Creating a ‘Knowledge Doc’ for Gemini: Consolidating Google Workspace Flows Documentation

How to build a server-less crawler using Google Apps Script to create a unified knowledge base for AI-assisted development.

The Goal: A Unified Context for Gemini

When building Google Workspace add-ons, particularly those using newer features like Workspace Flows, the documentation is often spread across dozens of nested pages. While excellent for browsing, this structure presents a challenge when working with Large Language Models (LLMs) like Gemini.

The primary motivation for this project was to create a single, consolidated “Knowledge Doc” containing the complete technical specification for Workspace Flows. By feeding this unified document into Gemini, I can:

Control the Context: Ensure the model answers based on the specific, official documentation rather than outdated training data or hallucinated APIs.
Accelerate Development: Ask complex architectural questions (“How do I build a flow that integrates X and Y?”) and get answers grounded in the full API reference without the latency of multiple web searches.
Bridge the Knowledge Gap: Work effectively with brand-new products and services that haven’t yet been absorbed into the LLM’s core training set.

This article details how I built the tool to create this document—a server-less crawler that runs entirely within Google Apps Script. If you are less concerned about the how and more about the end result here is the [Shared] Google Workspace Flows Guide

The Technical Challenge: Apps Script vs. The DOM

My initial plan was straightforward:

Fetch the HTML of each documentation page.
Parse it into Markdown (using a library like Turndown or Showdown).
Convert that Markdown into a Google Doc.

However, I hit a major roadblock immediately. Most JavaScript parsing libraries rely on the DOM (window, document, DOMParser). Google Apps Script runs in a server-side environment (similar to a Node.js runtime but without standard packages) where no DOM exists. Libraries like Turndown crashed instantly with ReferenceError: document is not defined.

The Solution: A Hybrid HTML Pipeline

Instead of fighting to make a browser library work on the server, I pivoted to a hybrid approach:

Cheerio for Parsing: I used the Cheerio library (which parses HTML using pure JavaScript strings, no DOM required) to “clean” the content.
Direct HTML Injection: Instead of converting to Markdown (which lost formatting for tables and images), I switched to the Google Drive API which can natively convert HTML files into Google Docs.

This pipeline proved superior because it preserved:

Code Blocks: <pre> tags with specific formatting.
Tables: Complex data tables rendered perfectly.
Images: <img> tags with absolute URLs were automatically fetched and embedded by Google Drive.

Important Note: This script was developed and tested specifically for Google Workspace Flows documentation. Web scraping inherently carries the risk that the target website’s HTML structure may change over time. Other documentation sites will use different CSS classes and HTML structures. While the core logic (crawling + HTML cleaning + Docs conversion) remains valid, you will need to inspect the source of your target site and update the Cheerio selectors (like .devsite-article-body or .devsite-book-nav) to match its specific layout.

Step 1: The Crawler

First, we needed to find all the relevant pages. Traversing the DOM tree proved fragile—class names change, and nesting levels vary.

The robust solution was Path Filtering. We fetch the main navigation bar and simply filter for any link that starts with our target root URL.

function extractLinksByPath(html, rootUrl) {
  const $ = Cheerio.load(html);
  const links = new Set();

  $('.devsite-book-nav a').each(function() {
    let href = $(this).attr('href');
    if (href) {
      if (href.startsWith('/')) href = CONFIG.BASE_URL + href;
      if (href.startsWith(rootUrl) && !href.includes('#')) links.add(href);
    }
  });
  return Array.from(links);
}

Step 2: The HTML Cleaner

We couldn’t just dump the raw page HTML into a Doc; it would include navigation bars, “Feedback” buttons, and footer links. We used Cheerio to strip these out.

We also encountered a specific challenge with Definition Lists (<dl>, <dt>, <dd>). Google Docs doesn’t have a native “Definition List” element, so they often flattened into messy text.

The fix was to transform them into semantic HTML paragraphs with inline styling to force the visual layout we wanted:

// Transform <dt> (Terms) into bold/italic paragraphs
$body.find('dl').each((i, dl) => {
    let newContent = '';
    $(dl).children().each((j, child) => {
      const $child = $(child);
      if ($child.is('dt')) {
        newContent += `<p style="font-weight:bold; font-style:italic; margin-top:12pt; margin-bottom:4pt; font-size:11pt; color:#202124;">${$child.html()}</p>`;
      } else if ($child.is('dd')) {
        newContent += `<p style="margin-left:24pt; margin-bottom:8pt; font-size:11pt; line-height:1.15;">${$child.html()}</p>`;
      }
    });
    $(dl).replaceWith(`<div>${newContent}</div>`);
  });

Step 3: Post-Processing with the Docs API

Even with clean HTML, the Google Docs converter can be stubborn. It sometimes ignores semantic headings (<h1>) or allows images to overflow the page width.

To fix this, we added a post-processing step. After the HTML file is converted to a Google Doc, we open it with DocumentApp to programmatically enforce styles and constraints.

Fixing Headings: We scan for our specific “Title” styling (24pt Bold) and force the Heading 1 style.

function applyHeadingsToDoc(docId) {
  const doc = DocumentApp.openById(docId);
  const body = doc.getBody();
  const paragraphs = body.getParagraphs();

  paragraphs.forEach(p => {
    const text = p.getText();
    if (text.length === 0) return;

    const fontSize = p.getAttributes().FONT_SIZE;
    const isBold = p.getAttributes().BOLD;
    // If it looks like a H1 (24pt + Bold), make it a H1!
    if (fontSize >= 24 && isBold) {
      p.setHeading(DocumentApp.ParagraphHeading.HEADING1);
    }
  });

  doc.saveAndClose();
}

Resizing Images: We also scan for images wider than the standard page width (e.g., 600px) and scale them down to fit, ensuring the document layout remains printable and clean.

function resizeImagesInDoc(docId) {
  const doc = DocumentApp.openById(docId);
  const body = doc.getBody();
  const MAX_WIDTH = 600;

  body.getImages().forEach(image => {
    const width = image.getWidth();
    const height = image.getHeight();
    if (width > MAX_WIDTH) {
      image.setWidth(MAX_WIDTH);
      image.setHeight((height / width) * MAX_WIDTH);
    }
  });
  doc.saveAndClose();
}

The Workflow in Action

Once generated, this document becomes the cornerstone of my development workflow:

Consolidate: I run the script to pull the latest documentation into a fresh Google Doc.
Enrich: I use tools like Gemini Deep Research to find community patterns, or architectural best practices and paste them directly into the same doc.
Contextualise: This single document is then uploaded to Gemini (or another LLM workspace), providing a focused, hallucination-resistant context for answering questions and generating code.

By combining the UrlFetchApp for crawling, Cheerio for parsing, and the Drive API for document generation, we built a custom scraping pipeline in under 200 lines of code without spinning up a single server. If you’d like to make your own feel free to copy this script project which comes with a compiled Apps Script friendly version of Cheerio

Martin Hawksey

Member of Google Developers Experts Program for Google Workspace (Google Apps Script) and interested in supporting Google Workspace Devs.

mashe.hawksey.info

Solutions, Workspace Flows, Beginner

Creating a ‘Knowledge Doc’ for Gemini: Consolidating Google Workspace Flows Documentation

Tags:developers.google.com
| 1 December, 2025 | 0 Comments |

|
|
|

The Goal: A Unified Context for Gemini

The Technical Challenge: Apps Script vs. The DOM

The Solution: A Hybrid HTML Pipeline

Step 1: The Crawler

Step 2: The HTML Cleaner

Step 3: Post-Processing with the Docs API

The Workflow in Action

Related Posts:

Leave a Reply Cancel reply

Solutions, Workspace Flows, Beginner

Creating a ‘Knowledge Doc’ for Gemini: Consolidating Google Workspace Flows Documentation

Tags:developers.google.com | 1 December, 2025 | 0 Comments | Twitter | Facebook | LinkedIn | Reddit

The Goal: A Unified Context for Gemini

The Technical Challenge: Apps Script vs. The DOM

The Solution: A Hybrid HTML Pipeline

Step 1: The Crawler

Step 2: The HTML Cleaner

Step 3: Post-Processing with the Docs API

The Workflow in Action

Related Posts:

Post navigation

Leave a Reply Cancel reply

Tags:developers.google.com
| 1 December, 2025 | 0 Comments |

|
|
|