How to build a server-less crawler using Google Apps Script to create a unified knowledge base for AI-assisted development.
The Goal: A Unified Context for Gemini
When building Google Workspace add-ons, particularly those using newer features like Workspace Flows, the documentation is often spread across dozens of nested pages. While excellent for browsing, this structure presents a challenge when working with Large Language Models (LLMs) like Gemini.
The primary motivation for this project was to create a single, consolidated “Knowledge Doc” containing the complete technical specification for Workspace Flows. By feeding this unified document into Gemini, I can:
Control the Context: Ensure the model answers based on the specific, official documentation rather than outdated training data or hallucinated APIs.
Accelerate Development: Ask complex architectural questions (“How do I build a flow that integrates X and Y?”) and get answers grounded in the full API reference without the latency of multiple web searches.
Bridge the Knowledge Gap: Work effectively with brand-new products and services that haven’t yet been absorbed into the LLM’s core training set.
This article details how I built the tool to create this document—a server-less crawler that runs entirely within Google Apps Script. If you are less concerned about the how and more about the end result here is the [Shared] Google Workspace Flows Guide
The Technical Challenge: Apps Script vs. The DOM
My initial plan was straightforward:
Fetch the HTML of each documentation page.
Parse it into Markdown (using a library like Turndown or Showdown).
Convert that Markdown into a Google Doc.
However, I hit a major roadblock immediately. Most JavaScript parsing libraries rely on the DOM (window, document, DOMParser). Google Apps Script runs in a server-side environment (similar to a Node.js runtime but without standard packages) where no DOM exists. Libraries like Turndown crashed instantly with ReferenceError: document is not defined.
The Solution: A Hybrid HTML Pipeline
Instead of fighting to make a browser library work on the server, I pivoted to a hybrid approach:
Cheerio for Parsing: I used the Cheerio library (which parses HTML using pure JavaScript strings, no DOM required) to “clean” the content.
Direct HTML Injection: Instead of converting to Markdown (which lost formatting for tables and images), I switched to the Google Drive API which can natively convert HTML files into Google Docs.
This pipeline proved superior because it preserved:
Code Blocks:<pre> tags with specific formatting.
Tables: Complex data tables rendered perfectly.
Images:<img> tags with absolute URLs were automatically fetched and embedded by Google Drive.
Important Note: This script was developed and tested specifically for Google Workspace Flows documentation. Web scraping inherently carries the risk that the target website’s HTML structure may change over time. Other documentation sites will use different CSS classes and HTML structures. While the core logic (crawling + HTML cleaning + Docs conversion) remains valid, you will need to inspect the source of your target site and update the Cheerio selectors (like .devsite-article-body or .devsite-book-nav) to match its specific layout.
Step 1: The Crawler
First, we needed to find all the relevant pages. Traversing the DOM tree proved fragile—class names change, and nesting levels vary.
The robust solution was Path Filtering. We fetch the main navigation bar and simply filter for any link that starts with our target root URL.
functionextractLinksByPath(html, rootUrl) {
const $ = Cheerio.load(html);
const links = newSet();
$('.devsite-book-nav a').each(function() {
let href = $(this).attr('href');
if (href) {
if (href.startsWith('/')) href = CONFIG.BASE_URL + href;
if (href.startsWith(rootUrl) && !href.includes('#')) links.add(href);
}
});
returnArray.from(links);
}
Step 2: The HTML Cleaner
We couldn’t just dump the raw page HTML into a Doc; it would include navigation bars, “Feedback” buttons, and footer links. We used Cheerio to strip these out.
We also encountered a specific challenge with Definition Lists (<dl>, <dt>, <dd>). Google Docs doesn’t have a native “Definition List” element, so they often flattened into messy text.
The fix was to transform them into semantic HTML paragraphs with inline styling to force the visual layout we wanted:
Even with clean HTML, the Google Docs converter can be stubborn. It sometimes ignores semantic headings (<h1>) or allows images to overflow the page width.
To fix this, we added a post-processing step. After the HTML file is converted to a Google Doc, we open it with DocumentApp to programmatically enforce styles and constraints.
Fixing Headings: We scan for our specific “Title” styling (24pt Bold) and force the Heading 1 style.
functionapplyHeadingsToDoc(docId) {
const doc = DocumentApp.openById(docId);
const body = doc.getBody();
const paragraphs = body.getParagraphs();
paragraphs.forEach(p => {
const text = p.getText();
if (text.length === 0) return;
const fontSize = p.getAttributes().FONT_SIZE;
const isBold = p.getAttributes().BOLD;
// If it looks like a H1 (24pt + Bold), make it a H1!if (fontSize >= 24 && isBold) {
p.setHeading(DocumentApp.ParagraphHeading.HEADING1);
}
});
doc.saveAndClose();
}
Resizing Images: We also scan for images wider than the standard page width (e.g., 600px) and scale them down to fit, ensuring the document layout remains printable and clean.
Once generated, this document becomes the cornerstone of my development workflow:
Consolidate: I run the script to pull the latest documentation into a fresh Google Doc.
Enrich: I use tools like Gemini Deep Research to find community patterns, or architectural best practices and paste them directly into the same doc.
Contextualise: This single document is then uploaded to Gemini (or another LLM workspace), providing a focused, hallucination-resistant context for answering questions and generating code.
By combining the UrlFetchApp for crawling, Cheerio for parsing, and the Drive API for document generation, we built a custom scraping pipeline in under 200 lines of code without spinning up a single server. If you’d like to make your own feel free to copy this script project which comes with a compiled Apps Script friendly version of Cheerio