Goldilocks Docs
Train

Crawling Websites

Import content from your existing help center or support site

If you already have a help center, knowledge base, or support documentation online, you can import it automatically using the website crawler. Crawled pages appear in the Websites tab on the Train page.

How Website Crawling Works

  1. You provide a starting URL (usually your help center homepage)
  2. Goldilocks discovers pages by following links and checking sitemaps
  3. AI categorizes pages as support content, marketing, or other
  4. You select which pages to import
  5. Selected pages become documents in your knowledge base

Starting a Crawl

  1. Navigate to Train in the sidebar
  2. Click the + Add Content button
  3. Select the Website tab in the dialog
  4. Enter your website URL (e.g., https://help.yoursite.com)
  5. Click Start Crawl

The crawler will begin discovering pages. This typically takes 1-5 minutes depending on your site size.

Crawl Process

Discovery Phase

The crawler finds pages by:

  • Reading your sitemap.xml (if available)
  • Following links from the starting page
  • Exploring subdirectories

Categorization Phase

AI analyzes each discovered page and categorizes it:

  • Recommended - Appears to be support/help content
  • Review - Might be relevant, needs manual review
  • Excluded - Marketing, legal, or non-support content

Selection Phase

Once categorization is complete, you're taken to the scan review page where you can:

  1. Review recommended pages
  2. Check pages marked for review
  3. Select or deselect individual pages
  4. Import selected pages

Viewing Crawled Websites

Imported pages appear in the Websites tab on the Train page. Each shows:

  • Name - Page title
  • URL - Original source URL
  • Folder - Folder assignment
  • Status - Active or Draft
  • Retrievals - Usage count
  • Last Synced - When content was last updated

Selecting Pages to Import

These are pages the AI identified as likely support content. Review them to:

  • Confirm they're relevant
  • Remove any that shouldn't be included
  • Check for duplicate content

Review Pages

These need manual review. They might be:

  • General information pages
  • Blog posts about features
  • Pages with mixed content

Excluding Pages

Deselect pages you don't want imported:

  • Marketing/promotional content
  • Terms of service, privacy policy
  • Pages with outdated information
  • Duplicate pages (different URLs, same content)

Importing Pages

  1. Review and adjust your selection on the scan page
  2. Click Import Selected
  3. Pages will be converted to documents
  4. Processing begins automatically

Imported pages appear in your Websites tab with the original URL preserved.

Managing Website Content

Edit Imported Content

  1. Click on a website entry in the list
  2. Modify the name, content, or settings
  3. Save changes

Note: Editing content locally means it may diverge from the source. Consider re-syncing if the source is updated.

Re-sync Content

To update content from the original URL:

  1. Find the website entry
  2. Click the menu icon (three dots)
  3. Select Resync
  4. Content will be re-fetched from the source

Recrawl a Website

To discover new or updated pages:

  1. Go to Train > Websites tab
  2. Find your previously crawled site
  3. Click Recrawl or start a new crawl with the same URL
  4. New and updated pages will be available for import

Best Practices

Start with Your Help Center

Begin with your dedicated help/support site rather than your main website. This reduces noise from marketing content.

Review AI Categorization

The AI is good but not perfect. Always review:

  • Recommended pages for false positives
  • Excluded pages for false negatives

Check for Duplicates

Websites often have duplicate content under different URLs. Look for:

  • Paginated versions of the same article
  • Print-friendly versions
  • Mobile-specific pages

Handle Dynamic Content

The crawler captures content at crawl time. If your site has:

  • Frequently updated content: Recrawl periodically
  • Personalized content: May not import correctly
  • JavaScript-heavy pages: Content might not extract fully

Set Up Regular Syncs

If your help center is frequently updated, schedule regular recrawls (weekly or monthly) to keep content fresh.

Troubleshooting

"No pages found"

  • Check that the URL is correct
  • Ensure the site is publicly accessible
  • Try a more specific starting URL

Poor content extraction

Some sites don't extract well due to:

  • Heavy JavaScript rendering
  • Content behind authentication
  • Unusual HTML structure

Try adding content manually instead.

Too many irrelevant pages

If too much marketing content is discovered:

  • Use a more specific starting URL (e.g., /help subdirectory)
  • Manually deselect irrelevant pages during import
  • Consider manual document creation instead

Next Steps