Crawling Websites

If you already have a help center, knowledge base, or support documentation online, you can import it automatically using the website crawler. Crawled pages appear in the Websites tab on the Training page.

How Website Crawling Works

You provide a starting URL (usually your help center homepage)
Goldilocks discovers pages by following links and checking sitemaps
AI categorizes pages as support content, marketing, or other
You select which pages to import
Selected pages become documents in your knowledge base

Starting a Crawl

Navigate to Training in the sidebar
Click the + Add Content button
Select the Website tab in the dialog
Enter your website URL (e.g., https://help.yoursite.com)
Click Start Crawl

The crawler will begin discovering pages. This typically takes 1-5 minutes depending on your site size.

Crawl Process

Discovery Phase

The crawler finds pages by:

Reading your sitemap.xml (if available)
Following links from the starting page
Exploring subdirectories

Categorization Phase

AI analyzes each discovered page and categorizes it:

Recommended - Appears to be support/help content
Review - Might be relevant, needs manual review
Excluded - Marketing, legal, or non-support content

Selection Phase

Once categorization is complete, you're taken to the scan review page where you can:

Review recommended pages
Check pages marked for review
Select or deselect individual pages
Import selected pages

Viewing Crawled Websites

Imported pages appear in the Websites tab on the Training page. Each shows:

Name - Page title
URL - Original source URL
Folder - Folder assignment
Status - Active or Draft
Restricted - Purple badge when visibility is limited to specific personas
Retrievals - Usage count
Last Synced - When content was last updated

Selecting Pages to Import

Recommended Pages

These are pages the AI identified as likely support content. Review them to:

Confirm they're relevant
Remove any that shouldn't be included
Check for duplicate content

Review Pages

These need manual review. They might be:

General information pages
Blog posts about features
Pages with mixed content

Excluding Pages

Deselect pages you don't want imported:

Marketing/promotional content
Terms of service, privacy policy
Pages with outdated information
Duplicate pages (different URLs, same content)

Importing Pages

Review and adjust your selection on the scan page
Click Import Selected
Pages will be converted to documents
Processing begins automatically

Imported pages appear in your Websites tab with the original URL preserved.

Managing Website Content

Edit Imported Content

Click on a website entry in the list
Modify the name, content, or settings
Save changes

Note: Editing content locally means it may diverge from the source. Consider re-syncing if the source is updated.

Re-sync Content

To update content from the original URL:

Find the website entry
Click the menu icon (three dots)
Select Resync
Content will be re-fetched from the source

Recrawl a Website

To discover new or updated pages:

Go to Training > Websites tab
Find your previously crawled site
Click Recrawl or start a new crawl with the same URL
New and updated pages will be available for import

Best Practices

Start with Your Help Center

Begin with your dedicated help/support site rather than your main website. This reduces noise from marketing content.

Review AI Categorization

The AI is good but not perfect. Always review:

Recommended pages for false positives
Excluded pages for false negatives

Check for Duplicates

Websites often have duplicate content under different URLs. Look for:

Paginated versions of the same article
Print-friendly versions
Mobile-specific pages

Handle Dynamic Content

The crawler captures content at crawl time. If your site has:

Frequently updated content: Recrawl periodically
Personalized content: May not import correctly
JavaScript-heavy pages: Content might not extract fully

Check that the URL is correct
Ensure the site is publicly accessible
Try a more specific starting URL

Poor content extraction

Some sites don't extract well due to:

Heavy JavaScript rendering
Content behind authentication
Unusual HTML structure

Try adding content manually instead.

Too many irrelevant pages

If too much marketing content is discovered:

Use a more specific starting URL (e.g., /help subdirectory)
Manually deselect irrelevant pages during import
Consider manual document creation instead

Visibility for Imported Websites

When importing marketing or sales content for a specific persona:

Import the pages, then add them to the right folder
Select all imported items with checkboxes
Use Restrict visibility to specific personas from the bulk actions dropdown
In the dialog, select which persona(s) can access (multi-select)
Apply — all selected pages are now restricted to those personas

No need to open each website entry individually. Visibility rules guide →