Crawling Websites
Import content from your existing help center or support site
If you already have a help center, knowledge base, or support documentation online, you can import it automatically using the website crawler. Crawled pages appear in the Websites tab on the Train page.
How Website Crawling Works
- You provide a starting URL (usually your help center homepage)
- Goldilocks discovers pages by following links and checking sitemaps
- AI categorizes pages as support content, marketing, or other
- You select which pages to import
- Selected pages become documents in your knowledge base
Starting a Crawl
- Navigate to Train in the sidebar
- Click the + Add Content button
- Select the Website tab in the dialog
- Enter your website URL (e.g.,
https://help.yoursite.com) - Click Start Crawl
The crawler will begin discovering pages. This typically takes 1-5 minutes depending on your site size.
Crawl Process
Discovery Phase
The crawler finds pages by:
- Reading your sitemap.xml (if available)
- Following links from the starting page
- Exploring subdirectories
Categorization Phase
AI analyzes each discovered page and categorizes it:
- Recommended - Appears to be support/help content
- Review - Might be relevant, needs manual review
- Excluded - Marketing, legal, or non-support content
Selection Phase
Once categorization is complete, you're taken to the scan review page where you can:
- Review recommended pages
- Check pages marked for review
- Select or deselect individual pages
- Import selected pages
Viewing Crawled Websites
Imported pages appear in the Websites tab on the Train page. Each shows:
- Name - Page title
- URL - Original source URL
- Folder - Folder assignment
- Status - Active or Draft
- Retrievals - Usage count
- Last Synced - When content was last updated
Selecting Pages to Import
Recommended Pages
These are pages the AI identified as likely support content. Review them to:
- Confirm they're relevant
- Remove any that shouldn't be included
- Check for duplicate content
Review Pages
These need manual review. They might be:
- General information pages
- Blog posts about features
- Pages with mixed content
Excluding Pages
Deselect pages you don't want imported:
- Marketing/promotional content
- Terms of service, privacy policy
- Pages with outdated information
- Duplicate pages (different URLs, same content)
Importing Pages
- Review and adjust your selection on the scan page
- Click Import Selected
- Pages will be converted to documents
- Processing begins automatically
Imported pages appear in your Websites tab with the original URL preserved.
Managing Website Content
Edit Imported Content
- Click on a website entry in the list
- Modify the name, content, or settings
- Save changes
Note: Editing content locally means it may diverge from the source. Consider re-syncing if the source is updated.
Re-sync Content
To update content from the original URL:
- Find the website entry
- Click the menu icon (three dots)
- Select Resync
- Content will be re-fetched from the source
Recrawl a Website
To discover new or updated pages:
- Go to Train > Websites tab
- Find your previously crawled site
- Click Recrawl or start a new crawl with the same URL
- New and updated pages will be available for import
Best Practices
Start with Your Help Center
Begin with your dedicated help/support site rather than your main website. This reduces noise from marketing content.
Review AI Categorization
The AI is good but not perfect. Always review:
- Recommended pages for false positives
- Excluded pages for false negatives
Check for Duplicates
Websites often have duplicate content under different URLs. Look for:
- Paginated versions of the same article
- Print-friendly versions
- Mobile-specific pages
Handle Dynamic Content
The crawler captures content at crawl time. If your site has:
- Frequently updated content: Recrawl periodically
- Personalized content: May not import correctly
- JavaScript-heavy pages: Content might not extract fully
Set Up Regular Syncs
If your help center is frequently updated, schedule regular recrawls (weekly or monthly) to keep content fresh.
Troubleshooting
"No pages found"
- Check that the URL is correct
- Ensure the site is publicly accessible
- Try a more specific starting URL
Poor content extraction
Some sites don't extract well due to:
- Heavy JavaScript rendering
- Content behind authentication
- Unusual HTML structure
Try adding content manually instead.
Too many irrelevant pages
If too much marketing content is discovered:
- Use a more specific starting URL (e.g.,
/helpsubdirectory) - Manually deselect irrelevant pages during import
- Consider manual document creation instead