Crawl Optimization
Maximize crawl value within your page limits — prioritize the right pages so the agent always has the context it needs.
How the Crawler Works
The crawler uses a sitemap-first approach. If a sitemap is found at the standard location or linked from your root page, those URLs form the crawl queue. Without a sitemap, the crawler falls back to link discovery starting from your homepage. Progress streams in real time via SSE, and extracted data is stored per-project as a JSON snapshot the agent can query instantly.
Most sitemap generators list pages newest-first. If your top pages are older pillar posts, move them to the top of your sitemap so they're always included within your crawl limit.
Crawl Coverage
How many pages the crawler covers depends on your access level. For your free report, the agent crawls enough of your site to ground its diagnosis in your real content. Once you are an onboarded client, the agent crawls full-site so it always has complete context to run your SEO continuously.
| Access level | Crawl coverage | Best fit |
|---|---|---|
| Prospect (free report) | Enough to ground your report | Generating an honest, data-backed first audit |
| Client (continuous agent) | Full-site coverage | Large sites, e-commerce, multi-section publications |
Each crawl replaces the previous snapshot. There is no historical archive, so re-crawl whenever your site changes significantly.
Prioritization Strategies
When your site has more pages than the current crawl will cover, prioritize the pages the agent needs most for content analysis, link suggestions, and writing style extraction.
- Put product pages, service pages, and high-traffic pillar content at the top of your sitemap
- List important blog posts before category or tag archive pages
- Exclude utility pages (author archives, pagination, login) from your sitemap entirely
“Crawl my site focusing on the blog section first”
The agent prioritizes that section's pages within the current crawl coverage.
For large sites, consider temporarily pruning your sitemap to a specific section (e.g., only /docs/ URLs) when you need the agent to do deep analysis of that area.
When to Re-Crawl
Re-crawl when any of the following apply:
- New content published — the agent cannot see pages added after the last crawl
- Site restructured — changed URLs, merged sections, or reorganized navigation
- Key pages updated — title changes, rewrites, or meta description updates
- 4-6 weeks have passed — periodic refreshes keep agent context accurate
GSC data syncs independently. A re-crawl updates page content knowledge only — it does not affect your search performance data.
How the Agent Uses Crawl Data
The agent accesses crawl data through the site_context tool. This lets it search pages, identify thin content, and check keyword coverage against the stored snapshot without live network requests.
| Agent Task | How Crawl Data Helps |
|---|---|
| Content gap analysis | Checks which topics already have pages before suggesting new ones |
| Internal link suggestions | Finds relevant anchor opportunities across your site |
| Content audit | Identifies thin pages, missing meta descriptions, and H1 issues |
| Writing style extraction | Samples existing content to model your brand voice |
| Keyword cannibalization | Finds multiple pages targeting the same query |
“Analyze my crawled content for thin pages that need improvement”
The agent scans your crawl snapshot and returns pages below a word count threshold with expansion suggestions.
Tips for Best Results
- Keep your sitemap current. Outdated sitemaps waste quota on deleted pages and miss new ones.
- Match crawl scope to your question. Crawl a specific section for focused analysis, or crawl broadly for a full audit.
- Re-crawl before content sprints. Fresh context prevents the agent from reasoning on stale data.
- Check crawl status first. Ask the agent what's been crawled and when before starting analysis.
For your free report, a focused crawl goes further than you'd expect. Lead with your top pillar pages plus highest-traffic blog posts — enough for the agent to understand your site's structure, voice, and topical coverage and ground its diagnosis in real content.
Try these prompts
© 2026 Agentic SEO. All rights reserved.