Crawl Optimization
Maximize crawl value within your page limits — prioritize the right pages so the agent always has the context it needs.
How the Crawler Works
The crawler uses a sitemap-first approach. If a sitemap is found at the standard location or linked from your root page, those URLs form the crawl queue. Without a sitemap, the crawler falls back to link discovery starting from your homepage. Progress streams in real time via SSE, and extracted data is stored per-project as a JSON snapshot the agent can query instantly.
Most sitemap generators list pages newest-first. If your top pages are older pillar posts, move them to the top of your sitemap so they're always included within your crawl limit.
Crawl Limits by Plan
| Plan | Max Pages per Crawl | Best Fit |
|---|---|---|
| Free | 25 pages | Small blogs, landing pages, early-stage projects |
| Pro ($29/mo) | 100 pages | Business blogs, content-heavy SaaS |
| Agency ($79/mo) | 500 pages | Large sites, e-commerce, multi-section publications |
Each crawl replaces the previous snapshot. There is no historical archive, so re-crawl whenever your site changes significantly.
Prioritization Strategies
When your site has more pages than your crawl limit, prioritize the pages the agent needs most for content analysis, link suggestions, and writing style extraction.
- Put product pages, service pages, and high-traffic pillar content at the top of your sitemap
- List important blog posts before category or tag archive pages
- Exclude utility pages (author archives, pagination, login) from your sitemap entirely
“Crawl my site focusing on the blog section first”
The agent prioritizes that section's pages within your plan limit.
For large sites on Pro, consider temporarily pruning your sitemap to a specific section (e.g., only /docs/ URLs) when you need deep analysis of that area.
When to Re-Crawl
Re-crawl when any of the following apply:
- New content published — the agent cannot see pages added after the last crawl
- Site restructured — changed URLs, merged sections, or reorganized navigation
- Key pages updated — title changes, rewrites, or meta description updates
- 4-6 weeks have passed — periodic refreshes keep agent context accurate
GSC data syncs independently. A re-crawl updates page content knowledge only — it does not affect your search performance data.
How the Agent Uses Crawl Data
The agent accesses crawl data through the site_context tool. This lets it search pages, identify thin content, and check keyword coverage against the stored snapshot without live network requests.
| Agent Task | How Crawl Data Helps |
|---|---|
| Content gap analysis | Checks which topics already have pages before suggesting new ones |
| Internal link suggestions | Finds relevant anchor opportunities across your site |
| Content audit | Identifies thin pages, missing meta descriptions, and H1 issues |
| Writing style extraction | Samples existing content to model your brand voice |
| Keyword cannibalization | Finds multiple pages targeting the same query |
“Analyze my crawled content for thin pages that need improvement”
The agent scans your crawl snapshot and returns pages below a word count threshold with expansion suggestions.
Tips for Best Results
- Keep your sitemap current. Outdated sitemaps waste quota on deleted pages and miss new ones.
- Match crawl scope to your question. Crawl a specific section for focused analysis, or crawl broadly for a full audit.
- Re-crawl before content sprints. Fresh context prevents the agent from reasoning on stale data.
- Check crawl status first. Ask the agent what's been crawled and when before starting analysis.
On the Free plan, 25 pages goes further than you'd expect. Focus on your top pillar pages plus highest-traffic blog posts — enough for the agent to understand your site's structure, voice, and topical coverage.
Try these prompts
© 2026 Agentic SEO. All rights reserved.