Crawl Optimization

Maximize crawl value within your page limits — prioritize the right pages so the agent always has the context it needs.

How the Crawler Works

The crawler uses a sitemap-first approach. If a sitemap is found at the standard location or linked from your root page, those URLs form the crawl queue. Without a sitemap, the crawler falls back to link discovery starting from your homepage. Progress streams in real time via SSE, and extracted data is stored per-project as a JSON snapshot the agent can query instantly.

Tip

Most sitemap generators list pages newest-first. If your top pages are older pillar posts, move them to the top of your sitemap so they're always included within your crawl limit.

Crawl Limits by Plan

PlanMax Pages per CrawlBest Fit
Free25 pagesSmall blogs, landing pages, early-stage projects
Pro ($29/mo)100 pagesBusiness blogs, content-heavy SaaS
Agency ($79/mo)500 pagesLarge sites, e-commerce, multi-section publications

Each crawl replaces the previous snapshot. There is no historical archive, so re-crawl whenever your site changes significantly.

Prioritization Strategies

When your site has more pages than your crawl limit, prioritize the pages the agent needs most for content analysis, link suggestions, and writing style extraction.

  • Put product pages, service pages, and high-traffic pillar content at the top of your sitemap
  • List important blog posts before category or tag archive pages
  • Exclude utility pages (author archives, pagination, login) from your sitemap entirely

Crawl my site focusing on the blog section first

The agent prioritizes that section's pages within your plan limit.

Note

For large sites on Pro, consider temporarily pruning your sitemap to a specific section (e.g., only /docs/ URLs) when you need deep analysis of that area.

When to Re-Crawl

Re-crawl when any of the following apply:

  • New content published — the agent cannot see pages added after the last crawl
  • Site restructured — changed URLs, merged sections, or reorganized navigation
  • Key pages updated — title changes, rewrites, or meta description updates
  • 4-6 weeks have passed — periodic refreshes keep agent context accurate
Note

GSC data syncs independently. A re-crawl updates page content knowledge only — it does not affect your search performance data.

How the Agent Uses Crawl Data

The agent accesses crawl data through the site_context tool. This lets it search pages, identify thin content, and check keyword coverage against the stored snapshot without live network requests.

Agent TaskHow Crawl Data Helps
Content gap analysisChecks which topics already have pages before suggesting new ones
Internal link suggestionsFinds relevant anchor opportunities across your site
Content auditIdentifies thin pages, missing meta descriptions, and H1 issues
Writing style extractionSamples existing content to model your brand voice
Keyword cannibalizationFinds multiple pages targeting the same query

Analyze my crawled content for thin pages that need improvement

The agent scans your crawl snapshot and returns pages below a word count threshold with expansion suggestions.

Tips for Best Results

  • Keep your sitemap current. Outdated sitemaps waste quota on deleted pages and miss new ones.
  • Match crawl scope to your question. Crawl a specific section for focused analysis, or crawl broadly for a full audit.
  • Re-crawl before content sprints. Fresh context prevents the agent from reasoning on stale data.
  • Check crawl status first. Ask the agent what's been crawled and when before starting analysis.
Tip

On the Free plan, 25 pages goes further than you'd expect. Focus on your top pillar pages plus highest-traffic blog posts — enough for the agent to understand your site's structure, voice, and topical coverage.

Try these prompts

Crawl my site focusing on the blog section first
What pages have I crawled and when was the last crawl?
Analyze my crawled content for thin pages that need improvement

© 2026 Agentic SEO. All rights reserved.