Outcome Verification and Evals
How the agent proves a fix actually caused the change — every action carries an expected impact, ships through review, and is re-measured in Google Search Console weeks later.
The Problem with Most SEO KPIs
Walk into any SEO engagement and you get a dashboard: rankings, domain rating, total traffic, a backlink count ticking up. They look like progress. Almost none of them prove that the work caused the change. Rankings drift on their own. Domain metrics are third-party estimates. Total traffic moves with seasonality, brand demand, and Google's own updates. A backlink count says nothing about whether a single one moved a single query.
These are lagging, aggregate, and mostly uncausal — the SEO equivalent of vanity metrics. The question a buyer actually cares about is sharper: did this specific fix move this specific query, and can you show it? That is the only KPI that survives scrutiny, and it is exactly the one the freelancer who ships 200 links and a green dashboard, or the self-serve tool that surfaces a finding and stops, structurally cannot answer.
Visibility is not the same as a verified outcome. On a site we audited, impressions rose +168% across two adjacent 28-day windows — 791 to 2,117 — while clicks stayed flat. The dashboard looked like a win. Nothing had actually been earned. Another query on the same kind of audit sat at average position 6.5 with 952 impressions and zero clicks over 90 days: page-one “visibility” that delivered nothing. Outcome verification exists to catch exactly this gap between a number that moved and an outcome that mattered.
The contrarian cut: most SEO KPIs are lagging or vanity (rankings, DR/DA, total traffic, backlink counts). The one that holds up is verified causal movement — did this fix move this metric on this query, confirmed in Search Console weeks after it shipped.
What an Eval Is Here
An eval, in this system, is not a free-text note about what the agent thinks happened. It is a structured, verified record of an action's outcome — for a fix, an article, or an opportunity the agent pursued. Each one ties an action to a prediction made before the work shipped, and to a real Search Console measurement taken after.
Every measurable action records, up front:
- What was done — the fix, the article, the change, with a link to the pull request that shipped it
- What it was expected to move — a named GSC metric and a target value
- What actually happened — the re-measured outcome weeks later, scored confirmed, regressed, or inconclusive
That triplet — action, prediction, verified result — is the eval. It turns SEO work from a list of activities into a body of evidence about what actually works.
The Verification Lifecycle
Every measurable action moves through the same four steps, automatically:
- Predict — when the agent files a finding, it sets an expected impact, the GSC metric that would prove it, and a target value. The prediction is committed before any work ships, so it can't be rationalized after the fact.
- Ship through review — the fix is implemented (usually as a pull request in your own repo) and only closed once it's actually live. Marking it done records the PR link and an implementation note.
- Wait out the window — closing a measurable task opens a verification window of roughly 21 days, long enough for Search Console to register the real effect rather than noise.
- Re-measure against GSC — once the window closes, the weekly monitor pulls the metric for that query or page and judges the outcome: confirmed (the predicted movement happened), regressed (it moved the wrong way), or inconclusive (no clear signal).
The Fields That Make It Measurable
A measurable task carries a small, deliberate set of fields. These are what turn a to-do into a verifiable claim:
| Field | What it captures |
|---|---|
| expected_impact | The prediction in plain language — what this fix should change and why |
| verify_metric | The GSC metric that would prove it — one of clicks, impressions, ctr, or position |
| verify_target | The target value the metric should reach for the prediction to hold |
| verify_after | The date the verification window closes — roughly 21 days after the fix shipped |
| verification_status | none → pending → confirmed, inconclusive, or regressed |
| verified_at | When the monitor actually ran the re-measurement |
| verification_note | The monitor's evidence — the GSC numbers behind the verdict |
| pr_url | The pull request that shipped the fix, so the change is traceable to code |
| implementation_note | What was actually changed when the task was closed |
Not every task is measurable, and that's fine. A task with no clear GSC signal (say, a structural cleanup) is tracked without a verification target. The verification machinery only fires when verify_metric and verify_target are set — so a claim is only ever made when it can actually be checked.
Why This Beats Run-and-Hope
The default mode of SEO is run-and-hope: do a batch of work, watch an aggregate number, and narrate a story about why it moved. Because the prediction is never committed before the work, any movement can be claimed and any non-movement can be explained away. Nothing is ever falsified, so nothing is ever really learned.
Outcome verification inverts that. The prediction is locked in first. The metric is named first. The window is fixed first. When the monitor re-measures, the verdict is whatever the GSC data says — including “regressed,” which a run-and-hope dashboard would never surface. That discipline is uncomfortable on purpose: it's the only way to tell a real win from a coincidence.
| Run-and-hope SEO | Verified outcomes |
|---|---|
| Reports aggregate traffic and rankings | Reports per-action causal movement |
| Prediction (if any) made after the fact | Prediction committed before work ships |
| Wins and coincidences look identical | A confirmed result is distinct from drift |
| Regressions get buried | Regressions are surfaced and named |
| Nothing is falsifiable, nothing learned | Every action adds to a record of what works |
The Learning Flywheel
Verified outcomes are valuable on their own — they prove the work to a client in numbers. But the deeper payoff is what accumulates. Every confirmed, regressed, or inconclusive eval is a structured data point: what worked, by tactic and by context. A title rewrite on a striking-distance commercial query. An internal-link push to a thin page. A net-new article against a low-difficulty term. Each ships with a prediction and resolves to a verdict.
Per site, that record sharpens prioritization fast: the agent stops guessing which tactics pay off on this domain and starts ranking findings by what has actually been confirmed to move the needle here. Across many engagements, the same records compound into cross-client priors — which tactic tends to work in which context — so the agent gets better at predicting impact before any work ships. That is the flywheel: every action becomes evidence, and the evidence makes the next prediction better.
This is the moat. A freelancer's 200 backlinks and a self-serve tool's findings list both stop at the activity. Neither produces a verified record of causal outcomes, so neither can learn from its own work. The accountability loop — predict, ship, re-measure — is the product.
How the Board and Monitor Produce It
The whole loop runs through two pieces you already have: the task board and the weekly monitor.
- The board is where predictions live. When the agent — or the coding agent in your own repo, over the MCP server — files a finding, it creates a task with
expected_impact,verify_metric, andverify_target. When it ships the fix, it marks the task done with thepr_urland an implementation note. - Marking a measurable task done queues verification. The moment a task with a verify metric is closed, its status flips to
pendingand theverify_afterdate is set to roughly 21 days out. Nothing else is needed — the prediction is now on the clock. - The Monday monitor settles the verdict. Each week the monitor reads every task whose window has closed, pulls the relevant Search Console data, and writes back
verification_status,verified_at, and averification_notewith the numbers. Confirmed, regressed, or inconclusive — the data decides.
It's the same continuous loop described in the autonomous agent and the agent loop, with the accountability layer made explicit: nothing is claimed that GSC hasn't confirmed.
Getting Started
Outcome verification isn't a setting you toggle — it's how the agent works the moment it's running your SEO. Start with your free SEO report to see findings land on the board, then book a call with the founder to put the full loop on your account: the coding agent shipping fixes through review, and the weekly monitor proving each one in Search Console.
Once it's running, the question you ask changes. Not “did traffic go up?” but “which of last month's fixes were confirmed, which regressed, and what does that tell us to do next?” — the only SEO question worth answering.
Try these prompts
Related
© 2026 Agentic SEO. All rights reserved.