Outcome Verification and Evals

How the agent proves a fix actually caused the change — every action carries an expected impact, ships through review, and is re-measured in Google Search Console weeks later.

The Problem with Most SEO KPIs

Walk into any SEO engagement and you get a dashboard: rankings, domain rating, total traffic, a backlink count ticking up. They look like progress. Almost none of them prove that the work caused the change. Rankings drift on their own. Domain metrics are third-party estimates. Total traffic moves with seasonality, brand demand, and Google's own updates. A backlink count says nothing about whether a single one moved a single query.

These are lagging, aggregate, and mostly uncausal — the SEO equivalent of vanity metrics. The question a buyer actually cares about is sharper: did this specific fix move this specific query, and can you show it? That is the only KPI that survives scrutiny, and it is exactly the one the freelancer who ships 200 links and a green dashboard, or the self-serve tool that surfaces a finding and stops, structurally cannot answer.

Visibility is not the same as a verified outcome. On a site we audited, impressions rose +168% across two adjacent 28-day windows — 791 to 2,117 — while clicks stayed flat. The dashboard looked like a win. Nothing had actually been earned. Another query on the same kind of audit sat at average position 6.5 with 952 impressions and zero clicks over 90 days: page-one “visibility” that delivered nothing. Outcome verification exists to catch exactly this gap between a number that moved and an outcome that mattered.

Note

The contrarian cut: most SEO KPIs are lagging or vanity (rankings, DR/DA, total traffic, backlink counts). The one that holds up is verified causal movement — did this fix move this metric on this query, confirmed in Search Console weeks after it shipped.

What an Eval Is Here

An eval, in this system, is not a free-text note about what the agent thinks happened. It is a structured, verified record of an action's outcome — for a fix, an article, or an opportunity the agent pursued. Each one ties an action to a prediction made before the work shipped, and to a real Search Console measurement taken after.

Every measurable action records, up front:

  • What was done — the fix, the article, the change, with a link to the pull request that shipped it
  • What it was expected to move — a named GSC metric and a target value
  • What actually happened — the re-measured outcome weeks later, scored confirmed, regressed, or inconclusive

That triplet — action, prediction, verified result — is the eval. It turns SEO work from a list of activities into a body of evidence about what actually works.

The Verification Lifecycle

Every measurable action moves through the same four steps, automatically:

  1. Predict — when the agent files a finding, it sets an expected impact, the GSC metric that would prove it, and a target value. The prediction is committed before any work ships, so it can't be rationalized after the fact.
  2. Ship through review — the fix is implemented (usually as a pull request in your own repo) and only closed once it's actually live. Marking it done records the PR link and an implementation note.
  3. Wait out the window — closing a measurable task opens a verification window of roughly 21 days, long enough for Search Console to register the real effect rather than noise.
  4. Re-measure against GSC — once the window closes, the weekly monitor pulls the metric for that query or page and judges the outcome: confirmed (the predicted movement happened), regressed (it moved the wrong way), or inconclusive (no clear signal).

The Fields That Make It Measurable

A measurable task carries a small, deliberate set of fields. These are what turn a to-do into a verifiable claim:

FieldWhat it captures
expected_impactThe prediction in plain language — what this fix should change and why
verify_metricThe GSC metric that would prove it — one of clicks, impressions, ctr, or position
verify_targetThe target value the metric should reach for the prediction to hold
verify_afterThe date the verification window closes — roughly 21 days after the fix shipped
verification_statusnone → pending → confirmed, inconclusive, or regressed
verified_atWhen the monitor actually ran the re-measurement
verification_noteThe monitor's evidence — the GSC numbers behind the verdict
pr_urlThe pull request that shipped the fix, so the change is traceable to code
implementation_noteWhat was actually changed when the task was closed
Tip

Not every task is measurable, and that's fine. A task with no clear GSC signal (say, a structural cleanup) is tracked without a verification target. The verification machinery only fires when verify_metric and verify_target are set — so a claim is only ever made when it can actually be checked.

Why This Beats Run-and-Hope

The default mode of SEO is run-and-hope: do a batch of work, watch an aggregate number, and narrate a story about why it moved. Because the prediction is never committed before the work, any movement can be claimed and any non-movement can be explained away. Nothing is ever falsified, so nothing is ever really learned.

Outcome verification inverts that. The prediction is locked in first. The metric is named first. The window is fixed first. When the monitor re-measures, the verdict is whatever the GSC data says — including “regressed,” which a run-and-hope dashboard would never surface. That discipline is uncomfortable on purpose: it's the only way to tell a real win from a coincidence.

Run-and-hope SEOVerified outcomes
Reports aggregate traffic and rankingsReports per-action causal movement
Prediction (if any) made after the factPrediction committed before work ships
Wins and coincidences look identicalA confirmed result is distinct from drift
Regressions get buriedRegressions are surfaced and named
Nothing is falsifiable, nothing learnedEvery action adds to a record of what works

The Learning Flywheel

Verified outcomes are valuable on their own — they prove the work to a client in numbers. But the deeper payoff is what accumulates. Every confirmed, regressed, or inconclusive eval is a structured data point: what worked, by tactic and by context. A title rewrite on a striking-distance commercial query. An internal-link push to a thin page. A net-new article against a low-difficulty term. Each ships with a prediction and resolves to a verdict.

Per site, that record sharpens prioritization fast: the agent stops guessing which tactics pay off on this domain and starts ranking findings by what has actually been confirmed to move the needle here. Across many engagements, the same records compound into cross-client priors — which tactic tends to work in which context — so the agent gets better at predicting impact before any work ships. That is the flywheel: every action becomes evidence, and the evidence makes the next prediction better.

Note

This is the moat. A freelancer's 200 backlinks and a self-serve tool's findings list both stop at the activity. Neither produces a verified record of causal outcomes, so neither can learn from its own work. The accountability loop — predict, ship, re-measure — is the product.

How the Board and Monitor Produce It

The whole loop runs through two pieces you already have: the task board and the weekly monitor.

  • The board is where predictions live. When the agent — or the coding agent in your own repo, over the MCP server — files a finding, it creates a task with expected_impact, verify_metric, and verify_target. When it ships the fix, it marks the task done with the pr_url and an implementation note.
  • Marking a measurable task done queues verification. The moment a task with a verify metric is closed, its status flips to pending and the verify_after date is set to roughly 21 days out. Nothing else is needed — the prediction is now on the clock.
  • The Monday monitor settles the verdict. Each week the monitor reads every task whose window has closed, pulls the relevant Search Console data, and writes back verification_status, verified_at, and a verification_note with the numbers. Confirmed, regressed, or inconclusive — the data decides.

It's the same continuous loop described in the autonomous agent and the agent loop, with the accountability layer made explicit: nothing is claimed that GSC hasn't confirmed.

Getting Started

Outcome verification isn't a setting you toggle — it's how the agent works the moment it's running your SEO. Start with your free SEO report to see findings land on the board, then book a call with the founder to put the full loop on your account: the coding agent shipping fixes through review, and the weekly monitor proving each one in Search Console.

Once it's running, the question you ask changes. Not “did traffic go up?” but “which of last month's fixes were confirmed, which regressed, and what does that tell us to do next?” — the only SEO question worth answering.

Try these prompts

For each open task, set an expected impact and the GSC metric that would prove it
Which tasks I marked done are still pending verification?
Re-measure last month's fixes against GSC and tell me which ones actually moved
Show me what worked: confirmed tasks grouped by tactic

© 2026 Agentic SEO. All rights reserved.