We Rebuilt Our Scoring Engine Based on Academic Research. Here's What Changed.

We pressure-tested every design decision in our scoring engine against academic research — psychometrics, information foraging theory, and CRO studies. Here's what we found and what we changed.

We Rebuilt Our Scoring Engine Based on Academic Research. Here's What Changed.

*By Dillon Richardson — Founder, AdAlign*

---

AdAlign scores the congruence between your ads and landing pages. It tells you whether the page your traffic lands on actually matches the ad that sent them there.

We launched with a three-dimension scoring model: Visual Match, Contextual Match, and Tone Alignment. Platform-specific weights. A penalty system for mismatches. It worked. Scores felt right.

But "feels right" isn't good enough when agencies are making budget decisions based on your numbers.

So I ran a deep research sprint — academic papers, UX research, psychometrics literature, conversion rate optimization studies — to pressure-test every design decision in our scoring engine. The goal: find out where we were wrong, where we were right, and what we were missing entirely.

Here's what I found and what we changed.

---

The research question

I structured the investigation around seven specific questions:

1. Does academic evidence support that ad-to-landing-page congruence actually drives conversions? 2. Are three scoring dimensions enough, or are we missing something critical? 3. Do our platform-specific weights (heavier visual weight on Meta, heavier contextual weight on Google) hold up? 4. Is our 0–10 scoring scale the right granularity? 5. Is our penalty system (hard cap at 6 for severe mismatches) well-designed? 6. Is our calibration methodology sound? 7. How does our approach compare to existing frameworks like Google Quality Score?

I used Perplexity Deep Research to synthesize findings across conversion rate optimization studies, information foraging theory, psychometrics, and advertising research. The prompt was deliberately specific — I wanted evidence, not opinions.

---

What the research confirmed

Our three dimensions are a solid core. Multiple controlled experiments show that visual, message, and tonal congruence between ads and landing pages materially improve conversion rates. One A/B test by NextAfter replaced only the ad image to match the landing page visual and saw a 47.6% conversion lift — isolating the impact of pure visual continuity.

Message match is the dominant driver. CRO case studies consistently report that headline and offer continuity between ad and landing page produces the largest conversion lifts — one Moz study documented a 212% conversion rate increase after systematically aligning ad and landing page messaging. The research consistently treats message and intent match as the primary post-click conversion driver, with visual and tone acting as multipliers.

Our platform weights are directionally correct. Evidence supports heavier contextual weight for search (where users have explicit intent and evaluate relevance) and higher visual/tone weights for social (where creative salience and vibe drive engagement). Google Quality Score research shows that when ad relevance and landing page experience are both rated "above average," non-brand search conversion rates are up to 750% higher than when both are "below average."

Tone matters, especially on social. Analysis of video ad performance found that emotional alignment between ad tone and content context produces measurable (often double-digit) lifts in intent and conversion. NBCUniversal's internal "Emotional Congruence" engine confirms this at scale.

---

Where we were wrong — or incomplete

### 1. We were missing Information Scent

This was the biggest finding. Information foraging theory (Pirolli and Card, Xerox PARC) formalizes something called information scent: users follow links and paths whose labels, context, and cues predict they'll reach their goal. When scent weakens — when "links lie" — users abandon, even if the content they need is technically on the page.

Strong information scent speeds navigation by up to 73% and can increase task completion by 40%+.

Here's why this matters for our scoring engine: information scent is a distinct construct from message match. You can have identical headlines between an ad and landing page (strong message match) but still fail on scent if the promised action is buried below the fold, if the page type doesn't match the ad's intent (purchase ad → blog post), or if the user has to think about what to do next.

Our old "Contextual Match" score conflated these two things. A page could score well on contextual alignment because the words matched, while completely failing the user's expectation of what they'd find.

### 2. Our 0–10 scale was creating false precision

Psychometrics research is clear: 5- to 7-point scales offer the best trade-off between reliability, validity, and ease of use. Expanding to 9 or 10 points adds cognitive load without meaningful reliability gains.

More importantly, when AI acts as the rater (as ours does), the primary risks aren't random noise — they're systematic bias, construct drift, and overconfidence. An AI scoring the difference between a 6 and a 7 is producing spurious precision.

Our rubric already described 5 descriptive bands (1–2, 3–4, 5–6, 7–8, 9–10). We just weren't enforcing them structurally.

### 3. Our penalty system was losing information

We had a hard cap: any HIGH-severity mismatch in a dimension capped that score at 6. The research supports the *concept* — a single severe congruence failure should strongly bound the attainable score because the overall experience is broken regardless of other strengths.

But hard caps are a blunt instrument. A page with one major mismatch and everything else perfect got the same capped score as a page with one major mismatch and everything else mediocre. Both hit 6. The information about the *degree* of the problem was lost.

Related quality-scoring systems (machine learning loss functions, advertising quality models) consistently use smooth, nonlinear penalties rather than hard cutoffs.

---

What we changed

### Change 1: Band-first scoring

Instead of asking the AI to assign a number directly, we now force it to:

1. Classify into a band (Exceptional, Good, Moderate, Poor, Very Poor) 2. Write 2–3 sentences of reasoning referencing specific elements from the ad and page 3. Then assign a numeric score within that band's range

This eliminates the "why did I get a 6 instead of a 7" problem. The AI has to commit to a qualitative judgment before picking a number. If the number falls outside the stated band's range, we override with the band midpoint.

The result: more consistent scores and structurally better explanations.

### Change 2: Four dimensions instead of three

We split our old Contextual Match into two distinct scores:

Does the landing page headline echo the ad's specific claim?
Is the same offer stated on both?
Do the CTAs use similar language?
Are key terms from the ad visibly present on the page?

Within the first viewport, can a user confirm the ad's promise is being fulfilled?
Is the next step from the ad's promise obviously available?
Does the page *type* match the ad's intent stage? (An awareness ad should lead to educational content, not a checkout page. A "buy now" ad should lead to a product page, not a blog.)
Do visual and verbal cues reinforce that the user is progressing toward what the ad promised?

These are fundamentally different diagnostics with different fixes. "Your headline doesn't match" is a copywriting problem. "Your page buries the promised action" is an information architecture problem. Agencies need to see both.

### Change 3: Exponential decay penalties

We replaced the hard cap with a smooth exponential curve:

adjustedScore = rawScore × e^(-β × totalSeverity)

Where severity is mapped per mismatch type (HIGH = 0.85, MEDIUM = 0.40, LOW = 0.15) and multiple penalties in the same dimension compound up to a cap of 1.0.

A single HIGH mismatch on a raw score of 8 produces ~4.1 (serious, but not identical to everything else)
A single MEDIUM mismatch on a raw 8 produces ~5.8 (noticeable but not catastrophic)
Two MEDIUM mismatches compound to ~4.2 (correctly reflecting accumulated problems)
A HIGH + MEDIUM together pushes the score to ~3.6 (the page is fundamentally broken)

The curve naturally handles severity scaling without arbitrary cutoffs, and the β parameter (currently 0.8) is tunable as we collect calibration data.

### Change 4: Refined platform weights

Two targeted adjustments based on the research:

Google: Contextual weight increased from 45% to 55%, visual reduced from 35% to 25%. Search intent is overwhelmingly about relevance and message alignment. When the user types a query and clicks an ad, they're evaluating whether they found what they were looking for — not whether the color palette matches.

TikTok: Tone weight increased from 30% to 35%, visual reduced from 35% to 30%. TikTok is vibe-first. A polished corporate ad leading to a page that matches the words but kills the energy is a conversion killer. The tone weight increase catches this.

---

What we're tracking next

The research surfaced two additional areas we're not scoring yet but plan to:

Page speed and UX friction. Each additional second of load time in the first 5 seconds reduces conversions by roughly 4–7%. Pages loading in 1 second convert nearly 3x higher than those loading in 5+ seconds. We already measure and display page speed — it's shown alongside the congruence scores but not mixed into them. The research confirms this is the right approach: speed is orthogonal to congruence, not a dimension of it.

Trust and credibility signals. Testimonials, social proof, security badges, and professional design consistently correlate with higher conversion. But these function as baseline hygiene rather than congruence drivers — what matters is that trust signals don't *contradict* the ad promise. We'll likely add this as a separate diagnostic score rather than a fifth congruence dimension.

---

The meta-lesson

The most useful thing about this exercise wasn't any single finding. It was learning that our intuitions were largely right but under-specified.

We had the right dimensions but conflated two distinct constructs. We had the right penalty concept but used a blunt implementation. We had the right scale but weren't enforcing it structurally.

If you're building a scoring system — for anything, not just ads — the research strongly suggests three things:

1. Define your constructs distinctly. If two things require different fixes, they need different scores. 2. Use behavioral anchors, not just numbers. A rubric that says "7–8 = Good" is worse than one that says "7–8 = Clear, intentional connection where most elements reinforce each other." 3. Smooth penalties over hard cutoffs. Nonlinear curves preserve information that caps destroy.

---

*AdAlign scores ad-to-landing-page congruence so agencies can find and fix alignment gaps before they cost conversions. Run a free audit at adalign.io.*

We Rebuilt Our Scoring Engine Based on Academic Research. Here's What Changed.

The research question

What the research confirmed

Where we were wrong — or incomplete

What we changed

What we're tracking next

The meta-lesson

Ready to Optimize Your Ad Performance?