Skip to content
UPCOMING EVENTS:UX, Product & Market Research Afterwork23. Apr.@Packhaus WienDetailsInsights & Research Breakfast16. Mai@Packhaus WienDetailsVibecoding & Agentic Coding for App Development22. Mai@Packhaus WienDetails
UPCOMING EVENTS:UX, Product & Market Research Afterwork23. Apr.@Packhaus WienDetailsInsights & Research Breakfast16. Mai@Packhaus WienDetailsVibecoding & Agentic Coding for App Development22. Mai@Packhaus WienDetails

UX Benchmarking: Measuring Progress Over Time

How to prove your redesign actually worked. A guide to establishing baselines, tracking metrics (SUS), and calculating ROI.

Marc Busch
Updated April 15, 2024
9 min read

Summary

UX benchmarking answers three questions: Where are we now (baseline), did we improve (pre/post tracking), and how do we compare to competitors? Use standardized metrics like SUS with n=30+ participants per segment for stable means. Critical trap: never compare live sites with prototype fidelity—technical friction skews data. Compare apples to apples.

"The redesign looks great" is not evidence. "SUS improved from 62 to 78" is evidence.

transforms subjective opinions about design quality into objective measurements you can track over time, compare across competitors, and use to calculate ROI.

The Three Goals of Benchmarking

Every benchmarking study answers one of three questions:

GoalQuestionUse Case
Benchmark"Where are we now?"Establishing a baseline before changes
Track"Did we get better?"Measuring pre/post redesign impact
Compare"Are we better than them?"Competitive analysis

Goal 1: Benchmark (Baseline)

Before you can measure improvement, you need to know where you started.

When to use:

  • Before a major redesign initiative
  • When taking over a new product
  • At regular intervals (quarterly, annually) for trending

What you get:

  • A quantified starting point
  • Objective evidence of current state
  • Ammunition for securing redesign budget

Goal 2: Track (Pre/Post)

The most powerful use of benchmarking: proving that your work made a measurable difference.

When to use:

  • After a significant redesign ships
  • To validate that fixes actually improved the experience
  • For quarterly/annual progress reporting

What you get:

  • Evidence of improvement (or regression)
  • ROI calculation inputs
  • Credibility for future initiatives

The Design:

Pre/Post Benchmarking DesignA horizontal flowchart showing three phases of pre/post benchmarking: Baseline pre-test with n=30+, Redesign ships, and Post-test comparison with n=30+. A comparison arc connects the baseline and post-test phases with the question: Did we improve?BASELINEPre-testn = 30+REDESIGNShipsPOST-TESTComparen = 30+COMPAREDid we improve?

Goal 3: Compare (Competitive)

How does your experience stack up against alternatives?

When to use:

  • Competitive intelligence gathering
  • Identifying industry best practices
  • Setting realistic improvement targets

What you get:

  • Relative positioning in the market
  • Specific areas where competitors excel
  • Evidence for competitive differentiation strategy

The Design:

Competitive Benchmarking DesignA tree diagram showing the same users evaluating three products — Your Product, Competitor A, and Competitor B — using the same tasks and metrics for direct comparison.SAME USERSevaluateYour ProductCompetitor ACompetitor BSAME TASKSSAME METRICSDirectComparison

The Study Design

Method: Unmoderated Remote Testing

For benchmarking at scale, unmoderated remote testing is typically the right choice:

FactorModeratedUnmoderated
Sample size5-12 (expensive)30-100+ (scalable)
Cost per participantHighLow
Depth of insightDeep qualitativeQuantitative metrics
Geographic reachLimitedGlobal
SchedulingComplexParticipants self-schedule

Sample Size: n=30+ Per Segment

Sample size determines how stable your metrics are:

Sample SizeWhat You GetUse Case
n=5Insights, not metricsQualitative usability testing
n=12Rough directional signalEarly-stage evaluation
n=30Stable mean, narrow confidence intervalBenchmarking single segment
n=50+High precisionWhen small differences matter

The Math:

With n=30, a typical SUS study has a 95% confidence interval of approximately ±6 points. This means if your measured SUS is 72, the true score is likely between 66 and 78.

With n=12, that interval might be ±10 points—too wide to detect meaningful differences.

Segmentation

If your product serves distinct user groups, benchmark each separately:

SegmentWhy Separate
New vs. Returning usersLearnability vs. efficiency
Free vs. Paid usersDifferent feature access
Mobile vs. DesktopDifferent interaction patterns
Power users vs. CasualDifferent mental models

Each segment needs n=30+ for stable metrics. A study with n=30 total across 3 segments (n=10 each) produces unreliable segment-level comparisons.

The Metric: System Usability Scale (SUS)

The is the industry standard for measuring perceived usability. It is fast, reliable, and benchmarkable.

Why SUS?

AdvantageExplanation
StandardizedSame 10 questions everywhere, enabling comparison
BenchmarkableDecades of data establish what scores mean
Quick10 questions, under 2 minutes to complete
ReliableHigh internal consistency across contexts
Technology-agnosticWorks for websites, apps, hardware, anything

Interpreting SUS Scores

ScoreGradeInterpretation
80+AExcellent—users love it
70-79BGood—above average
68CAverage—industry midpoint
50-67DBelow average—needs work
<50FPoor—significant usability problems

Complementary Metrics

SUS measures overall perceived usability. For a complete picture, add:

MetricWhat It MeasuresWhen to Add
Task Success RateCan users complete key tasks?Always
Time on TaskHow efficiently can they complete tasks?When speed matters
Per-task difficulty ratingWhen task-level insight needed
Likelihood to recommendWhen loyalty/advocacy matters
CSATSatisfaction with specific interactionFor transactional experiences

The Trap: Comparing Apples to Oranges

This is where benchmarking studies go wrong.

The Fidelity Problem

Never compare a live site with a Figma prototype.

Live SitePrototype
Real load timesInstant transitions
Actual dataPlaceholder content
Full functionalityPartial flows only
Real errors and edge casesHappy path only
Authentication, sessionsNone

The Solution: Compare Apples to Apples

Comparison TypeValid Approach
Pre/Post RedesignBoth must be live, or both must be same-fidelity prototype
Competitor AnalysisAll must be live production sites
Concept TestingAll concepts at same prototype fidelity

Other Comparison Traps

TrapProblemFix
Different task setsCannot compare if tasks differUse identical task scenarios
Different user segmentsNovices vs. experts skews resultsRecruit same profile for all conditions
Different time periodsSeasonal effects, market changesRun conditions simultaneously when possible
Different devicesMobile vs. desktop not comparableControl for device type

Running a Benchmark Study

Step-by-Step Process

1. Define Success Metrics

Before recruiting, decide exactly what you are measuring:

  • Primary metric (usually SUS)
  • Secondary metrics (task success, time, SEQ)
  • Target score (if tracking improvement)

2. Design Task Scenarios

Create realistic tasks that cover key user journeys:

TaskCoverageSuccess Criterion
"Find the pricing for the Pro plan"Discovery, navigationCorrect answer given
"Add a new team member to your account"Core workflowTask completed
"Cancel your subscription"Support flowReached confirmation

3. Build the Test

Using an unmoderated testing platform:

  • Welcome and consent
  • Screening questions (if needed)
  • Task scenarios with success measures
  • Post-task questions (SEQ for each task)
  • Post-study questionnaire (SUS, open-ended)
  • Thank you and compensation

4. Recruit Participants

  • n=30+ per segment
  • Match your actual user profile
  • Screen out irrelevant populations
  • Consider over-recruiting by 15-20% for dropouts

5. Analyze and Report

MetricReport
SUSMean, 95% CI, comparison to benchmark/target
Task SuccessPercentage per task, overall rate
Time on TaskMedian (means are skewed by outliers)
SEQMean per task, identify problem tasks

6. Track Over Time

Maintain a benchmark history:

UX Benchmark Tracking Over TimeA data table showing UX benchmark scores tracked quarterly: 2025-Q1 baseline with SUS 62 and 71% task success, 2025-Q3 post-redesign v1 with SUS 67 and 78%, 2026-Q1 post-redesign v2 with SUS 74 and 84%, and 2026-Q3 incremental improvements with SUS 76 and 86%.DATESUSTASK SUCCESSNOTES2025-Q16271%Baseline2025-Q36778%Post-redesign v12026-Q17484%Post-redesign v22026-Q37686%Incremental improvements

Calculating ROI

Benchmarking provides the inputs for calculating research ROI:

The Formula

ROI = (Value of Improvement - Cost of Research) / Cost of Research

Example Calculation

FactorValue
Baseline conversion rate2.0%
Post-redesign conversion rate2.4%
Monthly visitors100,000
Average order value€50
Research + redesign cost€25,000

Monthly revenue lift:

  • Before: 100,000 × 2.0% × €50 = €100,000
  • After: 100,000 × 2.4% × €50 = €120,000
  • Lift: €20,000/month

ROI (first year):

  • Annual lift: €240,000
  • Cost: €25,000
  • ROI: (€240,000 - €25,000) / €25,000 = 860%

What This Means for Practice

Benchmarking transforms UX from opinion to evidence.

  1. Establish baselines before any major initiative—you cannot prove improvement without a starting point
  2. Use n=30+ per segment for stable metrics; n=5 is for insights, not measurement
  3. Standardize on SUS for comparability across time and competitors
  4. Compare apples to apples—never benchmark live sites against prototypes
  5. Track over time to demonstrate cumulative impact
  6. Calculate ROI to secure future investment

The goal is not to produce impressive numbers. It is to produce defensible evidence that your work made a measurable difference.

READY TO TAKE ACTION?

Let's discuss how these insights can drive your business forward.

UX Benchmarking: Measuring Progress Over Time | Busch Labs | Busch Labs