Skip to content
UPCOMING EVENTS:UX, Product & Market Research Afterwork23. Apr.@Packhaus WienDetailsInsights & Research Breakfast16. Mai@Packhaus WienDetailsVibecoding & Agentic Coding for App Development22. Mai@Packhaus WienDetails
UPCOMING EVENTS:UX, Product & Market Research Afterwork23. Apr.@Packhaus WienDetailsInsights & Research Breakfast16. Mai@Packhaus WienDetailsVibecoding & Agentic Coding for App Development22. Mai@Packhaus WienDetails

Quantitative Analysis: From Metrics to Significance

Don't just report averages. How to clean data, visualize distributions, and calculate statistical significance.

Marc Busch
Updated February 1, 2024
9 min read

Summary

Effective quantitative UX analysis starts with sanity-checking raw data before calculating means, visualizing distributions with histograms to verify assumptions, and using visual shortcuts like notched boxplots to assess significance. Choosing the right statistical test depends on your study design (between vs. within subjects) and whether your data meets normality assumptions.

in UX research is not about drowning stakeholders in numbers. It is about extracting reliable signals from noisy data and knowing when those signals are meaningful.

Do not just report averages. Understand what your data is actually telling you.

The Sanity Check: Before You Calculate Anything

Before you calculate a mean, median, or any statistic, you must examine the raw data. Numbers without context are dangerous.

The Question to Ask

For every data point that looks unusual, ask: Is this signal or noise?

ObservationPossible SignalPossible Noise
Time on Task: 5 secondsUser is an expertPage failed to load
Time on Task: 600 secondsTask is genuinely hardUser left for coffee
Satisfaction: 1/7Genuine frustrationMisclick or spite response
Task Success: 0%Design is brokenTechnical failure during test

The Cleaning Protocol

  1. Flag outliers that fall outside expected ranges
  2. Check session notes or recordings for context
  3. Decide: Remove (if noise), keep (if signal), or note (if unclear)
  4. Document every decision for transparency

Visualizing Distributions: Why Averages Lie

The average (mean) is the most commonly reported statistic—and often the most misleading. Before calculating any mean, you must understand the shape of your data.

The Histogram First Rule

Always plot a histogram before reporting averages. The shape of your distribution determines what statistics are valid. In a normal (bell-curve) distribution the mean and median are roughly equal, and a standard t-test works fine. In a skewed distribution—common with time-on-task data—the mean and median diverge, and a t-test can give misleading results.

What Different Shapes Mean

Distribution ShapeCommon UX MetricsImplication
Normal (Bell)Satisfaction ratings (sometimes)Standard statistics apply
Right-skewedTime on Task, Error countsUse median, consider non-parametric tests
BimodalTask success with distinct user groupsYou may have two populations; segment first
UniformPoorly designed rating scaleScale may not be capturing real differences

Mean vs. Median: When to Use Each

StatisticUse WhenExample
MeanData is normally distributed"Average satisfaction was 5.2/7"
MedianData is skewed or has outliers"Median Time on Task was 45 seconds"
BothYou want to show the skew"Mean was 72s, median was 45s (right-skewed)"

The Notched Boxplot Trick

Statistical significance testing can be complex. But there is a visual shortcut that gives you a quick, intuitive answer: the notched boxplot.

How It Works

A notched boxplot adds "notches" around the median. These notches represent the approximate 95% confidence interval for the median.

Notched Boxplot ComparisonTwo side-by-side notched boxplots for Version A and Version B showing overlapping notches, indicating no significant difference. The notch is where the box narrows around the median.Version AVersion Bnotch overlap zoneNotches overlap hereNotches overlap →No significant differenceNotches don't overlap →Significant difference (~95% CI)Visual heuristic —confirm withstatistical test

Reading a Boxplot

ElementWhat It Shows
Center lineMedian (50th percentile)
Box edges25th and 75th percentiles (IQR)
Notches~95% confidence interval for median
WhiskersRange of typical data (1.5 × IQR)
Dots beyond whiskersOutliers

When to Use Notched Boxplots

  • Comparing two versions (A vs. B)
  • Comparing user segments (novice vs. expert)
  • Quick stakeholder communication (visual is more intuitive than p-values)
  • Exploratory analysis before running formal tests

Practical Walkthrough: A Comparative Analysis

Theory is useful. Seeing it applied to a real decision is better. Let us walk through an example from start to finish.

The Setup

Imagine your team has a live e-commerce site with an existing checkout flow (Version A). A UX designer has created a new, streamlined prototype (Version B). The business question is clear: "Is the new design a significant improvement that justifies the development cost?"

To answer this with confidence, you run a comparative study using a within-subjects design. Each participant uses both versions and rates them using the System Usability Scale (SUS). You recruit 30 participants and counterbalance the order (half start with A, half start with B) to control for learning effects.

The Data

After collecting the data, you calculate the descriptive statistics:

VersionMedian SUS ScoreInterpretation (Banister Benchmark)
Version A (Current)70.0"Good" - acceptable but has room for improvement
Version B (New)82.5"Excellent" - users find it highly usable

The medians tell a promising story. Version B scores 12.5 points higher. But is this difference real, or could it be random noise from your sample?

Selecting the Right Test

Your instinct might be to run a paired samples t-test. After all, you have paired data (each participant rated both versions). The t-test is the standard tool for this scenario.

But you need to check an assumption first. The paired t-test assumes the differences between scores are normally distributed. You run a Shapiro-Wilk test on the difference scores, and the result comes back with p < .05. This tells you the normality assumption is violated.

What do you do? You reach for the non-parametric equivalent: the Wilcoxon Signed-Rank Test. This test does not assume normality. It compares the ranks of differences rather than the raw values, making it robust to the distributional issues in your data.

The Significance

You run the Wilcoxon Signed-Rank Test. The output: p < 0.000001.

This p-value is far below the conventional threshold of 0.05. You can confidently reject the null hypothesis (that there is no difference). The difference you observed is statistically significant. It is extremely unlikely to have occurred by chance.

But here is where many researchers stop, and where you should keep going.

The Magnitude: Effect Size

A p-value tells you whether an effect is real. It does not tell you how big that effect is. A tiny, practically meaningless difference can be statistically significant if your sample is large enough. Conversely, a meaningful difference might not reach significance if your sample is small.

This is why you must report Effect Size. For comparing two means (or medians), the standard measure is Cohen's d.

Cohen's d expresses the difference between groups in terms of standard deviations. The conventional benchmarks are:

Cohen's dInterpretation
d ≈ 0.2Small effect
d ≈ 0.5Medium effect
d ≈ 0.8Large effect

For your checkout study, the calculated effect size is d = 1.2. This is a large effect by any standard. The new design does not just beat the old one by a statistically detectable margin. It beats it by a substantial, practically meaningful amount.

The Report: Communicating to Stakeholders

Finally, you translate your analysis into language that drives decisions. Stakeholders do not need to understand Wilcoxon tests or Cohen's d. They need to understand what the numbers mean for the business.

Here is how you might frame the recommendation:

"We should prioritize development resources to build and ship the new checkout design (Version B). Our usability study with 30 representative customers shows that the new design provides a measurably superior user experience. The improvement is both statistically significant (p < .001) and practically large (effect size d = 1.2), lifting our checkout flow's usability score from 'Good' to 'Excellent' on industry benchmarks. Given the direct relationship between checkout usability and conversion rates, this investment is a low-risk, high-reward opportunity."

Notice what this does: it states the recommendation, summarizes the evidence, translates the statistics into business terms, and connects the finding to outcomes the stakeholder cares about.

Choosing the Right Statistical Test

The right test depends on two factors: your study design and your data distribution.

Decision Framework

Statistical Test Decision FrameworkA decision tree for choosing the right statistical test. First decide between 2 groups (between-subjects) or 2 conditions (within-subjects), then check if data is normal or skewed. This leads to four tests: Independent t-test, Mann-Whitney U, Paired t-test, or Wilcoxon Signed-Rank.What are you comparing?2 GroupsBetween-SubjectsDifferent people2 ConditionsWithin-SubjectsSame peopleNormal DataSkewed DataNormal DataSkewed DataIndependentt-testMann-WhitneyU testPairedt-testWilcoxonSigned-RankPARAMETRICNON-PARAMETRICPARAMETRICNON-PARAMETRIC

Test Selection Table

Study DesignData DistributionRecommended Test
Between-subjects (2 groups)NormalIndependent samples t-test
Between-subjects (2 groups)Skewed/Non-normalMann-Whitney U test
Within-subjects (2 conditions)NormalPaired samples t-test
Within-subjects (2 conditions)Skewed/Non-normalWilcoxon Signed-Rank test
Between-subjects (3+ groups)NormalOne-way ANOVA
Between-subjects (3+ groups)Skewed/Non-normalKruskal-Wallis test

Interpreting Results

ResultMeaningWhat to Report
p < 0.05Statistically significant"The difference was statistically significant (p = 0.023)"
p ≥ 0.05Not statistically significant"No significant difference was detected (p = 0.34)"
Effect size (Cohen's d)Practical significance"A large effect (d = 0.8)" even if not significant with small n

The Reporting Checklist

When presenting quantitative findings, include:

  1. Sample size — "n = 24 participants per condition"
  2. Central tendency — Mean and/or median as appropriate
  3. Spread — Standard deviation or interquartile range
  4. Visualization — Histogram, boxplot, or confidence interval plot
  5. Statistical test — Which test and why
  6. Significance — p-value and effect size
  7. Practical interpretation — What this means for the product

Technical Reference: R Code for Analysis

If you want to run the analysis described in the Practical Walkthrough above, you need a tool that handles statistics reliably. Spreadsheets are convenient, but errors hide in cells and formulas cannot be easily reviewed or shared. R is free, open-source, and produces reproducible scripts that document exactly what you did.

The following code runs the Wilcoxon Signed-Rank Test and calculates Cohen's d for the checkout flow comparison.

# R Analysis for a Comparative Usability Study

# --- Load Necessary Libraries ---
library(tidyverse)
library(effectsize)

# --- Create The Dataset ---
# 30 participants, each rating both Version A and Version B
study_data <- data.frame(
  participant_id = 1:30,
  sus_a = c(72.5, 65.0, 55.0, 80.0, 60.0, 85.0, 70.0, 47.5, 75.0, 67.5,
            82.5, 70.0, 57.5, 72.5, 62.5, 77.5, 70.0, 65.0, 75.0, 85.0,
            50.0, 75.0, 80.0, 62.5, 72.5, 67.5, 70.0, 80.0, 55.0, 75.0),
  sus_b = c(85.0, 77.5, 67.5, 92.5, 70.0, 95.0, 82.5, 60.0, 87.5, 80.0,
            90.0, 82.5, 72.5, 85.0, 77.5, 90.0, 82.5, 75.0, 85.0, 97.5,
            65.0, 87.5, 92.5, 75.0, 85.0, 80.0, 82.5, 90.0, 70.0, 87.5)
)

# --- Run the Wilcoxon Signed-Rank Test ---
# Used because the normality assumption was violated
wilcox.test(study_data$sus_b, study_data$sus_a, paired = TRUE, exact = FALSE)

# --- Calculate Effect Size (Cohen's d) ---
# The p-value tells you the difference is real; effect size tells you how big
cohens_d(study_data$sus_b, study_data$sus_a, paired = TRUE)

Running this code produces the p-value (p < 0.000001) and effect size (d = 1.2) reported in the walkthrough. You can replace the sus_a and sus_b vectors with your own data to analyze your studies.

What This Means for Practice

Quantitative analysis is not about proving you are right. It is about honestly assessing what your data can and cannot tell you.

  1. Sanity-check first — Examine raw data before calculating anything
  2. Visualize always — Plot distributions before choosing statistics
  3. Match test to design — Between vs. within, normal vs. skewed
  4. Report honestly — Include effect sizes, not just p-values
  5. Interpret practically — Statistical significance is not the finish line

The goal is not to produce impressive numbers. It is to reduce uncertainty about whether your design changes actually matter.

READY TO TAKE ACTION?

Let's discuss how these insights can drive your business forward.

Quantitative Analysis: From Metrics to Significance | Busch Labs | Busch Labs