Summary
Effective quantitative UX analysis starts with sanity-checking raw data before calculating means, visualizing distributions with histograms to verify assumptions, and using visual shortcuts like notched boxplots to assess significance. Choosing the right statistical test depends on your study design (between vs. within subjects) and whether your data meets normality assumptions.
Quantitative analysis in UX research is not about drowning stakeholders in numbers. It is about extracting reliable signals from noisy data and knowing when those signals are meaningful.
Do not just report averages. Understand what your data is actually telling you.
The Sanity Check: Before You Calculate Anything
Before you calculate a mean, median, or any statistic, you must examine the raw data. Numbers without context are dangerous.
The Question to Ask
For every data point that looks unusual, ask: Is this signal or noise?
| Observation | Possible Signal | Possible Noise |
|---|---|---|
| Time on Task: 5 seconds | User is an expert | Page failed to load |
| Time on Task: 600 seconds | Task is genuinely hard | User left for coffee |
| Satisfaction: 1/7 | Genuine frustration | Misclick or spite response |
| Task Success: 0% | Design is broken | Technical failure during test |
The Cleaning Protocol
- Flag outliers that fall outside expected ranges
- Check session notes or recordings for context
- Decide: Remove (if noise), keep (if signal), or note (if unclear)
- Document every decision for transparency
Visualizing Distributions: Why Averages Lie
The average (mean) is the most commonly reported statistic—and often the most misleading. Before calculating any mean, you must understand the shape of your data.
The Histogram First Rule
Always plot a histogram before reporting averages. The shape of your distribution determines what statistics are valid. In a normal (bell-curve) distribution the mean and median are roughly equal, and a standard t-test works fine. In a skewed distribution—common with time-on-task data—the mean and median diverge, and a t-test can give misleading results.
What Different Shapes Mean
| Distribution Shape | Common UX Metrics | Implication |
|---|---|---|
| Normal (Bell) | Satisfaction ratings (sometimes) | Standard statistics apply |
| Right-skewed | Time on Task, Error counts | Use median, consider non-parametric tests |
| Bimodal | Task success with distinct user groups | You may have two populations; segment first |
| Uniform | Poorly designed rating scale | Scale may not be capturing real differences |
Mean vs. Median: When to Use Each
| Statistic | Use When | Example |
|---|---|---|
| Mean | Data is normally distributed | "Average satisfaction was 5.2/7" |
| Median | Data is skewed or has outliers | "Median Time on Task was 45 seconds" |
| Both | You want to show the skew | "Mean was 72s, median was 45s (right-skewed)" |
The Notched Boxplot Trick
Statistical significance testing can be complex. But there is a visual shortcut that gives you a quick, intuitive answer: the notched boxplot.
How It Works
A notched boxplot adds "notches" around the median. These notches represent the approximate 95% confidence interval for the median.
Reading a Boxplot
| Element | What It Shows |
|---|---|
| Center line | Median (50th percentile) |
| Box edges | 25th and 75th percentiles (IQR) |
| Notches | ~95% confidence interval for median |
| Whiskers | Range of typical data (1.5 × IQR) |
| Dots beyond whiskers | Outliers |
When to Use Notched Boxplots
- Comparing two versions (A vs. B)
- Comparing user segments (novice vs. expert)
- Quick stakeholder communication (visual is more intuitive than p-values)
- Exploratory analysis before running formal tests
Practical Walkthrough: A Comparative Analysis
Theory is useful. Seeing it applied to a real decision is better. Let us walk through an example from start to finish.
The Setup
Imagine your team has a live e-commerce site with an existing checkout flow (Version A). A UX designer has created a new, streamlined prototype (Version B). The business question is clear: "Is the new design a significant improvement that justifies the development cost?"
To answer this with confidence, you run a comparative usability study using a within-subjects design. Each participant uses both versions and rates them using the System Usability Scale (SUS). You recruit 30 participants and counterbalance the order (half start with A, half start with B) to control for learning effects.
The Data
After collecting the data, you calculate the descriptive statistics:
| Version | Median SUS Score | Interpretation (Banister Benchmark) |
|---|---|---|
| Version A (Current) | 70.0 | "Good" - acceptable but has room for improvement |
| Version B (New) | 82.5 | "Excellent" - users find it highly usable |
The medians tell a promising story. Version B scores 12.5 points higher. But is this difference real, or could it be random noise from your sample?
Selecting the Right Test
Your instinct might be to run a paired samples t-test. After all, you have paired data (each participant rated both versions). The t-test is the standard tool for this scenario.
But you need to check an assumption first. The paired t-test assumes the differences between scores are normally distributed. You run a Shapiro-Wilk test on the difference scores, and the result comes back with p < .05. This tells you the normality assumption is violated.
What do you do? You reach for the non-parametric equivalent: the Wilcoxon Signed-Rank Test. This test does not assume normality. It compares the ranks of differences rather than the raw values, making it robust to the distributional issues in your data.
The Significance
You run the Wilcoxon Signed-Rank Test. The output: p < 0.000001.
This p-value is far below the conventional threshold of 0.05. You can confidently reject the null hypothesis (that there is no difference). The difference you observed is statistically significant. It is extremely unlikely to have occurred by chance.
But here is where many researchers stop, and where you should keep going.
The Magnitude: Effect Size
A p-value tells you whether an effect is real. It does not tell you how big that effect is. A tiny, practically meaningless difference can be statistically significant if your sample is large enough. Conversely, a meaningful difference might not reach significance if your sample is small.
This is why you must report Effect Size. For comparing two means (or medians), the standard measure is Cohen's d.
Cohen's d expresses the difference between groups in terms of standard deviations. The conventional benchmarks are:
| Cohen's d | Interpretation |
|---|---|
| d ≈ 0.2 | Small effect |
| d ≈ 0.5 | Medium effect |
| d ≈ 0.8 | Large effect |
For your checkout study, the calculated effect size is d = 1.2. This is a large effect by any standard. The new design does not just beat the old one by a statistically detectable margin. It beats it by a substantial, practically meaningful amount.
The Report: Communicating to Stakeholders
Finally, you translate your analysis into language that drives decisions. Stakeholders do not need to understand Wilcoxon tests or Cohen's d. They need to understand what the numbers mean for the business.
Here is how you might frame the recommendation:
"We should prioritize development resources to build and ship the new checkout design (Version B). Our usability study with 30 representative customers shows that the new design provides a measurably superior user experience. The improvement is both statistically significant (p < .001) and practically large (effect size d = 1.2), lifting our checkout flow's usability score from 'Good' to 'Excellent' on industry benchmarks. Given the direct relationship between checkout usability and conversion rates, this investment is a low-risk, high-reward opportunity."
Notice what this does: it states the recommendation, summarizes the evidence, translates the statistics into business terms, and connects the finding to outcomes the stakeholder cares about.
Choosing the Right Statistical Test
The right test depends on two factors: your study design and your data distribution.
Decision Framework
Test Selection Table
| Study Design | Data Distribution | Recommended Test |
|---|---|---|
| Between-subjects (2 groups) | Normal | Independent samples t-test |
| Between-subjects (2 groups) | Skewed/Non-normal | Mann-Whitney U test |
| Within-subjects (2 conditions) | Normal | Paired samples t-test |
| Within-subjects (2 conditions) | Skewed/Non-normal | Wilcoxon Signed-Rank test |
| Between-subjects (3+ groups) | Normal | One-way ANOVA |
| Between-subjects (3+ groups) | Skewed/Non-normal | Kruskal-Wallis test |
Interpreting Results
| Result | Meaning | What to Report |
|---|---|---|
| p < 0.05 | Statistically significant | "The difference was statistically significant (p = 0.023)" |
| p ≥ 0.05 | Not statistically significant | "No significant difference was detected (p = 0.34)" |
| Effect size (Cohen's d) | Practical significance | "A large effect (d = 0.8)" even if not significant with small n |
The Reporting Checklist
When presenting quantitative findings, include:
- Sample size — "n = 24 participants per condition"
- Central tendency — Mean and/or median as appropriate
- Spread — Standard deviation or interquartile range
- Visualization — Histogram, boxplot, or confidence interval plot
- Statistical test — Which test and why
- Significance — p-value and effect size
- Practical interpretation — What this means for the product
Technical Reference: R Code for Analysis
If you want to run the analysis described in the Practical Walkthrough above, you need a tool that handles statistics reliably. Spreadsheets are convenient, but errors hide in cells and formulas cannot be easily reviewed or shared. R is free, open-source, and produces reproducible scripts that document exactly what you did.
The following code runs the Wilcoxon Signed-Rank Test and calculates Cohen's d for the checkout flow comparison.
# R Analysis for a Comparative Usability Study
# --- Load Necessary Libraries ---
library(tidyverse)
library(effectsize)
# --- Create The Dataset ---
# 30 participants, each rating both Version A and Version B
study_data <- data.frame(
participant_id = 1:30,
sus_a = c(72.5, 65.0, 55.0, 80.0, 60.0, 85.0, 70.0, 47.5, 75.0, 67.5,
82.5, 70.0, 57.5, 72.5, 62.5, 77.5, 70.0, 65.0, 75.0, 85.0,
50.0, 75.0, 80.0, 62.5, 72.5, 67.5, 70.0, 80.0, 55.0, 75.0),
sus_b = c(85.0, 77.5, 67.5, 92.5, 70.0, 95.0, 82.5, 60.0, 87.5, 80.0,
90.0, 82.5, 72.5, 85.0, 77.5, 90.0, 82.5, 75.0, 85.0, 97.5,
65.0, 87.5, 92.5, 75.0, 85.0, 80.0, 82.5, 90.0, 70.0, 87.5)
)
# --- Run the Wilcoxon Signed-Rank Test ---
# Used because the normality assumption was violated
wilcox.test(study_data$sus_b, study_data$sus_a, paired = TRUE, exact = FALSE)
# --- Calculate Effect Size (Cohen's d) ---
# The p-value tells you the difference is real; effect size tells you how big
cohens_d(study_data$sus_b, study_data$sus_a, paired = TRUE)
Running this code produces the p-value (p < 0.000001) and effect size (d = 1.2) reported in the walkthrough. You can replace the sus_a and sus_b vectors with your own data to analyze your studies.
What This Means for Practice
Quantitative analysis is not about proving you are right. It is about honestly assessing what your data can and cannot tell you.
- Sanity-check first — Examine raw data before calculating anything
- Visualize always — Plot distributions before choosing statistics
- Match test to design — Between vs. within, normal vs. skewed
- Report honestly — Include effect sizes, not just p-values
- Interpret practically — Statistical significance is not the finish line
The goal is not to produce impressive numbers. It is to reduce uncertainty about whether your design changes actually matter.