Summary
Between-subjects designs compare different groups, offering clean comparisons but requiring larger samples. Within-subjects designs have each participant experience all conditions, providing more statistical power but risking order effects. Counterbalancing mitigates these effects by varying condition order. The choice depends on your research questions, available participants, and practical constraints.
When comparing two or more versions of a design, or testing the same design with different user segments, you must decide how to structure the study. Who sees what? In what order?
This choice is not administrative. It directly shapes the conclusions you can draw, the number of participants you need, and the risks you must manage.
The Core Question
Imagine you want to compare two checkout flows: the current design (A) versus a proposed redesign (B). You have two fundamental options:
- Between-subjects: Different people test each version
- Within-subjects: The same people test both versions
Each approach has distinct trade-offs.
Between-Subjects Design (Independent Measures)
Definition: Different participants test different versions. Group A sees Prototype X; Group B sees Prototype Y. No participant ever sees both.
The Trade-Off
| Aspect | Assessment |
|---|---|
| Pro | No "learning effect"—seeing A does not teach you how to use B. Clean, uncontaminated data. |
| Con | Requires more participants (n=30+ per group) to average out individual differences between groups. |
Why It Works
No order effects: Participants cannot be influenced by having seen the other version first.
Clean comparisons: Reactions to each version are independent, there is no contamination between conditions.
Simpler sessions: Each participant has one task set, which can reduce session length and fatigue.
The Cost
Requires more participants: Because you are comparing different people, you need enough in each group to account for individual differences. Typically, you need roughly twice as many participants as a within-subjects design to achieve the same statistical power [1].
Individual differences become noise: Some observed differences might reflect the people in each group rather than the designs themselves.
When to Use It
Between-subjects is the right choice when:
- Exposure to one condition would contaminate responses to the other
- Learning effects would be problematic (testing the same task twice would bias results)
- You are testing significantly different experiences (e.g., two completely different product concepts)
Within-Subjects Design (Repeated Measures)
Definition: The same participants test both versions. Each user acts as their own baseline, experiencing Prototype X and then Prototype Y (or vice versa).
The Trade-Off
| Aspect | Assessment |
|---|---|
| Pro | High statistical power with fewer users—each user acts as their own control, eliminating individual differences. |
| Con | High risk of "Order Effects" (learning). Seeing A first may teach you how to use B. |
Why It Works
More statistical power: Because you are comparing each person to themselves, individual differences cancel out. You need fewer participants to detect the same effect [2].
Richer comparative feedback: Participants can directly compare their experience ("B felt faster than A because...").
Cost efficiency: You get more data points per participant.
The Risk: Order Effects
Order effects: The sequence in which participants experience conditions matters. Being exposed to A first might change how someone perceives B.
Fatigue and learning: Longer sessions can tire participants, and practice with the first condition might improve performance on the second.
Carryover effects: Knowledge or expectations from one condition might persist into the next.
Order Effects
Order effects are a critical concern in within-subjects designs. They take two main forms:
Practice effects: Performance improves simply because participants become more familiar with the task type, the interface style, or the testing situation.
Fatigue effects: Performance declines because participants become tired, bored, or less engaged over time.
Sensitization: Experiencing one condition changes how participants perceive the other, they notice things they might not have otherwise.
If everyone experiences A before B, you cannot tell whether differences in performance are due to the designs or simply due to order.
Counterbalancing
The solution to order effects is counterbalancing: systematically varying the order of conditions across participants.
Simple Counterbalancing
With two conditions (A and B):
- Half of participants experience A then B
- Half experience B then A
This way, any practice or fatigue effects are distributed across both conditions.
Latin Square Counterbalancing
With more than two conditions, full counterbalancing becomes impractical (3 conditions = 6 orders; 4 conditions = 24 orders). A Latin Square design ensures each condition appears equally often in each position without testing every possible order.
For three conditions (A, B, C):
| Group | Order |
|---|---|
| 1 | A → B → C |
| 2 | B → C → A |
| 3 | C → A → B |
Each condition appears once in each position (first, second, third).
Mixed Designs
Sometimes you need elements of both approaches. A mixed design (or "split-plot" design) combines between-subjects and within-subjects factors.
Example: You want to compare two checkout flows (A vs. B) across two user segments (new users vs. returning users).
- Between-subjects factor: User segment (a person is either new or returning)
- Within-subjects factor: Checkout flow (each person tests both A and B)
This design lets you ask: "Does the effect of the checkout redesign differ for new versus returning users?"
Practical Decision Framework
Use this framework to choose your design:
| Factor | Between-Subjects | Within-Subjects |
|---|---|---|
| Participant availability | Limited | Ample |
| Risk of order effects | High | Low/Manageable |
| Need for direct comparison | Low | High |
| Session length tolerance | Shorter | Longer acceptable |
| Statistical power needed | Lower | Higher |
| Learning/practice concern | High | Low |
The Baseline Problem
One common mistake is comparing a new design only against itself over time rather than against the current design.
"Users completed checkout faster after using the new design for a week" does not tell you whether the new design is better, it tells you users learned to use it.
To make a valid comparison, you need:
- The new design compared to the current design (not just to itself over time)
- Proper counterbalancing if using within-subjects
- Matched groups if using between-subjects
What This Means for Practice
Study design is not a formality, it is the structure that makes your conclusions valid or invalid.
Before recruiting a single participant, decide:
- What comparisons do you need to make?
- Can participants reasonably experience all conditions?
- What order effects might occur, and how will you control for them?
- How many participants do you need given your chosen design?
The right design depends on your specific research questions, practical constraints, and the conclusions you need to support. There is no universally "best" approach, only the right approach for your situation.