Choosing a Study Design: Between, Within, and Mixed

When comparing two or more versions of a design, or testing the same design with different user segments, you must decide how to structure the study. Who sees what? In what order?

This choice is not administrative. It directly shapes the conclusions you can draw, the number of participants you need, and the risks you must manage.

The Core Question

Imagine you want to compare two checkout flows: the current design (A) versus a proposed redesign (B). You have two fundamental options:

Between-subjects: Different people test each version
Within-subjects: The same people test both versions

Each approach has distinct trade-offs.

Between-Subjects Design (Independent Measures)

Definition: Different participants test different versions. Group A sees Prototype X; Group B sees Prototype Y. No participant ever sees both.

The Trade-Off

Aspect	Assessment
Pro	No "learning effect"—seeing A does not teach you how to use B. Clean, uncontaminated data.
Con	Requires more participants (n=30+ per group) to average out individual differences between groups.

Why It Works

No order effects: Participants cannot be influenced by having seen the other version first.

Clean comparisons: Reactions to each version are independent, there is no contamination between conditions.

Simpler sessions: Each participant has one task set, which can reduce session length and fatigue.

The Cost

Requires more participants: Because you are comparing different people, you need enough in each group to account for individual differences. Typically, you need roughly twice as many participants as a within-subjects design to achieve the same statistical power ^[1].

Individual differences become noise: Some observed differences might reflect the people in each group rather than the designs themselves.

When to Use It

Between-subjects is the right choice when:

Exposure to one condition would contaminate responses to the other
Learning effects would be problematic (testing the same task twice would bias results)
You are testing significantly different experiences (e.g., two completely different product concepts)

Within-Subjects Design (Repeated Measures)

Definition: The same participants test both versions. Each user acts as their own baseline, experiencing Prototype X and then Prototype Y (or vice versa).

The Trade-Off

Aspect	Assessment
Pro	High statistical power with fewer users—each user acts as their own control, eliminating individual differences.
Con	High risk of "Order Effects" (learning). Seeing A first may teach you how to use B.

Why It Works

More statistical power: Because you are comparing each person to themselves, individual differences cancel out. You need fewer participants to detect the same effect ^[2].

Richer comparative feedback: Participants can directly compare their experience ("B felt faster than A because...").

Cost efficiency: You get more data points per participant.

The Risk: Order Effects

Order effects: The sequence in which participants experience conditions matters. Being exposed to A first might change how someone perceives B.

Fatigue and learning: Longer sessions can tire participants, and practice with the first condition might improve performance on the second.

Carryover effects: Knowledge or expectations from one condition might persist into the next.

Order Effects

Order effects are a critical concern in within-subjects designs. They take two main forms:

Practice effects: Performance improves simply because participants become more familiar with the task type, the interface style, or the testing situation.

Fatigue effects: Performance declines because participants become tired, bored, or less engaged over time.

Sensitization: Experiencing one condition changes how participants perceive the other, they notice things they might not have otherwise.

If everyone experiences A before B, you cannot tell whether differences in performance are due to the designs or simply due to order.

For a broader framework on managing bias through study design, see Research Quality and Managing Bias.

Counterbalancing

The solution to order effects is counterbalancing: systematically varying the order of conditions across participants.

Simple Counterbalancing

With two conditions (A and B):

Half of participants experience A then B
Half experience B then A

This way, any practice or fatigue effects are distributed across both conditions.

Latin Square Counterbalancing

With more than two conditions, full counterbalancing becomes impractical (3 conditions = 6 orders; 4 conditions = 24 orders). A Latin Square design ensures each condition appears equally often in each position without testing every possible order.

For three conditions (A, B, C):

Group	Order
1	A → B → C
2	B → C → A
3	C → A → B

Each condition appears once in each position (first, second, third).

Mixed Designs

Sometimes you need elements of both approaches. A mixed design (or "split-plot" design) combines between-subjects and within-subjects factors.

Example: You want to compare two checkout flows (A vs. B) across two user segments (new users vs. returning users).

Between-subjects factor: User segment (a person is either new or returning)
Within-subjects factor: Checkout flow (each person tests both A and B)

This design lets you ask: "Does the effect of the checkout redesign differ for new versus returning users?"

Practical Decision Framework

Use this framework to choose your design:

Factor	Between-Subjects	Within-Subjects
Participant availability	Limited	Ample
Risk of order effects	High	Low/Manageable
Need for direct comparison	Low	High
Session length tolerance	Shorter	Longer acceptable
Statistical power needed	Lower	Higher
Learning/practice concern	High	Low

To see how your design choice affects sample requirements, try the Sample Size Calculator.

The Baseline Problem

One common mistake is comparing a new design only against itself over time rather than against the current design.

"Users completed checkout faster after using the new design for a week" does not tell you whether the new design is better, it tells you users learned to use it.

To make a valid comparison, you need:

The new design compared to the current design (not just to itself over time)
Proper counterbalancing if using within-subjects
Matched groups if using between-subjects

For the statistical tests that apply to each design type, see Quantitative Analysis: From Metrics to Significance.

What This Means for Practice

Study design is not a formality, it is the structure that makes your conclusions valid or invalid.

Before recruiting a single participant, decide:

What comparisons do you need to make?
Can participants reasonably experience all conditions?
What order effects might occur, and how will you control for them?
How many participants do you need given your chosen design?

The right design depends on your specific research questions, practical constraints, and the conclusions you need to support. There is no universally "best" approach, only the right approach for your situation.

For how study design operationalizes the research building blocks, see Building Blocks and Core Methods.

When comparing two or more versions of a design, or testing the same design with different user segments, you must decide how to structure the study. Who sees what? In what order?

This choice is not administrative. It directly shapes the conclusions you can draw, the number of participants you need, and the risks you must manage.

The Core Question

Imagine you want to compare two checkout flows: the current design (A) versus a proposed redesign (B). You have two fundamental options:

Between-subjects: Different people test each version
Within-subjects: The same people test both versions

Each approach has distinct trade-offs.

Between-Subjects Design (Independent Measures)

Definition: Different participants test different versions. Group A sees Prototype X; Group B sees Prototype Y. No participant ever sees both.

The Trade-Off

Aspect	Assessment
Pro	No "learning effect"—seeing A does not teach you how to use B. Clean, uncontaminated data.
Con	Requires more participants (n=30+ per group) to average out individual differences between groups.

Why It Works

No order effects: Participants cannot be influenced by having seen the other version first.

Clean comparisons: Reactions to each version are independent, there is no contamination between conditions.

Simpler sessions: Each participant has one task set, which can reduce session length and fatigue.

The Cost

Individual differences become noise: Some observed differences might reflect the people in each group rather than the designs themselves.

When to Use It

Between-subjects is the right choice when:

Exposure to one condition would contaminate responses to the other
Learning effects would be problematic (testing the same task twice would bias results)
You are testing significantly different experiences (e.g., two completely different product concepts)

Within-Subjects Design (Repeated Measures)

Definition: The same participants test both versions. Each user acts as their own baseline, experiencing Prototype X and then Prototype Y (or vice versa).

The Trade-Off

Aspect	Assessment
Pro	High statistical power with fewer users—each user acts as their own control, eliminating individual differences.
Con	High risk of "Order Effects" (learning). Seeing A first may teach you how to use B.

Why It Works

More statistical power: Because you are comparing each person to themselves, individual differences cancel out. You need fewer participants to detect the same effect ^[2].

Richer comparative feedback: Participants can directly compare their experience ("B felt faster than A because...").

Cost efficiency: You get more data points per participant.

The Risk: Order Effects

Order effects: The sequence in which participants experience conditions matters. Being exposed to A first might change how someone perceives B.

Fatigue and learning: Longer sessions can tire participants, and practice with the first condition might improve performance on the second.

Carryover effects: Knowledge or expectations from one condition might persist into the next.

Order Effects

Order effects are a critical concern in within-subjects designs. They take two main forms:

Practice effects: Performance improves simply because participants become more familiar with the task type, the interface style, or the testing situation.

Fatigue effects: Performance declines because participants become tired, bored, or less engaged over time.

Sensitization: Experiencing one condition changes how participants perceive the other, they notice things they might not have otherwise.

If everyone experiences A before B, you cannot tell whether differences in performance are due to the designs or simply due to order.

For a broader framework on managing bias through study design, see Research Quality and Managing Bias.

Counterbalancing

The solution to order effects is counterbalancing: systematically varying the order of conditions across participants.

Simple Counterbalancing

With two conditions (A and B):

Half of participants experience A then B
Half experience B then A

This way, any practice or fatigue effects are distributed across both conditions.

Latin Square Counterbalancing

For three conditions (A, B, C):

Group	Order
1	A → B → C
2	B → C → A
3	C → A → B

Each condition appears once in each position (first, second, third).

Mixed Designs

Sometimes you need elements of both approaches. A mixed design (or "split-plot" design) combines between-subjects and within-subjects factors.

Example: You want to compare two checkout flows (A vs. B) across two user segments (new users vs. returning users).

Between-subjects factor: User segment (a person is either new or returning)
Within-subjects factor: Checkout flow (each person tests both A and B)

This design lets you ask: "Does the effect of the checkout redesign differ for new versus returning users?"

Practical Decision Framework

Use this framework to choose your design:

Factor	Between-Subjects	Within-Subjects
Participant availability	Limited	Ample
Risk of order effects	High	Low/Manageable
Need for direct comparison	Low	High
Session length tolerance	Shorter	Longer acceptable
Statistical power needed	Lower	Higher
Learning/practice concern	High	Low

To see how your design choice affects sample requirements, try the Sample Size Calculator.

The Baseline Problem

One common mistake is comparing a new design only against itself over time rather than against the current design.

"Users completed checkout faster after using the new design for a week" does not tell you whether the new design is better, it tells you users learned to use it.

To make a valid comparison, you need:

The new design compared to the current design (not just to itself over time)
Proper counterbalancing if using within-subjects
Matched groups if using between-subjects

For the statistical tests that apply to each design type, see Quantitative Analysis: From Metrics to Significance.

What This Means for Practice

Study design is not a formality, it is the structure that makes your conclusions valid or invalid.

Before recruiting a single participant, decide:

What comparisons do you need to make?
Can participants reasonably experience all conditions?
What order effects might occur, and how will you control for them?
How many participants do you need given your chosen design?

For how study design operationalizes the research building blocks, see Building Blocks and Core Methods.

Choosing a Study Design: Between, Within, and Mixed

Summary

The Core Question

Between-Subjects Design (Independent Measures)

The Trade-Off

Why It Works

The Cost

When to Use It

Within-Subjects Design (Repeated Measures)

The Trade-Off

Why It Works

The Risk: Order Effects

Order Effects

Counterbalancing

Simple Counterbalancing

Latin Square Counterbalancing

Mixed Designs

Practical Decision Framework

The Baseline Problem

What This Means for Practice

References

Free Research Handbook

Related Resources

The Applied Research Framework: How Everything Fits Together

Active vs Passive Data Collection

Building Blocks and Core Methods: A Framework for UX Research

Ready to Take Action?

Choosing a Study Design: Between, Within, and Mixed

Summary

The Core Question

Between-Subjects Design (Independent Measures)

The Trade-Off

Why It Works

The Cost

When to Use It

Within-Subjects Design (Repeated Measures)

The Trade-Off

Why It Works

The Risk: Order Effects

Order Effects

Counterbalancing

Simple Counterbalancing

Latin Square Counterbalancing

Mixed Designs

Practical Decision Framework

The Baseline Problem

What This Means for Practice

References

Free Research Handbook

Related Resources

The Applied Research Framework: How Everything Fits Together

Active vs Passive Data Collection

Building Blocks and Core Methods: A Framework for UX Research

Ready to Take Action?