Summary
Internal, external, ecological, statistical conclusion, construct, criterion, content, face, nomological, structural, incremental, consequential: what each type of validity means, when it matters, and how to think about it in applied research. Plus: modern unified frameworks, qualitative trustworthiness, the AI contamination problem, and why your metrics might be lying to you.
"Is this valid?" is one of the most common questions in research, and one of the least specific. Validity is a family of related concepts, each addressing a different way your research can go wrong. The fix for one validity problem is often completely different from the fix for another. Understanding the distinctions is what separates rigorous research from research that just feels rigorous.
Modern validity theory has moved beyond the classical "types" model entirely. Since Messick (1995) [2] and Kane (2013) [21], the field treats validity as a unified concept, not a property of the instrument but of the interpretations and uses of scores. A survey is not "valid" or "invalid." Specific conclusions drawn from its data are more or less supported by available evidence for specific purposes. The same System Usability Scale (SUS) score might be well-validated for comparing two prototypes in a lab study but poorly validated for predicting market success.
Still, the classical types remain useful as a practical vocabulary. They name the specific ways things go wrong. The categories below are organized into three domains: whether your study design supports your conclusions, whether your measurements capture what you think they capture, and whether your qualitative work is trustworthy.
Study Design Validity
Internal Validity
Internal validity asks whether your conclusions about cause and effect hold within the study [1]. If you claim that a redesigned checkout flow reduced cart abandonment, internal validity is the question of whether it was actually the redesign, or whether a simultaneous pricing change, a seasonal effect, or a server speed improvement caused the drop. The main threats are confounding variables, selection bias, and maturation effects. Without internal validity, you are telling a causal story that your data does not support.
Common threats in UX and market research include maturation (users becoming more proficient through repeated exposure during a longitudinal study), history (external events, such as a competitor's update or a news cycle, influencing behavior during the study period), and selection bias (recruiting participants who are more tech-savvy or brand-positive than the actual target audience).
A/B testing [5] faces its own internal validity threats: the flicker effect (the testing tool causing visual delays that affect one variant), sample ratio mismatch (uneven traffic allocation between conditions), and novelty or primacy effects (temporary reactions to change rather than genuine preference).
External Validity
External validity asks whether your findings generalize beyond the specific conditions of your study. You ran a usability test with 8 participants in Vienna, all aged 25–35, all tech-savvy: do those findings apply to your actual user base in rural Germany? External validity breaks down when your sample, setting, or timing is too narrow. The classic tension in research design: tightly controlled studies (high internal validity) often sacrifice external validity, and vice versa [4].
Population validity is a specific sub-dimension that concerns whether your sample represents the target population. Testing a retirement planning app exclusively with university students violates population validity regardless of how well-designed the study is otherwise. This distinction matters because a study can have high ecological validity (realistic setting) while still failing on population validity (wrong people).
Temporal validity asks whether findings hold across time. Munger (2023) argues that social science knowledge "decays" as the world changes. A 2015 finding about hamburger menu preferences may not hold in 2026's gesture-based interaction patterns. In fast-moving digital product environments, temporal validity deserves explicit consideration [6].
Ecological Validity
Ecological validity is a specific form of external validity that asks whether your study conditions reflect real-world use. A usability test in a quiet lab with a facilitator watching over someone's shoulder is not how people actually use your app. They use it on a crowded train, distracted, with one hand. Ecological validity is the reason diary studies, field studies, and unmoderated remote tests exist. If your method strips away the context that shapes behavior, your findings may be technically clean but practically useless.
Research comparing lab and remote testing shows no significant differences under favorable conditions, but under difficult operational conditions (dual-task demands, poor usability) meaningful differences emerge [7]. Marcilly et al. (2024) found that increasing test fidelity does not invariably improve error detection: low-fidelity tests efficiently identify ease-of-use and safety issues, while high-fidelity simulations reveal context-dependent problems [8].
Statistical Conclusion Validity
Statistical conclusion validity concerns whether the statistical relationship you found is real [1]. Did you actually detect a true effect, or is your result a false positive from running too many comparisons? Did you miss a real effect because your sample was too small? This type is often overlooked in applied UX research, but it matters whenever you report quantitative results. Common threats include low statistical power, violated test assumptions, and inflated error rates from multiple testing.
In A/B testing, statistical conclusion validity suffers from the pervasive practice of testing multiple hypotheses simultaneously, examining many metrics and segments without correction. Simpson's paradox (aggregated results reversing when segmented) and underpowered tests compound the problem. The most common industry mistake is treating statistical significance as sufficient for validity. Both are required for trustworthy results.
If you are unsure about sample size or effect size, this is where things go wrong first.
For how sample size decisions interact with validity considerations, see Sample Sizes: Beyond the Magic Numbers.
Measurement Validity
Construct Validity
Construct validity asks whether you are measuring the theoretical concept you intend to measure [3]. If your survey claims to measure "user satisfaction," does it actually capture satisfaction, or is it picking up something else, like ease of use or brand loyalty? Construct validity is the deepest and most difficult form of measurement validity. It requires both theoretical clarity about what you mean by a construct and empirical evidence that your instrument captures it [2]. Poorly defined constructs lead to metrics that everyone reports but nobody trusts.
Messick (1995) argued that all validity is fundamentally construct validity, and that content, criterion, and the rest are simply different sources of evidence contributing to the overall construct validity argument. Two fundamental threats apply to every measurement: construct underrepresentation (the measure is too narrow, missing important dimensions) and construct-irrelevant variance (the measure captures extraneous factors). These twin threats provide a more actionable diagnostic framework than the classical types alone.
Borsboom, Mellenbergh, and van Heerden (2004) proposed an even sharper criterion: a test is valid if (a) the attribute exists and (b) variations in the attribute causally produce variation in measurement outcomes [9]. This forces uncomfortable questions about UX constructs. Does "user experience" exist as a real psychological entity? Do variations in it cause variations in UEQ scores, or are scores artifacts of response tendencies, social desirability, and item wording?
Convergent and Discriminant Validity
Convergent and discriminant validity are the two sides of construct validity in practice. Convergent validity means your measure correlates with other measures of the same construct: if your satisfaction scale correlates with NPS, that is convergent evidence. Discriminant validity means your measure does not correlate too highly with measures of different constructs: if your satisfaction scale correlates just as strongly with a usability scale, it may not be measuring satisfaction specifically. You need both. A measure that correlates with everything measures nothing in particular.
Nomological Validity
Nomological validity asks whether a construct behaves as predicted within its broader theoretical network [3]. Where convergent and discriminant validity test isolated pairwise relationships, nomological validity tests a pattern of relationships. A "Brand Trust Scale" should correlate positively with purchase intent and loyalty, negatively with brand switching, and be unrelated to demographics like age. If the whole pattern holds, the construct is embedded in its theoretical network as expected.
This matters in practice because a scale can show good convergent validity with one related measure while failing to behave as theory predicts across the broader network. Kusano, Napier, and Jost (2025) recently argued that nomological validity should be prioritized over strict measurement invariance in cross-cultural research, offering a practical alternative when traditional invariance criteria are too restrictive [10].
Lim (2024) proposed a typology that treats nomological and predictive validity as distinct categories alongside the classical types, mapped sequentially across the research process [11].
Structural (Factorial) Validity
Structural validity examines whether a measure's empirical factor structure matches the theoretical structure of the construct. If your UX questionnaire is theorized to measure efficiency, learnability, and satisfaction as three dimensions, confirmatory factor analysis should reveal three correlated factors with items loading as predicted [12].
When Schankin et al. (2022) conducted a psychometric evaluation of the UEQ (N = 1,121, 23 products), they found that its six scales collapsed better into two higher-order factors, specifically pragmatic and hedonic quality [13]. That is structural validity evidence guiding practitioners toward a more parsimonious interpretation of UEQ scores. If you are collapsing a multi-dimensional questionnaire into a single score, structural validity is the check that tells you whether that collapse is legitimate or whether it obscures the nuances that matter for product decisions.
Content Validity
Content validity asks whether your measurement instrument covers the full scope of the construct. If you are measuring "onboarding experience" but your questionnaire only asks about the sign-up form and ignores the first-use tutorial, the tooltip guidance, and the initial value moment, your instrument has a content gap. Content validity is established through expert review and systematic mapping of the construct domain, not through statistics. It is especially critical when you build custom questionnaires rather than using validated scales.
Best practice for establishing content validity involves calculating a Content Validity Index: have three or more experts rate item relevance, targeting I-CVI (Item-level Content Validity Index, the proportion of experts rating an item as relevant) ≥ 0.78 per item and S-CVI/Ave (Scale-level Content Validity Index averaged across items) ≥ 0.90 across the instrument.
Criterion Validity
Criterion validity asks whether your measure predicts or correlates with a concrete, real-world outcome. It comes in two flavors: predictive validity (does the measure predict a future outcome, for example, does your onboarding satisfaction score predict 30-day retention?) and concurrent validity (does the measure correlate with a current criterion, for example, does your usability score match task completion rates measured at the same time?). Criterion validity is what makes a metric actionable rather than decorative.
Incremental Validity
Incremental validity asks whether a measure adds predictive power beyond what already exists [14]. This is the most pragmatically useful validity question for industry researchers who must justify adding new metrics to an already crowded dashboard. If a Customer Effort Score does not improve churn prediction beyond what Customer Satisfaction Score (CSAT) already provides, it is redundant. Assessed via hierarchical regression: add the new measure to a model that already includes existing measures and check whether R² increases significantly.
The practical question every stakeholder asks: "Why do we need another metric?" Incremental validity is the answer.
Face Validity
Face validity is the simplest and weakest form: does the measure look like it measures what it claims to measure, on the surface? If you show your questionnaire to a participant and they think "yes, this is asking about my satisfaction," that is face validity. It matters for participant buy-in and response quality. People give better answers to questions that feel relevant. But face validity alone proves nothing about actual measurement quality. A question can look perfectly reasonable and still measure the wrong thing.
Consequential Validity
Consequential validity examines the social consequences of measurement use, both intended and unintended [2]. Introduced by Messick as the "consequential basis for validity" in his earlier work and elaborated in his unified framework [2], it remains controversial: some scholars argue it addresses ethical rather than measurement concerns and should sit outside the validity framework [15].
For applied researchers, it is too important to ignore. An engagement metric that inadvertently incentivizes dark patterns has a consequential validity problem. A "user satisfaction" score used to evaluate designers' performance may distort how designers approach their work. A hiring algorithm that screens UX portfolios and systematically disadvantages candidates from underrepresented backgrounds has consequential validity concerns regardless of how well it predicts job performance.
Cross-Cultural Validity and Measurement Invariance
Cross-cultural validity determines whether a measure functions equivalently across populations. Tested through multi-group confirmatory factor analysis at four progressively strict levels, from configural through metric and scalar to strict invariance, this matters for any research program that crosses cultural, linguistic, or demographic boundaries [16].
A SUS score of "70" in Japan and Germany may not carry identical meaning if scalar invariance fails, because cultural response styles or item interpretations differ. Recent work by Protzko (2025) found that a nonsense scale measuring nothing at all can pass strong measurement invariance tests. A sobering reminder that invariance is necessary but not sufficient evidence that the same construct is being measured [17].
A 2025 psychometric evaluation of the SUS in low- and middle-income countries found significant cross-cultural measurement issues, suggesting that even well-established scales require revalidation when deployed in new cultural contexts [18].
Response Process Validity
Response process validity, codified in the 2014 Standards for Educational and Psychological Testing (published jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education), provides evidence that respondents engage with items as the researcher intended [19]. Methods include think-aloud protocols, cognitive interviews, eye-tracking, and response time analysis.
This is perhaps the most underused validity evidence source in industry practice, yet it catches problems no statistical analysis can detect. A market researcher conducting cognitive interviews for a "Purchase Intent Scale" might discover that respondents interpret "I would definitely purchase this product" as a certainty judgment rather than a strength-of-intent judgment, disagreeing not because they would not buy, but because they cannot be certain. Without response process evidence, this misinterpretation is invisible in the data.
Validity in Qualitative Research
In qualitative research, where the goal is to construct understanding rather than generalize observed rules, the terminology of "validity" and "reliability" can be epistemologically mismatched. The most widely recognized alternative framework is Lincoln and Guba's (1985) trustworthiness criteria [20]:
Credibility is the confidence that findings represent plausible realities of participants. Strategies include member checks (sharing findings with participants for validation), triangulation (comparing interview data with observations or survey results), and prolonged engagement with participants.
Transferability replaces generalizability. Qualitative research does not claim universal applicability; instead, it provides "thick, rich descriptions" of context, participants, and settings, allowing readers to judge relevance to their own situations.
Dependability concerns the stability and consistency of the research process. It is supported by maintaining a detailed audit trail: a record of every step from data collection to analysis, so that an external reviewer could trace the logic of the study.
Confirmability ensures findings derive from data, not researcher bias. Strategies include using direct participant quotes, maintaining a reflexivity journal, and having an independent researcher review coding decisions.
For UX researchers, reflexivity is especially critical when testing your own designs. Documenting expectations before and after sessions helps distinguish between a user's genuine frustration and the researcher's anticipated pain points.
For practical strategies to manage bias and improve qualitative rigor, see Research Quality and Managing Bias.
Validity Requires Reliability, But Not Vice Versa
A method can be reliable without being valid. Your measurements might be perfectly consistent (high reliability) but consistently measuring the wrong thing, which means low validity.
However, a method cannot be valid without being reliable. If your measurements are random and inconsistent, they cannot be accurate. Reliability is necessary but not sufficient for validity.
The practical implication: when a metric looks unstable, fix reliability first. Only once measurements are consistent does it make sense to ask whether they are measuring the right thing.
Modern Frameworks: Beyond the "Types" Metaphor
The most important conceptual shift in validity theory is that validity is no longer considered to come in separate, interchangeable types. Three frameworks, developed over three decades, converge on this point.
Messick's Unified Framework (1995)
Messick identified six interrelated aspects of construct validity: content, substantive, structural, generalizability, external, and consequential [2]. His central argument: these are not a menu to choose from but a comprehensive set that all apply to any measurement. Content validity and criterion validity become evidence contributing to the overarching construct validity argument rather than independent types.
For a UX practitioner validating a product experience questionnaire, this translates into six evidence streams: expert review of item relevance (content), cognitive interviews showing respondents interpret items as intended (substantive), CFA confirming the expected factor structure (structural), testing across product categories and demographics (generalizability), correlations with behavioral metrics (external), and examining whether using scores leads to appropriate design decisions (consequential).
Kane's Argument-Based Approach (1992, 2013)
Kane reframed validation as structured argumentation organized around four inferences: scoring (are observations accurately recorded?), generalization (are scores reliable across items, raters, occasions?), extrapolation (do scores predict real-world behavior?), and implication (is the action taken based on scores appropriate?) [21].
The practical insight: evidence should focus on the weakest link in the chain. If your lab-to-field transfer is questionable, concentrate evidence on extrapolation rather than accumulating more reliability evidence for a generalization inference that is already strong.
AERA/APA/NCME Standards (2014)
The Standards synthesized both Messick and Kane into five sources of validity evidence: test content, response processes, internal structure, relations to other variables, and consequences of testing [19]. The definition locates validity in interpretations and uses, not instruments: "Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests."
The convergence across all three frameworks yields a clear message: stop asking "Is this measure valid?" and start asking "What evidence supports the specific conclusions I am drawing from these scores?"
Emerging Threat: AI and Data Validity
The most urgent validity threat in contemporary research is artificial intelligence contaminating survey data. The scale of the problem has grown faster than most research teams have adapted.
Participants Are Using AI
Zhang, Xu, and Alvero (2025) surveyed online panel participants and found that 34% reported using LLMs to help answer open-ended survey questions [22]. Participants cited unclear instructions, survey fatigue, and language barriers as motivations. Newer platform users, males, and college-educated participants were more likely to use AI, and demographic patterns in AI usage can systematically bias data in ways that mimic substantive findings.
AI Responses Are Homogeneous
LLM-generated responses are consistently more positive, more neutral, and less variable than human responses. Zhang et al. found that AI responses about social groups approached sensitive topics with sanitized detachment, while human responses contained concrete, emotionally charged language. The result is not simply added noise. AI contamination systematically flattens the distribution of responses, masking genuine variation in attitudes and beliefs [22].
Autonomous Agents Pass Quality Checks
Westwood (2025) demonstrated that an autonomous synthetic respondent powered by LLMs could complete online surveys end-to-end, passing 99.8% of attention checks across 6,000 trials. The agent outperformed actual humans [23]. Traditional fraud detection (CAPTCHAs, honeypot questions, logic puzzles) is nearly useless.
Validity Impact Across Types
Each validity type faces a distinct AI threat. Construct validity is compromised because researchers may be measuring AI training patterns rather than human psychological constructs. Ecological validity collapses because AI responses do not reflect real human behavior. Statistical conclusion validity suffers because reduced variance from homogeneous AI responses can inflate effect sizes or mask real effects. The bias is systematic, not random: it behaves like an unmeasured confound, not like noise.
When Metrics Lose Their Meaning
Goodhart's Law, in Strathern's widely cited formulation: "When a measure becomes a target, it ceases to be a good measure." This is a validity problem in disguise. The relationship between the metric and the underlying construct degrades precisely because the metric is being optimized.
When a team optimizes for click-through rate, they may select for outrage-bait or accidental taps rather than genuine interest. When "average handling time" becomes a call center target, it rewards premature call termination. The metric improves while customer frustration increases. NPS, despite its ubiquity, has been widely criticized: studies question whether it predicts revenue growth better than other loyalty questions, its 9–10/0–6 cutoffs lack clear statistical justification, its categorization discards approximately 30% of data (passives), and it requires much larger samples for significance than the raw 0–10 mean.
Strategies for maintaining honest metrics include: explicitly defining the construct each KPI represents, pairing every target KPI with at least one counter-metric to detect harm (speed with error rate, conversion with return rate), setting review cadences and expiry dates for proxy metrics, and maintaining qualitative narrative records alongside numbers to prevent surrogation, the cognitive slip where teams start believing the number is the reality.
For the measurement scales and instruments where validity matters most, see UX Measurement Instruments.
Questionable Measurement Practices
Flake and Fried (2020) introduced the concept of Questionable Measurement Practices (QMPs): decisions that raise doubts about measure validity through lack of transparency, ignorance, or negligence [24]. They found that 79% of item-based scales in the Many Labs 2 replication project appeared to be ad hoc, created without supporting validity evidence.
Perrig et al. (2024) brought this diagnosis directly into UX research. Their systematic review of CHI papers found 85 different scales and 172 distinct constructs, with most scales used only once. More troubling: only about 20% of papers provided a complete rationale for scale selection, and only one-third reported any scale quality investigation [25].
Six questions every researcher should answer before deploying a measure: What is your construct? Why did you choose this measure? What measure did you use? How did you quantify results? Did you modify the scale? Did you create the measure? As Flake and Fried put it: "Neither rigorous research design, nor advanced statistics, nor large samples can correct false inferences stemming from poor measurement."
The choice between validated scales and custom questionnaires is itself a validity decision. The SUS demonstrates reliability at α ≥ 0.90 with extensive benchmark norms. The UMUX-Lite achieves strong validity evidence from just two items, correlating with SUS at r = .81 [26]. The UEQ provides six-scale measurement with a benchmark database and 40+ language translations, though collapsing it into a single KPI is not recommended [27]. When full validation is not feasible, the minimum viable approach includes expert review, cognitive pretesting, piloting with 30+ respondents, and calculating reliability on collected data.
For how sample size decisions interact with validity, see Sample Sizes: Beyond the Magic Numbers.
References
- [1]William R. Shadish et al.. (2002). "Experimental and Quasi-Experimental Designs for Generalized Causal Inference". Houghton Mifflin.
- [2]Samuel Messick. (1995). "Validity of Psychological Assessment: Validation of Inferences From Persons' Responses and Performances as Scientific Inquiry Into Score Meaning". American Psychologist, 50(9), 741–749.DOI
- [3]Lee J. Cronbach & Paul E. Meehl. (1955). "Construct Validity in Psychological Tests". Psychological Bulletin, 52(4), 281–302.DOI
- [4]Thomas D. Cook & Donald T. Campbell. (1979). "Quasi-Experimentation: Design and Analysis Issues for Field Settings". Houghton Mifflin.
- [5]Ron Kohavi et al.. (2020). "Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing". Cambridge University Press.
- [6]Kevin Munger. (2023). "Temporal Validity as Meta-Science". Research & Politics, 10(3).DOI
- [7]Juergen Sauer et al.. (2019). "Extra-Laboratorial Usability Tests: An Empirical Comparison of Remote and Classical Field Testing with Lab Testing". Applied Ergonomics, 74, 85–96.DOI
- [8]Romaric Marcilly et al.. (2024). "Usability Evaluation Ecological Validity: Is More Always Better?". Healthcare, 12(14), 1417.DOI
- [9]Denny Borsboom et al.. (2004). "The Concept of Validity". Psychological Review, 111(4), 1061–1071.DOI
- [10]Kodai Kusano et al.. (2025). "The Mismeasure of Culture: Why Measurement Invariance Is Rarely Appropriate for Comparative Research in Psychology". Personality and Social Psychology Bulletin.DOI
- [11]Wing M. Lim. (2024). "A Typology of Validity: Content, Face, Convergent, Discriminant, Nomological and Predictive Validity". Journal of Trade Science, 12(3), 155–179.DOI
- [12]Lydia Repke et al.. (2024). "Validity in Survey Research: From Research Design to Measurement Instruments". GESIS Survey Guidelines. Mannheim: GESIS.
- [13]Andrea Schankin et al.. (2022). "Psychometric Properties of the User Experience Questionnaire (UEQ)". Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM.DOI
- [14]Stephen N. Haynes & William Hayes O'Brien. (2000). "Principles and Practice of Behavioral Assessment". Plenum Press.
- [15]Gregory J. Cizek et al.. (2008). "Sources of Validity Evidence for Educational and Psychological Tests". Educational and Psychological Measurement, 68(3), 397–412.
- [16]David Lacko & et al.. (2022). "The Necessity of Testing Measurement Invariance in Cross-Cultural Research". Cross-Cultural Research, 56(2–3), 1–38.DOI
- [17]John Protzko. (2025). "Invariance: What Does Measurement Invariance Allow Us to Claim?". Educational and Psychological Measurement.DOI
- [18]
- [19](2014). "Standards for Educational and Psychological Testing". American Educational Research Association, American Psychological Association & National Council on Measurement in Education. Washington, DC: AERA.
- [20]Yvonne S. Lincoln & Egon G. Guba. (1985). "Naturalistic Inquiry". Sage Publications.
- [21]Michael T. Kane. (2013). "Validating the Interpretations and Uses of Test Scores". Journal of Educational Measurement, 50(1), 1–73.DOI
- [22]Simone Zhang et al.. (2025). "Generative AI Meets Open-Ended Survey Responses: Research Participant Use of AI and Homogenization". Sociological Methods & Research.DOI
- [23]Sean J. Westwood. (2025). "The Potential Existential Threat of Large Language Models to Online Survey Research". Proceedings of the National Academy of Sciences, 122(47), e2518075122.DOI
- [24]Jessica K. Flake & Eiko I. Fried. (2020). "Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them". Advances in Methods and Practices in Psychological Science, 3(4), 456–465.DOI
- [25]Sebastian A. C. Perrig & et al.. (2024). "Measurement Practices in User Experience (UX) Research: A Systematic Quantitative Literature Review". Frontiers in Computer Science, 6.DOI
- [26]James R. Lewis et al.. (2013). "UMUX-LITE: When There's No Time for the SUS". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '13). ACM.DOI
- [27]Martin Schrepp et al.. (2023). "User Experience Questionnaire Handbook". UEQ Online.Link