Survey Pretesting: Cognitive Interviews, Expert Review, and Field Testing
Survey pretesting methods explained: cognitive interviewing (think-aloud, probing), expert review, behavior coding, and field testing. When to use each, how many respondents, and what problems each method catches.

Most survey errors are preventable. The questions that produce unusable data almost always show warning signs during pretesting. The problem is that most teams either skip pretesting entirely or conduct it in ways that miss the problems that matter.
Survey pretesting is the systematic evaluation of a questionnaire before full deployment. It is, by a wide margin, the most cost-effective step in the survey lifecycle. Fixing a confusing question before launch costs minutes. Discovering the same problem after 2,000 responses costs the entire dataset.
Yet pretesting remains one of the most commonly skipped steps in survey research. Teams invest weeks in sampling strategy, question development, and analysis planning, then launch without ever watching a single person attempt to answer their questions. The result is predictable: ambiguous wording that respondents interpret differently, response options that don't cover the range of real answers, skip logic that sends people down the wrong path, and data that looks clean but measures the wrong things.
This guide covers the major pretesting methods, what each one catches, when to use which, and how to build a pretesting process that actually improves measurement quality.
TL;DR:
- Pretesting catches problems that no amount of post-hoc analysis can fix. Most question wording issues, response option gaps, and logic errors are identifiable before launch.
- Four primary methods exist: cognitive interviewing, expert review, behavior coding, and field testing. Each catches different problems.
- Cognitive interviews are the gold standard for identifying comprehension and interpretation failures. Think-aloud protocols and verbal probing reveal how respondents actually process questions.
- Expert review should come first. It catches methodological violations cheaply, before you invest in respondent testing.
- Iterative rounds outperform single large rounds. Three cycles of 5 cognitive interviews catch more problems than one cycle of 15.
→ Build Better Surveys with Lensym
What Pretesting Is (and What It Is Not)
Pretesting is not a single activity. It is a family of methods, each designed to identify different categories of problems in a survey instrument. The term encompasses everything from having a colleague read through your questions to conducting formal cognitive interviews with recorded protocols.
What unites these methods is their purpose: evaluating the questionnaire itself, not the data it produces. Pretesting asks "Will this instrument work?" rather than "What did the data say?"
This distinction matters because it separates pretesting from analysis. You cannot analyze your way out of a poorly designed question. If respondents interpret "regular exercise" to mean anything from "daily gym sessions" to "walking to the car," no statistical technique will recover the signal. Pretesting is where you catch that problem.
The Total Survey Error Perspective
Within the Total Survey Error framework, pretesting primarily targets measurement error: the gap between the true value you want to capture and the value the instrument actually records. Question wording, response scales, instructions, and survey flow all introduce measurement error, and pretesting is the primary defense against all of them.
Presser et al. (2004) reviewed decades of pretesting research and concluded that cognitive interviewing and behavior coding are the most effective methods for identifying question problems, though no single method catches everything. Their review, published in Public Opinion Quarterly, remains the standard reference on pretesting methodology.
Cognitive Interviewing
Cognitive interviewing is the most widely used and most thoroughly validated pretesting method. It involves conducting one-on-one sessions where a trained interviewer administers the questionnaire while using structured techniques to understand how respondents interpret, process, and answer each question.
The method draws on Tourangeau's (1984) cognitive model of survey response, which identifies four stages respondents go through when answering a question:
- Comprehension: Understanding what the question asks
- Retrieval: Searching memory for relevant information
- Judgment: Evaluating and estimating an answer
- Response: Mapping the answer onto the available options
Problems can occur at any stage. Cognitive interviewing techniques are designed to detect failures at each one.
Think-Aloud Protocols
In the think-aloud method, respondents verbalize their thought process as they answer each question. The interviewer asks them to "think out loud" and share everything going through their mind: how they interpret the question, what information they consider, how they arrive at an answer.
Example interaction:
Interviewer: Please read this question and tell me everything you're thinking as you answer it.
Question: "In the past 12 months, how often have you consulted a healthcare professional about a mental health concern?"
Respondent: "Okay, so... does my therapist count as a healthcare professional? I see her every two weeks. But 'consulted' makes it sound like a one-time thing. And 'mental health concern,' I don't know if stress about work counts as a mental health concern. I guess I'll say 'sometimes'?"
This single interaction reveals three distinct problems: ambiguity in "healthcare professional," connotation issues with "consulted," and unclear boundaries of "mental health concern." Without the think-aloud protocol, this respondent would have quietly selected "sometimes," and the data would have looked perfectly clean.
Think-aloud protocols work best with questions that involve complex cognitive processing: behavioral recall, attitude formation, or sensitive topics. They are less effective for simple factual questions where the thought process is minimal.
Verbal Probing
Verbal probing is more structured than think-aloud. The interviewer asks specific follow-up questions (probes) after the respondent answers each survey item. Common probe types include:
- Comprehension probes: "What does the term 'regular exercise' mean to you in this question?"
- Retrieval probes: "How did you arrive at that number? What were you thinking about?"
- Confidence probes: "How sure are you about that answer?"
- Paraphrase probes: "Can you say this question back to me in your own words?"
- General probes: "Was this question easy or hard to answer? Why?"
Verbal probing gives the interviewer more control over the session. Instead of waiting for problems to surface organically (as in think-aloud), the interviewer systematically tests each question against known failure modes.
Willis (2005), in Cognitive Interviewing: A Tool for Improving Questionnaire Design, found that probing and think-aloud identify somewhat different problems. Probing catches more comprehension issues; think-aloud reveals more about retrieval and judgment. Many practitioners use a hybrid: think-aloud for complex questions, probing for straightforward ones.
Retrospective Interviewing
A third technique has the respondent complete the entire survey first, then discuss their experience question by question afterward. This reduces the artificial nature of think-aloud and captures the holistic experience of taking the survey, including fatigue, order effects, and cumulative confusion.
The tradeoff is recall accuracy. By the time respondents revisit early questions, they may not remember their exact thought process. Retrospective interviewing works best as a complement to concurrent methods, not a replacement.
Sample Size for Cognitive Interviews
One of the most common questions about cognitive interviewing is how many participants you need. The research literature offers consistent guidance:
- 5 to 7 participants identify approximately 75 to 85% of major problems (Blair & Conrad, 2011)
- 10 to 15 participants across two rounds catch most remaining issues
- Iterative rounds matter more than sample size. Three rounds of 5 participants, with revisions between rounds, consistently outperform a single round of 15 (Willis, 2005)
The logic behind iterative rounds is straightforward. The first round reveals the most obvious problems. You fix those, then the second round reveals problems that were masked by the first set. Each round uncovers a new layer.
This is one of the most important practical insights in pretesting: small, repeated cycles are more effective than large single efforts.
Who Should Participate
Cognitive interview participants should resemble your target population. This seems obvious but is routinely violated. Researchers pretest with graduate students, colleagues, or whoever is convenient, then deploy to a completely different population.
If your survey targets elderly patients, pretest with elderly patients. If it targets non-native English speakers, include non-native speakers in your cognitive interviews. The comprehension failures you need to catch are population-specific. A question that is perfectly clear to a college-educated researcher may be incomprehensible to someone with an eighth-grade reading level.
Expert Review
Expert review (sometimes called expert panel review or appraisal) involves having survey methodology experts, subject matter experts, or both evaluate the questionnaire against established design principles. It is typically the first pretesting method applied, because it catches problems cheaply before you invest in respondent testing.
What Experts Catch
Methodological experts evaluate questions against known best practices:
- Double-barreled questions: Items that ask about two things at once
- Leading or loaded wording: Language that pushes respondents toward a particular answer
- Response scale problems: Scales that are unbalanced, incomplete, or mismatched to the question stem
- Logical inconsistencies: Questions that contradict each other or make impossible assumptions
- Skip logic errors: Routing that sends respondents to irrelevant questions
- Missing options: Response lists that omit common or important categories
- Survey length concerns: Whether the instrument is longer than the topic warrants
Subject matter experts add a different perspective. They evaluate whether the questions adequately cover the construct, whether terminology is appropriate for the domain, and whether the response options reflect the range of real-world variation.
How to Conduct Expert Review
The most effective expert reviews use structured protocols rather than open-ended "take a look at this" requests. Provide reviewers with:
- The research objectives: What the survey is intended to measure
- The target population: Who will be answering
- A review checklist: Specific criteria to evaluate (question clarity, response option completeness, scale appropriateness, logical flow)
- Instructions to annotate: Ask reviewers to flag specific problems with specific explanations, not just "this question could be better"
A panel of 3 to 5 reviewers is standard. With fewer than three, you risk idiosyncratic opinions. With more than five, coordination costs outweigh marginal gains.
Limitations of Expert Review
Expert review has a well-documented blind spot: experts are not typical respondents. A survey methodologist may approve a question that real respondents find confusing, because the expert understands the intended meaning. Conversely, experts may flag issues that don't actually cause problems for the target audience.
Forsyth et al. (2004) compared expert review with cognitive interviewing and found that the two methods identify substantially different problem sets. Expert review catches more structural and methodological issues. Cognitive interviewing catches more comprehension and interpretation issues. Neither is a substitute for the other.
This is why the recommended sequence is expert review first, cognitive interviews second.
Behavior Coding
Behavior coding is a systematic observation method where trained coders watch (or listen to) interviewer-respondent interactions and record specific behaviors that signal question problems. Originally developed for interviewer-administered surveys, the method provides quantitative evidence about question performance.
How It Works
Coders classify each interaction using a standardized scheme. Common codes include:
| Code | Behavior | What It Signals |
|---|---|---|
| Exact reading | Interviewer reads question as written | Question works as designed |
| Minor change | Interviewer makes small wording change | Question is awkward to read |
| Major change | Interviewer substantially alters question | Question is problematic |
| Clarification request | Respondent asks what question means | Comprehension failure |
| Qualified answer | Respondent says "well, it depends..." | Question doesn't fit respondent's situation |
| Inadequate answer | Response doesn't match expected format | Response options are problematic |
| Don't know | Respondent cannot provide an answer | Question is unanswerable for this respondent |
A question is typically flagged for revision when problematic behaviors (anything other than exact reading and clean response) occur in more than 15 to 20 percent of interactions.
When to Use Behavior Coding
Behavior coding is most valuable for large-scale, interviewer-administered surveys: government censuses, health surveys, and national opinion polls. It requires a minimum of 50 to 100 respondent interactions to produce reliable frequencies, making it impractical for small studies.
For self-administered surveys (including most online surveys), behavior coding in its traditional form isn't possible because there is no interviewer to observe. However, some of the same signals can be captured through paradata: response times, answer changes, drop-offs at specific questions, and patterns of straight-lining. These digital equivalents serve a similar diagnostic function.
Field Testing (Pilot Testing)
Field testing, or pilot testing, is the final pretesting stage. It involves administering the survey to a small sample under conditions that approximate real deployment: same mode (online, phone, in-person), same population, same recruitment approach.
How Field Testing Differs from Cognitive Interviews
The distinction between field testing and cognitive interviewing is important because they catch fundamentally different problems.
Cognitive interviews reveal why questions fail: the thought process, the misinterpretation, the retrieval difficulty. But they happen in artificial conditions (one-on-one, with an interviewer present, with respondents who know they're being observed).
Field tests reveal what happens under real conditions: actual completion rates, real drop-off patterns, true response distributions, and technical failures across devices and browsers. But they don't tell you why problems occur.
| Dimension | Cognitive Interviews | Field Testing |
|---|---|---|
| Sample size | 5 to 15 | 50 to 200 |
| Conditions | Artificial, observed | Real or near-real |
| Data type | Qualitative (why) | Quantitative (what) |
| Catches | Interpretation problems | Distribution and completion problems |
| Misses | Scale-dependent issues | Cognitive process issues |
| Timing | Early in pretesting | Final pretesting stage |
What to Measure in a Field Test
A well-structured field test collects data on:
- Completion rate: What percentage finish? Rates below 50 percent warrant investigation. For guidance on diagnosing drop-off, see our completion rates guide.
- Median completion time: Is it within the range respondents were told to expect? Times significantly longer than projected suggest question difficulty or survey fatigue.
- Item-level drop-off: A sudden spike in abandonment at a particular item signals a problem with that question.
- Response distributions: Are respondents using the full range of your scales, or clustering at one end? Extreme ceiling or floor effects suggest the scale doesn't differentiate well.
- Skip logic paths: Do all paths produce the expected routing? Are any paths untested because no field test respondent triggered them?
- Straight-lining and pattern responding: Identical answers across a battery signal satisficing, often caused by survey length or repetitive formatting.
- Open-ended response quality: Are respondents providing substantive answers, or leaving open-ended items blank?
Sample Size for Field Tests
Field tests typically use 50 to 200 respondents, though the appropriate number depends on what you need to detect. For simple completion rate and timing checks, 50 responses may suffice. For evaluating response distributions across subgroups or testing complex skip logic with many branches, you may need 150 or more.
The key constraint is that field test respondents should come from your target population or a reasonable approximation.
When to Use Which Method
No single pretesting method catches all problems. The choice depends on your resources, timeline, and the types of issues you most need to identify.
The Recommended Sequence
For surveys where data quality is critical (academic research, policy evaluation, clinical measurement), the recommended sequence is:
- Expert review (3 to 5 reviewers, 1 to 2 weeks)
- Cognitive interviews, Round 1 (5 to 7 participants)
- Revise based on findings
- Cognitive interviews, Round 2 (5 to 7 participants)
- Revise again
- Field test (50 to 200 respondents)
- Final revisions and launch
This process typically takes 4 to 8 weeks. It is thorough, and for high-stakes surveys, it is worth every day. Presser et al. (2004) found that approximately one-third of survey questions are revised after cognitive testing, meaning a typical 30-item questionnaire will have 10 items changed. Those changes would not have happened without pretesting.
Abbreviated Approaches
Not every survey justifies full pretesting. For lower-stakes instruments (internal feedback surveys, quick pulse checks, exploratory studies), a shortened process can still catch critical problems:
Minimum viable pretesting (1 to 2 days):
- Have 2 to 3 colleagues complete the survey while noting confusion points
- Fix obvious problems
- Soft launch to 20 to 50 respondents, check completion rates and timing
Moderate pretesting (1 to 2 weeks):
- Expert review by 1 to 2 methodology-aware colleagues
- 3 to 5 cognitive interviews with target population members
- Revise and field test with 30 to 50 respondents
The trade-off is real: abbreviated pretesting will miss problems that full pretesting would catch. But abbreviated pretesting is vastly better than no pretesting.
Common Problems Pretesting Reveals
Across methods, pretesting consistently identifies several categories of issues:
Ambiguous Question Wording
The most common finding. Words that seem clear to the survey designer mean different things to different respondents. "Frequently," "recently," "significant," "regular," and similar terms are especially problematic because they feel precise but are not. Cognitive interviews reliably surface these ambiguities because different participants interpret the same term differently, and the interviewer can see the divergence in real time.
Double-Barreled Items
Questions that ask about two things at once appear in nearly every untested questionnaire. Respondents in cognitive interviews typically signal these by saying something like, "Well, I agree with the first part but not the second." Expert reviewers also catch these reliably, since the pattern (conjunction joining two concepts) is structurally recognizable. For a detailed treatment, see our guide to double-barreled questions.
Response Option Gaps
Missing or inappropriate response options are common and often invisible without pretesting. Respondents may need a "not applicable" option that isn't provided, or the frequency scale may not extend far enough in either direction. Cognitive interviews reveal this when respondents say, "None of these really fits." Field tests reveal it through "Other" responses or unusual clustering.
Skip Logic Errors
Complex branching creates combinatorial paths that are difficult to test mentally. A question intended only for parents might also appear for non-parents due to a logic error. A screening question might route respondents to the wrong follow-up. Field testing is the most effective method for catching these issues because it exercises the logic with real variation in responses. Lensym's visual graph editor makes these logic paths explicit during design, reducing (though not eliminating) the need to catch them in pretesting.
Recall and Knowledge Assumptions
Questions sometimes ask respondents to recall information they don't have or never encoded. "How many times in the past year did you..." assumes respondents track this behavior. Cognitive interviews reveal the problem when respondents say, "I honestly have no idea, I'm just guessing." This is a retrieval-stage failure that only surfacing the respondent's thought process can detect.
Sensitive Question Reactions
Pretesting reveals which questions respondents find uncomfortable, intrusive, or inappropriate. In cognitive interviews, you can observe hesitation, discomfort, or refusal. In field tests, you see elevated drop-off rates at sensitive items. Both signals indicate a need to reword, reposition, or add framing to the question.
Iterative Pretesting: The Key Principle
The single most important principle in pretesting is iteration. Each round of pretesting, regardless of method, reveals a layer of problems. Fixing those problems sometimes introduces new issues, or exposes previously hidden ones.
The iterative cycle works as follows:
- Test the current version
- Identify problems
- Revise the instrument
- Test the revised version
- Repeat until no critical problems emerge
Two to three cycles are typically sufficient for a well-designed questionnaire. Poorly designed instruments may require more. The point is not to achieve perfection (all surveys have residual error) but to eliminate the problems that would make the data uninterpretable.
Convergence is the signal to stop. When a round of testing reveals only minor issues, or issues that would require disproportionate redesign to address, the instrument is ready for deployment.
Practical Recommendations
Based on the pretesting literature and applied practice, the following recommendations apply to most survey projects:
-
Always pretest. There is no survey too simple, too short, or too routine to benefit from at least a quick review by someone other than the designer. The person who wrote the questions is the worst judge of their clarity.
-
Start with expert review. It is the cheapest method and catches the most structurally obvious problems. Fixing leading questions, double-barreled items, and scale mismatches before cognitive interviews saves time and respondent burden.
-
Prioritize cognitive interviews. If you can only do one method beyond expert review, choose cognitive interviews. They reveal the broadest range of problems, and the qualitative insight they provide is irreplaceable.
-
Recruit from your target population. Pretesting with convenient samples (colleagues, students, friends) is a documented source of false confidence. The problems your actual respondents will encounter are often different from the problems your colleagues encounter.
-
Iterate in small rounds. Five participants, revise, five more participants. This pattern catches more problems than a single round of any size.
-
Document findings systematically. For each problem identified, record the question number, the nature of the problem, the evidence (quotes, behaviors, frequencies), and the proposed revision. This documentation supports methodological transparency and helps when reporting your study's design process.
-
Test the full experience, not just individual questions. Also test transitions between sections, cumulative fatigue, the introduction and closing, and overall time burden.
-
Don't skip pretesting because of time pressure. The time you "save" is almost always less than the time you waste collecting and trying to interpret data from a flawed instrument.
Conclusion
Survey pretesting is not a formality. It is the stage where preventable errors get prevented, where assumptions about respondent comprehension get tested against reality, and where the gap between what you think you are measuring and what you are actually measuring gets closed.
The methods are well established. Cognitive interviewing reveals how respondents think. Expert review catches design violations. Behavior coding quantifies interaction problems. Field testing reveals what happens under real conditions. Used in combination and iteratively, these methods dramatically improve the quality of the data your survey produces.
The cost of pretesting is measured in days. The cost of skipping it is measured in datasets.
Ready to build surveys worth pretesting?
→ Get Early Access · See Features · Read the Question Design Guide
Related Reading:
- Survey Question Design: How to Write Questions That Get Honest, Useful Answers
- Survey Validity vs Reliability: What They Mean and How to Design for Both
- Survey Measurement Error: Types, Examples, and How to Reduce It
- Double-Barreled Questions: Why They Destroy Measurement Validity
- Survey Completion Rates and Drop-Off
- How Long Should a Survey Be?
For comprehensive treatments of pretesting methodology, see Willis, G. B., Cognitive Interviewing: A Tool for Improving Questionnaire Design (2005), and Presser, S. et al., "Methods for Testing and Evaluating Survey Questionnaires," Public Opinion Quarterly, 68(1), 2004.
Continue Reading
More articles you might find interesting

Acquiescence Bias: The Psychology of Agreement Response Tendency
Acquiescence bias is the tendency to agree with statements regardless of content. Learn why it occurs, how it distorts survey data, and evidence-based methods to detect and reduce it.

Anonymous Surveys and GDPR: What Researchers Must Document
GDPR's definition of anonymity is strict. Requirements for true anonymization, when pseudonymization suffices, and documentation obligations for each.

Survey Tools with Advanced Conditional Branching for Research
What researchers need from conditional branching in survey tools: nested conditions, compound logic, visual editing, design-time validation, and metadata-preserving exports.