Survey Weighting: Post-Stratification, Raking, and Propensity Methods
Survey weighting corrects for known discrepancies between your sample and population. How post-stratification, raking, and propensity score methods work, when each applies, and what can go wrong.

Your sample almost never looks like your population. Weighting is the statistical tool that bridges the gap, but it has limits, and misusing it creates more problems than it solves.
Every survey researcher faces the same problem: the people who respond to your survey don't perfectly represent the population you're studying. Young adults respond at lower rates than older adults. Highly educated individuals are more likely to complete online surveys. People with strong opinions about your topic self-select into participation.
Survey weighting attempts to correct these imbalances by assigning different importance to different respondents. If 25-year-olds make up 15% of your population but only 8% of your sample, you give each 25-year-old respondent a higher weight so their collective influence in your estimates matches their population share.
The concept is straightforward. The execution is where things get complicated.
This guide covers the three main approaches to survey weighting (post-stratification, raking, and propensity score methods), when each is appropriate, how they work, and what goes wrong when they're misapplied.
TL;DR:
- Survey weighting adjusts for known differences between your sample and population by giving some responses more influence than others.
- Post-stratification matches your sample to known population totals on one or a few variables. Simple, transparent, requires population benchmarks.
- Raking (iterative proportional fitting) adjusts for multiple variables simultaneously using only marginal distributions. More flexible than post-stratification but harder to diagnose.
- Propensity score weighting models the probability of responding (or being in your sample) and uses the inverse as a weight. Handles many variables but depends on model specification.
- All weighting increases variance. You're trading bias reduction for wider confidence intervals.
- Weighting cannot fix what it cannot see. If the reasons people don't respond are unrelated to your adjustment variables, weighting doesn't help.
→ Build Better Surveys with Lensym
Why Surveys Need Weighting
In an ideal world, every member of your target population would have an equal probability of appearing in your sample, and everyone selected would respond. In practice, neither condition holds.
The Sources of Imbalance
Three mechanisms create discrepancies between your sample and your population:
Unequal selection probabilities. Complex sample designs intentionally oversample certain groups (e.g., minority populations in health surveys) to ensure sufficient cases for subgroup analysis. Without weighting, these oversampled groups are overrepresented in overall estimates.
Nonresponse. Not everyone who is sampled actually participates. If nonresponse is systematically related to the characteristics you're measuring, your estimates are biased. Younger respondents, lower-income respondents, and people less interested in your topic tend to respond at lower rates.
Coverage gaps. Your sampling frame may not include all members of the target population. Online panels miss people without internet access. University email lists miss staff who use alternative addresses. Phone surveys miss people without phones.
What Weighting Does (and Doesn't Do)
Weighting adjusts your estimates so that your sample, when weighted, resembles the target population on measured characteristics. If you know the population is 52% female and your sample is 60% female, weighting down-weights female respondents and up-weights male respondents.
What weighting cannot do is adjust for unmeasured differences. If female non-respondents differ from female respondents in attitudes, the weight adjustment corrects the gender ratio but not the attitude bias within gender groups.
This is the fundamental limitation: weighting assumes that within each adjustment cell (e.g., "females aged 25-34"), respondents are representative of all members of that cell, including non-respondents. This assumption, called "missing at random" conditional on the adjustment variables, is untestable with survey data alone.
Design Weights: The Starting Point
Before applying any adjustment, most complex surveys start with design weights (also called base weights or sampling weights). These account for unequal selection probabilities built into the sampling design.
Calculating Design Weights
The design weight for each respondent is the inverse of their probability of selection:
Where is the inclusion probability for respondent .
If you use simple random sampling from a population of 10,000 and select 500, every person has a selection probability of 500/10,000 = 0.05, and a design weight of 20. Each respondent represents 20 people in the population.
If you oversample a subgroup (say, selecting 200 of 1,000 people in a minority group and 300 of 9,000 in the majority group), the design weights differ:
- Minority group:
- Majority group:
Without these design weights, the minority group (40% of the sample) would be vastly overrepresented relative to the population (10%).
When Design Weights Aren't Enough
Design weights correct for the sampling design, but they don't address nonresponse. After applying design weights, your weighted sample reflects the intended design but not necessarily the population, because non-respondents are missing.
This is where nonresponse adjustments come in.
Post-Stratification
Post-stratification is the simplest and most widely used nonresponse adjustment. It adjusts the design-weighted sample totals to match known population totals.
How It Works
-
Define adjustment cells. Choose variables where you know both the sample distribution and the population distribution. Common choices: age group, gender, education level, geographic region.
-
Calculate cell totals. Sum the design weights within each cell to get the weighted sample total for that cell.
-
Apply adjustment factors. For each cell, calculate:
Where is the known population total for cell and is the sum of design weights in cell .
- Compute final weights. Multiply each respondent's design weight by their cell's adjustment factor:
Example
Suppose your population has 1,000 males aged 18-29, but after applying design weights, your sample represents only 600 in that cell. The adjustment factor is . Every male aged 18-29 has their weight multiplied by 1.667.
Requirements
Post-stratification requires:
- Known population totals for each adjustment cell. These typically come from census data, administrative records, or well-established reference surveys.
- Sufficient sample size in each cell. If a cell has only 2-3 respondents, their weights become very large, producing unstable estimates. A common rule of thumb is at least 20-30 respondents per cell.
- Cross-classified totals. If you adjust by age × gender, you need to know the population total for each age-gender combination, not just the marginal totals for age and gender separately.
Strengths and Limitations
Strengths:
- Transparent and easy to explain
- Guaranteed to match population totals on the adjustment variables
- Computationally simple
- Well-understood statistical properties
Limitations:
- Requires full cross-classified population totals, which become sparse as you add variables. With 4 age groups × 2 genders × 4 education levels × 5 regions, you have 160 cells, and many will have too few respondents.
- The number of adjustment variables is practically limited to 2-3.
- Assumes homogeneity within cells: respondents and non-respondents in the same cell have similar values on the survey variables.
Raking (Iterative Proportional Fitting)
Raking solves the key limitation of post-stratification: it adjusts for multiple variables simultaneously using only their marginal distributions, not the full cross-classification.
How It Works
Raking is an iterative algorithm that cycles through adjustment variables, adjusting weights to match the marginal distribution of each variable in turn.
Step 1: Adjust weights so the weighted sample matches the population distribution of Variable 1 (e.g., gender).
Step 2: With the adjusted weights from Step 1, adjust again to match the population distribution of Variable 2 (e.g., age group). This may disturb the gender distribution achieved in Step 1.
Step 3: Adjust again for Variable 3 (e.g., education). This may disturb both previous adjustments.
Repeat: Cycle through all variables, adjusting one at a time, until the weighted sample simultaneously matches all marginal distributions within a specified tolerance. Convergence typically takes 10-50 iterations.
Why It Converges
The mathematical foundation is iterative proportional fitting (IPF), first described by Deming and Stephan in 1940. Each iteration reduces the distance between the weighted sample margins and the population margins. Under mild conditions (no structural zeros, positive weights), the algorithm converges to a unique solution.
The intuition: each pass adjusts one dimension while slightly disturbing the others. But each successive pass makes smaller adjustments. The disturbances shrink with each cycle until the weights satisfy all constraints simultaneously.
Comparison with Post-Stratification
Raking and post-stratification produce identical results when you adjust for only one variable. When you adjust for multiple variables, they differ:
| Post-Stratification | Raking | |
|---|---|---|
| Population data required | Full cross-classification | Marginal totals only |
| Number of variables | 2-3 practical maximum | 5-10 or more |
| Cell-level match | Exact | Approximate (matches margins, not cells) |
| Empty cells | Problematic | Not an issue |
| Transparency | High | Moderate |
If you know that 30% of your population is female aged 18-34, post-stratification guarantees that exactly 30% of your weighted sample falls in that cell. Raking only guarantees that the overall gender split and the overall age split match the population; the joint distribution of gender × age may not match perfectly.
When Raking Fails
Raking can produce poor results when:
- Margins are inconsistent. If the marginal totals come from different sources and are internally contradictory (they don't correspond to any valid joint distribution), the algorithm may not converge or may produce extreme weights.
- Strong interactions exist. If the relationship between nonresponse and the survey variable depends on a specific combination of adjustment variables (not just their marginals), raking's inability to match the joint distribution is a real limitation.
- Starting weights are very unequal. Highly variable design weights combined with raking adjustments can produce extreme final weights.
Practical Considerations
Most statistical software includes raking implementations. In R, the survey package provides rake() and calibrate(). In Stata, ipfweight and survwgt handle raking. In Python, ipfn provides iterative proportional fitting.
When implementing raking:
- Start with the most important adjustment variable (the one most correlated with your key survey outcomes).
- Set convergence tolerance appropriately. Too tight (0.0001) wastes computation; too loose (0.1) leaves noticeable margin mismatches.
- Monitor weight distributions after raking. Check the minimum, maximum, and coefficient of variation of the weights. Flag extreme weights for review.
Propensity Score Weighting
Propensity score methods take a different approach. Instead of directly matching to population totals, they model the mechanism that creates the sample-population discrepancy: the probability of responding (or of being included in a non-probability sample).
The Basic Idea
For each person in your sample, estimate the probability that they would appear in your sample (or respond to your survey) given their characteristics. Call this , the estimated propensity score. The weight is the inverse of this probability:
People with low propensity scores (those unlikely to respond but who did) receive high weights because they represent many similar people who didn't respond. People with high propensity scores receive low weights because they represent fewer missing people.
Estimating Propensity Scores
The propensity model is typically a logistic regression where:
- The outcome is whether a person responded (1) or not (0), or whether they're in the survey sample (1) or a reference sample (0).
- The predictors are characteristics available for both respondents and non-respondents (or for both the survey sample and reference population).
For probability samples with nonresponse, you need data on non-respondents from the sampling frame (e.g., demographic information from the frame database). For non-probability samples, you need a parallel reference survey (probability-based) that measures the same auxiliary variables.
Application to Non-Probability Samples
Propensity score weighting has become particularly important for non-probability samples (opt-in panels, convenience samples, online volunteer samples). The approach:
- Conduct your non-probability survey.
- Obtain a reference survey from a probability sample covering the same population and measuring the same auxiliary variables.
- Pool both datasets. Create a binary indicator: 1 = in non-probability sample, 0 = in reference sample.
- Fit a propensity model predicting membership in the non-probability sample.
- Weight the non-probability sample by the inverse of the estimated propensity scores.
The goal: make the weighted non-probability sample resemble the population (as represented by the reference survey) on the auxiliary variables.
Strengths and Limitations
Strengths:
- Handles many auxiliary variables naturally through the regression model
- Can capture nonlinear relationships and interactions between predictors
- Flexible model specification (logistic regression, random forests, gradient boosting)
- Works for both probability and non-probability samples
Limitations:
- Only as good as the propensity model. If the model is misspecified (missing important predictors of response or using the wrong functional form), the weights are biased.
- Sensitive to extreme propensity scores. When is very small, is very large, producing extreme weights.
- Requires data on both respondents and non-respondents (or a reference sample). This data isn't always available.
- Less transparent than post-stratification or raking. The weights depend on model choices that are harder to communicate.
Doubly Robust Estimation
A significant advance in weighting methodology is doubly robust estimation, which combines propensity score weighting with outcome modeling. The idea: estimate both the propensity model (probability of response) and an outcome model (predicted survey values given characteristics).
The resulting estimator is consistent if either the propensity model or the outcome model is correctly specified. You don't need both to be right, just one. This provides insurance against model misspecification, though it doesn't help if both models are wrong.
Weight Trimming and Smoothing
All weighting methods can produce extreme weights. A single respondent with a weight of 50 means that person represents 50 people in the population. If they happen to have unusual opinions, they heavily influence estimates.
The Variance-Bias Trade-Off
Extreme weights increase the variance of estimates. The design effect due to weighting is approximately:
Where is the coefficient of variation of the weights. If your weights have a CV of 1.0, the effective sample size is halved. Every bit of weight variability costs you statistical precision.
But reducing extreme weights (trimming) reintroduces bias. If you cap the maximum weight at 5 when it should be 20, you're underrepresenting the group that respondent belongs to.
Trimming Approaches
Hard trimming. Set a maximum weight. Any weight above the threshold is set to the threshold. Common thresholds: the median weight plus 6 times the interquartile range, or the 95th or 99th percentile of the weight distribution.
Soft trimming. Gradually compress extreme weights rather than hard-capping them. Weights above a threshold are pulled toward the threshold but not fully capped.
Mean-shift trimming. After capping extreme weights, redistribute the excess weight proportionally across all respondents to preserve the total weight sum.
When to Trim
Trim when:
- A small number of respondents have disproportionate influence (check by calculating the proportion of total weight held by the top 5% of respondents)
- Removing a single respondent substantially changes key estimates (sensitivity analysis)
- The design effect due to weighting is very large (commonly, warrants investigation)
Choosing a Weighting Method
The right method depends on your data situation:
| Situation | Recommended Approach |
|---|---|
| Probability sample, few known benchmarks | Post-stratification |
| Probability sample, many benchmarks but only marginals | Raking |
| Non-probability sample with reference survey | Propensity score weighting |
| Complex sample with many auxiliary variables | Raking + propensity hybrid |
| Small sample sizes in adjustment cells | Raking (avoids empty cell problem) |
Decision Criteria
What population data do you have? If you have census cross-tabulations for 2-3 variables, post-stratify. If you have marginal distributions for 5-8 variables, rake. If you have a parallel reference survey, consider propensity scores.
What is your sample design? Probability samples with known selection probabilities start with design weights and then adjust for nonresponse. Non-probability samples skip design weights and go directly to propensity or calibration methods.
How important is transparency? Post-stratification is the easiest to explain to stakeholders. Propensity methods are the hardest. For regulatory or policy contexts where methodology will be scrutinized, simpler methods may be preferable.
How much weight variability can you tolerate? More adjustment variables generally means more weight variability. Balance the bias reduction from additional adjustments against the variance increase from more variable weights.
Common Mistakes
Weighting on Too Many Variables
Every additional adjustment variable increases weight variability. If you rake on 10 variables, some respondents will end up with extremely large or small weights. Prioritize variables that are strongly related to both nonresponse and your key survey outcomes. Variables correlated with nonresponse but not with outcomes add variance without reducing bias.
Weighting on Survey Variables
Never weight on the variables you're trying to estimate. If you adjust your sample so that 60% of weighted respondents agree with a policy (matching some external benchmark), you've built that result into your data. Weighting variables should be auxiliary: demographic, geographic, or behavioral characteristics available from external sources, not the attitudinal or opinion measures your survey is designed to study.
Ignoring Weight Variability
Unweighted analyses of weighted data underestimate standard errors. Always use survey-aware estimation methods that account for the weights. In R, use the survey package. In Stata, use svy: prefix commands. In Python, use statsmodels with frequency weights. Standard statistical tests assume equal weights and will produce artificially small p-values and narrow confidence intervals when applied to weighted data.
Assuming Weighting Solves Everything
The most dangerous mistake is believing that weighting has "fixed" your sample. Weighting only adjusts for measured characteristics. If the people who didn't respond differ from respondents in unmeasured ways that matter for your research questions, weighting provides false confidence.
Always report both weighted and unweighted estimates. If they differ substantially, the weighting is doing a lot of work, and you should examine whether your adjustment variables are sufficient.
Evaluating Your Weights
After computing weights, evaluate them before using them for analysis.
Diagnostic Checks
Weight distribution. Report the minimum, maximum, mean, median, and coefficient of variation. Flag any respondent with a weight more than 5-6 times the median weight.
Design effect. Calculate and report the effective sample size . If the effective sample size is less than half the actual sample size, your weights are highly variable.
Margin matching. For post-stratification and raking, verify that the weighted sample margins match the population targets. For raking, check all margins; convergence failures can leave some margins off-target.
Sensitivity analysis. Re-run key estimates excluding the respondents with the largest weights. If results change substantially, those respondents are driving your estimates, which is risky.
Comparison of weighted and unweighted estimates. Large differences suggest strong confounding between the adjustment variables and your outcomes. Small differences suggest either (a) your sample was already representative on these characteristics, or (b) the adjustment variables aren't strongly related to your outcomes.
What Good Weights Look Like
- Low CV (below 0.5 is good, above 1.0 is concerning)
- No single respondent with more than 1-2% of the total weight
- Weighted margins match population targets
- Key estimates stable under sensitivity analysis
- Design effect below 2.0
Weighting in Practice
Tools and Software
Most survey analysis is done in software with built-in weighting support:
- R: The
surveypackage (svydesign(),calibrate(),rake(),postStratify()) is the gold standard. Thesrvyrpackage provides adplyr-compatible interface. - Stata:
svysetdefines survey design;svy:prefix runs survey-aware analyses.ipfweighthandles raking. - SPSS: COMPLEX SAMPLES module handles weighted analyses.
- Python:
statsmodelssupports frequency and probability weights. Theipfnpackage handles raking.
Reporting Standards
When reporting weighted survey results, include:
- Description of the weighting procedure (method, adjustment variables, population benchmarks)
- Source of population data (census, administrative records, reference survey)
- Summary of weight distribution (range, CV, design effect)
- Effective sample size after weighting
- Whether weight trimming was applied, and the threshold
- Both weighted and unweighted key estimates (at minimum in supplementary material)
The American Association for Public Opinion Research (AAPOR) Transparency Initiative provides detailed reporting standards for survey methodology, including weighting.
Summary
Survey weighting is a necessary but imperfect tool. It corrects for known, measurable differences between your sample and your population, improving the validity of your estimates when used appropriately.
Post-stratification is the right choice when you have a small number of reliable population benchmarks. Raking extends this to more variables when you only have marginal distributions. Propensity score methods handle complex multivariate relationships and non-probability samples.
But weighting is not magic. It increases variance, it can produce unstable estimates when weights are extreme, and it cannot correct for unmeasured bias. The best strategy combines good survey design (maximizing response rates, reducing coverage gaps) with appropriate weighting adjustments, followed by honest assessment of what the weighting can and cannot do.
No amount of statistical adjustment can substitute for a well-designed sample. Invest in design first. Weight second. And always report what you did.
Ready to build well-designed surveys from the start?
→ Get Early Access · See Features · Read the Sampling Methods Guide
Related Reading:
- Survey Sampling Methods: Probability vs Non-Probability Explained
- How to Calculate Survey Response Rate (With Examples and Formula)
- Survey Measurement Error: Sources, Types, and How to Minimize It
- How to Improve Survey Response Rates: Evidence-Based Strategies
- Survey Sample Size Guide: How to Determine Sample Size
Survey weighting methodology is covered rigorously in Valliant, R., Dever, J. A., & Kreuter, F., Practical Tools for Designing and Weighting Survey Samples (2nd ed.), and in Kish, L., Survey Sampling, the classic text on sampling theory and design.
Continue Reading
More articles you might find interesting

Anonymous Surveys and GDPR: What Researchers Must Document
GDPR's definition of anonymity is strict. Requirements for true anonymization, when pseudonymization suffices, and documentation obligations for each.

Construct Validity in Surveys: From Theory to Measurement
Construct validity: do items measure the intended concept? Operationalization, convergent/discriminant and factor evidence, and common threats to validity.

Double-Barreled Questions: Why They Destroy Measurement Validity
Double-barreled questions ask two things at once, making responses uninterpretable. How to identify them, why they persist, and how to rewrite them for valid measurement.