What is survey weighting?

Survey weighting assigns different importance (weights) to different respondents to correct for known discrepancies between your sample and the target population. If young adults are underrepresented in your sample relative to the population, their responses receive higher weights so the weighted sample better reflects the population distribution.

What is the difference between post-stratification and raking?

Post-stratification adjusts weights to match population margins on a single variable or cross-classification of variables. Raking (iterative proportional fitting) adjusts weights to match marginal distributions of multiple variables simultaneously without requiring the full cross-classification. Raking is more flexible when you have many adjustment variables but only know their marginal distributions.

When should you use propensity score weighting instead of post-stratification?

Propensity score weighting is useful when you have many auxiliary variables, when the relationship between response propensity and those variables is complex, or when you're working with non-probability samples where traditional design-based weighting doesn't apply. Post-stratification is simpler and more transparent when you have a small number of known population benchmarks.

What are the risks of over-weighting survey data?

Over-weighting occurs when some respondents receive very large weights, which inflates variance, widens confidence intervals, and makes estimates unstable. A single respondent with a weight of 15 has the influence of 15 respondents. If that person happens to be unusual, they distort results. Weight trimming or capping is commonly used to limit extreme weights.

Survey Weighting: Post-Stratification, Raking, and Propensity Methods

Q: Can survey weighting fix a badly designed sample?

No. Weighting can adjust for known, measurable imbalances, but it cannot correct for unmeasured differences between respondents and non-respondents. If the people who didn't respond differ from respondents in ways not captured by your auxiliary variables, weighting won't fix that bias. Weighting reduces known bias at the cost of increased variance.

Your sample almost never looks like your population. Weighting is the statistical tool that bridges the gap, but it has limits, and misusing it creates more problems than it solves.

Every survey researcher faces the same problem: the people who respond to your survey don't perfectly represent the population you're studying. Young adults respond at lower rates than older adults. Highly educated individuals are more likely to complete online surveys. People with strong opinions about your topic self-select into participation.

Survey weighting attempts to correct these imbalances by assigning different importance to different respondents. If 25-year-olds make up 15% of your population but only 8% of your sample, you give each 25-year-old respondent a higher weight so their collective influence in your estimates matches their population share.

The concept is straightforward. The execution is where things get complicated.

This guide covers the three main approaches to survey weighting (post-stratification, raking, and propensity score methods), when each is appropriate, how they work, and what goes wrong when they're misapplied.

TL;DR:

Survey weighting adjusts for known differences between your sample and population by giving some responses more influence than others.
Post-stratification matches your sample to known population totals on one or a few variables. Simple, transparent, requires population benchmarks.
Raking (iterative proportional fitting) adjusts for multiple variables simultaneously using only marginal distributions. More flexible than post-stratification but harder to diagnose.
Propensity score weighting models the probability of responding (or being in your sample) and uses the inverse as a weight. Handles many variables but depends on model specification.
All weighting increases variance. You're trading bias reduction for wider confidence intervals.
Weighting cannot fix what it cannot see. If the reasons people don't respond are unrelated to your adjustment variables, weighting doesn't help.

→ Build Better Surveys with Lensym

Why Surveys Need Weighting

In an ideal world, every member of your target population would have an equal probability of appearing in your sample, and everyone selected would respond. In practice, neither condition holds.

The Sources of Imbalance

Three mechanisms create discrepancies between your sample and your population:

Unequal selection probabilities. Complex sample designs intentionally oversample certain groups (e.g., minority populations in health surveys) to ensure sufficient cases for subgroup analysis. Without weighting, these oversampled groups are overrepresented in overall estimates.

Nonresponse. Not everyone who is sampled actually participates. If nonresponse is systematically related to the characteristics you're measuring, your estimates are biased. Younger respondents, lower-income respondents, and people less interested in your topic tend to respond at lower rates.

Coverage gaps. Your sampling frame may not include all members of the target population. Online panels miss people without internet access. University email lists miss staff who use alternative addresses. Phone surveys miss people without phones.

What Weighting Does (and Doesn't Do)

Weighting adjusts your estimates so that your sample, when weighted, resembles the target population on measured characteristics. If you know the population is 52% female and your sample is 60% female, weighting down-weights female respondents and up-weights male respondents.

What weighting cannot do is adjust for unmeasured differences. If female non-respondents differ from female respondents in attitudes, the weight adjustment corrects the gender ratio but not the attitude bias within gender groups.

This is the fundamental limitation: weighting assumes that within each adjustment cell (e.g., "females aged 25-34"), respondents are representative of all members of that cell, including non-respondents. This assumption, called "missing at random" conditional on the adjustment variables, is untestable with survey data alone.

Design Weights: The Starting Point

Before applying any adjustment, most complex surveys start with design weights (also called base weights or sampling weights). These account for unequal selection probabilities built into the sampling design.

Calculating Design Weights

The design weight for each respondent is the inverse of their probability of selection:

$w_i = \frac{1}{\pi_i}$

Where $\pi_i$ is the inclusion probability for respondent $i$ .

If you use simple random sampling from a population of 10,000 and select 500, every person has a selection probability of 500/10,000 = 0.05, and a design weight of 20. Each respondent represents 20 people in the population.

If you oversample a subgroup (say, selecting 200 of 1,000 people in a minority group and 300 of 9,000 in the majority group), the design weights differ:

Minority group: $w = 1,000/200 = 5$
Majority group: $w = 9,000/300 = 30$

Without these design weights, the minority group (40% of the sample) would be vastly overrepresented relative to the population (10%).

When Design Weights Aren't Enough

Design weights correct for the sampling design, but they don't address nonresponse. After applying design weights, your weighted sample reflects the intended design but not necessarily the population, because non-respondents are missing.

This is where nonresponse adjustments come in.

Post-Stratification

Post-stratification is the simplest and most widely used nonresponse adjustment. It adjusts the design-weighted sample totals to match known population totals.

How It Works

Define adjustment cells. Choose variables where you know both the sample distribution and the population distribution. Common choices: age group, gender, education level, geographic region.
Calculate cell totals. Sum the design weights within each cell to get the weighted sample total for that cell.
Apply adjustment factors. For each cell, calculate:

$f_c = \frac{N_c}{\sum_{i \in c} w_i}$

Where $N_c$ is the known population total for cell $c$ and $\sum_{i \in c} w_i$ is the sum of design weights in cell $c$ .

Compute final weights. Multiply each respondent's design weight by their cell's adjustment factor:

$w_i^{PS} = w_i \times f_c$

Example

Suppose your population has 1,000 males aged 18-29, but after applying design weights, your sample represents only 600 in that cell. The adjustment factor is $1,000/600 = 1.667$ . Every male aged 18-29 has their weight multiplied by 1.667.

Requirements

Post-stratification requires:

Known population totals for each adjustment cell. These typically come from census data, administrative records, or well-established reference surveys.
Sufficient sample size in each cell. If a cell has only 2-3 respondents, their weights become very large, producing unstable estimates. A common rule of thumb is at least 20-30 respondents per cell.
Cross-classified totals. If you adjust by age × gender, you need to know the population total for each age-gender combination, not just the marginal totals for age and gender separately.

Strengths and Limitations

Strengths:

Transparent and easy to explain
Guaranteed to match population totals on the adjustment variables
Computationally simple
Well-understood statistical properties

Limitations:

Requires full cross-classified population totals, which become sparse as you add variables. With 4 age groups × 2 genders × 4 education levels × 5 regions, you have 160 cells, and many will have too few respondents.
The number of adjustment variables is practically limited to 2-3.
Assumes homogeneity within cells: respondents and non-respondents in the same cell have similar values on the survey variables.

Raking (Iterative Proportional Fitting)

Raking solves the key limitation of post-stratification: it adjusts for multiple variables simultaneously using only their marginal distributions, not the full cross-classification.

How It Works

Raking is an iterative algorithm that cycles through adjustment variables, adjusting weights to match the marginal distribution of each variable in turn.

Step 1: Adjust weights so the weighted sample matches the population distribution of Variable 1 (e.g., gender).

Step 2: With the adjusted weights from Step 1, adjust again to match the population distribution of Variable 2 (e.g., age group). This may disturb the gender distribution achieved in Step 1.

Step 3: Adjust again for Variable 3 (e.g., education). This may disturb both previous adjustments.

Repeat: Cycle through all variables, adjusting one at a time, until the weighted sample simultaneously matches all marginal distributions within a specified tolerance. Convergence typically takes 10-50 iterations.

Why It Converges

The mathematical foundation is iterative proportional fitting (IPF), first described by Deming and Stephan in 1940. Each iteration reduces the distance between the weighted sample margins and the population margins. Under mild conditions (no structural zeros, positive weights), the algorithm converges to a unique solution.

The intuition: each pass adjusts one dimension while slightly disturbing the others. But each successive pass makes smaller adjustments. The disturbances shrink with each cycle until the weights satisfy all constraints simultaneously.

Comparison with Post-Stratification

Raking and post-stratification produce identical results when you adjust for only one variable. When you adjust for multiple variables, they differ:

	Post-Stratification	Raking
Population data required	Full cross-classification	Marginal totals only
Number of variables	2-3 practical maximum	5-10 or more
Cell-level match	Exact	Approximate (matches margins, not cells)
Empty cells	Problematic	Not an issue
Transparency	High	Moderate

If you know that 30% of your population is female aged 18-34, post-stratification guarantees that exactly 30% of your weighted sample falls in that cell. Raking only guarantees that the overall gender split and the overall age split match the population; the joint distribution of gender × age may not match perfectly.

When Raking Fails

Raking can produce poor results when:

Margins are inconsistent. If the marginal totals come from different sources and are internally contradictory (they don't correspond to any valid joint distribution), the algorithm may not converge or may produce extreme weights.
Strong interactions exist. If the relationship between nonresponse and the survey variable depends on a specific combination of adjustment variables (not just their marginals), raking's inability to match the joint distribution is a real limitation.
Starting weights are very unequal. Highly variable design weights combined with raking adjustments can produce extreme final weights.

Practical Considerations

Most statistical software includes raking implementations. In R, the survey package provides rake() and calibrate(). In Stata, ipfweight and survwgt handle raking. In Python, ipfn provides iterative proportional fitting.

When implementing raking:

Start with the most important adjustment variable (the one most correlated with your key survey outcomes).
Set convergence tolerance appropriately. Too tight (0.0001) wastes computation; too loose (0.1) leaves noticeable margin mismatches.
Monitor weight distributions after raking. Check the minimum, maximum, and coefficient of variation of the weights. Flag extreme weights for review.

Propensity Score Weighting

Propensity score methods take a different approach. Instead of directly matching to population totals, they model the mechanism that creates the sample-population discrepancy: the probability of responding (or of being included in a non-probability sample).

The Basic Idea

For each person in your sample, estimate the probability that they would appear in your sample (or respond to your survey) given their characteristics. Call this $\hat{p}_i$ , the estimated propensity score. The weight is the inverse of this probability:

$w_i^{PS} = \frac{1}{\hat{p}_i}$

People with low propensity scores (those unlikely to respond but who did) receive high weights because they represent many similar people who didn't respond. People with high propensity scores receive low weights because they represent fewer missing people.

Estimating Propensity Scores

The propensity model is typically a logistic regression where:

The outcome is whether a person responded (1) or not (0), or whether they're in the survey sample (1) or a reference sample (0).
The predictors are characteristics available for both respondents and non-respondents (or for both the survey sample and reference population).

For probability samples with nonresponse, you need data on non-respondents from the sampling frame (e.g., demographic information from the frame database). For non-probability samples, you need a parallel reference survey (probability-based) that measures the same auxiliary variables.

Application to Non-Probability Samples

Propensity score weighting has become particularly important for non-probability samples (opt-in panels, convenience samples, online volunteer samples). The approach:

Conduct your non-probability survey.
Obtain a reference survey from a probability sample covering the same population and measuring the same auxiliary variables.
Pool both datasets. Create a binary indicator: 1 = in non-probability sample, 0 = in reference sample.
Fit a propensity model predicting membership in the non-probability sample.
Weight the non-probability sample by the inverse of the estimated propensity scores.

The goal: make the weighted non-probability sample resemble the population (as represented by the reference survey) on the auxiliary variables.

Strengths and Limitations

Strengths:

Handles many auxiliary variables naturally through the regression model
Can capture nonlinear relationships and interactions between predictors
Flexible model specification (logistic regression, random forests, gradient boosting)
Works for both probability and non-probability samples

Limitations:

Only as good as the propensity model. If the model is misspecified (missing important predictors of response or using the wrong functional form), the weights are biased.
Sensitive to extreme propensity scores. When $\hat{p}_i$ is very small, $1/\hat{p}_i$ is very large, producing extreme weights.
Requires data on both respondents and non-respondents (or a reference sample). This data isn't always available.
Less transparent than post-stratification or raking. The weights depend on model choices that are harder to communicate.

Doubly Robust Estimation

A significant advance in weighting methodology is doubly robust estimation, which combines propensity score weighting with outcome modeling. The idea: estimate both the propensity model (probability of response) and an outcome model (predicted survey values given characteristics).

The resulting estimator is consistent if either the propensity model or the outcome model is correctly specified. You don't need both to be right, just one. This provides insurance against model misspecification, though it doesn't help if both models are wrong.

Weight Trimming and Smoothing

All weighting methods can produce extreme weights. A single respondent with a weight of 50 means that person represents 50 people in the population. If they happen to have unusual opinions, they heavily influence estimates.

The Variance-Bias Trade-Off

Extreme weights increase the variance of estimates. The design effect due to weighting is approximately:

$deff_w = 1 + CV^2(w)$

Where $CV(w)$ is the coefficient of variation of the weights. If your weights have a CV of 1.0, the effective sample size is halved. Every bit of weight variability costs you statistical precision.

But reducing extreme weights (trimming) reintroduces bias. If you cap the maximum weight at 5 when it should be 20, you're underrepresenting the group that respondent belongs to.

Trimming Approaches

Hard trimming. Set a maximum weight. Any weight above the threshold is set to the threshold. Common thresholds: the median weight plus 6 times the interquartile range, or the 95th or 99th percentile of the weight distribution.

Soft trimming. Gradually compress extreme weights rather than hard-capping them. Weights above a threshold are pulled toward the threshold but not fully capped.

Mean-shift trimming. After capping extreme weights, redistribute the excess weight proportionally across all respondents to preserve the total weight sum.

When to Trim

Trim when:

A small number of respondents have disproportionate influence (check by calculating the proportion of total weight held by the top 5% of respondents)
Removing a single respondent substantially changes key estimates (sensitivity analysis)
The design effect due to weighting is very large (commonly, $deff_w > 2.0$ warrants investigation)

Choosing a Weighting Method

The right method depends on your data situation:

Situation	Recommended Approach
Probability sample, few known benchmarks	Post-stratification
Probability sample, many benchmarks but only marginals	Raking
Non-probability sample with reference survey	Propensity score weighting
Complex sample with many auxiliary variables	Raking + propensity hybrid
Small sample sizes in adjustment cells	Raking (avoids empty cell problem)

Decision Criteria

What population data do you have? If you have census cross-tabulations for 2-3 variables, post-stratify. If you have marginal distributions for 5-8 variables, rake. If you have a parallel reference survey, consider propensity scores.

What is your sample design? Probability samples with known selection probabilities start with design weights and then adjust for nonresponse. Non-probability samples skip design weights and go directly to propensity or calibration methods.

How important is transparency? Post-stratification is the easiest to explain to stakeholders. Propensity methods are the hardest. For regulatory or policy contexts where methodology will be scrutinized, simpler methods may be preferable.

How much weight variability can you tolerate? More adjustment variables generally means more weight variability. Balance the bias reduction from additional adjustments against the variance increase from more variable weights.

Common Mistakes

Weighting on Too Many Variables

Every additional adjustment variable increases weight variability. If you rake on 10 variables, some respondents will end up with extremely large or small weights. Prioritize variables that are strongly related to both nonresponse and your key survey outcomes. Variables correlated with nonresponse but not with outcomes add variance without reducing bias.

Weighting on Survey Variables

Never weight on the variables you're trying to estimate. If you adjust your sample so that 60% of weighted respondents agree with a policy (matching some external benchmark), you've built that result into your data. Weighting variables should be auxiliary: demographic, geographic, or behavioral characteristics available from external sources, not the attitudinal or opinion measures your survey is designed to study.

Ignoring Weight Variability

Unweighted analyses of weighted data underestimate standard errors. Always use survey-aware estimation methods that account for the weights. In R, use the survey package. In Stata, use svy: prefix commands. In Python, use statsmodels with frequency weights. Standard statistical tests assume equal weights and will produce artificially small p-values and narrow confidence intervals when applied to weighted data.

Assuming Weighting Solves Everything

The most dangerous mistake is believing that weighting has "fixed" your sample. Weighting only adjusts for measured characteristics. If the people who didn't respond differ from respondents in unmeasured ways that matter for your research questions, weighting provides false confidence.

Always report both weighted and unweighted estimates. If they differ substantially, the weighting is doing a lot of work, and you should examine whether your adjustment variables are sufficient.

Evaluating Your Weights

After computing weights, evaluate them before using them for analysis.

Diagnostic Checks

Weight distribution. Report the minimum, maximum, mean, median, and coefficient of variation. Flag any respondent with a weight more than 5-6 times the median weight.

Design effect. Calculate $deff_w = 1 + CV^2(w)$ and report the effective sample size $n_{eff} = n / deff_w$ . If the effective sample size is less than half the actual sample size, your weights are highly variable.

Margin matching. For post-stratification and raking, verify that the weighted sample margins match the population targets. For raking, check all margins; convergence failures can leave some margins off-target.

Sensitivity analysis. Re-run key estimates excluding the respondents with the largest weights. If results change substantially, those respondents are driving your estimates, which is risky.

Comparison of weighted and unweighted estimates. Large differences suggest strong confounding between the adjustment variables and your outcomes. Small differences suggest either (a) your sample was already representative on these characteristics, or (b) the adjustment variables aren't strongly related to your outcomes.

What Good Weights Look Like

Low CV (below 0.5 is good, above 1.0 is concerning)
No single respondent with more than 1-2% of the total weight
Weighted margins match population targets
Key estimates stable under sensitivity analysis
Design effect below 2.0

Weighting in Practice

Tools and Software

Most survey analysis is done in software with built-in weighting support:

R: The survey package (svydesign(), calibrate(), rake(), postStratify()) is the gold standard. The srvyr package provides a dplyr-compatible interface.
Stata: svyset defines survey design; svy: prefix runs survey-aware analyses. ipfweight handles raking.
SPSS: COMPLEX SAMPLES module handles weighted analyses.
Python: statsmodels supports frequency and probability weights. The ipfn package handles raking.

Reporting Standards

When reporting weighted survey results, include:

Description of the weighting procedure (method, adjustment variables, population benchmarks)
Source of population data (census, administrative records, reference survey)
Summary of weight distribution (range, CV, design effect)
Effective sample size after weighting
Whether weight trimming was applied, and the threshold
Both weighted and unweighted key estimates (at minimum in supplementary material)

The American Association for Public Opinion Research (AAPOR) Transparency Initiative provides detailed reporting standards for survey methodology, including weighting.

Summary

Survey weighting is a necessary but imperfect tool. It corrects for known, measurable differences between your sample and your population, improving the validity of your estimates when used appropriately.

Post-stratification is the right choice when you have a small number of reliable population benchmarks. Raking extends this to more variables when you only have marginal distributions. Propensity score methods handle complex multivariate relationships and non-probability samples.

But weighting is not magic. It increases variance, it can produce unstable estimates when weights are extreme, and it cannot correct for unmeasured bias. The best strategy combines good survey design (maximizing response rates, reducing coverage gaps) with appropriate weighting adjustments, followed by honest assessment of what the weighting can and cannot do.

No amount of statistical adjustment can substitute for a well-designed sample. Invest in design first. Weight second. And always report what you did.

Ready to build well-designed surveys from the start?

→ Get Early Access · See Features · Read the Sampling Methods Guide

Related Reading:

Survey weighting methodology is covered rigorously in Valliant, R., Dever, J. A., & Kreuter, F., Practical Tools for Designing and Weighting Survey Samples (2nd ed.), and in Kish, L., Survey Sampling, the classic text on sampling theory and design.

Why Surveys Need Weighting

The Sources of Imbalance

What Weighting Does (and Doesn't Do)

Design Weights: The Starting Point

Calculating Design Weights

When Design Weights Aren't Enough

Post-Stratification

How It Works

Example

Requirements

Strengths and Limitations

Raking (Iterative Proportional Fitting)

How It Works

Why It Converges

Comparison with Post-Stratification

When Raking Fails

Practical Considerations

Propensity Score Weighting

The Basic Idea

Estimating Propensity Scores

Application to Non-Probability Samples

Strengths and Limitations

Doubly Robust Estimation

Weight Trimming and Smoothing

The Variance-Bias Trade-Off

Trimming Approaches

When to Trim

Choosing a Weighting Method

Decision Criteria

Common Mistakes

Weighting on Too Many Variables

Weighting on Survey Variables

Ignoring Weight Variability

Assuming Weighting Solves Everything

Evaluating Your Weights

Diagnostic Checks

What Good Weights Look Like

Weighting in Practice

Tools and Software

Reporting Standards

Summary

Continue Reading

Anonymous Surveys and GDPR: What Researchers Must Document

Construct Validity in Surveys: From Theory to Measurement

Double-Barreled Questions: Why They Destroy Measurement Validity