Inference for Program and Policy Evaluation

Why do inference?

  • We often observe a sample of a population.
  • Estimation under repeated sampling will yield a different result each time.
  • The (hypothetical) collection of results obtained (for a fixed sample size) can be summarized in a sampling distribution.

Why do inference?

  • The standard deviation of the sampling distribution is the standard error of the estimator.
  • We can also construct 95% confidence intervals that will, in expectation, cover the “truth” 95% of the time.

Why do inference?

  • This repeated sampling framework fits squarely within the generate, estimate, discriminate process we have used to understand estimation, bias, consistency, etc.

Our course thus far

Generate

Estimate

Discriminate

Our course thus far

Generate

Estimate

Discriminate

Our course thus far

Inference Framework

  • This conceptualization will become important later when we think about bootstrapping.
  • But first, we need to understand different types of inference and when we might use each.
  • To do so, we’ll construct a decision framework for how we do inference.

Inference Framework

We can think of two dimensions of uncertainty in a policy evaluation:

  1. Sampling: we only observe a sample of the population of interest.
  2. Treatment assignment: we (often) only observe one potential outcome.

Inference Framework

We can think of two dimensions of uncertainty in a policy evaluation:

  1. Sampling: we only observe a sample of the population of interest.
  2. Treatment assignment: we (often) only observe one potential outcome.

Our objective here is to develop an inference decision framework around these dimensions.

Inference Framework

  • Analytic inference relies on equations that use sample analogues to estimate a population-level parameter, such as a standard error.

Example:

\[ SE = \frac{\sigma}{\sqrt{n}} \] - \(SE\) is the standard error of the sample - \(\sigma\) is the sample standard deviation - \(n\) is the sample size.

Inference Framework

  • Drawback: standard analytic formulas apply for simple random samples of a population.
  • Another consideration is heteroskedasticity (though robust variance formulas are available)
  • In policy evaluation we often have both population sampling and treatment assignment correlated with clusters (e.g., states, households, etc.).

Inference Framework

This leads to the following frequently-asked questions:

  1. Should we use robust standard errors?
  2. Should we cluster?
  3. What method of inference should we use?

Four Questions You Should Ask

  • How were the data sampled?
  • How was treatment assigned?
    • Is treatment assignment perfectly or partially correlated with cluster (e.g., state)?
    • If so, how many clusters do you observe? (large #? large fraction?)

Four Questions You Should Ask

  • If you can answer these questions there is a clear pathway to conducting inference—though note that multiple pathways may exist!
  • We’ll now tackle each of these questions in turn.

Sampling Uncertainty

Sampling Uncertainty

  • This is the standard dimension that classic statistics/econometrics covers.
  • It has to do with the fact that we typically obtain a random sample from the population of interest.

Sampling Uncertainty

  • While simple random samples from a population are certainly feasible, we often analyze data sampled in complex ways.
    • Stratified samples (e.g., independent samples within a fixed number of groups like states)
    • Clustered samples of households.
    • Repeated sampling of households/individuals.
    • Often, all of the above.

Random Sampling

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

No reason to cluster, even if there is within-cluster correlation in outcomes.

  • Clustering standard errors can be harmful, resulting in unnecessarily wide confidence intervals.

  • If the sample represents a large fraction of the population and treatment effects are heterogeneous across units, robust standard errors are also conservative.

Source: Abadie et al. (2022)

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

No reason to cluster, even if there is within-cluster correlation in outcomes.

  • Clustering is not appropriate even if there is within-cluster correlation in outcomes (however those clusters are defined), and thus even if clustering makes a substantial difference in the magnitude of the standard errors.

  • However, if the data contain information on attributes of the units that are correlated with unit-level treatment effects, the methods in Abadie et al. (2020) can be applied to obtain less conservative standard errors.

Source: Abadie et al. (2022)

Example

  • Basic 2x2 difference-in-difference design with null ATT \(\tau=0\).
  • Run a linear OLS DID specification for two time periods (pre vs. post)
  • Vary sampling design and assignment mechanisms.

DID Example

Inference under random sampling and random assignment
Method Coefficient SE p-value
Classic Inference -0.022 0.019 0.259
Robust -0.022 0.019 0.259
Clustered -0.022 0.021 0.308

Cluster Sampling, Random Assignment

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

  • Returns to age at HS graduation within a large sample of US households: large number of clusters (households) sampled, but small fraction of total clusters sampled.

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

  • \(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
  • \(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

  • \(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
  • \(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach. More later!

  • \(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
  • \(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

DID Example

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

Inference under cluster (30/50) sampling and random assignment
Method Coefficient SE p-value
Robust -0.021 0.019 0.259
Clustered -0.021 0.023 0.355

DID Example

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach. More later!

Inference under cluster (10/50) sampling and random assignment
Method Coefficient SE p-value
Robust 0.01 0.015 0.518
Clustered 0.01 0.010 0.349

Treatment Assignment

Treatment Assignment

  • This dimension of uncertainty occurs because we often only observe one potential outcome for each treated unit.
  • This would remain the case even if we observed the entire population of interest!

Treatment Assignment

unit_id cluster w_i y_obs y_1 y_0
843943 44 0 0.66 ? 0.66
726814 32 1 0.93 0.93 ?
265572 10 0 -1.65 ? -1.65
105153 6 1 -1.54 -1.54 ?
885526 44 1 0.09 0.09 ?

Treatment Assignment

  • Often, treatment assignment is correlated with clusters (e.g., state, county, etc.).
  • If treatment is perfectly correlated with cluster, then everyone within the same cluster is either treated or untreated.
    • Effect of Medicaid expansion on earnings.
  • Alternatively, treatment could be partially correlated with cluster.
    • Effect of attending college on earnings, where it is observed that college attendance is correlated within state, county, etc.

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

  • Cluster at the level of the intervention (e.g., state)

Source: Abadie et al. (2022)

DID Example

  • Random sampling of population
  • All 50 state clusters observed
  • Treatment perfectly correlated with cluster
Method Coefficient SE p-value
Robust -0.431 0.019 0.000
Clustered -0.431 0.339 0.209

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

  • If treatment is not perfectly correlated with cluster, then clustered standard errors may be too conservative.
  • Abadie et al. (2022) provide analytic formulas as well as a bootstrap procedure that can be used for inference.

Bootstrap Inference

  • Bootstrapping is a way to conduct inference when large-sample theory provides a poor guide.
    • Small number of clusters
    • Treatment assignment that is partially correlated with cluster.
  • You will see multiple bootstrap methods in the next slides.
  • Let’s spend a few minutes on intuition for the bootstrap…

Bootstrap Inference

Back to basics:

Generate

Estimate

Discriminate

Bootstrap Inference

  • Basic idea is to generate a large number of bootstrap samples that mimic the distribution from which the actual sample was obtained (Roodman et al. 2019).

Bootstrap

Estimate

Discriminate

Bootstrap Inference

Bootstrap Inference

  • We repeatedly resample the sample (with replacement) and generate the parameter of interest each time.
  • The bootstrap P value is calculated as the proportion of the bootstrap values that are more extreme than the actual one from the original sample.
  • In this way, we essentially mimic the process we’ve used all along to understand the properties of estimators by generating data!

Bootstrap Inference

  • This is how bootstrapping works at a high level.
  • The various bootstrap procedures you’ll need have some additional wrinkles because of the clustered nature of sampling and/or treatment assignment.
  • Let’s now go back to the objective that brought us here …

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

  • If treatment is not perfectly correlated with cluster, then clustered standard errors may be too conservative.
  • Abadie et al. (2022) provide analytic formulas as well as a bootstrap procedure that can be used for inference.
  • The analytic variance formulas are called Causal Cluster Variance (CCV).
  • This bootstrap procedure is called two-stage cluster bootstrap (TSCB).

Clustered Assignment

Scenario: Random or clustered sample of units from a large population with fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

Clustered Assignment

Scenario: Random or clustered sample of units from a large population with fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

Scenario: Random or clustered sample of units from a large population with partially clustered treatment assignment.

If assignment is partially clustered (i.e., some variation in Tx assignment within cluster), use CCV/TSCB.

Clustered Assignment

Treatment perfectly correlated with cluster:

unit_id cluster y_i pr_treated w_i
309733 13 -0.58 0 0
498850 26 -1.13 0 0
418742 22 -1.64 0 0
385360 19 -1.39 0 0
603293 28 -0.94 0 0
998667 50 -0.99 1 1

Clustered Assignment

Treatment perfectly correlated with cluster:

unit_id cluster y_i pr_treated w_i
309733 13 -0.58 0 0
498850 26 -1.13 0 0
418742 22 -1.64 0 0
385360 19 -1.39 0 0
603293 28 -0.94 0 0
998667 50 -0.99 1 1

Treatment partially correlated with cluster:

unit_id cluster y_i pr_treated w_i
309733 13 -0.58 0.47 1
498850 26 -1.13 0.48 0
418742 22 -1.64 0.44 1
385360 19 -1.39 0.66 1
603293 28 -0.94 0.58 1
998667 50 -0.99 0.40 0

A Practical Guide to TSCB

  1. Start with your sample. At a minimum, you’ll need
    • Outcome
    • Treatment indicator
    • Cluster variable

A Practical Guide to TSCB

  1. Determine \(q_k\), the fraction of (population-level) clusters that are sampled.
    • It could be high! (e.g., for a state policy evaluation using a national survey that includes all states, this implies \(q_k=1.0\))
    • It could be low! (e.g., a national survey of households)

A Practical Guide to TSCB

  1. Define a number of bootstrap replications (e.g., \(B=100\))
    • You’re now ready to iterate through the bootstrap procedure replication by replication!

A Practical Guide to TSCB

For each bootstrap replication …

  1. Create a “pseudo-sample” by replicating your original sample \(1/q_k\) times.

    • If \(1/q_k\) is not an integer, you can use the methods in Chao and Lo (1985).
    • Essentially, you solve for a fraction \(\alpha\) using a fairly simple formula and some basic algebra.
    • If you observe 20 of 50 clusters (such that \(q_k=0.4\) and \(1/q_k=2.25\)), as you proceed through the B bootstrap replications, you’d replicate the sample 2 times with probability \(\alpha\) or 3 times with probability \(1-\alpha\).

A Practical Guide to TSCB

For each bootstrap replication …

  1. For each cluster in your pseudo-sample, calculate and record the fraction of observations treated.
    • This is the empricial distribution of cluster-level treatment fractions, or \(\bar{W}_{k,m}\).
    • NOTE: To proceed, \(\bar{W}_{k,m}\) must be between 0 and 1 for all clusters.
    • If treatment assignment is perfectly correlated with cluster, then just use cluster robust inference (if you have a sufficient number clusters! If you don’t, skip to the section on inference with a small number of clusters … )

A Practical Guide to TSCB

For each bootstrap replication …

  1. Randomly draw clusters from your pseudo-sample.
    • If you observe 20 clusters in your original sample, then randomly draw 20 clusters from your pseudo-sample.
    • Draw these clusters without replacement.

A Practical Guide to TSCB

For each bootstrap replication …

  1. For each of these randomly drawn clusters, sample a treatment fraction value from \(\bar{W}_{k,m}\).
    • You sample here with replacement.
    • This is the new assignment probability for each sampled cluster, called \(A_{k,m}\).

A Practical Guide to TSCB

For each bootstrap replication …

  1. Next, within each sampled cluster (each of which has cluster sample size \(N_{k,m}\)), sample (with replacement) treated and control units.
  • Sample \(N_{k,m}A_{k,m}\) treated units.
  • Sample \(N_{k,m}(1 - A_{k,m})\) untreated units.
  • Do this for each sampled cluster in your bootstrap replicate sample.

A Practical Guide to TSCB

For each bootstrap replication …

  1. Estimate your outcome regression (e.g., DID using least squares or a fixed effect model) on this bootstrap sample, are record the statistic/parameter you are interested in (e.g., \(\hat \tau\))

A Practical Guide to TSCB

  1. Repeat steps 4-9 \(B\) times and collect the \(\hat \tau\) estimates.
    • You can now calculate the standard deviation of these estimates as an estimate of the standard error.
    • You could also calculate a p-value by calculating the fraction of bootstrap estimates that are as or more extreme than your estimate from the original data.

DID Example

Method Coefficient SE p-value
Classic Inference 0.028 0.019 0.143
Robust 0.028 0.019 0.140
Clustered 0.028 0.106 0.797
Two-Stage Cluster Bootstrap 0.028 0.082 0.810

Small Cluster Bias

  • Thus far we have assumed we have a sufficiently large number of clusters (e.g., >30) in our sample.
  • Justification of inference with cluster-robust standard errors assumes that the number of clusters goes to infinity.

Small Cluster Bias

  • What happens if we have only a few clusters?
    • Example: Evaluation of a state-level policy change based on a sample of 8 states with 4 treated and 4 untreated.
  • Cameron, Gelbach, and Miller (2008): With a small number of clusters, cluster-robust standard errors are downwards biased.
    • You’re going to over-reject the null hypothesis because your standard errors are too small.

DID Example

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach.

Inference under cluster (10/50) sampling and random assignment
Method Coefficient SE p-value
Robust 0.01 0.015 0.518
Clustered 0.01 0.010 0.349

Small Cluster Bias

  • How can we deal with this?
  • Two primary ways (though there are others):
    • Wild cluster bootstrap.
    • Randomization inference

Wild cluster bootstrap

  • Same basic idea as a regular bootstrap.
  • Main difference is that rather than resampling rows of our data, we resample the residuals after fitting our outcome model.

Wild cluster bootstrap

For each bootstrap replicate …

  • Sample (with replacement) the residuals after fitting the main regression model.
  • We then apply a weight to each residual.
    • Weight is -1 with probability 0.5 and 1 with probability 0.5.
  • The catch is that everyone within the same cluster receives the same weight value (i.e., we sample 1 or -1 by cluster).

Wild cluster bootstrap

For each bootstrap replicate …

  • Because we only resample (and reweight the residuals) the other X variables used in the regression stay the same.
  • The coefficients and X’s from the original regression can be used to obtain a predicted value.
  • This predicted value plus the new (sampled and re-weighted) residuals are used to construct a new outcome value.

Wild cluster bootstrap

For each bootstrap replicate …

  • Re-estimate the regression using the new outcome values.
  • Repeat \(B\) times and collect the estimates.
  • You can use these estimates to construct a p-value based on how extreme the original value is within the distribution of estimates generated via the bootstrap samples.

Wild cluster bootstrap

  • Wild cluster bootstrap is easily executed in both Stata and R.
  • New “fast” routines can complete this process in seconds (or less).

DID Example

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach.

Inference under cluster (10/50) sampling and clustered assignment
Method Coefficient SE p-value
Robust 0.009 0.015 0.537
Clustered 0.009 0.057 0.873
Wild Cluster 0.009 NA 0.906
Two-Stage Cluster Bootstrap 0.009 0.096 0.940

Permutation Inference

  • Another approach (though it can be a bit more conservative) is to use randomization inference.
  • Basic idea here is to randomly permute the treatment indicator.
  • This is a useful approach when treatment is perfectly correlated with clusters (e.g., state policy change) and you only have a few clusters in your sample.

Permutation Inference

  • Example: State policy change with 12 total states (4 treated, 8 untreated)
  • Idea: Iterative exercise where you randomly select the 4 “treated” states and re-estimate the outcome model each time.
    • There are 495 possible combinations of 4 treated states out of 12 total states.
    • You can therefore construct a distribution of 495 different treatment effect estimates.
    • How “extreme” is the estimate in your primary sample within this distribution of 495 estimates?

Permutation Inference

  • A nice feature of this approach is that you get an exact p-value.
  • If the total possible number of permutations is large (e.g., 20 choose 10 = 184,756), you could always just permute cluster treatment status, say, 1,000 times to approximate the distribution.
  • A downside is that constructing 95% confidence intervals requires strong assumptions (homogeneity of treatment effect).

References

Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge. 2020. “Sampling-Based Versus Design-Based Uncertainty in Regression Analysis.” Econometrica 88 (1): 265–96.
Abadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge. 2022. “When Should You Adjust Standard Errors for Clustering?” arXiv. https://doi.org/10.48550/arXiv.1710.02926.
Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller. 2008. “Bootstrap-Based Improvements for Inference with Clustered Errors.” The Review of Economics and Statistics 90 (3): 414–27.
Chao, Min-Te, and Shaw-Hwa Lo. 1985. “A Bootstrap Method for Finite Population.” Sankhyā: The Indian Journal of Statistics, Series A, 399–405.
Roodman, David, Morten Ørregaard Nielsen, James G. MacKinnon, and Matthew D. Webb. 2019. “Fast and Wild: Bootstrap Inference in Stata Using Boottest.” The Stata Journal 19 (1): 4–60.