Inference for Program and Policy Evaluation

Why do inference?

We often observe a sample of a population.
Estimation under repeated sampling will yield a different result each time.
The (hypothetical) collection of results obtained (for a fixed sample size) can be summarized in a sampling distribution.

Why do inference?

The standard deviation of the sampling distribution is the standard error of the estimator.
We can also construct 95% confidence intervals that will, in expectation, cover the “truth” 95% of the time.

Why do inference?

This repeated sampling framework fits squarely within the generate, estimate, discriminate process we have used to understand estimation, bias, consistency, etc.

Our course thus far

Generate

Estimate

Discriminate

Our course thus far

Generate

Estimate

Discriminate

Our course thus far

Inference Framework

This conceptualization will become important later when we think about bootstrapping.
But first, we need to understand different types of inference and when we might use each.
To do so, we’ll construct a decision framework for how we do inference.

Inference Framework

We can think of two dimensions of uncertainty in a policy evaluation:

Sampling: we only observe a sample of the population of interest.
Treatment assignment: we (often) only observe one potential outcome.

Inference Framework

We can think of two dimensions of uncertainty in a policy evaluation:

Sampling: we only observe a sample of the population of interest.
Treatment assignment: we (often) only observe one potential outcome.

Our objective here is to develop an inference decision framework around these dimensions.

Inference Framework

Analytic inference relies on equations that use sample analogues to estimate a population-level parameter, such as a standard error.

Example:

\[ SE = \frac{\sigma}{\sqrt{n}} \] - \(SE\) is the standard error of the sample - \(\sigma\) is the sample standard deviation - \(n\) is the sample size.

Inference Framework

Drawback: standard analytic formulas apply for simple random samples of a population.
Another consideration is heteroskedasticity (though robust variance formulas are available)
In policy evaluation we often have both population sampling and treatment assignment correlated with clusters (e.g., states, households, etc.).

Inference Framework

This leads to the following frequently-asked questions:

Should we use robust standard errors?
Should we cluster?
What method of inference should we use?

Four Questions You Should Ask

How were the data sampled?
How was treatment assigned?
- Is treatment assignment perfectly or partially correlated with cluster (e.g., state)?
- If so, how many clusters do you observe? (large #? large fraction?)

Four Questions You Should Ask

If you can answer these questions there is a clear pathway to conducting inference—though note that multiple pathways may exist!
We’ll now tackle each of these questions in turn.

Sampling Uncertainty

This is the standard dimension that classic statistics/econometrics covers.
It has to do with the fact that we typically obtain a random sample from the population of interest.

Sampling Uncertainty

While simple random samples from a population are certainly feasible, we often analyze data sampled in complex ways.
- Stratified samples (e.g., independent samples within a fixed number of groups like states)
- Clustered samples of households.
- Repeated sampling of households/individuals.
- Often, all of the above.

Random Sampling

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

No reason to cluster, even if there is within-cluster correlation in outcomes.

Clustering standard errors can be harmful, resulting in unnecessarily wide confidence intervals.
If the sample represents a large fraction of the population and treatment effects are heterogeneous across units, robust standard errors are also conservative.

Source: Abadie et al. (2022)

Random Sampling, Random Assignment

Scenario: Random sample of units from a large population with randomized treatment assignment at the unit level.

No reason to cluster, even if there is within-cluster correlation in outcomes.

Clustering is not appropriate even if there is within-cluster correlation in outcomes (however those clusters are defined), and thus even if clustering makes a substantial difference in the magnitude of the standard errors.
However, if the data contain information on attributes of the units that are correlated with unit-level treatment effects, the methods in Abadie et al. (2020) can be applied to obtain less conservative standard errors.

Source: Abadie et al. (2022)

Example

Basic 2x2 difference-in-difference design with null ATT \(\tau=0\).
Run a linear OLS DID specification for two time periods (pre vs. post)
Vary sampling design and assignment mechanisms.

DID Example

Inference under random sampling and random assignment
Method	Coefficient	SE	p-value
Classic Inference	-0.022	0.019	0.259
Robust	-0.022	0.019	0.259
Clustered	-0.022	0.021	0.308

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

Returns to age at HS graduation within a large sample of US households: large number of clusters (households) sampled, but small fraction of total clusters sampled.

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

\(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
\(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

\(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
\(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

Cluster Sampling, Random Assignment

Scenario: Clustered sample of units within clusters from a large population with randomized treatment assignment at the unit level.

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach. More later!

\(q_k\) is the fraction of clusters sampled (e.g., 25 out of 50 states = 0.5).
\(p_k\) is the fraction of the population sampled.

Source: Abadie et al. (2022)

DID Example

If \(q_k\) small or if \(q_k\) large but \(p_k\) small, clustered standard errors are asymptotically correct (with large # of clusters).

Inference under cluster (30/50) sampling and random assignment
Method	Coefficient	SE	p-value
Robust	-0.021	0.019	0.259
Clustered	-0.021	0.023	0.355

DID Example

If the total number of clusters is small (e.g., 30 or less), you may need to adopt a different inference approach. More later!

Inference under cluster (10/50) sampling and random assignment
Method	Coefficient	SE	p-value
Robust	0.01	0.015	0.518
Clustered	0.01	0.010	0.349

Treatment Assignment

This dimension of uncertainty occurs because we often only observe one potential outcome for each treated unit.
This would remain the case even if we observed the entire population of interest!

Treatment Assignment

unit_id	cluster	w_i	y_obs	y_1	y_0
843943	44	0	0.66	?	0.66
726814	32	1	0.93	0.93	?
265572	10	0	-1.65	?	-1.65
105153	6	1	-1.54	-1.54	?
885526	44	1	0.09	0.09	?

Treatment Assignment

Often, treatment assignment is correlated with clusters (e.g., state, county, etc.).
If treatment is perfectly correlated with cluster, then everyone within the same cluster is either treated or untreated.
- Effect of Medicaid expansion on earnings.
Alternatively, treatment could be partially correlated with cluster.
- Effect of attending college on earnings, where it is observed that college attendance is correlated within state, county, etc.

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

Cluster at the level of the intervention (e.g., state)

Source: Abadie et al. (2022)

DID Example

Random sampling of population
All 50 state clusters observed
Treatment perfectly correlated with cluster

Method	Coefficient	SE	p-value
Robust	-0.431	0.019	0.000
Clustered	-0.431	0.339	0.209

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

If treatment is not perfectly correlated with cluster, then clustered standard errors may be too conservative.
Abadie et al. (2022) provide analytic formulas as well as a bootstrap procedure that can be used for inference.

Bootstrap Inference

Bootstrapping is a way to conduct inference when large-sample theory provides a poor guide.
- Small number of clusters
- Treatment assignment that is partially correlated with cluster.
You will see multiple bootstrap methods in the next slides.
- Two-stage cluster bootstrap (Abadie et al. 2022)
- Wild cluster bootstrap (Cameron, Gelbach, and Miller 2008)
Let’s spend a few minutes on intuition for the bootstrap…

Bootstrap Inference

Back to basics:

Generate

Estimate

Discriminate

Bootstrap Inference

Basic idea is to generate a large number of bootstrap samples that mimic the distribution from which the actual sample was obtained (Roodman et al. 2019).

Bootstrap

Estimate

Discriminate

Bootstrap Inference

We repeatedly resample the sample (with replacement) and generate the parameter of interest each time.
The bootstrap P value is calculated as the proportion of the bootstrap values that are more extreme than the actual one from the original sample.
In this way, we essentially mimic the process we’ve used all along to understand the properties of estimators by generating data!

Bootstrap Inference

This is how bootstrapping works at a high level.
The various bootstrap procedures you’ll need have some additional wrinkles because of the clustered nature of sampling and/or treatment assignment.
Let’s now go back to the objective that brought us here …

Random Sampling, Clustered Assignment

Scenario: Random sample of units from a large population with partially or fully clustered treatment assignment.

If treatment is not perfectly correlated with cluster, then clustered standard errors may be too conservative.
Abadie et al. (2022) provide analytic formulas as well as a bootstrap procedure that can be used for inference.

The analytic variance formulas are called Causal Cluster Variance (CCV).
This bootstrap procedure is called two-stage cluster bootstrap (TSCB).

Clustered Assignment

Scenario: Random or clustered sample of units from a large population with fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

Clustered Assignment

Scenario: Random or clustered sample of units from a large population with fully clustered treatment assignment.

If assignment is perfectly clustered, and you observe a large number of clusters, use standard cluster adjustment.

Scenario: Random or clustered sample of units from a large population with partially clustered treatment assignment.

If assignment is partially clustered (i.e., some variation in Tx assignment within cluster), use CCV/TSCB.

Clustered Assignment

Treatment perfectly correlated with cluster:

unit_id	cluster	y_i	pr_treated	w_i
309733	13	-0.58	0	0
498850	26	-1.13	0	0
418742	22	-1.64	0	0
385360	19	-1.39	0	0
603293	28	-0.94	0	0
998667	50	-0.99	1	1

Clustered Assignment

Treatment perfectly correlated with cluster:

unit_id	cluster	y_i	pr_treated	w_i
309733	13	-0.58	0	0
498850	26	-1.13	0	0
418742	22	-1.64	0	0
385360	19	-1.39	0	0
603293	28	-0.94	0	0
998667	50	-0.99	1	1

Treatment partially correlated with cluster:

unit_id	cluster	y_i	pr_treated	w_i
309733	13	-0.58	0.47	1
498850	26	-1.13	0.48	0
418742	22	-1.64	0.44	1
385360	19	-1.39	0.66	1
603293	28	-0.94	0.58	1
998667	50	-0.99	0.40	0

A Practical Guide to TSCB

Start with your sample. At a minimum, you’ll need
- Outcome
- Treatment indicator
- Cluster variable

A Practical Guide to TSCB

Determine \(q_k\), the fraction of (population-level) clusters that are sampled.
- It could be high! (e.g., for a state policy evaluation using a national survey that includes all states, this implies \(q_k=1.0\))
- It could be low! (e.g., a national survey of households)