Inference

Delta method

There are some cases where we are interested in constructing inference on functions of estimates. Consider the case where

$\sqrt{n} ( \beta_n - \beta ) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma)$

And that we are interested in the limiting distribution of $h(\beta_n)$ for some function $h$ .

We write

$h(\beta_n) \simeq h(\beta) + \nabla h(\beta)' ( \beta_n - \beta )$

which allows us to express the variance of $h(\beta_n)$ as

$\begin{aligned} Var(h(\beta_n)) &\simeq Var( h(\beta) + \nabla h(\beta)' ( \beta_n - \beta ) ) \\ & = Var( \nabla h(\beta)' \beta_n ) \\ & = \nabla h(\beta)' Var( \beta_n ) \nabla h(\beta)\\ & = \nabla h(\beta)' \Sigma \nabla h(\beta)\\ \end{aligned}$

and so we get that

$\sqrt{n} ( h(\beta_n) - h(\beta) ) \overset{d}{\rightarrow} \mathcal{N}(0, \nabla h(\beta)' \Sigma \nabla h(\beta) )$

Clustered standard errors

Please refer to the review paper by Cameron and Miller for more details. The following is a summary of the first part of the paper. Here we will consider Molton forumal, but things go beyond this simple consideration. We use the same notations as before, but for simplicity we suppose that we only have one regressor.

We recall that

$\beta_n = \sum x_i y_i / \sum X_i^2 = \beta + \sum x_i u_i / \sum x_i^2$

Hence in genenral we get that

$Var[\beta_n] = Var[ \sum x_i u_i] / \Big( \sum x_i^2 \Big)^2$

In the simplest case where the errors are uncorrelated across $i$ and homoskedastic, we get $Var[\beta_n] = \sigma^2 / \Big( \sum x_i^2 \Big)^2$ . If instead errors are heteroskedastic we get

$Var[\beta_n] = \Big(\sum x_i^2 \mathbb{E}[u_i^2|x_i]) / \Big( \sum x_i^2 \Big)^2$

Where we could construct an estimator using $\hat{u}_i$ :

$\hat{Var}[\beta_n] = \Big(\sum x_i^2 \hat{u}_i^2]) / \Big( \sum x_i^2 \Big)^2$

Finally what if the errors are corrolated across $i$ ? In the most general case:

$\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j Cov[x_i u_i , x_j u_j] \\ &= \sum_i\sum_j x_i x_j \mathbb{E}[u_i u_j] \end{aligned}$

Simply replacing with $\hat{u}_i$ would unfortunately gives $0$ disrectly since $\sum_i x_i \hat{u}_i = 0$ . Instead then we are going to assume that in the population there are known / observed groups such that we allow correlation within cluster, but we assume not correlation between clusters. Then we can compute:

$\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j x_i x_j \hat{u}_i \hat{u}_j] 1[i,j\text{ in same cluster}] \\ \end{aligned}$

Clustered errors

Let's assume that their are $g$ clusters and we still write:

$Y_{ig} = X_{ig}'\beta + u_{ig}$

And we assume that $\mathbb{E}[u_{ig}|x_{ig}] = 0$ , and we assume in additio that for $g \neq g'$ :

$\mathbb{E}[ u_{ig} u_{jg'} | x_{ig}, u_{jg'} ] = 0$

Moulton (1990) considered the case where $Cor[u_{ig},u_{jg}] = \rho_u$ and within correlation of the regressor is also written as $\rho_x$ , and $N_g$ is the average size of a cluster. Then the non-clustered variance estimator should be scaled by

$\tau \simeq 1 + \rho_x \rho_u ( N_g - 1 )$

The variance inflation factor, or the Moulton factor is increasing with:

within cluster correlation of the regressor
within cluster correlation of the error
number of observations in each cluster (because really it is between cluster that matters)

In an influential paper, Moulton (1990) pointed out that in many settings the inflation factor $\tau$ can be large even if $\rho_u$ is small. He considered a log earnings regression using March CPS data (𝑁 = 18,946), regressors aggregated at the state level (𝐺 = 49), and errors correlated within state ( $\rho_u= 0.032$ ) . The average group size was 18,946/49 = 387 , $\rho_x= 1$ for a state-level regressor, so the expression yields $\tau = 1 + 1 × 0.032 × 386 = 13.3$ . The weak correlation of errors within state was still enough to lead to cluster-corrected standard errors being $\sqrt{13.3} = 3.7$ times larger than the (incorrect) default standard errors!

Choosing where to cluster

TBD

Bootstrap

Computing confidence intervals and critical values can be tedious when using asymptotic formulation. If we could draw directly from the population we could conduct a Monte-Carlo exercise and recover the distribution of the estimator. In this section we consider such an approach by sampling from the available sample. Considering a given sample $Y_{1}..Y_{n}$ , there are two main re-sampling approach. The first is to re-sample $n$ elements from $(Y_{1}..Y_{n})$ with replacement, the second is to sample $m<n$ from $(Y_{1}..Y_{n})$ without replacement. In both approaches the goal is generate draws from a distribution that reassembles as much as possible to the population distribution.

The theory behind the bootstrap

The data is assumed to be independent draws from $F_{0}(x)=F_{0}(x,\theta_{0})$ and we consider a statistic $T_{n}=T_{n}(X_{1}..X_{n})$ . The distribution of $T_{n}$ is denoted by $G_{n}=G_{n}(t,F_{0})=P_{0}[T_{n}\leq t]$ . Asymptotic theory relies on $G_{\infty}$ , instead the bootstrap relies on plugging in an estimate of $F_{0}$ and uses $G_{n}(\cdot,F_{n})$ . Taking $B$ samples with replacement from $F_{n}$ , computing $T_{n,b}$ in each, we can construct

$\hat{G}_{n}(t,F_{n})=\frac{1}{B}\sum_{b}\mathbf{1}[T_{n,b}\leq t]$

then what we need for the bootstrap procedure to be asymptotically valid is that

$G_{n}(t,F_{n})\overset{p}{\rightarrow}G_{n}(t,F_{0})$

uniformly in $t$ . This requires smoothness in $F_{0}$ as well as in $G_{n}(\cdot,\cdot)$ and consistency of $F_{n}$ for $F_{0}$ . In general we get that if we have $\sqrt{n}$ asymptotic convergence to $G_{\infty}$ , then both $G_{n}(t,F_{0})$ and $G_{n}(t,F_{n})$ do so and so they are also close to each other:

$G_{n}(t,F_{0})=G_{n}(t,F_{n})+O(N^{-1/2})$

which provides no gain when compared to asymptotic standard error besides the simplicity of the computation.

Parametric Bootstrap

Note that the goal is to approximate $F_{0}$ and hence $F_{n}$ is a good candidate however one can use $F(\cdot,\theta_{n})$ where $\theta_{n}$ is a consistent estimator of $\theta_{0}$ . This is referred to as the parametric bootstrap.

It can be shown that in the case where $T_{n}$ is asymptotically pivotal, meaning that is does not depend on the parameters, then the bootstrap achieves:

$G_{n}(t,F_{0})=G_{n}(t,F_{n})+O(N^{-1})$

The idea here is that one can get a better approximation of the finite sample distribution. At every N, $G_{n}(t,F_{n})$ is closer to $G_{n}(t,F_{0})$ than $G_{\infty}(t,F_{0})$ . This can be shown using the Edgeworth expansion which expands $G_{n}(z)$ as a function of $n^{-\frac{1}{2}}$ .

$\begin{aligned} G_{n}(t,F_{n})-G_{n}(t,F_{0}) & =\left[G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})\right]\\ & +\frac{1}{\sqrt{n}}\left[g_{1}(t,F_{n})-g_{1}(t,F_{0})\right]+O(n^{-1}) \end{aligned}$

and then $G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})=0$ if $T_{n}$ is asymptotically pivotal, and $g_{1}(t,F_{n})-g_{1}(t,F_{0})=O(n^{-1/2})$ delivering an overall $O(n^{-1})$ .

Failure of bootstrap:

One example of the failure even when the estimator is asymptotic normal is the nearest neighbor estimator (Abadie and Imbens 2008). It is shown that the variance of the bootstrap is either too small or too large. Another example is the estimation of the median.

Bias correction using bootstrap

The bootstrap can be used to correct for the bias of an estimator. In many applications the exact form of the bias $\mathbb{E}_{0}(\theta_{n}-\theta_{0})$ is not known, however if we consider $\bar{\theta}_{n}^{*}$ , the expectation across bootstraps replications, then it gives us an estimate of the bias. We can then consider a bias-corrected estimate $\theta_{n}^{BR}=\theta_{n}-(\bar{\theta}_{n}^{*}-\theta_{n})$ .

Non iid samples

There will cases where the data is not exactly iid. For instance there might be weak spatial correlation. In this case, one might want to bootstrap by resampling clusters of data to replicate the dependence.