Skip to content


Delta method

There are some cases where we are interested in constructing inference on functions of estimates. Consider the case where

\sqrt{n} ( \beta_n - \beta ) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma)

And that we are interested in the limiting distribution of h(\beta_n) for some function h.

We write

h(\beta_n) \simeq h(\beta) + \nabla h(\beta)' ( \beta_n - \beta )

which allows us to express the variance of h(\beta_n) as

\begin{aligned} Var(h(\beta_n)) &\simeq Var( h(\beta) + \nabla h(\beta)' ( \beta_n - \beta ) ) \\ & = Var( \nabla h(\beta)' \beta_n ) \\ & = \nabla h(\beta)' Var( \beta_n ) \nabla h(\beta)\\ & = \nabla h(\beta)' \Sigma \nabla h(\beta)\\ \end{aligned}

and so we get that

\sqrt{n} ( h(\beta_n) - h(\beta) ) \overset{d}{\rightarrow} \mathcal{N}(0, \nabla h(\beta)' \Sigma \nabla h(\beta) )

Clustered standard errors

Please refer to the review paper by Cameron and Miller for more details. The following is a summary of the first part of the paper. Here we will consider Molton forumal, but things go beyond this simple consideration. We use the same notations as before, but for simplicity we suppose that we only have one regressor.

We recall that

\beta_n = \sum x_i y_i / \sum X_i^2 = \beta + \sum x_i u_i / \sum x_i^2

Hence in genenral we get that

Var[\beta_n] = Var[ \sum x_i u_i] / \Big( \sum x_i^2 \Big)^2

In the simplest case where the errors are uncorrelated across i and homoskedastic, we get Var[\beta_n] = \sigma^2 / \Big( \sum x_i^2 \Big)^2. If instead errors are heteroskedastic we get

Var[\beta_n] = \Big(\sum x_i^2 \mathbb{E}[u_i^2|x_i]) / \Big( \sum x_i^2 \Big)^2

Where we could construct an estimator using \hat{u}_i:

\hat{Var}[\beta_n] = \Big(\sum x_i^2 \hat{u}_i^2]) / \Big( \sum x_i^2 \Big)^2

Finally what if the errors are corrolated across i? In the most general case:

\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j Cov[x_i u_i , x_j u_j] \\ &= \sum_i\sum_j x_i x_j \mathbb{E}[u_i u_j] \end{aligned}

Simply replacing with \hat{u}_i would unfortunately gives 0 disrectly since \sum_i x_i \hat{u}_i = 0. Instead then we are going to assume that in the population there are known / observed groups such that we allow correlation within cluster, but we assume not correlation between clusters. Then we can compute:

\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j x_i x_j \hat{u}_i \hat{u}_j] 1[i,j\text{ in same cluster}] \\ \end{aligned}

Clustered errors

Let's assume that their are g clusters and we still write:

Y_{ig} = X_{ig}'\beta + u_{ig}

And we assume that \mathbb{E}[u_{ig}|x_{ig}] = 0, and we assume in additio that for g \neq g':

\mathbb{E}[ u_{ig} u_{jg'} | x_{ig}, u_{jg'} ] = 0

Moulton (1990) considered the case where Cor[u_{ig},u_{jg}] = \rho_u and within correlation of the regressor is also written as \rho_x, and N_g is the average size of a cluster. Then the non-clustered variance estimator should be scaled by

\tau \simeq 1 + \rho_x \rho_u ( N_g - 1 )

The variance inflation factor, or the Moulton factor is increasing with:

  • within cluster correlation of the regressor
  • within cluster correlation of the error
  • number of observations in each cluster (because really it is between cluster that matters)

In an influential paper, Moulton (1990) pointed out that in many settings the inflation factor \tau can be large even if \rho_u is small. He considered a log earnings regression using March CPS data (𝑁 = 18,946), regressors aggregated at the state level (𝐺 = 49), and errors correlated within state (\rho_u= 0.032) . The average group size was 18,946/49 = 387 , \rho_x= 1 for a state-level regressor, so the expression yields \tau = 1 + 1 × 0.032 × 386 = 13.3. The weak correlation of errors within state was still enough to lead to cluster-corrected standard errors being \sqrt{13.3} = 3.7 times larger than the (incorrect) default standard errors!

Choosing where to cluster



Computing confidence intervals and critical values can be tedious when using asymptotic formulation. If we could draw directly from the population we could conduct a Monte-Carlo exercise and recover the distribution of the estimator. In this section we consider such an approach by sampling from the available sample. Considering a given sample Y_{1}..Y_{n}, there are two main re-sampling approach. The first is to re-sample n elements from (Y_{1}..Y_{n}) with replacement, the second is to sample m<n from (Y_{1}..Y_{n}) without replacement. In both approaches the goal is generate draws from a distribution that reassembles as much as possible to the population distribution.

The theory behind the bootstrap

The data is assumed to be independent draws from F_{0}(x)=F_{0}(x,\theta_{0}) and we consider a statistic T_{n}=T_{n}(X_{1}..X_{n}). The distribution of T_{n} is denoted by G_{n}=G_{n}(t,F_{0})=P_{0}[T_{n}\leq t]. Asymptotic theory relies on G_{\infty}, instead the bootstrap relies on plugging in an estimate of F_{0} and uses G_{n}(\cdot,F_{n}). Taking B samples with replacement from F_{n}, computing T_{n,b} in each, we can construct

\hat{G}_{n}(t,F_{n})=\frac{1}{B}\sum_{b}\mathbf{1}[T_{n,b}\leq t]

then what we need for the bootstrap procedure to be asymptotically valid is that


uniformly in t. This requires smoothness in F_{0} as well as in G_{n}(\cdot,\cdot) and consistency of F_{n} for F_{0}. In general we get that if we have \sqrt{n} asymptotic convergence to G_{\infty}, then both G_{n}(t,F_{0}) and G_{n}(t,F_{n}) do so and so they are also close to each other:


which provides no gain when compared to asymptotic standard error besides the simplicity of the computation.

Parametric Bootstrap

Note that the goal is to approximate F_{0} and hence F_{n} is a good candidate however one can use F(\cdot,\theta_{n}) where \theta_{n} is a consistent estimator of \theta_{0}. This is referred to as the parametric bootstrap.

Asymptotic refinement

It can be shown that in the case where T_{n} is asymptotically pivotal, meaning that is does not depend on the parameters, then the bootstrap achieves:


The idea here is that one can get a better approximation of the finite sample distribution. At every N, G_{n}(t,F_{n}) is closer to G_{n}(t,F_{0}) than G_{\infty}(t,F_{0}). This can be shown using the Edgeworth expansion which expands G_{n}(z) as a function of n^{-\frac{1}{2}}.

\begin{aligned} G_{n}(t,F_{n})-G_{n}(t,F_{0}) & =\left[G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})\right]\\ & +\frac{1}{\sqrt{n}}\left[g_{1}(t,F_{n})-g_{1}(t,F_{0})\right]+O(n^{-1}) \end{aligned}

and then G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})=0 if T_{n} is asymptotically pivotal, and g_{1}(t,F_{n})-g_{1}(t,F_{0})=O(n^{-1/2}) delivering an overall O(n^{-1}).

Failure of bootstrap:

One example of the failure even when the estimator is asymptotic normal is the nearest neighbor estimator (Abadie and Imbens 2008). It is shown that the variance of the bootstrap is either too small or too large. Another example is the estimation of the median.

Bias correction using bootstrap

The bootstrap can be used to correct for the bias of an estimator. In many applications the exact form of the bias \mathbb{E}_{0}(\theta_{n}-\theta_{0}) is not known, however if we consider \bar{\theta}_{n}^{*}, the expectation across bootstraps replications, then it gives us an estimate of the bias. We can then consider a bias-corrected estimate \theta_{n}^{BR}=\theta_{n}-(\bar{\theta}_{n}^{*}-\theta_{n}).

Non iid samples

There will cases where the data is not exactly iid. For instance there might be weak spatial correlation. In this case, one might want to bootstrap by resampling clusters of data to replicate the dependence.