Inference
Delta method
There are some cases where we are interested in constructing inference on functions of estimates. Consider the case where
And that we are interested in the limiting distribution of h(\beta_n) for some function h.
We write
which allows us to express the variance of h(\beta_n) as
and so we get that
Clustered standard errors
Please refer to the review paper by Cameron and Miller for more details. The following is a summary of the first part of the paper. Here we will consider Molton forumal, but things go beyond this simple consideration. We use the same notations as before, but for simplicity we suppose that we only have one regressor.
We recall that
Hence in genenral we get that
In the simplest case where the errors are uncorrelated across i and homoskedastic, we get Var[\beta_n] = \sigma^2 / \Big( \sum x_i^2 \Big)^2. If instead errors are heteroskedastic we get
Where we could construct an estimator using \hat{u}_i:
Finally what if the errors are corrolated across i? In the most general case:
Simply replacing with \hat{u}_i would unfortunately gives 0 disrectly since \sum_i x_i \hat{u}_i = 0. Instead then we are going to assume that in the population there are known / observed groups such that we allow correlation within cluster, but we assume not correlation between clusters. Then we can compute:
Clustered errors
Let's assume that their are g clusters and we still write:
And we assume that \mathbb{E}[u_{ig}|x_{ig}] = 0, and we assume in additio that for g \neq g':
Moulton (1990) considered the case where Cor[u_{ig},u_{jg}] = \rho_u and within correlation of the regressor is also written as \rho_x, and N_g is the average size of a cluster. Then the non-clustered variance estimator should be scaled by
The variance inflation factor, or the Moulton factor is increasing with:
- within cluster correlation of the regressor
- within cluster correlation of the error
- number of observations in each cluster (because really it is between cluster that matters)
In an influential paper, Moulton (1990) pointed out that in many settings the inflation factor \tau can be large even if \rho_u is small. He considered a log earnings regression using March CPS data (𝑁 = 18,946), regressors aggregated at the state level (𝐺 = 49), and errors correlated within state (\rho_u= 0.032) . The average group size was 18,946/49 = 387 , \rho_x= 1 for a state-level regressor, so the expression yields \tau = 1 + 1 × 0.032 × 386 = 13.3. The weak correlation of errors within state was still enough to lead to cluster-corrected standard errors being \sqrt{13.3} = 3.7 times larger than the (incorrect) default standard errors!
Choosing where to cluster
TBD
Bootstrap
Computing confidence intervals and critical values can be tedious when using asymptotic formulation. If we could draw directly from the population we could conduct a Monte-Carlo exercise and recover the distribution of the estimator. In this section we consider such an approach by sampling from the available sample. Considering a given sample Y_{1}..Y_{n}, there are two main re-sampling approach. The first is to re-sample n elements from (Y_{1}..Y_{n}) with replacement, the second is to sample m<n from (Y_{1}..Y_{n}) without replacement. In both approaches the goal is generate draws from a distribution that reassembles as much as possible to the population distribution.
The theory behind the bootstrap
The data is assumed to be independent draws from F_{0}(x)=F_{0}(x,\theta_{0}) and we consider a statistic T_{n}=T_{n}(X_{1}..X_{n}). The distribution of T_{n} is denoted by G_{n}=G_{n}(t,F_{0})=P_{0}[T_{n}\leq t]. Asymptotic theory relies on G_{\infty}, instead the bootstrap relies on plugging in an estimate of F_{0} and uses G_{n}(\cdot,F_{n}). Taking B samples with replacement from F_{n}, computing T_{n,b} in each, we can construct
then what we need for the bootstrap procedure to be asymptotically valid is that
uniformly in t. This requires smoothness in F_{0} as well as in G_{n}(\cdot,\cdot) and consistency of F_{n} for F_{0}. In general we get that if we have \sqrt{n} asymptotic convergence to G_{\infty}, then both G_{n}(t,F_{0}) and G_{n}(t,F_{n}) do so and so they are also close to each other:
which provides no gain when compared to asymptotic standard error besides the simplicity of the computation.
Parametric Bootstrap
Note that the goal is to approximate F_{0} and hence F_{n} is a good candidate however one can use F(\cdot,\theta_{n}) where \theta_{n} is a consistent estimator of \theta_{0}. This is referred to as the parametric bootstrap.
Asymptotic refinement
It can be shown that in the case where T_{n} is asymptotically pivotal, meaning that is does not depend on the parameters, then the bootstrap achieves:
The idea here is that one can get a better approximation of the finite sample distribution. At every N, G_{n}(t,F_{n}) is closer to G_{n}(t,F_{0}) than G_{\infty}(t,F_{0}). This can be shown using the Edgeworth expansion which expands G_{n}(z) as a function of n^{-\frac{1}{2}}.
and then G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})=0 if T_{n} is asymptotically pivotal, and g_{1}(t,F_{n})-g_{1}(t,F_{0})=O(n^{-1/2}) delivering an overall O(n^{-1}).
Failure of bootstrap:
One example of the failure even when the estimator is asymptotic normal is the nearest neighbor estimator (Abadie and Imbens 2008). It is shown that the variance of the bootstrap is either too small or too large. Another example is the estimation of the median.
Bias correction using bootstrap
The bootstrap can be used to correct for the bias of an estimator. In many applications the exact form of the bias \mathbb{E}_{0}(\theta_{n}-\theta_{0}) is not known, however if we consider \bar{\theta}_{n}^{*}, the expectation across bootstraps replications, then it gives us an estimate of the bias. We can then consider a bias-corrected estimate \theta_{n}^{BR}=\theta_{n}-(\bar{\theta}_{n}^{*}-\theta_{n}).
Non iid samples
There will cases where the data is not exactly iid. For instance there might be weak spatial correlation. In this case, one might want to bootstrap by resampling clusters of data to replicate the dependence.