OLS as a review
We are going to use the OLS (ordinary least square) to clarify a few terms.
Unless you are doing experimental economics, most projects start with a data-set and a question. Let's then consider a simple data set as (X_i,Y_i) for i \in (1,...,n). These are two variables and n observations.
A simple question that can directly answer is what is the average X in the sample. This does not require any statistics, it only requires taking a mean: \bar{X}_n = \frac{1}{n} \sum_i X_i. Objects like \bar{X}_n, which are functions of the sample are called estimators.
In general, we are interested in questions that involve values that we didn't directly observe. For instance we might be interested in average value of Y conditional on some given fixed value of X, or we might be interested in a the average value of X in the all U.S. but we only got a sample of size 10,000. To make the object of interest precise, we then define a population, a distribution where the sample is coming from.
Population, models and estimands
The population is a construct that allows us to precisely define the objects we are interested in recovering from our data. In concrete terms it will be represent the joint distribution from which the data/sample is drawn from.
In its most simple form, we could write down our population as the joint distribution of the data f(X_1,Y_1,...,X_n,Y_n). Using this population I can for instance define the average of X in the population, ie \bar{X} = \mathbb{E}_F[X]. \bar{X} = \mathbb{E}_F[X] is different from \bar{X}_n = \frac{1}{n} \sum_i X_i since it defined using the population and not using the sample. Object constructed on the population will be refered to as estimands. One can then start to think about how well we can learn about to learn about \bar{X} from the sample (X_i,Y_i).
Often however, we want to include variables that are not be observed. For instance we might want to control for some unobserved factors. Indeed I might be interested in the effect of changing X while keeping some characteristics of each individual i fixed, let's call such characteritic U_i. In this case I would define my population as f(X_1,Y_1,U_1,...,X_n,Y_n,U_n).
In general we will start with a class of such distributions which is indexed by some parameter \theta. We will think of this class of distribution as our model. The parameter space can be finite dimensional in which case we will call it a parametric model, or it could be infinite and we will call it non parametric.
The first interesting point to note that in this new population \mathbb{E}_F[Y|X=x'] and \mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big] might be quite different objects. Take Y as income and X as college degree, then the first expression asks the difference in income between people with a college degree, and people without a college degree. The second expression asks what is the average effect of changing the degree of each individuals.
To be precise, also imposing iid across i, we defined:
The second important observation is that there might be different f(X_1,Y_1,U_1,...,X_n,Y_n,U_n) distributions that deliver a given f(X_1,Y_1,...,X_n,Y_n). It is then said that such augmented distributions are observationally equivalent. To the extent that we have 2 such distributions, that would generate the same exact data, but would give us two different values of our parameter of interest, then we would be in trouble.
An example of a model is the linear conditional expectation where we specify
as well as the joint distribution f(X_1,U_1,...,X_n,U_n) where we would probably be interested in \beta. We see if given \beta and f(X_1,U_1,...,X_n,U_n) one knows the joint f(X_1,Y_1,...,X_n,Y_n). The reverse will require additional assumptions! This leads to our next paragraph.
Identification
An important first step once the model has been defined and an estimand of interest has been expressed is to ask the question whether the observed part of the population f(X_1,Y_1,...,X_n,Y_n) together with the structure we imposed in the model allow recovering a unique estimand. When this is the case, we say that the parameter of interest, or the estimand of interest is identified.
In other words, being identified refers to the ability to construct the estimand using observed data in the context where the full distribution is given to you.
Let's take the linear case again, the assumption that Y_i = X_i \beta + U_i is for instance not sufficient to construct \beta from f(X_1,Y_1,...,X_n,Y_n).
Let's make an additional familiar assumption, let's assume that (X_i,Y_i,U_i) are indenpendent across i and drawn from a joint where U_i and X_i are conditional mean independent. Hence we impose further that \mathbb{E}[U_i | X_i ]=0 and that \mathbb{E} XX' is invertible.
In this case, let's show that \beta is identified. We can show identification by directly constructing \beta from f(X_1,Y_1,...,X_n,Y_n). Indeed:
Note here that \mathbb{E}XX' is a k \times k matrix.
Estimators
As stated at the begining, an estimator is a function of sample, and as such it is a random object. They are often written either with a hat or with a n n subcript. In the case where the data is given by a vector Y_n of size n and a matrix X_n of size n \times k, the OLS estimator is given by
Finite sample properties of estimators
Unbiasedness
This is the property that \mathbb{E} [ \beta_n | X_n] = \beta. We can check this is true for the OLS estimator under the assumptions we stated before:
Since this is true conditional on X_n, it will also be true unconditionaly.
Finite sample distribution
Let's assume further that U_n | X_n is Normally and independently distributed. In other words:
Then we have that
and condition on X_n since U_n is normaly distributed, any linear combination is also Normaly distributed. We can then compute the variance covariance matrix of the joint Normal distribution:
In addition note that
So we end up with the following expression for the finite sample distribution of the estimator of \beta:
Let's make a couple of remarks:
- we notice that as n \rightarrow \infty, indeed \beta_n concentrates on \beta.
- we also notice that it looks like copy pasting the sample reduces the the variance. Why is that not true?
Asymptotic properties of estimators
Very often, we would rather not have to make the Normality assumption on the error directly. Instead it is common to try to rely on results based on large samples and build on top of the central limit theorem.
If we can specify a sequence of population f_n(Y_1,X_1,...,Y_n,X_n) we can start thinking about deriving properties of estimators in the limit where n grows large.
A common way to generate such a sequence of population is again to assume that observations are iid across i and drawn from f(Y,X).
We then look at two important properties.
Consistency
An estimator is consistent if \beta_n \rightarrow \beta in probability as n \rightarrow \infty.
We look again at the OLS estimator:
and hence we get that
where \text{plim}\frac{1}{n}X_{n}'U_{n}=\text{plim}\frac{1}{n}\sum_{i}X_{i} U_{i}=\mathbb{E} X_i U_i=0 (under existence of these limits).
Asymptotic distribution
We conclude with the asymptotic distribution of \beta_n. We consider
we have that \frac{1}{n}X_{n}'X_{n}\rightarrow\mathbb{E} XX'. Now we should look at the second term. By the central limit theorem we will converge to
if we are in an iid case then \mathbb{E}\left[U_i U_i'|X_i\right] = \sigma^{2}_u and so
Confidence intervals
Beyond point estimates of parameters, we are also interested in forming confidence intervals on parameters \beta. A 1-\alpha confidence interval is a combination of two estimators a_{n},c_{n} (function of the sample) such that
where beta is fixed and a_{n},c_{n} are the random variables. See the example for a normally distributed estimator.
We will often consider asymptotic confidence intervals where we will replace the inequality with a probability limit:
Probability refresher
Central limit theorem: given a sequence of iid random variables (X_1, X_2, ...) with \mathbb{E}X = \mu and Var[X_i]=\sigma^2 < \infty, define S_n = 1/n ( X_1 + ... + X_n), then:
Law of large numbers: for the same sequence S_n \overset{p}{\rightarrow} \mu
Probability limit: given a sequence X_1,X_2,... we say that X_n converges in probability to X, and write X_n \overset{p}{\rightarrow} X if for every \epsilon we have that
Convergence in distribution: we write X_n \overset{d}{\rightarrow} X iff P(X_n \leq x) \rightarrow P(X \leq x) for all x.
Expressing matrix products: Let's look at X_n' X_n. We have defined X_n as an n \times k matrix where each row correspond to individual i regressors x_i which is k \times 1. Hence the element in row i and column j of X_n is [X_n]_{ij} = [x_i]_j, the j component of the regressors of individual i.
Looking at the matrix C = X_n' X_n, by definition of the matrix multiplication, the elements of C are
where we recognized the matrix multiplication D_i = x_i x_i' which is k \times k and has for elements [D_i]_{pq} = [x_i]_p [x_i]_q. Hence we do get that C = \sum_i D_i and :
We used this often in proofs to express the limits as averages.