OLS as a review

We are going to use the OLS (ordinary least square) to clarify a few terms.

Unless you are doing experimental economics, most projects start with a data-set and a question. Let's then consider a simple data set as $(X_i,Y_i)$ for $i \in (1,...,n)$ . These are two variables and $n$ observations.

A simple question that can directly answer is what is the average $X$ in the sample. This does not require any statistics, it only requires taking a mean: $\bar{X}_n = \frac{1}{n} \sum_i X_i$ . Objects like $\bar{X}_n$ , which are functions of the sample are called estimators.

In general, we are interested in questions that involve values that we didn't directly observe. For instance we might be interested in average value of $Y$ conditional on some given fixed value of $X$ , or we might be interested in a the average value of $X$ in the all U.S. but we only got a sample of size $10,000$ . To make the object of interest precise, we then define a population, a distribution where the sample is coming from.

Population, models and estimands

The population is a construct that allows us to precisely define the objects we are interested in recovering from our data. In concrete terms it will be represent the joint distribution from which the data/sample is drawn from.

In its most simple form, we could write down our population as the joint distribution of the data $f(X_1,Y_1,...,X_n,Y_n)$ . Using this population I can for instance define the average of $X$ in the population, ie $\bar{X} = \mathbb{E}_F[X]$ . $\bar{X} = \mathbb{E}_F[X]$ is different from $\bar{X}_n = \frac{1}{n} \sum_i X_i$ since it defined using the population and not using the sample. Object constructed on the population will be refered to as estimands. One can then start to think about how well we can learn about to learn about $\bar{X}$ from the sample $(X_i,Y_i)$ .

Often however, we want to include variables that are not be observed. For instance we might want to control for some unobserved factors. Indeed I might be interested in the effect of changing $X$ while keeping some characteristics of each individual $i$ fixed, let's call such characteritic $U_i$ . In this case I would define my population as $f(X_1,Y_1,U_1,...,X_n,Y_n,U_n)$ .

In general we will start with a class of such distributions which is indexed by some parameter $\theta$ . We will think of this class of distribution as our model. The parameter space can be finite dimensional in which case we will call it a parametric model, or it could be infinite and we will call it non parametric.

The first interesting point to note that in this new population $\mathbb{E}_F[Y|X=x']$ and $\mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big]$ might be quite different objects. Take $Y$ as income and $X$ as college degree, then the first expression asks the difference in income between people with a college degree, and people without a college degree. The second expression asks what is the average effect of changing the degree of each individuals.

To be precise, also imposing iid across $i$ , we defined:

$\begin{aligned} \mathbb{E}_F[Y|X=x'] & = \int Y f_{Y|X}(y,x') \text{d}y \\ \mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big] & = \int \Big( \int Y f_{Y|X,U}(y,x',u) \text{d}y \Big) f_U(u) \text{d} u \\ \end{aligned}$

The second important observation is that there might be different $f(X_1,Y_1,U_1,...,X_n,Y_n,U_n)$ distributions that deliver a given $f(X_1,Y_1,...,X_n,Y_n)$ . It is then said that such augmented distributions are observationally equivalent. To the extent that we have 2 such distributions, that would generate the same exact data, but would give us two different values of our parameter of interest, then we would be in trouble.

An example of a model is the linear conditional expectation where we specify

$Y_i = X_i \beta + U_i$

as well as the joint distribution $f(X_1,U_1,...,X_n,U_n)$ where we would probably be interested in $\beta$ . We see if given $\beta$ and $f(X_1,U_1,...,X_n,U_n)$ one knows the joint $f(X_1,Y_1,...,X_n,Y_n)$ . The reverse will require additional assumptions! This leads to our next paragraph.

Identification

An important first step once the model has been defined and an estimand of interest has been expressed is to ask the question whether the observed part of the population $f(X_1,Y_1,...,X_n,Y_n)$ together with the structure we imposed in the model allow recovering a unique estimand. When this is the case, we say that the parameter of interest, or the estimand of interest is identified.

In other words, being identified refers to the ability to construct the estimand using observed data in the context where the full distribution is given to you.

Let's take the linear case again, the assumption that $Y_i = X_i \beta + U_i$ is for instance not sufficient to construct $\beta$ from $f(X_1,Y_1,...,X_n,Y_n)$ .

Let's make an additional familiar assumption, let's assume that $(X_i,Y_i,U_i)$ are indenpendent across $i$ and drawn from a joint where $U_i$ and $X_i$ are conditional mean independent. Hence we impose further that $\mathbb{E}[U_i | X_i ]=0$ and that $\mathbb{E} XX'$ is invertible.

In this case, let's show that $\beta$ is identified. We can show identification by directly constructing $\beta$ from $f(X_1,Y_1,...,X_n,Y_n)$ . Indeed:

$\begin{aligned} \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XY & = \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XX'\beta+\left(\mathbb{E}XX'\right)^{-1}\mathbb{E}X U \\ & = \beta+0 \end{aligned}$

Note here that $\mathbb{E}XX'$ is a $k \times k$ matrix.

Estimators

As stated at the begining, an estimator is a function of sample, and as such it is a random object. They are often written either with a hat or with a n $n$ subcript. In the case where the data is given by a vector $Y_n$ of size $n$ and a matrix $X_n$ of size $n \times k$ , the OLS estimator is given by

$\beta_n^\text{ols} = (X_n' X_n)^{-1} X_n Y_n$

Finite sample properties of estimators

Unbiasedness

This is the property that $\mathbb{E} [ \beta_n | X_n] = \beta$ . We can check this is true for the OLS estimator under the assumptions we stated before:

$\begin{aligned} \mathbb{E}[ \beta_{n} | X_n] & =\mathbb{E} [ \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} | X_n]\\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'(X_{n}\beta + U_{n}) | X_n ] \\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta| X_n] + \mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n] \\ & =\beta \end{aligned}$

Since this is true conditional on $X_n$ , it will also be true unconditionaly.

Finite sample distribution

Let's assume further that $U_n | X_n$ is Normally and independently distributed. In other words:

$U_n | X_n \sim \mathcal{N}(0, \sigma^2_u I_n)$

Then we have that

$\beta_n - \beta = \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n}$

and condition on $X_n$ since $U_n$ is normaly distributed, any linear combination is also Normaly distributed. We can then compute the variance covariance matrix of the joint Normal distribution:

$\begin{aligned} \text{Var}( \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n) &= \left( X_n'X_n \right)^{-1}X'_{n} \; \mathbb{E} [ U_n U_n' | X_n ] \; X_n \left( X_{n}'X_{n}\right)^{-1} \\ &= \sigma_u^2 \left( X_n'X_n \right)^{-1} \end{aligned}$

In addition note that

$X_n'X_n = \sum_i X_i X_i' = n \hat{Var}(X_i)$

So we end up with the following expression for the finite sample distribution of the estimator of $\beta$ :

$\beta_n | X_n \sim \mathcal{N}\left( \beta , \frac{\sigma^2_u}{n} \hat{Var}(X_i)^{-1} \right)$

Let's make a couple of remarks:

we notice that as $n \rightarrow \infty$ , indeed $\beta_n$ concentrates on $\beta$ .
we also notice that it looks like copy pasting the sample reduces the the variance. Why is that not true?

Asymptotic properties of estimators

Very often, we would rather not have to make the Normality assumption on the error directly. Instead it is common to try to rely on results based on large samples and build on top of the central limit theorem.

If we can specify a sequence of population $f_n(Y_1,X_1,...,Y_n,X_n)$ we can start thinking about deriving properties of estimators in the limit where $n$ grows large.

A common way to generate such a sequence of population is again to assume that observations are iid across $i$ and drawn from $f(Y,X)$ .

We then look at two important properties.

Consistency

An estimator is consistent if $\beta_n \rightarrow \beta$ in probability as $n \rightarrow \infty$ .

We look again at the OLS estimator:

$\begin{aligned} \beta_{n} &= \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} \\ &=\left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta +\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{n}X_{n}'U_{n}\right) \\ \end{aligned}$

and hence we get that

$\text{plim}\beta_{n}=\beta+\left(\text{plim}\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\text{plim}\frac{1}{n}X_{n}'U_{n}\right)$

where $\text{plim}\frac{1}{n}X_{n}'U_{n}=\text{plim}\frac{1}{n}\sum_{i}X_{i} U_{i}=\mathbb{E} X_i U_i=0$ (under existence of these limits).

Asymptotic distribution

We conclude with the asymptotic distribution of $\beta_n$ . We consider

$\begin{aligned} \sqrt{n}(\beta_{n}-\beta) &=\sqrt{n}\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{\sqrt{n}}X_{n}'U_{n}\right) \end{aligned}$

we have that $\frac{1}{n}X_{n}'X_{n}\rightarrow\mathbb{E} XX'$ . Now we should look at the second term. By the central limit theorem we will converge to

$\frac{1}{\sqrt{n}}X_{n}'U_{n}=\frac{1}{\sqrt{n}}\sum_{i}X_{i} U_{i}\overset{d}{\rightarrow}\mathcal{N}(0,\mathbb{E} X_{i}U_{i}U_{i}'X_{i}')$

if we are in an iid case then $\mathbb{E}\left[U_i U_i'|X_i\right] = \sigma^{2}_u$ and so

$\sqrt{n}(\beta_{n}-\beta)\overset{d}{\rightarrow}\mathcal{N}(0,\sigma^{2}\left(\mathbb{E} X_iX_i'\right)^{-1})$

Confidence intervals

Beyond point estimates of parameters, we are also interested in forming confidence intervals on parameters $\beta$ . A $1-\alpha$ confidence interval is a combination of two estimators $a_{n},c_{n}$ (function of the sample) such that

$P(\beta\in[a_{n},c_{n}])\geq1-\alpha$

where $beta$ is fixed and $a_{n},c_{n}$ are the random variables. See the example for a normally distributed estimator.

We will often consider asymptotic confidence intervals where we will replace the inequality with a probability limit:

$P(\beta \in[a_{n},c_{n}])\rightarrow 1-\alpha$

Probability refresher

Central limit theorem: given a sequence of iid random variables $(X_1, X_2, ...)$ with $\mathbb{E}X = \mu$ and $Var[X_i]=\sigma^2 < \infty$ , define $S_n = 1/n ( X_1 + ... + X_n)$ , then:

$\sqrt{n} ( S_n - \mu) \overset{d}{\rightarrow} \mathcal{N}(0,\sigma^2)$

Law of large numbers: for the same sequence $S_n \overset{p}{\rightarrow} \mu$

Probability limit: given a sequence $X_1,X_2,...$ we say that $X_n$ converges in probability to $X$ , and write $X_n \overset{p}{\rightarrow} X$ if for every $\epsilon$ we have that

$\lim_{n \rightarrow \infty} Pr \Big[ |X_n - X| \geq \epsilon \Big] = 0$

Convergence in distribution: we write $X_n \overset{d}{\rightarrow} X$ iff $P(X_n \leq x) \rightarrow P(X \leq x)$ for all $x$ .

Expressing matrix products: Let's look at $X_n' X_n$ . We have defined $X_n$ as an $n \times k$ matrix where each row correspond to individual $i$ regressors $x_i$ which is $k \times 1$ . Hence the element in row $i$ and column j of $X_n$ is $[X_n]_{ij} = [x_i]_j$ , the $j$ component of the regressors of individual $i$ .

Looking at the matrix $C = X_n' X_n$ , by definition of the matrix multiplication, the elements of $C$ are

$\begin{aligned} [C]_{pq} & = \sum_i [X'_n]_{pi} \cdot [X_n]_{iq} \\ & = \sum_i [X_n]_{ip} \cdot [X_n]_{iq} \\ & = \sum_i [x_i]_{p} \cdot [x_i]_{q} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x_i]_{ql} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x'_i]_{lq} \\ & = \sum_i \Big[ x_i x'_i \Big]_{pq} \end{aligned}$

where we recognized the matrix multiplication $D_i = x_i x_i'$ which is $k \times k$ and has for elements $[D_i]_{pq} = [x_i]_p [x_i]_q$ . Hence we do get that $C = \sum_i D_i$ and :

$\begin{aligned} X_n' X_n & = C \\ & = \sum_i D_i \\ & = \sum_i x_i x'_i \\ \end{aligned}$

We used this often in proofs to express the limits as averages.