A homework on mulitway fixed effect approach

In this homework we are going to consider a fictuous grading at a given universities and try to see what we can learn from grade variability and it affects earnings.

In the first data set we have the grades collected by a bunch of student within a given summester. Each student attended 2 courses. From this data we are going to try to extrac the ability of each student while allowing for course specific intercept. We can then use this to evaluate how much of the grade variation is due to student differences versus course differences and other factors (residuals).

Given this ability measures, we then merge a second file which has the earnings of the student at age 35. We then evaluate the effect of academic ability on individual earnings. Here again we will worry about the effect of over-fitting.

Of course this requires, like we saw in class, estimating many parameters, hence we will look into overfitting and how to address it! We wil lmake use of sparse matrices, degree of freedom correction and bias correction.

The two data files you will need are:

grades: hw4-grades.json
earnings: hw4-earnings.json

Useful links:

Sparse linear solver

In [1]:

Copied!





import os
import pandas as pd
import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sys


%cd ..
%load_ext autoreload
%autoreload 2

np.random.seed(5435245)

sys.path.append('../')
import os
import pandas as pd
import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sys


%cd ..
%load_ext autoreload
%autoreload 2

np.random.seed(5435245)

sys.path.append('../')

/Users/tlamadon/git/econ-21340/docs

In [2]:

Copied!

import solutions.sol_pset4 as solution # you need to command this, you don't have the solution file!
#solution.simulate_data()
import solutions.sol_pset4 as solution # you need to command this, you don't have the solution file!
#solution.simulate_data()

Explaining the dispersion in grades¶

Load the grade data from hw4-grades.json. Then compute:

total variance in grades
the between class variance
plot the histogram of class size

In [3]:

Copied!

df_all = solution.question1()

ns = len(np.unique(df_all['student_id']))
nc = len(np.unique(df_all['class_id']))
nn = len(df_all)

print("{} students and {} classes".format(ns,nc))

df_all[["grade","class_id","student_id","major","firstname"]]
df_all = solution.question1()

ns = len(np.unique(df_all['student_id']))
nc = len(np.unique(df_all['class_id']))
nn = len(df_all)

print("{} students and {} classes".format(ns,nc))

df_all[["grade","class_id","student_id","major","firstname"]]

 total variance is 2.199920200173231
 between class variance is 0.7605625660013033

No description has been provided for this image

6732 students and 673 classes

Out[3]:

	grade	class_id	student_id	major	firstname
0	0.843900	GP8471	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann
1	0.926570	IK1731	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann
2	1.695413	GW2045	15	STUDIO ARTS	Marcus
3	-0.038370	ML7772	15	STUDIO ARTS	Marcus
4	2.129442	BI3547	22	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Lonnie
...	...	...	...	...	...
13459	-2.318588	JE6164	49527	PUBLIC POLICY	Lea
13460	-1.458787	GQ0531	49534	PHYSICS	Leonard
13461	-1.560250	IX0276	49534	PHYSICS	Leonard
13462	2.091195	KX3268	49539	EARLY CHILDHOOD EDUCATION	Addie
13463	-0.105092	BN9468	49539	EARLY CHILDHOOD EDUCATION	Addie

13464 rows × 5 columns

Constructing the sparse regressor matrices¶

In a similar fashion to what we covered in the class we want to estimate a two-way fixed model of grates. Specifically, we are want to fit:

$$y_{ic} = \alpha_i + \psi_c + \epsilon_{ic}$$

where $i$ denotes each individual, $c$ denote each courses and $\epsilon_{ic}$ is an error term that will assume conditional mean independent of the assignment of students to courses.

We are going to estimate this using least-square. This requires that we construct the matrices that correspond to the equation for $y_{ic}$. We then want to consruct the $A$ and $J$ such that

$$Y = A \alpha + J \psi + E$$

where for $n_s$ students each with $n_g$ grades in difference courses and a total of $n_c$ courses we have that $Y$ is $n_s \cdot n_g \times 1$ vector, $A$ is a $n_s \cdot n_g \times n_s$ matrix and $J$ is $n_s \cdot n_g \times n_c$. $\alpha$ is the vector of size $n_s$ and $\psi$ is a vector of size $n_c$.

Each fo the $n_s \cdot n_g$ correspond to a grade, in each row $A$ has a $1$ in the column corresponding to the individual of this row. Similary, $J$ has a $1$ for for the column corresponding to the class of that row.

So, I ask you to:

construct these matrices using python sparse matrices scipy.sparse.csc.csc_matrix

In [4]:

Copied!

df_all
df_all

Out[4]:

	grade	class_id	student_id	major	firstname
0	0.843900	GP8471	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann
1	0.926570	IK1731	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann
2	1.695413	GW2045	15	STUDIO ARTS	Marcus
3	-0.038370	ML7772	15	STUDIO ARTS	Marcus
4	2.129442	BI3547	22	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Lonnie
...	...	...	...	...	...
13459	-2.318588	JE6164	49527	PUBLIC POLICY	Lea
13460	-1.458787	GQ0531	49534	PHYSICS	Leonard
13461	-1.560250	IX0276	49534	PHYSICS	Leonard
13462	2.091195	KX3268	49539	EARLY CHILDHOOD EDUCATION	Addie
13463	-0.105092	BN9468	49539	EARLY CHILDHOOD EDUCATION	Addie

13464 rows × 5 columns

In [5]:

Copied!





from scipy.sparse.linalg import spsolve
Y,A,J,df_all = solution.question2(df_all)

print(f"Type of A        : {type(A)}")
print(f"Shape of A       : {A.shape}")
print(f"Shape of J       : {J.shape}")
print(f"Shape of Y       : {Y.shape}")
print(f"Sum of A         : {A.sum():,.3f}")
print(f"Sum of J         : {J.sum():,.3f}")

# getting a nice diagonal here requires sorting by the studend_id
plt.spy(A,aspect=0.1,markersize=0.2)
plt.show()
plt.spy(J,aspect=0.1,markersize=0.2)
plt.show()
from scipy.sparse.linalg import spsolve
Y,A,J,df_all = solution.question2(df_all)

print(f"Type of A        : {type(A)}")
print(f"Shape of A       : {A.shape}")
print(f"Shape of J       : {J.shape}")
print(f"Shape of Y       : {Y.shape}")
print(f"Sum of A         : {A.sum():,.3f}")
print(f"Sum of J         : {J.sum():,.3f}")

# getting a nice diagonal here requires sorting by the studend_id
plt.spy(A,aspect=0.1,markersize=0.2)
plt.show()
plt.spy(J,aspect=0.1,markersize=0.2)
plt.show()

Type of A        : <class 'scipy.sparse._csc.csc_matrix'>
Shape of A       : (13464, 6732)
Shape of J       : (13464, 673)
Shape of Y       : (13464,)
Sum of A         : 13,464.000
Sum of J         : 13,464.000

In [6]:

Copied!

df_all
df_all

Out[6]:

	index	grade	class_id	student_id	major	firstname	si	ci
0	0	0.843900	GP8471	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann	0	352
1	1	0.926570	IK1731	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann	0	440
2	2	1.695413	GW2045	15	STUDIO ARTS	Marcus	1	368
3	3	-0.038370	ML7772	15	STUDIO ARTS	Marcus	1	651
4	4	2.129442	BI3547	22	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Lonnie	2	68
...	...	...	...	...	...	...	...	...
13459	13459	-2.318588	JE6164	49527	PUBLIC POLICY	Lea	6729	475
13460	13460	-1.458787	GQ0531	49534	PHYSICS	Leonard	6730	353
13461	13461	-1.560250	IX0276	49534	PHYSICS	Leonard	6730	459
13462	13462	2.091195	KX3268	49539	EARLY CHILDHOOD EDUCATION	Addie	6731	565
13463	13463	-0.105092	BN9468	49539	EARLY CHILDHOOD EDUCATION	Addie	6731	77

13464 rows × 8 columns

Estimating the model¶

Next we estimate our model using the OLS estiamtor formula. We first remove the last column of $J$ (since the model we wrote does not pin down a constant we force the last course to have $\psi=0$). Solve the linear system using the formula

$$ \hat{\gamma} = (M'M)^{-1} M' Y $$

where $M = [A,J]$ and $\gamma = (\alpha,\psi)$.

So do the following:

select the last column simply by doing J = J[:,1:(nc-1]]
use scipy.sparse.hstack to concatenate the matrices to create M
use scipy.sparse.linalg.spsolve to solve a sparse linear system
extract $\hat{\alpha}$ from $\hat{\gamma}$ by selecting the first $n_s$ terms
merge $\hat{\alpha}$ into df_all
compute the variance of $\hat{\alpha}$ in df_all
compute the variance of the residuals
What share of the total variation in grades can be attributed to difference in students?

In [7]:

Copied!





df_all, M, gamma_hat = solution.question3(df_all,A,J,Y)
print(f"Variance of alpha_hat  : {df_all['alpha_hat'].var():.4f}")
print(f"Variance of resid      : {df_all['resid'].var():.4f}")
print(f"Variance of grade      : {df_all['grade'].var():.4f}")
df_all[["grade","class_id","student_id","major","firstname","alpha_hat"]]
df_all, M, gamma_hat = solution.question3(df_all,A,J,Y)
print(f"Variance of alpha_hat  : {df_all['alpha_hat'].var():.4f}")
print(f"Variance of resid      : {df_all['resid'].var():.4f}")
print(f"Variance of grade      : {df_all['grade'].var():.4f}")
df_all[["grade","class_id","student_id","major","firstname","alpha_hat"]]

Variance of alpha_hat  : 1.3367
Variance of resid      : 0.2498
Variance of grade      : 2.1999

Out[7]:

	grade	class_id	student_id	major	firstname	alpha_hat
0	0.843900	GP8471	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann	1.523048
1	0.926570	IK1731	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann	1.523048
2	1.695413	GW2045	15	STUDIO ARTS	Marcus	0.668487
3	-0.038370	ML7772	15	STUDIO ARTS	Marcus	0.668487
4	2.129442	BI3547	22	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Lonnie	2.119907
...	...	...	...	...	...	...
13459	-2.318588	JE6164	49527	PUBLIC POLICY	Lea	-1.358326
13460	-1.458787	GQ0531	49534	PHYSICS	Leonard	-0.248203
13461	-1.560250	IX0276	49534	PHYSICS	Leonard	-0.248203
13462	2.091195	KX3268	49539	EARLY CHILDHOOD EDUCATION	Addie	1.255938
13463	-0.105092	BN9468	49539	EARLY CHILDHOOD EDUCATION	Addie	1.255938

13464 rows × 6 columns

Note that you might get alphas off by a constant, depending on which firm you chose to normalize.

In [8]:

Copied!





# We check that predicted values look good
print(np.var(df_all.grade - df_all.alpha_hat - df_all.psi_hat))
plt.scatter( df_all.grade, df_all.alpha_hat + df_all.psi_hat)
plt.xlabel("grade")
plt.ylabel("predicted grade")
plt.show()
# We check that predicted values look good
print(np.var(df_all.grade - df_all.alpha_hat - df_all.psi_hat))
plt.scatter( df_all.grade, df_all.alpha_hat + df_all.psi_hat)
plt.xlabel("grade")
plt.ylabel("predicted grade")
plt.show()

0.2497404970777697

A simple evaluation of our estimator¶

To see what we are dealing with, we are simly going to re-simulate using our estimated parameters, then re-run our estimation and compare the new results to the previous one. This is in the spirit of a bootstrap exercise, onyl we will just do it once this time.

Please do:

create $Y_2 = M \hat{\gamma} + \hat{\sigma}_r E$ where $E$ is a vector of draw from a standard normal.
estimate $\hat{\gamma}_2$ from $Y_2$
report the new variance term and compare them to the previously estimated
comment on the results (not that because of the seed and ordering, your number doesn't have to match mine exactly)

In [9]:

Copied!

df_all = solution.question4(df_all,M,gamma_hat)
print(f"Variance of alpha_hat2 : {df_all['alpha_hat2'].var():.4f}")
print(f"Variance of resid2      : {df_all['resid2'].var():.4f}")
df_all = solution.question4(df_all,M,gamma_hat)
print(f"Variance of alpha_hat2 : {df_all['alpha_hat2'].var():.4f}")
print(f"Variance of resid2      : {df_all['resid2'].var():.4f}")

Variance of alpha_hat2 : 1.4664
Variance of resid2      : 0.1113

Noticed how even the variance of the residual has shrunk? Now is the time to remember STATS 101. We have all heard this thing about degree of freedom correction! Indeed we should correct our raw variance estimates to control for the fact that we have estimated a bunch of dummies. Usually we use $n/n-1$ because we only estimate one mean. Here however we have estimated $n_s + n_c - 1$ means! Hence we should use

$$ \frac{N}{N-n_s -n_c +1} \hat{Var}(\hat{\epsilon}) $$

please do:

compute this variance corrected for degree of freedom using your recomputed residuals
compare this variance to the variance you estimated in quetion 3
what does this suggest about your estimates in Q3?

In [10]:

Copied!

solution.question4b(df_all)
solution.question4b(df_all)

Var2 with degree of freedom correction:0.247210099267587

Evaluate impact of academic measure on earnings¶

In this section we load a separate data set that contains for each student their earnings at age 35. We are intereted in the effect of $\alpha$ on earnings.

Do the following:

load the data the earnings data listed in the intro
merge $\alpha$ into the data
regress earnings on $\alpha$.

In [11]:

Copied!

df_earnings = solution.question5(df_all)
df_earnings
df_earnings = solution.question5(df_all)
df_earnings

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               earnings   R-squared:                       0.025
Model:                            OLS   Adj. R-squared:                  0.025
Method:                 Least Squares   F-statistic:                     172.7
Date:                Tue, 27 May 2025   Prob (F-statistic):           5.80e-39
Time:                        12:40:56   Log-Likelihood:                -9595.0
No. Observations:                6732   AIC:                         1.919e+04
Df Residuals:                    6730   BIC:                         1.921e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0903      0.013     -6.716      0.000      -0.117      -0.064
alpha_hat      0.1394      0.011     13.141      0.000       0.119       0.160
==============================================================================
Omnibus:                        0.815   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.665   Jarque-Bera (JB):                0.769
Skew:                           0.013   Prob(JB):                        0.681
Kurtosis:                       3.045   Cond. No.                         1.65
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Out[11]:

	student_id	major	firstname	earnings	alpha_hat
0	9	PHYSICAL AND HEALTH EDUCATION TEACHING	Leann	-1.498308	1.523048
1	15	STUDIO ARTS	Marcus	2.908033	0.668487
2	22	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Lonnie	0.933555	2.119907
3	24	MISCELLANEOUS FINE ARTS	Seamus	-0.913896	-0.703119
4	35	MEDICAL ASSISTING SERVICES	Wilbert	0.545178	1.874107
...	...	...	...	...	...
6727	49520	ANIMAL SCIENCES	Amos	-1.723242	1.208074
6728	49525	COMMERCIAL ART AND GRAPHIC DESIGN	Isaac	1.264385	0.518734
6729	49527	PUBLIC POLICY	Lea	-0.368899	-1.358326
6730	49534	PHYSICS	Leonard	1.244166	-0.248203
6731	49539	EARLY CHILDHOOD EDUCATION	Addie	1.735586	1.255938

6732 rows × 5 columns

Bias correction - construct the Q matrix¶

We want to apply bias correction to refine our results. As we have seen in class thaqt we can directly compute the bias of the expression of interest.

$$ E[ \hat{\gamma} Q \hat{\gamma}' ] = \gamma Q \gamma + \frac{\sigma^2}{n} \text{Tr}[ ( M'M )^{-1} Q] $$

under homoskedatic assumption of the error and hence we get the following expresison for the bias for any $Q$ matrix:

$$ B = \frac{\sigma^2}{n_s} \text{Tr}[ ( M'M )^{-1} Q] $$

When computing the variance of the measured ability of the student, we simply use a diagonal matrix on $\gamma$ which selects only the ability part and removes the average.

do:

Construct such Q matrix.
Check that $\gamma Q \gamma' = \hat{Var}(\hat{a})$.
(bonus) Show why the formula includes the $1/n_s$ terms rather than $1/n$ ?

In [12]:

Copied!





# a small example if we had ns=5,nc=4
Qbis = solution.question6(ns=5,nc=4)
print(Qbis)

# the full Q
Q = solution.question6(ns,nc)

# comparing Q expression to df_all expression
val1 = (1 / ns) * np.matmul(gamma_hat, np.matmul(Q, gamma_hat))
val2 = gamma_hat[np.arange(ns)].var()

print(f"Quadratic form (1/ns · γᵗQγ)      : {val1:.4f}")
print(f"Variance of gamma_hat[:ns]        : {val2:.4f}")
# a small example if we had ns=5,nc=4
Qbis = solution.question6(ns=5,nc=4)
print(Qbis)

# the full Q
Q = solution.question6(ns,nc)

# comparing Q expression to df_all expression
val1 = (1 / ns) * np.matmul(gamma_hat, np.matmul(Q, gamma_hat))
val2 = gamma_hat[np.arange(ns)].var()

print(f"Quadratic form (1/ns · γᵗQγ)      : {val1:.4f}")
print(f"Variance of gamma_hat[:ns]        : {val2:.4f}")

[[ 0.8 -0.2 -0.2 -0.2 -0.2  0.   0.   0. ]
 [-0.2  0.8 -0.2 -0.2 -0.2  0.   0.   0. ]
 [-0.2 -0.2  0.8 -0.2 -0.2  0.   0.   0. ]
 [-0.2 -0.2 -0.2  0.8 -0.2  0.   0.   0. ]
 [-0.2 -0.2 -0.2 -0.2  0.8  0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.   0.   0.   0. ]]
Quadratic form (1/ns · γᵗQγ)      : 1.3366
Variance of gamma_hat[:ns]        : 1.3366

We are now finally in the position to compute our bias. We have our matrix $Q$. Now we also need the variance of the residual in the original problem! Given what we have learn in Question 4, we definitely want to use the formula with degree of freedom correction.

Compute $\sigma^2_r$ with the degree of freedom correction
Invert $M'M$ using scipy.sparse.linalg
Compute $B = \frac{\sigma^2}{n_s} \text{Tr}[ ( M'M )^{-1} Q]$ using np.trace
Remove this from original estimate to get the share of variance explained by student differences!

Note that inversing a matrix is far longer than solving a linear system. You might need to be patient here!

In [13]:

Copied!





%%time
sig_hat, B = solution.question7(M,Q,df_all)
print(f"sigma_r  : {sig_hat:.4f}")
print(f"Value of bias  : {B:.4f}")
%%time
sig_hat, B = solution.question7(M,Q,df_all)
print(f"sigma_r  : {sig_hat:.4f}")
print(f"Value of bias  : {B:.4f}")

sigma_r  : 0.5549
Value of bias  : 0.3081
CPU times: user 55.2 s, sys: 1.59 s, total: 56.8 s
Wall time: 30.9 s

In [14]:

Copied!

gamma_hat[range(ns)].var() - B
gamma_hat[range(ns)].var() - B

Out[14]:

np.float64(1.0284885884511565)

Bias correction - Regression coefficient¶

Finally, we look back at our regression of earnings on estimated academic ability. We have seen in class that when the regressor has measurment error this will suffer from attenuation bias. Here we now know exactly how much of the variance is noise thanks to our bias correction.

The attenuation bias is given by :

$$ \beta_2 = \frac{Var(x)}{Var(x) + B} \beta $$

We then decide to compute a correction for our regression using our estimated $B$. This means computing a corrected parameters as follows:

$$ \hat{\beta}^{BC} = \frac{Var(x) + B}{Var(x)} \hat{\beta}^{Q5} = \frac{Var(\hat{\alpha})}{Var(\hat{\alpha})-B} \hat{\beta}^{Q5} $$

Do:

compute the corrected $\hat{\beta}^{BC}$
FIY, the true $\beta$ I used to simulate the data was 0.2, is your final parameter far? Is is economically different from the $\hat{\beta}^{Q5}$ we got in Question 5?

Conclusion¶

I hope you have learned about the pitfalls of over-fitting in this assignment! There are many and they can drastically affect the results and the conclusion of an empirical analysis.

This is the end of the class, I hope you enjoyed it and that you learned a thing or twom, have a nice summer!