Introduction

This tutorial reviews some concepts for the basic linear model, using the econometrics software package R. Specifically, this tutorial reviews:

conducting the Ramsey RESET test in R
general issues about choosing the correct functional form
interpretation when there are relevant variables omitted from the model

This tutorial requires two (2) data file:

clothes_tut4.csv
school.csv

These files can be obtained from the Canvas subject page. In addition the R script file tut4.R provides the program code necessary to complete the exercises.
This R script file uses the following packages which need to be installed prior to running the R script file:

modelsummary	for easily generating high quality tables of results in R
car:	for easily conducting hypothesis tests in R
rio:	for easily importing data into R
tinytable:	for easily working with tinytable formats in R
lmtest:	for easily conducting the Ramsey RESET test in R

These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets.

Question 1

Consider the following total cost function where $y_{i}$ represents total cost for firm $i$ and $x_{i}$ represents output for this firm $i$:

\[\begin{equation} y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \varepsilon_{i} \tag{1} \end{equation}\]

Data for a sample of 28 firms in the clothing industry are provided in the data file clothes_tut4.csv located on the subject page. This data file contains the following variables:

\[\begin{align*} \mbox{y} & = \mbox{total cost for firm $i$} \\ \mbox{x} & = \mbox{ouput for firm $i$} \\ \mbox{x2} & = \mbox{squared ouput for firm $i$} \\ \mbox{x3} & = \mbox{cubed ouput for firm $i$} \\ \end{align*}\]

with

\[\begin{align*} \ln\,\mbox{x} & = \mbox{natural logaritm of ouput for firm $i$} \\ \ln\,\mbox{y} & = \mbox{natural logaritm of total cost for firm $i$} \end{align*}\]

a)

Find the OLS estimates of the parameters $\beta_{0},\beta_{1},\beta_{2}$ and $\beta_{3}$.

Solution

The OLS estimation results are reproduced below in Figure 1.

Code

options(scipen=999)
library(modelsummary)     # nice R output
library(tinytable)
library(car)               # joint linear restrictions in R.
library(rio)               # easily import data into R
library(ggplot2)
library(lmtest)            # easily implement RESET test

Code

#-------------------------------
# create custom function for including sample F statistic
# in modelsummary table from the lm function
# only need to do this once per file
glance_custom.lm <- function(x, ...) {
  s <- summary(x)
  f <- s$fstatistic  # value, numdf, dendf
  
  data.frame(
    F_line = sprintf(
      "%.4f (df = %d; %d)",
      f[1], f[2], f[3]
    ),
    p_F = pf(f[1], f[2], f[3], lower.tail = FALSE),
    sigma = s$sigma,
    nobs = nobs(x),
    r.squared = s$r.squared,
    adj.r.squared = s$adj.r.squared
  )
}

Code

# Read data
tut4 <- import("clothes_tut4.csv")
# y: total cost
# x: output
#--------------------------------
# Question 1 (a)
# (2)  Estimate the cubic model by OLS
reg1 <- lm(y ~ x + x2 + x3, data=tut4)
df_reg1 <- reg1$df.residual 
# modelsummary output
# put the dependent-variable label in the model name (this becomes the column header)
models1 <- list("Total Cost" = reg1)
# use modelsummary for results:
tut4_table1 <- modelsummary(
  models1,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    x = "Output",
    x2 = "Output squared",
    x3 =  "Output cubed"),
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line", "p_F"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic", "F-test p value"),
    fmt   = c(0, 4, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  notes = "Standard errors shown in parentheses"
)

#  save_tt(tut4_table1, "tut4_table1.html", overwrite = TRUE)
tut4_table1

	Total Cost
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Standard errors shown in parentheses
Intercept	134.6560**
	(44.8001)
Output	57.9702+
	(29.9702)
Output squared	-11.0289+
	(5.7646)
Output cubed	1.1431**
	(0.3359)
Observations	28
R-squared	0.9800
Adj. R-squared	0.9775
Residual Std. Error	21.9255
F Statistic	391.2195 (df = 3; 24)
F-test p value	0.0000

Figure 1: OLS Regression Results - Cubic Cost Function

b)

What parameter restrictions would imply a linear total cost function?

Using the car package, test whether the data are consistent with a linear total cost function, at the 5% level.

Solution

Cubic Total Cost Function: \[ y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \varepsilon_{i} \tag{Unrest'd} \] \[ H_0; \beta_2=0.\beta_3=0 \] \[ y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \underbrace{\color{red}{\beta_{2}}}_{=0}\,x_{i}^{2} + \underbrace{\color{red}{\beta_{3}}}_{=0}\,x_{i}^{3} + \varepsilon_{i} \] Linear Cost Function: \[ y_{i} = \beta_{0} + \beta_{1}\,x_{i}+ \varepsilon_{i} \tag{Rest'd} \]

The total cost function (1) would be linear if $\beta_{2} = \beta_{3} = 0$ .

Therefore, we want to test the null hypothesis $H_{0}: \beta_{2} = \beta_{3} = 0$.

The sample test statistic will follow a F distribution with (2,24) degrees of freedom.

Code

alpha <- 0.05
qf(1-alpha,2,df_reg1)          # F critical value

[1] 3.402826

The decision rule is to reject the null hypothesis if the sample value of the F test statistic exceed some critical value $F_{c}(2,24)$. Using statistical tables, this critical value is $F_{c} \thickapprox 3.40$.

Using R, the exact critical value is $3.402826$.

Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that $p< 0.05$.

Using the car package procedure in R we obtain the following results:

Code

# Question 1 (b)
# Test linear total cost function
# H0: beta2 = 0 and beta3 = 0
hnull_1b <-c("x2=0", "x3=0")
linearHypothesis(reg1, hnull_1b)


Linear hypothesis test:
x2 = 0
x3 = 0

Model 1: restricted model
Model 2: y ~ x + x2 + x3

  Res.Df   RSS Df Sum of Sq      F          Pr(>F)    
1     26 71202                                        
2     24 11537  2     59664 62.056 0.0000000003277 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 2: $H_0:\beta_2=\beta_3=0$

c)

Code

tut4_table1a <- tut4_table1 |>
  style_tt(i = 7, j = 2, bold = TRUE, color = "red") |>
  style_tt(i = 8, j = 2, color = "red")
tut4_table1a

	Total Cost
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Standard errors shown in parentheses
Intercept	134.6560**
	(44.8001)
Output	57.9702+
	(29.9702)
Output squared	-11.0289+
	(5.7646)
Output cubed	1.1431**
	(0.3359)
Observations	28
R-squared	0.9800
Adj. R-squared	0.9775
Residual Std. Error	21.9255
F Statistic	391.2195 (df = 3; 24)
F-test p value	0.0000

What parameter restrictions would imply a quadratic total cost function?
Test whether the data are consistent with a quadratic total cost function at the 5% level.

Solution

The total cost function (1) would be quadratic if $\beta_3=0$. Therefore, we want to test the two-sided null hypothesis $H_0:\beta_3=0$; we can use a t-test.
The 5% critical value for a two-sided t-test with 24 degrees of freedom is 2.0639 so the decision rule is to reject $H_0$ is $\text{t} > 2.0639$ or if $\text{t} < -2.0639$. The sample test statistic is calculated as $\text{t} = \dfrac{1.1431-0}{0.3359}=3.403031$ with a $p-\text{value}= 0.023$. Since the $p-\text{value}$ is extremely small we should reject $H_0 \Rightarrow$ the data are not consistent with a quadratic total cost function.

Code

df_reg1 <- 24
qf(1-alpha,1,df_reg1)

[1] 4.259677

Tip

When the null hypothesis has one restriction then $t^2=F$ and $t_c^2=F_c$ e.g.
$t^2= 3.403031^2=11.58062=F$ and $t_c^2=2.0639^2 = 4.259683=F_c$

Alternatively, consider $H_0:\beta_3=0$
The test statistic will follow a F distribution with (1,24) degrees of freedom.

The decision rule is to reject the null hypothesis if the sample size of the F test statistic exceeds some critical vale, $F_c(1,24)$.
Using statistical tables $F_c \approx 4.26$.Using R the exact critical value is 4.259677.
Alternatively, the decision rule is to reject the null hypothesis if the $p-\text{value}$ is less than the desired level of significance such that $p<0.05$.


Linear hypothesis test:
x3 = 0

Model 1: restricted model
Model 2: y ~ x + x2 + x3

  Res.Df   RSS Df Sum of Sq      F  Pr(>F)   
1     25 17105                               
2     24 11537  1    5567.1 11.581 0.00234 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure 3: $H_0:\beta_3=0$

An examination of Figure 3 reveals the sample test statistic is 11.581 with an associated $p-\text{value}$ of 0.023. Therefore. we should reject the null hypothesis.
The data are not consistent with a quadratic total cost function.

d)

Recall the definition of average cost. What parameter restrictions would imply a linear average cost function? Using the car package, test whether the data are consistent with a linear average cost function, at the 5% level.

Solution

The cubic cost function (1) implies the following average cost function:

\[ \frac{y_{i}}{x_{i}} = \frac{\beta_{0}}{x_{i}} + \beta_{1} + \beta_{2}\,x_{i} + \beta_{3}\,x_{i}^{2} + \frac{\varepsilon_{i}}{x_{i}} \]

Ignore the problem of heteroskedasticity for the moment. We will be looking at this later in the subject.

The average cost function will be linear if $\beta_{0} = \beta_{3} = 0$.

We don’t need to estimate the average cost function since we are testing parameter restrictions on the cubic mode (1).

The sample test statistic will follow a F distribution with (2,24) degrees of freedom. The decision rule is to reject the null hypothesis if the sample value of the F test statistic exceed some critical value $F_{c}(2,24)$.
Using statistical tables, this critical value is $F_{c} \thickapprox 3.39$.
Using R, the exact critical value is $3.402826$. Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that $p < 0.05$.
Again using the car package we get

Code

# Question 1 (d)
# Test linear average cost function
hnull_1d <-c("(Intercept)=0", "x3=0")
linearHypothesis(reg1, hnull_1d)


Linear hypothesis test:
(Intercept) = 0
x3 = 0

Model 1: restricted model
Model 2: y ~ x + x2 + x3

  Res.Df    RSS Df Sum of Sq    F            Pr(>F)    
1     26 101818                                        
2     24  11537  2     90280 93.9 0.000000000004482 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

F_crit_d<- qf(1-alpha,2,df_reg1) 
F_crit_d

[1] 3.402826

Figure 4: $H_0:\beta_0=\beta_3=0$

The sample test statistic, from Figure 4, is 93.9 with an associated $p-\text{value}$ of 0.00.
Therefore, we should reject the null hypothesis; the data are not consistent with a linear average cost function.

e)

Test the hypothesis that the model is correctly specified using the RESET test, at the 5% level.
What is your conclusion?

Solution

Consider the following modification of model (1): \[ y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \gamma\,\hat{y}_{i}^{2} + \varepsilon_{i} \] where $\hat{y}_{i}$ represents the fitted values from the OLS regression of model (1). The RESET test is a test of the null hypothesis

$H_{0} : \gamma = 0$ against the alternative $H_{A}: \gamma \ne 0$.

Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,23) degrees of freedom. The F critical value $F_{c}$ is $4.279$. The decision rule is to reject $H_{0}$ if the sample value of the F-statistic exceeds $F_{c}$. Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that $p < 0.05$.

Alternatively, Consider the following modification of model (1):

\[ y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \gamma\,\hat{y}_{i}^{2} + \varepsilon_{i} \]

where $\hat{y}_{i}$ represents the fitted values from the OLS regression of model (1).

The RESET test is a test of the null hypothesis

$H_{0} : \gamma = 0$ against the alternative $H_{A}: \gamma \ne 0$.

The test statistic will follow a t-distribution with (N-5) = 23.
The 5% critical value for a two-sided test with 23 degrees of freedom is 2.069 so the decision rule is to reject $H_{0}$ if $t>2.069$ or $t<-2.069$.
Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,23) degrees of freedom.
The F critical value $F_{c}$ is $4.279$. The decision rule is to reject $H_{0}$ if the sample value of the F-statistic exceeds $F_{c}$. Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that $p < 0.05$.

Running the RESET test with one fitted term gives

Code

# Question 1 (e)
# Test total cost model is well specified
# Use the RESET test
# Cubic Model: Reset Test with squared fitted values
resettest(reg1, power=2, type="fitted")


    RESET test

data:  reg1
RESET = 0.98739, df1 = 1, df2 = 23, p-value = 0.3307

Code

qf(1-alpha,1,23)                       # F-critical value

[1] 4.279344

Figure 5: Question 1 (e) - RESET Test for $\hat{y}^2$.

Alternatively, consider the following modification of model (1).

\[ y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \gamma_1\,\hat{y}_{i}^{2} + \gamma_2\,\hat{y}_{i}^{3}+ \varepsilon_{i} \]

where $\hat{y}_{i}$ represents the fitted values from the OLS regression of model (1).

The RESET test is a test of the null hypothesis $H_{0} : \gamma_{1} = \gamma_{2} = 0$ against the alternative $H_{A}: \gamma_{1} \mbox{ and/or } \gamma_{2} \ne 0$.

The test statistic will follow a F-distribution with (2,22) degrees of freedom. The F critical value $F_{c}$ is $3.443$.
The decision rule is to reject $H_{0}$ if the sample value of the F-statistic exceeds $F_{c}$ - or use the p-value approach as we have done before.

Running the RESET test with two fitted terms gives

Code

# Cubic Model: Reset Test with squared and cubed fitted values
resettest(reg1, power=2:3, type="fitted")


    RESET test

data:  reg1
RESET = 0.54561, df1 = 2, df2 = 22, p-value = 0.5871

Code

qf(1-alpha,2,22)                      # F-critical value

[1] 3.443357

Figure 6: Question 1 (e) - RESET Test for $\hat{y}^2$ and $\hat{y}^3$.

Since the p-values for both versions of the RESET test are quite large and definitely exceed $\alpha = 0.05$, we should not reject $H_{0}$.

The polynomials of the fitted values do not significantly improve the explanatory power of the cubic cost model.

This implies that the cubic cost model is appropriate, given our sample of data.

Note that this is also consistent with the hypothesis tests conducted above.

Recall, we have rejected both the linear cost function, and the quadratic cost function models.

f)

Estimate the following alternative total cost function:

\[\begin{equation} \ln y_{i} = \alpha_{1} + \alpha_{2}\,\ln x_{i} + \varepsilon_{i} \tag{2} \end{equation}\]

Test the hypothesis that the model is correctly specified using the RESET test, at the 5% level. What is your conclusion?

Solution

The regression results for Model (2) are presented in Figure 7 below.

First run model (2).

Code

# Question 1(f)
# Log-Log Total Cost Model
reg2 <- lm(lny ~ lnx, data=tut4)

use modelsummary to report the results

Code

# modelsummary output
# put the dependent-variable label in the model name (this becomes the column header)
models2 <- list("(Log) Total Cost" = reg2)
# use modelsummary for results:
tut4_table2 <- modelsummary(
  models2,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    lnx = "(Log) Output"),
    gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line", "p_F"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic", "F-test p value"),
    fmt   = c(0, 4, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  notes = "Standard errors shown in parentheses"
)
tut4_table2

	(Log) Total Cost
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Standard errors shown in parentheses
Intercept	4.8407***
	(0.0853)
(Log) Output	0.6214***
	(0.0535)
Observations	28
R-squared	0.8384
Adj. R-squared	0.8322
Residual Std. Error	0.1594
F Statistic	134.8825 (df = 1; 26)
F-test p value	0.0000

Figure 7: OLS Regression Results - Log Cost Function.

Consider the following modification of the total cost function model (2):

\[ \ln\,y_{i} = \beta_{0} + \beta_{1}\,\ln\,x_{i} + \gamma\,\hat{y}_{i}^{2} + \varepsilon_{i} \]

where $\hat{y}_{i}$ represents the fitted values from the OLS regression for model (2).

The RESET test is a test of the null hypothesis $H_{0} : \gamma = 0$ against the alternative $H_{A}: \gamma \ne 0$.

The test statistic will follow a t-distribution with (N-5) = 25. The 5% critical value for a two-sided test with 25 degrees of freedom is 2.060 so the decision rule is to reject $H_{0}$ if $t>2.060$ or $t<-2.060$.

Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,26) degrees of freedom.

The F critical value $F_{c}$ is $4.242$.
The decision rule is to reject $H_{0}$ if the sample value of the F-statistic exceeds $F_{c}$ - or use the p-value approach and reject the null hypothesis if the $p-\text{value}$<$\alpha$=0.05.

Then consider the following modification of the total cost function model (2).

\[ y_{i} = \beta_{0} + \beta_{1}\,\ln,x_{i} + \gamma_{1}\,\hat{y}_{i}^{2} + \gamma_{2}\,\hat{y}_{i}^{3} + \varepsilon_{i} \]

where $\hat{y}_{i}$ represents the fitted values from the OLS regression for model (2).

The RESET test is a test of the null hypothesis $H_{0} : \gamma_{1} = \gamma_{2} = 0$ against the alternative $H_{A}: \gamma_{1} \mbox{ and/or } \gamma_{2} \ne 0$.

The test statistic will follow a F-distribution with (2,24) degrees of freedom. The F critical value $F_{c}$ is $3.403$.

The decision rule is to reject $H_{0}$ if the sample value of the F-statistic exceeds $F_{c}$ - or, again, use the p-value approach.

The results of the RESET test using one, and two fitted terms are below:

Code

# Log-Log Model: Reset Test with squared fitted values
resettest(reg2, power=2, type="fitted")

RESET test

data: reg2 RESET = 26.229, df1 = 1, df2 = 25, p-value = 0.00002723

Figure 8: Question 1 (f): RESET Test - $\hat{y}^2$.

Since p-values for both versions of the RESET test are extremely small and definitely less than $\alpha = 0.05$, we should reject $H_{0}$.

The polynomials of the fitted values significantly improve the explanatory power of the log-log cost model.

This implies that the log-log model is not appropriate, given our sample of data.

Code

# Log-Log Model: Reset Test with squared and cubed fitted values
resettest(reg2, power=2, type="fitted")

RESET test

data: reg2 RESET = 26.229, df1 = 1, df2 = 25, p-value = 0.00002723

Figure 8: Question 1 (f): RESET Test - $\hat{y}^2$ .

g)

Based on your results for (e) and (f), does the RESET test suggest that the cubic cost function or the log-log cost function is appropriate? Why?

Solution
The results from the RESET test suggest that the cubic cost model is appropriate and the log-log model is inappropriate, given our sample of data.

The polynomials of the fitted values do not significantly improve the explanatory power of the cubic cost model.

Code

# Log-Log Model: Reset Test with squared and cubed fitted values
resettest(reg2, power=2:3, type="fitted")

RESET test

data: reg2 RESET = 32.752, df1 = 2, df2 = 24, p-value = 0.0000001382

Figure 9: Question 1 (f): RESET Test - $\hat{y}^2$ and $\hat{y}^3$.

Question 2

Consider the following model relating student test scores, on a standardised test, to student-teacher ratios:

\[ \text{score}_i = \beta_0+\beta_1\,\text{str}_i + \varepsilon_i \tag{3} \]

The data file schools.csv contains data on 420 primary schools in a large state, with a considerable portion of students who did not learn English as a first language. The data include the following variables:

`students`	number of students enrolled in the school
`teachers`	number of full-time equivalent teachers employed in the school
`english`	percent of limited English proficiency students enrolled in the school
`read`	average school reading test score
`maths`	average school mathematics test score

Code

schools <- import("schools.csv")
names(schools)

 [1] "district"    "school"      "county"      "grades"      "students"   
 [6] "teachers"    "calworks"    "lunch"       "computer"    "expenditure"
[11] "income"      "english"     "read"        "math"

Code

# generate test score
schools$score <-(schools$read + schools$math)/2
# generate student-teacher-ratio
schools$str <- schools$students/schools$teachers
# create two subsets, based on student-teacher ratio
schools_low <- subset(schools, schools$str<20)
schools_high <- subset(schools, schools$str>=20)

The student-teacher ratio can be constructed as

$\text{str} = \dfrac{\text{students}}{\text{teachers}}$

and the test score at the school (across reading and mathematics) can be constructed as

$\text{score} = \dfrac{\text{read+maths}}{\text{2}}$

a)

Estimate the econometric model (3) by Ordinary Least Squares (OLS).
At the 5% level of significance, test the hypothesis that a higher student-teacher ratio is associated with lower test scores.

Solution

The estimated results are reported in Figure 10.

$b_1=-2.2798$: a one-unit increase in the student teacher ratio is associated with a 2.2798 point reduction in the average test score, across reading and mathematics.

Code

# Question 2(a): incorrect model
reg1_schools <-lm(score ~str, data=schools)
df_reg1_schools <-df.residual(reg1_schools)

reg2_schools <-lm(score ~str + english, data=schools)
df_reg2_schools <-df.residual(reg2_schools)

Code

models3 <- list("(1)" = reg1_schools, "(2)" = reg2_schools)

tut4_table3 <- modelsummary(
  models3,
  fmt      = 4,                   # digits=4
  statistic = "({std.error})",    
  coef_map = c(                   # covariate.labels + intercept at top
    `(Intercept)` = "Intercept",
    str = "Student-Teacher Ratio",
    english = "% Limited English Proficiency"),   
  gof_map = data.frame(
    raw   = c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),
    clean = c("Observations", "R-squared", "Adj. R-squared",
              "Residual Std. Error", "F Statistic"),
    fmt   = c(0, 4, 4, 4, 4)
  ),
  output   = "tinytable",
  stars = TRUE,
  title = "Dependent variable: School Average Test Score",
  notes = "Standard errors shown in parentheses"
)
tut4_table3

Dependent variable: School Average Test Score
	(1)	(2)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Standard errors shown in parentheses
Intercept	698.9329***	686.0322***
	(9.4675)	(7.4113)
Student-Teacher Ratio	-2.2798***	-1.1013**
	(0.4798)	(0.3803)
% Limited English Proficiency		-0.6498***
		(0.0393)
Observations	420	420
R-squared	0.0512	0.4264
Adj. R-squared	0.0490	0.4237
Residual Std. Error	18.5810	14.4645
F Statistic	22.5751 (df = 1; 418)	155.0137 (df = 2; 417)

Figure 10: Question 2 OLS Regression Results $.

The hypothesis may be represented as a one-sided test:

Code

# One-sided t test H0: beta1 = 0 HA:beta1 < 0
alpha <- 0.05                          # set desired significance
b1_reg1_schools <- coef(reg1_schools)[["str"]]       # coefficient on str
seb1_reg1_schools <- sqrt(vcov(reg1_schools)[2,2])   # standard error of b1
t_reg1_schools <- (b1_reg1_schools-0)/seb1_reg1_schools    # construct t test statistic
tcr_reg1_schools <- qt(alpha, df_reg1_schools)   # calculate critical value
pval_reg1_schools <- pt(t_reg1_schools,df_reg1_schools)   # calculate p-value for 1 sided test
print(df_reg1_schools)                          # print df for t test

[1] 418

Code

print(b1_reg1_schools)                          # print estimate b1

[1] -2.279808

Code

print(seb1_reg1_schools)                        # print std. error for b1

[1] 0.4798255

Code

print(t_reg1_schools)                           # print t test statistic

[1] -4.751327

Code

print(tcr_reg1_schools)                         # print t critical value

[1] -1.648507

Code

print(pval_reg1_schools)               # print p value for t test statistic

[1] 0.000001391654

specify $H_0$ and $H_A$:
$H_0:\beta_1 \geq 0 \qquad H_A:\beta_1<0 \qquad$ alternatively this can be written as
$H_0:\beta_1 = 0 \qquad H_A:\beta_1<0$
the test statistic $t=\dfrac{b_1-0}{se(b_1) \thicksim t(N-2)}$
the level of significance $\alpha=0.05$ with degrees of freedom (N-2)=418.
Reject $H_0$ if $t \leq -1.6449$.
Note that the file tute4.R provides an exact critical value of $t_c=-1.6485
regression results
$b_1=-2.2798 \qquad se(b_1)=0.4798$
calculate the value of the test statistic $t=\dfrac{b_1-0}{se(b_1)}=\dfrac{-2.2798-0}{0.4798}=-4.7543$
apply the decision rule $\qquad t<-t_c$
reject the null hypothesis.
The data are consistent with the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Alternatively, the $p-\text{value}$ for this test is $p=0.0000$. Since $p<0.05$ then reject $H_0$.

b)

Using a 5% significance level, test the hypothesis that the relative price $w/r$ has a negative effect upon the labour/capital ratio.
Your answer should clearly state the null and alternative hypotheses, the distribution of test statistic, and your decision.

Solution

First, it is expected that limited English proficiency will have lower average test scores, controlling for the the student-teacher ratio.

This implies the following alternative econometric model: \[ \text{score}_i = \beta_0+\beta_1\,\text{str}_i +\beta_2\, \text{english}+ \varepsilon_i \tag{4} \]
Estimate the econometric model (4) by Ordinary Least Squares (OLS). The estimation results are in Model (2) Figure 10.

$b_1=-1.1013$: a one-unit increase in the student teacher ratio is associated with a 1.1013 point reduction in the average test score, across reading and mathematics.
$b_2=-0.6498$: a one-unit increase in the percent of limited English proficiency students in the school is associated with a 0.6498 point reduction in the average test score, across reading and mathematics.

Then at the 5% significance level, test the hypothesis that a higher student-teacher ratio is associated with lower test scores.

Again, the hypothesis may be represented as a one-sided test:

specify $H_0$ and $H_A$:
$H_0:\beta_1 \geq 0 \qquad H_A:\beta_1<0 \qquad$ alternatively this can be written as
$H_0:\beta_1 = 0 \qquad H_A:\beta_1<0$
the test statistic $t=\dfrac{b_1-0}{se(b_1) \thicksim t(N-2)}$
the level of significance $\alpha=0.05$ with degrees of freedom (N-2)=418.
Reject $H_0$ if $t \leq -1.6449$.
Note that the file tute4.R provides an exact critical value of $t_c=-1.6485.
regression results
$b_1=-1.1013 \qquad se(b_1)=0.3803$
calculate the value of the test statistic $t=\dfrac{b_1-0}{se(b_1)}=\dfrac{-1.1013-0}{0.3803}=-2.8960$
apply the decision rule $\qquad t<-t_c$
reject the null hypothesis.
The data are consistent with the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Alternatively, the $p-\text{value}$ for this test is $p=0.0020$. Since $p<0.05$ then reject $H_0$.

c)

Suppose further that limited English proficiency students are not randomly distributed across schools. In particular, limited English proficiency students may be more likely to be enrolled in schools with a larger student-teacher ratio.
Treating model (4) as the “true” model and model (3) as the “incorrect model, write down an expression for the omitted variable bias in the”incorrect” model.
What is the expected direction of the omitted variable bias?

Solution

\[ \begin{align*} \text{score}_i &= \alpha_0 + \alpha_1\, \text{str}i + \varepsilon_i& \text{incorrect model}\\ \text{score}_i &= \beta_0 + \beta_1\, \text{str}i +\beta_2 \, \text{english}_i+ \varepsilon_i& \text{correct model} \end{align*} \] Following the results in Week 3, Lecture 2:
\[ {E[\alpha_1 - \beta_1]}= \beta_2 \dfrac{\text{COV(str,english)}}{\text{VAR(str)}} \] In the incorrect model, the OLS estimator $\alpha_1$ will only be unbiased if ${E[\alpha_1 - \beta_1]}=0$:

$\beta_2=0$ or
$\text{COV(str,eglish)=}=0$

Neither of these conditions is expected to hold in our model.

we expect $\beta_2<0$ as limited English proficiency students will have lower average test scores , controlling for the student-teacher ratio
$\text{COV(str,english)}>0$ : limited English proficiency students are more likely to be enrolled in schools with a larger student-teacher ratio

Putting all this together

\[ {E[\alpha_1 - \beta_1]}= \beta_2 \dfrac{\text{COV(str,english)}}{\text{VAR(str)}}<0 \]
so we expect that the estimator for$\alpha_1$ in the incorrect model to produce estimates that are negatively (downward) biased.

Since it is expected $\beta_1<0$, the estimator for $\alpha_1$ in the incorrect model will produce estimates that are more negative than the “true” $\beta_1$ in model (4).

Code

cov_schools_high <- schools_high[c("score", "str", "english")]
covmat_high <- cov(cov_schools_high)                                         
print(covmat_high)

              score         str       english
score    318.757539 -1.75376147 -237.90417077
str       -1.753761  1.33382281    0.08797197
english -237.904171  0.08797197  372.97932993

Note that the sample covariance $\widehat{\text{COV}}\text{(str,english)}=6.4912>0$.