ECOM30001/ECOM90001 Tutorial 4 Hypothesis Testing and Model Specification in R
Introduction
This tutorial reviews some concepts for the basic linear model, using the econometrics software package R. Specifically, this tutorial reviews:
conducting the Ramsey RESET test in R
general issues about choosing the correct functional form
interpretation when there are relevant variables omitted from the model
This tutorial requires two (2) data file:
clothes_tut4.csv
school.csv
These files can be obtained from the Canvas subject page. In addition the R script file tut4.R provides the program code necessary to complete the exercises.
This R script file uses the following packages which need to be installed prior to running the R script file:
modelsummary
for easily generating high quality tables of results in R
car:
for easily conducting hypothesis tests in R
rio:
for easily importing data into R
tinytable:
for easily working with tinytable formats in R
lmtest:
for easily conducting the Ramsey RESET test in R
These can be installed directly in RStudio from the packages tab or by using the command install.packages() and inserting the name of the package in the brackets.
Question 1
Consider the following total cost function where \(y_{i}\) represents total cost for firm \(i\) and \(x_{i}\) represents output for this firm \(i\):
Data for a sample of 28 firms in the clothing industry are provided in the data file clothes_tut4.csv located on the subject page. This data file contains the following variables:
\[\begin{align*}
\mbox{y} & = \mbox{total cost for firm $i$} \\
\mbox{x} & = \mbox{ouput for firm $i$} \\
\mbox{x2} & = \mbox{squared ouput for firm $i$} \\
\mbox{x3} & = \mbox{cubed ouput for firm $i$} \\
\end{align*}\]
with
\[\begin{align*}
\ln\,\mbox{x} & = \mbox{natural logaritm of ouput for firm $i$} \\
\ln\,\mbox{y} & = \mbox{natural logaritm of total cost for firm $i$}
\end{align*}\]
a)
Find the OLS estimates of the parameters \(\beta_{0},\beta_{1},\beta_{2}\) and \(\beta_{3}\).
Solution
The OLS estimation results are reproduced below in Figure 1.
Code
options(scipen=999)library(modelsummary) # nice R outputlibrary(tinytable)library(car) # joint linear restrictions in R.library(rio) # easily import data into Rlibrary(ggplot2)library(lmtest) # easily implement RESET test
Code
#-------------------------------# create custom function for including sample F statistic# in modelsummary table from the lm function# only need to do this once per fileglance_custom.lm <-function(x, ...) { s <-summary(x) f <- s$fstatistic # value, numdf, dendfdata.frame(F_line =sprintf("%.4f (df = %d; %d)", f[1], f[2], f[3] ),p_F =pf(f[1], f[2], f[3], lower.tail =FALSE),sigma = s$sigma,nobs =nobs(x),r.squared = s$r.squared,adj.r.squared = s$adj.r.squared )}
Code
# Read datatut4 <-import("clothes_tut4.csv")# y: total cost# x: output#--------------------------------# Question 1 (a)# (2) Estimate the cubic model by OLSreg1 <-lm(y ~ x + x2 + x3, data=tut4)df_reg1 <- reg1$df.residual # modelsummary output# put the dependent-variable label in the model name (this becomes the column header)models1 <-list("Total Cost"= reg1)# use modelsummary for results:tut4_table1 <-modelsummary( models1,fmt =4, # digits=4statistic ="({std.error})", coef_map =c( # covariate.labels + intercept at top`(Intercept)`="Intercept",x ="Output",x2 ="Output squared",x3 ="Output cubed"),gof_map =data.frame(raw =c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line", "p_F"),clean =c("Observations", "R-squared", "Adj. R-squared","Residual Std. Error", "F Statistic", "F-test p value"),fmt =c(0, 4, 4, 4, 4, 4) ),output ="tinytable",stars =TRUE,notes ="Standard errors shown in parentheses")# save_tt(tut4_table1, "tut4_table1.html", overwrite = TRUE)tut4_table1
Total Cost
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept
134.6560**
(44.8001)
Output
57.9702+
(29.9702)
Output squared
-11.0289+
(5.7646)
Output cubed
1.1431**
(0.3359)
Observations
28
R-squared
0.9800
Adj. R-squared
0.9775
Residual Std. Error
21.9255
F Statistic
391.2195 (df = 3; 24)
F-test p value
0.0000
Figure 1: OLS Regression Results - Cubic Cost Function
<>
b)
What parameter restrictions would imply a linear total cost function?
Using the car package, test whether the data are consistent with a linear total cost function, at the 5% level.
The total cost function (1) would be linear if \(\beta_{2} = \beta_{3} = 0\) .
Therefore, we want to test the null hypothesis \(H_{0}: \beta_{2} = \beta_{3} = 0\).
The sample test statistic will follow a F distribution with (2,24) degrees of freedom.
Code
alpha <-0.05qf(1-alpha,2,df_reg1) # F critical value
[1] 3.402826
The decision rule is to reject the null hypothesis if the sample value of the F test statistic exceed some critical value \(F_{c}(2,24)\). Using statistical tables, this critical value is \(F_{c} \thickapprox 3.40\).
Using R, the exact critical value is \(3.402826\).
Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that \(p< 0.05\).
Using the car package procedure in R we obtain the following results:
Code
# Question 1 (b)# Test linear total cost function# H0: beta2 = 0 and beta3 = 0hnull_1b <-c("x2=0", "x3=0")linearHypothesis(reg1, hnull_1b)
Linear hypothesis test:
x2 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 71202
2 24 11537 2 59664 62.056 0.0000000003277 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure 2: \(H_0:\beta_2=\beta_3=0\)
c)
Code
tut4_table1a <- tut4_table1 |>style_tt(i =7, j =2, bold =TRUE, color ="red") |>style_tt(i =8, j =2, color ="red")tut4_table1a
Total Cost
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept
134.6560**
(44.8001)
Output
57.9702+
(29.9702)
Output squared
-11.0289+
(5.7646)
Output cubed
1.1431**
(0.3359)
Observations
28
R-squared
0.9800
Adj. R-squared
0.9775
Residual Std. Error
21.9255
F Statistic
391.2195 (df = 3; 24)
F-test p value
0.0000
What parameter restrictions would imply a quadratic total cost function?
Test whether the data are consistent with a quadratic total cost function at the 5% level.
Solution
The total cost function (1) would be quadratic if \(\beta_3=0\). Therefore, we want to test the two-sided null hypothesis \(H_0:\beta_3=0\); we can use a t-test.
The 5% critical value for a two-sided t-test with 24 degrees of freedom is 2.0639 so the decision rule is to reject \(H_0\) is \(\text{t} > 2.0639\) or if \(\text{t} < -2.0639\). The sample test statistic is calculated as \(\text{t} = \dfrac{1.1431-0}{0.3359}=3.403031\) with a \(p-\text{value}= 0.023\). Since the \(p-\text{value}\) is extremely small we should reject \(H_0 \Rightarrow\) the data are not consistent with a quadratic total cost function.
Code
df_reg1 <-24qf(1-alpha,1,df_reg1)
[1] 4.259677
Tip
When the null hypothesis has one restriction then \(t^2=F\) and \(t_c^2=F_c\) e.g. \(t^2= 3.403031^2=11.58062=F\) and \(t_c^2=2.0639^2 = 4.259683=F_c\)
Alternatively, consider \(H_0:\beta_3=0\)
The test statistic will follow a F distribution with (1,24) degrees of freedom.
The decision rule is to reject the null hypothesis if the sample size of the F test statistic exceeds some critical vale, \(F_c(1,24)\).
Using statistical tables \(F_c \approx 4.26\).Using R the exact critical value is 4.259677.
Alternatively, the decision rule is to reject the null hypothesis if the \(p-\text{value}\) is less than the desired level of significance such that \(p<0.05\).
Linear hypothesis test:
x3 = 0
Model 1: restricted model
Model 2: y ~ x + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 25 17105
2 24 11537 1 5567.1 11.581 0.00234 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure 3: \(H_0:\beta_3=0\)
An examination of Figure 3 reveals the sample test statistic is 11.581 with an associated \(p-\text{value}\) of 0.023. Therefore. we should reject the null hypothesis.
The data are not consistent with a quadratic total cost function.
d)
Recall the definition of average cost. What parameter restrictions would imply a linear average cost function? Using the car package, test whether the data are consistent with a linear average cost function, at the 5% level.
Solution
The cubic cost function (1) implies the following average cost function:
Ignore the problem of heteroskedasticity for the moment. We will be looking at this later in the subject.
The average cost function will be linear if \(\beta_{0} = \beta_{3} = 0\).
We don’t need to estimate the average cost function since we are testing parameter restrictions on the cubic mode (1).
The sample test statistic will follow a F distribution with (2,24) degrees of freedom. The decision rule is to reject the null hypothesis if the sample value of the F test statistic exceed some critical value \(F_{c}(2,24)\).
Using statistical tables, this critical value is \(F_{c} \thickapprox 3.39\).
Using R, the exact critical value is \(3.402826\). Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that \(p < 0.05\).
Again using the car package we get
Code
# Question 1 (d)# Test linear average cost functionhnull_1d <-c("(Intercept)=0", "x3=0")linearHypothesis(reg1, hnull_1d)
Linear hypothesis test:
(Intercept) = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 101818
2 24 11537 2 90280 93.9 0.000000000004482 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
F_crit_d<-qf(1-alpha,2,df_reg1) F_crit_d
[1] 3.402826
Figure 4: \(H_0:\beta_0=\beta_3=0\)
The sample test statistic, from Figure 4, is 93.9 with an associated \(p-\text{value}\) of 0.00.
Therefore, we should reject the null hypothesis; the data are not consistent with a linear average cost function.
e)
Test the hypothesis that the model is correctly specified using the RESET test, at the 5% level.
What is your conclusion?
Solution
Consider the following modification of model (1): \[
y_{i} = \beta_{0} + \beta_{1}\,x_{i} + \beta_{2}\,x_{i}^{2} + \beta_{3}\,x_{i}^{3} + \gamma\,\hat{y}_{i}^{2} + \varepsilon_{i}
\] where \(\hat{y}_{i}\) represents the fitted values from the OLS regression of model (1). The RESET test is a test of the null hypothesis
\(H_{0} : \gamma = 0\) against the alternative \(H_{A}: \gamma \ne 0\).
The test statistic will follow a t-distribution with (N-5) = 23. The 5% critical value for a two-sided test with 23 degrees of freedom is 2.069 so the decision rule is to reject \(H_{0}\) if \(t>2.069\) or \(t<-2.069\).
Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,23) degrees of freedom. The F critical value \(F_{c}\) is \(4.279\). The decision rule is to reject \(H_{0}\) if the sample value of the F-statistic exceeds \(F_{c}\). Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that \(p < 0.05\).
Alternatively, Consider the following modification of model (1):
where \(\hat{y}_{i}\) represents the fitted values from the OLS regression of model (1).
The RESET test is a test of the null hypothesis
\(H_{0} : \gamma = 0\) against the alternative \(H_{A}: \gamma \ne 0\).
The test statistic will follow a t-distribution with (N-5) = 23.
The 5% critical value for a two-sided test with 23 degrees of freedom is 2.069 so the decision rule is to reject \(H_{0}\) if \(t>2.069\) or \(t<-2.069\).
Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,23) degrees of freedom.
The F critical value \(F_{c}\) is \(4.279\). The decision rule is to reject \(H_{0}\) if the sample value of the F-statistic exceeds \(F_{c}\). Alternatively, the decision rule is to reject the null hypothesis if the p-value for the sample test statistic is less than the desired level of significance such that \(p < 0.05\).
Running the RESET test with one fitted term gives
Code
# Question 1 (e)# Test total cost model is well specified# Use the RESET test# Cubic Model: Reset Test with squared fitted valuesresettest(reg1, power=2, type="fitted")
where \(\hat{y}_{i}\) represents the fitted values from the OLS regression of model (1).
The RESET test is a test of the null hypothesis \(H_{0} :
\gamma_{1} = \gamma_{2} = 0\) against the alternative \(H_{A}: \gamma_{1} \mbox{ and/or } \gamma_{2} \ne 0\).
The test statistic will follow a F-distribution with (2,22) degrees of freedom. The F critical value \(F_{c}\) is \(3.443\).
The decision rule is to reject \(H_{0}\) if the sample value of the F-statistic exceeds \(F_{c}\) - or use the p-value approach as we have done before.
Running the RESET test with two fitted terms gives
Code
# Cubic Model: Reset Test with squared and cubed fitted valuesresettest(reg1, power=2:3, type="fitted")
where \(\hat{y}_{i}\) represents the fitted values from the OLS regression for model (2).
The RESET test is a test of the null hypothesis \(H_{0} : \gamma = 0\) against the alternative \(H_{A}: \gamma \ne 0\).
The test statistic will follow a t-distribution with (N-5) = 25. The 5% critical value for a two-sided test with 25 degrees of freedom is 2.060 so the decision rule is to reject \(H_{0}\) if \(t>2.060\) or \(t<-2.060\).
Alternatively, using the equivalence between the t-distribution and the F-distribution, the test statistic will follow a F-distribution with (1,26) degrees of freedom.
The F critical value \(F_{c}\) is \(4.242\).
The decision rule is to reject \(H_{0}\) if the sample value of the F-statistic exceeds \(F_{c}\) - or use the p-value approach and reject the null hypothesis if the \(p-\text{value}\)<\(\alpha\)=0.05.
Then consider the following modification of the total cost function model (2).
where \(\hat{y}_{i}\) represents the fitted values from the OLS regression for model (2).
The RESET test is a test of the null hypothesis \(H_{0} : \gamma_{1} = \gamma_{2} = 0\) against the alternative \(H_{A}: \gamma_{1} \mbox{ and/or } \gamma_{2} \ne 0\).
The test statistic will follow a F-distribution with (2,24) degrees of freedom. The F critical value \(F_{c}\) is \(3.403\).
The decision rule is to reject \(H_{0}\) if the sample value of the F-statistic exceeds \(F_{c}\) - or, again, use the p-value approach.
The results of the RESET test using one, and two fitted terms are below:
Code
# Log-Log Model: Reset Test with squared fitted valuesresettest(reg2, power=2, type="fitted")
Figure 8: Question 1 (f): RESET Test - \(\hat{y}^2\) .
g)
Based on your results for (e) and (f), does the RESET test suggest that the cubic cost function or the log-log cost function is appropriate? Why?
Solution
The results from the RESET test suggest that the cubic cost model is appropriate and the log-log model is inappropriate, given our sample of data.
The polynomials of the fitted values do not significantly improve the explanatory power of the cubic cost model.
Code
# Log-Log Model: Reset Test with squared and cubed fitted valuesresettest(reg2, power=2:3, type="fitted")
The data file schools.csv contains data on 420 primary schools in a large state, with a considerable portion of students who did not learn English as a first language. The data include the following variables:
students
number of students enrolled in the school
teachers
number of full-time equivalent teachers employed in the school
english
percent of limited English proficiency students enrolled in the school
Estimate the econometric model (3) by Ordinary Least Squares (OLS).
At the 5% level of significance, test the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Solution
The estimated results are reported in Figure 10.
\(b_1=-2.2798\): a one-unit increase in the student teacher ratio is associated with a 2.2798 point reduction in the average test score, across reading and mathematics.
models3 <-list("(1)"= reg1_schools, "(2)"= reg2_schools)tut4_table3 <-modelsummary( models3,fmt =4, # digits=4statistic ="({std.error})", coef_map =c( # covariate.labels + intercept at top`(Intercept)`="Intercept",str ="Student-Teacher Ratio",english ="% Limited English Proficiency"), gof_map =data.frame(raw =c("nobs", "r.squared", "adj.r.squared", "sigma", "F_line"),clean =c("Observations", "R-squared", "Adj. R-squared","Residual Std. Error", "F Statistic"),fmt =c(0, 4, 4, 4, 4) ),output ="tinytable",stars =TRUE,title ="Dependent variable: School Average Test Score",notes ="Standard errors shown in parentheses")tut4_table3
Dependent variable: School Average Test Score
(1)
(2)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Standard errors shown in parentheses
Intercept
698.9329***
686.0322***
(9.4675)
(7.4113)
Student-Teacher Ratio
-2.2798***
-1.1013**
(0.4798)
(0.3803)
% Limited English Proficiency
-0.6498***
(0.0393)
Observations
420
420
R-squared
0.0512
0.4264
Adj. R-squared
0.0490
0.4237
Residual Std. Error
18.5810
14.4645
F Statistic
22.5751 (df = 1; 418)
155.0137 (df = 2; 417)
Figure 10: Question 2 OLS Regression Results $.
The hypothesis may be represented as a one-sided test:
Code
# One-sided t test H0: beta1 = 0 HA:beta1 < 0alpha <-0.05# set desired significanceb1_reg1_schools <-coef(reg1_schools)[["str"]] # coefficient on strseb1_reg1_schools <-sqrt(vcov(reg1_schools)[2,2]) # standard error of b1t_reg1_schools <- (b1_reg1_schools-0)/seb1_reg1_schools # construct t test statistictcr_reg1_schools <-qt(alpha, df_reg1_schools) # calculate critical valuepval_reg1_schools <-pt(t_reg1_schools,df_reg1_schools) # calculate p-value for 1 sided testprint(df_reg1_schools) # print df for t test
[1] 418
Code
print(b1_reg1_schools) # print estimate b1
[1] -2.279808
Code
print(seb1_reg1_schools) # print std. error for b1
[1] 0.4798255
Code
print(t_reg1_schools) # print t test statistic
[1] -4.751327
Code
print(tcr_reg1_schools) # print t critical value
[1] -1.648507
Code
print(pval_reg1_schools) # print p value for t test statistic
[1] 0.000001391654
specify \(H_0\) and \(H_A\): \(H_0:\beta_1 \geq 0 \qquad H_A:\beta_1<0 \qquad\) alternatively this can be written as \(H_0:\beta_1 = 0 \qquad H_A:\beta_1<0\)
the test statistic \(t=\dfrac{b_1-0}{se(b_1) \thicksim t(N-2)}\)
the level of significance \(\alpha=0.05\) with degrees of freedom (N-2)=418.
Reject \(H_0\) if \(t \leq -1.6449\).
Note that the file tute4.R provides an exact critical value of $t_c=-1.6485
calculate the value of the test statistic \(t=\dfrac{b_1-0}{se(b_1)}=\dfrac{-2.2798-0}{0.4798}=-4.7543\)
apply the decision rule \(\qquad t<-t_c\) reject the null hypothesis.
The data are consistent with the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Alternatively, the \(p-\text{value}\) for this test is \(p=0.0000\). Since \(p<0.05\) then reject \(H_0\).
b)
Using a 5% significance level, test the hypothesis that the relative price \(w/r\) has a negative effect upon the labour/capital ratio.
Your answer should clearly state the null and alternative hypotheses, the distribution of test statistic, and your decision.
Solution
First, it is expected that limited English proficiency will have lower average test scores, controlling for the the student-teacher ratio.
This implies the following alternative econometric model: \[
\text{score}_i = \beta_0+\beta_1\,\text{str}_i +\beta_2\, \text{english}+ \varepsilon_i \tag{4}
\]
Estimate the econometric model (4) by Ordinary Least Squares (OLS). The estimation results are in Model (2) Figure 10.
\(b_1=-1.1013\): a one-unit increase in the student teacher ratio is associated with a 1.1013 point reduction in the average test score, across reading and mathematics.
\(b_2=-0.6498\): a one-unit increase in the percent of limited English proficiency students in the school is associated with a 0.6498 point reduction in the average test score, across reading and mathematics.
Then at the 5% significance level, test the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Again, the hypothesis may be represented as a one-sided test:
specify \(H_0\) and \(H_A\): \(H_0:\beta_1 \geq 0 \qquad H_A:\beta_1<0 \qquad\) alternatively this can be written as \(H_0:\beta_1 = 0 \qquad H_A:\beta_1<0\)
the test statistic \(t=\dfrac{b_1-0}{se(b_1) \thicksim t(N-2)}\)
the level of significance \(\alpha=0.05\) with degrees of freedom (N-2)=418.
Reject \(H_0\) if \(t \leq -1.6449\).
Note that the file tute4.R provides an exact critical value of $t_c=-1.6485.
calculate the value of the test statistic \(t=\dfrac{b_1-0}{se(b_1)}=\dfrac{-1.1013-0}{0.3803}=-2.8960\)
apply the decision rule \(\qquad t<-t_c\) reject the null hypothesis.
The data are consistent with the hypothesis that a higher student-teacher ratio is associated with lower test scores.
Alternatively, the \(p-\text{value}\) for this test is \(p=0.0020\). Since \(p<0.05\) then reject \(H_0\).
c)
Suppose further that limited English proficiency students are not randomly distributed across schools. In particular, limited English proficiency students may be more likely to be enrolled in schools with a larger student-teacher ratio.
Treating model (4) as the “true” model and model (3) as the “incorrect model, write down an expression for the omitted variable bias in the”incorrect” model.
What is the expected direction of the omitted variable bias?
Solution
\[
\begin{align*}
\text{score}_i &= \alpha_0 + \alpha_1\, \text{str}i + \varepsilon_i& \text{incorrect model}\\
\text{score}_i &= \beta_0 + \beta_1\, \text{str}i +\beta_2 \, \text{english}_i+ \varepsilon_i& \text{correct model}
\end{align*}
\] Following the results in Week 3, Lecture 2: \[
{E[\alpha_1 - \beta_1]}= \beta_2 \dfrac{\text{COV(str,english)}}{\text{VAR(str)}}
\] In the incorrect model, the OLS estimator \(\alpha_1\) will only be unbiased if \({E[\alpha_1 - \beta_1]}=0\):
\(\beta_2=0\) or
\(\text{COV(str,eglish)=}=0\)
Neither of these conditions is expected to hold in our model.
we expect \(\beta_2<0\) as limited English proficiency students will have lower average test scores , controlling for the student-teacher ratio
\(\text{COV(str,english)}>0\) : limited English proficiency students are more likely to be enrolled in schools with a larger student-teacher ratio
Putting all this together
\[
{E[\alpha_1 - \beta_1]}= \beta_2 \dfrac{\text{COV(str,english)}}{\text{VAR(str)}}<0
\]
so we expect that the estimator for\(\alpha_1\) in the incorrect model to produce estimates that are negatively (downward) biased.
Since it is expected \(\beta_1<0\), the estimator for \(\alpha_1\) in the incorrect model will produce estimates that are more negative than the “true” \(\beta_1\) in model (4).