#clean environment
rm(list = ls())
library("AER")
data<- data("CASchools")
Estimate the linear regression showing the effect of income on average test scores in the 520 California counties.
lm_math <- lm(CASchools$math ~ CASchools$income)
Linear Regression:
plot(CASchools$math ~ CASchools$income)
regLine(lm_math)
summary(lm_math)
Call:
lm(formula = CASchools$math ~ CASchools$income)
Residuals:
Min 1Q Median 3Q Max
-39.045 -8.997 0.308 8.416 34.246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 625.53948 1.53627 407.18 <2e-16 ***
CASchools$income 1.81523 0.09073 20.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.42 on 418 degrees of freedom
Multiple R-squared: 0.4892, Adjusted R-squared: 0.4879
F-statistic: 400.3 on 1 and 418 DF, p-value: < 2.2e-16
summary(lm_math)
Call:
lm(formula = CASchools$math ~ CASchools$income)
Residuals:
Min 1Q Median 3Q Max
-39.045 -8.997 0.308 8.416 34.246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 625.53948 1.53627 407.18 <2e-16 ***
CASchools$income 1.81523 0.09073 20.01 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.42 on 418 degrees of freedom
Multiple R-squared: 0.4892, Adjusted R-squared: 0.4879
F-statistic: 400.3 on 1 and 418 DF, p-value: < 2.2e-16
Polynomial Regression:
poly_math <- lm(CASchools$math ~ poly(CASchools$income,3))
plot(CASchools$math ~ CASchools$income)
lines(sort(CASchools$income),fitted(poly_math)[order(CASchools$income)],col="red")
summary(poly_math)
Call:
lm(formula = CASchools$math ~ poly(CASchools$income, 3))
Residuals:
Min 1Q Median 3Q Max
-39.491 -9.006 0.045 8.108 39.520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 653.343 0.633 1032.206 < 2e-16 ***
poly(CASchools$income, 3)1 268.491 12.972 20.698 < 2e-16 ***
poly(CASchools$income, 3)2 -72.255 12.972 -5.570 4.58e-08 ***
poly(CASchools$income, 3)3 7.963 12.972 0.614 0.54
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.97 on 416 degrees of freedom
Multiple R-squared: 0.525, Adjusted R-squared: 0.5216
F-statistic: 153.3 on 3 and 416 DF, p-value: < 2.2e-16
Test the null hypothesis that the population regression function is linear for a polynomial regression model of degree 3. Test the significance of the individual coefficients in cubic regression function relating district income to test scores (\(Test Scores = beta_1 + beta_2 income + beta_3 income^2 + beta_4 income^3\))
#For quadratics:
(-72.25 - 0) / (12.972) #notice it is the same in the table.
[1] -5.569689
\(5.5\) is much more than \(1.96\).
#For the third degree:
(-7.96 - 0) / (12.972) #notice it is the same in the table.
[1] -0.6136294
The third degree is not statistically significant.
Based on the t-statistic can we reject the null that the regression function is quadratic?
No, as shown above the beta for the quadratic term is significant.
In the previous (1) regression function what is value of the F-statistic testing the null hypothesis that the coefficients of quadratic and cubic transformations are both zero? \(153.3\)
f <- 153.3
(7.96+72.25-0)/sqrt(f)
[1] 6.478246
\(6.4\) and the effect would be capped by:
esitimated_effect<-(7.96+72.25)
esitimated_effect + 1.96*6.4
[1] 92.754
esitimated_effect - 1.96*6.4
[1] 67.666
Can we reject this null against the alternative that it is either quadratic or cubic? Yes, it is clearly nonlinear.
Generate two new variables by logarithmic transformation of the variables on income and test scores. Call them log_income and log_testscores.
log_income <- log(CASchools$income)
log_testscore <- log(CASchools$math)
Run and interpret these three regression models (I) \(Test Scores = beta_1 + beta_2 log_income\)
log_model <- lm(CASchools$math~log_income)
plot(CASchools$math ~ log(CASchools$income))
lines(sort(log(CASchools$income)),fitted(log_model)[order(log(CASchools$income))],col="red")
(II) $log_testscores = beta_1 + beta_2 income
log_model <- lm(CASchools$math~log_income)
plot(CASchools$math ~ log(CASchools$income))
lines(sort(log(CASchools$income)),fitted(log_model)[order(log(CASchools$income))],col="red")
(III) log_testscores = beta_1 + beta_2 log_income$
log_model1<- lm(log_testscore~log_income)
plot(CASchools$math ~ log(CASchools$income))
lines(sort(log(CASchools$income)),fitted(log_model1)[order(log(CASchools$income))],col="red")
Generate the binary variable \(D_i\) (= 1, if class-size < median) indicating low class-size.
teacher_stu_ratio <- ifelse(CASchools$students/CASchools$teachers < median(CASchools$students/CASchools$teachers),1,0)
head(teacher_stu_ratio)
[1] 1 0 1 1 1 0
Generate an interaction term (\(Income_D_i\)) by multiplying income with binary dummy indicating low class-size.
intercation <- CASchools$income*teacher_stu_ratio
Run and interpret this regression model (I) \(Test Scores = beta_1 + beta_2 income_D_i\)
last_model <- lm(CASchools$math~ intercation)
plot(CASchools$math ~ CASchools$income)
lines(sort(intercation),fitted(last_model)[order(intercation)],col="red")