Load the CASchools
dataset used in class and define
a new variable containing the number of students per teacher (the
student-teacher ratio).
Run a regression of the math score on the student teacher ratio. Interpret the results of the previous question. How do you interpret the intercept? Does it make sense? Explain.
data("CASchools")
CASchools$student_teacher_ratio <- CASchools$students / CASchools$teachers
fit <- lm(math ~ student_teacher_ratio, data = CASchools)
summary(fit)
##
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.615 -13.374 -0.828 12.728 52.711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 691.4174 9.3825 73.692 < 2e-16 ***
## student_teacher_ratio -1.9386 0.4755 -4.077 5.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.41 on 418 degrees of freedom
## Multiple R-squared: 0.03824, Adjusted R-squared: 0.03594
## F-statistic: 16.62 on 1 and 418 DF, p-value: 5.467e-05
lunch
) is higher
than the median. Call it lunch_d
.median_lunch <- median(CASchools$lunch)
CASchools$lunch_d <- ifelse(CASchools$lunch > median_lunch, 1, 0)
lunch_d
is true and the other for those
for which it is false. Interpret your results. For the results, judging
by the t-values and p- values, there is a strong impact on the math
scores.fit_lunch_d_true <- lm(math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 1, ])
summary(fit_lunch_d_true)
##
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d ==
## 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.031 -10.016 -0.135 8.285 37.819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 661.1314 9.5988 68.877 <2e-16 ***
## student_teacher_ratio -1.0138 0.4836 -2.096 0.0373 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.64 on 208 degrees of freedom
## Multiple R-squared: 0.02069, Adjusted R-squared: 0.01598
## F-statistic: 4.394 on 1 and 208 DF, p-value: 0.03727
fit_lunch_d_false <- lm(math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 0, ])
summary(fit_lunch_d_false)
##
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d ==
## 0, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.132 -9.135 -0.450 8.597 40.293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 708.1286 10.5006 67.44 < 2e-16 ***
## student_teacher_ratio -2.1789 0.5354 -4.07 6.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.15 on 208 degrees of freedom
## Multiple R-squared: 0.07376, Adjusted R-squared: 0.0693
## F-statistic: 16.56 on 1 and 208 DF, p-value: 6.683e-05
lunch_d
. What do you see? The
outliers are more of the higher “lunch_d” scores but there middle are
all of the middle plots.library(ggplot2)
ggplot(CASchools, aes(x = student_teacher_ratio, y = grades, color = lunch_d)) +
geom_point() +
ggtitle("Scatter Plot of Math Score x Student Teacher Ratio by Lunch_d") +
xlab("Student Teacher Ratio") +
ylab("Scores")
aes()
and do the rest
as usual.library(ggplot2)
ggplot(CASchools, aes(x = student_teacher_ratio, y = grades, color = lunch_d)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Scatter Plot of Math Score x Student Teacher Ratio by Lunch_d") +
xlab("Student Teacher Ratio") +
ylab("Scores")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
We will use the dataset traffic2
in the
wooldridge
package in this exercise.
totacc
) on unemployment (unem
) without an
intercept. For this, you need to write the relationship between y and x
as y ~ 0 + x instead of the usual y~x when you call
lm()
.reg1 <- lm(totacc ~ 0 + unem, data = traffic2)
summary(reg1)
##
## Call:
## lm(formula = totacc ~ 0 + unem, data = traffic2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29361 -5420 4918 14928 29389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## unem 5484.2 184.5 29.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14230 on 107 degrees of freedom
## Multiple R-squared: 0.892, Adjusted R-squared: 0.8909
## F-statistic: 883.3 on 1 and 107 DF, p-value: < 2.2e-16
library(ggplot2)
ggplot(traffic2, aes(x = unem, y = totacc)) +
geom_point() +
geom_abline(intercept = 0, slope = coef(reg1)[1], color = "red") +
ggtitle("Scatterplot of Total Accidents vs. Unemployment") +
xlab("Unemployment") +
ylab("Total Accidents")
mean_totacc <- mean(traffic2$totacc)
mean_totacc
## [1] 42831.26
reg2 <- lm(totacc ~ 1, data = traffic2)
reg2
##
## Call:
## lm(formula = totacc ~ 1, data = traffic2)
##
## Coefficients:
## (Intercept)
## 42831