1. Schools and student-teacher ratio

  1. Load the CASchools dataset used in class and define a new variable containing the number of students per teacher (the student-teacher ratio).

  2. Run a regression of the math score on the student teacher ratio. Interpret the results of the previous question. How do you interpret the intercept? Does it make sense? Explain.

data("CASchools")

CASchools$student_teacher_ratio <- CASchools$students / CASchools$teachers

fit <- lm(math ~ student_teacher_ratio, data = CASchools)

summary(fit)
## 
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.615 -13.374  -0.828  12.728  52.711 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           691.4174     9.3825  73.692  < 2e-16 ***
## student_teacher_ratio  -1.9386     0.4755  -4.077 5.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.41 on 418 degrees of freedom
## Multiple R-squared:  0.03824,    Adjusted R-squared:  0.03594 
## F-statistic: 16.62 on 1 and 418 DF,  p-value: 5.467e-05
  1. Are these estimates causal? I.e. do the assumptions discussed in class hold? In particular, do you think that e(u|x)=0? Explain. The estimates are casual, I think that e(u|x)=0 because there is a systemic relationship
  2. Now create a new variable that is true when the share of students qualifying for free lunch (in variable lunch) is higher than the median. Call it lunch_d.
median_lunch <- median(CASchools$lunch)
CASchools$lunch_d <- ifelse(CASchools$lunch > median_lunch, 1, 0)
  1. Run two regressions of math score on the student teacher ratio, one for schools where lunch_d is true and the other for those for which it is false. Interpret your results. For the results, judging by the t-values and p- values, there is a strong impact on the math scores.
fit_lunch_d_true <- lm(math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 1, ])
summary(fit_lunch_d_true)
## 
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 
##     1, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.031 -10.016  -0.135   8.285  37.819 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           661.1314     9.5988  68.877   <2e-16 ***
## student_teacher_ratio  -1.0138     0.4836  -2.096   0.0373 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.64 on 208 degrees of freedom
## Multiple R-squared:  0.02069,    Adjusted R-squared:  0.01598 
## F-statistic: 4.394 on 1 and 208 DF,  p-value: 0.03727
fit_lunch_d_false <- lm(math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 0, ])
summary(fit_lunch_d_false)
## 
## Call:
## lm(formula = math ~ student_teacher_ratio, data = CASchools[CASchools$lunch_d == 
##     0, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.132  -9.135  -0.450   8.597  40.293 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           708.1286    10.5006   67.44  < 2e-16 ***
## student_teacher_ratio  -2.1789     0.5354   -4.07 6.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.15 on 208 degrees of freedom
## Multiple R-squared:  0.07376,    Adjusted R-squared:  0.0693 
## F-statistic: 16.56 on 1 and 208 DF,  p-value: 6.683e-05
  1. Make a scatter plot of math score and student teacher ratio using a different color according to lunch_d. What do you see? The outliers are more of the higher “lunch_d” scores but there middle are all of the middle plots.
library(ggplot2)
ggplot(CASchools, aes(x = student_teacher_ratio, y = grades, color = lunch_d)) +
  geom_point() +
  ggtitle("Scatter Plot of Math Score x Student Teacher Ratio by Lunch_d") +
  xlab("Student Teacher Ratio") +
  ylab("Scores")

  1. Can you add a regression line for each color to the previous plot? Tip: Just add the color variable in aes() and do the rest as usual.
library(ggplot2)
ggplot(CASchools, aes(x = student_teacher_ratio, y = grades, color = lunch_d)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Scatter Plot of Math Score x Student Teacher Ratio by Lunch_d") +
  xlab("Student Teacher Ratio") +
  ylab("Scores")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

2 Playing with regressions

We will use the dataset traffic2 in the wooldridge package in this exercise.

  1. Run a regression of the total number of accidents (totacc) on unemployment (unem) without an intercept. For this, you need to write the relationship between y and x as y ~ 0 + x instead of the usual y~x when you call lm().
reg1 <- lm(totacc ~ 0 + unem, data = traffic2)
summary(reg1)
## 
## Call:
## lm(formula = totacc ~ 0 + unem, data = traffic2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29361  -5420   4918  14928  29389 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## unem   5484.2      184.5   29.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14230 on 107 degrees of freedom
## Multiple R-squared:  0.892,  Adjusted R-squared:  0.8909 
## F-statistic: 883.3 on 1 and 107 DF,  p-value: < 2.2e-16
  1. Make a scatterplot of the data, adding the regression above (without intercept). What kind of restriction are you imposing when you remove the intercept (or, equivalently, force it to be zero)? When you remove the intercept, you are imposing the restriction that the mean value of the dependent variable totacc is zero when the independent variable unem is zero.
library(ggplot2)
ggplot(traffic2, aes(x = unem, y = totacc)) +
  geom_point() +
  geom_abline(intercept = 0, slope = coef(reg1)[1], color = "red") +
  ggtitle("Scatterplot of Total Accidents vs. Unemployment") +
  xlab("Unemployment") +
  ylab("Total Accidents")

  1. Now compute the mean of the total number of accidents, and run a regression of the total number of accidents on no regressor (y~1 in the lm language). How do these two numbers compare? What does the regression without regressors do? The intercepts are very different.
mean_totacc <- mean(traffic2$totacc)
mean_totacc
## [1] 42831.26
reg2 <- lm(totacc ~ 1, data = traffic2)
reg2
## 
## Call:
## lm(formula = totacc ~ 1, data = traffic2)
## 
## Coefficients:
## (Intercept)  
##       42831