Problem Set 2

data <- mutate (CASchools, stu_teach = students / teachers)

data <- mutate (data, avg_score = (read + math) / 2)

lg1 <- lm(avg_score ~ stu_teach, data = data)
summary (lg1)

## 
## Call:
## lm(formula = avg_score ~ stu_teach, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
## stu_teach    -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Ans: The coefficient signifies that if the student to teacher ratio increases by one unit then the average score drops by approximately 2.28. This shows that there is a negative correlation between the two

ggplot(data = data, aes(x = stu_teach, y = avg_score)) +
  geom_point() +
  stat_smooth(method = "lm")

lg2 <- lm(avg_score ~ stu_teach + english + expenditure, data = data)
summary (lg2)

## 
## Call:
## lm(formula = avg_score ~ stu_teach + english + expenditure, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.340 -10.111   0.293  10.318  43.181 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 649.577947  15.205717  42.719  < 2e-16 ***
## stu_teach    -0.286399   0.480523  -0.596  0.55149    
## english      -0.656023   0.039106 -16.776  < 2e-16 ***
## expenditure   0.003868   0.001412   2.739  0.00643 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 416 degrees of freedom
## Multiple R-squared:  0.4366, Adjusted R-squared:  0.4325 
## F-statistic: 107.5 on 3 and 416 DF,  p-value: < 2.2e-16

Ans: There is a negative correlation between the average score and the student to teacher ratio and percentage of English learners. As ‘stu_teach’ ratio increases by one unit, the avg_score drops by 0.286. As the percentage of English learners increase by one unit, the avg_score drops by 0.656. The negative correlation between percentage of English learners and the avg score is stronger. There is a positive correlation between the expenditure per student and the avg score but it is not of that much importance. In this case the variable ‘stu_teach’ is not stastically significant.

Q6 Ans: The student to teacher ratio co-efficient is lower in Q5 than what it was in Q3 This might be because, in Q5 we ran a multi variate regression and there were other control variables such as percentage of english learners and expenditure that were contributing to the change in average score. As a result the effect of student to teacher ratio was reduced. In Q3, variables such as ‘english’ and ‘expenditure’ were acting as omitted variables and they might have been co related with the ‘stu_teach’ ratio and were biasing upwards the effect of ‘stu_teach’ ratio on the average score. We can confirm this by checking the co relation between ‘stu_teach’ and variables such as ‘english’ and expenditure.

check <- lm (stu_teach ~ english + expenditure, data = data)
summary (check)

## 
## Call:
## lm(formula = stu_teach ~ english + expenditure, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6152 -0.8798  0.0000  0.8414  4.3801 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.071055   0.612073  47.496  < 2e-16 ***
## english      0.014909   0.003918   3.806 0.000163 ***
## expenditure -0.001819   0.000113 -16.100  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.463 on 417 degrees of freedom
## Multiple R-squared:  0.405,  Adjusted R-squared:  0.4022 
## F-statistic: 141.9 on 2 and 417 DF,  p-value: < 2.2e-16

model1 <- lm (math ~ computer + expenditure + income + calworks + lunch, data = data)
summary (model1)

## 
## Call:
## lm(formula = math ~ computer + expenditure + income + calworks + 
##     lunch, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.921  -6.803   0.237   6.151  33.410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.577e+02  4.429e+00 148.504  < 2e-16 ***
## computer    -4.891e-04  1.154e-03  -0.424    0.672    
## expenditure  1.261e-03  8.662e-04   1.455    0.146    
## income       6.146e-01  1.041e-01   5.901 7.51e-09 ***
## calworks    -4.421e-02  6.515e-02  -0.679    0.498    
## lunch       -4.409e-01  3.214e-02 -13.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.09 on 414 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.7106 
## F-statistic: 206.8 on 5 and 414 DF,  p-value: < 2.2e-16

model2 <- lm (read ~ computer + expenditure + income + calworks + lunch, data = data)
summary (model2)

## 
## Call:
## lm(formula = read ~ computer + expenditure + income + calworks + 
##     lunch, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.533  -5.312   0.005   5.109  36.070 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.562e+02  3.864e+00 169.823  < 2e-16 ***
## computer    -3.073e-03  1.007e-03  -3.052  0.00242 ** 
## expenditure  3.656e-03  7.557e-04   4.838 1.85e-06 ***
## income       3.899e-01  9.086e-02   4.291 2.21e-05 ***
## calworks     1.037e-01  5.684e-02   1.825  0.06874 .  
## lunch       -6.045e-01  2.804e-02 -21.556  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.802 on 414 degrees of freedom
## Multiple R-squared:  0.8107, Adjusted R-squared:  0.8084 
## F-statistic: 354.6 on 5 and 414 DF,  p-value: < 2.2e-16

Ans: This assumption is not valid since the variables such as ‘computer’, ‘expenditure’, ‘income’, ‘calworks’ are statistically more significant when effecting the scores for reading. However, these variables are less significant when effecting the scores for maths.

Problem Set 2

Mohammad Shaheer

30/09/2021