model <-lm(average_score ~ income, data = CASchools)summary(model)
Call:
lm(formula = average_score ~ income, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-39.574 -8.803 0.603 9.032 32.530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 625.3836 1.5324 408.11 <2e-16 ***
income 1.8785 0.0905 20.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.39 on 418 degrees of freedom
Multiple R-squared: 0.5076, Adjusted R-squared: 0.5064
F-statistic: 430.8 on 1 and 418 DF, p-value: < 2.2e-16
plot(CASchools$income, CASchools$average_score, main ="Regression of Average Score on District Average Income",xlab ="District Average Income", ylab ="Average Score")abline(model, col ="red", lwd =2)
The summary of the linear regression model shows that there is a significant relationship between district average income and the average score and we can actually see it from the plot. The model results indicate that when the district income increases by one unit, the average score increases by about 1.88 points. The intercept is 625.38, so the predicted average score is 625.38 when the district income is zero.
The residuals are from -39.57 to 32.53, with a median of 0.60. This shows that there is some error in the predictions, with both overestimations and underestimations.
The model’s R-squared value is 0.5076, meaning about 50.76% of the variation in the average score is explained by district income. This is a moderate value, which suggests that income plays an important role in predicting the average score, but other factors also influence the scores. The F-statistic is 430.8, and its p-value is very small, showing that the model is statistically significant.
The residual standard error is 13.39, therefore the model’s predictions are off by about 13.39 points on average. This indicates some error in the model’s predictions.
The moderate R-squared suggests that the model explains only about half of the variation in average scores. This means that although district income is important, there are other factors affecting the scores that the model does not account for. The spread of the residuals suggests that the model might not fully capture the relationship between income and the average score, and in some cases, the model could either overestimate or underestimate the true scores.
model_quad <-lm(average_score ~ income +I(income^2), data = CASchools)summary(model_quad)
Call:
lm(formula = average_score ~ income + I(income^2), data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-44.416 -9.048 0.440 8.347 31.639
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.30174 3.04622 199.362 < 2e-16 ***
income 3.85099 0.30426 12.657 < 2e-16 ***
I(income^2) -0.04231 0.00626 -6.758 4.71e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.72 on 417 degrees of freedom
Multiple R-squared: 0.5562, Adjusted R-squared: 0.554
F-statistic: 261.3 on 2 and 417 DF, p-value: < 2.2e-16
plot(CASchools$income, CASchools$average_score, main ="Quadratic Regression of Average Score on District Average Income",xlab ="District Average Income", ylab ="Average Score")curve(predict(model_quad, newdata =data.frame(income = x)), from =min(CASchools$income), to =max(CASchools$income), col ="red", lwd =2, add =TRUE)
The summary of the quadratic model shows that both income and income squared are significant. The intercept is 607.30, and for each unit increase in income, the average score increases by 3.85 points. The negative coefficient for income squared (-0.0423) indicates that the effect of income on the average score decreases as income increases, suggesting a non-linear relationship.
The R-squared value is 0.5562, which is higher than the linear model’s R-squared (0.5076), so the quadratic model explains more of the variation in the average score. The residual standard error is slightly lower at 12.72, showing a better fit, though the improvement is not drastic.
Overall, the quadratic model provides a better fit than the linear one because it explains more of the variance and captures the curvilinear relationship between income and average score. We can also see this better fit from the plot.
Call:
lm(formula = average_score ~ perc_english + student_teacher_ratio +
perc_english:student_teacher_ratio, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-38.895 -14.560 0.646 12.429 45.417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 705.42328 10.56055 66.798 < 2e-16 ***
perc_english -1.02313 2.33843 -0.438 0.662
student_teacher_ratio -2.45797 0.53477 -4.596 5.71e-06 ***
perc_english:student_teacher_ratio -0.02427 0.12083 -0.201 0.841
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.55 on 416 degrees of freedom
Multiple R-squared: 0.158, Adjusted R-squared: 0.152
F-statistic: 26.03 on 3 and 416 DF, p-value: 1.898e-15
The model shows that the intercept is 705.42, meaning the predicted average score is 705.42 when all independent variables are zero. The percentage of English learners has a coefficient of -1.02, but this is not statistically significant (p-value = 0.662). The student-teacher ratio has a coefficient of -2.46, meaning it significantly decreases the average score (p-value = 5.71e-06). The interaction term is not significant either (p-value = 0.841).
The R-squared is 0.158, meaning the model explains only 15.8% of the variation in the average score. The F-statistic (26.03) indicates that the model is significant overall. However, the residuals show some large discrepancies, with a standard error of 17.55, indicating the model has a high level of prediction error.
In summary, the student-teacher ratio significantly affects the average score, while the percentage of English learners and the interaction term do not. The model explains only a small portion of the variation in the average score.
Call:
lm(formula = average_score ~ student_teacher_dummy, data = CASchools)
Residuals:
Min 1Q Median 3Q Max
-50.435 -14.071 -0.285 12.778 49.565
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 657.185 1.202 546.62 < 2e-16 ***
student_teacher_dummy -7.185 1.852 -3.88 0.000121 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.74 on 418 degrees of freedom
Multiple R-squared: 0.03476, Adjusted R-squared: 0.03245
F-statistic: 15.05 on 1 and 418 DF, p-value: 0.0001215
The results of the dummy regression model show that the intercept is 657.19, which means the predicted average score is 657.19 when the student-teacher ratio is 20 or below (when the dummy variable is 0). The coefficient for the dummy variable is -7.19, indicating that when the student-teacher ratio is above 20 (when the dummy variable is 1), the average score decreases by 7.19 points compared to when the ratio is 20 or below. This relationship is statistically significant with a p-value of 0.000121, which is much smaller than 0.05.
The R-squared value is 0.0348, meaning the model explains only 3.48% of the variation in the average score. This suggests that the dummy variable alone does not explain much of the variation in the average score. The residuals show a spread from -50.44 to 49.57, with a standard error of 18.74, indicating that there is a moderate level of prediction error in the model.
In summary, the model suggests that the average score decreases when the student-teacher ratio is above 20, but the model only explains a small portion of the variation in the average score, meaning other factors likely influence the result.
The regression model shows that the intercept is 660.8, meaning the predicted average score is 660.8 when all variables are zero. The student-teacher ratio dummy has a coefficient of -1.08 but is not significant. The percentage of students qualifying for reduced-price lunch significantly lowers the average score by 0.59 points for each additional percentage point. The number of computers and the interaction term with English learners do not significantly affect the average score. However, expenditure per student has a significant positive effect, increasing the score by 0.0039 for each additional dollar spent.
The model explains 77.71% of the variation in average scores, and the overall model is statistically significant. In conclusion, the most important factors affecting the average score are the percentage of students qualifying for reduced-price lunch and expenditure per student. The student-teacher ratio and the number of computers do not have a significant impact.