libraries and data

# Loading necessary libraries
library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(effects)
## Warning: package 'effects' was built under R version 4.3.2
## Carregando pacotes exigidos: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
# Loading the dataset
library(readr)
CASchools <- read_csv("c:/users/dell/downloads/CASchools.csv")
## Rows: 420 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): students, teachers, lunch, expenditure, english, read, math
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Adding new variables for average test score and student-teacher ratio

CASchools$average <- (CASchools$read + CASchools$math) / 2
CASchools$ratio <- CASchools$students / CASchools$teachers

Regression with average test score as the dependent variable

# Include ratio, lunch, expenditure, and english as explanatory variables
model <- lm(average ~ ratio + lunch + expenditure + english, data = CASchools)

Summary

## 
## Call:
## lm(formula = average ~ ratio + lunch + expenditure + english, 
##     data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.366  -5.683   0.281   5.288  30.266 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.660e+02  9.460e+00  70.398  < 2e-16 ***
## ratio       -2.354e-01  2.983e-01  -0.789     0.43    
## lunch       -5.464e-01  2.119e-02 -25.780  < 2e-16 ***
## expenditure  3.622e-03  8.766e-04   4.132 4.36e-05 ***
## english     -1.283e-01  3.175e-02  -4.042 6.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.91 on 415 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7813 
## F-statistic: 375.3 on 4 and 415 DF,  p-value: < 2.2e-16

Interpretation:

##Coefficients: The ratio coefficient lacks statistical significance (p-value = 0.43). Lunch, expenditure, and English coefficients exhibit statistical significance at the 5% threshold.

##Residuals: Residuals center around a mean close to zero in their distribution.

##R-squared: The model elucidates around 78.34% of the variability in average test scores.

##F-statistic: The F-statistic holds high significance (p-value < 2.2e-16), signifying the overall model’s importance.

##Why can you not include the math and read test scores as further explanatory variables? Incorporating both math and reading test scores as explanatory variables in the same regression model is avoided due to concerns about multicollinearity. Given the high likelihood of correlation between math and reading scores, their inclusion may result in unstable coefficients, reduced precision in effect estimation, and difficulties in interpretation. Multicollinearity complicates distinguishing each variable’s unique contribution and can lead to inflated standard errors. To address this concern, it is common to opt for one of the correlated variables or employ techniques like principal component analysis (PCA) to create composite variables, preserving relevant information while minimizing collinearity. This ensures more dependable and interpretable regression outcomes by sidestepping the issues linked to multicollinearity.

Null and alternative hypothesis for each of the t-tests of the coefficients.

##Intercept: Null Hypothesis (H0): The intercept equals zero. Alternative Hypothesis (H1): The intercept is not zero.

##Ratio: Null Hypothesis (H0): The ratio variable’s coefficient equals zero. Alternative Hypothesis (H1): The ratio variable’s coefficient is not zero.

##Lunch: Null Hypothesis (H0): The lunch variable’s coefficient equals zero. Alternative Hypothesis (H1): The lunch variable’s coefficient is not zero.

##Expenditure: Null Hypothesis (H0): The expenditure variable’s coefficient equals zero. Alternative Hypothesis (H1): The expenditure variable’s coefficient is not zero.

##English: Null Hypothesis (H0): The English variable’s coefficient equals zero. Alternative Hypothesis (H1): The English variable’s coefficient is not zero.

For each t-test, the null hypothesis posits no effect (equal to zero) for the corresponding coefficient, while the alternative hypothesis suggests a significant effect (not equal to zero). The p-values associated with each t-test will reveal whether there is enough evidence to reject the null hypothesis for each variable.

##Conclusion: To sum up, lunch, expenditure, and English exhibit a statistically significant linear relationship with the average test score, whereas the ratio variable does not demonstrate a significant effect.