expenditureObiettivo del laboratorio
Analizzare l’effetto di class size (stratio) e della percentuale di English learners (english) sui risultati scolastici (score) delle scuole californiane.
Il notebook mostra: - la stima OLS della regressione multipla
- il calcolo manuale dei coefficienti e degli standard error
- il controllo dell’eteroschedasticità
- i test F per ipotesi congiunte
- la selezione automatica del modello migliore
library(AER)
data("CASchools")
# Variabili derivate
CASchools$stratio <- with(CASchools, students/teachers)
CASchools$score <- with(CASchools, (math + read)/2)
head(CASchools[, c("score", "stratio", "english")])
Nota:
score= media tra punteggi di matematica e lettura.
stratio= rapporto studenti/insegnanti (proxy della dimensione media della classe).
pairs(CASchools[, c("score", "stratio", "english")])
library(PerformanceAnalytics)
chart.Correlation(CASchools[, c("score", "stratio", "english")])
Le relazioni negative indicano che: - classi più grandi ⇒ punteggi medi più bassi
- percentuale di studenti non madrelingua più alta ⇒ punteggi inferiori
reg_mult <- lm(score ~ stratio + english, data = CASchools)
summary(reg_mult)
##
## Call:
## lm(formula = score ~ stratio + english, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.845 -10.240 -0.308 9.815 43.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03224 7.41131 92.566 < 2e-16 ***
## stratio -1.10130 0.38028 -2.896 0.00398 **
## english -0.64978 0.03934 -16.516 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
stratio: coefficiente negativo (-1.10) → più studenti per insegnante riduce lo score.
english: coefficiente negativo (-0.65) → più studenti non madrelingua riducono i punteggi medi.
- R² = 0.43 → il modello spiega il 43% della variabilità dei risultati.
plot(reg_mult$fitted.values, reg_mult$residuals, pch=19, col="gray",
xlab="Valori Predetti", ylab="Residui")
lines(lowess(reg_mult$fitted.values, reg_mult$residuals), col="red", lwd=2)
abline(h=0, lty=2)
Il grafico fitted vs residuals mostra se la varianza dei residui è costante.
Se i punti formano un “imbuto” → possibile eteroschedasticità.
library(lmtest)
library(sandwich)
coeftest(reg_mult, vcov = vcovHC(reg_mult, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.032245 8.728225 78.5993 < 2e-16 ***
## stratio -1.101296 0.432847 -2.5443 0.01131 *
## english -0.649777 0.031032 -20.9391 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gli standard error robusti correggono la stima in presenza di eteroschedasticità.
I valori dei coefficienti restano invariati, ma gli errori standard cambiano.
X <- as.matrix(cbind(1, CASchools$stratio, CASchools$english))
Y <- as.matrix(CASchools$score)
beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% Y)
beta_hat
## [,1]
## [1,] 686.0322445
## [2,] -1.1012956
## [3,] -0.6497768
Formula base dell’OLS:
\[\hat{\beta} = (X'X)^{-1} X'Y\]
I risultati coincidono con quelli ottenuti conlm().
# Calcolo manuale degli errori standard classici
n <- nrow(CASchools)
k <- length(coef(reg_mult))
u_hat <- residuals(reg_mult)
SER <- sqrt(t(u_hat) %*% u_hat / (n - k))
SER
## [,1]
## [1,] 14.46448
mat_cov <- as.numeric(SER^2) * solve(t(X) %*% X)
sqrt(diag(mat_cov))
## [1] 7.41131160 0.38027827 0.03934254
Gli errori standard derivano da:
\[Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\]
con \(\sigma^2\) stimata come varianza dei residui.
D <- diag(as.numeric((n/(n-k)) * u_hat^2))
HC_Var_cov <- solve(t(X)%*%X) %*% (t(X)%*%D%*%X) %*% solve(t(X)%*%X)
sqrt(diag(HC_Var_cov))
## [1] 8.72822452 0.43284720 0.03103176
In presenza di eteroschedasticità:
\[Var(\hat{\beta}) = (X'X)^{-1} X' D X (X'X)^{-1}\]
dove \(D\) è la matrice diagonale dei residui al quadrato.
reg_1 <- lm(score ~ stratio, data = CASchools)
summary(reg_1)
##
## Call:
## lm(formula = score ~ stratio, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
## stratio -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
t_value <- summary(reg_1)$coef[2,3]
F_value <- t_value^2
p_value <- 1 - pf(F_value, 1, n-2)
c(F_value=F_value, p_value=p_value)
## F_value p_value
## 2.257511e+01 2.783308e-06
Con un solo regressore, la statistica F = t².
L’ipotesi \(H_0: \beta_{stratio} = 0\) viene rifiutata.
reg_2 <- lm(score ~ stratio + english, data = CASchools)
summary(reg_2)$fstatistic
## value numdf dendf
## 155.0137 2.0000 417.0000
\(H_0: \beta_{stratio} = \beta_{english} = 0\)
F ≈ 155, p-value < 0.001 → rifiuto dell’ipotesi nulla.
expenditurereg_3 <- lm(score ~ stratio + english + expenditure, data = CASchools)
summary(reg_3)
##
## Call:
## lm(formula = score ~ stratio + english + expenditure, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.340 -10.111 0.293 10.318 43.181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 649.577947 15.205717 42.719 < 2e-16 ***
## stratio -0.286399 0.480523 -0.596 0.55149
## english -0.656023 0.039106 -16.776 < 2e-16 ***
## expenditure 0.003868 0.001412 2.739 0.00643 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.35 on 416 degrees of freedom
## Multiple R-squared: 0.4366, Adjusted R-squared: 0.4325
## F-statistic: 107.5 on 3 and 416 DF, p-value: < 2.2e-16
La spesa per studente ha effetto positivo e significativo,
mastratioperde significatività → possibile collinearità conexpenditure.
library(olsrr)
lista_var_new <- colnames(CASchools)[!(colnames(CASchools) %in%
c("school","district","county","grades","math","read"))]
model <- lm(score ~ ., data = CASchools[, lista_var_new])
ols_step_forward_p(model, penter = 0.05)
##
##
## Stepwise Summary
## -----------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## -----------------------------------------------------------------------------
## 0 Base Model 3670.591 3678.672 2475.782 0.00000 0.00000
## 1 lunch 3082.265 3094.386 1889.473 0.75476 0.75418
## 2 income 3036.983 3053.144 1844.418 0.78088 0.77982
## 3 english 2995.389 3015.591 1803.418 0.80248 0.80105
## 4 expenditure 2989.015 3013.257 1797.227 0.80638 0.80451
## 5 calworks 2988.533 3016.815 1796.841 0.80752 0.80519
## -----------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------
## R 0.899 RMSE 8.349
## R-Squared 0.808 MSE 70.720
## Adj. R-Squared 0.805 Coef. Var 1.286
## Pred R-Squared 0.802 AIC 2988.533
## MAE 6.489 SBC 3016.815
## ----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## ------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------
## Regression 122831.373 5 24566.275 347.372 0.0000
## Residual 29278.207 414 70.720
## Total 152109.580 419
## ------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) 654.008 3.686 177.411 0.000 646.762 661.255
## lunch -0.373 0.036 -0.531 -10.372 0.000 -0.443 -0.302
## income 0.623 0.088 0.236 7.090 0.000 0.450 0.796
## english -0.211 0.033 -0.203 -6.473 0.000 -0.275 -0.147
## expenditure 0.002 0.001 0.074 3.094 0.002 0.001 0.004
## calworks -0.089 0.057 -0.054 -1.567 0.118 -0.201 0.023
## --------------------------------------------------------------------------------------------
Aggiunge progressivamente le variabili significative.
Modello finale:lunch,income,english,expenditure,calworks.
ols_step_backward_p(model, prem = 0.05)
##
##
## Stepwise Summary
## ----------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## ----------------------------------------------------------------------------
## 0 Full Model 2993.945 3038.388 1802.523 0.80870 0.80450
## 1 teachers 2991.946 3032.348 1800.475 0.80870 0.80498
## 2 stratio 2990.513 3026.875 1798.972 0.80844 0.80519
## ----------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------
## R 0.899 RMSE 8.329
## R-Squared 0.808 MSE 70.723
## Adj. R-Squared 0.805 Coef. Var 1.286
## Pred R-Squared 0.801 AIC 2990.513
## MAE 6.469 SBC 3026.875
## ----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## ------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------
## Regression 122971.846 7 17567.407 248.399 0.0000
## Residual 29137.735 412 70.723
## Total 152109.580 419
## ------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) 654.387 3.740 174.982 0.000 647.036 661.738
## students 0.000 0.000 -0.084 -1.395 0.164 -0.001 0.000
## calworks -0.092 0.058 -0.055 -1.586 0.113 -0.205 0.022
## lunch -0.371 0.036 -0.528 -10.249 0.000 -0.442 -0.300
## computer 0.003 0.003 0.081 1.361 0.174 -0.002 0.009
## expenditure 0.002 0.001 0.071 2.951 0.003 0.001 0.004
## income 0.618 0.089 0.234 6.946 0.000 0.443 0.792
## english -0.207 0.035 -0.198 -5.956 0.000 -0.275 -0.138
## --------------------------------------------------------------------------------------------
Elimina progressivamente le variabili meno rilevanti.
Restano le stesse variabili principali del modello forward.
ols_step_both_p(model, pent = 0.05, prem = 0.05)
##
##
## Stepwise Summary
## ---------------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## ---------------------------------------------------------------------------------
## 0 Base Model 3670.591 3678.672 2475.782 0.00000 0.00000
## 1 lunch (+) 3082.265 3094.386 1889.473 0.75476 0.75418
## 2 income (+) 3036.983 3053.144 1844.418 0.78088 0.77982
## 3 english (+) 2995.389 3015.591 1803.418 0.80248 0.80105
## 4 expenditure (+) 2989.015 3013.257 1797.227 0.80638 0.80451
## ---------------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------
## R 0.898 RMSE 8.374
## R-Squared 0.806 MSE 70.968
## Adj. R-Squared 0.805 Coef. Var 1.288
## Pred R-Squared 0.802 AIC 2989.015
## MAE 6.488 SBC 3013.257
## ----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## ------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------
## Regression 122657.826 4 30664.457 432.088 0.0000
## Residual 29451.754 415 70.968
## Total 152109.580 419
## ------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) 655.049 3.632 180.337 0.000 647.909 662.189
## lunch -0.408 0.028 -0.581 -14.549 0.000 -0.463 -0.353
## income 0.622 0.088 0.236 7.063 0.000 0.449 0.795
## english -0.195 0.031 -0.187 -6.286 0.000 -0.256 -0.134
## expenditure 0.002 0.001 0.068 2.891 0.004 0.001 0.003
## --------------------------------------------------------------------------------------------
Combina i due approcci: aggiunge o rimuove variabili in base al p-value.
Conferma la rilevanza dilunch,income,english,expenditure.
ols_step_best_subset(model)
## Best Subsets Regression
## -------------------------------------------------------------------------------------------
## Model Index Predictors
## -------------------------------------------------------------------------------------------
## 1 lunch
## 2 lunch income
## 3 lunch income english
## 4 lunch expenditure income english
## 5 calworks lunch expenditure income english
## 6 calworks lunch expenditure income english stratio
## 7 students calworks lunch computer expenditure income english
## 8 students calworks lunch computer expenditure income english stratio
## 9 students teachers calworks lunch computer expenditure income english stratio
## -------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## -----------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## -----------------------------------------------------------------------------------------------------------------------------------------
## 1 0.7548 0.7542 0.7523 109.5996 3082.2654 1889.4726 3094.3862 37481.1109 89.6657 0.2140 0.2476
## 2 0.7809 0.7798 0.7779 55.6380 3036.9829 1844.4180 3053.1439 33570.9336 80.5013 0.1921 0.2223
## 3 0.8025 0.8011 0.7987 11.3368 2995.3893 1803.4177 3015.5906 30334.1198 72.9111 0.1740 0.2013
## 4 0.8064 0.8045 0.8016 4.9796 2989.0151 1797.2274 3013.2566 29807.1127 71.8129 0.1714 0.1983
## 5 0.8075 0.8052 0.8016 4.5343 2988.5329 1796.8410 3016.8146 29703.2191 71.7306 0.1712 0.1981
## 6 0.8079 0.8051 0.8008 5.7187 2989.7017 1798.0741 3022.0237 29716.4461 71.9306 0.1717 0.1986
## 7 0.8084 0.8052 0.8012 6.5550 2990.5129 1798.9716 3026.8752 29704.5548 72.0698 0.1721 0.1990
## 8 0.8087 0.8050 0.8003 8.0011 2991.9458 1800.4753 3032.3484 29736.8279 72.3162 0.1727 0.1997
## 9 0.8087 0.8045 0.7996 10.0000 2993.9447 1802.5230 3038.3875 29809.4563 72.6614 0.1735 0.2006
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Il modello a 4 variabili con miglior AIC è:
score ~ lunch + income + english + expenditure.
- La dimensione della classe (
stratio) perde importanza una volta controllato per fattori socioeconomici.
- Variabili come
income(reddito medio),lunch(povertà) edenglish(studenti non madrelingua) spiegano gran parte della variazione nei punteggi.
- Il modello finale è:
\[score = 655 - 0.41\, lunch + 0.62\, income - 0.20\, english + 0.002\, expenditure\]
con R² ≈ 0.81.
💡 Domande per la discussione in aula
- Perché gli errori robusti sono importanti?
- Quando conviene usare criteri automatici di selezione?
- Cosa implica la perdita di significatività distrationel modello esteso?