1 1. Dati e costruzione delle variabili
2 2. Analisi preliminare: relazioni tra le variabili
3 3. Regressione multipla OLS
4 4. Controllo dell’assunzione di omoschedasticità
- 4.1 Errori standard robusti (White, tipo HC1)
5 5. Calcolo manuale dei coefficienti OLS
6 6. Calcolo manuale degli errori standard classici
7 7. Calcolo manuale degli errori standard robusti
8 8. Test F: ipotesi congiunte
- 8.1 Caso con un solo regressore
- 8.2 Caso con due regressori
9 9. Aggiunta di una terza variabile: expenditure
10 10. Selezione del modello ottimale
11 11. Conclusioni

Obiettivo del laboratorio
Analizzare l’effetto di class size (stratio) e della percentuale di English learners (english) sui risultati scolastici (score) delle scuole californiane.
Il notebook mostra: - la stima OLS della regressione multipla
- il calcolo manuale dei coefficienti e degli standard error
- il controllo dell’eteroschedasticità
- i test F per ipotesi congiunte
- la selezione automatica del modello migliore

1 1. Dati e costruzione delle variabili

library(AER)
data("CASchools")

# Variabili derivate
CASchools$stratio <- with(CASchools, students/teachers)
CASchools$score <- with(CASchools, (math + read)/2)

head(CASchools[, c("score", "stratio", "english")])

Nota:
score = media tra punteggi di matematica e lettura.
stratio = rapporto studenti/insegnanti (proxy della dimensione media della classe).

2 2. Analisi preliminare: relazioni tra le variabili

pairs(CASchools[, c("score", "stratio", "english")])

library(PerformanceAnalytics)
chart.Correlation(CASchools[, c("score", "stratio", "english")])

Le relazioni negative indicano che: - classi più grandi ⇒ punteggi medi più bassi
- percentuale di studenti non madrelingua più alta ⇒ punteggi inferiori

3 3. Regressione multipla OLS

reg_mult <- lm(score ~ stratio + english, data = CASchools)
summary(reg_mult)

## 
## Call:
## lm(formula = score ~ stratio + english, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.845 -10.240  -0.308   9.815  43.461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
## stratio      -1.10130    0.38028  -2.896  0.00398 ** 
## english      -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237 
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

stratio: coefficiente negativo (-1.10) → più studenti per insegnante riduce lo score.

english: coefficiente negativo (-0.65) → più studenti non madrelingua riducono i punteggi medi.

R² = 0.43 → il modello spiega il 43% della variabilità dei risultati.

4 4. Controllo dell’assunzione di omoschedasticità

plot(reg_mult$fitted.values, reg_mult$residuals, pch=19, col="gray",
     xlab="Valori Predetti", ylab="Residui")
lines(lowess(reg_mult$fitted.values, reg_mult$residuals), col="red", lwd=2)
abline(h=0, lty=2)

Il grafico fitted vs residuals mostra se la varianza dei residui è costante.
Se i punti formano un “imbuto” → possibile eteroschedasticità.

4.1 Errori standard robusti (White, tipo HC1)

library(lmtest)
library(sandwich)
coeftest(reg_mult, vcov = vcovHC(reg_mult, type = "HC1"))

## 
## t test of coefficients:
## 
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 686.032245   8.728225  78.5993  < 2e-16 ***
## stratio      -1.101296   0.432847  -2.5443  0.01131 *  
## english      -0.649777   0.031032 -20.9391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gli standard error robusti correggono la stima in presenza di eteroschedasticità.
I valori dei coefficienti restano invariati, ma gli errori standard cambiano.

5 5. Calcolo manuale dei coefficienti OLS

X <- as.matrix(cbind(1, CASchools$stratio, CASchools$english))
Y <- as.matrix(CASchools$score)

beta_hat <- solve(t(X) %*% X) %*% (t(X) %*% Y)
beta_hat

##             [,1]
## [1,] 686.0322445
## [2,]  -1.1012956
## [3,]  -0.6497768

Formula base dell’OLS:
\[\hat{\beta} = (X'X)^{-1} X'Y\]
I risultati coincidono con quelli ottenuti con lm().

6 6. Calcolo manuale degli errori standard classici

# Calcolo manuale degli errori standard classici
n <- nrow(CASchools)
k <- length(coef(reg_mult))
u_hat <- residuals(reg_mult)

SER <- sqrt(t(u_hat) %*% u_hat / (n - k))
SER

##          [,1]
## [1,] 14.46448

mat_cov <- as.numeric(SER^2) * solve(t(X) %*% X)
sqrt(diag(mat_cov))

## [1] 7.41131160 0.38027827 0.03934254

Gli errori standard derivano da:
\[Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}\]
con \(\sigma^2\) stimata come varianza dei residui.

7 7. Calcolo manuale degli errori standard robusti

D <- diag(as.numeric((n/(n-k)) * u_hat^2))
HC_Var_cov <- solve(t(X)%*%X) %*% (t(X)%*%D%*%X) %*% solve(t(X)%*%X)
sqrt(diag(HC_Var_cov))

## [1] 8.72822452 0.43284720 0.03103176

In presenza di eteroschedasticità:
\[Var(\hat{\beta}) = (X'X)^{-1} X' D X (X'X)^{-1}\]
dove \(D\) è la matrice diagonale dei residui al quadrato.

8 8. Test F: ipotesi congiunte

8.1 Caso con un solo regressore

reg_1 <- lm(score ~ stratio, data = CASchools)
summary(reg_1)

## 
## Call:
## lm(formula = score ~ stratio, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
## stratio      -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

t_value <- summary(reg_1)$coef[2,3]
F_value <- t_value^2
p_value <- 1 - pf(F_value, 1, n-2)
c(F_value=F_value, p_value=p_value)

##      F_value      p_value 
## 2.257511e+01 2.783308e-06

Con un solo regressore, la statistica F = t².
L’ipotesi \(H_0: \beta_{stratio} = 0\) viene rifiutata.

8.2 Caso con due regressori

reg_2 <- lm(score ~ stratio + english, data = CASchools)
summary(reg_2)$fstatistic

##    value    numdf    dendf 
## 155.0137   2.0000 417.0000

\(H_0: \beta_{stratio} = \beta_{english} = 0\)
F ≈ 155, p-value < 0.001 → rifiuto dell’ipotesi nulla.

9 9. Aggiunta di una terza variabile: `expenditure`

reg_3 <- lm(score ~ stratio + english + expenditure, data = CASchools)
summary(reg_3)

## 
## Call:
## lm(formula = score ~ stratio + english + expenditure, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.340 -10.111   0.293  10.318  43.181 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 649.577947  15.205717  42.719  < 2e-16 ***
## stratio      -0.286399   0.480523  -0.596  0.55149    
## english      -0.656023   0.039106 -16.776  < 2e-16 ***
## expenditure   0.003868   0.001412   2.739  0.00643 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 416 degrees of freedom
## Multiple R-squared:  0.4366, Adjusted R-squared:  0.4325 
## F-statistic: 107.5 on 3 and 416 DF,  p-value: < 2.2e-16

La spesa per studente ha effetto positivo e significativo,
ma stratio perde significatività → possibile collinearità con expenditure.

10 10. Selezione del modello ottimale

library(olsrr)

lista_var_new <- colnames(CASchools)[!(colnames(CASchools) %in%
   c("school","district","county","grades","math","read"))]

model <- lm(score ~ ., data = CASchools[, lista_var_new])

10.1 Metodo Forward

ols_step_forward_p(model, penter = 0.05)

## 
## 
##                                Stepwise Summary                                
## -----------------------------------------------------------------------------
## Step    Variable         AIC         SBC         SBIC        R2       Adj. R2 
## -----------------------------------------------------------------------------
##  0      Base Model     3670.591    3678.672    2475.782    0.00000    0.00000 
##  1      lunch          3082.265    3094.386    1889.473    0.75476    0.75418 
##  2      income         3036.983    3053.144    1844.418    0.78088    0.77982 
##  3      english        2995.389    3015.591    1803.418    0.80248    0.80105 
##  4      expenditure    2989.015    3013.257    1797.227    0.80638    0.80451 
##  5      calworks       2988.533    3016.815    1796.841    0.80752    0.80519 
## -----------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                           
## ----------------------------------------------------------------
## R                       0.899       RMSE                  8.349 
## R-Squared               0.808       MSE                  70.720 
## Adj. R-Squared          0.805       Coef. Var             1.286 
## Pred R-Squared          0.802       AIC                2988.533 
## MAE                     6.489       SBC                3016.815 
## ----------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                   
## ------------------------------------------------------------------------
##                   Sum of                                                
##                  Squares         DF    Mean Square       F         Sig. 
## ------------------------------------------------------------------------
## Regression    122831.373          5      24566.275    347.372    0.0000 
## Residual       29278.207        414         70.720                      
## Total         152109.580        419                                     
## ------------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    654.008         3.686                 177.411    0.000    646.762    661.255 
##       lunch     -0.373         0.036       -0.531    -10.372    0.000     -0.443     -0.302 
##      income      0.623         0.088        0.236      7.090    0.000      0.450      0.796 
##     english     -0.211         0.033       -0.203     -6.473    0.000     -0.275     -0.147 
## expenditure      0.002         0.001        0.074      3.094    0.002      0.001      0.004 
##    calworks     -0.089         0.057       -0.054     -1.567    0.118     -0.201      0.023 
## --------------------------------------------------------------------------------------------

Aggiunge progressivamente le variabili significative.
Modello finale: lunch, income, english, expenditure, calworks.

10.2 Metodo Backward

ols_step_backward_p(model, prem = 0.05)

## 
## 
##                                Stepwise Summary                               
## ----------------------------------------------------------------------------
## Step    Variable        AIC         SBC         SBIC        R2       Adj. R2 
## ----------------------------------------------------------------------------
##  0      Full Model    2993.945    3038.388    1802.523    0.80870    0.80450 
##  1      teachers      2991.946    3032.348    1800.475    0.80870    0.80498 
##  2      stratio       2990.513    3026.875    1798.972    0.80844    0.80519 
## ----------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                           
## ----------------------------------------------------------------
## R                       0.899       RMSE                  8.329 
## R-Squared               0.808       MSE                  70.723 
## Adj. R-Squared          0.805       Coef. Var             1.286 
## Pred R-Squared          0.801       AIC                2990.513 
## MAE                     6.469       SBC                3026.875 
## ----------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                   
## ------------------------------------------------------------------------
##                   Sum of                                                
##                  Squares         DF    Mean Square       F         Sig. 
## ------------------------------------------------------------------------
## Regression    122971.846          7      17567.407    248.399    0.0000 
## Residual       29137.735        412         70.723                      
## Total         152109.580        419                                     
## ------------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    654.387         3.740                 174.982    0.000    647.036    661.738 
##    students      0.000         0.000       -0.084     -1.395    0.164     -0.001      0.000 
##    calworks     -0.092         0.058       -0.055     -1.586    0.113     -0.205      0.022 
##       lunch     -0.371         0.036       -0.528    -10.249    0.000     -0.442     -0.300 
##    computer      0.003         0.003        0.081      1.361    0.174     -0.002      0.009 
## expenditure      0.002         0.001        0.071      2.951    0.003      0.001      0.004 
##      income      0.618         0.089        0.234      6.946    0.000      0.443      0.792 
##     english     -0.207         0.035       -0.198     -5.956    0.000     -0.275     -0.138 
## --------------------------------------------------------------------------------------------

Elimina progressivamente le variabili meno rilevanti.
Restano le stesse variabili principali del modello forward.

10.3 Metodo Stepwise

ols_step_both_p(model, pent = 0.05, prem = 0.05)

## 
## 
##                                  Stepwise Summary                                  
## ---------------------------------------------------------------------------------
## Step    Variable             AIC         SBC         SBIC        R2       Adj. R2 
## ---------------------------------------------------------------------------------
##  0      Base Model         3670.591    3678.672    2475.782    0.00000    0.00000 
##  1      lunch (+)          3082.265    3094.386    1889.473    0.75476    0.75418 
##  2      income (+)         3036.983    3053.144    1844.418    0.78088    0.77982 
##  3      english (+)        2995.389    3015.591    1803.418    0.80248    0.80105 
##  4      expenditure (+)    2989.015    3013.257    1797.227    0.80638    0.80451 
## ---------------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                           
## ----------------------------------------------------------------
## R                       0.898       RMSE                  8.374 
## R-Squared               0.806       MSE                  70.968 
## Adj. R-Squared          0.805       Coef. Var             1.288 
## Pred R-Squared          0.802       AIC                2989.015 
## MAE                     6.488       SBC                3013.257 
## ----------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                  ANOVA                                   
## ------------------------------------------------------------------------
##                   Sum of                                                
##                  Squares         DF    Mean Square       F         Sig. 
## ------------------------------------------------------------------------
## Regression    122657.826          4      30664.457    432.088    0.0000 
## Residual       29451.754        415         70.968                      
## Total         152109.580        419                                     
## ------------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    655.049         3.632                 180.337    0.000    647.909    662.189 
##       lunch     -0.408         0.028       -0.581    -14.549    0.000     -0.463     -0.353 
##      income      0.622         0.088        0.236      7.063    0.000      0.449      0.795 
##     english     -0.195         0.031       -0.187     -6.286    0.000     -0.256     -0.134 
## expenditure      0.002         0.001        0.068      2.891    0.004      0.001      0.003 
## --------------------------------------------------------------------------------------------

Combina i due approcci: aggiunge o rimuove variabili in base al p-value.
Conferma la rilevanza di lunch, income, english, expenditure.

10.4 Miglior sottoinsieme

ols_step_best_subset(model)

##                                   Best Subsets Regression                                  
## -------------------------------------------------------------------------------------------
## Model Index    Predictors
## -------------------------------------------------------------------------------------------
##      1         lunch                                                                        
##      2         lunch income                                                                 
##      3         lunch income english                                                         
##      4         lunch expenditure income english                                             
##      5         calworks lunch expenditure income english                                    
##      6         calworks lunch expenditure income english stratio                            
##      7         students calworks lunch computer expenditure income english                  
##      8         students calworks lunch computer expenditure income english stratio          
##      9         students teachers calworks lunch computer expenditure income english stratio 
## -------------------------------------------------------------------------------------------
## 
##                                                        Subsets Regression Summary                                                        
## -----------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                   
## Model    R-Square    R-Square    R-Square      C(p)         AIC         SBIC          SBC          MSEP         FPE       HSP       APC  
## -----------------------------------------------------------------------------------------------------------------------------------------
##   1        0.7548      0.7542      0.7523    109.5996    3082.2654    1889.4726    3094.3862    37481.1109    89.6657    0.2140    0.2476 
##   2        0.7809      0.7798      0.7779     55.6380    3036.9829    1844.4180    3053.1439    33570.9336    80.5013    0.1921    0.2223 
##   3        0.8025      0.8011      0.7987     11.3368    2995.3893    1803.4177    3015.5906    30334.1198    72.9111    0.1740    0.2013 
##   4        0.8064      0.8045      0.8016      4.9796    2989.0151    1797.2274    3013.2566    29807.1127    71.8129    0.1714    0.1983 
##   5        0.8075      0.8052      0.8016      4.5343    2988.5329    1796.8410    3016.8146    29703.2191    71.7306    0.1712    0.1981 
##   6        0.8079      0.8051      0.8008      5.7187    2989.7017    1798.0741    3022.0237    29716.4461    71.9306    0.1717    0.1986 
##   7        0.8084      0.8052      0.8012      6.5550    2990.5129    1798.9716    3026.8752    29704.5548    72.0698    0.1721    0.1990 
##   8        0.8087      0.8050      0.8003      8.0011    2991.9458    1800.4753    3032.3484    29736.8279    72.3162    0.1727    0.1997 
##   9        0.8087      0.8045      0.7996     10.0000    2993.9447    1802.5230    3038.3875    29809.4563    72.6614    0.1735    0.2006 
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Il modello a 4 variabili con miglior AIC è:
score ~ lunch + income + english + expenditure.

11 11. Conclusioni

La dimensione della classe (stratio) perde importanza una volta controllato per fattori socioeconomici.

Variabili come income (reddito medio), lunch (povertà) ed english (studenti non madrelingua) spiegano gran parte della variazione nei punteggi.

Il modello finale è:
\[score = 655 - 0.41\, lunch + 0.62\, income - 0.20\, english + 0.002\, expenditure\]
con R² ≈ 0.81.

💡 Domande per la discussione in aula
- Perché gli errori robusti sono importanti?
- Quando conviene usare criteri automatici di selezione?
- Cosa implica la perdita di significatività di stratio nel modello esteso?

Regressione multipla con CASchools

Lezioni SAE — Notebook Didattico

2025-10-14

1 1. Dati e costruzione delle variabili

2 2. Analisi preliminare: relazioni tra le variabili

3 3. Regressione multipla OLS

4 4. Controllo dell’assunzione di omoschedasticità

4.1 Errori standard robusti (White, tipo HC1)

5 5. Calcolo manuale dei coefficienti OLS

6 6. Calcolo manuale degli errori standard classici

7 7. Calcolo manuale degli errori standard robusti

8 8. Test F: ipotesi congiunte

8.1 Caso con un solo regressore

8.2 Caso con due regressori

9 9. Aggiunta di una terza variabile: `expenditure`

10 10. Selezione del modello ottimale

10.1 Metodo Forward

10.2 Metodo Backward

10.3 Metodo Stepwise

10.4 Miglior sottoinsieme

11 11. Conclusioni

Regressione multipla con CASchools

Lezioni SAE — Notebook Didattico

2025-10-14

1 1. Dati e costruzione delle variabili

2 2. Analisi preliminare: relazioni tra le variabili

3 3. Regressione multipla OLS

4 4. Controllo dell’assunzione di omoschedasticità

4.1 Errori standard robusti (White, tipo HC1)

5 5. Calcolo manuale dei coefficienti OLS

6 6. Calcolo manuale degli errori standard classici

7 7. Calcolo manuale degli errori standard robusti

8 8. Test F: ipotesi congiunte

8.1 Caso con un solo regressore

8.2 Caso con due regressori

9 9. Aggiunta di una terza variabile: expenditure

10 10. Selezione del modello ottimale

10.1 Metodo Forward

10.2 Metodo Backward

10.3 Metodo Stepwise

10.4 Miglior sottoinsieme

11 11. Conclusioni

9 9. Aggiunta di una terza variabile: `expenditure`