Prestige Data Analysis

Questions

1. Is at least one of the predictors X1 , X2 , . . . , Xp useful in predicting the response?

library(car)
data(Prestige)
mod1<-lm(prestige ~ ., data = Prestige)
summary(mod1)

## 
## Call:
## lm(formula = prestige ~ ., data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9863  -4.9813   0.6983   4.8690  19.2402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.213e+01  8.018e+00  -1.513  0.13380    
## education    3.933e+00  6.535e-01   6.019 3.64e-08 ***
## income       9.946e-04  2.601e-04   3.824  0.00024 ***
## women        1.310e-02  3.018e-02   0.434  0.66524    
## census       1.156e-03  6.183e-04   1.870  0.06471 .  
## typeprof     1.077e+01  4.676e+00   2.303  0.02354 *  
## typewc       2.877e-01  3.139e+00   0.092  0.92718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.037 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8306 
## F-statistic: 80.25 on 6 and 91 DF,  p-value: < 2.2e-16

pairs(Prestige)

cor(Prestige[,-6])

##             education     income       women   prestige     census
## education  1.00000000  0.5775802  0.06185286  0.8501769 -0.8230882
## income     0.57758023  1.0000000 -0.44105927  0.7149057 -0.3610023
## women      0.06185286 -0.4410593  1.00000000 -0.1183342 -0.2270028
## prestige   0.85017689  0.7149057 -0.11833419  1.0000000 -0.6345103
## census    -0.82308821 -0.3610023 -0.22700277 -0.6345103  1.0000000

A: Yes. The main predictors are education and income, because their correlation are the highest with prestige. Also, in the full model (all variables together), their p-values are below 0.001.

2. Do all the predictors help to explain Y, or is only a subset of the predictors useful?

A: No. The predictors type, census and women show no strong correlation with prestige and their contribution are very low in the full model.

3. How well does the model fit the data?

mod2<-lm(prestige ~ education + income, data = Prestige)
summary(mod2)

## 
## Call:
## lm(formula = prestige ~ education + income, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4040  -5.3308   0.0154   4.9803  17.6889 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8477787  3.2189771  -2.127   0.0359 *  
## education    4.1374444  0.3489120  11.858  < 2e-16 ***
## income       0.0013612  0.0002242   6.071 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7939 
## F-statistic: 195.6 on 2 and 99 DF,  p-value: < 2.2e-16

A: When we select only education and income to predict prestige our model (mod2) shows an adjusted \(R^2 = 0.794\).

par(mfrow=c(2,2))
plot(mod2)

By plotting the residuals of our mod2, we confirm that the residuals show low deviance and normal distribution.

4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

A: Our model can make predictions:

# fake guess
guess<-data.frame(income=9000,education=14) # set of random predictor values
y_guess<-predict(mod2, guess)
y_guess

##        1 
## 63.32694

Our predicted values seems to be close to the real values of prestige:

y_hat<-predict(mod2)
plot(y_hat~Prestige$prestige)

Performing a Leave-one-out cross-validation, we can see that the residuals of our two models (with and without type variable) show normal distribution.

residuos<-c()
for(i in 1:nrow(Prestige)){
  model_cross_val<-lm(prestige ~ education + income, data = Prestige[-i,])
  residuos[i] <- Prestige$prestige[i]-
    predict(model_cross_val,Prestige[i,c("education","income")] )
}
densityPlot(scale(residuos))

which(is.na(Prestige$type))

## [1] 34 53 63 67

dados<-Prestige[-c(34,53,63,67),]
residuos<-c()
for(i in 1:nrow(dados)){
  model_cross_val<-lm(prestige ~ education + income+type, data = dados[-i,])
  residuos[i] <- dados$prestige[i]-
    predict(model_cross_val,dados[i,c("education","income","type")] )
}
densityPlot(scale(residuos))

dados<-Prestige[-c(34,53,63,67),]
mod1<-lm(prestige~income+education,dados)
mod2<-lm(prestige~income+education+type,dados)
AIC(mod1,mod2)

##      df      AIC
## mod1  4 676.6695
## mod2  6 669.0151

anova(mod1,mod2)

## Analysis of Variance Table
## 
## Model 1: prestige ~ income + education
## Model 2: prestige ~ income + education + type
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     95 5272.4                                
## 2     93 4681.3  2    591.16 5.8721 0.003966 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From our analysis, especially from the results obtained with the AIC criterium, we conclude that including type predictor increases the model’s accuracy.