Prestige Data Analysis

Materials and Methods

Obtained dataset from “car”: Prestige;
Pair plot with all data from Prestige:

## The following object is masked from package:datasets:
## 
##     women

Fitting using all variables (prestige vs. education, income, women and census);

## 
## Call:
## lm(formula = prestige ~ education + income + women + census + 
##     type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9863  -4.9813   0.6983   4.8690  19.2402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.213e+01  8.018e+00  -1.513  0.13380    
## education    3.933e+00  6.535e-01   6.019 3.64e-08 ***
## income       9.946e-04  2.601e-04   3.824  0.00024 ***
## women        1.310e-02  3.018e-02   0.434  0.66524    
## census       1.156e-03  6.183e-04   1.870  0.06471 .  
## typeprof     1.077e+01  4.676e+00   2.303  0.02354 *  
## typewc       2.877e-01  3.139e+00   0.092  0.92718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.037 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8306 
## F-statistic: 80.25 on 6 and 91 DF,  p-value: < 2.2e-16

Simplified models: removed non-significant variables (“type” was a marginally significant) step by step in order to find the optimal model.

4.1: Removing women variable

## 
## Call:
## lm(formula = prestige ~ education + income + census + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0873  -4.9935   0.7435   4.9617  19.4891 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.144e+01  7.823e+00  -1.462   0.1472    
## education    3.947e+00  6.498e-01   6.075 2.76e-08 ***
## income       9.365e-04  2.221e-04   4.217 5.79e-05 ***
## census       1.125e-03  6.113e-04   1.840   0.0691 .  
## typeprof     1.091e+01  4.645e+00   2.348   0.0210 *  
## typewc       5.605e-01  3.062e+00   0.183   0.8551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.006 on 92 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8407, Adjusted R-squared:  0.8321 
## F-statistic: 97.12 on 5 and 92 DF,  p-value: < 2.2e-16

4.2: Removing census variable

## 
## Call:
## lm(formula = prestige ~ education + income + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9529  -4.4486   0.1678   5.0566  18.6320 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.6229292  5.2275255  -0.119    0.905    
## education    3.6731661  0.6405016   5.735 1.21e-07 ***
## income       0.0010132  0.0002209   4.586 1.40e-05 ***
## typeprof     6.0389707  3.8668551   1.562    0.122    
## typewc      -2.7372307  2.5139324  -1.089    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.095 on 93 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.8278 
## F-statistic: 117.5 on 4 and 93 DF,  p-value: < 2.2e-16

When we removed census variable, type was not longer significant.

4.3: Removing type variable

## 
## Call:
## lm(formula = prestige ~ education + income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4040  -5.3308   0.0154   4.9803  17.6889 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8477787  3.2189771  -2.127   0.0359 *  
## education    4.1374444  0.3489120  11.858  < 2e-16 ***
## income       0.0013612  0.0002242   6.071 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7939 
## F-statistic: 195.6 on 2 and 99 DF,  p-value: < 2.2e-16

These selection models was made manually, however we tried automatic stepwise procedure and we came out with a different result.

## Start:  AIC=389.16
## prestige ~ education + income + women + census + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      9.33 4515.2 387.36
## <none>                   4505.9 389.16
## - census     1    173.13 4679.0 390.86
## - type       2    669.16 5175.0 398.73
## - income     1    724.12 5230.0 401.77
## - education  1   1793.65 6299.5 420.00
## 
## Step:  AIC=387.36
## prestige ~ education + income + census + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   4515.2 387.36
## - census     1    166.09 4681.3 388.90
## + women      1      9.33 4505.9 389.16
## - type       2    660.45 5175.6 396.74
## - income     1    872.78 5388.0 402.68
## - education  1   1811.26 6326.4 418.42

## 
## Call:
## lm(formula = prestige ~ education + income + census + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0873  -4.9935   0.7435   4.9617  19.4891 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.144e+01  7.823e+00  -1.462   0.1472    
## education    3.947e+00  6.498e-01   6.075 2.76e-08 ***
## income       9.365e-04  2.221e-04   4.217 5.79e-05 ***
## census       1.125e-03  6.113e-04   1.840   0.0691 .  
## typeprof     1.091e+01  4.645e+00   2.348   0.0210 *  
## typewc       5.605e-01  3.062e+00   0.183   0.8551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.006 on 92 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8407, Adjusted R-squared:  0.8321 
## F-statistic: 97.12 on 5 and 92 DF,  p-value: < 2.2e-16

Model validation

Questions

Is at least one of the predictors \(X_1\), \(X_2\) , \(. . .\) , \(X_p\) useful in predicting the response?

Yes, education and income are useful for predicting the prestige of an individual.

Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?

No, for example women percentage, census and type were not a good predictors.

How well does the model fit the data?

Adjusted R-squared: 0.7939 Through model validation we checked that the structure of the choosen model was appropriate for the dataset considered.

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

More income and education more prestige. the equation: \[\mathrm{Prestige} = -6.84777 + 4.13744\star\mathrm{Education} + 0.00136\star\mathrm{Income} \pm 7.81.\]

This plot represents prestige vs. education with circles area depending on the income. We choose education for the \(x-\mathrm{axis}\) because of it is more representative than income.

Prestige Data Analysis

Wemi, Renan, Luis, Ignacio

December 12, 2017

Materials and Methods

Model validation

Questions