Materials and Methods

  1. Obtained dataset from “car”: Prestige;
  2. Pair plot with all data from Prestige:
## The following object is masked from package:datasets:
## 
##     women

  1. Fitting using all variables (prestige vs. education, income, women and census);
## 
## Call:
## lm(formula = prestige ~ education + income + women + census + 
##     type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9863  -4.9813   0.6983   4.8690  19.2402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.213e+01  8.018e+00  -1.513  0.13380    
## education    3.933e+00  6.535e-01   6.019 3.64e-08 ***
## income       9.946e-04  2.601e-04   3.824  0.00024 ***
## women        1.310e-02  3.018e-02   0.434  0.66524    
## census       1.156e-03  6.183e-04   1.870  0.06471 .  
## typeprof     1.077e+01  4.676e+00   2.303  0.02354 *  
## typewc       2.877e-01  3.139e+00   0.092  0.92718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.037 on 91 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8306 
## F-statistic: 80.25 on 6 and 91 DF,  p-value: < 2.2e-16
  1. Simplified models: removed non-significant variables (“type” was a marginally significant) step by step in order to find the optimal model.

4.1: Removing women variable

## 
## Call:
## lm(formula = prestige ~ education + income + census + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0873  -4.9935   0.7435   4.9617  19.4891 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.144e+01  7.823e+00  -1.462   0.1472    
## education    3.947e+00  6.498e-01   6.075 2.76e-08 ***
## income       9.365e-04  2.221e-04   4.217 5.79e-05 ***
## census       1.125e-03  6.113e-04   1.840   0.0691 .  
## typeprof     1.091e+01  4.645e+00   2.348   0.0210 *  
## typewc       5.605e-01  3.062e+00   0.183   0.8551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.006 on 92 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8407, Adjusted R-squared:  0.8321 
## F-statistic: 97.12 on 5 and 92 DF,  p-value: < 2.2e-16

4.2: Removing census variable

## 
## Call:
## lm(formula = prestige ~ education + income + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9529  -4.4486   0.1678   5.0566  18.6320 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.6229292  5.2275255  -0.119    0.905    
## education    3.6731661  0.6405016   5.735 1.21e-07 ***
## income       0.0010132  0.0002209   4.586 1.40e-05 ***
## typeprof     6.0389707  3.8668551   1.562    0.122    
## typewc      -2.7372307  2.5139324  -1.089    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.095 on 93 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.8278 
## F-statistic: 117.5 on 4 and 93 DF,  p-value: < 2.2e-16

When we removed census variable, type was not longer significant.

4.3: Removing type variable

## 
## Call:
## lm(formula = prestige ~ education + income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4040  -5.3308   0.0154   4.9803  17.6889 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8477787  3.2189771  -2.127   0.0359 *  
## education    4.1374444  0.3489120  11.858  < 2e-16 ***
## income       0.0013612  0.0002242   6.071 2.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 99 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7939 
## F-statistic: 195.6 on 2 and 99 DF,  p-value: < 2.2e-16

These selection models was made manually, however we tried automatic stepwise procedure and we came out with a different result.

## Start:  AIC=389.16
## prestige ~ education + income + women + census + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      9.33 4515.2 387.36
## <none>                   4505.9 389.16
## - census     1    173.13 4679.0 390.86
## - type       2    669.16 5175.0 398.73
## - income     1    724.12 5230.0 401.77
## - education  1   1793.65 6299.5 420.00
## 
## Step:  AIC=387.36
## prestige ~ education + income + census + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   4515.2 387.36
## - census     1    166.09 4681.3 388.90
## + women      1      9.33 4505.9 389.16
## - type       2    660.45 5175.6 396.74
## - income     1    872.78 5388.0 402.68
## - education  1   1811.26 6326.4 418.42
## 
## Call:
## lm(formula = prestige ~ education + income + census + type)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0873  -4.9935   0.7435   4.9617  19.4891 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.144e+01  7.823e+00  -1.462   0.1472    
## education    3.947e+00  6.498e-01   6.075 2.76e-08 ***
## income       9.365e-04  2.221e-04   4.217 5.79e-05 ***
## census       1.125e-03  6.113e-04   1.840   0.0691 .  
## typeprof     1.091e+01  4.645e+00   2.348   0.0210 *  
## typewc       5.605e-01  3.062e+00   0.183   0.8551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.006 on 92 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8407, Adjusted R-squared:  0.8321 
## F-statistic: 97.12 on 5 and 92 DF,  p-value: < 2.2e-16

Model validation

Questions

  1. Is at least one of the predictors \(X_1\), \(X_2\) , \(. . .\) , \(X_p\) useful in predicting the response?
  1. Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
  1. How well does the model fit the data?
  1. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

This plot represents prestige vs. education with circles area depending on the income. We choose education for the \(x-\mathrm{axis}\) because of it is more representative than income.