As mentioned in one the blogs (you can find it here), a good way of doing backward elimination is to use stepwise function function “step”.
Here we are going to build a full multiple linear model, then apply step function to it to eliminate no needed variables. The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached. That’s it, start with the full model, then refit all possible models omitting one variable at a time, and choose the model with the highest adjusted R square, repeat until maximum possible adjusted R square is reached.
We need to load both “tidyverse” and “openintro” where the dataset used is found. The data set is called evals, gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance.
We are going first to load the libraries.
To know the meaning of each variable, you can run the code below..
Know that we know about our data, it is a good time to prepare it for our analysis.
Gender has two levels, male and female, and we are going to make each one the possibilities to be a separate variable.
evals <- evals %>%
mutate("male" = ifelse(evals$gender == "male", 1, 0)) %>%
mutate("female" = ifelse(evals$gender == "female", 1, 0))Adding quadratic term (square Percent of students in class who completed evaluation), and dichotomous vs. quantitative interaction term (interaction between male and Average beauty rating of professor ) (this is optional)
evals <- evals %>%
mutate("cls_perc_eval_sq" = cls_perc_eval^2) %>%
mutate("male_perc_beauty" = male * bty_avg)Let first start by building the full model.
df_lm_full <- lm(score ~ rank + male + female + ethnicity + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty, data = evals)
summary(df_lm_full)##
## Call:
## lm(formula = score ~ rank + male + female + ethnicity + language +
## age + cls_perc_eval + cls_students + cls_level + cls_profs +
## cls_credits + bty_avg + pic_outfit + pic_color + cls_perc_eval_sq +
## male_perc_beauty, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72712 -0.32545 0.07036 0.36050 0.94499
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.588e+00 4.302e-01 10.665 < 2e-16 ***
## ranktenure track -1.229e-01 8.361e-02 -1.470 0.14225
## ranktenured -4.983e-02 6.856e-02 -0.727 0.46774
## male -1.811e-01 1.597e-01 -1.134 0.25731
## female NA NA NA NA
## ethnicitynot minority 8.623e-02 7.985e-02 1.080 0.28075
## languagenon-english -2.597e-01 1.118e-01 -2.323 0.02064 *
## age -8.621e-03 3.151e-03 -2.736 0.00646 **
## cls_perc_eval -3.375e-03 9.329e-03 -0.362 0.71772
## cls_students 3.431e-04 3.800e-04 0.903 0.36719
## cls_levelupper 7.701e-02 5.779e-02 1.333 0.18335
## cls_profssingle -2.851e-02 5.197e-02 -0.549 0.58350
## cls_creditsone credit 5.528e-01 1.168e-01 4.733 2.98e-06 ***
## bty_avg -4.416e-03 2.450e-02 -0.180 0.85703
## pic_outfitnot formal -1.062e-01 7.350e-02 -1.444 0.14933
## pic_colorcolor -2.326e-01 7.155e-02 -3.251 0.00124 **
## cls_perc_eval_sq 6.045e-05 6.707e-05 0.901 0.36793
## male_perc_beauty 8.738e-02 3.369e-02 2.594 0.00980 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4951 on 446 degrees of freedom
## Multiple R-squared: 0.2001, Adjusted R-squared: 0.1714
## F-statistic: 6.974 on 16 and 446 DF, p-value: 2.142e-14
About the coefficients
The following coefficients have the p-value > 0.05 and might be dropped: ranktenure, ranktenured, ethnicitynot minority, cls_perc_eval, cls_students, cls_levelupper, cls_profssingle, bty_avg, pic_outfitnot formal, cls_perc_eval_sq
But as we said, we are not going to drop those variables instead we are going to apply stepwise function
Now, it is the time to apply stepwise function and see if some variables will be eliminated, and if multiple r square has improved.
## Start: AIC=-634.37
## score ~ rank + male + female + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_profs + cls_credits + bty_avg +
## pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty
##
##
## Step: AIC=-634.37
## score ~ rank + male + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_profs + cls_credits + bty_avg +
## pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - bty_avg 1 0.0080 109.32 -636.33
## - cls_perc_eval 1 0.0321 109.34 -636.23
## - rank 2 0.5315 109.84 -636.12
## - cls_profs 1 0.0738 109.38 -636.05
## - cls_perc_eval_sq 1 0.1991 109.51 -635.52
## - cls_students 1 0.1997 109.51 -635.52
## - ethnicity 1 0.2858 109.59 -635.16
## - male 1 0.3153 109.62 -635.03
## - cls_level 1 0.4352 109.74 -634.53
## <none> 109.31 -634.37
## - pic_outfit 1 0.5113 109.82 -634.21
## - language 1 1.3223 110.63 -630.80
## - male_perc_beauty 1 1.6491 110.96 -629.43
## - age 1 1.8350 111.14 -628.66
## - pic_color 1 2.5897 111.90 -625.52
## - cls_credits 1 5.4898 114.80 -613.68
##
## Step: AIC=-636.33
## score ~ rank + male + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_profs + cls_credits + pic_outfit +
## pic_color + cls_perc_eval_sq + male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - cls_perc_eval 1 0.0311 109.35 -638.20
## - rank 2 0.5402 109.86 -638.05
## - cls_profs 1 0.0721 109.39 -638.03
## - cls_perc_eval_sq 1 0.1966 109.51 -637.50
## - cls_students 1 0.1985 109.52 -637.49
## - ethnicity 1 0.3180 109.63 -636.99
## - cls_level 1 0.4279 109.74 -636.52
## - male 1 0.4483 109.77 -636.44
## <none> 109.32 -636.33
## - pic_outfit 1 0.5059 109.82 -636.19
## - language 1 1.3184 110.64 -632.78
## - age 1 1.8330 111.15 -630.63
## - pic_color 1 2.7279 112.04 -626.92
## - male_perc_beauty 1 2.9541 112.27 -625.99
## - cls_credits 1 5.4822 114.80 -615.68
##
## Step: AIC=-638.2
## score ~ rank + male + ethnicity + language + age + cls_students +
## cls_level + cls_profs + cls_credits + pic_outfit + pic_color +
## cls_perc_eval_sq + male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - rank 2 0.5137 109.86 -640.03
## - cls_profs 1 0.0678 109.42 -639.91
## - cls_students 1 0.1952 109.54 -639.37
## - ethnicity 1 0.3120 109.66 -638.88
## - cls_level 1 0.4059 109.75 -638.48
## - male 1 0.4320 109.78 -638.37
## <none> 109.35 -638.20
## - pic_outfit 1 0.5069 109.85 -638.06
## - language 1 1.3441 110.69 -634.54
## - age 1 1.8021 111.15 -632.63
## - cls_perc_eval_sq 1 2.6645 112.01 -629.05
## - pic_color 1 2.7584 112.11 -628.67
## - male_perc_beauty 1 2.9233 112.27 -627.98
## - cls_credits 1 5.4514 114.80 -617.67
##
## Step: AIC=-640.03
## score ~ male + ethnicity + language + age + cls_students + cls_level +
## cls_profs + cls_credits + pic_outfit + pic_color + cls_perc_eval_sq +
## male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - cls_profs 1 0.0727 109.93 -641.72
## - cls_students 1 0.1744 110.03 -641.30
## - cls_level 1 0.3614 110.22 -640.51
## - pic_outfit 1 0.3952 110.26 -640.37
## - ethnicity 1 0.4181 110.28 -640.27
## <none> 109.86 -640.03
## - male 1 0.6766 110.54 -639.19
## - age 1 1.3761 111.24 -636.27
## - language 1 1.7146 111.58 -634.86
## - pic_color 1 2.6396 112.50 -631.04
## - cls_perc_eval_sq 1 2.6617 112.52 -630.95
## - male_perc_beauty 1 3.7017 113.56 -626.69
## - cls_credits 1 6.6258 116.49 -614.92
##
## Step: AIC=-641.72
## score ~ male + ethnicity + language + age + cls_students + cls_level +
## cls_credits + pic_outfit + pic_color + cls_perc_eval_sq +
## male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - cls_students 1 0.2107 110.14 -642.84
## - pic_outfit 1 0.3420 110.28 -642.29
## - cls_level 1 0.3619 110.30 -642.20
## <none> 109.93 -641.72
## - ethnicity 1 0.5051 110.44 -641.60
## - male 1 0.6580 110.59 -640.96
## - age 1 1.3755 111.31 -637.97
## - language 1 1.6813 111.61 -636.70
## - cls_perc_eval_sq 1 2.6049 112.54 -632.88
## - pic_color 1 2.7450 112.68 -632.30
## - male_perc_beauty 1 3.6425 113.58 -628.63
## - cls_credits 1 6.9075 116.84 -615.51
##
## Step: AIC=-642.84
## score ~ male + ethnicity + language + age + cls_level + cls_credits +
## pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty
##
## Df Sum of Sq RSS AIC
## - cls_level 1 0.2477 110.39 -643.80
## <none> 110.14 -642.84
## - ethnicity 1 0.5173 110.66 -642.67
## - pic_outfit 1 0.5979 110.74 -642.33
## - male 1 0.7530 110.90 -641.68
## - age 1 1.4493 111.59 -638.78
## - language 1 1.8342 111.98 -637.19
## - cls_perc_eval_sq 1 2.3996 112.54 -634.86
## - pic_color 1 2.5827 112.73 -634.11
## - male_perc_beauty 1 4.0430 114.19 -628.15
## - cls_credits 1 6.7173 116.86 -617.43
##
## Step: AIC=-643.8
## score ~ male + ethnicity + language + age + cls_credits + pic_outfit +
## pic_color + cls_perc_eval_sq + male_perc_beauty
##
## Df Sum of Sq RSS AIC
## <none> 110.39 -643.80
## - pic_outfit 1 0.6547 111.05 -643.06
## - ethnicity 1 0.6643 111.06 -643.02
## - male 1 0.7404 111.13 -642.70
## - age 1 1.3608 111.75 -640.12
## - language 1 1.6417 112.03 -638.96
## - pic_color 1 2.3474 112.74 -636.06
## - cls_perc_eval_sq 1 2.6554 113.05 -634.79
## - male_perc_beauty 1 3.9090 114.30 -629.69
## - cls_credits 1 6.5638 116.96 -619.06
##
## Call:
## lm(formula = score ~ male + ethnicity + language + age + cls_credits +
## pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty,
## data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80230 -0.31925 0.07448 0.37398 0.91811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.306e+00 1.824e-01 23.605 < 2e-16 ***
## male -1.928e-01 1.106e-01 -1.743 0.08200 .
## ethnicitynot minority 1.236e-01 7.484e-02 1.651 0.09943 .
## languagenon-english -2.723e-01 1.049e-01 -2.596 0.00975 **
## age -6.179e-03 2.615e-03 -2.363 0.01854 *
## cls_creditsone credit 5.362e-01 1.033e-01 5.190 3.18e-07 ***
## pic_outfitnot formal -1.088e-01 6.635e-02 -1.639 0.10188
## pic_colorcolor -2.011e-01 6.479e-02 -3.104 0.00203 **
## cls_perc_eval_sq 3.393e-05 1.028e-05 3.301 0.00104 **
## male_perc_beauty 8.878e-02 2.217e-02 4.005 7.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4937 on 453 degrees of freedom
## Multiple R-squared: 0.1922, Adjusted R-squared: 0.1761
## F-statistic: 11.97 on 9 and 453 DF, p-value: < 2.2e-16
Multiple r square has not improved and there is not any big improvement on the rse nor the adjusted r square but the backward elimination gives us the same accuracy with fewer number of variables. It eliminates itself all the variables not having any contribution in the model improvement. That’s one of the reasons to use stepwise elimination while working on multiple regression.