In this blog, we are going to do residual analysis.The residual analysis help us to fugue if a linear model is appropriate to a given data set. Prior to start, you will need to load both “tidyverse” and “openintro” where the dataset used is found. The data set is called evals, gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance.
Let first load the libraries…
As libraries are loaded, we are going to proceed with data exploration to help us getting ideas on the data we are using.
To know the meaning of each variable, you can run the code below..
Know that we know about our data, it is a good time to prepare it for our analysis.
We have a dichotomous variable (a variable that takes only one of two possibilities): Gender which has two levels, male and female, and we are going to make each one the possibilities to be a separate variable.
evals <- evals %>%
mutate("male" = ifelse(evals$gender == "male", 1, 0)) %>%
mutate("female" = ifelse(evals$gender == "female", 1, 0))Adding quadratic term (square Percent of students in class who completed evaluation), and dichotomous vs. quantitative interaction term (interaction between male and Average beauty rating of professor )
evals <- evals %>%
mutate("cls_perc_eval_sq" = cls_perc_eval^2) %>%
mutate("male_perc_beauty" = male * bty_avg)Now it is a good time for us to build the model.
Let’s run the full model…
df_lm_full <- lm(score ~ rank + male + female + ethnicity + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color + cls_perc_eval_sq + male_perc_beauty, data = evals)
summary(df_lm_full)##
## Call:
## lm(formula = score ~ rank + male + female + ethnicity + language +
## age + cls_perc_eval + cls_students + cls_level + cls_profs +
## cls_credits + bty_avg + pic_outfit + pic_color + cls_perc_eval_sq +
## male_perc_beauty, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72712 -0.32545 0.07036 0.36050 0.94499
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.588e+00 4.302e-01 10.665 < 2e-16 ***
## ranktenure track -1.229e-01 8.361e-02 -1.470 0.14225
## ranktenured -4.983e-02 6.856e-02 -0.727 0.46774
## male -1.811e-01 1.597e-01 -1.134 0.25731
## female NA NA NA NA
## ethnicitynot minority 8.623e-02 7.985e-02 1.080 0.28075
## languagenon-english -2.597e-01 1.118e-01 -2.323 0.02064 *
## age -8.621e-03 3.151e-03 -2.736 0.00646 **
## cls_perc_eval -3.375e-03 9.329e-03 -0.362 0.71772
## cls_students 3.431e-04 3.800e-04 0.903 0.36719
## cls_levelupper 7.701e-02 5.779e-02 1.333 0.18335
## cls_profssingle -2.851e-02 5.197e-02 -0.549 0.58350
## cls_creditsone credit 5.528e-01 1.168e-01 4.733 2.98e-06 ***
## bty_avg -4.416e-03 2.450e-02 -0.180 0.85703
## pic_outfitnot formal -1.062e-01 7.350e-02 -1.444 0.14933
## pic_colorcolor -2.326e-01 7.155e-02 -3.251 0.00124 **
## cls_perc_eval_sq 6.045e-05 6.707e-05 0.901 0.36793
## male_perc_beauty 8.738e-02 3.369e-02 2.594 0.00980 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4951 on 446 degrees of freedom
## Multiple R-squared: 0.2001, Adjusted R-squared: 0.1714
## F-statistic: 6.974 on 16 and 446 DF, p-value: 2.142e-14
About the coefficients
The following coefficients have the p-value > 0.05 and might be dropped: ranktenure, ranktenured, ethnicitynot minority, cls_perc_eval, cls_students, cls_levelupper, cls_profssingle, bty_avg, pic_outfitnot formal, cls_perc_eval_sq
Since I run the full model, now let do the backward eliminating to adjust the model before residual analysis. There is a specific function that can help us to do backward elemination (step function) but for this case as we mentioned, we are just going to drop the coefficients that we think, according to their p_value, don’t have a big impact in the model.
df_lm_back <- lm(score ~ language + age +
+ cls_credits
+ pic_color + male_perc_beauty, data = evals)
summary(df_lm_back)##
## Call:
## lm(formula = score ~ language + age + +cls_credits + pic_color +
## male_perc_beauty, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72867 -0.33006 0.05769 0.38667 0.95560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.624001 0.129915 35.593 < 2e-16 ***
## languagenon-english -0.287925 0.098389 -2.926 0.00360 **
## age -0.007661 0.002403 -3.188 0.00153 **
## cls_creditsone credit 0.489696 0.099647 4.914 1.24e-06 ***
## pic_colorcolor -0.281230 0.062822 -4.477 9.59e-06 ***
## male_perc_beauty 0.058623 0.009930 5.903 6.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5017 on 457 degrees of freedom
## Multiple R-squared: 0.1582, Adjusted R-squared: 0.149
## F-statistic: 17.17 on 5 and 457 DF, p-value: 1.4e-15
Multiple r square has improved but there is not any big improvement on the rse nor the adjusted r square
There are two conditions that we are going to be focus on here for residual analysis: Nearly normal residuals and linearity.
Nearly normal residuals
Let’s plot an histogram…
The histogram shows that the distribution is not really normal, looks left skewed.
Or a normal probability plot of residuals…
As per Q-Q plot, the residuals distribution is not really normal.
Linearity
ggplot(data = df_lm_full, aes(x = .fitted, y = .resid)) +
geom_point()+
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")As we look at on the graph, we can see that there is a pattern in residuals and a linear model wouldn’t be stronger and efficient
From the diagnostics, we can say that the linear model wouldn’t be more appropriate. or will need to make more adjustment with the variables.