MATH 141 HW12 Lab/R Component Homework

Johnny Mendoza FL1

Notes

Before we can answer any questions, we load the data.

download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")

Question 1

Question:

Picking up from the end of Lab 12, let's consider the multiple regression model from Exercise 11. Report and interpret the coefficient associated with the ethnicity variable.

Answer: The ethnicity variable has a coefficient of ~ 0.12. This means that a professor who is not of an ethnic minority scores an extra 0.12 points on their evaluation score. The p-value associated with this variable is large at ~ 0.117, meaning that the coefficient associated with this variable is not statistically significant (at least not in this model).

m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
    cls_students + cls_level + cls_profs + cls_credits + bty_avg + pic_outfit + 
    pic_color, data = evals)
summary(m_full)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7740 -0.3243  0.0907  0.3518  0.9504 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.095214   0.290528   14.10  < 2e-16 ***
## ranktenure track      -0.147593   0.082067   -1.80  0.07278 .  
## ranktenured           -0.097338   0.066330   -1.47  0.14295    
## ethnicitynot minority  0.123493   0.078627    1.57  0.11698    
## gendermale             0.210948   0.051823    4.07  5.5e-05 ***
## languagenon-english   -0.229811   0.111375   -2.06  0.03965 *  
## age                   -0.009007   0.003136   -2.87  0.00427 ** 
## cls_perc_eval          0.005327   0.001539    3.46  0.00059 ***
## cls_students           0.000455   0.000377    1.20  0.22896    
## cls_levelupper         0.060514   0.057562    1.05  0.29369    
## cls_profssingle       -0.014662   0.051988   -0.28  0.77806    
## cls_creditsone credit  0.502043   0.115939    4.33  1.8e-05 ***
## bty_avg                0.040033   0.017506    2.29  0.02267 *  
## pic_outfitnot formal  -0.112682   0.073880   -1.53  0.12792    
## pic_colorcolor        -0.217263   0.071502   -3.04  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.162 
## F-statistic: 7.37 on 14 and 448 DF,  p-value: 6.55e-14

Question 2

Question:

Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?

Answer: cls.profs was dropped for this revised model. The coefficients and significance of the other explanatory variable did change, but only by miniscule amounts. This indicates that the dropped variable was not collinear with the other explanatory variables.

m_drop <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
    cls_students + cls_level + cls_credits + bty_avg + pic_outfit + pic_color, 
    data = evals)
summary(m_drop)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7836 -0.3257  0.0859  0.3513  0.9551 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.087252   0.288856   14.15  < 2e-16 ***
## ranktenure track      -0.147675   0.081982   -1.80  0.07233 .  
## ranktenured           -0.097383   0.066261   -1.47  0.14235    
## ethnicitynot minority  0.127446   0.077289    1.65  0.09986 .  
## gendermale             0.210123   0.051687    4.07  5.7e-05 ***
## languagenon-english   -0.228289   0.111131   -2.05  0.04053 *  
## age                   -0.008999   0.003133   -2.87  0.00426 ** 
## cls_perc_eval          0.005289   0.001532    3.45  0.00061 ***
## cls_students           0.000469   0.000374    1.25  0.21038    
## cls_levelupper         0.060637   0.057501    1.05  0.29220    
## cls_creditsone credit  0.506120   0.114916    4.40  1.3e-05 ***
## bty_avg                0.039863   0.017478    2.28  0.02303 *  
## pic_outfitnot formal  -0.108323   0.072171   -1.50  0.13408    
## pic_colorcolor        -0.219053   0.071147   -3.08  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.497 on 449 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.163 
## F-statistic: 7.94 on 13 and 449 DF,  p-value: 2.34e-14

Question 3

Question:

Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.

Answer: The regression line equation for the final revised model is: score = (0.17*ethnicity) + (0.21 * gender) + (-0.21 * language) + (-0.01 * age) + (0.005 * cls_perc_eval) + (0.51 * cls_credits) + (0.05 * bty_avg) + (-0.20 * pic_color) + 3.77

m_drop7 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + 
    cls_credits + bty_avg + pic_color, data = evals)
summary(m_drop7)
## 
## Call:
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8532 -0.3239  0.0998  0.3793  0.9361 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.77192    0.23205   16.25  < 2e-16 ***
## ethnicitynot minority  0.16787    0.07528    2.23   0.0262 *  
## gendermale             0.20711    0.05013    4.13  4.3e-05 ***
## languagenon-english   -0.20618    0.10364   -1.99   0.0473 *  
## age                   -0.00605    0.00261   -2.31   0.0211 *  
## cls_perc_eval          0.00466    0.00144    3.24   0.0013 ** 
## cls_creditsone credit  0.50531    0.10412    4.85  1.7e-06 ***
## bty_avg                0.05107    0.01693    3.02   0.0027 ** 
## pic_colorcolor        -0.19058    0.06735   -2.83   0.0049 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.499 on 454 degrees of freedom
## Multiple R-squared:  0.172,  Adjusted R-squared:  0.158 
## F-statistic: 11.8 on 8 and 454 DF,  p-value: 2.58e-15

Question 4

Question:

Verify that the conditions for this model are reasonable using diagnostic plots.

Answer: The conditions for using linear regression to model this data are not satisfied. The residuals plot is not centered around 0 and it does not have constant variability. A histogram of the residuals plot shows that the data is left-skewed. A Q-Q plot of the residuals further shows that the data is not normal. Any predictions made using this model are likely not valid.

plot(m_drop7$residuals ~ evals$score)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-5

hist(m_drop7$residuals)

plot of chunk unnamed-chunk-5

qqnorm(m_drop7$residuals)
qqline(m_drop7$residuals)

plot of chunk unnamed-chunk-5

Question 5

Question:

The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?

Answer: Yes; considering that each row of data corresponds to a particular course and not a particular professor means that the data is fatally-flawed. This is an example of pseudoreplication, where each observation is treated as independent even though they might be collected from the same subject. The fact that data is not truly independent sheds light on why the residuals data was not normal, and why the conditions for using a linear regression model were not met.

head(evals)
##   score         rank    ethnicity gender language age cls_perc_eval
## 1   4.7 tenure track     minority female  english  36         55.81
## 2   4.1 tenure track     minority female  english  36         68.80
## 3   3.9 tenure track     minority female  english  36         60.80
## 4   4.8 tenure track     minority female  english  36         62.60
## 5   4.6      tenured not minority   male  english  59         85.00
## 6   4.3      tenured not minority   male  english  59         87.50
##   cls_did_eval cls_students cls_level cls_profs  cls_credits bty_f1lower
## 1           24           43     upper    single multi credit           5
## 2           86          125     upper    single multi credit           5
## 3           76          125     upper    single multi credit           5
## 4           77          123     upper    single multi credit           5
## 5           17           20     upper  multiple multi credit           4
## 6           35           40     upper  multiple multi credit           4
##   bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 1           7           6           2           4           6       5
## 2           7           6           2           4           6       5
## 3           7           6           2           4           6       5
## 4           7           6           2           4           6       5
## 5           4           2           2           3           3       3
## 6           4           2           2           3           3       3
##   pic_outfit pic_color
## 1 not formal     color
## 2 not formal     color
## 3 not formal     color
## 4 not formal     color
## 5 not formal     color
## 6 not formal     color
tail(evals)
##     score         rank    ethnicity gender    language age cls_perc_eval
## 458   4.1 tenure track not minority   male     english  32         42.86
## 459   4.5 tenure track not minority   male     english  32         60.47
## 460   3.5 tenure track     minority female non-english  42         57.14
## 461   4.4 tenure track     minority female non-english  42         77.61
## 462   4.4 tenure track     minority female non-english  42         81.82
## 463   4.1 tenure track     minority female non-english  42         80.00
##     cls_did_eval cls_students cls_level cls_profs  cls_credits bty_f1lower
## 458            9           21     lower  multiple multi credit           6
## 459           52           86     upper  multiple multi credit           6
## 460           48           84     upper  multiple multi credit           3
## 461           52           67     upper  multiple multi credit           3
## 462           54           66     upper  multiple multi credit           3
## 463           28           35     lower  multiple   one credit           3
##     bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 458           6           9           7           8           5   6.833
## 459           6           9           7           8           5   6.833
## 460           8           7           4           6           4   5.333
## 461           8           7           4           6           4   5.333
## 462           8           7           4           6           4   5.333
## 463           8           7           4           6           4   5.333
##     pic_outfit pic_color
## 458 not formal     color
## 459 not formal     color
## 460 not formal     color
## 461 not formal     color
## 462 not formal     color
## 463 not formal     color

Question 6

Question:

Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.

Answer: Although using a regression model for this data is not appropriate, and thus any predictions made from it are not valid, there are several variables that have a statistically signficant effect on evaluation score (according to the model). The final model includes 8 variables, among them ethnicity, gender, language, age, cls_perc_eval, cls_credits, bty_avg, and pic_color. The highest possible evaluation score would belong to a professor who is an attractive young male who took a black & white photo, had a high percentage of students that completed the evaluation, attended a non-english school, is not an ethnic minority, and taught a high-credit course.

Question 7

Question:

Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?

Answer: No; the conditions for using linear regression to model the data were not met. The predicitons made from the data are not valid even within the University of Texas at Austin, let alone any other university. The data cannot be used to extrapolate any reasonable predictions.