Simple linear regression

http://htmlpreview.github.io/?https://github.com/andrewpbray/oiLabs-base-R/blob/master/multiple_regression/multiple_regression.html

download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")
head(evals)
##   score         rank    ethnicity gender language age cls_perc_eval
## 1   4.7 tenure track     minority female  english  36      55.81395
## 2   4.1 tenure track     minority female  english  36      68.80000
## 3   3.9 tenure track     minority female  english  36      60.80000
## 4   4.8 tenure track     minority female  english  36      62.60163
## 5   4.6      tenured not minority   male  english  59      85.00000
## 6   4.3      tenured not minority   male  english  59      87.50000
##   cls_did_eval cls_students cls_level cls_profs  cls_credits bty_f1lower
## 1           24           43     upper    single multi credit           5
## 2           86          125     upper    single multi credit           5
## 3           76          125     upper    single multi credit           5
## 4           77          123     upper    single multi credit           5
## 5           17           20     upper  multiple multi credit           4
## 6           35           40     upper  multiple multi credit           4
##   bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 1           7           6           2           4           6       5
## 2           7           6           2           4           6       5
## 3           7           6           2           4           6       5
## 4           7           6           2           4           6       5
## 5           4           2           2           3           3       3
## 6           4           2           2           3           3       3
##   pic_outfit pic_color
## 1 not formal     color
## 2 not formal     color
## 3 not formal     color
## 4 not formal     color
## 5 not formal     color
## 6 not formal     color

Exercise 1 - Beauty Study

This is an observational study. Causality can’t be shown, so this study couldn’t claim that beauty leads directly to differences in course evaluations. There are many factors that could affect a student’s evaluation of the course that are not included in the data. The study could show if there is an association between the rating of a professor’s beauty and the evaluation score of the course.

Exercise 2 - Evaluation score

The distribution of the course scores is left skewed with low outliers. I didn’t have an expectation about the distribution. I think the distribution is left skewed, rather than normally distributed, because it’s human nature to give a good score to someone you’ve had a relationship with for several months.

boxplot(evals$score, col = 'plum')

hist(evals$score, col = 'mediumorchid')

Exercise 3 - Age and Beauty

There is a moderate negative correlation between age and beauty. The average beauty score decreases as the age of a professor increases.

m1 <- lm(bty_avg ~ age, data = evals)
summary(m1)
## 
## Call:
## lm(formula = bty_avg ~ age, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4326 -1.1254 -0.1971  0.6859  3.8724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.713283   0.341083  19.682  < 2e-16 ***
## age         -0.047461   0.006912  -6.866 2.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.456 on 461 degrees of freedom
## Multiple R-squared:  0.09278,    Adjusted R-squared:  0.09082 
## F-statistic: 47.15 on 1 and 461 DF,  p-value: 2.137e-11
plot(evals$age, evals$bty_avg, main = "Age v. Beauty",
     ylab = "Beauty (avg)", xlab = "Age",
     pch = 23, cex=2, col="purple", bg="plum", lwd=2)
abline(m1, col = "magenta4", lwd=2.5)

plot(evals$score ~ evals$bty_avg, pch=10, col="mediumorchid4",
     main = "Score x Beauty", ylab = "score", xlab = " avg. beauty")

Exercise 4 - Beauty score

The number of observations in the data set is 463. There are duplicate (x, y)-tuples in the data, duplicate data points aren’t visible on the plot.

plot(jitter(evals$score) ~ jitter(evals$bty_avg), 
     pch=10, col=c("mediumorchid4","mediumvioletred", "pink3"),         #using several colors for distinctivity
     main = "Score x Beauty", ylab = "score", xlab = " avg. beauty")

Exercise 5 - Beauty trend

There’s a small positive correlation between average beauty and score, the slope of the regression line is 0.067. The p-value for bty_avg is small, it’s a statistically significant predictor. The adjusted R2 value = 0.033 is small. The SE = 0.535 is large, this isn’t a very good prediction model.

m_bty <- lm(score ~ bty_avg, data = evals)
s1 <- summary(m_bty)
s1
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05
plot(jitter(evals$score) ~ jitter(evals$bty_avg), 
     pch=10, col=c("mediumorchid4","mediumvioletred", "pink3"),
     main = "Score x Beauty", ylab = "score", xlab = " avg. beauty")
abline(m_bty, col = "magenta1", lwd=2.5)

Exercise 6 - Residual beauty

The residuals are normally distributed with a bit of left skewing. The residuals v. fitted plot doesn’t indicate any non-linearity. The scale location plot has a horizontal line with randomly spread points indicating good homoscedasticity, and there are no influencing outliers.

plot(m_bty$residuals ~ evals$bty_avg, col='plum', pch = 19,  ylab = "residuals", xlab = "avg. beauty")
abline(h = 0, lty = 3, col='midnightblue')  
lines(stats::lowess(m_bty$residuals ~ evals$bty_avg), col='firebrick', lwd=2, lty=5)

par(mfrow=c(1,2))

hist(m_bty$residuals, col='violetred', main = 'score by avg. beauty residuals', xlab = 'residuals')

qqnorm(m_bty$residuals, pch = 8, col = 'purple3')
qqline(m_bty$residuals, col = 'mediumblue', lwd=2)  # adds diagonal line to the normal prob plot

par(mfrow=c(2,2))
plot(m_bty, col= 'blueviolet', lwd = 2)

## Multiple linear regression

plot(evals$bty_avg ~ evals$bty_f1lower, main='Avg. Beauty to Beauty score by female lower lvl student',
     ylab='Avg. Beauty', xlab='Score from female LL student',
     bg='violet', pch=23)

c <- cor(evals$bty_avg, evals$bty_f1lower)

Correlation = 0.8439112

plot(evals[,13:19], col='darkorchid3')

### Exercise 7 - Beauty and Gender

The diagnostic plots look a little bit better with gender added in the model. The conditions for linear regression are still met.

m_bg <- lm(score ~ bty_avg + gender, data = evals)
par(mfrow=c(2,2))
plot(m_bg, col= 'darkviolet', lwd = 2)

Exercise 8 - Significant Beauty

The p-value for bty_avg is smaller than the model with bty_avg alone. Adding gender has increased the R2 value to 0.055, which is still not a very good prediction model.

multiLinesP <- function(model, ...){
  if(class(model)!="lm"){
    warning("Model must be the output of the function lm()")
  }
  
  if(length(model$xlevels)!=1){
    warning("Model must contain exactly one categorical predictor")
  }
  
  if(length(model$coef)-length(model$xlevels[[1]])!=1){
    warning("Model must contain exactly one non-categorical predictor")
  }
  
  palette <- c("#68228B", "#56B4E9", "#800000", "#009E73", "#CC79A7", "#F0E442", "#0072B2")
  
  baseIntercept <- model$coef[1]
  nLines <- length(model$xlevels[[1]])
  intercepts <- c(baseIntercept, rep(0, nLines-1))
  indicatorInd <- c(1, rep(0, nLines)) # used to find slope parameter by process of elimination
  
  for(i in 1:(nLines-1)){
    indicatorName <- paste(names(model$contrasts),model$xlevels[[1]][1+i], sep = "")
    intercepts[i+1] <- baseIntercept + model$coef[names(model$coef)==indicatorName]
    indicatorInd <- indicatorInd + (names(model$coef)==indicatorName)
  }
  
  slope <- model$coef[!indicatorInd]
  
  num_pred = which(names(model$model[,-1]) != names(model$xlevels)) + 1
  cat_pred = which(names(model$model[,-1]) == names(model$xlevels)) + 1
  
  model$model$COL = NA
  model$model$PCH = NA
  for(i in 1:nLines){
    model$model$COL[model$model[,cat_pred] == levels(model$model[,cat_pred])[i]] = adjustcolor(palette[i],0.40)
    model$model$PCH[model$model[,cat_pred] == levels(model$model[,cat_pred])[i]] = i+14
  }
  
  plot(model$model[,1] ~ jitter(model$model[,num_pred]), col = model$model$COL, pch = model$model$PCH,
       ylab = names(model$model)[1],
       xlab = names(model$model)[num_pred])
  
  for(j in 1:nLines){
    abline(intercepts[j], slope, col = palette[j], lwd = 2, ...)
  }
  
  if(slope > 0){legend_pos = "bottomright"}
  if(slope < 0){legend_pos = "topleft"}  
  
  legend(legend_pos, col = palette[1:nLines], lty = 1, legend = levels(model$model[,cat_pred]), lwd = 2)
}
multiLinesP(m_bg)          #customized multiLines function palette with purples

Exercise 9 - Male-Beauty Formula

score = 3.737 + 0.0742(bty_avg) + 0.172(gendermale), men will have higher course evaluation scores

s2 <- summary(m_bg)
s2
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercise 10 - Rank Beauty

The lowest value of rank = teaching is used for the y-intercept. The other values of rank are treated as binary variables, one or neither can be true for each observation. So the formula for score is:

score = 3.98 + 0.068(bty_avg) - 0.162(ranktenure track) - 0.126(ranktenured)

So tenure track professors get lower scores, maybe they’re spending more time publishing in order to get their tenure.

m_br <- lm(score ~ bty_avg + rank, data = evals)
s3 <- summary(m_br)
s3
## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

Beauty models

Exercise 11 - Full Beauty

I’d expect the number of students in the class to have a high p-value. The scores are calculated from the students who participated in the survey.

Exercise 12 - Full Model Check

m1 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals) 
s1 <- summary(m1)

The highest p-value is 0.7780566 for number of professors teaching sections in course. Number of students does have a p-value = 0.2289607

Exercise 13 - Full Model ethnicity

The p-value for ethnicity is 0.1169791, this is still fairly large. In this model, holding all other variables constant a non-minority professor would receive a score 0.123 points higher than a minority professor.

Exercise 14 - Full Model minus one

Removing the cls_profs variable from the model changes the values of the coefficients and p-values for the remaining variables.

m2 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
s2 <- summary(m2)
t1 <- tibble(name = labels(s1$coefficients[,'Estimate']), coef_m1 = s1$coefficients[,'Estimate'], 
             pval_m1 = s1$coefficients[,'Pr(>|t|)']) %>% 
  filter(name != 'cls_profssingle') %>% 
  mutate(coef_m2 = s2$coefficients[,1], diff_c = coef_m1 - coef_m2, 
         pval_m2 = s2$coefficients[,'Pr(>|t|)'], diff_pv = pval_m1 - pval_m2) %>% 
  relocate(pval_m1, .after = diff_c)
kable(t1, table.attr = "style = \"color: indigo;\"") %>% kable_minimal()
name coef_m1 coef_m2 diff_c pval_m1 pval_m2 diff_pv
(Intercept) 4.0952141 4.0872523 0.0079618 0.0000000 0.0000000 0.0000000
ranktenure track -0.1475932 -0.1476746 0.0000813 0.0727793 0.0723271 0.0004523
ranktenured -0.0973378 -0.0973829 0.0000451 0.1429455 0.1423493 0.0005962
ethnicitynot minority 0.1234929 0.1274458 -0.0039528 0.1169791 0.0998556 0.0171234
gendermale 0.2109481 0.2101231 0.0008250 0.0000554 0.0000566 -0.0000012
languagenon-english -0.2298112 -0.2282894 -0.0015217 0.0396509 0.0405303 -0.0008794
age -0.0090072 -0.0089992 -0.0000080 0.0042688 0.0042616 0.0000072
cls_perc_eval 0.0053272 0.0052888 0.0000385 0.0005903 0.0006072 -0.0000169
cls_students 0.0004546 0.0004687 -0.0000141 0.2289607 0.2103843 0.0185764
cls_levelupper 0.0605140 0.0606374 -0.0001235 0.2936925 0.2922000 0.0014925
cls_creditsone credit 0.5020432 0.5061196 -0.0040764 0.0000184 0.0000133 0.0000051
bty_avg 0.0400333 0.0398629 0.0001704 0.0226744 0.0230315 -0.0003571
pic_outfitnot formal -0.1126817 -0.1083227 -0.0043589 0.1279153 0.1340803 -0.0061650
pic_colorcolor -0.2172630 -0.2190527 0.0017897 0.0025162 0.0022052 0.0003110

Exercise 15 - Model redux

The best model is m2, all variables except cls_profs with an adjusted r2 = 0.163. With only 16.3% of the variability explained by the model, it’s still not great.

# remove cls_level with p_value 0.292
s3 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals) %>% 
   summary()

# remove cls_level with p_value 0.319
s4 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_credits + bty_avg + pic_outfit + pic_color, data = evals) %>% 
   summary()

# remove rank with p_value 0.205
s5 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval 
             + cls_credits + bty_avg + pic_outfit + pic_color, data = evals) %>% 
   summary()

# remove pic_outfit with p_value 0.0905
s6 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval 
             + cls_credits + bty_avg + pic_color, data = evals) %>% 
   summary()

# The remaining redux models got smaller adj.r.squared values

models <- tibble(name = c('m1', 'm2', 'm3', 'm4', 'm5', 'm6'),
                 rsquared = c(s1$adj.r.squared, s2$adj.r.squared, s3$adj.r.squared, s4$adj.r.squared,
                              s5$adj.r.squared, s6$adj.r.squared),
                 SE = c(s1$sigma, s2$sigma, s3$sigma, s4$sigma, s5$sigma, s6$sigma),
                 f_statistic = c(s1$fstatistic[1], s2$fstatistic[1], s3$fstatistic[1], s4$fstatistic[1],
                                 s5$fstatistic[1], s6$fstatistic[1]),
                 bty.p_value = c(s1$coefficients[2,4], s2$coefficients[2,4], s3$coefficients[2,4],
                                 s4$coefficients[2,4], s5$coefficients[2,4], s6$coefficients[2,4]))   #p-value for bty-avg 
models <- arrange(models, desc(rsquared)) 
models$name <- as.character(models$name)
kable(models, table.attr = "style = \"color: indigo;\"") %>% kable_minimal()
name rsquared SE f_statistic bty.p_value
m2 0.1634262 0.4974425 7.942500 0.0723271
m4 0.1632301 0.4975008 9.193009 0.0828135
m3 0.1632178 0.4975044 8.509584 0.0831837
m1 0.1617076 0.4979532 7.365741 0.0727793
m5 0.1610344 0.4981531 10.853127 0.0297983
m6 0.1575649 0.4991821 11.801271 0.0262285

Exercise 16 - Model 2 verification

The diagnostic plots for m2 look good for satisfying the conditions of linear regression.

par(mfrow=c(2,2))
plot(m2, col= 'darkviolet', lwd = 2)

Exercise 17 - Observation = Course

Each row represents a course so the observations are about a course not a professor. There are courses that are taught by only one professor, and courses that are taught by multiple professors (cls_profs). If the single professor courses are filtered out of the data, then we can find a model with an adjusted R2 = 0.234, which is better than the model with all the observations.

cls_m <- evals %>% filter(cls_profs == 'multiple')

lm(score ~ ethnicity + gender + language + age + cls_perc_eval 
             + cls_credits + bty_avg + pic_outfit, data = cls_m) %>% 
summary()
## 
## Call:
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit, data = cls_m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55019 -0.26938  0.04784  0.33733  1.25809 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.994145   0.268938  14.852  < 2e-16 ***
## ethnicitynot minority  0.197930   0.104830   1.888  0.05999 .  
## gendermale             0.176262   0.058777   2.999  0.00294 ** 
## languagenon-english   -0.207967   0.138240  -1.504  0.13354    
## age                   -0.014892   0.003236  -4.603 6.19e-06 ***
## cls_perc_eval          0.008509   0.001714   4.964 1.17e-06 ***
## cls_creditsone credit  0.487818   0.119172   4.093 5.48e-05 ***
## bty_avg                0.046966   0.019734   2.380  0.01795 *  
## pic_outfitnot formal  -0.250986   0.092186  -2.723  0.00686 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4825 on 297 degrees of freedom
## Multiple R-squared:  0.2539, Adjusted R-squared:  0.2338 
## F-statistic: 12.64 on 8 and 297 DF,  p-value: 1.266e-15

If we look at only the single professor courses, then bty_avg becomes less significant, p-value = 0.541, as we do model selection.

cls_s <- evals %>% filter(cls_profs == 'single')

lm(score ~ rank + gender + language + cls_level + cls_credits +
               bty_avg + pic_color, data = cls_s) %>% 
  summary()
## 
## Call:
## lm(formula = score ~ rank + gender + language + cls_level + cls_credits + 
##     bty_avg + pic_color, data = cls_s)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57972 -0.30552  0.07844  0.33951  0.92990 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.36418    0.24922  17.512  < 2e-16 ***
## ranktenure track      -0.33060    0.12852  -2.572  0.01109 *  
## ranktenured           -0.11285    0.11397  -0.990  0.32369    
## gendermale             0.16256    0.09109   1.785  0.07638 .  
## languagenon-english   -0.23834    0.14588  -1.634  0.10444    
## cls_levelupper         0.19254    0.09737   1.977  0.04984 *  
## cls_creditsone credit  0.65080    0.23684   2.748  0.00675 ** 
## bty_avg                0.01923    0.03137   0.613  0.54079    
## pic_colorcolor        -0.44578    0.13498  -3.303  0.00120 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4788 on 148 degrees of freedom
## Multiple R-squared:  0.2271, Adjusted R-squared:  0.1853 
## F-statistic: 5.436 on 8 and 148 DF,  p-value: 5.172e-06

Exercise 18 - High score professor

A high scoring professor would be a younger man, who is not a minority and was educated in an English speaking school. He is not tenured or tenured tracked and teaches multi-credit, upper-level courses. He is attractive (to students), and is dressed formally in the black & white photograph shown to students.

t1 <- tibble(name = labels(s2$coefficients[,'Estimate']), coef = s2$coefficients[,'Estimate'])
kable(t1, table.attr = "style = \"color: indigo;\"") %>% kable_minimal()
name coef
(Intercept) 4.0872523
ranktenure track -0.1476746
ranktenured -0.0973829
ethnicitynot minority 0.1274458
gendermale 0.2101231
languagenon-english -0.2282894
age -0.0089992
cls_perc_eval 0.0052888
cls_students 0.0004687
cls_levelupper 0.0606374
cls_creditsone credit 0.5061196
bty_avg 0.0398629
pic_outfitnot formal -0.1083227
pic_colorcolor -0.2190527

Exercise 19 - Generalizing conclusions

I would not generalize these results to the population of professors. This observational study was conducted at a single American university and the sample data is not representative of all college professors.