Lab 9 - Multiple Linear Regression

Grading the professor: data exploration

download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")

Exercise 1: Observational Study

Because there has not been any form of treatment/control group being measured or surveyed, this is an observational study. Is beauty a factor that influences the outcome of course evalutions?

Exercise 2: Score distribution

The distribution of scores is left skewed, with low and high end outliers. The majority of students rate the courses on the high end. My expectation would be more of a “normal” distribution with the bulk of scores happening in the middle (not too low, not too high scores)

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

evals %>%
  ggplot()+
  geom_histogram(mapping = aes(x = score))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(evals$score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.800   4.300   4.175   4.600   5.000

sd(evals$score)

## [1] 0.5438645

evals %>%
  ggplot()+
  geom_boxplot(mapping = aes(x = score))

##### Exercise 3: Professors avg beauty rating vs Professors age

evals_f <- evals %>%
  filter(gender == "female")

evals_m <- evals %>%
  filter(gender == "male")

It appears that there is an inverse relationship between the professors’ avg beauty rating vs their age. In other words, as the age of professors increases, the “beauty” rating obtained decreases. Although not strong, these variables exhibit a negative correlation coefficient of 0.3 indicating that there is a slight relationship between them.

par(mfrow = c(2,1)) #3 number of rows, 1 number of columns

hist(evals$bty_avg, 
     xlim = c(0,12), 
     main = "Avg Beauty Rating of Professors:", 
     xlab = "", 
     col = "lightblue")

hist(evals$age, 
     xlim = c(25,80), 
     main = "Age of professors:", 
     xlab = "", 
     col = "purple")

cor(evals$bty_avg, evals$age)

## [1] -0.3046034

m1 <- lm(bty_avg ~ age, data = evals) #form y ~ x

plot(evals$age, evals$bty_avg, 
         pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Beauty rating as as a function of professors' age", 
     xlab = "Professors' age", 
     ylab = "Bty avg rating")
abline(m1)

Female professors’ beauty rating as as a function of their age, + Correlation coefficient suggests an inverse relationship between their beauty rating as age increases slightly less “pronounced” than that of male professors.

cor(evals_f$bty_avg, evals_f$age)

## [1] -0.2070454

m2 <- lm(bty_avg ~ age, data = evals_f) #form y ~ x

plot(evals_f$age, evals_f$bty_avg, 
         pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#009999", # red
     main = "Female professors' beauty rating as as a function of their age", 
     xlab = "Professors' age", 
     ylab = "Bty avg rating")
abline(m2)

Males beauty rating as as a function of their age, + Correlation coefficient.

cor(evals_m$bty_avg, evals_m$age)

## [1] -0.3316213

m3 <- lm(bty_avg ~ age, data = evals_m) #form y ~ x

plot(evals_m$age, evals_m$bty_avg, 
         pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#6600cc", # red
     main = "Male professors' beauty rating as as a function of their age", 
     xlab = "Professors' age", 
     ylab = "Bty avg rating")
abline(m3)

### Simple Linear Regression

plot(evals$score ~ evals$bty_avg)

##### Exercise 4: Using jitter() - adds noise to a numeric vector The previous scatterplot did not show a clear distinction between the 2 variables being compared, whereas this scatter plot does, due to the noise jitter adds to variable (x).

?jitter
plot(jitter(evals$score) ~ evals$bty_avg)

##### Exercise 5: Fit a linear model called m_bty to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty) ######Lm equation -> ŷ = 3.880 + 0.066*bty_avg,for this model 3.5% of the variability in average professor scores is explained by avg beauty ratings. In other words, the score will increase by 0.066 by every additional increase in beauty rating. (slope = rise/run = 0.066 / 1, where 0.066 = increase in score and 1 = beauty rating)

Correlation & Linear Regression side note: The outcome variable is also called the response or dependent variable, and the risk factors and confounders are called the predictors, or explanatory or independent variables. In regression analysis, the dependent variable is denoted “Y” and the independent variables are denoted by “X”

m_bty <- lm(score ~ bty_avg, data = evals) #form y ~ x, predictor being x

plot(evals$bty_avg, evals$score, 
         pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Avg. professor score as as a function of professors' beauty rating", 
     xlab = "Avg. beauty rating", 
     ylab = "Avg. professor score")
abline(m_bty)

summary(m_bty)

## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05

ŷ = 3.880 + 0.066*bty_avg,for this model 3.5% of the variability in average professor scores is explained by avg beauty ratings. In other words, the score will increase by 0.066 by every additional increase in beauty rating. (slope = rise/run = 0.066 / 1, where 0.066 = increase in score and 1 = beauty rating). The P value for the bty_avg variable is 5.08^e-05 which is less than the 0.05 significance level, therefore there is strong evidence to suggest that bty_avg is a strong predictor of professors’ scores.

Exercise 6: Residuals & Plots..

The plots below do suggest reasonable conditions for the least of squares regression.

library(statsr)

## 
## Attaching package: 'statsr'

## The following object is masked _by_ '.GlobalEnv':
## 
##     evals

?plot_ss

plot_ss(bty_avg, score, data = evals)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     3.88034      0.06664  
## 
## Sum of Squares:  131.868

plot_ss(bty_avg, score, data = evals, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     3.88034      0.06664  
## 
## Sum of Squares:  131.868

plot(evals$score ~ evals$bty_avg)
abline(m_bty)

We have looked at the linearity of residuals, which shows no apparent pattern

plot(m_bty$residuals ~ evals$bty_avg)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Although the distribution of the residuals appears to be left skewed, the QQ plot suggests fairly normal distribution of the residuals with the skewedness occurring at the top and bottom ends of the distribution.

hist(m_bty$residuals)

qqnorm(m_bty$residuals)
qqline(m_bty$residuals)  # adds diagonal line to the normal prob plot

### Multiple linear regression Looking at the relationship between one of the scores and the average beauty score. (as expected the relationship is strong)

plot(evals$bty_avg ~ evals$bty_f1lower)

cor(evals$bty_avg, evals$bty_f1lower)

## [1] 0.8439112

We can actually take a look at the relationships between all beauty variables (columns 13 through 19) These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.

plot(evals[,13:19])

In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercise 7: P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Here we verifty that the conditions (1) linearity, (2) nearly normal residuals, and (3) constant variability are met using diagnostic plots.

plot(score ~ bty_avg + gender, data = evals)

abline(m_bty_gen)

## Warning in abline(m_bty_gen): only using the first two of 3 regression
## coefficients

plot(m_bty_gen$residuals ~ evals$bty_avg + evals$gender)

abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

hist(m_bty_gen$residuals)

qqnorm(m_bty_gen$residuals)
qqline(m_bty_gen$residuals)  # adds diagonal line to the normal prob plot

##### Exercise 8: While bty_avg appears to be a predictor of score, its significance appears to be diminished as gendermale has been added to the model (where the correlation between bty_avg and score for males is much stronger than that between bty_avg and score for females). This is also shown in the regression model plotted below.

cor(evals$bty_avg, evals$score)

## [1] 0.1871424

cor(evals_f$bty_avg, evals_f$score)

## [1] 0.08547868

cor(evals_m$bty_avg, evals_m$score)

## [1] 0.310937

multiLines(m_bty_gen)

##### Exercise 9: equation for regression line corresponding to males: scoreˆ=β̂ 0+β̂ 1×bty_avg+β̂ 2×(1) -> ŷ = 3.75 + 0.074*bty_av + 0.172 For 2 professors who received the same beauty rating, assuming one is a female and the other is a male, the male professors tend to have the higher course evaluation score. note: The decision to call the indicator variable gendermale instead ofgenderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a 0 . (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel function. Use ?relevel to learn more.)

Exercise 10: it appears that R drops or embeds one of the categorical variables in models that have more than two levels?

m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)

## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

The search for the best model

Exercise 11: I suspect that the variable “cls” students would have the highest P value.

We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color. ##### It appears that the variable “cls_level” had the highest p value.

m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_full)

## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77397 -0.32432  0.09067  0.35183  0.95036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0952141  0.2905277  14.096  < 2e-16 ***
## ranktenure track      -0.1475932  0.0820671  -1.798  0.07278 .  
## ranktenured           -0.0973378  0.0663296  -1.467  0.14295    
## ethnicitynot minority  0.1234929  0.0786273   1.571  0.11698    
## gendermale             0.2109481  0.0518230   4.071 5.54e-05 ***
## languagenon-english   -0.2298112  0.1113754  -2.063  0.03965 *  
## age                   -0.0090072  0.0031359  -2.872  0.00427 ** 
## cls_perc_eval          0.0053272  0.0015393   3.461  0.00059 ***
## cls_students           0.0004546  0.0003774   1.205  0.22896    
## cls_levelupper         0.0605140  0.0575617   1.051  0.29369    
## cls_profssingle       -0.0146619  0.0519885  -0.282  0.77806    
## cls_creditsone credit  0.5020432  0.1159388   4.330 1.84e-05 ***
## bty_avg                0.0400333  0.0175064   2.287  0.02267 *  
## pic_outfitnot formal  -0.1126817  0.0738800  -1.525  0.12792    
## pic_colorcolor        -0.2172630  0.0715021  -3.039  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1617 
## F-statistic: 7.366 on 14 and 448 DF,  p-value: 6.552e-14

Exercise 12: cking on p value for “cls_level”.

While it appears that “cls_level” is not a strong predictor for score variability, the perdictor with the least strength is actually the variable “cls_profs”. It has the highest p-score, and where 0.7% of the variability in scores is explained by cls_level, cls_profs can only explain 0.06% of the score variability.

m_cls_level <- lm(score ~ cls_level, data = evals) #form y ~ x, predictor being x

plot(evals$cls_level, evals$score, 
         pch = 19,        # solid circle
     cex = .9,       # make 150% size
     main = "Avg. professor score as as a function of professors' teaching cls_level", 
     xlab = "Cls level", 
     ylab = "Avg. professor score")
abline(m_bty)

qqnorm(m_cls_level$residuals)
qqline(m_cls_level$residuals)  # adds diagonal line to the normal prob plot

summary(m_cls_level)

## 
## Call:
## lm(formula = score ~ cls_level, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8422 -0.3422  0.1578  0.4578  0.8578 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.23822    0.04330  97.881   <2e-16 ***
## cls_levelupper -0.09606    0.05326  -1.804    0.072 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5425 on 461 degrees of freedom
## Multiple R-squared:  0.007006,   Adjusted R-squared:  0.004852 
## F-statistic: 3.253 on 1 and 461 DF,  p-value: 0.07196

For this model it appears that only 0.7% of the variability in scores is explained by cls_level

m_cls_profs <- lm(score ~ cls_profs, data = evals) #form y ~ x, predictor being x
summary(m_cls_profs)

## 
## Call:
## lm(formula = score ~ cls_profs, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8554 -0.3846  0.1154  0.4154  0.8446 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.18464    0.03111 134.493   <2e-16 ***
## cls_profssingle -0.02923    0.05343  -0.547    0.585    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5443 on 461 degrees of freedom
## Multiple R-squared:  0.0006486,  Adjusted R-squared:  -0.001519 
## F-statistic: 0.2992 on 1 and 461 DF,  p-value: 0.5847

Exercise 13: interpreting the coefficient associated with the ethnicity variable:

While only 0.57% of of the variability in score can be explained by the variable “ethnicity”, the model suggests that professors who are not of a minority group have a stronger propensity to obtain a higher score. The professors’ scores will increase by 0.119 points everytime a professor is not of a minority group.

m_ethnicity <- lm(score ~ ethnicity, data = evals) #form y ~ x, predictor being x
summary(m_ethnicity)

## 
## Call:
## lm(formula = score ~ ethnicity, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8912 -0.3816  0.1088  0.4088  0.9281 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.07188    0.06786  60.003   <2e-16 ***
## ethnicitynot minority  0.11935    0.07310   1.633    0.103    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5429 on 461 degrees of freedom
## Multiple R-squared:  0.005749,   Adjusted R-squared:  0.003593 
## F-statistic: 2.666 on 1 and 461 DF,  p-value: 0.1032