LOAD

install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)

Multiple Linear Regression

The Data

download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")

Exploring the data

Exercise 1

Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.

  • This is an observational study because: 1) there is no control or experimental group 2) there is no manipulated variable 3) there is no randomization.
  • The original phrasing implies causality. Since this is not an experiment I would rephrase it like this: Is there an association between perceived beauty rating and average professor evaluation score?

Exercise 2

Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?

  • It is left skewed with a long left tail
  • If higher scores means a more positive view of the professor, it looks like students that do the course evaluations have a generally more positive view of their professors.
  • Yes, I expected this, those who actually choose to do course evaluations usually like their professors and want to help them out.
ggplot(evals,aes(score))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3

Excluding score, select two other variables and describe their relationship using an appropriate visualization (scatterplot, side-by-side boxplots, or mosaic plot)

  • age

  • bty_avg

  • as age increases average beauty scores decreases

ggplot(evals,aes(age,bty_avg))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Simple Linear Regression

The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:

plot(evals$score ~ evals$bty_avg)

ggplot(evals, aes(bty_avg,score))+
  geom_point()

Exercise 4

Replot the scatterplot, but this time use the function jitter() on the y - or the x -coordinate. (Use ?jitter to learn more.) What was misleading about the initial scatterplot?

  • For context, the jitter function adds noise to a numeric vector. This scatterplot now shows repeated and overlapping values. This scatter plot shows that the initial scatterplot hid a lot of data points. It also shows that certain values are more heavily weighted.
plot(evals$score ~ jitter(evals$bty_avg))

ggplot(evals,aes(bty_avg,score))+
  geom_point(position="jitter")+
  geom_smooth (method="lm")
## `geom_smooth()` using formula = 'y ~ x'

Exercise 5

Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty). Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?

  • y= 3.880338 + 0.06663704 x Beauty average
  • For every one unit increase in bty_avg, there is an increase of 0.6664 in score.
  • bty_avg does not seem to be a practically significant predictor; the multiple R-squared is 0.03502 meaning that only .03% of the variability is explained by this model. In addition to this, even for the most attractive professors the scores only increase by less than one point. Although, it is technically statistically significant with a p-value of 5.083e-05.
m_bty<-lm(score~bty_avg,evals)

m_bty$coefficients[1]
## (Intercept) 
##    3.880338
m_bty$coefficients[2]
##    bty_avg 
## 0.06663704
summary(m_bty) 
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05
ggplot(evals,aes(bty_avg,score))+
  geom_point()+
 geom_abline(slope=m_bty$coefficients[2],
             intercept = m_bty$coefficients[1],
             color="blue",
             size=1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Exercise 6

Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).

Linearity: Residuals vs Fitted plot –> shows if residuals have non-linear patterns. My plot looks equally spread around the horizontal line. There does not seem to be a distinct pattern. This model passes the linearity test.

Normality: Normal Q-Q –> shows if residuals are normally distributed. The residuals follow a straight line. Some deviation at both extreme ends. This model passes the normality test.

Variability: Scale-Location –> shows if residuals are spread equally along the range of predictors (homoscedasticity). This plot shows a horizontal line, the points look equally spread, there is no obvious pattern such as a curve or a cone. This model passes the variability test.

plot(m_bty)

Multiple Linear Regression

Let’s take a look at the relationship between one of these scores and the average beauty score.

plot(evals$bty_avg ~ evals$bty_f1lower)

cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112

We can actually take a look at the relationships between all beauty variables (columns 13 through 19) using the following command:

plot(evals[,13:19])

In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender of the professor, we can add the gender term into the model.

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercise 7

P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.

Linearity: Residuals vs Fitted plot –> shows if residuals have non-linear patterns. My plot looks equally spread around the horizontal line. There does not seem to be a distinct pattern. This model passes the linearity test.

Normality: Normal Q-Q –> shows if residuals are normally distributed. The residuals follow a straight line. Some deviation at both extreme ends. This model passes the normality test.

Variability: Scale-Location –> shows if residuals are spread equally along the range of predictors (homoscedasticity). This plot shows a horizontal line, the points look equally spread, there is no obvious pattern such as a curve or a cone. This model passes the variability test.

plot(m_bty_gen)

Exercise 8

Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg? * the multiple R-squared is 0.05912. The addition of gender has increased the multiple R-squared value. the model now explains .05% of variability.The p-value is still quite low so it is still significant. But because the multiple r squared is also still quite low, it might not be practical.

summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

We can plot this line and the line corresponding to males with the following custom function.

multiLines(m_bty_gen)

Exercise 9

What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score? * males tend to have the higher course evaluation score * 3.74734 + (0.17239 x gender_male) + (0.007416 x beauty_avg) –> 3.74734 + (1) + (0.007416 x beauty_avg)

summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercise 10

Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured. * R only shows 2 out of the 3 levels: tenure track and tenured. Our teaching variable = 0, it is our comparision group.

m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05