Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The data is from Researchers at University of Texas, Austin which is for teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. We will explore the regression between teaching evaluation score with beauty score.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read the data from github
pb <- read.csv("https://raw.githubusercontent.com/amit-kapoor/data605/master/profbeautyevals.csv")
bty <- pb$btystdave
ev <- pb$courseevaluation

# model evaluation 
m_ev_bty <- lm(ev ~ bty)

# beauty score/teaching evaluation plot
m_ev_bty %>% 
  ggplot(aes(bty, ev)) +
  geom_point() +
  geom_smooth(method = lm, se = F)
## `geom_smooth()` using formula 'y ~ x'

# residuals
m_ev_bty %>% 
  ggplot(aes(fitted(m_ev_bty), resid(m_ev_bty))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

hist(m_ev_bty$residuals, xlab = "Residuals", ylab = "")

m_ev_bty %>% 
  ggplot(aes(sample = resid(m_ev_bty))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

Conclusion

As shown in the residuals histogram, they are nearly normal with a slight left skew. The scatterplot of the residuals shows constant variability. Q-Q line looks good with the exception of both tails. The relationship between beauty and teaching evaluation appears linear from the scatter plot but we could explore further for non linear model too.