Load necessary packages and the dataset in question:

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)

load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
  select(score, ethnicity, gender, language, age, bty_avg, rank)
View(evals)

Question 1

Exploratory data analysis: Create a visualization that shows the relationship between:

  • \(y\): instructor teaching score
  • \(x\): instructor age

Comment on this relationship.

Answer

# Add your code to create visualization below:
ggplot(data = evals, aes(x = age, y = score)) + 
  geom_point()+geom_jitter()+labs(x="Age", y="Teaching Score")+geom_smooth(method="lm", se=FALSE)

cor(evals$score, evals$age)
## [1] -0.107032

Comment here: The linear relationship between instructor age and teaching score is a weak and negative one as we can see from the plot above(corr -0.107032).

Question 2

  1. Display the regression table that shows both the 1) fitted intercept and 2) fitted slope of the regression line. Pipe %>% the table into kable(digits=3) to get a cleanly outputted table with 3 significant digits.
  2. Interpret the slope.
  3. For an instructor that is 67 years old, what would you guess that their teaching score would be?

Answer

# Add your code to create regression table below:
lm(score~age, data=evals) %>% 
  tidy() %>%
  kable(digits=3)
term estimate std.error statistic p.value
(Intercept) 4.462 0.127 35.195 0.000
age -0.006 0.003 -2.311 0.021

Answer the two other questions here: The slope represents the rate of change of Y as X changes/how the values to be predicted vary in respect to X. For every increase in one in the age of a professor, there is an associated increase of on average 0.067 of teaching score points. The score of a 67 year old would be approx 4.2

Question 3

Does there seem to be a systematic pattern in the lack-of-fit of the model? In other words, is there a pattern in the error between the fitted score \(\widehat{y}\) and the observed score \(y\)? Hint:
geom_hline(yintercept=0, col="blue") adds a blue horizontal line at \(y=0\).

Answer

# Add the code necessary to answer this question below:
point_by_point_info <- lm(score~age, data=evals) %>% 
  augment() %>% 
  select(score, age, .fitted, .resid)
ggplot(data = point_by_point_info, aes(x = age, y =score-.fitted)) + 
  geom_point()+labs(x="Age", y="Residual")+geom_hline(yintercept=0, col="blue")+geom_jitter()

Comment here: The spread of residuals seems to be fairly homoschedastic.

Question 4

Say a college administrator wants to model teaching scores using more than one predictor/explantory variable than just age, in particular using the instructor’s gender as well. Create a visualization that summarizes this relationship and comment on the observed relationship.

Answer

# Add your code to create visualization below:
ggplot(data = evals, aes(x = age, y = score)) + 
  geom_point()+facet_wrap("gender")+geom_jitter()+labs(x="Age", y="Teaching Score")+geom_smooth(method="lm", se=FALSE)

Comment here: In my opinion, the regression model above does a better job of predicting respective teaching scores since it creates a more personalised model based on how the respective subsets of the original data behave.