Load necessary packages and the dataset in question:
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
load(url("http://www.openintro.org/stat/data/evals.RData"))
evals <- evals %>%
select(score, ethnicity, gender, language, age, bty_avg, rank)
View(evals)Exploratory data analysis: Create a visualization that shows the relationship between:
Comment on this relationship.
# Add your code to create visualization below:
ggplot(data = evals, aes(x = age, y = score)) +
geom_point()+geom_jitter()+labs(x="Age", y="Teaching Score")+geom_smooth(method="lm", se=FALSE)cor(evals$score, evals$age)## [1] -0.107032
Comment here: The linear relationship between instructor age and teaching score is a weak and negative one as we can see from the plot above(corr -0.107032).
%>% the table into kable(digits=3) to get a cleanly outputted table with 3 significant digits.# Add your code to create regression table below:
lm(score~age, data=evals) %>%
tidy() %>%
kable(digits=3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.462 | 0.127 | 35.195 | 0.000 |
| age | -0.006 | 0.003 | -2.311 | 0.021 |
Answer the two other questions here: The slope represents the rate of change of Y as X changes/how the values to be predicted vary in respect to X. For every increase in one in the age of a professor, there is an associated increase of on average 0.067 of teaching score points. The score of a 67 year old would be approx 4.2
Does there seem to be a systematic pattern in the lack-of-fit of the model? In other words, is there a pattern in the error between the fitted score \(\widehat{y}\) and the observed score \(y\)? Hint:
geom_hline(yintercept=0, col="blue") adds a blue horizontal line at \(y=0\).
# Add the code necessary to answer this question below:
point_by_point_info <- lm(score~age, data=evals) %>%
augment() %>%
select(score, age, .fitted, .resid)
ggplot(data = point_by_point_info, aes(x = age, y =score-.fitted)) +
geom_point()+labs(x="Age", y="Residual")+geom_hline(yintercept=0, col="blue")+geom_jitter()Comment here: The spread of residuals seems to be fairly homoschedastic.
Say a college administrator wants to model teaching scores using more than one predictor/explantory variable than just age, in particular using the instructor’s gender as well. Create a visualization that summarizes this relationship and comment on the observed relationship.
# Add your code to create visualization below:
ggplot(data = evals, aes(x = age, y = score)) +
geom_point()+facet_wrap("gender")+geom_jitter()+labs(x="Age", y="Teaching Score")+geom_smooth(method="lm", se=FALSE)Comment here: In my opinion, the regression model above does a better job of predicting respective teaching scores since it creates a more personalised model based on how the respective subsets of the original data behave.