library(tidyverse)
library(moderndive)
library(skimr)
library(ISLR)
library(tinytex)
evals dataset
Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores?
To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at openintro.org.
glimpse(evals)
## Rows: 463
## Columns: 14
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
## $ prof_ID <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, ~
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.~
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 4~
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3~
## $ gender <fct> female, female, female, female, male, male, male, male, m~
## $ ethnicity <fct> minority, minority, minority, minority, not minority, not~
## $ language <fct> english, english, english, english, english, english, eng~
## $ rank <fct> tenure track, tenure track, tenure track, tenure track, t~
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not forma~
## $ pic_color <fct> color, color, color, color, color, color, color, color, c~
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14, ~
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, 2~
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, upper, u~
Can we explain differences in teaching evaluation score based on various teacher attributes?
\(y:\) Average teaching \(score\) based on students evaluations
\(\vec{x}\) Attributes like gender, age, bty_avg
Investigate correlation between this variables
evals_disc11 <- evals %>%
select(ID, score, bty_avg, age)
evals_disc11
## # A tibble: 463 x 4
## ID score bty_avg age
## <int> <dbl> <dbl> <int>
## 1 1 4.7 5 36
## 2 2 4.1 5 36
## 3 3 3.9 5 36
## 4 4 4.8 5 36
## 5 5 4.6 3 59
## 6 6 4.3 3 59
## 7 7 2.8 3 59
## 8 8 4.1 3.33 51
## 9 9 3.4 3.33 51
## 10 10 4.5 3.17 40
## # ... with 453 more rows
Always perform an exploratory analysis of your variables before any formal modeling
Random sample of 10 out of the 463 courses at UT Ausitn
evals_disc11 %>%
sample_n(size = 10)
## # A tibble: 10 x 4
## ID score bty_avg age
## <int> <dbl> <dbl> <int>
## 1 228 4.7 8.17 39
## 2 406 5 2.83 57
## 3 196 4 6.5 44
## 4 371 3.9 3.33 38
## 5 308 3.7 3.67 35
## 6 191 4.3 2.33 54
## 7 192 4.1 2.33 54
## 8 218 4.4 4 42
## 9 16 4.3 3.17 40
## 10 48 4.7 4.67 33
evals_disc11 %>%
select(score, bty_avg, age) %>% skim()
Name | Piped data |
Number of rows | 463 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
age | 0 | 1 | 48.37 | 9.80 | 29.00 | 42.00 | 48.00 | 57.0 | 73.00 | ▅▆▇▆▁ |
## # A tibble: 1 x 1
## cor
## <dbl>
## 1 0.187
## # A tibble: 1 x 1
## cor
## <dbl>
## 1 -0.107
Visual Relationship
We can see their relationship is consistent with the coefficient sign and result
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Regression line is consistent with the earlier result of correlation coefficient 0.187
as instructors have higher beauty scores they receive higher teacher evaluations
Score Model
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 3.880 | 0.076 | 50.961 | 0 | 3.683 | 4.077 |
bty_avg | 0.067 | 0.016 | 4.090 | 0 | 0.024 | 0.109 |
Organize all points in one data frame how all values in the first row of output are computed.
Visual Relationship
Distribution of Residuals Investigate potential relationships between the residuals and all explanatory/predictor variables
\[\hat{y} = b_0 + b_1 * x\]
\[\hat{score} = b_0 + b_1 * bty.avg\]
\[\hat{score} = 3.880 + 0.067 * bty.avg\]
The intercept / average teaching score = \(b_0 = 3.88\)
Relationship between teaching and beauty = \(b_1 = 0.067\)
Note
Sign is positive
positive relationship
teachers with higher beauty tend to have higher teaching scores
Correlation Coefficient 0.187 is positive
slope \(b_1 = 0.067\) interpretation
*for every increase of 1 unit in bty_avg
there is an associated increase of,
on average, 0.0067 units of score*
Note
score = y
bty_avg = \(x\)
score_hat = \(\hat{y}\)
residual = \(y-\hat{y}\)
sum of squared residuals
if we compute the residuals for all 463 courses’ instructors and compute the sum of squared residuals we would obtain the “lack of a fit in a model”
\[ y - \hat{y} = 0\]
# compute the square of residuals
score_model_points %>%
mutate(squared_residuals = residual^2) %>%
summarise(sum_of_squared_residuals = sum(squared_residuals))
## # A tibble: 1 x 1
## sum_of_squared_residuals
## <dbl>
## 1 132.
if the regression line fits all the points perfectly, then the sum of the squared residuals is 0. In this case as we can see this linear model was not the appropriate choice for this regression.