library(tidyverse)
library(moderndive)
library(skimr)
library(ISLR)
library(tinytex)

Dataset

evals dataset

Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores?

To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at openintro.org.

glimpse(evals)
## Rows: 463
## Columns: 14
## $ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
## $ prof_ID      <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, ~
## $ score        <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.~
## $ age          <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 4~
## $ bty_avg      <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3~
## $ gender       <fct> female, female, female, female, male, male, male, male, m~
## $ ethnicity    <fct> minority, minority, minority, minority, not minority, not~
## $ language     <fct> english, english, english, english, english, english, eng~
## $ rank         <fct> tenure track, tenure track, tenure track, tenure track, t~
## $ pic_outfit   <fct> not formal, not formal, not formal, not formal, not forma~
## $ pic_color    <fct> color, color, color, color, color, color, color, color, c~
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14, ~
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, 2~
## $ cls_level    <fct> upper, upper, upper, upper, upper, upper, upper, upper, u~

Question

Can we explain differences in teaching evaluation score based on various teacher attributes?

Variables

\(y:\) Average teaching \(score\) based on students evaluations

\(\vec{x}\) Attributes like gender, age, bty_avg

Investigate correlation between this variables

  • ID
  • Score
  • beauty average
  • Age
evals_disc11 <- evals %>% 
  select(ID, score, bty_avg, age)

evals_disc11
## # A tibble: 463 x 4
##       ID score bty_avg   age
##    <int> <dbl>   <dbl> <int>
##  1     1   4.7    5       36
##  2     2   4.1    5       36
##  3     3   3.9    5       36
##  4     4   4.8    5       36
##  5     5   4.6    3       59
##  6     6   4.3    3       59
##  7     7   2.8    3       59
##  8     8   4.1    3.33    51
##  9     9   3.4    3.33    51
## 10    10   4.5    3.17    40
## # ... with 453 more rows

EDA

Always perform an exploratory analysis of your variables before any formal modeling

Random sample of 10 out of the 463 courses at UT Ausitn

evals_disc11 %>% 
  sample_n(size = 10)
## # A tibble: 10 x 4
##       ID score bty_avg   age
##    <int> <dbl>   <dbl> <int>
##  1   228   4.7    8.17    39
##  2   406   5      2.83    57
##  3   196   4      6.5     44
##  4   371   3.9    3.33    38
##  5   308   3.7    3.67    35
##  6   191   4.3    2.33    54
##  7   192   4.1    2.33    54
##  8   218   4.4    4       42
##  9    16   4.3    3.17    40
## 10    48   4.7    4.67    33

Statistics

evals_disc11 %>% 
  select(score, bty_avg, age) %>% skim()
Data summary
Name Piped data
Number of rows 463
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
score 0 1 4.17 0.54 2.30 3.80 4.30 4.6 5.00 ▁▁▅▇▇
bty_avg 0 1 4.42 1.53 1.67 3.17 4.33 5.5 8.17 ▃▇▇▃▂
age 0 1 48.37 9.80 29.00 42.00 48.00 57.0 73.00 ▅▆▇▆▁

Correlation Coefficient

  • formula
  • = score ~ bty_avg
  • = score ~ age
## # A tibble: 1 x 1
##     cor
##   <dbl>
## 1 0.187
## # A tibble: 1 x 1
##      cor
##    <dbl>
## 1 -0.107

Visual Relationship

We can see their relationship is consistent with the coefficient sign and result

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Regression line is consistent with the earlier result of correlation coefficient 0.187

as instructors have higher beauty scores they receive higher teacher evaluations

Residuals

Score Model

term estimate std_error statistic p_value lower_ci upper_ci
intercept 3.880 0.076 50.961 0 3.683 4.077
bty_avg 0.067 0.016 4.090 0 0.024 0.109

Organize all points in one data frame how all values in the first row of output are computed.

Visual Relationship

Distribution of Residuals Investigate potential relationships between the residuals and all explanatory/predictor variables

Fit Model

\[\hat{y} = b_0 + b_1 * x\]

\[\hat{score} = b_0 + b_1 * bty.avg\]

\[\hat{score} = 3.880 + 0.067 * bty.avg\]

The intercept / average teaching score = \(b_0 = 3.88\)

Relationship between teaching and beauty = \(b_1 = 0.067\)

Note

  • Sign is positive

  • positive relationship

  • teachers with higher beauty tend to have higher teaching scores

  • Correlation Coefficient 0.187 is positive

  • slope \(b_1 = 0.067\) interpretation

    *for every increase of 1 unit in bty_avg
    there is an associated increase of,
    on average, 0.0067 units of score*

Note

  • score = y

  • bty_avg = \(x\)

  • score_hat = \(\hat{y}\)

  • residual = \(y-\hat{y}\)

Conditions for Linear Regression

Conclusion


sum of squared residuals

if we compute the residuals for all 463 courses’ instructors and compute the sum of squared residuals we would obtain the “lack of a fit in a model”

\[ y - \hat{y} = 0\]

# compute the square of residuals

score_model_points %>% 
  mutate(squared_residuals = residual^2) %>% 
  summarise(sum_of_squared_residuals = sum(squared_residuals))
## # A tibble: 1 x 1
##   sum_of_squared_residuals
##                      <dbl>
## 1                     132.

if the regression line fits all the points perfectly, then the sum of the squared residuals is 0. In this case as we can see this linear model was not the appropriate choice for this regression.