Linear Regression-1

Dataset

evals dataset

Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer the following research question: what factors explain differences in instructor teaching evaluation scores?

To this end, they collected instructor and course information on 463 courses. A full description of the study can be found at openintro.org.

glimpse(evals)

## Rows: 463
## Columns: 14
## $ ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17~
## $ prof_ID      <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, ~
## $ score        <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.~
## $ age          <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 4~
## $ bty_avg      <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3~
## $ gender       <fct> female, female, female, female, male, male, male, male, m~
## $ ethnicity    <fct> minority, minority, minority, minority, not minority, not~
## $ language     <fct> english, english, english, english, english, english, eng~
## $ rank         <fct> tenure track, tenure track, tenure track, tenure track, t~
## $ pic_outfit   <fct> not formal, not formal, not formal, not formal, not forma~
## $ pic_color    <fct> color, color, color, color, color, color, color, color, c~
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14, ~
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, 2~
## $ cls_level    <fct> upper, upper, upper, upper, upper, upper, upper, upper, u~

Question

Can we explain differences in teaching evaluation score based on various teacher attributes?

Variables

\(y:\) Average teaching \(score\) based on students evaluations

\(\vec{x}\) Attributes like gender, age, bty_avg

Investigate correlation between this variables

ID
Score
beauty average
Age

evals_disc11 <- evals %>% 
  select(ID, score, bty_avg, age)

evals_disc11

## # A tibble: 463 x 4
##       ID score bty_avg   age
##    <int> <dbl>   <dbl> <int>
##  1     1   4.7    5       36
##  2     2   4.1    5       36
##  3     3   3.9    5       36
##  4     4   4.8    5       36
##  5     5   4.6    3       59
##  6     6   4.3    3       59
##  7     7   2.8    3       59
##  8     8   4.1    3.33    51
##  9     9   3.4    3.33    51
## 10    10   4.5    3.17    40
## # ... with 453 more rows

EDA

Always perform an exploratory analysis of your variables before any formal modeling

Random sample of 10 out of the 463 courses at UT Ausitn

evals_disc11 %>% 
  sample_n(size = 10)

## # A tibble: 10 x 4
##       ID score bty_avg   age
##    <int> <dbl>   <dbl> <int>
##  1   228   4.7    8.17    39
##  2   406   5      2.83    57
##  3   196   4      6.5     44
##  4   371   3.9    3.33    38
##  5   308   3.7    3.67    35
##  6   191   4.3    2.33    54
##  7   192   4.1    2.33    54
##  8   218   4.4    4       42
##  9    16   4.3    3.17    40
## 10    48   4.7    4.67    33

Statistics

evals_disc11 %>% 
  select(score, bty_avg, age) %>% skim()

Data summary
Name	Piped data
Number of rows	463
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
score	1	4.17	0.54	2.30	3.80	4.30	4.6	5.00	▁▁▅▇▇
bty_avg	1	4.42	1.53	1.67	3.17	4.33	5.5	8.17	▃▇▇▃▂
age	1	48.37	9.80	29.00	42.00	48.00	57.0	73.00	▅▆▇▆▁

Correlation Coefficient

formula
= score ~ bty_avg
= score ~ age

## # A tibble: 1 x 1
##     cor
##   <dbl>
## 1 0.187

## # A tibble: 1 x 1
##      cor
##    <dbl>
## 1 -0.107

Visual Relationship

We can see their relationship is consistent with the coefficient sign and result

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Regression line is consistent with the earlier result of correlation coefficient 0.187

as instructors have higher beauty scores they receive higher teacher evaluations

Residuals

Score Model

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.880	0.076	50.961	0	3.683	4.077
bty_avg	0.067	0.016	4.090	0	0.024	0.109

Organize all points in one data frame how all values in the first row of output are computed.

Visual Relationship

Distribution of Residuals Investigate potential relationships between the residuals and all explanatory/predictor variables

Fit Model

\[\hat{y} = b_0 + b_1 * x\]

\[\hat{score} = b_0 + b_1 * bty.avg\]

\[\hat{score} = 3.880 + 0.067 * bty.avg\]

The intercept / average teaching score = \(b_0 = 3.88\)

Relationship between teaching and beauty = \(b_1 = 0.067\)

Note

Sign is positive
positive relationship
teachers with higher beauty tend to have higher teaching scores
Correlation Coefficient 0.187 is positive

slope \(b_1 = 0.067\) interpretation

*for every increase of 1 unit in bty_avg
there is an associated increase of,
on average, 0.0067 units of score*

Note

score = y
bty_avg = \(x\)
score_hat = \(\hat{y}\)
residual = \(y-\hat{y}\)

Linear Regression-1

MGinorio

11/3/2021