Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
This multiple regression model is for the evals dataset from the openintro package. It contains evaluation scores and characteristics for each professor. The regression model focuses on the following terms: age, ethnicity, and bty_avg.
library(tidyverse)
library(openintro)
df <- evals
summary(df)
## course_id prof_id score rank
## Min. : 1.0 Min. : 1.00 Min. :2.300 teaching :102
## 1st Qu.:116.5 1st Qu.:20.00 1st Qu.:3.800 tenure track:108
## Median :232.0 Median :43.00 Median :4.300 tenured :253
## Mean :232.0 Mean :45.15 Mean :4.175
## 3rd Qu.:347.5 3rd Qu.:70.50 3rd Qu.:4.600
## Max. :463.0 Max. :94.00 Max. :5.000
## ethnicity gender language age
## minority : 64 female:195 english :435 Min. :29.00
## not minority:399 male :268 non-english: 28 1st Qu.:42.00
## Median :48.00
## Mean :48.37
## 3rd Qu.:57.00
## Max. :73.00
## cls_perc_eval cls_did_eval cls_students cls_level cls_profs
## Min. : 10.42 Min. : 5.00 Min. : 8.00 lower:157 multiple:306
## 1st Qu.: 62.70 1st Qu.: 15.00 1st Qu.: 19.00 upper:306 single :157
## Median : 76.92 Median : 23.00 Median : 29.00
## Mean : 74.43 Mean : 36.62 Mean : 55.18
## 3rd Qu.: 87.25 3rd Qu.: 40.00 3rd Qu.: 60.00
## Max. :100.00 Max. :380.00 Max. :581.00
## cls_credits bty_f1lower bty_f1upper bty_f2upper
## multi credit:436 Min. :1.000 Min. :1.000 Min. : 1.000
## one credit : 27 1st Qu.:2.000 1st Qu.:4.000 1st Qu.: 4.000
## Median :4.000 Median :5.000 Median : 5.000
## Mean :3.963 Mean :5.019 Mean : 5.214
## 3rd Qu.:5.000 3rd Qu.:7.000 3rd Qu.: 6.000
## Max. :8.000 Max. :9.000 Max. :10.000
## bty_m1lower bty_m1upper bty_m2upper bty_avg
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.667
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.167
## Median :3.000 Median :4.000 Median :5.000 Median :4.333
## Mean :3.413 Mean :4.147 Mean :4.752 Mean :4.418
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:5.500
## Max. :7.000 Max. :9.000 Max. :9.000 Max. :8.167
## pic_outfit pic_color
## formal : 77 black&white: 78
## not formal:386 color :385
##
##
##
##
glimpse(df)
## Rows: 463
## Columns: 23
## $ course_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ prof_id <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,~
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4~
## $ rank <fct> tenure track, tenure track, tenure track, tenure track, ~
## $ ethnicity <fct> minority, minority, minority, minority, not minority, no~
## $ gender <fct> female, female, female, female, male, male, male, male, ~
## $ language <fct> english, english, english, english, english, english, en~
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, ~
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500~
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,~
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, ~
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, upper, ~
## $ cls_profs <fct> single, single, single, single, multiple, multiple, mult~
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi credit, ~
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,~
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,~
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,~
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,~
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,~
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,~
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, ~
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not form~
## $ pic_color <fct> color, color, color, color, color, color, color, color, ~
ggplot(df , aes(x = age, y = score)) +
geom_point()
ggplot(df , aes(x = ethnicity, y = score)) +
geom_point()
ggplot(df , aes(x = bty_avg, y = score)) +
geom_point()
Regression Model
model <- lm(data=df, score ~ age + ethnicity + bty_avg)
summary(model)
##
## Call:
## lm(formula = score ~ age + ethnicity + bty_avg, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9661 -0.3506 0.1343 0.3829 1.0187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.955742 0.176774 22.377 < 2e-16 ***
## age -0.003555 0.002668 -1.332 0.183377
## ethnicitynot minority 0.140897 0.072161 1.953 0.051484 .
## bty_avg 0.061004 0.017047 3.579 0.000382 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.533 on 459 degrees of freedom
## Multiple R-squared: 0.04571, Adjusted R-squared: 0.03947
## F-statistic: 7.328 on 3 and 459 DF, p-value: 8.298e-05
Residual Analysis
plot(model)
The plots show us that the residuals are normally distributed. The Normal Q-Q plot shows the residuals following the main diagonal, giving a strong indication of normality. For this data, the linear model was appropriate.