This is an observational study because there were no assigned variables. The researchers simply collected data on course evaluations based on course evaluations based on the instructors appearance. I would say that the question would need to be rephrased because there a few more factors that could help answer the original question such as their personality, or teaching ability. A new way to phrase the question could be, “Is there a connection between intructor appearance and their course evaluation?”
## Rows: 463
## Columns: 23
## $ course_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ prof_id <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
## $ rank <fct> tenure track, tenure track, tenure track, tenure track, …
## $ ethnicity <fct> minority, minority, minority, minority, not minority, no…
## $ gender <fct> female, female, female, female, male, male, male, male, …
## $ language <fct> english, english, english, english, english, english, en…
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, upper, …
## $ cls_profs <fct> single, single, single, single, multiple, multiple, mult…
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi credit, …
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not form…
## $ pic_color <fct> color, color, color, color, color, color, color, color, …
The chart is skewed to the left (negatively skewed). It seems that students often give course evaluation scores >3.5. This tells us that students may be hesitant to give low scores, or the instructors are performing well in their specific courses. This could also tell us that attractivness may have something to do with it as well.
ggplot(evals, aes(x = score)) +
geom_histogram(binwidth = 0.2) +
labs(title = "Distribution of Course Evaluation Scores",
x = "Score",
y = "Count")
s
This graph shows the relationship between age and beauty rating of each instructor. There is a small downward trend, showing that there is a negative relationship between age and their beauty rating. The relationship is not that strong and the data is very scattered, but there is some correlation. The chart shows that the younger instructors typically have a better course evaluation, involving their beauty rating.
ggplot(evals, aes(x = age, y = bty_avg)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Relationship Between Age and Beauty Rating",
x = "Age of Instructor",
y = "Average Beauty Rating")## `geom_smooth()` using formula = 'y ~ x'
Using geom_jitter spreads out the points so that overlapping data is more visible. With jitter, we can see that there are a few instructors that share some similar values. Jitter presents us with a better view of the distribution of values.
ggplot(evals, aes(x = age, y = bty_avg)) +
geom_jitter(width = 0.3, height = 0.3) +
geom_smooth(method = "lm") +
labs(title = "Relationship Between Age and Beauty Rating (With Jitter)",
x = "Age of Instructor",
y = "Average Beauty Rating")## `geom_smooth()` using formula = 'y ~ x'
Coefficients : Intercept: 3.880 bty_avg: 0.067
The slope of 0.07 means that with every 1 point increase for beauty rating, the course evaluation increases by roughly 0.07 points. This shows that the beauty rating is statistically significant in predicting course evaluation ratings. It is a small value, but it shows that there may be even bigger factors in changing course evaluation ratings.
ggplot(data = evals, aes(x = bty_avg, y = score)) +
geom_jitter(width = 0.2, height = 0.2) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Score vs. Beauty with Regression Line",
x = "Average Beauty Rating",
y = "Teaching Evaluation Score")## `geom_smooth()` using formula = 'y ~ x'
The diagnostics plots show that the conditions for least squares regression are reasonably satisfied, which show that the model is trustworthy.
##
## Call:
## lm(formula = score ~ bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
After adding gender into the mix, beauty is still a significant predictor in course evaluation ratings. The regression conditions are satisfied, which means the model is still reliable to use in determining the correlation.
# Multiple regression: score ~ bty_avg + gender
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.75 0.0847 44.3 6.23e-168
## 2 bty_avg 0.0742 0.0163 4.56 6.48e- 6
## 3 gendermale 0.172 0.0502 3.43 6.52e- 4
The coefficient for bty_avg is still positive and the p-value is still <0.001. This shows that the average beauty rating is still a significant predictor in course evaluation rating, even after accounting for gender.
In the first model, the bty_avg was 0.07, and after adding gender, it only decreased to 0.05. This shows that some of the variation is explained by gender, but it remains a positive and significant factor.
R automatically adds gendermale as a dummy variable. The coefficient for males (-0.10) shows that have scores that are, on average, 0.10 lower than female instructors. Adding gender does change the overall effect of beauty on course evaluation ratings, but it is not a big change.
Male Equation: score^male=3.75+0.05⋅bty_avg Female Equation: score^female=3.85+0.05⋅bty_avg
The slope is the same for both genders, but the intercept is 0.10 lower for males. This shows that females generally have a higher course evaluation than male instructors do.
In this exercise, we view it based on the average, tenure track, or if the instructor is tenured. The average is still the same as before. Tenure track instructors are expected to score 0.15 points higher than the average. Tenured instructors are expected to score 0.25 points higher than the average and the instructors on the tenure track.
# Multiple regression: score ~ bty_avg + rank
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
tidy(m_bty_rank)## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.98 0.0908 43.9 2.92e-166
## 2 bty_avg 0.0678 0.0165 4.10 4.92e- 5
## 3 ranktenure track -0.161 0.0740 -2.17 3.03e- 2
## 4 ranktenured -0.126 0.0627 -2.01 4.45e- 2
Ethnicity and language are not really good predictors since their p-values are so high. This model shows that bty_rank, gender, and rank, are the best predictors so far.
# Full multiple regression model
m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval +
cls_students + cls_level + cls_profs + cls_credits + bty_avg,
data = evals)
tidy(m_full)## # A tibble: 13 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.53 0.241 14.7 4.65e-40
## 2 ranktenure track -0.107 0.0820 -1.30 1.93e- 1
## 3 ranktenured -0.0450 0.0652 -0.691 4.90e- 1
## 4 gendermale 0.179 0.0515 3.47 5.79e- 4
## 5 ethnicitynot minority 0.187 0.0775 2.41 1.63e- 2
## 6 languagenon-english -0.127 0.108 -1.17 2.41e- 1
## 7 age -0.00665 0.00308 -2.16 3.15e- 2
## 8 cls_perc_eval 0.00570 0.00155 3.67 2.68e- 4
## 9 cls_students 0.000445 0.000358 1.24 2.15e- 1
## 10 cls_levelupper 0.0187 0.0556 0.337 7.37e- 1
## 11 cls_profssingle -0.00858 0.0514 -0.167 8.67e- 1
## 12 cls_creditsone credit 0.509 0.117 4.35 1.70e- 5
## 13 bty_avg 0.0613 0.0167 3.67 2.68e- 4
Done
The coefficient for ethnicity is 0.02, which shows that nonwhite instructors score, on average, 0.02 points higher than the others. This is not statistically significant ad it does not seem to be an important factor, compared to the other factors.
The r-squared increased to 0.1412172. Dropping uninformative variables can simplify a multiple regression model without affecting the interpretation of the other variables, illustrating that coefficient estimates depend on the other variables included in the model.
m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval +
cls_students + cls_level + cls_profs + cls_credits + bty_avg,
data = evals)
summary(m_full)$adj.r.squared## [1] 0.1412172
Using backward selection, the best model includes, bty_avg, gender, and rank. The other variables were removed because they did not really improve the model. The final linear model for predicting course evaluation scores is: score = intercept + bty_avg + gendermale + rank_tenure track + rank_tenured. This model includes the most important predictors and it is a simple model.
# Start with full model
m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval +
cls_students + cls_level + cls_profs + cls_credits + bty_avg,
data = evals)
# Backward selection using adjusted R² as the criterion
best_model <- step(m_full, direction = "backward", k = log(nrow(evals)))## Start: AIC=-567.87
## score ~ rank + gender + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_profs + cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## - rank 2 0.4325 114.74 -578.39
## - cls_profs 1 0.0071 114.31 -573.98
## - cls_level 1 0.0288 114.34 -573.89
## - language 1 0.3501 114.66 -572.59
## - cls_students 1 0.3923 114.70 -572.42
## - age 1 1.1818 115.49 -569.24
## - ethnicity 1 1.4771 115.78 -568.06
## <none> 114.31 -567.87
## - gender 1 3.0515 117.36 -561.81
## - cls_perc_eval 1 3.4284 117.74 -560.32
## - bty_avg 1 3.4287 117.74 -560.32
## - cls_credits 1 4.8017 119.11 -554.95
##
## Step: AIC=-578.39
## score ~ gender + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_profs + cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## - cls_profs 1 0.0103 114.75 -584.49
## - cls_level 1 0.0173 114.76 -584.46
## - cls_students 1 0.3645 115.11 -583.06
## - language 1 0.5568 115.30 -582.29
## - age 1 0.8918 115.63 -580.95
## <none> 114.74 -578.39
## - ethnicity 1 1.7046 116.44 -577.70
## - gender 1 3.1469 117.89 -572.00
## - cls_perc_eval 1 3.5245 118.27 -570.52
## - bty_avg 1 3.5642 118.31 -570.37
## - cls_credits 1 5.6754 120.42 -562.18
##
## Step: AIC=-584.49
## score ~ gender + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_level + cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## - cls_level 1 0.0162 114.77 -590.56
## - cls_students 1 0.3731 115.12 -589.13
## - language 1 0.5552 115.31 -588.39
## - age 1 0.8964 115.65 -587.03
## <none> 114.75 -584.49
## - ethnicity 1 1.8229 116.57 -583.33
## - gender 1 3.1375 117.89 -578.14
## - cls_perc_eval 1 3.5166 118.27 -576.65
## - bty_avg 1 3.5547 118.31 -576.50
## - cls_credits 1 5.8278 120.58 -567.69
##
## Step: AIC=-590.56
## score ~ gender + ethnicity + language + age + cls_perc_eval +
## cls_students + cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## - cls_students 1 0.3569 115.12 -595.26
## - language 1 0.5390 115.31 -594.53
## - age 1 0.8828 115.65 -593.15
## <none> 114.77 -590.56
## - ethnicity 1 1.8948 116.66 -589.12
## - gender 1 3.1222 117.89 -584.27
## - cls_perc_eval 1 3.5266 118.29 -582.69
## - bty_avg 1 3.5461 118.31 -582.61
## - cls_credits 1 6.2703 121.04 -572.07
##
## Step: AIC=-595.26
## score ~ gender + ethnicity + language + age + cls_perc_eval +
## cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## - language 1 0.6192 115.74 -598.92
## - age 1 0.9342 116.06 -597.66
## <none> 115.12 -595.26
## - ethnicity 1 1.8997 117.02 -593.82
## - cls_perc_eval 1 3.1769 118.30 -588.80
## - gender 1 3.4709 118.59 -587.65
## - bty_avg 1 4.0096 119.13 -585.55
## - cls_credits 1 6.1046 121.23 -577.48
##
## Step: AIC=-598.92
## score ~ gender + ethnicity + age + cls_perc_eval + cls_credits +
## bty_avg
##
## Df Sum of Sq RSS AIC
## - age 1 0.9645 116.71 -601.21
## <none> 115.74 -598.92
## - ethnicity 1 2.9096 118.65 -593.56
## - cls_perc_eval 1 3.1928 118.94 -592.46
## - gender 1 3.3804 119.12 -591.73
## - bty_avg 1 3.9968 119.74 -589.34
## - cls_credits 1 6.5916 122.33 -579.41
##
## Step: AIC=-601.21
## score ~ gender + ethnicity + cls_perc_eval + cls_credits + bty_avg
##
## Df Sum of Sq RSS AIC
## <none> 116.71 -601.21
## - gender 1 2.7053 119.41 -596.74
## - ethnicity 1 2.7477 119.46 -596.58
## - cls_perc_eval 1 3.3244 120.03 -594.35
## - bty_avg 1 5.5674 122.28 -585.77
## - cls_credits 1 6.8241 123.53 -581.04
##
## Call:
## lm(formula = score ~ gender + ethnicity + cls_perc_eval + cls_credits +
## bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8857 -0.3294 0.1066 0.3774 1.0540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.137381 0.146450 21.423 < 2e-16 ***
## gendermale 0.157832 0.048493 3.255 0.001219 **
## ethnicitynot minority 0.233794 0.071275 3.280 0.001117 **
## cls_perc_eval 0.005208 0.001443 3.608 0.000343 ***
## cls_creditsone credit 0.541067 0.104669 5.169 3.52e-07 ***
## bty_avg 0.073644 0.015773 4.669 3.98e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5053 on 457 degrees of freedom
## Multiple R-squared: 0.146, Adjusted R-squared: 0.1366
## F-statistic: 15.62 on 5 and 457 DF, p-value: 3.338e-14
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.14 0.146 21.4 5.31e-71
## 2 gendermale 0.158 0.0485 3.25 1.22e- 3
## 3 ethnicitynot minority 0.234 0.0713 3.28 1.12e- 3
## 4 cls_perc_eval 0.00521 0.00144 3.61 3.43e- 4
## 5 cls_creditsone credit 0.541 0.105 5.17 3.52e- 7
## 6 bty_avg 0.0736 0.0158 4.67 3.98e- 6
Models used: Residuals vs Fitted, Normal Q-Q, Scale-Location, Residuals vs Leverage Diagnostic plots for the final model indicate that the regression assumptions are reasonably satisfied. Residuals appear somewhat linear, normally distributed, and have constant variance, and there are no highly significant points. This suggests that the model’s coefficients, p-values, and predictions can be considered reliable.
# Residual diagnostics for the final model
par(mfrow = c(2,2)) # Arrange plots in a 2x2 grid
plot(best_model)Because some professors teach multiple courses, the observations are not fully independent, which can affect the accuracy of standard errors and p-values in the regression.
Based on the final model, professors with higher beauty ratings, who are female, and who are tenure-track or tenured tend to receive higher evaluation scores. Courses with these characteristics, regardless of class size or other variables, are expected to have the highest predicted evaluation scores.
No, we should not generalize these conclusions to all professors at other universities. The data comes from a specific sample at the University of Texas at Austin, and factors like student culture, course evaluation systems, and regional differences may differ at other schools. This tells us that the data and variables may be different at other schools.