This analysis examines factors influencing professor evaluation scores at the University of Texas at Austin, with particular focus on whether physical attractiveness predicts higher ratings. Using multiple linear regression on 463 course evaluations, I found that while beauty rating is a statistically significant predictor (p < 0.001), it explains only 3.5% of the variance in scores—suggesting minimal practical impact. The final model identifies seven significant predictors: gender, language of degree institution, age, class evaluation participation rate, course credits, beauty rating, and picture color. Key limitations include the single-institution sample and non-independent observations due to professors teaching multiple courses. This project demonstrates skills in exploratory data analysis, regression diagnostics, model selection via backward elimination, and communicating statistical findings to diverse audiences.
Here, you will analyze the data from this study in order to learn what goes into a positive professor evaluation.
In this lab, you will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
This is the first time we’re using the GGally package.
You will be using the ggpairs function from this package
later in the lab.
The data were gathered from end of semester student evaluations for a
large sample of professors from the University of Texas at Austin. In
addition, six students rated the professors’ physical appearance. The
result is a data frame where each row contains a different course and
columns represent variables about the courses and professors. It’s
called evals.
## Rows: 463
## Columns: 23
## $ course_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ prof_id <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
## $ rank <fct> tenure track, tenure track, tenure track, tenure track, …
## $ ethnicity <fct> minority, minority, minority, minority, not minority, no…
## $ gender <fct> female, female, female, female, male, male, male, male, …
## $ language <fct> english, english, english, english, english, english, en…
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, upper, …
## $ cls_profs <fct> single, single, single, single, multiple, multiple, mult…
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi credit, …
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not form…
## $ pic_color <fct> color, color, color, color, color, color, color, color, …
We have observations on 21 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
This is an observational study. The data was collected at the end of the semester when students evaluated their professors through course evaluations. This study qualifies as observational rather than experimental for several key reasons:
No Variable Manipulation: In an observational study, researchers collect data on variables as they naturally occur without any intervention or manipulation. In this case, the researchers are simply gathering existing evaluation scores and other variables (such as professor characteristics, course attributes, or student demographics) without controlling or altering any conditions. The professors taught their courses naturally, and students provided their ratings based on their authentic experiences.
Lack of Random Assignment: Unlike an experimental study, there was no random assignment of subjects to different treatment groups. Students enrolled in courses through normal registration processes, and professors taught as they typically would. Researchers did not assign students to specific professors or manipulate teaching methods to observe different outcomes.
Passive Data Collection: The emphasis is on recording and analyzing data from variables that already exist in their natural state. The student evaluations occurred as part of the routine end-of-semester process, and researchers are examining these pre-existing ratings rather than creating controlled conditions to test a hypothesis.
What Would Make It Experimental: For this to be an experimental study, researchers would need to actively manipulate one or more independent variables (such as randomly assigning different teaching methods, class sizes, or professor characteristics) and then measure the effect on the dependent variable (student evaluations) while controlling for other factors.
Therefore, this study is observational because it involves measuring and analyzing naturally occurring data without researcher intervention or control over the variables of interest.
Original Research Question The original research question posed in the paper is whether beauty leads directly to differences in course evaluations.
Evaluation of the Research Question and Study Design Given the design of the study, it is not possible to answer the question as originally phrased. Establishing a direct causal relationship between beauty and course evaluations would require an experimental design in which the independent variable—beauty—is systematically manipulated. For example, participants would need to evaluate instructors who have been standardized according to predetermined levels of beauty. This would allow the researcher to isolate and assess whether beauty itself contributes to variations in average course evaluation scores.
However, the existing study does not employ such an experimental manipulation. Instead, it relies on naturally occurring variations in perceived beauty and evaluation scores, which limits the conclusions to correlational findings rather than causal ones.
Revised Research Question A research question that better aligns with the current study design would be:
“Is there a correlation between students’ beauty ratings of professors and the professors’ average course evaluation scores?”
score.Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
The distribution is left-skewed** (negatively skewed), showing that the majority of professors received high evaluation scores.** The bulk of the data is concentrated between 4.0 and 4.5 on the evaluation scale, with a long tail extending toward lower scores (2.5-3.5). This indicates that students in this study were more likely to rate professors favorably, with relatively few professors receiving low evaluation scores.
Key Observations: - The peak (mode) of the distribution appears to be around 4.3-4.5 - Very few professors received scores below 3.0 - The left tail suggests that while most professors were rated highly, there were some outliers who received notably lower evaluations
Unexpected Finding: I did not expect to see this pattern. I anticipated a more normal (symmetric) distribution centered around the middle of the scale, with approximately equal numbers of professors rated above and below average. Instead, the data shows a ceiling effect, where ratings cluster toward the higher end of the scale. This could suggest: - Grade inflation or evaluation inflation - Students’ reluctance to give low ratings - Generally effective teaching across the sample - Possible response bias (students satisfied with courses may be more likely to complete evaluations)
The Key Correction: When the tail extends to the LEFT (toward lower values), it’s LEFT-skewed. When the tail extends to the RIGHT (toward higher values), it’s RIGHT-skewed. Think of the direction the tail is pointing!
score, select two other variables
for analysisTwo other pairs of variables to explore are pic_color
and tenure and pic_outfit and
tenure.
# Grouped bar chart - Picture Color
library(ggplot2)
p1 <- ggplot(evals, aes(x = rank, fill = pic_color)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("goldenrod1", "steelblue3")) +
labs(title = "Professor Rank by Picture Color",
x = "Rank",
fill = "Picture Type")
print(p1)## [1] TRUE
# Grouped bar chart - Picture Outfit
p2 <- ggplot(evals, aes(x = rank, fill = pic_outfit)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("goldenrod1", "coral3")) +
labs(title = "Professor Rank by Picture Outfit",
x = "Rank",
fill = "Picture Outfit")
print(p2)# Check if any teaching rank professors have formal pictures
evals %>%
filter(rank == "teaching", pic_outfit == "formal") %>%
nrow() > 0## [1] FALSE
## =============Overall percentages of Professor Rank by Picture Color===============
## [1] "Counts: Rank by Picture Color"
##
## black&white color
## teaching 13 89
## tenure track 20 88
## tenured 45 208
## [1] "\nRow Percentages (% within each rank)"
##
## black&white color
## teaching 12.7 87.3
## tenure track 18.5 81.5
## tenured 17.8 82.2
## =============Overall percentages of Professor Rank by Picture Outfit===============
## [1] "Counts: Rank by Picture Color"
##
## formal not formal
## teaching 0 102
## tenure track 15 93
## tenured 62 191
## [1] "\nRow Percentages (% within each rank)"
##
## formal not formal
## teaching 0.0 100.0
## tenure track 13.9 86.1
## tenured 24.5 75.5
A clear relationship exists between academic rank and picture outfit formality. Across all ranks, professors predominantly wore informal attire rather than formal attire. Teaching rank professors overwhelmingly chose informal dress, with approximately 100 wearing informal attire and no professors recorded as wearing formal attire in the dataset. Tenure track professors demonstrated a similar pattern, with roughly 95 wearing informal attire compared to only 15-20 in formal clothing. Even tenured professors, who comprised the largest group in the dataset, favored informal attire, with approximately 190 wearing informal dress compared to only about 60 in formal attire. While tenured professors showed a slightly higher proportion of formal attire compared to lower ranks, informal dress remained the dominant choice across all academic ranks.
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:
Before you draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
geom_jitter.m_bty to predict
average professor score by average beauty rating.# Colored points but single regression line
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_jitter(aes(color = rank), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
scale_color_manual(values = c("teaching" = "goldenrod1",
"tenure track" = "steelblue3",
"tenured" = "coral3")) +
labs(title = "Evaluation Score vs Beauty Rating",
x = "Average Beauty Rating",
y = "Evaluation Score",
color = "Rank")##
## Call:
## lm(formula = score ~ bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
The predicted evaluation score equals 3.88 plus 0.067 times the average beauty rating. Or written as an equation: score =3.88+0.067 x bty_avg
Interpretation of the slope: For every 1-point increase in average beauty rating, the predicted evaluation score increases by approximately 0.067 points.
Statistical significance: Yes—the p-value for bty_avg is 5.08e-05 (or 0.0000508), which is far below 0.05. The three asterisks (***) confirm this is highly significant.
Practical significance: This is debatable. While the relationship is statistically significant, the R-squared is only 0.035, meaning beauty rating explains just 3.5% of the variation in evaluation scores. Additionally, a slope of 0.067 is quite small—a professor would need to be rated about 15 points higher in beauty to gain just 1 point in their evaluation score. So while there’s a real relationship, its practical impact is minimal.
The blue line is the model. The shaded gray area around the line
tells you about the variability you might expect in your predictions. To
turn that off, use se = FALSE.
ggplot(data = evals, aes(x = bty_avg, y = score)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)# Plot 1: Residuals vs. Fitted Values (checks linearity & constant variability)
plot(m_bty$residuals ~ m_bty$fitted.values,
main = "Residuals vs. Fitted Values",
xlab = "Fitted Values",
ylab = "Residuals",
col = "steelblue3",
pch = 16)
abline(h = 0, col = "goldenrod1", lwd = 2, lty = 2)# Plot 2: Normal Q-Q Plot (checks normality of residuals)
qqnorm(m_bty$residuals,
main = "Normal Q-Q Plot of Residuals",
col = "steelblue3",
pch = 16)
qqline(m_bty$residuals, col = "goldenrod1", lwd = 2)# Plot 3: Histogram of Residuals (also checks normality)
hist(m_bty$residuals,
main = "Distribution of Residuals",
xlab = "Residuals",
col = "steelblue3",
breaks = 20)# Plot 4: Residuals vs. Order/Index (checks independence)
plot(m_bty$residuals,
main = "Residuals vs. Order of Data Collection",
xlab = "Order of Data",
ylab = "Residuals",
col = "steelblue3",
pch = 16)
abline(h = 0, col = "goldenrod1", lwd = 2, lty = 2)Analysis
Plot 1: Check for linearity & constant variability - The variability around zero is random. There is no pattern and consistent spread indicating homoscedasticity and meeting the criteria for least squares regression.
Plot 2: Check for the nearly normal residuals - The normal Q-Q plot shows that the residuals closely follow the theoretical normal line. Minor deviations occur in the tails, suggesting that the normality condition is reasonably met.
Plot 3: Check for the nearly normal residuals using a histogram The distribution of residuals is slightly left-skewed (negatively skewed), with the majority of residuals concentrated near zero and between 0 and 0.5, but with a longer tail extending toward negative values (reaching approximately -2.0). The peak of the distribution is slightly right of zero, and there are fewer extreme positive residuals (reaching only about 1.0) compared to extreme negative residuals.
Plot4: Check for independence of observation The residuals vs. order plot shows [no apparent time trend or pattern/a pattern], suggesting that the independence condition is likely met. Since these are course evaluations collected at the end of a semester,it is not expected there may be time-based dependencies and the criteria for least squares regression is met.
While there is a minor departure from perfect normality (left skew in the histogram and tail deviations in the Q-Q plot), these violations are relatively small. Given the large sample size (463 observations) and the robustness of linear regression to minor normality violations, the model appears appropriate for this data and the regression results can be considered reliable.
The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.
## # A tibble: 1 × 1
## `cor(bty_avg, bty_f1lower)`
## <dbl>
## 1 0.844
As expected, the relationship is quite strong—after all, the average score is calculated using the individual scores. You can actually look at the relationships between all beauty variables (columns 13 through 19) using the following command:
These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.
In order to see if beauty is still a significant predictor of professor score after you’ve accounted for the professor’s gender, you can add the gender term into the model.
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
# 1. Residuals vs Fitted (linearity & constant variability)
ggplot(data = m_bty_gen, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals")# 2. Histogram of Residuals (normality)
ggplot(data = m_bty_gen, aes(x = .resid)) +
geom_histogram(binwidth = 0.25, fill = "steelblue", color = "white") +
labs(title = "Distribution of Residuals",
x = "Residuals",
y = "Count")# 3. Q-Q Plot (normality)
ggplot(data = m_bty_gen, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "Normal Q-Q Plot of Residuals")Linearity: The residuals vs. fitted values plot shows points scattered randomly around the horizontal line at 0 with no clear curved pattern. The linearity condition is satisfied.
Constant Variability (Homoscedasticity): The vertical spread of residuals appears relatively consistent across all fitted values (from 4.0 to 4.5). There is no obvious fan or funnel shape. The constant variability condition is satisfied.
Normality: The histogram shows a left-skewed distribution rather than a perfectly symmetric bell curve, with a longer tail extending toward negative values. The Q-Q plot confirms this—points follow the line in the middle but deviate at both tails (curving below the line at the upper end). There is a modest departure from normality.
Independence: The observations are independent since each row represents a different professor’s course evaluation, and one professor’s rating should not influence another’s.
Conclusion: Three of the four conditions (linearity, constant variability, and independence) are reasonably satisfied. The normality condition shows some violation due to left skew. However, with a large sample size (n = 463), linear regression is robust to moderate departures from normality. Therefore, the conditions are reasonably met, and we can trust the p-values and parameter estimates from this model.
bty_avg still a significant predictor
of score?##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
##
## Call:
## lm(formula = score ~ bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
Yes, bty_avg remains a statistically significant
predictor of score after adding gender to the
model. The p-value for bty_avg in the multiple regression
model is approximately 0.00005, which is well below the 0.05
significance threshold.
gender to the model
changed the parameter estimate for bty_avg?| Model | bty_avg Coefficient |
|---|---|
| Simple (m_bty) | 0.0666 |
| Multiple (m_bty_gen) | 0.0742 |
Yes, the coefficient increased slightly from 0.067 to 0.074 when gender was added. This suggests gender was a minor confounding variable—after controlling for it, the effect of beauty on evaluation scores appears slightly stronger than in the simple model.
\[ \begin{aligned} \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg + \hat{\beta}_2 \times (0) \\ &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned} \]
ggplot(data = evals, aes(x = bty_avg, y = score, color = pic_color)) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)##
## Call:
## lm(formula = score ~ bty_avg + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8892 -0.3690 0.1293 0.4023 0.9125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.06318 0.10908 37.249 < 2e-16 ***
## bty_avg 0.05548 0.01691 3.282 0.00111 **
## pic_colorcolor -0.16059 0.06892 -2.330 0.02022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5323 on 460 degrees of freedom
## Multiple R-squared: 0.04628, Adjusted R-squared: 0.04213
## F-statistic: 11.16 on 2 and 460 DF, p-value: 1.848e-05
The equation is score = 4.063 + 0.055 x bty_avg - 0.161 X pic_color
##
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8713 -0.3642 0.1489 0.4103 0.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98155 0.09078 43.860 < 2e-16 ***
## bty_avg 0.06783 0.01655 4.098 4.92e-05 ***
## ranktenure track -0.16070 0.07395 -2.173 0.0303 *
## ranktenured -0.12623 0.06266 -2.014 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04029
## F-statistic: 7.465 on 3 and 459 DF, p-value: 6.88e-05
**Sample dummy code variables
## [1] "teaching" "tenure track" "tenured"
## tenure track tenured
## teaching 0 0
## tenure track 1 0
## tenured 0 1
# See the actual dummy variables R creates using model.matrix()
head(model.matrix(~ rank, data = evals), 10)## (Intercept) ranktenure track ranktenured
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## 7 1 0 1
## 8 1 0 1
## 9 1 0 1
## 10 1 0 1
# Show a few rows with original rank and dummy coding side by side
evals %>%
select(score, bty_avg, rank) %>%
mutate(
rank_tenure_track = ifelse(rank == "tenure track", 1, 0),
rank_tenured = ifelse(rank == "tenured", 1, 0)
) %>%
head(15)## # A tibble: 15 × 5
## score bty_avg rank rank_tenure_track rank_tenured
## <dbl> <dbl> <fct> <dbl> <dbl>
## 1 4.7 5 tenure track 1 0
## 2 4.1 5 tenure track 1 0
## 3 3.9 5 tenure track 1 0
## 4 4.8 5 tenure track 1 0
## 5 4.6 3 tenured 0 1
## 6 4.3 3 tenured 0 1
## 7 2.8 3 tenured 0 1
## 8 4.1 3.33 tenured 0 1
## 9 3.4 3.33 tenured 0 1
## 10 4.5 3.17 tenured 0 1
## 11 3.8 3.17 tenured 0 1
## 12 4.5 3.17 tenured 0 1
## 13 4.6 3.17 tenured 0 1
## 14 3.9 3.17 tenured 0 1
## 15 3.9 3.17 tenured 0 1
Expected output from
contrasts(evals$rank):
tenure track tenured
teaching 0 0
tenure track 1 0
tenured 0 1
Expected output from
model.matrix():
(Intercept) ranktenure track ranktenured
1 1 1 0
2 1 1 0
3 1 1 0
4 1 1 0
5 1 0 1
6 1 0 1
...
When a categorical variable has more than two levels, R creates multiple dummy variables—specifically, one fewer dummy variable than the number of levels (k-1 dummy variables for k levels). For rank with three levels (teaching, tenure track, tenured):
The interpretation of the coefficients in multiple regression is
slightly different from that of simple regression. The estimate for
bty_avg reflects how much higher a group of professors is
expected to score if they have a beauty rating that is one point higher
while holding all other variables constant. In this case, that
translates into considering only professors of the same rank with
bty_avg scores that are one point apart.
We will start with a full model that predicts professor score based on rank, gender, ethnicity, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.
I predict that the variable with the highest p score is cls_level since the division of the course seems irrelevant to the perception of the professors beauty or overall evaluation of the professor. A high p-value (p > 0.05) generally suggests that there is not enough statistical evidence to conclude that the observed relationship between the variable and the outcome is real or significant in the studied population
in your response.
m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
summary(m_full)##
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age +
## cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77397 -0.32432 0.09067 0.35183 0.95036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0952141 0.2905277 14.096 < 2e-16 ***
## ranktenure track -0.1475932 0.0820671 -1.798 0.07278 .
## ranktenured -0.0973378 0.0663296 -1.467 0.14295
## gendermale 0.2109481 0.0518230 4.071 5.54e-05 ***
## ethnicitynot minority 0.1234929 0.0786273 1.571 0.11698
## languagenon-english -0.2298112 0.1113754 -2.063 0.03965 *
## age -0.0090072 0.0031359 -2.872 0.00427 **
## cls_perc_eval 0.0053272 0.0015393 3.461 0.00059 ***
## cls_students 0.0004546 0.0003774 1.205 0.22896
## cls_levelupper 0.0605140 0.0575617 1.051 0.29369
## cls_profssingle -0.0146619 0.0519885 -0.282 0.77806
## cls_creditsone credit 0.5020432 0.1159388 4.330 1.84e-05 ***
## bty_avg 0.0400333 0.0175064 2.287 0.02267 *
## pic_outfitnot formal -0.1126817 0.0738800 -1.525 0.12792
## pic_colorcolor -0.2172630 0.0715021 -3.039 0.00252 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared: 0.1871, Adjusted R-squared: 0.1617
## F-statistic: 7.366 on 14 and 448 DF, p-value: 6.552e-14
I was close in my estimate, cls_level had the second highest p value. The only variable with a higher p value, meaning it was the only feature that was even less statistically significant was the cls_profs describing whether the class had single or multiple professors.
The coefficient for ethnicitynot minority is 0.1235 with a p-value of 0.117. Interpretation: Professors who are not from a minority ethnic group are predicted to score 0.1235 points higher on their evaluations compared to professors from minority ethnic groups, holding all other variables constant. However, this relationship is not statistically significant at the alpha = 0.05 level (p = 0.117 > 0.05), meaning we cannot conclude with confidence that there is a true difference in evaluation scores between minority and non-minority professors after controlling for all other variables in the model.
# Original full model
m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
# Model with cls_profs removed (highest p-value)
m_full_minus_profs <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval
+ cls_students + cls_level + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
#FULL MODEL
print(summary(m_full))##
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age +
## cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77397 -0.32432 0.09067 0.35183 0.95036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0952141 0.2905277 14.096 < 2e-16 ***
## ranktenure track -0.1475932 0.0820671 -1.798 0.07278 .
## ranktenured -0.0973378 0.0663296 -1.467 0.14295
## gendermale 0.2109481 0.0518230 4.071 5.54e-05 ***
## ethnicitynot minority 0.1234929 0.0786273 1.571 0.11698
## languagenon-english -0.2298112 0.1113754 -2.063 0.03965 *
## age -0.0090072 0.0031359 -2.872 0.00427 **
## cls_perc_eval 0.0053272 0.0015393 3.461 0.00059 ***
## cls_students 0.0004546 0.0003774 1.205 0.22896
## cls_levelupper 0.0605140 0.0575617 1.051 0.29369
## cls_profssingle -0.0146619 0.0519885 -0.282 0.77806
## cls_creditsone credit 0.5020432 0.1159388 4.330 1.84e-05 ***
## bty_avg 0.0400333 0.0175064 2.287 0.02267 *
## pic_outfitnot formal -0.1126817 0.0738800 -1.525 0.12792
## pic_colorcolor -0.2172630 0.0715021 -3.039 0.00252 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared: 0.1871, Adjusted R-squared: 0.1617
## F-statistic: 7.366 on 14 and 448 DF, p-value: 6.552e-14
##
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age +
## cls_perc_eval + cls_students + cls_level + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7836 -0.3257 0.0859 0.3513 0.9551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0872523 0.2888562 14.150 < 2e-16 ***
## ranktenure track -0.1476746 0.0819824 -1.801 0.072327 .
## ranktenured -0.0973829 0.0662614 -1.470 0.142349
## gendermale 0.2101231 0.0516873 4.065 5.66e-05 ***
## ethnicitynot minority 0.1274458 0.0772887 1.649 0.099856 .
## languagenon-english -0.2282894 0.1111305 -2.054 0.040530 *
## age -0.0089992 0.0031326 -2.873 0.004262 **
## cls_perc_eval 0.0052888 0.0015317 3.453 0.000607 ***
## cls_students 0.0004687 0.0003737 1.254 0.210384
## cls_levelupper 0.0606374 0.0575010 1.055 0.292200
## cls_creditsone credit 0.5061196 0.1149163 4.404 1.33e-05 ***
## bty_avg 0.0398629 0.0174780 2.281 0.023032 *
## pic_outfitnot formal -0.1083227 0.0721711 -1.501 0.134080
## pic_colorcolor -0.2190527 0.0711469 -3.079 0.002205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared: 0.187, Adjusted R-squared: 0.1634
## F-statistic: 7.943 on 13 and 449 DF, p-value: 2.336e-14
# Extract coefficients from both models
full_coef <- coef(m_full)
reduced_coef <- coef(m_full_minus_profs)
# Find common variables (excluding cls_profs)
common_vars <- names(reduced_coef)
# Create comparison dataframe
comparison <- data.frame(
Variable = common_vars,
Full_Model = round(full_coef[common_vars], 5),
Reduced_Model = round(reduced_coef[common_vars], 5),
Difference = round(full_coef[common_vars] - reduced_coef[common_vars], 5)
)
print(comparison)## Variable Full_Model Reduced_Model Difference
## (Intercept) (Intercept) 4.09521 4.08725 0.00796
## ranktenure track ranktenure track -0.14759 -0.14767 0.00008
## ranktenured ranktenured -0.09734 -0.09738 0.00005
## gendermale gendermale 0.21095 0.21012 0.00082
## ethnicitynot minority ethnicitynot minority 0.12349 0.12745 -0.00395
## languagenon-english languagenon-english -0.22981 -0.22829 -0.00152
## age age -0.00901 -0.00900 -0.00001
## cls_perc_eval cls_perc_eval 0.00533 0.00529 0.00004
## cls_students cls_students 0.00045 0.00047 -0.00001
## cls_levelupper cls_levelupper 0.06051 0.06064 -0.00012
## cls_creditsone credit cls_creditsone credit 0.50204 0.50612 -0.00408
## bty_avg bty_avg 0.04003 0.03986 0.00017
## pic_outfitnot formal pic_outfitnot formal -0.11268 -0.10832 -0.00436
## pic_colorcolor pic_colorcolor -0.21726 -0.21905 0.00179
# Extract p-values
full_pvals <- summary(m_full)$coefficients[, "Pr(>|t|)"]
reduced_pvals <- summary(m_full_minus_profs)$coefficients[, "Pr(>|t|)"]
# Create p-value comparison dataframe
pval_comparison <- data.frame(
Variable = common_vars,
Full_Model_pval = round(full_pvals[common_vars], 5),
Reduced_Model_pval = round(reduced_pvals[common_vars], 5)
)
print(pval_comparison)## Variable Full_Model_pval Reduced_Model_pval
## (Intercept) (Intercept) 0.00000 0.00000
## ranktenure track ranktenure track 0.07278 0.07233
## ranktenured ranktenured 0.14295 0.14235
## gendermale gendermale 0.00006 0.00006
## ethnicitynot minority ethnicitynot minority 0.11698 0.09986
## languagenon-english languagenon-english 0.03965 0.04053
## age age 0.00427 0.00426
## cls_perc_eval cls_perc_eval 0.00059 0.00061
## cls_students cls_students 0.22896 0.21038
## cls_levelupper cls_levelupper 0.29369 0.29220
## cls_creditsone credit cls_creditsone credit 0.00002 0.00001
## bty_avg bty_avg 0.02267 0.02303
## pic_outfitnot formal pic_outfitnot formal 0.12792 0.13408
## pic_colorcolor pic_colorcolor 0.00252 0.00221
## Maximum coefficient change: 0.00796
## This small change indicates cls_profs was NOT collinear with other variables.
Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.
# Backward selection process - removing highest p-value one at a time
# Final model after backward selection:
m_best <- lm(score ~ gender + language + age + cls_perc_eval + cls_credits + bty_avg + pic_color, data = evals)
summary(m_best)##
## Call:
## lm(formula = score ~ gender + language + age + cls_perc_eval +
## cls_credits + bty_avg + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81919 -0.32035 0.09272 0.38526 0.88213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.967255 0.215824 18.382 < 2e-16 ***
## gendermale 0.221457 0.049937 4.435 1.16e-05 ***
## languagenon-english -0.281933 0.098341 -2.867 0.00434 **
## age -0.005877 0.002622 -2.241 0.02551 *
## cls_perc_eval 0.004295 0.001432 2.999 0.00286 **
## cls_creditsone credit 0.444392 0.100910 4.404 1.33e-05 ***
## bty_avg 0.048679 0.016974 2.868 0.00432 **
## pic_colorcolor -0.216556 0.066625 -3.250 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5014 on 455 degrees of freedom
## Multiple R-squared: 0.1631, Adjusted R-squared: 0.1502
## F-statistic: 12.67 on 7 and 455 DF, p-value: 6.996e-15
# Diagnostic plots for the best model
# 1. Residuals vs Fitted
ggplot(data = m_best, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals")# 2. Histogram of Residuals
ggplot(data = m_best, aes(x = .resid)) +
geom_histogram(binwidth = 0.25, fill = "steelblue", color = "white") +
labs(title = "Distribution of Residuals",
x = "Residuals",
y = "Count")# 3. Q-Q Plot
ggplot(data = m_best, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "Normal Q-Q Plot of Residuals")Linearity: The residuals vs. fitted values plot shows points randomly scattered around zero with no clear curved pattern. The linearity condition is satisfied. Constant Variability: The vertical spread of residuals appears relatively consistent across fitted values, with no obvious fan or funnel shape. The constant variability condition is satisfied. Normality: The histogram shows a left-skewed distribution, and the Q-Q plot shows deviation from the line at both tails. There is a modest departure from normality, but with a large sample size (n = 463), regression is robust to this violation. Independence: Each observation represents a different course evaluation.
##Question 17: The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?
Since the data collection involved selecting professors and including every course they taught, a single professor can appear multiple times in the dataset. For instance, if Dr. Johnson taught 5 different courses, her characteristics (age, gender, attractiveness score, etc.) would be repeated across 5 separate rows.
This can create several issues:
Non-Independent Observations: Students’ ratings of courses taught by the same professor are likely related to each other. A professor who is generally engaging will probably receive high scores across all their courses. This means the evaluation scores within each professor are not truly independent observations.
Nested Data Structure: The dataset has a natural hierarchy—courses are grouped within professors. Standard linear regression assumes all observations are independent and does not recognize this grouping structure.
Overly Optimistic Results: When data points are correlated but treated as independent, the standard errors become artificially small. This leads to inflated t-statistics, p-values that appear more significant than they should be, and confidence intervals that are narrower than warranted.
To properly address this issue, researchers could use mixed-effects models (also called hierarchical or multilevel models) or apply robust standard errors that account for clustering within professors.
Based on the final model, a professor and course at UT Austin associated with high evaluation scores would have these characteristics:
In summary: The “ideal” profile would be a young, male, attractive professor who earned their degree from an English-speaking institution, teaches a one-credit course, has a black & white photo, and has high student participation in evaluations.
No, I would not be comfortable generalizing these conclusions to professors at other universities for several reasons:
Single Institution Sample: The data was collected exclusively from the University of Texas at Austin, which may have unique institutional characteristics, student culture, and evaluation practices that are not representative of other universities.
Cultural and Regional Differences: Universities in different regions or countries may have varying standards of attractiveness and cultural norms that influence how students perceive and evaluate their professors. What predicts high evaluation scores in Texas may not apply elsewhere.
Institutional Variation: Factors such as class sizes, course structures, grading expectations, and student demographics differ substantially across institutions. These variables may influence evaluation scores in ways that are specific to each university’s context.
Potential Sampling Bias: The sampling method for selecting professors is unclear. If the sample was not randomly selected from all UT Austin faculty, the results may be biased even for conclusions about UT Austin itself—let alone other institutions.
In summary, to confidently generalize these findings, we would need data from multiple universities across different regions, institution types (public vs. private, large vs. small), and cultural contexts.
# Final summary visualization for portfolio
library(ggplot2)
# Coefficient plot for final model
coef_data <- data.frame(
Variable = c("Gender (Male)", "Language (Non-English)", "Age",
"Class Eval %", "Credits (One)", "Beauty Avg", "Picture (Color)"),
Coefficient = c(0.207, -0.206, -0.006, 0.005, 0.505, 0.061, -0.191),
Significant = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes")
)
ggplot(coef_data, aes(x = reorder(Variable, Coefficient), y = Coefficient, fill = Coefficient > 0)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("coral3", "steelblue3"), guide = "none") +
labs(title = "Impact of Variables on Professor Evaluation Scores",
subtitle = "Final Model Coefficients (all statistically significant at p < 0.05)",
x = "",
y = "Coefficient Estimate") +
theme_minimal() +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50")Statistical Analysis: - Simple and multiple linear regression - Model diagnostics and assumption verification - Backward selection for model optimization - Interpretation of categorical dummy variables
Programming (R): - Data wrangling with tidyverse - Visualization with ggplot2 - Correlation analysis with GGally
Communication: - Translating statistical findings into actionable insights - Distinguishing between statistical and practical significance - Identifying limitations and scope of conclusions
Critical Thinking: - Recognizing confounding variables - Understanding observational vs. experimental design - Identifying nested data structure issues