Gender and Attractiveness Bias in Student Evaluations of Professors at University of Texas

library(ggplot2)
library(dplyr)
library(mosaic)
evals <- read.csv("evals.csv", header=TRUE, stringsAsFactors = TRUE)

Introduction:

Every semester, students at most major universities are required to fill out professor and class evaluations. These student evaluations often have major implications for a professor’s career path and credibility as an academic professional, so it is very important that they are fair and even across the ranges of professors. Unfortunately, people (students included) are often plagued by bias, both conscious and subconscious. These include assumptions about ability based on gender, race, age, and attractiveness, just to name a few. Even without expressly thinking it, we make evaluations based upon these biases every day, and these biases make a difference when they show up in formal evaluations. This can result in two professors, who have the same level of education and knowledge about their subject and who teach with the same level of competence, getting different evaluation scores because of their students’ subconscious biases based on their gender or attractiveness. We are analyzing this data to see how much of a difference these factor make in students’ evaluations, based on data from the University of Texas.

Our formal research questions are threefold:

First, do female professors have a different evaluation score than male professors, on average, at the University of Texas?
Next, do attractive professors have a different evaluation score than non-attractive people, on average, at the University of Texas?
Finally, is a professor’s perceived attractiveness or gender a better indicator of higher evaluation scores by students at the University of Texas?

Before our evaluation of the data, we hypothesize that:

Female professors will have statistically significant lower evaluation scores, on average, than male professors.
Professors with above-average attractiveness will have statistically significant higher evaluation scores, on average, than professors who are deemed below-average.
That the gender of a professor will have a statistically significant larger effect on evaluation score than level of attractiveness.

Descriptive Statistics:

evals <- evals %>%
  mutate(bty_avg2 = ifelse(bty_avg >5,  yes = "above", no = "not above"))

ggplot(data = evals, mapping = aes(x = gender, fill = bty_avg2)) +
  geom_bar(position = "dodge")+
  ggtitle("Attractiveness Rating by Gender")+
  xlab("Gender")+
  ylab("Count")+
  scale_fill_discrete(name="Attractivness Rating", labels=c("Above Average", "Not Above Average"))

ggplot(data = evals, mapping = aes(x = gender, y = score)) +
  geom_boxplot(aes (fill = bty_avg2))+
  labs(title="Attractiveness Rating and Evaluation Score by Gender", x="Gender", y="Evaluation Score")+
  scale_fill_discrete(name="Attractiveness Rating", labels=c("Above Average", "Not Above Average"))

ggplot(data = evals, mapping = aes(x = bty_avg2, y = age)) +
  geom_boxplot(aes(fill = bty_avg2))+
  labs(title="Attractiveness Rating by Age", x="Attractiveness Rating", y="Age")+
  scale_fill_discrete(name="Attractiveness Rating", labels=c("Above Average", "Not Above Average"))

Inference: Graph #1: Shows the number of professors who were evaluated broken down by gender and their attractiveness rating. We can see that there are more male professors than female, and that based on their attractiveness rating there are significantly more professors who are below average than above. There are about 270 males and only 70 of them scored above average, whereas there are about 200 female professors and around 60 of them scored above average.

Graph #2: Shows the score of professors based on gender and attractiveness rating average. We can see a significant difference in scores, with the males having a higher score overall. We can also see that both genders who had an above average beauty score did score significantly higher than professors who had a below average beauty score, with the difference showing up much more in the male professor population.

Graph #3: Shows the beauty average score of professors based on age. Professors between the age of 35 and 50 tend to score above average in their beauty ranking. Whereas, professors between the age of 45 to 60 scored below average. This tells us that the age of the professor most likely does have something to do with their beauty ranking.

When testing our hypothesis we dove into the question, is the confidence interval looking at bty_avg2 as the predictor of score “further from 0” compared to that of gender as predicted? When looking at our two independent means sample, beauty average and score, results rejected our null hypothesis in saying that bty-avg2 is less of a predictor of score than gender.

shuffled_bty_avg2 <- evals %>%
  mutate(bty_avg2 = shuffle(bty_avg2)) %>%
  group_by(bty_avg2) %>%
  summarize(mean_shuf_bty_avg2 = mean(score))

(obs_shuf_diff_bty_avg2 <- diff(shuffled_bty_avg2$mean_shuf_bty_avg2))

## [1] 0.06945969

many_shuffles_bty_avg2 <- do(10000) *
  evals %>%
  mutate(bty_avg2 = shuffle(bty_avg2)) %>%
  group_by(bty_avg2) %>%
  summarize(mean_shuf_bty_avg2 = mean(score))

rand_distn_bty_avg2 <- many_shuffles_bty_avg2 %>% 
  group_by(.index) %>% 
  summarize(diffmean = diff(mean_shuf_bty_avg2))

rand_distn_bty_avg2 %>%
  filter(abs(diffmean) >= -obs_shuf_diff_bty_avg2) %>%
  nrow() / nrow(rand_distn_bty_avg2)

## [1] 1

rand_distn_bty_avg2 %>% ggplot(aes(x = diffmean)) +
  geom_histogram(color = "white", aes(fill = (abs(diffmean) >= -obs_shuf_diff_bty_avg2)), bins = 20)

shuffled_gender <- evals %>% 
  mutate(gender = shuffle(gender)) %>% 
  group_by(gender) %>% 
  summarize(mean_shuf_gender = mean(score))

many_shuffles_gender <- do(10000) *
  evals %>% 
  mutate(gender = shuffle(gender)) %>% 
  group_by(gender) %>% 
  summarize(mean_shuf_gender = mean(score))
rand_distn_gender <- many_shuffles_gender %>% 
  group_by(.index) %>% 
  summarize(diffmean = diff(mean_shuf_gender))

(obs_shuf_diff_gender <- diff(shuffled_gender$mean_shuf_gender))

## [1] 0.03519326

rand_distn_gender %>%
  filter(abs(diffmean) >= -obs_shuf_diff_gender) %>%
  nrow() / nrow(rand_distn_gender)

## [1] 1

rand_distn_gender %>% ggplot(aes(x = diffmean)) +
  geom_histogram(color = "white", bins = 100) +
  geom_vline(xintercept = obs_shuf_diff_gender, color = "red")+
geom_vline(xintercept = -obs_shuf_diff_gender, color = "red")

rand_distn_gender %>% ggplot(aes(x = diffmean)) +
  geom_histogram(color = "white", aes(fill = (diffmean <= obs_shuf_diff_gender) | (diffmean >= -obs_shuf_diff_gender)), bins = 100)

Conclusion:

The data shows that according to our first hypothesis test based on bty_avg2, there is no statistically significant difference between the two variables in comparison, average beauty scores of professors who were above average or below average on their beauty rating.

When looking at our second hypothesis test based on gender there are many parallels in conclusion to the hypothesis test on bty_avg2. The trend here shows that there is not a statistically significant difference based on gender and the beauty score of different professors. Therefore, we cannot statistically say that either attractiveness nor gender is a definitive indicator of student evaluation score. Further research is needed.

Gender and Attractiveness Bias in Student Evaluations of Professors at University of Texas

Hunter Peterson, Cassie Dandy, Emily Trumpus, Elizabeth Oshiro