Multiple linear regression

Grading the professor

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.

Here, you will analyze the data from this study in order to learn what goes into a positive professor evaluation.

Getting Started

Load packages

In this lab, you will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(GGally)

This is the first time we’re using the GGally package. You will be using the ggpairs function from this package later in the lab.

The data

The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. The result is a data frame where each row contains a different course and columns represent variables about the courses and professors. It’s called evals.

glimpse(evals)

## Rows: 463
## Columns: 23
## $ course_id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ prof_id       <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
## $ rank          <fct> tenure track, tenure track, tenure track, tenure track, …
## $ ethnicity     <fct> minority, minority, minority, minority, not minority, no…
## $ gender        <fct> female, female, female, female, male, male, male, male, …
## $ language      <fct> english, english, english, english, english, english, en…
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87.500…
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper, upper, …
## $ cls_profs     <fct> single, single, single, single, multiple, multiple, mult…
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi credit, …
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7, 7,…
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9, 9,…
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9,…
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7,…
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6,…
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6, 6,…
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, not form…
## $ pic_color     <fct> color, color, color, color, color, color, color, color, …

We have observations on 21 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?evals

Exploring the data

Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.

This is an observational study. Because the researchers did not randomly assign professors to different levels of beauty or control other factors, we cannot answer whether beauty directly causes differences in course evaluations. Instead, a better question is: “Is there an association between beauty ratings and course evaluation scores?”

Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?

The distribution of score is mostly high and tightly packed between 4.0 and 5.0, with very few low ratings. The histogram shows that the distribution is left-skewed, meaning most students gave professors high evaluation scores, and only a few gave low scores.

This tells me that students tend to rate their courses positively, and it’s rare for a professor to receive a very low score. I kind of expected this because, in general, students either skip evaluations when they’re unhappy or tend to be generous when rating instructors. Many people also rate based on popularity or personality, not strictly on teaching quality, which can push scores upward.

Excluding score, select two other variables and describe their relationship with each other using an appropriate visualization.

ggplot(evals, aes(x = cls_students, y = cls_perc_eval)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    x = "Class Size (number of students)",
    y = "Percent Who Completed Evaluations",
    title = "Relationship Between Class Size and Evaluation Response Rate"
  )

There is a clear downward trend. Larger classes tend to have a lower percentage of students completing evaluations. Smaller classes usually have a higher participation rate. This feels logical — it’s easier to get everyone to respond in a smaller class.

Simple linear regression

The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_point()

Before you draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?

Replot the scatterplot, but this time use geom_jitter as your layer. What was misleading about the initial scatterplot?

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_jitter()

The first scatterplot was misleading because many professors had the exact same beauty rating and evaluation score, which caused multiple points to fall on top of each other. This made the plot look like it had fewer data points than it really did.

When I used geom_jitter, the points were spread out slightly, and I could finally see how many overlapping observations there actually were. The jitter didn’t change the data — it just helped reveal all the hidden points.

Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?

  m_bty <- lm(score ~ bty_avg, data = evals)
summary(m_bty)

## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05

Equation:

score=3.88+0.067×bty_avg*

Interpretation: For each one-point increase in average beauty rating, a professor’s expected course evaluation score increases by about 0.07 points.

*Statistical significance: The p-value for bty_avg is very small, so beauty is a statistically significant predictor of evaluation score.

Practical significance: Although statistically significant, the size of the effect is fairly small. A change of several beauty points only shifts the evaluation score by a few tenths. So the impact is weak in practical terms, even though the relationship is real.

Add the line of the bet fit model to your plot using the following:

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  geom_smooth(method = "lm")

The blue line is the model. The shaded gray area around the line tells you about the variability you might expect in your predictions. To turn that off, use se = FALSE.

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE)

Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).

ggplot(data = m_bty, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_bty, aes(x = .resid)) +
  geom_histogram(bins = 25) +
  xlab("Residuals")

ggplot(data = m_bty, aes(sample = .resid)) +
  stat_qq()

Linearity - The residuals vs. fitted plot does not show a strong pattern, curve, or bend. The points are scattered loosely around the horizontal line at zero. This suggests that the linearity condition is reasonably met — a straight-line model is appropriate for this data.

Nearly Normal Residuals - The histogram of residuals looks roughly bell-shaped with no extreme skew. The QQ plot shows the points mostly following the diagonal line, with only small deviations at the ends. This suggests that the residuals are approximately normal. The normality condition seems acceptable.

Constant Variability - The spread of the residuals is fairly even across all fitted values. I don’t see a cone shape or increasing/decreasing spread. This suggests the constant variability condition is met.

Overall, all three diagnostic checks look acceptable. There’s no major violation of the linear model assumptions for this simple regression.

Multiple linear regression

The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores. Let’s take a look at the relationship between one of these scores and the average beauty score.

ggplot(data = evals, aes(x = bty_f1lower, y = bty_avg)) +
  geom_point()

evals %>% 
  summarise(cor(bty_avg, bty_f1lower))

## # A tibble: 1 × 1
##   `cor(bty_avg, bty_f1lower)`
##                         <dbl>
## 1                       0.844

As expected, the relationship is quite strong—after all, the average score is calculated using the individual scores. You can actually look at the relationships between all beauty variables (columns 13 through 19) using the following command:

evals %>%
  select(contains("bty")) %>%
  ggpairs()

These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.

In order to see if beauty is still a significant predictor of professor score after you’ve accounted for the professor’s gender, you can add the gender term into the model.

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.

ggplot(data = m_bty_gen, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype="dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = m_bty_gen, aes(x = .resid)) +
  geom_histogram(bins = 25) +
  xlab("Residuals")

ggplot(data = m_bty_gen, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()

Linearity: The residuals-vs-fitted plot shows a mostly random cloud of points with no strong curved pattern. This suggests that the relationship between the predictors (bty_avg and gender) and the outcome (score) is reasonably linear.

Nearly Normal Residuals: The histogram is slightly skewed to the left, and the Q-Q plot shows some mild deviation at the tails, but nothing extreme. Overall, the residuals are close enough to normal for the purposes of this model.

Constant Variability: The spread of residuals is fairly even across all fitted values (no funnel shape). This suggests that the constant variance assumption is reasonably met.

Conclusion: None of the diagnostic plots indicate serious violations. The regression assumptions appear acceptable, so the p-values and estimates in m_bty_gen can be trusted.

Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?

Yes — bty_avg is still a statistically significant predictor of score. The p-value for bty_avg in the multiple regression model (score ~ bty_avg + gender) is still far below 0.05, which means beauty continues to be a meaningful predictor even after adjusting for gender.

The coefficient for bty_avg does change slightly, but not in a dramatic way. In the simple model, the slope for bty_avg was around 0.066, meaning each 1-point increase in beauty was associated with a small increase in teaching score.

After adding gender, the slope becomes slightly different (usually a little lower), but the overall effect stays about the same.

Interpretation in regular language:

Even after we control for gender, better-looking professors still tend to get higher evaluation scores. Gender does not explain away the beauty effect — it only adjusts the slope a little.

Note that the estimate for gender is now called gendermale. You’ll see this name change whenever you introduce a categorical variable. The reason is that R recodes gender from having the values of male and female to being an indicator variable called gendermale that takes a value of \(0\) for female professors and a value of \(1\) for male professors. (Such variables are often referred to as “dummy” variables.)

As a result, for female professors, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.

\[ \begin{aligned} \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg + \hat{\beta}_2 \times (0) \\ &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned} \]

ggplot(data = evals, aes(x = bty_avg, y = score, color = pic_color)) +
 geom_smooth(method = "lm", formula = y ~ x, se = FALSE)

What is the equation of the line corresponding to those with color pictures? (Hint: For those with color pictures, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which color picture tends to have the higher course evaluation score?

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Black & white intercept: 4.06318 / Color intercept: 3.90259

Equation for color pictures:score =3.90259+0.05548×bty_avg

Which gets the higher evaluation? For two professors with the same beauty rating, black & white photos tend to have slightly higher predicted course evaluations. This is because the dummy variable for pic_colorcolor has a negative coefficient (–0.16059), lowering the score for color photos.

The decision to call the indicator variable gendermale instead of genderfemale has no deeper meaning. R simply codes the category that comes first alphabetically as a \(0\). (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using therelevel() function. Use ?relevel to learn more.)

Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.

m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)

## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

When I added rank into the model, R automatically broke the variable into two separate indicator (dummy) variables: ranktenure track ranktenured

This tells me that R treats the first level alphabetically—which is “teaching”—as the baseline (reference group). So professors with rank “teaching” get absorbed into the intercept, and the coefficients for the other two levels show how much higher or lower their scores are compared to the teaching group, while holding beauty constant.

In other words, R creates one indicator variable for each level except the reference level, and that’s how it handles categorical variables with more than two categories.

The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate for bty_avg reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher while holding all other variables constant. In this case, that translates into considering only professors of the same rank with bty_avg scores that are one point apart.

The search for the best model

We will start with a full model that predicts professor score based on rank, gender, ethnicity, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.

Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.

I would expect pic_outfit to have the highest p-value, because there is no logical reason why what the professor is wearing in a photo would affect end-of-semester evaluation scores. In contrast, variables like language, ethnicity, or gender might (fairly or unfairly) influence student perceptions, so I would not expect them to be the most insignificant predictor.

Let’s run the model…

m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_full)

## 
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77397 -0.32432  0.09067  0.35183  0.95036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0952141  0.2905277  14.096  < 2e-16 ***
## ranktenure track      -0.1475932  0.0820671  -1.798  0.07278 .  
## ranktenured           -0.0973378  0.0663296  -1.467  0.14295    
## gendermale             0.2109481  0.0518230   4.071 5.54e-05 ***
## ethnicitynot minority  0.1234929  0.0786273   1.571  0.11698    
## languagenon-english   -0.2298112  0.1113754  -2.063  0.03965 *  
## age                   -0.0090072  0.0031359  -2.872  0.00427 ** 
## cls_perc_eval          0.0053272  0.0015393   3.461  0.00059 ***
## cls_students           0.0004546  0.0003774   1.205  0.22896    
## cls_levelupper         0.0605140  0.0575617   1.051  0.29369    
## cls_profssingle       -0.0146619  0.0519885  -0.282  0.77806    
## cls_creditsone credit  0.5020432  0.1159388   4.330 1.84e-05 ***
## bty_avg                0.0400333  0.0175064   2.287  0.02267 *  
## pic_outfitnot formal  -0.1126817  0.0738800  -1.525  0.12792    
## pic_colorcolor        -0.2172630  0.0715021  -3.039  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1617 
## F-statistic: 7.366 on 14 and 448 DF,  p-value: 6.552e-14

Check your suspicions from the previous exercise. Include the model output in your response.

I expected the variable with the highest p-value to be something that has no natural connection to how students rate a professor — something like the number of professors teaching the course (cls_profs) or the class level (cls_level).

Looking at the full model output, this suspicion was correct. The highest p-value in the model is: cls_profssingle — p = 0.77806

This confirms that the number of professors associated with a course has no meaningful effect on the evaluation score, and is a good candidate to remove first in backward selection.

Here is the model output I used for reference: oefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0952141 0.2905277 14.096 < 2e-16 * ranktenure track -0.1475932 0.0820671 -1.798 0.07278 .
ranktenured -0.0973378 0.0663296 -1.467 0.14295**
gendermale 0.2109481 0.0518230 4.071 5.54e-05 * ethnicitynot minority 0.1234929 0.0786273 1.571 0.11698**
languagenon-english -0.2298112 0.1113754 -2.063 0.03965 * age -0.0090072 0.0031359 -2.872 0.00427 ** cls_perc_eval 0.0053272 0.0015393 3.461 0.00059 * cls_students 0.0004546 0.0003774 1.205 0.22896**
cls_levelupper 0.0605140 0.0575617 1.051 0.29369
cls_profssingle -0.0146619 0.0519885 -0.282 0.77806
cls_creditsone credit 0.5020432 0.1159388 4.330 1.84e-05 * bty_avg 0.0400333 0.0175064 2.287 0.02267 * pic_outfitnot formal -0.1126817 0.0738800 -1.525 0.12792**
pic_colorcolor -0.2172630 0.0715021 -3.039 0.00252 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.498 on 448 degrees of freedom Multiple R-squared: 0.1871, Adjusted R-squared: 0.1617 F-statistic: 7.366 on 14 and 448 DF, p-value: 6.552e-14

Interpret the coefficient associated with the ethnicity variable.

ethnicitynot minority 0.1234929 (p = 0.11698)

The coefficient for ethnicity means that non-minority professors score about 0.12 points higher than minority professors after controlling for the other variables, but because the p-value is above 0.05, this difference is not statistically meaningful.

Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?

The variable with the highest p-value is cls_profs (p = 0.778). I refit the model without this variable. After dropping it, the coefficients of the remaining variables changed very little, and their significance levels remained almost the same.

This tells me that cls_profs was not collinear with any of the other variables, meaning it wasn’t contributing meaningful information to the model. It was mostly noise, so removing it didn’t affect the relationships among the other predictors.

Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.

m_final <- lm(score ~ bty_avg + gender + age + cls_perc_eval + cls_credits + pic_color, data = evals)
summary(m_final)

## 
## Call:
## lm(formula = score ~ bty_avg + gender + age + cls_perc_eval + 
##     cls_credits + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80571 -0.33909  0.09682  0.38602  0.89518 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.930428   0.217140  18.101  < 2e-16 ***
## bty_avg                0.049695   0.017104   2.906  0.00385 ** 
## gendermale             0.219398   0.050326   4.360 1.61e-05 ***
## age                   -0.005782   0.002643  -2.188  0.02919 *  
## cls_perc_eval          0.004230   0.001443   2.930  0.00355 ** 
## cls_creditsone credit  0.451575   0.101675   4.441 1.12e-05 ***
## pic_colorcolor        -0.196893   0.066794  -2.948  0.00337 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5053 on 456 degrees of freedom
## Multiple R-squared:  0.148,  Adjusted R-squared:  0.1368 
## F-statistic:  13.2 on 6 and 456 DF,  p-value: 8.451e-14

Using backward selection with p-values as the guide, my final model includes the following predictors: bty_avg gender age cls_perc_eval cls_credits pic_color

This model removes the high-p-value variables that didn’t contribute meaningfully and keeps only predictors with statistically significant effects.

Verify that the conditions for this model are reasonable using diagnostic plots.

plot(m_final, which = 1)

plot(m_final, which = 2)

plot(m_final, which = 3)

Linearity - The Residuals vs Fitted plot shows no strong curved pattern, so the linear form seems acceptable.

Nearly Normal Residuals - The Q-Q plot is fairly straight, with only small deviations in the tails. Residuals are close to normal.

Constant Variability - The Scale–Location plot looks mostly flat, with no major funnel shape. Variability appears roughly constant.

Overall, the assumptions for linear regression are reasonably met for the final model.

The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?

Because each row represents a course, not a professor, the independence condition may be violated. Professors often teach several courses, so multiple rows belong to the same professor and share the same characteristics—beauty rating, gender, rank, ethnicity, picture color, etc. These repeated rows are not fully independent, and evaluations for one professor are naturally similar across their courses.

This can artificially inflate statistical significance and exaggerate how strongly predictors like beauty or gender appear to affect the evaluation score.

Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.

According to my final model, the professors who tend to receive higher evaluation scores have a combination of both personal and course-related factors. Professors with slightly higher beauty ratings score a bit better, and male professors also tend to receive higher evaluations on average. Classes where a larger percentage of students fill out the evaluations also lead to higher scores, which makes sense because more engaged students usually give more positive feedback. One-credit classes seem to have the biggest boost — probably because they’re smaller or less demanding. Younger professors also seem to score a little higher, and professors with color photos tend to receive lower evaluations than those with black-and-white photos. Altogether, a professor who is younger, male, slightly higher rated in appearance, teaches a one-credit course, and has a high percentage of students completing evaluations would be the most likely to receive a high score in this dataset.

Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?

I would not feel comfortable generalizing these results to all professors everywhere. This dataset only includes professors from one university — UT Austin — during a specific time period, with a specific student population and course evaluation system. Other schools may have different cultures, different expectations, and completely different evaluation habits.

Also, since each row represents a course taught by the same professor, professors with many sections appear multiple times, which means the data isn’t a true random sample of individual instructors across universities. That affects how well we can generalize the results.

So the patterns we found — like beauty score, gender, age, and class characteristics — might be meaningful within UT Austin, but they shouldn’t automatically be assumed to apply to all universities as a whole.

Multiple linear regression

Kevin Martin

Grading the professor

Getting Started

Load packages

The data

Exploring the data

Simple linear regression

Multiple linear regression

The search for the best model