DATA 606 LAb 9: Multiple Linear Regression

Getting Started

In this lab, we will be using tidyverse suite package to exploe and visualize data. The data were gathered from end of semester students evaluations for a large sample of professors from the University of Texas at Austin. Six students rated the professors’ physical appearance. The data frame called “evals” has 463 rows containing different courses and 23 columns representing variables about the courses and professors.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
head(evals)

## # A tibble: 6 × 23
##   course_id prof_id score rank     ethnicity gender language   age cls_perc_eval
##       <int>   <int> <dbl> <fct>    <fct>     <fct>  <fct>    <int>         <dbl>
## 1         1       1   4.7 tenure … minority  female english     36          55.8
## 2         2       1   4.1 tenure … minority  female english     36          68.8
## 3         3       1   3.9 tenure … minority  female english     36          60.8
## 4         4       1   4.8 tenure … minority  female english     36          62.6
## 5         5       2   4.6 tenured  not mino… male   english     59          85  
## 6         6       2   4.3 tenured  not mino… male   english     59          87.5
## # ℹ 14 more variables: cls_did_eval <int>, cls_students <int>, cls_level <fct>,
## #   cls_profs <fct>, cls_credits <fct>, bty_f1lower <int>, bty_f1upper <int>,
## #   bty_f2upper <int>, bty_m1lower <int>, bty_m1upper <int>, bty_m2upper <int>,
## #   bty_avg <dbl>, pic_outfit <fct>, pic_color <fct>

Exploring the Data

Exercise 1

Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.

This is an observational study because data are collected from student evaluations of professors’ physical appearance without manipulating or controlling any variables. The original research question, “whether beauty leads directly to the differences in course evaluations,” might not be directly answerable with the current study design. Rephrasing the question to suit the study design could involve framing it in terms of association or correlation. For instance: “Is there an association between professors’ physical appearance ratings and differences in course evaluations?”

Exercise 2

Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?

Based on the histogram below, the distribution of ‘score’ is left skewed meaning the students rate highly for the courses, it may indicate that most courses receive lower ratings, implying a more critical student body or challenging courses.

# Summary statistics for the 'score' variable
summary(evals$score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.800   4.300   4.175   4.600   5.000

# Create a histogram to visualize the distribution of scores
ggplot(evals, aes(x = score)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Overall Course Evaluation Scores", x = "Score", y = "Frequency")

Exercise 3

Excluding score, select two other variables and describe their relationship with each other using an appropriate visualization. The box plot implies that, among female professors, those who are younger tend to receive higher course ratings compared to male professors of similar age groups.

library(ggplot2)

# Box plot to visualize the distribution of ages based on gender
ggplot(evals, aes(x = gender, y = age, fill = gender)) +
  geom_boxplot() +
  labs(title = "Distribution of Ages by Gender",
       x = "Gender", y = "Age")

Simple Linear Regression

Exercise 4

Before you draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry? The number of unique points is notably lower than the total observations in the data set, it might indicate duplicate or aggregated data being represented in the scatterplot, which could left skewness the visual representation of the relationship between bty_avg and score.

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_point()

Exercise 4

Replot the scatterplot, but this time use geom_jitter as your layer. What was misleading about the initial scatterplot?

The geom_jitter() function in ggplot adds a small amount of random noise to each point, which helps in visualizing dense areas more accurately and avoids overlapping points especially when there might be discrete values or clustering in the dataset. By Adjusting the ‘width and height arguments in geom_jitter()’ can help in better visualizing the distribution of points without overcrowding.

ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_jitter(width = 0.3, height = 0.2)

Exercise 5

Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?

# Fit linear model
m_bty <- lm(score ~ bty_avg, data = evals)

# Creating a jittered scatterplot
ggplot(data = evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

# Summary of the linear model
summary(m_bty)

## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05

The equation for the linear model would be: score = 3.88034 + 0.06664 * bty_avg. As p-value: 5.083e-05 indicates that the average beauty score is a statistical significant predictor of the average professor score. A positive slope means the average beauty rating increases, the average professor score also tends to increase. The slope is small 0.06664, even if statistically significant, the practical significance average beauty score as a predictor variable might be limited.

Exercise 6

Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).

# Get residuals from the linear model
residuals <- resid(m_bty)
# Residuals vs Fitted Values Plot
plot(x = fitted(m_bty), y = residuals,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")  # Add a horizontal line at y = 0

#Historgream
hist(m_bty$residuals)

# Normal Q-Q Plot
qqnorm(residuals)
qqline(residuals)

# Scale-Location (Sqrt(|Residuals|) vs Fitted Values) Plot
sqrt_abs_resid <- sqrt(abs(residuals))
plot(x = fitted(m_bty), y = sqrt_abs_resid,
     xlab = "Fitted Values", ylab = "sqrt(|Residuals|)",
     main = "Scale-Location Plot")
abline(h = mean(sqrt_abs_resid), col = "red")  # Add a horizontal line

This scatter plot shows that points are randomly scattered around the horizontal line at y = 0. The histogram visualizes the distribution of residuals from the linear regression model and display the frequency or count of residuals within specific intervals along the x-axis, and the count of residuals falling within each interval will be represented by the height of the bars along the y-axis. The Q-Q plot shows most of the points are approximately follow a straight line, we might say that residuals follow a normal distribution. In the scale-location plot, points are randomly scattered around a horizontal line indicates that there is homoscedasticity (constant variance of residuals) across the range of fitted values.

Multiple Linear Regression

To examine the relationship between one of the individual beauty scores and the average beauty score by creating a scatterplot to visualize their association. For example, visualize the relationship between the ‘bty_f1lower’ variable and the ‘bty_avg’ variable from the evals dataset.

#visualizing the relationship between two variables (bty_avg and bty_f1lower)
ggplot(data = evals, aes(x = bty_f1lower, y = bty_avg)) +
  geom_point()

#summarize the correlation between two variables (bty_avg and bty_f1lower)
evals %>% 
  summarise(cor(bty_avg, bty_f1lower))

## # A tibble: 1 × 1
##   `cor(bty_avg, bty_f1lower)`
##                         <dbl>
## 1                       0.844

# To analyze the relationships between multiple beauty variables (bty_* columns 13 through 19) in the evals dataset
evals %>%
  select(contains("bty")) %>%
  ggpairs()

# To predicts the score variable based on two predictors: bty_avg (average beauty score) and gender with linear regression model
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercise 7

P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.

# Diagnostic plots for the linear regression model m_bty_gen
par(mfrow = c(2, 2))  # Set the layout to a 2x2 grid for plots
plot(m_bty_gen)

The Diagnostic plots finds patterns or deviations of the linear regression model that might indicate violations of assumptions such as heteroscedasticity, linearity, or outliers, which can impact the reliability of the model’s coefficients and p-values.

Exercise 8

Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg?

summary(m_bty)

## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05

summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

When we compare the coefficients and their significance before and after including gender in the model, there is a significant change in the coefficient or its significance after adding gender that means the relationship between bty_avg and score was obviously influenced in the model.

For a categorical variable like gender with two categories (male and female), R creates a single binary variable to represent this. Let’s say gendermale is the resulting variable. For female professors, gendermale takes a value of 0, and for male professors, it takes a value of 1. Consider the regression equation:

\(\begin{aligned} \widehat{\text { score }} & =\hat{\beta}_0+\hat{\beta}_1 \times \text { bty_avg }+\hat{\beta}_2 \times(0) \\ & =\hat{\beta}_0+\hat{\beta}_1 \times \text { bty_avg }\end{aligned}\)

# plot a linear regression line (using geom_smooth()) to display the relationship between bty_avg (average beauty score) and score (professor's score)
ggplot(data = evals, aes(x = bty_avg, y = score, color = pic_color)) +
 geom_smooth(method = "lm", formula = y ~ x, se = FALSE)

Exercise 9

What is the equation of the line corresponding to those with color pictures? (Hint: For those with color pictures, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which color picture tends to have the higher course evaluation score?

\[ \begin{aligned} \widehat{\text { score }} & =3.7434+0.07416 \times \text { bty_avg }+0.17239 \times(\text { gendermale }) \\ & =3.7434+0.07416 \times \text { bty_avg } +0.17239 \end{aligned} \]

For gender_Male, we will evaluate the equation with (gendermale) \(=1\). In case of female gender, we will substitute a 0 . Male professor will have a evaluation score higher by 0.17239 all other things being equal.

Exercise 10

Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured.

# Create a new model m_bty_rank with rank added and gender removed
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)

## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

In the case of the rank variable with three levels (teaching, tenure track, tenured), R creates two dummy variables to encode this categorical variable.

In multiple regression, the interpretation of coefficients slightly differs compared to simple regression due to the inclusion of multiple predictors.

The Search for the Best Model

Exercise 11

Which variable would you expect to have the highest p-value in this model? Why? Hint: Think about which variable would you expect to not have any association with the professor score.

To predict professor scores involves fitting a linear regression model with all these predictors, build a m_full model,

m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_full)

## 
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77397 -0.32432  0.09067  0.35183  0.95036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0952141  0.2905277  14.096  < 2e-16 ***
## ranktenure track      -0.1475932  0.0820671  -1.798  0.07278 .  
## ranktenured           -0.0973378  0.0663296  -1.467  0.14295    
## gendermale             0.2109481  0.0518230   4.071 5.54e-05 ***
## ethnicitynot minority  0.1234929  0.0786273   1.571  0.11698    
## languagenon-english   -0.2298112  0.1113754  -2.063  0.03965 *  
## age                   -0.0090072  0.0031359  -2.872  0.00427 ** 
## cls_perc_eval          0.0053272  0.0015393   3.461  0.00059 ***
## cls_students           0.0004546  0.0003774   1.205  0.22896    
## cls_levelupper         0.0605140  0.0575617   1.051  0.29369    
## cls_profssingle       -0.0146619  0.0519885  -0.282  0.77806    
## cls_creditsone credit  0.5020432  0.1159388   4.330 1.84e-05 ***
## bty_avg                0.0400333  0.0175064   2.287  0.02267 *  
## pic_outfitnot formal  -0.1126817  0.0738800  -1.525  0.12792    
## pic_colorcolor        -0.2172630  0.0715021  -3.039  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1617 
## F-statistic: 7.366 on 14 and 448 DF,  p-value: 6.552e-14

The variable that have the highest p-value is cls_profssingle.

Exercise 12

Check your suspicions from the previous exercise. Include the model output in your response.

The suspicions regarding the variable ‘cls_profssingle’ having a higher p-value of 0.77806, indicates weaker association with the professor’s score.

Exercise 13

Interpret the coefficient associated with the ethnicity variable.

The coefficient for a particular ethnicity (e.g., African American) is 0.1234929. This would imply that, compared to the reference ethnicity (if the reference ethnicity is not African American), professors of that ethnicity are estimated to have, on average, a score approximately 0.1234929 units higher, holding all other variables constant.

Exercise 14

Drop the variable with the highest p-value and re-fit the model. Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?

To drop the variable with the highest p-value from the model and then re-fit it, we need to identify the variable with the highest p-value from the (m_full) model.

# Dropping cls_profs from the model
m_revised <- update(m_full, . ~ . - cls_profs)

# Refitting the model without cls_profs
summary(m_revised)

## 
## Call:
## lm(formula = score ~ rank + gender + ethnicity + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7836 -0.3257  0.0859  0.3513  0.9551 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0872523  0.2888562  14.150  < 2e-16 ***
## ranktenure track      -0.1476746  0.0819824  -1.801 0.072327 .  
## ranktenured           -0.0973829  0.0662614  -1.470 0.142349    
## gendermale             0.2101231  0.0516873   4.065 5.66e-05 ***
## ethnicitynot minority  0.1274458  0.0772887   1.649 0.099856 .  
## languagenon-english   -0.2282894  0.1111305  -2.054 0.040530 *  
## age                   -0.0089992  0.0031326  -2.873 0.004262 ** 
## cls_perc_eval          0.0052888  0.0015317   3.453 0.000607 ***
## cls_students           0.0004687  0.0003737   1.254 0.210384    
## cls_levelupper         0.0606374  0.0575010   1.055 0.292200    
## cls_creditsone credit  0.5061196  0.1149163   4.404 1.33e-05 ***
## bty_avg                0.0398629  0.0174780   2.281 0.023032 *  
## pic_outfitnot formal  -0.1083227  0.0721711  -1.501 0.134080    
## pic_colorcolor        -0.2190527  0.0711469  -3.079 0.002205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.1634 
## F-statistic: 7.943 on 13 and 449 DF,  p-value: 2.336e-14

If the coefficients and significance levels of the remaining explanatory variables remain almmost unchanged after dropping cls_profs, it suggests that cls_profs might not have been strongly correlated (collinear) with the other explanatory variables in the model. Consequently, the exclusion of cls_profs did not notably impact the estimates or significance of the remaining variables.

Exercise 15

Using backward-selection and p-value as the selection criterion, determine the best model. You do not need to show all steps in your answer, just the output for the final model. Also, write out the linear model for predicting score based on the final model you settle on.

The step() function is used to iteratively remove predictors based on their p-values until reaching the best-fitting model according to the chosen criterion (p-value). The best_model will display the summary output of the resulting model.

# Perform backward-selection
final_model <- step(m_full, direction = "backward", criterion = "p-value")

## Start:  AIC=-630.9
## score ~ rank + gender + ethnicity + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_profs + cls_credits + bty_avg + 
##     pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_profs      1    0.0197 111.11 -632.82
## - cls_level      1    0.2740 111.36 -631.76
## - cls_students   1    0.3599 111.44 -631.40
## - rank           2    0.8930 111.98 -631.19
## <none>                       111.08 -630.90
## - pic_outfit     1    0.5768 111.66 -630.50
## - ethnicity      1    0.6117 111.70 -630.36
## - language       1    1.0557 112.14 -628.52
## - bty_avg        1    1.2967 112.38 -627.53
## - age            1    2.0456 113.13 -624.45
## - pic_color      1    2.2893 113.37 -623.46
## - cls_perc_eval  1    2.9698 114.06 -620.69
## - gender         1    4.1085 115.19 -616.09
## - cls_credits    1    4.6495 115.73 -613.92
## 
## Step:  AIC=-632.82
## score ~ rank + gender + ethnicity + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_credits + bty_avg + pic_outfit + 
##     pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_level      1    0.2752 111.38 -633.67
## - cls_students   1    0.3893 111.49 -633.20
## - rank           2    0.8939 112.00 -633.11
## <none>                       111.11 -632.82
## - pic_outfit     1    0.5574 111.66 -632.50
## - ethnicity      1    0.6728 111.78 -632.02
## - language       1    1.0442 112.15 -630.49
## - bty_avg        1    1.2872 112.39 -629.49
## - age            1    2.0422 113.15 -626.39
## - pic_color      1    2.3457 113.45 -625.15
## - cls_perc_eval  1    2.9502 114.06 -622.69
## - gender         1    4.0895 115.19 -618.08
## - cls_credits    1    4.7999 115.90 -615.24
## 
## Step:  AIC=-633.67
## score ~ rank + gender + ethnicity + language + age + cls_perc_eval + 
##     cls_students + cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_students   1    0.2459 111.63 -634.65
## - rank           2    0.8140 112.19 -634.30
## <none>                       111.38 -633.67
## - pic_outfit     1    0.6618 112.04 -632.93
## - ethnicity      1    0.8698 112.25 -632.07
## - language       1    0.9015 112.28 -631.94
## - bty_avg        1    1.3694 112.75 -630.02
## - age            1    1.9342 113.31 -627.70
## - pic_color      1    2.0777 113.46 -627.12
## - cls_perc_eval  1    3.0290 114.41 -623.25
## - gender         1    3.8989 115.28 -619.74
## - cls_credits    1    4.5296 115.91 -617.22
## 
## Step:  AIC=-634.65
## score ~ rank + gender + ethnicity + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - rank           2    0.7892 112.42 -635.39
## <none>                       111.63 -634.65
## - ethnicity      1    0.8832 112.51 -633.00
## - pic_outfit     1    0.9700 112.60 -632.65
## - language       1    1.0338 112.66 -632.38
## - bty_avg        1    1.5783 113.20 -630.15
## - pic_color      1    1.9477 113.57 -628.64
## - age            1    2.1163 113.74 -627.96
## - cls_perc_eval  1    2.7922 114.42 -625.21
## - gender         1    4.0945 115.72 -619.97
## - cls_credits    1    4.5163 116.14 -618.29
## 
## Step:  AIC=-635.39
## score ~ gender + ethnicity + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       112.42 -635.39
## - pic_outfit     1    0.7141 113.13 -634.46
## - ethnicity      1    1.1790 113.59 -632.56
## - language       1    1.3403 113.75 -631.90
## - age            1    1.6847 114.10 -630.50
## - pic_color      1    1.7841 114.20 -630.10
## - bty_avg        1    1.8553 114.27 -629.81
## - cls_perc_eval  1    2.9147 115.33 -625.54
## - gender         1    4.0577 116.47 -620.97
## - cls_credits    1    6.1208 118.54 -612.84

# Display the final model
summary(final_model)

## 
## Call:
## lm(formula = score ~ gender + ethnicity + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8455 -0.3221  0.1013  0.3745  0.9051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.907030   0.244889  15.954  < 2e-16 ***
## gendermale             0.202597   0.050102   4.044 6.18e-05 ***
## ethnicitynot minority  0.163818   0.075158   2.180 0.029798 *  
## languagenon-english   -0.246683   0.106146  -2.324 0.020567 *  
## age                   -0.006925   0.002658  -2.606 0.009475 ** 
## cls_perc_eval          0.004942   0.001442   3.427 0.000666 ***
## cls_creditsone credit  0.517205   0.104141   4.966 9.68e-07 ***
## bty_avg                0.046732   0.017091   2.734 0.006497 ** 
## pic_outfitnot formal  -0.113939   0.067168  -1.696 0.090510 .  
## pic_colorcolor        -0.180870   0.067456  -2.681 0.007601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4982 on 453 degrees of freedom
## Multiple R-squared:  0.1774, Adjusted R-squared:  0.161 
## F-statistic: 10.85 on 9 and 453 DF,  p-value: 2.441e-15

The linear model equation can be expressed based on the selected predictors is as follow:

\(\widehat{\text { score }}=\hat{\beta_0}+\hat{\beta_1} \times\) gendermale \(+\hat{\beta}_2 \times\) ethnicitynot minority \(+\hat{\beta}_3 {\times}\) languagenon-english \(+\hat{\beta}_4 \times\) age \(+\hat{\beta}_5 \times\) cls_perc_eval \(+\hat{\beta}_6 \times\) cls_reditsone_credit \(+\hat{\beta}_7 \times\) bty_avg \(+\hat{\beta}_8 \times\) pic_outfitnot formal \(+\hat{\beta}_9 \times\) pic_colorcolor

Exercixe 16

Verify that the conditions for this model are reasonable using diagnostic plots.

# Creating diagnostic plots for the m_full_best model
plot(final_model)

These plots determine the assumptions of linearity, constant variance, normality of residuals, and absence of influential points are reasonably met by the model.

Exercise 17

The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught. Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?

Let’s say we can consider that observations are independent of each other. For instance, if a professor teaches multiple courses, the evaluations from these courses might share some characteristics that are specific to the professor rather than the course itself that impact on the conditions of linear regression mmodel.

Exercise 18

Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.

The impact and interpretation of these characteristics would depend on the specific values and coefficients obtained from the final_model such as specific ethnicity, gender, language of instruction, older age, smaller class sizes, higher average beauty ratings, pic_outfitnot formal, and perhaps specific picture colors, all else being equal.

Exercise 19

Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)? Why or why not?

This observational data set was acquired from a particular university that might not fully represent the diversity of professors and courses across all universities, leading to potential biases or limitations in generalizing the findings. Each university has its own unique institutional culture, academic standards, student demographics, and evaluation methodologies. So, evaluation scores at one institution may not hold true for others due to these variations.

However, from my standpoints, generally professors’ personal appearance or perceived attractiveness which is very similar to the variable ‘bty_avg’ in ‘evals’ data set, is one of the most influencing evaluation scores by students.

DATA 606 LAb 9: Multiple Linear Regression

Lwin Shwe

2023-11-19

Getting Started

Exploring the Data

Exercise 1

Exercise 2

Exercise 3

Simple Linear Regression

Exercise 4

Exercise 4

Exercise 5

Exercise 6

Multiple Linear Regression

Exercise 7

Exercise 8

Exercise 9

Exercise 10

The Search for the Best Model

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercixe 16

Exercise 17

Exercise 18

Exercise 19