library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

The Data

download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")

Exploring The Data

Exercice 1

Answer: This is an observational study. So, we cannot establish causation. The question rephrasd would be: Is there a correlation between an instructor’s physical appearance and course evaluations?

Exercice 2

hist(evals$score, prob = TRUE, breaks = 11,  main = "Professor Score on Course Evaluation", xlab = "score")
x <- seq(from = 0, to = 5, by = 0.5)
curve(dnorm(x, mean = mean(evals$score), sd = sd(evals$score) ), add = TRUE, col = "red", lwd = 2)

Answer: The evaluation scores are left skewed, which would suggest that students positively rate their courses.Yes, it is what I expect to see since the sample is not random.

Execrice 3

hist(evals$age, prob = TRUE, breaks = 16, main = "Age Distrinution on Course Evaluation", xlab = "Age")
x <- seq(from = 20, to =95, by = 10)
curve(dnorm(x, mean = mean(evals$age), sd = sd(evals$age) ), add = TRUE, col = "red", lwd = 2)

boxplot(evals$age~ evals$gender, main = "Age by Gender", xlab = "Gender", ylab = "Age")

Answer: Comparing the 2 variables age and gender, we can see that the age for males is higher than for females. So, the males are older than females. The age distribution is normal and unimodal with a mean around 50.

Simple linear regression

plot(evals$score ~ evals$bty_avg)

nrow(evals)
## [1] 463
min(evals$score)
## [1] 2.3
summary(evals$score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.800   4.300   4.175   4.600   5.000

Answer: The dataframe (evals) has 463 plots. The score goes from 2.3 to 5. The scatterplot seems to not show awry.

Exercice 4

plot(jitter(evals$score) ~ jitter(evals$bty_avg))

Anwer: The score mean of the points was only represented by the plots in the original dispersed map. Since all of these overlapped, some points were hidden, causing the results to be misrepresented.

Exercice 5

m_bty <- lm(score~bty_avg, data = evals)
plot(jitter(evals$score) ~ jitter(evals$bty_avg))
abline(m_bty)

summary(m_bty)
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9246 -0.3690  0.1420  0.3977  0.9309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88034    0.07614   50.96  < 2e-16 ***
## bty_avg      0.06664    0.01629    4.09 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared:  0.03502,    Adjusted R-squared:  0.03293 
## F-statistic: 16.73 on 1 and 461 DF,  p-value: 5.083e-05

Answer: Linear model equation: score = 3.88034 + 0.06664 x bty_avg. The slope is the proportion of a professor’s score to his or her beauty. If it’s less than 1, it means that the effect of how attractive a professor is hasn’t had a significant impact on the ranking. The intercept means that no matter how attractive a professor is, he or she will earn a 3.88. The overall attractiveness score is not statistically meaningful, according to R-squared (0.03502), adjusted R-squared (0.03293), and p-value (5.083e-05). This indicates that the indicator doesn’t account for all of the variation in ranking.

Exercice 6

par(mfrow = c(2,2))
plot(m_bty)

Answer: The expectation of linearity is not met since the Residuals vs Fitted plot doesn’t display a linear sequence. The residuals are not near-normally distributed, as seen by the Normal Q-Q plot. As a result, we can conclude that the least squares regression conditions are not reasonable.

Multiple linear regression

plot(evals$bty_avg ~ evals$bty_f1lower)

cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112
plot(evals[,13:19])

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

Exercice 7

m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
par(mfrow = c(2,2))
plot(m_bty_gen)

Answer: The regression conditions for this model are not reasonable. In fact, the pattern in the Residual vs Fitted plot doesn’t meet the expectations of the variance assumption, and the Q-Q plot shows that the residuals are not normally distributed (left skewed).

Exercice 8

Answer: Yes bty_avg is still a significant predictor of score. The addition of gender to the model has changed the parameter estimated for bty_avg.

multiLines(m_bty_gen)

Exercice 9

summary(m_bty_gen)
## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07
plot(m_bty_gen)

Answer: score = 3.74734 + 0.07416 x bty_avg + 0.17239 For two professor who received the same beauty rating, males tend to have the higher course evaluation score.

Exercice 10

m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05
plot(m_bty_rank)

Answer: R appear to handle categorical variable that have more than two levels well (ranktenure track, and ranktenured). In this case rank as 03 levels and teaching parameter estimate is multiplied by 0 while tenured and tenure track are multiplied by 1 respectively. We get the following equation of the line. score = 3.98155 + 0.06783×bty_avg − 0.16070×(x2) − 0.12623×(x3)

The search for the best model

Exercice 11

Answer: I would expect the “language” variable to have the highest p-value for this model. I would assume that the language of the university where students got their degrees is not related to the evaluation of a professor’s score.

m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_full)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77397 -0.32432  0.09067  0.35183  0.95036 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0952141  0.2905277  14.096  < 2e-16 ***
## ranktenure track      -0.1475932  0.0820671  -1.798  0.07278 .  
## ranktenured           -0.0973378  0.0663296  -1.467  0.14295    
## ethnicitynot minority  0.1234929  0.0786273   1.571  0.11698    
## gendermale             0.2109481  0.0518230   4.071 5.54e-05 ***
## languagenon-english   -0.2298112  0.1113754  -2.063  0.03965 *  
## age                   -0.0090072  0.0031359  -2.872  0.00427 ** 
## cls_perc_eval          0.0053272  0.0015393   3.461  0.00059 ***
## cls_students           0.0004546  0.0003774   1.205  0.22896    
## cls_levelupper         0.0605140  0.0575617   1.051  0.29369    
## cls_profssingle       -0.0146619  0.0519885  -0.282  0.77806    
## cls_creditsone credit  0.5020432  0.1159388   4.330 1.84e-05 ***
## bty_avg                0.0400333  0.0175064   2.287  0.02267 *  
## pic_outfitnot formal  -0.1126817  0.0738800  -1.525  0.12792    
## pic_colorcolor        -0.2172630  0.0715021  -3.039  0.00252 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1617 
## F-statistic: 7.366 on 14 and 448 DF,  p-value: 6.552e-14
plot(m_full)

Exercice 12

Answer: For this model, the cls_profs variable has the highest p-value. This suggests that the number of professors teaching a class has the smallest effect on professor performance. The language variable was supposed to have the largest p-value. The language p-value (0.03965) was, on the other hand, significantly smaller than the cls profs p-value.

Exercice 13

Answer: The ethnicitynot minority coefficient is 0.1234929.So, 12,35 % of score variability is related to ethnicity. This means that if all other variables are held constant, the score increases by 0.1234929 when the professor is not from a ethnic minority background.

Exercice 14

no_cls_profs <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_credits + bty_avg + pic_outfit
             + pic_color, data = evals)
summary(no_cls_profs)
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_credits + 
##     bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7836 -0.3257  0.0859  0.3513  0.9551 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.0872523  0.2888562  14.150  < 2e-16 ***
## ranktenure track      -0.1476746  0.0819824  -1.801 0.072327 .  
## ranktenured           -0.0973829  0.0662614  -1.470 0.142349    
## ethnicitynot minority  0.1274458  0.0772887   1.649 0.099856 .  
## gendermale             0.2101231  0.0516873   4.065 5.66e-05 ***
## languagenon-english   -0.2282894  0.1111305  -2.054 0.040530 *  
## age                   -0.0089992  0.0031326  -2.873 0.004262 ** 
## cls_perc_eval          0.0052888  0.0015317   3.453 0.000607 ***
## cls_students           0.0004687  0.0003737   1.254 0.210384    
## cls_levelupper         0.0606374  0.0575010   1.055 0.292200    
## cls_creditsone credit  0.5061196  0.1149163   4.404 1.33e-05 ***
## bty_avg                0.0398629  0.0174780   2.281 0.023032 *  
## pic_outfitnot formal  -0.1083227  0.0721711  -1.501 0.134080    
## pic_colorcolor        -0.2190527  0.0711469  -3.079 0.002205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared:  0.187,  Adjusted R-squared:  0.1634 
## F-statistic: 7.943 on 13 and 449 DF,  p-value: 2.336e-14
plot(no_cls_profs)

Answer: The coefficients and significance of the other explanatory variables change when we dropped the cls_profs variable which suggests variable dependency.

Exercice 15

m_full_2 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
best_model <- step(m_full_2, direction = 'backward')
## Start:  AIC=-630.9
## score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_profs + cls_credits + bty_avg + 
##     pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_profs      1    0.0197 111.11 -632.82
## - cls_level      1    0.2740 111.36 -631.76
## - cls_students   1    0.3599 111.44 -631.40
## - rank           2    0.8930 111.98 -631.19
## <none>                       111.08 -630.90
## - pic_outfit     1    0.5768 111.66 -630.50
## - ethnicity      1    0.6117 111.70 -630.36
## - language       1    1.0557 112.14 -628.52
## - bty_avg        1    1.2967 112.38 -627.53
## - age            1    2.0456 113.13 -624.45
## - pic_color      1    2.2893 113.37 -623.46
## - cls_perc_eval  1    2.9698 114.06 -620.69
## - gender         1    4.1085 115.19 -616.09
## - cls_credits    1    4.6495 115.73 -613.92
## 
## Step:  AIC=-632.82
## score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_level + cls_credits + bty_avg + pic_outfit + 
##     pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_level      1    0.2752 111.38 -633.67
## - cls_students   1    0.3893 111.49 -633.20
## - rank           2    0.8939 112.00 -633.11
## <none>                       111.11 -632.82
## - pic_outfit     1    0.5574 111.66 -632.50
## - ethnicity      1    0.6728 111.78 -632.02
## - language       1    1.0442 112.15 -630.49
## - bty_avg        1    1.2872 112.39 -629.49
## - age            1    2.0422 113.15 -626.39
## - pic_color      1    2.3457 113.45 -625.15
## - cls_perc_eval  1    2.9502 114.06 -622.69
## - gender         1    4.0895 115.19 -618.08
## - cls_credits    1    4.7999 115.90 -615.24
## 
## Step:  AIC=-633.67
## score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_students + cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - cls_students   1    0.2459 111.63 -634.65
## - rank           2    0.8140 112.19 -634.30
## <none>                       111.38 -633.67
## - pic_outfit     1    0.6618 112.04 -632.93
## - ethnicity      1    0.8698 112.25 -632.07
## - language       1    0.9015 112.28 -631.94
## - bty_avg        1    1.3694 112.75 -630.02
## - age            1    1.9342 113.31 -627.70
## - pic_color      1    2.0777 113.46 -627.12
## - cls_perc_eval  1    3.0290 114.41 -623.25
## - gender         1    3.8989 115.28 -619.74
## - cls_credits    1    4.5296 115.91 -617.22
## 
## Step:  AIC=-634.65
## score ~ rank + ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## - rank           2    0.7892 112.42 -635.39
## <none>                       111.63 -634.65
## - ethnicity      1    0.8832 112.51 -633.00
## - pic_outfit     1    0.9700 112.60 -632.65
## - language       1    1.0338 112.66 -632.38
## - bty_avg        1    1.5783 113.20 -630.15
## - pic_color      1    1.9477 113.57 -628.64
## - age            1    2.1163 113.74 -627.96
## - cls_perc_eval  1    2.7922 114.42 -625.21
## - gender         1    4.0945 115.72 -619.97
## - cls_credits    1    4.5163 116.14 -618.29
## 
## Step:  AIC=-635.39
## score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       112.42 -635.39
## - pic_outfit     1    0.7141 113.13 -634.46
## - ethnicity      1    1.1790 113.59 -632.56
## - language       1    1.3403 113.75 -631.90
## - age            1    1.6847 114.10 -630.50
## - pic_color      1    1.7841 114.20 -630.10
## - bty_avg        1    1.8553 114.27 -629.81
## - cls_perc_eval  1    2.9147 115.33 -625.54
## - gender         1    4.0577 116.47 -620.97
## - cls_credits    1    6.1208 118.54 -612.84
summary(best_model)
## 
## Call:
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg + pic_outfit + pic_color, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8455 -0.3221  0.1013  0.3745  0.9051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.907030   0.244889  15.954  < 2e-16 ***
## ethnicitynot minority  0.163818   0.075158   2.180 0.029798 *  
## gendermale             0.202597   0.050102   4.044 6.18e-05 ***
## languagenon-english   -0.246683   0.106146  -2.324 0.020567 *  
## age                   -0.006925   0.002658  -2.606 0.009475 ** 
## cls_perc_eval          0.004942   0.001442   3.427 0.000666 ***
## cls_creditsone credit  0.517205   0.104141   4.966 9.68e-07 ***
## bty_avg                0.046732   0.017091   2.734 0.006497 ** 
## pic_outfitnot formal  -0.113939   0.067168  -1.696 0.090510 .  
## pic_colorcolor        -0.180870   0.067456  -2.681 0.007601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4982 on 453 degrees of freedom
## Multiple R-squared:  0.1774, Adjusted R-squared:  0.161 
## F-statistic: 10.85 on 9 and 453 DF,  p-value: 2.441e-15
plot(best_model)

Answer: Linear model for predicting score: score = 3.907030 + 0.163818 * ethnicitynot minority + 0.202597 * gendermale - 0.246683 * languagenon-english - 0.006925 * age + 0.004942 * cls_perc_eval + 0.517205 * cls_creditsone credit + 0.046732 * bty_avg - 0.113939 * pic_outfitnot formal - 0.180870 x pic_colorcolor

Exercice 16

par(mfrow = c(2,2))
plot(best_model)

Answer: The residual of the model tends to be almost normal based on the Normal Q-Q plots (slightly left skewed). We don’t see any outliers, but the tails are a little off.

Exercice 17

Answer: Yes and No. If a student takes two or more courses from the same professor, his or her level of freedom may be compromised. However, it is assumed that a student who takes just one course with the same professor would have a different opinion than a student who takes two courses with the same professor. It’s difficult to get the same level of goodness on two opposite subjects.

Exercice 18

Answer: Students tend to like mostly male teachers from not minority, young, who have graduated from an English speaking university, teaching one credit and have color picture.

Exercice 19

Answer: No, we wouldn’t be comfortable generalizing our conclusions to apply to professors generally (at any university). There is a bias in the assumption: the study is observational, and therefore does not represent the general professor population. Also, how students appreciate a teacher is more a sociology/psychology aspect than it is a scientific law. Additionally, the definition of beauty differs from person to person, rendering the bty_avg variable subjective.

---
title: "Lab 11: Multiple Linear Regression"
author: "Auriane Grippi"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r}
library(tidyverse)
library(openintro)
```

# The Data

```{r}
download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")
```


# Exploring The Data

## Exercice 1

Answer: This is an observational study. So, we cannot establish causation. The question rephrasd would be: Is there a correlation between an instructor’s physical appearance and course evaluations?


## Exercice 2

```{r}
hist(evals$score, prob = TRUE, breaks = 11,  main = "Professor Score on Course Evaluation", xlab = "score")
x <- seq(from = 0, to = 5, by = 0.5)
curve(dnorm(x, mean = mean(evals$score), sd = sd(evals$score) ), add = TRUE, col = "red", lwd = 2)
```

Answer: The evaluation scores are left skewed, which would suggest that students positively rate their courses.Yes, it is what I expect to see since the sample is not random.


## Execrice 3

```{r}
hist(evals$age, prob = TRUE, breaks = 16, main = "Age Distrinution on Course Evaluation", xlab = "Age")
x <- seq(from = 20, to =95, by = 10)
curve(dnorm(x, mean = mean(evals$age), sd = sd(evals$age) ), add = TRUE, col = "red", lwd = 2)
```

```{r}
boxplot(evals$age~ evals$gender, main = "Age by Gender", xlab = "Gender", ylab = "Age")
```

Answer: Comparing the 2 variables age and gender, we can see that the age for males is higher than for females. So, the males are older than females. The age distribution is normal and unimodal with a mean around 50.


# Simple linear regression

```{r}
plot(evals$score ~ evals$bty_avg)
```

```{r}
nrow(evals)
```

```{r}
min(evals$score)
```

```{r}
summary(evals$score)
```

Answer: The dataframe (evals) has 463 plots. The score goes from 2.3 to 5. The scatterplot seems to not show awry.

## Exercice 4

```{r}
plot(jitter(evals$score) ~ jitter(evals$bty_avg))
```

Anwer: The score mean of the points was only represented by the plots in the original dispersed map. Since all of these overlapped, some points were hidden, causing the results to be misrepresented.


## Exercice 5

```{r}
m_bty <- lm(score~bty_avg, data = evals)
plot(jitter(evals$score) ~ jitter(evals$bty_avg))
abline(m_bty)
```

```{r}
summary(m_bty)
```

Answer: 
Linear model equation: score = 3.88034 + 0.06664 x bty_avg. 
The slope is the proportion of a professor's score to his or her beauty. If it's less than 1, it means that the effect of how attractive a professor is hasn't had a significant impact on the ranking.
The intercept means that no matter how attractive a professor is, he or she will earn a 3.88.
The overall attractiveness score is not statistically meaningful, according to R-squared (0.03502), adjusted R-squared (0.03293), and p-value (5.083e-05). This indicates that the indicator doesn't account for all of the variation in ranking.


## Exercice 6

```{r}
par(mfrow = c(2,2))
plot(m_bty)
```

Answer: The expectation of linearity is not met since the Residuals vs Fitted plot doesn't display a linear sequence. The residuals are not near-normally distributed, as seen by the Normal Q-Q plot. As a result, we can conclude that the least squares regression conditions are not reasonable.


# Multiple linear regression

```{r}
plot(evals$bty_avg ~ evals$bty_f1lower)
cor(evals$bty_avg, evals$bty_f1lower)
```

```{r}
plot(evals[,13:19])
```

```{r}
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
```


## Exercice 7

```{r}
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
par(mfrow = c(2,2))
plot(m_bty_gen)
```

Answer: The regression conditions for this model are not reasonable. In fact, the pattern in the Residual vs Fitted plot doesn't meet the expectations of the variance assumption, and the Q-Q plot shows that the residuals are not normally distributed (left skewed).


## Exercice 8

Answer: Yes bty_avg is still a significant predictor of score. The addition of gender to the model has changed the parameter estimated for bty_avg.


```{r}
multiLines(m_bty_gen)
```


## Exercice 9

```{r}
summary(m_bty_gen)
```

```{r}
plot(m_bty_gen)
```


Answer: 
score = 3.74734 + 0.07416 x bty_avg + 0.17239
For two professor who received the same beauty rating, males tend to have the higher course evaluation score.


## Exercice 10

```{r}
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
```

```{r}
plot(m_bty_rank)
```

Answer:
R appear to handle categorical variable that have more than two levels well (ranktenure track, and ranktenured). In this case rank as 03 levels and teaching parameter estimate is multiplied by 0 while tenured and tenure track are multiplied by 1 respectively. We get the following equation of the line.
score = 3.98155 + 0.06783×bty_avg − 0.16070×(x2) − 0.12623×(x3)


# The search for the best model

## Exercice 11

Answer: I would expect the “language” variable to have the highest p-value for this model. I would assume that the language of the university where students got their degrees is not related to the evaluation of a professor’s score.


```{r}
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
summary(m_full)
```

```{r}
plot(m_full)
```


## Exercice 12

Answer: For this model, the cls_profs variable has the highest p-value. This suggests that the number of professors teaching a class has the smallest effect on professor performance. The language variable was supposed to have the largest p-value. The language p-value (0.03965) was, on the other hand, significantly smaller than the cls profs p-value.


## Exercice 13

Answer: The ethnicitynot minority coefficient is 0.1234929.So, 12,35 % of score variability is related to ethnicity. This means that if all other variables are held constant, the score increases by 0.1234929 when the professor is not from a ethnic minority background.


## Exercice 14

```{r}
no_cls_profs <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_credits + bty_avg + pic_outfit
             + pic_color, data = evals)
summary(no_cls_profs)
```

```{r}
plot(no_cls_profs)
```

Answer: The coefficients and significance of the other explanatory variables change when we dropped the cls_profs variable which suggests variable dependency.


## Exercice 15

```{r}
m_full_2 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval 
             + cls_students + cls_level + cls_profs + cls_credits + bty_avg 
             + pic_outfit + pic_color, data = evals)
best_model <- step(m_full_2, direction = 'backward')
```

```{r}
summary(best_model)
```

```{r}
plot(best_model)
```

Answer: 
Linear model for predicting score:
score = 3.907030 + 0.163818 * ethnicitynot minority + 0.202597 * gendermale - 0.246683 * languagenon-english - 0.006925 * age + 0.004942 * cls_perc_eval + 0.517205 * cls_creditsone credit + 0.046732 * bty_avg - 0.113939 * pic_outfitnot formal - 0.180870 x pic_colorcolor


## Exercice 16

```{r}
par(mfrow = c(2,2))
plot(best_model)
```

Answer: The residual of the model tends to be almost normal based on the Normal Q-Q plots (slightly left skewed). We don't see any outliers, but the tails are a little off.


## Exercice 17

Answer: Yes and No. If a student takes two or more courses from the same professor, his or her level of freedom may be compromised. However, it is assumed that a student who takes just one course with the same professor would have a different opinion than a student who takes two courses with the same professor. It's difficult to get the same level of goodness on two opposite subjects.


## Exercice 18

Answer: Students tend to like mostly male teachers from not minority, young, who have graduated from an English speaking university, teaching one credit and have color picture.


## Exercice 19

Answer: No, we wouldn’t be comfortable generalizing our conclusions to apply to professors generally (at any university). There is a bias in the assumption: the study is observational, and therefore does not represent the general professor population. Also, how students appreciate a teacher is more a sociology/psychology aspect than it is a scientific law. Additionally, the definition of beauty differs from person to person, rendering the bty_avg variable subjective.