Elina Azrilyan

November 13th, 2019

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Inspecting the data

The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin.

load("evals.RData")
head(evals)
##   score         rank    ethnicity gender language age cls_perc_eval
## 1   4.7 tenure track     minority female  english  36      55.81395
## 2   4.1 tenure track     minority female  english  36      68.80000
## 3   3.9 tenure track     minority female  english  36      60.80000
## 4   4.8 tenure track     minority female  english  36      62.60163
## 5   4.6      tenured not minority   male  english  59      85.00000
## 6   4.3      tenured not minority   male  english  59      87.50000
##   cls_did_eval cls_students cls_level cls_profs  cls_credits bty_f1lower
## 1           24           43     upper    single multi credit           5
## 2           86          125     upper    single multi credit           5
## 3           76          125     upper    single multi credit           5
## 4           77          123     upper    single multi credit           5
## 5           17           20     upper  multiple multi credit           4
## 6           35           40     upper  multiple multi credit           4
##   bty_f1upper bty_f2upper bty_m1lower bty_m1upper bty_m2upper bty_avg
## 1           7           6           2           4           6       5
## 2           7           6           2           4           6       5
## 3           7           6           2           4           6       5
## 4           7           6           2           4           6       5
## 5           4           2           2           3           3       3
## 6           4           2           2           3           3       3
##   pic_outfit pic_color
## 1 not formal     color
## 2 not formal     color
## 3 not formal     color
## 4 not formal     color
## 5 not formal     color
## 6 not formal     color
length(evals$score)
## [1] 463

There are 22 columns in our dataset and there are 463 rows of data.

mr_bty_gen_age_clsize <- lm(score ~ bty_avg + gender + age + cls_students, data = evals)
summary(mr_bty_gen_age_clsize)
## 
## Call:
## lm(formula = score ~ bty_avg + gender + age + cls_students, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85333 -0.36138  0.08768  0.41174  0.93005 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.0580973  0.1676976  24.199  < 2e-16 ***
## bty_avg       0.0647559  0.0169816   3.813 0.000156 ***
## gendermale    0.2035758  0.0523620   3.888 0.000116 ***
## age          -0.0058029  0.0027197  -2.134 0.033406 *  
## cls_students -0.0001202  0.0003317  -0.362 0.717311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5272 on 458 degrees of freedom
## Multiple R-squared:  0.06859,    Adjusted R-squared:  0.06046 
## F-statistic: 8.432 on 4 and 458 DF,  p-value: 1.432e-06

Conclusion: Class size seems to be the only statistically insignificant variable in our model with the pvalue of 0.72. As expected proffesor’s age seems to be negatively correlated to rating - which would indicate that student score yonger proffesors higher.