load(url("http://www.openintro.org/stat/data/evals.RData"))
score
. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?hist(evals$score)
summary(evals$score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.300 3.800 4.300 4.175 4.600 5.000
score
, select two other variables and describe their relationship using an appropriate visualization (scatterplot, side-by-side boxplots, or mosaic plot).mosaic(~ evals$ethnicity + evals$gender)
plot(evals$score ~ evals$bty_avg)
jitter()
on the \(y\)- or the \(x\)-coordinate. (Use ?jitter
to learn more.) What was misleading about the initial scatterplot?qplot(x = evals$bty_avg, y = evals$score, geom = "point") + geom_jitter()
jitter
, the scatterplot displays points in a cloud-looking cluster of pointsm_bty
to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty)
. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?m_bty<-lm(evals$score ~ evals$bty_avg)
plot(jitter(evals$score) ~ jitter(evals$bty_avg))
abline(m_bty)
summary(m_bty)
##
## Call:
## lm(formula = evals$score ~ evals$bty_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## evals$bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
plot(m_bty$residuals ~ evals$bty_avg)
abline(h = 0)
hist(m_bty$residuals)
qqnorm(m_bty$residuals)
qqline(m_bty$residuals)
plot(evals$bty_avg ~ evals$bty_f1lower)
cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112
plot(evals[,13:19])
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
plot(m_bty_gen$residuals ~ evals$bty_avg)
abline(h = 0)
hist(m_bty_gen$residuals)
qqnorm(m_bty_gen$residuals)
qqline(m_bty_gen$residuals)
bty_avg
still a significant predictor of score
? Has the addition of gender
to the model changed the parameter estimate for bty_avg
?summary(m_bty)
##
## Call:
## lm(formula = evals$score ~ evals$bty_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## evals$bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
\[ \begin{aligned} \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg + \hat{\beta}_2 \times (0) \\ &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned} \]
We can plot this line and the line corresponding to males with the following custom function.
multiLines(m_bty_gen)
m_bty_rank
with gender
removed and rank
added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching
, tenure track
, tenured
.m_bty_rank <- lm(score ~ bty_avg + rank, data=evals)
summary(m_bty_rank)
##
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8713 -0.3642 0.1489 0.4103 0.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98155 0.09078 43.860 < 2e-16 ***
## bty_avg 0.06783 0.01655 4.098 4.92e-05 ***
## ranktenure track -0.16070 0.07395 -2.173 0.0303 *
## ranktenured -0.12623 0.06266 -2.014 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04029
## F-statistic: 7.465 on 3 and 459 DF, p-value: 6.88e-05
variables<-c("rank" , "ethnicity" , "gender" , "language" , "age" , "cls_perc_eval" , "cls_students" , "cls_level" , "cls_profs" , "cls_credits" , "bty_avg" , "pic_outfit" , "pic_color")
cor_results<-c(cor(as.numeric(evals$score),as.numeric(evals$rank)),cor(as.numeric(evals$score),as.numeric(evals$ethnicity)),cor(as.numeric(evals$score),as.numeric(evals$gender)),cor(as.numeric(evals$score),as.numeric(evals$language)),cor(as.numeric(evals$score),as.numeric(evals$age)),cor(as.numeric(evals$score),as.numeric(evals$cls_perc_eval)),cor(as.numeric(evals$score),as.numeric(evals$cls_students)),cor(as.numeric(evals$score),as.numeric(evals$cls_level)),cor(as.numeric(evals$score),as.numeric(evals$cls_profs)),cor(as.numeric(evals$score),as.numeric(evals$cls_credits)),cor(as.numeric(evals$score),as.numeric(evals$bty_avg)),cor(as.numeric(evals$score),as.numeric(evals$pic_outfit)),cor(as.numeric(evals$score),as.numeric(evals$pic_color)))
for(i in cor_results){
if(i>=-0.026 & i<=0.026){
print(variables[which(cor_results == i)])
}
}
## [1] "cls_students"
## [1] "cls_profs"
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
summary(m_full)
ethnicitynot minority
variable has a coefficent of 0.1234929 that is a multiplier to itselfwithout_high_p_value <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg + pic_outfit + pic_color, data = evals)
summary(without_high_p_value)
##
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age +
## cls_perc_eval + cls_students + cls_level + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7836 -0.3257 0.0859 0.3513 0.9551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0872523 0.2888562 14.150 < 2e-16 ***
## ranktenure track -0.1476746 0.0819824 -1.801 0.072327 .
## ranktenured -0.0973829 0.0662614 -1.470 0.142349
## ethnicitynot minority 0.1274458 0.0772887 1.649 0.099856 .
## gendermale 0.2101231 0.0516873 4.065 5.66e-05 ***
## languagenon-english -0.2282894 0.1111305 -2.054 0.040530 *
## age -0.0089992 0.0031326 -2.873 0.004262 **
## cls_perc_eval 0.0052888 0.0015317 3.453 0.000607 ***
## cls_students 0.0004687 0.0003737 1.254 0.210384
## cls_levelupper 0.0606374 0.0575010 1.055 0.292200
## cls_creditsone credit 0.5061196 0.1149163 4.404 1.33e-05 ***
## bty_avg 0.0398629 0.0174780 2.281 0.023032 *
## pic_outfitnot formal -0.1083227 0.0721711 -1.501 0.134080
## pic_colorcolor -0.2190527 0.0711469 -3.079 0.002205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared: 0.187, Adjusted R-squared: 0.1634
## F-statistic: 7.943 on 13 and 449 DF, p-value: 2.336e-14
backward_selection <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval +
cls_credits + bty_avg + pic_color, data = evals)
summary(backward_selection)
##
## Call:
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval +
## cls_credits + bty_avg + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85320 -0.32394 0.09984 0.37930 0.93610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.771922 0.232053 16.255 < 2e-16 ***
## ethnicitynot minority 0.167872 0.075275 2.230 0.02623 *
## gendermale 0.207112 0.050135 4.131 4.30e-05 ***
## languagenon-english -0.206178 0.103639 -1.989 0.04726 *
## age -0.006046 0.002612 -2.315 0.02108 *
## cls_perc_eval 0.004656 0.001435 3.244 0.00127 **
## cls_creditsone credit 0.505306 0.104119 4.853 1.67e-06 ***
## bty_avg 0.051069 0.016934 3.016 0.00271 **
## pic_colorcolor -0.190579 0.067351 -2.830 0.00487 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4992 on 454 degrees of freedom
## Multiple R-squared: 0.1722, Adjusted R-squared: 0.1576
## F-statistic: 11.8 on 8 and 454 DF, p-value: 2.58e-15
hist(backward_selection$residuals)
qqnorm(backward_selection$residuals)
qqline(backward_selection$residuals)
plot(backward_selection$residuals)
abline(h = 0)
Characteristics would be:
Better not generalize it because: