LOAD
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(ggplot2)
download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")
Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
Describe the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not?
ggplot(evals,aes(score))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Excluding score, select two other variables and describe their relationship using an appropriate visualization (scatterplot, side-by-side boxplots, or mosaic plot)
age
bty_avg
as age increases average beauty scores decreases
ggplot(evals,aes(age,bty_avg))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let’s create a scatterplot to see if this appears to be the case:
plot(evals$score ~ evals$bty_avg)
ggplot(evals, aes(bty_avg,score))+
geom_point()
Replot the scatterplot, but this time use the function jitter() on the y - or the x -coordinate. (Use ?jitter to learn more.) What was misleading about the initial scatterplot?
plot(evals$score ~ jitter(evals$bty_avg))
ggplot(evals,aes(bty_avg,score))+
geom_point(position="jitter")+
geom_smooth (method="lm")
## `geom_smooth()` using formula = 'y ~ x'
Let’s see if the apparent trend in the plot is something more than natural variation. Fit a linear model called m_bty to predict average professor score by average beauty rating and add the line to your plot using abline(m_bty). Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
m_bty<-lm(score~bty_avg,evals)
m_bty$coefficients[1]
## (Intercept)
## 3.880338
m_bty$coefficients[2]
## bty_avg
## 0.06663704
summary(m_bty)
##
## Call:
## lm(formula = score ~ bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
ggplot(evals,aes(bty_avg,score))+
geom_point()+
geom_abline(slope=m_bty$coefficients[2],
intercept = m_bty$coefficients[1],
color="blue",
size=1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).
Linearity: Residuals vs Fitted plot –> shows if residuals have non-linear patterns. My plot looks equally spread around the horizontal line. There does not seem to be a distinct pattern. This model passes the linearity test.
Normality: Normal Q-Q –> shows if residuals are normally distributed. The residuals follow a straight line. Some deviation at both extreme ends. This model passes the normality test.
Variability: Scale-Location –> shows if residuals are spread equally along the range of predictors (homoscedasticity). This plot shows a horizontal line, the points look equally spread, there is no obvious pattern such as a curve or a cone. This model passes the variability test.
plot(m_bty)
Let’s take a look at the relationship between one of these scores and the average beauty score.
plot(evals$bty_avg ~ evals$bty_f1lower)
cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112
We can actually take a look at the relationships between all beauty variables (columns 13 through 19) using the following command:
plot(evals[,13:19])
In order to see if beauty is still a significant predictor of professor
score after we’ve accounted for the gender of the professor, we can add
the gender term into the model.
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Verify that the conditions for this model are reasonable using diagnostic plots.
Linearity: Residuals vs Fitted plot –> shows if residuals have non-linear patterns. My plot looks equally spread around the horizontal line. There does not seem to be a distinct pattern. This model passes the linearity test.
Normality: Normal Q-Q –> shows if residuals are normally distributed. The residuals follow a straight line. Some deviation at both extreme ends. This model passes the normality test.
Variability: Scale-Location –> shows if residuals are spread equally along the range of predictors (homoscedasticity). This plot shows a horizontal line, the points look equally spread, there is no obvious pattern such as a curve or a cone. This model passes the variability test.
plot(m_bty_gen)
Is bty_avg still a significant predictor of score? Has the addition of gender to the model changed the parameter estimate for bty_avg? * the multiple R-squared is 0.05912. The addition of gender has increased the multiple R-squared value. the model now explains .05% of variability.The p-value is still quite low so it is still significant. But because the multiple r squared is also still quite low, it might not be practical.
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
We can plot this line and the line corresponding to males with the following custom function.
multiLines(m_bty_gen)
What is the equation of the line corresponding to males? (Hint: For males, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score? * males tend to have the higher course evaluation score * 3.74734 + (0.17239 x gender_male) + (0.007416 x beauty_avg) –> 3.74734 + (1) + (0.007416 x beauty_avg)
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
Create a new model called m_bty_rank with gender removed and rank added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: teaching, tenure track, tenured. * R only shows 2 out of the 3 levels: tenure track and tenured. Our teaching variable = 0, it is our comparision group.
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
##
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8713 -0.3642 0.1489 0.4103 0.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98155 0.09078 43.860 < 2e-16 ***
## bty_avg 0.06783 0.01655 4.098 4.92e-05 ***
## ranktenure track -0.16070 0.07395 -2.173 0.0303 *
## ranktenured -0.12623 0.06266 -2.014 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04029
## F-statistic: 7.465 on 3 and 459 DF, p-value: 6.88e-05