library(ggplot2)
download.file("http://www.openintro.org/stat/data/evals.RData", destfile = "evals.RData")
load("evals.RData")
hist(evals$score
, col= "Lavender")
Based on observation theirs is a minimal difference between beauty and the course level/difficulty(upper or lower). I didn’t anticipate there being much of a difference, but I was curious if the difficulty of the course would directly change how the students view towards instructor, a factor that I have seen occur. But in this case, this wouldn’t apply
plot(evals$bty_avg~ evals$cls_leve)
plot(evals$score ~ evals$bty_avg)
It wasn’t able to show a significant relationship between beauty average and score due to overlapping scores, as well as their seeming to be more date than the number of points on the scatter plot.
plot(jitter(evals$score)~ jitter(evals$bty_avg))
Equation: yhat = 3.88034 + 0.0664*(bty_avg), overall bty_avg is a good predictor for beauty score, it does have a low R-squared value, along with the predicted increase of 0.0666, it wouldn’t be suitable for evaluation score prediction.
m_bty <- lm(evals$score ~ evals$bty_avg)
plot(jitter(evals$score) ~ jitter(evals$bty_avg))
summary(m_bty)
##
## Call:
## lm(formula = evals$score ~ evals$bty_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9246 -0.3690 0.1420 0.3977 0.9309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.88034 0.07614 50.96 < 2e-16 ***
## evals$bty_avg 0.06664 0.01629 4.09 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5348 on 461 degrees of freedom
## Multiple R-squared: 0.03502, Adjusted R-squared: 0.03293
## F-statistic: 16.73 on 1 and 461 DF, p-value: 5.083e-05
Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these). From the histogram we can see that the residuals are not normally distributed as shown by the left skew, we can also see the residuals are not centered around the zero line. Pushing it further away from the criteria needed for the conditions to be reasonable.
plot(m_bty$residuals ~ evals$bty_avg)
hist(m_bty$residuals,
col = "Lavender")
plot(evals$bty_avg ~ evals$bty_f1lower)
cor(evals$bty_avg, evals$bty_f1lower)
## [1] 0.8439112
plot(evals[,13:19])
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
(I tired)
qqnorm(m_bty_gen$residuals)
plot(m_bty_gen$residuals ~ m_bty_gen$fitted.values)
plot(m_bty_gen$residuals)
plot(evals$score ~ evals$gender, col = "Lavender")
#5.)Overall we the qqplot shows that most data is along the normal line, excluding the upper end where it seem to curve slightly, while the residuals are plotted are spread along the zero line seemingly at random.
plot(m_bty_gen$residuals ~ evals$bty_avg)
I would say yes, based on both the summary and previous exercise gender seemed to have made average beauty more significant due to the notable difference in the p-value, compared to when beauty average was the only variable being considered.
summary(m_bty_gen)
##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
score hat = 3.74734 + 0.17239X beauty score x 1, in the end, males would have a higher beauty rating
It seems to handle it by conditions them both 2 different variables, in other words, based on how many categorical levels are present, would be equal to the number of different variables made.
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
summary(m_bty_rank)
##
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8713 -0.3642 0.1489 0.4103 0.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98155 0.09078 43.860 < 2e-16 ***
## bty_avg 0.06783 0.01655 4.098 4.92e-05 ***
## ranktenure track -0.16070 0.07395 -2.173 0.0303 *
## ranktenured -0.12623 0.06266 -2.014 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04029
## F-statistic: 7.465 on 3 and 459 DF, p-value: 6.88e-05
multiLines(m_bty_rank)
The variable I expected would would have the least association would be number of professors while, so it should have the highest p-value.
Include the model output in your response. Based on the plot and the summary we can see that among of professors in class has the most minimal affect, based of visualization and its respected P value.
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval
+ cls_students + cls_level + cls_profs + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
summary(m_full)
##
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age +
## cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77397 -0.32432 0.09067 0.35183 0.95036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0952141 0.2905277 14.096 < 2e-16 ***
## ranktenure track -0.1475932 0.0820671 -1.798 0.07278 .
## ranktenured -0.0973378 0.0663296 -1.467 0.14295
## ethnicitynot minority 0.1234929 0.0786273 1.571 0.11698
## gendermale 0.2109481 0.0518230 4.071 5.54e-05 ***
## languagenon-english -0.2298112 0.1113754 -2.063 0.03965 *
## age -0.0090072 0.0031359 -2.872 0.00427 **
## cls_perc_eval 0.0053272 0.0015393 3.461 0.00059 ***
## cls_students 0.0004546 0.0003774 1.205 0.22896
## cls_levelupper 0.0605140 0.0575617 1.051 0.29369
## cls_profssingle -0.0146619 0.0519885 -0.282 0.77806
## cls_creditsone credit 0.5020432 0.1159388 4.330 1.84e-05 ***
## bty_avg 0.0400333 0.0175064 2.287 0.02267 *
## pic_outfitnot formal -0.1126817 0.0738800 -1.525 0.12792
## pic_colorcolor -0.2172630 0.0715021 -3.039 0.00252 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.498 on 448 degrees of freedom
## Multiple R-squared: 0.1871, Adjusted R-squared: 0.1617
## F-statistic: 7.366 on 14 and 448 DF, p-value: 6.552e-14
plot(evals$score ~ evals$cls_profs, col = "lavender")
Based on the summary, I would assume that evaluations for professors would not be considered a minority may be 0.123 times higher then those who aren’t.
Did the coefficients and significance of the other explanatory variables change? (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables? Their seemed to be a slight change when remove class professors, as all the values now have a lower and or more significant p value
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval
+ cls_students + cls_level + cls_credits + bty_avg
+ pic_outfit + pic_color, data = evals)
summary(m_full)
##
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age +
## cls_perc_eval + cls_students + cls_level + cls_credits +
## bty_avg + pic_outfit + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7836 -0.3257 0.0859 0.3513 0.9551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0872523 0.2888562 14.150 < 2e-16 ***
## ranktenure track -0.1476746 0.0819824 -1.801 0.072327 .
## ranktenured -0.0973829 0.0662614 -1.470 0.142349
## ethnicitynot minority 0.1274458 0.0772887 1.649 0.099856 .
## gendermale 0.2101231 0.0516873 4.065 5.66e-05 ***
## languagenon-english -0.2282894 0.1111305 -2.054 0.040530 *
## age -0.0089992 0.0031326 -2.873 0.004262 **
## cls_perc_eval 0.0052888 0.0015317 3.453 0.000607 ***
## cls_students 0.0004687 0.0003737 1.254 0.210384
## cls_levelupper 0.0606374 0.0575010 1.055 0.292200
## cls_creditsone credit 0.5061196 0.1149163 4.404 1.33e-05 ***
## bty_avg 0.0398629 0.0174780 2.281 0.023032 *
## pic_outfitnot formal -0.1083227 0.0721711 -1.501 0.134080
## pic_colorcolor -0.2190527 0.0711469 -3.079 0.002205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4974 on 449 degrees of freedom
## Multiple R-squared: 0.187, Adjusted R-squared: 0.1634
## F-statistic: 7.943 on 13 and 449 DF, p-value: 2.336e-14
pr 3.771922 + 0.167872* ethnicity + 0.207112* gender -0.206178 * language -0.006046* age +0.004656* claspereval + .505306* credits + .051069* beauty - .190579*color
fb <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg + pic_color, data = evals)
summary(fb)
##
## Call:
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval +
## cls_credits + bty_avg + pic_color, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85320 -0.32394 0.09984 0.37930 0.93610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.771922 0.232053 16.255 < 2e-16 ***
## ethnicitynot minority 0.167872 0.075275 2.230 0.02623 *
## gendermale 0.207112 0.050135 4.131 4.30e-05 ***
## languagenon-english -0.206178 0.103639 -1.989 0.04726 *
## age -0.006046 0.002612 -2.315 0.02108 *
## cls_perc_eval 0.004656 0.001435 3.244 0.00127 **
## cls_creditsone credit 0.505306 0.104119 4.853 1.67e-06 ***
## bty_avg 0.051069 0.016934 3.016 0.00271 **
## pic_colorcolor -0.190579 0.067351 -2.830 0.00487 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4992 on 454 degrees of freedom
## Multiple R-squared: 0.1722, Adjusted R-squared: 0.1576
## F-statistic: 11.8 on 8 and 454 DF, p-value: 2.58e-15
residuals seem to be nearly normal while the variability seems to be constant
qqnorm(fb$residuals)
plot(fb)
hist(fb$residuals,
col = "Lavender")
### Exercise 17: The original paper describes how these data were
gathered by taking a sample of professors from the University of Texas
at Austin and including all courses that they have taught. Considering
that each row represents a course, could this new information have an
impact on any of the conditions of linear regression? It would have a
effect due to violating the independence condition, as the courses
themselves would be independent of one another causing the score of each
course to be independent as well.
Based on the mode those who receive a education in an English taught school, would be males,non-minority, techs a one credit course, fairly young, height beauty score, have a high % of students that actually completer the evaluations, and use pictures in color, would coincide with receiving a high evaluation.
No, mainly due to the the findings being catered towards Austin Texas University, so it doesn’t consider differing the variables and other qualities of others professors in different university.