Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.
Citation:
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
# load packages
library(ggplot2)
library(MASS)
library(Matrix)
library(lme4)
library(tidyverse) # Mike's favorites…
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven) # Maybe to get read_dta
library(arm)
##
## arm (Version 1.13-1, built: 2022-8-25)
##
## Working directory is /Users/user/Desktop/R - Spring
read.csv("exercise2.1.dat")
## y.x1.x2
## 1 15.68 6.87 14.09
## 2 6.18 4.4 4.35
## 3 18.1 0.43 18.09
## 4 9.07 2.73 8.65
## 5 17.97 3.25 17.68
## 6 10.04 5.3 8.53
## 7 20.74 7.08 19.5
## 8 9.76 9.73 0.72
## 9 8.23 4.51 6.88
## 10 6.52 6.4 1.26
## 11 15.69 5.72 14.62
## 12 15.51 6.28 14.18
## 13 20.61 6.14 19.68
## 14 19.58 8.26 17.75
## 15 9.72 9.41 2.44
## 16 16.36 2.88 16.1
## 17 18.3 5.74 17.37
## 18 13.26 0.45 13.25
## 19 12.1 3.74 11.51
## 20 18.15 5.03 17.44
## 21 16.8 9.67 13.74
## 22 16.55 3.62 16.15
## 23 18.79 2.54 18.62
## 24 15.68 9.15 12.74
## 25 4.08 0.69 4.02
## 26 15.45 7.97 13.24
## 27 13.44 2.49 13.21
## 28 20.86 9.81 18.41
## 29 16.05 7.56 14.16
## 30 6 0.98 5.92
## 31 3.29 0.65 3.22
## 32 9.41 9 2.74
## 33 10.76 7.83 7.39
## 34 5.98 0.26 5.97
## 35 19.23 3.64 18.89
## 36 15.67 9.28 12.63
## 37 7.04 5.66 4.18
## 38 21.63 9.71 19.32
## 39 17.84 9.36 15.19
## 40 7.49 0.88 7.43
## 41 NA 9.87 10.43
## 42 NA 9.99 15.72
## 43 NA 8.39 0.35
## 44 NA 0.8 10.91
## 45 NA 9.58 15.82
## 46 NA 4.82 11.9
## 47 NA 2.97 2.46
## 48 NA 8.8 4.09
## 49 NA 6.07 1.8
## 50 NA 0.19 13.54
## 51 NA 4.19 19.13
## 52 NA 5.39 14.84
## 53 NA 6.58 5.28
## 54 NA 2.36 15.42
## 55 NA 2.37 4.12
## 56 NA 1.52 6.54
## 57 NA 2.07 2.67
## 58 NA 6.7 12.85
## 59 NA 2.02 8.36
## 60 NA 9.63 12.16
data <- "exercise2.1.dat"
The folder pyth contains outcome y and inputs x1, x2 for 40 data points, with a further 20 points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.
data1 <- read.table(data, header = TRUE)
train_data <- data1[0:40, ]
test_data <- data1[-1:-40, ]
library(dplyr)
data_set = data1 %>% head(40)
#this is an alternative way of taking the last 20 rows
test_data_set <- data1 %>% tail(20)
lm_data1 <- lm(y ~ x1 + x2, data_set )
summary (lm_data1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9585 -0.5865 -0.3356 0.3973 2.8548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.31513 0.38769 3.392 0.00166 **
## x1 0.51481 0.04590 11.216 1.84e-13 ***
## x2 0.80692 0.02434 33.148 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9 on 37 degrees of freedom
## Multiple R-squared: 0.9724, Adjusted R-squared: 0.9709
## F-statistic: 652.4 on 2 and 37 DF, p-value: < 2.2e-16
Comments: for every 1 increase in X1 when X0 is zero (or held constant) there is a .51481 increase in Y and for every 1 increase in X2 when X1 is zero (or held constant) there is a .80692 increase in Y. This model accounts for 97% of variability, which is good.
sim_data1 <- sim(lm_data1)
coef(sim_data1) [0:2, ]
## (Intercept) x1 x2
## [1,] 1.234660 0.4878389 0.8142380
## [2,] 1.937482 0.4352307 0.7712429
Plot of fit against x1 with x2 held to it’s average value
par (mfrow = c (1,2))
plot(data_set$x1, data_set$y , xlab = "y", ylab = "x1")
for (i in 1:25){
curve(cbind (1, x, mean(data_set$x2)) %*% coef(sim_data1)[i,], lwd = .5, col="#FFB6C1", add = TRUE)
}
curve (cbind (1, x, mean(data_set$x2)) %*% coef(lm_data1), col="#ff0081", add = TRUE)
Plot of fit against x2 with x1 held to it’s average value
par (mfrow = c (1,2))
plot(data_set$x1, data_set$y , xlab = "y", ylab = "x2")
for (i in 1:25){
curve(cbind (1, x, mean(data_set$x1)) %*% coef(sim_data1)[i,], lwd = .5, col="#FFB6C1", add = TRUE)
}
curve (cbind (1, x, mean(data_set$x1)) %*% coef(lm_data1), col="#ff0081", add = TRUE)
y_hat <- fitted(lm_data1)
r <- resid(lm_data1)
sh <- sigma.hat(lm_data1)
residual.plot(y_hat, r, sh, lwd=1)
Comments: There seems to be some clustering in the bottom left of the plot which isn’t good. The assumption of equal variance of errors does not appear to be met.
test_model <- predict(lm_data1, test_data_set, interval = "prediction", level=0.95)
test_model
## fit lwr upr
## 41 14.812484 12.916966 16.708002
## 42 19.142865 17.241520 21.044211
## 43 5.916816 3.958626 7.875005
## 44 10.530475 8.636141 12.424809
## 45 19.012485 17.118597 20.906373
## 46 13.398863 11.551815 15.245911
## 47 4.829144 2.918323 6.739965
## 48 9.145767 7.228364 11.063170
## 49 5.892489 3.979060 7.805918
## 50 12.338639 10.426349 14.250929
## 51 18.908561 17.021818 20.795303
## 52 16.064649 14.212209 17.917088
## 53 8.963122 7.084081 10.842163
## 54 14.972786 13.094194 16.851379
## 55 5.859744 3.959679 7.759808
## 56 7.374900 5.480921 9.268879
## 57 4.535267 2.616996 6.453539
## 58 15.133280 13.282467 16.984094
## 59 9.100899 7.223395 10.978403
## 60 16.084900 14.196990 17.972810
Comments: Considering the linear regression model predicting y from x1,x2 had an adjusted R squared of 97% I feel pretty confident about these predictions. (I think that is how you determine)
In this exercise you will simulate two variables that are statistically independent of each other to see what happens when we run a regression of one on the other.
First generate 1000 data points from a normal distribution with mean 0 and standard deviation 1 by typing var1 <- rnorm(1000,0,1) in R. Generate another variable in the same way (call it var2). Run a regression of one variable on the other. Is the slope coefficient statistically significant?
var1 <- rnorm(1000,0,1)
var2 <- rnorm(1000,0,1)
var1_lm <- lm(var1 ~ var2)
display(var1_lm)
## lm(formula = var1 ~ var2)
## coef.est coef.se
## (Intercept) -0.01 0.03
## var2 0.08 0.03
## ---
## n = 1000, k = 2
## residual sd = 0.98, R-Squared = 0.01
summary(var1_lm)
##
## Call:
## lm(formula = var1 ~ var2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1743 -0.6651 -0.0205 0.6773 3.3471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.005341 0.030963 -0.172 0.8631
## var2 0.075176 0.030594 2.457 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9789 on 998 degrees of freedom
## Multiple R-squared: 0.006014, Adjusted R-squared: 0.005018
## F-statistic: 6.038 on 1 and 998 DF, p-value: 0.01417
var2_lm <- lm(var2 ~ var1)
summary(var2_lm)
##
## Call:
## lm(formula = var2 ~ var1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3077 -0.6942 0.0191 0.6983 3.4545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02142 0.03193 0.671 0.5024
## var1 0.07999 0.03255 2.457 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01 on 998 degrees of freedom
## Multiple R-squared: 0.006014, Adjusted R-squared: 0.005018
## F-statistic: 6.038 on 1 and 998 DF, p-value: 0.01417
Comment: The slope coefficient is not statistically significant.
z_scores <- rep (NA, 100)
for(k in 1:100) {
var1 <-rnorm(1000,0,1)
var2 <-rnorm(1000,0,1)
fit <- lm(var2~var1)
z_scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}
z_scores
## [1] -1.70964118 -0.47180246 -1.09415752 1.05912756 -0.40166600 -1.58777003
## [7] -0.28600526 -1.45986716 1.18298900 -2.87436560 1.48976811 0.17733226
## [13] 1.32743834 -1.02674140 0.67508277 -0.37580288 1.04091099 -0.60652049
## [19] -0.34211507 1.35293187 -1.09364202 -0.83093977 0.27901561 -0.59578195
## [25] -0.35784424 -1.01168365 0.74771949 0.79328903 -0.86323537 0.13480022
## [31] -1.00121981 0.69560501 -0.42947062 0.25650720 0.80230488 1.21333458
## [37] 0.19662594 0.82650396 -0.71085341 1.99799684 -0.03294835 0.21693150
## [43] 1.72238589 -0.24327113 -0.18080530 -1.34204512 -0.87533410 -1.82324482
## [49] -0.33544538 1.97136817 -1.34643008 0.95454704 -0.62330061 0.13047605
## [55] -0.45179356 1.84254809 -0.07683816 -0.31753948 -0.22118539 -0.71811574
## [61] 0.13764131 0.23420692 0.44605824 0.61550213 -1.20265382 1.78659104
## [67] -0.10207484 2.49712343 -0.35786576 -1.31350688 0.42847527 0.10159801
## [73] 1.34583237 0.48126826 0.45924945 -0.06255146 1.19666560 -0.17830837
## [79] 1.51497937 0.51249608 0.41650951 -1.16313692 0.48561808 -0.36401785
## [85] -0.99591460 1.16562085 0.54110733 1.32565094 -0.18770910 -0.95158427
## [91] 1.41217740 0.63533859 0.51603665 -0.49584593 1.09517189 1.63602069
## [97] 0.55262672 -1.75843024 0.84593461 -1.13151544
Comments: can we go over this code in class? I have some questions.
sum(z_scores > 2)
## [1] 1
sum(z_scores < -2)
## [1] 1
Comments: 4 of the z-scores exceed 2 and -2 and are therefore significant
The child.iq folder contains a subset of the children and mother data discussed earlier in the chapter. You have access to children’s test scores at age 3, mother’s education, and the mother’s age at the time she gave birth for a sample of 400 children. The data are a Stata file which you can read into R.
data_iq <- read_dta ("child.iq.dta")
require("foreign")
## Loading required package: foreign
require("arm")
(Generalized) Linear models make some strong assumptions concerning the data structure:
Independance of each data points Correct distribution of the residuals Correct specification of the variance structure Linear relationship between the response and the linear predictor For simple linear models the last three points mean that the residuals should be normally distributed, the variance should be homogenous across the fitted values of the model and for each predictors separately, and the y’s should be linearly related to the predictors. In R checking these assumptions from a lm and glm object is fairly easy.
head(data_iq)
## # A tibble: 6 × 3
## ppvt educ_cat momage
## <dbl> <dbl> <dbl>
## 1 120 2 21
## 2 89 1 17
## 3 78 2 19
## 4 42 1 20
## 5 115 4 26
## 6 97 1 20
Just to remind myself what I want to complete
Fit a regression of child test scores on mother’s age
lm_iq <- lm(ppvt ~ momage, data = data_iq)
summary(lm_iq)
##
## Call:
## lm(formula = ppvt ~ momage, data = data_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.109 -11.798 2.971 14.860 55.210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.7827 8.6880 7.802 5.42e-14 ***
## momage 0.8403 0.3786 2.219 0.027 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.34 on 398 degrees of freedom
## Multiple R-squared: 0.01223, Adjusted R-squared: 0.009743
## F-statistic: 4.926 on 1 and 398 DF, p-value: 0.02702
par(mfrow=c(2,2))
plot(lm_iq)
Comments: For every one year increase in mom’s age when she gave birth there is a .8403 increase in child’s score. This is a really poor model considering the adjusted R-squared is .009743. The slope coefficient indicates that when mothers gave birth at age 0 the average child’s test score is 67.78.
What age would I suggest moms give birth?
plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score")
abline(lm_iq)
Comments: Just looking at the above plot I would suggest mothers give birth sometime in their 20s. But the model suggests that for every one year increase in mom’s age there is a .8403 increase in child’s test score so this would imply the older the better.
lm_iq_2 <- lm(ppvt ~ momage + educ_cat, data = data_iq)
summary(lm_iq_2)
##
## Call:
## lm(formula = ppvt ~ momage + educ_cat, data = data_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.763 -13.130 2.495 14.620 55.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.1554 8.5706 8.069 8.51e-15 ***
## momage 0.3433 0.3981 0.862 0.389003
## educ_cat 4.7114 1.3165 3.579 0.000388 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.05 on 397 degrees of freedom
## Multiple R-squared: 0.04309, Adjusted R-squared: 0.03827
## F-statistic: 8.939 on 2 and 397 DF, p-value: 0.0001594
par(mfrow=c(2,2))
plot(lm_iq_2)
comment: I’m not completely sure how to interpret these
colours <- c('#00bfaf', '#FFC0CB', '#b4a7d6','#8fce00')
plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score", col = colours, pch=10)
for (i in 1:4) {
curve (cbind(1, x, i) %*% coef(lm_iq_2), add=TRUE, col=colours[i])
}
Comments: My conclusions about the timing of birth have changed. When
moms age was the only predictor it was a statistically significant
indicator but when adding mom’s education to the model, mother’s age
became insignificant. Therefore, timing of birth does not matter as much
as education level. Perhaps it is still the older the mom’s age at birth
the better because usually older individuals have higher levels of
education.
data_iq$hs <- ifelse(data_iq$educ_cat >= 2, 1, 0)
lm_iq_3 <- lm(ppvt ~ hs * momage, data=data_iq)
summary(lm_iq_3)
##
## Call:
## lm(formula = ppvt ~ hs * momage, data = data_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.696 -12.407 2.022 14.804 54.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.2202 17.6454 5.963 5.49e-09 ***
## hs -38.4088 20.2815 -1.894 0.0590 .
## momage -1.2402 0.8113 -1.529 0.1271
## hs:momage 2.2097 0.9181 2.407 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.85 on 396 degrees of freedom
## Multiple R-squared: 0.06417, Adjusted R-squared: 0.05708
## F-statistic: 9.051 on 3 and 396 DF, p-value: 8.276e-06
display(lm_iq_3)
## lm(formula = ppvt ~ hs * momage, data = data_iq)
## coef.est coef.se
## (Intercept) 105.22 17.65
## hs -38.41 20.28
## momage -1.24 0.81
## hs:momage 2.21 0.92
## ---
## n = 400, k = 4
## residual sd = 19.85, R-Squared = 0.06
colours = ifelse(data_iq$hs == 1, '#FFC0CB','#8fce00')
plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score", col = colours, pch=10)
#completed high school
curve (cbind(1, 1, x, 1 * x) %*% coef(lm_iq_3), add=TRUE, col='#FFC0CB')
#did not complete high school
curve (cbind(1, 0, x, 0 * x) %*% coef(lm_iq_3), add=TRUE, col='#8fce00')
comment: For mothers who completed high school, the older she was when
giving birth the higher the child’s test score. For mothers who did not
complete high school, the older she was when she gave birth the lower
the child’s test score.
# split data set into training and test sets
training_data <- data_iq[1:200, ]
testing_data <- data_iq[201:dim(data_iq)[1], ]
# fit linear model
lm_iq_4 <- lm(ppvt ~ momage + educ_cat, data = training_data)
summary(lm_iq_4)
##
## Call:
## lm(formula = ppvt ~ momage + educ_cat, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.358 -12.967 2.866 14.435 58.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.6295 11.8202 5.383 2.07e-07 ***
## momage 0.4473 0.5516 0.811 0.41836
## educ_cat 5.4434 1.8228 2.986 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.58 on 197 degrees of freedom
## Multiple R-squared: 0.06199, Adjusted R-squared: 0.05246
## F-statistic: 6.509 on 2 and 197 DF, p-value: 0.001831
# make predictions
predict_iq <- predict(lm_iq_4, testing_data)
plot(predict_iq, testing_data$ppvt, xlab = "Predicted Score", ylab = "Observed Score")
abline(a=0, b=1)
The folder beauty contains data from Hamermesh and Parker (2005) on student evaluations of instructors’ beauty and teaching quality for several courses at the University of Texas. The teaching evaluations were conducted at the end of the semester, and the beauty judgments were made later, by six students who had not attended the classes and were not aware of the course evaluations.
Reading in data
beauty_data <- read.csv("http://www.stat.columbia.edu/~gelman/arm/examples/beauty/ProfEvaltnsBeautyPublic.csv")
head(beauty_data)
## tenured profnumber minority age beautyf2upper beautyflowerdiv beautyfupperdiv
## 1 0 1 1 36 6 5 7
## 2 1 2 0 59 2 4 4
## 3 1 3 0 51 5 5 2
## 4 1 4 0 40 4 2 5
## 5 0 5 0 31 9 7 9
## 6 1 6 0 62 5 6 6
## beautym2upper beautymlowerdiv beautymupperdiv btystdave btystdf2u
## 1 6 2 4 0.2015666 0.2893519
## 2 3 2 3 -0.8260813 -1.6193560
## 3 3 2 3 -0.6603327 -0.1878249
## 4 2 3 3 -0.7663125 -0.6650018
## 5 6 7 6 1.4214450 1.7208830
## 6 6 5 5 0.5002196 -0.1878249
## btystdfl btystdfu btystdm2u btystdml btystdmu class1 class2 class3
## 1 0.4580018 0.8758139 0.6817153 -0.9000649 -0.1954181 0 0 1
## 2 -0.0735065 -0.5770065 -1.1319040 -0.9000649 -0.6546507 0 0 0
## 3 0.4580018 -1.5455530 -1.1319040 -0.9000649 -0.6546507 0 0 0
## 4 -1.1365230 -0.0927330 -1.7364440 -0.3125226 -0.6546507 0 1 0
## 5 1.5210190 1.8443610 0.6817153 2.0376470 0.7230470 0 0 0
## 6 0.9895102 0.3915404 0.6817153 0.8625621 0.2638144 0 0 0
## class4 class5 class6 class7 class8 class9 class10 class11 class12 class13
## 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
## class14 class15 class16 class17 class18 class19 class20 class21 class22
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## class23 class24 class25 class26 class27 class28 class29 class30
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## courseevaluation didevaluation female formal fulldept lower multipleclass
## 1 4.3 24 1 0 1 0 1
## 2 4.5 17 0 0 1 0 0
## 3 3.7 55 0 0 1 0 1
## 4 4.3 40 1 0 1 0 1
## 5 4.4 42 1 0 1 0 0
## 6 4.2 182 0 1 1 0 0
## nonenglish onecredit percentevaluating profevaluation students tenuretrack
## 1 0 0 55.81395 4.7 43 1
## 2 0 0 85.00000 4.6 20 1
## 3 0 0 100.00000 4.1 55 1
## 4 0 0 86.95652 4.5 46 1
## 5 0 0 87.50000 4.8 48 1
## 6 0 0 64.53901 4.4 282 1
## blkandwhite btystdvariance btystdavepos btystdaveneg
## 1 0 2.1298060 0.201567 0.000000
## 2 0 1.3860810 0.000000 -0.826081
## 3 0 2.5374350 0.000000 -0.660333
## 4 0 1.7605770 0.000000 -0.766312
## 5 0 1.6931000 1.421450 0.000000
## 6 0 0.9447419 0.500220 0.000000
lm_beauty <- lm(btystdave ~ courseevaluation, data = beauty_data)
summary(lm_beauty)
##
## Call:
## lm(formula = btystdave ~ courseevaluation, data = beauty_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42409 -0.62469 -0.09744 0.50828 2.13077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1626 0.2624 -4.431 1.17e-05 ***
## courseevaluation 0.2687 0.0650 4.133 4.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7753 on 461 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.03364
## F-statistic: 17.08 on 1 and 461 DF, p-value: 4.247e-05
lm_beauty2 <- lm(btystdave ~ courseevaluation + female + age, data = beauty_data)
summary(lm_beauty2)
##
## Call:
## lm(formula = btystdave ~ courseevaluation + female + age, data = beauty_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80374 -0.54913 -0.08697 0.45625 1.87299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.167859 0.335388 -0.500 0.617
## courseevaluation 0.265700 0.063100 4.211 3.06e-05 ***
## female 0.124219 0.073802 1.683 0.093 .
## age -0.021403 0.003684 -5.809 1.17e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7404 on 459 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1187
## F-statistic: 21.73 on 3 and 459 DF, p-value: 3.543e-13
par(mfrow=c(2,2))
plot(lm_beauty2)
Comments: For every 1 unit increase in course evaluation score there is a .265700 increase in their evaluated beauty. Females have a .124219 increase in evaluated beauty compared to males. For every 1 unit increase in age there is a .021403 decrease in their evaluated beauty. I think the residual standard error seems to be kind of large which isn’t good, but I’m not really positive whether it is small or large. no .74 for the residual standard error means 68% of the beauty evaluations will be within .74 of the predicted value?
beta_hat <- coef(lm_beauty2)
beta_sim <- sim(lm_beauty2)@coef
par(mfrow=c(1,3))
Plot course evaluation against beauty score
plot(beauty_data$courseevaluation, beauty_data$btystdave, pch = 10, xlab = "Course Evaluation Score", ylab = "Beauty Score")
for (i in 1:10){
curve(cbind (1,x, mean(beauty_data$female), mean(beauty_data$age)) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve (cbind (1, x, mean(beauty_data$female), mean(beauty_data$age)) %*% beta_hat, col="#ff0081", add=TRUE)
Plot sex against beauty score
plot(beauty_data$female, beauty_data$btystdave, pch = 10, xlab = "Sex", ylab = "Beauty Score")
for (i in 1:10){
curve(cbind (1, mean(beauty_data$courseevaluation), x, mean(beauty_data$age)) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve (cbind (1, mean(beauty_data$courseevaluation),x, mean(beauty_data$age)) %*% beta_hat, col="#ff0081", add=TRUE)
Plot age against beauty score
plot(beauty_data$age, beauty_data$btystdave, pch = 10, xlab = "Age", ylab = "Beauty Score")
for (i in 1:10){
curve(cbind(1, mean(beauty_data$courseevaluation), mean(beauty_data$female), x) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve(cbind(1, mean(beauty_data$courseevaluation), mean(beauty_data$female), x) %*% beta_hat, col = "#ff0081", add = TRUE)
lm_beauty3 <- lm(btystdave ~ courseevaluation * female + age, data = beauty_data)
summary(lm_beauty3)
##
## Call:
## lm(formula = btystdave ~ courseevaluation * female + age, data = beauty_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80938 -0.54166 -0.09022 0.42178 1.82833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.512488 0.387154 -1.324 0.1863
## courseevaluation 0.356584 0.081244 4.389 1.41e-05 ***
## female 1.020189 0.511609 1.994 0.0467 *
## age -0.021899 0.003686 -5.940 5.63e-09 ***
## courseevaluation:female -0.226481 0.127976 -1.770 0.0774 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7387 on 458 degrees of freedom
## Multiple R-squared: 0.1303, Adjusted R-squared: 0.1227
## F-statistic: 17.16 on 4 and 458 DF, p-value: 4.001e-13
Comment: For every one unit increase in course evaluation there is a .356584 unit increase in beauty evaluation score. Females beauty evaluation scores are 1.020189 units higher than males. For every one unit increase in age there is a .021899 decrease in beauty evaluation score. The interaction is not a statistically significant. Men beauty evaluation score is .2264 less than females?
lm_beauty4 <- lm(btystdave ~ courseevaluation * age + female, data = beauty_data)
summary(lm_beauty4)
##
## Call:
## lm(formula = btystdave ~ courseevaluation * age + female, data = beauty_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82375 -0.55055 -0.08067 0.44817 1.80452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.259577 1.268005 1.782 0.07541 .
## courseevaluation -0.347133 0.315136 -1.102 0.27124
## age -0.070473 0.024997 -2.819 0.00502 **
## female 0.132078 0.073674 1.793 0.07367 .
## courseevaluation:age 0.012392 0.006244 1.985 0.04779 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.738 on 458 degrees of freedom
## Multiple R-squared: 0.1318, Adjusted R-squared: 0.1243
## F-statistic: 17.39 on 4 and 458 DF, p-value: 2.712e-13
par(mfrow=c(2,2))
plot(lm_beauty3)
Comments:
We will now fit a model which, compared to m2, adds a few more input variables. As we will see, this model, although slightly more complicated, will be able to explain better the variance of our outcome variable.
lm_beauty5 <- lm(btystdave ~ courseevaluation * female + age + blkandwhite + formal, data = beauty_data)
summary(lm_beauty5)
##
## Call:
## lm(formula = btystdave ~ courseevaluation * female + age + blkandwhite +
## formal, data = beauty_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6987 -0.5367 -0.0485 0.3941 1.8714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.270223 0.370255 -0.730 0.46587
## courseevaluation 0.314151 0.077555 4.051 6.00e-05 ***
## female 1.356260 0.490874 2.763 0.00596 **
## age -0.025580 0.003569 -7.167 3.11e-12 ***
## blkandwhite 0.552726 0.092295 5.989 4.29e-09 ***
## formal 0.253208 0.090476 2.799 0.00535 **
## courseevaluation:female -0.330770 0.123321 -2.682 0.00758 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.703 on 456 degrees of freedom
## Multiple R-squared: 0.2158, Adjusted R-squared: 0.2054
## F-statistic: 20.91 on 6 and 456 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_beauty5)
Comments: Adding the fourth variable increased the adjusted R squared from .1227 to .2054, which is good. That means that adding that variable helped increase the fit of the model.