Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.

Citation:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

# load packages
library(ggplot2)
library(MASS)
library(Matrix)
library(lme4)
library(tidyverse) # Mike's favorites…
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven) # Maybe to get read_dta
library(arm)
## 
## arm (Version 1.13-1, built: 2022-8-25)
## 
## Working directory is /Users/user/Desktop/R - Spring

Chapter 3 Exercises

read.csv("exercise2.1.dat")
##             y.x1.x2
## 1  15.68 6.87 14.09
## 2     6.18 4.4 4.35
## 3   18.1 0.43 18.09
## 4    9.07 2.73 8.65
## 5  17.97 3.25 17.68
## 6    10.04 5.3 8.53
## 7   20.74 7.08 19.5
## 8    9.76 9.73 0.72
## 9    8.23 4.51 6.88
## 10    6.52 6.4 1.26
## 11 15.69 5.72 14.62
## 12 15.51 6.28 14.18
## 13 20.61 6.14 19.68
## 14 19.58 8.26 17.75
## 15   9.72 9.41 2.44
## 16  16.36 2.88 16.1
## 17  18.3 5.74 17.37
## 18 13.26 0.45 13.25
## 19  12.1 3.74 11.51
## 20 18.15 5.03 17.44
## 21  16.8 9.67 13.74
## 22 16.55 3.62 16.15
## 23 18.79 2.54 18.62
## 24 15.68 9.15 12.74
## 25   4.08 0.69 4.02
## 26 15.45 7.97 13.24
## 27 13.44 2.49 13.21
## 28 20.86 9.81 18.41
## 29 16.05 7.56 14.16
## 30      6 0.98 5.92
## 31   3.29 0.65 3.22
## 32      9.41 9 2.74
## 33  10.76 7.83 7.39
## 34   5.98 0.26 5.97
## 35 19.23 3.64 18.89
## 36 15.67 9.28 12.63
## 37   7.04 5.66 4.18
## 38 21.63 9.71 19.32
## 39 17.84 9.36 15.19
## 40   7.49 0.88 7.43
## 41    NA 9.87 10.43
## 42    NA 9.99 15.72
## 43     NA 8.39 0.35
## 44     NA 0.8 10.91
## 45    NA 9.58 15.82
## 46     NA 4.82 11.9
## 47     NA 2.97 2.46
## 48      NA 8.8 4.09
## 49      NA 6.07 1.8
## 50    NA 0.19 13.54
## 51    NA 4.19 19.13
## 52    NA 5.39 14.84
## 53     NA 6.58 5.28
## 54    NA 2.36 15.42
## 55     NA 2.37 4.12
## 56     NA 1.52 6.54
## 57     NA 2.07 2.67
## 58     NA 6.7 12.85
## 59     NA 2.02 8.36
## 60    NA 9.63 12.16
data <- "exercise2.1.dat"

Question 1

The folder pyth contains outcome y and inputs x1, x2 for 40 data points, with a further 20 points with the inputs but no observed outcome. Save the file to your working directory and read it into R using the read.table() function.

data1 <- read.table(data, header = TRUE)

train_data <- data1[0:40, ]

test_data <- data1[-1:-40, ]


library(dplyr)

data_set = data1 %>% head(40)

#this is an alternative way of taking the last 20 rows
test_data_set <- data1 %>% tail(20)

(a) Use R to fit a linear regression model predicting y from x1,x2, using the first 40 data points in the file. Summarize the inferences and check the fit of your model.

lm_data1 <- lm(y ~ x1 + x2, data_set )
summary (lm_data1)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9585 -0.5865 -0.3356  0.3973  2.8548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.31513    0.38769   3.392  0.00166 ** 
## x1           0.51481    0.04590  11.216 1.84e-13 ***
## x2           0.80692    0.02434  33.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9 on 37 degrees of freedom
## Multiple R-squared:  0.9724, Adjusted R-squared:  0.9709 
## F-statistic: 652.4 on 2 and 37 DF,  p-value: < 2.2e-16

Comments: for every 1 increase in X1 when X0 is zero (or held constant) there is a .51481 increase in Y and for every 1 increase in X2 when X1 is zero (or held constant) there is a .80692 increase in Y. This model accounts for 97% of variability, which is good.

(b) Display the estimated model graphically as in Figure 3.2.

sim_data1 <- sim(lm_data1)
coef(sim_data1) [0:2, ]
##      (Intercept)        x1        x2
## [1,]    1.234660 0.4878389 0.8142380
## [2,]    1.937482 0.4352307 0.7712429

Plot of fit against x1 with x2 held to it’s average value

par (mfrow = c (1,2))

plot(data_set$x1, data_set$y , xlab = "y", ylab = "x1")
for (i in 1:25){
  curve(cbind (1, x, mean(data_set$x2)) %*% coef(sim_data1)[i,], lwd = .5, col="#FFB6C1", add = TRUE)
}

curve (cbind (1, x, mean(data_set$x2)) %*% coef(lm_data1), col="#ff0081", add = TRUE)

Plot of fit against x2 with x1 held to it’s average value

par (mfrow = c (1,2))

plot(data_set$x1, data_set$y , xlab = "y", ylab = "x2")
for (i in 1:25){
  curve(cbind (1, x, mean(data_set$x1)) %*% coef(sim_data1)[i,], lwd = .5, col="#FFB6C1", add = TRUE)
}

curve (cbind (1, x, mean(data_set$x1)) %*% coef(lm_data1), col="#ff0081", add = TRUE)

(c) Make a residual plot for this model. Do the assumptions appear to be met?

y_hat <- fitted(lm_data1)
r <- resid(lm_data1)
sh <- sigma.hat(lm_data1)

residual.plot(y_hat, r, sh, lwd=1)

Comments: There seems to be some clustering in the bottom left of the plot which isn’t good. The assumption of equal variance of errors does not appear to be met.

(d) Make predictions for the remaining 20 data points in the file. How confident do you feel about these predictions?

test_model <- predict(lm_data1, test_data_set, interval = "prediction", level=0.95)
test_model
##          fit       lwr       upr
## 41 14.812484 12.916966 16.708002
## 42 19.142865 17.241520 21.044211
## 43  5.916816  3.958626  7.875005
## 44 10.530475  8.636141 12.424809
## 45 19.012485 17.118597 20.906373
## 46 13.398863 11.551815 15.245911
## 47  4.829144  2.918323  6.739965
## 48  9.145767  7.228364 11.063170
## 49  5.892489  3.979060  7.805918
## 50 12.338639 10.426349 14.250929
## 51 18.908561 17.021818 20.795303
## 52 16.064649 14.212209 17.917088
## 53  8.963122  7.084081 10.842163
## 54 14.972786 13.094194 16.851379
## 55  5.859744  3.959679  7.759808
## 56  7.374900  5.480921  9.268879
## 57  4.535267  2.616996  6.453539
## 58 15.133280 13.282467 16.984094
## 59  9.100899  7.223395 10.978403
## 60 16.084900 14.196990 17.972810

Comments: Considering the linear regression model predicting y from x1,x2 had an adjusted R squared of 97% I feel pretty confident about these predictions. (I think that is how you determine)

Qestion 3

In this exercise you will simulate two variables that are statistically independent of each other to see what happens when we run a regression of one on the other.

First generate 1000 data points from a normal distribution with mean 0 and standard deviation 1 by typing var1 <- rnorm(1000,0,1) in R. Generate another variable in the same way (call it var2). Run a regression of one variable on the other. Is the slope coefficient statistically significant?

var1 <- rnorm(1000,0,1)
var2 <- rnorm(1000,0,1)

var1_lm <- lm(var1 ~ var2)
display(var1_lm)
## lm(formula = var1 ~ var2)
##             coef.est coef.se
## (Intercept) -0.01     0.03  
## var2         0.08     0.03  
## ---
## n = 1000, k = 2
## residual sd = 0.98, R-Squared = 0.01
summary(var1_lm)
## 
## Call:
## lm(formula = var1 ~ var2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1743 -0.6651 -0.0205  0.6773  3.3471 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.005341   0.030963  -0.172   0.8631  
## var2         0.075176   0.030594   2.457   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9789 on 998 degrees of freedom
## Multiple R-squared:  0.006014,   Adjusted R-squared:  0.005018 
## F-statistic: 6.038 on 1 and 998 DF,  p-value: 0.01417
var2_lm <- lm(var2 ~ var1)
summary(var2_lm)
## 
## Call:
## lm(formula = var2 ~ var1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3077 -0.6942  0.0191  0.6983  3.4545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.02142    0.03193   0.671   0.5024  
## var1         0.07999    0.03255   2.457   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01 on 998 degrees of freedom
## Multiple R-squared:  0.006014,   Adjusted R-squared:  0.005018 
## F-statistic: 6.038 on 1 and 998 DF,  p-value: 0.01417

Comment: The slope coefficient is not statistically significant.

(b) Now run a simulation repeating this process 100 times. This can be done using a loop. From each simulation, save the z-score (the estimated coefficient of var1 divided by its standard error). If the absolute value of the z-score exceeds 2, the estimate is statistically significant. Here is code to perform the simulation:

z_scores <- rep (NA, 100)

for(k in 1:100) {
  var1 <-rnorm(1000,0,1)
  var2 <-rnorm(1000,0,1)
  fit <- lm(var2~var1)
  z_scores[k] <- coef(fit)[2]/se.coef(fit)[2]
}
z_scores
##   [1] -1.70964118 -0.47180246 -1.09415752  1.05912756 -0.40166600 -1.58777003
##   [7] -0.28600526 -1.45986716  1.18298900 -2.87436560  1.48976811  0.17733226
##  [13]  1.32743834 -1.02674140  0.67508277 -0.37580288  1.04091099 -0.60652049
##  [19] -0.34211507  1.35293187 -1.09364202 -0.83093977  0.27901561 -0.59578195
##  [25] -0.35784424 -1.01168365  0.74771949  0.79328903 -0.86323537  0.13480022
##  [31] -1.00121981  0.69560501 -0.42947062  0.25650720  0.80230488  1.21333458
##  [37]  0.19662594  0.82650396 -0.71085341  1.99799684 -0.03294835  0.21693150
##  [43]  1.72238589 -0.24327113 -0.18080530 -1.34204512 -0.87533410 -1.82324482
##  [49] -0.33544538  1.97136817 -1.34643008  0.95454704 -0.62330061  0.13047605
##  [55] -0.45179356  1.84254809 -0.07683816 -0.31753948 -0.22118539 -0.71811574
##  [61]  0.13764131  0.23420692  0.44605824  0.61550213 -1.20265382  1.78659104
##  [67] -0.10207484  2.49712343 -0.35786576 -1.31350688  0.42847527  0.10159801
##  [73]  1.34583237  0.48126826  0.45924945 -0.06255146  1.19666560 -0.17830837
##  [79]  1.51497937  0.51249608  0.41650951 -1.16313692  0.48561808 -0.36401785
##  [85] -0.99591460  1.16562085  0.54110733  1.32565094 -0.18770910 -0.95158427
##  [91]  1.41217740  0.63533859  0.51603665 -0.49584593  1.09517189  1.63602069
##  [97]  0.55262672 -1.75843024  0.84593461 -1.13151544

Comments: can we go over this code in class? I have some questions.

(c) How many of these z scores are significant?

sum(z_scores > 2)
## [1] 1
sum(z_scores < -2)
## [1] 1

Comments: 4 of the z-scores exceed 2 and -2 and are therefore significant

Question 4

The child.iq folder contains a subset of the children and mother data discussed earlier in the chapter. You have access to children’s test scores at age 3, mother’s education, and the mother’s age at the time she gave birth for a sample of 400 children. The data are a Stata file which you can read into R.

data_iq <- read_dta ("child.iq.dta")

(a) Fit a regression of child test scores on mother’s age, display the data and fitted model, check assumptions, and interpret the slope coefficient. When do you recommend mothers should give birth? What are you assuming in making these recommendations?

require("foreign")
## Loading required package: foreign
require("arm")

(Generalized) Linear models make some strong assumptions concerning the data structure:

Independance of each data points Correct distribution of the residuals Correct specification of the variance structure Linear relationship between the response and the linear predictor For simple linear models the last three points mean that the residuals should be normally distributed, the variance should be homogenous across the fitted values of the model and for each predictors separately, and the y’s should be linearly related to the predictors. In R checking these assumptions from a lm and glm object is fairly easy.

head(data_iq)
## # A tibble: 6 × 3
##    ppvt educ_cat momage
##   <dbl>    <dbl>  <dbl>
## 1   120        2     21
## 2    89        1     17
## 3    78        2     19
## 4    42        1     20
## 5   115        4     26
## 6    97        1     20

Just to remind myself what I want to complete

Fit a regression of child test scores on mother’s age

lm_iq <- lm(ppvt ~ momage, data = data_iq)
summary(lm_iq)
## 
## Call:
## lm(formula = ppvt ~ momage, data = data_iq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.109 -11.798   2.971  14.860  55.210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  67.7827     8.6880   7.802 5.42e-14 ***
## momage        0.8403     0.3786   2.219    0.027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.34 on 398 degrees of freedom
## Multiple R-squared:  0.01223,    Adjusted R-squared:  0.009743 
## F-statistic: 4.926 on 1 and 398 DF,  p-value: 0.02702
par(mfrow=c(2,2))
plot(lm_iq)

Comments: For every one year increase in mom’s age when she gave birth there is a .8403 increase in child’s score. This is a really poor model considering the adjusted R-squared is .009743. The slope coefficient indicates that when mothers gave birth at age 0 the average child’s test score is 67.78.

What age would I suggest moms give birth?

plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score")
abline(lm_iq)

Comments: Just looking at the above plot I would suggest mothers give birth sometime in their 20s. But the model suggests that for every one year increase in mom’s age there is a .8403 increase in child’s test score so this would imply the older the better.

(b) Repeat this for a regression that further includes mother’s education, interpreting both slope coefficients in this model. Have your conclusions about the timing of birth changed?

lm_iq_2 <- lm(ppvt ~ momage + educ_cat, data = data_iq)
summary(lm_iq_2)
## 
## Call:
## lm(formula = ppvt ~ momage + educ_cat, data = data_iq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.763 -13.130   2.495  14.620  55.610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.1554     8.5706   8.069 8.51e-15 ***
## momage        0.3433     0.3981   0.862 0.389003    
## educ_cat      4.7114     1.3165   3.579 0.000388 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.05 on 397 degrees of freedom
## Multiple R-squared:  0.04309,    Adjusted R-squared:  0.03827 
## F-statistic: 8.939 on 2 and 397 DF,  p-value: 0.0001594
par(mfrow=c(2,2))
plot(lm_iq_2)

comment: I’m not completely sure how to interpret these

colours <- c('#00bfaf', '#FFC0CB', '#b4a7d6','#8fce00')

plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score", col = colours, pch=10)
     
for (i in 1:4) {
  curve (cbind(1, x, i) %*% coef(lm_iq_2), add=TRUE, col=colours[i])
}

Comments: My conclusions about the timing of birth have changed. When moms age was the only predictor it was a statistically significant indicator but when adding mom’s education to the model, mother’s age became insignificant. Therefore, timing of birth does not matter as much as education level. Perhaps it is still the older the mom’s age at birth the better because usually older individuals have higher levels of education.

(c) Now create an indicator variable reflecting whether the mother has completed high school or not. Consider interactions between the high school completion and mother’s age in family. Also, create a plot that shows the separate regression lines for each high school completion status group.

data_iq$hs <- ifelse(data_iq$educ_cat >= 2, 1, 0)

lm_iq_3 <- lm(ppvt ~ hs * momage, data=data_iq)
summary(lm_iq_3)
## 
## Call:
## lm(formula = ppvt ~ hs * momage, data = data_iq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.696 -12.407   2.022  14.804  54.343 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.2202    17.6454   5.963 5.49e-09 ***
## hs          -38.4088    20.2815  -1.894   0.0590 .  
## momage       -1.2402     0.8113  -1.529   0.1271    
## hs:momage     2.2097     0.9181   2.407   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.85 on 396 degrees of freedom
## Multiple R-squared:  0.06417,    Adjusted R-squared:  0.05708 
## F-statistic: 9.051 on 3 and 396 DF,  p-value: 8.276e-06
display(lm_iq_3)
## lm(formula = ppvt ~ hs * momage, data = data_iq)
##             coef.est coef.se
## (Intercept) 105.22    17.65 
## hs          -38.41    20.28 
## momage       -1.24     0.81 
## hs:momage     2.21     0.92 
## ---
## n = 400, k = 4
## residual sd = 19.85, R-Squared = 0.06
colours = ifelse(data_iq$hs == 1, '#FFC0CB','#8fce00')

plot(data_iq$momage, data_iq$ppvt, xlab = "Mother's Age", ylab = "Child's Test Score", col = colours, pch=10)
  #completed high school
  curve (cbind(1, 1, x, 1 * x) %*% coef(lm_iq_3), add=TRUE, col='#FFC0CB')
  #did not complete high school
  curve (cbind(1, 0, x, 0 * x) %*% coef(lm_iq_3), add=TRUE, col='#8fce00')

comment: For mothers who completed high school, the older she was when giving birth the higher the child’s test score. For mothers who did not complete high school, the older she was when she gave birth the lower the child’s test score.

(d) Finally, fit a regression of child test scores on mother’s age and education level for the first 200 children and use this model to predict test scores for the next 200. Graphically display comparisons of the predicted and actual scores for the final 200 children.

# split data set into training and test sets
training_data <- data_iq[1:200, ]
testing_data <- data_iq[201:dim(data_iq)[1], ]

# fit linear model
lm_iq_4 <- lm(ppvt ~ momage + educ_cat, data = training_data)
summary(lm_iq_4)
## 
## Call:
## lm(formula = ppvt ~ momage + educ_cat, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.358 -12.967   2.866  14.435  58.428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.6295    11.8202   5.383 2.07e-07 ***
## momage        0.4473     0.5516   0.811  0.41836    
## educ_cat      5.4434     1.8228   2.986  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.58 on 197 degrees of freedom
## Multiple R-squared:  0.06199,    Adjusted R-squared:  0.05246 
## F-statistic: 6.509 on 2 and 197 DF,  p-value: 0.001831
# make predictions
predict_iq <- predict(lm_iq_4, testing_data)

plot(predict_iq, testing_data$ppvt, xlab = "Predicted Score", ylab = "Observed Score")
abline(a=0, b=1)

Question 5

The folder beauty contains data from Hamermesh and Parker (2005) on student evaluations of instructors’ beauty and teaching quality for several courses at the University of Texas. The teaching evaluations were conducted at the end of the semester, and the beauty judgments were made later, by six students who had not attended the classes and were not aware of the course evaluations.

Reading in data

beauty_data <- read.csv("http://www.stat.columbia.edu/~gelman/arm/examples/beauty/ProfEvaltnsBeautyPublic.csv")
head(beauty_data)
##   tenured profnumber minority age beautyf2upper beautyflowerdiv beautyfupperdiv
## 1       0          1        1  36             6               5               7
## 2       1          2        0  59             2               4               4
## 3       1          3        0  51             5               5               2
## 4       1          4        0  40             4               2               5
## 5       0          5        0  31             9               7               9
## 6       1          6        0  62             5               6               6
##   beautym2upper beautymlowerdiv beautymupperdiv  btystdave  btystdf2u
## 1             6               2               4  0.2015666  0.2893519
## 2             3               2               3 -0.8260813 -1.6193560
## 3             3               2               3 -0.6603327 -0.1878249
## 4             2               3               3 -0.7663125 -0.6650018
## 5             6               7               6  1.4214450  1.7208830
## 6             6               5               5  0.5002196 -0.1878249
##     btystdfl   btystdfu  btystdm2u   btystdml   btystdmu class1 class2 class3
## 1  0.4580018  0.8758139  0.6817153 -0.9000649 -0.1954181      0      0      1
## 2 -0.0735065 -0.5770065 -1.1319040 -0.9000649 -0.6546507      0      0      0
## 3  0.4580018 -1.5455530 -1.1319040 -0.9000649 -0.6546507      0      0      0
## 4 -1.1365230 -0.0927330 -1.7364440 -0.3125226 -0.6546507      0      1      0
## 5  1.5210190  1.8443610  0.6817153  2.0376470  0.7230470      0      0      0
## 6  0.9895102  0.3915404  0.6817153  0.8625621  0.2638144      0      0      0
##   class4 class5 class6 class7 class8 class9 class10 class11 class12 class13
## 1      0      0      0      0      0      0       0       0       0       0
## 2      0      0      0      0      0      0       0       0       0       0
## 3      1      0      0      0      0      0       0       0       0       0
## 4      0      0      0      0      0      0       0       0       0       0
## 5      0      0      0      0      0      0       0       0       0       0
## 6      0      0      0      0      0      0       0       0       0       0
##   class14 class15 class16 class17 class18 class19 class20 class21 class22
## 1       0       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   class23 class24 class25 class26 class27 class28 class29 class30
## 1       0       0       0       0       0       0       0       0
## 2       0       0       0       0       0       0       0       0
## 3       0       0       0       0       0       0       0       0
## 4       0       0       0       0       0       0       0       0
## 5       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0
##   courseevaluation didevaluation female formal fulldept lower multipleclass
## 1              4.3            24      1      0        1     0             1
## 2              4.5            17      0      0        1     0             0
## 3              3.7            55      0      0        1     0             1
## 4              4.3            40      1      0        1     0             1
## 5              4.4            42      1      0        1     0             0
## 6              4.2           182      0      1        1     0             0
##   nonenglish onecredit percentevaluating profevaluation students tenuretrack
## 1          0         0          55.81395            4.7       43           1
## 2          0         0          85.00000            4.6       20           1
## 3          0         0         100.00000            4.1       55           1
## 4          0         0          86.95652            4.5       46           1
## 5          0         0          87.50000            4.8       48           1
## 6          0         0          64.53901            4.4      282           1
##   blkandwhite btystdvariance btystdavepos btystdaveneg
## 1           0      2.1298060     0.201567     0.000000
## 2           0      1.3860810     0.000000    -0.826081
## 3           0      2.5374350     0.000000    -0.660333
## 4           0      1.7605770     0.000000    -0.766312
## 5           0      1.6931000     1.421450     0.000000
## 6           0      0.9447419     0.500220     0.000000

(a) Run a regression using beauty (the variable btystdave) to predict course evaluations (courseevaluation), controlling for various other inputs. Display the fitted model graphically, and explaining the meaning of each of the coefficients, along with the residual standard deviation. Plot the residuals versus fitted values.

lm_beauty <- lm(btystdave ~ courseevaluation, data = beauty_data)
summary(lm_beauty)
## 
## Call:
## lm(formula = btystdave ~ courseevaluation, data = beauty_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42409 -0.62469 -0.09744  0.50828  2.13077 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.1626     0.2624  -4.431 1.17e-05 ***
## courseevaluation   0.2687     0.0650   4.133 4.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7753 on 461 degrees of freedom
## Multiple R-squared:  0.03574,    Adjusted R-squared:  0.03364 
## F-statistic: 17.08 on 1 and 461 DF,  p-value: 4.247e-05
lm_beauty2 <- lm(btystdave ~ courseevaluation + female + age, data = beauty_data)
summary(lm_beauty2)
## 
## Call:
## lm(formula = btystdave ~ courseevaluation + female + age, data = beauty_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80374 -0.54913 -0.08697  0.45625  1.87299 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.167859   0.335388  -0.500    0.617    
## courseevaluation  0.265700   0.063100   4.211 3.06e-05 ***
## female            0.124219   0.073802   1.683    0.093 .  
## age              -0.021403   0.003684  -5.809 1.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7404 on 459 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1187 
## F-statistic: 21.73 on 3 and 459 DF,  p-value: 3.543e-13
par(mfrow=c(2,2))
plot(lm_beauty2)

Comments: For every 1 unit increase in course evaluation score there is a .265700 increase in their evaluated beauty. Females have a .124219 increase in evaluated beauty compared to males. For every 1 unit increase in age there is a .021403 decrease in their evaluated beauty. I think the residual standard error seems to be kind of large which isn’t good, but I’m not really positive whether it is small or large. no .74 for the residual standard error means 68% of the beauty evaluations will be within .74 of the predicted value?

beta_hat <- coef(lm_beauty2)
beta_sim <- sim(lm_beauty2)@coef
par(mfrow=c(1,3))

Plot course evaluation against beauty score

plot(beauty_data$courseevaluation, beauty_data$btystdave, pch = 10, xlab = "Course Evaluation Score", ylab = "Beauty Score")
for (i in 1:10){
  curve(cbind (1,x, mean(beauty_data$female), mean(beauty_data$age)) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve (cbind (1, x, mean(beauty_data$female), mean(beauty_data$age)) %*% beta_hat, col="#ff0081", add=TRUE)

Plot sex against beauty score

plot(beauty_data$female, beauty_data$btystdave, pch = 10, xlab = "Sex", ylab = "Beauty Score")
for (i in 1:10){
  curve(cbind (1, mean(beauty_data$courseevaluation), x, mean(beauty_data$age)) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve (cbind (1, mean(beauty_data$courseevaluation),x, mean(beauty_data$age)) %*% beta_hat, col="#ff0081", add=TRUE)

Plot age against beauty score

plot(beauty_data$age, beauty_data$btystdave, pch = 10, xlab = "Age", ylab = "Beauty Score") 
for (i in 1:10){
  curve(cbind(1, mean(beauty_data$courseevaluation), mean(beauty_data$female), x) %*% beta_sim[i,], lwd = .5, col = '#00bfaf', add = TRUE)
}
curve(cbind(1, mean(beauty_data$courseevaluation), mean(beauty_data$female), x) %*% beta_hat, col = "#ff0081", add = TRUE)

(b) Fit some other models, including beauty and also other input variables. Consider at least one model with interactions. For each model, state what the predictors are, and what the inputs are (see Section 3.7), and explain the meaning of each of its coefficients

lm_beauty3 <- lm(btystdave ~ courseevaluation * female + age, data = beauty_data)
summary(lm_beauty3)
## 
## Call:
## lm(formula = btystdave ~ courseevaluation * female + age, data = beauty_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80938 -0.54166 -0.09022  0.42178  1.82833 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.512488   0.387154  -1.324   0.1863    
## courseevaluation         0.356584   0.081244   4.389 1.41e-05 ***
## female                   1.020189   0.511609   1.994   0.0467 *  
## age                     -0.021899   0.003686  -5.940 5.63e-09 ***
## courseevaluation:female -0.226481   0.127976  -1.770   0.0774 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7387 on 458 degrees of freedom
## Multiple R-squared:  0.1303, Adjusted R-squared:  0.1227 
## F-statistic: 17.16 on 4 and 458 DF,  p-value: 4.001e-13

Comment: For every one unit increase in course evaluation there is a .356584 unit increase in beauty evaluation score. Females beauty evaluation scores are 1.020189 units higher than males. For every one unit increase in age there is a .021899 decrease in beauty evaluation score. The interaction is not a statistically significant. Men beauty evaluation score is .2264 less than females?

lm_beauty4 <- lm(btystdave ~ courseevaluation * age + female, data = beauty_data)
summary(lm_beauty4)
## 
## Call:
## lm(formula = btystdave ~ courseevaluation * age + female, data = beauty_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82375 -0.55055 -0.08067  0.44817  1.80452 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           2.259577   1.268005   1.782  0.07541 . 
## courseevaluation     -0.347133   0.315136  -1.102  0.27124   
## age                  -0.070473   0.024997  -2.819  0.00502 **
## female                0.132078   0.073674   1.793  0.07367 . 
## courseevaluation:age  0.012392   0.006244   1.985  0.04779 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.738 on 458 degrees of freedom
## Multiple R-squared:  0.1318, Adjusted R-squared:  0.1243 
## F-statistic: 17.39 on 4 and 458 DF,  p-value: 2.712e-13
par(mfrow=c(2,2))
plot(lm_beauty3)

Comments:

We will now fit a model which, compared to m2, adds a few more input variables. As we will see, this model, although slightly more complicated, will be able to explain better the variance of our outcome variable.

lm_beauty5 <- lm(btystdave ~ courseevaluation * female + age + blkandwhite + formal, data = beauty_data)
summary(lm_beauty5)
## 
## Call:
## lm(formula = btystdave ~ courseevaluation * female + age + blkandwhite + 
##     formal, data = beauty_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6987 -0.5367 -0.0485  0.3941  1.8714 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.270223   0.370255  -0.730  0.46587    
## courseevaluation         0.314151   0.077555   4.051 6.00e-05 ***
## female                   1.356260   0.490874   2.763  0.00596 ** 
## age                     -0.025580   0.003569  -7.167 3.11e-12 ***
## blkandwhite              0.552726   0.092295   5.989 4.29e-09 ***
## formal                   0.253208   0.090476   2.799  0.00535 ** 
## courseevaluation:female -0.330770   0.123321  -2.682  0.00758 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.703 on 456 degrees of freedom
## Multiple R-squared:  0.2158, Adjusted R-squared:  0.2054 
## F-statistic: 20.91 on 6 and 456 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_beauty5)

Comments: Adding the fourth variable increased the adjusted R squared from .1227 to .2054, which is good. That means that adding that variable helped increase the fit of the model.