Welcome to week 2!
HW questions will once again be split into conceptual and technical.
1. List the five assumptions of regression models. (Hint: 3 of them relate to error, and linearity is not one of them) Answer: NHST of regression models is based on a theoretical sampling distribution of error. We make five assumptions about this distribution: 1. The distribution is normal with a mean of zero and a standard deviation sigma. 2. The errors are independent. 3. Homoscedtasticity: variance is constant at each value of X. 4. There is no measurement error in the regressors. Otherwise, this exerts bias. 5. Correct model specification: All cause(s) of Y that are correlated to an X in the model are included in the model.
2. What was the motivation for creating an adjusted R2? Answer: R^2 is an upwardly biased estimate of the population parameter. Adjusted R2 makes a more conservative estimate.
3. Explain how a p-value is obtained for a b-value. (Hint: We can calculate the ___ of a b and then that allows to calculate a ___ and from that we can get a p-value).
Answer: A p-value is obtained for a b-value by conducting a t-test. The t-statistic is obtained by subtracting the value of b under the null hypothesis from the suspected b-value and dividing by the standard error of b. The t-statistic is compared to a t-distribution indexed by N-k-1 degrees of freedom where N is total sample and k is number of predictors.
4. What factors impact the standard errors of b-values and in which direction (according to Dan, not the book)? Answer: Standard errors of b-values increase when multicollinearity is large, the number of predictors is large, and sample size is small.
First thing we should do is create a chunk for all the packages we will need
library(psych)
library(ggplot2) #this is a very powerful library for creating plots. I sent you all a GUI (i.e. click and drag) version of this as well.
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(lmSupport)
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
And another chunk to load the csv.
Setting Working Directory Make sure your working directory is in the folder with this script and the csv! You can do this by clicking on session -> Set working directory -> to the folder your script and csv are saved in You can then save the output that you will see in your console (something like setwd(“~/Desktop/PSYC 212/212:2021/Week_2”)) into the chunk above so that every time you run this script it does it automatically.
#note how I named the chunk above. this will show up when I knit this file and can be helpful for organization.
data<-read.csv("homework2data.csv", header=T, stringsAsFactors = FALSE, na.strings=c("","NA")) #one you load it, you should see it in your environment.
#when you load a csv, you often want to include some arguments to make sure it is handling missing data correctly and treating strings as factors. We can talk about this in discussion. But for sanity I would recommend pretty much always using these arguments when you load data.
a. Estimate a regression equation using R predicting voting interest from age and include the output below. Also, write out the resulting regression equation.
#Let's get to regression!
#This is predicting voting interest from age
age <- lm(vote~age, data)
#We can create an object from the summary, that way we can extract some items from our summary
agesummary <- summary(age)
agesummary
##
## Call:
## lm(formula = vote ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037478 0.161532 0.232 0.819
## age 0.099435 0.003356 29.631 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#We can take a look at the R-squared and the resiudals
agesummary$r.squared
## [1] 0.9744722
agesummary$residuals
## 1 2 3 4 5 6
## -0.009219308 0.172694892 -0.026174745 0.476651161 0.482302973 -0.318827389
## 7 8 9 10 11 12
## -0.003567496 0.499258410 -0.196785321 0.007736129 0.189650329 -0.512045214
## 13 14 15 16 17 18
## -0.208088946 -0.108654127 0.101519135 0.002084317 -0.213740758 0.183998517
## 19 20 21 22 23 24
## 0.090215510 -0.014871121 -0.412610396 0.382868154 -0.125609564 -0.423914021
## 25
## -0.014871121
Call: lm(formula = vote ~ age, data = data)
Residuals: Min 1Q Median 3Q Max -0.51205 -0.19679 -0.00922 0.17269 0.49926
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.037478 0.161532 0.232 0.819
age 0.099435 0.003356 29.631 <2e-16 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2832 on 23 degrees of freedom Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734 F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
agesummary\(r.squared [1] 0.9744722 agesummary\)residuals 1 2 3 4 5 6 -0.009219308 0.172694892 -0.026174745 0.476651161 0.482302973 -0.318827389 7 8 9 10 11 12 -0.003567496 0.499258410 -0.196785321 0.007736129 0.189650329 -0.512045214 13 14 15 16 17 18 -0.208088946 -0.108654127 0.101519135 0.002084317 -0.213740758 0.183998517 19 20 21 22 23 24 0.090215510 -0.014871121 -0.412610396 0.382868154 -0.125609564 -0.423914021 25 -0.014871121
Answer: Y-hat = .04 + .1X
b. We want to know how much voting interest will change for each additional year of age? What coefficient are we looking for and how do we interpret it? Answer: We would be interested in the coefficient (b) of age, which is .1 (rounded). This suggests that for every year increase in age, there is a .1 increase in voting interest.
c. We want to know how much voting interest a person will have at 0 years old? What term are we looking for and how would you interpret it? Answer: We would be interested in the intercept, which would be .04. This suggests that at 0 years old, someone would have a voting interest of .04.
d. Is this the value in part C useful interpretively? Answer: This is not useful interpretatively, as someone would not have voting interest at 0 years old.
e. How does the model fit overall? Is this a good model? Support your answer with some kind of statistic. Answer: Model fit is strong. Specifically, age predicts 97% of the variation in voting interest [Adj. R-squared = .97, F(1, 23) = 878, p < .01]
f. Calculate the coefficients of a standardized equation of this model in R, include the output below, write out the full equation, and interpret all components. Answer: Standardized equation is below.
g. Oops. We jumped in to the analysis. What should we have looked at first?
#we forgot to make a scatterplot
plot(data$vote~data$age)
Answer: Performing data visualization ahead of creating models or running statistical tests will allow us to see patterns in the data, as well as potential outliers.
h. What else might we like to do in this situation to resolve an earlier problem with the unstandardized equation? (Hint: think about cynical infants). From now on, we will almost always want to do this.
#centering our age variable
data$age_c <- scale(data$age, scale=FALSE)
Answer: Centering data so that values on the x-axis represent differences from mean X may make our model more interpretable, particularly the intercept.
i. Estimate the coefficients of a regular regression equation and the standardized equation predicting voting interest from CENTERED age in R, include the output below, write both the equations, and interpret all the parts.
#running the regression with age centered
centered <- lm(vote~age_c, data)
#again creating an object from our summary
centeredsum <-summary(centered)
centeredsum
##
## Call:
## lm(formula = vote ~ age_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.520000 0.056632 79.81 <2e-16 ***
## age_c 0.099435 0.003356 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#getting our r-squared value from the centered model
centeredsum$adj.r.squared
## [1] 0.9733623
#Let's z-score our variables now
data$vote_z <- scale(data$vote)
data$age_z <- scale(data$age)
#Running the regression with our standardized variables now
standard <- lm(vote_z~age_z, data)
#outputting the betas
summary(standard)
##
## Call:
## lm(formula = vote_z ~ age_z, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.295138 -0.113425 -0.005314 0.099540 0.287768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.677e-16 3.264e-02 0.00 1
## age_z 9.872e-01 3.332e-02 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
Call: lm(formula = vote_z ~ age_z, data = data)
Residuals: Min 1Q Median 3Q Max -0.295138 -0.113425 -0.005314 0.099540 0.287768
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.677e-16 3.264e-02 0.00 1
age_z 9.872e-01 3.332e-02 29.63 <2e-16 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1632 on 23 degrees of freedom Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734 F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
Answer: Regular regression equation: Y-hat = 4.52 + .1X
For every year increase in age, there is a .1 increase in voting interest. The average voting interest at mean X is 4.52.
Standardized regression equation: Zy = .99(Zx)
For every 1-unit increase in standardized units of age, there is a 1-unit increase in the standard units of voting interest.
j. Now estimate the coefficients of a regression equation in R predicting voting interest from age and education, include the output below, and write the equation. Use all centered predictors.
#centering our education and voting variables now
data$education_c <- scale(data$education, scale=FALSE)
data$vote_c <- scale(data$vote, scale=FALSE)
#this is our model with everything centered and multiple predictors
#to add multiple predicotrs, we only need a + between predictors
multiplecentered <- lm(vote_c~age_c + education_c, data)
summary(multiplecentered)
##
## Call:
## lm(formula = vote_c ~ age_c + education_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53281 -0.18523 -0.01487 0.17027 0.50181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.584e-16 5.781e-02 0.000 1.000000
## age_c 9.320e-02 2.385e-02 3.908 0.000755 ***
## education_c 5.067e-02 1.919e-01 0.264 0.794190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9722
## F-statistic: 421.3 on 2 and 22 DF, p-value: < 2.2e-16
Answer: Y-hat = .09X1 + .06X2
k. Interpret all the important information for describing this model as well as the predictors in the model. Answer: For every one-unit increase in age, there is a .09 increase in voter interest. For every one-unit increase in education, there is a .06 increase in voter interest.
l. Does education significantly predict interest in voting above and beyond what age can explain? Explain how you know. Answer: No, education is not a significant predictor of voting interest (p=.79)
m. Does age predict interest in voting above and beyond what education can explain? Explain. Answer: Yes, age is a significant, positive predictor of voting interest (p < .01)
n. Now we will conduct a hierarchical linear regression analysis. Create two regression models and analyze them hierarchically (i.e., in sequence). Use voting interest predicted from (1) only age on the first step and (2) age and education on the second step. Include R output for this below.
age <- lm(vote~age, data) #just age
multiple <- lm(vote~age+education ,data) #age and education
#the "anova" command will compared the residual sum of squares and see if our model is doing signficantly better
anova(age, multiple)
## Analysis of Variance Table
##
## Model 1: vote ~ age
## Model 2: vote ~ age + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 1.8441
## 2 22 1.8383 1 0.0058264 0.0697 0.7942
Call: lm(formula = vote_c ~ age_c + education_c, data = data)
Residuals: Min 1Q Median 3Q Max -0.53281 -0.18523 -0.01487 0.17027 0.50181
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.584e-16 5.781e-02 0.000 1.000000
age_c 9.320e-02 2.385e-02 3.908 0.000755 *** education_c 5.067e-02 1.919e-01 0.264 0.794190
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2891 on 22 degrees of freedom Multiple R-squared: 0.9746, Adjusted R-squared: 0.9722 F-statistic: 421.3 on 2 and 22 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: vote ~ age Model 2: vote ~ age + education Res.Df RSS Df Sum of Sq F Pr(>F) 1 23 1.8441
2 22 1.8383 1 0.0058264 0.0697 0.7942
Answer: Adding education to the model does not significantly enhance variability explained in voter interest F(1, 22) = .07, p=.8.
o. Compare significance of the t statistic for education in part 1j to the significance of ΔR (see Pr(>F) with what you just did in part 1n. Comment on the relationship between these two p-values.
#change in R2 and its square root
chgR2 = summary(multiplecentered)$r.squared - centeredsum$r.squared
chgR2
## [1] 8.065404e-05
sqrt(chgR2)
## [1] 0.008980759
Answer: The amount of variance in voter interest explained by adding education to the model is minuscule (8.06 x 10^-5).The significance of the t-statistic obtained for the education coefficient t = .26, p = .79 is equivalent to the significance of R change ΔR (.79).