Welcome to week 2!
HW questions will once again be split into conceptual and technical.
1. List the five assumptions of regression models. (Hint: 3 of them relate to error, and linearity is not one of them) Answer: 1. no measurement error in the regressors 2. models are correctly specified 3. homoscedasticity 4. errors are independent 5. errors are normally distributed
2. What was the motivation for creating an adjusted R2? Answer: R2 is an upwardly biased estimator of the population parameter.
3. Explain how a p-value is obtained for a b-value. (Hint: We can calculate the ___ of a b and then that allows to calculate a ___ and from that we can get a p-value).
Answer: We calculate the t-value for a specific b-value and then we can test its significance.
4. What factors impact the standard errors of b-values and in which direction (according to Dan, not the book)? Answer: The standard errors of b-values are impacted by the number of participants and the number of groups, as well as R2y.
First thing we should do is create a chunk for all the packages we will need
library(psych)
library(ggplot2) #this is a very powerful library for creating plots. I sent you all a GUI (i.e. click and drag) version of this as well.
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(lmSupport)
## Warning: package 'lmSupport' was built under R version 4.0.3
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
And another chunk to load the csv.
Setting Working Directory Make sure your working directory is in the folder with this script and the csv! You can do this by clicking on session -> Set working directory -> to the folder your script and csv are saved in You can then save the output that you will see in your console (something like setwd("~/Desktop/PSYC 212/212:2021/Week_2")) into the chunk above so that every time you run this script it does it automatically.
setwd( "C:/Users/Nina/Desktop/Essays/UC Riverside/Multiple Correlation & Regression Analysis/Homework")
#note how I named the chunk above. this will show up when I knit this file and can be helpful for organization.
data<-read.csv("homework2data.csv", header=T, stringsAsFactors = FALSE, na.strings=c("","NA")) #one you load it, you should see it in your environment.
#when you load a csv, you often want to include some arguments to make sure it is handling missing data correctly and treating strings as factors. We can talk about this in discussion. But for sanity I would recommend pretty much always using these arguments when you load data.
a. Estimate a regression equation using R predicting voting interest from age and include the output below. Also, write out the resulting regression equation.
#Let's get to regression!
#This is predicting voting interest from age
age <- lm(vote~age, data)
#We can create an object from the summary, that way we can extract some items from our summary
agesummary <- summary(age)
agesummary
##
## Call:
## lm(formula = vote ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037478 0.161532 0.232 0.819
## age 0.099435 0.003356 29.631 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#We can take a look at the R-squared and the resiudals
agesummary$r.squared
## [1] 0.9744722
agesummary$residuals
## 1 2 3 4 5 6
## -0.009219308 0.172694892 -0.026174745 0.476651161 0.482302973 -0.318827389
## 7 8 9 10 11 12
## -0.003567496 0.499258410 -0.196785321 0.007736129 0.189650329 -0.512045214
## 13 14 15 16 17 18
## -0.208088946 -0.108654127 0.101519135 0.002084317 -0.213740758 0.183998517
## 19 20 21 22 23 24
## 0.090215510 -0.014871121 -0.412610396 0.382868154 -0.125609564 -0.423914021
## 25
## -0.014871121
Answer:
b. We want to know how much voting interest will change for each additional year of age? What coefficient are we looking for and how do we interpret it? Answer: We are looking for b. For each 1 year of age, there is a 0.099-item increase in voting interest.
c. We want to know how much voting interest a person will have at 0 years old? What term are we looking for and how would you interpret it? Answer: We are looking at the intercept. An individual of age 0 is expected to have an interest in voting of 0.037 (items).
d. Is this the value in part C useful interpretively? Answer: The value is not useful as individuals of age 0 cannot vote.
e. How does the model fit overall? Is this a good model? Support your answer with some kind of statistic. Answer: The model seems to be a very good fit. The R2 and R2 adjusted values are very close to one.
f. Calculate the coefficients of a standardized equation of this model in R, include the output below, write out the full equation, and interpret all components. Answer: Beta = 1 x square_root(((1-R2*y)/(N-k-1)) x (1/1-R2))
Beta_f <- sqrt(((1-0.9745)/(878-23-1)) * (1/1-0.9745))
Beta_f
## [1] 0.000872592
R2 is the coefficient of determination y is the y-value N is the total number of participants k is the number of groups/sets Beta gives us the difference standard units of Y for a 1 SD difference in X.
g. Oops. We jumped in to the analysis. What should we have looked at first?
#we forgot to make a scatterplot
plot(data$vote~data$age)
Answer:
h. What else might we like to do in this situation to resolve an earlier problem with the unstandardized equation? (Hint: think about cynical infants). From now on, we will almost always want to do this.
#centering our age variable
data$age_c <- scale(data$age, scale=FALSE)
Answer: We want to center the data so that the age of an individual is more meaningful.
i. Estimate the coefficients of a regular regression equation and the standardized equation predicting voting interest from CENTERED age in R, include the output below, write both the equations, and interpret all the parts.
#running the regression with age centered
centered <- lm(vote~age_c, data)
#again creating an object from our summary
centeredsum <-summary(centered)
centeredsum
##
## Call:
## lm(formula = vote ~ age_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.520000 0.056632 79.81 <2e-16 ***
## age_c 0.099435 0.003356 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#getting our r-squared value from the centered model
centeredsum$adj.r.squared
## [1] 0.9733623
#Let's z-score our variables now
data$vote_z <- scale(data$vote)
data$age_z <- scale(data$age)
#Running the regression with our standardized variables now
standard <- lm(vote_z~age_z, data)
#outputting the betas
summary(standard)
##
## Call:
## lm(formula = vote_z ~ age_z, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.295138 -0.113425 -0.005314 0.099540 0.287768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.677e-16 3.264e-02 0.00 1
## age_z 9.872e-01 3.332e-02 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
Answer: b = sy/sj x square_root(((1-R2y)/(N-k-1)) x (1/1-R2)) Beta = 1 x square_root(((1-R2y)/(N-k-1)) x (1/1-R2)) Same elements as before, except for beta we now have sy (error in y)/ sj (error in predictor)
For each 1 SD change in age there is an 0.9 SD change in voting interest. The model is a good fit (adjusted R2 = 0.9734) and the regression is significant.
j. Now estimate the coefficients of a regression equation in R predicting voting interest from age and education, include the output below, and write the equation. Use all centered predictors.
#centering our education and voting variables now
data$education_c <- scale(data$education, scale=FALSE)
data$vote_c <- scale(data$vote, scale=FALSE)
#this is our model with everything centered and multiple predictors
#to add multiple predicotrs, we only need a + between predictors
multiplecentered <- lm(vote_c~age_c + education_c, data)
summary(multiplecentered)
##
## Call:
## lm(formula = vote_c ~ age_c + education_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53281 -0.18523 -0.01487 0.17027 0.50181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.584e-16 5.781e-02 0.000 1.000000
## age_c 9.320e-02 2.385e-02 3.908 0.000755 ***
## education_c 5.067e-02 1.919e-01 0.264 0.794190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9722
## F-statistic: 421.3 on 2 and 22 DF, p-value: < 2.2e-16
Answer:
k. Interpret all the important information for describing this model as well as the predictors in the model. Answer: For each 1 SD difference in voting behavior there is a .09 SD difference in age and a .05 SD difference in education (controlling for each other). The model is a good fit for the data. Age is a significant predictor, but education level is not.
l. Does education significantly predict interest in voting above and beyond what age can explain? Explain how you know. Answer: Education is not a significant predictor above and beyond age (Pr > .5)
m. Does age predict interest in voting above and beyond what education can explain? Explain. Answer: Age is a significant predictor of interest in voting above and beyond education (Pr < .01).
n. Now we will conduct a hierarchical linear regression analysis. Create two regression models and analyze them hierarchically (i.e., in sequence). Use voting interest predicted from (1) only age on the first step and (2) age and education on the second step. Include R output for this below.
age <- lm(vote~age, data) #just age
multiple <- lm(vote~age+education ,data) #age and education
#the "anova" command will compared the residual sum of squares and see if our model is doing signficantly better
anova(age, multiple)
## Analysis of Variance Table
##
## Model 1: vote ~ age
## Model 2: vote ~ age + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 1.8441
## 2 22 1.8383 1 0.0058264 0.0697 0.7942
Answer: Education does not seem to provide much additional information beyond age.
o. Compare significance of the t statistic for education in part 1j to the significance of ΔR (see Pr(>F) with what you just did in part 1n. Comment on the relationship between these two p-values.
#change in R2 and its square root
chgR2 = summary(multiplecentered)$r.squared - centeredsum$r.squared
chgR2
## [1] 8.065404e-05
sqrt(chgR2)
## [1] 0.008980759
Answer: The change in R2 and its square root is very small. This seems to me to indicate that MLM and HLM provide similar estimates of the effects of variables (numerically).