Welcome to week 2!

HW questions will once again be split into conceptual and technical.

CONCEPTS

1. List the five assumptions of regression models. (Hint: 3 of them relate to error, and linearity is not one of them) Answer: It’s assumed that (1) there is no measurement error, (2) correct model specification, (3) the residual values fall into a normal distribution, (4) the residual values are independent, (5) homoscedasticity

2. What was the motivation for creating an adjusted R2? Answer:Adjusted R2 is used because it’s a less biased estimate of the population parameter.

3. Explain how a p-value is obtained for a b-value. (Hint: We can calculate the ___ of a b and then that allows to calculate a ___ and from that we can get a p-value).
Answer: If we calculate the standard error of b, and then we can do a t-test, and from that we can determine our p value.

4. What factors impact the standard errors of b-values and in which direction (according to Dan, not the book)? Answer: factors such as: (1) measurement error in Y can increase the standard error of b which decreases power; (2) non-linearity that isn’t properly modeled will bias the standard error of b

MULTIPLE REGRESSION

HW 2 prompt: Today we are interested in what makes people interested in voting. We collected information on age, education, and voting. The data and R scripts are on iLearn.

First thing we should do is create a chunk for all the packages we will need

library(psych)
library(ggplot2) #this is a very powerful library for creating plots. I sent you all a GUI (i.e. click and drag) version of this as well. 
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(lmSupport)
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

And another chunk to load the csv.

Setting Working Directory Make sure your working directory is in the folder with this script and the csv! You can do this by clicking on session -> Set working directory -> to the folder your script and csv are saved in You can then save the output that you will see in your console (something like setwd(“~/Desktop/PSYC 212/212:2021/Week_2”)) into the chunk above so that every time you run this script it does it automatically.

#note how I named the chunk above. this will show up when I knit this file and can be helpful for organization. 
data<-read.csv("homework2data.csv", header=T, stringsAsFactors = FALSE, na.strings=c("","NA")) #one you load it, you should see it in your environment.

#when you load a csv, you often want to include some arguments to make sure it is handling missing data correctly and treating strings as factors. We can talk about this in discussion. But for sanity I would recommend pretty much always using these arguments when you load data. 

a. Estimate a regression equation using R predicting voting interest from age and include the output below. Also, write out the resulting regression equation.

#Let's get to regression!
#This is predicting voting interest from age
age <- lm(vote~age, data)
#We can create an object from the summary, that way we can extract some items from our summary
agesummary <- summary(age)
agesummary
## 
## Call:
## lm(formula = vote ~ age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51205 -0.19679 -0.00922  0.17269  0.49926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.037478   0.161532   0.232    0.819    
## age         0.099435   0.003356  29.631   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16
#We can take a look at the R-squared and the resiudals
agesummary$r.squared
## [1] 0.9744722
agesummary$residuals
##            1            2            3            4            5            6 
## -0.009219308  0.172694892 -0.026174745  0.476651161  0.482302973 -0.318827389 
##            7            8            9           10           11           12 
## -0.003567496  0.499258410 -0.196785321  0.007736129  0.189650329 -0.512045214 
##           13           14           15           16           17           18 
## -0.208088946 -0.108654127  0.101519135  0.002084317 -0.213740758  0.183998517 
##           19           20           21           22           23           24 
##  0.090215510 -0.014871121 -0.412610396  0.382868154 -0.125609564 -0.423914021 
##           25 
## -0.014871121

Answer: The regression equation would be: Y = 0.037 + 0.099(X) + 0.283 to predict voting interest from age.

b. We want to know how much voting interest will change for each additional year of age? What coefficient are we looking for and how do we interpret it? Answer: the coefficient we are interested in would be slope, which in this case is 0.099 and represents the increment of change in voting interest for each year of age.

c. We want to know how much voting interest a person will have at 0 years old? What term are we looking for and how would you interpret it? Answer: The term we are interested in would be our Y intercept, which in this case is 0.037.

d. Is this the value in part C useful interpretively? Answer: This probably isn’t going to be meaningful for any age below a certain point (is there a rule of thumb for when political awareness develops??). In reality, the voting interest of a newborn is going to be 0, which is inconsistent with this model’s prediction.

e. How does the model fit overall? Is this a good model? Support your answer with some kind of statistic. Answer: Our R squared and adjusted R squared values both equal approx 0.97 which translates into age explaining almost all of the variance of voter interest. Based on this logic, we can conclude this is a good model fit.

f. Calculate the coefficients of a standardized equation of this model in R, include the output below, write out the full equation, and interpret all components.

# I found this package online to calculate our beta coefficient for age
library("QuantPsyc")
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
## Loading required package: MASS
## 
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
## 
##     norm
age_lmbeta <- lm.beta(age)
age_lmbeta
##       age 
## 0.9871536

Answer: equation: Y = 0.987(X) + 0.283 components: Our value of 0.987 is our beta coefficient when standardizing our b1 (in this case age) and 0.283 is the error. There is no intercept in this equation because it is 0.00 when dealing with standardized regression coefficients.

g. Oops. We jumped in to the analysis. What should we have looked at first?

#we forgot to make a scatterplot
plot(data$vote~data$age)

Answer: We should have first looked at the variance of Y at each X to see if we have homoscedasticity.

h. What else might we like to do in this situation to resolve an earlier problem with the unstandardized equation? (Hint: think about cynical infants). From now on, we will almost always want to do this.

#centering our age variable
data$age_c <- scale(data$age, scale=FALSE)

Answer: By centering, we will avoid the issue of ages that are meaningless in this model. The range of X begins at 18, so we would like to center our data in a way that’s meaningful for these age values.

i. Estimate the coefficients of a regular regression equation and the standardized equation predicting voting interest from CENTERED age in R, include the output below, write both the equations, and interpret all the parts.

#running the regression with age centered
centered <- lm(vote~age_c, data)
#again creating an object from our summary
centeredsum <-summary(centered)
centeredsum
## 
## Call:
## lm(formula = vote ~ age_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51205 -0.19679 -0.00922  0.17269  0.49926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.520000   0.056632   79.81   <2e-16 ***
## age_c       0.099435   0.003356   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16
#getting our r-squared value from the centered model
centeredsum$adj.r.squared
## [1] 0.9733623
#Let's z-score our variables now
data$vote_z <- scale(data$vote)
data$age_z <- scale(data$age)
#Running the regression with our standardized variables now
standard <- lm(vote_z~age_z, data)
#outputting the betas
summary(standard)
## 
## Call:
## lm(formula = vote_z ~ age_z, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.295138 -0.113425 -0.005314  0.099540  0.287768 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.677e-16  3.264e-02    0.00        1    
## age_z       9.872e-01  3.332e-02   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16

Answer: (1) Regression equation 1: Y = 4.52 + 0.099(X) + 0.283 - our intercept is 4.52, and our predicted Y value increases 0.099 for each X value, with an error of 0.283 (2) Standardized equation: Y = 0.99(X) + 0.16 - we do not have an intercept; our predicted Y value increases 0.99 for each X value, with an error of 0.16

j. Now estimate the coefficients of a regression equation in R predicting voting interest from age and education, include the output below, and write the equation. Use all centered predictors.

#centering our education and voting variables now
data$education_c <- scale(data$education, scale=FALSE)
data$vote_c <- scale(data$vote, scale=FALSE)
#this is our model with everything centered and multiple predictors
#to add multiple predicotrs, we only need a + between predictors
multiplecentered <- lm(vote_c~age_c + education_c, data)
summary(multiplecentered)
## 
## Call:
## lm(formula = vote_c ~ age_c + education_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53281 -0.18523 -0.01487  0.17027  0.50181 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.584e-16  5.781e-02   0.000 1.000000    
## age_c       9.320e-02  2.385e-02   3.908 0.000755 ***
## education_c 5.067e-02  1.919e-01   0.264 0.794190    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9722 
## F-statistic: 421.3 on 2 and 22 DF,  p-value: < 2.2e-16

Answer: Y = (3.58 x 10^-16) + 0.093(X1) + 0.051(X2) + 0.289

k. Interpret all the important information for describing this model as well as the predictors in the model. Answer: Our Y intercept is very small compared to the single regressor model above. Our regression coefficients for X1 (age) and X2 (education) are 0.093 and 0.051 respectively, with an error of 0.289 (which isn’t much different compared to the previous model with only one predictor)

l. Does education significantly predict interest in voting above and beyond what age can explain? Explain how you know. Answer: Education does not seem to increase R squared (or adjusted R squared) compared to the model that only had age as a regressor.

m. Does age predict interest in voting above and beyond what education can explain? Explain. Answer: Yes, age seems to predict voting better than what education can explain, because of the logic explained in the last answer. Including age did not seem to affect the R squared value much at all.

n. Now we will conduct a hierarchical linear regression analysis. Create two regression models and analyze them hierarchically (i.e., in sequence). Use voting interest predicted from (1) only age on the first step and (2) age and education on the second step. Include R output for this below.

age <- lm(vote~age, data) #just age
multiple <- lm(vote~age+education ,data) #age and education

#the "anova" command will compared the residual sum of squares and see if our model is doing signficantly better
anova(age, multiple)
## Analysis of Variance Table
## 
## Model 1: vote ~ age
## Model 2: vote ~ age + education
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     23 1.8441                           
## 2     22 1.8383  1 0.0058264 0.0697 0.7942

Answer: There is no significance when comparing the two models. Adding age does not make the model significantly better, but removing it doesn’t make it significantly worse either.

o. Compare significance of the t statistic for education in part 1j to the significance of ΔR (see Pr(>F) with what you just did in part 1n. Comment on the relationship between these two p-values.

#change in R2 and its square root
chgR2 = summary(multiplecentered)$r.squared - centeredsum$r.squared 
chgR2
## [1] 8.065404e-05
sqrt(chgR2)
## [1] 0.008980759

Answer: In 1j, we had a significance of 0.79 for the t stat for education and in 1n we had a significance of 0.79 for the change in R for the model including both age and education.. they are the same! Because this value is much larger than 0.05, we can conclude that education is not a significant predictor of voter interest.