Welcome to week 2!
HW questions will once again be split into conceptual and technical.
1. List the five assumptions of regression models. (Hint: 3 of them relate to error, and linearity is not one of them) 1) Normally distributed. the conditional distributions of Y must be normally distributed, this becomes less of a problem as N gets larger. 2) Independence of observations. Cases from the data must be independent or unrelated from one another on Y. 3) Homoscedasicity. The variance of Y must be equal (or similar). Otherwise any observed effects may be attributed to the differing variances of Y for different values of X? 4) No measurement error in the regressors. Measurement error in regressors can result in a downward bias in r, b, and B (for one regressor), or a bias in either direction if there are multiple regressors. 5) Correct model specification. For causal impact, all possible covariates or causes of Y must be uncorrelated (unrelated) to X (my outcome). Kind of like ruling out third variables and alternative explanations?
2. What was the motivation for creating an adjusted R2? Answer: R is a bit biased, and it was substantial enough that someone had to make adjusted R2 a thing to get better, less biased estimates of R2.
3. Explain how a p-value is obtained for a b-value. (Hint: We can calculate the ___ of a b and then that allows to calculate a ___ and from that we can get a p-value).
Answer: We can calculate the standard error of b (s hat of b) and that allows us to calculate the 95% confidence interval.
4. What factors impact the standard errors of b-values and in which direction (according to Dan, not the book)? Answer:standard error of b gets large when multicollinearity of r2 is large, k is large, n is small or R2 of y is small.
First thing we should do is create a chunk for all the packages we will need
And another chunk to load the csv.
getwd()
## [1] "C:/Users/AshMi/Desktop/stat"
setwd("C:/Users/AshMi/Desktop/stat")
getwd()
## [1] "C:/Users/AshMi/Desktop/stat"
hmwk.1<- read.csv("homework2data.csv")
attach(hmwk.1)
Setting Working Directory Make sure your working directory is in the folder with this script and the csv! You can do this by clicking on session -> Set working directory -> to the folder your script and csv are saved in You can then save the output that you will see in your console (something like setwd("~/Desktop/PSYC 212/212:2021/Week_2")) into the chunk above so that every time you run this script it does it automatically.
#note how I named the chunk above. this will show up when I knit this file and can be helpful for organization.
data<-read.csv("homework2data.csv", header=T, stringsAsFactors = FALSE, na.strings=c("","NA"))
#one you load it, you should see it in your environment.
#when you load a csv, you often want to include some arguments to make sure it is handling missing data correctly and treating strings as factors. We can talk about this in discussion. But for sanity I would recommend pretty much always using these arguments when you load data.
a. Estimate a regression equation using R predicting voting interest from age and include the output below. Also, write out the resulting regression equation.
#Let's get to regression!
#This is predicting voting interest from age
age.2 <- lm(vote~age.1, data)
#We can create an object from the summary, that way we can extract some items from our summary
agesummary <- summary(age.2)
agesummary
##
## Call:
## lm(formula = vote ~ age.1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037478 0.161532 0.232 0.819
## age.1 0.099435 0.003356 29.631 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#We can take a look at the R-squared and the resiudals
agesummary$r.squared
## [1] 0.9744722
agesummary$residuals
## 1 2 3 4 5 6
## -0.009219308 0.172694892 -0.026174745 0.476651161 0.482302973 -0.318827389
## 7 8 9 10 11 12
## -0.003567496 0.499258410 -0.196785321 0.007736129 0.189650329 -0.512045214
## 13 14 15 16 17 18
## -0.208088946 -0.108654127 0.101519135 0.002084317 -0.213740758 0.183998517
## 19 20 21 22 23 24
## 0.090215510 -0.014871121 -0.412610396 0.382868154 -0.125609564 -0.423914021
## 25
## -0.014871121
Answer: y = .037 + .099X. For each year you age, you will become .099 units interested in voting.
b. We want to know how much voting interest will change for each additional year of age? What coefficient are we looking for and how do we interpret it? Answer: We are looking at the slope, B, and voting interest will change .099 units with every year of age.
c. We want to know how much voting interest a person will have at 0 years old? What term are we looking for and how would you interpret it? Answer: We would look at the intercept (.037). At 0 years of age, someone would have .037 units of interest in voting (not really, but that's what the statistics say).
d. Is this the value in part C useful interpretively? Answer: No, probably not. 0 is not useful in this case, infants cannot have interest in voting.
e. How does the model fit overall? Is this a good model? Support your answer with some kind of statistic. Answer: The adjusted R2 is .97, which would indicate that it is likely a good fit overall.
f. Calculate the coefficients of a standardized equation of this model in R, include the output below, write out the full equation, and interpret all components.
vote.scale<- scale(vote)
age.scale<- scale(as.numeric(age.1))
model.hmwk<- lm(vote.scale~ age.1)
summary(model.hmwk)
##
## Call:
## lm(formula = vote.scale ~ age.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.295138 -0.113425 -0.005314 0.099540 0.287768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.583683 0.093106 -27.75 <2e-16 ***
## age.1 0.057313 0.001934 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
Answer: y = -2.58 + .057X. 1 standard deviation in age is associated with a .057 standard deviation increase in voting interest.
g. Oops. We jumped in to the analysis. What should we have looked at first?
#we forgot to make a scatterplot
plot(data$vote~data$age)
Answer: Scatterplots are good to look at to make sure your data is linear before you conduct an analysis that assumes it is linear.
h. What else might we like to do in this situation to resolve an earlier problem with the unstandardized equation? (Hint: think about cynical infants). From now on, we will almost always want to do this.
#centering our age variable
age_c <- scale(age.1, scale=FALSE)
vote_c<- scale(vote, scale=FALSE)
Answer: We will want to mean center, because in this case 0 is not meaningful and infants are too young to be preoccupied with the joys of voting.
i. Estimate the coefficients of a regular regression equation and the standardized equation predicting voting interest from CENTERED age in R, include the output below, write both the equations, and interpret all the parts.
#running the regression with age centered
centered <- lm(vote~age_c)
#again creating an object from our summary
centeredsum <-summary(centered)
centeredsum
##
## Call:
## lm(formula = vote ~ age_c)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51205 -0.19679 -0.00922 0.17269 0.49926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.520000 0.056632 79.81 <2e-16 ***
## age_c 0.099435 0.003356 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
#getting our r-squared value from the centered model
centeredsum$adj.r.squared
## [1] 0.9733623
#Let's z-score our variables now
vote_z <- scale(data$vote)
age_z <- scale(data$age)
#Running the regression with our standardized variables now
standard <- lm(vote_z~age_z)
#outputting the betas
summary(standard)
##
## Call:
## lm(formula = vote_z ~ age_z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.295138 -0.113425 -0.005314 0.099540 0.287768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.677e-16 3.264e-02 0.00 1
## age_z 9.872e-01 3.332e-02 29.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9734
## F-statistic: 878 on 1 and 23 DF, p-value: < 2.2e-16
Answer:For our regular centered regression model: y = 4.52 + .099X. Individuals start out with 4.52 units interest in voting, and gain .099 units interest in voting with every year they age. For our standardized centered regression model: For every standard deviation in age, there is a .98 standard deviation increase in voting.
j. Now estimate the coefficients of a regression equation in R predicting voting interest from age and education, include the output below, and write the equation. Use all centered predictors.
#centering our education and voting variables now
data$education_c <- scale(data$education, scale=FALSE)
data$vote_c <- scale(data$vote, scale=FALSE)
#this is our model with everything centered and multiple predictors
#to add multiple predicotrs, we only need a + between predictors
multiplecentered <- lm(vote_c~age_c + education_c, data)
summary(multiplecentered)
##
## Call:
## lm(formula = vote_c ~ age_c + education_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53281 -0.18523 -0.01487 0.17027 0.50181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.584e-16 5.781e-02 0.000 1.000000
## age_c 9.320e-02 2.385e-02 3.908 0.000755 ***
## education_c 5.067e-02 1.919e-01 0.264 0.794190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9722
## F-statistic: 421.3 on 2 and 22 DF, p-value: < 2.2e-16
Answer: y = 3.58 + .093X1 + .050X2.
k. Interpret all the important information for describing this model as well as the predictors in the model. Answer:When holding education constant, for every 1 standard deviation increase in age, voting interest increases by .093, and holding age constant, for every 1 standard deviation increase in education there is a .05 standard deviation increase in voting interest. The adjusted R2 of .97 indicates the model has goodish fit.
l. Does education significantly predict interest in voting above and beyond what age can explain? Explain how you know. Answer:No, not really. R2 has gone down by a tiny amount, residual standard error has increased, and education (in this model) does not have a significant relationship with voting interest. All of this indicates that education is not explaining new variation in voting interest, and is likely a problem of multicollinearity.
m. Does age predict interest in voting above and beyond what education can explain? Explain. Answer: Probably not, the multicollinearity we observe happens because age does not uniquely explain voting interest, and it overlaps with the variance explained by education.
n. Now we will conduct a hierarchical linear regression analysis. Create two regression models and analyze them hierarchically (i.e., in sequence). Use voting interest predicted from (1) only age on the first step and (2) age and education on the second step. Include R output for this below.
age.3 <- lm(vote~age.1, data) #just age
multiple <- lm(vote~age.1+education ,data) #age and education
#the "anova" command will compared the residual sum of squares and see if our model is doing signficantly better
anova(age.3, multiple)
## Analysis of Variance Table
##
## Model 1: vote ~ age.1
## Model 2: vote ~ age.1 + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 1.8441
## 2 22 1.8383 1 0.0058264 0.0697 0.7942
Answer: The R2's are pretty similar, suggesting that nothing is gained (and a small bit is lost) by including education in the model.
o. Compare significance of the t statistic for education in part 1j to the significance of EĀR (see Pr(>F) with what you just did in part 1n. Comment on the relationship between these two p-values.
#change in R2 and its square root
chgR2 = summary(multiplecentered)$r.squared - centeredsum$r.squared
chgR2
## [1] 8.065404e-05
sqrt(chgR2)
## [1] 0.008980759
Answer: The p-values are nearly identical (unless I'm looking at the wrong numbers?)