Week 2 Homework - Michael Gilbert

CONCEPTS

1. List the five assumptions of regression models. (Hint: 3 of them relate to error, and linearity is not one of them) Answer: Many of the assumptions about regressions seem to center around the distribution of the residuals. 1. Residuals are normally distributed with a mean of zero and a standard deviation of sigma e 2. residual values are independent, so that knowledge of one error term does not give knowledge of any other. 3. sigma e, the standard deviation of error, is assumed to be constant at each x value. 4. there is no measurement error in the regressors. 5. the model is correctly specified and your variables accurately reflect the underlying theoretical construct they are meant to measure.

2. What was the motivation for creating an adjusted R2? Answer: R^2 is a biased estimator that consistently overestimates population parameters due to unknowable, arcane math reasons. Thus, adjusting it makes it more accurate, also for mysterious and eldritch math-based reasons that are generally based on how many predictors are in the model.

3. Explain how a p-value is obtained for a b-value. (Hint: We can calculate the ___ of a b and then that allows to calculate a ___ and from that we can get a p-value).
Answer: In order to calculate a significance value for a regression coefficient, you have to first formulate hypotheses. The null is that the population level slope, referred to as capital Beta, is 0; the alternative is that it is not zero and there is some predictive line. This calls for a t-test evaluating the observed coefficient b.
To do this, we calculate the standard error of the b with a complex formula: sqrt(MSE/sum(xi-xbar)^2), where MSE is the mean square error of the equation, the variance not accounted for in the model. This is calculated as (sum(y-yhat)^2)/n-2 .
Finally, to actually calculate the critical t, we divide the b itself by the calculated standard error of b: t = b/se(b). This is then evaluated using standard t-test tables for significance.

4. What factors impact the standard errors of b-values and in which direction (according to Dan, not the book)? Answer: This was somewhat unclear and weird. According to my notes, if there is one regressor in a model, error exerts a downward bias on r, b and betas, for unclear math-based reasons. However, multiple regressors can be biased in either up or down ways.

MULTIPLE REGRESSION

HW 2 prompt: Today we are interested in what makes people interested in voting. We collected information on age, education, and voting. The data and R scripts are on iLearn.

First thing we should do is create a chunk for all the packages we will need

library(psych)
library(ggplot2) #this is a very powerful library for creating plots. I sent you all a GUI (i.e. click and drag) version of this as well.

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(lmSupport)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

And another chunk to load the csv.

Setting Working Directory Make sure your working directory is in the folder with this script and the csv! You can do this by clicking on session -> Set working directory -> to the folder your script and csv are saved in You can then save the output that you will see in your console (something like setwd(“~/Desktop/PSYC 212/212:2021/Week_2”)) into the chunk above so that every time you run this script it does it automatically.

getwd()

## [1] "C:/Users/Bob McTavish/Documents/R"

rm(list = ls())
#note how I named the chunk above. this will show up when I knit this file and can be helpful for organization. 
data<-read.csv("homework2data.csv", header=T, stringsAsFactors = FALSE, na.strings=c("","NA")) #one you load it, you should see it in your environment.

#when you load a csv, you often want to include some arguments to make sure it is handling missing data correctly and treating strings as factors. We can talk about this in discussion. But for sanity I would recommend pretty much always using these arguments when you load data.

a. Estimate a regression equation using R predicting voting interest from age and include the output below. Also, write out the resulting regression equation.

#Let's get to regression!
#This is predicting voting interest from age
age <- lm(vote~age, data)
#We can create an object from the summary, that way we can extract some items from our summary
agesummary <- summary(age)
agesummary

## 
## Call:
## lm(formula = vote ~ age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51205 -0.19679 -0.00922  0.17269  0.49926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.037478   0.161532   0.232    0.819    
## age         0.099435   0.003356  29.631   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16

#We can take a look at the R-squared and the residuals
agesummary$r.squared

## [1] 0.9744722

agesummary$residuals

##            1            2            3            4            5            6 
## -0.009219308  0.172694892 -0.026174745  0.476651161  0.482302973 -0.318827389 
##            7            8            9           10           11           12 
## -0.003567496  0.499258410 -0.196785321  0.007736129  0.189650329 -0.512045214 
##           13           14           15           16           17           18 
## -0.208088946 -0.108654127  0.101519135  0.002084317 -0.213740758  0.183998517 
##           19           20           21           22           23           24 
##  0.090215510 -0.014871121 -0.412610396  0.382868154 -0.125609564 -0.423914021 
##           25 
## -0.014871121

Answer: Based on R’s calculations for us, we can see that age is in fact a significant predictor of voting interest. Predicted vote interest = yhat = 0.037478 + .099435X

b. We want to know how much voting interest will change for each additional year of age? What coefficient are we looking for and how do we interpret it? Answer: This is the b coefficient in the model. For every one unit increase in age, presumably in years, voting interest increases by about .1, in whatever scale it was measured in.

c. We want to know how much voting interest a person will have at 0 years old? What term are we looking for and how would you interpret it? Answer: This is given by the intercept in the model, the predicted Y when X is zero. In this case, babies who are just born are predicted to have approximately 0.04 interest in voting, which makes some sense; when you consider they do not speak a language or know what voting is, it is reasonable to assume their voting enthusiasm would be very near if not actually zero.

d. Is this the value in part C useful interpretively? Answer: Somewhat, but not really. Intercept b values are in the same real units as the scale of the measurement, but what exactly that 0.04 value means on the scale of voting interest is dubious. Also as we are in this case considering the voting behavior of newborns, it is somewhat ridiculous… but, crucially, not entirely ridiculous. The prediction makes sense, even if it is silly.

e. How does the model fit overall? Is this a good model? Support your answer with some kind of statistic. Answer: The R2 can be used as a general measure of model fit, though of course it does have limitations (and Dan does not like it). The R2 of this model is a sky-high .9745, suggesting that age in this model accounts for about 97.45% of the variance. Of course, as we know this is a biased estimator, the Adjusted R2 comes in to take things down a notch, but only a very small one as there is only one predictor in the model. Adj R2 comes out to .9734. The F-test of this statistic comes out significant as well, giving us more support for the fit: F(1,23)=878, p < .001

f. Calculate the coefficients of a standardized equation of this model in R, include the output below, write out the full equation, and interpret all components.

dat2 = as.data.frame(scale(data, scale = T, center = T)) #not sure why the output here wasn't already a dataframe, but I think scale outputs it as a matrix or something... had to add this so the code would work
age2 = lm(vote~age, dat2)

agesummary2 = summary(age2)
agesummary2

## 
## Call:
## lm(formula = vote ~ age, data = dat2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.295138 -0.113425 -0.005314  0.099540  0.287768 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.677e-16  3.264e-02    0.00        1    
## age         9.872e-01  3.332e-02   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16

agesummary2$r.squared

## [1] 0.9744722

agesummary2$residuals

##            1            2            3            4            5            6 
## -0.005313921  0.099539681 -0.015086872  0.274737162  0.277994812 -0.183769052 
##            7            8            9           10           11           12 
## -0.002056270  0.287767764 -0.113425173  0.004459031  0.109312632 -0.295137955 
##           13           14           15           16           17           18 
## -0.119940474 -0.062627198  0.058514657  0.001201380 -0.123198125  0.106054982 
##           19           20           21           22           23           24 
##  0.051999356 -0.008571571 -0.237824678  0.220681535 -0.072400149 -0.244339979 
##           25 
## -0.008571571

Answer: I think I’ve done this properly. We’ll see I suppose. This standardized model suggests the beta predictor coefficient for age is .9872, meaning the predicted value of voting interest increases by nearly 1 standard deviation for every standard deviation increase in age. R2 and the F test remain the same, because it’s the same data. The intercept of a standardized model is not very useful.

g. Oops. We jumped in to the analysis. What should we have looked at first?

#we forgot to make a scatterplot
plot(data$vote~data$age)

Answer: Look at this beautifully linear data! So regular, so diagonal.

h. What else might we like to do in this situation to resolve an earlier problem with the unstandardized equation? (Hint: think about cynical infants). From now on, we will almost always want to do this.

#centering our age variable
data$age_c <- scale(data$age, scale=FALSE)

Answer: This centers the age data around the mean value of, uh…

mean(data$age)

## [1] 45.08

of 45.08. This is now our zero value on the graph, which makes a lot more sense than a true zero age. While a zero age does make sense and is interpretable, it is also not valuable in the context of the study as infants can’t vote, weren’t studied directly and are likely incapable of expressing any enthusiasm they might have if they were.

i. Estimate the coefficients of a regular regression equation and the standardized equation predicting voting interest from CENTERED age in R, include the output below, write both the equations, and interpret all the parts.

#running the regression with age centered
centered <- lm(vote~age_c, data)
#again creating an object from our summary
centeredsum <-summary(centered)
centeredsum

## 
## Call:
## lm(formula = vote ~ age_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51205 -0.19679 -0.00922  0.17269  0.49926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.520000   0.056632   79.81   <2e-16 ***
## age_c       0.099435   0.003356   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2832 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16

#getting our r-squared value from the centered model
centeredsum$adj.r.squared

## [1] 0.9733623

#Let's z-score our variables now
data$vote_z <- scale(data$vote)
data$age_z <- scale(data$age)
#Running the regression with our standardized variables now
standard <- lm(vote_z~age_z, data)
#outputting the betas
summary(standard)

## 
## Call:
## lm(formula = vote_z ~ age_z, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.295138 -0.113425 -0.005314  0.099540  0.287768 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.677e-16  3.264e-02    0.00        1    
## age_z       9.872e-01  3.332e-02   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1632 on 23 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9734 
## F-statistic:   878 on 1 and 23 DF,  p-value: < 2.2e-16

Answer: Regular: yhat = 4.52 + 0.99435x. All that’s changed here is the intercept, which is good because we didn’t change the relationship among our data with our centering. The intercept is more useful now, inherently predicting a voting enthusiasm of 4.52 at age 45.08 Standardized: yhat = 1.677e-16 + 0.9872x. This is identical to the previous scaling, as standardizing involves a centering already.

j. Now estimate the coefficients of a regression equation in R predicting voting interest from age and education, include the output below, and write the equation. Use all centered predictors.

#centering our education and voting variables now
data$education_c <- scale(data$education, scale=FALSE)
data$vote_c <- scale(data$vote, scale=FALSE)
#this is our model with everything centered and multiple predictors
#to add multiple predicotrs, we only need a + between predictors
multiplecentered <- lm(vote_c~age_c + education_c, data)
summary(multiplecentered)

## 
## Call:
## lm(formula = vote_c ~ age_c + education_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53281 -0.18523 -0.01487  0.17027  0.50181 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.584e-16  5.781e-02   0.000 1.000000    
## age_c       9.320e-02  2.385e-02   3.908 0.000755 ***
## education_c 5.067e-02  1.919e-01   0.264 0.794190    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9722 
## F-statistic: 421.3 on 2 and 22 DF,  p-value: < 2.2e-16

Answer: Our very first multiple regression, I’m so proud of it.

k. Interpret all the important information for describing this model as well as the predictors in the model. Answer: This more complex model examines the effect of both predictors combined on the yhat values. The coefficient for age has reduced to .09320 as compared to .9943 when it was the only predictor. Education comes in at a coefficient value of .05067, and notably, the t-test on its’ predictor value is not significant. This is likely a very bad predictor.

l. Does education significantly predict interest in voting above and beyond what age can explain? Explain how you know. Answer: No, because R has handily calculated a t-test of that specific predictor’s unique contribution and found it very non-significant. It also messes with the established nearly 1:1 predictive value of the age predictor. I think this is a symptom of multicolinearity.

m. Does age predict interest in voting above and beyond what education can explain? Explain. Answer: In this combined model, education does not appear to be a significant predictor of voting enthusiasm at all, but the issue is likely a lot more complicated.

cor(data[c(1,2,3)])

##                vote       age education
## vote      1.0000000 0.9871536 0.9782081
## age       0.9871536 1.0000000 0.9896314
## education 0.9782081 0.9896314 1.0000000

The items are all extremely highly correlated with each other, suggesting that much of the variance in each is the same.

edumodel <- lm(vote_c~education_c, data)
summary(edumodel)

## 
## Call:
## lm(formula = vote_c ~ education_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83709 -0.21527 -0.04436  0.16291  0.74836 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.036e-16  7.359e-02    0.00        1    
## education_c 7.927e-01  3.508e-02   22.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.368 on 23 degrees of freedom
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.955 
## F-statistic: 510.5 on 1 and 23 DF,  p-value: < 2.2e-16

As we see from this one-predictor model, education by itself does appear to be a significant predictor in a vacuum… which does significantly implicate the multicolinearity of the two variables as the reason for the lack of significance when combined. Out of curiosity, I’ve run the same model with the variables input in the opposite order.

multiplecentered2 <- lm(vote_c~education_c + age_c, data)
summary(multiplecentered2)

## 
## Call:
## lm(formula = vote_c ~ education_c + age_c, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53281 -0.18523 -0.01487  0.17027  0.50181 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.584e-16  5.781e-02   0.000 1.000000    
## education_c 5.067e-02  1.919e-01   0.264 0.794190    
## age_c       9.320e-02  2.385e-02   3.908 0.000755 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2891 on 22 degrees of freedom
## Multiple R-squared:  0.9746, Adjusted R-squared:  0.9722 
## F-statistic: 421.3 on 2 and 22 DF,  p-value: < 2.2e-16

Age is still treated as the most significant predictor here, though I admit I expected it to be education instead as it is being entered into the model first.

n. Now we will conduct a hierarchical linear regression analysis. Create two regression models and analyze them hierarchically (i.e., in sequence). Use voting interest predicted from (1) only age on the first step and (2) age and education on the second step. Include R output for this below.

age <- lm(vote~age, data) #just age
multiple <- lm(vote~age+education ,data) #age and education

#the "anova" command will compared the residual sum of squares and see if our model is doing signficantly better
anova(age, multiple)

## Analysis of Variance Table
## 
## Model 1: vote ~ age
## Model 2: vote ~ age + education
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     23 1.8441                           
## 2     22 1.8383  1 0.0058264 0.0697 0.7942

Answer: I don’t really know what else to interpret here, other than the disappointing non-significant results for the combined model.

o. Compare significance of the t statistic for education in part 1j to the significance of ΔR (see Pr(>F) with what you just did in part 1n. Comment on the relationship between these two p-values.

#change in R2 and its square root
chgR2 = summary(multiplecentered)$r.squared - centeredsum$r.squared 
chgR2

## [1] 8.065404e-05

sqrt(chgR2)

## [1] 0.008980759

Answer: I actually do not really understand this. Delta-R is the change in R2 in the models, but… to be perfectly frank I do not know the relationship between the t statistic and this. The change in r appears to be about 0.009, which is compared to a t stat of 0.264 with a sig of 0.794190. I don’t quite know how this is related and it is 4 am so I am going to throw in the towel on this one.

#tinytex::install_tinytex()