This week’s HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.
The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let’s begin!
Per usual, we’ll start by loading our necessary packages
Next, our data:
1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?
Performmodel <- lm(performance~stress, data)
summary(Performmodel)
##
## Call:
## lm(formula = performance ~ stress, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.012 -1.226 0.119 1.250 3.381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0952 0.9856 6.185 6.08e-06 ***
## stress 0.1310 0.2204 0.594 0.559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared: 0.01824, Adjusted R-squared: -0.03343
## F-statistic: 0.3531 on 1 and 19 DF, p-value: 0.5594
Answer: No it is not (p > .55)
a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.
plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")
plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")
Answer: We plot the residuals against the predicted values (also, this is a homoscedasticity assumption check). But it can alert us to something systematic going on with the residuals. Our graph suggests we have a non-linear relationship between stress and performance. When there are lots of predictors, it can be difficult to determine WHICH predictor may be able to explain more. Plotting residuals against the individual X’s can tell you which is incorrectly specified. Here, stress shows a relationship to the residuals in a systematic (curvilinear) way. Suggesting it is the culprit (the figures look the same because stress is the ONLY predictor). A scatterplot of the Y’s against the X’s can also reveal this information.
b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?
#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
## stress stress_c stress_sq stress_csq
## stress 1.0000000 1.0000000 0.9773556 0.0000000
## stress_c 1.0000000 1.0000000 0.9773556 0.0000000
## stress_sq 0.9773556 0.9773556 1.0000000 0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037 1.0000000
Answer: NOT surprisingly, stress and centered stress are perfectly correlated. Also not surprisingly, they are highly correlated with stress squared. What might be surprising is that centered stress and centered stress squared are not correlated (in real data it will be slightly off based on larger asymmetry in the distribution).
We are demonstrating that centering reduces artificial multicollinearity and leaves only “essential” multicollinearity.
c. Review: what are the reasons that centering is useful?
Answer: Makes b0 (intercept) interpretable. Removes non-essential multicollinearity when dealing with polynomial terms and interactions (note that third and first powers will still be correlated, as will second and fourth, etc.).
d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?
modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
##
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9048 -0.6508 0.1191 0.8333 1.8333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.50794 0.36275 23.454 6.06e-15 ***
## stress_c 0.13095 0.11874 1.103 0.285
## stress_csq -0.47222 0.06855 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer: Yes, R² = 0.7, F(2,18) = 24.333, p < .001. (or equivalently, this model accounts for ~70% of the variance in performance).
e. What do the signs tell us about the shape of the relationship between stress and performance?
plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")
Answer:The positive sign for the linear term says that the shape is mostly upward, or that the slope of the line tangent to the centering point is positive, that is why the right side is kicked up a tad. The negative sign for the quadratic term says that the relationship bends downwards (Frown, i.e., it starts going up but ends going down).
f. What is the mean slope of the line? What is the slope of the line when stress = 4?
Answer: The mean slope of the line is .131. The slope of the line when stress = 4 (the mean level of stress) is ALSO .131 (because we mean centered).
We can see that the line starts-off steeply upwards, levels off, then becomes more and more steeply downwards. If you like, you can dig deeper. To get the equation for the slope of the tangent line at any X, we just take the derivative. It happens that this will ALWAYS be: b1+2b2X using this equation, .131 -.944X.
Using the centered values, we can see that when we plug-in 0 into the above equation
[ .131 -.944(0) = .131]
When stress = 5, Cstress = +1.0 (because the mean value is 4, and that is the “new” 0 in our centered variable). Plugging-in 1 for X, we get a slope of -.813 and we can see in the figure, the slope is already trending downward by that point. (If we take the second derivative, it tells us how quickly the slope accelerates, it will always just be 2b2.)
g. So, what do each the raw regression coefficients tell us?
Answer: The 8.51 tells us the level of Performance at the Mean of X (ie, when X = 0).
The .131 is the mean slope of the tangent line (tells us overall positive relationship across all levels of X).
and the slope of the line when at the mean on stress.
The -.472 tells us that the curve is pointed downwards.
h. What amount of stress will yield the greatest estimated performance?
8.508-((.130^2)/(4*-.472))
## [1] 8.516951
Answer:To get a maximum or minimum (depending on the direction of the quadratic term), we want to find where the slope of the tangent line is 0. It is the point at which the line levels off before changing directions.
So just solve the above magic equation b1+2b2X for the slope= 0.
b1+2b2X = 0
The min or max will have a tangent line slope = 0, it’s the flat point before the curve changes slope sign
Whether it is a maximum or minimum depends on if the line curves up (minimum) or down (maximum). So again…
MinMaxx = -b1 / 2b2
Here, -.131/-.944 = .139 (on Cstress). To convert back to raw units, add this value to the mean of X which is 4! Thus this X value is about 4.139, which should match the figure if you check it out above.
But what is the y value?
MinMaxy = b0 - b21 / 4b2
(8.508)-((.130^2)/(4*-.472)) = 8.516951
The Y is 8.52.
i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?
describe(data[c("stress_c","stress_csq")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## stress_c 1 21 0 2.05 0 0.00 2.97 -3 3 6 0.00 -1.41
## stress_csq 2 21 4 3.55 4 3.88 4.45 0 9 9 0.42 -1.50
## se
## stress_c 0.45
## stress_csq 0.77
Answer: Forcing both to a standard deviation of 1 and mean of 0 creates the need for impossible scores (an X and X-squared that are incompatible). The main point here is that you should first z-score the original variable then create the interaction term.
j. How CAN we get the proper betas? What are they?
data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2
Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
##
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95868 -0.32755 0.05992 0.41942 0.92273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9507 0.1826 5.207 5.94e-05 ***
## stress_z 0.1351 0.1225 1.103 0.285
## stress_zsq -0.9982 0.1449 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer: With the quadratic term we will need to square the previously z scored stress (NOT standardize the square of centered stress). This will be the betas based on possible X’s, where you are 0 on each. (Note: Usually, when everything is z-scored, the constant is 0, but not in this case here because the mean of the product of the z score term is actually not zero! But it’s not a big deal since we are interested in comparative contributions which are unchanged.) We want the Z score squared not the Z score of the X squared, which is what most programs want to do.
k. How would the model be changed if we did not include the linear effect of stress?
Answer: If we dropped the linear effect, the squared term would no longer carry its interpretation as ONLY the squared relationship. Alone in a model, the squared term would carry both the linear and quadratic pieces in it. In order to keep the interpretations we want (“pure” quadratic only), you must include the linear term whenever you include the quadratic term.
In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons.
Importantly, we will do so by simulating some data, which will hopefully give you insight into how and when you might want to do so for your own research.
The questions for this section may require you to do a quick google search for a function, or better yet to do a ?a nd then then name of the function.
1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.
set.seed(316) #why do we do this?
N <-300 # set our sample
mu <- sample(1:7, 10, replace=T)
n <- 10 #set how many
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix
# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.
df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma) #simulate the data
df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation
a. why do we need to set.seed before we simulate data? Answer: We set.seed in order to recreate an identical example. Everything in R is pseudorandomized making this possible.
b. what does the sample function do? Answer: the sample function takes a sample of a specified size from the element of x. You can specify if you want it to be samples with replacement (i.e. after you pull a sample it’s thrown back into the pool, or not) In this case, because of the small sample size, we did want to sample with replacement.
c. how many variables does df contain? Answer: 9 IV and 1 DV
2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.
#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)
#let's visualize the correlation with color representing the and white cells representing no sig.
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),
type="upper", order="hclust", #order by level of strength
addCoef.col = "black",
tl.col="black", tl.srt=45,
# Combine with significance
p.mat = p.mat, sig.level = 0.05, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
a. what do the blue and red colored cells indicate?
Answer: The blue gradient represents positive correlation (i.e. > 0), whereas the red gradient represents negative correlations (i.e. < 0).
b. how many variables are not correlated?
Answer: 11, as denoted by the cells that are white.
3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.
#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.
?rand.test
a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: The first is a list regarding the absolute correlations observed, the size of the absolute correlation we would expect by chance and SE around that.
b. How many correlation were significant and how many would we expect by chance? Answer: 20 were significant and we would expect 1.164 by chance. However, to be conservative, we would take the upper limit + SE of 1.27 and then round up. Meaning that we should remove 4 sig effects. I think in this situation we would want to order the correlation (e.g., using the sort(df$variable)) and then not interpret the 4 correlations that were the least significant.
4. next, let’s simulate a between subjects factorial design to look at how we might control for multiple comparisons
DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)
a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?
anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
## Df Sum Sq Mean Sq F value Pr(>F)
## IV1 3 1641655 547218 197 <2e-16 ***
## Residuals 156 433389 2778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
##
## Call:
## lm(formula = DV ~ IV1, data = AllData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.45 -51.25 1.58 50.76 68.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 151.558 8.334 18.186 <2e-16 ***
## IV1level2 201.627 11.786 17.108 <2e-16 ***
## IV1level3 -5.569 11.786 -0.473 0.637
## IV1level4 197.865 11.786 16.788 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.71 on 156 degrees of freedom
## Multiple R-squared: 0.7911, Adjusted R-squared: 0.7871
## F-statistic: 197 on 3 and 156 DF, p-value: < 2.2e-16
Answer: Yes, 190.7, p < 2.2e-16
b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant?
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 0.64 <2e-16 -
## level4 <2e-16 0.75 <2e-16
##
## P value adjustment method: none
Answer: 4 are sig and 2 are not.
c. what if we want to control using bonferroni corrections? How many are significant now?
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 1 <2e-16 -
## level4 <2e-16 1 <2e-16
##
## P value adjustment method: bonferroni
Answer: Well in this case it’s actually the same, but interestingly, we can see the p values of the non sig tests are now 1.
c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything?
TukeyHSD(anovaOutput)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DV ~ IV1, data = AllData)
##
## $IV1
## diff lwr upr p adj
## level2-level1 201.627458 171.02026 232.23465 0.0000000
## level3-level1 -5.569371 -36.17657 25.03782 0.9650043
## level4-level1 197.865027 167.25783 228.47222 0.0000000
## level3-level2 -207.196829 -237.80402 -176.58963 0.0000000
## level4-level2 -3.762431 -34.36963 26.84476 0.9887056
## level4-level3 203.434398 172.82720 234.04159 0.0000000
Answer: It does change the p values of the non sig effects (closer to bonferroni) but the same 4 tests are sig.