This week’s HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.

Power Polynomials

The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let’s begin!

Per usual, we’ll start by loading our necessary packages

Next, our data:

1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?

Performmodel <- lm(performance~stress, data)
summary(Performmodel)
## 
## Call:
## lm(formula = performance ~ stress, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.012 -1.226  0.119  1.250  3.381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0952     0.9856   6.185 6.08e-06 ***
## stress        0.1310     0.2204   0.594    0.559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared:  0.01824,    Adjusted R-squared:  -0.03343 
## F-statistic: 0.3531 on 1 and 19 DF,  p-value: 0.5594

Answer: No it is not (p > .55)

a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.

plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")

plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")

Answer: We plot the residuals against the predicted values (also, this is a homoscedasticity assumption check). But it can alert us to something systematic going on with the residuals. Our graph suggests we have a non-linear relationship between stress and performance. When there are lots of predictors, it can be difficult to determine WHICH predictor may be able to explain more. Plotting residuals against the individual X’s can tell you which is incorrectly specified. Here, stress shows a relationship to the residuals in a systematic (curvilinear) way. Suggesting it is the culprit (the figures look the same because stress is the ONLY predictor). A scatterplot of the Y’s against the X’s can also reveal this information.

b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?

#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
##               stress  stress_c stress_sq stress_csq
## stress     1.0000000 1.0000000 0.9773556  0.0000000
## stress_c   1.0000000 1.0000000 0.9773556  0.0000000
## stress_sq  0.9773556 0.9773556 1.0000000  0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037  1.0000000

Answer: NOT surprisingly, stress and centered stress are perfectly correlated. Also not surprisingly, they are highly correlated with stress squared. What might be surprising is that centered stress and centered stress squared are not correlated (in real data it will be slightly off based on larger asymmetry in the distribution).

We are demonstrating that centering reduces artificial multicollinearity and leaves only “essential” multicollinearity.

c. Review: what are the reasons that centering is useful?

Answer: Makes b0 (intercept) interpretable. Removes non-essential multicollinearity when dealing with polynomial terms and interactions (note that third and first powers will still be correlated, as will second and fourth, etc.).

d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?

modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
## 
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9048 -0.6508  0.1191  0.8333  1.8333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.50794    0.36275  23.454 6.06e-15 ***
## stress_c     0.13095    0.11874   1.103    0.285    
## stress_csq  -0.47222    0.06855  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: Yes, R² = 0.7, F(2,18) = 24.333, p < .001. (or equivalently, this model accounts for ~70% of the variance in performance).

e. What do the signs tell us about the shape of the relationship between stress and performance?

plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")

Answer:The positive sign for the linear term says that the shape is mostly upward, or that the slope of the line tangent to the centering point is positive, that is why the right side is kicked up a tad. The negative sign for the quadratic term says that the relationship bends downwards (Frown, i.e., it starts going up but ends going down).

f. What is the mean slope of the line? What is the slope of the line when stress = 4?

Answer: The mean slope of the line is .131. The slope of the line when stress = 4 (the mean level of stress) is ALSO .131 (because we mean centered).

We can see that the line starts-off steeply upwards, levels off, then becomes more and more steeply downwards. If you like, you can dig deeper. To get the equation for the slope of the tangent line at any X, we just take the derivative. It happens that this will ALWAYS be: b1+2b2X using this equation, .131 -.944X.

Using the centered values, we can see that when we plug-in 0 into the above equation

[ .131 -.944(0) = .131]

When stress = 5, Cstress = +1.0 (because the mean value is 4, and that is the “new” 0 in our centered variable). Plugging-in 1 for X, we get a slope of -.813 and we can see in the figure, the slope is already trending downward by that point. (If we take the second derivative, it tells us how quickly the slope accelerates, it will always just be 2b2.)

g. So, what do each the raw regression coefficients tell us?

Answer: The 8.51 tells us the level of Performance at the Mean of X (ie, when X = 0).

The .131 is the mean slope of the tangent line (tells us overall positive relationship across all levels of X).

and the slope of the line when at the mean on stress.

The -.472 tells us that the curve is pointed downwards.

h. What amount of stress will yield the greatest estimated performance?

8.508-((.130^2)/(4*-.472))
## [1] 8.516951

Answer:To get a maximum or minimum (depending on the direction of the quadratic term), we want to find where the slope of the tangent line is 0. It is the point at which the line levels off before changing directions.

So just solve the above magic equation b1+2b2X for the slope= 0.

b1+2b2X = 0

The min or max will have a tangent line slope = 0, it’s the flat point before the curve changes slope sign

Whether it is a maximum or minimum depends on if the line curves up (minimum) or down (maximum). So again…

MinMaxx = -b1 / 2b2

Here, -.131/-.944 = .139 (on Cstress). To convert back to raw units, add this value to the mean of X which is 4! Thus this X value is about 4.139, which should match the figure if you check it out above.

But what is the y value?

MinMaxy = b0 - b21 / 4b2

(8.508)-((.130^2)/(4*-.472)) = 8.516951

The Y is 8.52.

i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?

describe(data[c("stress_c","stress_csq")])
##            vars  n mean   sd median trimmed  mad min max range skew kurtosis
## stress_c      1 21    0 2.05      0    0.00 2.97  -3   3     6 0.00    -1.41
## stress_csq    2 21    4 3.55      4    3.88 4.45   0   9     9 0.42    -1.50
##              se
## stress_c   0.45
## stress_csq 0.77

Answer: Forcing both to a standard deviation of 1 and mean of 0 creates the need for impossible scores (an X and X-squared that are incompatible). The main point here is that you should first z-score the original variable then create the interaction term.

j. How CAN we get the proper betas? What are they?

data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2

Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
## 
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95868 -0.32755  0.05992  0.41942  0.92273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9507     0.1826   5.207 5.94e-05 ***
## stress_z      0.1351     0.1225   1.103    0.285    
## stress_zsq   -0.9982     0.1449  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: With the quadratic term we will need to square the previously z scored stress (NOT standardize the square of centered stress). This will be the betas based on possible X’s, where you are 0 on each. (Note: Usually, when everything is z-scored, the constant is 0, but not in this case here because the mean of the product of the z score term is actually not zero! But it’s not a big deal since we are interested in comparative contributions which are unchanged.) We want the Z score squared not the Z score of the X squared, which is what most programs want to do.

k. How would the model be changed if we did not include the linear effect of stress?

Answer: If we dropped the linear effect, the squared term would no longer carry its interpretation as ONLY the squared relationship. Alone in a model, the squared term would carry both the linear and quadratic pieces in it. In order to keep the interpretations we want (“pure” quadratic only), you must include the linear term whenever you include the quadratic term.

Randomization tests

In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons.

Importantly, we will do so by simulating some data, which will hopefully give you insight into how and when you might want to do so for your own research.

The questions for this section may require you to do a quick google search for a function, or better yet to do a ?a nd then then name of the function.

1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.

set.seed(316) #why do we do this?

N <-300 # set our sample         
mu <- sample(1:7, 10, replace=T) 
n <- 10 #set how many 
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix

# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.

df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma)  #simulate the data

df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation

a. why do we need to set.seed before we simulate data? Answer: We set.seed in order to recreate an identical example. Everything in R is pseudorandomized making this possible.

b. what does the sample function do? Answer: the sample function takes a sample of a specified size from the element of x. You can specify if you want it to be samples with replacement (i.e. after you pull a sample it’s thrown back into the pool, or not) In this case, because of the small sample size, we did want to sample with replacement.

c. how many variables does df contain? Answer: 9 IV and 1 DV

2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.

#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)

#let's visualize the correlation with color representing the and white cells representing no sig. 
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", #order by level of strength
         addCoef.col = "black", 
         tl.col="black", tl.srt=45, 
         # Combine with significance
         p.mat = p.mat, sig.level = 0.05, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

a. what do the blue and red colored cells indicate?

Answer: The blue gradient represents positive correlation (i.e. > 0), whereas the red gradient represents negative correlations (i.e. < 0).

b. how many variables are not correlated?

Answer: 11, as denoted by the cells that are white.

3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.

#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.

?rand.test

a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: The first is a list regarding the absolute correlations observed, the size of the absolute correlation we would expect by chance and SE around that.

b. How many correlation were significant and how many would we expect by chance? Answer: 20 were significant and we would expect 1.164 by chance. However, to be conservative, we would take the upper limit + SE of 1.27 and then round up. Meaning that we should remove 4 sig effects. I think in this situation we would want to order the correlation (e.g., using the sort(df$variable)) and then not interpret the 4 correlations that were the least significant.

4. next, let’s simulate a between subjects factorial design to look at how we might control for multiple comparisons

DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)

a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?

anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## IV1           3 1641655  547218     197 <2e-16 ***
## Residuals   156  433389    2778                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
## 
## Call:
## lm(formula = DV ~ IV1, data = AllData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.45 -51.25   1.58  50.76  68.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  151.558      8.334  18.186   <2e-16 ***
## IV1level2    201.627     11.786  17.108   <2e-16 ***
## IV1level3     -5.569     11.786  -0.473    0.637    
## IV1level4    197.865     11.786  16.788   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.71 on 156 degrees of freedom
## Multiple R-squared:  0.7911, Adjusted R-squared:  0.7871 
## F-statistic:   197 on 3 and 156 DF,  p-value: < 2.2e-16

Answer: Yes, 190.7, p < 2.2e-16

b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 0.64   <2e-16 -     
## level4 <2e-16 0.75   <2e-16
## 
## P value adjustment method: none

Answer: 4 are sig and 2 are not.

c. what if we want to control using bonferroni corrections? How many are significant now?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 1      <2e-16 -     
## level4 <2e-16 1      <2e-16
## 
## P value adjustment method: bonferroni

Answer: Well in this case it’s actually the same, but interestingly, we can see the p values of the non sig tests are now 1.

c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything?

TukeyHSD(anovaOutput)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DV ~ IV1, data = AllData)
## 
## $IV1
##                      diff        lwr        upr     p adj
## level2-level1  201.627458  171.02026  232.23465 0.0000000
## level3-level1   -5.569371  -36.17657   25.03782 0.9650043
## level4-level1  197.865027  167.25783  228.47222 0.0000000
## level3-level2 -207.196829 -237.80402 -176.58963 0.0000000
## level4-level2   -3.762431  -34.36963   26.84476 0.9887056
## level4-level3  203.434398  172.82720  234.04159 0.0000000

Answer: It does change the p values of the non sig effects (closer to bonferroni) but the same 4 tests are sig.