This week’s HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.

Power Polynomials

The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let’s begin!

Per usual, we’ll start by loading our necessary packages

Next, our data:

1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?

Performmodel <- lm(performance~stress, data)
summary(Performmodel)
## 
## Call:
## lm(formula = performance ~ stress, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.012 -1.226  0.119  1.250  3.381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0952     0.9856   6.185 6.08e-06 ***
## stress        0.1310     0.2204   0.594    0.559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared:  0.01824,    Adjusted R-squared:  -0.03343 
## F-statistic: 0.3531 on 1 and 19 DF,  p-value: 0.5594

Answer: Stress is not a significant predictor of performance, b = .13, p = .60.

a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.

plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")

plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")

Answer: To test model specification, we may plot residuals of Y-hat. Residuals appear to follow a quadratic relationship.

b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?

#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
##               stress  stress_c stress_sq stress_csq
## stress     1.0000000 1.0000000 0.9773556  0.0000000
## stress_c   1.0000000 1.0000000 0.9773556  0.0000000
## stress_sq  0.9773556 0.9773556 1.0000000  0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037  1.0000000

Answer: Stress is perfectly correlated with centered stress and centered stress squared. Centered stress is also perfectly correlated with center stress squared. This is an issue for multicollinearity as multicollinearity involves correlation between independent variables. Stress squared, by contrast, is less correlated with the other variables.

c.Review: what are the reasons that centering is useful?

Answer: Centering makes models more predictable, especially the intercept.

d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?

modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
## 
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9048 -0.6508  0.1191  0.8333  1.8333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.50794    0.36275  23.454 6.06e-15 ***
## stress_c     0.13095    0.11874   1.103    0.285    
## stress_csq  -0.47222    0.06855  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: Yes, the model exhibits good fit F(2, 18) = 24.33, p < .01. The model accounts for 70% of variance in performance.

e. What do the signs tell us about the shape of the relationship between stress and performance?

plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")

Answer: There is a curvilinear effect such that stress is positively related to performance until it reaches the critical point of 4, at which point, it is negatively related to performance.

f. What is the mean slope of the line? What is the slope of the line when stress = 4? Answer: The mean slope of the line is .13. The slope of the line when stress is 4 is approximately 1.

g. So, what do each the raw regression coefficients tell us? Answer: The intercept, b0 = 8.5, is performance at mean stress. The coefficient for centered stress, .13, is the tangent at the mean. The coefficient for centered squared stress is -.47.

h. What amount of stress will yield the greatest estimated performance?

8.508-((.130^2)/(4*-.472))
## [1] 8.516951

Answer: A stress amount of 4 yields the greatest predicted performance.

i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?

describe(data[c("stress_c","stress_csq")])
##            vars  n mean   sd median trimmed  mad min max range skew kurtosis
## stress_c      1 21    0 2.05      0    0.00 2.97  -3   3     6 0.00    -1.41
## stress_csq    2 21    4 3.55      4    3.88 4.45   0   9     9 0.42    -1.50
##              se
## stress_c   0.45
## stress_csq 0.77

Answer: If stress and stress squared were standardized, a zero on z-scores centered stress would not correspond to a zero of z-scored centered stress squared.

j. How CAN we get the proper betas? What are they?

data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2

Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
## 
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95868 -0.32755  0.05992  0.41942  0.92273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9507     0.1826   5.207 5.94e-05 ***
## stress_z      0.1351     0.1225   1.103    0.285    
## stress_zsq   -0.9982     0.1449  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: Standardized regression coefficients would be obtained by z-scoring both independent variables. The standardized regression coefficients for z-scored stress and z-scored squared stress are .14 and -.1, respectively.

k. How would the model be changed if we did not include the linear effect of stress?

Answer: If we did not include the linear effect of stress, there would be less model fit. In a significant quadratic model, the linear term cannot be trimmed.

Randomization tests

In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons. Importantly, we will do so by simulating some data, which will hopefully give you insight in how and when you might want to do so for your own research. The questions for this section may require you to do a quick google search for a function, or better yet to do a ?and then then name of the function.

1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.

set.seed(316) #why do we do this?

N <-300 # set our sample         
mu <- sample(1:7, 10, replace=T) 
n <- 10 #set how many 
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix

# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.

df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma)  #simulate the data

df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation

a. why do we need to set.seed before we simulate data? Answer: The set.seed function ensures the samples we pull are different. This moves the starting point for the random sequence of numbers.

b. what does the sample function do? Answer: The sample function took a sample of our specified size (10). We also specified we want sampling with replacement.

c. how many variables does df contain? Answer: Df contains 10 variables (9 independent variables and 1 dependent variable).

2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.

cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)

#let's visualize the correlation with color representing the and white cells representing no sig. 
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", #order by level of strength
         addCoef.col = "black", 
         tl.col="black", tl.srt=45, 
         # Combine with significance
         p.mat = p.mat, sig.level = 0.05, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

a. what do the blue and red colored cells indicate? Blue and red colored cells indicate significant correlations.

b. how many variables are not correlated? 11 variables are not correlated.

3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.

#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.

#?rand.test
?rand.test

a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: The function gives two lists: (1) the average absolute r between sets 1 and 2, and (2) the number of significant correlations between sets 1 and 2.

b. How many correlation were significant and how many would we expect by chance? Answer: 20 correlations were significant and 1.16 correlations would be expected by chance.

4. next, let’s simulate a between subjects factorial design to look at how we might control for multiple comparisons

DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)

a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?

anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## IV1           3 1641655  547218     197 <2e-16 ***
## Residuals   156  433389    2778                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
## 
## Call:
## lm(formula = DV ~ IV1, data = AllData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.45 -51.25   1.58  50.76  68.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  151.558      8.334  18.186   <2e-16 ***
## IV1level2    201.627     11.786  17.108   <2e-16 ***
## IV1level3     -5.569     11.786  -0.473    0.637    
## IV1level4    197.865     11.786  16.788   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.71 on 156 degrees of freedom
## Multiple R-squared:  0.7911, Adjusted R-squared:  0.7871 
## F-statistic:   197 on 3 and 156 DF,  p-value: < 2.2e-16

Answer: Yes, the F-statistic yielded is the same: F(3, 156) = 197, p , <.01.

b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 0.64   <2e-16 -     
## level4 <2e-16 0.75   <2e-16
## 
## P value adjustment method: none

Answer: Four correlations are significant.

c. what if we want to control using bonferroni corrections? How many are significant now?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 1      <2e-16 -     
## level4 <2e-16 1      <2e-16
## 
## P value adjustment method: bonferroni

Answer: Four correlations are significant.

c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything?

TukeyHSD(anovaOutput)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DV ~ IV1, data = AllData)
## 
## $IV1
##                      diff        lwr        upr     p adj
## level2-level1  201.627458  171.02026  232.23465 0.0000000
## level3-level1   -5.569371  -36.17657   25.03782 0.9650043
## level4-level1  197.865027  167.25783  228.47222 0.0000000
## level3-level2 -207.196829 -237.80402 -176.58963 0.0000000
## level4-level2   -3.762431  -34.36963   26.84476 0.9887056
## level4-level3  203.434398  172.82720  234.04159 0.0000000

Answer: Whether using pairwise t-tests, bonferroni corrections, or Tukey HSD, four correlations remain significant.