This week's HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.

Power Polynomials

The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let's begin!

Per usual, we'll start by loading our necessary packages

Next, our data:

1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?

Performmodel <- lm(performance~stress, data)
summary(Performmodel)
## 
## Call:
## lm(formula = performance ~ stress, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.012 -1.226  0.119  1.250  3.381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0952     0.9856   6.185 6.08e-06 ***
## stress        0.1310     0.2204   0.594    0.559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared:  0.01824,    Adjusted R-squared:  -0.03343 
## F-statistic: 0.3531 on 1 and 19 DF,  p-value: 0.5594

Answer: No, the relationship between stress and performance is not significant.

a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.

plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")

plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")

Answer: We can plot our variables to see if they are linear, in this case stress and performance are not linear.

b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?

#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
##               stress  stress_c stress_sq stress_csq
## stress     1.0000000 1.0000000 0.9773556  0.0000000
## stress_c   1.0000000 1.0000000 0.9773556  0.0000000
## stress_sq  0.9773556 0.9773556 1.0000000  0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037  1.0000000

Answer: Multicollinearity is higher with the original stress variable and the centered variable than with the centered stress squared variable.

c.Review: what are the reasons that centering is useful?

Answer: Centering is useful when 0 has no practical meaning in the data.

d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?

modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
## 
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9048 -0.6508  0.1191  0.8333  1.8333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.50794    0.36275  23.454 6.06e-15 ***
## stress_c     0.13095    0.11874   1.103    0.285    
## stress_csq  -0.47222    0.06855  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: Yes, this is a good model, according to the F-statistic.

e. What do the signs tell us about the shape of the relationship between stress and performance?

plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")

Answer: It's an inverted U, which means that as stress increases, perofrmance increases up to a point, and after that stress increase is followed by performance decrease.

f. What is the mean slope of the line? What is the slope of the line when stress = 4? Answer: Mean slope of the line = .13, slope of the line when stress = 4 is .13?

g. So, what do each the raw regression coefficients tell us? Answer: The raw regression coefficents (b sub 0) is the the value of y at the mean of X.

h. What amount of stress will yield the greatest estimated performance?

8.508-((.130^2)/(4*-.472))
## [1] 8.516951

Answer: 4 units of stress will yield the greatest performance.

i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?

describe(data[c("stress_c","stress_csq")])
##            vars  n mean   sd median trimmed  mad min max range skew kurtosis
## stress_c      1 21    0 2.05      0    0.00 2.97  -3   3     6 0.00    -1.41
## stress_csq    2 21    4 3.55      4    3.88 4.45   0   9     9 0.42    -1.50
##              se
## stress_c   0.45
## stress_csq 0.77

Answer: No, that is not what would happen. The squared values will still be larger, and 0 would not corrispond for stress centered and stress centered squared. This would mean that the betas would be larger and always positive for the squared centered values?

j. How CAN we get the proper betas? What are they?

data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2

Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
## 
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95868 -0.32755  0.05992  0.41942  0.92273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9507     0.1826   5.207 5.94e-05 ***
## stress_z      0.1351     0.1225   1.103    0.285    
## stress_zsq   -0.9982     0.1449  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: The proper betas are .135 for stress Z scored and -.9982 for stress squared z-scored.

k. How would the model be changed if we did not include the linear effect of stress?

Answer:We would be leaving out the half of the data that resembles a linear model, and our error would increase? This question leaves me lost.

Randomization tests

In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons. Importantly, we will do so by simulating some data, which will hopefully give you insight in how and when you might want to do so for your own research. The questions for this section may require you to do a quick google search for a function, or better yet to do a ?and then then name of the function.

1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.

set.seed(316) #why do we do this?

N <-300 # set our sample         
mu <- sample(1:7, 10, replace=T) 
n <- 10 #set how many 
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix

# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.

df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma)  #simulate the data

df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation

a. why do we need to set.seed before we simulate data? Answer: using set. seed() sets the starting number used to generate a sequence of random numbers. It means we'll get the same result each time we run the function because we are starting with the same "seed". Starting with the same "seed" minimizes error?

b. what does the sample function do? Answer: It pulls samples from your variable and you can specify replacement or not. This can be useful for permutations.

c. how many variables does df contain? Answer: 10 variables.

2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.

#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)

#let's visualize the correlation with color representing the and white cells representing no sig. 
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", #order by level of strength
         addCoef.col = "black", 
         tl.col="black", tl.srt=45, 
         # Combine with significance
         p.mat = p.mat, sig.level = 0.05, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

a. what do the blue and red colored cells indicate?

The blue and red colors indicate the potsitive or negative significant corralations.

b. how many variables are not correlated? 12 variables are not correlated.

3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.

#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.

?rand.test
## starting httpd help server ... done
#?rand.test

a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: The two lists given 1) the sampling distribution for the number of statistically significant correlations and 2)the absolute r. The rand.test computes a randomization test for the number of significant correlations. It indicates the number of significant correlations.

b. How many correlation were significant and how many would we expect by chance?

Answer: 37 were signficiant. We would expect 4 to be signififcant by chance.

4. next, let's simulate a between subjects factorial design to look at how we might control for multiple comparisons

DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)

a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?

anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## IV1           3 1567452  522484   202.7 <2e-16 ***
## Residuals   156  402103    2578                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
## 
## Call:
## lm(formula = DV ~ IV1, data = AllData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.447 -49.604  -1.634  49.290  69.952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  148.441      8.027  18.492   <2e-16 ***
## IV1level2    201.101     11.352  17.714   <2e-16 ***
## IV1level3      5.258     11.352   0.463    0.644    
## IV1level4    199.995     11.352  17.617   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.77 on 156 degrees of freedom
## Multiple R-squared:  0.7958, Adjusted R-squared:  0.7919 
## F-statistic: 202.7 on 3 and 156 DF,  p-value: < 2.2e-16

Answer: Yes, the F value is the same, it is 185.3.

b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 0.64   <2e-16 -     
## level4 <2e-16 0.92   <2e-16
## 
## P value adjustment method: none

Answer: 4?

c. what if we want to control using bonferroni corrections? How many are significant now?

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 1      <2e-16 -     
## level4 <2e-16 1      <2e-16
## 
## P value adjustment method: bonferroni

Answer: 4?

c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything?

TukeyHSD(anovaOutput)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DV ~ IV1, data = AllData)
## 
## $IV1
##                      diff        lwr        upr     p adj
## level2-level1  201.100813  171.61908  230.58254 0.0000000
## level3-level1    5.258443  -24.22329   34.74017 0.9669323
## level4-level1  199.995284  170.51356  229.47701 0.0000000
## level3-level2 -195.842370 -225.32410 -166.36064 0.0000000
## level4-level2   -1.105529  -30.58726   28.37620 0.9996679
## level4-level3  194.736841  165.25511  224.21857 0.0000000

Answer: Yes, I get numbers now. It's still 4 singificant correlations though?