This week's HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.
The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let's begin!
Per usual, we'll start by loading our necessary packages
Next, our data:
1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?
Performmodel <- lm(performance~stress, data)
summary(Performmodel)
##
## Call:
## lm(formula = performance ~ stress, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.012 -1.226 0.119 1.250 3.381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0952 0.9856 6.185 6.08e-06 ***
## stress 0.1310 0.2204 0.594 0.559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared: 0.01824, Adjusted R-squared: -0.03343
## F-statistic: 0.3531 on 1 and 19 DF, p-value: 0.5594
Answer: No, the relationship between stress and performance is not significant.
a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.
plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")
plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")
Answer: We can plot our variables to see if they are linear, in this case stress and performance are not linear.
b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?
#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
## stress stress_c stress_sq stress_csq
## stress 1.0000000 1.0000000 0.9773556 0.0000000
## stress_c 1.0000000 1.0000000 0.9773556 0.0000000
## stress_sq 0.9773556 0.9773556 1.0000000 0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037 1.0000000
Answer: Multicollinearity is higher with the original stress variable and the centered variable than with the centered stress squared variable.
c.Review: what are the reasons that centering is useful?
Answer: Centering is useful when 0 has no practical meaning in the data.
d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?
modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
##
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9048 -0.6508 0.1191 0.8333 1.8333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.50794 0.36275 23.454 6.06e-15 ***
## stress_c 0.13095 0.11874 1.103 0.285
## stress_csq -0.47222 0.06855 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer: Yes, this is a good model, according to the F-statistic.
e. What do the signs tell us about the shape of the relationship between stress and performance?
plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")
Answer: It's an inverted U, which means that as stress increases, perofrmance increases up to a point, and after that stress increase is followed by performance decrease.
f. What is the mean slope of the line? What is the slope of the line when stress = 4? Answer: Mean slope of the line = .13, slope of the line when stress = 4 is .13?
g. So, what do each the raw regression coefficients tell us? Answer: The raw regression coefficents (b sub 0) is the the value of y at the mean of X.
h. What amount of stress will yield the greatest estimated performance?
8.508-((.130^2)/(4*-.472))
## [1] 8.516951
Answer: 4 units of stress will yield the greatest performance.
i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?
describe(data[c("stress_c","stress_csq")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## stress_c 1 21 0 2.05 0 0.00 2.97 -3 3 6 0.00 -1.41
## stress_csq 2 21 4 3.55 4 3.88 4.45 0 9 9 0.42 -1.50
## se
## stress_c 0.45
## stress_csq 0.77
Answer: No, that is not what would happen. The squared values will still be larger, and 0 would not corrispond for stress centered and stress centered squared. This would mean that the betas would be larger and always positive for the squared centered values?
j. How CAN we get the proper betas? What are they?
data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2
Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
##
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95868 -0.32755 0.05992 0.41942 0.92273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9507 0.1826 5.207 5.94e-05 ***
## stress_z 0.1351 0.1225 1.103 0.285
## stress_zsq -0.9982 0.1449 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer: The proper betas are .135 for stress Z scored and -.9982 for stress squared z-scored.
k. How would the model be changed if we did not include the linear effect of stress?
Answer:We would be leaving out the half of the data that resembles a linear model, and our error would increase? This question leaves me lost.
In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons. Importantly, we will do so by simulating some data, which will hopefully give you insight in how and when you might want to do so for your own research. The questions for this section may require you to do a quick google search for a function, or better yet to do a ?and then then name of the function.
1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.
set.seed(316) #why do we do this?
N <-300 # set our sample
mu <- sample(1:7, 10, replace=T)
n <- 10 #set how many
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix
# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.
df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma) #simulate the data
df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation
a. why do we need to set.seed before we simulate data? Answer: using set. seed() sets the starting number used to generate a sequence of random numbers. It means we'll get the same result each time we run the function because we are starting with the same "seed". Starting with the same "seed" minimizes error?
b. what does the sample function do? Answer: It pulls samples from your variable and you can specify replacement or not. This can be useful for permutations.
c. how many variables does df contain? Answer: 10 variables.
2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.
#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)
#let's visualize the correlation with color representing the and white cells representing no sig.
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),
type="upper", order="hclust", #order by level of strength
addCoef.col = "black",
tl.col="black", tl.srt=45,
# Combine with significance
p.mat = p.mat, sig.level = 0.05, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
a. what do the blue and red colored cells indicate?
The blue and red colors indicate the potsitive or negative significant corralations.
b. how many variables are not correlated? 12 variables are not correlated.
3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.
#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.
?rand.test
## starting httpd help server ... done
#?rand.test
a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: The two lists given 1) the sampling distribution for the number of statistically significant correlations and 2)the absolute r. The rand.test computes a randomization test for the number of significant correlations. It indicates the number of significant correlations.
b. How many correlation were significant and how many would we expect by chance?
Answer: 37 were signficiant. We would expect 4 to be signififcant by chance.
4. next, let's simulate a between subjects factorial design to look at how we might control for multiple comparisons
DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)
a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?
anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
## Df Sum Sq Mean Sq F value Pr(>F)
## IV1 3 1567452 522484 202.7 <2e-16 ***
## Residuals 156 402103 2578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
##
## Call:
## lm(formula = DV ~ IV1, data = AllData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.447 -49.604 -1.634 49.290 69.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.441 8.027 18.492 <2e-16 ***
## IV1level2 201.101 11.352 17.714 <2e-16 ***
## IV1level3 5.258 11.352 0.463 0.644
## IV1level4 199.995 11.352 17.617 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.77 on 156 degrees of freedom
## Multiple R-squared: 0.7958, Adjusted R-squared: 0.7919
## F-statistic: 202.7 on 3 and 156 DF, p-value: < 2.2e-16
Answer: Yes, the F value is the same, it is 185.3.
b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant?
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 0.64 <2e-16 -
## level4 <2e-16 0.92 <2e-16
##
## P value adjustment method: none
Answer: 4?
c. what if we want to control using bonferroni corrections? How many are significant now?
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 1 <2e-16 -
## level4 <2e-16 1 <2e-16
##
## P value adjustment method: bonferroni
Answer: 4?
c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything?
TukeyHSD(anovaOutput)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DV ~ IV1, data = AllData)
##
## $IV1
## diff lwr upr p adj
## level2-level1 201.100813 171.61908 230.58254 0.0000000
## level3-level1 5.258443 -24.22329 34.74017 0.9669323
## level4-level1 199.995284 170.51356 229.47701 0.0000000
## level3-level2 -195.842370 -225.32410 -166.36064 0.0000000
## level4-level2 -1.105529 -30.58726 28.37620 0.9996679
## level4-level3 194.736841 165.25511 224.21857 0.0000000
Answer: Yes, I get numbers now. It's still 4 singificant correlations though?