This week's HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.
The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let's begin!
Per usual, we'll start by loading our necessary packages
Next, our data:
1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?
Performmodel <- lm(performance~stress, data)
summary(Performmodel)
##
## Call:
## lm(formula = performance ~ stress, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.012 -1.226 0.119 1.250 3.381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0952 0.9856 6.185 6.08e-06 ***
## stress 0.1310 0.2204 0.594 0.559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared: 0.01824, Adjusted R-squared: -0.03343
## F-statistic: 0.3531 on 1 and 19 DF, p-value: 0.5594
Answer: Stress is not significant.
a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.
plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")
plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")
Answer: We can plot the model and look at the plot.
b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?
#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])
## stress stress_c stress_sq stress_csq
## stress 1.0000000 1.0000000 0.9773556 0.0000000
## stress_c 1.0000000 1.0000000 0.9773556 0.0000000
## stress_sq 0.9773556 0.9773556 1.0000000 0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037 1.0000000
Answer: The correlations are very high, except for the correlation between stress sqares and stress centered+squared. I cannot say what this means for multicollinearity as it is the same variable.
c.Review: what are the reasons that centering is useful?
Answer: Centering allows us to compare all values to the mean (which is at 0).
d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?
modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)
##
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9048 -0.6508 0.1191 0.8333 1.8333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.50794 0.36275 23.454 6.06e-15 ***
## stress_c 0.13095 0.11874 1.103 0.285
## stress_csq -0.47222 0.06855 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer: Squared stress is a better predictor than unsquared.
e. What do the signs tell us about the shape of the relationship between stress and performance?
plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")
Answer: There is a culvilinear relationship.
f. What is the mean slope of the line? What is the slope of the line when stress = 4? Answer: THe mean slope of the line is 0.13. The slope of the line when stress = 4 is about this value.
g. So, what do each the raw regression coefficients tell us? Answer: Each raw regression coefficient tels us the average slope of the regression line.
h. What amount of stress will yield the greatest estimated performance?
8.508-((.130^2)/(4*-.472))
## [1] 8.516951
Answer: 8.52
i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?
describe(data[c("stress_c","stress_csq")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## stress_c 1 21 0 2.05 0 0.00 2.97 -3 3 6 0.00 -1.41
## stress_csq 2 21 4 3.55 4 3.88 4.45 0 9 9 0.42 -1.50
## se
## stress_c 0.45
## stress_csq 0.77
Answer: The standard deviations will change because we squared the values. This would make the betas larger.
j. How CAN we get the proper betas? What are they?
data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2
Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)
##
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95868 -0.32755 0.05992 0.41942 0.92273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9507 0.1826 5.207 5.94e-05 ***
## stress_z 0.1351 0.1225 1.103 0.285
## stress_zsq -0.9982 0.1449 -6.888 1.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.7
## F-statistic: 24.33 on 2 and 18 DF, p-value: 7.627e-06
Answer:
k. How would the model be changed if we did not include the linear effect of stress?
Answer: The model would become a transformation, not a linear model.
In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons. Importantly, we will do so by simulating some data, which will hopefully give you insight in how and when you might want to do so for your own research. The questions for this section may require you to do a quick google search for a function, or better yet to do a ?and then then name of the function.
1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.
set.seed(316) #why do we do this?
N <-300 # set our sample
mu <- sample(1:7, 10, replace=T)
n <- 10 #set how many
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix
# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.
df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma) #simulate the data
df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation
a. why do we need to set.seed before we simulate data? Answer: It gives the data a central reference point around which to sample.
b. what does the sample function do? Answer: It randomly samples a given number of values from a given range.
c. how many variables does df contain? Answer: It contains 10 variables.
2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.
#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)
#let's visualize the correlation with color representing the and white cells representing no sig.
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),
type="upper", order="hclust", #order by level of strength
addCoef.col = "black",
tl.col="black", tl.srt=45,
# Combine with significance
p.mat = p.mat, sig.level = 0.05, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE
)
a. what do the blue and red colored cells indicate? The blue colored cells are negatively correlated.
b. how many variables are not correlated? 11 variables seem not to be significantly correlated, though there are no zero correlations.
3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.
#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)
## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.
#?rand.test
a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: It gives us the approximate sampling distrubution for a) average absolute r and b) number of siginficant tests.
b. How many correlation were significant and how many would we expect by chance? By chance, about 4 correlations would be significant. We had more than 20 significant. much more than by chance. Answer:
4. next, let's simulate a between subjects factorial design to look at how we might control for multiple comparisons
DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)
a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?
anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)
## Df Sum Sq Mean Sq F value Pr(>F)
## IV1 3 1641655 547218 197 <2e-16 ***
## Residuals 156 433389 2778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)
##
## Call:
## lm(formula = DV ~ IV1, data = AllData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.45 -51.25 1.58 50.76 68.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 151.558 8.334 18.186 <2e-16 ***
## IV1level2 201.627 11.786 17.108 <2e-16 ***
## IV1level3 -5.569 11.786 -0.473 0.637
## IV1level4 197.865 11.786 16.788 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.71 on 156 degrees of freedom
## Multiple R-squared: 0.7911, Adjusted R-squared: 0.7871
## F-statistic: 197 on 3 and 156 DF, p-value: < 2.2e-16
Answer: The F-value is the same, 197.
b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant? Most values in the table are significant.
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 0.64 <2e-16 -
## level4 <2e-16 0.75 <2e-16
##
## P value adjustment method: none
c. what if we want to control using bonferroni corrections? How many are significant now? All but one value are significant.
pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: AllData$DV and AllData$IV1
##
## level1 level2 level3
## level2 <2e-16 - -
## level3 1 <2e-16 -
## level4 <2e-16 1 <2e-16
##
## P value adjustment method: bonferroni
c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything? With Tukey, all but 2 values are significant. There is a slight difference for Bonferroni.
TukeyHSD(anovaOutput)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DV ~ IV1, data = AllData)
##
## $IV1
## diff lwr upr p adj
## level2-level1 201.627458 171.02026 232.23465 0.0000000
## level3-level1 -5.569371 -36.17657 25.03782 0.9650043
## level4-level1 197.865027 167.25783 228.47222 0.0000000
## level3-level2 -207.196829 -237.80402 -176.58963 0.0000000
## level4-level2 -3.762431 -34.36963 26.84476 0.9887056
## level4-level3 203.434398 172.82720 234.04159 0.0000000