Week_6_HW_Nina

This week's HW will be split into 2 sections that both mix applied and conceptual questions. The first will be on nonlinear relationships and the second on multiple comparisons and randomization tests.

Power Polynomials

The first section will focus on nonlinear relationships and will be using a data set that includes a measure of performance on a test as a function of stress (in addition to a few other variables). Let's begin!

Per usual, we'll start by loading our necessary packages

Next, our data:

1. Look at the relationship between performance and stress by regressing performance on stress. Do not center stress for now. Is stress sig?

Performmodel <- lm(performance~stress, data)
summary(Performmodel)

## 
## Call:
## lm(formula = performance ~ stress, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.012 -1.226  0.119  1.250  3.381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0952     0.9856   6.185 6.08e-06 ***
## stress        0.1310     0.2204   0.594    0.559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 19 degrees of freedom
## Multiple R-squared:  0.01824,    Adjusted R-squared:  -0.03343 
## F-statistic: 0.3531 on 1 and 19 DF,  p-value: 0.5594

Answer: Stress is not significant.

a. What can we do to examine our assumption of correct model specification (HINT: It’s also how we test some of the other assumptions)? Let’s do it and interpret what we see.

plot(Performmodel$residuals~Performmodel$fitted.values, ylab="Residuals", xlab="Predicted Values")

plot(Performmodel$residuals~data$stress, ylab="Residuals", xlab="Stress values")

Answer: We can plot the model and look at the plot.

b.Create three new terms: Stress squared, centered stress, and centered stress squared. Correlate each of them together (along with stress), what do you notice? What are the implications for multicollinearity?

#Now let's center our stress variable
data$stress_c<-scale(data$stress, center = TRUE, scale = FALSE)
#Now we'll create quadratic terms with our centered and uncentered variables
data$stress_sq <- data$stress^2
data$stress_csq<- data$stress_c^2
cor(data[c("stress","stress_c","stress_sq","stress_csq")])

##               stress  stress_c stress_sq stress_csq
## stress     1.0000000 1.0000000 0.9773556  0.0000000
## stress_c   1.0000000 1.0000000 0.9773556  0.0000000
## stress_sq  0.9773556 0.9773556 1.0000000  0.2116037
## stress_csq 0.0000000 0.0000000 0.2116037  1.0000000

Answer: The correlations are very high, except for the correlation between stress sqares and stress centered+squared. I cannot say what this means for multicollinearity as it is the same variable.

c.Review: what are the reasons that centering is useful?

Answer: Centering allows us to compare all values to the mean (which is at 0).

d. Predict performance from stress and stress squared (using centering to reduce MC). Is this a good model?

modelquad<- lm(performance~stress_c+stress_csq, data)
summary(modelquad)

## 
## Call:
## lm(formula = performance ~ stress_c + stress_csq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9048 -0.6508  0.1191  0.8333  1.8333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.50794    0.36275  23.454 6.06e-15 ***
## stress_c     0.13095    0.11874   1.103    0.285    
## stress_csq  -0.47222    0.06855  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.088 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer: Squared stress is a better predictor than unsquared.

e. What do the signs tell us about the shape of the relationship between stress and performance?

plot(modelquad$fitted.values~data$stress, ylab="Predicted Y", xlab="Stress", col="blue", type="b")

Answer: There is a culvilinear relationship.

f. What is the mean slope of the line? What is the slope of the line when stress = 4? Answer: THe mean slope of the line is 0.13. The slope of the line when stress = 4 is about this value.

g. So, what do each the raw regression coefficients tell us? Answer: Each raw regression coefficient tels us the average slope of the regression line.

h. What amount of stress will yield the greatest estimated performance?

8.508-((.130^2)/(4*-.472))

## [1] 8.516951

Answer: 8.52

i. Calculate the descriptive statistics for stress and stress squared. If we maintained the relationship between centered stress and centered stress squared, a zero on centered stress should correspond to a zero on centered stress squared. However, if we z-scored both variables, is this what would happen? What might this mean for the betas that we calculate based on these z-scores?

describe(data[c("stress_c","stress_csq")])

##            vars  n mean   sd median trimmed  mad min max range skew kurtosis
## stress_c      1 21    0 2.05      0    0.00 2.97  -3   3     6 0.00    -1.41
## stress_csq    2 21    4 3.55      4    3.88 4.45   0   9     9 0.42    -1.50
##              se
## stress_c   0.45
## stress_csq 0.77

Answer: The standard deviations will change because we squared the values. This would make the betas larger.

j. How CAN we get the proper betas? What are they?

data$performance_z <- scale(data$performance, scale=TRUE, center=TRUE)
data$stress_z <- scale(data$stress, scale=TRUE, center=TRUE)
data$stress_zsq <- data$stress_z^2

Zmodel <-lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
summary(Zmodel)

## 
## Call:
## lm(formula = performance_z ~ stress_z + stress_zsq, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95868 -0.32755  0.05992  0.41942  0.92273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9507     0.1826   5.207 5.94e-05 ***
## stress_z      0.1351     0.1225   1.103    0.285    
## stress_zsq   -0.9982     0.1449  -6.888 1.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5477 on 18 degrees of freedom
## Multiple R-squared:   0.73,  Adjusted R-squared:    0.7 
## F-statistic: 24.33 on 2 and 18 DF,  p-value: 7.627e-06

Answer:

k. How would the model be changed if we did not include the linear effect of stress?

Answer: The model would become a transformation, not a linear model.

Randomization tests

In the next section, we are going to learn how to run permutation (i.e. randomization) tests in R, in addition to examining different ways to control for multiple comparisons. Importantly, we will do so by simulating some data, which will hopefully give you insight in how and when you might want to do so for your own research. The questions for this section may require you to do a quick google search for a function, or better yet to do a ?and then then name of the function.

1. If we have a data set with n variables, what are the chances that one or more are significant just by chance (i.e. spurious)? To examine how we might go about figuring this out, we will simulate a data set with variables that vary in how correlated they are.

set.seed(316) #why do we do this?

N <-300 # set our sample         
mu <- sample(1:7, 10, replace=T) 
n <- 10 #set how many 
covm <- rWishart(1,n,diag(n)) # simulates random positive definite matrices
sigma <- matrix(covm,10,10) #creates matrix

# A non-positive definite VCV matrix is when the variables are a linear combination of one another or that one variable entirely determines/depends on another. If you randomly create your own VCV matrix without knowing what you're inputting, your likely to encounter this.

df <- MASS::mvrnorm(n=N,mu=mu,Sigma=sigma)  #simulate the data

df <- as.data.frame(df) # let's make it a dataframe
names(df) <- c("Y1","x2","x3","x4","x5","x6","x7","x8","x9","x10") # and name it
M <- cor(df) # run correlation

a. why do we need to set.seed before we simulate data? Answer: It gives the data a central reference point around which to sample.

b. what does the sample function do? Answer: It randomly samples a given number of values from a given range.

c. how many variables does df contain? Answer: It contains 10 variables.

2. Next we are going to create a visual correlation matrix that contains information about both the strenght of the correlation (color) and the significance of the correlation.

#this is a function that grabs p values from a correlation matrix
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# next we can apply the function to the data set we simulated earlier
p.mat <- cor.mtest(df)

#let's visualize the correlation with color representing the and white cells representing no sig. 
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", order="hclust", #order by level of strength
         addCoef.col = "black", 
         tl.col="black", tl.srt=45, 
         # Combine with significance
         p.mat = p.mat, sig.level = 0.05, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

a. what do the blue and red colored cells indicate? The blue colored cells are negatively correlated.

b. how many variables are not correlated? 11 variables seem not to be significantly correlated, though there are no zero correlations.

3. one way to test how many correlations we would expect to find by chance is by using a randomization test. This basically reshuffles the values in each column (i.e. chance) and then compares that correlation to the one observed. If we do this over and over (e.g., 1k times) this will give us insight into how many we would expect to get by chance.

#the rand.test function from the MASS packages does just that
randTest <- rand.test(df[,1:5],df[,6:10],sims=1000, crit = .95)

## Warning in rand.test(df[, 1:5], df[, 6:10], sims = 1000, crit = 0.95): When p=.
## 00, confidence interval for p not valid.

#?rand.test

a. what two lists does this function give you and what do they indicate? (hint: run the line ?rand.test) Answer: It gives us the approximate sampling distrubution for a) average absolute r and b) number of siginficant tests.

b. How many correlation were significant and how many would we expect by chance? By chance, about 4 correlations would be significant. We had more than 20 significant. much more than by chance. Answer:

4. next, let's simulate a between subjects factorial design to look at how we might control for multiple comparisons

DV<-rnorm(80,c(100,200,300,400),sd=10)
IV1<-rep(c("level1","level2", "level3", "level4"),each=2,20)
IV2<-rep(c("level1","level2"),each=1,40)
AllData<-data.frame(DV,IV1,IV2)

a. Remember that a model using aov (ANOVA) and lm (regression) with a categorical predictor will return the identical F test. The difference is that the aov will give you the omnibus test, whereas the regression will choose a coding scheme (dummy by default) and compare means to that value (i.e. intercept). Is the F test the same, and if so what is it?

anovaOutput <- aov(DV~IV1, data = AllData)
summary(anovaOutput)

##              Df  Sum Sq Mean Sq F value Pr(>F)    
## IV1           3 1641655  547218     197 <2e-16 ***
## Residuals   156  433389    2778                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lmOutput <-lm(DV~IV1, data = AllData)
summary(lmOutput)

## 
## Call:
## lm(formula = DV ~ IV1, data = AllData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69.45 -51.25   1.58  50.76  68.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  151.558      8.334  18.186   <2e-16 ***
## IV1level2    201.627     11.786  17.108   <2e-16 ***
## IV1level3     -5.569     11.786  -0.473    0.637    
## IV1level4    197.865     11.786  16.788   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.71 on 156 degrees of freedom
## Multiple R-squared:  0.7911, Adjusted R-squared:  0.7871 
## F-statistic:   197 on 3 and 156 DF,  p-value: < 2.2e-16

Answer: The F-value is the same, 197.

b. Okay, how do we run post hoc tests? Well, we can do so with the pairwise.t.test, which computes all possible comparisons. If we run a post hoc test without any corrections, how many are significant? Most values in the table are significant.

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "none")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 0.64   <2e-16 -     
## level4 <2e-16 0.75   <2e-16
## 
## P value adjustment method: none

c. what if we want to control using bonferroni corrections? How many are significant now? All but one value are significant.

pairwise.t.test(AllData$DV, AllData$IV1, p.adj = "bonf")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  AllData$DV and AllData$IV1 
## 
##        level1 level2 level3
## level2 <2e-16 -      -     
## level3 1      <2e-16 -     
## level4 <2e-16 1      <2e-16
## 
## P value adjustment method: bonferroni

c. the most common multiple comparison is Tukey, which requires a different function. Does this change anything? With Tukey, all but 2 values are significant. There is a slight difference for Bonferroni.

TukeyHSD(anovaOutput)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DV ~ IV1, data = AllData)
## 
## $IV1
##                      diff        lwr        upr     p adj
## level2-level1  201.627458  171.02026  232.23465 0.0000000
## level3-level1   -5.569371  -36.17657   25.03782 0.9650043
## level4-level1  197.865027  167.25783  228.47222 0.0000000
## level3-level2 -207.196829 -237.80402 -176.58963 0.0000000
## level4-level2   -3.762431  -34.36963   26.84476 0.9887056
## level4-level3  203.434398  172.82720  234.04159 0.0000000

Week_6_HW_Nina_Radosic

Power Polynomials

Randomization tests