Homework6

Multiple comparisons

mult.comparison = matrix(nrow=9,ncol=3)
colnames(mult.comparison) = c("n", "comb(n)", "Type I error rate")
for (i in 2:10) {
comb = FUN=function(i) factorial(i)/(factorial(i-2)*factorial(2))
sim = FUN=function(i) any(runif(comb(i),0,1) < .05)
n.sims = 1000
sims = replicate(n.sims, sim(i))
comparison = c(i, comb(i), mean(sims))
mult.comparison[i-1,] = comparison
}
mult.comparison

##        n comb(n) Type I error rate
##  [1,]  2       1             0.055
##  [2,]  3       3             0.131
##  [3,]  4       6             0.272
##  [4,]  5      10             0.375
##  [5,]  6      15             0.549
##  [6,]  7      21             0.650
##  [7,]  8      28             0.779
##  [8,]  9      36             0.834
##  [9,] 10      45             0.872

plot(mult.comparison[,1], mult.comparison[,3], xlab="n", ylab="Type I error rate", main="1. Multiple Comparison")

pairwise.comparison=matrix(nrow=9,ncol=3)
colnames(pairwise.comparison) = c("n", "comb(n)", "Type I error rate")
Ns = 2:10
for (n in 2:10)    {
sim = function(k) {
        data.sets = sapply(1:n, FUN=function(i) rnorm(15, 3, 1))
        all.p.values = c()
        for (i in 1:(n - 1)) {
            for (j in (i+1):n) {
                p.value = t.test(data.sets[,i], data.sets[,j])$p.value
                all.p.values = c(all.p.values, p.value) 
            }
        }
        return(any(all.p.values < .05))
    }
    sims = sapply(1:1000, sim)

comparison = c(n, comb(n), mean(sims))
pairwise.comparison[n-1,] = comparison
}
pairwise.comparison

##        n comb(n) Type I error rate
##  [1,]  2       1             0.047
##  [2,]  3       3             0.120
##  [3,]  4       6             0.203
##  [4,]  5      10             0.282
##  [5,]  6      15             0.337
##  [6,]  7      21             0.445
##  [7,]  8      28             0.491
##  [8,]  9      36             0.572
##  [9,] 10      45             0.622

plot(pairwise.comparison[,1], pairwise.comparison[,3], xlab="n", ylab="Type I error rate", main="2. Pairwise Comparison")

The Type I error rates generated in problems 1 and 2 are different. They are different because the distribution of the randomly generated p-values in 1 are uniform, while the distribution of the calculated p-values in 2 are based on t-tests of two identically generated datasets. Thus, the p-values in 2 are much less likely to be less than .05, while the p-values in 1 are just as likely to be below .05 as anywhere else along the distribution.

plot(mult.comparison[,3], pairwise.comparison[,3], xlab="Multiple Comparison Type I error rate", ylab="Pairwise Comparison Type I error rate", main="3. Type I error rate comparison")

Linear regression

library(languageR)

plot( english$Ncount, english$RTlexdec, ylab="Lexical Decision Time", xlab="Neighborhood Density", main="4. Linear Regression")

m = lm(english$RTlexdec ~ english$Ncount, data=english)
summary(m)

## 
## Call:
## lm(formula = english$RTlexdec ~ english$Ncount, data = english)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35180 -0.12503  0.00113  0.10205  0.63715 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     6.5633025  0.0037644 1743.528  < 2e-16 ***
## english$Ncount -0.0021075  0.0004735   -4.451 8.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1566 on 4566 degrees of freedom
## Multiple R-squared:  0.00432,    Adjusted R-squared:  0.004102 
## F-statistic: 19.81 on 1 and 4566 DF,  p-value: 8.755e-06

What is the interpretation of the intercept and slope coefficients?
For the intercept, “The value of Lexical Decision Time is 6.56 when Neighborhood Density is 0.”
For the slope, “The value of Lexical Decision Time lowers by .0021 when Neighborhood Density changes by 1.”
Are these estimates statistically significant? At what level?
These estimates are statistically significant at p<.0001 level.
What does “statistically significant” mean in this context?
Statistically significant means that the data have rejected the null hypothesis.

The null hypothesis would predict that the value of Lexical Decision Time is 0 when Neighborhood Density is 0, and that there is no relationship between Lexical Decision Time and Neighborhood Density.
What proportion of the variance in the data does the model explain?
Only .4 percent of the variance in Lexical Decision Time can be explained by Neighborhood Density.
What does this statistic mean - i.e., what model is your model being compared to when we say “it explains d% of the variance”, and by what procedure?
The R-squared statistic is the ratio of the explained sum of squares to the total sum of squares. The explained sum of squares measures how much variation there is in the modelled values. The total sum of squares measures how much variation there is in the observed data. When these two are equivalent (the ratio is 1), this indicates that the regression line perfectly fits the data. In our model, the regression line was fit to reduce the residual sum of squares as much as possible. However, the regression fails to capture all the variance of the original data. It only captures .4 percent of it.

coef(m)

##    (Intercept) english$Ncount 
##    6.563302539   -0.002107512

plot( english$Ncount, english$RTlexdec, ylab="Lexical Decision Time", xlab="Neighborhood Density", main="6. Best-Fit Regression Line")
abline(a=coef(m)[1], b=coef(m)[2], col='red')

Neighborhood density does not explain lexical decision time very well at all. This finding would be in conflict with my answer to the previous question about statistical significance if I had said that statistical significance meant that the independent variable was a significant or important predictor of the dependent variable. This is not the case.

plot(resid(m))

plot(english$RTlexdec)

There is an obvious pattern in the residuals, and in the Lexical Decision Time. By looking at the data around the shifts in the pattern, it appears that these may be explained by changes in age of subject. 8

m2 = lm(english$RTlexdec ~ english$Ncount + english$AgeSubject, data=english)
summary(m2)

## 
## Call:
## lm(formula = english$RTlexdec ~ english$Ncount + english$AgeSubject, 
##     data = english)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27097 -0.08315 -0.01641  0.06948  0.52629 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              6.6741633  0.0031216 2138.068  < 2e-16 ***
## english$Ncount          -0.0021075  0.0003344   -6.303  3.2e-10 ***
## english$AgeSubjectyoung -0.2217215  0.0032725  -67.754  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1106 on 4565 degrees of freedom
## Multiple R-squared:  0.5035, Adjusted R-squared:  0.5033 
## F-statistic:  2315 on 2 and 4565 DF,  p-value: < 2.2e-16

When Age of Subject is included as an independent variable in the model, the model is now able to explain 50% of the variance found in the original data.

anova(m)

## Analysis of Variance Table
## 
## Response: english$RTlexdec
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## english$Ncount    1   0.486 0.48580  19.811 8.755e-06 ***
## Residuals      4566 111.970 0.02452                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m2)

## Analysis of Variance Table
## 
## Response: english$RTlexdec
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## english$Ncount        1  0.486   0.486   39.723 3.203e-10 ***
## english$AgeSubject    1 56.141  56.141 4590.562 < 2.2e-16 ***
## Residuals          4565 55.829   0.012                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The sum of squares for the residuals is much less in the second model (55.828) than in the first (111.970).

par(mfrow=c(1,2))
plot(resid(m))
plot(resid(m2))

The residuals in m2 are much more uniform than the residuals in m.
9

There is not an interaction between the Neighborhood Density and the Age of Subjects. Such a relationship between these two variables does not even make logical sense, because they are describing two different types of data (words and subjects).

m3 = lm(english$RTlexdec ~ english$Ncount * english$AgeSubject, data=english)
summary(m3)

## 
## Call:
## lm(formula = english$RTlexdec ~ english$Ncount * english$AgeSubject, 
##     data = english)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27112 -0.08316 -0.01635  0.06947  0.52628 
## 
## Coefficients:
##                                          Estimate Std. Error  t value
## (Intercept)                             6.674e+00  3.760e-03 1775.112
## english$Ncount                         -2.132e-03  4.729e-04   -4.508
## english$AgeSubjectyoung                -2.220e-01  5.317e-03  -41.756
## english$Ncount:english$AgeSubjectyoung  4.946e-05  6.688e-04    0.074
##                                        Pr(>|t|)    
## (Intercept)                             < 2e-16 ***
## english$Ncount                         6.69e-06 ***
## english$AgeSubjectyoung                 < 2e-16 ***
## english$Ncount:english$AgeSubjectyoung    0.941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1106 on 4564 degrees of freedom
## Multiple R-squared:  0.5036, Adjusted R-squared:  0.5032 
## F-statistic:  1543 on 3 and 4564 DF,  p-value: < 2.2e-16

As expected, the interaction between Neighborhood Density and Age of Subject does not do anything to help explain the variance (R-squared is still 50%) and is not found to have a relation (it is unable to reject the null hypothesis).

Homework6

Kate Lindsey

February 17, 2015