Q1)
#Data set comparisons
Ns = seq(from=2, to=10)
#Our combinations function for the number of combinations (order doesn't matter)
combo <- function(n, r=2) {
return(factorial(n)/(factorial(n-r)*factorial(r)))
}
#Matrix data-hold
summary.matrix.1 = matrix(nrow=9, ncol=3)
#Main loop
for(n in Ns) {
sim = function(i) {
simple.sim = runif(n, min=0, max=1)
return(any(simple.sim<.05))
}
sim.data=sapply(1:1000, FUN=sim)
summary.matrix.1[n-1,] = c(n, combo(n), mean(sim.data))
}
#rename our columns
colnames(summary.matrix.1) = c("n.data.sets",
"n.pwise.comparisons", "error.rate")
summary.matrix.1
## n.data.sets n.pwise.comparisons error.rate
## [1,] 2 1 0.106
## [2,] 3 3 0.154
## [3,] 4 6 0.187
## [4,] 5 10 0.237
## [5,] 6 15 0.271
## [6,] 7 21 0.291
## [7,] 8 28 0.335
## [8,] 9 36 0.359
## [9,] 10 45 0.401
Q2)
summary.matrix.2 = matrix(nrow=9, ncol=3)
#Loop through number of data sets
for(n in Ns) {
###------Our sim function-------###
sim = function(k) {
#Our current n data sets with 15 sample, mean=3, sigma=1
data.sets = sapply(1:n, FUN=function(i){rnorm(15,3,1)})
all.p.values = c()
#Assess all possible combinations of pairwise comparisons
for(i in 1:(n-1)) {
for(j in (i+1):n) {
#current pairwise comparison
p.value = t.test(data.sets[,i], data.sets[,j])$p.value
#append to vector of p.values
all.p.values = c(all.p.values, p.value)
}
}
#return boolean if any are lower than .05
return(any(all.p.values < .05))
}
###------Our sim function-------###
#run simulation 1000 times at current n
sims = sapply(1:1000, FUN=sim)
#erros for this simulation
error.rate = mean(sims)
entry = c(n, combo(n), error.rate)
summary.matrix.2[n-1, ] = entry
}
#rename our columns
colnames(summary.matrix.2) = c("n.data.set",
"n.pwise.comparisons", "error.rate")
summary.matrix.2
## n.data.set n.pwise.comparisons error.rate
## [1,] 2 1 0.052
## [2,] 3 3 0.120
## [3,] 4 6 0.217
## [4,] 5 10 0.306
## [5,] 6 15 0.391
## [6,] 7 21 0.477
## [7,] 8 28 0.482
## [8,] 9 36 0.558
## [9,] 10 45 0.585
#plot our error rates
par(mfrow=c(1,2))
barplot(summary.matrix.1[,"error.rate"], names.arg=Ns,
main="Error rates\n(runif(n,0,1))\nby data sets",
xlab="n.data sets", ylab="error.rate", col="purple", ylim=c(0,.7))
barplot(summary.matrix.2[,"error.rate"], names.arg=Ns,
main="Error rates\n(rnorm(15,3,1))\nby data sets",
xlab="n.data sets", ylab="error.rate", col="blue", ylim=c(0,.7))
#Error rates and plot
par(mfrow=c(1,1))
summary.matrix.1[,3]
## [1] 0.106 0.154 0.187 0.237 0.271 0.291 0.335 0.359 0.401
summary.matrix.2[,3]
## [1] 0.052 0.120 0.217 0.306 0.391 0.477 0.482 0.558 0.585
plot(summary.matrix.1[,3], summary.matrix.2[,3],
xlab="Q2 rnorm()", ylab="Q1 runif()",
main="Error rates by distribution source", pch=16,
xlim=c(0,.7), ylim=c(0,.7))
#plotting y = x for slope reference
abline(0, 1, lty=2, col="red")
Q3)
I’m not completely sure I understand this question, but here’s my takeway re: using runif to model the probablity of committing a Type1 error vs generating data sets from the same underlying distrbution: In both cases the probability of a type 1 error and the number of pairwise comparisons are positively related - the more comparisons we make the more likely we are to commit a type 1 error. However, in the case that we are actually pulling data from the same underlying normal distribution, as in q2, the probability that we commit a type1 error increases more quickly. In q1, by contrast, we model the probability that we commit a type 1 error by chance if our alpha is set to .05. This is troubling because it means that in situations in wich the null hypothesis is in fact true (q2 where the underlying distribution is the same) we’re even more likely to commit a type 1 error. This is due in part to the fact that when we use runif() we make the assumption that there is complete independence between tests, while in our data set generating context (q2) this assumption is not maintained (it may be the case that one of the data sets is drastically different, generating lower p-values in all its comparisons). This of course, won’t be the case if we pull a fresh number from runif() each time. In both cases however, we could address some of these issues by performing a Bonferroni correction or adjusting our alphas (as in notes from 2/9).
Q4) Plot lexical decision time against neighborhood density.
library(languageR)
with(english, plot(Ncount, RTlexdec,
main="RT against neighborhood density",
col="blue"), pch=20)
Q5)
#Our linear model
m = lm(RTlexdec ~ Ncount, data=english)
summary(m)
##
## Call:
## lm(formula = RTlexdec ~ Ncount, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35180 -0.12503 0.00113 0.10205 0.63715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5633025 0.0037644 1743.528 < 2e-16 ***
## Ncount -0.0021075 0.0004735 -4.451 8.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1566 on 4566 degrees of freedom
## Multiple R-squared: 0.00432, Adjusted R-squared: 0.004102
## F-statistic: 19.81 on 1 and 4566 DF, p-value: 8.755e-06
with(english, plot(Ncount, RTlexdec,
main="RT against neighborhood density",
col="blue"), pch=20)
abline(m, col="red")
–> We can interpret the intercept of 6.5633025 such that a neighborhood density of 0 corresopndes with a RT of 6.5633025.
–> Our Ncount coefficient, which captures the slope of our regression line in this model, means that as we increase the neighborhood density by one word we reduce reaction time by -0.0021075
–> According to our our summary statistics both these estimates are statistically significant with a far lower than .001 probability of committing a type 1 error. Intuitively this is a characterization of the the probability of observing estimates as extreme as this if we’re assuming the null hypothesis is true. In the case of our intercept and slope coefficents estimates the null hypothesis is that no relationship exists between Ncount and RTlexdec so these estimates would be zero (and any variation in the data due to noise).
–> Our r squared was fairly low (0.0425) when I ran this. This of the proportion of variance in our dependent variable (RT) explained by our model m. This is an assessment of goodness of fit.
Q6)
coef(m)
## (Intercept) Ncount
## 6.563302539 -0.002107512
with(english, plot(Ncount, RTlexdec,
main="RT against neighborhood density",
col="blue"), pch=20)
abline(a=coef(m)[1], b=coef(m)[2], col="red")
We talked briefly in class about what is essentially the least-effective-while-not-fully-arbitrary model - one in which we ignored information about an IV and simply tried to minimize error by choosing the mean value for our DV and plotted a line with intercept=MLE of DV and slope=1. While this model is obviously weak the mean value for our IV and DV are important. Using our model which incorporates Ncount, I plot the mean values for both our IV and DV. These lines intersect along our regression line. This is a nice demonstration of the way our model works - minimizing error along both variables.
coef(m)
## (Intercept) Ncount
## 6.563302539 -0.002107512
with(english, plot(Ncount, RTlexdec,
main="RT against neighborhood density",
col="blue"), pch=20)
abline(a=coef(m)[1], b=coef(m)[2], col="red")
abline(v=mean(english$Ncount), col="green", lty=2)
abline(h=mean(english$RTlexdec), col="green", lty=2)
Q7) Plotting our residuals
par(mfrow=c(1,2))
plot(resid(m), main="Model residiuals")
#this looks strange, just as a sanity check lets see if we plot
#RTlexdec by index if we get a same pattern (we should expect)
#the same thing as our model isn't doing a lot of heavy lifting
plot(english$RTlexdec, main="RTlexdec by Index")
par(mfrow=c(1,2))
barplot(as.numeric(factor(english$AgeSubject)), main="Age factors")
plot(resid(m), main="Model residiuals")
We’re getting this weird plot because we have subsets of young and old subjects grouped within the data set. So when we plot by index we are seeing these groups reflected. This makes sense as mean RT for younger subjects tends to be lower than that of older subjects:
young.indices = which(english[,"AgeSubject"] == "young")
old.indices = which(english[,"AgeSubject"] == "old")
Q8)
Let’s adjust our model to incorporate information about age:
m.Ncount.Age = lm(RTlexdec ~ Ncount + AgeSubject, data=english)
par(mfrow=c(1,2))
plot(resid(m), main="Model residiuals Ncount", col="purple")
plot(resid(m.Ncount.Age), main="Model residiuals Ncount+AgeSubject",
col="blue")
summary(m)
##
## Call:
## lm(formula = RTlexdec ~ Ncount, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35180 -0.12503 0.00113 0.10205 0.63715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5633025 0.0037644 1743.528 < 2e-16 ***
## Ncount -0.0021075 0.0004735 -4.451 8.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1566 on 4566 degrees of freedom
## Multiple R-squared: 0.00432, Adjusted R-squared: 0.004102
## F-statistic: 19.81 on 1 and 4566 DF, p-value: 8.755e-06
summary(m.Ncount.Age)
##
## Call:
## lm(formula = RTlexdec ~ Ncount + AgeSubject, data = english)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27097 -0.08315 -0.01641 0.06948 0.52629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6741633 0.0031216 2138.068 < 2e-16 ***
## Ncount -0.0021075 0.0003344 -6.303 3.2e-10 ***
## AgeSubjectyoung -0.2217215 0.0032725 -67.754 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1106 on 4565 degrees of freedom
## Multiple R-squared: 0.5035, Adjusted R-squared: 0.5033
## F-statistic: 2315 on 2 and 4565 DF, p-value: < 2.2e-16
Our r-squared value improves dramatically when we add the critical age information to our model.
Model without age variable: 0.0043199
Model with age and Ncount: 0.5035499
Anova test on our two models:
anova(m)
## Analysis of Variance Table
##
## Response: RTlexdec
## Df Sum Sq Mean Sq F value Pr(>F)
## Ncount 1 0.486 0.48580 19.811 8.755e-06 ***
## Residuals 4566 111.970 0.02452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.Ncount.Age)["Sum Sq"][2,]
## [1] 56.14118
Our second model does a much better job accounting for variance in the data set. Additionally we can see that AgeSubject is doing the majority of the heavy lifing accounting for a much larger portion of the variance (captured by Sum Sq values) 56.1411838 than is Ncount 0.4858022
Q9) Adding an interaction between AgeSubjects and Ncount is unlikely to improve our model as these measures are fundamentally unrelated. Since we know this it wouldn’t make sense to take the correlation between these variables, but we can still do so for fun, converting AgeSubject into a numeric vector (1=young, 2=old).
with(english, cor(Ncount, as.numeric(AgeSubject)))
## [1] 0
Q10) Adding the interaction to our model:
m.Ncount.Age.Interaction = lm(RTlexdec ~ Ncount * AgeSubject, data=english)
summary(m.Ncount.Age.Interaction)$fstatistic[1]
## value
## 1543.094
summary(m.Ncount.Age)$fstatistic[1]
## value
## 2315.143
These summary statistics confirm that the interaction between our independent variables is basically non-existent. Including the variable interaction into our model actually lowers our f-statistic from
f-statistic without interaction:2315.1427634
to
f-statistic with interaction:1543.0940796