Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
DLP <- read.csv("DLP_words.csv")
DLP <- na.omit(DLP)
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(car)
## Loading required package: carData
library(carData)
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
table(DLP$POS)
##
## ADJ N WW
## 1395 7582 3038
DLP$POS = factor(DLP$POS, #the column you want to update
#the values in the data in the order you want
levels = c("N", "ADJ", "WW"),
#give them better labels if you want
labels = c("Noun", "Adjective", "Verb"))
table(DLP$POS)
##
## Noun Adjective Verb
## 7582 1395 3038
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.
hist(DLP$subtlex.frequency, breaks = 100)
DLP$Log_SUB = log(DLP$subtlex.frequency)
hist(DLP$Log_SUB)
hist(DLP$bigram.freq)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
model = lm(rt ~ Log_SUB + length + POS+bigram.freq,
data = DLP)
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients) Subtitle frequency and length are negative predictors; POS & bigram frequency are positive predictors
Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below: bigram.freq predicts the most variance
summary(model)
##
## Call:
## lm(formula = rt ~ Log_SUB + length + POS + bigram.freq, data = DLP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.88 -33.85 -6.19 27.49 616.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.112e+02 2.520e+00 282.176 < 2e-16 ***
## Log_SUB -1.808e+01 2.253e-01 -80.218 < 2e-16 ***
## length -1.806e+00 3.486e-01 -5.180 2.26e-07 ***
## POSAdjective 7.976e+00 1.486e+00 5.368 8.13e-08 ***
## POSVerb 1.170e+01 1.157e+00 10.119 < 2e-16 ***
## bigram.freq 3.374e-05 5.246e-06 6.432 1.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.76 on 12009 degrees of freedom
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3697
## F-statistic: 1410 on 5 and 12009 DF, p-value: < 2.2e-16
summary(model$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -348.879 -33.850 -6.189 0.000 27.489 616.596
tapply(DLP$rt, #dv
DLP$POS, #iv group variable
mean) #function
## Noun Adjective Verb
## 633.1077 628.9087 633.5708
summary(model)$fstatistic
## value numdf dendf
## 1410.291 5.000 12009.000
Yes - What is the practical importance of the overall model?
t = summary(model)$coefficients[-1 , 3]
pr = t / sqrt(t^2 + model$df.residual)
pr^2
## Log_SUB length POSAdjective POSVerb bigram.freq
## 0.348893207 0.002229149 0.002393353 0.008453925 0.003432757
Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
influencePlot(model)
## StudRes Hat CookD
## 358 7.4011293 0.0004432197 4.030100e-03
## 7620 0.2091906 0.0022564015 1.649549e-05
## 7799 -6.8894657 0.0008468915 6.679415e-03
## 8552 12.2246554 0.0003687713 9.076198e-03
## 11041 -0.2678347 0.0022063720 2.643955e-05
Do we have additivity in our model? - Show that the correlations between predictors is less than .9. - Show the VIF values.
summary(model, correlation = T)$correlation[ , -1]
## Log_SUB length POSAdjective POSVerb
## (Intercept) -0.61547149 -0.817549059 -0.062985653 0.006374922
## Log_SUB 1.00000000 0.300651186 -0.090366305 -0.146973534
## length 0.30065119 1.000000000 -0.001602351 0.054511388
## POSAdjective -0.09036631 -0.001602351 1.000000000 0.205766502
## POSVerb -0.14697353 0.054511388 0.205766502 1.000000000
## bigram.freq -0.02461407 -0.423780585 0.015906745 -0.295822672
## bigram.freq
## (Intercept) -0.01692010
## Log_SUB -0.02461407
## length -0.42378059
## POSAdjective 0.01590674
## POSVerb -0.29582267
## bigram.freq 1.00000000
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Log_SUB 1.143764 1 1.069469
## length 1.362545 1 1.167281
## POS 1.139962 2 1.033291
## bigram.freq 1.358598 1 1.165589
Is the model linear? - Include a plot and interpret the output. Yes
plot(model, which = 2)
Are the errors normally distributed? - Include a plot and interpret the output.
hist(scale(residuals(model)))
Do the errors meet the assumptions of homoscedasticity and homogeneity? - Include a plot and interpret the output (either plot option).
plot(model, which = 1)
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}