09/06/19Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
DLP_words <- readr::read_csv("DLP_words.csv")
## Parsed with column specification:
## cols(
## spelling = col_character(),
## lexicality = col_character(),
## rt = col_double(),
## subtlex.frequency = col_double(),
## length = col_double(),
## POS = col_character(),
## bigram.freq = col_double()
## )
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(car)
## Loading required package: carData
library(readr)
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
DLP_words$POS <- factor(DLP_words$POS, levels = c("N", "ADJ", "WW"), labels = c("Noun", "Adjective", "Verb"))
DLP_words
## # A tibble: 12,026 x 7
## spelling lexicality rt subtlex.frequency length POS bigram.freq
## <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 aaide W 643. 11 5 Verb 211373
## 2 aaien W 614. 92 5 Verb 317860
## 3 aak W 561. 5 3 Noun 75056
## 4 aal W 669. 35 3 Noun 83336
## 5 aalmoes W 636. 42 7 Noun 166581
## 6 aam W NA 1 3 Adjective 71881
## 7 aambeeld W 691. 22 8 Noun 218551
## 8 aanblik W 614. 96 7 Noun 141520
## 9 aanbod W 568. 1174 6 Noun 139596
## 10 aanbouw W 652. 32 7 Noun 119589
## # ... with 12,016 more rows
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.
## The histogram of the original `subtlex.frequency` column
hist(DLP_words$subtlex.frequency, breaks = 100)
## Log-transform the `subtlex.frequency` column
DLP_words$log_subtlex_freq <- log(DLP_words$subtlex.frequency)
## The histogram of `bigram.freq`
hist(DLP_words$bigram.freq, breaks = 100)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
lm_mod <- lm(rt ~ subtlex.frequency + length + POS + bigram.freq, data = DLP_words)
summary(lm_mod)
##
## Call:
## lm(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
## data = DLP_words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258.31 -46.27 -9.90 36.87 651.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.894e+02 2.474e+00 238.250 < 2e-16 ***
## subtlex.frequency -4.084e-04 4.834e-05 -8.449 < 2e-16 ***
## length 6.261e+00 4.128e-01 15.166 < 2e-16 ***
## POSAdjective -2.536e+00 1.829e+00 -1.387 0.165526
## POSVerb -8.008e-01 1.420e+00 -0.564 0.572825
## bigram.freq 2.220e-05 6.481e-06 3.425 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.72 on 12009 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.03806, Adjusted R-squared: 0.03766
## F-statistic: 95.03 on 5 and 12009 DF, p-value: < 2.2e-16
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients)
The coefficent on variable subtlex.frequency is negative, therefore frequent words are faster The coeffcient on length is positive: Thereforet longer words take longer to react The coefficient on the variable POS is not significant: Hence Adjectives, Verbs and Nouns have the same response latency The coeffcient on bigram.freq is positive: which implies that more frequent bigrams are slower
options(scipen = 999)
t <- summary(lm_mod)$coefficients[-1 , 3]
pr <- t / sqrt(t^2 + lm_mod$df.residual)
pr^2
## subtlex.frequency length POSAdjective POSVerb
## 0.00590899844 0.01879270997 0.00016012374 0.00002647912
## bigram.freq
## 0.00097594604
tapply(DLP_words$rt, DLP_words$POS, mean, na.rm = TRUE)
## Noun Adjective Verb
## 633.1077 628.9087 633.5708
What is the practical importance of the overall model? The R-squared is 0.038
Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
influencePlot(lm_mod)
## StudRes Hat CookD
## 4184 7.974493 0.5263562855 11.717236871
## 8552 10.428743 0.0002960303 0.005319831
## 11783 2.157855 0.0704942705 0.058838675
Do we have additivity in our model? - Show that the correlations between predictors is less than .9. - Show the VIF values.
## correlation
summary(lm_mod, correlation= TRUE)$correlation[-1, -1]
## subtlex.frequency length POSAdjective POSVerb
## subtlex.frequency 1.00000000 0.09789214 -0.01679744 -0.09437705
## length 0.09789214 1.00000000 0.02513935 0.09441722
## POSAdjective -0.01679744 0.02513935 1.00000000 0.19608345
## POSVerb -0.09437705 0.09441722 0.19608345 1.00000000
## bigram.freq 0.02157752 -0.43240037 0.01337525 -0.30343414
## bigram.freq
## subtlex.frequency 0.02157752
## length -0.43240037
## POSAdjective 0.01337525
## POSVerb -0.30343414
## bigram.freq 1.00000000
## VIF
vif(lm_mod)
## GVIF Df GVIF^(1/(2*Df))
## subtlex.frequency 1.022716 1 1.011294
## length 1.251375 1 1.118649
## POS 1.121023 2 1.028972
## bigram.freq 1.358407 1 1.165507
Is the model linear? - Include a plot and interpret the output.
plot(lm_mod, which = 2, pch = 16)
Are the errors normally distributed? - Include a plot and interpret the output.
hist(scale(residuals(lm_mod)))
Do the errors meet the assumptions of homoscedasticity and homogeneity? - Include a plot and interpret the output (either plot option).
plot(lm_mod, which = 1)
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}
# The estimates of the coefficients from the bootstrapping
lm_mod_boot <- boot(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE), ],
statistic = bootcoef,
R = 1000)
lm_mod_boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE),
## ], statistic = bootcoef, R = 1000, formula = rt ~ subtlex.frequency +
## length + POS + bigram.freq)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 591.21979418172 0.7635156646024 10.27444850290
## t2* -0.00349408177 -0.0006581106600 0.00171904484
## t3* 5.13021312371 -0.0866320837204 1.75365845974
## t4* 0.79332710091 0.0524178813671 8.62253008182
## t5* 9.62914463832 0.2501683854199 6.29971240592
## t6* 0.00004920488 0.0000009384283 0.00003167061