Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
library(readr)
DLP_words <- read_csv("DLP_words.csv")
## Parsed with column specification:
## cols(
## spelling = col_character(),
## lexicality = col_character(),
## rt = col_double(),
## subtlex.frequency = col_integer(),
## length = col_integer(),
## POS = col_character(),
## bigram.freq = col_integer()
## )
DLP_words
## # A tibble: 12,026 x 7
## spelling lexicality rt subtlex.frequency length POS bigram.freq
## <chr> <chr> <dbl> <int> <int> <chr> <int>
## 1 aaide W 643. 11 5 WW 211373
## 2 aaien W 614. 92 5 WW 317860
## 3 aak W 561. 5 3 N 75056
## 4 aal W 669. 35 3 N 83336
## 5 aalmoes W 636. 42 7 N 166581
## 6 aam W NA 1 3 ADJ 71881
## 7 aambeeld W 691. 22 8 N 218551
## 8 aanblik W 614. 96 7 N 141520
## 9 aanbod W 568. 1174 6 N 139596
## 10 aanbouw W 652. 32 7 N 119589
## # … with 12,016 more rows
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(boot)
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
DLP_words$POS <- factor(DLP_words$POS, levels = c("N", "ADJ", "WW"), labels = c("Noun", "Adjective", "Verb"))
DLP_words
## # A tibble: 12,026 x 7
## spelling lexicality rt subtlex.frequency length POS bigram.freq
## <chr> <chr> <dbl> <int> <int> <fct> <int>
## 1 aaide W 643. 11 5 Verb 211373
## 2 aaien W 614. 92 5 Verb 317860
## 3 aak W 561. 5 3 Noun 75056
## 4 aal W 669. 35 3 Noun 83336
## 5 aalmoes W 636. 42 7 Noun 166581
## 6 aam W NA 1 3 Adjective 71881
## 7 aambeeld W 691. 22 8 Noun 218551
## 8 aanblik W 614. 96 7 Noun 141520
## 9 aanbod W 568. 1174 6 Noun 139596
## 10 aanbouw W 652. 32 7 Noun 119589
## # … with 12,016 more rows
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.
hist(DLP_words$subtlex.frequency, breaks = 100)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
DLP_words$log_subtlex_freq <- log(DLP_words$subtlex.frequency)
hist(DLP_words$bigram.freq, breaks = 100)
lm_mod <- lm(rt ~ subtlex.frequency + length + POS + bigram.freq, data = DLP_words)
summary(lm_mod)
##
## Call:
## lm(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
## data = DLP_words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258.31 -46.27 -9.90 36.87 651.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.894e+02 2.474e+00 238.250 < 2e-16 ***
## subtlex.frequency -4.084e-04 4.834e-05 -8.449 < 2e-16 ***
## length 6.261e+00 4.128e-01 15.166 < 2e-16 ***
## POSAdjective -2.536e+00 1.829e+00 -1.387 0.165526
## POSVerb -8.008e-01 1.420e+00 -0.564 0.572825
## bigram.freq 2.220e-05 6.481e-06 3.425 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.72 on 12009 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.03806, Adjusted R-squared: 0.03766
## F-statistic: 95.03 on 5 and 12009 DF, p-value: < 2.2e-16
Which coefficients are statistically significant? Three coefficients are significant, which are sublex.frequency, length and bigram.freq.
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients) Since the subtlex.frequency is negative, it means more words leads to process faster. In addition, the longer length means process is slow.
Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:
options(scipen = 999)
v <- summary(lm_mod)$coefficients[-1 , 3]
pr <- v / sqrt(v^2 + lm_mod$df.residual)
pr^2
## subtlex.frequency length POSAdjective POSVerb
## 0.00590899844 0.01879270997 0.00016012374 0.00002647912
## bigram.freq
## 0.00097594604
tapply(DLP_words$rt, DLP_words$POS, mean, na.rm = TRUE)
## Noun Adjective Verb
## 633.1077 628.9087 633.5708
confint(lm_mod)
## 2.5 % 97.5 %
## (Intercept) 584.508295912703 594.20596994958
## subtlex.frequency -0.000503197495 -0.00031367922
## length 5.451658950622 7.07006821880
## POSAdjective -6.121108514006 1.04858008848
## POSVerb -3.584515634718 1.98285867975
## bigram.freq 0.000009495121 0.00003490433
Is the overall model statistically significant? Yes, P value is close to 0.
What is the practical importance of the overall model?
Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
influencePlot(lm_mod)
## StudRes Hat CookD
## 4184 7.974493 0.5263562855 11.717236871
## 8552 10.428743 0.0002960303 0.005319831
## 11783 2.157855 0.0704942705 0.058838675
DLP_words[c(8552, 11783, 4184), ]
## # A tibble: 3 x 8
## spelling lexicality rt subtlex.frequen… length POS bigram.freq
## <chr> <chr> <dbl> <int> <int> <fct> <int>
## 1 skunk W 1273 14 5 Noun 57488
## 2 zijn W 604. 348446 4 Verb 122407
## 3 is W 559. 947568 2 Verb 61959
## # … with 1 more variable: log_subtlex_freq <dbl>
Do we have additivity in our model? - Show that the correlations between predictors is less than .9. - Show the VIF values.
summary(lm_mod, correlation = T)$correlation[ , -1]
## subtlex.frequency length POSAdjective POSVerb
## (Intercept) -0.12416861 -0.84307473 -0.14782371 -0.09482313
## subtlex.frequency 1.00000000 0.09789214 -0.01679744 -0.09437705
## length 0.09789214 1.00000000 0.02513935 0.09441722
## POSAdjective -0.01679744 0.02513935 1.00000000 0.19608345
## POSVerb -0.09437705 0.09441722 0.19608345 1.00000000
## bigram.freq 0.02157752 -0.43240037 0.01337525 -0.30343414
## bigram.freq
## (Intercept) -0.04305614
## subtlex.frequency 0.02157752
## length -0.43240037
## POSAdjective 0.01337525
## POSVerb -0.30343414
## bigram.freq 1.00000000
vif(lm_mod)
## GVIF Df GVIF^(1/(2*Df))
## subtlex.frequency 1.022716 1 1.011294
## length 1.251375 1 1.118649
## POS 1.121023 2 1.028972
## bigram.freq 1.358407 1 1.165507
Is the model linear? - Include a plot and interpret the output.
plot(lm_mod, which = 2, pch = 16)
Are the errors normally distributed? - Include a plot and interpret the output. Yes, the histogram shows the data is normally distributed.
hist(scale(residuals(lm_mod)))
Do the errors meet the assumptions of homoscedasticity and homogeneity? - Include a plot and interpret the output (either plot option). No..
plot(lm_mod, which = 1)
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}
lm_mod_boot <- boot(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE), ],
statistic = bootcoef,
R = 1000)
lm_mod_boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE),
## ], statistic = bootcoef, R = 1000, formula = rt ~ subtlex.frequency +
## length + POS + bigram.freq)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 604.064537227117 0.2879418325631 13.82927371060
## t2* -0.000760064672 -0.0003190676290 0.00110867166
## t3* 4.737223660103 -0.0247546435201 2.16572299505
## t4* -0.199336748392 0.8411336489761 10.29709522343
## t5* 3.890311685732 0.1240061727525 6.69258053453
## t6* 0.000003810574 0.0000001614324 0.00003867773
boot.ci(lm_mod_boot, index = 2)
## Warning in boot.ci(lm_mod_boot, index = 2): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = lm_mod_boot, index = 2)
##
## Intervals :
## Level Normal Basic
## 95% (-0.0026, 0.0017 ) (-0.0011, 0.0017 )
##
## Level Percentile BCa
## 95% (-0.0033, -0.0004 ) (-0.0025, -0.0004 )
## Calculations and Intervals on Original Scale