Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(boot)
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
Import the data from the Dutch Lexicon Project DLP_words.csv
. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt
: Response Latency to the Lexical Decision Task - subtlex.frequency
: The frequency of the word from the Dutch Subtitle Project. - length
: Length of the word. - POS
: part of speech. - bigram.freq
: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
dataDLP <- read_csv("DLP_words.csv")
## Parsed with column specification:
## cols(
## spelling = col_character(),
## lexicality = col_character(),
## rt = col_double(),
## subtlex.frequency = col_double(),
## length = col_double(),
## POS = col_character(),
## bigram.freq = col_double()
## )
summary(dataDLP)
## spelling lexicality rt subtlex.frequency
## Length:12026 Length:12026 Min. : 358.0 Min. : 1
## Class :character Class :character 1st Qu.: 585.9 1st Qu.: 16
## Mode :character Mode :character Median : 624.1 Median : 67
## Mean : 632.7 Mean : 1217
## 3rd Qu.: 670.5 3rd Qu.: 306
## Max. :1273.0 Max. :947568
## NA's :11
## length POS bigram.freq
## Min. : 2.000 Length:12026 Min. : 5026
## 1st Qu.: 5.000 Class :character 1st Qu.:127241
## Median : 6.000 Mode :character Median :188577
## Mean : 6.351 Mean :207476
## 3rd Qu.: 7.000 3rd Qu.:285100
## Max. :12.000 Max. :603499
##
dataDLP <- na.omit(dataDLP)
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
dataDLP$POS <- factor(dataDLP$POS, levels = c("N", "ADJ", "WW"), labels = c("Noun", "Adjective", "Verb"))
table(dataDLP$POS)
##
## Noun Adjective Verb
## 7582 1395 3038
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency
column. - Log-transform the subtlex.frequency
column. - Include a histogram of bigram.freq
- note that it does not appear extremely skewed.
## The histogram of the original `subtlex.frequency` column
hist(dataDLP$subtlex.frequency, breaks = 100)
## Log-transform the `subtlex.frequency` column
dataDLP$log_subtlex_freq <- log(dataDLP$subtlex.frequency)
## The histogram of `bigram.freq`
hist(dataDLP$bigram.freq, breaks = 100)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
modelDLP <- lm(rt ~ subtlex.frequency + length + POS + bigram.freq, data = dataDLP)
summary(modelDLP )
##
## Call:
## lm(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
## data = dataDLP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258.31 -46.27 -9.90 36.87 651.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.894e+02 2.474e+00 238.250 < 2e-16 ***
## subtlex.frequency -4.084e-04 4.834e-05 -8.449 < 2e-16 ***
## length 6.261e+00 4.128e-01 15.166 < 2e-16 ***
## POSAdjective -2.536e+00 1.829e+00 -1.387 0.165526
## POSVerb -8.008e-01 1.420e+00 -0.564 0.572825
## bigram.freq 2.220e-05 6.481e-06 3.425 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.72 on 12009 degrees of freedom
## Multiple R-squared: 0.03806, Adjusted R-squared: 0.03766
## F-statistic: 95.03 on 5 and 12009 DF, p-value: < 2.2e-16
“subtlex. frequency”, “length” and “bigram.freq” are the statistically significant coefficients.
Positive coefficients increase the time of processing whereas negative will reduce the processing time.“bigram.freq” is postive stating processing is slower with more bigrams.
The most variance seems to be predicted by length.
options(scipen = 999)
round(summary(modelDLP)$coefficients, 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 589.357 2.474 238.250 0.000
## subtlex.frequency 0.000 0.000 -8.449 0.000
## length 6.261 0.413 15.166 0.000
## POSAdjective -2.536 1.829 -1.387 0.166
## POSVerb -0.801 1.420 -0.564 0.573
## bigram.freq 0.000 0.000 3.425 0.001
Nominal variables into regression analysis incroporated using Dummy coding. Nouns responded faster than verb and adjectives respond faster than noun.
tapply(dataDLP$rt, dataDLP$POS, mean, na.rm = TRUE)
## Noun Adjective Verb
## 633.1077 628.9087 633.5708
confint(modelDLP)
## 2.5 % 97.5 %
## (Intercept) 584.508295912703 594.20596994958
## subtlex.frequency -0.000503197495 -0.00031367922
## length 5.451658950622 7.07006821880
## POSAdjective -6.121108514006 1.04858008848
## POSVerb -3.584515634718 1.98285867975
## bigram.freq 0.000009495121 0.00003490433
The overall model seems to be significant with 5% level. p value is close to 0 and model seems to be significant
Model is not clear enough.The R-squared of the overall model is 0.038.
Create an influence plot of the model using the car
library. - Which data points appear to have the most influence on the model?
library(car)
influencePlot(modelDLP)
## StudRes Hat CookD
## 4180 7.974493 0.5263562855 11.717236871
## 8542 10.428743 0.0002960303 0.005319831
## 11772 2.157855 0.0704942705 0.058838675
dataDLP[c(8552, 11783, 4184), ]
## # A tibble: 3 x 8
## spelling lexicality rt subtlex.frequen~ length POS bigram.freq
## <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 slaan W 541. 4133 5 Verb 210966
## 2 zinde W 799. 3 5 Verb 245166
## 3 ivoor W 598. 58 5 Noun 97604
## # ... with 1 more variable: log_subtlex_freq <dbl>
Do we have additivity in our model? - Show that the correlations between predictors is less than .9. - Show the VIF values.
summary(modelDLP, correlation = T)$correlation[ , -1]
## subtlex.frequency length POSAdjective POSVerb
## (Intercept) -0.12416861 -0.84307473 -0.14782371 -0.09482313
## subtlex.frequency 1.00000000 0.09789214 -0.01679744 -0.09437705
## length 0.09789214 1.00000000 0.02513935 0.09441722
## POSAdjective -0.01679744 0.02513935 1.00000000 0.19608345
## POSVerb -0.09437705 0.09441722 0.19608345 1.00000000
## bigram.freq 0.02157752 -0.43240037 0.01337525 -0.30343414
## bigram.freq
## (Intercept) -0.04305614
## subtlex.frequency 0.02157752
## length -0.43240037
## POSAdjective 0.01337525
## POSVerb -0.30343414
## bigram.freq 1.00000000
vif(modelDLP)
## GVIF Df GVIF^(1/(2*Df))
## subtlex.frequency 1.022716 1 1.011294
## length 1.251375 1 1.118649
## POS 1.121023 2 1.028972
## bigram.freq 1.358407 1 1.165507
Is the model linear?
It is fairly linear.
plot(modelDLP, which = 2)
Are the errors normally distributed?
The errors are normally distributed.
hist(scale(residuals(modelDLP)))
-Do the errors meet the assumptions of homoscedasticity and homogeneity?
No.
plot(modelDLP, which = 1)
{plot(scale(residuals(modelDLP)), scale(modelDLP$fitted.values))
abline(v = 0, h = 0)}
Use the function provided from class (included below) and the boot
library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot
function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),]
for the data argument changing DF to the name of your data frame.
coefbootstrap = function(formula, data, indices){
d = data[indices, ]
model = lm(formula, data = d)
return(coef(model))
}
modelDLPbootstrap <- boot(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
data = dataDLP[sample(1:nrow(dataDLP), size = 500, replace = FALSE), ],
statistic = coefbootstrap,
R = 1000)
modelDLPbootstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = dataDLP[sample(1:nrow(dataDLP), size = 500, replace = FALSE),
## ], statistic = coefbootstrap, R = 1000, formula = rt ~ subtlex.frequency +
## length + POS + bigram.freq)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 596.507423985742 1.4681438490759 13.25308740850
## t2* -0.000529769736 -0.0001799386274 0.00073369950
## t3* 6.785880126497 -0.1833479137690 2.10775051166
## t4* 11.215291181869 -0.1167673720855 10.01884429658
## t5* 3.637628464075 0.1794891951984 7.21732171786
## t6* -0.000004125443 -0.0000004589594 0.00003434058
boot.ci(modelDLPbootstrap, index = 2)
## Warning in boot.ci(modelDLPbootstrap, index = 2): bootstrap variances
## needed for studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = modelDLPbootstrap, index = 2)
##
## Intervals :
## Level Normal Basic
## 95% (-0.0018, 0.0011 ) (-0.0010, 0.0017 )
##
## Level Percentile BCa
## 95% (-0.0028, -0.0001 ) (-0.0022, 0.0000 )
## Calculations and Intervals on Original Scale