Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
library(readr)
DLP_words <- read_csv("C:/Users/Emily/Desktop/540/DLP_words.csv")
## Parsed with column specification:
## cols(
## spelling = col_character(),
## lexicality = col_character(),
## rt = col_double(),
## subtlex.frequency = col_double(),
## length = col_double(),
## POS = col_character(),
## bigram.freq = col_double()
## )
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(car)
## Loading required package: carData
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
DLP_words$POS <- factor(DLP_words$POS, levels = c("N", "ADJ", "WW"), labels = c("Noun", "Adjective", "Verb"))
DLP_words
## # A tibble: 12,026 x 7
## spelling lexicality rt subtlex.frequency length POS bigram.freq
## <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 aaide W 643. 11 5 Verb 211373
## 2 aaien W 614. 92 5 Verb 317860
## 3 aak W 561. 5 3 Noun 75056
## 4 aal W 669. 35 3 Noun 83336
## 5 aalmoes W 636. 42 7 Noun 166581
## 6 aam W NA 1 3 Adjective 71881
## 7 aambeeld W 691. 22 8 Noun 218551
## 8 aanblik W 614. 96 7 Noun 141520
## 9 aanbod W 568. 1174 6 Noun 139596
## 10 aanbouw W 652. 32 7 Noun 119589
## # ... with 12,016 more rows
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.
hist(DLP_words$subtlex.frequency, breaks = 100)
## Log-transform the `subtlex.frequency` column
DLP_words$log_subtlex_freq <- log(DLP_words$subtlex.frequency)
## The histogram of `bigram.freq`
hist(DLP_words$bigram.freq, breaks = 100)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
model <- lm(rt ~ subtlex.frequency + length + POS + bigram.freq, data = DLP_words)
summary(model)
##
## Call:
## lm(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
## data = DLP_words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258.31 -46.27 -9.90 36.87 651.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.894e+02 2.474e+00 238.250 < 2e-16 ***
## subtlex.frequency -4.084e-04 4.834e-05 -8.449 < 2e-16 ***
## length 6.261e+00 4.128e-01 15.166 < 2e-16 ***
## POSAdjective -2.536e+00 1.829e+00 -1.387 0.165526
## POSVerb -8.008e-01 1.420e+00 -0.564 0.572825
## bigram.freq 2.220e-05 6.481e-06 3.425 0.000617 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.72 on 12009 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.03806, Adjusted R-squared: 0.03766
## F-statistic: 95.03 on 5 and 12009 DF, p-value: < 2.2e-16
a <- summary(model)$coefficients[-1 , 3]
pr <- a / sqrt(a^2 + model$df.residual)
pr^2
## subtlex.frequency length POSAdjective POSVerb
## 5.908998e-03 1.879271e-02 1.601237e-04 2.647912e-05
## bigram.freq
## 9.759460e-04
tapply(DLP_words$rt, DLP_words$POS, mean, na.rm = TRUE)
## Noun Adjective Verb
## 633.1077 628.9087 633.5708
Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
influencePlot(model)
## StudRes Hat CookD
## 4184 7.974493 0.5263562855 11.717236871
## 8552 10.428743 0.0002960303 0.005319831
## 11783 2.157855 0.0704942705 0.058838675
Do we have additivity in our model?
summary(model, correlation= TRUE)$correlation[-1, -1]
## subtlex.frequency length POSAdjective POSVerb
## subtlex.frequency 1.00000000 0.09789214 -0.01679744 -0.09437705
## length 0.09789214 1.00000000 0.02513935 0.09441722
## POSAdjective -0.01679744 0.02513935 1.00000000 0.19608345
## POSVerb -0.09437705 0.09441722 0.19608345 1.00000000
## bigram.freq 0.02157752 -0.43240037 0.01337525 -0.30343414
## bigram.freq
## subtlex.frequency 0.02157752
## length -0.43240037
## POSAdjective 0.01337525
## POSVerb -0.30343414
## bigram.freq 1.00000000
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## subtlex.frequency 1.022716 1 1.011294
## length 1.251375 1 1.118649
## POS 1.121023 2 1.028972
## bigram.freq 1.358407 1 1.165507
Is the model linear? Yes the model is linear - Include a plot and interpret the output.
plot(model, which = 2, pch = 16)
Are the errors normally distributed? Yes, normally distributed. - Include a plot and interpret the output.
hist(scale(residuals(model)))
Do the errors meet the assumptions of homoscedasticity and homogeneity? No the errors do not meet the assumptions of homoscedasticity and homogeneity - Include a plot and interpret the output (either plot option).
plot(model, which = 1)
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results? The estimates do not appear stable based on the results.
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
model_boot <- boot(formula = rt ~ subtlex.frequency + length + POS + bigram.freq,
data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE), ],
statistic = bootcoef,
R = 1000)
summary(model_boot)
}