Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(readr)
library(Rling)
library(car)
library(boot)
Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).
DLP_words <- read_csv("DLP_words.csv")
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
ADJ - Adjective N - Noun WW - Verbs
table(DLP_words$POS)
##
## ADJ N WW
## 1397 7591 3038
DLP_words$POS = factor(DLP_words$POS, levels = c("N", "WW", "ADJ"), labels = c("Noun", "Adjective", "Verbs"))
table(DLP_words$POS)
##
## Noun Adjective Verbs
## 7591 3038 1397
Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.
hist(DLP_words$subtlex.frequency, breaks = 100)
DLP_words$Log_transform = log(DLP_words$subtlex.frequency)
hist(DLP_words$Log_transform)
hist(DLP_words$bigram.freq)
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
linear_model = lm(data = DLP_words, DLP_words$rt ~ DLP_words$length + DLP_words$Log_transform + DLP_words$POS + DLP_words$bigram.freq)
summary(linear_model)
##
## Call:
## lm(formula = DLP_words$rt ~ DLP_words$length + DLP_words$Log_transform +
## DLP_words$POS + DLP_words$bigram.freq, data = DLP_words)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.88 -33.85 -6.19 27.49 616.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.112e+02 2.520e+00 282.176 < 2e-16 ***
## DLP_words$length -1.806e+00 3.486e-01 -5.180 2.26e-07 ***
## DLP_words$Log_transform -1.808e+01 2.253e-01 -80.218 < 2e-16 ***
## DLP_words$POSAdjective 1.170e+01 1.157e+00 10.119 < 2e-16 ***
## DLP_words$POSVerbs 7.976e+00 1.486e+00 5.368 8.13e-08 ***
## DLP_words$bigram.freq 3.374e-05 5.246e-06 6.432 1.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.76 on 12009 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3697
## F-statistic: 1410 on 5 and 12009 DF, p-value: < 2.2e-16
Which coefficients are statistically significant?
P-values are less than 0.05, hence all coefficients (length, log_transform, pos adjective, pos verbs, bigram frequency ) are significant
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients)
Increase in length increases the response latency. The more number of times a word occurs, the less the response time. Adjectives and verbs have difference response latency when compared with noun
Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:
Log transform appears to predict most variance
options(scipen = 999)
round(summary(linear_model)$coefficients,3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 711.200 2.520 282.176 0
## DLP_words$length -1.806 0.349 -5.180 0
## DLP_words$Log_transform -18.077 0.225 -80.218 0
## DLP_words$POSAdjective 11.705 1.157 10.119 0
## DLP_words$POSVerbs 7.976 1.486 5.368 0
## DLP_words$bigram.freq 0.000 0.000 6.432 0
t = summary(linear_model)$coefficients[-1,3]
pr = t / sqrt(t^2 + linear_model$df.residual)
pr^2
## DLP_words$length DLP_words$Log_transform DLP_words$POSAdjective
## 0.002229149 0.348893207 0.008453925
## DLP_words$POSVerbs DLP_words$bigram.freq
## 0.002393353 0.003432757
summary(DLP_words)
## spelling lexicality rt subtlex.frequency
## Length:12026 Length:12026 Min. : 358.0 Min. : 1
## Class :character Class :character 1st Qu.: 585.9 1st Qu.: 16
## Mode :character Mode :character Median : 624.1 Median : 67
## Mean : 632.7 Mean : 1217
## 3rd Qu.: 670.5 3rd Qu.: 306
## Max. :1273.0 Max. :947568
## NA's :11
## length POS bigram.freq Log_transform
## Min. : 2.000 Noun :7591 Min. : 5026 Min. : 0.000
## 1st Qu.: 5.000 Adjective:3038 1st Qu.:127241 1st Qu.: 2.773
## Median : 6.000 Verbs :1397 Median :188577 Median : 4.205
## Mean : 6.351 Mean :207476 Mean : 4.305
## 3rd Qu.: 7.000 3rd Qu.:285100 3rd Qu.: 5.724
## Max. :12.000 Max. :603499 Max. :13.762
##
tapply(DLP_words$rt, DLP_words$POS, mean)
## Noun Adjective Verbs
## NA 633.5708 NA
confint(linear_model)
## 2.5 % 97.5 %
## (Intercept) 706.25942190124 716.14023493008
## DLP_words$length -2.48916207364 -1.12243081141
## DLP_words$Log_transform -18.51863172506 -17.63520071430
## DLP_words$POSAdjective 9.43745753617 13.97230528326
## DLP_words$POSVerbs 5.06330402122 10.88879557920
## DLP_words$bigram.freq 0.00002345653 0.00004402181
dlpsub <- subset(DLP_words, !(rt == "NA"))
tapply(dlpsub$rt, dlpsub$POS, mean)
## Noun Adjective Verbs
## 633.1077 633.5708 628.9087
summary(linear_model)$fstatistic
## value numdf dendf
## 1410.291 5.000 12009.000
summary(linear_model)$r.squared
## [1] 0.3699521
Is the overall model statistically significant?
Overall predicted model is R^2 value is 0.3699% , not closer to 1. So not really significant due high variability
What is the practical importance of the overall model?
To be able to predict response latency given the length of the word, adjectives, verbs, frequencies.
Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
Data points closer to zero appears to have most influence on the model. Bats, rechtstreeks, risks, skunk, vrienden the most significant data points
influencePlot(linear_model)
## StudRes Hat CookD
## 358 7.4011293 0.0004432197 0.00403010022
## 7620 0.2091906 0.0022564015 0.00001649549
## 7799 -6.8894657 0.0008468915 0.00667941519
## 8552 12.2246554 0.0003687713 0.00907619766
## 11041 -0.2678347 0.0022063720 0.00002643955
DLP_words[c(358,7620,7799,8552,11041),]
## # A tibble: 5 x 8
## spelling lexicality rt subtlex.frequen~ length POS bigram.freq
## <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 bats W 1033 15 4 Noun 95596
## 2 rechtst~ W 605. 567 12 Verbs 338480
## 3 riks W 358 1 4 Noun 86017
## 4 skunk W 1273 14 5 Noun 57488
## 5 vrienden W 529. 14545 8 Noun 578859
## # ... with 1 more variable: Log_transform <dbl>
Do we have additivity in our model?
There is not multicolinearity, so additivity
summary(linear_model, correlation = T)$correlation[,-1]
## DLP_words$length DLP_words$Log_transform
## (Intercept) -0.817549059 -0.61547149
## DLP_words$length 1.000000000 0.30065119
## DLP_words$Log_transform 0.300651186 1.00000000
## DLP_words$POSAdjective 0.054511388 -0.14697353
## DLP_words$POSVerbs -0.001602351 -0.09036631
## DLP_words$bigram.freq -0.423780585 -0.02461407
## DLP_words$POSAdjective DLP_words$POSVerbs
## (Intercept) 0.006374922 -0.062985653
## DLP_words$length 0.054511388 -0.001602351
## DLP_words$Log_transform -0.146973534 -0.090366305
## DLP_words$POSAdjective 1.000000000 0.205766502
## DLP_words$POSVerbs 0.205766502 1.000000000
## DLP_words$bigram.freq -0.295822672 0.015906745
## DLP_words$bigram.freq
## (Intercept) -0.01692010
## DLP_words$length -0.42378059
## DLP_words$Log_transform -0.02461407
## DLP_words$POSAdjective -0.29582267
## DLP_words$POSVerbs 0.01590674
## DLP_words$bigram.freq 1.00000000
vif(linear_model)
## GVIF Df GVIF^(1/(2*Df))
## DLP_words$length 1.362545 1 1.167281
## DLP_words$Log_transform 1.143764 1 1.069469
## DLP_words$POS 1.139962 2 1.033291
## DLP_words$bigram.freq 1.358598 1 1.165589
Is the model linear?
Model appears to be linear based on plot observation
plot(linear_model, which=2)
Are the errors normally distributed?
Errors appear to be normally distributed
hist(scale(residuals(linear_model)))
Do the errors meet the assumptions of homoscedasticity and homogeneity? - Include a plot and interpret the output (either plot option).
plot(linear_model, which=1)
{plot(scale(residuals(linear_model)), scale(linear_model$fitted.values))
abline(v=0, h=0)}
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}
boot_model = lm(data=DLP_words, formula = rt ~ Log_transform + length + POS + bigram.freq, statistic = bootcoef, R = 1000)
boot_model
##
## Call:
## lm(formula = rt ~ Log_transform + length + POS + bigram.freq,
## data = DLP_words, statistic = bootcoef, R = 1000)
##
## Coefficients:
## (Intercept) Log_transform length POSAdjective POSVerbs
## 711.19982842 -18.07691622 -1.80579644 11.70488141 7.97604980
## bigram.freq
## 0.00003374