Import the Data

Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects

Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).

DLP_words = read.csv("/Users/pallavisaitu/Downloads/DLP_words.csv") 
head(DLP_words)
##   spelling lexicality       rt subtlex_frequency length POS bigram_freq
## 1    aaide          W 642.9189                11      5  WW      211373
## 2    aaien          W 613.7778                92      5  WW      317860
## 3      aak          W 560.6667                 5      3   N       75056
## 4      aal          W 669.4815                35      3   N       83336
## 5  aalmoes          W 636.4706                42      7   N      166581
## 6      aam          W       NA                 1      3 ADJ       71881

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
library(boot)
## Warning: package 'boot' was built under R version 3.5.2
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit

Clean Up Part of Speech

Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:

ADJ - Adjective N - Noun WW - Verbs

DLP_words$POS = factor(DLP_words$POS, levels = c("ADJ","N","WW"),
                       labels = c("Adjective","Noun","Verbs"))
head(DLP_words)
##   spelling lexicality       rt subtlex_frequency length       POS
## 1    aaide          W 642.9189                11      5     Verbs
## 2    aaien          W 613.7778                92      5     Verbs
## 3      aak          W 560.6667                 5      3      Noun
## 4      aal          W 669.4815                35      3      Noun
## 5  aalmoes          W 636.4706                42      7      Noun
## 6      aam          W       NA                 1      3 Adjective
##   bigram_freq
## 1      211373
## 2      317860
## 3       75056
## 4       83336
## 5      166581
## 6       71881

Deal with Non-Normality

Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.

# histogram of the original subtlex.frequency column
hist(DLP_words$subtlex_frequency, breaks = 50)

# log-transform the subtlex.frequency column
DLP_words$subtlex_frequency_log = log(DLP_words$subtlex_frequency)

# histogram of bigram.freq 
hist(DLP_words$bigram_freq, breaks =50)

Create Your Linear Model

See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.

linear_model = lm(rt ~ subtlex_frequency + length + POS + bigram_freq, data = DLP_words)
summary(linear_model)
## 
## Call:
## lm(formula = rt ~ subtlex_frequency + length + POS + bigram_freq, 
##     data = DLP_words)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -258.31  -46.27   -9.90   36.87  651.07 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.868e+02  2.851e+00 205.853  < 2e-16 ***
## subtlex_frequency -4.084e-04  4.834e-05  -8.449  < 2e-16 ***
## length             6.261e+00  4.128e-01  15.166  < 2e-16 ***
## POSNoun            2.536e+00  1.829e+00   1.387 0.165526    
## POSVerbs           1.735e+00  2.084e+00   0.833 0.405000    
## bigram_freq        2.220e-05  6.481e-06   3.425 0.000617 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.72 on 12009 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.03806,    Adjusted R-squared:  0.03766 
## F-statistic: 95.03 on 5 and 12009 DF,  p-value: < 2.2e-16

Interpret Your Model

Coefficients

  • Which coefficients are statistically significant? Interpretation: bigram_frequency, subtlex_frequency and length are statistically significat at 95% confidence interval.
  • What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients) Interpretation: The most variance appears to be predicted by “length” variable.
  • Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:
options(scipen = 999)
t <- summary(linear_model)$coefficients[-1 , 3]
pr <- t / sqrt(t^2 + linear_model$df.residual)
pr^2
## subtlex_frequency            length           POSNoun          POSVerbs 
##     0.00590899844     0.01879270997     0.00016012374     0.00005774338 
##       bigram_freq 
##     0.00097594604
  • What do the dummy coded POS values mean? Calculate the means of each group below to help you interpret: Interpretation: The mean of each group below shows that Adjectives are processed and responded the fastest, then Nouns are responded to and Verbs are responded to the last.
tapply(DLP_words$rt, DLP_words$POS, mean, na.rm = TRUE)
## Adjective      Noun     Verbs 
##  628.9087  633.1077  633.5708

Overall Model

  • Is the overall model statistically significant? Interpretation: Looking at the F value: 95.03 oand p-value: < 2.2e-16, we can say that the model is significant at 5%.
  • What is the practical importance of the overall model? Interpretation: R square is 0.038 which doesn’t show a clear importance of the overall model.

Diagnostic Tests

Outliers

Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?

influencePlot(linear_model)

##         StudRes          Hat        CookD
## 4184   7.974493 0.5263562855 11.717236871
## 8552  10.428743 0.0002960303  0.005319831
## 11783  2.157855 0.0704942705  0.058838675

Additivity

Do we have additivity in our model? - Show that the correlations between predictors is less than .9. - Show the VIF values.

summary(linear_model, correlation = TRUE)$correlation[-1,-1]
##                   subtlex_frequency      length     POSNoun    POSVerbs
## subtlex_frequency        1.00000000  0.09789214  0.01679744 -0.04957259
## length                   0.09789214  1.00000000 -0.02513935  0.04227927
## POSNoun                  0.01679744 -0.02513935  1.00000000  0.74395844
## POSVerbs                -0.04957259  0.04227927  0.74395844  1.00000000
## bigram_freq              0.02157752 -0.43240037 -0.01337525 -0.21851451
##                   bigram_freq
## subtlex_frequency  0.02157752
## length            -0.43240037
## POSNoun           -0.01337525
## POSVerbs          -0.21851451
## bigram_freq        1.00000000
vif(linear_model)
##                       GVIF Df GVIF^(1/(2*Df))
## subtlex_frequency 1.022716  1        1.011294
## length            1.251375  1        1.118649
## POS               1.121023  2        1.028972
## bigram_freq       1.358407  1        1.165507

Linearity

Is the model linear? - Include a plot and interpret the output.

plot(linear_model, which=2, pch=16)

Interpretation: The Q-Q plot shows the data points along the straight line which means the model is linear. ### Normality

Are the errors normally distributed? - Include a plot and interpret the output.

hist(scale(residuals(linear_model)))

Interpretation: The histogram above shows the model is linear.

Homoscedasticity/Homogeneity

Do the errors meet the assumptions of homoscedasticity and homogeneity? Interpretation:Errors do not meet the assumptions of homoscedasticity and homogeneity. - Include a plot and interpret the output (either plot option).

plot(linear_model, which=1)

Bootstrapping

Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?

Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.

bootcoef = function(formula, data, indices){
  d = data[indices, ] #randomize the data by row
  model = lm(formula, data = d) #run our model
  return(coef(model)) #give back coefficients
}

# The estimates of the coefficients from the bootstrapping
model_bootstrapping <- boot(formula = rt ~ subtlex_frequency + length + POS + bigram_freq, 
                    data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE), ], 
                    statistic = bootcoef,
                    R = 1000)
model_bootstrapping
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = DLP_words[sample(1:nrow(DLP_words), size = 500, replace = FALSE), 
##     ], statistic = bootcoef, R = 1000, formula = rt ~ subtlex_frequency + 
##     length + POS + bigram_freq)
## 
## 
## Bootstrap Statistics :
##            original           bias       std. error
## t1* 599.20670861223  0.5446131088721 14.60016287063
## t2*  -0.00297999061 -0.0002924792553  0.00120597114
## t3*   5.18603647974 -0.1179444767634  1.84460950641
## t4*  -3.98990948222  0.2994992798702 10.16550914808
## t5*   4.08581429584  0.3401499410068 11.43343669502
## t6*   0.00001037677  0.0000007121897  0.00002946026

Interpretation: The coefficients of POS and bigram_freq from boostrapping are different from previous estimates linear regression which means the model is not stable.