Linear Regression

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(readr)
library(Rling)
library(car)
library(boot)

Import the Data

Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects

Variables we are going to use: - rt: Response Latency to the Lexical Decision Task - subtlex.frequency: The frequency of the word from the Dutch Subtitle Project. - length: Length of the word. - POS: part of speech. - bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).

DLP_words <- read_csv("DLP_words.csv")

Clean Up Part of Speech

Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:

ADJ - Adjective N - Noun WW - Verbs

table(DLP_words$POS)

## 
##  ADJ    N   WW 
## 1397 7591 3038

DLP_words$POS = factor(DLP_words$POS, levels = c("N", "WW", "ADJ"), labels = c("Noun", "Adjective", "Verbs"))
table(DLP_words$POS)

## 
##      Noun Adjective     Verbs 
##      7591      3038      1397

Deal with Non-Normality

Since we are using frequencies, we should consider the non-normality of frequency. - Include a histogram of the original subtlex.frequency column. - Log-transform the subtlex.frequency column. - Include a histogram of bigram.freq - note that it does not appear extremely skewed.

hist(DLP_words$subtlex.frequency, breaks = 100)

DLP_words$Log_transform = log(DLP_words$subtlex.frequency)
hist(DLP_words$Log_transform)

hist(DLP_words$bigram.freq)

Create Your Linear Model

See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.

linear_model = lm(data = DLP_words, DLP_words$rt  ~ DLP_words$length + DLP_words$Log_transform + DLP_words$POS + DLP_words$bigram.freq)
summary(linear_model)

## 
## Call:
## lm(formula = DLP_words$rt ~ DLP_words$length + DLP_words$Log_transform + 
##     DLP_words$POS + DLP_words$bigram.freq, data = DLP_words)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.88  -33.85   -6.19   27.49  616.60 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              7.112e+02  2.520e+00 282.176  < 2e-16 ***
## DLP_words$length        -1.806e+00  3.486e-01  -5.180 2.26e-07 ***
## DLP_words$Log_transform -1.808e+01  2.253e-01 -80.218  < 2e-16 ***
## DLP_words$POSAdjective   1.170e+01  1.157e+00  10.119  < 2e-16 ***
## DLP_words$POSVerbs       7.976e+00  1.486e+00   5.368 8.13e-08 ***
## DLP_words$bigram.freq    3.374e-05  5.246e-06   6.432 1.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.76 on 12009 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3697 
## F-statistic:  1410 on 5 and 12009 DF,  p-value: < 2.2e-16

Interpret Your Model

Coefficients

Which coefficients are statistically significant?
P-values are less than 0.05, hence all coefficients (length, log_transform, pos adjective, pos verbs, bigram frequency ) are significant
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients)
Increase in length increases the response latency. The more number of times a word occurs, the less the response time. Adjectives and verbs have difference response latency when compared with noun
Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:
Log transform appears to predict most variance

options(scipen = 999)
round(summary(linear_model)$coefficients,3)

##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              711.200      2.520 282.176        0
## DLP_words$length          -1.806      0.349  -5.180        0
## DLP_words$Log_transform  -18.077      0.225 -80.218        0
## DLP_words$POSAdjective    11.705      1.157  10.119        0
## DLP_words$POSVerbs         7.976      1.486   5.368        0
## DLP_words$bigram.freq      0.000      0.000   6.432        0

t = summary(linear_model)$coefficients[-1,3]
pr = t / sqrt(t^2 + linear_model$df.residual)
pr^2

##        DLP_words$length DLP_words$Log_transform  DLP_words$POSAdjective 
##             0.002229149             0.348893207             0.008453925 
##      DLP_words$POSVerbs   DLP_words$bigram.freq 
##             0.002393353             0.003432757

What do the dummy coded POS values mean? Calculate the means of each group below to help you interpret:
Adjectives are responded to faster than nouns and verbs

summary(DLP_words)

##    spelling          lexicality              rt         subtlex.frequency
##  Length:12026       Length:12026       Min.   : 358.0   Min.   :     1   
##  Class :character   Class :character   1st Qu.: 585.9   1st Qu.:    16   
##  Mode  :character   Mode  :character   Median : 624.1   Median :    67   
##                                        Mean   : 632.7   Mean   :  1217   
##                                        3rd Qu.: 670.5   3rd Qu.:   306   
##                                        Max.   :1273.0   Max.   :947568   
##                                        NA's   :11                        
##      length              POS        bigram.freq     Log_transform   
##  Min.   : 2.000   Noun     :7591   Min.   :  5026   Min.   : 0.000  
##  1st Qu.: 5.000   Adjective:3038   1st Qu.:127241   1st Qu.: 2.773  
##  Median : 6.000   Verbs    :1397   Median :188577   Median : 4.205  
##  Mean   : 6.351                    Mean   :207476   Mean   : 4.305  
##  3rd Qu.: 7.000                    3rd Qu.:285100   3rd Qu.: 5.724  
##  Max.   :12.000                    Max.   :603499   Max.   :13.762  
##

tapply(DLP_words$rt, DLP_words$POS, mean)

##      Noun Adjective     Verbs 
##        NA  633.5708        NA

confint(linear_model)

##                                   2.5 %          97.5 %
## (Intercept)             706.25942190124 716.14023493008
## DLP_words$length         -2.48916207364  -1.12243081141
## DLP_words$Log_transform -18.51863172506 -17.63520071430
## DLP_words$POSAdjective    9.43745753617  13.97230528326
## DLP_words$POSVerbs        5.06330402122  10.88879557920
## DLP_words$bigram.freq     0.00002345653   0.00004402181

dlpsub <- subset(DLP_words, !(rt == "NA"))

tapply(dlpsub$rt, dlpsub$POS, mean)

##      Noun Adjective     Verbs 
##  633.1077  633.5708  628.9087

summary(linear_model)$fstatistic

##     value     numdf     dendf 
##  1410.291     5.000 12009.000

summary(linear_model)$r.squared

## [1] 0.3699521

Overall Model

Is the overall model statistically significant?
Overall predicted model is R^2 value is 0.3699% , not closer to 1. So not really significant due high variability
What is the practical importance of the overall model?
To be able to predict response latency given the length of the word, adjectives, verbs, frequencies.

Diagnostic Tests

Outliers

Create an influence plot of the model using the car library. - Which data points appear to have the most influence on the model?
Data points closer to zero appears to have most influence on the model. Bats, rechtstreeks, risks, skunk, vrienden the most significant data points

influencePlot(linear_model)

##          StudRes          Hat         CookD
## 358    7.4011293 0.0004432197 0.00403010022
## 7620   0.2091906 0.0022564015 0.00001649549
## 7799  -6.8894657 0.0008468915 0.00667941519
## 8552  12.2246554 0.0003687713 0.00907619766
## 11041 -0.2678347 0.0022063720 0.00002643955

DLP_words[c(358,7620,7799,8552,11041),]

## # A tibble: 5 x 8
##   spelling lexicality    rt subtlex.frequen~ length POS   bigram.freq
##   <chr>    <chr>      <dbl>            <dbl>  <dbl> <fct>       <dbl>
## 1 bats     W          1033                15      4 Noun        95596
## 2 rechtst~ W           605.              567     12 Verbs      338480
## 3 riks     W           358                 1      4 Noun        86017
## 4 skunk    W          1273                14      5 Noun        57488
## 5 vrienden W           529.            14545      8 Noun       578859
## # ... with 1 more variable: Log_transform <dbl>

Additivity

Do we have additivity in our model?
There is not multicolinearity, so additivity

Show that the correlations between predictors is less than .9.
Show the VIF values.

summary(linear_model, correlation = T)$correlation[,-1]

##                         DLP_words$length DLP_words$Log_transform
## (Intercept)                 -0.817549059             -0.61547149
## DLP_words$length             1.000000000              0.30065119
## DLP_words$Log_transform      0.300651186              1.00000000
## DLP_words$POSAdjective       0.054511388             -0.14697353
## DLP_words$POSVerbs          -0.001602351             -0.09036631
## DLP_words$bigram.freq       -0.423780585             -0.02461407
##                         DLP_words$POSAdjective DLP_words$POSVerbs
## (Intercept)                        0.006374922       -0.062985653
## DLP_words$length                   0.054511388       -0.001602351
## DLP_words$Log_transform           -0.146973534       -0.090366305
## DLP_words$POSAdjective             1.000000000        0.205766502
## DLP_words$POSVerbs                 0.205766502        1.000000000
## DLP_words$bigram.freq             -0.295822672        0.015906745
##                         DLP_words$bigram.freq
## (Intercept)                       -0.01692010
## DLP_words$length                  -0.42378059
## DLP_words$Log_transform           -0.02461407
## DLP_words$POSAdjective            -0.29582267
## DLP_words$POSVerbs                 0.01590674
## DLP_words$bigram.freq              1.00000000

vif(linear_model)

##                             GVIF Df GVIF^(1/(2*Df))
## DLP_words$length        1.362545  1        1.167281
## DLP_words$Log_transform 1.143764  1        1.069469
## DLP_words$POS           1.139962  2        1.033291
## DLP_words$bigram.freq   1.358598  1        1.165589

Linearity

Is the model linear?
Model appears to be linear based on plot observation

Include a plot and interpret the output.

plot(linear_model, which=2)

Normality

Are the errors normally distributed?
Errors appear to be normally distributed

Include a plot and interpret the output.

hist(scale(residuals(linear_model)))

Homoscedasticity/Homogeneity

Do the errors meet the assumptions of homoscedasticity and homogeneity? - Include a plot and interpret the output (either plot option).

plot(linear_model, which=1)

{plot(scale(residuals(linear_model)), scale(linear_model$fitted.values))
abline(v=0, h=0)}

Bootstrapping

Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times. - Include the estimates of the coefficients from the bootstrapping. - Include the confidence intervals for at least one of the predictors (not the intercept). - Do our estimates appear stable, given the bootstrapping results?

Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.

bootcoef = function(formula, data, indices){
  d = data[indices, ] #randomize the data by row
  model = lm(formula, data = d) #run our model
  return(coef(model)) #give back coefficients
}

boot_model = lm(data=DLP_words, formula = rt ~ Log_transform + length + POS + bigram.freq, statistic = bootcoef, R = 1000)

boot_model

## 
## Call:
## lm(formula = rt ~ Log_transform + length + POS + bigram.freq, 
##     data = DLP_words, statistic = bootcoef, R = 1000)
## 
## Coefficients:
##   (Intercept)  Log_transform         length   POSAdjective       POSVerbs  
##  711.19982842   -18.07691622    -1.80579644    11.70488141     7.97604980  
##   bigram.freq  
##    0.00003374

Discussion Question

Thinking of your current job or prospective career, propose a research project to address a problem or question in your industry that would use regression analysis. Spend time thinking about this and write roughly a paragraph describing the problem/question and how you would address it with text data and regression analysis.
Text data and regression analysis can be sued to analyze how the sprint planning goes and how the sprint review went.