ANLY540 - Analysis of Human Language - Assignment 2: Regression Analysis

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(car)
library(boot)

Import the Data

Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects

Variables we are going to use:

rt: Response Latency to the Lexical Decision Task
subtlex.frequency: The frequency of the word from the Dutch Subtitle Project.
length: Length of the word.
POS: part of speech.
bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).

data <- read.csv("DLP_words.csv")

Let’s look at the summary of this dataset,

summary(data)

##     spelling     lexicality       rt         subtlex.frequency
##  aÃ¯s   :    1   W:12026    Min.   : 358.0   Min.   :     1   
##  aaide  :    1              1st Qu.: 585.9   1st Qu.:    16   
##  aaien  :    1              Median : 624.1   Median :    67   
##  aak    :    1              Mean   : 632.7   Mean   :  1217   
##  aal    :    1              3rd Qu.: 670.5   3rd Qu.:   306   
##  aalmoes:    1              Max.   :1273.0   Max.   :947568   
##  (Other):12020              NA's   :11                        
##      length        POS        bigram.freq    
##  Min.   : 2.000   ADJ:1397   Min.   :  5026  
##  1st Qu.: 5.000   N  :7591   1st Qu.:127241  
##  Median : 6.000   WW :3038   Median :188577  
##  Mean   : 6.351              Mean   :207476  
##  3rd Qu.: 7.000              3rd Qu.:285100  
##  Max.   :12.000              Max.   :603499  
##

We see that there are 11 records where the response latency variable has NA values. Since response latency is one of the most important varibles (the dependent variable) in our dataset, let’s remove these 11 variables to simplify our analysis.

The dataset with these 11 records removed is presented below:

data <- data[complete.cases(data),]
rownames(data) <- c(1:12015)
data

Clean Up Part of Speech

Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:

ADJ - Adjective
N - Noun
WW - Verbs

table(data$POS)

## 
##  ADJ    N   WW 
## 1395 7582 3038

data$POS = factor(data$POS, levels = c("N", "WW", "ADJ"), labels = c("Noun", "Verb", "Adjective"))
table(data$POS)

## 
##      Noun      Verb Adjective 
##      7582      3038      1395

We did this because we want to compare adjective and verbs to nouns. Essentially, we are making nouns our frame of comparison.

Deal with Non-Normality

Since we are using frequencies, we should consider the non-normality of frequency.

Include a histogram of the original subtlex.frequency column.

hist(data$subtlex.frequency, breaks = 100)

The histogram for subtlex.frequency shows us that its distribution is in the form of a power curve. Since linear regression assumes that all the independent variables are normally distributed, we need to transform this variable to make it normal.

Log-transform the subtlex.frequency column.

data$Log_SUB = log(data$subtlex.frequency)
hist(data$Log_SUB)

We use a log transformation and see that the histogram is now much more normal, with a slight right tail.

Include a histogram of bigram.freq - note that it does not appear extremely skewed.

hist(data$bigram.freq)

We also plot the histogram for bigram.freq to assess its normality and see that it is mostly normal with a slight right tail. So no transformation is required for bigram.freq.

Create Your Linear Model

See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.

Below is a summary of the model. We see that all the variables are statistically significant in the prediction of response latencies.

model = lm(rt ~ Log_SUB + length + POS + bigram.freq, data = data)
summary(model)

## 
## Call:
## lm(formula = rt ~ Log_SUB + length + POS + bigram.freq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.88  -33.85   -6.19   27.49  616.60 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.112e+02  2.520e+00 282.176  < 2e-16 ***
## Log_SUB      -1.808e+01  2.253e-01 -80.218  < 2e-16 ***
## length       -1.806e+00  3.486e-01  -5.180 2.26e-07 ***
## POSVerb       1.170e+01  1.157e+00  10.119  < 2e-16 ***
## POSAdjective  7.976e+00  1.486e+00   5.368 8.13e-08 ***
## bigram.freq   3.374e-05  5.246e-06   6.432 1.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.76 on 12009 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3697 
## F-statistic:  1410 on 5 and 12009 DF,  p-value: < 2.2e-16

When we look at the summary of the residuals, we see that the mean of the error terms is zero and the median is also very close to zero.

summary(model$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -348.879  -33.850   -6.189    0.000   27.489  616.596

Interpret Your Model

Coefficients

Which coefficients are statistically significant?

As seen before, we see that all the coefficients are statistically significant. This indicates that the frequency and length of the word, the part of speech, and the summed frequency of the bigrams in the word are statistically significant in the prediction of response latency of Dutch words.

options(scipen = 999)
round(summary(model)$coefficients, 3)

##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   711.200      2.520 282.176        0
## Log_SUB       -18.077      0.225 -80.218        0
## length         -1.806      0.349  -5.180        0
## POSVerb        11.705      1.157  10.119        0
## POSAdjective    7.976      1.486   5.368        0
## bigram.freq     0.000      0.000   6.432        0

What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients)

Based on the coefficient estimates, we can say the following about response latencies:

The average response latency of Dutch words is 711.2 ms.
Length is a negative predictor: Shorter words take longer to read. Since seems counter-intuitive and very different than the results of the model built on the words from the English lexicon project. It is possible that in Dutch, longer words are easier to interpret. But it is also possible that the coefficient for length is negative due to the interaction effects among the various variables.
Frequency as expected is a negative predictor. As the words become more frequent, they are judged faster.
The response latency of adjectives are different than nouns in Dutch, unlike English.
The response latency of verbs are different than nouns, just as was in the case of English.

Let’s also look at the confidence intervals for the coefficients, which estimates the change in response latency (with 95% confidence) for a unit increase in the coefficient.

confint(model)

##                        2.5 %          97.5 %
## (Intercept)  706.25942190124 716.14023493008
## Log_SUB      -18.51863172506 -17.63520071430
## length        -2.48916207363  -1.12243081141
## POSVerb        9.43745753617  13.97230528326
## POSAdjective   5.06330402122  10.88879557920
## bigram.freq    0.00002345653   0.00004402181

Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:

We calculate partial squares for each of the independent variables to calculate their practical significance.

t = summary(model)$coefficients[-1, 3]
pr = t /sqrt(t^2 +model$df.residual)
pr^2

##      Log_SUB       length      POSVerb POSAdjective  bigram.freq 
##  0.348893207  0.002229149  0.008453925  0.002393353  0.003432757

When we look at the results, we see that frequency of words in Dutch is the most pragmatic variable in predicting response latencies. Length of the word is not so important, unlike English.

What do the dummy coded POS values mean? Calculate the means of each group below to help you interpret:

tapply(data$rt, data$POS, mean)

##      Noun      Verb Adjective 
##  633.1077  633.5708  628.9087

Based on the summary of the model, the response latency of adjectives and that of verbs both seemed statistically different than that of nouns. However, when we look at the mean response latencies for each of these parts of speech, they seem more or less similar. This indicates that the standard deviation of distribution of response latencies for adjectives and that for verbs must be very different than that for nouns. The interaction between the different variables is also considered (by controlling for other variables) when the model attributes statistical significance, which can’t be seen by just looking at means.

Overall Model

Is the overall model statistically significant?

summary(model)$fstatistic

##     value     numdf     dendf 
##  1410.291     5.000 12009.000

summary(model)

## 
## Call:
## lm(formula = rt ~ Log_SUB + length + POS + bigram.freq, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.88  -33.85   -6.19   27.49  616.60 
## 
## Coefficients:
##                   Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)  711.199828416   2.520407834 282.176 < 0.0000000000000002 ***
## Log_SUB      -18.076916220   0.225346480 -80.218 < 0.0000000000000002 ***
## length        -1.805796443   0.348627200  -5.180       0.000000225800 ***
## POSVerb       11.704881410   1.156753574  10.119 < 0.0000000000000002 ***
## POSAdjective   7.976049800   1.485972310   5.368       0.000000081293 ***
## bigram.freq    0.000033739   0.000005246   6.432       0.000000000131 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.76 on 12009 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3697 
## F-statistic:  1410 on 5 and 12009 DF,  p-value: < 0.00000000000000022

The F-statistic represents the difference of the model from zero. The p-value of the F-statistic is much less than 0.05, indicating that the overall model is statistically significant.

What is the practical importance of the overall model?

To assess the practical significance of the model, we calculate its R-squared value.

summary(model)$r.squared

## [1] 0.3699521

The R-squared value indicates that 37% of the variation in the data is explained by the model. This is a very good R-squared value in this context where we are using psychological data, and hence the model is of practical significance.

Diagnostic Tests

Outliers

Create an influence plot of the model using the car library.

influencePlot(model)

Which data points appear to have the most influence on the model?

data[c(357, 7613, 7790, 8542, 11030),]

The words which have the most influence on the model are below:

bats - bats in Dutch means the same as bats in English. Despite being a small word, this word has a high response latency, probably because it comes from English.
rechtstreeks - This stands for directly, which is a very common word. Despite its length, it has a low response latency becasue the word is so common.
riks - riks means national. Despite the low frequency, the response latency for this word was extremely low.
skunk - skunk means skunk in Dutch as well. This word has a high response latency probably because it comes from English.
vrienden - vrienden stands for friends, a very common word that is easily interpretable.

Additivity

Do we have additivity in our model?

Show that the correlations between predictors is less than .9.

summary(model, correlation = T)$correlation[ , -1]

##                  Log_SUB       length      POSVerb POSAdjective
## (Intercept)  -0.61547149 -0.817549059  0.006374922 -0.062985653
## Log_SUB       1.00000000  0.300651186 -0.146973534 -0.090366305
## length        0.30065119  1.000000000  0.054511388 -0.001602351
## POSVerb      -0.14697353  0.054511388  1.000000000  0.205766502
## POSAdjective -0.09036631 -0.001602351  0.205766502  1.000000000
## bigram.freq  -0.02461407 -0.423780585 -0.295822672  0.015906745
##              bigram.freq
## (Intercept)  -0.01692010
## Log_SUB      -0.02461407
## length       -0.42378059
## POSVerb      -0.29582267
## POSAdjective  0.01590674
## bigram.freq   1.00000000

We see that the correlations between none of the independent variables are greater than 0.9 or less than -0.9. This indicates that there is no multi-collinearity in the model or that we have additivity in our model - each variable adds something different to the model.

Show the VIF values.

vif(model)

##                 GVIF Df GVIF^(1/(2*Df))
## Log_SUB     1.143764  1        1.069469
## length      1.362545  1        1.167281
## POS         1.139962  2        1.033291
## bigram.freq 1.358598  1        1.165589

Linearity

Is the model linear?

Include a plot and interpret the output.

plot(model, which = 2)

The dots are pretty close to the dotted lines between the theoretical quantiles of -2 and +2 (Two z-scores on either side of the mean). Towards the theoretical quantile of +2, the dots curve slightly away from the line, which is not very unexpected - We can not predict as well where there is not as much data. So overall, the plot seems fine and we can say that the model is linear.

Normality

Are the errors normally distributed?

Include a plot and interpret the output.

hist(scale(residuals(model)))

Based on the histogram plot for normalized errors, the data seems to be centered around zero and mostly between the scaled residuals of -2 and +2. There seems to be a little bit of skew towards the right, but it appears to small. We can conclude that the errors are more or less normally distributed. Also, since this is a very large dataset, normality is not much of an issue due to the Central Limit Theorem.

Homoscedasticity/Homogeneity

Do the errors meet the assumptions of homoscedasticity and homogeneity?

Include a plot and interpret the output (either plot option).

plot(model, which = 1)

Although, the line here is straight, the model residuals are much lower for smaller predicted response latencies and tend to increase for higher predicted response latency forming a triangular shape.

{plot(scale(residuals(model)), scale(model$fitted.values))
  abline(v = 0, h = 0)}

Based on the second plot of normalized predicted values v/s normalized residuals, we see a distinct triangular shape. Here again, we see that for smaller fitted values, the residuals are lower than the residuals for the higher fitted values. Based on both the plots, we can say that the errors do not meet the assumptions of homoscedasticity and homogeneity. This is probably because there is another variable, which we are not including in the model, for example, pronunciational rule.

Bootstrapping

Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times.

bootcoef = function(formula, data, indices){
  d = data[indices, ] #randomize the data by row
  model = lm(formula, data = d) #run our model
  return(coef(model)) #give back coefficients
}

Include the estimates of the coefficients from the bootstrapping.

model.boot = boot(formula = rt ~ Log_SUB + length + POS + bigram.freq,
                  data = data,
                  statistic = bootcoef,
                  R = 1000)
model.boot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = bootcoef, R = 1000, formula = rt ~ 
##     Log_SUB + length + POS + bigram.freq)
## 
## 
## Bootstrap Statistics :
##            original           bias       std. error
## t1* 711.19982841566 -0.1560855429433 2.889632016423
## t2* -18.07691621968  0.0052677266541 0.231785001340
## t3*  -1.80579644252  0.0183693533719 0.383665288422
## t4*  11.70488140972 -0.0344971341641 1.049103703240
## t5*   7.97604980021 -0.0291743083312 1.502942372724
## t6*   0.00003373917  0.0000001365802 0.000005375022

The coefficients from the bootstrapping are the same as the model.

Include the confidence intervals for at least one of the predictors (not the intercept).

boot.ci(model.boot, index = 3)

The above code for confidence interval for length was not run as it throw errors, because the number of runs (1000) is lesser than the number of rows (12,015).

This is because boot.ci calls bca.ci. Because the boot.out object doesn’t supply L, the empirical influence values for the statistic that we are calculating on the data, bca.ci tries to calculate them using the empinf function, and then uses them to calculate the acceleration constant.

But with a small number of replications, empinf sometimes fails and returns a vector of NA values. The result is that we have no values for L, a can’t be calculated, and we get an error.

Do our estimates appear stable, given the bootstrapping results?

Our coefficient estimates are stable (same as model), but the standard error for the part of speech variable seems high in comparison to its coefficient. Also, we need to bootstrap over many more runs (> 12,015) to get the confidence intervals. But this is computationally intensive and hence we run the bootstrap over a smaller randomized sample of the data.

Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.

The coefficients of the bootstap over the randomized sample of 500 are presented below:

model.boot2 = boot(formula = rt ~ Log_SUB + length + POS + bigram.freq,
                  data = data[sample(1:nrow(data), 500, replace=FALSE),],
                  statistic = bootcoef,
                  R = 1000)

model.boot2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data[sample(1:nrow(data), 500, replace = FALSE), 
##     ], statistic = bootcoef, R = 1000, formula = rt ~ Log_SUB + 
##     length + POS + bigram.freq)
## 
## 
## Bootstrap Statistics :
##            original           bias       std. error
## t1* 700.06591950099 -0.4418150429487 14.98026243146
## t2* -17.62953858506  0.0186672148613  1.14417198430
## t3*  -1.70068095997  0.0341757975943  2.03135540497
## t4*  18.16635710075 -0.5994692432541  6.69089346135
## t5*  16.18190207503  0.1293699632657  7.01481966429
## t6*   0.00005980857  0.0000008407677  0.00002653149

The coefficients of the variables are different from the original model becasue this is run on a subset of the data. The standard errors are also very high for part of speech and length of word.

The confidence intervals for length are presented below:

boot.ci(model.boot2, index = 3)

## Warning in boot.ci(model.boot2, index = 3): bootstrap variances needed for
## studentized intervals

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = model.boot2, index = 3)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-5.716,  2.247 )   (-5.862,  2.062 )  
## 
## Level     Percentile            BCa          
## 95%   (-5.463,  2.461 )   (-5.478,  2.419 )  
## Calculations and Intervals on Original Scale

These confidence intervals for length are different (but not by a lot) from the confidence intervals for length calculated earlier on the entire model. But we can not really conclude that our model is bad or doesn’t represent reality as it all depends on the sampling, that is, how well these 500 samples represent the entire dataset of 12,015 rows.