Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
Import the data from the Dutch Lexicon Project DLP_words.csv. All materials are from here: http://crr.ugent.be/programs-data/lexicon-projects
Variables we are going to use:
rt: Response Latency to the Lexical Decision Tasksubtlex.frequency: The frequency of the word from the Dutch Subtitle Project.length: Length of the word.POS: part of speech.bigram.freq: Summed frequency of the bigrams in the word (the sum of each two-letter combination frequency).Let’s look at the summary of this dataset,
## spelling lexicality rt subtlex.frequency
## aïs : 1 W:12026 Min. : 358.0 Min. : 1
## aaide : 1 1st Qu.: 585.9 1st Qu.: 16
## aaien : 1 Median : 624.1 Median : 67
## aak : 1 Mean : 632.7 Mean : 1217
## aal : 1 3rd Qu.: 670.5 3rd Qu.: 306
## aalmoes: 1 Max. :1273.0 Max. :947568
## (Other):12020 NA's :11
## length POS bigram.freq
## Min. : 2.000 ADJ:1397 Min. : 5026
## 1st Qu.: 5.000 N :7591 1st Qu.:127241
## Median : 6.000 WW :3038 Median :188577
## Mean : 6.351 Mean :207476
## 3rd Qu.: 7.000 3rd Qu.:285100
## Max. :12.000 Max. :603499
##
We see that there are 11 records where the response latency variable has NA values. Since response latency is one of the most important varibles (the dependent variable) in our dataset, let’s remove these 11 variables to simplify our analysis.
The dataset with these 11 records removed is presented below:
Update the part of speech variable so that the Nouns are the comparison category. Here’s what the labels mean:
##
## ADJ N WW
## 1395 7582 3038
data$POS = factor(data$POS, levels = c("N", "WW", "ADJ"), labels = c("Noun", "Verb", "Adjective"))
table(data$POS)##
## Noun Verb Adjective
## 7582 3038 1395
We did this because we want to compare adjective and verbs to nouns. Essentially, we are making nouns our frame of comparison.
Since we are using frequencies, we should consider the non-normality of frequency.
Include a histogram of the original subtlex.frequency column.
The histogram for subtlex.frequency shows us that its distribution is in the form of a power curve. Since linear regression assumes that all the independent variables are normally distributed, we need to transform this variable to make it normal.
Log-transform the subtlex.frequency column.
We use a log transformation and see that the histogram is now much more normal, with a slight right tail.
Include a histogram of bigram.freq - note that it does not appear extremely skewed.
We also plot the histogram for bigram.freq to assess its normality and see that it is mostly normal with a slight right tail. So no transformation is required for bigram.freq.
See if you can predict response latencies (DV) with the following IVs: subtitle frequency, length, POS, and bigram frequency.
Below is a summary of the model. We see that all the variables are statistically significant in the prediction of response latencies.
##
## Call:
## lm(formula = rt ~ Log_SUB + length + POS + bigram.freq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.88 -33.85 -6.19 27.49 616.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.112e+02 2.520e+00 282.176 < 2e-16 ***
## Log_SUB -1.808e+01 2.253e-01 -80.218 < 2e-16 ***
## length -1.806e+00 3.486e-01 -5.180 2.26e-07 ***
## POSVerb 1.170e+01 1.157e+00 10.119 < 2e-16 ***
## POSAdjective 7.976e+00 1.486e+00 5.368 8.13e-08 ***
## bigram.freq 3.374e-05 5.246e-06 6.432 1.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.76 on 12009 degrees of freedom
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3697
## F-statistic: 1410 on 5 and 12009 DF, p-value: < 2.2e-16
When we look at the summary of the residuals, we see that the mean of the error terms is zero and the median is also very close to zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -348.879 -33.850 -6.189 0.000 27.489 616.596
Which coefficients are statistically significant?
As seen before, we see that all the coefficients are statistically significant. This indicates that the frequency and length of the word, the part of speech, and the summed frequency of the bigrams in the word are statistically significant in the prediction of response latency of Dutch words.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 711.200 2.520 282.176 0
## Log_SUB -18.077 0.225 -80.218 0
## length -1.806 0.349 -5.180 0
## POSVerb 11.705 1.157 10.119 0
## POSAdjective 7.976 1.486 5.368 0
## bigram.freq 0.000 0.000 6.432 0
What do they suggest predicts response latency? (i.e., give the non-stats interpretation of the coefficients)
Based on the coefficient estimates, we can say the following about response latencies:
Let’s also look at the confidence intervals for the coefficients, which estimates the change in response latency (with 95% confidence) for a unit increase in the coefficient.
## 2.5 % 97.5 %
## (Intercept) 706.25942190124 716.14023493008
## Log_SUB -18.51863172506 -17.63520071430
## length -2.48916207363 -1.12243081141
## POSVerb 9.43745753617 13.97230528326
## POSAdjective 5.06330402122 10.88879557920
## bigram.freq 0.00002345653 0.00004402181
Which coefficients appear to predict the most variance? Calculate the \(pr^2\) values below:
We calculate partial squares for each of the independent variables to calculate their practical significance.
## Log_SUB length POSVerb POSAdjective bigram.freq
## 0.348893207 0.002229149 0.008453925 0.002393353 0.003432757
When we look at the results, we see that frequency of words in Dutch is the most pragmatic variable in predicting response latencies. Length of the word is not so important, unlike English.
What do the dummy coded POS values mean? Calculate the means of each group below to help you interpret:
## Noun Verb Adjective
## 633.1077 633.5708 628.9087
Based on the summary of the model, the response latency of adjectives and that of verbs both seemed statistically different than that of nouns. However, when we look at the mean response latencies for each of these parts of speech, they seem more or less similar. This indicates that the standard deviation of distribution of response latencies for adjectives and that for verbs must be very different than that for nouns. The interaction between the different variables is also considered (by controlling for other variables) when the model attributes statistical significance, which can’t be seen by just looking at means.
Is the overall model statistically significant?
## value numdf dendf
## 1410.291 5.000 12009.000
##
## Call:
## lm(formula = rt ~ Log_SUB + length + POS + bigram.freq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348.88 -33.85 -6.19 27.49 616.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 711.199828416 2.520407834 282.176 < 0.0000000000000002 ***
## Log_SUB -18.076916220 0.225346480 -80.218 < 0.0000000000000002 ***
## length -1.805796443 0.348627200 -5.180 0.000000225800 ***
## POSVerb 11.704881410 1.156753574 10.119 < 0.0000000000000002 ***
## POSAdjective 7.976049800 1.485972310 5.368 0.000000081293 ***
## bigram.freq 0.000033739 0.000005246 6.432 0.000000000131 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.76 on 12009 degrees of freedom
## Multiple R-squared: 0.37, Adjusted R-squared: 0.3697
## F-statistic: 1410 on 5 and 12009 DF, p-value: < 0.00000000000000022
The F-statistic represents the difference of the model from zero. The p-value of the F-statistic is much less than 0.05, indicating that the overall model is statistically significant.
What is the practical importance of the overall model?
To assess the practical significance of the model, we calculate its R-squared value.
## [1] 0.3699521
The R-squared value indicates that 37% of the variation in the data is explained by the model. This is a very good R-squared value in this context where we are using psychological data, and hence the model is of practical significance.
Create an influence plot of the model using the car library.
Which data points appear to have the most influence on the model?
The words which have the most influence on the model are below:
Do we have additivity in our model?
Show that the correlations between predictors is less than .9.
## Log_SUB length POSVerb POSAdjective
## (Intercept) -0.61547149 -0.817549059 0.006374922 -0.062985653
## Log_SUB 1.00000000 0.300651186 -0.146973534 -0.090366305
## length 0.30065119 1.000000000 0.054511388 -0.001602351
## POSVerb -0.14697353 0.054511388 1.000000000 0.205766502
## POSAdjective -0.09036631 -0.001602351 0.205766502 1.000000000
## bigram.freq -0.02461407 -0.423780585 -0.295822672 0.015906745
## bigram.freq
## (Intercept) -0.01692010
## Log_SUB -0.02461407
## length -0.42378059
## POSVerb -0.29582267
## POSAdjective 0.01590674
## bigram.freq 1.00000000
We see that the correlations between none of the independent variables are greater than 0.9 or less than -0.9. This indicates that there is no multi-collinearity in the model or that we have additivity in our model - each variable adds something different to the model.
Show the VIF values.
## GVIF Df GVIF^(1/(2*Df))
## Log_SUB 1.143764 1 1.069469
## length 1.362545 1 1.167281
## POS 1.139962 2 1.033291
## bigram.freq 1.358598 1 1.165589
Is the model linear?
Include a plot and interpret the output.
The dots are pretty close to the dotted lines between the theoretical quantiles of -2 and +2 (Two z-scores on either side of the mean). Towards the theoretical quantile of +2, the dots curve slightly away from the line, which is not very unexpected - We can not predict as well where there is not as much data. So overall, the plot seems fine and we can say that the model is linear.
Are the errors normally distributed?
Include a plot and interpret the output.
Based on the histogram plot for normalized errors, the data seems to be centered around zero and mostly between the scaled residuals of -2 and +2. There seems to be a little bit of skew towards the right, but it appears to small. We can conclude that the errors are more or less normally distributed. Also, since this is a very large dataset, normality is not much of an issue due to the Central Limit Theorem.
Do the errors meet the assumptions of homoscedasticity and homogeneity?
Include a plot and interpret the output (either plot option).
Although, the line here is straight, the model residuals are much lower for smaller predicted response latencies and tend to increase for higher predicted response latency forming a triangular shape.
Based on the second plot of normalized predicted values v/s normalized residuals, we see a distinct triangular shape. Here again, we see that for smaller fitted values, the residuals are lower than the residuals for the higher fitted values. Based on both the plots, we can say that the errors do not meet the assumptions of homoscedasticity and homogeneity. This is probably because there is another variable, which we are not including in the model, for example, pronunciational rule.
Use the function provided from class (included below) and the boot library to bootstrap the model you created 1000 times.
bootcoef = function(formula, data, indices){
d = data[indices, ] #randomize the data by row
model = lm(formula, data = d) #run our model
return(coef(model)) #give back coefficients
}Include the estimates of the coefficients from the bootstrapping.
model.boot = boot(formula = rt ~ Log_SUB + length + POS + bigram.freq,
data = data,
statistic = bootcoef,
R = 1000)
model.boot##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = bootcoef, R = 1000, formula = rt ~
## Log_SUB + length + POS + bigram.freq)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 711.19982841566 -0.1560855429433 2.889632016423
## t2* -18.07691621968 0.0052677266541 0.231785001340
## t3* -1.80579644252 0.0183693533719 0.383665288422
## t4* 11.70488140972 -0.0344971341641 1.049103703240
## t5* 7.97604980021 -0.0291743083312 1.502942372724
## t6* 0.00003373917 0.0000001365802 0.000005375022
The coefficients from the bootstrapping are the same as the model.
Include the confidence intervals for at least one of the predictors (not the intercept).
The above code for confidence interval for length was not run as it throw errors, because the number of runs (1000) is lesser than the number of rows (12,015).
This is because boot.ci calls bca.ci. Because the boot.out object doesn’t supply L, the empirical influence values for the statistic that we are calculating on the data, bca.ci tries to calculate them using the empinf function, and then uses them to calculate the acceleration constant.
But with a small number of replications, empinf sometimes fails and returns a vector of NA values. The result is that we have no values for L, a can’t be calculated, and we get an error.
Do our estimates appear stable, given the bootstrapping results?
Our coefficient estimates are stable (same as model), but the standard error for the part of speech variable seems high in comparison to its coefficient. Also, we need to bootstrap over many more runs (> 12,015) to get the confidence intervals. But this is computationally intensive and hence we run the bootstrap over a smaller randomized sample of the data.
Use the following to randomly sample 500 rows of data - generally, you have to have more bootstraps than rows of data, so this code will speed up your assignment. In the boot function use: data = DF[sample(1:nrow(DF), 500, replace=FALSE),] for the data argument changing DF to the name of your data frame.
The coefficients of the bootstap over the randomized sample of 500 are presented below:
model.boot2 = boot(formula = rt ~ Log_SUB + length + POS + bigram.freq,
data = data[sample(1:nrow(data), 500, replace=FALSE),],
statistic = bootcoef,
R = 1000)
model.boot2##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data[sample(1:nrow(data), 500, replace = FALSE),
## ], statistic = bootcoef, R = 1000, formula = rt ~ Log_SUB +
## length + POS + bigram.freq)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 700.06591950099 -0.4418150429487 14.98026243146
## t2* -17.62953858506 0.0186672148613 1.14417198430
## t3* -1.70068095997 0.0341757975943 2.03135540497
## t4* 18.16635710075 -0.5994692432541 6.69089346135
## t5* 16.18190207503 0.1293699632657 7.01481966429
## t6* 0.00005980857 0.0000008407677 0.00002653149
The coefficients of the variables are different from the original model becasue this is run on a subset of the data. The standard errors are also very high for part of speech and length of word.
The confidence intervals for length are presented below:
## Warning in boot.ci(model.boot2, index = 3): bootstrap variances needed for
## studentized intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = model.boot2, index = 3)
##
## Intervals :
## Level Normal Basic
## 95% (-5.716, 2.247 ) (-5.862, 2.062 )
##
## Level Percentile BCa
## 95% (-5.463, 2.461 ) (-5.478, 2.419 )
## Calculations and Intervals on Original Scale
These confidence intervals for length are different (but not by a lot) from the confidence intervals for length calculated earlier on the entire model. But we can not really conclude that our model is bad or doesn’t represent reality as it all depends on the sampling, that is, how well these 500 samples represent the entire dataset of 12,015 rows.