Let’s work with the data set with the results of a psycholinguistic experiment dedicated to lexical decision and word naming. The brief description of the experiment is the following. In studies of lexical decision paticipants (also called subjects) are asked to decide whether the word shown on the screen is a real word or not. In other words, whether a word exists in the language or it is just an artificical word created using grammatical rules. In studies of word naming participants are asked to read a word shown aloud. Then the reaction time in miliseconds is measured: how fast a person clicks on the button word or non-word (lexical decision) or pronouns a word (word naming).

Load data on English words:

eng <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/english.csv")

Some of the variables:

To see the complete description of this dataset (original one, we use a modified version here, for example, we work with reaction time in ms, not with its logarithm), you can install the library languageR and get the documentation for english:

?english

To make things simple we can choose variables for analysis and take a smaller sample — only young people.

library(tidyverse)
young <- eng %>% filter(AgeSubject == "young", WordCategory == "N")
small <- young %>% select(RTlexdec, WrittenFrequency, 
                          LengthInLetters, FamilySize, 
                          NumberSimplexSynsets)

Now we will work with the data frame small. We want to create a model that will explain how the reaction time in lexical decision depends (in statistical terms, not in cause-and-effect terms) on several features, namely word frequency, word length in letters, morphological family size and number of synonyms (synsect).

Before proceeding to models, usually it is really helpful to visualise the relationships between variables of interest. We can use the library GGally we discussed before.

library(GGally)
ggpairs(small)

At the first sight, all associations are logical: the more frequent is a word, the lower is the reaction time (people react quickly on frequent words), the longer is a word, the higher the reaction time and so on.

Now let’s create a multiple linear regression model. To do so we use the same function lm() as for a bivariate linear model:

model <-  lm(data=small, RTlexdec ~ WrittenFrequency + LengthInLetters + FamilySize + NumberSimplexSynsets)
summary(model)
## 
## Call:
## lm(formula = RTlexdec ~ WrittenFrequency + LengthInLetters + 
##     FamilySize + NumberSimplexSynsets, data = small)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -221.162  -36.239   -5.461   33.623  270.033 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           753.474      8.614  87.469  < 2e-16 ***
## WrittenFrequency      -20.870      1.098 -19.012  < 2e-16 ***
## LengthInLetters         3.706      1.708   2.170 0.030200 *  
## FamilySize             -9.156      2.374  -3.857 0.000120 ***
## NumberSimplexSynsets   -9.598      2.867  -3.348 0.000836 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.27 on 1447 degrees of freedom
## Multiple R-squared:  0.4358, Adjusted R-squared:  0.4342 
## F-statistic: 279.4 on 4 and 1447 DF,  p-value: < 2.2e-16

Interpretation:

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \text{ (model is not better than the model with the intercept only)} \]

\[ H_1: \text{at least one } \beta_i \ne 0 \text{ (model is better than the model with the intercept only)} \] Informally, here we test the hypothesis that a model constructed is needed (not useless). Here p-value is close to 0, so we reject \(H_0\) and conclude that at least one coefficient is statistically different from zero, model is worth considering. Frankly speaking, seldom can we construct a model that is completely useless, so where p-value is too high.

We can extract some elements of this output separately, for example, coefficients, residuals or fitted values (values of reaction time predicted by our model).

model$coefficients
##          (Intercept)     WrittenFrequency      LengthInLetters 
##           753.474472           -20.870361             3.705659 
##           FamilySize NumberSimplexSynsets 
##            -9.156051            -9.597713
model$residuals[0:10]
##         1         2         3         4         5         6         7 
##  31.28950 -53.99426 -54.07484 -18.62364  10.81357  46.15871 -23.41527 
##         8         9        10 
## -34.62099 109.13492 -76.78837
model$fitted.values[0:10]
##        1        2        3        4        5        6        7        8 
## 663.6005 654.3943 601.3448 635.2236 622.2664 640.5913 607.8153 561.4410 
##        9       10 
## 632.3451 613.1684

To check the fit of our model we can look at several graphs of residuals of this model. First, let us plot graphs independent variable vs residuals for every independent variable in our model. For convenience we can save residuals as a separate column in small dataframe.

small <- small %>% mutate(res = model$residuals)
ggplot(data = small, aes(x = WrittenFrequency, y = res)) + geom_point()

ggplot(data = small, aes(x = LengthInLetters, y = res)) + geom_point()

ggplot(data = small, aes(x = FamilySize, y = res)) + geom_point()

ggplot(data = small, aes(x = NumberSimplexSynsets, y = res)) + geom_point()

Interestingly, in most these graphs we can see a slight non-linear pattern of points, so, it can serve as a sign that our model might have problems with specification. However, usually it is sensible to consider substantial knowledge about the variables we study (based on processes known, previous research. etc.)

Let’s add more residuals plots. To do this, we need the library ggfortify.

install.packages("ggfortify")
library(ggfortify)
# which - which graphs to use (we chose two first)
autoplot(model, which = 1:2)

Interpretation:

Now let us export the regression results into a document so as to get a regression table we usually see in academic papers or projects. By default R works with LaTeX code, however, if you work with Word/Libre Office/Open Office only, you can export the results into a doc-file as well. To do this you need the stargazer library. This name is derived from the expression “gaze at stars”, so stars for significance that we have seen in the regression output.

install.packages("stargazer")

We add type=html so as not to get LaTeX code and specify the name of the file.

library(stargazer)
stargazer(model, type = "html", out = "my model.doc")
## 
## <table style="text-align:center"><tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>RTlexdec</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">WrittenFrequency</td><td>-20.870<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.098)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">LengthInLetters</td><td>3.706<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.708)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">FamilySize</td><td>-9.156<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(2.374)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">NumberSimplexSynsets</td><td>-9.598<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(2.867)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>753.474<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(8.614)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>1,452</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.436</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.434</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>53.266 (df = 1447)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>279.430<sup>***</sup> (df = 4; 1447)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

If you do not know where R saves files, check it — get the path to a current working directory:

getwd()
## [1] "/Users/allat/Desktop"

If you want to knit an Rmd-file as an html (like this file) and include regression results as a neat table, you can add the option `results while creating a new code chunk, like this: {r, results = 'asis'}

stargazer(model, type = "html", out = "my model.doc")
Dependent variable:
RTlexdec
WrittenFrequency -20.870***
(1.098)
LengthInLetters 3.706**
(1.708)
FamilySize -9.156***
(2.374)
NumberSimplexSynsets -9.598***
(2.867)
Constant 753.474***
(8.614)
Observations 1,452
R2 0.436
Adjusted R2 0.434
Residual Std. Error 53.266 (df = 1447)
F Statistic 279.430*** (df = 4; 1447)
Note: p<0.1; p<0.05; p<0.01

Note: stars in stargazer are added in a slightly different way: one star corresponds to the 10% significance (p<0.1), two stars correspond to the 5% significance (p<0.05), three stars correspond to 1% significance (p<0.01).