Linear Models - Anscombe’s Quartet

Getting started - loading some data to model.

library(datasets)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

From the datasets package we’ll load up some data called Anscombe’s quartet.

a <- datasets::anscombe

There are 11 observations of 8 variables which form 4 pairs of x and y data, e.g., x is associated with or predicts y.

Let’s quickly summarize these variables numerically - a quick and dirty way is to just use the summary() function on a dataframe.

summary(a)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500

You’ll notice that all of the X variables have the same mean, and all of the Y variables have the same mean.

Let’s run a few simple linear models.

m1 <- lm(y1 ~ x1, data = a)
summary(m1)
## 
## Call:
## lm(formula = y1 ~ x1, data = a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
m2 <- lm(y2 ~ x2, data = a)
summary(m2)
## 
## Call:
## lm(formula = y2 ~ x2, data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
m3 <- lm(y3 ~ x3, data = a)
summary(m3)
## 
## Call:
## lm(formula = y3 ~ x3, data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
m4 <- lm(y4 ~ x4, data = a)
summary(m4)
## 
## Call:
## lm(formula = y4 ~ x4, data = a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Wow, looks like the models all say the same thing about the relationship between the X variable and the Y variable. The value of the beta for X is about .5. So, each unit increase in X results in a .5 unit increase in Y.

But of course, we should plot our data first to see if the models make sense. Let’s take a look:

Pair 1:

a %>% ggplot(aes(x = x1, y = y1))+
  geom_point()

Hmm, this looks pretty linear…

a %>% ggplot(aes(x = x1, y = y1))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Let’s look at the other pairs…

a %>% ggplot(aes(x = x2, y = y2))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

a %>% ggplot(aes(x = x3, y = y3))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

a %>% ggplot(aes(x = x4, y = y4))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : at 7.945
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : radius 0.003025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 7.945
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.055
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 122.21
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : zero-width neighborhood. make span bigger
## Warning: Failed to fit group -1.
## Caused by error in `predLoess()`:
## ! NA/NaN/Inf in foreign function call (arg 5)

These data famously illustrate the importance of plotting. If you don’t, you might miss important aspects of the relationships among variables and end up using models that do not make much sense.

Correlations

Loading the broom package:

library(broom)

Read in data from Winter:

ELP <- read_csv("ELP_full_length_frequency.csv")
## Rows: 33075 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Word
## dbl (3): Log10Freq, length, RT
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ll start off running correlations on the raw data. First, we should make some plots, right?

ELP %>% ggplot(aes(x = Log10Freq, y = RT))+
  geom_point()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

This looks like a mostly linear relationship. Let’s try some correlations anyway:

cor.test(ELP$Log10Freq, ELP$RT, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  ELP$Log10Freq and ELP$RT
## t = -143.41, df = 33073, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6258090 -0.6125185
## sample estimates:
##        cor 
## -0.6192081

So the correlation between Log10Freq and RT is -.62. With the cor.test() function we also get to see a 95% confidence interval for that correlation.

Let’s try a different kind of correlation, the Spearman correlation, which only takes into account the ranks of values in the variables.

cor.test(ELP$Log10Freq, ELP$RT, method = "spearman")
## Warning in cor.test.default(ELP$Log10Freq, ELP$RT, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  ELP$Log10Freq and ELP$RT
## S = 9.9688e+12, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.6530766

This correlation of -.65 is very close to the Pearson correlation, as will often be the case.

Transformations

Now we’re going to transform these variables to further explore the relationships among them using correlations and regression.

ELP <- ELP %>% mutate(Freq = exp(Log10Freq),
                      Log10Freq_c = Log10Freq - mean(Log10Freq),
                      Log10Freq_z = Log10Freq_c/sd(Log10Freq),
                      RT_c = RT - mean(RT),
                      RT_z = RT_c/sd(RT),
                      Log10RT = log10(RT),
                      Log10RT_c = Log10RT - mean(Log10RT),
                      Log10RT_z = Log10RT_c/sd(Log10RT))

Let’s look at the distributions of these variables. We’ll use density plots instead of histograms, as the data are quite large.

ELP %>% pivot_longer(Log10Freq:Log10RT_z) %>% filter(name != "length") %>%
  ggplot(aes(x = value))+
  geom_density()+
  facet_wrap(~name, scales = "free")

These plots suggest that the standardization of Log-transformed values doesn’t do much for us. But anyway, let’s look at some correlations among these variables:

ELP %>% select(Log10Freq:Log10RT_z) %>% cor() %>% round(2)
##             Log10Freq length    RT  Freq Log10Freq_c Log10Freq_z  RT_c  RT_z
## Log10Freq        1.00  -0.38 -0.62  0.74        1.00        1.00 -0.62 -0.62
## length          -0.38   1.00  0.53 -0.30       -0.38       -0.38  0.53  0.53
## RT              -0.62   0.53  1.00 -0.41       -0.62       -0.62  1.00  1.00
## Freq             0.74  -0.30 -0.41  1.00        0.74        0.74 -0.41 -0.41
## Log10Freq_c      1.00  -0.38 -0.62  0.74        1.00        1.00 -0.62 -0.62
## Log10Freq_z      1.00  -0.38 -0.62  0.74        1.00        1.00 -0.62 -0.62
## RT_c            -0.62   0.53  1.00 -0.41       -0.62       -0.62  1.00  1.00
## RT_z            -0.62   0.53  1.00 -0.41       -0.62       -0.62  1.00  1.00
## Log10RT         -0.64   0.54  0.99 -0.44       -0.64       -0.64  0.99  0.99
## Log10RT_c       -0.64   0.54  0.99 -0.44       -0.64       -0.64  0.99  0.99
## Log10RT_z       -0.64   0.54  0.99 -0.44       -0.64       -0.64  0.99  0.99
##             Log10RT Log10RT_c Log10RT_z
## Log10Freq     -0.64     -0.64     -0.64
## length         0.54      0.54      0.54
## RT             0.99      0.99      0.99
## Freq          -0.44     -0.44     -0.44
## Log10Freq_c   -0.64     -0.64     -0.64
## Log10Freq_z   -0.64     -0.64     -0.64
## RT_c           0.99      0.99      0.99
## RT_z           0.99      0.99      0.99
## Log10RT        1.00      1.00      1.00
## Log10RT_c      1.00      1.00      1.00
## Log10RT_z      1.00      1.00      1.00

You should notice here that variables in the original metric and centered/standardized forms are perfectly correlated, essentially.

Let’s see now if we can reproduce the Pearson correlation coefficient between Log10Freq and RT from our earlier example, but this time using a linear regression. We’ll use the standardized versions of the variable to do that.

mod <- lm(Log10Freq_z ~ RT_z, data = ELP)
summary(mod)
## 
## Call:
## lm(formula = Log10Freq_z ~ RT_z, data = ELP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2046 -0.5161 -0.0071  0.4942  3.6760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.283e-16  4.318e-03     0.0        1    
## RT_z        -6.192e-01  4.318e-03  -143.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7852 on 33073 degrees of freedom
## Multiple R-squared:  0.3834, Adjusted R-squared:  0.3834 
## F-statistic: 2.057e+04 on 1 and 33073 DF,  p-value: < 2.2e-16

It’s identical! Though a little hard to read in the summary output. Remember we can call specific elements from model objects.

round(mod$coefficients[2],2)
##  RT_z 
## -0.62

There it is - a standardized regression coefficient of -.62.

Regression Practice

Now we’ll follow the examples from Winter Ch. 5.

ELP <- read_csv("ELP_frequency.csv")
## Rows: 12 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Word
## dbl (2): Freq, RT
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Transform some variables:

ELP <- mutate(ELP, Log10Freq = log10(Freq),
              LogRT = log(RT))

This is a smaller version of the dataset we first loaded. Let’s plot it using some neat labels.

Plot 1: Log RTs

ELP %>% ggplot(aes(x = Freq, y = LogRT, label = Word))+
  geom_text()+
  geom_smooth(method = "lm")+
  ggtitle('Log RT ~ raw frequency')+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

And now with log frequencies:

ELP %>% ggplot(aes(x = Log10Freq, y = LogRT, label = Word))+
  geom_text()+
  geom_smooth(method = "lm")+
  ggtitle('Log RT ~ log frequency')+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

This is stretched out more - not so many words piling up near 0 for raw frequency.

On to a regression model:

ELP_mdl <- lm(LogRT ~ Log10Freq, data = ELP)
tidy(ELP_mdl)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    6.79     0.0611    111.   8.56e-17
## 2 Log10Freq     -0.104    0.0201     -5.20 4.03e- 4

Extra: Model predictions and back-transforming

A neat thing about regression models is that they allow you to make predictions for values not actually in the data - you can use the coefficients and ‘plug in’ values of interest into the regression equation to get predicted outcome variable values.

b0 <- tidy(ELP_mdl)$estimate[1] #intercept
b1 <- tidy(ELP_mdl)$estimate[2] #slope

With our intercept and slope coefficients saved, we can compute predicted values for frequency of 10 and 1000.

logRT_10freq <- b0 + b1*1
logRT_1000freq <- b0 + b1*3

Now we can exponentiate those predictions to get values in raw RT units:

exp(logRT_10freq)
## [1] 801.9387
exp(logRT_1000freq)
## [1] 651.0159

So words that show up in the corpus around 1000 times will elicit RTs about 150ms faster than words that only show up around 10 times.

Extra: More practice with centering and standardizing

Following section 5.6 -

ELP <- mutate(ELP,
              Log10Freq_c = Log10Freq - mean(Log10Freq),
              Log10Freq_z = Log10Freq_c/sd(Log10Freq_c))

select(ELP, Freq, Log10Freq, Log10Freq_c, Log10Freq_z)
## # A tibble: 12 × 4
##     Freq Log10Freq Log10Freq_c Log10Freq_z
##    <dbl>     <dbl>       <dbl>       <dbl>
##  1 55522     4.74        2.03       1.41  
##  2 40629     4.61        1.89       1.31  
##  3 14895     4.17        1.46       1.01  
##  4  3992     3.60        0.884      0.614 
##  5  3850     3.59        0.868      0.603 
##  6   409     2.61       -0.106     -0.0736
##  7   241     2.38       -0.336     -0.233 
##  8   238     2.38       -0.341     -0.237 
##  9    66     1.82       -0.898     -0.624 
## 10    32     1.51       -1.21      -0.842 
## 11     4     0.602      -2.12      -1.47  
## 12     4     0.602      -2.12      -1.47

Note: you can also use the scale() function to center and/or standardize variables.

Some new linear models:

ELP_mdl_c <- lm(LogRT ~ Log10Freq_c, ELP) #centered
ELP_mdl_z <- lm(LogRT ~ Log10Freq_z, ELP) #standardized

Now to check out the output… Log10 word freqs

tidy(ELP_mdl) %>% select(term, estimate)
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    6.79 
## 2 Log10Freq     -0.104

Centered log10freqs

tidy(ELP_mdl_c) %>% select(term, estimate)
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    6.51 
## 2 Log10Freq_c   -0.104

And standardized log10freqs

tidy(ELP_mdl_z) %>% select(term, estimate)
## # A tibble: 2 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)    6.51 
## 2 Log10Freq_z   -0.150

There are differences in the estimates, but the models explain the exact same proportion of variance in the data.

glance(ELP_mdl)$r.squared
## [1] 0.7297624
glance(ELP_mdl_c)$r.squared
## [1] 0.7297624
glance(ELP_mdl_z)$r.squared
## [1] 0.7297624

Linear transformations won’t change correlations, regressions, etc. But nonlinear transformations will:

glance(lm(LogRT ~ Freq, ELP))$r.squared
## [1] 0.2914014

Introducing the with() function to do a correlation

with(ELP, cor(Log10Freq, LogRT))
## [1] -0.8542613

Recreating in a regression:

ELP_cor <- lm(scale(LogRT) ~ -1 + Log10Freq_z, ELP)

tidy(ELP_cor) %>% select(estimate)
## # A tibble: 1 × 1
##   estimate
##      <dbl>
## 1   -0.854

That’s it for now!