Setup:
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.4 v purrr 0.3.4
v tibble 3.1.2 v dplyr 1.0.6
v tidyr 1.1.3 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.1
-- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Read in data from Winter:
ELP <- read_csv("ELP_full_length_frequency.csv")
-- Column specification -----------------------------------------------------------------------------------------
cols(
Word = col_character(),
Log10Freq = col_double(),
length = col_double(),
RT = col_double()
)
We’ll start off running correlations on the raw data. First, we should make some plots, right?
ELP %>% ggplot(aes(x = Log10Freq, y = RT))+
geom_point()+
geom_zmooth(method = "lm")
Error in geom_zmooth(method = "lm") :
could not find function "geom_zmooth"
This looks like a mostly linear relationship. Let’s try some correlations anyway:
cor.test(ELP$Log10Freq, ELP$RT, method = "pearson")
Pearson's product-moment correlation
data: ELP$Log10Freq and ELP$RT
t = -143.41, df = 33073, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6258090 -0.6125185
sample estimates:
cor
-0.6192081
So the correlation between Log10Freq and RT is -.62. With the cor.test() function we also get to see a 95% confidence interval for that correlation. We’ll learn more about that in detail later, but the quick explanation is it is the range in which we would expect 95% of all correlations to lie upon resampling ~33,000 words.
Let’s try a different kind of correlation, the Spearman correlation, which only takes into account the ranks of values in the variables.
cor.test(ELP$Log10Freq, ELP$RT, method = "spearman")
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: ELP$Log10Freq and ELP$RT
S = 9.9688e+12, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.6530766
This correlation of -.65 is very close to the Pearson correlation, as will often be the case.
Now we’re going to transform these variables to further explore the relationships among them using correlations and regression.
ELP <- ELP %>% mutate(Freq = exp(Log10Freq),
Log10Freq_c = Log10Freq - mean(Log10Freq),
Log10Freq_z = Log10Freq_c/sd(Log10Freq),
RT_c = RT - mean(RT),
RT_z = RT_c/sd(RT),
Log10RT = log10(RT),
Log10RT_c = Log10RT - mean(Log10RT),
Log10RT_z = Log10RT_c/sd(Log10RT))
Let’s look at the distributions of these variables. We’ll use density plots instead of histograms, as the data are quite large.
ELP %>% pivot_longer(Log10Freq:Log10RT_z) %>% filter(name != "length") %>%
ggplot(aes(x = value))+
geom_density()+
facet_wrap(~name, scales = "free")
These plots suggest that the standardization of Log-transformed values doesn’t do much for us. But anyway, let’s look at some correlations among these variables:
ELP %>% select(Log10Freq:Log10RT_z) %>% cor() %>% round(2)
Log10Freq length RT Freq Log10Freq_c Log10Freq_z RT_c RT_z Log10RT Log10RT_c Log10RT_z
Log10Freq 1.00 -0.38 -0.62 0.74 1.00 1.00 -0.62 -0.62 -0.64 -0.64 -0.64
length -0.38 1.00 0.53 -0.30 -0.38 -0.38 0.53 0.53 0.54 0.54 0.54
RT -0.62 0.53 1.00 -0.41 -0.62 -0.62 1.00 1.00 0.99 0.99 0.99
Freq 0.74 -0.30 -0.41 1.00 0.74 0.74 -0.41 -0.41 -0.44 -0.44 -0.44
Log10Freq_c 1.00 -0.38 -0.62 0.74 1.00 1.00 -0.62 -0.62 -0.64 -0.64 -0.64
Log10Freq_z 1.00 -0.38 -0.62 0.74 1.00 1.00 -0.62 -0.62 -0.64 -0.64 -0.64
RT_c -0.62 0.53 1.00 -0.41 -0.62 -0.62 1.00 1.00 0.99 0.99 0.99
RT_z -0.62 0.53 1.00 -0.41 -0.62 -0.62 1.00 1.00 0.99 0.99 0.99
Log10RT -0.64 0.54 0.99 -0.44 -0.64 -0.64 0.99 0.99 1.00 1.00 1.00
Log10RT_c -0.64 0.54 0.99 -0.44 -0.64 -0.64 0.99 0.99 1.00 1.00 1.00
Log10RT_z -0.64 0.54 0.99 -0.44 -0.64 -0.64 0.99 0.99 1.00 1.00 1.00
You should notice here that variables in the original metric and centered/standardized forms are perfectly correlated, essentially.
Let’s see now if we can reproduce the Pearson correlation coefficient between Log10Freq and RT from our earlier example, but this time using a linear regression. We’ll use the standardized versions of the variable to do that.
mod <- lm(Log10Freq_z ~ RT_z, data = ELP)
summary(mod)
Call:
lm(formula = Log10Freq_z ~ RT_z, data = ELP)
Residuals:
Min 1Q Median 3Q Max
-3.2046 -0.5161 -0.0071 0.4942 3.6760
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.292e-16 4.318e-03 0.0 1
RT_z -6.192e-01 4.318e-03 -143.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7852 on 33073 degrees of freedom
Multiple R-squared: 0.3834, Adjusted R-squared: 0.3834
F-statistic: 2.057e+04 on 1 and 33073 DF, p-value: < 2.2e-16
It’s identical! Though a little hard to read in the summary output. Remember we can call specific elements from model objects.
round(mod$coefficients[2],2)
RT_z
-0.62
There it is - a standardized regression coefficient of -.62.
Now we’ll follow the examples from Winter Ch. 5.
ELP <- read_csv("ELP_frequency.csv")
-- Column specification -----------------------------------------------------------------------------------------
cols(
Word = col_character(),
Freq = col_double(),
RT = col_double()
)
Transform some variables:
ELP <- mutate(ELP, Log10Freq = log10(Freq),
LogRT = log(RT))
This is a smaller version of the dataset we first loaded. Let’s plot it using some neat labels.
Plot 1: Log RTs
ELP %>% ggplot(aes(x = Freq, y = LogRT, label = Word))+
geom_text()+
geom_smooth(method = "lm")+
ggtitle('Log RT ~ raw frequency')+
theme_minimal()
And now with log frequencies:
ELP %>% ggplot(aes(x = Log10Freq, y = LogRT, label = Word))+
geom_text()+
geom_smooth(method = "lm")+
ggtitle('Log RT ~ log frequency')+
theme_minimal()
This is stretched out more - not so many words piling up near 0 for raw frequency.
On to a regression model (which you did for HW!)
ELP_mdl <- lm(LogRT ~ Log10Freq, data = ELP)
tidy(ELP_mdl)
A neat thing about regression models is that they allow you to make predictions for values not actually in the data - you can use the coefficients and ‘plug in’ values of interest into the regression equation to get predicted outcome variable values.
b0 <- tidy(ELP_mdl)$estimate[1] #intercept
b1 <- tidy(ELP_mdl)$estimate[2] #slope
With our intercept and slope coefficients saved, we can compute predicted values for frequency of 10 and 1000.
logRT_10freq <- b0 + b1*1
logRT_1000freq <- b0 + b1*3
Now we can exponentiate those predictions to get values in raw RT units:
exp(logRT_10freq)
[1] 801.9387
exp(logRT_1000freq)
[1] 651.0159
So words that show up in the corpus around 1000 times will elicit RTs about 150ms faster than words that only show up around 10 times.
Following section 5.6 -
ELP <- mutate(ELP,
Log10Freq_c = Log10Freq - mean(Log10Freq),
Log10Freq_z = Log10Freq_c/sd(Log10Freq_c))
select(ELP, Freq, Log10Freq, Log10Freq_c, Log10Freq_z)
Note: you can also use the scale() function to center and/or standardize variables.
Some new linear models:
ELP_mdl_c <- lm(LogRT ~ Log10Freq_c, ELP) #centered
ELP_mdl_z <- lm(LogRT ~ Log10Freq_z, ELP) #standardized
Now to check out the output… Log10 word freqs
tidy(ELP_mdl) %>% select(term, estimate)
Centered log10freqs
tidy(ELP_mdl_c) %>% select(term, estimate)
And standardized log10freqs
tidy(ELP_mdl_z) %>% select(term, estimate)
There are differences in the estimates, but the models explain the exact same proportion of variance in the data.
glance(ELP_mdl)$r.squared
[1] 0.7297624
glance(ELP_mdl_c)$r.squared
[1] 0.7297624
glance(ELP_mdl_z)$r.squared
[1] 0.7297624
Linear transformations won’t change correlations, regressions, etc. But nonlinear transformations will:
glance(lm(LogRT ~ Freq, ELP))$r.squared
[1] 0.2914014
Introducing the with() function to do a correlation
with(ELP, cor(Log10Freq, LogRT))
[1] -0.8542613
Recreating in a regression:
ELP_cor <- lm(scale(LogRT) ~ -1 + Log10Freq_z, ELP)
tidy(ELP_cor) %>% select(estimate)
That’s it for now!