We’ll be using a couple of new packages today, so install car, ppcor, and GGally if you don’t have them on your machine already. We’ll also be revisiting Winter’s data on iconicity that we saw in an earlier chapter.
library(tidyverse)
library(broom)
library(car)
library(ppcor)
library(GGally)
icon <- read_csv("perry_winter_2017_iconicity.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
Word = col_character(),
POS = col_character(),
SER = col_double(),
CorteseImag = col_double(),
Conc = col_double(),
Syst = col_double(),
Freq = col_double(),
Iconicity = col_double()
)
d <- read_csv("isbell_under_rev.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
Speaker = col_character(),
Gender = col_character(),
L1 = col_character(),
EIT = col_double(),
Comprehensibility = col_double(),
Accentedness = col_double(),
SA_Comprehensibility = col_double(),
SA_Accentedness = col_double(),
Satisfaction = col_double(),
Value = col_double(),
Kor_Use = col_double(),
LivingSK = col_double(),
SKKorSchool = col_double(),
OtherKorSchool = col_double(),
TotalKorSchool = col_double(),
Age = col_double(),
AO = col_double()
)
You might notice some odd error messages - the car package requires some additional packages that have functions that overlap names of some tidyverse packages. That’s okay - we can force R to use a specific package’s function when needed.
We’re going to play with some data from a study I have under review (and your HW last week) to examine how partial correlations relate to multiple regressions.
Let’s look at some simple correlations between 3 variables first: Self-assessed comprehensibility, (listener-assessed) comprehensibility, and EIT scores.
d %>% dplyr::select(SA_Comprehensibility, Comprehensibility, EIT) %>%
cor() -> corrs
round(corrs, 2)
SA_Comprehensibility Comprehensibility EIT
SA_Comprehensibility 1.00 0.54 0.51
Comprehensibility 0.54 1.00 0.70
EIT 0.51 0.70 1.00
EIT and Comprehensibility are rather strongly correlated (~.70) and both are moderately correlated with SA_Comprehensibility (~.50).
Now we’ll use the pcor() function from ppcor to examine the partial correlations of the 3 variables.
d %>% dplyr::select(SA_Comprehensibility, Comprehensibility, EIT) %>%
pcor() -> partials
round(partials$estimate, 2)
SA_Comprehensibility Comprehensibility EIT
SA_Comprehensibility 1.00 0.29 0.23
Comprehensibility 0.29 1.00 0.59
EIT 0.23 0.59 1.00
You’ll notice that these partial correlation coefficients are all smaller - they account for the correlation between each member of the pair and the third variable prior to calculating the relationship between the pair.
Now for a regression. We can actually standardize the variables within the regression equation itself, though it’s generally better to follow Winter’s practice of creating new variables.
d_mod <- lm(scale(SA_Comprehensibility) ~ scale(Comprehensibility) + scale(EIT), data = d)
tidy(d_mod) %>% dplyr::select(term, estimate) %>% mutate(estimate = round(estimate, 2))
The regression coefficients are not exactly the same as partial correlation estimates, but they are similar. In calculating both, there is a common numerator but different denominators, which leads to some differences in the final values.
The point here is that like univariate regression, multiple regression runs on the same kind of logic and principles as (partial) correlation.
We’ll do some transformations next:
icon <- icon %>% mutate(Log10Freq = log10(Freq),
SER_z = scale(SER),
CorteseImag_z = scale(CorteseImag),
Syst_z = scale(Syst),
Freq_z = scale(Log10Freq))
The ggpairs() function from GGally can be very helpful for examining the distributions of and correlations among several variables.
icon %>% dplyr::select(Iconicity, SER, CorteseImag, Syst, Log10Freq) %>%
ggpairs(axisLabels = "none", progress = F,
lower = list(continuous = wrap("points", alpha = 0.3)))
Now we’ll specify and run (‘fit’) the model from Ch. 6 based on unstandardized variables.
icon_mdl <- lm(Iconicity ~ SER + CorteseImag + Syst + Log10Freq, data = icon)
summary(icon_mdl)
Call:
lm(formula = Iconicity ~ SER + CorteseImag + Syst + Log10Freq,
data = icon)
Residuals:
Min 1Q Median 3Q Max
-3.07601 -0.71411 -0.03824 0.67337 2.76066
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.54476 0.18795 8.219 6.43e-16 ***
SER 0.49713 0.04012 12.391 < 2e-16 ***
CorteseImag -0.26328 0.02500 -10.531 < 2e-16 ***
Syst 401.52431 262.90268 1.527 0.127
Log10Freq -0.25163 0.03741 -6.725 2.97e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.002 on 984 degrees of freedom
(2012 observations deleted due to missingness)
Multiple R-squared: 0.2125, Adjusted R-squared: 0.2093
F-statistic: 66.36 on 4 and 984 DF, p-value: < 2.2e-16
Remember, there are tools for more ‘cleanly’ extracting model info from the broom package.
glance(icon_mdl)$r.squared
[1] 0.2124559
tidy(icon_mdl) %>% dplyr::select(term, estimate)
Note: the Iconicity variable is pre-standardized.
icon_mdl_z <- lm(Iconicity ~ SER_z + CorteseImag_z, data = icon)
summary(icon_mdl_z)
Call:
lm(formula = Iconicity ~ SER_z + CorteseImag_z, data = icon)
Residuals:
Min 1Q Median 3Q Max
-3.2164 -0.7381 -0.1050 0.6143 3.0554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.12971 0.03319 34.036 <2e-16 ***
SER_z 0.57965 0.04040 14.349 <2e-16 ***
CorteseImag_z -0.33641 0.03535 -9.517 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.032 on 1088 degrees of freedom
(1910 observations deleted due to missingness)
Multiple R-squared: 0.1658, Adjusted R-squared: 0.1643
F-statistic: 108.1 on 2 and 1088 DF, p-value: < 2.2e-16
We start checking our assumptions by looking at residuals.
A histogram:
hist(resid(icon_mdl))
Next we look at a “QQ Plot”, which roughly plots the mean of the standardized outcome variable and the mean of standardized predicted values at various windows (quantiles). We’re looking for a straight line.
qqnorm(resid(icon_mdl))
Next comes a plot of the residuals and predicted values. We hope to not see any patterns - that is, we hope to see a largely formless cloud or blob.
plot(fitted(icon_mdl), resid(icon_mdl))
Now we’ll take a closer look at the predictor variables to see if collinearity might be a concern. We can use the vif() function from the car package. Values greater than 10 are really bad, and values greater than ~4 are concerning.
vif(icon_mdl)
SER CorteseImag Syst Log10Freq
1.148597 1.143599 1.015054 1.020376