Correlation

Author

Dan Isbell

Set-up

First, load packages and data:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

d <- read_csv("correlation_practice.csv")

Rows: 54 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ID, Sex
dbl (3): OPIc_rating, Vocab_score, Grammar_score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Scatter Plots

Scatter plots are a very useful way to visualize the relationship between two variables. We’ll try a quick way and a pretty way to make scatter plots.

Quick way:

plot(d$OPIc_rating, d$Vocab_score)

Pretty way:

d %>% ggplot(aes(x = OPIc_rating, y = Vocab_score)) +
  geom_point()+
  labs(x = "Speaking Proficiency (OPIc rating)", y = "Vocabulary Knowledge")+
  theme_bw()

A nice extra step is to add a trend line:

d %>% ggplot(aes(x = OPIc_rating, y = Vocab_score)) +
  geom_point()+
  geom_smooth(method = "lm")+
  labs(x = "Speaking Proficiency (OPIc rating)", y = "Vocabulary Knowledge")+
  theme_bw()

`geom_smooth()` using formula = 'y ~ x'

Correlations

R has several built-in functions for running correlations. It is important to specify which type of correlation you want to run.

In our data, the OPIc_score is an ordinal test score. So we should use Spearman correlation.

cor(d$OPIc_rating, d$Vocab_score, method = "spearman")

[1] 0.8206502

The correlation between OPIc scores and Vocabulary scores is .82. This is a positive, strong correlation.

To correlation two continuous variables, you don’t need to specify the method = argument, as Pearson correlations are the default, but we will do so anyway just to practice:

cor(d$Vocab_score, d$Grammar_score)

[1] 0.8756754

While you cannot directly compare a Spearman correlation and a Pearson correlation, it seems likely that the correlation between Vocabulary and Grammar is likely stronger than the correlation between Speaking and Vocabulary.

Let’s look at all the correlations

cor(select(d, 3:5), method = "pearson")

              OPIc_rating Vocab_score Grammar_score
OPIc_rating     1.0000000   0.7931991     0.8484629
Vocab_score     0.7931991   1.0000000     0.8756754
Grammar_score   0.8484629   0.8756754     1.0000000

We can also run Spearman correlations:

cor(select(d, 3:5), method = "spearman")

              OPIc_rating Vocab_score Grammar_score
OPIc_rating     1.0000000   0.8206502     0.8180669
Vocab_score     0.8206502   1.0000000     0.8738085
Grammar_score   0.8180669   0.8738085     1.0000000

If you want to make an inference about the population of test takers, we can get p-values and confidence intervals using the cor.test() function.

cor.test(d$Vocab_score, d$Grammar_score)


    Pearson's product-moment correlation

data:  d$Vocab_score and d$Grammar_score
t = 13.076, df = 52, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7941111 0.9262556
sample estimates:
      cor 
0.8756754