This Rmarkdown file serves as a demonstration of how to perform various types of correlation analyses using R. We will be using a dataset named data_ITAproject.csv that contains columns such as Participant ID, Holistic Score, Composite, Pronunciation, Lexical Grammar, Rhetorical Organization, and Topic Development.
Load Data
# Load the data
data_ITA <- read.csv("data_ITAproject.csv")
# Display the first few rows
head(data_ITA)
## ParticipantID HolisticScore Composite Pronunciation LexicalGrammar
## 1 1 50 1.49 1.77 2.11
## 2 2 60 3.14 3.59 4.04
## 3 3 45 0.75 0.49 0.62
## 4 4 50 2.58 3.14 3.19
## 5 5 45 0.65 0.02 0.92
## 6 6 40 -0.04 -0.13 -0.15
## RhetoricalOrganization TopicDevelopment
## 1 1.61 1.21
## 2 3.16 2.93
## 3 1.20 0.86
## 4 2.46 2.53
## 5 0.92 0.88
## 6 0.00 0.05
# Provide summary statitiscs of the data
summary(data_ITA)
## ParticipantID HolisticScore Composite Pronunciation
## Min. : 1.0 Min. :35.00 Min. :-0.690 Min. :-2.5100
## 1st Qu.: 32.5 1st Qu.:45.00 1st Qu.: 0.670 1st Qu.:-0.2750
## Median : 64.0 Median :45.00 Median : 1.150 Median : 0.5700
## Mean : 64.0 Mean :46.61 Mean : 1.231 Mean : 0.7391
## 3rd Qu.: 95.5 3rd Qu.:50.00 3rd Qu.: 1.720 3rd Qu.: 1.5750
## Max. :127.0 Max. :60.00 Max. : 3.140 Max. : 3.7700
## LexicalGrammar RhetoricalOrganization TopicDevelopment
## Min. :-0.920 Min. :-0.620 Min. :-0.680
## 1st Qu.: 0.730 1st Qu.: 1.040 1st Qu.: 0.910
## Median : 1.460 Median : 1.550 Median : 1.390
## Mean : 1.583 Mean : 1.567 Mean : 1.424
## 3rd Qu.: 2.300 3rd Qu.: 2.100 3rd Qu.: 1.960
## Max. : 4.370 Max. : 3.420 Max. : 2.930
Covariance measures the degree to which two variables change together. If one variable tends to go up when the other goes up, there is a positive covariance.
# Calculate covariance between HolisticScore and Composite
cov_Holistic_Composite <- cov(data_ITA$HolisticScore, data_ITA$Composite)
# Print the covariance
cov_Holistic_Composite
## [1] 4.02679
Pearson correlation measures the linear relationship between two variables and ranges between -1 and 1.
# Calculate Pearson correlation
pearson_corr <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "pearson")
# Print the Pearson correlation
pearson_corr
## [1] 0.8256058
Spearman correlation assesses how well an arbitrary monotonic function can describe the relationship between two variables, without making any assumptions about the frequency distribution.
# Calculate Spearman correlation
spearman_corr <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "spearman")
# Print the Spearman correlation
spearman_corr
## [1] 0.7924679
Kendall’s Tau is used for ordinal data and is less sensitive to outliers compared to Pearson and Spearman.
# Calculate Kendall's Tau
kendall_tau <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "kendall")
# Print Kendall's Tau
kendall_tau
## [1] 0.6643445
A correlation matrix provides a summary of the correlation between all possible pairs of variables in the dataset.
# Calculate correlation matrix
cor_matrix <- cor(data_ITA[,2:7], method = "pearson")
# Print correlation matrix
cor_matrix
## HolisticScore Composite Pronunciation LexicalGrammar
## HolisticScore 1.0000000 0.8256058 0.6797656 0.8704173
## Composite 0.8256058 1.0000000 0.8834001 0.9625157
## Pronunciation 0.6797656 0.8834001 1.0000000 0.8056969
## LexicalGrammar 0.8704173 0.9625157 0.8056969 1.0000000
## RhetoricalOrganization 0.7668623 0.9515875 0.7328956 0.8992653
## TopicDevelopment 0.7586988 0.9274102 0.6719024 0.8850215
## RhetoricalOrganization TopicDevelopment
## HolisticScore 0.7668623 0.7586988
## Composite 0.9515875 0.9274102
## Pronunciation 0.7328956 0.6719024
## LexicalGrammar 0.8992653 0.8850215
## RhetoricalOrganization 1.0000000 0.9604132
## TopicDevelopment 0.9604132 1.0000000
Visualizing correlations is often more intuitive when using heatmaps. A heatmap color-codes values in a matrix, providing a graphical representation of the strength and direction of each correlation.
Here, we will use the ggcorrplot package to create a heatmap alongside the numerical correlation coefficients. This provides a visually rich and data-dense representation of how the variables relate to each other.
In this heatmap:
This visualization allows us to quickly identify patterns in the dataset, thus serving as an effective exploratory tool.
# Install the ggcorrplot package if you haven't
# install.packages("ggcorrplot")
# Load the library
library(ggcorrplot)
## Loading required package: ggplot2
# Generate the plot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3, method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlation Matrix with ggcorrplot",
ggtheme=ggplot2::theme_minimal())
After understanding the basic correlations between different pairs of
variables, it’s often useful to visualize these relationships along with
the distribution of each variable. The pairs.panels
function from the psych
package allows us to create a
scatterplot matrix, also known as a “splom,” which showcases the
relationships and distributions all in one plot.
# Load the psych package if not already loaded
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Create the scatterplot matrix
psych::pairs.panels(data_ITA[, 2:7], # Selecting columns 2 to 7 from the data_ITA dataset
method = "pearson", # Using Pearson correlation
density = TRUE, # Add density plots
ellipses = FALSE) # Do not add concentration ellipses
For this part of the demonstration, we will aim to answer a specific research question:
“What is the relationship between Age of Acquisition (AOA) and English Pronunciation (PronEng)?”
We will use the dataset FlegeYeniKomshianLiu.sav
.
# Load the required package
library(foreign)
# Load the dataset
data_RQ <- read.spss("FlegeYeniKomshianLiu.sav", to.data.frame=T)
## re-encoding from CP1252
# Display the first few rows
head(data_RQ)
## L1 Group AOA LOR Age EngUse KorUse PronEng PronKor
## 1 2 2 2.5 20 20.73 4.57 2.78 7.19 2.25
## 2 2 2 2.5 17 20.42 4.86 2.00 8.57 3.63
## 3 2 2 3.5 17 20.63 4.43 2.56 7.47 2.00
## 4 2 2 3.5 18 21.19 4.83 2.44 6.81 1.67
## 5 2 2 2.5 19 20.45 5.00 2.44 7.22 3.04
## 6 2 2 3.5 20 22.68 5.00 2.44 7.05 1.97
We start by plotting histograms to get a sense of the data distribution.
# Load ggplot2 if not already loaded
library(ggplot2)
# Histogram for AOA
ggplot(data_RQ, aes(x=AOA)) +
geom_histogram(binwidth=2, fill="white", color = "black") +
ggtitle("Histogram of AOA") +
xlab("Age of Acquisition (AOA)") +
ylab("Frequency")
# Histogram for PronEng
ggplot(data_RQ, aes(x=PronEng)) +
geom_histogram(binwidth=1, fill="white", color = "black") +
ggtitle("Histogram of PronEng") +
xlab("English Pronunciation (PronEng)") +
ylab("Frequency")
Q-Q Plots can help us identify if the data departs from normality.
# Q-Q Plot for AOA
qqnorm(data_RQ$AOA)
qqline(data_RQ$AOA)
# Q-Q Plot for PronEng
qqnorm(data_RQ$PronEng)
qqline(data_RQ$PronEng)
Interpreting Q-Q Plots In a Q-Q plot for normality, the x-axis displays the expected quantiles of the normal distribution, and the y-axis shows the observed quantiles of the data. The points should fall approximately along a straight line (the 45-degree line, also known as the “line of equality”) for data that are normally distributed. Here are some guidelines for interpreting Q-Q plots:
When interpreting the Q-Q plot, pay attention to how the points align with the line of equality (the straight line in the plot):
Note that Q-Q plots are more reliable when the sample size is large. For small sample sizes, minor deviations from the line shouldn’t be over-interpreted.
The Shapiro-Wilk test can be used for a formal check of normality.
# Shapiro-Wilk Test for AOA
shapiro.test(data_RQ$AOA)
##
## Shapiro-Wilk normality test
##
## data: data_RQ$AOA
## W = 0.95589, p-value = 1.044e-06
# Shapiro-Wilk Test for PronEng
shapiro.test(data_RQ$PronEng)
##
## Shapiro-Wilk normality test
##
## data: data_RQ$PronEng
## W = 0.95275, p-value = 4.685e-07
Interpreting Results:
A p-value below 0.05 suggests that the data is not normally distributed. A p-value above 0.05 suggests the data is approximately normally distributed.
Decision Based on Data Normality
Interpretation of Normality Tests:
From the histograms, Q-Q plots, and the Shapiro-Wilk test, it’s evident that our data for both AOA and PronEng are not normally distributed. Due to these findings, using Pearson correlation may not be the most appropriate choice for our data. Spearman’s Rank Correlation as an alternative.
Given that our data does not meet the assumption of normality, Spearman’s rank correlation is recommended as an alternative to Pearson. Spearman’s correlation does not assume that the data is normally distributed and is particularly useful when the relationship between variables is not linear.
Scatterplot
To check for linearity, we will plot the variables against each other.
# Scatterplot
scatter <- ggplot(data_RQ, aes(x=AOA, y=PronEng)) +
geom_point() +
geom_smooth(method="lm", color="blue", se=FALSE) +
geom_smooth(method="loess", color="red", se=FALSE) +
labs(title = "Scatterplot with Linear and Loess Lines",
x = "Age of Acquisition",
y = "English Pronunciation")
scatter
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Interpreting Results:
The blue line represents the linear model. If the data points cluster around this line, it suggests a linear relationship. The red line represents the loess line, which provides a more flexible fit. If the data points closely follow this line but not the blue line, the relationship might be non-linear. By checking these assumptions, we ensure that our subsequent analysis will be valid and interpretable.
Now that we have established that our data does not meet the assumption of normality, we’ll proceed with Spearman’s rank correlation. Spearman’s correlation is a non-parametric test that does not assume that the data is normally distributed. This makes it a suitable choice for our dataset.
Running the Spearman Correlation
We can run the Spearman correlation using the cor.test() function, specifying the method as “spearman”.
# Running Spearman correlation
cor_result <- cor.test(data_RQ$AOA, data_RQ$PronEng, method = "spearman")
## Warning in cor.test.default(data_RQ$AOA, data_RQ$PronEng, method = "spearman"):
## Cannot compute exact p-value with ties
cor_result
##
## Spearman's rank correlation rho
##
## data: data_RQ$AOA and data_RQ$PronEng
## S = 4272925, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8546005
Here, cor_result will contain the Spearman’s rho value along with the p-value.
Calculating 95% Confidence Interval for Spearman’s Rho
We’ll use the RVAideMemoire
package to calculate the 95%
confidence interval for Spearman’s rho.
# Load the RVAideMemoire package
# install.packages('RVAideMemoire')
library(RVAideMemoire)
## *** Package RVAideMemoire v 0.9-83-7 ***
# Calculating 95% CI for Spearman's rho
ci_result <- spearman.ci(data_RQ$AOA, data_RQ$PronEng, nrep = 1000, conf.level = 0.95)
ci_result
##
## Spearman's rank correlation
##
## data: data_RQ$AOA and data_RQ$PronEng
## 1000 replicates
##
## 95 percent confidence interval:
## -0.8829082 -0.8224301
## sample estimates:
## rho
## -0.8546005
Interpretation of Spearman’s Rank Correlation Results
Spearman’s Rho Value
The Spearman’s rho value is -0.855, which is very close to -1. This indicates a strong negative correlation between Age of Acquisition (AOA) and English Pronunciation (PronEng). In practical terms, this suggests that as the age of acquiring English (AOA) increases, the quality of English pronunciation (PronEng) tends to decrease, or vice versa.
95% Confidence Interval
The 95% confidence interval for the Spearman’s rho is between -0.883 and -0.822. The confidence interval is narrow and does not contain zero, further confirming that the correlation is both strong and statistically significant. Specifically, we can be 95% confident that the true Spearman’s rho in the population lies between -0.883 and -0.822.
Overall Conclusion
The analysis clearly indicates a strong, negative relationship between AOA and PronEng. Given the narrow 95% confidence interval that does not contain zero, we can say that this relationship is statistically significant. Therefore, the age at which one acquires English is strongly and inversely related to the quality of English pronunciation.