Rmarkdown: Correlation

Introduction

This Rmarkdown file serves as a demonstration of how to perform various types of correlation analyses using R. We will be using a dataset named data_ITAproject.csv that contains columns such as Participant ID, Holistic Score, Composite, Pronunciation, Lexical Grammar, Rhetorical Organization, and Topic Development.

Load Data

# Load the data
data_ITA <- read.csv("data_ITAproject.csv")

# Display the first few rows
head(data_ITA)

##   ParticipantID HolisticScore Composite Pronunciation LexicalGrammar
## 1             1            50      1.49          1.77           2.11
## 2             2            60      3.14          3.59           4.04
## 3             3            45      0.75          0.49           0.62
## 4             4            50      2.58          3.14           3.19
## 5             5            45      0.65          0.02           0.92
## 6             6            40     -0.04         -0.13          -0.15
##   RhetoricalOrganization TopicDevelopment
## 1                   1.61             1.21
## 2                   3.16             2.93
## 3                   1.20             0.86
## 4                   2.46             2.53
## 5                   0.92             0.88
## 6                   0.00             0.05

# Provide summary statitiscs of the data
summary(data_ITA)

##  ParticipantID   HolisticScore     Composite      Pronunciation    
##  Min.   :  1.0   Min.   :35.00   Min.   :-0.690   Min.   :-2.5100  
##  1st Qu.: 32.5   1st Qu.:45.00   1st Qu.: 0.670   1st Qu.:-0.2750  
##  Median : 64.0   Median :45.00   Median : 1.150   Median : 0.5700  
##  Mean   : 64.0   Mean   :46.61   Mean   : 1.231   Mean   : 0.7391  
##  3rd Qu.: 95.5   3rd Qu.:50.00   3rd Qu.: 1.720   3rd Qu.: 1.5750  
##  Max.   :127.0   Max.   :60.00   Max.   : 3.140   Max.   : 3.7700  
##  LexicalGrammar   RhetoricalOrganization TopicDevelopment
##  Min.   :-0.920   Min.   :-0.620         Min.   :-0.680  
##  1st Qu.: 0.730   1st Qu.: 1.040         1st Qu.: 0.910  
##  Median : 1.460   Median : 1.550         Median : 1.390  
##  Mean   : 1.583   Mean   : 1.567         Mean   : 1.424  
##  3rd Qu.: 2.300   3rd Qu.: 2.100         3rd Qu.: 1.960  
##  Max.   : 4.370   Max.   : 3.420         Max.   : 2.930

Covariance

Covariance measures the degree to which two variables change together. If one variable tends to go up when the other goes up, there is a positive covariance.

# Calculate covariance between HolisticScore and Composite
cov_Holistic_Composite <- cov(data_ITA$HolisticScore, data_ITA$Composite)

# Print the covariance
cov_Holistic_Composite

## [1] 4.02679

Pearson Correlation

Pearson correlation measures the linear relationship between two variables and ranges between -1 and 1.

# Calculate Pearson correlation
pearson_corr <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "pearson")

# Print the Pearson correlation
pearson_corr

## [1] 0.8256058

Spearman Correlation

Spearman correlation assesses how well an arbitrary monotonic function can describe the relationship between two variables, without making any assumptions about the frequency distribution.

# Calculate Spearman correlation
spearman_corr <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "spearman")

# Print the Spearman correlation
spearman_corr

## [1] 0.7924679

Kendall’s Tau

Kendall’s Tau is used for ordinal data and is less sensitive to outliers compared to Pearson and Spearman.

# Calculate Kendall's Tau
kendall_tau <- cor(data_ITA$HolisticScore, data_ITA$Composite, method = "kendall")

# Print Kendall's Tau
kendall_tau

## [1] 0.6643445

Correlation Matrix

A correlation matrix provides a summary of the correlation between all possible pairs of variables in the dataset.

# Calculate correlation matrix
cor_matrix <- cor(data_ITA[,2:7], method = "pearson")

# Print correlation matrix
cor_matrix

##                        HolisticScore Composite Pronunciation LexicalGrammar
## HolisticScore              1.0000000 0.8256058     0.6797656      0.8704173
## Composite                  0.8256058 1.0000000     0.8834001      0.9625157
## Pronunciation              0.6797656 0.8834001     1.0000000      0.8056969
## LexicalGrammar             0.8704173 0.9625157     0.8056969      1.0000000
## RhetoricalOrganization     0.7668623 0.9515875     0.7328956      0.8992653
## TopicDevelopment           0.7586988 0.9274102     0.6719024      0.8850215
##                        RhetoricalOrganization TopicDevelopment
## HolisticScore                       0.7668623        0.7586988
## Composite                           0.9515875        0.9274102
## Pronunciation                       0.7328956        0.6719024
## LexicalGrammar                      0.8992653        0.8850215
## RhetoricalOrganization              1.0000000        0.9604132
## TopicDevelopment                    0.9604132        1.0000000

Correlation Matrix with Heatmap

Visualizing correlations is often more intuitive when using heatmaps. A heatmap color-codes values in a matrix, providing a graphical representation of the strength and direction of each correlation.

Here, we will use the ggcorrplot package to create a heatmap alongside the numerical correlation coefficients. This provides a visually rich and data-dense representation of how the variables relate to each other.

In this heatmap:

A color closer to green represents a positive correlation.
A color closer to red represents a negative correlation.
The absence of color (white) signifies no or weak correlation.

This visualization allows us to quickly identify patterns in the dataset, thus serving as an effective exploratory tool.

# Install the ggcorrplot package if you haven't
# install.packages("ggcorrplot")

# Load the library
library(ggcorrplot)

## Loading required package: ggplot2

# Generate the plot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower", 
           lab = TRUE, lab_size = 3, method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlation Matrix with ggcorrplot", 
           ggtheme=ggplot2::theme_minimal())

Scatterplot Matrix with Distribution

After understanding the basic correlations between different pairs of variables, it’s often useful to visualize these relationships along with the distribution of each variable. The pairs.panels function from the psych package allows us to create a scatterplot matrix, also known as a “splom,” which showcases the relationships and distributions all in one plot.

# Load the psych package if not already loaded
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# Create the scatterplot matrix
psych::pairs.panels(data_ITA[, 2:7],  # Selecting columns 2 to 7 from the data_ITA dataset
                    method = "pearson",  # Using Pearson correlation
                    density = TRUE,  # Add density plots
                    ellipses = FALSE)  # Do not add concentration ellipses

For this part of the demonstration, we will aim to answer a specific research question:

“What is the relationship between Age of Acquisition (AOA) and English Pronunciation (PronEng)?”

We will use the dataset FlegeYeniKomshianLiu.sav.

Loading the Data

# Load the required package
library(foreign)

# Load the dataset
data_RQ <- read.spss("FlegeYeniKomshianLiu.sav", to.data.frame=T)

## re-encoding from CP1252

# Display the first few rows
head(data_RQ)

##   L1 Group AOA LOR   Age EngUse KorUse PronEng PronKor
## 1  2     2 2.5  20 20.73   4.57   2.78    7.19    2.25
## 2  2     2 2.5  17 20.42   4.86   2.00    8.57    3.63
## 3  2     2 3.5  17 20.63   4.43   2.56    7.47    2.00
## 4  2     2 3.5  18 21.19   4.83   2.44    6.81    1.67
## 5  2     2 2.5  19 20.45   5.00   2.44    7.22    3.04
## 6  2     2 3.5  20 22.68   5.00   2.44    7.05    1.97

Checking Assumptions

1) Normality

Histograms

We start by plotting histograms to get a sense of the data distribution.

# Load ggplot2 if not already loaded
library(ggplot2)

# Histogram for AOA
ggplot(data_RQ, aes(x=AOA)) +
  geom_histogram(binwidth=2, fill="white", color = "black") +
  ggtitle("Histogram of AOA") +
  xlab("Age of Acquisition (AOA)") +
  ylab("Frequency")

# Histogram for PronEng
ggplot(data_RQ, aes(x=PronEng)) +
  geom_histogram(binwidth=1, fill="white", color = "black") +
  ggtitle("Histogram of PronEng") +
  xlab("English Pronunciation (PronEng)") +
  ylab("Frequency")

Q-Q Plot

Q-Q Plots can help us identify if the data departs from normality.

# Q-Q Plot for AOA
qqnorm(data_RQ$AOA)
qqline(data_RQ$AOA)

# Q-Q Plot for PronEng
qqnorm(data_RQ$PronEng)
qqline(data_RQ$PronEng)

Interpreting Q-Q Plots In a Q-Q plot for normality, the x-axis displays the expected quantiles of the normal distribution, and the y-axis shows the observed quantiles of the data. The points should fall approximately along a straight line (the 45-degree line, also known as the “line of equality”) for data that are normally distributed. Here are some guidelines for interpreting Q-Q plots:

When interpreting the Q-Q plot, pay attention to how the points align with the line of equality (the straight line in the plot):

Closely Aligned: Suggests that the data is normally distributed.
Curved or “S” Shape: Indicates the presence of skewness or heavy-tailedness, suggesting non-normality.
Systematic Deviations: May indicate that the data follows a different kind of distribution.

Note that Q-Q plots are more reliable when the sample size is large. For small sample sizes, minor deviations from the line shouldn’t be over-interpreted.

Shapiro-Wilk Test

The Shapiro-Wilk test can be used for a formal check of normality.

# Shapiro-Wilk Test for AOA
shapiro.test(data_RQ$AOA)

## 
##  Shapiro-Wilk normality test
## 
## data:  data_RQ$AOA
## W = 0.95589, p-value = 1.044e-06

# Shapiro-Wilk Test for PronEng
shapiro.test(data_RQ$PronEng)

## 
##  Shapiro-Wilk normality test
## 
## data:  data_RQ$PronEng
## W = 0.95275, p-value = 4.685e-07

Interpreting Results:

A p-value below 0.05 suggests that the data is not normally distributed. A p-value above 0.05 suggests the data is approximately normally distributed.

Decision Based on Data Normality

Interpretation of Normality Tests:

Histograms: Both variables show a distribution that deviates from the bell curve.
Q-Q Plots: The points do not align closely with the line of equality, indicating non-normality.
Shapiro-Wilk Test: The p-values are below 0.05, suggesting that the data is not normally distributed.

From the histograms, Q-Q plots, and the Shapiro-Wilk test, it’s evident that our data for both AOA and PronEng are not normally distributed. Due to these findings, using Pearson correlation may not be the most appropriate choice for our data. Spearman’s Rank Correlation as an alternative.

Given that our data does not meet the assumption of normality, Spearman’s rank correlation is recommended as an alternative to Pearson. Spearman’s correlation does not assume that the data is normally distributed and is particularly useful when the relationship between variables is not linear.

2) Linearity

Scatterplot

To check for linearity, we will plot the variables against each other.

# Scatterplot
scatter <- ggplot(data_RQ, aes(x=AOA, y=PronEng)) +
  geom_point() +
  geom_smooth(method="lm", color="blue", se=FALSE) +
  geom_smooth(method="loess", color="red", se=FALSE) +
  labs(title = "Scatterplot with Linear and Loess Lines",
       x = "Age of Acquisition", 
       y = "English Pronunciation")
scatter

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Interpreting Results:

The blue line represents the linear model. If the data points cluster around this line, it suggests a linear relationship. The red line represents the loess line, which provides a more flexible fit. If the data points closely follow this line but not the blue line, the relationship might be non-linear. By checking these assumptions, we ensure that our subsequent analysis will be valid and interpretable.

Correlational Analysis

Now that we have established that our data does not meet the assumption of normality, we’ll proceed with Spearman’s rank correlation. Spearman’s correlation is a non-parametric test that does not assume that the data is normally distributed. This makes it a suitable choice for our dataset.

Running the Spearman Correlation

We can run the Spearman correlation using the cor.test() function, specifying the method as “spearman”.

# Running Spearman correlation
cor_result <- cor.test(data_RQ$AOA, data_RQ$PronEng, method = "spearman")

## Warning in cor.test.default(data_RQ$AOA, data_RQ$PronEng, method = "spearman"):
## Cannot compute exact p-value with ties

cor_result

## 
##  Spearman's rank correlation rho
## 
## data:  data_RQ$AOA and data_RQ$PronEng
## S = 4272925, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8546005

Here, cor_result will contain the Spearman’s rho value along with the p-value.

Calculating 95% Confidence Interval for Spearman’s Rho

We’ll use the RVAideMemoire package to calculate the 95% confidence interval for Spearman’s rho.

# Load the RVAideMemoire package
# install.packages('RVAideMemoire')
library(RVAideMemoire)

## *** Package RVAideMemoire v 0.9-83-7 ***

# Calculating 95% CI for Spearman's rho
ci_result <- spearman.ci(data_RQ$AOA, data_RQ$PronEng, nrep = 1000, conf.level = 0.95)
ci_result

## 
##  Spearman's rank correlation
## 
## data:  data_RQ$AOA and data_RQ$PronEng
## 1000 replicates
## 
## 95 percent confidence interval:
##  -0.8829082 -0.8224301
## sample estimates:
##        rho 
## -0.8546005

Interpretation of Spearman’s Rank Correlation Results

Spearman’s Rho Value

The Spearman’s rho value is -0.855, which is very close to -1. This indicates a strong negative correlation between Age of Acquisition (AOA) and English Pronunciation (PronEng). In practical terms, this suggests that as the age of acquiring English (AOA) increases, the quality of English pronunciation (PronEng) tends to decrease, or vice versa.

95% Confidence Interval

The 95% confidence interval for the Spearman’s rho is between -0.883 and -0.822. The confidence interval is narrow and does not contain zero, further confirming that the correlation is both strong and statistically significant. Specifically, we can be 95% confident that the true Spearman’s rho in the population lies between -0.883 and -0.822.

Overall Conclusion

The analysis clearly indicates a strong, negative relationship between AOA and PronEng. Given the narrow 95% confidence interval that does not contain zero, we can say that this relationship is statistically significant. Therefore, the age at which one acquires English is strongly and inversely related to the quality of English pronunciation.