2024-02-01

Correlation: Overview

Correlation is a statistical measure that quantifies the extent to which two variables move in relation to each other. It assesses the strength and direction of a linear relationship between the variables.

In simpler terms, it helps us understand how changes in one variable correspond to changes in another.

  • Positive and negative correlations indicate the direction of the relationship
  • Strength is measured by the correlation coefficient, ranging from -1 to 1

Types of Correlation

  • Positive Correlation: Exists when both variables increase or decrease together.
    • Example: As the temperature increases, ice cream sales also increase.
  • Negative Correlation: Occurs when one variable increases while the other decreases.
    • Example: Number of exercise hours and patient weight
  • Zero Correlation: Implies that changes in one variable do not predict changes in the other.
    • Example: The height of a person and the number of books they own

Correlation Coefficient

The Pearson correlation coefficient (\(r\)) between two variables \(X\) and \(Y\) is calculated using the following formula:

\[ r = \frac{{\sum (X_i - \bar{X})(Y_i - \bar{Y})}}{{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}} \]

where \(\bar{X}\) and \(\bar{Y}\) are the means of variables \(X\) and \(Y\), respectively.

The closer the correlation coefficient is to -1 or 1, the stronger the correlation. A coefficient of 0 indicates no linear correlation.

Calculate Correlation Coefficient in R

# Example data
X <- c(2, 4, 5, 7, 9)
Y <- c(10, 8, 5, 3, 1)

# Calculate correlation coefficient
correlation_coefficient <- cor(X, Y)

# Print the result
print(paste("Correlation Coefficient (r):", correlation_coefficient))
## [1] "Correlation Coefficient (r): -0.984429192109557"

Case Study: Life Expectancy vs. BMI

## [1] "Correlation between Life Expectancy and BMI: 0.567693547545986"

Interpreting Results

  • A correlation coefficient of \(r = 0.5677\) indicates a positive correlation between Life Expectancy and BMI.
  • This suggests as BMI increases, Life Expectancy also increases
  • However, this brings up an important concept of Causation vs. Correlation

Causation Vs. Correlation

The phrase “correlation does not imply causation” is a fundamental principle in statistics and research methodology. It emphasizes that just because two variables are correlated (i.e., there is a statistical association between them) does not mean that one variable causes the other.

  • Causation: Implies a cause-and-effect relationship between two variables, where changes in one variable directly cause changes in the other.
  • Correlation: Indicates that there is a statistical association between two variables, but it does not reveal the direction of causation.

Overview: Correlation Matrix

  • A correlation matrix is a symmetric table where each cell represents the correlation between two variables.
    • Rows and columns correspond to variables, and the diagonal contains perfect correlations (1.0).
    • Off-diagonal elements show the pairwise correlations between variables.
## [1] "Correlation Matrix:"
##                 Life.expectancy         GDP infant.deaths  Population
## Life.expectancy      1.00000000  0.46566179    -0.1732494 -0.02262795
## GDP                  0.46566179  1.00000000    -0.1038938 -0.02832441
## infant.deaths       -0.17324943 -0.10389376     1.0000000  0.55676997
## Population          -0.02262795 -0.02832441     0.5567700  1.00000000

Pair Plot: Multiple Variable Correlation

Multiple Variable Scatter Plot - 3D

The 3D scatter plot illustrates the relationship between Life Expectancy, GDP, and Infant Deaths. Note: Population is represented by color intensity.

My Plotly Code:

library(readr)
library(plotly)

data <- na.omit(data[, c("GDP", "infant.deaths", "Life.expectancy",
                         "Population")])

# Create a 3D scatter plot with a custom color scale
scatter_3d_plot <- plot_ly(data, x = ~Life.expectancy, 
                           y = ~GDP, z = ~infant.deaths,
                           marker = list(color = ~Population, 
                                         colorscale = list(c('#FFE1A1',
                                                             '#683531')), 
                                         showscale = TRUE),
                           type = 'scatter3d', mode = 'markers', 
                           marker = list(size = 3)) %>%
  layout(scene = list(
    xaxis = list(title = "Life Expectancy"),
    yaxis = list(title = "GDP"),
    zaxis = list(title = "Infant Deaths")
  ))

Alternate Methods: Spearman’s Rank

Spearman’s rank correlation coefficient (\(\rho\)) is calculated as:

\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

Where: - \(n\) is the number of observations. - \(d_i\) is the difference between the ranks of corresponding pairs of observations.

  • Spearman’s correlation is based on the ranks of the data rather than the actual values.
  • It assesses the strength of a monotonic relationship between two variables.

Note: Monotonic relationships are those where the variables tend to move in the same relative direction without necessarily following a straight line