Correlation analysis: practice

Anna Shirokanova & Olesya Volchenko

March 16, 2020

This seminar

Correlations

  1. Are variables related or not?
  2. If yes, is this a positive or negative relationship?
  3. If yes, how large is it?

Pearson’s Product Moment Correlation

Pearson’s r is used for evaluating bivariate relationships, i.e. to describe the relationship of two continuous variables

\[ r_{xy} = \frac{Cov(x, y)}{S_x * S_y} = \frac{Joint \space variance}{Separate \space SDs}\]

Covariance (Cov) is a measure of joint variability of two variables: it is positive if variables go in the same direction (when X goes up, Y goes up), negative if not (when X goes up, Y goes down). It is a non-normalised measure, i.e., it depends on the original scales of variables.

Correlation is normalised covariance: the covariance of X and Y is divided by the product of their standard deviations (Sx * Sy).

A correlation coefficient can be significant or not (there is a test for statistical significance).

Graphically, it shows whether the dots lie along the line or not

Examine positive correlation

Source: http://guessthecorrelation.com/

Pearson’s Product-Moment Correlation: Assumptions

If the assumptions are not met, then apply Spearman’s or Kendall’s coefficients (they have no distributional assumptions)

Common Mistakes with Correlation

1. Correlation is not causation

Ex.: There is a positive correlation between living in an industrial neighbourhood like Sedova and having tuberculosis. Does industry induce the disease? Or is it that a majority are low-income people who have poor nutrition and generally higher chances of contracting tuberculosis?

2. Zero correlation does not always mean there is no relationship between the variables.

Not all relations are linear, see Anscombe’s quartet below

Samples can be heterogeneous, see Simpson’s paradox below

3. Non-zero correlation coefficient does not always mean a relationship between the variables.

Spurious correlations, see http://www.tylervigen.com/spurious-correlations

Bottomline: Do not jump to quick conclusions based on correlation alone.

Visualizing Data Gives Insights

Anscombe’s quartet: these are four data sets with equal means, SDs, and the same correlation coefficient of .86 between X and Y, but their distributions are completely different.

Conclusion: visualise your data before interpreting the statistics

What Else Can Be Difficult With Correlations

Simpson’s paradox: in a heterogeneous sample, relationships within groups and on the aggregate level can go in opposite directions

Conclusion: Mind the groups, do not generalise individual-level conclusions to the group level without double-checking

Now on to the empirical example

Before you go further, revise the terms from the first part of this presentation and make sure you can explain them without looking back at the slides:

Data description

Let’s analyze some of the World Bank development indicators about countries: https://databank.worldbank.org/data/source/world-development-indicators

The data come from year 2010.

The dataset is available in the data folder in your Dropbox <bit.do/das2020Dropbox>

Variables description

n.NY.GDP.PCAP.PP.KD - GDP per capita, PPP (constant 2011 international $)
n.IT.NET.USER.ZS - Individuals using the Internet (% of population)
n.IT.NET.SECR.P6 - Secure Internet servers (per 1 million people)
n.SE.COM.DURS - Compulsory education, duration (years)
n.SH.HIV.INCD.ZS - Incidence of HIV (% of uninfected population ages 15-49)
n.SH.STA.SUIC.P5 - Suicide mortality rate (per 100,000 population)
n.SH.STA.SUIC.MA.P5 - Suicide mortality rate, male (per 100,000 male population)
n.SH.STA.SUIC.FE.P5 - Suicide mortality rate, female (per 100,000 female population)
n.SH.ALC.PCAP.LI - Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
n.EG.ELC.ACCS.RU.ZS - Access to electricity, rural (% of rural population)

wb <- read.csv("wb_clean.csv")
library(tidyverse)
glimpse(wb)
## Observations: 217
## Variables: 13
## $ X                   <int> 1, 14, 27, 66, 79, 92, 105, 131, 144, 157, 170,...
## $ country             <fct> "Afghanistan", "Albania", "Algeria", "Angola", ...
## $ n.NY.GDP.PCAP.PP.KD <dbl> 1693.7702, 9927.1529, 12870.6027, 6356.9350, 19...
## $ n.IT.NET.USER.ZS    <dbl> 4.00, 45.00, 12.50, 2.80, 47.00, 45.00, 25.00, ...
## $ n.IT.NET.SECR.P6    <dbl> 0.4860577, 4.1194348, 0.3599350, 1.2837448, 633...
## $ n.SE.COM.DURS       <dbl> 9, 8, 10, 6, 11, 13, 11, 10, 8, 9, 12, 9, 5, 11...
## $ n.SH.HIV.INCD.ZS    <dbl> NA, 0.01, 0.01, 0.21, NA, 0.03, 0.01, 0.01, 0.0...
## $ n.SH.STA.SUIC.P5    <dbl> 5.1, 7.8, 3.3, 5.7, 0.3, 8.7, 6.0, 12.5, 16.0, ...
## $ n.SH.STA.SUIC.MA.P5 <dbl> 8.6, 9.5, 4.9, 8.7, 0.5, 14.3, 10.1, 18.6, 25.0...
## $ n.SH.STA.SUIC.FE.P5 <dbl> 1.4, 6.1, 1.8, 2.8, 0.0, 3.3, 2.4, 6.3, 7.5, 1....
## $ n.SH.ALC.PCAP.LI    <dbl> 0.2, 7.9, 0.7, 9.0, 6.1, 9.3, 5.6, 12.5, 12.0, ...
## $ n.EG.ELC.ACCS.RU.ZS <dbl> 32.40000, 100.00000, 97.59427, 16.20996, 92.221...
## $ na_count            <int> 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 1, 1, 3, 3, 3, 4,...
wb <- wb[ ,-1]

Let’s assign labels

labs <- c("Country name", # create a list of labels from a separate file downloaded with the data from the World Bank
          "GDP per capita, PPP (constant 2011 international $)",
          "Individuals using the Internet (% of population)",
          "Secure Internet servers (per 1 million people)",
          "Compulsory education, duration (years)",
          "Incidence of HIV (% of uninfected population ages 15-49)",
          "Suicide mortality rate (per 100,000 population)",
          "Suicide mortality rate, male (per 100,000 male population)",
          "Suicide mortality rate, female (per 100,000 female population)",
          "Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)",
          "Access to electricity, rural (% of rural population)",
          "Number of NA's") # do not forget the variable that we have just created
library(sjlabelled)
wb <- set_label(wb, label = labs)

Variables description

Now that we have attached proper labels to the variables, we can better understand the data.

library(sjPlot)
view_df(wb[2:11], show.prc = F, verbose = F)
Data frame: wb[2:11]
ID Name Label Values Value Labels
1 n.NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2011 international
$)
range: 660.2-125140.8
2 n.IT.NET.USER.ZS Individuals using the Internet (% of population) range: 0.2-93.4
3 n.IT.NET.SECR.P6 Secure Internet servers (per 1 million people) range: 0.0-2481.6
4 n.SE.COM.DURS Compulsory education, duration (years) range: 5.0-15.0
5 n.SH.HIV.INCD.ZS Incidence of HIV (% of uninfected population ages
15-49)
range: 0.0-3.1
6 n.SH.STA.SUIC.P5 Suicide mortality rate (per 100,000 population) range: 0.3-40.0
7 n.SH.STA.SUIC.MA.P5 Suicide mortality rate, male (per 100,000 male
population)
range: 0.5-71.7
8 n.SH.STA.SUIC.FE.P5 Suicide mortality rate, female (per 100,000 female
population)
range: 0.0-23.2
9 n.SH.ALC.PCAP.LI Total alcohol consumption per capita (liters of
pure alcohol, projected estimates, 15+ years of
age)
range: 0.0-17.9
10 n.EG.ELC.ACCS.RU.ZS Access to electricity, rural (% of rural
population)
range: 0.9-100.0

That’s it with the data preprocessing. Let’s move on to correlations!

Correlations

This simplest way to calculate a correlation coefficient is the cor() function:

cor(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS)
## [1] NA

It does not retunrn the value if there are missing values.

cor(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, use = "complete.obs")
## [1] 0.7816712

The default method of correlation is Pearson’s product moment, but other methods are also available:

cor(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, use = "complete.obs", method = "spearman")
## [1] 0.8896563
cor(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, use = "complete.obs", method = "kendall")
## [1] 0.7217487

Summary:

When using ‘cor’, specify which observations to use. If there are NAs, the output will be “NA”. You can choose which observations to use: "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

Use use = "complete.obs" to correlate the observations without missing values on all the variables that you put in the formula.

We can see that both Pearson’s and Spearman’s correlation coefficients are very high. The interpretation is that the higher the country’s GDP per capita, the higher is the share of the country’s population using the Internet.

As expected, Kendall’s correlation is lower than Spearman’s. However, its value is also high.

With cor function, you do not get the statistical significance of the coefficient.

Correlations

A more informative way is to use the cor.test() function:

cor.test(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS)
## 
##  Pearson's product-moment correlation
## 
## data:  wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## t = 18.291, df = 213, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7235433 0.8287912
## sample estimates:
##       cor 
## 0.7816712
cor.test(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## S = 182769, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8896563
cor.test(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## z = 15.736, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7217487

We get the same results, but also much more information:

  1. the statistical significance tests of correlation coefficients;
  2. confidence intervals for Pearson’s correlation coefficient; and
  3. alternative hypotheses to the tests.
  4. NAs are removed by default.

Let’s plot the relationship

Plain scatterplots

plot(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS)

plot(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS, 
     pch = 20, 
     xlab = "GDP PPP", 
     ylab = "Internet users, per cent of the population")

Scatterplot with labels

wbws <-  na.omit(data.frame(n.NY.GDP.PCAP.PP.KD = wb$n.NY.GDP.PCAP.PP.KD, n.IT.NET.USER.ZS = wb$n.IT.NET.USER.ZS, 
                            country = wb$country))
library(wordcloud)
textplot(wbws$n.NY.GDP.PCAP.PP.KD, wbws$n.IT.NET.USER.ZS, wbws$country, 
         cex = 0.5,
         xlab = "GDP PPP",
         ylab = "Internet users, per cent of the population")

textplot(log(wbws$n.NY.GDP.PCAP.PP.KD), wbws$n.IT.NET.USER.ZS, wbws$country, 
         cex = 0.5,
         xlab = "GDP PPP",
         ylab = "Internet users, per cent of the population")

If you had fewer observations, this could be a very informative plot with labels.

Another scatterplot

library(ggpubr)
ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS", 
          add = "reg.line", 
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "GDP", 
          ylab = "Internet users, per cent of the population")

This scatterplot shows not only the data points, but also the line of best fit, the value of correlation coefficient (R) and its p-value.

Is everything right with the axes?

Take a look at the Y scale: R does not know that it is per cent that will not exceed 100, so it continues the scale upwards.

The current axis is misleading and must be adjusted as shown below:

ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS", 
          add = "reg.line",
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "GDP", 
          ylab = "Internet users, per cent of the population",
          ylim = c(0, 100))

Is it a linear relationship, after all? It looks more like inverse quadratic.

Log-transformation is sometimes used to make such relationships linear (see below). Do not use log-transformation if you do not know what you are doing!

wbws$logGDP <- log(wbws$n.NY.GDP.PCAP.PP.KD)
ggscatter(wbws, x = "logGDP", y = "n.IT.NET.USER.ZS", 
          add = "reg.line",
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "Log GDP", 
          ylab = "Internet users, per cent of the population",
          ylim = c(0, 100))

This pair shows a higher correlation r = 0.85 (p < 0.001), with a more linear trend on the plot.

Get fancy scatterplots with single distributions

library(GGally)
ggpairs(wbws[, 1:2],
        upper = list(continuous = wrap("cor", size = 8, col = "black"))) # to adjust the font

Correlation matrix

If there are more than two variable you want to correlate, you can use a correlation matrix like that:

Pearson’s product moment correlations

Look at the note at the bottom-right: ‘listwise deletion’ means that this correlation matrix was calculated only for the countries that contained the data on ALL the variables used in this matrix.

  1. Try intrepreting some of the correlations.

  2. Why is the diagonal empty?

library(sjPlot)
sjt.corr(wb[, 3:12])
  Individuals using the Internet (% of
population)
Secure Internet servers (per 1 million
people)
Compulsory education, duration (years) Incidence of HIV (% of uninfected
population ages 15-49)
Suicide mortality rate (per 100,000
population)
Suicide mortality rate, male (per
100,000 male population)
Suicide mortality rate, female (per
100,000 female population)
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
Access to electricity, rural (% of rural
population)
Number of NA’s
Individuals using the Internet (% of
population)
  0.665*** 0.343*** -0.261** 0.266** 0.298*** 0.059 0.481*** 0.733*** -0.173*
Secure Internet servers (per 1 million
people)
0.665***   0.193* -0.112 0.114 0.096 0.124 0.306*** 0.299*** -0.019
Compulsory education, duration (years) 0.343*** 0.193*   -0.270** -0.008 0.039 -0.145 0.164 0.388*** -0.196*
Incidence of HIV (% of uninfected
population ages 15-49)
-0.261** -0.112 -0.270**   0.022 -0.028 0.177* 0.029 -0.381*** 0.092
Suicide mortality rate (per 100,000
population)
0.266** 0.114 -0.008 0.022   0.978*** 0.752*** 0.591*** 0.248** -0.027
Suicide mortality rate, male (per
100,000 male population)
0.298*** 0.096 0.039 -0.028 0.978***   0.601*** 0.621*** 0.305*** -0.050
Suicide mortality rate, female (per
100,000 female population)
0.059 0.124 -0.145 0.177* 0.752*** 0.601***   0.320*** -0.020 0.076
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
0.481*** 0.306*** 0.164 0.029 0.591*** 0.621*** 0.320***   0.309*** -0.016
Access to electricity, rural (% of rural
population)
0.733*** 0.299*** 0.388*** -0.381*** 0.248** 0.305*** -0.020 0.309***   -0.166*
Number of NA’s -0.173* -0.019 -0.196* 0.092 -0.027 -0.050 0.076 -0.016 -0.166*  
Computed correlation used pearson-method with listwise-deletion.

Correlation matrix

Spearman’s correlation coefficient (it transforms all the variables into ranks)

Compare the correlation coefficients for the last variable with those obtained with Pearson’s coefficient. Why are they different?

Which of them suits our data better here?

sjt.corr(wb[, 3:12], 
         corr.method = "spearman")
  Individuals using the Internet (% of
population)
Secure Internet servers (per 1 million
people)
Compulsory education, duration (years) Incidence of HIV (% of uninfected
population ages 15-49)
Suicide mortality rate (per 100,000
population)
Suicide mortality rate, male (per
100,000 male population)
Suicide mortality rate, female (per
100,000 female population)
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
Access to electricity, rural (% of rural
population)
Number of NA’s
Individuals using the Internet (% of
population)
  0.876*** 0.402*** -0.530*** 0.214* 0.294*** 0.002 0.498*** 0.883*** -0.201*
Secure Internet servers (per 1 million
people)
0.876***   0.405*** -0.361*** 0.220** 0.308*** -0.001 0.533*** 0.738*** -0.226**
Compulsory education, duration (years) 0.402*** 0.405***   -0.363*** -0.038 0.036 -0.161 0.201* 0.404*** -0.195*
Incidence of HIV (% of uninfected
population ages 15-49)
-0.530*** -0.361*** -0.363***   -0.038 -0.063 0.041 0.000 -0.662*** 0.170*
Suicide mortality rate (per 100,000
population)
0.214* 0.220** -0.038 -0.038   0.966*** 0.849*** 0.569*** 0.201* 0.079
Suicide mortality rate, male (per
100,000 male population)
0.294*** 0.308*** 0.036 -0.063 0.966***   0.706*** 0.620*** 0.266** 0.063
Suicide mortality rate, female (per
100,000 female population)
0.002 -0.001 -0.161 0.041 0.849*** 0.706***   0.354*** 0.002 0.128
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
0.498*** 0.533*** 0.201* 0.000 0.569*** 0.620*** 0.354***   0.398*** -0.039
Access to electricity, rural (% of rural
population)
0.883*** 0.738*** 0.404*** -0.662*** 0.201* 0.266** 0.002 0.398***   -0.174*
Number of NA’s -0.201* -0.226** -0.195* 0.170* 0.079 0.063 0.128 -0.039 -0.174*  
Computed correlation used spearman-method with listwise-deletion.

Correlation matrix

Kendall’s tau (it transforms all the variables into ranks but uses a different formula)

Compare Kendall’s correlations with Spearman’s correlations.

sjt.corr(wb[, 3:12], 
         corr.method = "kendall")
  Individuals using the Internet (% of
population)
Secure Internet servers (per 1 million
people)
Compulsory education, duration (years) Incidence of HIV (% of uninfected
population ages 15-49)
Suicide mortality rate (per 100,000
population)
Suicide mortality rate, male (per
100,000 male population)
Suicide mortality rate, female (per
100,000 female population)
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
Access to electricity, rural (% of rural
population)
Number of NA’s
Individuals using the Internet (% of
population)
  0.702*** 0.300*** -0.368*** 0.136* 0.194*** -0.004 0.352*** 0.700*** -0.163*
Secure Internet servers (per 1 million
people)
0.702***   0.302*** -0.252*** 0.138* 0.196*** -0.006 0.376*** 0.560*** -0.184**
Compulsory education, duration (years) 0.300*** 0.302***   -0.266*** -0.029 0.022 -0.112 0.134* 0.301*** -0.170*
Incidence of HIV (% of uninfected
population ages 15-49)
-0.368*** -0.252*** -0.266***   -0.015 -0.035 0.029 0.004 -0.490*** 0.144*
Suicide mortality rate (per 100,000
population)
0.136* 0.138* -0.029 -0.015   0.856*** 0.670*** 0.403*** 0.131* 0.064
Suicide mortality rate, male (per
100,000 male population)
0.194*** 0.196*** 0.022 -0.035 0.856***   0.528*** 0.447*** 0.183** 0.052
Suicide mortality rate, female (per
100,000 female population)
-0.004 -0.006 -0.112 0.029 0.670*** 0.528***   0.242*** -0.003 0.104
Total alcohol consumption per capita
(liters of pure alcohol, projected
estimates, 15+ years of age)
0.352*** 0.376*** 0.134* 0.004 0.403*** 0.447*** 0.242***   0.275*** -0.032
Access to electricity, rural (% of rural
population)
0.700*** 0.560*** 0.301*** -0.490*** 0.131* 0.183** -0.003 0.275***   -0.145*
Number of NA’s -0.163* -0.184** -0.170* 0.144* 0.064 0.052 0.104 -0.032 -0.145*  
Computed correlation used kendall-method with listwise-deletion.

Get a graphical table of the correlations

Color is used to indicate positive and negative values.

The columns are given in reverse order, so that the correlations remain on the left and are easier to read.

Compare this table to Pearson’s product moment correlation matrix - they should be the same.

In your projects, use either matrices or tables, depending on which suits your goal better.

sjp.corr(wb[, 3:12])

Final remarks