Anna Shirokanova & Olesya Volchenko
March 16, 2020
Pearson’s r is used for evaluating bivariate relationships, i.e. to describe the relationship of two continuous variables
\[ r_{xy} = \frac{Cov(x, y)}{S_x * S_y} = \frac{Joint \space variance}{Separate \space SDs}\]
Covariance (Cov) is a measure of joint variability of two variables: it is positive if variables go in the same direction (when X goes up, Y goes up), negative if not (when X goes up, Y goes down). It is a non-normalised measure, i.e., it depends on the original scales of variables.
Correlation is normalised covariance: the covariance of X and Y is divided by the product of their standard deviations (Sx * Sy).
A correlation coefficient can be significant or not (there is a test for statistical significance).
Graphically, it shows whether the dots lie along the line or not
If the assumptions are not met, then apply Spearman’s or Kendall’s coefficients (they have no distributional assumptions)
1. Correlation is not causation
Ex.: There is a positive correlation between living in an industrial neighbourhood like Sedova and having tuberculosis. Does industry induce the disease? Or is it that a majority are low-income people who have poor nutrition and generally higher chances of contracting tuberculosis?
2. Zero correlation does not always mean there is no relationship between the variables.
Not all relations are linear, see Anscombe’s quartet below
Samples can be heterogeneous, see Simpson’s paradox below
3. Non-zero correlation coefficient does not always mean a relationship between the variables.
Spurious correlations, see http://www.tylervigen.com/spurious-correlations
Bottomline: Do not jump to quick conclusions based on correlation alone.
Anscombe’s quartet: these are four data sets with equal means, SDs, and the same correlation coefficient of .86 between X and Y, but their distributions are completely different.
Conclusion: visualise your data before interpreting the statistics
Simpson’s paradox: in a heterogeneous sample, relationships within groups and on the aggregate level can go in opposite directions
Conclusion: Mind the groups, do not generalise individual-level conclusions to the group level without double-checking
Before you go further, revise the terms from the first part of this presentation and make sure you can explain them without looking back at the slides:
Let’s analyze some of the World Bank development indicators about countries: https://databank.worldbank.org/data/source/world-development-indicators
The data come from year 2010.
The dataset is available in the data folder in your Dropbox <bit.do/das2020Dropbox>
n.NY.GDP.PCAP.PP.KD - GDP per capita, PPP (constant 2011 international $)
n.IT.NET.USER.ZS - Individuals using the Internet (% of population)
n.IT.NET.SECR.P6 - Secure Internet servers (per 1 million people)
n.SE.COM.DURS - Compulsory education, duration (years)
n.SH.HIV.INCD.ZS - Incidence of HIV (% of uninfected population ages 15-49)
n.SH.STA.SUIC.P5 - Suicide mortality rate (per 100,000 population)
n.SH.STA.SUIC.MA.P5 - Suicide mortality rate, male (per 100,000 male population)
n.SH.STA.SUIC.FE.P5 - Suicide mortality rate, female (per 100,000 female population)
n.SH.ALC.PCAP.LI - Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
n.EG.ELC.ACCS.RU.ZS - Access to electricity, rural (% of rural population)
## Observations: 217
## Variables: 13
## $ X <int> 1, 14, 27, 66, 79, 92, 105, 131, 144, 157, 170,...
## $ country <fct> "Afghanistan", "Albania", "Algeria", "Angola", ...
## $ n.NY.GDP.PCAP.PP.KD <dbl> 1693.7702, 9927.1529, 12870.6027, 6356.9350, 19...
## $ n.IT.NET.USER.ZS <dbl> 4.00, 45.00, 12.50, 2.80, 47.00, 45.00, 25.00, ...
## $ n.IT.NET.SECR.P6 <dbl> 0.4860577, 4.1194348, 0.3599350, 1.2837448, 633...
## $ n.SE.COM.DURS <dbl> 9, 8, 10, 6, 11, 13, 11, 10, 8, 9, 12, 9, 5, 11...
## $ n.SH.HIV.INCD.ZS <dbl> NA, 0.01, 0.01, 0.21, NA, 0.03, 0.01, 0.01, 0.0...
## $ n.SH.STA.SUIC.P5 <dbl> 5.1, 7.8, 3.3, 5.7, 0.3, 8.7, 6.0, 12.5, 16.0, ...
## $ n.SH.STA.SUIC.MA.P5 <dbl> 8.6, 9.5, 4.9, 8.7, 0.5, 14.3, 10.1, 18.6, 25.0...
## $ n.SH.STA.SUIC.FE.P5 <dbl> 1.4, 6.1, 1.8, 2.8, 0.0, 3.3, 2.4, 6.3, 7.5, 1....
## $ n.SH.ALC.PCAP.LI <dbl> 0.2, 7.9, 0.7, 9.0, 6.1, 9.3, 5.6, 12.5, 12.0, ...
## $ n.EG.ELC.ACCS.RU.ZS <dbl> 32.40000, 100.00000, 97.59427, 16.20996, 92.221...
## $ na_count <int> 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 1, 1, 3, 3, 3, 4,...
labs <- c("Country name", # create a list of labels from a separate file downloaded with the data from the World Bank
"GDP per capita, PPP (constant 2011 international $)",
"Individuals using the Internet (% of population)",
"Secure Internet servers (per 1 million people)",
"Compulsory education, duration (years)",
"Incidence of HIV (% of uninfected population ages 15-49)",
"Suicide mortality rate (per 100,000 population)",
"Suicide mortality rate, male (per 100,000 male population)",
"Suicide mortality rate, female (per 100,000 female population)",
"Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)",
"Access to electricity, rural (% of rural population)",
"Number of NA's") # do not forget the variable that we have just created
library(sjlabelled)
wb <- set_label(wb, label = labs)
Now that we have attached proper labels to the variables, we can better understand the data.
ID | Name | Label | Values | Value Labels |
---|---|---|---|---|
1 | n.NY.GDP.PCAP.PP.KD |
GDP per capita, PPP (constant 2011 international $) |
range: 660.2-125140.8 | |
2 | n.IT.NET.USER.ZS | Individuals using the Internet (% of population) | range: 0.2-93.4 | |
3 | n.IT.NET.SECR.P6 | Secure Internet servers (per 1 million people) | range: 0.0-2481.6 | |
4 | n.SE.COM.DURS | Compulsory education, duration (years) | range: 5.0-15.0 | |
5 | n.SH.HIV.INCD.ZS |
Incidence of HIV (% of uninfected population ages 15-49) |
range: 0.0-3.1 | |
6 | n.SH.STA.SUIC.P5 | Suicide mortality rate (per 100,000 population) | range: 0.3-40.0 | |
7 | n.SH.STA.SUIC.MA.P5 |
Suicide mortality rate, male (per 100,000 male population) |
range: 0.5-71.7 | |
8 | n.SH.STA.SUIC.FE.P5 |
Suicide mortality rate, female (per 100,000 female population) |
range: 0.0-23.2 | |
9 | n.SH.ALC.PCAP.LI |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
range: 0.0-17.9 | |
10 | n.EG.ELC.ACCS.RU.ZS |
Access to electricity, rural (% of rural population) |
range: 0.9-100.0 |
That’s it with the data preprocessing. Let’s move on to correlations!
This simplest way to calculate a correlation coefficient is the cor()
function:
## [1] NA
It does not retunrn the value if there are missing values.
## [1] 0.7816712
The default method of correlation is Pearson’s product moment, but other methods are also available:
## [1] 0.8896563
## [1] 0.7217487
Summary:
When using ‘cor’, specify which observations to use. If there are NAs, the output will be “NA”. You can choose which observations to use: "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs"
.
Use use = "complete.obs"
to correlate the observations without missing values on all the variables that you put in the formula.
We can see that both Pearson’s and Spearman’s correlation coefficients are very high. The interpretation is that the higher the country’s GDP per capita, the higher is the share of the country’s population using the Internet.
As expected, Kendall’s correlation is lower than Spearman’s. However, its value is also high.
With cor
function, you do not get the statistical significance of the coefficient.
A more informative way is to use the cor.test()
function:
##
## Pearson's product-moment correlation
##
## data: wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## t = 18.291, df = 213, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7235433 0.8287912
## sample estimates:
## cor
## 0.7816712
##
## Spearman's rank correlation rho
##
## data: wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## S = 182769, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8896563
##
## Kendall's rank correlation tau
##
## data: wb$n.NY.GDP.PCAP.PP.KD and wb$n.IT.NET.USER.ZS
## z = 15.736, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7217487
We get the same results, but also much more information:
Plain scatterplots
plot(wb$n.NY.GDP.PCAP.PP.KD, wb$n.IT.NET.USER.ZS,
pch = 20,
xlab = "GDP PPP",
ylab = "Internet users, per cent of the population")
wbws <- na.omit(data.frame(n.NY.GDP.PCAP.PP.KD = wb$n.NY.GDP.PCAP.PP.KD, n.IT.NET.USER.ZS = wb$n.IT.NET.USER.ZS,
country = wb$country))
library(wordcloud)
textplot(wbws$n.NY.GDP.PCAP.PP.KD, wbws$n.IT.NET.USER.ZS, wbws$country,
cex = 0.5,
xlab = "GDP PPP",
ylab = "Internet users, per cent of the population")
textplot(log(wbws$n.NY.GDP.PCAP.PP.KD), wbws$n.IT.NET.USER.ZS, wbws$country,
cex = 0.5,
xlab = "GDP PPP",
ylab = "Internet users, per cent of the population")
If you had fewer observations, this could be a very informative plot with labels.
library(ggpubr)
ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS",
add = "reg.line",
cor.coef = TRUE,
cor.method = "pearson",
xlab = "GDP",
ylab = "Internet users, per cent of the population")
This scatterplot shows not only the data points, but also the line of best fit, the value of correlation coefficient (R) and its p-value.
Is everything right with the axes?
Take a look at the Y scale: R does not know that it is per cent that will not exceed 100, so it continues the scale upwards.
The current axis is misleading and must be adjusted as shown below:
ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS",
add = "reg.line",
cor.coef = TRUE,
cor.method = "pearson",
xlab = "GDP",
ylab = "Internet users, per cent of the population",
ylim = c(0, 100))
Is it a linear relationship, after all? It looks more like inverse quadratic.
Log-transformation is sometimes used to make such relationships linear (see below). Do not use log-transformation if you do not know what you are doing!
wbws$logGDP <- log(wbws$n.NY.GDP.PCAP.PP.KD)
ggscatter(wbws, x = "logGDP", y = "n.IT.NET.USER.ZS",
add = "reg.line",
cor.coef = TRUE,
cor.method = "pearson",
xlab = "Log GDP",
ylab = "Internet users, per cent of the population",
ylim = c(0, 100))
This pair shows a higher correlation r = 0.85 (p < 0.001), with a more linear trend on the plot.
library(GGally)
ggpairs(wbws[, 1:2],
upper = list(continuous = wrap("cor", size = 8, col = "black"))) # to adjust the font
If there are more than two variable you want to correlate, you can use a correlation matrix like that:
Pearson’s product moment correlations
Look at the note at the bottom-right: ‘listwise deletion’ means that this correlation matrix was calculated only for the countries that contained the data on ALL the variables used in this matrix.
Try intrepreting some of the correlations.
Why is the diagonal empty?
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
---|---|---|---|---|---|---|---|---|---|---|
Individuals using the Internet (% of population) |
0.665*** | 0.343*** | -0.261** | 0.266** | 0.298*** | 0.059 | 0.481*** | 0.733*** | -0.173* | |
Secure Internet servers (per 1 million people) |
0.665*** | 0.193* | -0.112 | 0.114 | 0.096 | 0.124 | 0.306*** | 0.299*** | -0.019 | |
Compulsory education, duration (years) | 0.343*** | 0.193* | -0.270** | -0.008 | 0.039 | -0.145 | 0.164 | 0.388*** | -0.196* | |
Incidence of HIV (% of uninfected population ages 15-49) |
-0.261** | -0.112 | -0.270** | 0.022 | -0.028 | 0.177* | 0.029 | -0.381*** | 0.092 | |
Suicide mortality rate (per 100,000 population) |
0.266** | 0.114 | -0.008 | 0.022 | 0.978*** | 0.752*** | 0.591*** | 0.248** | -0.027 | |
Suicide mortality rate, male (per 100,000 male population) |
0.298*** | 0.096 | 0.039 | -0.028 | 0.978*** | 0.601*** | 0.621*** | 0.305*** | -0.050 | |
Suicide mortality rate, female (per 100,000 female population) |
0.059 | 0.124 | -0.145 | 0.177* | 0.752*** | 0.601*** | 0.320*** | -0.020 | 0.076 | |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.481*** | 0.306*** | 0.164 | 0.029 | 0.591*** | 0.621*** | 0.320*** | 0.309*** | -0.016 | |
Access to electricity, rural (% of rural population) |
0.733*** | 0.299*** | 0.388*** | -0.381*** | 0.248** | 0.305*** | -0.020 | 0.309*** | -0.166* | |
Number of NA’s | -0.173* | -0.019 | -0.196* | 0.092 | -0.027 | -0.050 | 0.076 | -0.016 | -0.166* | |
Computed correlation used pearson-method with listwise-deletion. |
Spearman’s correlation coefficient (it transforms all the variables into ranks)
Compare the correlation coefficients for the last variable with those obtained with Pearson’s coefficient. Why are they different?
Which of them suits our data better here?
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
---|---|---|---|---|---|---|---|---|---|---|
Individuals using the Internet (% of population) |
0.876*** | 0.402*** | -0.530*** | 0.214* | 0.294*** | 0.002 | 0.498*** | 0.883*** | -0.201* | |
Secure Internet servers (per 1 million people) |
0.876*** | 0.405*** | -0.361*** | 0.220** | 0.308*** | -0.001 | 0.533*** | 0.738*** | -0.226** | |
Compulsory education, duration (years) | 0.402*** | 0.405*** | -0.363*** | -0.038 | 0.036 | -0.161 | 0.201* | 0.404*** | -0.195* | |
Incidence of HIV (% of uninfected population ages 15-49) |
-0.530*** | -0.361*** | -0.363*** | -0.038 | -0.063 | 0.041 | 0.000 | -0.662*** | 0.170* | |
Suicide mortality rate (per 100,000 population) |
0.214* | 0.220** | -0.038 | -0.038 | 0.966*** | 0.849*** | 0.569*** | 0.201* | 0.079 | |
Suicide mortality rate, male (per 100,000 male population) |
0.294*** | 0.308*** | 0.036 | -0.063 | 0.966*** | 0.706*** | 0.620*** | 0.266** | 0.063 | |
Suicide mortality rate, female (per 100,000 female population) |
0.002 | -0.001 | -0.161 | 0.041 | 0.849*** | 0.706*** | 0.354*** | 0.002 | 0.128 | |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.498*** | 0.533*** | 0.201* | 0.000 | 0.569*** | 0.620*** | 0.354*** | 0.398*** | -0.039 | |
Access to electricity, rural (% of rural population) |
0.883*** | 0.738*** | 0.404*** | -0.662*** | 0.201* | 0.266** | 0.002 | 0.398*** | -0.174* | |
Number of NA’s | -0.201* | -0.226** | -0.195* | 0.170* | 0.079 | 0.063 | 0.128 | -0.039 | -0.174* | |
Computed correlation used spearman-method with listwise-deletion. |
Kendall’s tau (it transforms all the variables into ranks but uses a different formula)
Compare Kendall’s correlations with Spearman’s correlations.
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
---|---|---|---|---|---|---|---|---|---|---|
Individuals using the Internet (% of population) |
0.702*** | 0.300*** | -0.368*** | 0.136* | 0.194*** | -0.004 | 0.352*** | 0.700*** | -0.163* | |
Secure Internet servers (per 1 million people) |
0.702*** | 0.302*** | -0.252*** | 0.138* | 0.196*** | -0.006 | 0.376*** | 0.560*** | -0.184** | |
Compulsory education, duration (years) | 0.300*** | 0.302*** | -0.266*** | -0.029 | 0.022 | -0.112 | 0.134* | 0.301*** | -0.170* | |
Incidence of HIV (% of uninfected population ages 15-49) |
-0.368*** | -0.252*** | -0.266*** | -0.015 | -0.035 | 0.029 | 0.004 | -0.490*** | 0.144* | |
Suicide mortality rate (per 100,000 population) |
0.136* | 0.138* | -0.029 | -0.015 | 0.856*** | 0.670*** | 0.403*** | 0.131* | 0.064 | |
Suicide mortality rate, male (per 100,000 male population) |
0.194*** | 0.196*** | 0.022 | -0.035 | 0.856*** | 0.528*** | 0.447*** | 0.183** | 0.052 | |
Suicide mortality rate, female (per 100,000 female population) |
-0.004 | -0.006 | -0.112 | 0.029 | 0.670*** | 0.528*** | 0.242*** | -0.003 | 0.104 | |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.352*** | 0.376*** | 0.134* | 0.004 | 0.403*** | 0.447*** | 0.242*** | 0.275*** | -0.032 | |
Access to electricity, rural (% of rural population) |
0.700*** | 0.560*** | 0.301*** | -0.490*** | 0.131* | 0.183** | -0.003 | 0.275*** | -0.145* | |
Number of NA’s | -0.163* | -0.184** | -0.170* | 0.144* | 0.064 | 0.052 | 0.104 | -0.032 | -0.145* | |
Computed correlation used kendall-method with listwise-deletion. |
Color is used to indicate positive and negative values.
The columns are given in reverse order, so that the correlations remain on the left and are easier to read.
Compare this table to Pearson’s product moment correlation matrix - they should be the same.
In your projects, use either matrices or tables, depending on which suits your goal better.
Correlation can help evaluate relationships between continuous variables.
If you want to use correlation in a research project, report the statistic, its p-value, the direction and magnitude of correlation.
Check the normality of distributions to choose between the types of correlation you need.