Correlation analysis: practice

Variables description

NY.GDP.PCAP.PP.KD - GDP per capita, PPP (constant 2011 international $)
IT.NET.USER.ZS - Individuals using the Internet (% of population)
IT.NET.SECR.P6 - Secure Internet servers (per 1 million people)
SE.COM.DURS - Compulsory education, duration (years)
SE.TER.CUAT.BA.FE.ZS - Educational attainment, at least Bachelor’s or equivalent, population 25+, female (%) (cumulative)
SE.TER.CUAT.BA.MA.ZS - Educational attainment, at least Bachelor’s or equivalent, population 25+, male (%) (cumulative)
SH.HIV.INCD.ZS - Incidence of HIV (% of uninfected population ages 15-49)
SH.STA.SUIC.P5 - Suicide mortality rate (per 100,000 population)
SH.STA.SUIC.MA.P5 - Suicide mortality rate, male (per 100,000 male population)
SH.STA.SUIC.FE.P5 - Suicide mortality rate, female (per 100,000 female population)
SH.ALC.PCAP.LI - Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
EN.FSH.THRD.NO - A Fish species, threatened
EG.ELC.ACCS.RU.ZS - Access to electricity, rural (% of rural population)

Data preparation

We have our dataset in the ‘long’ format: multiple rows with the country name, one column where all the indicators are listed, and another column with values.

wb <- read.csv("cade1285-2388-448f-9e00-b37286b9cf08_Data.csv")
wb[1:10,]

##    ï..Country.Name Country.Code
## 1      Afghanistan          AFG
## 2      Afghanistan          AFG
## 3      Afghanistan          AFG
## 4      Afghanistan          AFG
## 5      Afghanistan          AFG
## 6      Afghanistan          AFG
## 7      Afghanistan          AFG
## 8      Afghanistan          AFG
## 9      Afghanistan          AFG
## 10     Afghanistan          AFG
##                                                                                           Series.Name
## 1                                                 GDP per capita, PPP (constant 2011 international $)
## 2                                                    Individuals using the Internet (% of population)
## 3                                                      Secure Internet servers (per 1 million people)
## 4                                                              Compulsory education, duration (years)
## 5  Educational attainment, at least Bachelor's or equivalent, population 25+, female (%) (cumulative)
## 6    Educational attainment, at least Bachelor's or equivalent, population 25+, male (%) (cumulative)
## 7                                            Incidence of HIV (% of uninfected population ages 15-49)
## 8                                                     Suicide mortality rate (per 100,000 population)
## 9                                          Suicide mortality rate, male (per 100,000 male population)
## 10                                     Suicide mortality rate, female (per 100,000 female population)
##             Series.Code    X2010..YR2010.
## 1     NY.GDP.PCAP.PP.KD   1693.7701994264
## 2        IT.NET.USER.ZS                 4
## 3        IT.NET.SECR.P6 0.486057661645332
## 4           SE.COM.DURS                 9
## 5  SE.TER.CUAT.BA.FE.ZS                ..
## 6  SE.TER.CUAT.BA.MA.ZS                ..
## 7        SH.HIV.INCD.ZS                ..
## 8        SH.STA.SUIC.P5               5.1
## 9     SH.STA.SUIC.MA.P5               8.6
## 10    SH.STA.SUIC.FE.P5               1.4

Let’s move to wide!

In the ‘wide’ format, there is one entry per country

wb1 <- wb[c("ï..Country.Name", "Series.Code", "X2010..YR2010.")]
names(wb1) <- c("country", "series", "n") # we replace the 'year' variable with a short name, "n"
library(reshape)
wbw <- reshape(wb1, idvar = "country", timevar = "series", direction = "wide")
head(wbw)

##           country n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS  n.IT.NET.SECR.P6
## 1     Afghanistan     1693.7701994264                4 0.486057661645332
## 14        Albania    9927.15286013963               45  4.11943477235489
## 27        Algeria    12870.6026985154             12.5 0.359934953662666
## 40 American Samoa                  ..               ..  53.9209518845373
## 53        Andorra                  ..               81   686.80505393788
## 66         Angola    6356.93499087317              2.8  1.28374478280771
##    n.SE.COM.DURS n.SE.TER.CUAT.BA.FE.ZS n.SE.TER.CUAT.BA.MA.ZS
## 1              9                     ..                     ..
## 14             8                     ..                     ..
## 27            10                     ..                     ..
## 40            ..                     ..                     ..
## 53            10                     ..                     ..
## 66             6                     ..                     ..
##    n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5 n.SH.STA.SUIC.MA.P5
## 1                ..              5.1                 8.6
## 14             0.01              7.8                 9.5
## 27             0.01              3.3                 4.9
## 40               ..               ..                  ..
## 53               ..               ..                  ..
## 66             0.21              5.7                 8.7
##    n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI n.EN.FSH.THRD.NO
## 1                  1.4              0.2               ..
## 14                 6.1              7.9               ..
## 27                 1.8              0.7               ..
## 40                  ..               ..               ..
## 53                  ..             11.4               ..
## 66                 2.8                9               ..
##    n.EG.ELC.ACCS.RU.ZS   n.
## 1                 32.4 <NA>
## 14                 100 <NA>
## 27    97.5942707575627 <NA>
## 40                  .. <NA>
## 53                 100 <NA>
## 66     16.209957525451 <NA>

wbw$n. <- NULL # we use the data for year 2010 only, therefore, there is no need for the empty year column, and we delete it

Setting NA’s right

In the original data set, all the missing values are denoted with two dots, “..”

R cannot read this as a ‘missing value’. Therefore, we replace those dots with “NA” in columns 2 to 14.

summary(wbw)

##            country          n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS
##                :  1   ..              : 25      ..     : 15     
##  Afghanistan   :  1   3241.69994573611:  2      15.9   :  4     
##  Albania       :  1   4174.71460976018:  2      8      :  4     
##  Algeria       :  1   10197.5237047127:  1      25     :  3     
##  American Samoa:  1   1032.9628806199 :  1      3      :  3     
##  Andorra       :  1   (Other)         :233      (Other):235     
##  (Other)       :261   NA's            :  3      NA's   :  3     
##            n.IT.NET.SECR.P6 n.SE.COM.DURS      n.SE.TER.CUAT.BA.FE.ZS
##  0                 : 12     9      :84    ..              :245       
##  ..                :  7     10     :41    10.8434200286865:  1       
##  1.42076916942725  :  2     ..     :31    11.5630903244019:  1       
##  3.64335593664317  :  2     8      :22    11.5771198272705:  1       
##  0.0199378354241743:  1     6      :21    12.8441400527954:  1       
##  (Other)           :240     (Other):65    (Other)         : 15       
##  NA's              :  3     NA's   : 3    NA's            :  3       
##       n.SE.TER.CUAT.BA.MA.ZS n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5
##  ..              :245        ..     :107      ..     : 35     
##  10.626540184021 :  1        0.01   : 42      5.1    :  5     
##  10.7774600982666:  1        0.02   : 19      11.1   :  4     
##  11.6993999481201:  1        0.03   :  9      3.3    :  4     
##  12.6875896453857:  1        0.04   :  9      3.6    :  4     
##  (Other)         : 15        (Other): 78      (Other):212     
##  NA's            :  3        NA's   :  3      NA's   :  3     
##  n.SH.STA.SUIC.MA.P5 n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI
##  ..     : 35         ..     : 35         ..     : 31     
##  5      :  4         3.6    :  7         0.7    :  7     
##  8.7    :  4         2.5    :  6         0.2    :  6     
##  11.7   :  3         1.5    :  5         11.4   :  5     
##  12.9   :  3         2.8    :  5         7      :  5     
##  (Other):215         (Other):206         (Other):210     
##  NA's   :  3         NA's   :  3         NA's   :  3     
##           n.EN.FSH.THRD.NO       n.EG.ELC.ACCS.RU.ZS
##  ..               :264     100             : 72     
##                   :  0     ..              : 17     
##  0                :  0     17.3994206856074:  2     
##  0.01             :  0     3.1             :  2     
##  0.010232397845675:  0     3.5             :  2     
##  (Other)          :  0     (Other)         :169     
##  NA's             :  3     NA's            :  3

for(i in 2:14)
{
  wbw[,i] <- replace(wbw[,i], wbw[,i] == "..", "NA")
}

Deleting cases with 5 or more NA’s

Most of the countries have up to 4 missing values. To keep most information but get a cleaner-looking data set, we filter out the countries with many NA’s.

wbw$na_count <- apply(wbw, 1, function(x) sum(is.na(x)))
table(wbw$na_count)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  12   4 129  72  10   1   5   5   8  10   3   4   4

wbw1 <- subset(wbw, na_count < 5)

Deleting variables with more than 50% missings

Some variables have many empty cells as well, i.e. this statistic is not collected in many countries.

We delete the variables for which most countries do not contain data.

summary(wbw1)

##                 country          n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS
##  Afghanistan        :  1   3241.69994573611:  2      15.9   :  4     
##  Albania            :  1   4174.71460976018:  2      3      :  3     
##  Algeria            :  1   1032.9628806199 :  1      45     :  3     
##  Angola             :  1   10436.3655998628:  1      8      :  3     
##  Antigua and Barbuda:  1   1073.82629515037:  1      1      :  2     
##  Arab World         :  1   (Other)         :208      1.7    :  2     
##  (Other)            :211   NA's            :  2      (Other):200     
##            n.IT.NET.SECR.P6 n.SE.COM.DURS      n.SE.TER.CUAT.BA.FE.ZS
##  0                 :  4     9      :79    10.8434200286865:  1       
##  1.42076916942725  :  2     10     :36    11.5630903244019:  1       
##  3.64335593664317  :  2     6      :21    11.5771198272705:  1       
##  0.0199378354241743:  1     8      :20    13.3864803314209:  1       
##  0.0325068985327394:  1     11     :16    14.4489097595215:  1       
##  (Other)           :205     (Other):37    (Other)         : 12       
##  NA's              :  2     NA's   : 8    NA's            :200       
##       n.SE.TER.CUAT.BA.MA.ZS n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5
##  10.626540184021 :  1        0.01   :41       3.3    :  4     
##  10.7774600982666:  1        0.02   :19       3.6    :  4     
##  11.6993999481201:  1        0.03   : 9       5.1    :  4     
##  12.6875896453857:  1        0.04   : 9       8      :  4     
##  13.4247598648071:  1        0.05   : 7       10     :  3     
##  (Other)         : 12        (Other):70       12.5   :  3     
##  NA's            :200        NA's   :62       (Other):195     
##        n.SH.STA.SUIC.MA.P5 n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI
##  5               :  4      3.6    :  7         0.7    :  7     
##  8.7             :  4      2.5    :  6         0.2    :  6     
##  11.7            :  3      1.5    :  5         7      :  5     
##  12.9            :  3      6      :  5         7.1    :  5     
##  4.3             :  3      6.1    :  5         11.4   :  4     
##  10.1443132276089:  2      1.9    :  4         0.5    :  3     
##  (Other)         :198      (Other):185         (Other):187     
##           n.EN.FSH.THRD.NO       n.EG.ELC.ACCS.RU.ZS    na_count    
##                   :  0     100             : 58      Min.   :1.000  
##  ..               :  0     17.3994206856074:  2      1st Qu.:3.000  
##  0                :  0     3.1             :  2      Median :3.000  
##  0.01             :  0     3.5             :  2      Mean   :3.203  
##  0.010232397845675:  0     64.9773918291319:  2      3rd Qu.:4.000  
##  (Other)          :  0     (Other)         :147      Max.   :4.000  
##  NA's             :217     NA's            :  4

wbw1$n.EN.FSH.THRD.NO <- NULL # completely empty (217 NA's)
wbw1$n.SE.TER.CUAT.BA.MA.ZS <- NULL # almost empty (200 NA's)
wbw1$n.SE.TER.CUAT.BA.FE.ZS <- NULL # almost empty (200 NA's)

Factors to numeric

Before the analysis, make sure all variables except for ‘country’ are read by R as numeric (which they should be).

str(wbw1) # Those factors should be numeric

## 'data.frame':    217 obs. of  12 variables:
##  $ country            : Factor w/ 267 levels "","Afghanistan",..: 2 3 4 7 8 10 11 13 14 15 ...
##  $ n.NY.GDP.PCAP.PP.KD: Factor w/ 1302 levels "","..","0","0.01",..: 404 1300 291 1005 464 447 1023 809 825 378 ...
##  $ n.IT.NET.USER.ZS   : Factor w/ 1302 levels "","..","0","0.01",..: 760 838 262 491 850 838 579 1096 1090 845 ...
##  $ n.IT.NET.SECR.P6   : Factor w/ 1302 levels "","..","0","0.01",..: 87 763 74 132 1004 574 173 347 1035 486 ...
##  $ n.SE.COM.DURS      : Factor w/ 1302 levels "","..","0","0.01",..: 1180 1109 159 940 202 293 202 159 1109 1180 ...
##  $ n.SH.HIV.INCD.ZS   : Factor w/ 1302 levels "","..","0","0.01",..: NA 4 4 52 NA 9 4 4 4 4 ...
##  $ n.SH.STA.SUIC.P5   : Factor w/ 1302 levels "","..","0","0.01",..: 867 1062 647 885 67 1126 940 262 379 642 ...
##  $ n.SH.STA.SUIC.MA.P5: Factor w/ 1302 levels "","..","0","0.01",..: 1124 1200 791 1126 90 332 161 436 579 864 ...
##  $ n.SH.STA.SUIC.FE.P5: Factor w/ 1302 levels "","..","0","0.01",..: 137 949 151 491 3 647 478 959 1054 128 ...
##  $ n.SH.ALC.PCAP.LI   : Factor w/ 1302 levels "","..","0","0.01",..: 50 1064 105 1180 949 1190 883 262 250 494 ...
##  $ n.EG.ELC.ACCS.RU.ZS: Factor w/ 1302 levels "","..","0","0.01",..: 686 190 1264 382 1229 1271 1287 190 190 1295 ...
##  $ na_count           : int  4 3 3 3 4 3 3 3 3 3 ...

for (i in 2:11)
{
  wbw1[,i] <- as.numeric(as.character(wbw1[,i]))
}

str(wbw1)

## 'data.frame':    217 obs. of  12 variables:
##  $ country            : Factor w/ 267 levels "","Afghanistan",..: 2 3 4 7 8 10 11 13 14 15 ...
##  $ n.NY.GDP.PCAP.PP.KD: num  1694 9927 12871 6357 19213 ...
##  $ n.IT.NET.USER.ZS   : num  4 45 12.5 2.8 47 ...
##  $ n.IT.NET.SECR.P6   : num  0.486 4.119 0.36 1.284 633.841 ...
##  $ n.SE.COM.DURS      : num  9 8 10 6 11 13 11 10 8 9 ...
##  $ n.SH.HIV.INCD.ZS   : num  NA 0.01 0.01 0.21 NA 0.03 0.01 0.01 0.01 0.01 ...
##  $ n.SH.STA.SUIC.P5   : num  5.1 7.8 3.3 5.7 0.3 8.7 6 12.5 16 3.1 ...
##  $ n.SH.STA.SUIC.MA.P5: num  8.6 9.5 4.9 8.7 0.5 14.3 10.1 18.6 25 5 ...
##  $ n.SH.STA.SUIC.FE.P5: num  1.4 6.1 1.8 2.8 0 3.3 2.4 6.3 7.5 1.2 ...
##  $ n.SH.ALC.PCAP.LI   : num  0.2 7.9 0.7 9 6.1 9.3 5.6 12.5 12 2.9 ...
##  $ n.EG.ELC.ACCS.RU.ZS: num  32.4 100 97.6 16.2 92.2 ...
##  $ na_count           : int  4 3 3 3 4 3 3 3 3 3 ...

Nice! Every variable is now of its correct type.

Let’s assign labels

labs <- c("Country name", # create a list of labels from a separate file downloaded with the data from the World Bank
          "GDP per capita, PPP (constant 2011 international $)",
          "Individuals using the Internet (% of population)",
          "Secure Internet servers (per 1 million people)",
          "Compulsory education, duration (years)",
          "Incidence of HIV (% of uninfected population ages 15-49)",
          "Suicide mortality rate (per 100,000 population)",
          "Suicide mortality rate, male (per 100,000 male population)",
          "Suicide mortality rate, female (per 100,000 female population)",
          "Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)",
          "Access to electricity, rural (% of rural population)",
          "Number of NA's") # do not forget the variable that we have just created
library(sjlabelled)
wbw1 <- set_label(wbw1, label = labs)

Variables description

Now that we have attached proper labels to the variables, we can use them and understand the data better.

library(sjPlot)
view_df(wbw1[2:11], show.prc = F, verbose = F)

Data frame: wbw1[2:11]
ID	Name	Label	Values
1	n.NY.GDP.PCAP.PP.KD	GDP per capita, PPP (constant 2011 international $)	range: 660.2-125140.8
2	n.IT.NET.USER.ZS	Individuals using the Internet (% of population)	range: 0.2-93.4
3	n.IT.NET.SECR.P6	Secure Internet servers (per 1 million people)	range: 0.0-2481.6
4	n.SE.COM.DURS	Compulsory education, duration (years)	range: 5.0-15.0
5	n.SH.HIV.INCD.ZS	Incidence of HIV (% of uninfected population ages 15-49)	range: 0.0-3.1
6	n.SH.STA.SUIC.P5	Suicide mortality rate (per 100,000 population)	range: 0.3-40.0
7	n.SH.STA.SUIC.MA.P5	Suicide mortality rate, male (per 100,000 male population)	range: 0.5-71.7
8	n.SH.STA.SUIC.FE.P5	Suicide mortality rate, female (per 100,000 female population)	range: 0.0-23.2
9	n.SH.ALC.PCAP.LI	Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	range: 0.0-17.9
10	n.EG.ELC.ACCS.RU.ZS	Access to electricity, rural (% of rural population)	range: 0.9-100.0

That’s it with the data manipulations. Let’s move on to correlations!

Correlations

using cor() function

cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)

## [1] NA

cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs")

## [1] 0.7816712

cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs", method = "spearman")

## [1] 0.8896563

cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs", method = "kendall")

## [1] 0.7217487

When using ‘cor’, specify which observations to use. If there are NA’s, the output will be “NA”. The options are: “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.

Use “complete.obs” to correlate the observations with all the variables that you put in the formula.

We can see that both Pearson’s and Spearman’s correlation values are very high. The interpretation is that the higher the country’s GDP per capita, the higher is the share of the country’s population using the Internet.

As expected, Kendall’s correlation is lower than Spearman’s. However, its value is also high.

With cor function, you do not get the statistical significance of the coefficient.

Correlations

using cor.test() function

cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)

## 
##  Pearson's product-moment correlation
## 
## data:  wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## t = 18.291, df = 213, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7235433 0.8287912
## sample estimates:
##       cor 
## 0.7816712

cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, method = "spearman")

## Warning in cor.test.default(wbw1$n.NY.GDP.PCAP.PP.KD,
## wbw1$n.IT.NET.USER.ZS, : Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## S = 182770, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8896563

cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## z = 15.736, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.7217487

We get the same results, but also much more information: (1) the statistical significance tests of correlation coefficients; (2) confidence intervals for Pearson’s correlation coefficient; and (3) alternative hypotheses to the tests.

Moreover, NA’s are removed by default.

Let’s plot the relationship

Plain scatterplots

plot(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)

plot(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, pch = 20, xlab = "GDP PPP", ylab = "Internet users, per cent of the population")

Scatterplot with labels

wbws <-  na.omit(data.frame(n.NY.GDP.PCAP.PP.KD = wbw1$n.NY.GDP.PCAP.PP.KD, n.IT.NET.USER.ZS = wbw1$n.IT.NET.USER.ZS, 
                            country = wbw1$country))
library(wordcloud)
textplot(wbws$n.NY.GDP.PCAP.PP.KD, wbws$n.IT.NET.USER.ZS, wbws$country, cex = 0.5)

textplot(log(wbws$n.NY.GDP.PCAP.PP.KD), wbws$n.IT.NET.USER.ZS, wbws$country, cex = 0.5)

If you have fewer countries, this could be a very informative plot with labels.

Another scatterplot

library("ggpubr")
ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "GDP", ylab = "Internet users, per cent of the population")

ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "GDP", ylab = "Internet users, per cent of the population",
          ylim = c(0, 100))

We can add the correlation line with confidence intervals.

Take a look at the Y scale: R does not know that it is per cent that will not exceed 100, so it continues the scale upwards.

Is it a linear relationship, after all?

Fancy scatterplot with distributions using GGally

library(GGally)
ggpairs(data = wbws,  # data.frame with variables
        columns = 1:2)

ggpairs(wbws[, 1:2],
        upper = list(continuous = wrap("cor", size = 8)))

Correlation matrix

Pearson’s product moment correlations

library(sjPlot)
sjt.corr(wbw1[, 2:12])

	GDP per capita, PPP (constant 2011 international $)	Individuals using the Internet (% of population)	Secure Internet servers (per 1 million people)	Compulsory education, duration (years)	Incidence of HIV (% of uninfected population ages 15-49)	Suicide mortality rate (per 100,000 population)	Suicide mortality rate, male (per 100,000 male population)	Suicide mortality rate, female (per 100,000 female population)	Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	Access to electricity, rural (% of rural population)	Number of NA’s
GDP per capita, PPP (constant 2011 international $)		0.764***	0.567***	0.209*	-0.158	0.131	0.144	-0.015	0.241**	0.515***	-0.251**
Individuals using the Internet (% of population)	0.764***		0.664***	0.348***	-0.264**	0.265**	0.298***	0.056	0.476***	0.738***	-0.158
Secure Internet servers (per 1 million people)	0.567***	0.664***		0.195*	-0.114	0.114	0.096	0.123	0.304***	0.301***	-0.010
Compulsory education, duration (years)	0.209*	0.348***	0.195*		-0.270**	-0.005	0.042	-0.143	0.170*	0.397***	-0.209*
Incidence of HIV (% of uninfected population ages 15-49)	-0.158	-0.264**	-0.114	-0.270**		0.023	-0.028	0.176*	0.026	-0.384***	0.103
Suicide mortality rate (per 100,000 population)	0.131	0.265**	0.114	-0.005	0.023		0.978***	0.752***	0.592***	0.241**	-0.025
Suicide mortality rate, male (per 100,000 male population)	0.144	0.298***	0.096	0.042	-0.028	0.978***		0.600***	0.622***	0.298***	-0.050
Suicide mortality rate, female (per 100,000 female population)	-0.015	0.056	0.123	-0.143	0.176*	0.752***	0.600***		0.317***	-0.027	0.085
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	0.241**	0.476***	0.304***	0.170*	0.026	0.592***	0.622***	0.317***		0.304***	0.011
Access to electricity, rural (% of rural population)	0.515***	0.738***	0.301***	0.397***	-0.384***	0.241**	0.298***	-0.027	0.304***		-0.168*
Number of NA’s	-0.251**	-0.158	-0.010	-0.209*	0.103	-0.025	-0.050	0.085	0.011	-0.168*
Computed correlation used pearson-method with listwise-deletion.

At the bottom of the table, ‘listwise deletion’ is mentioned. It means that this correlation matrix was calculated only for the countries that contained the data on ALL the variables used in this matrix.

Try intrepreting some of the correlations.

Why is the diagonal empty?

Correlation matrix

Spearman’s correlation coefficient (it transforms all the variables into ranks)

sjt.corr(wbw1[, 2:12], corr.method = "spearman")

	GDP per capita, PPP (constant 2011 international $)	Individuals using the Internet (% of population)	Secure Internet servers (per 1 million people)	Compulsory education, duration (years)	Incidence of HIV (% of uninfected population ages 15-49)	Suicide mortality rate (per 100,000 population)	Suicide mortality rate, male (per 100,000 male population)	Suicide mortality rate, female (per 100,000 female population)	Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	Access to electricity, rural (% of rural population)	Number of NA’s
GDP per capita, PPP (constant 2011 international $)		0.887***	0.840***	0.371***	-0.460***	0.226**	0.295***	0.019	0.450***	0.837***	-0.221**
Individuals using the Internet (% of population)	0.887***		0.884***	0.411***	-0.529***	0.212*	0.292***	-0.001	0.494***	0.886***	-0.187*
Secure Internet servers (per 1 million people)	0.840***	0.884***		0.410***	-0.367***	0.237**	0.325***	0.005	0.536***	0.758***	-0.204*
Compulsory education, duration (years)	0.371***	0.411***	0.410***		-0.368***	-0.030	0.044	-0.157	0.213*	0.415***	-0.234**
Incidence of HIV (% of uninfected population ages 15-49)	-0.460***	-0.529***	-0.367***	-0.368***		-0.032	-0.057	0.044	0.003	-0.662***	0.178*
Suicide mortality rate (per 100,000 population)	0.226**	0.212*	0.237**	-0.030	-0.032		0.965***	0.849***	0.571***	0.190*	0.084
Suicide mortality rate, male (per 100,000 male population)	0.295***	0.292***	0.325***	0.044	-0.057	0.965***		0.705***	0.622***	0.255**	0.070
Suicide mortality rate, female (per 100,000 female population)	0.019	-0.001	0.005	-0.157	0.044	0.849***	0.705***		0.353***	-0.007	0.142
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	0.450***	0.494***	0.536***	0.213*	0.003	0.571***	0.622***	0.353***		0.397***	0.008
Access to electricity, rural (% of rural population)	0.837***	0.886***	0.758***	0.415***	-0.662***	0.190*	0.255**	-0.007	0.397***		-0.187*
Number of NA’s	-0.221**	-0.187*	-0.204*	-0.234**	0.178*	0.084	0.070	0.142	0.008	-0.187*
Computed correlation used spearman-method with listwise-deletion.

Compare the correlation coefficients for the last variable with those obtained with Pearson’s coefficient. Why are they different? Which of them suits our data better here?

Correlation matrix

Kendall’s tau (it transforms all the variables into ranks but uses a different formula)

sjt.corr(wbw1[, 2:12], corr.method = "kendall")

	GDP per capita, PPP (constant 2011 international $)	Individuals using the Internet (% of population)	Secure Internet servers (per 1 million people)	Compulsory education, duration (years)	Incidence of HIV (% of uninfected population ages 15-49)	Suicide mortality rate (per 100,000 population)	Suicide mortality rate, male (per 100,000 male population)	Suicide mortality rate, female (per 100,000 female population)	Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	Access to electricity, rural (% of rural population)	Number of NA’s
GDP per capita, PPP (constant 2011 international $)		0.725***	0.672***	0.270***	-0.323***	0.145*	0.191***	0.009	0.323***	0.674***	-0.181**
Individuals using the Internet (% of population)	0.725***		0.712***	0.306***	-0.367***	0.135*	0.193***	-0.006	0.347***	0.705***	-0.154*
Secure Internet servers (per 1 million people)	0.672***	0.712***		0.307***	-0.255***	0.150**	0.209***	-0.000	0.380***	0.576***	-0.167*
Compulsory education, duration (years)	0.270***	0.306***	0.307***		-0.270***	-0.023	0.027	-0.110	0.142*	0.308***	-0.207**
Incidence of HIV (% of uninfected population ages 15-49)	-0.323***	-0.367***	-0.255***	-0.270***		-0.011	-0.031	0.030	0.005	-0.490***	0.152*
Suicide mortality rate (per 100,000 population)	0.145*	0.135*	0.150**	-0.023	-0.011		0.855***	0.671***	0.406***	0.123*	0.069
Suicide mortality rate, male (per 100,000 male population)	0.191***	0.193***	0.209***	0.027	-0.031	0.855***		0.528***	0.449***	0.175**	0.057
Suicide mortality rate, female (per 100,000 female population)	0.009	-0.006	-0.000	-0.110	0.030	0.671***	0.528***		0.241***	-0.009	0.117
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)	0.323***	0.347***	0.380***	0.142*	0.005	0.406***	0.449***	0.241***		0.275***	0.006
Access to electricity, rural (% of rural population)	0.674***	0.705***	0.576***	0.308***	-0.490***	0.123*	0.175**	-0.009	0.275***		-0.156*
Number of NA’s	-0.181**	-0.154*	-0.167*	-0.207**	0.152*	0.069	0.057	0.117	0.006	-0.156*
Computed correlation used kendall-method with listwise-deletion.

Compare Kendall’s correlations with Spearman’s correlations.

Graphical table

sjp.corr(wbw1[, 2:12])

Here, color is used to indicate positive and negative correlations.

The columns are in reverse order, so that the part with the correlations remains on the left and is easier to read.

Compare this table to the Pearson’s product moment correlation matrix - they should be the same.

In your projects, use either matrices or tables depending on which suits your goal better.

Correlation analysis: practice

Data description

Variables description

Data preparation

Let’s move to wide!

Setting NA’s right

Deleting cases with 5 or more NA’s

Deleting variables with more than 50% missings

Factors to numeric

Let’s assign labels

Variables description

Correlations

Correlations

Let’s plot the relationship

Scatterplot with labels

Another scatterplot

Fancy scatterplot with distributions using GGally

Correlation matrix

Correlation matrix

Correlation matrix

Graphical table