Anna Shirokanova & Olesya Volchenko
March 14, 2019
We decided to analyze some of the World Bank indicators.
Those are country-level variables from https://databank.worldbank.org/data/source/world-development-indicators
We have all countries as observations and a set of 12 variables for 2010
NY.GDP.PCAP.PP.KD - GDP per capita, PPP (constant 2011 international $)
IT.NET.USER.ZS - Individuals using the Internet (% of population)
IT.NET.SECR.P6 - Secure Internet servers (per 1 million people)
SE.COM.DURS - Compulsory education, duration (years)
SE.TER.CUAT.BA.FE.ZS - Educational attainment, at least Bachelor’s or equivalent, population 25+, female (%) (cumulative)
SE.TER.CUAT.BA.MA.ZS - Educational attainment, at least Bachelor’s or equivalent, population 25+, male (%) (cumulative)
SH.HIV.INCD.ZS - Incidence of HIV (% of uninfected population ages 15-49)
SH.STA.SUIC.P5 - Suicide mortality rate (per 100,000 population)
SH.STA.SUIC.MA.P5 - Suicide mortality rate, male (per 100,000 male population)
SH.STA.SUIC.FE.P5 - Suicide mortality rate, female (per 100,000 female population)
SH.ALC.PCAP.LI - Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
EN.FSH.THRD.NO - A Fish species, threatened
EG.ELC.ACCS.RU.ZS - Access to electricity, rural (% of rural population)
We have our dataset in the ‘long’ format: multiple rows with the country name, one column where all the indicators are listed, and another column with values.
wb <- read.csv("cade1285-2388-448f-9e00-b37286b9cf08_Data.csv")
wb[1:10,]## ï..Country.Name Country.Code
## 1 Afghanistan AFG
## 2 Afghanistan AFG
## 3 Afghanistan AFG
## 4 Afghanistan AFG
## 5 Afghanistan AFG
## 6 Afghanistan AFG
## 7 Afghanistan AFG
## 8 Afghanistan AFG
## 9 Afghanistan AFG
## 10 Afghanistan AFG
## Series.Name
## 1 GDP per capita, PPP (constant 2011 international $)
## 2 Individuals using the Internet (% of population)
## 3 Secure Internet servers (per 1 million people)
## 4 Compulsory education, duration (years)
## 5 Educational attainment, at least Bachelor's or equivalent, population 25+, female (%) (cumulative)
## 6 Educational attainment, at least Bachelor's or equivalent, population 25+, male (%) (cumulative)
## 7 Incidence of HIV (% of uninfected population ages 15-49)
## 8 Suicide mortality rate (per 100,000 population)
## 9 Suicide mortality rate, male (per 100,000 male population)
## 10 Suicide mortality rate, female (per 100,000 female population)
## Series.Code X2010..YR2010.
## 1 NY.GDP.PCAP.PP.KD 1693.7701994264
## 2 IT.NET.USER.ZS 4
## 3 IT.NET.SECR.P6 0.486057661645332
## 4 SE.COM.DURS 9
## 5 SE.TER.CUAT.BA.FE.ZS ..
## 6 SE.TER.CUAT.BA.MA.ZS ..
## 7 SH.HIV.INCD.ZS ..
## 8 SH.STA.SUIC.P5 5.1
## 9 SH.STA.SUIC.MA.P5 8.6
## 10 SH.STA.SUIC.FE.P5 1.4
In the ‘wide’ format, there is one entry per country
wb1 <- wb[c("ï..Country.Name", "Series.Code", "X2010..YR2010.")]
names(wb1) <- c("country", "series", "n") # we replace the 'year' variable with a short name, "n"
library(reshape)
wbw <- reshape(wb1, idvar = "country", timevar = "series", direction = "wide")
head(wbw)## country n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS n.IT.NET.SECR.P6
## 1 Afghanistan 1693.7701994264 4 0.486057661645332
## 14 Albania 9927.15286013963 45 4.11943477235489
## 27 Algeria 12870.6026985154 12.5 0.359934953662666
## 40 American Samoa .. .. 53.9209518845373
## 53 Andorra .. 81 686.80505393788
## 66 Angola 6356.93499087317 2.8 1.28374478280771
## n.SE.COM.DURS n.SE.TER.CUAT.BA.FE.ZS n.SE.TER.CUAT.BA.MA.ZS
## 1 9 .. ..
## 14 8 .. ..
## 27 10 .. ..
## 40 .. .. ..
## 53 10 .. ..
## 66 6 .. ..
## n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5 n.SH.STA.SUIC.MA.P5
## 1 .. 5.1 8.6
## 14 0.01 7.8 9.5
## 27 0.01 3.3 4.9
## 40 .. .. ..
## 53 .. .. ..
## 66 0.21 5.7 8.7
## n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI n.EN.FSH.THRD.NO
## 1 1.4 0.2 ..
## 14 6.1 7.9 ..
## 27 1.8 0.7 ..
## 40 .. .. ..
## 53 .. 11.4 ..
## 66 2.8 9 ..
## n.EG.ELC.ACCS.RU.ZS n.
## 1 32.4 <NA>
## 14 100 <NA>
## 27 97.5942707575627 <NA>
## 40 .. <NA>
## 53 100 <NA>
## 66 16.209957525451 <NA>
wbw$n. <- NULL # we use the data for year 2010 only, therefore, there is no need for the empty year column, and we delete itIn the original data set, all the missing values are denoted with two dots, “..”
R cannot read this as a ‘missing value’. Therefore, we replace those dots with “NA” in columns 2 to 14.
summary(wbw)## country n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS
## : 1 .. : 25 .. : 15
## Afghanistan : 1 3241.69994573611: 2 15.9 : 4
## Albania : 1 4174.71460976018: 2 8 : 4
## Algeria : 1 10197.5237047127: 1 25 : 3
## American Samoa: 1 1032.9628806199 : 1 3 : 3
## Andorra : 1 (Other) :233 (Other):235
## (Other) :261 NA's : 3 NA's : 3
## n.IT.NET.SECR.P6 n.SE.COM.DURS n.SE.TER.CUAT.BA.FE.ZS
## 0 : 12 9 :84 .. :245
## .. : 7 10 :41 10.8434200286865: 1
## 1.42076916942725 : 2 .. :31 11.5630903244019: 1
## 3.64335593664317 : 2 8 :22 11.5771198272705: 1
## 0.0199378354241743: 1 6 :21 12.8441400527954: 1
## (Other) :240 (Other):65 (Other) : 15
## NA's : 3 NA's : 3 NA's : 3
## n.SE.TER.CUAT.BA.MA.ZS n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5
## .. :245 .. :107 .. : 35
## 10.626540184021 : 1 0.01 : 42 5.1 : 5
## 10.7774600982666: 1 0.02 : 19 11.1 : 4
## 11.6993999481201: 1 0.03 : 9 3.3 : 4
## 12.6875896453857: 1 0.04 : 9 3.6 : 4
## (Other) : 15 (Other): 78 (Other):212
## NA's : 3 NA's : 3 NA's : 3
## n.SH.STA.SUIC.MA.P5 n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI
## .. : 35 .. : 35 .. : 31
## 5 : 4 3.6 : 7 0.7 : 7
## 8.7 : 4 2.5 : 6 0.2 : 6
## 11.7 : 3 1.5 : 5 11.4 : 5
## 12.9 : 3 2.8 : 5 7 : 5
## (Other):215 (Other):206 (Other):210
## NA's : 3 NA's : 3 NA's : 3
## n.EN.FSH.THRD.NO n.EG.ELC.ACCS.RU.ZS
## .. :264 100 : 72
## : 0 .. : 17
## 0 : 0 17.3994206856074: 2
## 0.01 : 0 3.1 : 2
## 0.010232397845675: 0 3.5 : 2
## (Other) : 0 (Other) :169
## NA's : 3 NA's : 3
for(i in 2:14)
{
wbw[,i] <- replace(wbw[,i], wbw[,i] == "..", "NA")
}Most of the countries have up to 4 missing values. To keep most information but get a cleaner-looking data set, we filter out the countries with many NA’s.
wbw$na_count <- apply(wbw, 1, function(x) sum(is.na(x)))
table(wbw$na_count)##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 12 4 129 72 10 1 5 5 8 10 3 4 4
wbw1 <- subset(wbw, na_count < 5)Some variables have many empty cells as well, i.e. this statistic is not collected in many countries.
We delete the variables for which most countries do not contain data.
summary(wbw1)## country n.NY.GDP.PCAP.PP.KD n.IT.NET.USER.ZS
## Afghanistan : 1 3241.69994573611: 2 15.9 : 4
## Albania : 1 4174.71460976018: 2 3 : 3
## Algeria : 1 1032.9628806199 : 1 45 : 3
## Angola : 1 10436.3655998628: 1 8 : 3
## Antigua and Barbuda: 1 1073.82629515037: 1 1 : 2
## Arab World : 1 (Other) :208 1.7 : 2
## (Other) :211 NA's : 2 (Other):200
## n.IT.NET.SECR.P6 n.SE.COM.DURS n.SE.TER.CUAT.BA.FE.ZS
## 0 : 4 9 :79 10.8434200286865: 1
## 1.42076916942725 : 2 10 :36 11.5630903244019: 1
## 3.64335593664317 : 2 6 :21 11.5771198272705: 1
## 0.0199378354241743: 1 8 :20 13.3864803314209: 1
## 0.0325068985327394: 1 11 :16 14.4489097595215: 1
## (Other) :205 (Other):37 (Other) : 12
## NA's : 2 NA's : 8 NA's :200
## n.SE.TER.CUAT.BA.MA.ZS n.SH.HIV.INCD.ZS n.SH.STA.SUIC.P5
## 10.626540184021 : 1 0.01 :41 3.3 : 4
## 10.7774600982666: 1 0.02 :19 3.6 : 4
## 11.6993999481201: 1 0.03 : 9 5.1 : 4
## 12.6875896453857: 1 0.04 : 9 8 : 4
## 13.4247598648071: 1 0.05 : 7 10 : 3
## (Other) : 12 (Other):70 12.5 : 3
## NA's :200 NA's :62 (Other):195
## n.SH.STA.SUIC.MA.P5 n.SH.STA.SUIC.FE.P5 n.SH.ALC.PCAP.LI
## 5 : 4 3.6 : 7 0.7 : 7
## 8.7 : 4 2.5 : 6 0.2 : 6
## 11.7 : 3 1.5 : 5 7 : 5
## 12.9 : 3 6 : 5 7.1 : 5
## 4.3 : 3 6.1 : 5 11.4 : 4
## 10.1443132276089: 2 1.9 : 4 0.5 : 3
## (Other) :198 (Other):185 (Other):187
## n.EN.FSH.THRD.NO n.EG.ELC.ACCS.RU.ZS na_count
## : 0 100 : 58 Min. :1.000
## .. : 0 17.3994206856074: 2 1st Qu.:3.000
## 0 : 0 3.1 : 2 Median :3.000
## 0.01 : 0 3.5 : 2 Mean :3.203
## 0.010232397845675: 0 64.9773918291319: 2 3rd Qu.:4.000
## (Other) : 0 (Other) :147 Max. :4.000
## NA's :217 NA's : 4
wbw1$n.EN.FSH.THRD.NO <- NULL # completely empty (217 NA's)
wbw1$n.SE.TER.CUAT.BA.MA.ZS <- NULL # almost empty (200 NA's)
wbw1$n.SE.TER.CUAT.BA.FE.ZS <- NULL # almost empty (200 NA's)Before the analysis, make sure all variables except for ‘country’ are read by R as numeric (which they should be).
str(wbw1) # Those factors should be numeric## 'data.frame': 217 obs. of 12 variables:
## $ country : Factor w/ 267 levels "","Afghanistan",..: 2 3 4 7 8 10 11 13 14 15 ...
## $ n.NY.GDP.PCAP.PP.KD: Factor w/ 1302 levels "","..","0","0.01",..: 404 1300 291 1005 464 447 1023 809 825 378 ...
## $ n.IT.NET.USER.ZS : Factor w/ 1302 levels "","..","0","0.01",..: 760 838 262 491 850 838 579 1096 1090 845 ...
## $ n.IT.NET.SECR.P6 : Factor w/ 1302 levels "","..","0","0.01",..: 87 763 74 132 1004 574 173 347 1035 486 ...
## $ n.SE.COM.DURS : Factor w/ 1302 levels "","..","0","0.01",..: 1180 1109 159 940 202 293 202 159 1109 1180 ...
## $ n.SH.HIV.INCD.ZS : Factor w/ 1302 levels "","..","0","0.01",..: NA 4 4 52 NA 9 4 4 4 4 ...
## $ n.SH.STA.SUIC.P5 : Factor w/ 1302 levels "","..","0","0.01",..: 867 1062 647 885 67 1126 940 262 379 642 ...
## $ n.SH.STA.SUIC.MA.P5: Factor w/ 1302 levels "","..","0","0.01",..: 1124 1200 791 1126 90 332 161 436 579 864 ...
## $ n.SH.STA.SUIC.FE.P5: Factor w/ 1302 levels "","..","0","0.01",..: 137 949 151 491 3 647 478 959 1054 128 ...
## $ n.SH.ALC.PCAP.LI : Factor w/ 1302 levels "","..","0","0.01",..: 50 1064 105 1180 949 1190 883 262 250 494 ...
## $ n.EG.ELC.ACCS.RU.ZS: Factor w/ 1302 levels "","..","0","0.01",..: 686 190 1264 382 1229 1271 1287 190 190 1295 ...
## $ na_count : int 4 3 3 3 4 3 3 3 3 3 ...
for (i in 2:11)
{
wbw1[,i] <- as.numeric(as.character(wbw1[,i]))
}
str(wbw1) ## 'data.frame': 217 obs. of 12 variables:
## $ country : Factor w/ 267 levels "","Afghanistan",..: 2 3 4 7 8 10 11 13 14 15 ...
## $ n.NY.GDP.PCAP.PP.KD: num 1694 9927 12871 6357 19213 ...
## $ n.IT.NET.USER.ZS : num 4 45 12.5 2.8 47 ...
## $ n.IT.NET.SECR.P6 : num 0.486 4.119 0.36 1.284 633.841 ...
## $ n.SE.COM.DURS : num 9 8 10 6 11 13 11 10 8 9 ...
## $ n.SH.HIV.INCD.ZS : num NA 0.01 0.01 0.21 NA 0.03 0.01 0.01 0.01 0.01 ...
## $ n.SH.STA.SUIC.P5 : num 5.1 7.8 3.3 5.7 0.3 8.7 6 12.5 16 3.1 ...
## $ n.SH.STA.SUIC.MA.P5: num 8.6 9.5 4.9 8.7 0.5 14.3 10.1 18.6 25 5 ...
## $ n.SH.STA.SUIC.FE.P5: num 1.4 6.1 1.8 2.8 0 3.3 2.4 6.3 7.5 1.2 ...
## $ n.SH.ALC.PCAP.LI : num 0.2 7.9 0.7 9 6.1 9.3 5.6 12.5 12 2.9 ...
## $ n.EG.ELC.ACCS.RU.ZS: num 32.4 100 97.6 16.2 92.2 ...
## $ na_count : int 4 3 3 3 4 3 3 3 3 3 ...
Nice! Every variable is now of its correct type.
labs <- c("Country name", # create a list of labels from a separate file downloaded with the data from the World Bank
"GDP per capita, PPP (constant 2011 international $)",
"Individuals using the Internet (% of population)",
"Secure Internet servers (per 1 million people)",
"Compulsory education, duration (years)",
"Incidence of HIV (% of uninfected population ages 15-49)",
"Suicide mortality rate (per 100,000 population)",
"Suicide mortality rate, male (per 100,000 male population)",
"Suicide mortality rate, female (per 100,000 female population)",
"Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)",
"Access to electricity, rural (% of rural population)",
"Number of NA's") # do not forget the variable that we have just created
library(sjlabelled)
wbw1 <- set_label(wbw1, label = labs)Now that we have attached proper labels to the variables, we can use them and understand the data better.
library(sjPlot)
view_df(wbw1[2:11], show.prc = F, verbose = F)| ID | Name | Label | Values | Value Labels |
|---|---|---|---|---|
| 1 | n.NY.GDP.PCAP.PP.KD |
GDP per capita, PPP (constant 2011 international $) |
range: 660.2-125140.8 | |
| 2 | n.IT.NET.USER.ZS | Individuals using the Internet (% of population) | range: 0.2-93.4 | |
| 3 | n.IT.NET.SECR.P6 | Secure Internet servers (per 1 million people) | range: 0.0-2481.6 | |
| 4 | n.SE.COM.DURS | Compulsory education, duration (years) | range: 5.0-15.0 | |
| 5 | n.SH.HIV.INCD.ZS |
Incidence of HIV (% of uninfected population ages 15-49) |
range: 0.0-3.1 | |
| 6 | n.SH.STA.SUIC.P5 | Suicide mortality rate (per 100,000 population) | range: 0.3-40.0 | |
| 7 | n.SH.STA.SUIC.MA.P5 |
Suicide mortality rate, male (per 100,000 male population) |
range: 0.5-71.7 | |
| 8 | n.SH.STA.SUIC.FE.P5 |
Suicide mortality rate, female (per 100,000 female population) |
range: 0.0-23.2 | |
| 9 | n.SH.ALC.PCAP.LI |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
range: 0.0-17.9 | |
| 10 | n.EG.ELC.ACCS.RU.ZS |
Access to electricity, rural (% of rural population) |
range: 0.9-100.0 | |
That’s it with the data manipulations. Let’s move on to correlations!
using cor() function
cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)## [1] NA
cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs")## [1] 0.7816712
cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs", method = "spearman")## [1] 0.8896563
cor(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, use = "complete.obs", method = "kendall")## [1] 0.7217487
When using ‘cor’, specify which observations to use. If there are NA’s, the output will be “NA”. The options are: “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.
Use “complete.obs” to correlate the observations with all the variables that you put in the formula.
We can see that both Pearson’s and Spearman’s correlation values are very high. The interpretation is that the higher the country’s GDP per capita, the higher is the share of the country’s population using the Internet.
As expected, Kendall’s correlation is lower than Spearman’s. However, its value is also high.
With cor function, you do not get the statistical significance of the coefficient.
using cor.test() function
cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)##
## Pearson's product-moment correlation
##
## data: wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## t = 18.291, df = 213, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7235433 0.8287912
## sample estimates:
## cor
## 0.7816712
cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, method = "spearman")## Warning in cor.test.default(wbw1$n.NY.GDP.PCAP.PP.KD,
## wbw1$n.IT.NET.USER.ZS, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## S = 182770, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8896563
cor.test(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, method = "kendall")##
## Kendall's rank correlation tau
##
## data: wbw1$n.NY.GDP.PCAP.PP.KD and wbw1$n.IT.NET.USER.ZS
## z = 15.736, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7217487
We get the same results, but also much more information: (1) the statistical significance tests of correlation coefficients; (2) confidence intervals for Pearson’s correlation coefficient; and (3) alternative hypotheses to the tests.
Moreover, NA’s are removed by default.
Plain scatterplots
plot(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS)plot(wbw1$n.NY.GDP.PCAP.PP.KD, wbw1$n.IT.NET.USER.ZS, pch = 20, xlab = "GDP PPP", ylab = "Internet users, per cent of the population")wbws <- na.omit(data.frame(n.NY.GDP.PCAP.PP.KD = wbw1$n.NY.GDP.PCAP.PP.KD, n.IT.NET.USER.ZS = wbw1$n.IT.NET.USER.ZS,
country = wbw1$country))
library(wordcloud)
textplot(wbws$n.NY.GDP.PCAP.PP.KD, wbws$n.IT.NET.USER.ZS, wbws$country, cex = 0.5)textplot(log(wbws$n.NY.GDP.PCAP.PP.KD), wbws$n.IT.NET.USER.ZS, wbws$country, cex = 0.5)If you have fewer countries, this could be a very informative plot with labels.
library("ggpubr")
ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "GDP", ylab = "Internet users, per cent of the population")ggscatter(wbws, x = "n.NY.GDP.PCAP.PP.KD", y = "n.IT.NET.USER.ZS",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "GDP", ylab = "Internet users, per cent of the population",
ylim = c(0, 100))We can add the correlation line with confidence intervals.
Take a look at the Y scale: R does not know that it is per cent that will not exceed 100, so it continues the scale upwards.
Is it a linear relationship, after all?
library(GGally)
ggpairs(data = wbws, # data.frame with variables
columns = 1:2)ggpairs(wbws[, 1:2],
upper = list(continuous = wrap("cor", size = 8)))Pearson’s product moment correlations
library(sjPlot)
sjt.corr(wbw1[, 2:12])|
GDP per capita, PPP (constant 2011 international $) |
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
GDP per capita, PPP (constant 2011 international $) |
0.764*** | 0.567*** | 0.209* | -0.158 | 0.131 | 0.144 | -0.015 | 0.241** | 0.515*** | -0.251** | |
|
Individuals using the Internet (% of population) |
0.764*** | 0.664*** | 0.348*** | -0.264** | 0.265** | 0.298*** | 0.056 | 0.476*** | 0.738*** | -0.158 | |
|
Secure Internet servers (per 1 million people) |
0.567*** | 0.664*** | 0.195* | -0.114 | 0.114 | 0.096 | 0.123 | 0.304*** | 0.301*** | -0.010 | |
| Compulsory education, duration (years) | 0.209* | 0.348*** | 0.195* | -0.270** | -0.005 | 0.042 | -0.143 | 0.170* | 0.397*** | -0.209* | |
|
Incidence of HIV (% of uninfected population ages 15-49) |
-0.158 | -0.264** | -0.114 | -0.270** | 0.023 | -0.028 | 0.176* | 0.026 | -0.384*** | 0.103 | |
|
Suicide mortality rate (per 100,000 population) |
0.131 | 0.265** | 0.114 | -0.005 | 0.023 | 0.978*** | 0.752*** | 0.592*** | 0.241** | -0.025 | |
|
Suicide mortality rate, male (per 100,000 male population) |
0.144 | 0.298*** | 0.096 | 0.042 | -0.028 | 0.978*** | 0.600*** | 0.622*** | 0.298*** | -0.050 | |
|
Suicide mortality rate, female (per 100,000 female population) |
-0.015 | 0.056 | 0.123 | -0.143 | 0.176* | 0.752*** | 0.600*** | 0.317*** | -0.027 | 0.085 | |
|
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.241** | 0.476*** | 0.304*** | 0.170* | 0.026 | 0.592*** | 0.622*** | 0.317*** | 0.304*** | 0.011 | |
|
Access to electricity, rural (% of rural population) |
0.515*** | 0.738*** | 0.301*** | 0.397*** | -0.384*** | 0.241** | 0.298*** | -0.027 | 0.304*** | -0.168* | |
| Number of NA’s | -0.251** | -0.158 | -0.010 | -0.209* | 0.103 | -0.025 | -0.050 | 0.085 | 0.011 | -0.168* | |
| Computed correlation used pearson-method with listwise-deletion. | |||||||||||
At the bottom of the table, ‘listwise deletion’ is mentioned. It means that this correlation matrix was calculated only for the countries that contained the data on ALL the variables used in this matrix.
Try intrepreting some of the correlations.
Why is the diagonal empty?
Spearman’s correlation coefficient (it transforms all the variables into ranks)
sjt.corr(wbw1[, 2:12], corr.method = "spearman")|
GDP per capita, PPP (constant 2011 international $) |
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
GDP per capita, PPP (constant 2011 international $) |
0.887*** | 0.840*** | 0.371*** | -0.460*** | 0.226** | 0.295*** | 0.019 | 0.450*** | 0.837*** | -0.221** | |
|
Individuals using the Internet (% of population) |
0.887*** | 0.884*** | 0.411*** | -0.529*** | 0.212* | 0.292*** | -0.001 | 0.494*** | 0.886*** | -0.187* | |
|
Secure Internet servers (per 1 million people) |
0.840*** | 0.884*** | 0.410*** | -0.367*** | 0.237** | 0.325*** | 0.005 | 0.536*** | 0.758*** | -0.204* | |
| Compulsory education, duration (years) | 0.371*** | 0.411*** | 0.410*** | -0.368*** | -0.030 | 0.044 | -0.157 | 0.213* | 0.415*** | -0.234** | |
|
Incidence of HIV (% of uninfected population ages 15-49) |
-0.460*** | -0.529*** | -0.367*** | -0.368*** | -0.032 | -0.057 | 0.044 | 0.003 | -0.662*** | 0.178* | |
|
Suicide mortality rate (per 100,000 population) |
0.226** | 0.212* | 0.237** | -0.030 | -0.032 | 0.965*** | 0.849*** | 0.571*** | 0.190* | 0.084 | |
|
Suicide mortality rate, male (per 100,000 male population) |
0.295*** | 0.292*** | 0.325*** | 0.044 | -0.057 | 0.965*** | 0.705*** | 0.622*** | 0.255** | 0.070 | |
|
Suicide mortality rate, female (per 100,000 female population) |
0.019 | -0.001 | 0.005 | -0.157 | 0.044 | 0.849*** | 0.705*** | 0.353*** | -0.007 | 0.142 | |
|
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.450*** | 0.494*** | 0.536*** | 0.213* | 0.003 | 0.571*** | 0.622*** | 0.353*** | 0.397*** | 0.008 | |
|
Access to electricity, rural (% of rural population) |
0.837*** | 0.886*** | 0.758*** | 0.415*** | -0.662*** | 0.190* | 0.255** | -0.007 | 0.397*** | -0.187* | |
| Number of NA’s | -0.221** | -0.187* | -0.204* | -0.234** | 0.178* | 0.084 | 0.070 | 0.142 | 0.008 | -0.187* | |
| Computed correlation used spearman-method with listwise-deletion. | |||||||||||
Compare the correlation coefficients for the last variable with those obtained with Pearson’s coefficient. Why are they different? Which of them suits our data better here?
Kendall’s tau (it transforms all the variables into ranks but uses a different formula)
sjt.corr(wbw1[, 2:12], corr.method = "kendall")|
GDP per capita, PPP (constant 2011 international $) |
Individuals using the Internet (% of population) |
Secure Internet servers (per 1 million people) |
Compulsory education, duration (years) |
Incidence of HIV (% of uninfected population ages 15-49) |
Suicide mortality rate (per 100,000 population) |
Suicide mortality rate, male (per 100,000 male population) |
Suicide mortality rate, female (per 100,000 female population) |
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
Access to electricity, rural (% of rural population) |
Number of NA’s | |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
GDP per capita, PPP (constant 2011 international $) |
0.725*** | 0.672*** | 0.270*** | -0.323*** | 0.145* | 0.191*** | 0.009 | 0.323*** | 0.674*** | -0.181** | |
|
Individuals using the Internet (% of population) |
0.725*** | 0.712*** | 0.306*** | -0.367*** | 0.135* | 0.193*** | -0.006 | 0.347*** | 0.705*** | -0.154* | |
|
Secure Internet servers (per 1 million people) |
0.672*** | 0.712*** | 0.307*** | -0.255*** | 0.150** | 0.209*** | -0.000 | 0.380*** | 0.576*** | -0.167* | |
| Compulsory education, duration (years) | 0.270*** | 0.306*** | 0.307*** | -0.270*** | -0.023 | 0.027 | -0.110 | 0.142* | 0.308*** | -0.207** | |
|
Incidence of HIV (% of uninfected population ages 15-49) |
-0.323*** | -0.367*** | -0.255*** | -0.270*** | -0.011 | -0.031 | 0.030 | 0.005 | -0.490*** | 0.152* | |
|
Suicide mortality rate (per 100,000 population) |
0.145* | 0.135* | 0.150** | -0.023 | -0.011 | 0.855*** | 0.671*** | 0.406*** | 0.123* | 0.069 | |
|
Suicide mortality rate, male (per 100,000 male population) |
0.191*** | 0.193*** | 0.209*** | 0.027 | -0.031 | 0.855*** | 0.528*** | 0.449*** | 0.175** | 0.057 | |
|
Suicide mortality rate, female (per 100,000 female population) |
0.009 | -0.006 | -0.000 | -0.110 | 0.030 | 0.671*** | 0.528*** | 0.241*** | -0.009 | 0.117 | |
|
Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) |
0.323*** | 0.347*** | 0.380*** | 0.142* | 0.005 | 0.406*** | 0.449*** | 0.241*** | 0.275*** | 0.006 | |
|
Access to electricity, rural (% of rural population) |
0.674*** | 0.705*** | 0.576*** | 0.308*** | -0.490*** | 0.123* | 0.175** | -0.009 | 0.275*** | -0.156* | |
| Number of NA’s | -0.181** | -0.154* | -0.167* | -0.207** | 0.152* | 0.069 | 0.057 | 0.117 | 0.006 | -0.156* | |
| Computed correlation used kendall-method with listwise-deletion. | |||||||||||
Compare Kendall’s correlations with Spearman’s correlations.
sjp.corr(wbw1[, 2:12])Here, color is used to indicate positive and negative correlations.
The columns are in reverse order, so that the part with the correlations remains on the left and is easier to read.
Compare this table to the Pearson’s product moment correlation matrix - they should be the same.
In your projects, use either matrices or tables depending on which suits your goal better.