We have decided to work on a task proposed by the World Bank on the following page :
https://www.drivendata.org/competitions/1/united-nations-millennium-development-goals/page/3/
In the year 2000, United Nations proposed a set of 8 development goals which should be achieved in every country worldwide.
Aim of the study :
After renaming the columns, we have a data frame with lines for countries and series, and 36 columns for the values, each one for one year from 1972 to 2007 :
#Below modify the location of your working folder
setwd("C:/Users/chloe/Google Drive/M1_T3_Big_Data_Analytics/Projet_BigDataAna")
library(tidyverse)
library(dplyr)
library(ggplot2)
library(cluster)
library(reshape2)
library(ggforce)
library(leaps)
RawData <- read_csv('_Data/TrainingSet.csv')
RawData <- RawData[,2:40]
for (i in 1:36) {
colnames(RawData)[i] <- 1972+i-1
}
RawData <- RawData %>%
rename(
"CountryName" = "Country Name",
"SerieCode" = "Series Code",
"SerieName" = "Series Name",
)
str(RawData)
## tibble [195,402 x 39] (S3: tbl_df/tbl/data.frame)
## $ 1972 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1973 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1974 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1975 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1976 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1977 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1978 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1979 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1980 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1981 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1982 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1983 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1984 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1985 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1986 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1987 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1988 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1989 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1990 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1991 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1992 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1993 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1994 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1995 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1996 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1997 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1998 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 1999 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2000 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2001 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2002 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2003 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2004 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2005 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2006 : num [1:195402] NA NA NA NA NA NA NA NA NA NA ...
## $ 2007 : num [1:195402] 3.77 7.03 8.24 12.93 19 ...
## $ CountryName: chr [1:195402] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ SerieCode : chr [1:195402] "allsi.bi_q1" "allsp.bi_q1" "allsa.bi_q1" "allsi.gen_pop" ...
## $ SerieName : chr [1:195402] "(%) Benefits held by 1st 20% population - All Social Insurance" "(%) Benefits held by 1st 20% population - All Social Protection" "(%) Benefits held by 1st 20% population - All Social Safety Nets" "(%) Generosity of All Social Insurance" ...
The dataset is composed of observations of up to 1305 series in 214 countries :
ListePays <- unique(RawData$`CountryName`)
(ListePays)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "American Samoa"
## [5] "Andorra" "Angola"
## [7] "Antigua and Barbuda" "Argentina"
## [9] "Armenia" "Aruba"
## [11] "Australia" "Austria"
## [13] "Azerbaijan" "Bahamas, The"
## [15] "Bahrain" "Bangladesh"
## [17] "Barbados" "Belarus"
## [19] "Belgium" "Belize"
## [21] "Benin" "Bermuda"
## [23] "Bhutan" "Bolivia"
## [25] "Bosnia and Herzegovina" "Botswana"
## [27] "Brazil" "Brunei Darussalam"
## [29] "Bulgaria" "Burkina Faso"
## [31] "Burundi" "Cabo Verde"
## [33] "Cambodia" "Cameroon"
## [35] "Canada" "Cayman Islands"
## [37] "Central African Republic" "Chad"
## [39] "Channel Islands" "Chile"
## [41] "China" "Colombia"
## [43] "Comoros" "Congo, Dem. Rep."
## [45] "Congo, Rep." "Costa Rica"
## [47] "Cote d'Ivoire" "Croatia"
## [49] "Cuba" "Curacao"
## [51] "Cyprus" "Czech Republic"
## [53] "Denmark" "Djibouti"
## [55] "Dominica" "Dominican Republic"
## [57] "Ecuador" "Egypt, Arab Rep."
## [59] "El Salvador" "Equatorial Guinea"
## [61] "Eritrea" "Estonia"
## [63] "Ethiopia" "Faeroe Islands"
## [65] "Fiji" "Finland"
## [67] "France" "French Polynesia"
## [69] "Gabon" "Gambia, The"
## [71] "Georgia" "Germany"
## [73] "Ghana" "Greece"
## [75] "Greenland" "Grenada"
## [77] "Guam" "Guatemala"
## [79] "Guinea" "Guinea-Bissau"
## [81] "Guyana" "Haiti"
## [83] "Honduras" "Hong Kong SAR, China"
## [85] "Hungary" "Iceland"
## [87] "India" "Indonesia"
## [89] "Iran, Islamic Rep." "Iraq"
## [91] "Ireland" "Isle of Man"
## [93] "Israel" "Italy"
## [95] "Jamaica" "Japan"
## [97] "Jordan" "Kazakhstan"
## [99] "Kenya" "Kiribati"
## [101] "Korea, Dem. Rep." "Korea, Rep."
## [103] "Kosovo" "Kuwait"
## [105] "Kyrgyz Republic" "Lao PDR"
## [107] "Latvia" "Lebanon"
## [109] "Lesotho" "Liberia"
## [111] "Libya" "Liechtenstein"
## [113] "Lithuania" "Luxembourg"
## [115] "Macao SAR, China" "Macedonia, FYR"
## [117] "Madagascar" "Malawi"
## [119] "Malaysia" "Maldives"
## [121] "Mali" "Malta"
## [123] "Marshall Islands" "Mauritania"
## [125] "Mauritius" "Mexico"
## [127] "Micronesia, Fed. Sts." "Moldova"
## [129] "Monaco" "Mongolia"
## [131] "Montenegro" "Morocco"
## [133] "Mozambique" "Myanmar"
## [135] "Namibia" "Nepal"
## [137] "Netherlands" "New Caledonia"
## [139] "New Zealand" "Nicaragua"
## [141] "Niger" "Nigeria"
## [143] "Northern Mariana Islands" "Norway"
## [145] "Oman" "Pakistan"
## [147] "Palau" "Panama"
## [149] "Papua New Guinea" "Paraguay"
## [151] "Peru" "Philippines"
## [153] "Poland" "Portugal"
## [155] "Puerto Rico" "Qatar"
## [157] "Romania" "Russian Federation"
## [159] "Rwanda" "Samoa"
## [161] "San Marino" "Sao Tome and Principe"
## [163] "Saudi Arabia" "Senegal"
## [165] "Serbia" "Seychelles"
## [167] "Sierra Leone" "Singapore"
## [169] "Sint Maarten (Dutch part)" "Slovak Republic"
## [171] "Slovenia" "Solomon Islands"
## [173] "Somalia" "South Africa"
## [175] "South Sudan" "Spain"
## [177] "Sri Lanka" "St. Kitts and Nevis"
## [179] "St. Lucia" "St. Martin (French part)"
## [181] "St. Vincent and the Grenadines" "Sudan"
## [183] "Suriname" "Swaziland"
## [185] "Sweden" "Switzerland"
## [187] "Syrian Arab Republic" "Tajikistan"
## [189] "Tanzania" "Thailand"
## [191] "Timor-Leste" "Togo"
## [193] "Tonga" "Trinidad and Tobago"
## [195] "Tunisia" "Turkey"
## [197] "Turkmenistan" "Turks and Caicos Islands"
## [199] "Tuvalu" "Uganda"
## [201] "Ukraine" "United Arab Emirates"
## [203] "United Kingdom" "United States"
## [205] "Uruguay" "Uzbekistan"
## [207] "Vanuatu" "Venezuela, RB"
## [209] "Vietnam" "Virgin Islands (U.S.)"
## [211] "West Bank and Gaza" "Yemen, Rep."
## [213] "Zambia" "Zimbabwe"
ListeSeries <- unique(RawData[c("SerieCode","SerieName")])
(ListeSeries)
## # A tibble: 1,305 x 2
## SerieCode SerieName
## <chr> <chr>
## 1 allsi.bi_q1 (%) Benefits held by 1st 20% population - All Social Insurance
## 2 allsp.bi_q1 (%) Benefits held by 1st 20% population - All Social Protection
## 3 allsa.bi_q1 (%) Benefits held by 1st 20% population - All Social Safety Ne~
## 4 allsi.gen_pop (%) Generosity of All Social Insurance
## 5 allsp.gen_pop (%) Generosity of All Social Protection
## 6 allsa.gen_pop (%) Generosity of All Social Safety Nets
## 7 allsi.cov_pop (%) Program participation - All Social Insurance
## 8 allsp.cov_pop (%) Program participation - All Social Protection
## 9 allsa.cov_pop (%) Program participation - All Social Safety Nets
## 10 lm_ub.cov_pop (%) Program participation - Unemp benefits and ALMP
## # ... with 1,295 more rows
The proposed data frame format is not easy to handle, so we transform it from a large dataset to a long dataset. The latter is composed of 4 columns: CountryName,SerieCode,Year and Value.
RawData <- RawData[,1:38]
DataLong <- melt( RawData, na.rm= TRUE, id.vars=c("CountryName","SerieCode"),
value.name="Value", variable.name="Year" )
str(DataLong)
## 'data.frame': 3714187 obs. of 4 variables:
## $ CountryName: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ SerieCode : chr "NY.ADJ.DCO2.GN.ZS" "NY.ADJ.DCO2.CD" "NY.ADJ.AEDU.GN.ZS" "NY.ADJ.AEDU.CD" ...
## $ Year : Factor w/ 36 levels "1972","1973",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Value : num 1.38e-01 2.26e+06 1.03 1.68e+07 1.37 ...
We decided to focus on one of the 8 UN development goals.
Moreover, given the high number of parameters contained in the database, we made a shorlist of influencing parameters from litterature.
Target variables :
TargetSeries=c("SH.DYN.MORT.FE","SH.DYN.MORT.MA")
ListeSeries[ListeSeries$SerieCode %in% TargetSeries,]
## # A tibble: 2 x 2
## SerieCode SerieName
## <chr> <chr>
## 1 SH.DYN.MORT.FE Mortality rate, under-5, female (per 1,000)
## 2 SH.DYN.MORT.MA Mortality rate, under-5, male (per 1,000)
Short list of influencing parameters :
Parameters=c("SH.STA.BRTC.ZS", "SH.DTH.COMM.ZS", "SH.DTH.INJR.ZS", "SH.DTH.NCOM.ZS", "SH.HIV.0014", "SH.MLR.TRET.ZS", "SH.STA.ORCF.ZS", "SH.STA.ORTH", "SH.STA.BFED.ZS","NY.GDP.MKTP.PP.KD","SH.XPD.TOTL.ZS","SH.IMM.IDPT","SH.IMM.MEAS","SH.STA.ACSN")
ListeSeries[ListeSeries$SerieCode %in% Parameters,]
## # A tibble: 14 x 2
## SerieCode SerieName
## <chr> <chr>
## 1 SH.STA.BRTC.ZS Births attended by skilled health staff (% of total)
## 2 SH.DTH.COMM.ZS Cause of death, by communicable diseases and maternal, pren~
## 3 SH.DTH.INJR.ZS Cause of death, by injury (% of total)
## 4 SH.DTH.NCOM.ZS Cause of death, by non-communicable diseases (% of total)
## 5 SH.STA.ORTH Diarrhea treatment (% of children under 5 who received ORS ~
## 6 SH.STA.BFED.ZS Exclusive breastfeeding (% of children under 6 months)
## 7 NY.GDP.MKTP.PP.~ GDP, PPP (constant 2011 international $)
## 8 SH.XPD.TOTL.ZS Health expenditure, total (% of GDP)
## 9 SH.IMM.IDPT Immunization, DPT (% of children ages 12-23 months)
## 10 SH.IMM.MEAS Immunization, measles (% of children ages 12-23 months)
## 11 SH.STA.ACSN Improved sanitation facilities (% of population with access)
## 12 SH.STA.ORCF.ZS Diarrhea treatment (% of children under 5 receiving oral re~
## 13 SH.HIV.0014 Children (0-14) living with HIV
## 14 SH.MLR.TRET.ZS Children with fever receiving antimalarial drugs (% of chil~
All countries do not have a measure for the target variable. Hence we create a subset of the interesting countries for further analysis.
#Reducing Dataset to interesting series
DataReduced <- DataLong[DataLong$SerieCode %in% c(TargetSeries,Parameters),]
#Reducing Dataset to countries where target variables have been measured
PaysSeries <- unique(DataReduced[c("CountryName","SerieCode")])
ListePaysOK <- unique(PaysSeries[PaysSeries$SerieCode %in% TargetSeries,1])
In this part, we want to group similar countries together to further analyse the factors.
We will use all the target variable and the selected factors to cluster them as a first step.
As for these series some values are missing, the clustering is done on an average on the 1972-2007 period.
PaysSeriesOK <-PaysSeries[PaysSeries$CountryName %in% ListePaysOK,]
DataCluster <- PaysSeriesOK
for (i in 1:nrow(DataCluster)) {
DataCluster[i,3]=mean(DataReduced[DataReduced$CountryName==DataCluster[i,1] & DataReduced$SerieCode==DataCluster[i,2],4], na.rm = TRUE)
}
colnames(DataCluster)[3] <- "Mean"
DataCluster <- dcast(DataCluster, CountryName ~ SerieCode)
## Using Mean as value column: use value.var to override.
Still, some series have never been measured in some countries. Then, in order to compare countries together along all the dimensions, we select the countries where all series have a value. This will be our benchmark group.
QtteNA<-DataCluster[,]
QtteNA[,18] <- rowSums(is.na(QtteNA[,2:17]))
print(paste("Number of countries with extensive measures along selected dimensions :",length(QtteNA[QtteNA$V18==0,1])))
## [1] "Number of countries with extensive measures along selected dimensions : 29"
ListePaysBM <- QtteNA[QtteNA$V18==0,1]
ListePaysBM
## [1] "Benin" "Burkina Faso" "Burundi"
## [4] "Cameroon" "Chad" "Congo, Dem. Rep."
## [7] "Congo, Rep." "Cote d'Ivoire" "Djibouti"
## [10] "Equatorial Guinea" "Eritrea" "Ethiopia"
## [13] "Ghana" "Guinea" "Guinea-Bissau"
## [16] "Haiti" "Kenya" "Liberia"
## [19] "Malawi" "Mozambique" "Namibia"
## [22] "Nigeria" "Rwanda" "Sierra Leone"
## [25] "Swaziland" "Tanzania" "Togo"
## [28] "Uganda" "Zambia"
DataCluster <- DataCluster[DataCluster$CountryName %in% ListePaysBM,]
We use the kmeans methology to cluster the benchmark countries (5 clusters) :
NbClusters <-5
km_out <- kmeans(DataCluster[,2:17], centers = NbClusters, nstart = 50)
ClusterPays <- data.frame(DataCluster[,1],km_out$cluster)
colnames(ClusterPays)[2] <- "Cluster"
for (i in 1:NbClusters) {
PlotData <- DataReduced[DataReduced$CountryName %in% ClusterPays[ClusterPays$Cluster==i,1],]
for (j in 1:3) {
p <- ggplot(data=PlotData, aes(x = Year, y = Value,color=CountryName,group=CountryName)) +
geom_point()+
geom_path()+
#facet_wrap(~PlotData$SerieCode,scales="free_y")+
facet_wrap_paginate(~PlotData$SerieCode, ncol = 2, nrow = 2, page=j,scales="free_y")+
scale_x_discrete(breaks = seq(1970, 2010, 5))+
labs(x="",y="",title =paste("Cluster ",i," data"), subtitle = paste("Page ",j))
print(p)
}
}
All regression function stumble on missing values in the dataset. The easiest solution is to fill the NA values with the average value of the given serie. After this step, we can apply linear regression for each country. We select the meaningful coefficients on a p-value criterion, and then compare it between countries.
#regression for each country
#Le data frame TabFacteurs contient les coefficients de regression pour chaque facteur et chaque Pays, si le seuil de P-test a ƩtƩ franchi (significatif).
TabFacteurs <- PaysSeriesOK
TabFacteurs <- TabFacteurs[TabFacteurs$CountryName %in% ListePaysBM,]
TabFacteurs$Estimate <- 0
for (p in ListePaysBM) {
DataBM <-DataReduced
DataBM <-DataBM[DataBM$CountryName==p,]
DataBM <- dcast(DataBM, CountryName + Year ~ SerieCode)
DataBM$Mortality <- (DataBM$SH.DYN.MORT.FE+DataBM$SH.DYN.MORT.MA)/2
DataBM$SH.DYN.MORT.FE <- NULL
DataBM$SH.DYN.MORT.MA <- NULL
DataBM$CountryName <- NULL
DataBM$Year <- NULL
for (c in 1:ncol(DataBM)) {
DataBM[is.na(DataBM)] <- mean(DataBM[,c])
}
model2 <- lm(data=DataBM, Mortality~.)
summary(model2)
result <- as.data.frame(summary(model2)$coefficients)
result$CountryName <- p
result$SerieCode <- rownames(result)
result<- result[result$"Pr(>|t|)"<0.05,]
result = result %>% select(CountryName, SerieCode, Estimate)
TabFacteurs <- merge(x = TabFacteurs, y=result,by=c("CountryName","SerieCode","Estimate"), all.x=TRUE, all.y=TRUE)
}
In the benchmark group, the following series have been identified as statistically significant in at least one country :
SeriesLM <- aggregate(TabFacteurs$Estimate, by=list(SerieCode=TabFacteurs$SerieCode), FUN=sum)
SeriesLM <- SeriesLM[SeriesLM$x != 0,]
ListeSeries[ListeSeries$SerieCode %in% SeriesLM[,1],]
## # A tibble: 8 x 2
## SerieCode SerieName
## <chr> <chr>
## 1 SH.STA.BRTC.ZS Births attended by skilled health staff (% of total)
## 2 SH.DTH.COMM.ZS Cause of death, by communicable diseases and maternal, prena~
## 3 NY.GDP.MKTP.PP.~ GDP, PPP (constant 2011 international $)
## 4 SH.XPD.TOTL.ZS Health expenditure, total (% of GDP)
## 5 SH.IMM.IDPT Immunization, DPT (% of children ages 12-23 months)
## 6 SH.IMM.MEAS Immunization, measles (% of children ages 12-23 months)
## 7 SH.STA.ACSN Improved sanitation facilities (% of population with access)
## 8 SH.HIV.0014 Children (0-14) living with HIV
Here is how the estimates of factors vary across countries :
for (s in SeriesLM[,1]) {
p <- ggplot(data=TabFacteurs[TabFacteurs$SerieCode==s,], aes(x = CountryName, y = Estimate)) +
geom_bar(stat = "identity")+
labs(title ="Estimates for benchmark group", subtitle = s)
print(p)
}