This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Here I am making a non professional analysis of relationships between GDP, Human Life Expectancy and World Population. I am not sociologist, nor epidemiologist or economist, this is doing just for self training purposes in Java, MySQL and R with Rmarkdown
Data were obtained from The World Bank mebsite, they consist in a set of three tables: Human Life Expectancy, Gross Domestic Product and population.
Data were grouped row-wise by National and sub-national records, but columns were related to the name of the country and its code, the geographical geozone and the data from 1990 to 2019 This means 30 years of records. For sevral country the first data available starts at year 2000. However for some others, data reflects politics behind, for instance Kosovo has no HLE data until 2011.
This is a compilation of the population per country in the interval of years starting at 1960 and finishing at 2020. In this data set was not made a cleaning process: I have not a comparisson/controlling factor to check the data out.
The first step taken in HLE data was to eliminate the geographical geozone column, because was meaningless for our purposes. As data the yeears were agruped by columns, a small JAVA program was made to read the data file and get the values with the years reflected row-wise, this decission was taken due to the grouping of the GDP table.There was another step taken: even when coming from the same source, the country names were no consistent and when in the other databases US was enumerated as “United States”, in GDP appears as “the United States”. In other cases the name was actualized, for instance “North Korea” was renamed to “Korea Rep.”, etc.
After that step, data was imported into a MySQL schema, just to filter the data (only National and not sub-national data collected) and order. Where obtained two ordered tables the first one, by country name and then by year and the second one the opposite. That strategy is useful to further analyze specific years or country. The code use was very simple:
SELECT
yearID, country, geozone, age, gdp, gdp_percent, population
FROM
hle_gdp_dwb.hle, hle_gdp_dwb.gdp
WHERE
(yearID >= 1990 and yearID <=2019)
and (yearID = year_ID)
and (country = countryID)
and level = “National”
ORDER BY yearID ASC , country ASC
The first step is to activate required libraries and read the dataset as seen next (including a summary of the data):
library(scatterplot3d)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
library(ggpattern)
setwd("D://Documents//Certificado-Google//Data Analytics//Portfolio//Life_GDP")
data <- read.csv("hle_gdp_pop_dwb_yc.csv")
summary(data)
## yearID country geozone age
## Min. :1990 Length:4537 Length:4537 Min. :26.2
## 1st Qu.:1998 Class :character Class :character 1st Qu.:63.3
## Median :2006 Mode :character Mode :character Median :71.3
## Mean :2005 Mean :69.0
## 3rd Qu.:2013 3rd Qu.:76.2
## Max. :2019 Max. :84.6
## gdp gdp_percent population
## Min. :6.310e+07 Min. :0.0000019 Min. :3.318e+04
## 1st Qu.:5.082e+09 1st Qu.:0.0001210 1st Qu.:2.857e+06
## Median :2.339e+10 Median :0.0004730 Median :8.580e+06
## Mean :3.278e+11 Mean :0.0065053 Mean :4.040e+07
## 3rd Qu.:1.486e+11 3rd Qu.:0.0031104 3rd Qu.:2.725e+07
## Max. :2.143e+13 Max. :0.3197593 Max. :1.408e+09
head(data)
## yearID country geozone age gdp gdp_percent population
## 1 1990 Albania Europe 71.8 2028553750 8.995872e-05 3286542
## 2 1990 Algeria Africa 66.9 62048562947 2.751620e-03 25758872
## 3 1990 Argentina America 71.6 141352368714 6.268445e-03 32618648
## 4 1990 Armenia Asia 67.9 2256838857 1.000823e-04 3538164
## 5 1990 Australia Oceania 76.9 310781069642 1.378197e-02 17065100
## 6 1990 Austria Europe 75.6 166463386179 7.382024e-03 7677850
tail(data)
## yearID country geozone age gdp gdp_percent population
## 4532 2019 USA America 78.9 2.143322e+13 2.484793e-01 328329953
## 4533 2019 Uzbekistan Asia 71.7 5.772654e+10 6.692344e-04 33580350
## 4534 2019 Vanuatu Oceania 70.5 9.303380e+08 1.078558e-05 299882
## 4535 2019 Vietnam Asia 75.4 2.619212e+11 3.036501e-03 96462108
## 4536 2019 Yemen Asia 66.1 2.258108e+10 2.617866e-04 29161922
## 4537 2019 Zambia Africa 63.9 2.330869e+10 2.702219e-04 17861034
As expected the tidyverse conflicts with other packages are summarized, but I will not use thes packages, so, there is no problems.
Of course the summary of the data here was included to check if every year was represented in the imported data, and that means that the import was ‘successful’.
Next is shown how was read and plotted the data.
The above shown maps (and all the following maps) shows information for the year 2019. There are missing data from the World Bank Database, several countries does not report their data, these are shown as white spots.
As you can see, there is an enormous disparity in Life Expectancy from country to country, but comparing the GDP there is not a clear relationship between these variables. The following map shows the results.
scatterplot3d( data[,c(2,1,5)], angle = 45, color=304050, type ="l")
mapdata_gdp <- mapdata %>% filter (!is.na(mapdata$gdp), yearID == 2019)
map_gdp <-ggplot (mapdata_gdp, aes (x = long, y = lat, group=group )) +
geom_polygon (aes (color= gdp, fill = gdp)) #+
# geom_polygon_pattern(
# pattern = mapdata_gdp$age,
# pattern_color = "black",
# fill = mapdata_gdp$gdp,
# pattern_fill = mapdata_gdp$age,
# colour =mapdata_gdp$gdp
#)
map_gdp
scatterplot3d( data[,c(3,1,4)], angle = 45, color=304050 , type ="l" )
scatterplot3d( data[,c(2,3,5)], angle = 45, color=304050, type ="p")
scatterplot3d( data[,c(2,3,6)], angle = 45, color=30405, type ="p")
ggplot(data,
aes(x = geozone, y = age)) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = "Life Expectancy by Geographical Zone",
x = "GeoZone", y = "Life Expectancy",
color = "GeoZone", shape = "GeoZone"
) +
theme_minimal()
That Africa has the lower Life Expectancy is not surprisingly, successive calamities, like HIV-AIDS, Ebola virus and COVID-19/SARSjoined to low-level or insufficient medical services makes the worst possible scenario for human life.
ggplot(data,
aes(x = geozone, y = gdp)) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = "GDP by Geographical Zone",
x = "GeoZone", y = "GDP (USD)",
color = "GeoZone", shape = "GeoZone"
) +
theme_minimal()
ggplot(data,
aes(x = geozone, y = gdp_percent)) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = " Percent of GDP by Geographical Zone",
x = "GeoZone", y = "%GDP)",
color = "GeoZone", shape = "GeoZone"
) +
theme_minimal()
ggplot(data,
aes(x = geozone, y = population)) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = "Population by Geographical Zone",
x = "GeoZone", y = "Population",
color = "GeoZone", shape = "GeoZone"
) +
theme_minimal()
The jump in the total population of the Asia geozone probably is due to the lack of data of some countries during certain years, as also happened in Africa, Oceania and Europe… the last one due to the new countries appear after the fall of the ’Communist Block”
ggplot(data,
aes(x = age, y = gdp )) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = "GDP versus Life Expectancy",
x = "Life Expectancy", y= "GDP"
) +
theme_minimal()
ggplot(data,
aes(x = age, y = gdp_percent )) +
geom_point(aes(color = geozone, shape = geozone)) +
scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
labs(
title = "%GDP versus Life Expectancy",
x = "Life Expectancy", y= "%GDP"
) +
theme_minimal()
ggplot(data,
aes(x = population, y = gdp )) +
geom_point(aes(color = geozone, shape = geozone)) +
labs(
title = "GDP and Population ",
x = "Population", y = "GDP"
) +
theme_minimal()
ggplot(data,
aes(x = population, y = gdp_percent )) +
geom_point(aes(color = geozone, shape = geozone)) +
labs(
title = "%GDP and Population ",
x = "Population", y = "%GDP"
) +
theme_minimal()
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
ggplot(china,
aes(x = yearID, y = age )) +
geom_point(aes(color = geozone, shape = geozone)) +
labs(
title = "China: Anual Life Expectancy ",
x = "Year", y = "Life Expectancy"
) +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
ggplot(china,
aes(x = yearID, y = population )) +
geom_point(aes(color = geozone, shape = geozone)) +
labs(
title = "China: Anual Pupulation Growth ",
x = "Year", y = "Population"
) +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
I will do the regression and prediction curves for the China case ….
(I am not sociologist nor epidemiologist or economist…)