R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Here I am making a non professional analysis of relationships between GDP, Human Life Expectancy and World Population. I am not sociologist, nor epidemiologist or economist, this is doing just for self training purposes in Java, MySQL and R with Rmarkdown

Databases description

Data were obtained from The World Bank mebsite, they consist in a set of three tables: Human Life Expectancy, Gross Domestic Product and population.

Human Life Expectancy (HLE) data

Data were grouped row-wise by National and sub-national records, but columns were related to the name of the country and its code, the geographical geozone and the data from 1990 to 2019 This means 30 years of records. For sevral country the first data available starts at year 2000. However for some others, data reflects politics behind, for instance Kosovo has no HLE data until 2011.

Population data

This is a compilation of the population per country in the interval of years starting at 1960 and finishing at 2020. In this data set was not made a cleaning process: I have not a comparisson/controlling factor to check the data out.

Gross Domestic Product (GDP) data

Data cleaning

The first step taken in HLE data was to eliminate the geographical geozone column, because was meaningless for our purposes. As data the yeears were agruped by columns, a small JAVA program was made to read the data file and get the values with the years reflected row-wise, this decission was taken due to the grouping of the GDP table.There was another step taken: even when coming from the same source, the country names were no consistent and when in the other databases US was enumerated as “United States”, in GDP appears as “the United States”. In other cases the name was actualized, for instance “North Korea” was renamed to “Korea Rep.”, etc.

After that step, data was imported into a MySQL schema, just to filter the data (only National and not sub-national data collected) and order. Where obtained two ordered tables the first one, by country name and then by year and the second one the opposite. That strategy is useful to further analyze specific years or country. The code use was very simple:

SELECT
yearID, country, geozone, age, gdp, gdp_percent, population
FROM
hle_gdp_dwb.hle, hle_gdp_dwb.gdp
WHERE
(yearID >= 1990 and yearID <=2019)
and (yearID = year_ID)
and (country = countryID)
and level = “National”
ORDER BY yearID ASC , country ASC

CAUTION!!!

THIS IS A FAKE ANALYSIS!!!!, I suppose that there is not relationships between human life expectancy or pupulation with GDP, This text is only an “academic” exercise. GDP is more dependent on internal or external investment in the industry/services, financial support to certain activities and even protectionist measurements taken by the governments.

The first step is to activate required libraries and read the dataset as seen next (including a summary of the data):

library(scatterplot3d)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(ggpattern)

setwd("D://Documents//Certificado-Google//Data Analytics//Portfolio//Life_GDP")
data <- read.csv("hle_gdp_pop_dwb_yc.csv")
summary(data)
##      yearID       country            geozone               age      
##  Min.   :1990   Length:4537        Length:4537        Min.   :26.2  
##  1st Qu.:1998   Class :character   Class :character   1st Qu.:63.3  
##  Median :2006   Mode  :character   Mode  :character   Median :71.3  
##  Mean   :2005                                         Mean   :69.0  
##  3rd Qu.:2013                                         3rd Qu.:76.2  
##  Max.   :2019                                         Max.   :84.6  
##       gdp             gdp_percent          population       
##  Min.   :6.310e+07   Min.   :0.0000019   Min.   :3.318e+04  
##  1st Qu.:5.082e+09   1st Qu.:0.0001210   1st Qu.:2.857e+06  
##  Median :2.339e+10   Median :0.0004730   Median :8.580e+06  
##  Mean   :3.278e+11   Mean   :0.0065053   Mean   :4.040e+07  
##  3rd Qu.:1.486e+11   3rd Qu.:0.0031104   3rd Qu.:2.725e+07  
##  Max.   :2.143e+13   Max.   :0.3197593   Max.   :1.408e+09
head(data)
##   yearID   country geozone  age          gdp  gdp_percent population
## 1   1990   Albania  Europe 71.8   2028553750 8.995872e-05    3286542
## 2   1990   Algeria  Africa 66.9  62048562947 2.751620e-03   25758872
## 3   1990 Argentina America 71.6 141352368714 6.268445e-03   32618648
## 4   1990   Armenia    Asia 67.9   2256838857 1.000823e-04    3538164
## 5   1990 Australia Oceania 76.9 310781069642 1.378197e-02   17065100
## 6   1990   Austria  Europe 75.6 166463386179 7.382024e-03    7677850
tail(data)
##      yearID    country geozone  age          gdp  gdp_percent population
## 4532   2019        USA America 78.9 2.143322e+13 2.484793e-01  328329953
## 4533   2019 Uzbekistan    Asia 71.7 5.772654e+10 6.692344e-04   33580350
## 4534   2019    Vanuatu Oceania 70.5 9.303380e+08 1.078558e-05     299882
## 4535   2019    Vietnam    Asia 75.4 2.619212e+11 3.036501e-03   96462108
## 4536   2019      Yemen    Asia 66.1 2.258108e+10 2.617866e-04   29161922
## 4537   2019     Zambia  Africa 63.9 2.330869e+10 2.702219e-04   17861034

As expected the tidyverse conflicts with other packages are summarized, but I will not use thes packages, so, there is no problems.

Of course the summary of the data here was included to check if every year was represented in the imported data, and that means that the import was ‘successful’.

Next is shown how was read and plotted the data.

Data by Country

The above shown maps (and all the following maps) shows information for the year 2019. There are missing data from the World Bank Database, several countries does not report their data, these are shown as white spots.

As you can see, there is an enormous disparity in Life Expectancy from country to country, but comparing the GDP there is not a clear relationship between these variables. The following map shows the results.

scatterplot3d( data[,c(2,1,5)], angle = 45, color=304050, type ="l")

mapdata_gdp <- mapdata %>% filter (!is.na(mapdata$gdp), yearID == 2019)
map_gdp <-ggplot (mapdata_gdp, aes (x = long, y = lat, group=group )) + 
  geom_polygon (aes (color= gdp, fill = gdp)) #+
#    geom_polygon_pattern( 
 #      pattern = mapdata_gdp$age,
  #     pattern_color = "black",
  #     fill = mapdata_gdp$gdp,
   #     pattern_fill = mapdata_gdp$age,
    #    colour =mapdata_gdp$gdp 
  #)

map_gdp

Data by Geographical Zone

Depending on year

scatterplot3d( data[,c(3,1,4)], angle = 45, color=304050   , type ="l"   )

Depending on Geographical area

scatterplot3d( data[,c(2,3,5)], angle = 45, color=304050, type ="p")

scatterplot3d( data[,c(2,3,6)],  angle = 45, color=30405, type ="p")

Let’s group data for geographical area

ggplot(data, 
       aes(x = geozone, y = age)) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = "Life Expectancy by Geographical Zone",
    x = "GeoZone", y = "Life Expectancy",
    color = "GeoZone", shape = "GeoZone"
  ) +
  theme_minimal()

That Africa has the lower Life Expectancy is not surprisingly, successive calamities, like HIV-AIDS, Ebola virus and COVID-19/SARSjoined to low-level or insufficient medical services makes the worst possible scenario for human life.

ggplot(data, 
       aes(x = geozone, y = gdp)) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = "GDP by Geographical Zone",
    x = "GeoZone", y = "GDP (USD)",
    color = "GeoZone", shape = "GeoZone"
  ) +
  theme_minimal()

ggplot(data, 
       aes(x = geozone, y = gdp_percent)) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = " Percent of GDP by Geographical Zone",
    x = "GeoZone", y = "%GDP)",
    color = "GeoZone", shape = "GeoZone"
  ) +
  theme_minimal()

ggplot(data, 
       aes(x = geozone, y = population)) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = "Population by Geographical Zone",
    x = "GeoZone", y = "Population",
    color = "GeoZone", shape = "GeoZone"
  ) +
  theme_minimal()

The jump in the total population of the Asia geozone probably is due to the lack of data of some countries during certain years, as also happened in Africa, Oceania and Europe… the last one due to the new countries appear after the fall of the ’Communist Block”

GDP versus HLE and Population

ggplot(data, 
       aes(x = age, y = gdp )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = "GDP versus Life Expectancy",
    x =  "Life Expectancy", y= "GDP"
  ) +
  theme_minimal()

ggplot(data, 
       aes(x = age, y = gdp_percent )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  scale_color_manual(values = c("darkorange","purple","cyan4", "black", "red", "blue")) +
  labs(
    title = "%GDP versus Life Expectancy",
    x =  "Life Expectancy", y= "%GDP"
  ) +
  theme_minimal()

ggplot(data, 
       aes(x = population, y = gdp )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  labs(
    title = "GDP and Population ",
    x  = "Population", y = "GDP"
  ) +
  theme_minimal()

ggplot(data, 
       aes(x = population, y = gdp_percent )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  labs(
    title = "%GDP and Population ",
    x  = "Population", y = "%GDP"
  ) +
  theme_minimal()

Analysis: The China case

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

ggplot(china, 
       aes(x = yearID, y = age )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  labs(
    title = "China: Anual Life Expectancy  ",
    x  = "Year", y = "Life Expectancy"
  ) +
  geom_smooth(method = "lm") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

ggplot(china, 
       aes(x = yearID, y = population )) +
  geom_point(aes(color = geozone, shape = geozone)) +
  labs(
    title = "China: Anual Pupulation Growth ",
    x  = "Year", y = "Population"
  ) +
   geom_smooth(method = "lm") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

To be continued….

I will do the regression and prediction curves for the China case ….

(I am not sociologist nor epidemiologist or economist…)