Introduction

The purpose of this report is to communicate patterns across different countries. Specifically, we will use ggplot and leaflet maps to help communicate those patterns.

For the first phase of this assignment we focused on five countries in terms of their Percentage of Internet Usage Across Countries, CO2 Emissions and Health Expenditures as a % of GDP. These countries included Brazil, China, India, Russian Federation and the United States.

For the second phase we plotted two separate leaflet maps that uses the “countries” dataset to focus on the life expectancy variable. For each country in question, we looked we plotted the results at the respective capital city location and compared the life expectancy variable between two different years 1998 and 2017 (19 years apart).

Step 1: library calls to load packages

This first chunk calls the packages that will be used for this notebook.

library(tidyverse)
library(leaflet)
library(WDI)
library(ggplot2)
library(tidyr)

Step 2: Call package WDI to retrieve most updated figures available.

The second code chunk automatically retrieves the latest data from the World Development Indicators database, for use in this notebook. The dataframe will also contain the longitude and latitude of the capital city in each country.

In this assignment, we will fetch ten data series from the WDI:

Tableau Name WDI Series
Birth Rate SP.DYN.CBRT.IN
Infant Mortality Rate SP.DYN.IMRT.IN
Internet Usage IT.NET.USER.ZS
Life Expectancy (Total) SP.DYN.LE00.IN
Forest Area (% of land) AG.LND.FRST.ZS
Mobile Phone Usage IT.CEL.SETS.P2
Population Total SP.POP.TOTL
International Tourism receipts (current US$) ST.INT.RCPT.CD
Import value index (2000=100) TM.VAL.MRCH.XD.WD
Export value index (2000=100) TX.VAL.MRCH.XD.WD

The World Bank uses a complex, non-intuitive scheme for naming variables. For example, the Birth Rate series is called SP.DYN.CBRT,IN. The code assigns variables names that are more intuitive than the codes assigned by the World Bank, and converts the geocodes from factors to numbers.

We used the data frame called countries.

birth <- "SP.DYN.CBRT.IN"
infmort <- "SP.DYN.IMRT.IN"
net <-"IT.NET.USER.ZS"
lifeexp <- "SP.DYN.LE00.IN"
forest <- "AG.LND.FRST.ZS"
mobile <- "IT.CEL.SETS.P2"
pop <- "SP.POP.TOTL"
tour <- "ST.INT.RCPT.CD"
import <- "TM.VAL.MRCH.XD.WD"
export <- "TX.VAL.MRCH.XD.WD"

# create a vector of the desired indicator series
indicators <- c(birth, infmort, net, lifeexp, forest,
                mobile, pop, tour, import, export)

countries <- WDI(country="all", indicator = indicators, 
     start = 1998, end = 2018, extra = TRUE)

## rename columns for each of reference
countries <- rename(countries, birth = SP.DYN.CBRT.IN, 
       infmort = SP.DYN.IMRT.IN, net  = IT.NET.USER.ZS,
       lifeexp = SP.DYN.LE00.IN, forest = AG.LND.FRST.ZS,
       mobile = IT.CEL.SETS.P2, pop = SP.POP.TOTL, 
       tour = ST.INT.RCPT.CD, import = TM.VAL.MRCH.XD.WD,
       export = TX.VAL.MRCH.XD.WD)

# convert geocodes from factors into numerics

countries$lng <- as.numeric(as.character(countries$longitude))
countries$lat <- as.numeric(as.character(countries$latitude))

# Remove groupings, which have no geocodes
countries <- countries %>%
   filter(!is.na(lng))

A Glimpse of the new dataframe

Here we are able to see sample observations within our 22 variables in the new dataframe. This is helpful in helping us pick out target variable for phase 2. Ultimately, we chose our target variable to be “lifeexp” (life expectancy).

glimpse(countries)
## Observations: 4,410
## Variables: 22
## $ iso2c     <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", …
## $ country   <chr> "Andorra", "Andorra", "Andorra", "Andorra", "Andorra",…
## $ year      <int> 2018, 2007, 2004, 2005, 2017, 1998, 1999, 2000, 2006, …
## $ birth     <dbl> NA, 10.100, 10.900, 10.700, NA, 11.900, 12.600, 11.300…
## $ infmort   <dbl> 2.7, 4.5, 5.1, 4.9, 2.8, 6.4, 6.2, 5.9, 4.7, 5.5, 5.3,…
## $ net       <dbl> NA, 70.870000, 26.837954, 37.605766, 91.567467, 6.8862…
## $ lifeexp   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ forest    <dbl> NA, 34.042553, 34.042553, 34.042553, NA, 34.042553, 34…
## $ mobile    <dbl> 107.28255, 76.80204, 76.55160, 81.85933, 104.33241, 22…
## $ pop       <dbl> 77006, 82684, 76244, 78867, 77001, 64142, 64370, 65390…
## $ tour      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ import    <dbl> 136.50668, 190.30053, 174.09246, 178.06349, 146.27331,…
## $ export    <dbl> 268.35043, 332.78037, 271.81148, 314.89205, 264.92993,…
## $ iso3c     <fct> AND, AND, AND, AND, AND, AND, AND, AND, AND, AND, AND,…
## $ region    <fct> Europe & Central Asia, Europe & Central Asia, Europe &…
## $ capital   <fct> Andorra la Vella, Andorra la Vella, Andorra la Vella, …
## $ longitude <fct> 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218…
## $ latitude  <fct> 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, …
## $ income    <fct> High income, High income, High income, High income, Hi…
## $ lending   <fct> Not classified, Not classified, Not classified, Not cl…
## $ lng       <dbl> 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218…
## $ lat       <dbl> 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, …

Plot from Phase 1

The clearest way to create the plots was by using the ggplot2 tool on R. We were able to filter for the specific countries in question and look at each of the trends in different graphs separately over the years. We made sure to facet by the relevant countries and use three different chunks of codes to represent the three types of trends we were looking at. This approach made more sense because a matrix of all 15 facets together would have been more complicated, as the x axis across the three different types of trends would not be on the same scale. We also made sure that the variables were interpreted as numeric and changed the variables on the x-axis accordingly to be less cluttered. For aesthetics purposes, we assigned different colors to the various countries.

There are some interesting insights from the three graphs we created below. We can see that Internet Usage has been increasing overtime but India’s rate of growth has been slower than the other four comparable countries. We can also see that CO2 emissions vary drastically across countries, with China displaying the highest increase and amount of CO2 emissions compared to the other countries. Based on the same graph, we can also see that although the United States CO2 emissions has been relatively high, it has decreased slightly. Lastly, we observe a drastic increase in health expenditures as a percentage of GDP in the United States compared to other countries.

world_indicators <- read_csv("World Indicators.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Business Tax Rate` = col_logical(),
##   `CO2 Emissions` = col_double(),
##   `Life Expectancy` = col_double(),
##   `Days to Start Business` = col_double(),
##   `Ease of Business` = col_logical(),
##   `Energy Usage` = col_double(),
##   `Hours to do Tax` = col_logical(),
##   `Lending Interest` = col_double(),
##   `Life Expectancy Female` = col_double(),
##   `Life Expectancy Male` = col_double(),
##   `Number of Records` = col_double(),
##   `Population 65+` = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 3023 parsing failures.
##  row               col           expected actual                   file
## 1041 Business Tax Rate 1/0/T/F/TRUE/FALSE  0.769 'World Indicators.csv'
## 1041 Hours to do Tax   1/0/T/F/TRUE/FALSE  451   'World Indicators.csv'
## 1042 Business Tax Rate 1/0/T/F/TRUE/FALSE  0.521 'World Indicators.csv'
## 1042 Hours to do Tax   1/0/T/F/TRUE/FALSE  272   'World Indicators.csv'
## 1043 Business Tax Rate 1/0/T/F/TRUE/FALSE  0.757 'World Indicators.csv'
## .... ................. .................. ...... ......................
## See problems(...) for more details.
# Internet Usage %   ############

world_indicators$`Internet Usage` <- sub("%", "", world_indicators$`Internet Usage`)
world_indicators$`Year` <- sub("12/1/", "", world_indicators$`Year`)
world_indicators$internet_usage <- as.numeric(world_indicators$`Internet Usage`)

world_indicators %>%
  filter(Country == "United States"| Country == "Brazil"| Country == "Russian Federation"| Country == "India"| Country == "China") %>%
ggplot(aes(x = Year, y = internet_usage, group= Country, color = Country)) +
  geom_point(shape = 1) +
  geom_line() +
  scale_color_manual(values = c("Brazil" = "royal blue", "China" = "forest green", 
                                "India" = "purple", "Russian Federation" = "brown", "United States" = "black")) +
  facet_wrap(. ~ Country, ncol = 5) +
  labs(x = "Year", y = "Internet Usage(%)", title = "Percentage of Internet Usage Across Countries") +
  theme(axis.text.x = element_text(angle = 90)) + 
  scale_x_discrete(limits=c("2000", "2002", "2004", "2006", "2008", "2010", "2012"))+
  theme(legend.position = "none")
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_path).

# CO2 Emissions Across Countries   ############

world_indicators$`Year` <- sub("12/1/", "", world_indicators$`Year`)
world_indicators %>%
  filter(Country == "United States"| Country == "Brazil"| Country == "Russian Federation"| Country == "India"| Country == "China") %>%
ggplot(aes(x = Year, y = `CO2 Emissions`, group= Country, color = Country)) +
  geom_point(shape = 1) +
  geom_line() +
  scale_color_manual(values = c("Brazil" = "royal blue", "China" = "forest green", "India" = "purple", "Russian Federation" = "brown", "United States" = "black")) +
  facet_wrap(. ~ Country, ncol = 5) +
  labs(x = "Year", y = "CO2 Emissions", title = "CO2 Emissions Across Countries") +
  theme(axis.text.x = element_text(angle = 90)) + scale_x_discrete(limits=c("2000","2002","2004", "2006","2008","2010","2012"))+
  theme(legend.position = "none")
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: Removed 35 rows containing missing values (geom_path).

# Health Expenditures as a % of GDP  ############

world_indicators$`Health Exp % GDP` <- sub("%", "", world_indicators$`Health Exp % GDP`)
world_indicators$`Year` <- sub("12/1/", "", world_indicators$`Year`)
world_indicators$`Health Exp % GDP` <- as.numeric(world_indicators$`Health Exp % GDP`)
world_indicators %>%
  filter(Country == "United States"| Country == "Brazil"| Country == "Russian Federation"| Country == "India"| Country == "China") %>%
ggplot(aes(x = Year, y = `Health Exp % GDP`, group= Country, color = Country)) +
  geom_point(shape = 1) +
  geom_line() +
  scale_color_manual(values = c("Brazil" = "royal blue", "China" = "forest green", "India" = "purple", "Russian Federation" = "brown", "United States" = "black")) +
  facet_wrap(. ~ Country, ncol = 5) +
  labs(x = "Year", y = "Health Exp % GDP", title = "Percentage of Health Exp. Across Countries") +
  theme(axis.text.x = element_text(angle = 90)) + scale_x_discrete(limits=c("2000","2002","2004", "2006","2008","2010","2012"))+
  theme(legend.position = "none")
## Warning: Removed 30 rows containing missing values (geom_point).
## Warning: Removed 30 rows containing missing values (geom_path).

World map showing a variable in 1998

For the code we decided that it would be best to round up the life expectancy numbers so that when we base it off quantiles it would make more sense. We then created a color palette using the colorQuantile function that used the countries life expectancy as the dataset for the year 1998. We then created a map using the countries dataset that included our color palette for circle markers and a legend. When looking at the map we can see that underdeveloped countries have a lower life expectancy than developed countries. We can see this based off our legend which shows us the different colors for the different quantiles. Africa is a continent mostly covered in red and we see that the “western” world is mostly covered in blue indicating a high life expectancy. One thing that did surprise us is that some countries in Europe tend to have “NA’s” which is surprising because we expected data from Europe to be well documented. Overall life expectancy tends to seem “balanced” but it’s obvious that developed nations hold an advantage. However, since this is in the year 1998 we expect things to improve in 2017 as we believe that the overall trend is that life expectancy should increase across the world.

rounded <- round(countries$lifeexp, 0)

pall <- colorQuantile(
  palette ="RdYlBu",
  domain = rounded[countries$year == "1998"])
  
leaflet(countries) %>%
addTiles() %>%
addCircleMarkers(lng = countries$lng[countries$year == '1998'], lat = countries$lat[countries$year ==  '1998'], color = pall(rounded[countries$year == "1998"])) %>%
  addLegend("bottomright", pal = pall, values = (rounded[countries$year == "1998"]), 
            title = "Life Expectancy (1998)")

World map showing the same variable recently

For 2017, we used the same logic for the code as we did for 1998. Our results show that the general color spectrum for the various countries stays pretty much the same, but it is because the countries are displayed as relative to each other with the “colorQuantile” function. So although we see changes in life expectancies within our dataset, the changes are not necessarily apparent in the leaflet plot. In either case, we can tell that countries in Africa are generally in the lowest quantiles, while the rest of the world has a larger spread of quantiles with European countries’ life expectancy bucketed at least at the 50% quantile of the total data. As part of our results, we can also tell that certain regions moved from one quantile to the other (red to orange) in the south east pacific region, so we do see some subtle improvements.

rounded <- round(countries$lifeexp, 0)

pall <- colorQuantile(
  palette ="RdYlBu",
  domain = rounded[countries$year == "2017"])
  
leaflet(countries) %>%
addTiles() %>%
addCircleMarkers(lng = countries$lng[countries$year == '2017'], lat = countries$lat[countries$year ==  '2017'], color = pall(rounded[countries$year == "2017"])) %>%
  addLegend("bottomright", pal = pall, values = (rounded[countries$year == "2017"]), 
            title = "Life Expectancy (2017)")

Conclusion

Throughout this report we were able to learn about the functionalities of faceted plots and leaflet maps. For phase 1, faceting the plots helped us divide the relevant variables into respective countries and compare them separately. This is helpful because it made comparisons and interpretation easier for the individual variables. For phase 2, the leaflet plots allowed us to use an interactive map to compare life expectancy across different countries. We decided to use quantiles for simplification purposes; however, we realized that it is not an all encompassing solution for comparisons. We could have used the “colorNumeric” function for numeric comparisons of life expectancy, but given the amount of data, it was difficult to compare with multiple shades of colors. Whether it is faceting or using leaflet maps, we were able to obtain key insights from our visualizations.