Introduction

In this document, we have explored 2 datasets - World Indicators.csv and WDI. Before we explore the visualization part of these two datasets, we performed 2 steps. Step 1 is loading the required libraries. Step 2 is calling the WDI package and retriving the latest data series for various varibales for different countries. Step 2 also has some data preparation and cleaning steps for the WDI dataset.

Step 1:

Library calls to load packages

library(tidyverse)
library(leaflet)
library(WDI)

Step 2:

Call package WDI to retrieve most updated figures available. In this assignment, we will fetch ten data series from the WDI:

Tableau Name	WDI Series
Birth Rate	SP.DYN.CBRT.IN
Infant Mortality Rate	SP.DYN.IMRT.IN
Internet Usage	IT.NET.USER.ZS
Life Expectancy (Total)	SP.DYN.LE00.IN
Forest Area (% of land)	AG.LND.FRST.ZS
Mobile Phone Usage	IT.CEL.SETS.P2
Population Total	SP.POP.TOTL
International Tourism receipts (current US$)	ST.INT.RCPT.CD
Import value index (2000=100)	TM.VAL.MRCH.XD.WD
Export value index (2000=100)	TX.VAL.MRCH.XD.WD

The next code chunk will call the WDI API and fetch the years 1998 through 2018, as available. You will find that only a few variables have data for 2018. The dataframe will also contain the longitude and latitude of the capital city in each country.

Note This notebook will take approximately 2 minutes to run. The WDI call is time-consuming as is the process of knitting the file. Be patient.

The World Bank uses a complex, non-intuitive scheme for naming variables. For example, the Birth Rate series is called SP.DYN.CBRT,IN. The code assigns variables names that are more intuitive than the codes assigned by the World Bank, and converts the geocodes from factors to numbers.

In your code, you will use the data frame called countries.

birth <- "SP.DYN.CBRT.IN"
infmort <- "SP.DYN.IMRT.IN"
net <-"IT.NET.USER.ZS"
lifeexp <- "SP.DYN.LE00.IN"
forest <- "AG.LND.FRST.ZS"
mobile <- "IT.CEL.SETS.P2"
pop <- "SP.POP.TOTL"
tour <- "ST.INT.RCPT.CD"
import <- "TM.VAL.MRCH.XD.WD"
export <- "TX.VAL.MRCH.XD.WD"

# create a vector of the desired indicator series
indicators <- c(birth, infmort, net, lifeexp, forest,
                mobile, pop, tour, import, export)

countries <- WDI(country="all", indicator = indicators, 
     start = 1998, end = 2018, extra = TRUE)

## rename columns for each of reference
countries <- rename(countries, birth = SP.DYN.CBRT.IN, 
       infmort = SP.DYN.IMRT.IN, net  = IT.NET.USER.ZS,
       lifeexp = SP.DYN.LE00.IN, forest = AG.LND.FRST.ZS,
       mobile = IT.CEL.SETS.P2, pop = SP.POP.TOTL, 
       tour = ST.INT.RCPT.CD, import = TM.VAL.MRCH.XD.WD,
       export = TX.VAL.MRCH.XD.WD)

# convert geocodes from factors into numerics

countries$lng <- as.numeric(as.character(countries$longitude))
countries$lat <- as.numeric(as.character(countries$latitude))

# Remove groupings, which have no geocodes
countries <- countries %>%
   filter(!is.na(lng))

A Glimpse of the new dataframe:

DataFrame from the WDI package after data preparation and cleaning activities

glimpse(countries)

## Observations: 4,410
## Variables: 22
## $ iso2c     <chr> "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", "AD", …
## $ country   <chr> "Andorra", "Andorra", "Andorra", "Andorra", "Andorra",…
## $ year      <int> 2018, 2007, 2004, 2005, 2017, 1998, 1999, 2000, 2006, …
## $ birth     <dbl> NA, 10.100, 10.900, 10.700, NA, 11.900, 12.600, 11.300…
## $ infmort   <dbl> 2.7, 4.5, 5.1, 4.9, 2.8, 6.4, 6.2, 5.9, 4.7, 5.5, 5.3,…
## $ net       <dbl> NA, 70.870000, 26.837954, 37.605766, 91.567467, 6.8862…
## $ lifeexp   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ forest    <dbl> NA, 34.042553, 34.042553, 34.042553, NA, 34.042553, 34…
## $ mobile    <dbl> 107.28255, 76.80204, 76.55160, 81.85933, 104.33241, 22…
## $ pop       <dbl> 77006, 82684, 76244, 78867, 77001, 64142, 64370, 65390…
## $ tour      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ import    <dbl> 136.50668, 190.30053, 174.09246, 178.06349, 146.27331,…
## $ export    <dbl> 268.35043, 332.78037, 271.81148, 314.89205, 264.92993,…
## $ iso3c     <fct> AND, AND, AND, AND, AND, AND, AND, AND, AND, AND, AND,…
## $ region    <fct> Europe & Central Asia, Europe & Central Asia, Europe &…
## $ capital   <fct> Andorra la Vella, Andorra la Vella, Andorra la Vella, …
## $ longitude <fct> 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218…
## $ latitude  <fct> 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, …
## $ income    <fct> High income, High income, High income, High income, Hi…
## $ lending   <fct> Not classified, Not classified, Not classified, Not cl…
## $ lng       <dbl> 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218, 1.5218…
## $ lat       <dbl> 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, 42.5075, …

Phase 1:

Plot for Phase 1 - In this section, we first load the data set - World Indicators.csv from computer. We are then changing names of the columns as there are spaces in column names in the dataset. We are then changing the year format to have just year for plotting purposes.We are then selecting only the required columns for making faceted graphs. We are then changing the type of CO2Emissions AND internetUsage to numeric. This completes the data preparation part of phase 1.

# Our code for Phase 1
setwd("~/Desktop")
worldIndicators.df <- read.csv("World Indicators.csv", header = TRUE)
#changing names of the columns as there are spaces in column names in the dataset
names(worldIndicators.df)[names(worldIndicators.df) == "Internet.Usage"] <- "internetUsage"
names(worldIndicators.df)[names(worldIndicators.df) == "CO2.Emissions"] <- "CO2Emissions"
names(worldIndicators.df)[names(worldIndicators.df) == "Health.Exp...GDP"] <- "healthExpPercentGDP"

#changing the year format to have just year for plotting purposes
worldIndicators.df$Year <- substring(worldIndicators.df$Year,6,9)
#selecting only the required columns for making facetted graphs
worldIndicators.df.norm <- worldIndicators.df[,c(9,3,15,19,33)]
#changing the type
worldIndicators.df.norm$internetUsage <- as.numeric(worldIndicators.df.norm$internetUsage)
worldIndicators.df.norm$CO2Emissions <- as.numeric(worldIndicators.df.norm$CO2Emissions)

Plots for Phase 1:

For the phase 1 facetting -We first need to prepare the dataset by filtering the countries that we want to display.We then use gemo_line and facet_wrap functions to get the facets of all the 3 variables for 5 different countries. For the third facet we tried using a theme with the help of theme function. We applied green background here as we want to support environment. Go Green!

p <- worldIndicators.df.norm %>% filter(Country == "India" | Country =="United States" | Country =="Brazil" | Country == "Russian Federation" | Country == "China")
ggplot(data = p ,aes(x= Year, y = internetUsage))+
  geom_line(mapping = 
                    aes(group = Country, color = Country), show.legend = FALSE) +
  facet_wrap(~Country, ncol = 2 , scales = "free")+
  expand_limits(y =0 )  +
  labs( x= "\nYear", y ="internetUsage", title= "Internet Usage Vs Year for various Countries")

options(scipen = 10L)
p <- worldIndicators.df.norm %>% filter(Country == "India" | Country =="United States" | Country =="Brazil" | Country == "Russian Federation" | Country == "China")
ggplot(data = p ,aes(x= Year, y = CO2Emissions))+
  geom_line(mapping = 
                    aes(group = Country, color = Country), show.legend = FALSE) +
  facet_wrap(~Country, ncol = 2, scales = "free")+
  expand_limits(y =0 )  +
  labs( x= "\nYear", y ="CO2Emissions", title= "CO2.Emissions Vs Year for various Countries")

## Warning: Removed 10 rows containing missing values (geom_path).

p <- worldIndicators.df.norm %>% filter(Country == "India" | Country =="United States" | Country =="Brazil" | Country == "Russian Federation" | Country == "China")
theme_set(theme_gray(base_size = 6.5))
ggplot(data = p ,aes(x= Year, y = healthExpPercentGDP))+
 
  geom_line(mapping = 
                    aes(group = Country, color = Country), show.legend = FALSE) + 
  facet_wrap(~Country, ncol=2, scales="free")+
  expand_limits(x =0 , y = 3)  +
  labs( x= "\nYear", y ="healthExpPercentGDP", title= "Health Expectancy Percent GDP Vs Year for various Countries")+
  theme(plot.background=element_rect(fill="darkseagreen"),
        plot.margin = unit(c(0.001, 0.001, 0.001, 0.001), "cm"))

Phase 2 - World Map - Initial year -1998:

World map showing a variable in 1998- phase 2,We choose the population variable for our analysis. We have filtered the countries that were suggested in the assignment. On this world map we showed the capital and population of these countries in the marker label. We observe that Russian Federation and Brazil are very less in population. Hence, the blue color is very light on the map. The dark blue marker suggests that those countries are having the highest population among all the countries available in the dataset. As an optional parameter,we can also randomly assign a default view of the map and zoom level of 2. We can also set the maximum limits of boundaries. In the marker popup we show the capital of the country and the population of the country separated by “/”.

countries.df <- countries[,c(2,3,21,22,16,10)]
countries.df.year <- countries.df %>% filter(year == 1998)


countries.df.year <- countries.df.year %>% filter(country == "India" | country =="United States" | country =="Brazil" | country == "Russian Federation" | country == "China")

countries.df.year$popnew <- paste(as.character(countries.df.year$capital), as.character(countries.df.year$pop), sep= '/ ')
pal <-  colorQuantile("Blues", countries.df.year$pop)

map <- leaflet() %>%
addProviderTiles("CartoDB") %>%
addCircleMarkers(lng = countries.df.year$lng, 
                 lat = countries.df.year$lat, 
                 color = pal(countries.df.year$pop) , 
                 popup = countries.df.year$popnew)
  

map

World map - Latest year - 2018:

World Map showing the same variable recently.- phase 2, We have filtered the countries that were suggested in the assignment. On this world map we showed the capital and population of these countries in the marker label.The dark blue marker suggests that those countries are having the highest population among all the countries available in the dataset. We also observed that the population variable has increased for all the countries but still we observed that Russian Federation and Brazil are very less in population. Hence, the markers are still in light blue color.

countries.df <- countries[,c(2,3,21,22,16,10)]
countries.df.year.2018 <- countries.df %>% filter(year == 2018)


countries.df.year.2018 <- countries.df.year.2018 %>% filter(country == "India" | country =="United States" | country =="Brazil" | country == "Russian Federation" | country == "China")

countries.df.year.2018$popnew <- paste(as.character(countries.df.year.2018$capital), as.character(countries.df.year.2018$pop), sep= '/ ')
pal <-  colorQuantile("Blues", countries.df.year.2018$pop)

map <- leaflet() %>%
addProviderTiles("CartoDB") %>%
addCircleMarkers(lng = countries.df.year.2018$lng, 
                 lat = countries.df.year.2018$lat, 
                 color = pal(countries.df.year.2018$pop) , 
                 popup = countries.df.year.2018$popnew)
 
map

Assignment 2

Malavika Andavilli and Federico Lederman