MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(readr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(ggplot2)

Executive Summary

This investigation joins three data sets together from the World Bank (WB) using the left join function. One contains information on countries’ total population, another has information on gdp, and the last set has information regarding the income grouping (e.g. high, upper middle, lower middle, and low).

Once joined, the data was put in a tidy format using the gather function, unnecessary column variables were removed and the year of investigation narrowed to 2019.

Investigation of the data reveals that omitting NA values is appropriate, due to the entry of non-country regional/geographic groups included by the WB. Whilst some of these variables had population data, they did not have gdp data as they were overseas territories of other states e.g. Gibraltar and Isle of Man being territories of the United Kingdom. For simplicity, it was easier to remove these locations from the analysis so to avoided the possibility of double reporting.

Following this, the data is correctly categized. The variable gdp per capita is created with the mutate function. Tukey’s method of outlier detection is used to test for outliers. Given the large discrepancy between gdp per capita between incomes groups, it was appropriate to conduct a multi-variate outlier test with data grouped by income groups to create the box plots. Tukey’s test revealed only one outlier which was located in the high income group. The outlier – Luxembourg – was kept in the data as it illustrated an interesting story about how a small highly skilled population can produce a very high gdp per capita figure.

Histograms revealed that the disputation of gdp per capita was right skewed for high, upper medium and lower medium groupings. Natural logarithm transformation was use to create normal distributions. The low-income group was normally distributed to begin with and did not require statistical transformation.

Data

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

All sources used for this analysis are from the World Bank. The data used is from three World Bank Tables:

Population, total (http://datatopics.worldbank.org/world-development-indicators/themes/people.html).
The Population, total metadata which included income grouping information (http://datatopics.worldbank.org/world-development-indicators/themes/people.html).
GDP (current US$) (http://datatopics.worldbank.org/world-development-indicators/themes/econoumy.html).

The Population, total data set includes the total population of countries and specific regions and geographical groupings from 1960 to 2019. Other variables included where country name, country code. The data set was untidy as each year variable formed a separate column.

The Population, total metadata records the countries current income grouping, the region in which it belongs, the country code as well as the table name (country). The information is tidy.

GDP (current US dollars) data set contains the country name, country code, and the GDP data from 1960 to 2019 in current US$. The data is untidy as each year has a separate column, rather than having a single year column.

I blended several steps in this first process and will describe them in the following:

Used the read_csv function to enter the three data sets into R.
Progressively removed obsolete column variables as I checked the upload structure. Column X66 was meaningless and full of null variables.
year 2020 was full of NA values as the information had not been reported yet.
Made the untidy data sets tidy through the gather function to place all year variables in a single column with the value being Total population and GDP respectively.
I used the left join function to join the total population data set (assigned pop) and with country population metadata set (assigned pop_info). The key for the data sets was the “Country Code” as it was the only piece of information shared between both tables.
I then removed more obsolete columns in SpecialNotes and TableName.
I then progressed to put the gdp data in a tidy form using the gather function.
Combined the population and gdp data together using a left join and assigned it as total_data. There were multiple key variables shared between then which were year, country code, and country name. Following this I used the select function to remove other uninformative data columns and filtered the data to focus only on 2019 as we did not cover time series data in this course.
I then changed the column names for ease of coding to lower case.

# This is the R chunk for the Data Section
#upload data

pop <- read_csv("API_SP.POP.TOTL_DS2_en_csv_v2_1429441.csv", 
                                                         skip = 3)

## Warning: Missing column names filled in: 'X66' [66]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2020` = col_logical(),
##   X66 = col_logical()
## )

## See spec(...) for full column specifications.

head(pop)

pop %<>% select(-"X66")

pop_info <- read_csv("Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1429441.csv")

## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character()
## )

head(pop_info)

gdp <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1429653.csv", 
                                                            skip = 3)

## Warning: Missing column names filled in: 'X66' [66]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2020` = col_logical(),
##   X66 = col_logical()
## )
## See spec(...) for full column specifications.

head(gdp)

gdp %<>% select(-"X66")

#gather pop

pop %<>% gather('1960':'2020', key = "Year", value = "Total Population")
head(pop)

# join pop_info to pop

pop %<>% left_join(pop_info, by = "Country Code")
head(pop)

pop %<>% select(-"SpecialNotes", -"TableName")


#Gather gdp

gdp %<>% gather('1960':'2020', key = "Year", value = "GDP")
head(gdp)

#join gdp to pop to create new dataset: total_data

total_data <- pop %>% left_join(gdp, by = c("Country Name", "Country Code", 
                                            "Year")) %>% 
  select(- c(starts_with("Ind"), "Country Code")) %>% filter(Year == 2019)

head(total_data)

colnames(total_data) <- c("country", "year", "total_population",
                          "region", "income_group", "gdp")
head(total_data)

str(total_data)

## tibble [264 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country         : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ year            : chr [1:264] "2019" "2019" "2019" "2019" ...
##  $ total_population: num [1:264] 106314 38041754 31825295 2854191 77142 ...
##  $ region          : chr [1:264] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ income_group    : chr [1:264] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ gdp             : num [1:264] NA 1.91e+10 9.46e+10 1.53e+10 3.15e+09 ...

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Only the population and gdp data where correctly inputted as numeric values. The other variable required commands to input their format.

The following variables were inputted as factors:

country
region

income group was inputted as an ordered factor with the levels Low income < Lower middle < Upper middle < High income.

I used the unique function to see each unique level and how to logically order them.

I noted there is an NA variable which will be removed in the next step.

# This is the R chunk for the Understand Section
str(total_data)

## tibble [264 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country         : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ year            : chr [1:264] "2019" "2019" "2019" "2019" ...
##  $ total_population: num [1:264] 106314 38041754 31825295 2854191 77142 ...
##  $ region          : chr [1:264] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ income_group    : chr [1:264] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ gdp             : num [1:264] NA 1.91e+10 9.46e+10 1.53e+10 3.15e+09 ...

total_data$country %<>% factor()
total_data$region %<>% factor()
total_data$income_group %>% unique

## [1] "High income"         "Low income"          "Lower middle income"
## [4] "Upper middle income" NA

total_data$income_group %<>% factor(levels = c("Low income", "Lower middle income",
                                               "Upper middle income", "High income"),
                                               ordered = TRUE)
glimpse(total_data)

## Rows: 264
## Columns: 6
## $ country          <fct> "Aruba", "Afghanistan", "Angola", "Albania", "Ando...
## $ year             <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2...
## $ total_population <dbl> 106314, 38041754, 31825295, 2854191, 77142, 427870...
## $ region           <fct> Latin America & Caribbean, South Asia, Sub-Saharan...
## $ income_group     <ord> High income, Low income, Lower middle income, Uppe...
## $ gdp              <dbl> NA, 1.910135e+10, 9.463542e+10, 1.527808e+10, 3.15...

Tidy & Manipulate Data I

Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.

This process was completes with the uploading stage.

For data to be tidy it must:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Having each year in a separate column violated this principle.

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.

I create gdp_per_capita using the mutate function.

gdp_per_capita is a country’s gdp divided by its total population and is a measure of a countries relative wealth and development.

# This is the R chunk for the Tidy & Manipulate Data II 
total_data %<>% mutate(gdp_per_capita = gdp / total_population)

head(total_data)

Scan I

Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

Firstly, I paired the sum and colSum function with the is.na function to check if there are NAs present in the data set.

As shown earlier, there are. The colsums function showed that there were 47 in the region variable and that this highlighted inconsistent groupings other than individual countries. On confirmation of this, I removed the NA rows from the region column through subsetting and re-assigned it to total_data.

Whilst the remaining NAs are above the 5% deletion level, further investigate and logic found that many of the NAs were small semi-independent states or overseas territories of other states. Its impossible to adequately measure their gdp contribution of these regions to the greater state and an average figure could well over/ underestimate its contribution. I thought it made sense to remove them.

I used the na.omit function to remove the remaining missing values.

# This is the R chunk for the Scan I
#missing values
sum(is.na(total_data))

## [1] 177

colSums(is.na(total_data))

##          country             year total_population           region 
##                0                0                1               47 
##     income_group              gdp   gdp_per_capita 
##               47               41               41

is.na(total_data$region)

##   [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
##  [73]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [97]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
## [109]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [133]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [157] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [181] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [193] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [217]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

which(is.na(total_data$region))

##  [1]   6  35  48  60  61  62  63  64  67  72  73  94  97 101 102 103 104 106 109
## [20] 127 133 134 135 138 139 141 152 155 160 169 180 182 190 196 197 203 214 216
## [39] 217 229 230 235 237 239 240 248 258

# missing values in the region highlight that there are rows consisting of
#aggravations of countries under the country column.

total_data %>% head()

#subset missing rows from the regions out of total_data

total_data <- total_data[-(which(is.na(total_data$region))),]

head(total_data)

colSums(is.na(total_data))

##          country             year total_population           region 
##                0                0                0                0 
##     income_group              gdp   gdp_per_capita 
##                0               38               38

0.05 * 263

## [1] 13.15

head(total_data[(which(is.na(total_data$gdp))),])

#the 38 missing GDP values are above the 5% deletion level, however, it
#doesn't make sense to impute an average of gdp data to make a new variable.
#gdp is country specific. Furthermore, many of the countries without gdp data seem
#to be small states, or overseas territories over other counties e.g.
#Gibraltar, and the Channel Islands are part of the UK. It seems reasonable to
#omit the data for this reason.

total_data %<>% na.omit()

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

I decide to use Tukey’s method of outlier detection because of its use of visuals in a box plot as well as its relative simplicity.

I also choose it as it provided an easy means of handling multi-variate data. The difference between gdp per captia between rich and poor countries is so great, it makes sense to treat each income group as discrete group rather than all the gdp per capitas as a whole.

With all the graphs on one plot the lower income is not readable, but it clearly shows an outlier on the upper bound of the high income group.

Excluding high income from the graph re-scales the x axis and makes it clear that there is no outlier in the low income group.

I then used the filter function to find which country was the outlier.

Luxembourge is an outlier, it is a small western european country located between France and Germany. Its major sectors are financial services, law, and is one of the four capitals of the European Union. I decided to leave it in the data as it is not an error and provides an interesting story about development and specialization in services.

# This is the R chunk for the Scan II
#scan for outliers
ggplot(total_data, aes(x = income_group, y = gdp_per_capita)) + 
  geom_boxplot() + coord_flip() +
  labs(title = "World GDP per Capita by Income Groups",
       x = "",
       y = "GDP per Capita (Current $US)")

ggplot(total_data %>% filter(income_group != "High income"), aes(x = income_group, y = gdp_per_capita)) + 
  geom_boxplot() + coord_flip() +
  labs(title = "World GDP per Capita by Income Groups",
       subtitle = "Excluding High income",
       x = "",
       y = "GDP per Capita (Current $US)")

ggplot(total_data %>% filter(income_group == "High income"), aes(x = income_group, y = gdp_per_capita)) + 
  geom_boxplot() + coord_flip() +
  labs(title = "World GDP per Capita by Income Groups",
       subtitle = "High income only",
       x = "",
       y = "GDP per Capita (Current $US)")

ggplot(total_data %>% filter(income_group == "Low income"), aes(x = income_group, y = gdp_per_capita)) + 
  geom_boxplot() + coord_flip() +
  labs(title = "World GDP per Capita by Income Groups",
       subtitle = "Low income only",
       x = "",
       y = "GDP per Capita (Current $US)")

#makes sense to do a multi-variate outlier test as gdp per capita varies so much
# between rich and low income countries.
# Using Tukey's method - intuitive and straight forward.


#there appears to be one outlier in the whole data set.
total_data %>% filter(gdp_per_capita >= 90000)

#Luxembourge is an outlier, it is a small western european country located
#between France and Germany. Its major sectors are financial services, law,
#and is one of the four capitals of the European Union.
#I will leave it in the data as it is not an error and provides an interesting
#story about development and specialisation in services.

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Firstly, I filtered total_data by income group to make four discrete groups.

I then plotted the distribution of each income group using a histogram.

The histograms showed GDP per capita is right skewed for all expect low income, which is roughly normally distributed.

For right skewed distributions the appropriate transformations are taking roots, logarithem or recipricals.

After comparing each, the natural logarithm seemed to provided the best transformation, correcting the right skew in the three histograms.

The original histogram for the low income group was already approximately normally distributed so no transformation was required.

# This is the R chunk for the Transform Section
high_income <- total_data %>% filter(income_group == "High income")
head(high_income)

upper_middle <- total_data %>% filter(income_group == "Upper middle income")
head(upper_middle)

lower_middle <- total_data %>% filter(income_group == "Lower middle income")
head(lower_middle)

low_income <- total_data %>% filter(income_group == "Low income")
head(low_income)

hist(high_income$gdp_per_capita)

hist(upper_middle$gdp_per_capita)

hist(lower_middle$gdp_per_capita)

hist(low_income$gdp_per_capita)

#GDP per capita is right skewed for all expect low income which is roughly
#normally distributed

# for right skewed the appropriate transformations are taking roots,
# logarithem or recipricals, lets see which one is more appropriate.

#ln high income

#ln (natural logarithem)
ln_high_income <- log(high_income$gdp_per_capita)
hist(ln_high_income)

#much better

#ln upper middle

ln_upper_middle <- log(upper_middle$gdp_per_capita)
hist(ln_upper_middle)

#ln lower middle

ln_lower_middle <- log(lower_middle$gdp_per_capita)
hist(ln_lower_middle)