MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Executive Summary

In this assignment, 3 datasets were merged with the help of common attributes i.e. Country Name, Country Code and Year to form a complete dataframe. Various data preprocessing steps were performed like subsetting and renaming columns, omitting rows with too many missing values etc. Two of the dataframes were untidy as the year columns were attributes instead of observations. The data was brought to correct format with the help of tidy principles suh as gather().

Checked the structure of all datasets and performed relevant conversion of data types, wherever required. Nominal attributes such as ‘Income Group’ and ‘Years’ which take limited values were made factors with ordered is equal to true.

Scanned all datasets for missing and special values. Missing values for the nominal attribute was removed since they were mostly part of regions such as South Asia, East Asia, Small states etc instead of countries. The missing values for numerical attribute were replaced by the mean of the respective countries.

Numerical attributes were scanned for outliers with the help of boxplot. The outliers were dealt using capping since they were extreme values for some of the countries which was correct.

Lastly, data transformation techniques were performed to reduce skewness and bring the data close to normal distribution so that further statistical apporoaches could be carried out.

Data

The data is taken from “The World Bank” website. The three datasets used in our analysis are-

World Population dataset- The dataset contains population data of 264 countries from 1960 to 2019. It has 65 attributes and 264 observations. The attributes are :-
1. Country Name- Name of 264 countries
2. Country Code- 3 letter codes of country
3. Indicator Name- Data indicator i.e. Population
4. Indicator Code- Code of the indicator
5. Years- 61 variables of years starting from 1960 - 2020
Source of data- https://data.worldbank.org/indicator/SP.POP.TOTL?locations=1W

Download link- http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=csv
Poverty dataset- The dataset contains poverty headcount ratio at $1.90 day(% of population) from 1960 to 2018. It has 65 attributes and 264 observations. The attributes are:-
1. Country Name- Name of 264 countries
2. Country Code- 3 letter codes of country
3. Indicator Name- Data indicator i.e. Poverty percentage
4. Indicator Code- Code of the indicator
5. Years- 61 variables of years starting from 1960 - 2020
Source of data- https://data.worldbank.org/indicator/SI.POV.DDAY

Download Link- http://api.worldbank.org/v2/en/indicator/SI.POV.MDIM.17.XQ?downloadformat=csv
Metadata- The dataset contains income groups of 263 countries. It has 5 attributes and 263 observations. The attributes are:-
1. Country Code- 3 letter country code
2. Region- Region of the country. For example- Latin America & Caribbean, South Asia etc.
3. Income group- Income status of countries
4. Special notes
5. Table Name- Name of countries
The dataset in avaiable under all the links provided above.

The main objective is to analyze the percentage of population living under/over poverty over time from 2001 to 2018. There is no data for the year 2019 and 2020 in our dataset.

# This is the R chunk for the Data Section

# Importing Population Data
  
Population <- read_csv("E:/data wrangling/TotalPopulation/API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv",skip=3)

## Warning: Missing column names filled in: 'X66' [66]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2020` = col_logical(),
##   X66 = col_logical()
## )

## See spec(...) for full column specifications.

head(Population)

# Importing Poverty data

Poverty <- read_csv("E:/data wrangling/ExtremePoverty/API_SI.POV.DDAY_DS2_en_csv_v2_1496490.csv",skip = 3)

## Warning: Missing column names filled in: 'X66' [66]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `1960` = col_logical(),
##   `1961` = col_logical(),
##   `1962` = col_logical(),
##   `1963` = col_logical(),
##   `1964` = col_logical(),
##   `1965` = col_logical(),
##   `1966` = col_logical(),
##   `1968` = col_logical(),
##   `1970` = col_logical(),
##   `1972` = col_logical(),
##   `1973` = col_logical(),
##   `1976` = col_logical(),
##   `2019` = col_logical(),
##   `2020` = col_logical(),
##   X66 = col_logical()
## )
## See spec(...) for full column specifications.

head(Poverty)

# Importing Metadata

Metadata <- read_csv("E:/data wrangling/ExtremePoverty/Metadata_Country_API_SI.POV.DDAY_DS2_en_csv_v2_1496490.csv")

## Warning: Missing column names filled in: 'X6' [6]

## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_logical()
## )

head(Metadata)

Subset required columns

Filtered out useful columns from Population, Poverty and Metadata in new dataframes. These dataframes and their attributes are:-

Population_filtered- Country Name, Country Code, Years (2001-2018)
Poverty_filtered- Country Name, Country Code, Years (2001-2018)
Metadata_filtered- Country Code, Income group , TableName

# This is the R chunk for the Subset required columns Section

# Sub-setting required columns from Population, Poverty and Metadata

Population_filtered <- Population[, c(1,2, 46:63)]
head(Population_filtered)

Poverty_filtered <- Poverty[, c(1,2, 46:63)]
head(Poverty_filtered)

Metadata_filtered <- Metadata[,c(1,3,5)]
head(Metadata_filtered)

Tidy & Manipulate Data I

Observations with more than 9 missing values from year 2001 to 2018 were removed from all the datasets in order to avoid discrepancy.

Column “TableName” of Metadata was renamed to “Country Name” for better understanding.

In tidy data:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Population and Poverty datasets were untidy as the years (2001-2018) were column headers whereas they should be observations (values) with column header as “Year”. To transform the data from wide to long format, gather() function of tidyr package was used.

# This is the R chunk for the Tidy & Manipulate Data I

# Function for deleting NA values

delete_NA <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

# Deleting rows with more than 9 NA values from all the datasets

Population_filtered <- delete_NA(Population_filtered,9)

Poverty_filtered <- delete_NA(Poverty_filtered,9)

Metadata_filtered <- delete_NA(Metadata_filtered,9)

# Renaming column name

colnames(Metadata_filtered)[3] <- c("Country Name")

# Checking dimension of the dataset

dim(Population_filtered)

## [1] 263  20

dim(Poverty_filtered)

## [1] 74 20

dim(Metadata_filtered)

## [1] 263   3

# Gather function to get year in one column

Population_filtered <- gather(Population_filtered, key="Year", value= "Population", c(3:20))
head(Population_filtered)

Poverty_filtered <- gather(Poverty_filtered, key="Year", value= "Perc_pop_under_poverty", c(3:20))
head(Poverty_filtered)

Scan I

Scannned datasets for missing and special values before joining them. Checked for missing values in Population and Poverty datasets and replaced it with the mean of the respective country. This has been done in order to preserve useful information and removing it would have lead to loss of quality data. The colSums() function was used to get the number of missing values in each column.

For Metadata dataset, the missing values were dropped as the values didn’t belong to any countries but regions which was not of any use for our analysis.

Merged all the datasets with the help of common attributes. The inner_join() function was used to retain only rows which are present in both Population and Poverty datasets. The right_join() function was used between Metadata and PopulationXPoverty in order retain all the rows present in PopulationXPoverty and matching observations in both the dataset.

Rows with missing values under “Income Group” were dropped as it can’t be predicted and replacing it with the mode of the variable is not a correct approach and may lead to bias.

Special charaters were checked in the final dataframe. The dataframe didn’t contain any special character.

# This is the R chunk for the Scan I

# Checking and dealing with missing values

colSums(is.na(Population_filtered))

## Country Name Country Code         Year   Population 
##            0            0            0            7

Population_filtered <- Population_filtered %>% group_by(`Country Name`) %>%
  mutate(Population=ifelse(is.na(Population),mean(Population,na.rm=TRUE),Population))

colSums(is.na(Poverty_filtered))

##           Country Name           Country Code                   Year 
##                      0                      0                      0 
## Perc_pop_under_poverty 
##                    194

Poverty_filtered <- Poverty_filtered %>% group_by(`Country Name`) %>%
  mutate(`Perc_pop_under_poverty`=ifelse(is.na(`Perc_pop_under_poverty`),mean(`Perc_pop_under_poverty`,na.rm=TRUE),`Perc_pop_under_poverty`))

Metadata_filtered %>% is.na() %>% table()

## .
## FALSE  TRUE 
##   743    46

Metadata_filtered <- drop_na(Metadata_filtered)

# Merging all the datasets
PopulationXPoverty <- inner_join(Population_filtered,Poverty_filtered, by= c("Country Name","Country Code","Year"))

Final_df <- right_join(Metadata_filtered,PopulationXPoverty, by = c("Country Name","Country Code"))
head(Final_df)

colSums(is.na(Final_df))

##           Country Code            IncomeGroup           Country Name 
##                      0                    252                      0 
##                   Year             Population Perc_pop_under_poverty 
##                      0                      0                      0

# Omitting rows with na values in Income group variable

Final_df <- Final_df[!is.na(Final_df$IncomeGroup), ]

# Checking for missing or special character 

is.specialorNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

Final_df_special <- sapply(Final_df, is.specialorNA)

Final_df_special <- as.data.frame(Final_df_special[c(5,6)])

colSums(Final_df_special)

##             Population Perc_pop_under_poverty 
##                      0                      0

Understand

Data structure of all the variables were checked using str() function. “IncomeGroup” and “Year” attributes were converted from chr to factor as they contained a set of limited values. The final dataset contains multiple data types such as numerics, characters, factors.

# This is the R chunk for the Understand Section

str(Final_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1080 obs. of  6 variables:
##  $ Country Code          : chr  "ARG" "ARM" "AUT" "BEL" ...
##  $ IncomeGroup           : chr  "Upper middle income" "Upper middle income" "High income" "High income" ...
##  $ Country Name          : chr  "Argentina" "Armenia" "Austria" "Belgium" ...
##  $ Year                  : chr  "2001" "2001" "2001" "2001" ...
##  $ Population            : num  37275652 3050687 8042293 10286570 8009142 ...
##  $ Perc_pop_under_poverty: num  10.7 14.4 0.4 0.147 2.217 ...

Final_df$IncomeGroup <- factor(Final_df$IncomeGroup, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered = TRUE)

Final_df$Year <- factor(Final_df$Year, levels = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018), ordered = TRUE)

str(Final_df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1080 obs. of  6 variables:
##  $ Country Code          : chr  "ARG" "ARM" "AUT" "BEL" ...
##  $ IncomeGroup           : Ord.factor w/ 4 levels "Low income"<"Lower middle income"<..: 3 3 4 4 3 3 2 3 4 3 ...
##  $ Country Name          : chr  "Argentina" "Armenia" "Austria" "Belgium" ...
##  $ Year                  : Ord.factor w/ 18 levels "2001"<"2002"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population            : num  37275652 3050687 8042293 10286570 8009142 ...
##  $ Perc_pop_under_poverty: num  10.7 14.4 0.4 0.147 2.217 ...

Tidy & Manipulate Data II

The final dataframe (Final_df) contains variable “Population” and “Perc_pop_under_poverty”. These two variables were used to mutate 3 more variables. These variables are:-

Num_of_people_under_poverty- This variable gives us information about the number of people living below poverty in a particular country. The formula for calculating it is (Perc_pop_under_poverty * Population)/100.
Perc_pop_over_poverty- This variable gives us information about the percentage of population living above poverty in a particular country. The formula for calculating it is 100 - Perc_pop_under_poverty.
Num_of_people_over_poverty- This variable gives us information about the number of people living above poverty in a particular country. The formula for calculating it is Population - Num_of_people_under_poverty.

The trunc() function is used to get whole numbers.

# This is the R chunk for the Tidy & Manipulate Data II 

# Mutating new variables 

Final_df <- mutate(Final_df,
                   Num_of_people_under_poverty = trunc((Final_df$`Perc_pop_under_poverty`*Final_df$Population)/100))

Final_df <- mutate(Final_df,
                   Perc_pop_over_poverty = (100-Final_df$Perc_pop_under_poverty)
)

Final_df <- mutate(Final_df,
                   Num_of_people_over_poverty = trunc((Final_df$Population - Final_df$Num_of_people_under_poverty ))
)

head(Final_df)

Scan II

Boxplot was used to check for univariate and multivariate outliers in all the numerical attributes of the dataframe. The outliers were handled using capping technique as they were extreme values and not outliers. Countries like India, China have a very high population compared to other nations and removing there values would result in information loss. Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers. Hence, it can be used as a statistical approach to deal with extreme values. Outliers that lie outside the outlier fences on a box-plot, capping can be used to replace those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.

# This is the R chunk for the Scan II

# Detecting univariate and multivariate outliers

Final_df$Population %>% boxplot(main = "Population", ylab = "Number of people",horizontal = TRUE)

boxplot(Final_df$Population ~ Final_df$IncomeGroup, main="Country Population by Income Group", ylab = "Income Group", xlab = "Population",horizontal = TRUE)

boxplot(Final_df$Perc_pop_under_poverty ~ Final_df$IncomeGroup, main="Perc of people under poverty by Income Group", ylab = "Income Group", xlab = "% of people under poverty",horizontal = TRUE)

boxplot(Final_df$Num_of_people_under_poverty ~ Final_df$IncomeGroup, main="Number of people under poverty by Income Group", ylab = "Income Group", xlab = "Num of people under poverty",horizontal = TRUE)

boxplot(Final_df$Perc_pop_over_poverty ~ Final_df$IncomeGroup, main="Perc of people over poverty by Income Group", ylab = "Income Group", xlab = "% of people over poverty",horizontal = TRUE)

boxplot(Final_df$Num_of_people_over_poverty ~ Final_df$IncomeGroup, main="Number of people over poverty by Income Group", ylab = "Income Group", xlab = "Num of people over poverty",horizontal = TRUE)

# Function to cap the values outside the limits

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

# Capping

Final_df$Population <- Final_df$Population %>% cap()

Final_df$Num_of_people_under_poverty <- Final_df$Num_of_people_under_poverty %>% cap()

Final_df$Num_of_people_over_poverty <- Final_df$Num_of_people_over_poverty %>% cap()

Transform

Checked for skewness of all the numerical variables using histogram. All the variables except “Perc_pop_over_poverty” were highly right skewed. “Perc_pop_over_poverty” was highly left skewed.

Tried transforming these variables to normal distribution by removing skewness so that statistical tests based on normality can be applied to it, if required.

After applying log transform on “Population”, the distribution of the variable has come close to normal or gaussian distribution.

Applying power transform on “Perc_pop_over_poverty” has no significant effect on the distribution of the variable.

# This is the R chunk for the Transform Section

# Histograms for all the numerical attributes 

Final_df$Population %>% hist(main = "Histogram of Population", xlab = "Total number of people")

Final_df$Perc_pop_under_poverty %>% hist(main = "Histogram of % of population under poverty line", xlab = "% of people")

Final_df$Perc_pop_over_poverty %>% hist(main = "Histogram of % of population over poverty line", xlab = "% of people" )

Final_df$Num_of_people_under_poverty %>% hist(main = "Histogram of Population under poverty line", xlab = "Number of people under poverty")

Final_df$Num_of_people_over_poverty %>% hist(main = "Histogram of Population above Poverty line", xlab = "Number of people over poverty")

# Transforming to remove skewness

# Log transform

Total_Population <- log10(Final_df$Population)
hist(Total_Population)

# Power transform
Perc_population_over_poverty <- (Final_df$Perc_pop_over_poverty)^50
hist(Perc_population_over_poverty)