Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
In this assignment, 3 datasets were merged with the help of common attributes i.e. Country Name, Country Code and Year to form a complete dataframe. Various data preprocessing steps were performed like subsetting and renaming columns, omitting rows with too many missing values etc. Two of the dataframes were untidy as the year columns were attributes instead of observations. The data was brought to correct format with the help of tidy principles suh as gather().
Checked the structure of all datasets and performed relevant conversion of data types, wherever required. Nominal attributes such as ‘Income Group’ and ‘Years’ which take limited values were made factors with ordered is equal to true.
Scanned all datasets for missing and special values. Missing values for the nominal attribute was removed since they were mostly part of regions such as South Asia, East Asia, Small states etc instead of countries. The missing values for numerical attribute were replaced by the mean of the respective countries.
Numerical attributes were scanned for outliers with the help of boxplot. The outliers were dealt using capping since they were extreme values for some of the countries which was correct.
Lastly, data transformation techniques were performed to reduce skewness and bring the data close to normal distribution so that further statistical apporoaches could be carried out.
The data is taken from “The World Bank” website. The three datasets used in our analysis are-
World Population dataset- The dataset contains population data of 264 countries from 1960 to 2019. It has 65 attributes and 264 observations. The attributes are :-
Source of data- https://data.worldbank.org/indicator/SP.POP.TOTL?locations=1W
Download link- http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=csv
Poverty dataset- The dataset contains poverty headcount ratio at $1.90 day(% of population) from 1960 to 2018. It has 65 attributes and 264 observations. The attributes are:-
Source of data- https://data.worldbank.org/indicator/SI.POV.DDAY
Download Link- http://api.worldbank.org/v2/en/indicator/SI.POV.MDIM.17.XQ?downloadformat=csv
Metadata- The dataset contains income groups of 263 countries. It has 5 attributes and 263 observations. The attributes are:-
The dataset in avaiable under all the links provided above.
The main objective is to analyze the percentage of population living under/over poverty over time from 2001 to 2018. There is no data for the year 2019 and 2020 in our dataset.
# This is the R chunk for the Data Section
# Importing Population Data
Population <- read_csv("E:/data wrangling/TotalPopulation/API_SP.POP.TOTL_DS2_en_csv_v2_1495124.csv",skip=3)
## Warning: Missing column names filled in: 'X66' [66]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `2020` = col_logical(),
## X66 = col_logical()
## )
## See spec(...) for full column specifications.
head(Population)
# Importing Poverty data
Poverty <- read_csv("E:/data wrangling/ExtremePoverty/API_SI.POV.DDAY_DS2_en_csv_v2_1496490.csv",skip = 3)
## Warning: Missing column names filled in: 'X66' [66]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `1960` = col_logical(),
## `1961` = col_logical(),
## `1962` = col_logical(),
## `1963` = col_logical(),
## `1964` = col_logical(),
## `1965` = col_logical(),
## `1966` = col_logical(),
## `1968` = col_logical(),
## `1970` = col_logical(),
## `1972` = col_logical(),
## `1973` = col_logical(),
## `1976` = col_logical(),
## `2019` = col_logical(),
## `2020` = col_logical(),
## X66 = col_logical()
## )
## See spec(...) for full column specifications.
head(Poverty)
# Importing Metadata
Metadata <- read_csv("E:/data wrangling/ExtremePoverty/Metadata_Country_API_SI.POV.DDAY_DS2_en_csv_v2_1496490.csv")
## Warning: Missing column names filled in: 'X6' [6]
## Parsed with column specification:
## cols(
## `Country Code` = col_character(),
## Region = col_character(),
## IncomeGroup = col_character(),
## SpecialNotes = col_character(),
## TableName = col_character(),
## X6 = col_logical()
## )
head(Metadata)
Filtered out useful columns from Population, Poverty and Metadata in new dataframes. These dataframes and their attributes are:-
# This is the R chunk for the Subset required columns Section
# Sub-setting required columns from Population, Poverty and Metadata
Population_filtered <- Population[, c(1,2, 46:63)]
head(Population_filtered)
Poverty_filtered <- Poverty[, c(1,2, 46:63)]
head(Poverty_filtered)
Metadata_filtered <- Metadata[,c(1,3,5)]
head(Metadata_filtered)
Observations with more than 9 missing values from year 2001 to 2018 were removed from all the datasets in order to avoid discrepancy.
Column “TableName” of Metadata was renamed to “Country Name” for better understanding.
In tidy data:
Population and Poverty datasets were untidy as the years (2001-2018) were column headers whereas they should be observations (values) with column header as “Year”. To transform the data from wide to long format, gather() function of tidyr package was used.
# This is the R chunk for the Tidy & Manipulate Data I
# Function for deleting NA values
delete_NA <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
# Deleting rows with more than 9 NA values from all the datasets
Population_filtered <- delete_NA(Population_filtered,9)
Poverty_filtered <- delete_NA(Poverty_filtered,9)
Metadata_filtered <- delete_NA(Metadata_filtered,9)
# Renaming column name
colnames(Metadata_filtered)[3] <- c("Country Name")
# Checking dimension of the dataset
dim(Population_filtered)
## [1] 263 20
dim(Poverty_filtered)
## [1] 74 20
dim(Metadata_filtered)
## [1] 263 3
# Gather function to get year in one column
Population_filtered <- gather(Population_filtered, key="Year", value= "Population", c(3:20))
head(Population_filtered)
Poverty_filtered <- gather(Poverty_filtered, key="Year", value= "Perc_pop_under_poverty", c(3:20))
head(Poverty_filtered)
Scannned datasets for missing and special values before joining them. Checked for missing values in Population and Poverty datasets and replaced it with the mean of the respective country. This has been done in order to preserve useful information and removing it would have lead to loss of quality data. The colSums() function was used to get the number of missing values in each column.
For Metadata dataset, the missing values were dropped as the values didn’t belong to any countries but regions which was not of any use for our analysis.
Merged all the datasets with the help of common attributes. The inner_join() function was used to retain only rows which are present in both Population and Poverty datasets. The right_join() function was used between Metadata and PopulationXPoverty in order retain all the rows present in PopulationXPoverty and matching observations in both the dataset.
Rows with missing values under “Income Group” were dropped as it can’t be predicted and replacing it with the mode of the variable is not a correct approach and may lead to bias.
Special charaters were checked in the final dataframe. The dataframe didn’t contain any special character.
# This is the R chunk for the Scan I
# Checking and dealing with missing values
colSums(is.na(Population_filtered))
## Country Name Country Code Year Population
## 0 0 0 7
Population_filtered <- Population_filtered %>% group_by(`Country Name`) %>%
mutate(Population=ifelse(is.na(Population),mean(Population,na.rm=TRUE),Population))
colSums(is.na(Poverty_filtered))
## Country Name Country Code Year
## 0 0 0
## Perc_pop_under_poverty
## 194
Poverty_filtered <- Poverty_filtered %>% group_by(`Country Name`) %>%
mutate(`Perc_pop_under_poverty`=ifelse(is.na(`Perc_pop_under_poverty`),mean(`Perc_pop_under_poverty`,na.rm=TRUE),`Perc_pop_under_poverty`))
Metadata_filtered %>% is.na() %>% table()
## .
## FALSE TRUE
## 743 46
Metadata_filtered <- drop_na(Metadata_filtered)
# Merging all the datasets
PopulationXPoverty <- inner_join(Population_filtered,Poverty_filtered, by= c("Country Name","Country Code","Year"))
Final_df <- right_join(Metadata_filtered,PopulationXPoverty, by = c("Country Name","Country Code"))
head(Final_df)
colSums(is.na(Final_df))
## Country Code IncomeGroup Country Name
## 0 252 0
## Year Population Perc_pop_under_poverty
## 0 0 0
# Omitting rows with na values in Income group variable
Final_df <- Final_df[!is.na(Final_df$IncomeGroup), ]
# Checking for missing or special character
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
Final_df_special <- sapply(Final_df, is.specialorNA)
Final_df_special <- as.data.frame(Final_df_special[c(5,6)])
colSums(Final_df_special)
## Population Perc_pop_under_poverty
## 0 0
Data structure of all the variables were checked using str() function. “IncomeGroup” and “Year” attributes were converted from chr to factor as they contained a set of limited values. The final dataset contains multiple data types such as numerics, characters, factors.
# This is the R chunk for the Understand Section
str(Final_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1080 obs. of 6 variables:
## $ Country Code : chr "ARG" "ARM" "AUT" "BEL" ...
## $ IncomeGroup : chr "Upper middle income" "Upper middle income" "High income" "High income" ...
## $ Country Name : chr "Argentina" "Armenia" "Austria" "Belgium" ...
## $ Year : chr "2001" "2001" "2001" "2001" ...
## $ Population : num 37275652 3050687 8042293 10286570 8009142 ...
## $ Perc_pop_under_poverty: num 10.7 14.4 0.4 0.147 2.217 ...
Final_df$IncomeGroup <- factor(Final_df$IncomeGroup, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered = TRUE)
Final_df$Year <- factor(Final_df$Year, levels = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018), ordered = TRUE)
str(Final_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1080 obs. of 6 variables:
## $ Country Code : chr "ARG" "ARM" "AUT" "BEL" ...
## $ IncomeGroup : Ord.factor w/ 4 levels "Low income"<"Lower middle income"<..: 3 3 4 4 3 3 2 3 4 3 ...
## $ Country Name : chr "Argentina" "Armenia" "Austria" "Belgium" ...
## $ Year : Ord.factor w/ 18 levels "2001"<"2002"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Population : num 37275652 3050687 8042293 10286570 8009142 ...
## $ Perc_pop_under_poverty: num 10.7 14.4 0.4 0.147 2.217 ...
The final dataframe (Final_df) contains variable “Population” and “Perc_pop_under_poverty”. These two variables were used to mutate 3 more variables. These variables are:-
Num_of_people_under_poverty- This variable gives us information about the number of people living below poverty in a particular country. The formula for calculating it is (Perc_pop_under_poverty * Population)/100.
Perc_pop_over_poverty- This variable gives us information about the percentage of population living above poverty in a particular country. The formula for calculating it is 100 - Perc_pop_under_poverty.
Num_of_people_over_poverty- This variable gives us information about the number of people living above poverty in a particular country. The formula for calculating it is Population - Num_of_people_under_poverty.
The trunc() function is used to get whole numbers.
# This is the R chunk for the Tidy & Manipulate Data II
# Mutating new variables
Final_df <- mutate(Final_df,
Num_of_people_under_poverty = trunc((Final_df$`Perc_pop_under_poverty`*Final_df$Population)/100))
Final_df <- mutate(Final_df,
Perc_pop_over_poverty = (100-Final_df$Perc_pop_under_poverty)
)
Final_df <- mutate(Final_df,
Num_of_people_over_poverty = trunc((Final_df$Population - Final_df$Num_of_people_under_poverty ))
)
head(Final_df)
Boxplot was used to check for univariate and multivariate outliers in all the numerical attributes of the dataframe. The outliers were handled using capping technique as they were extreme values and not outliers. Countries like India, China have a very high population compared to other nations and removing there values would result in information loss. Capping or winsorising involves replacing the outliers with the nearest neighbours that are not outliers. Hence, it can be used as a statistical approach to deal with extreme values. Outliers that lie outside the outlier fences on a box-plot, capping can be used to replace those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.
# This is the R chunk for the Scan II
# Detecting univariate and multivariate outliers
Final_df$Population %>% boxplot(main = "Population", ylab = "Number of people",horizontal = TRUE)
boxplot(Final_df$Population ~ Final_df$IncomeGroup, main="Country Population by Income Group", ylab = "Income Group", xlab = "Population",horizontal = TRUE)
boxplot(Final_df$Perc_pop_under_poverty ~ Final_df$IncomeGroup, main="Perc of people under poverty by Income Group", ylab = "Income Group", xlab = "% of people under poverty",horizontal = TRUE)
boxplot(Final_df$Num_of_people_under_poverty ~ Final_df$IncomeGroup, main="Number of people under poverty by Income Group", ylab = "Income Group", xlab = "Num of people under poverty",horizontal = TRUE)
boxplot(Final_df$Perc_pop_over_poverty ~ Final_df$IncomeGroup, main="Perc of people over poverty by Income Group", ylab = "Income Group", xlab = "% of people over poverty",horizontal = TRUE)
boxplot(Final_df$Num_of_people_over_poverty ~ Final_df$IncomeGroup, main="Number of people over poverty by Income Group", ylab = "Income Group", xlab = "Num of people over poverty",horizontal = TRUE)
# Function to cap the values outside the limits
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
# Capping
Final_df$Population <- Final_df$Population %>% cap()
Final_df$Num_of_people_under_poverty <- Final_df$Num_of_people_under_poverty %>% cap()
Final_df$Num_of_people_over_poverty <- Final_df$Num_of_people_over_poverty %>% cap()
Checked for skewness of all the numerical variables using histogram. All the variables except “Perc_pop_over_poverty” were highly right skewed. “Perc_pop_over_poverty” was highly left skewed.
Tried transforming these variables to normal distribution by removing skewness so that statistical tests based on normality can be applied to it, if required.
After applying log transform on “Population”, the distribution of the variable has come close to normal or gaussian distribution.
Applying power transform on “Perc_pop_over_poverty” has no significant effect on the distribution of the variable.
# This is the R chunk for the Transform Section
# Histograms for all the numerical attributes
Final_df$Population %>% hist(main = "Histogram of Population", xlab = "Total number of people")
Final_df$Perc_pop_under_poverty %>% hist(main = "Histogram of % of population under poverty line", xlab = "% of people")
Final_df$Perc_pop_over_poverty %>% hist(main = "Histogram of % of population over poverty line", xlab = "% of people" )
Final_df$Num_of_people_under_poverty %>% hist(main = "Histogram of Population under poverty line", xlab = "Number of people under poverty")
Final_df$Num_of_people_over_poverty %>% hist(main = "Histogram of Population above Poverty line", xlab = "Number of people over poverty")
# Transforming to remove skewness
# Log transform
Total_Population <- log10(Final_df$Population)
hist(Total_Population)
# Power transform
Perc_population_over_poverty <- (Final_df$Perc_pop_over_poverty)^50
hist(Perc_population_over_poverty)