MATH2349 Data Wrangling

Required packages

# This is the R chunk for the required packages

#Import the necessary packages
library(readr) # Useful for importing data
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(knitr) # Useful for creating nice tables
library(gdata) # Useful for manipulating data

## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location of a
## gdata: valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please ensure
## gdata: perl is installed and available on the executable search path.)

## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.

##

## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.

##

## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

library(readxl) #Useful for getting data out of excel
library(dplyr) # Useful for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:gdata':
## 
##     combine, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr) #Useful for tidying data
library(outliers) #Useful for handling outliers
library(Hmisc) #Useful to perform many operations for data analysis

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:rvest':
## 
##     html

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(lubridate) #Useful to work with dates and times

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(stringr) #Useful to handle string operations
library(mlr) #Useful for feature ranking and selection

## Loading required package: ParamHelpers

## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).

## 
## Attaching package: 'mlr'

## The following object is masked from 'package:Hmisc':
## 
##     impute

## The following object is masked from 'package:gdata':
## 
##     resample

library(caret) #Useful to apply feature extraction.

## 
## Attaching package: 'caret'

## The following object is masked from 'package:mlr':
## 
##     train

## The following object is masked from 'package:survival':
## 
##     cluster

library(forecast) #Useful for forecasting functions

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

The data preprocessing is an important process whenever we are dealing with datasets.This makes data cleaner,more managable further and to draw proper conclusions.I have handled the datasets by using the five important steps Get,Understand,Tidy and Manipulate,Scan,Transform. As a start,relevant datasets and packages are imported.Here I have used two csv datasets WorldGDP_Data1 and Income_Data2 which are from open public sources where only relevant columns are selected.As a part of understanding of data,column names are then renamed and merged accordingly.Next,analysed the columns of the dataset along with their datatype and ordered properly as its easy to analyse,For the sake of further pre-processing datatype conversions,character to factor conversions are also carried out for easier labelling and ordering.For tidying the data considered the Tidy Data Principles and united two columns as “Place”.Also created mutate variable called “Real_GDP “ in the dataset.Then the data is checked for any missing values and inconsistencies by considering individual columns. As a part of Scan process,the outlier scan in the datasets involved by trying to inspect the numeric variables by using “Tukey’s method of outlier detection” with the boxplots. After confirming the GDP Per Capita variable that it has an approximately normal distribution, calculated z scores using outliers package along with the summary of output. I found the locations of the z-scores and location of outliers where considered the absolute value and considered it is greater than 3.By the thorough analysis of boxplots the variables which had extreme outliers where the capping method was applied using a pre-set method called capping or winsorising function. The final step Transform involves the transforming of the GDP Per Capita variable using the mean-centering method ,with proper scaling and by z-score standardisation transformations.Also visualised the transformed variable GDP Per Capita using histograms to make sure its symmetric one and analysed the skew.

Data

The datasets have been obtained from the public open source worldpopulationreview.com and the kaggle.First dataset is from GDP Ranked by Country 2020 with URL https://worldpopulationreview.com/countries/countries-by-gdp and the second dataset is obtained from World Bank Unemployment Data with URL https://www.kaggle.com/uddipta/world-bank-unemployment-data-19912017. The first dataset is having the information about the Gross Domestic Product(GDP) value of different countries in the world where these kind of datasets are useful for calculating the world GDP Value.It is the total of gross national income for every country in the world. These values helps to provide a snapshot of a country’s economy and can be calculated using expenditures, production, or incomes.The dataset is having 211 objects with 6 variables.

First Dataset variables description: * rank : Denotes Rank of country as per the GDP value by statistics, * country: Denotes name of the country, * imfGDP : Denotes GDP Monetary value of final goods and services calculated by International Monetary Fund which in US dollars, * unGDP : Denotes GDP value which is calculated by United Nations which in US dollars, * gdpPerCapita : Denotes “Per capita gross domestic product” which is determined by dividing GDP value of country by countrys population value, * pop : Denotes the population of country. The first dataset is having the datatypes character and numeric.

The second dataset contains the unemployment rate of different countries in the world in across different regions with several categoried income groups supplied by the worldbank Its having the classification of a countries in the world by regionwise with 232 objects of 31 variables.By proper analysis used relevant three variables * Country Name - Denotes the names of country * Region - Denotes the region in which the country belongs * IncomeGroup - Denotes the income group of the country to represent overall condition. In the second datatset the character type above variables are considered

These variables are only selected as our purpose is to determine the GDP in each particular country area along with the income group status of people in the country not by year wise.The orginial dataset is having values from 1991 to 2017 values where its not required.And also the required GDP values are present in the first dataset for the year 2019-2020.

Using the read_csv function reads both csv files and for making the data more meaningful the column names have renamed in the both datasets.Merged both the datasets where “Country” is the common variable in both datasets to merge.In Merged Dataset rearranged the pattern of columns in such a way to easily analyse.As the main purpose of the dataset to derive the information about the GDP value of each Country ranking in the decreased order is easier and more informative

# This is the R chunk for the Data Section

#Reading first dataset
WorldGDP_Data1 <- read_csv("C:/Users/Geena George/Desktop/Sem2/DataWrangling/Assignment2/csvData.csv")

## Parsed with column specification:
## cols(
##   rank = col_double(),
##   country = col_character(),
##   imfGDP = col_double(),
##   unGDP = col_double(),
##   gdpPerCapita = col_double(),
##   pop = col_double()
## )

#Displaying the few records of first data
head(WorldGDP_Data1)

#Reading the second dataset
Income_Data2 <- read_csv("C:/Users/Geena George/Desktop/Sem2/DataWrangling/Assignment2/full_data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character()
## )

## See spec(...) for full column specifications.

#Displaying the few records of second data
head(Income_Data2)

#After analysing renaming first datas column names to make data more meaningfull
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="rank"] <- "Rank"
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="country"] <- "Country"
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="imfGDP"] <- "GDP(IMF - in US$)"
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="unGDP"] <- "GDP(UN - in US$)"
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="gdpPerCapita"] <- "GDP Per Capita"
colnames(WorldGDP_Data1)[colnames(WorldGDP_Data1)=="pop"] <- "Population"

#After renaming the columns the dataset records 
head(WorldGDP_Data1)

#After analysing choosing the relevant columns from second dataset among 31 columns
Income_Data2 <- Income_Data2 %>% select(`Country Name`,Region, IncomeGroup)

##After analysing renaming to make data more meaningfull
colnames(Income_Data2)[colnames(Income_Data2)=="Country Name"] <- "Country"

#After renaming the columns the dataset records 
head(Income_Data2)

# Merge the two datasets
Merged_Data<- merge(x= WorldGDP_Data1, y= Income_Data2,by="Country")

#Displays the records of Merged data
head(Merged_Data)

#Rearranging the columns order which easier to analyse the variables
Merged_Data <- Merged_Data[,c(2,1,7,8,6,3,4,5)]

#Sorting the data based on rank of countries with respect to GDP
Merged_Data<-Merged_Data[order(as.integer(Merged_Data$Rank),decreasing = FALSE),]

The merged data is perfectly formatted as the columns are rearranged and based on the order of columns we can analyse the data by correlating the varibles.Moreover the rank of each countries with respect to the GDP value is sorted accordingly which is best for reference.

Understand

In the Understand task the data structures are checked by using the str() function and dim() is used to find the number of attributes.By using the class() function we can check the class of an object.Numeric objects can be of integar or double and we can identify this by using typeof() function.By analysis the best Rank type is suggested is Integer and the Rank variable is converted using is.integer() function.

# This is the R chunk for the Understand Section

#Checking the data structures
str(Merged_Data)

## 'data.frame':    157 obs. of  8 variables:
##  $ Rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country          : chr  "United States" "China" "Japan" "Germany" ...
##  $ Region           : chr  "North America" "East Asia & Pacific" "East Asia & Pacific" "Europe & Central Asia" ...
##  $ IncomeGroup      : chr  "High income" "Upper middle income" "High income" "High income" ...
##  $ Population       : num  331003 1439324 126476 83784 1380004 ...
##  $ GDP(IMF - in US$): num  2.22e+13 1.55e+13 5.50e+12 4.16e+12 3.26e+12 ...
##  $ GDP(UN - in US$) : num  1.86e+13 1.12e+13 4.94e+12 3.48e+12 2.26e+12 ...
##  $ GDP Per Capita   : num  67063 10747 43450 49617 2361 ...

#Finding the number of attributes
dim(Merged_Data)

## [1] 157   8

#Checking the type of variables
class(Merged_Data$Rank)

## [1] "numeric"

class(Merged_Data$Country)

## [1] "character"

class(Merged_Data$Region)

## [1] "character"

class(Merged_Data$IncomeGroup)

## [1] "character"

class(Merged_Data$Population)

## [1] "numeric"

class(Merged_Data$`GDP(IMF - in US$)`)

## [1] "numeric"

class(Merged_Data$`GDP(UN - in US$)`)

## [1] "numeric"

class(Merged_Data$`GDP Per Capita`)

## [1] "numeric"

#To check whether a numeric object is integer or double
typeof(Merged_Data$Rank)

## [1] "double"

typeof(Merged_Data$Population)

## [1] "double"

typeof(Merged_Data$`GDP(IMF - in US$)`)

## [1] "double"

typeof(Merged_Data$`GDP(UN - in US$)`)

## [1] "double"

typeof(Merged_Data$`GDP Per Capita`)

## [1] "double"

#The correct data type for Rank is Integar instead of numeric.Converting to integer datatype
Merged_Data$Rank <- as.integer(Merged_Data$Rank)

#Checking the datatype of Rank if its integer
is.integer(Merged_Data$Rank)

## [1] TRUE

#After the conversions checking the data structres.
str(Merged_Data)

## 'data.frame':    157 obs. of  8 variables:
##  $ Rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country          : chr  "United States" "China" "Japan" "Germany" ...
##  $ Region           : chr  "North America" "East Asia & Pacific" "East Asia & Pacific" "Europe & Central Asia" ...
##  $ IncomeGroup      : chr  "High income" "Upper middle income" "High income" "High income" ...
##  $ Population       : num  331003 1439324 126476 83784 1380004 ...
##  $ GDP(IMF - in US$): num  2.22e+13 1.55e+13 5.50e+12 4.16e+12 3.26e+12 ...
##  $ GDP(UN - in US$) : num  1.86e+13 1.12e+13 4.94e+12 3.48e+12 2.26e+12 ...
##  $ GDP Per Capita   : num  67063 10747 43450 49617 2361 ...

#The data contains the Integer,Character,Numeric datatype

From above structure we can infer that out data contains the Integer,Character and numeric data types.As the Income group is having four income levels the best method is to convert this character datatype to an ordinal factor by proper labelling and ordering

# Converting the charactor datatype of "IncomeGroup" to ordinal factor.
Merged_Data$IncomeGroup <- factor(Merged_Data$IncomeGroup,levels = c("High income","Upper middle income","Lower middle income","Low income"), labels = c("High Income","Upper Middle Income","Lower Middle Income","Low Income") , ordered=TRUE)

str(Merged_Data)

## 'data.frame':    157 obs. of  8 variables:
##  $ Rank             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country          : chr  "United States" "China" "Japan" "Germany" ...
##  $ Region           : chr  "North America" "East Asia & Pacific" "East Asia & Pacific" "Europe & Central Asia" ...
##  $ IncomeGroup      : Ord.factor w/ 4 levels "High Income"<..: 1 2 1 1 3 1 1 1 2 1 ...
##  $ Population       : num  331003 1439324 126476 83784 1380004 ...
##  $ GDP(IMF - in US$): num  2.22e+13 1.55e+13 5.50e+12 4.16e+12 3.26e+12 ...
##  $ GDP(UN - in US$) : num  1.86e+13 1.12e+13 4.94e+12 3.48e+12 2.26e+12 ...
##  $ GDP Per Capita   : num  67063 10747 43450 49617 2361 ...

#The dataset is having one factor variable.

Tidy & Manipulate Data I

“Tidy Data Principles” provide a standard and consistent way of storing data that makes transformation, visualization, and modeling easier.As per this we are having three tidy data rules, * Each variable must have its own column * Each observation must have its own row. * Each value must have its own cell

Analysing both the datasets as per this principle and the resources the second dataset is Untidy. In the source dataset of WorldBank mainly GDP values are recorded for Country in specific region.As per the above principles,its better to combine the multiple columns into a single column to form a new variable which is highly correlated as per statistics.So I have used unite() function used to combine multiple columns Country and Region into a single column.As Country is the common key for the two datasets used for merging step tidying step carried out in data after the merging.By following above rules,the first dataset is in tidy format.And in the merged dataset the default rownumber generated by R is removed,as a process of tidying the merged data.

# This is the R chunk for the Tidy & Manipulate Data I
#Removing the default rownumber generated by R as a part of Tidying Data
row.names(Merged_Data) <- NULL

#As per the tidy principle tidying the second dataset Income_Data2 after merging, inorder to keep the variables in common to merge them(Country)
Merged_Data <- Merged_Data %>% unite(CountryRegion,Country,Region)

head(Merged_Data)

Tidy & Manipulate Data II

As per the statistics and definition of GDP the real GDP growth in US dollars is found by the calculating GDP Monetary value of final goods and services with GDP value calculated by United Nations which in US dollars.This is a measure of Nominal GDP which can be used forstatistical analysis and for checking price inflations and inflations.So mutated variable called “Real_GrowthGDP” calculated by taking difference of existing variables GDP value of IMF and GDP value of UN in the US dollars which denotes the real economic growth of the each region.

# This is the R chunk for the Tidy & Manipulate Data II 
#Mutating the Real Growth GDP Variable
Merged_Data<-mutate(Merged_Data,Real_GrowthGDP=(Merged_Data$`GDP(IMF - in US$)` - Merged_Data$`GDP(UN - in US$)`))

Scan I

By using the colSums() we can compute the total missing values in each column in the Merged Data.Where this function will displays the total number of NA values in each column.

# This is the R chunk for the Scan I
#Scan for the missing value in the entire dataset
colSums(is.na(Merged_Data))

##              Rank     CountryRegion       IncomeGroup        Population 
##                 0                 0                 0                 0 
## GDP(IMF - in US$)  GDP(UN - in US$)    GDP Per Capita    Real_GrowthGDP 
##                 0                 0                 0                 0

There is no missing values in the data and need to scan all variables to check whether there is any special values and obvious errors.The below function is used to check whether there is any errors and considered for each column.

#Scan in each rows for any errors
sum(is.nan(Merged_Data$Rank))

## [1] 0

sum(is.nan(Merged_Data$CountryRegion))

## [1] 0

sum(is.nan(Merged_Data$IncomeGroup))

## [1] 0

sum(is.nan(Merged_Data$Population))

## [1] 0

sum(is.nan(Merged_Data$`GDP(IMF - in US$)`))

## [1] 0

sum(is.nan(Merged_Data$`GDP(UN - in US$)`))

## [1] 0

sum(is.nan(Merged_Data$`GDP Per Capita`))

## [1] 0

sum(is.nan(Merged_Data$Real_GrowthGDP))

## [1] 0

we have zero Missing values in each variable.Hence we can conclude that there is no errors in datase.There by no replacings,imputations are needed to proceed with this dataset.

Scan II

An outlier is defined as an observation which stands far away from the most of other observations. It can be a result of an measurement error and the inclusion of that error would have a great impact on the analysis results.While analysing the structure the numeric columns are Population,GDP IMF, GDP UN,GDP Per Capita and Real_GrowthGDP.Outliers can drastically change the results of the data analysis and statistical modelling. Some of the unfavourable impacts of outliers are; * They increase the error variance, * They reduce the power of statistical tests, * They can bias or influence the estimates of model parameters that may be of substantive interest.

One of the simplest methods for detecting univariate outliers is the use of box plots where its a graphical display for describing the distribution of the data using the median, the first (Q1) and third quartiles (Q3) and the inter-quartile range (IQR=Q3−Q1).In the box plot, the “Tukey’s method of outlier detection” is used to detect outliers. According to this method, outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR.

# This is the R chunk for the Scan II

# Outliers detection for all the numeric variables
GDPIMF <-boxplot(Merged_Data$`GDP(IMF - in US$)`, main = "GDP(IMF values in US$)",horizontal = TRUE)

GDPUN <-boxplot(Merged_Data$`GDP(UN - in US$)`, main = "GDP(UN values in US$)",horizontal = TRUE)

GDP_Capita <-boxplot(Merged_Data$`GDP Per Capita`, main = "GDP per Capita",horizontal = TRUE)

Real_GDP <-boxplot(Merged_Data$Real_GrowthGDP, main = "Real GDP",horizontal = TRUE)

pop <- boxplot(Merged_Data$Population, main = "Population",horizontal = TRUE)

As the data is mainly focussing on GDP Per capita of each country-region where its the average amount of goods and services produced per person.From the box plots the GDP Per Capita box plot is having approximately normal distribution comparing with the other boxplots even its skewed we can calculate the z scores.

z_scores <- Merged_Data$`GDP Per Capita` %>%  scores(type = "z")
z_scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7202 -0.6286 -0.4399  0.0000  0.1589  4.7193

Using the summary() output the minimum score is -0.7202 and maximum is 4.7193.According to these values we are finding the total and number of outliers present.

# Total number of outliers as per the z-score
length (which( abs(z_scores) >3 ))

## [1] 4

#Four outliers are present here which is having greater than value 3.
## Finding the locations of these outliers in the GDP Per Capita variable
which( abs(z_scores) >3 )

## [1] 18 27 28 64

Capping or winsorising involves replacing the existing outliers with nearest neighbours which are not outliers which defining a function called cap.Also applied this function to the data using sapply function.

## Define a function which is used to cap the values outside the limits
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95))
  x [ x < quantiles[2] - 1.5 * IQR(x) ] <- quantiles[1]
  x [ x > quantiles[3] + 1.5* IQR(x) ] <- quantiles[3]+1.5*IQR(x)
  x
}
#Using sapply function applying cap() function to the dataset
GDP_per_Capita<- sapply(Merged_Data %>% select(`GDP Per Capita`),FUN = cap)

#Plotting the GDP Per Capita without outlier
GDP <- cbind (Merged_Data %>% select(`GDP Per Capita`),GDP_per_Capita)
boxplot(GDP$`GDP Per Capita`,main ="The Box Plot of GDP Per Capita")

#checking the summary statistics
summary(GDP_per_Capita)

##  GDP Per Capita   
##  Min.   :  312.3  
##  1st Qu.: 2288.5  
##  Median : 6361.8  
##  Mean   :13448.2  
##  3rd Qu.:19287.2  
##  Max.   :44785.3

Here the outliers are replaced with the nearest neighbours using the cap method.And from the above and first plot we can compare the presence of outliers.

Transform

For statistical analysis the Data transformation step is very important.Here I am considering variable “GDP Per Capita” having the most importance metrc that breaks down a countrys economic output in our dataset.In statistical inferences carried out from the previous plotted boxplot of GDP Per Capita,its having a right skew where the symmetric (normal) distribution is preferred over skewed distribution.As we have a skewed distribution or heterogeneous of variances the purpose of applying transformations to this variable is to reduce the skewness or heterogeneity of variances,thereby we can convert the distribution into a normal distribution.

# This is the R chunk for the Transform Section

#For better analysis we can plot the histogram of GDP Per Capita and compare with and without transformation.
#GDP Per Capita using histogram - Before transformation
hist(Merged_Data$`GDP Per Capita`, main = "GDP per Capita(Before transformation)",xlab="GDP per Capita",col="grey",breaks=10)

The above histogram is right skewed distribution and mathematical operations have to applied to the data to achieve normality or variance homogeneity.The most useful transformation in introductory data analysis is the logarithm (base 10 and base e) reciprocal where it compresses high values and spreads low values by expressing the values as orders of magnitude.This transformation is commonly used for reducing right skewness.As it cannot be applied to zero or negative values directly we can apply the log transformation to a zero or negative value and we can add a non-negative constant to all observations and then take the logarithm.

#Transforming GDP Per Capita using histogram - After transformation
hist(log(Merged_Data$`GDP Per Capita`), main = "GDP per Capita(After transformation)",xlab="GDP per Capita",col="yellow",breaks=10)

After transformation the symmetry of GDP Per Capita improved and achieved normal distribution.

Conclusion

These all are the major tasks in data preprocessing framework and adapting these tasks we can minimise the garbage that gets into our analysis.The data used here had undergone these process sourced from multiple sources which made suitable for statistical analysis and used to draw valid interpretations.

##References Dr.Anil Dolgun 22 June,2020 “Module Notes” In Data Wrangling,RMIT University