MATH2349 Data Wrangling

Required packages

In this step I load the packages required to produce the report and set the Working directory.

#Loading packages:

library(readr)
library(foreign) 
library(gdata)

## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location of a
## gdata: valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please ensure
## gdata: perl is installed and available on the executable search path.)

## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.

##

## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.

##

## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.

## 
## Attaching package: 'gdata'

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:gdata':
## 
##     combine, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(deductive)
library(deducorrect)

## Loading required package: editrules

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

## The following object is masked from 'package:dplyr':
## 
##     contains

library(editrules) 
library(validate)

## 
## Attaching package: 'validate'

## The following object is masked from 'package:igraph':
## 
##     compare

## The following object is masked from 'package:dplyr':
## 
##     expr

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:validate':
## 
##     expr

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:validate':
## 
##     label, label<-

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:rvest':
## 
##     html

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(stringr) 
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:igraph':
## 
##     %--%, union

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(outliers)
library(MVN)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## sROC 0.1-2 loaded

library(infotheo)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

library(ggplot2)
library(knitr)

#Setting the working directory:

setwd("C:/Users/jaspe/OneDrive/2020/Analytics RMIT/Data Wrangling/Assignment 2")

Executive Summary

In this report several pre-processing steps have been applied to some publicly available data from the World Bank. These steps transform the data from a state that is incomplete, noisy, inconsistent, does not have all the correct labels and codes to data that is clean, tidy, complete and ready for further analysis. These pre-processing steps include reading the Data into R Studio, gathering one of the data sets to tidy it, filtering the data to focus on specific types of observations, merging two data sets with a join, sub setting the data to exclude irrelevant variables, inspecting the structure of the data and applying factor type conversions to create an ordinal factor that correctly reflects the nature of the data, using the mutate and arrange functions to create two new variables that explain interesting characteristics of the data, scanning the data to identify missing and special values and impute the median to replace missing values which ensures the data is complete, scanning the distribution of numeric variables by creating boxplots, histograms and scatter plots to determine if there are any outliers. Finally, a log transformation is applied to one of the numeric variables in order to change the scale for better understanding of the variable and to decrease the skewness and convert the distribution into a normal distribution.

Data

This report utilises 2 data sets. The first data set; SA_Consum, describes South African Annual Household final Consumption in 2010 by Sector and Consumption Segment in Local Currency, PPP and USD (Million). The variables in this data set are:

Area: this refers to the geographic location of the households within South Africa and whether they were taken from rural or urban areas, or both (national) areas.
Consumption segment: Four levels of consumption are used to segment the market in each country: lowest, low, middle, and higher. They are based on global income distribution data, which rank the global population by income per capita. The lowest consumption segment corresponds to the bottom half of the global distribution, or the 50th percentile and below; the low consumption segment to the 51th–75th percentiles; the middle consumption segment to the 76th–90th percentiles; and the higher consumption segment to the 91st percentile and above (The World Bank, 2020).
Country Hide: this simply refers to the country which the data refers to.
Indicator: this variable is present as the data set clearly was part of a larger data set that pertained to multiple indicators.As this subset of the data only concerns one indicator which is Annual household expenditure by sector all of the observations in the columns indicate this. Household final consumption expenditure is the market value of all goods and services, including durable products (such as cars, washing machines, and home computers), purchased by households. It excludes purchases of dwellings but includes imputed rent for owner-occupied dwellings. It also includes payments and fees to governments to obtain permits and licenses. Here, household consumption expenditure includes the expenditures of nonprofit institutions serving households, even when reported separately by the country (The World Bank, 2020).
Sectors: this refers to different industries with the economy.
PPP: Purchasing power parity (PPP) is a measurement of prices in different countries that uses the prices of specific goods to compare the absolute purchasing power of the countries’ currencies. in this case this variable indicates Value of the Annual household expenditure by sector denoted in PPP.
Local Currency: Value of the Annual household expenditure by sector denoted in the South African Rand.
USD: Value of the Annual household expenditure by sector denoted in United States Dollars.

The Second data set; SA_Seg_Pop describes South Africa’s Total Population in 2010 by Area and Consumption Segment. The data set includes some common variables with the SA_Consum these are:

Country_Hide
Area
Indicator

As well as some distinct variables which are:

Sex: Gender of survey participants.
Unit: refers to method of measurement present in the data.
Country: duplicate of Country_Hide.
All: total population regardless of consumption segment .
Higher:Population of higher consumption segment as per prior description of consumption segments.
Low: Population of Low consumption segment as per prior description of consumption segments.
Lowest: Population of Lowest consumption segment as per prior description of consumption segments.
Middle: Population of Middle consumption segment as per prior description of consumption segments.

In this data set population is described in millions and is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship–except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin. The values shown are midyear estimates(The World Bank, 2020).

These Data sets were sorced from The World Bank, 2020. Global Consumption Database. [Online] Available at: http://datatopics.worldbank.org/consumption/ [Accessed 29 September 2020].

Due to the nature of these data sets the merge operation will not be performed in this step. Instead the merging of data will Occur in the in the step; Tidy and Manipulate I, after the initial cleaning of the data has been performed.

#Data Selection: This involves reading in the data set, assigning them to an object, viewing the data set and printing the head of each data set. 

#Data set 1: South Africa Goods and Services Consumption

SA_Consum <- read_csv("SA Consumption Data.csv")

## Parsed with column specification:
## cols(
##   Area = col_character(),
##   `Consumption Segment` = col_character(),
##   Country = col_character(),
##   Country_Hide = col_character(),
##   Indicator = col_character(),
##   Sector = col_character(),
##   PPP = col_number(),
##   `Local Currency` = col_number(),
##   USD = col_number()
## )

View(SA_Consum)
head(SA_Consum)

#Data set 2: Population of South Africa by Consumption Segments

SA_Pop <- read_csv("SA Population Data .csv")

## Parsed with column specification:
## cols(
##   Country_Hide = col_character(),
##   `Age Group` = col_character(),
##   Area = col_character(),
##   Country = col_character(),
##   Indicator = col_character(),
##   Sex = col_character(),
##   Unit = col_character(),
##   All = col_double(),
##   Higher = col_double(),
##   Low = col_double(),
##   Lowest = col_double(),
##   Middle = col_double()
## )

View(SA_Pop)
head(SA_Pop)

Tidy & Manipulate Data I

The Three fundamental rules of tidy, which were set out by Hadley Wickham in his article “Tidy Data” are that, in tidy data:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Looking at the second data set SA_Pop we can see that it violates the 1st rule of tidy data and therefore is messy. This represents a common problem in messy data; Column headers are values, not variable names. In this case each category of population segment (All, Higher, Low, Lowest, Middle) are column headers not values, as they should be. This causes the data to appear short and wide, where-as tidy data is generally narrow and long. Below I will apply the gather() function to combine these columns into one variable that contains all the various categories of consumption segment.

#Filtering and Merging The Data:

After performing the gather() operation both data sets should be tidy and ready to be merged. As the SA_Consum data set only contains data concerning South Africa’s national consumption of goods and services we must first filter the population data (SA_Pop) so that it too only displays national statistics. This is done by applying the filter() function.

Once the filter has been applied the mutating join function left_join() is used to attach matching rows from the tidied and filtered SA_Pop Data set to the SA_consum Data set, by the ‘Consumption Segment’ and ‘Area’ variables, while keeping all the observations in SA_Consum.

#Subseting the Data:

Looking at the head of the new merged data there are number of irrelevant columns. These columns contain variables that do not describe anything useful about the data, e.g.the country.x and country.y columns all contain the same observation; ‘South Africa’ as this is the country with which the data is concerned, therefore this information could be included in the data set’s title rather than in the form of a variable. Below the subseting [ ] function is used to select only variables that are meaningful and useful.

#Gather(): Below I use the gather function to combine multiple columns into the new column entitled "Consumption segment", I ensure the correct value corresponds with each observation and entitle this Column "Segment Population", This then assigned to the new object 'SA_Pop_tidy' and the head of this new object is printed.   

SA_Pop_tidy <- gather(SA_Pop, 'All','Higher','Low','Lowest','Middle', key = "Consumption Segment", value = "Segment Population")
head(SA_Pop_tidy)

#Filter: SA_Pop is filtered to only include National statistics, this is assigned to a new object. 

SA_Pop_1 <- SA_Pop_tidy %>% filter(Area == "National")


#Merge: The data sets are joined by two variables using the function left_join(), data set is viewed and the head is printed. 

SA_Consum_and_Pop <- SA_Consum %>% left_join(SA_Pop_1, by = "Consumption Segment", "Area")
View(SA_Consum_and_Pop)
head(SA_Consum_and_Pop)

#Subset: Firstly the names() function is used to gain a succinct view of the positions of the order of the variables in the data. Secondly, the subseting [ ] function is used to select only variables that are meaningful and useful. These are assigned to a new object 'SA', this object is viewed and the head is printed

names(SA_Consum_and_Pop)

##  [1] "Area.x"              "Consumption Segment" "Country.x"          
##  [4] "Country_Hide.x"      "Indicator.x"         "Sector"             
##  [7] "PPP"                 "Local Currency"      "USD"                
## [10] "Country_Hide.y"      "Age Group"           "Area.y"             
## [13] "Country.y"           "Indicator.y"         "Sex"                
## [16] "Unit"                "Segment Population"

SA <- SA_Consum_and_Pop[, c(2, 6, 7, 8, 9, 17)]
View(SA)
head(SA)

Understand

As per the inspection of the merged data below we can see that the SA data frame contains 65 observations and 6 variables. With 2 character variables; ‘Consumption Segment’ and ‘Sector’ as well as 4 numeric or dbl variables including ‘PPP’, ‘Local currency’, ‘USD’ and Segment Population’.

This inspection was completed by applying the str(), Glimpse() and attributes() functions to the data.

However the segment population is actually an ordinal variable meaning, therefore a factor conversion is applied to convert this character variable into an ordered factor.

#Inspection of data structures:

str(SA)

## tibble [65 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Consumption Segment: chr [1:65] "Low" "Low" "Low" "Low" ...
##  $ Sector             : chr [1:65] "All Sectors" "Food and Beverages" "Clothing and Footwear" "Housing" ...
##  $ PPP                : num [1:65] 33777 9707 2637 6914 1552 ...
##  $ Local Currency     : num [1:65] 193105 55493 15074 39527 8871 ...
##  $ USD                : num [1:65] 26377 7580 2059 5399 1212 ...
##  $ Segment Population : num [1:65] 16658579 16658579 16658579 16658579 16658579 ...

glimpse(SA)

## Rows: 65
## Columns: 6
## $ `Consumption Segment` <chr> "Low", "Low", "Low", "Low", "Low", "Low", "Lo...
## $ Sector                <chr> "All Sectors", "Food and Beverages", "Clothin...
## $ PPP                   <dbl> 33777.37, 9706.70, 2636.79, 6913.86, 1551.64,...
## $ `Local Currency`      <dbl> 193105.25, 55493.22, 15074.50, 39526.55, 8870...
## $ USD                   <dbl> 26376.90, 7580.01, 2059.08, 5399.06, 1211.68,...
## $ `Segment Population`  <dbl> 16658579, 16658579, 16658579, 16658579, 16658...

attributes(SA)

## $names
## [1] "Consumption Segment" "Sector"              "PPP"                
## [4] "Local Currency"      "USD"                 "Segment Population" 
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

#Apply Data type Conversions: As Consumption segment represents a ordinal variable its appropriate convert it to an ordered factor:

SA$`Consumption Segment` <- factor(SA$`Consumption Segment`, 
                                          levels = c("Lowest","Low","Middle","Higher","All"), 
                                          ordered = TRUE)

#Apply str() function to confirm variable class conversion is successful:
str(SA)

## tibble [65 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Consumption Segment: Ord.factor w/ 5 levels "Lowest"<"Low"<..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Sector             : chr [1:65] "All Sectors" "Food and Beverages" "Clothing and Footwear" "Housing" ...
##  $ PPP                : num [1:65] 33777 9707 2637 6914 1552 ...
##  $ Local Currency     : num [1:65] 193105 55493 15074 39527 8871 ...
##  $ USD                : num [1:65] 26377 7580 2059 5399 1212 ...
##  $ Segment Population : num [1:65] 16658579 16658579 16658579 16658579 16658579 ...

Tidy & Manipulate Data II

In this step the arrange() and mutate() functions are used to create multiple new variables including:

-Spend_Per_Person_inPPP: this variable indicates the amount each person with their respective Consumption segement spent by sector with the value denominated in purchasing power parity dollars. -Spend_Per_Person_Rank: this variable indicates the rank of person spending by sector and consumption segment with 1 being the biggest perperson spend and 65 being the lowest.

#Create Spending_Per_Person_inPPP variable: as PPP is in millions we firt multiply it by 1 million to get the true number, then it is divided by the segment population and this result is assigned to the new variable. 

SA_1 <- SA %>% mutate(Spend_Per_Person_inPPP = ((SA$`PPP`*1000000)/SA$`Segment Population`))


#Create Overall_Expenditure_PP_Rank variable: First we arrange the data set by the new Spending_Per_Person_inPPP variable, in descending order using the arrange function, then we use the row_number() function within the mutate function to assign the rank. Then the data set is viewed and the head is printed.

SA_2 <- SA_1 %>% arrange(desc(SA_1$Spend_Per_Person_inPPP)) %>% mutate(Spend_Per_Person_Rank = row_number())

View(SA_2)
head(SA_2)

Scan I

Firstly, which(is.na()) is used to identify that there in fact are missing values in the data. The the missing values are then replaced with the median, this is the appropriate method of imputation as is the nature of the data means there is no way to infer the values and the mean would be biased by the ‘totals’ observations. Additionally, Sapply() is used to identify that there no special values within this data. Finally, The mutates from a previous step are re run in order to correct the new variables as per the imputed values.

# Identify if there are missing values:

which(is.na(SA_2))

## [1] 195 214 319 455

colSums(is.na(SA_2))

##    Consumption Segment                 Sector                    PPP 
##                      0                      0                      1 
##         Local Currency                    USD     Segment Population 
##                      1                      1                      0 
## Spend_Per_Person_inPPP  Spend_Per_Person_Rank 
##                      1                      0

#Identify special values:

sapply(SA_2, is.infinite) %>% sum()

## [1] 0

sapply(SA_2, is.nan) %>% sum()

## [1] 0

#impute median for missing values:

SA_2$PPP[which(is.na(SA_2$PPP))] = median(SA_2$PPP, na.rm = TRUE)
SA_2$`Local Currency`[which(is.na(SA_2$`Local Currency`))] = median(SA_2$`Local Currency`, na.rm = TRUE)
SA_2$USD[which(is.na(SA_2$USD))] = median(SA_2$USD, na.rm = TRUE)


#Re-run mutates to correct new variables and assign new variables:

SA_3 <- SA_2 %>% mutate(Spend_Per_Person_inPPP = ((SA_2$`PPP`*1000000)/SA_2$`Segment Population`)) %>% arrange(desc(SA_2$Spend_Per_Person_inPPP)) %>% mutate(Spend_Per_Person_Rank = row_number())

Scan II

In this step the variable SA_3$PPP is scanned for outliers by creating univariate boxplot and histogram as well as a multivariate boxplot grouped by consumption segment. As you can see below these measurements identify huge outliers. Upon inspection of the data we can see the the totals which are included as observations within the data set are what are causing these outliers. After filtering out the totals and re running the multivariate boxplot we can see that all significant outliers have been eliminated. scatter plots are then used to determine whether there appears to be correlation between multiple numeric variables. As seen below the scatterplot demonstrate appropriate correlations with very minor deviations created by imputed observations.

#Initial univariate boxplot, histogram and multivariate boxplot

SA_3$PPP %>%  boxplot(main="Box Plot of Annual household expenditure by sector (PPP)", ylab="Expenditure (PPP, in Millions) ", col = "grey")

hist(SA_3$PPP)

boxplot(SA_3$PPP ~ SA_3$`Consumption Segment`, main="Box Plot of Annual household expenditure inSouth Africa by sector (PPP", ylab = "Expenditure (PPP, Millions)", xlab = "Consumption Segment")

#boxplot after removing All Sector observations

SA_4 <- SA_3 %>% filter(Sector != 'All Sectors')

boxplot(SA_4$PPP ~ SA_4$`Consumption Segment`, main="Annual household expenditure in South Africa by sector (PPP)", ylab = "Expenditure (PPP, Millions)", xlab = "Consumption Segment")

#Scatter plot

SA_4 %>% plot(PPP ~ USD, data = ., ylab="PPP", xlab="USD", main="PPP by USD")

SA_4 %>% plot(PPP ~ `Local Currency`, data = ., ylab="PPP", xlab="Local Currency", main="PPP by Local Cuarrency")

Transform

Finally, a log transformation is applied to one of the numeric variable ‘PPP’ in order to change the scale for better understanding of the variable and to decrease the skewness and convert the distribution into a normal distribution.

#log transformation

log_PPP <- log10(SA_4$PPP)

hist(log_PPP)