MATH2349 Data Wrangling Semester 2, 2020

Required packages

# This is the R chunk for the required packages
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)

Executive Summary

This report will exercist prepocessing techniques to import, understand, tidy, manipulate, transform and scan with three CSV datasets. Three dataset were cleaned with tidy data principles by gathering and spreading the data. Select and filter functions were used extensivly in the process of manipulating data. Mutate function was then used to calculate the GDP. NA values were recoded with the proposed method in Scan I. Followed by that, Log transformation for imputing were conducted and outliers were detected or replaced at the end in Scan II with z-score, capping and Tukey’s method. Note that outliers may serve as a reminder for possible anomalies. Therefore, capping method may be suggested as it is less cruicial way to deal with outliers.

Data

There are three open source csv data sets were chosen from the internet for the purpose of investigating the global terrorist attacks. More specifically, the 911 terrorist attack that happened in 2011 in America. The first one, population, taken from world bank https://datacatalog.worldbank.org/dataset/population-estimates-and-projections, which contains the variable of interest - total population by country from 1960 to 2019. The population dataset has country name, country code, indicator name, indicator code and indicator value from 1960-2019, which are self-explanantory. The second data is GDP per capita (current US$)from 1960 to 2019 taken from world bank https://data.worldbank.org/indicator/NY.GDP.PCAP.CD. The third dataset “case of deaths” taken from Kaggle https://www.kaggle.com/tahminashoaib86/global-cause-of-the-deaths-other-than-diseases, which contains information on casue of deaths by conflict and terrorism, epidemics, famine, natural disaster and other injuries from the years 1960 to 2019 aross the globe. The data for the year of 2000 will also be selected as the baseline of 2001 (the year of 911 terrorist attack happened), so that the three data sets can be merged and analysed by country and year which demonstrates the student’s ability of using R programming.

Read_CSV is a lot faster than Read.CSV. That is why we are using this function to read the data. Note that the col_types are guessed with default setting.

# This is the R chunk for the Data Section
total_population <- read_csv("C:/Users/harry.lu/Google Drive/RMIT Master of Analytics/2020 Semester 2/Data Wrangling/Assignment 2/Report/Population.csv", col_types = cols( .default = col_guess()))
GDP_US <- read_csv("C:/Users/harry.lu/Google Drive/RMIT Master of Analytics/2020 Semester 2/Data Wrangling/Assignment 2/Report/GDP.csv", skip = 4,  col_types = cols( .default = col_guess()))
deaths <- read_csv("C:/Users/harry.lu/Google Drive/RMIT Master of Analytics/2020 Semester 2/Data Wrangling/Assignment 2/Report/Caused of Deaths.csv")

## Parsed with column specification:
## cols(
##   Country = col_character(),
##   ISO_CODE = col_character(),
##   Year = col_double(),
##   Deaths = col_double(),
##   Cause = col_character(),
##   `Male POP` = col_double(),
##   `Female POP` = col_double(),
##   `Total Pop` = col_double(),
##   GDP = col_double(),
##   PCAP = col_double()
## )

head(total_population)

head(GDP_US)

head(deaths)

Tidy & Manipulate Data I

Three rules must be complied as follow (Dolgun 2020). 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. The years obviously cannot be the variable them so that it comforms the observation must have its own row. Gather() were used to transform data form wide into long format. The indicator values is given as the variable name to the values of GDP per capita. Followed by that, Spread() were used to transform the deaths table after joining tidy_tpop and tidy_GDP table with deaths_subset table.

# Firstly gather the year and then rule our the redundant variables with select function and keep the variables of interest.
tpop_gather <- total_population %>% gather(key = Year, value = "Indicator Value", 5:95) %>% select(c(1, 3, 5, 6))
GDP_gather <- GDP_US %>% gather(key = Year, value = "Indicator Value", 5:64) %>% select(c(1, 3, 6, 7))

head(tpop_gather)

head(GDP_gather)

Different type of subsetting methods was used with square brakets in combination with c() for combineing the columns 1, 3, 4, 5, 8, 9 and 10 into deaths_subset.

deaths_subset <- deaths[ ,c(1, 3, 4, 5)] %>% filter(`Year` == 2000 | `Year` == 2001) 

head(deaths_subset)

# To follow the tidy data priciples, we use gather and spread () to conform the tiday data rules. 
tpop_spread <- tpop_gather %>% spread("Indicator Name", value = "Indicator Value")
GDP_spread <- GDP_gather %>% spread("Indicator Name", value = "Indicator Value")

# The data is now considered as tidy
head(tpop_spread)

head(GDP_spread)

# The variables of interest are slected through filter fuction and select function as below.
tidy_tpop <- filter(tpop_spread , Year == 2000 | Year == 2001) %>% select("Country Name", 'Population, total', "Year")  
head(tidy_tpop)

tidy_GDP <-filter(GDP_spread , Year == 2000 | Year == 2001) %>% select("Country Name", "GDP per capita (current US$)", "Year")
head(tidy_GDP)

# Inner join was used to filter out the spelling issue with country. Only the ones that are matched in both table with country and year will be used for the analaysis.
deaths_tpop <- inner_join(tidy_tpop, tidy_GDP, by = c("Country Name" = "Country Name", "Year")) 

# The Year in death_tpop must be coercied from character to numeric before right join with the deaths_subset. 
deaths_tpop$Year <- as.numeric(deaths_tpop$Year)

deaths_tpop <- deaths_tpop %>% right_join(deaths_subset, by = c("Country Name" = "Country", "Year"))
str(deaths_tpop)

## tibble [1,940 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country Name                : chr [1:1940] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Population, total           : num [1:1940] 20779953 20779953 20779953 20779953 20779953 ...
##  $ Year                        : num [1:1940] 2000 2000 2000 2000 2000 ...
##  $ GDP per capita (current US$): num [1:1940] NA NA NA NA NA NA NA NA NA NA ...
##  $ Deaths                      : num [1:1940] 5429 198 12 0 46 ...
##  $ Cause                       : chr [1:1940] "Conflict and Terrorism" "Epidemics" "Famine" "Natural Disaster" ...

# To tidy up, we need to finall tranform the cause of deaths variable to wide format.
dpop_spread <- deaths_tpop %>% spread("Cause", value = "Deaths")

str(dpop_spread)

## tibble [388 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Country Name                : chr [1:388] "Afghanistan" "Afghanistan" "Albania" "Albania" ...
##  $ Population, total           : num [1:388] 20779953 21606988 3060173 3089027 31042235 ...
##  $ Year                        : num [1:388] 2000 2001 2001 2000 2000 ...
##  $ GDP per capita (current US$): num [1:388] NA NA 1282 1127 1765 ...
##  $ Conflict and Terrorism      : num [1:388] 5429 5727 5 1 2534 ...
##  $ Epidemics                   : num [1:388] 198 114 0 0 0 0 0 0 133 359 ...
##  $ Famine                      : num [1:388] 12 12 0 0 0 0 0 0 0 58 ...
##  $ Natural Disaster            : num [1:388] 0 4 0 0 48 921 0 0 59 57 ...
##  $ Other Injuries              : num [1:388] 46 349 0 0 22 0 0 0 217 206 ...

Understand

Initially, I have multiple runs with errors where it says that “non-numeric argument to binary operator”. This means that R was cofused with the column names - “GDP per capita (current US$)” and “Population, total” I had, so I used make.names() to produce a sutiable abd reconginisable names and rename these two variables as below for mutate() to calculate GDP in Tidy Manipulate Data II.

# Renaming "Population, total " to "total.population" and "GDP per capita (current US$)" to "gdp.per.capia" with assistance of make.names(names, unique = FALSE, allow_ = TRUE)
names(dpop_spread)[names(dpop_spread) == 'Population, total'] <- 'total.population'
names(dpop_spread)[names(dpop_spread) == 'GDP per capita (current US$)'] <- 'gdp.per.capia'

# Coverting charactor to factor for cause of deaths 
deaths_tpop$Cause <- factor(deaths_tpop$Cause, order=FALSE) 
levels(deaths_tpop$Cause)

## [1] "Conflict and Terrorism" "Epidemics"              "Famine"                
## [4] "Natural Disaster"       "Other Injuries"

is.factor(deaths_tpop$Cause)

## [1] TRUE

# Check data types of the variables
str(dpop_spread)

## tibble [388 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Country Name          : chr [1:388] "Afghanistan" "Afghanistan" "Albania" "Albania" ...
##  $ total.population      : num [1:388] 20779953 21606988 3060173 3089027 31042235 ...
##  $ Year                  : num [1:388] 2000 2001 2001 2000 2000 ...
##  $ gdp.per.capia         : num [1:388] NA NA 1282 1127 1765 ...
##  $ Conflict and Terrorism: num [1:388] 5429 5727 5 1 2534 ...
##  $ Epidemics             : num [1:388] 198 114 0 0 0 0 0 0 133 359 ...
##  $ Famine                : num [1:388] 12 12 0 0 0 0 0 0 0 58 ...
##  $ Natural Disaster      : num [1:388] 0 4 0 0 48 921 0 0 59 57 ...
##  $ Other Injuries        : num [1:388] 46 349 0 0 22 0 0 0 217 206 ...

# Check attributes of the data
attributes(dpop_spread)

## $names
## [1] "Country Name"           "total.population"       "Year"                  
## [4] "gdp.per.capia"          "Conflict and Terrorism" "Epidemics"             
## [7] "Famine"                 "Natural Disaster"       "Other Injuries"        
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
## [343] 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## [361] 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
## [379] 379 380 381 382 383 384 385 386 387 388
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Tidy & Manipulate Data II

Mutate() is used to calculate GDP which will be added into the dpop_spread table if the run is successful.

dpop_spread <- dpop_spread %>% mutate(GDP = total.population * gdp.per.capia) 
str(dpop_spread)

## tibble [388 x 10] (S3: tbl_df/tbl/data.frame)
##  $ Country Name          : chr [1:388] "Afghanistan" "Afghanistan" "Albania" "Albania" ...
##  $ total.population      : num [1:388] 20779953 21606988 3060173 3089027 31042235 ...
##  $ Year                  : num [1:388] 2000 2001 2001 2000 2000 ...
##  $ gdp.per.capia         : num [1:388] NA NA 1282 1127 1765 ...
##  $ Conflict and Terrorism: num [1:388] 5429 5727 5 1 2534 ...
##  $ Epidemics             : num [1:388] 198 114 0 0 0 0 0 0 133 359 ...
##  $ Famine                : num [1:388] 12 12 0 0 0 0 0 0 0 58 ...
##  $ Natural Disaster      : num [1:388] 0 4 0 0 48 921 0 0 59 57 ...
##  $ Other Injuries        : num [1:388] 46 349 0 0 22 0 0 0 217 206 ...
##  $ GDP                   : num [1:388] NA NA 3.92e+09 3.48e+09 5.48e+10 ...

Scan I

Firslty, scanning for missing or special values for each variable: sapply() is used to check for NAN numbers and infinite values in the data frame.

# Scanning for infinite values
sapply(dpop_spread, function(x) sum(is.infinite(x)))

##           Country Name       total.population                   Year 
##                      0                      0                      0 
##          gdp.per.capia Conflict and Terrorism              Epidemics 
##                      0                      0                      0 
##                 Famine       Natural Disaster         Other Injuries 
##                      0                      0                      0 
##                    GDP 
##                      0

# Scanning for NaN numbers
sapply(dpop_spread, function(x) sum(is.nan(x)))

##           Country Name       total.population                   Year 
##                      0                      0                      0 
##          gdp.per.capia Conflict and Terrorism              Epidemics 
##                      0                      0                      0 
##                 Famine       Natural Disaster         Other Injuries 
##                      0                      0                      0 
##                    GDP 
##                      0

# Scanning for missing values 
colSums(is.na(dpop_spread))

##           Country Name       total.population                   Year 
##                      0                     48                      0 
##          gdp.per.capia Conflict and Terrorism              Epidemics 
##                     63                      0                      0 
##                 Famine       Natural Disaster         Other Injuries 
##                      0                      0                      0 
##                    GDP 
##                     63

We note that there are 63 missing cells each under GDP and gdp.per.capia data. 48 missing observation from total population. Thus, We can confirm that most of the missing data are due to the NA values in the GDP per capia. As the goal of this analysis is to observe the GDP changes before and after the infamous terrorist attack from Al-Qaeda to America on 11th of September in 2001.

# Below are counties that contain missing total population and GDP per capita.
dpop_spread[which(is.na(dpop_spread$total.population)), c("Country Name", "Year")]

dpop_spread[which(is.na(dpop_spread$gdp.per.capia)), c("Country Name", "Year")]

# Print observations were missing form total population and GDP per capita 
dpop_spread[which(is.na(dpop_spread$gdp.per.capia)), ]

As below, the sutiable methodology in dealing with NA values would be manipulating the dataset to join tables to ensure that the information can be recoded correctly. The NA values were recoded with the data from original cause of deaths.csv that contains GDP information which the GDP.csv did not reveal and show as NA Values. As a result, 42 NA Vlaues were replaced with data from deaths table and 21 NA values were removed which is believed that this is the optimal method to deal with NA values in this case. Please note that before removing (see table dpop_spread) is 388 observations and now we have retrieved back the 42 observations and removed 21 NA values (5% out of 388 observations), hence we have 367 obersvations with no NA values in the dpop_tran_final table:

# Step 1. subset the NA values for replacing them with the existing GDP information from original cause of death CSV.
dpop_tran <- dpop_spread[is.na(dpop_spread$gdp.per.capia), ]

# Step 2. manipulated the data for the purpose of replacing NA with data from deaths table into dpop_spread
deaths_tran <- deaths[ ,c(1, 3, 4, 5, 8, 10)]  %>% spread("Cause", value = "Deaths") %>% filter(`Year` == 2000 | `Year` == 2001) %>% select(1:4)

# Step 3. use inner join to match the NA values by country and year between deaths transfromed table and dpop transformed table.
transformed <- dpop_tran %>% inner_join(deaths_tran, by = c("Country Name" = "Country", "Year"))

# step 4. drop 21 NA values that cannot be found from the data which equals 5% of the total 388 observations before removing NA values. Please note that world bank and kaggle data were used which they have relatively high credibility on the data source. 
transformed.narm <- transformed[!is.na(transformed$PCAP), ]
transformed.narm <- transformed.narm[!is.na(transformed.narm$'Total Pop'), ]

# step 5. Now we have 42 GDP per capia and total population were recoded through joining function as above and we can apply mutate () again to calculate GDP now.
transformed.narm <- transformed.narm %>% select(c(1,3,5,6:12))

# step 6. remove the rest of 21 NA values that cannot be found in death table from dpop_spread and join the tranformed.narm with the dpop_spread table to ensure no NA values are within the final table.Recalculate the GDP with mutate ().
dpop_tran_final <- dpop_spread[!is.na(dpop_spread$gdp.per.capia), ]

names(transformed.narm)[names(transformed.narm) == 'Total Pop'] <- 'total.population'
names(transformed.narm)[names(transformed.narm) == 'PCAP'] <- 'gdp.per.capia'

dpop_tran_final <- bind_rows(transformed.narm, dpop_tran_final) 

dpop_tran_final<- dpop_tran_final %>% mutate(GDP = total.population * gdp.per.capia) 

# step 7. check if NA values exist in the data
colSums(is.na(dpop_tran_final))

##           Country Name                   Year Conflict and Terrorism 
##                      0                      0                      0 
##              Epidemics                 Famine       Natural Disaster 
##                      0                      0                      0 
##         Other Injuries                    GDP       total.population 
##                      0                      0                      0 
##          gdp.per.capia 
##                      0

Transform

In order to decrease the skewness and conver the distribution into a normal distribution, transformation is done as below. It is believed that it would be better than imputing with mean for total population, since the distibution is well distributed normally. Moreover, death table happens to have information we need. For GDP per capita, as we can see the ditribution it is slightly right skewed distribution which means outliers have a greater impact to the distribution and push the mean to cause skewness. Median is preferrable in this situation.

Imputing with mean and median are carried out on total population and GDP per capita respectively as below. Histogram is great way to see the parameter distribution for imputing.

par(mfrow = c(2,2))

hist(dpop_spread$total.population) 
hist(dpop_spread$gdp.per.capia)

hist(log(dpop_spread$total.population))
hist(log(dpop_spread$gdp.per.capia))

The log transformation to Total Population and GDP per capita would show us a better indication of how the data are distributed. We can see in the transformed histogram, population and gdp per capia follow normal distributions. Hence, imputing with mean and median are chose beacuse they are numerical variables and the centre of measure.

# Impute using median and mean 
dpop_spread$total.population <- impute(dpop_spread$total.population, fun = mean) 
dpop_spread$gdp.per.capia <- impute(dpop_spread$gdp.per.capia, fun = median)

GDP can be recalculated after the missing values are gone.

dpop_spread <- dpop_spread %>% mutate(GDP = total.population * gdp.per.capia)

Check if missing values still exist.

colSums(is.na(dpop_spread))

##           Country Name       total.population                   Year 
##                      0                      0                      0 
##          gdp.per.capia Conflict and Terrorism              Epidemics 
##                      0                      0                      0 
##                 Famine       Natural Disaster         Other Injuries 
##                      0                      0                      0 
##                    GDP 
##                      0

Scan II

Firstly, start with hist() to observe the distribution for the skewness as below. It is noted that capping method and zscores are used in this section. Capping method is less crude than other methdos like removing the outliers. Sometimes, outliers do serve a reminder of possible anomalies in the data.

dpop_spread %>% select(c(2, 4:10)) %>% hist()

# Log transformation applied for all variables to understand the skewness of distributions better
dpop_spread %>% select(c(2, 4:10)) %>% log() %>% hist()

For the skewed distributions as above, Tukey’s/boxplot method is used as it is a non-parametric method to test outliers in non-symmetric/ non-normal data distributions (Dolgun 2020):

# subsetting the skewed variables as below with Tukey's method 
dpsele <- dpop_spread %>% select(`Conflict and Terrorism`, Famine, `Natural Disaster`) 
dpbox <- dpsele %>% boxplot()

For the outliers in Conflict and Terrorism, Famine, Natural Disaster a capping method inovlved in replacing the outliers with the nearest neighbours that are not outliers. It is important to note that outliers can still hold valuable information, especially when it comes to anomaly detection or fraud detection.

# function for capping outliers
cap <- function(x) {
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# apply cap function on variables
capping <- sapply(dpsele, FUN = cap)

# print the boxplot after capping
boxplot(capping)

We can also use the z-score method to identify outliers where it is above 3 or below -3.

dpzscore <- dpop_spread %>% select(Epidemics, 'Other Injuries', `GDP`)
zs <- dpzscore %>% scores(type = "z")
summary(zs)

##    Epidemics       Other Injuries         GDP         
##  Min.   :-0.2714   Min.   :-0.2049   Min.   :-0.2042  
##  1st Qu.:-0.2714   1st Qu.:-0.2049   1st Qu.:-0.2003  
##  Median :-0.2714   Median :-0.2049   Median :-0.1877  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2483   3rd Qu.:-0.1400   3rd Qu.:-0.1304  
##  Max.   :11.2938   Max.   :12.7821   Max.   :12.2994

As it can be seen from below, by using which () and abs(), we can obtain the absolute values of z-score values which our observations have are greather than three.

which(abs(zs$`GDP`) > 3)

## [1] 171 172 369 370

which(abs(zs$'Other Injuries') > 3)

## [1]  47  48  73  74 247 248

which(abs(zs$Epidemics) > 3)

## [1]  54  68  91 155 156 203 245 386

Some important facts about z-score can be concluded as follow: 1. A positive z-score says the data point is above average. 2. A negative z-score says the data point is below average. 3. A z-score close to 000 says the data point is close to average. 4. A data point can be considered unusual if its z-score is above 3 or below -3.

References

Dolgun, A 2020, ‘Module 5 Scan: Missing Values’, Module notes, MATH2349, RMIT University, viewed 10 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_05.html.

Dolgun, A 2020, ‘Module 6 Scan: Outliers’, Module notes, MATH2349, RMIT University, viewed 20 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_06.html.

Dolgun, A 2020, ‘Module Transform: Data Transformation, Standardisation, and Reduction’, Module notes, MATH2349, RMIT University, viewed 20 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_07.html.

The WORLD BANK 2020, Population Estimates And Projections, data file, The WORLD BANK, Melbourne, viewed 30 September 2020, https://datacatalog.worldbank.org/dataset/population-estimates-and-projections

The WORLD BANK 2020, GDP per capita, data file, The WORLD BANK, Melbourne, viewed 30 September 2020, https://data.worldbank.org/indicator/NY.GDP.PCAP.CD

Kaggle 2017, Global Cause of the Deaths other than diseases, data file, Kaggle, Melbourne, viewed 30 September 2020, https://www.kaggle.com/tahminashoaib86/global-cause-of-the-deaths-other-than-diseases