MATH2349 Semester 1, 2019

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

library(readr)

## Warning: package 'readr' was built under R version 3.5.3

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)
library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.5.3

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(stringr)

## Warning: package 'stringr' was built under R version 3.5.3

library(mlr)

## Warning: package 'mlr' was built under R version 3.5.3

## Loading required package: ParamHelpers

## Warning: package 'ParamHelpers' was built under R version 3.5.3

## 
## Attaching package: 'mlr'

## The following object is masked from 'package:Hmisc':
## 
##     impute

library(readxl)

## Warning: package 'readxl' was built under R version 3.5.3

Executive Summary

At the start of the pre-processing, the datasets and relevant packages are imported.

Column names are then applied to the datasets and the key column that will be used to join the data undergoes pre-processing before joining the datasets. Specifically, ‘-’ are removed and replaced with spaces.

The next step involves analysing the columns of the dataset for their column characteristics, determining if they are characters, factors the types and ranges of values they contain. For the sake of further pre-processing, some column types are converted from characters to numeric and characters are converted to factors for easier labelling and ordering.

The fourth step of pre-processing involves checking if the dataset is in tidy format. Fortunately, this is the case as the data imported is already in a tidy format.

Next, a new mutated variable is created as a net figure of the capital gain and capital loss column. The dataset is then checking for missing values and inconsistencies and the only missing values are found in the GDP column which is due to the combination of the datasets as the country codes are not all the same. For these missing values, they were imputed using the mean of the GDP column.

The outlier scan involved inspecting all the numeric variables for outliers using boxplots. From the observed boxplots some variables had extreme outliers in which case capping was applied using a pre-set capping function. Some values were left as is despite having outliers as the values seemed to be intentional and it would be wrong to modify them for the sake of removing outliers.

The final step involves transforming the hours per week variable using the mean-centering, scaling and z-score standardisation transformations and visualising the transformed variables using boxplots. The boxplots are compared with the original untransformed boxplot.

Data

The datasets have been obtained from the UCI repository and the worldbank.

All currency measures will be in USD.

Source for the adult dataset: https://archive.ics.uci.edu/ml/datasets/Adult

The adult dataset contains the following attributes:

age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country:

This dataset was originally used to predict wether or not yearly income would exceed $50k/ year.The time period the data was extracted from is the year 1994.

Source for the popular indicators dataset: https://databank.worldbank.org/data/indicator/NY.GDP.MKTP.KD.ZG/1ff4a498/Popular-Indicators#

The popular indicator contains gdp growth for 217 countries in the year 1994.

The adult dataset will be merged with the popular indicators dataset on the key of the country name to show gdp growth of the country next to the adult dataset.

#Import adult dataset
adult <- read_csv("adult.data", col_names = FALSE)

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   X2 = col_character(),
##   X3 = col_double(),
##   X4 = col_character(),
##   X5 = col_double(),
##   X6 = col_character(),
##   X7 = col_character(),
##   X8 = col_character(),
##   X9 = col_character(),
##   X10 = col_character(),
##   X11 = col_double(),
##   X12 = col_double(),
##   X13 = col_double(),
##   X14 = col_character(),
##   X15 = col_character()
## )

#Add column names to adult dataset
colnames(adult) <- c("Age", "Workclass", "fnlwgt", "Education", "Education_No", "Marital_Status", "Occupation",
                     "Relationship", "Race", "Sex", "Capital_gain", "Capital_loss", "Hours_per_week",
                     "Country_Name", "Over/Under50k")

#View unique values under country name
adultUnique <- unique(adult$Native_country)

## Warning: Unknown or uninitialised column: 'Native_country'.

adultUnique

## NULL

#Replace "-" with spaces to make joining the dataset easier
adult$Country_Name <- chartr(old = "-", new = " ", adult$Country_Name)
adultUnique2 <- unique(adult$Country_Name)
adultUnique2

##  [1] "United States"              "Cuba"                      
##  [3] "Jamaica"                    "India"                     
##  [5] "?"                          "Mexico"                    
##  [7] "South"                      "Puerto Rico"               
##  [9] "Honduras"                   "England"                   
## [11] "Canada"                     "Germany"                   
## [13] "Iran"                       "Philippines"               
## [15] "Italy"                      "Poland"                    
## [17] "Columbia"                   "Cambodia"                  
## [19] "Thailand"                   "Ecuador"                   
## [21] "Laos"                       "Taiwan"                    
## [23] "Haiti"                      "Portugal"                  
## [25] "Dominican Republic"         "El Salvador"               
## [27] "France"                     "Guatemala"                 
## [29] "China"                      "Japan"                     
## [31] "Yugoslavia"                 "Peru"                      
## [33] "Outlying US(Guam USVI etc)" "Scotland"                  
## [35] "Trinadad&Tobago"            "Greece"                    
## [37] "Nicaragua"                  "Vietnam"                   
## [39] "Hong"                       "Ireland"                   
## [41] "Hungary"                    "Holand Netherlands"

#View head of adult dataset
head(adult)

#Import country gdp growth dataset for the year 1994
indicators <- read_excel("Popular Indicators.xlsx")
head(indicators)

#Drop non-essential columns from indicators to only keep country name and gdp for 1994
gdp <- indicators[c(3,5)]
colnames(gdp) <- c("Country_Name", "GDP")

#View head of gdp
head(gdp)

#Merging the dataset
adultGdp <- left_join(adult, gdp, key = "Country_Name")

## Joining, by = "Country_Name"

head(adultGdp)

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

#View structure of adultGdp dataset
str(adultGdp)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  16 variables:
##  $ Age           : num  39 50 38 53 28 37 49 52 31 42 ...
##  $ Workclass     : chr  "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : num  77516 83311 215646 234721 338409 ...
##  $ Education     : chr  "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ Education_No  : num  13 13 9 7 13 14 5 9 14 13 ...
##  $ Marital_Status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ Occupation    : chr  "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ Relationship  : chr  "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ Race          : chr  "White" "White" "White" "Black" ...
##  $ Sex           : chr  "Male" "Male" "Male" "Male" ...
##  $ Capital_gain  : num  2174 0 0 0 0 ...
##  $ Capital_loss  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hours_per_week: num  40 13 40 40 40 40 16 45 50 40 ...
##  $ Country_Name  : chr  "United States" "United States" "United States" "United States" ...
##  $ Over/Under50k : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
##  $ GDP           : chr  "4.037643424864811" "4.037643424864811" "4.037643424864811" "4.037643424864811" ...

#Convert education characters to factors
educationUnique <- unique(adultGdp$Education)
educationUnique

##  [1] "Bachelors"    "HS-grad"      "11th"         "Masters"     
##  [5] "9th"          "Some-college" "Assoc-acdm"   "Assoc-voc"   
##  [9] "7th-8th"      "Doctorate"    "Prof-school"  "5th-6th"     
## [13] "10th"         "1st-4th"      "Preschool"    "12th"

adultGdp$Education <- factor(adultGdp$Education, levels=c("Preschool", "1st-4th", "5th-6th","7th-8th", "9th",
                                                          "10th", "11th", "12th", "HS-grad", "Some-college",
                                                          "Assoc-acdm", "Assoc-voc", "Bachelors", "Masters",
                                                          "Doctorate", "Prof-school"),
                                                labels=c("Preschool", "1st-4th", "5th-6th","7th-8th", "9th",
                                                          "10th", "11th", "12th", "HS-grad", "Some-college",
                                                          "Assoc-acdm", "Assoc-voc", "Bachelors", "Masters",
                                                          "Doctorate", "Prof-school"))

levels(adultGdp$Education)

##  [1] "Preschool"    "1st-4th"      "5th-6th"      "7th-8th"     
##  [5] "9th"          "10th"         "11th"         "12th"        
##  [9] "HS-grad"      "Some-college" "Assoc-acdm"   "Assoc-voc"   
## [13] "Bachelors"    "Masters"      "Doctorate"    "Prof-school"

sum(is.na(adultGdp$Education))

## [1] 0

#Check if education column is a factor
is.factor(adultGdp$Education)

## [1] TRUE

#Convert education number to factors
educationUniqueNum <- unique(adultGdp$Education_No)
educationUniqueNum

##  [1] 13  9  7 14  5 10 12 11  4 16 15  3  6  2  1  8

adultGdp$Education_No <- as.factor(adultGdp$Education_No)

#Check if education number column is a factor
is.factor(adultGdp$Education_No)

## [1] TRUE

#Convert gdp column from character to numeric
adultGdp$GDP <- as.numeric(adultGdp$GDP)
is.numeric(adultGdp$GDP)

## [1] TRUE

#View Structure after transformations
str(adultGdp)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  16 variables:
##  $ Age           : num  39 50 38 53 28 37 49 52 31 42 ...
##  $ Workclass     : chr  "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : num  77516 83311 215646 234721 338409 ...
##  $ Education     : Factor w/ 16 levels "Preschool","1st-4th",..: 13 13 9 7 13 14 5 9 14 13 ...
##  $ Education_No  : Factor w/ 16 levels "1","2","3","4",..: 13 13 9 7 13 14 5 9 14 13 ...
##  $ Marital_Status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ Occupation    : chr  "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ Relationship  : chr  "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ Race          : chr  "White" "White" "White" "Black" ...
##  $ Sex           : chr  "Male" "Male" "Male" "Male" ...
##  $ Capital_gain  : num  2174 0 0 0 0 ...
##  $ Capital_loss  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hours_per_week: num  40 13 40 40 40 40 16 45 50 40 ...
##  $ Country_Name  : chr  "United States" "United States" "United States" "United States" ...
##  $ Over/Under50k : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
##  $ GDP           : num  4.038 4.038 4.038 4.038 0.717 ...

Tidy & Manipulate Data I

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.

#View head of dataset
head(adultGdp)

#Looking at the head of the dataset we can see that the dataset is in tidy format

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.

#Mutating one variable
adultGdp <- mutate(adultGdp, netCapital = adultGdp$Capital_gain - adultGdp$Capital_loss)

#View head of dataset with new variable
head(adultGdp)

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.

#Searching for missing values and inconsistencies in each column
colSums(is.na(adultGdp))

##            Age      Workclass         fnlwgt      Education   Education_No 
##              0              0              0              0              0 
## Marital_Status     Occupation   Relationship           Race            Sex 
##              0              0              0              0              0 
##   Capital_gain   Capital_loss Hours_per_week   Country_Name  Over/Under50k 
##              0              0              0              0              0 
##            GDP     netCapital 
##           1006              0

#As we can see, the gdp column has a lot of missing values
#We will replace the missing values in GDP with the mean value of the gdp column
adultGdp$GDP[is.na(adultGdp$GDP)] <- mean(adultGdp$GDP, na.rm = TRUE)
colSums(is.na(adultGdp))

##            Age      Workclass         fnlwgt      Education   Education_No 
##              0              0              0              0              0 
## Marital_Status     Occupation   Relationship           Race            Sex 
##              0              0              0              0              0 
##   Capital_gain   Capital_loss Hours_per_week   Country_Name  Over/Under50k 
##              0              0              0              0              0 
##            GDP     netCapital 
##              0              0

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.

#Scanning dataset for outliers
summary(adultGdp)

##       Age         Workclass             fnlwgt               Education    
##  Min.   :17.00   Length:32561       Min.   :  12285   HS-grad     :10501  
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Some-college: 7291  
##  Median :37.00   Mode  :character   Median : 178356   Bachelors   : 5355  
##  Mean   :38.58                      Mean   : 189778   Masters     : 1723  
##  3rd Qu.:48.00                      3rd Qu.: 237051   Assoc-voc   : 1382  
##  Max.   :90.00                      Max.   :1484705   11th        : 1175  
##                                                       (Other)     : 5134  
##   Education_No   Marital_Status      Occupation        Relationship      
##  9      :10501   Length:32561       Length:32561       Length:32561      
##  10     : 7291   Class :character   Class :character   Class :character  
##  13     : 5355   Mode  :character   Mode  :character   Mode  :character  
##  14     : 1723                                                           
##  11     : 1382                                                           
##  7      : 1175                                                           
##  (Other): 5134                                                           
##      Race               Sex             Capital_gain    Capital_loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##                                                                        
##  Hours_per_week  Country_Name       Over/Under50k           GDP         
##  Min.   : 1.00   Length:32561       Length:32561       Min.   :-34.809  
##  1st Qu.:40.00   Class :character   Class :character   1st Qu.:  4.038  
##  Median :40.00   Mode  :character   Mode  :character   Median :  4.038  
##  Mean   :40.44                                         Mean   :  4.024  
##  3rd Qu.:45.00                                         3rd Qu.:  4.038  
##  Max.   :99.00                                         Max.   : 13.052  
##                                                                         
##    netCapital     
##  Min.   :-4356.0  
##  1st Qu.:    0.0  
##  Median :    0.0  
##  Mean   :  990.4  
##  3rd Qu.:    0.0  
##  Max.   :99999.0  
##

#From the summary view we can see there appear to be outliers for age, fnlwgt, capital gain, capital loss, hours per week, gdp and netcapital

#Function to cap values
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

#We will be using boxplots to visualise outliers for numeric variables
#Age
boxplot(adultGdp$Age, main = "Age")

#No action needed since all values appear to be within reason

#fnlwgt
boxplot(adultGdp$fnlwgt, main = "fnlwgt")

#Capping will be applied to these outliers
adultGdp$fnlwgt <- adultGdp$fnlwgt %>% cap()

#Capital Gain
boxplot(adultGdp$Capital_gain, main = "Capital Gain")

#Few impossible values of 99999, capping outliers
adultGdp$Capital_gain <- adultGdp$Capital_gain %>% cap()

#Capital Loss
boxplot(adultGdp$Capital_loss, main = "Capital Loss")

#Capping will be used to manage these outliers
adultGdp$Capital_loss <- adultGdp$Capital_loss %>% cap()

#Hours per week
boxplot(adultGdp$Hours_per_week, main = "Hours per Week")

#Capping will be used to manage these outliers
adultGdp$Hours_per_week <- adultGdp$Hours_per_week %>% cap()

#GDP
boxplot(adultGdp$GDP, main = "GDP")

#Values appear to be intentional, no further action required

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

#Variable chosen for transformation will be the hours per week variable
boxplot(adultGdp$Hours_per_week, main = "Hours Per Week")

#Transformation we will be using is the mean-centering, scaling and z-score standardisation transformation
hours_Mean_Centered <- scale(adultGdp$Hours_per_week, center = TRUE, scale = FALSE)
hours_sd <- scale(adultGdp$Hours_per_week, center = FALSE, scale = TRUE)
hours_zscore <- scale(adultGdp$Hours_per_week, center = TRUE, scale = TRUE)

#Boxplot of transformed data
boxplot(hours_Mean_Centered, main = "Hours Per Week Mean Centered")

boxplot(hours_sd, main = "Hours Per Week Standard Deviation")

boxplot(hours_zscore, main = "Hours Per Week Z-Score")