Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.5.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(outliers)
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
library(mlr)
## Warning: package 'mlr' was built under R version 3.5.3
## Loading required package: ParamHelpers
## Warning: package 'ParamHelpers' was built under R version 3.5.3
##
## Attaching package: 'mlr'
## The following object is masked from 'package:Hmisc':
##
## impute
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
At the start of the pre-processing, the datasets and relevant packages are imported.
Column names are then applied to the datasets and the key column that will be used to join the data undergoes pre-processing before joining the datasets. Specifically, ‘-’ are removed and replaced with spaces.
The next step involves analysing the columns of the dataset for their column characteristics, determining if they are characters, factors the types and ranges of values they contain. For the sake of further pre-processing, some column types are converted from characters to numeric and characters are converted to factors for easier labelling and ordering.
The fourth step of pre-processing involves checking if the dataset is in tidy format. Fortunately, this is the case as the data imported is already in a tidy format.
Next, a new mutated variable is created as a net figure of the capital gain and capital loss column. The dataset is then checking for missing values and inconsistencies and the only missing values are found in the GDP column which is due to the combination of the datasets as the country codes are not all the same. For these missing values, they were imputed using the mean of the GDP column.
The outlier scan involved inspecting all the numeric variables for outliers using boxplots. From the observed boxplots some variables had extreme outliers in which case capping was applied using a pre-set capping function. Some values were left as is despite having outliers as the values seemed to be intentional and it would be wrong to modify them for the sake of removing outliers.
The final step involves transforming the hours per week variable using the mean-centering, scaling and z-score standardisation transformations and visualising the transformed variables using boxplots. The boxplots are compared with the original untransformed boxplot.
The datasets have been obtained from the UCI repository and the worldbank.
All currency measures will be in USD.
Source for the adult dataset: https://archive.ics.uci.edu/ml/datasets/Adult
The adult dataset contains the following attributes:
age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country:
This dataset was originally used to predict wether or not yearly income would exceed $50k/ year.The time period the data was extracted from is the year 1994.
Source for the popular indicators dataset: https://databank.worldbank.org/data/indicator/NY.GDP.MKTP.KD.ZG/1ff4a498/Popular-Indicators#
The popular indicator contains gdp growth for 217 countries in the year 1994.
The adult dataset will be merged with the popular indicators dataset on the key of the country name to show gdp growth of the country next to the adult dataset.
#Import adult dataset
adult <- read_csv("adult.data", col_names = FALSE)
## Parsed with column specification:
## cols(
## X1 = col_double(),
## X2 = col_character(),
## X3 = col_double(),
## X4 = col_character(),
## X5 = col_double(),
## X6 = col_character(),
## X7 = col_character(),
## X8 = col_character(),
## X9 = col_character(),
## X10 = col_character(),
## X11 = col_double(),
## X12 = col_double(),
## X13 = col_double(),
## X14 = col_character(),
## X15 = col_character()
## )
#Add column names to adult dataset
colnames(adult) <- c("Age", "Workclass", "fnlwgt", "Education", "Education_No", "Marital_Status", "Occupation",
"Relationship", "Race", "Sex", "Capital_gain", "Capital_loss", "Hours_per_week",
"Country_Name", "Over/Under50k")
#View unique values under country name
adultUnique <- unique(adult$Native_country)
## Warning: Unknown or uninitialised column: 'Native_country'.
adultUnique
## NULL
#Replace "-" with spaces to make joining the dataset easier
adult$Country_Name <- chartr(old = "-", new = " ", adult$Country_Name)
adultUnique2 <- unique(adult$Country_Name)
adultUnique2
## [1] "United States" "Cuba"
## [3] "Jamaica" "India"
## [5] "?" "Mexico"
## [7] "South" "Puerto Rico"
## [9] "Honduras" "England"
## [11] "Canada" "Germany"
## [13] "Iran" "Philippines"
## [15] "Italy" "Poland"
## [17] "Columbia" "Cambodia"
## [19] "Thailand" "Ecuador"
## [21] "Laos" "Taiwan"
## [23] "Haiti" "Portugal"
## [25] "Dominican Republic" "El Salvador"
## [27] "France" "Guatemala"
## [29] "China" "Japan"
## [31] "Yugoslavia" "Peru"
## [33] "Outlying US(Guam USVI etc)" "Scotland"
## [35] "Trinadad&Tobago" "Greece"
## [37] "Nicaragua" "Vietnam"
## [39] "Hong" "Ireland"
## [41] "Hungary" "Holand Netherlands"
#View head of adult dataset
head(adult)
#Import country gdp growth dataset for the year 1994
indicators <- read_excel("Popular Indicators.xlsx")
head(indicators)
#Drop non-essential columns from indicators to only keep country name and gdp for 1994
gdp <- indicators[c(3,5)]
colnames(gdp) <- c("Country_Name", "GDP")
#View head of gdp
head(gdp)
#Merging the dataset
adultGdp <- left_join(adult, gdp, key = "Country_Name")
## Joining, by = "Country_Name"
head(adultGdp)
Summarise the types of variables and data structures, check the attributes in the data and apply data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.
#View structure of adultGdp dataset
str(adultGdp)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of 16 variables:
## $ Age : num 39 50 38 53 28 37 49 52 31 42 ...
## $ Workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : num 77516 83311 215646 234721 338409 ...
## $ Education : chr "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ Education_No : num 13 13 9 7 13 14 5 9 14 13 ...
## $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ Occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ Relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ Race : chr "White" "White" "White" "Black" ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Capital_gain : num 2174 0 0 0 0 ...
## $ Capital_loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hours_per_week: num 40 13 40 40 40 40 16 45 50 40 ...
## $ Country_Name : chr "United States" "United States" "United States" "United States" ...
## $ Over/Under50k : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## $ GDP : chr "4.037643424864811" "4.037643424864811" "4.037643424864811" "4.037643424864811" ...
#Convert education characters to factors
educationUnique <- unique(adultGdp$Education)
educationUnique
## [1] "Bachelors" "HS-grad" "11th" "Masters"
## [5] "9th" "Some-college" "Assoc-acdm" "Assoc-voc"
## [9] "7th-8th" "Doctorate" "Prof-school" "5th-6th"
## [13] "10th" "1st-4th" "Preschool" "12th"
adultGdp$Education <- factor(adultGdp$Education, levels=c("Preschool", "1st-4th", "5th-6th","7th-8th", "9th",
"10th", "11th", "12th", "HS-grad", "Some-college",
"Assoc-acdm", "Assoc-voc", "Bachelors", "Masters",
"Doctorate", "Prof-school"),
labels=c("Preschool", "1st-4th", "5th-6th","7th-8th", "9th",
"10th", "11th", "12th", "HS-grad", "Some-college",
"Assoc-acdm", "Assoc-voc", "Bachelors", "Masters",
"Doctorate", "Prof-school"))
levels(adultGdp$Education)
## [1] "Preschool" "1st-4th" "5th-6th" "7th-8th"
## [5] "9th" "10th" "11th" "12th"
## [9] "HS-grad" "Some-college" "Assoc-acdm" "Assoc-voc"
## [13] "Bachelors" "Masters" "Doctorate" "Prof-school"
sum(is.na(adultGdp$Education))
## [1] 0
#Check if education column is a factor
is.factor(adultGdp$Education)
## [1] TRUE
#Convert education number to factors
educationUniqueNum <- unique(adultGdp$Education_No)
educationUniqueNum
## [1] 13 9 7 14 5 10 12 11 4 16 15 3 6 2 1 8
adultGdp$Education_No <- as.factor(adultGdp$Education_No)
#Check if education number column is a factor
is.factor(adultGdp$Education_No)
## [1] TRUE
#Convert gdp column from character to numeric
adultGdp$GDP <- as.numeric(adultGdp$GDP)
is.numeric(adultGdp$GDP)
## [1] TRUE
#View Structure after transformations
str(adultGdp)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of 16 variables:
## $ Age : num 39 50 38 53 28 37 49 52 31 42 ...
## $ Workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : num 77516 83311 215646 234721 338409 ...
## $ Education : Factor w/ 16 levels "Preschool","1st-4th",..: 13 13 9 7 13 14 5 9 14 13 ...
## $ Education_No : Factor w/ 16 levels "1","2","3","4",..: 13 13 9 7 13 14 5 9 14 13 ...
## $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ Occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ Relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ Race : chr "White" "White" "White" "Black" ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Capital_gain : num 2174 0 0 0 0 ...
## $ Capital_loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hours_per_week: num 40 13 40 40 40 40 16 45 50 40 ...
## $ Country_Name : chr "United States" "United States" "United States" "United States" ...
## $ Over/Under50k : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## $ GDP : num 4.038 4.038 4.038 4.038 0.717 ...
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.
#View head of dataset
head(adultGdp)
#Looking at the head of the dataset we can see that the dataset is in tidy format
Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.
#Mutating one variable
adultGdp <- mutate(adultGdp, netCapital = adultGdp$Capital_gain - adultGdp$Capital_loss)
#View head of dataset with new variable
head(adultGdp)
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.
#Searching for missing values and inconsistencies in each column
colSums(is.na(adultGdp))
## Age Workclass fnlwgt Education Education_No
## 0 0 0 0 0
## Marital_Status Occupation Relationship Race Sex
## 0 0 0 0 0
## Capital_gain Capital_loss Hours_per_week Country_Name Over/Under50k
## 0 0 0 0 0
## GDP netCapital
## 1006 0
#As we can see, the gdp column has a lot of missing values
#We will replace the missing values in GDP with the mean value of the gdp column
adultGdp$GDP[is.na(adultGdp$GDP)] <- mean(adultGdp$GDP, na.rm = TRUE)
colSums(is.na(adultGdp))
## Age Workclass fnlwgt Education Education_No
## 0 0 0 0 0
## Marital_Status Occupation Relationship Race Sex
## 0 0 0 0 0
## Capital_gain Capital_loss Hours_per_week Country_Name Over/Under50k
## 0 0 0 0 0
## GDP netCapital
## 0 0
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.
#Scanning dataset for outliers
summary(adultGdp)
## Age Workclass fnlwgt Education
## Min. :17.00 Length:32561 Min. : 12285 HS-grad :10501
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Some-college: 7291
## Median :37.00 Mode :character Median : 178356 Bachelors : 5355
## Mean :38.58 Mean : 189778 Masters : 1723
## 3rd Qu.:48.00 3rd Qu.: 237051 Assoc-voc : 1382
## Max. :90.00 Max. :1484705 11th : 1175
## (Other) : 5134
## Education_No Marital_Status Occupation Relationship
## 9 :10501 Length:32561 Length:32561 Length:32561
## 10 : 7291 Class :character Class :character Class :character
## 13 : 5355 Mode :character Mode :character Mode :character
## 14 : 1723
## 11 : 1382
## 7 : 1175
## (Other): 5134
## Race Sex Capital_gain Capital_loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
##
## Hours_per_week Country_Name Over/Under50k GDP
## Min. : 1.00 Length:32561 Length:32561 Min. :-34.809
## 1st Qu.:40.00 Class :character Class :character 1st Qu.: 4.038
## Median :40.00 Mode :character Mode :character Median : 4.038
## Mean :40.44 Mean : 4.024
## 3rd Qu.:45.00 3rd Qu.: 4.038
## Max. :99.00 Max. : 13.052
##
## netCapital
## Min. :-4356.0
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 990.4
## 3rd Qu.: 0.0
## Max. :99999.0
##
#From the summary view we can see there appear to be outliers for age, fnlwgt, capital gain, capital loss, hours per week, gdp and netcapital
#Function to cap values
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#We will be using boxplots to visualise outliers for numeric variables
#Age
boxplot(adultGdp$Age, main = "Age")
#No action needed since all values appear to be within reason
#fnlwgt
boxplot(adultGdp$fnlwgt, main = "fnlwgt")
#Capping will be applied to these outliers
adultGdp$fnlwgt <- adultGdp$fnlwgt %>% cap()
#Capital Gain
boxplot(adultGdp$Capital_gain, main = "Capital Gain")
#Few impossible values of 99999, capping outliers
adultGdp$Capital_gain <- adultGdp$Capital_gain %>% cap()
#Capital Loss
boxplot(adultGdp$Capital_loss, main = "Capital Loss")
#Capping will be used to manage these outliers
adultGdp$Capital_loss <- adultGdp$Capital_loss %>% cap()
#Hours per week
boxplot(adultGdp$Hours_per_week, main = "Hours per Week")
#Capping will be used to manage these outliers
adultGdp$Hours_per_week <- adultGdp$Hours_per_week %>% cap()
#GDP
boxplot(adultGdp$GDP, main = "GDP")
#Values appear to be intentional, no further action required
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
#Variable chosen for transformation will be the hours per week variable
boxplot(adultGdp$Hours_per_week, main = "Hours Per Week")
#Transformation we will be using is the mean-centering, scaling and z-score standardisation transformation
hours_Mean_Centered <- scale(adultGdp$Hours_per_week, center = TRUE, scale = FALSE)
hours_sd <- scale(adultGdp$Hours_per_week, center = FALSE, scale = TRUE)
hours_zscore <- scale(adultGdp$Hours_per_week, center = TRUE, scale = TRUE)
#Boxplot of transformed data
boxplot(hours_Mean_Centered, main = "Hours Per Week Mean Centered")
boxplot(hours_sd, main = "Hours Per Week Standard Deviation")
boxplot(hours_zscore, main = "Hours Per Week Z-Score")