library(forecast)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
library(editrules)
In this report, it shows step by step how to do data preprocessing before jumping into data modelling in an data science project. We use Adult dataset from UCI to implement this part. There are five main steps considered in this report. First of all, we should discover variable types and meaning of dataset. Secondly, we check whether dataset is tidy or not. Thirdly, we deal with issues related to typing errors such as white space, wrong grammar. In this step, we categorize some variables into groups so that it would be easier to build data model later on. The Fourth step checks whether missing values or inconsistent errors exists in the dataset. We verify the values of nummeric variables such Age, Capital Loss, Capital gain and Hours per week. Finally, we do transformation for Capital and Hours per week by using reciprocal and boxcox methods, respectively.
The Adult dataset is used to build classifiers to predict whether an individual has income more than or less than 50K per year base on the 1994 US Cencus dataset. There are total 48842 records divided into two datasets “adult.data” and “adult.test”. Both datasets have 14 descriptive features and 1 target features. We will merge these two datasets into one for Data Preprocessing Steps.
Data Description:
| Feature Name | Description |
|---|---|
| Age | Continous |
| Workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
| fnlwgt | Continous |
| Education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |
| Education-num | Continuous |
| Marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. |
| Occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. |
| Relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. |
| Race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. |
| Sex | Female, Male |
| Capital-loss | Continuous |
| Capital-gain | Continuous |
| Hours-per-week | Continous |
| Native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |
| Income | >50K , <=50K |
attributes_name <- c("age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race","sex", "capital-gain", "capital-loss","hours-per-week", "native countries", "income")
adult_training <- read.delim("adult.data.txt", sep = ",", header = FALSE, col.names = attributes_name)
head(adult_training)
adult_test <- read.delim("adult.test.txt", sep=",", header = FALSE, col.names = attributes_name)
head(adult_test)
Obviously, adult_test dataset got wrong value in first row. So therefore we need to remove this line before merging.
# Remove first row in adult_test
adult_test <- adult_test[-1,]
# Merge two dataset into one
adult <- rbind(adult_training, adult_test)
head(adult)
# Check dimension of data
dim(adult)
[1] 48842 15
It shows that there are totally 48842 records and 15 attributes
# Summary of dataset
summary(adult)
age workclass fnlwgt education education.num marital.status
Length:48842 Private :33906 Min. : 12285 HS-grad :15784 Min. : 1.00 Married-civ-spouse :22379
Class :character Self-emp-not-inc: 3862 1st Qu.: 117551 Some-college:10878 1st Qu.: 9.00 Never-married :16117
Mode :character Local-gov : 3136 Median : 178145 Bachelors : 8025 Median :10.00 Divorced : 6633
? : 2799 Mean : 189664 Masters : 2657 Mean :10.08 Separated : 1530
State-gov : 1981 3rd Qu.: 237642 Assoc-voc : 2061 3rd Qu.:12.00 Widowed : 1518
Self-emp-inc : 1695 Max. :1490400 11th : 1812 Max. :16.00 Married-spouse-absent: 628
(Other) : 1463 (Other) : 7625 (Other) : 37
occupation relationship race sex capital.gain capital.loss hours.per.week
Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470 Female:16192 Min. : 0 Min. : 0.0 Min. : 1.00
Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519 Male :32650 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
Exec-managerial: 6086 Other-relative: 1506 Black : 4685 : 0 Median : 0 Median : 0.0 Median :40.00
Adm-clerical : 5611 Own-child : 7581 Other : 406 Mean : 1079 Mean : 87.5 Mean :40.42
Sales : 5504 Unmarried : 5125 White :41762 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
Other-service : 4923 Wife : 2331 : 0 Max. :99999 Max. :4356.0 Max. :99.00
(Other) :14434 : 0
native.countries income
United-States:43832 <=50K :24720
Mexico : 951 >50K : 7841
? : 857 : 0
Philippines : 295 <=50K.:12435
Germany : 206 >50K. : 3846
Puerto-Rico : 184
(Other) : 2517
# Check type of variables:
lapply(adult, typeof)
$workclass
[1] "integer"
$education
[1] "integer"
$marital.status
[1] "integer"
$occupation
[1] "integer"
$relationship
[1] "integer"
$race
[1] "integer"
$sex
[1] "integer"
$native.countries
[1] "integer"
$income
[1] "integer"
$age
[1] "integer"
$fnlwgt
[1] "integer"
$capital.gain
[1] "integer"
$capital.loss
[1] "integer"
$hours.per.week
[1] "double"
$capital
[1] "integer"
$age_group
[1] "integer"
Type of variables such as Age, Workclass, Education, Marital.Status, Relationship,Occupation, Sex, Native Countries and Income look different compared with data description.
# Check class of variables:
lapply(adult, class)
$workclass
[1] "factor"
$education
[1] "factor"
$marital.status
[1] "factor"
$occupation
[1] "factor"
$relationship
[1] "factor"
$race
[1] "factor"
$sex
[1] "factor"
$native.countries
[1] "factor"
$income
[1] "factor"
$age
[1] "integer"
$fnlwgt
[1] "integer"
$capital.gain
[1] "integer"
$capital.loss
[1] "integer"
$hours.per.week
[1] "numeric"
$capital
[1] "integer"
$age_group
[1] "factor"
# Check levels of factor variables:
fac_cols <- sapply(adult, is.factor)
lapply(adult[, fac_cols], levels)
$workclass
[1] " ?" " Federal-gov" " Local-gov" " Never-worked" " Private" " Self-emp-inc" " Self-emp-not-inc"
[8] " State-gov" " Without-pay" ""
$education
[1] " 10th" " 11th" " 12th" " 1st-4th" " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
[9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad" " Masters" " Preschool" " Prof-school" " Some-college"
[17] ""
$marital.status
[1] " Divorced" " Married-AF-spouse" " Married-civ-spouse" " Married-spouse-absent" " Never-married"
[6] " Separated" " Widowed" ""
$occupation
[1] " ?" " Adm-clerical" " Armed-Forces" " Craft-repair" " Exec-managerial" " Farming-fishing"
[7] " Handlers-cleaners" " Machine-op-inspct" " Other-service" " Priv-house-serv" " Prof-specialty" " Protective-serv"
[13] " Sales" " Tech-support" " Transport-moving" ""
$relationship
[1] " Husband" " Not-in-family" " Other-relative" " Own-child" " Unmarried" " Wife" ""
$race
[1] " Amer-Indian-Eskimo" " Asian-Pac-Islander" " Black" " Other" " White" ""
$sex
[1] " Female" " Male" ""
$native.countries
[1] " ?" " Cambodia" " Canada" " China"
[5] " Columbia" " Cuba" " Dominican-Republic" " Ecuador"
[9] " El-Salvador" " England" " France" " Germany"
[13] " Greece" " Guatemala" " Haiti" " Holand-Netherlands"
[17] " Honduras" " Hong" " Hungary" " India"
[21] " Iran" " Ireland" " Italy" " Jamaica"
[25] " Japan" " Laos" " Mexico" " Nicaragua"
[29] " Outlying-US(Guam-USVI-etc)" " Peru" " Philippines" " Poland"
[33] " Portugal" " Puerto-Rico" " Scotland" " South"
[37] " Taiwan" " Thailand" " Trinadad&Tobago" " United-States"
[41] " Vietnam" " Yugoslavia" ""
$income
[1] " <=50K" " >50K" "" " <=50K." " >50K."
It is clearly to see that there are some typing errors in this dataset for factor variables.
Check if the data conforms the tidy data principles.
names(adult)
[1] "age" "workclass" "fnlwgt" "education" "education.num" "marital.status" "occupation"
[8] "relationship" "race" "sex" "capital.gain" "capital.loss" "hours.per.week" "native.countries"
[15] "income"
head(adult)
This dataset is considered as tidy dataset. However there is only one thing needed to be checked about the relationship between education and education.num variables
# Check relationship between education and education.num
adult %>% distinct(education, education.num)
It is clearly that Education.Num is an ID number of education. Therefore, we can remove one of them. In this assignment, I will drop education.num
# drop education.num variable
adult$education.num <- NULL
Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step..
# Create capital variable which is the difference betwwen capital-gain and capital-loss
adult <- adult %>% mutate(capital = capital.gain - capital.loss)
Clean Data
# List down Factor Columns in dataframe & Trim string in factor columns
fac_cols <- sapply(adult, is.factor)
adult <- data.frame(cbind(sapply(adult[,fac_cols], trimws, which="both"), adult[,!fac_cols]))
adult <- adult %>% mutate(workclass = ifelse(grepl(".gov$", str_trim(workclass)), "Gov",
ifelse(grepl("^Self.",str_trim(workclass)),"Self-emp",
ifelse(grepl("^Private$", str_trim(workclass)),"Private", "Other"))))
adult$workclass <- as.factor(adult$workclass)
levels(adult$workclass)
[1] "Gov" "Other" "Private" "Self-emp"
adult <- adult %>% mutate(education = ifelse(grepl(".th$|^Preschool$", (education)), "Before-Highschool",
ifelse(grepl("^Assoc.", (education)),"Associate",
ifelse(grepl("^Masters$|^Doctorate$|^Pro.",(education)), "Post-Graduate",
as.character((education))))))
adult$education <- as.factor(adult$education)
levels(adult$education)
[1] "Associate" "Bachelors" "Before-Highschool" "HS-grad" "Post-Graduate" "Some-college"
adult <- adult %>% mutate(marital.status = ifelse(grepl("^Married.", marital.status), "Married", as.character(marital.status)))
adult$marital.status <- as.factor(adult$marital.status)
levels(adult$marital.status)
[1] "Divorced" "Married" "Never-married" "Separated" "Widowed"
adult <- adult %>% mutate(income = ifelse(grepl("^<=50K.$", income), "<=50K",
ifelse(grepl("^>50K.$", income),">50K", as.character(income))))
adult$income <- as.factor(adult$income)
levels(adult$income)
[1] "<=50K" ">50K"
# Convert Age character into numeric because Age has character type as default in dataset.
adult$age <- as.integer(adult$age)
# Categorize Age into 4 groups
adult<- adult %>% mutate(age_group = ifelse(age <=30, "<=30",
ifelse(age>30 & age <=45, "30-45",
ifelse(age>45 & age <=60,"45-60",
">60"))))
adult$age_group <- as.factor(adult$age_group)
# Check levels result of Age after processing
levels(adult$age_group)
[1] "<=30" ">60" "30-45" "45-60"
adult<- adult %>% mutate(native.countries = ifelse(grepl("United.",native.countries), "USA", "Non-USA"))
adult$native.countries <- as.factor(adult$native.countries)
levels(adult$native.countries)
[1] "Non-USA" "USA"
summary(adult)
workclass education marital.status occupation relationship race
Gov : 6549 Associate : 3662 Divorced : 6633 Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470
Other : 2830 Bachelors : 8025 Married :23044 Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519
Private :33906 Before-Highschool: 6408 Never-married:16117 Exec-managerial: 6086 Other-relative: 1506 Black : 4685
Self-emp: 5557 HS-grad :15784 Separated : 1530 Adm-clerical : 5611 Own-child : 7581 Other : 406
Post-Graduate : 4085 Widowed : 1518 Sales : 5504 Unmarried : 5125 White :41762
Some-college :10878 Other-service : 4923 Wife : 2331
(Other) :14434
sex native.countries income age fnlwgt capital.gain capital.loss hours.per.week
Female:16192 Non-USA: 5010 <=50K:37155 Min. :17.00 Min. : 12285 Min. : 0 Min. : 0.0 Min. : 1.00
Male :32650 USA :43832 >50K :11687 1st Qu.:28.00 1st Qu.: 117551 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
Median :37.00 Median : 178145 Median : 0 Median : 0.0 Median :40.00
Mean :38.64 Mean : 189664 Mean : 1079 Mean : 87.5 Mean :40.42
3rd Qu.:48.00 3rd Qu.: 237642 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
Max. :90.00 Max. :1490400 Max. :99999 Max. :4356.0 Max. :99.00
capital age_group
Min. :-4356.0 <=30 :15793
1st Qu.: 0.0 >60 : 3606
Median : 0.0 30-45:18505
Mean : 991.6 45-60:10938
3rd Qu.: 0.0
Max. :99999.0
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.
** Check missing values **
colnames(adult)[apply(is.na(adult),2,any)]
character(0)
There is no missing values in all variables
** Check Infinite vaules for nummeric variables
num_cols <- sapply(adult, is.numeric)
is.special <- function(x){
if (is.numeric(x)) is.infinite(x)
}
head(sapply(adult[,num_cols], is.special))
age fnlwgt capital.gain capital.loss hours.per.week capital
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
** Check inconsistence error for Age variable
Age_rule <- editset(c("age >15", "age <= 100"))
violate <- violatedEdits(Age_rule, adult)
summary(violate)
No violations detected, 0 checks evaluated to NA
NULL
There are no violation for Age
** Check inconsistence error for Capital Gain variable
cap_gain_rule <- editset("capital.gain >=0")
violateCapGain <- violatedEdits(cap_gain_rule, adult)
summary(violateCapGain)
No violations detected, 0 checks evaluated to NA
NULL
** Check inconsistence error for Capital Loss variable
cap_loss_rule <- editset("capital.loss >=0")
violateCapLoss <- violatedEdits(cap_loss_rule, adult)
summary(violateCapLoss)
No violations detected, 0 checks evaluated to NA
NULL
** Check inconsistence error for Hours.Per.Week variable
hour_rule <- editset(c("hours.per.week > 0", "hours.per.week <160"))
violateHours <- violatedEdits(hour_rule, adult)
summary(violateHours)
No violations detected, 0 checks evaluated to NA
NULL
Scan the numeric data for outliers. In this step, we only consider outliers for variables including “Capital” (because it is the difference between captial gain and capital loss) and “Hours.per.week”. For “fnlwgt” variable, it stands for “Final Weight” defined by the US Census. For “Age” variable, because it does not violate rule which we did check in previous part and we did categorize Age into 4 groups. Hence we do not check outlier for “Age” and “fnlwgt”
Check outliers of Hours Per Week
# Draw Boxplot chart of this variable
boxplot(x = adult$hours.per.week, main = "Box Plot of Hours Per Week", ylab= "Hours per week")
# Using z-score to check outliers
z.scores.hours<- adult$hours.per.week %>% scores(type="z")
z.scores.hours %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.18142 -0.03409 -0.03409 0.00000 0.36942 4.72726
length(which(abs(z.scores.hours)>3))
[1] 681
There are 681 outliers for this variable in dataset. To deal with this, we impute outliers with mean value
adult$hours.per.week[which(abs(z.scores.hours)>3)] <- mean(adult$hours.per.week, na.rm = TRUE)
# Using boxplot to see changes after imputing outliers
boxplot(x = adult$hours.per.week, main = "Box Plot of Hours Per Week After Imputing Outliers", ylab= "Hours per week")
Check outliers for Capital
boxplot(x = adult$capital, main = "Box Plot of Capital", ylab= "Capital")
Obviously, the value 99999 is as an outlier of capital. However, we skip this value because it can happen in reality for People’s Income.
Transform Capital variable
hist(adult$capital, main = "Histogram of Capital")
Looking at histogram of capital, it has right skew distribution. Therefore we use reciprocal tranformation to reduce the right skewness.
# Apply Reciprocal Transformation for Capital
capital_recip <- 1/adult$capital
hist(capital_recip)
Transform Hours.Per.Week variable
hist(adult$hours.per.week, main = "Histogram of Hours Per Week", xlab="Hours per week")
# Using BoxCox to do transformation for Hours.per.week
boxcox_hours <- BoxCox(adult$hours.per.week, lambda = "auto")
hist(boxcox_hours)