library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(outliers)
This report contains the data pre-processing on the 1994 US census data obtained from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult). Both training and test datasets are supplied along with the attribute names. Every individual is made up of social and demographic attributes such as age, gender, nationality, etc, this dataset contains these factors and is intended to use them to produce predictive models capable of classifying a person’s income accurately. Annual Income is the target feature and is measured as a binary category target; either less than or equal to $50kUSD, or greater than $50kUSD. This report contains a step by step pre-processing of the dataset to be ready for data modelling to achieve a high level of accuracy through machine learning algorithms. This report contains: the data description, dataset merge, data type conversions, creating and mutating attributes, missing value and inconsistencies checks, outlier removal, data transformation and normalization.
Both training and test datasets are supplied along with the attribute names, have all been sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult). Both Datasets contain 15 variables (14 descriptive and 1 target), the training set contains 32561 rows and the test set contains 16281 rows. Both training and test sets are merged into one dataframe, the dimension has now become 48842 rows and 15 features. Variable descriptions are given in the table below:
| Variable Name | Description |
|---|---|
| age | continuous |
| workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
| fnlwgt | continuous |
| Education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. |
| Education-num | continuous |
| Marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. |
| Occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. |
| Relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. |
| Race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. |
| Sex | Female, Male |
| Capital-loss | continuous |
| Capital-gain | continuous |
| Hours-per-week | continuous |
| Native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |
| Income | >50K , <=50K |
adult.data <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T)
colnames(adult.data) <- c('age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
adult.test <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',
sep = ',', fill = F, strip.white = T, skip = 1)
colnames(adult.test) <- c('age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
dim(adult.data)
[1] 32561 15
dim(adult.test)
[1] 16281 15
head(adult.data)
head(adult.test)
# merging the 2 training and test datasets
Adult <- rbind(adult.data, adult.test)
dim(Adult)
[1] 48842 15
head(Adult)
Summarising the types of variables and data structures, checking the attributes in the data and applying data type conversions. The structure shows 9 factor variables and 6 numerical integer variables. summary() is used to explore variable levels frequency and descriptive statistics on numerical features. Whitespaces are present in the factor variables, these are removed using ‘trimws’. The target feature ‘income’ has a full stop in some rows after ‘<=50K’ or ‘>50K’ , these are removed and then imputed to become a ‘0’ for less than 50K and a ‘1’ for more than 50K. Finally, the zero and ones are changed to a numerical data-type for modelling such as logistic regression. The variable ‘Marital Status’ has 3 similar ‘married’ levels, these are all combined into one ‘married’ level.
# Data structure
str(Adult)
'data.frame': 48842 obs. of 15 variables:
$ age : int 39 50 38 53 28 37 49 52 31 42 ...
$ workclass : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
$ marital_status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
$ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
$ native_country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
$ income : Factor w/ 4 levels "<=50K",">50K",..: 1 1 1 1 1 1 1 2 2 2 ...
# Summary of each variable in the Adult dataset
summary(Adult)
age workclass fnlwgt education education_num marital_status occupation
Min. :17.00 Private :33906 Min. : 12285 HS-grad :15784 Min. : 1.00 Divorced : 6633 Prof-specialty : 6172
1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117550 Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37 Craft-repair : 6112
Median :37.00 Local-gov : 3136 Median : 178144 Bachelors : 8025 Median :10.00 Married-civ-spouse :22379 Exec-managerial: 6086
Mean :38.64 ? : 2799 Mean : 189664 Masters : 2657 Mean :10.08 Married-spouse-absent: 628 Adm-clerical : 5611
3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642 Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117 Sales : 5504
Max. :90.00 Self-emp-inc : 1695 Max. :1490400 11th : 1812 Max. :16.00 Separated : 1530 Other-service : 4923
(Other) : 1463 (Other) : 7625 Widowed : 1518 (Other) :14434
relationship race sex capital_gain capital_loss hours_per_week native_country income
Husband :19716 Amer-Indian-Eskimo: 470 Female:16192 Min. : 0 Min. : 0.0 Min. : 1.00 United-States:43832 <=50K :24720
Not-in-family :12583 Asian-Pac-Islander: 1519 Male :32650 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 951 >50K : 7841
Other-relative: 1506 Black : 4685 Median : 0 Median : 0.0 Median :40.00 ? : 857 <=50K.:12435
Own-child : 7581 Other : 406 Mean : 1079 Mean : 87.5 Mean :40.42 Philippines : 295 >50K. : 3846
Unmarried : 5125 White :41762 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 206
Wife : 2331 Max. :99999 Max. :4356.0 Max. :99.00 Puerto-Rico : 184
(Other) : 2517
# Checking the levels of each factor variable
factors <- sapply(Adult, is.factor)
lapply(Adult[, factors], levels)
$workclass
[1] "?" "Federal-gov" "Local-gov" "Never-worked" "Private" "Self-emp-inc" "Self-emp-not-inc" "State-gov"
[9] "Without-pay"
$education
[1] "10th" "11th" "12th" "1st-4th" "5th-6th" "7th-8th" "9th" "Assoc-acdm" "Assoc-voc" "Bachelors"
[11] "Doctorate" "HS-grad" "Masters" "Preschool" "Prof-school" "Some-college"
$marital_status
[1] "Divorced" "Married-AF-spouse" "Married-civ-spouse" "Married-spouse-absent" "Never-married" "Separated"
[7] "Widowed"
$occupation
[1] "?" "Adm-clerical" "Armed-Forces" "Craft-repair" "Exec-managerial" "Farming-fishing" "Handlers-cleaners" "Machine-op-inspct"
[9] "Other-service" "Priv-house-serv" "Prof-specialty" "Protective-serv" "Sales" "Tech-support" "Transport-moving"
$relationship
[1] "Husband" "Not-in-family" "Other-relative" "Own-child" "Unmarried" "Wife"
$race
[1] "Amer-Indian-Eskimo" "Asian-Pac-Islander" "Black" "Other" "White"
$sex
[1] "Female" "Male"
$native_country
[1] "?" "Cambodia" "Canada" "China" "Columbia"
[6] "Cuba" "Dominican-Republic" "Ecuador" "El-Salvador" "England"
[11] "France" "Germany" "Greece" "Guatemala" "Haiti"
[16] "Holand-Netherlands" "Honduras" "Hong" "Hungary" "India"
[21] "Iran" "Ireland" "Italy" "Jamaica" "Japan"
[26] "Laos" "Mexico" "Nicaragua" "Outlying-US(Guam-USVI-etc)" "Peru"
[31] "Philippines" "Poland" "Portugal" "Puerto-Rico" "Scotland"
[36] "South" "Taiwan" "Thailand" "Trinadad&Tobago" "United-States"
[41] "Vietnam" "Yugoslavia"
$income
[1] "<=50K" ">50K" "<=50K." ">50K."
# remove whitespcaes in Factor variables
Adult <- data.frame(cbind(sapply(Adult[,factors], trimws), Adult[,!factors]))
# adjusting the income variable to binary target.
levels(Adult$income)[levels(Adult$income)=="<=50K."] <- "<=50K"
levels(Adult$income)[levels(Adult$income)==">50K."] <- ">50K"
levels(Adult$income)
[1] "<=50K" ">50K"
# changing levels of Income to a numeric value of 0 or 1 for classification modelling including logistic regression
levels(Adult$income)[levels(Adult$income)=="<=50K"] <- "0"
levels(Adult$income)[levels(Adult$income)==">50K"] <- "1"
levels(Adult$income)
[1] "0" "1"
# changing data type to numeric, factor -> numeric
Adult$income <- as.numeric(Adult$income)-1
class(Adult$income)
[1] "numeric"
head(Adult$income)
[1] 0 0 0 0 0 0
# re-levelling marital status factor
levels(Adult$marital_status)[levels(Adult$marital_status)=="Married-AF-spouse"] <- "Married"
levels(Adult$marital_status)[levels(Adult$marital_status)=="Married-civ-spouse"] <- "Married"
levels(Adult$marital_status)[levels(Adult$marital_status)=="Married-spouse-absent"] <- "Married"
levels(Adult$marital_status)
[1] "Divorced" "Married" "Never-married" "Separated" "Widowed"
The Adult dataset is in tidy format.
The ‘fnlwgt’ variable (stands for final weight) is removed as it has no predictive power since it is a feature aimed to allocate similar weights to people with similar demographic characteristics. ‘Education’ is removed since it is just a label on ‘education_num’ (number of years of education).
# deleting fnlwgt and education from dataframe
Adult$education <- NULL
Adult$fnlwgt <- NULL
‘Capital gain’ and ‘capital loss’ are converted into one ‘capital’ variable which is calculated by subtracting capital loss from capital gain. The dataset now contains 48,842 rows and 12 features (11 descriptive, 1 target). There are not many individuals in each distinct native country category other than USA, therefore we bin them. Changing the ‘Native country’ levels to ‘USA’ or ‘Other’ will increase this attributes’ predictive power when modelling.
# creating 'capital' variable
Adult <- Adult %>% mutate(capital = capital_gain - capital_loss)
Adult$capital_gain <- NULL
Adult$capital_loss <- NULL
summary(Adult$capital)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4356.0 0.0 0.0 991.6 0.0 99999.0
# binning all other native countries
Adult<- Adult %>% mutate(native_country = ifelse(grepl("United.",native_country), "USA", "Other"))
Adult$native_country <- as.factor(Adult$native_country)
levels(Adult$native_country)
[1] "Other" "USA"
dim(Adult)
[1] 48842 12
summary(Adult)
workclass marital_status occupation relationship race sex native_country
Private :33906 Divorced : 6633 Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470 Female:16192 Other: 5010
Self-emp-not-inc: 3862 Married :23044 Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519 Male :32650 USA :43832
Local-gov : 3136 Never-married:16117 Exec-managerial: 6086 Other-relative: 1506 Black : 4685
? : 2799 Separated : 1530 Adm-clerical : 5611 Own-child : 7581 Other : 406
State-gov : 1981 Widowed : 1518 Sales : 5504 Unmarried : 5125 White :41762
Self-emp-inc : 1695 Other-service : 4923 Wife : 2331
(Other) : 1463 (Other) :14434
income age education_num hours_per_week capital
Min. :0.0000 Min. :17.00 Min. : 1.00 Min. : 1.00 Min. :-4356.0
1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 9.00 1st Qu.:40.00 1st Qu.: 0.0
Median :0.0000 Median :37.00 Median :10.00 Median :40.00 Median : 0.0
Mean :0.2393 Mean :38.64 Mean :10.08 Mean :40.42 Mean : 991.6
3rd Qu.:0.0000 3rd Qu.:48.00 3rd Qu.:12.00 3rd Qu.:45.00 3rd Qu.: 0.0
Max. :1.0000 Max. :90.00 Max. :16.00 Max. :99.00 Max. :99999.0
Scanning the data for missing values, inconsistencies and obvious errors. Using sum(is.na()), we can confirm there are no missing values in the dataset. By using is.infinite() into a function that scans all numeric features, we can confirm there are no infintie values. Using summary() of just the numerical features we can confirm there are no errors from looking at the min and max values, all features seem to be within a realistic range of values, eg. age is between 17-90. Using lapply() of just the factor variables, we can see that there are ‘?’ entries in workclass and occupation, these rows containing the ‘?’ are removed from the dataset since they can not be imputed.
# total missing values
sum(is.na(Adult))
[1] 0
# check for infinite values
numerics <- sapply(Adult, is.numeric)
special <- function(x){
if (is.numeric(x)) is.infinite(x)
}
sum(sapply(Adult[,numerics], special))
[1] 0
# Checking iconsistencies in numerical variables
summary(Adult[,numerics])
income age education_num hours_per_week capital
Min. :0.0000 Min. :17.00 Min. : 1.00 Min. : 1.00 Min. :-4356.0
1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 9.00 1st Qu.:40.00 1st Qu.: 0.0
Median :0.0000 Median :37.00 Median :10.00 Median :40.00 Median : 0.0
Mean :0.2393 Mean :38.64 Mean :10.08 Mean :40.42 Mean : 991.6
3rd Qu.:0.0000 3rd Qu.:48.00 3rd Qu.:12.00 3rd Qu.:45.00 3rd Qu.: 0.0
Max. :1.0000 Max. :90.00 Max. :16.00 Max. :99.00 Max. :99999.0
# Checking iconsistencies in factor variables
factors <- sapply(Adult, is.factor)
lapply(Adult[, factors], levels)
$workclass
[1] "?" "Federal-gov" "Local-gov" "Never-worked" "Private" "Self-emp-inc" "Self-emp-not-inc" "State-gov"
[9] "Without-pay"
$marital_status
[1] "Divorced" "Married" "Never-married" "Separated" "Widowed"
$occupation
[1] "?" "Adm-clerical" "Armed-Forces" "Craft-repair" "Exec-managerial" "Farming-fishing" "Handlers-cleaners" "Machine-op-inspct"
[9] "Other-service" "Priv-house-serv" "Prof-specialty" "Protective-serv" "Sales" "Tech-support" "Transport-moving"
$relationship
[1] "Husband" "Not-in-family" "Other-relative" "Own-child" "Unmarried" "Wife"
$race
[1] "Amer-Indian-Eskimo" "Asian-Pac-Islander" "Black" "Other" "White"
$sex
[1] "Female" "Male"
$native_country
[1] "Other" "USA"
#removing '?' in workclass and occupation
is.na(Adult) = Adult=='?'
is.na(Adult) = Adult==' ?'
Adult = na.omit(Adult)
summary(Adult[,factors])
workclass marital_status occupation relationship race sex native_country
Private :33906 Divorced : 6363 Prof-specialty : 6172 Husband :19005 Amer-Indian-Eskimo: 435 Female:14919 Other: 4741
Self-emp-not-inc: 3862 Married :22066 Craft-repair : 6112 Not-in-family :11916 Asian-Pac-Islander: 1423 Male :31114 USA :41292
Local-gov : 3136 Never-married:14875 Exec-managerial: 6086 Other-relative: 1400 Black : 4356
State-gov : 1981 Separated : 1433 Adm-clerical : 5611 Own-child : 6706 Other : 375
Self-emp-inc : 1695 Widowed : 1296 Sales : 5504 Unmarried : 4867 White :39444
Federal-gov : 1432 Other-service : 4923 Wife : 2139
(Other) : 21 (Other) :11625
By scanning all numeric variables for outliers we are then able to impute them to their corresponding variable’s mean. This is acheived using the z-score approach, all values are presented with their z-score and if the value is greater than 3 or less than -3 they are imputed to the mean of that variable. Age, hours_per_week and education_num are the numeric variables that will use this scan and imputation. ‘Capital’ will not be adjusted since most of the values are ‘0’, resulting in all non-zero values being affected.
# age
z_scores_age<- Adult$age %>% scores(type="z")
length(which(abs(z_scores_age)>3))
[1] 167
Adult$age[which(abs(z_scores_age)>3)] <- mean(Adult$age, na.rm = TRUE)
# hours per week
z_scores_hoursperweek<- Adult$hours_per_week %>% scores(type="z")
length(which(abs(z_scores_hoursperweek)>3))
[1] 639
Adult$hours_per_week[which(abs(z_scores_hoursperweek)>3)] <- mean(Adult$hours_per_week, na.rm = TRUE)
# eduction_num
z_scores_eduction_num<- Adult$education_num %>% scores(type="z")
length(which(abs(z_scores_eduction_num)>3))
[1] 302
Adult$education_num[which(abs(z_scores_eduction_num)>3)] <- mean(Adult$education_num, na.rm = TRUE)
summary(Adult[,numerics])
income age education_num hours_per_week capital
Min. :0.0000 Min. :17.00 Min. : 3.00 Min. : 5.00 Min. :-4356
1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 9.00 1st Qu.:40.00 1st Qu.: 0
Median :0.0000 Median :37.00 Median :10.00 Median :40.00 Median : 0
Mean :0.2481 Mean :38.39 Mean :10.18 Mean :40.53 Mean : 1026
3rd Qu.:0.0000 3rd Qu.:47.00 3rd Qu.:13.00 3rd Qu.:45.00 3rd Qu.: 0
Max. :1.0000 Max. :78.00 Max. :16.00 Max. :76.00 Max. :99999
Applying an appropriate transformation for ‘capital’ will decrease the skewness and convert the distribution of this variable into an approximate normal distribution. Firstly, we will plot the ‘capital’ variable using a histogram over the range -5000,100000 with 50 bins. It can be clearly seen that there is a positive skew. This is rectified by using a reciprocal transformation. Finally, the transformed capital variable is visualised using a histogram and it can be seen that this transformed variable is now approximatley normal distributed.
# plot capital
hist(Adult$capital, breaks=seq(-5000,100000,l=50), main = "Histogram of Capital", xlab = "capital", ylab='Frequency',col="orange")
# Apply Reciprocal Transformation for Capital
recip_capital <- 1/Adult$capital
# plot transformed capital variable
hist(recip_capital, breaks=seq(-0.007,0.010,l=100), main = "Histogram of reciprocal transformed Capital variable", xlab = "1/capital", ylab='Frequency',col="orange")
Numerical descriptive features will be min-max scaled for machine learning classification modelling techniques such as K-nearest Neighbours. The variables: education_num, hours_per_week, capital and age will be min-max scaled between the range of 0,1
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Adult$age <- normalize(Adult$age)
Adult$education_num <- normalize(Adult$education_num)
Adult$hours_per_week <- normalize(Adult$hours_per_week)
Adult$capital <- normalize(Adult$capital)
summary(Adult[,numerics])
income age education_num hours_per_week capital
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.1803 1st Qu.:0.4615 1st Qu.:0.4930 1st Qu.:0.04174
Median :0.0000 Median :0.3279 Median :0.5385 Median :0.4930 Median :0.04174
Mean :0.2481 Mean :0.3507 Mean :0.5526 Mean :0.5004 Mean :0.05157
3rd Qu.:0.0000 3rd Qu.:0.4918 3rd Qu.:0.7692 3rd Qu.:0.5634 3rd Qu.:0.04174
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000