library(forecast)
library(knitr)
library(pander)
library(tidyverse)
library(rmarkdown)
library(class)
library(caret)
library(nnet)
library(scales)
library(finalfit)
library(naniar)
library(mice)
library(factoextra)
library(rpart)
library(rpart.plot)
library(devtools)
library(VIM)
The problem of how to classify a population for the benefit of enhanced business operations and strategy is central to many departments (from Marketing to Risk Analysis). In our research, we attempt to analyze a standard dataset of socio-economic demographics using the Cross Industry Standardized Process for Data Mining (CRISP-DM) framework and apply three supervised machine learning algorithms to classify each observation into an income class. Considerable resources must be spent in preparing the data for modeling and initial models may prove to be of little use. That said, the CRISP-DM framework emphasizes a cyclical process by which early decisions are be revisited as new information is realized. The CRISP-DM process allows a process to follow in order to avoid duplication of efforts across team members and allows analysts the opportunity for continued improvement through the data mining and information discovery process.
Through our data manipulation, exploration, analysis, and model development, assumptions evolved and with them revisions to our models which resulted in model improvements. As the following report will demonstrate, the Adults dataset containing socio-economic demographic data from the U.S. Census can be understood and managed in order to develop fairly accurate models to predict the income class of observations based on certain variables. The most important and accurate model developed based on a subset of the data that observed a complete dataset (all observations missing data were removed). The neural network model developed with complete observations and the omission of outliers for capital gains performed the most efficiently, with an accuracy rate of 87.8% when running the training set against the test set. This process is explained in the pages that follow so that the reader may gain an understanding of the CRISP-DM framework, as well as the final results shown on the socio-economic demographics dataset.
The goal of this analysis is to explore and build a model that will correctly classify the target variable (income class) as a function of the other identified variables in the dataset. This will be accomplished by understanding the data, tidying the data, and making any necessary adjustments in order to build models to classify the target variable and quantify the accuracy of the classifications Then the best model will be identified.
First the Provided dataset will be introduced and the ‘str()’ command is used to see the structure of the dataset, this dataset is a subset of the UCI Adults dataset.
setwd("C:/Users/Marissa.Valente/Documents/RStuff2020")
Adults <- read.csv("Provideddataset.csv", header = TRUE, sep=",")
str(Adults)
## 'data.frame': 16281 obs. of 15 variables:
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : int 226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: int 40 50 40 40 30 30 40 32 40 10 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class : chr " <=50K." " <=50K." " >50K." " >50K." ...
head(Adults)
## age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## occupation relationship race sex capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Machine-op-inspct Husband Black Male 7688 0
## 5 ? Own-child White Female 0 0
## 6 Other-service Not-in-family White Male 0 0
## hours.per.week native.country income.class
## 1 40 United-States <=50K.
## 2 50 United-States <=50K.
## 3 40 United-States >50K.
## 4 40 United-States >50K.
## 5 30 United-States <=50K.
## 6 30 United-States <=50K.
ls(Adults)
## [1] "age" "capital.gain" "capital.loss" "education"
## [5] "education.num" "fnlwgt" "hours.per.week" "income.class"
## [9] "marital.status" "native.country" "occupation" "race"
## [13] "relationship" "sex" "workclass"
This dataset is a subset of the Adults dataset, which comes from U.S. Census data. It contains 15 variables and 16,281 observations. The observations are U.S. people and the data is collected from the U.S. Census. There are 15 variables showing various demographics of these people (observations).
Age: The age of the individual in that observation (continuous)
Workclass: The working class for the individual’s occupation (categorical)
Fnlwgt: Final weight, which is the number of units in the target population that the responding unit represents (continuous)
Education: The highest level of education completed by the individual (categorical)
Education-Num: The number of years the individual was in an education program (continuous)
Marital-Status: The marital status of the individual (categorical)
Occupation: The occupation category of the individual (categorical)
Relationship: Relationship to the individual reporting (categorical)
Race: The race of the individual reporting (categorical)
Sex: The sex of the individual reporting (categorical)
Capital-Gain: Capital gain value of the individual (continuous)
Capital-Loss: Capital loss value of the individual (continuous)
Hours-Per-Week: The hours worked per week by the individual (continuous)
Native-Country: The native country of the individual (categorical)
Income-Class: The income class (at, above, or below $50,000) of the individual (categorical)
To begin the data pre-processing and exploratory data analysis (EDA) portion of this project, missing data within the Adults dataset will be identified and a decision on what to do with these data will be explored and identified.
Our missing data was categorized as " ?" instead of “N/A” or blanks.This means we need to look for " ?" through the data in order to know what cells/observations need to be dealt with as missing data and either imputed or omitted.
Adults2 <- na_if(Adults, " ?")
summary(Adults2)
## age workclass fnlwgt education
## Min. :17.00 Length:16281 Min. : 13492 Length:16281
## 1st Qu.:28.00 Class :character 1st Qu.: 116736 Class :character
## Median :37.00 Mode :character Median : 177831 Mode :character
## Mean :38.77 Mean : 189436
## 3rd Qu.:48.00 3rd Qu.: 238384
## Max. :90.00 Max. :1490400
## education.num marital.status occupation relationship
## Min. : 1.00 Length:16281 Length:16281 Length:16281
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.07
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:16281 Length:16281 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1082 Mean : 87.9
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :3770.0
## hours.per.week native.country income.class
## Min. : 1.00 Length:16281 Length:16281
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.39
## 3rd Qu.:45.00
## Max. :99.00
anyNA(Adults2)
## [1] TRUE
n_miss(Adults2)
## [1] 2203
x <- c(NA, NaN, " ?", "missing")
anyNA(x)
## [1] TRUE
table(Adults$age)
##
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 200 312 341 360 376 413 452 408 354 368 397 413 410 417 437 425 460 417 461 450
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## 422 437 390 393 427 385 334 343 362 360 373 302 270 264 282 260 247 198 202 198
## 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
## 193 189 168 137 150 136 105 132 106 87 87 58 41 44 46 53 44 26 27 23
## 77 78 79 80 81 82 83 84 85 87 88 89 90
## 25 11 8 16 17 3 5 3 2 2 3 2 12
table(Adults$workclass)
##
## ? Federal-gov Local-gov Never-worked
## 963 472 1043 3
## Private Self-emp-inc Self-emp-not-inc State-gov
## 11210 579 1321 683
## Without-pay
## 7
table(Adults$education)
##
## 10th 11th 12th 1st-4th 5th-6th
## 456 637 224 79 176
## 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 309 242 534 679 2670
## Doctorate HS-grad Masters Preschool Prof-school
## 181 5283 934 32 258
## Some-college
## 3587
table(Adults$education.num)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 32 79 176 309 242 456 637 224 5283 3587 679 534 2670 934 258 181
table(Adults$marital.status)
##
## Divorced Married-AF-spouse Married-civ-spouse
## 2190 14 7403
## Married-spouse-absent Never-married Separated
## 210 5434 505
## Widowed
## 525
table(Adults$occupation)
##
## ? Adm-clerical Armed-Forces Craft-repair
## 966 1841 6 2013
## Exec-managerial Farming-fishing Handlers-cleaners Machine-op-inspct
## 2020 496 702 1020
## Other-service Priv-house-serv Prof-specialty Protective-serv
## 1628 93 2032 334
## Sales Tech-support Transport-moving
## 1854 518 758
table(Adults$relationship)
##
## Husband Not-in-family Other-relative Own-child Unmarried
## 6523 4278 525 2513 1679
## Wife
## 763
table(Adults$race)
##
## Amer-Indian-Eskimo Asian-Pac-Islander Black Other
## 159 480 1561 135
## White
## 13946
table(Adults$sex)
##
## Female Male
## 5421 10860
table(Adults$capital.gain)
##
## 0 114 401 594 914 991 1055 1086 1151 1173 1264 1409 1424
## 14958 2 3 18 2 1 12 4 5 2 2 3 1
## 1455 1471 1506 1731 1797 1831 1848 2036 2062 2105 2174 2176 2202
## 3 2 9 1 3 2 3 1 1 6 26 8 12
## 2290 2329 2346 2354 2407 2414 2463 2538 2580 2597 2635 2653 2829
## 5 1 2 10 6 2 4 4 8 11 3 6 11
## 2885 2907 2936 2961 2964 2977 2993 3103 3137 3273 3325 3411 3418
## 6 7 1 1 5 3 1 55 14 1 28 10 3
## 3456 3464 3471 3674 3781 3818 3887 3908 3942 4064 4101 4386 4416
## 4 10 3 8 4 4 2 10 4 12 9 38 12
## 4508 4650 4687 4787 4865 4931 4934 5013 5060 5178 5455 5556 5721
## 11 22 1 12 8 3 3 48 1 49 7 1 4
## 6097 6418 6497 6514 6612 6723 6767 6849 7262 7298 7430 7443 7688
## 1 7 4 5 1 3 1 15 1 118 6 2 126
## 7896 7978 8614 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020
## 1 1 27 9 1 21 2 7 2 15 8 8 5
## 15024 15831 20051 25124 25236 27828 34095 41310 99999
## 166 2 12 2 3 24 1 1 85
table(Adults$capital.loss)
##
## 0 213 323 625 653 1092 1138 1258 1340 1380 1408 1411 1421
## 15518 1 2 5 1 4 2 2 4 3 14 3 1
## 1429 1485 1504 1510 1564 1573 1579 1590 1594 1602 1617 1628 1648
## 3 20 8 3 18 6 10 22 1 15 2 9 1
## 1651 1668 1669 1672 1719 1721 1726 1735 1740 1741 1762 1816 1825
## 2 5 11 16 16 10 5 1 16 20 6 2 1
## 1844 1848 1870 1876 1887 1902 1911 1944 1974 1977 1980 2001 2002
## 2 16 1 20 74 102 1 2 10 85 13 11 12
## 2042 2051 2057 2129 2149 2163 2174 2179 2205 2231 2238 2246 2258
## 3 8 10 4 3 1 3 5 10 4 2 2 14
## 2282 2339 2377 2392 2415 2444 2457 2465 2467 2472 2547 2559 2603
## 1 10 5 2 23 8 1 1 1 3 1 5 2
## 2824 3004 3175 3770
## 4 3 2 2
table(Adults$hours.per.week)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 7 21 20 30 35 28 19 73 9 147 9 74 5 21 219 98
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 13 54 5 638 22 18 19 102 284 10 13 54 8 551 7 157
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 22 20 640 116 93 238 25 7586 23 119 76 98 893 47 33 253
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 10 1427 7 67 14 21 357 44 2 10 2 702 2 5 5 8
## 65 66 67 68 69 70 72 73 74 75 76 77 78 79 80 84
## 111 6 2 4 1 146 36 2 2 39 1 3 5 1 77 27
## 85 86 88 89 90 92 96 98 99
## 4 2 2 1 13 2 4 3 52
table(Adults$native.country)
##
## ? Cambodia
## 274 9
## Canada China
## 61 47
## Columbia Cuba
## 26 43
## Dominican-Republic Ecuador
## 33 17
## El-Salvador England
## 49 37
## France Germany
## 9 69
## Greece Guatemala
## 20 24
## Haiti Honduras
## 31 7
## Hong Hungary
## 10 6
## India Iran
## 51 16
## Ireland Italy
## 13 32
## Jamaica Japan
## 25 30
## Laos Mexico
## 5 308
## Nicaragua Outlying-US(Guam-USVI-etc)
## 15 9
## Peru Philippines
## 15 97
## Poland Portugal
## 27 30
## Puerto-Rico Scotland
## 70 9
## South Taiwan
## 35 14
## Thailand Trinadad&Tobago
## 12 8
## United-States Vietnam
## 14662 19
## Yugoslavia
## 7
table(Adults$income.class)
##
## <=50K. >50K.
## 12435 3846
In the Adults dataset, there were 963 observations missing information for the workclass variable there were 966 observations missing information for the occupation variable, and there were 274 observations missing the native country variable. The 963 missing workclass observations make sense considering they are also the same observations that are missing occupation. The additional three observations missing within occupation are individuals who have never worked. Moreover, of the 274 observations missing native.country, 19 of them are also missing work class and occupation. It does not seem that there is a pattern to the missing variables. Due to the large sample size of the dataset, 16,281 observations and the fact that a total of 1,221 observations are impacted by a missing value in one or more variables, these observations will be omitted. This allows the analysis to avoid issues with missing or imputed data causing non-sensical or error-prone results within the statistics of the dataset, mainly issues with the mean, standard deviation, or other non-sensical values (observations that are not possible). The entire record for all 1,221 observations that were missing values across one or more variables will be removed, which is approximately 7% of the original dataset provided. To determine if any of the variables in the dateset should be dropped completely, a function will be used that checks for features (columns) and samples (rows) where more than 5% of data is missing1. In order to remove the entire variable and/or row (or observation) from the data, also known as Listwise Deletion, the missing values needs to be identified as MCAR (missing completely at random)5. Assuming MCAR, the dataset needs to be checked for too much missing data, which can also lead to problems. A safe threshold for large datasets is ~5% 1. In order to determine this the mice package can be used and the md.pattern() command will calculate the frequencies of the missing data patterns and determine if the missing data is systematic or random.
table_ofMD_WF = table(Adults$workclass == " ?")
table_ofMD_WF
##
## FALSE TRUE
## 15318 963
table_ofMD_Occ = table(Adults$occupation == " ?")
table_ofMD_Occ
##
## FALSE TRUE
## 15315 966
table_ofMD_Cty = table(Adults$native.country == " ?")
table_ofMD_Cty
##
## FALSE TRUE
## 16007 274
table_ofMD_2 = table(Adults$occupation == " ?" & Adults$workclass == " ?")
table_ofMD_2
##
## FALSE TRUE
## 15318 963
table_ofMD_all3 = table(Adults$native.country == " ?" & Adults$occupation == " ?" & Adults$workclass == " ?")
table_ofMD_all3
##
## FALSE TRUE
## 16262 19
aMiss <- function(x){sum(is.na(x))/length(x)*100}
tablesum <- apply(Adults2, 2, aMiss)
tablesum
all_obs <- apply(Adults2, 1, aMiss)
head(all_obs)
## [1] 0.00000 0.00000 0.00000 0.00000 13.33333 0.00000
As shown above, both workclass and occupation have more than 5% of data missing within the variable. Due to the variables having just over the 5% suggested threshold, at 5.91% and 5.93%, respectively, these variables will be kept in for analysis but it should be noted that our models may be better suited for prediction by omitting them and this will need to be explored. Models for comparison will be developed later. Additionally, this table shows the percentage of missing data per sample (or observation). For example, wherever there is a 13.33% this means the observation is missing 2 of the 15 variables.
md.pattern(Adults2, plot=TRUE, rotate.names = TRUE)
## age fnlwgt education education.num marital.status relationship race sex
## 15060 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1
## 944 1 1 1 1 1 1 1 1
## 255 1 1 1 1 1 1 1 1
## 19 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## capital.gain capital.loss hours.per.week income.class native.country
## 15060 1 1 1 1 1
## 3 1 1 1 1 1
## 944 1 1 1 1 1
## 255 1 1 1 1 0
## 19 1 1 1 1 0
## 0 0 0 0 274
## workclass occupation
## 15060 1 1 0
## 3 1 0 1
## 944 0 0 2
## 255 1 1 1
## 19 0 0 3
## 963 966 2203
As verified using the md.pattern() command, there are a few missing data observations that have a pattern. The column to the far left shows the frequency of each pattern and the bottom row shows the total number of missing entries per variable, as well as, the total number of missing cells 6. The Adults dataset does show patterns within the missing entries for Native country, Workclass, and Occupation. For Native country, 255 observations are missing only Native country, for Workclass, 944 observations are missing only Workclass, and for Occupation, only 3 observations are missing only Occupation. Lastly, 19 observations are missing Workclass, Occupation, and Native Country. The above command also shows that 15,060 observations that are complete data. The patterns can be seen in the last three columns of the table above, Using the VIM package and the aggr() command we will be able to visualize the patterns2.
MD_Pattern_Adults2 <- aggr(Adults2, prop=FALSE, numbers=TRUE, col=c('blue','green'), sortVars = TRUE, labels=names(Adults2), cex.axis = .7, gap = 3, ylab=c("Histogram of Missing Data", "Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## occupation 966
## workclass 963
## native.country 274
## age 0
## fnlwgt 0
## education 0
## education.num 0
## marital.status 0
## relationship 0
## race 0
## sex 0
## capital.gain 0
## capital.loss 0
## hours.per.week 0
## income.class 0
plot(MD_Pattern_Adults2)
Based on the output from the table above, the patterns and how many observations within each pattern are more clearly explored, showing that there is no distinct subset of the population that has been systematically removed, i.e. a certain age or country, etc. The observations missing data will be dropped from the dataset moving forward. The complete dataset for modeling includes 15,060 observations.
Adults3 <- Adults2 %>% replace_with_na_if(.predicate = is.character,
condition = ~.x == (" ?")) %>% drop_na() %>% droplevels()
summary(Adults3)
## age workclass fnlwgt education
## Min. :17.00 Length:15060 Min. : 13492 Length:15060
## 1st Qu.:28.00 Class :character 1st Qu.: 116655 Class :character
## Median :37.00 Mode :character Median : 177955 Mode :character
## Mean :38.77 Mean : 189616
## 3rd Qu.:48.00 3rd Qu.: 238589
## Max. :90.00 Max. :1490400
## education.num marital.status occupation relationship
## Min. : 1.00 Length:15060 Length:15060 Length:15060
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.11
## 3rd Qu.:13.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:15060 Length:15060 Min. : 0 Min. : 0.00
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.00
## Mode :character Mode :character Median : 0 Median : 0.00
## Mean : 1120 Mean : 89.04
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :99999 Max. :3770.00
## hours.per.week native.country income.class
## Min. : 1.00 Length:15060 Length:15060
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.95
## 3rd Qu.:45.00
## Max. :99.00
str(Adults3)
## 'data.frame': 15060 obs. of 15 variables:
## $ age : int 25 38 28 44 34 63 24 55 65 36 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : int 226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : int 7 9 12 10 6 15 10 4 9 13 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : int 0 0 0 7688 0 3103 0 0 6418 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: int 40 50 40 40 30 32 40 10 40 40 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class : chr " <=50K." " <=50K." " >50K." " >50K." ...
Adults3$income.class2[Adults3$income.class == " <=50K."]= "$50,000 or less"
Adults3$income.class2[Adults3$income.class == " >50K."]= "$50,001 or more"
Adults_Updated <- Adults3[,-15]
head(Adults_Updated)
## age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 34 Private 198693 10th 6 Never-married
## 6 63 Self-emp-not-inc 104626 Prof-school 15 Married-civ-spouse
## occupation relationship race sex capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Machine-op-inspct Husband Black Male 7688 0
## 5 Other-service Not-in-family White Male 0 0
## 6 Prof-specialty Husband White Male 3103 0
## hours.per.week native.country income.class2
## 1 40 United-States $50,000 or less
## 2 50 United-States $50,000 or less
## 3 40 United-States $50,001 or more
## 4 40 United-States $50,001 or more
## 5 30 United-States $50,000 or less
## 6 32 United-States $50,001 or more
ls(Adults_Updated)
## [1] "age" "capital.gain" "capital.loss" "education"
## [5] "education.num" "fnlwgt" "hours.per.week" "income.class2"
## [9] "marital.status" "native.country" "occupation" "race"
## [13] "relationship" "sex" "workclass"
str(Adults_Updated)
## 'data.frame': 15060 obs. of 15 variables:
## $ age : int 25 38 28 44 34 63 24 55 65 36 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : int 226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : int 7 9 12 10 6 15 10 4 9 13 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : int 0 0 0 7688 0 3103 0 0 6418 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: int 40 50 40 40 30 32 40 10 40 40 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class2 : chr "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...
Adults_NumericOnly <- Adults_Updated[,c(1,3,5,11:13)]
plot(Adults_NumericOnly)
summary(Adults_NumericOnly)
## age fnlwgt education.num capital.gain
## Min. :17.00 Min. : 13492 Min. : 1.00 Min. : 0
## 1st Qu.:28.00 1st Qu.: 116655 1st Qu.: 9.00 1st Qu.: 0
## Median :37.00 Median : 177955 Median :10.00 Median : 0
## Mean :38.77 Mean : 189616 Mean :10.11 Mean : 1120
## 3rd Qu.:48.00 3rd Qu.: 238589 3rd Qu.:13.00 3rd Qu.: 0
## Max. :90.00 Max. :1490400 Max. :16.00 Max. :99999
## capital.loss hours.per.week
## Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00 1st Qu.:40.00
## Median : 0.00 Median :40.00
## Mean : 89.04 Mean :40.95
## 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :3770.00 Max. :99.00
table(Adults_Updated$occupation, Adults_Updated$income.class2)
##
## $50,000 or less $50,001 or more
## Adm-clerical 1561 258
## Armed-Forces 2 3
## Craft-repair 1543 447
## Exec-managerial 1062 930
## Farming-fishing 434 57
## Handlers-cleaners 644 52
## Machine-op-inspct 884 120
## Other-service 1532 64
## Priv-house-serv 87 2
## Prof-specialty 1077 893
## Protective-serv 235 97
## Sales 1339 485
## Tech-support 375 133
## Transport-moving 585 159
table(Adults_Updated$income.class2, Adults_Updated$education)
##
## 10th 11th 12th 1st-4th 5th-6th 7th-8th 9th Assoc-acdm
## $50,000 or less 380 541 186 69 151 246 208 357
## $50,001 or more 23 30 14 2 10 20 13 142
##
## Assoc-voc Bachelors Doctorate HS-grad Masters Preschool
## $50,000 or less 492 1474 50 4144 412 26
## $50,001 or more 160 1052 119 799 475 1
##
## Prof-school Some-college
## $50,000 or less 57 2567
## $50,001 or more 186 654
agevincome <-table(Adults_Updated$income.class2, Adults_Updated$age)
summary(agevincome)
## Number of cases in table: 15060
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 1557.1, df = 72, p-value = 1.216e-277
## Chi-squared approximation may be incorrect
barplot(agevincome, legend=c("$50,000 or less", "$50,001 or more"), xlab="Age", ylab="Count of Individuals", col=c("blue","green"))
Most of the individuals who are making $50,001 or more are older than 30.
Using ggplot2, a bar graph was constructed to show occupation sorted by individuals with income levels above or below $50,000.
P2 <- ggplot(Adults_Updated, aes(x= occupation, fill= income.class2))+
geom_bar()
P2 + theme(axis.text.x = element_text(angle=45, hjust=1))
When comparing occupation by income class, it becomes obvious that there are few people in most occupations making more than $50,000. It is interesting to note that the closest breakdown of individuals between $50,000 or less and $50,001 or more is within the “Prof-specialty” and “Exec-managerial” occupations.
P2 <- ggplot(Adults_Updated, aes(x= workclass, fill= income.class2))+
geom_bar()
P2 + theme(axis.text.x = element_text(angle=45, hjust=1))
## age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 34 Private 198693 10th 6 Never-married
## 5 24 Private 369667 Some-college 10 Never-married
## 6 55 Private 104996 7th-8th 4 Married-civ-spouse
## occupation relationship race sex capital.gain capital.loss
## 1 Machine-op-inspct Own-child Black Male 0 0
## 2 Farming-fishing Husband White Male 0 0
## 3 Protective-serv Husband White Male 0 0
## 4 Other-service Not-in-family White Male 0 0
## 5 Other-service Unmarried White Female 0 0
## 6 Craft-repair Husband White Male 0 0
## hours.per.week native.country income.class2
## 1 40 United-States $50,000 or less
## 2 50 United-States $50,000 or less
## 3 40 United-States $50,001 or more
## 4 30 United-States $50,000 or less
## 5 40 United-States $50,000 or less
## 6 10 United-States $50,000 or less
Capital_Gains_Brkdn <- count(Adults_Updated, capital.gain, capital.gain > 90000)
head(Capital_Gains_Brkdn)
## capital.gain capital.gain > 90000 n
## 1 0 FALSE 13808
## 2 114 FALSE 2
## 3 401 FALSE 1
## 4 594 FALSE 14
## 5 914 FALSE 2
## 6 991 FALSE 1
tail(Capital_Gains_Brkdn)
## capital.gain capital.gain > 90000 n
## 105 25124 FALSE 2
## 106 25236 FALSE 3
## 107 27828 FALSE 24
## 108 34095 FALSE 1
## 109 41310 FALSE 1
## 110 99999 TRUE 81
There are 13,808 observations with capital.gain = 0. For those not equal to 0, there were 1,252 observations ranging from $144 to $99,999.
capital.gain_2 <- filter(Adults_Updated, capital.gain > 0)
summary(capital.gain_2)
## age workclass fnlwgt education
## Min. :17.00 Length:1252 Min. : 19302 Length:1252
## 1st Qu.:34.00 Class :character 1st Qu.:113383 Class :character
## Median :43.00 Mode :character Median :173453 Mode :character
## Mean :43.98 Mean :184846
## 3rd Qu.:52.00 3rd Qu.:230477
## Max. :90.00 Max. :764638
## education.num marital.status occupation relationship
## Min. : 1.00 Length:1252 Length:1252 Length:1252
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :11.07
## 3rd Qu.:13.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:1252 Length:1252 Min. : 114 Min. :0
## Class :character Class :character 1st Qu.: 3464 1st Qu.:0
## Mode :character Mode :character Median : 7298 Median :0
## Mean :13476 Mean :0
## 3rd Qu.:13550 3rd Qu.:0
## Max. :99999 Max. :0
## hours.per.week native.country income.class2
## Min. : 2.00 Length:1252 Length:1252
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :43.99
## 3rd Qu.:50.00
## Max. :99.00
min(capital.gain_2$capital.gain)
## [1] 114
max(capital.gain_2$capital.gain)
## [1] 99999
hist(capital.gain_2$capital.gain, breaks = 20, xlim = c(140,100000), xlab = "capital gains")
Based on the histogram of all observations where capital gain is larger than 0, there are 81 observations with capital gains equal to $99,999.
which(capital.gain_2$capital.gain >90000)
## [1] 34 36 38 69 72 85 98 114 116 117 125 135 167 174 197
## [16] 203 227 256 266 282 287 323 327 369 378 382 398 401 419 438
## [31] 439 456 463 491 502 504 522 526 549 561 579 580 598 599 612
## [46] 617 646 663 687 710 728 736 746 749 750 760 770 784 798 801
## [61] 805 851 893 928 965 974 976 992 995 1000 1007 1048 1070 1108 1110
## [76] 1120 1166 1229 1231 1232 1236
As LaRose and LaRose explain, “when mixing categorical and continuous variables, the min-max normalization may be preferred”(Larose, 155). This is due to the fact that using the z-score standardization allows for a wider scale than the min-max normalization and may reverse the conclusions for how variables are categorized.4 For this reason, the min-max transformation will be performed on all numeric variables.
MMNormTrans <- preProcess(Adults_Updated, method=c("range"))
Adults_CatModel <- predict(MMNormTrans, Adults_Updated)
str(Adults_CatModel)
## 'data.frame': 15060 obs. of 15 variables:
## $ age : num 0.11 0.288 0.151 0.37 0.233 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : num 0.1444 0.0517 0.219 0.0994 0.1254 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : num 0.4 0.533 0.733 0.6 0.333 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : num 0 0 0 0.0769 0 ...
## $ capital.loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: num 0.398 0.5 0.398 0.398 0.296 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class2 : chr "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...
Adults_CatModel$income.class2 <- as.factor(Adults_CatModel$income.class2)
str(Adults_CatModel)
## 'data.frame': 15060 obs. of 15 variables:
## $ age : num 0.11 0.288 0.151 0.37 0.233 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : num 0.1444 0.0517 0.219 0.0994 0.1254 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : num 0.4 0.533 0.733 0.6 0.333 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : num 0 0 0 0.0769 0 ...
## $ capital.loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: num 0.398 0.5 0.398 0.398 0.296 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class2 : Factor w/ 2 levels "$50,000 or less",..: 1 1 2 2 1 2 1 1 2 1 ...
Outliers for Age
hist(Adults_Updated$age)
hist(Adults_CatModel$age)
boxplot(Adults_Updated$age)
boxplot(Adults_CatModel$age)
which(Adults_Updated$age >78)
## [1] 216 833 881 996 1291 1925 2116 2757 3235 3403 4315 6076
## [13] 6245 6387 6428 6445 6448 6620 6860 6865 6983 7339 7681 7789
## [25] 8280 8306 8341 8359 8577 9018 9130 9274 9422 9442 9922 10451
## [37] 10488 10563 10966 11060 11166 11328 11522 12065 12922 13204 13239 13362
## [49] 13486 13507 13637 13966 14732 14760 14796 14893 14933
boxplot.stats(Adults_Updated$age, do.out=FALSE)
## $stats
## [1] 17 28 37 48 78
##
## $n
## [1] 15060
##
## $conf
## [1] 36.7425 37.2575
##
## $out
## numeric(0)
IQR = 3rd Quartile - 1st Quartile for Age (non-normalized variable)
48-28
## [1] 20
1.5*20
## [1] 30
28-30
## [1] -2
48+30
## [1] 78
The 58 outliers for age can be found in observations where age is greater than 78 years old. These outliers are been considered when building the data models. It is been determined that the ages are reasonable values. Also, the histogram indicates a reasonable skew to the data. For this reason, the observations have been included in all models.
While it is possible that these outliers may affect the accuracy of models, the parameters of the project do not currently cap the age of the observations. Moreover, this subset of data does not indicate any strong patterns in employment type or hours.per.week.
AgeOutliers <- which(Adults_Updated$age >78)
All_Age_outliers <- Adults_Updated %>% filter(age < 78) %>% arrange(desc(age))
head(All_Age_outliers)
## age workclass fnlwgt education education.num marital.status
## 1 77 Private 89655 HS-grad 9 Widowed
## 2 77 Self-emp-inc 29702 HS-grad 9 Married-civ-spouse
## 3 77 Private 148949 10th 6 Married-civ-spouse
## 4 77 Private 133728 7th-8th 4 Married-civ-spouse
## 5 77 Private 189173 HS-grad 9 Married-civ-spouse
## 6 77 State-gov 267799 Doctorate 16 Married-spouse-absent
## occupation relationship race sex capital.gain capital.loss
## 1 Adm-clerical Not-in-family White Female 0 0
## 2 Other-service Husband White Male 0 0
## 3 Other-service Husband Black Male 3818 0
## 4 Craft-repair Husband White Male 0 0
## 5 Prof-specialty Husband White Male 0 0
## 6 Prof-specialty Not-in-family White Male 0 0
## hours.per.week native.country income.class2
## 1 40 United-States $50,000 or less
## 2 35 United-States $50,001 or more
## 3 30 United-States $50,000 or less
## 4 40 United-States $50,000 or less
## 5 35 United-States $50,000 or less
## 6 4 United-States $50,001 or more
Outliers for Education Number of Years
hist(Adults_Updated$education.num)
hist(Adults_CatModel$education.num)
boxplot(Adults_Updated$education.num)
boxplot(Adults_CatModel$education.num)
which(Adults_Updated$education.num <3)
boxplot.stats(Adults_Updated$education.num, do.out=FALSE)
## $stats
## [1] 3 9 10 13 16
##
## $n
## [1] 15060
##
## $conf
## [1] 9.9485 10.0515
##
## $out
## numeric(0)
IQR = 3rd Quartile - 1st Quartile for Education in number of years (non-normalized variable)
13-9
## [1] 4
1.5*4
## [1] 6
9-6
## [1] 3
13+6
## [1] 19
All_Education_Outliers <- Adults_Updated %>% filter(education.num<3) %>% arrange(desc(age))
head(All_Education_Outliers)
## age workclass fnlwgt education education.num marital.status
## 1 80 Local-gov 81534 1st-4th 2 Widowed
## 2 78 Private 454321 1st-4th 2 Widowed
## 3 77 Self-emp-not-inc 161552 Preschool 1 Widowed
## 4 72 Private 109385 1st-4th 2 Married-civ-spouse
## 5 65 Private 237024 1st-4th 2 Married-civ-spouse
## 6 64 State-gov 107732 1st-4th 2 Married-civ-spouse
## occupation relationship race sex capital.gain
## 1 Farming-fishing Not-in-family Asian-Pac-Islander Male 1086
## 2 Handlers-cleaners Other-relative White Male 0
## 3 Exec-managerial Not-in-family White Female 0
## 4 Other-service Husband Black Male 0
## 5 Other-service Husband White Male 0
## 6 Farming-fishing Husband Other Male 0
## capital.loss hours.per.week native.country income.class2
## 1 0 20 Philippines $50,000 or less
## 2 0 20 Nicaragua $50,000 or less
## 3 0 60 United-States $50,000 or less
## 4 0 36 United-States $50,000 or less
## 5 0 35 Mexico $50,000 or less
## 6 0 45 Columbia $50,000 or less
The 99 outliers for number of years of education can be found in observations where education.num is less than 3 years. These outliers have been considered when building the data models. It has been determined that, while obtaining fewer than 3 years of education is rare, it is not an impossible statistic. Furthermore, a table of the observations does not reveal any obvious correlations with any other variables, this is shown at the end of the outliers section and before the modeling section. For this reason, the observations have been included in all models.
EducationNOutliers <- which(Adults_Updated$education.num <3)
Outliers for Capital Gain
boxplot(Adults_Updated$capital.gain)
boxplot(capital.gain_2$capital.gain)
boxplot(Adults_CatModel$capital.gain)
boxplot.stats(Adults_Updated$capital.gain, do.out=FALSE)
## $stats
## [1] 0 0 0 0 0
##
## $n
## [1] 15060
##
## $conf
## [1] 0 0
##
## $out
## numeric(0)
boxplot.stats(Adults_CatModel$capital.gain, do.out=FALSE)
## $stats
## [1] 0 0 0 0 0
##
## $n
## [1] 15060
##
## $conf
## [1] 0 0
##
## $out
## numeric(0)
Based on the boxplots of capital gain, it is evident anything above $0 would be considered an outlier.Boxplot and IQR show the same observations, the observations listed above are outliers in terms of capital gains observations. There are 1,252 outliers for capital gains,those being any observations above $0. These outliers should be considered when building the data models. Since 81 of the outliers are $99,999, it seems logical that these represent a format constraint of the data inputs. Furthermore, the next highest value is $41,310. With such a significant cluster of outliers, we believe that future models should be run with and without these observations.
CGOutliers <- which(Adults_Updated$capital.gain >0)
Outliers for Capital Loss
hist(Adults_Updated$capital.loss)
hist(Adults_CatModel$capital.loss)
boxplot(Adults_Updated$capital.loss)
boxplot(Adults_CatModel$capital.loss)
which(Adults_Updated$capital.loss >0)
boxplot.stats(Adults_Updated$capital.loss, do.out=FALSE)
## $stats
## [1] 0 0 0 0 0
##
## $n
## [1] 15060
##
## $conf
## [1] 0 0
##
## $out
## numeric(0)
Capital_Loss_Brkdn <- count(Adults_Updated, capital.loss, capital.loss > 0)
head(Capital_Loss_Brkdn)
## capital.loss capital.loss > 0 n
## 1 0 FALSE 14347
## 2 213 TRUE 1
## 3 323 TRUE 2
## 4 625 TRUE 5
## 5 653 TRUE 1
## 6 1092 TRUE 4
tail(Capital_Loss_Brkdn)
## capital.loss capital.loss > 0 n
## 74 2559 TRUE 5
## 75 2603 TRUE 1
## 76 2824 TRUE 4
## 77 3004 TRUE 3
## 78 3175 TRUE 2
## 79 3770 TRUE 2
In the Adults_Updated dataset, there were 14,347 observations with $0 in capital losses. Any observations with values other than $0 for capital losses are considered outliers. The total number of outliers based on capital losses is 713 and the range for these observations is from $213 to $3,770. These outliers are to be considered when building the data models.
CLOutliers <- which(Adults_Updated$capital.loss >0)
Outliers for Hours Per Week
hist(Adults_Updated$hours.per.week)
hist(Adults_CatModel$hours.per.week)
boxplot(Adults_Updated$hours.per.week)
boxplot(Adults_CatModel$hours.per.week)
which(Adults_Updated$hours.per.week >55)
which(Adults_Updated$hours.per.week <27)
boxplot.stats(Adults_Updated$hours.per.week, do.out=FALSE)
## $stats
## [1] 33 40 40 45 52
##
## $n
## [1] 15060
##
## $conf
## [1] 39.93563 40.06437
##
## $out
## numeric(0)
IQR = 3rd Quartile - 1st Quartile for Age (non-normalized variable)
45-40
## [1] 5
1.5*5
## [1] 7.5
40-7.5
## [1] 32.5
45+7.5
## [1] 52.5
Under_Hrs <- count(Adults_Updated, hours.per.week < 32.5)
Under_Hrs
## hours.per.week < 32.5 n
## 1 FALSE 12768
## 2 TRUE 2292
Above_Hrs <- count(Adults_Updated, hours.per.week > 52.5)
Above_Hrs
## hours.per.week > 52.5 n
## 1 FALSE 13406
## 2 TRUE 1654
There are 3,946 outliers based on hours per week. The observations with hours above 52.5 hours a week (1,654) and below 32.5 hours a week (2,292) are all outliers. These outliers have been considered when building the data models, and ultimately, they were included in all modeling. While technically outliers, the histogram confirms that these datapoints are evenly distributed along a bell curve and do no appear to be heavily skewed, the result of imputation error or the result of a formatting constraint.
HPWOutliers_Above <- which(Adults_Updated$hours.per.week >52.5)
HPWOutliers_Below <- which(Adults_Updated$hours.per.week <32.5)
HPWOutliers <- which(Adults_Updated$hours.per.week <32.5 | Adults_Updated$hours.per.week >52.5)
New Column Added for Identifying All Outliers based on all Continuous Variables
Adults_Updated$Outliers <-(Adults_Updated$Outliers<- Adults_Updated$age >78| Adults_Updated$education.num <3| Adults_Updated$capital.gain >0| Adults_Updated$capital.loss >0| Adults_Updated$hours.per.week <32.5 | Adults_Updated$hours.per.week >52.5)
str(Adults_Updated)
## 'data.frame': 15060 obs. of 16 variables:
## $ age : int 25 38 28 44 34 63 24 55 65 36 ...
## $ workclass : chr " Private" " Private" " Local-gov" " Private" ...
## $ fnlwgt : int 226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
## $ education : chr " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
## $ education.num : int 7 9 12 10 6 15 10 4 9 13 ...
## $ marital.status: chr " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
## $ relationship : chr " Own-child" " Husband" " Husband" " Husband" ...
## $ race : chr " Black" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capital.gain : int 0 0 0 7688 0 3103 0 0 6418 0 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: int 40 50 40 40 30 32 40 10 40 40 ...
## $ native.country: chr " United-States" " United-States" " United-States" " United-States" ...
## $ income.class2 : chr "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...
## $ Outliers : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
table(Adults_Updated$Outliers)
##
## FALSE TRUE
## 9549 5511
There are 5,511 observations or individuals that are identified as outliers based on one or more of the continuous variables (age, education number of years, capital gain,capital loss or hours per week) within the Adults dataset.
Next, it is important to determine which variables are correlated and to what degree (based on the correlation coefficient, r). This can help identify relationships and trends within the data and which variables are necessary for prediction and which can potential be removed or intentionally omitted. According to LaRose and LaRose, “Identify any variables that are perfectly correlated (i.e., r = 1.0 or r=-1.0). Do not retain both variables in the model, but rather omit one. Identify groups of variables that are correlated with each other. Then, later, during the modeling phase, apply dimension reduction methods, such as principal components analysis to these variables” (LaRose, 78).4 This is done because correlated variables can overemphasize a data component or even "cause the model to become unstable and deliver unreliable results (LaRose, 76).4
A good way to begin this analysis is to look at the correlation coefficients between all the continuous variables to see where they show r values closer to -1.00 or 1.00. The correlation coefficient (r) shows the strength and direction of the linear relationship between the two variables.
contV2 <- Adults_NumericOnly
cor(contV2)
## age fnlwgt education.num capital.gain capital.loss
## age 1.00000000 -0.074375137 0.02612265 0.07876001 0.057745040
## fnlwgt -0.07437514 1.000000000 -0.03600951 -0.01283892 0.006420974
## education.num 0.02612265 -0.036009508 1.00000000 0.13174987 0.085816886
## capital.gain 0.07876001 -0.012838920 0.13174987 1.00000000 -0.031875657
## capital.loss 0.05774504 0.006420974 0.08581689 -0.03187566 1.000000000
## hours.per.week 0.10275833 -0.010306013 0.13369132 0.09050098 0.057712162
## hours.per.week
## age 0.10275833
## fnlwgt -0.01030601
## education.num 0.13369132
## capital.gain 0.09050098
## capital.loss 0.05771216
## hours.per.week 1.00000000
From the above table, there seems to be no strong correlations among the continuous variables, all variables should be included in the data models. This is explored further through scatterplots in the Appendix.
In order to construct the three models used to classify income class, the modified subsets of the original dataset and all variables must undergo a final round of pre-processing. The data is first copied into two data sets: one which remains a mix of numeric and categorical variables, and one that is converted into 100% numeric data. The latter set is developed by expanding the categorical variables to include various factors. It should be noted that this format is not recommended for exploratory data analysis (as it makes for difficult interpretation because it is not a tidy data format); however, it is required to pass categorical variables to models which will only accept numeric inputs. For this, the ‘carat’ package is used to assign dummy variables to factors.
After the Adults_NumModel set is expanded, it is standardized using min/max standardization so that all data points fall between 0.0 and 1.0. This is performed in order to normalize the variables and results in standardizing the scale of effects for each variable.
The last part of the pre-processing section is to create training and test data sets by randomly subsetting 80% of the observations into a training set and the other 20% of the observations into the test set. Models will learn from the training set and their accuracy will be tested with the test set for overfitting. Data sets are created that exclude the capital gains outliers of $99,999. This percent split between training and test sets is accepted as an industry standard by which to allow the model to train using the bulk of the data while simultaneously providing a test set large enough to guarantee a high level of diversity. A 90%/10% split could also be used; however, in the absence of numerous random test sets, we prefer to use 20% of the data.
Pre-processing for models
#creating a categorical set and a numeric set
Adults_CatModel <- Adults_CatModel[,-c(4,8,16)]
Adults_NumModel <- Adults_Updated[,-c(4,8,16)]
#race
dmy <- dummyVars("~ race", data = Adults_NumModel)
dmy.race <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.race)
Adults_NumModel <- Adults_NumModel[,-7]
#workclass
dmy <- dummyVars("~ workclass", data = Adults_NumModel)
dmy.workclass <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.workclass)
Adults_NumModel <- Adults_NumModel[,-2]
#martial status
dmy <- dummyVars("~ marital.status", data = Adults_NumModel)
dmy.marital.status <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.marital.status)
Adults_NumModel <- Adults_NumModel[,-4]
#occupation
dmy <- dummyVars("~ occupation", data = Adults_NumModel)
dmy.occupation <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.occupation)
Adults_NumModel <- Adults_NumModel[,-4]
#native country
dmy <- dummyVars("~ native.country", data = Adults_NumModel)
dmy.native.country <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.native.country)
Adults_NumModel <- Adults_NumModel[,-8]
#changing Males to 0 and Females to 1 and removing original sex column
Adults_NumModel$sex2[Adults_NumModel$sex == " Male"]=0
Adults_NumModel$sex2[Adults_NumModel$sex == " Female"]=1
Adults_NumModel <- Adults_NumModel[,-4]
Adults_NumModel <- Adults_NumModel %>% relocate(income.class2, .after = last_col())
### Min/Max for Adults_NumModel
MinMax <- preProcess(Adults_NumModel, method=c("range"))
Adults_NumModel <- predict(MinMax, Adults_NumModel)
#removing capital gains outliers
Adults_CatModel.Outlier <- Adults_CatModel %>% filter(capital.gain < max(capital.gain))
Adults_NumModel.Outlier <- Adults_NumModel %>% filter(capital.gain < max(capital.gain))
### Creating training and test sets (80/20% split)
set.seed(100)
samp <- sample(1:15060, 12048, replace = FALSE)
TrainCat <- Adults_CatModel[samp,]
TestCat <- Adults_CatModel[-samp,]
TrainNum <- Adults_NumModel[samp,]
TestNum <- Adults_NumModel[-samp,]
TrainCat.cap.outliers <- Adults_CatModel.Outlier[samp,]
TrainNum.cap.outliers <- Adults_NumModel.Outlier[samp,]
TestCat.cap.outliers <- Adults_CatModel.Outlier[-samp,]
TestNum.cap.outliers <- Adults_NumModel.Outlier[-samp,]
The first model that we test is the K-Nearest Neighbor (KNN) model. This model uses numeric independent variables to measure the euclidean distance between variables. The K value in the equation indicates the number of nearest neighboring observations the model considers. The ‘knn.cv’ model includes cross-validation in its computation, so the entire data set was used instead of the training set. The advantage in this is that the model is able to train on a greater number of observations. Through trial and error, we determine that a K values of 33 provides the most accurate model in classifying the income class.
Once the most accurate K value was determined, the model is also run without observations in which capital are equal-to or greater-than $99,999.
#k-nearest neighbor
knn33 <- knn.cv(Adults_NumModel[,-81], Adults_NumModel[,81], k = 33)
tknn33 <- table(Adults_NumModel$income.class2, knn33)
tknn33
## knn33
## $50,000 or less $50,001 or more
## $50,000 or less 10447 913
## $50,001 or more 1650 2050
knnaccu33 <- round((tknn33[1,1] + tknn33[2,2]) / sum(tknn33)*100, 2)
print(paste("Our KNN model is ",knnaccu33,"percent accurate with a K value of 33."))
## [1] "Our KNN model is 82.98 percent accurate with a K value of 33."
#removing capital gains outliers
knn.outlier <- knn.cv(Adults_NumModel.Outlier[,-81], Adults_NumModel.Outlier[,81], k = 33)
tknnOutlier <- table(Adults_NumModel.Outlier$income.class2, knn.outlier)
tknnOutlier
## knn.outlier
## $50,000 or less $50,001 or more
## $50,000 or less 10448 912
## $50,001 or more 1629 1990
knnaccuOutlier <- round((tknnOutlier[1,1] + tknnOutlier[2,2]) / sum(tknnOutlier)*100, 2)
print(paste("The model is",knnaccuOutlier,"percent accurate without capital gains >= $99,999."))
## [1] "The model is 83.04 percent accurate without capital gains >= $99,999."
The accuracy of the KNN model is slightly improved when capital gains outliers are omitted from the data set. While this is noteworthy, we do not make the assumption that all models will act accordingly. We continue to use both datasets throughout our modeling.
The second model we test is a Classification Tree. The ‘rpart’ model calculates the probability of the target variable given the probabilities of the independent variables. As this model accepts both numeric and categorical variables, we utilize the training and test data sets that include both variables, as well as the set that is only numeric. The ‘rpart.plot’ function renders an acceptable chart of the classifications. Additionally, we run the dataset that excludes the capital gains outliers.
# using categorical data set
CatTree <- rpart(income.class2 ~ ., data = TrainCat, method = "class")
rpart.plot(CatTree)
est.train.tree <- predict(CatTree, TrainCat, type = "class")
tree.train.table <- table(TrainCat$income.class2, est.train.tree)
tree.train.table
## est.train.tree
## $50,000 or less $50,001 or more
## $50,000 or less 8525 550
## $50,001 or more 1378 1595
accutree.train <- round((tree.train.table[1,1]+tree.train.table[2,2])/sum(tree.train.table)*100,2)
print(paste("Our classification tree is",accutree.train,"percent accurate using the training set."))
## [1] "Our classification tree is 84 percent accurate using the training set."
# with categorical test data
est.test.tree <- predict(CatTree, TestCat, type = "class")
tree.test.table <- table(TestCat$income.class2, est.test.tree)
tree.test.table
## est.test.tree
## $50,000 or less $50,001 or more
## $50,000 or less 2146 139
## $50,001 or more 345 382
accutree.test <- round((tree.test.table[1,1]+tree.test.table[2,2])/sum(tree.test.table)*100,2)
print(paste("Our classification tree is",accutree.test,"percent accurate using the test set."))
## [1] "Our classification tree is 83.93 percent accurate using the test set."
# with numeric data
NumTree <- rpart(income.class2 ~ ., data = TrainNum, method = "class")
rpart.plot(NumTree)
est.NumTrain.tree <- predict(NumTree, TrainNum, type = "class")
Numtree.train.table <- table(TrainNum$income.class2, est.NumTrain.tree)
Numtree.train.table
## est.NumTrain.tree
## $50,000 or less $50,001 or more
## $50,000 or less 8494 581
## $50,001 or more 1281 1692
accuNumTree.train <- round((Numtree.train.table[1,1]+Numtree.train.table[2,2])/sum(Numtree.train.table)*100,2)
print(paste("The classification tree is",accuNumTree.train,"percent accurate using the numeric training set."))
## [1] "The classification tree is 84.55 percent accurate using the numeric training set."
est.NumTest.tree <- predict(NumTree, TestNum, type = "class")
Numtree.test.table <- table(TestNum$income.class2, est.NumTest.tree)
Numtree.test.table
## est.NumTest.tree
## $50,000 or less $50,001 or more
## $50,000 or less 2134 151
## $50,001 or more 325 402
accuNumTree.test <- round((Numtree.test.table[1,1]+Numtree.test.table[2,2])/sum(Numtree.test.table)*100)
print(paste("The classification tree is",accuNumTree.test,"percent accurate using the numeric test set."))
## [1] "The classification tree is 84 percent accurate using the numeric test set."
est.OutlierTrain.tree <- predict(NumTree, TrainNum.cap.outliers, type = "class")
Numtree.Outlier.table <- table(TrainNum.cap.outliers$income.class2, est.OutlierTrain.tree)
Numtree.Outlier.table
## est.OutlierTrain.tree
## $50,000 or less $50,001 or more
## $50,000 or less 8473 599
## $50,001 or more 1295 1613
accuNumTreeOutlier.train <- round( (Numtree.Outlier.table[1,1]+Numtree.Outlier.table[2,2])/sum(Numtree.Outlier.table)*100,2)
print(paste("The classification tree is",accuNumTreeOutlier.train,"percent accurate using the numeric training set without capital gains."))
## [1] "The classification tree is 84.19 percent accurate using the numeric training set without capital gains."
est.OutlierTrain.tree <- predict(NumTree, TestNum.cap.outliers, type = "class")
Numtree.Outlier.TestTable <- table(TestNum.cap.outliers$income.class2, est.OutlierTrain.tree)
accuNumTreeOutlier.test <- round((Numtree.Outlier.TestTable[1,1]+Numtree.Outlier.TestTable[2,2])/sum(Numtree.Outlier.TestTable)*100,2)
print(paste("The classification tree is",accuNumTreeOutlier.test,"percent accurate using the numeric training set without capital gains."))
## [1] "The classification tree is 85.2 percent accurate using the numeric training set without capital gains."
The results of the classification tree models show that the model is highly effective in learning from the training data. The results from the test data were actually slightly more accurate than when using the training data. This is rare, and may be a product of the random sampling. This model is also marginally more accurate when using the all numeric dataset. Further research is needed, but this result is likely caused by the expanded number of variables in the numeric dataset.
Regarding the inclusion of capital gains outliers, the model accuracy is not improved by omitting outliers.
The final model we test is a neural network. Acting in same fashion as a biological neural network, the independent inputs are passed through hidden nodes before they reach the output. The number of hidden nodes determines the complexity of the network and, in most cases,the accuracy of the model. The optimal number of nodes (‘size’ input) was reached through trial and error.
neuronet <- nnet(income.class2 ~ ., TrainCat, size = 12, maxit = 2000)
## # weights: 925
## initial value 7552.141318
## iter 10 value 4433.614072
## iter 20 value 3988.648238
## iter 30 value 3768.480483
## iter 40 value 3665.206615
## iter 50 value 3595.230739
## iter 60 value 3515.572438
## iter 70 value 3429.873544
## iter 80 value 3375.787692
## iter 90 value 3330.771082
## iter 100 value 3301.091894
## iter 110 value 3277.558026
## iter 120 value 3259.546942
## iter 130 value 3244.858710
## iter 140 value 3225.218682
## iter 150 value 3207.807109
## iter 160 value 3194.882171
## iter 170 value 3183.757705
## iter 180 value 3174.957568
## iter 190 value 3168.052302
## iter 200 value 3162.828960
## iter 210 value 3156.421323
## iter 220 value 3149.462459
## iter 230 value 3139.071010
## iter 240 value 3131.132907
## iter 250 value 3121.244632
## iter 260 value 3111.099618
## iter 270 value 3098.990054
## iter 280 value 3086.255558
## iter 290 value 3071.991418
## iter 300 value 3062.033493
## iter 310 value 3054.138433
## iter 320 value 3047.626867
## iter 330 value 3043.710684
## iter 340 value 3037.956656
## iter 350 value 3033.946186
## iter 360 value 3030.649482
## iter 370 value 3027.045318
## iter 380 value 3024.198476
## iter 390 value 3021.760005
## iter 400 value 3018.762047
## iter 410 value 3015.561039
## iter 420 value 3010.942512
## iter 430 value 3005.294207
## iter 440 value 2997.980463
## iter 450 value 2984.488882
## iter 460 value 2971.177285
## iter 470 value 2967.867188
## iter 480 value 2966.559369
## iter 490 value 2965.880647
## iter 500 value 2965.372854
## iter 510 value 2964.510261
## iter 520 value 2963.602925
## iter 530 value 2963.404966
## iter 540 value 2963.310224
## iter 550 value 2963.146404
## iter 560 value 2963.102699
## iter 570 value 2963.054454
## iter 580 value 2962.803637
## iter 590 value 2962.751746
## iter 600 value 2962.739251
## iter 610 value 2962.735827
## iter 620 value 2962.734374
## final value 2962.733850
## converged
source_url("https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r")
## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
plot.nnet(neuronet)
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 4.0.3
##
## Attaching package: 'reshape'
## The following object is masked from 'package:class':
##
## condense
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
est.train.nnet <- predict(neuronet, TrainCat, type = "class")
nnet.train.table <- table(TrainCat$income.class2, est.train.nnet)
nnet.train.table
## est.train.nnet
## $50,000 or less $50,001 or more
## $50,000 or less 8527 548
## $50,001 or more 835 2138
accuNeuroTrain <- round((nnet.train.table[1,1]+nnet.train.table[2,2])/sum(nnet.train.table)*100, 2)
print(paste("The neural network is",accuNumTreeOutlier.test,"percent accurate using the training set."))
## [1] "The neural network is 85.2 percent accurate using the training set."
est.test.nnet <- predict(neuronet, TestCat, type = "class")
nnet.test.table <- table(TestCat$income.class2, est.test.nnet)
nnet.test.table
## est.test.nnet
## $50,000 or less $50,001 or more
## $50,000 or less 2068 217
## $50,001 or more 289 438
accuNeuroTest <- round((nnet.test.table[1,1]+nnet.test.table[2,2])/sum(nnet.test.table)*100,2)
print(paste("The neural network is",accuNeuroTest,"percent accurate using the test set."))
## [1] "The neural network is 83.2 percent accurate using the test set."
neuronetOutlier <- nnet(income.class2 ~ ., TrainCat.cap.outliers, size = 12, maxit = 2000)
## # weights: 925
## initial value 7864.019938
## iter 10 value 4359.345073
## iter 20 value 4014.169594
## iter 30 value 3851.216239
## iter 40 value 3752.494821
## iter 50 value 3634.009812
## iter 60 value 3557.540748
## iter 70 value 3507.596000
## iter 80 value 3443.251643
## iter 90 value 3392.274843
## iter 100 value 3339.842681
## iter 110 value 3309.006820
## iter 120 value 3282.351822
## iter 130 value 3260.089445
## iter 140 value 3241.794525
## iter 150 value 3226.446812
## iter 160 value 3215.061360
## iter 170 value 3202.254023
## iter 180 value 3191.827580
## iter 190 value 3185.702072
## iter 200 value 3180.834070
## iter 210 value 3175.488630
## iter 220 value 3170.374565
## iter 230 value 3166.034203
## iter 240 value 3161.840127
## iter 250 value 3157.803293
## iter 260 value 3153.486203
## iter 270 value 3149.413036
## iter 280 value 3145.916200
## iter 290 value 3141.997192
## iter 300 value 3139.146562
## iter 310 value 3136.340690
## iter 320 value 3133.372236
## iter 330 value 3131.156089
## iter 340 value 3129.156098
## iter 350 value 3127.166687
## iter 360 value 3124.915179
## iter 370 value 3123.025817
## iter 380 value 3121.085554
## iter 390 value 3118.563946
## iter 400 value 3115.617177
## iter 410 value 3113.630854
## iter 420 value 3110.843488
## iter 430 value 3107.656758
## iter 440 value 3104.587279
## iter 450 value 3101.901386
## iter 460 value 3100.231060
## iter 470 value 3098.921020
## iter 480 value 3098.128777
## iter 490 value 3097.073419
## iter 500 value 3095.995085
## iter 510 value 3095.026221
## iter 520 value 3093.919997
## iter 530 value 3092.833244
## iter 540 value 3090.971085
## iter 550 value 3089.055287
## iter 560 value 3087.919341
## iter 570 value 3086.849389
## iter 580 value 3086.034270
## iter 590 value 3085.392120
## iter 600 value 3084.780812
## iter 610 value 3084.325282
## iter 620 value 3083.834814
## iter 630 value 3083.494235
## iter 640 value 3083.165797
## iter 650 value 3082.836459
## iter 660 value 3082.591911
## iter 670 value 3082.300170
## iter 680 value 3081.874064
## iter 690 value 3081.602754
## iter 700 value 3081.444231
## iter 710 value 3081.346245
## iter 720 value 3081.242607
## iter 730 value 3081.102443
## iter 740 value 3080.976904
## iter 750 value 3080.777696
## iter 760 value 3080.563885
## iter 770 value 3080.390172
## iter 780 value 3080.184533
## iter 790 value 3080.005283
## iter 800 value 3079.805771
## iter 810 value 3079.630719
## iter 820 value 3079.485379
## iter 830 value 3079.329283
## iter 840 value 3079.141042
## iter 850 value 3079.026082
## iter 860 value 3078.904930
## iter 870 value 3078.827663
## iter 880 value 3078.774900
## iter 890 value 3078.719398
## iter 900 value 3078.622810
## iter 910 value 3078.560264
## iter 920 value 3078.499980
## iter 930 value 3078.432886
## iter 940 value 3078.379092
## iter 950 value 3078.325757
## iter 960 value 3078.280438
## iter 970 value 3078.252646
## iter 980 value 3078.229042
## iter 990 value 3078.211967
## iter1000 value 3078.194648
## iter1010 value 3078.177356
## iter1020 value 3078.162001
## iter1030 value 3078.142118
## iter1040 value 3078.128868
## iter1050 value 3078.120769
## iter1060 value 3078.106646
## iter1070 value 3078.090232
## iter1080 value 3078.077286
## iter1090 value 3078.067858
## iter1100 value 3078.051835
## iter1110 value 3078.033993
## iter1120 value 3078.001297
## iter1130 value 3077.977457
## iter1140 value 3077.962030
## iter1150 value 3077.934615
## iter1160 value 3077.917206
## iter1170 value 3077.889605
## iter1180 value 3077.869117
## iter1190 value 3077.837896
## iter1200 value 3077.815072
## iter1210 value 3077.787919
## iter1220 value 3077.564873
## iter1230 value 3077.538882
## iter1240 value 3077.440655
## iter1250 value 3077.400800
## iter1260 value 3077.354974
## iter1270 value 3077.326164
## iter1280 value 3077.306054
## iter1290 value 3077.286337
## iter1300 value 3077.266745
## iter1310 value 3077.248757
## iter1320 value 3077.236299
## iter1330 value 3077.224937
## iter1340 value 3077.205868
## iter1350 value 3077.183553
## iter1360 value 3077.180036
## iter1370 value 3077.176571
## iter1380 value 3077.173578
## iter1390 value 3077.168581
## iter1400 value 3077.163128
## iter1410 value 3077.158087
## iter1420 value 3077.152923
## iter1430 value 3077.148663
## iter1440 value 3077.142974
## final value 3077.138895
## converged
est.trainOutlier.nnet <- predict(neuronetOutlier, TrainCat.cap.outliers,type = "class")
nnet.trainOutlier.table <- table(TrainCat.cap.outliers$income.class2, est.trainOutlier.nnet)
nnet.trainOutlier.table
## est.trainOutlier.nnet
## $50,000 or less $50,001 or more
## $50,000 or less 8510 562
## $50,001 or more 933 1975
accuNeuroOutlier.train <- round((nnet.trainOutlier.table[1,1]+ nnet.trainOutlier.table[2,2])/sum(nnet.trainOutlier.table)*100,)
print(paste("The neural network is",accuNeuroOutlier.train,"percent accurate using the training set without capital gains outliers."))
## [1] "The neural network is 88 percent accurate using the training set without capital gains outliers."
est.testOutlier.nnet <- predict(neuronet, TestCat.cap.outliers, type = "class")
nnet.testOutlier.table <- table(TestCat.cap.outliers$income.class2, est.testOutlier.nnet)
accuNeuroOutlier.test<- round((nnet.testOutlier.table[1,1]+nnet.testOutlier.table[2,2])/sum(nnet.testOutlier.table)*100,2)
nnet.testOutlier.table
## est.testOutlier.nnet
## $50,000 or less $50,001 or more
## $50,000 or less 2138 150
## $50,001 or more 215 496
print(paste("The neural network is",accuNeuroOutlier.test,"percent accurate using the test set without capital gains outliers."))
## [1] "The neural network is 87.83 percent accurate using the test set without capital gains outliers."
JoblessTrain <- TrainCat.cap.outliers %>% select(-workclass, -occupation)
JoblessTest <- TestCat.cap.outliers %>% select(-workclass, -occupation)
jobless <- nnet(income.class2 ~ .,JoblessTrain, size = 12, maxit = 2000 )
## # weights: 697
## initial value 9817.205207
## iter 10 value 4489.661816
## iter 20 value 4097.696311
## iter 30 value 3969.028544
## iter 40 value 3877.747735
## iter 50 value 3788.771730
## iter 60 value 3716.922948
## iter 70 value 3667.826328
## iter 80 value 3629.332332
## iter 90 value 3598.808544
## iter 100 value 3580.069004
## iter 110 value 3563.204011
## iter 120 value 3544.993231
## iter 130 value 3531.601566
## iter 140 value 3522.496119
## iter 150 value 3515.038532
## iter 160 value 3510.762539
## iter 170 value 3507.147291
## iter 180 value 3503.956239
## iter 190 value 3499.805189
## iter 200 value 3494.490264
## iter 210 value 3491.817027
## iter 220 value 3488.105833
## iter 230 value 3485.615833
## iter 240 value 3483.406555
## iter 250 value 3479.920896
## iter 260 value 3476.684622
## iter 270 value 3472.012551
## iter 280 value 3467.497696
## iter 290 value 3463.101936
## iter 300 value 3460.792174
## iter 310 value 3458.527119
## iter 320 value 3456.081766
## iter 330 value 3451.308933
## iter 340 value 3444.920326
## iter 350 value 3440.881802
## iter 360 value 3436.968986
## iter 370 value 3431.277828
## iter 380 value 3423.611504
## iter 390 value 3416.752803
## iter 400 value 3412.912384
## iter 410 value 3409.799541
## iter 420 value 3406.503142
## iter 430 value 3404.531429
## iter 440 value 3403.138207
## iter 450 value 3401.912735
## iter 460 value 3400.370399
## iter 470 value 3399.280193
## iter 480 value 3398.504253
## iter 490 value 3397.788974
## iter 500 value 3397.244895
## iter 510 value 3396.743398
## iter 520 value 3396.259363
## iter 530 value 3395.761922
## iter 540 value 3395.123488
## iter 550 value 3394.111128
## iter 560 value 3392.680211
## iter 570 value 3391.461341
## iter 580 value 3390.103602
## iter 590 value 3389.014159
## iter 600 value 3387.986310
## iter 610 value 3387.219067
## iter 620 value 3386.450007
## iter 630 value 3386.116354
## iter 640 value 3385.972585
## iter 650 value 3385.838267
## iter 660 value 3385.761592
## iter 670 value 3385.719235
## iter 680 value 3385.704511
## iter 690 value 3385.688491
## iter 700 value 3385.681509
## iter 710 value 3385.675970
## final value 3385.675014
## converged
estjoblessTest <- predict(jobless, JoblessTest, type = "class")
joblesstable <- table(JoblessTest$income.class2, estjoblessTest)
accuJobless <- round((joblesstable[1,1]+joblesstable[2,2])/sum(joblesstable)*100,2)
print(paste("Without the 'occupation' or 'workclass' columns, the model has been reduced to", accuJobless, "percent accuracy."))
## [1] "Without the 'occupation' or 'workclass' columns, the model has been reduced to 84.76 percent accuracy."
The results of the neural network reveal that the model is superior to the KNN and classification tree models. It is notable that model is slightly improved with the omission of capital gains outliers.
The results of the neural network reveal that the model is in line with KNN and the classification tree models. It is notable, however, that model improves with the omission of capital gains outliers.
From our analysis, the neural network model developed with complete observations and the omission of outliers for capital gains performed the most efficiently, with an accuracy rate of 87.8% when running the training set against the test set. This model is the most accurate model in classifying the income levels for the given observations. The model correctly classified 87.8% of the test data (resulting in an error rate of only 12.2%). Further research is needed in order to fine tune our model to its full potential. A more detailed analysis of each variable may suggest that specific variables be omitted from the model while others may benefit from being weighted more heavily; nevertheless, the findings of this project alone may yield a useful framework to others.
2.“Chapter 15. Advanced Methods for Missing Data · R in Action.” · R In Action, livebook.manning.com/book/r-in-action/chapter-15/133.
Hastie, T., Tibshirani, R., Walther, G. (2000). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Retrieved 11/19/20, from https://web.stanford.edu/~hastie/Papers/gap.pdf
Larose, D. T., Larose, C. D. (2014). Discovering Knowledge in Data: an Introduction to Data Mining. Second ed., Wiley, 2014.
Roy, B. “All About Missing Data Handling.” Medium, Towards Data Science, 22 Feb. 2020, (https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184).
van Buuren, S. (2020, November 23) “4.1 Missing Data Pattern.” stefvanbuuren.name/fimd/missing-data-pattern.html.
Zscore Transformations of Adults Dataset Variables
attach(Adults)
mean(age)
## [1] 38.76746
sd(age)
## [1] 13.84919
mean(fnlwgt)
## [1] 189435.7
sd(fnlwgt)
## [1] 105714.9
mean(education.num)
## [1] 10.07291
sd(education.num)
## [1] 2.567545
mean(capital.gain)
## [1] 1081.905
sd(capital.gain)
## [1] 7583.936
mean(capital.loss)
## [1] 87.89927
sd(capital.loss)
## [1] 403.1053
mean(hours.per.week)
## [1] 40.39224
sd(hours.per.week)
## [1] 12.47933
Adults_Updated$AgeZscore = (Adults_Updated$age - mean(Adults_Updated$age))/sd(Adults_Updated$age)
Adults_Updated$FNLWGTZscore = (Adults_Updated$fnlwgt - mean(Adults_Updated$fnlwgt))/sd(Adults_Updated$fnlwgt)
Adults_Updated$EduNumZscore = (Adults_Updated$education.num - mean(Adults_Updated$education.num))/sd(Adults_Updated$education.num)
Adults_Updated$CGZscore = (Adults_Updated$capital.gain - mean(Adults_Updated$capital.gain))/sd(Adults_Updated$capital.gain)
Adults_Updated$CLZscore = (Adults_Updated$capital.loss - mean(Adults_Updated$capital.loss))/sd(Adults_Updated$capital.loss)
Adults_Updated$HrsZscore = (Adults_Updated$hours.per.week - mean(Adults_Updated$hours.per.week))/sd(Adults_Updated$hours.per.week)
ls.str(Adults_Updated)
## age : int [1:15060] 25 38 28 44 34 63 24 55 65 36 ...
## AgeZscore : num [1:15060] -1.029 -0.0574 -0.8048 0.391 -0.3564 ...
## capital.gain : int [1:15060] 0 0 0 7688 0 3103 0 0 6418 0 ...
## capital.loss : int [1:15060] 0 0 0 0 0 0 0 0 0 0 ...
## CGZscore : num [1:15060] -0.145 -0.145 -0.145 0.853 -0.145 ...
## CLZscore : num [1:15060] -0.219 -0.219 -0.219 -0.219 -0.219 ...
## education : chr [1:15060] " 11th" " HS-grad" " Assoc-acdm" " Some-college" " 10th" ...
## education.num : int [1:15060] 7 9 12 10 6 15 10 4 9 13 ...
## EduNumZscore : num [1:15060] -1.2165 -0.4349 0.7376 -0.0441 -1.6073 ...
## fnlwgt : int [1:15060] 226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
## FNLWGTZscore : num [1:15060] 0.3521 -0.945 1.395 -0.2774 0.0859 ...
## hours.per.week : int [1:15060] 40 50 40 40 30 32 40 10 40 40 ...
## HrsZscore : num [1:15060] -0.0789 0.7501 -0.0789 -0.0789 -0.9079 ...
## income.class2 : chr [1:15060] "$50,000 or less" "$50,000 or less" "$50,001 or more" ...
## marital.status : chr [1:15060] " Never-married" " Married-civ-spouse" " Married-civ-spouse" ...
## native.country : chr [1:15060] " United-States" " United-States" " United-States" ...
## occupation : chr [1:15060] " Machine-op-inspct" " Farming-fishing" " Protective-serv" ...
## Outliers : logi [1:15060] FALSE FALSE FALSE TRUE TRUE TRUE ...
## race : chr [1:15060] " Black" " White" " White" " Black" " White" " White" ...
## relationship : chr [1:15060] " Own-child" " Husband" " Husband" " Husband" ...
## sex : chr [1:15060] " Male" " Male" " Male" " Male" " Male" " Male" " Female" ...
## workclass : chr [1:15060] " Private" " Private" " Local-gov" " Private" " Private" ...
hist(Adults_Updated$age)
hist(Adults_Updated$AgeZscore)
hist(Adults_Updated$capital.gain)
hist(Adults_Updated$CGZscore)
Scatterplots to Explore the Correlation between Continuous Variables
plot(Adults_NumericOnly$age, Adults_NumericOnly$fnlwgt)
plot(Adults_NumericOnly$age, Adults_NumericOnly$education.num)
plot(Adults_NumericOnly$age, Adults_NumericOnly$capital.gain)
plot(Adults_NumericOnly$age, Adults_NumericOnly$capital.loss)
plot(Adults_NumericOnly$age, Adults_NumericOnly$hours.per.week)
plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$education.num)
plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$capital.gain)
plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$capital.loss)
plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$hours.per.week)
plot(Adults_NumericOnly$education.num, Adults_NumericOnly$capital.gain)
plot(Adults_NumericOnly$education.num, Adults_NumericOnly$capital.loss)
plot(Adults_NumericOnly$education.num, Adults_NumericOnly$hours.per.week)
plot(Adults_NumericOnly$capital.gain, Adults_NumericOnly$capital.loss)
plot(Adults_NumericOnly$capital.gain, Adults_NumericOnly$hours.per.week)
plot(Adults_NumericOnly$capital.loss, Adults_NumericOnly$hours.per.week)