library(forecast)
library(knitr)
library(pander)
library(tidyverse)
library(rmarkdown)
library(class)
library(caret)
library(nnet) 
library(scales)
library(finalfit)
library(naniar)
library(mice)
library(factoextra)
library(rpart)
library(rpart.plot)
library(devtools)
library(VIM)

Executive Summary

The problem of how to classify a population for the benefit of enhanced business operations and strategy is central to many departments (from Marketing to Risk Analysis). In our research, we attempt to analyze a standard dataset of socio-economic demographics using the Cross Industry Standardized Process for Data Mining (CRISP-DM) framework and apply three supervised machine learning algorithms to classify each observation into an income class. Considerable resources must be spent in preparing the data for modeling and initial models may prove to be of little use. That said, the CRISP-DM framework emphasizes a cyclical process by which early decisions are be revisited as new information is realized. The CRISP-DM process allows a process to follow in order to avoid duplication of efforts across team members and allows analysts the opportunity for continued improvement through the data mining and information discovery process.

Through our data manipulation, exploration, analysis, and model development, assumptions evolved and with them revisions to our models which resulted in model improvements. As the following report will demonstrate, the Adults dataset containing socio-economic demographic data from the U.S. Census can be understood and managed in order to develop fairly accurate models to predict the income class of observations based on certain variables. The most important and accurate model developed based on a subset of the data that observed a complete dataset (all observations missing data were removed). The neural network model developed with complete observations and the omission of outliers for capital gains performed the most efficiently, with an accuracy rate of 87.8% when running the training set against the test set. This process is explained in the pages that follow so that the reader may gain an understanding of the CRISP-DM framework, as well as the final results shown on the socio-economic demographics dataset.

Business Understanding Section

The goal of this analysis is to explore and build a model that will correctly classify the target variable (income class) as a function of the other identified variables in the dataset. This will be accomplished by understanding the data, tidying the data, and making any necessary adjustments in order to build models to classify the target variable and quantify the accuracy of the classifications Then the best model will be identified.

First the Provided dataset will be introduced and the ‘str()’ command is used to see the structure of the dataset, this dataset is a subset of the UCI Adults dataset.

Importing the Adults dataset

setwd("C:/Users/Marissa.Valente/Documents/RStuff2020")
Adults <- read.csv("Provideddataset.csv", header = TRUE, sep=",")
str(Adults)

## 'data.frame':    16281 obs. of  15 variables:
##  $ age           : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : int  226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : int  0 0 0 7688 0 0 0 3103 0 0 ...
##  $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: int  40 50 40 40 30 30 40 32 40 10 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class  : chr  " <=50K." " <=50K." " >50K." " >50K." ...

head(Adults)

##   age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44    Private 160323  Some-college            10  Married-civ-spouse
## 5  18          ? 103497  Some-college            10       Never-married
## 6  34    Private 198693          10th             6       Never-married
##           occupation   relationship   race     sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black    Male            0            0
## 2    Farming-fishing        Husband  White    Male            0            0
## 3    Protective-serv        Husband  White    Male            0            0
## 4  Machine-op-inspct        Husband  Black    Male         7688            0
## 5                  ?      Own-child  White  Female            0            0
## 6      Other-service  Not-in-family  White    Male            0            0
##   hours.per.week native.country income.class
## 1             40  United-States       <=50K.
## 2             50  United-States       <=50K.
## 3             40  United-States        >50K.
## 4             40  United-States        >50K.
## 5             30  United-States       <=50K.
## 6             30  United-States       <=50K.

ls(Adults)

##  [1] "age"            "capital.gain"   "capital.loss"   "education"     
##  [5] "education.num"  "fnlwgt"         "hours.per.week" "income.class"  
##  [9] "marital.status" "native.country" "occupation"     "race"          
## [13] "relationship"   "sex"            "workclass"

This dataset is a subset of the Adults dataset, which comes from U.S. Census data. It contains 15 variables and 16,281 observations. The observations are U.S. people and the data is collected from the U.S. Census. There are 15 variables showing various demographics of these people (observations).

Variables:

Age: The age of the individual in that observation (continuous)

Workclass: The working class for the individual’s occupation (categorical)

Fnlwgt: Final weight, which is the number of units in the target population that the responding unit represents (continuous)

Education: The highest level of education completed by the individual (categorical)

Education-Num: The number of years the individual was in an education program (continuous)

Marital-Status: The marital status of the individual (categorical)

Occupation: The occupation category of the individual (categorical)

Relationship: Relationship to the individual reporting (categorical)

Race: The race of the individual reporting (categorical)

Sex: The sex of the individual reporting (categorical)

Capital-Gain: Capital gain value of the individual (continuous)

Capital-Loss: Capital loss value of the individual (continuous)

Hours-Per-Week: The hours worked per week by the individual (continuous)

Native-Country: The native country of the individual (categorical)

Income-Class: The income class (at, above, or below $50,000) of the individual (categorical)

Data Pre-Processing and Exploratory Data Analysis

To begin the data pre-processing and exploratory data analysis (EDA) portion of this project, missing data within the Adults dataset will be identified and a decision on what to do with these data will be explored and identified.

Look at Missing Data within Variables and Evaluate If Omission or Imputation Should be Performed

Our missing data was categorized as " ?" instead of “N/A” or blanks.This means we need to look for " ?" through the data in order to know what cells/observations need to be dealt with as missing data and either imputed or omitted.

Adults2 <- na_if(Adults, " ?")

summary(Adults2)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:16281       Min.   :  13492   Length:16281      
##  1st Qu.:28.00   Class :character   1st Qu.: 116736   Class :character  
##  Median :37.00   Mode  :character   Median : 177831   Mode  :character  
##  Mean   :38.77                      Mean   : 189436                     
##  3rd Qu.:48.00                      3rd Qu.: 238384                     
##  Max.   :90.00                      Max.   :1490400                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:16281       Length:16281       Length:16281      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:16281       Length:16281       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1082   Mean   :  87.9  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :3770.0  
##  hours.per.week  native.country     income.class      
##  Min.   : 1.00   Length:16281       Length:16281      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.39                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

anyNA(Adults2)

## [1] TRUE

n_miss(Adults2)

## [1] 2203

x <- c(NA, NaN, " ?", "missing")
anyNA(x)

## [1] TRUE

table(Adults$age)

## 
##  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
## 200 312 341 360 376 413 452 408 354 368 397 413 410 417 437 425 460 417 461 450 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 
## 422 437 390 393 427 385 334 343 362 360 373 302 270 264 282 260 247 198 202 198 
##  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76 
## 193 189 168 137 150 136 105 132 106  87  87  58  41  44  46  53  44  26  27  23 
##  77  78  79  80  81  82  83  84  85  87  88  89  90 
##  25  11   8  16  17   3   5   3   2   2   3   2  12

table(Adults$workclass)

## 
##                 ?       Federal-gov         Local-gov      Never-worked 
##               963               472              1043                 3 
##           Private      Self-emp-inc  Self-emp-not-inc         State-gov 
##             11210               579              1321               683 
##       Without-pay 
##                 7

table(Adults$education)

## 
##          10th          11th          12th       1st-4th       5th-6th 
##           456           637           224            79           176 
##       7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
##           309           242           534           679          2670 
##     Doctorate       HS-grad       Masters     Preschool   Prof-school 
##           181          5283           934            32           258 
##  Some-college 
##          3587

table(Adults$education.num)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   32   79  176  309  242  456  637  224 5283 3587  679  534 2670  934  258  181

table(Adults$marital.status)

## 
##               Divorced      Married-AF-spouse     Married-civ-spouse 
##                   2190                     14                   7403 
##  Married-spouse-absent          Never-married              Separated 
##                    210                   5434                    505 
##                Widowed 
##                    525

table(Adults$occupation)

## 
##                  ?       Adm-clerical       Armed-Forces       Craft-repair 
##                966               1841                  6               2013 
##    Exec-managerial    Farming-fishing  Handlers-cleaners  Machine-op-inspct 
##               2020                496                702               1020 
##      Other-service    Priv-house-serv     Prof-specialty    Protective-serv 
##               1628                 93               2032                334 
##              Sales       Tech-support   Transport-moving 
##               1854                518                758

table(Adults$relationship)

## 
##         Husband   Not-in-family  Other-relative       Own-child       Unmarried 
##            6523            4278             525            2513            1679 
##            Wife 
##             763

table(Adults$race)

## 
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black               Other 
##                 159                 480                1561                 135 
##               White 
##               13946

table(Adults$sex)

## 
##  Female    Male 
##    5421   10860

table(Adults$capital.gain)

## 
##     0   114   401   594   914   991  1055  1086  1151  1173  1264  1409  1424 
## 14958     2     3    18     2     1    12     4     5     2     2     3     1 
##  1455  1471  1506  1731  1797  1831  1848  2036  2062  2105  2174  2176  2202 
##     3     2     9     1     3     2     3     1     1     6    26     8    12 
##  2290  2329  2346  2354  2407  2414  2463  2538  2580  2597  2635  2653  2829 
##     5     1     2    10     6     2     4     4     8    11     3     6    11 
##  2885  2907  2936  2961  2964  2977  2993  3103  3137  3273  3325  3411  3418 
##     6     7     1     1     5     3     1    55    14     1    28    10     3 
##  3456  3464  3471  3674  3781  3818  3887  3908  3942  4064  4101  4386  4416 
##     4    10     3     8     4     4     2    10     4    12     9    38    12 
##  4508  4650  4687  4787  4865  4931  4934  5013  5060  5178  5455  5556  5721 
##    11    22     1    12     8     3     3    48     1    49     7     1     4 
##  6097  6418  6497  6514  6612  6723  6767  6849  7262  7298  7430  7443  7688 
##     1     7     4     5     1     3     1    15     1   118     6     2   126 
##  7896  7978  8614  9386  9562 10520 10566 10605 11678 13550 14084 14344 15020 
##     1     1    27     9     1    21     2     7     2    15     8     8     5 
## 15024 15831 20051 25124 25236 27828 34095 41310 99999 
##   166     2    12     2     3    24     1     1    85

table(Adults$capital.loss)

## 
##     0   213   323   625   653  1092  1138  1258  1340  1380  1408  1411  1421 
## 15518     1     2     5     1     4     2     2     4     3    14     3     1 
##  1429  1485  1504  1510  1564  1573  1579  1590  1594  1602  1617  1628  1648 
##     3    20     8     3    18     6    10    22     1    15     2     9     1 
##  1651  1668  1669  1672  1719  1721  1726  1735  1740  1741  1762  1816  1825 
##     2     5    11    16    16    10     5     1    16    20     6     2     1 
##  1844  1848  1870  1876  1887  1902  1911  1944  1974  1977  1980  2001  2002 
##     2    16     1    20    74   102     1     2    10    85    13    11    12 
##  2042  2051  2057  2129  2149  2163  2174  2179  2205  2231  2238  2246  2258 
##     3     8    10     4     3     1     3     5    10     4     2     2    14 
##  2282  2339  2377  2392  2415  2444  2457  2465  2467  2472  2547  2559  2603 
##     1    10     5     2    23     8     1     1     1     3     1     5     2 
##  2824  3004  3175  3770 
##     4     3     2     2

table(Adults$hours.per.week)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    7   21   20   30   35   28   19   73    9  147    9   74    5   21  219   98 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##   13   54    5  638   22   18   19  102  284   10   13   54    8  551    7  157 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##   22   20  640  116   93  238   25 7586   23  119   76   98  893   47   33  253 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   10 1427    7   67   14   21  357   44    2   10    2  702    2    5    5    8 
##   65   66   67   68   69   70   72   73   74   75   76   77   78   79   80   84 
##  111    6    2    4    1  146   36    2    2   39    1    3    5    1   77   27 
##   85   86   88   89   90   92   96   98   99 
##    4    2    2    1   13    2    4    3   52

table(Adults$native.country)

## 
##                           ?                    Cambodia 
##                         274                           9 
##                      Canada                       China 
##                          61                          47 
##                    Columbia                        Cuba 
##                          26                          43 
##          Dominican-Republic                     Ecuador 
##                          33                          17 
##                 El-Salvador                     England 
##                          49                          37 
##                      France                     Germany 
##                           9                          69 
##                      Greece                   Guatemala 
##                          20                          24 
##                       Haiti                    Honduras 
##                          31                           7 
##                        Hong                     Hungary 
##                          10                           6 
##                       India                        Iran 
##                          51                          16 
##                     Ireland                       Italy 
##                          13                          32 
##                     Jamaica                       Japan 
##                          25                          30 
##                        Laos                      Mexico 
##                           5                         308 
##                   Nicaragua  Outlying-US(Guam-USVI-etc) 
##                          15                           9 
##                        Peru                 Philippines 
##                          15                          97 
##                      Poland                    Portugal 
##                          27                          30 
##                 Puerto-Rico                    Scotland 
##                          70                           9 
##                       South                      Taiwan 
##                          35                          14 
##                    Thailand             Trinadad&Tobago 
##                          12                           8 
##               United-States                     Vietnam 
##                       14662                          19 
##                  Yugoslavia 
##                           7

table(Adults$income.class)

## 
##  <=50K.   >50K. 
##   12435    3846

Removing Observations with Missing Data

In the Adults dataset, there were 963 observations missing information for the workclass variable there were 966 observations missing information for the occupation variable, and there were 274 observations missing the native country variable. The 963 missing workclass observations make sense considering they are also the same observations that are missing occupation. The additional three observations missing within occupation are individuals who have never worked. Moreover, of the 274 observations missing native.country, 19 of them are also missing work class and occupation. It does not seem that there is a pattern to the missing variables. Due to the large sample size of the dataset, 16,281 observations and the fact that a total of 1,221 observations are impacted by a missing value in one or more variables, these observations will be omitted. This allows the analysis to avoid issues with missing or imputed data causing non-sensical or error-prone results within the statistics of the dataset, mainly issues with the mean, standard deviation, or other non-sensical values (observations that are not possible). The entire record for all 1,221 observations that were missing values across one or more variables will be removed, which is approximately 7% of the original dataset provided. To determine if any of the variables in the dateset should be dropped completely, a function will be used that checks for features (columns) and samples (rows) where more than 5% of data is missing¹. In order to remove the entire variable and/or row (or observation) from the data, also known as Listwise Deletion, the missing values needs to be identified as MCAR (missing completely at random)⁵. Assuming MCAR, the dataset needs to be checked for too much missing data, which can also lead to problems. A safe threshold for large datasets is ~5% ¹. In order to determine this the mice package can be used and the md.pattern() command will calculate the frequencies of the missing data patterns and determine if the missing data is systematic or random.

table_ofMD_WF = table(Adults$workclass == " ?")
table_ofMD_WF

## 
## FALSE  TRUE 
## 15318   963

table_ofMD_Occ = table(Adults$occupation == " ?")
table_ofMD_Occ

## 
## FALSE  TRUE 
## 15315   966

table_ofMD_Cty = table(Adults$native.country == " ?")
table_ofMD_Cty

## 
## FALSE  TRUE 
## 16007   274

table_ofMD_2 = table(Adults$occupation == " ?" & Adults$workclass == " ?")
table_ofMD_2

## 
## FALSE  TRUE 
## 15318   963

table_ofMD_all3 = table(Adults$native.country == " ?" & Adults$occupation == " ?" & Adults$workclass == " ?")
table_ofMD_all3

## 
## FALSE  TRUE 
## 16262    19

aMiss <- function(x){sum(is.na(x))/length(x)*100}
tablesum <- apply(Adults2, 2, aMiss)
tablesum

all_obs <- apply(Adults2, 1, aMiss)
head(all_obs)

## [1]  0.00000  0.00000  0.00000  0.00000 13.33333  0.00000

As shown above, both workclass and occupation have more than 5% of data missing within the variable. Due to the variables having just over the 5% suggested threshold, at 5.91% and 5.93%, respectively, these variables will be kept in for analysis but it should be noted that our models may be better suited for prediction by omitting them and this will need to be explored. Models for comparison will be developed later. Additionally, this table shows the percentage of missing data per sample (or observation). For example, wherever there is a 13.33% this means the observation is missing 2 of the 15 variables.

md.pattern(Adults2, plot=TRUE, rotate.names = TRUE)

##       age fnlwgt education education.num marital.status relationship race sex
## 15060   1      1         1             1              1            1    1   1
## 3       1      1         1             1              1            1    1   1
## 944     1      1         1             1              1            1    1   1
## 255     1      1         1             1              1            1    1   1
## 19      1      1         1             1              1            1    1   1
##         0      0         0             0              0            0    0   0
##       capital.gain capital.loss hours.per.week income.class native.country
## 15060            1            1              1            1              1
## 3                1            1              1            1              1
## 944              1            1              1            1              1
## 255              1            1              1            1              0
## 19               1            1              1            1              0
##                  0            0              0            0            274
##       workclass occupation     
## 15060         1          1    0
## 3             1          0    1
## 944           0          0    2
## 255           1          1    1
## 19            0          0    3
##             963        966 2203

As verified using the md.pattern() command, there are a few missing data observations that have a pattern. The column to the far left shows the frequency of each pattern and the bottom row shows the total number of missing entries per variable, as well as, the total number of missing cells ⁶. The Adults dataset does show patterns within the missing entries for Native country, Workclass, and Occupation. For Native country, 255 observations are missing only Native country, for Workclass, 944 observations are missing only Workclass, and for Occupation, only 3 observations are missing only Occupation. Lastly, 19 observations are missing Workclass, Occupation, and Native Country. The above command also shows that 15,060 observations that are complete data. The patterns can be seen in the last three columns of the table above, Using the VIM package and the aggr() command we will be able to visualize the patterns².

MD_Pattern_Adults2 <- aggr(Adults2, prop=FALSE, numbers=TRUE, col=c('blue','green'), sortVars = TRUE, labels=names(Adults2), cex.axis = .7, gap = 3, ylab=c("Histogram of Missing Data", "Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable Count
##      occupation   966
##       workclass   963
##  native.country   274
##             age     0
##          fnlwgt     0
##       education     0
##   education.num     0
##  marital.status     0
##    relationship     0
##            race     0
##             sex     0
##    capital.gain     0
##    capital.loss     0
##  hours.per.week     0
##    income.class     0

plot(MD_Pattern_Adults2)

Based on the output from the table above, the patterns and how many observations within each pattern are more clearly explored, showing that there is no distinct subset of the population that has been systematically removed, i.e. a certain age or country, etc. The observations missing data will be dropped from the dataset moving forward. The complete dataset for modeling includes 15,060 observations.

Adults3 <- Adults2 %>% replace_with_na_if(.predicate = is.character,
                     condition = ~.x == (" ?")) %>% drop_na() %>% droplevels()
summary(Adults3)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:15060       Min.   :  13492   Length:15060      
##  1st Qu.:28.00   Class :character   1st Qu.: 116655   Class :character  
##  Median :37.00   Mode  :character   Median : 177955   Mode  :character  
##  Mean   :38.77                      Mean   : 189616                     
##  3rd Qu.:48.00                      3rd Qu.: 238589                     
##  Max.   :90.00                      Max.   :1490400                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:15060       Length:15060       Length:15060      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.11                                                           
##  3rd Qu.:13.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss    
##  Length:15060       Length:15060       Min.   :    0   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.00  
##                                        Mean   : 1120   Mean   :  89.04  
##                                        3rd Qu.:    0   3rd Qu.:   0.00  
##                                        Max.   :99999   Max.   :3770.00  
##  hours.per.week  native.country     income.class      
##  Min.   : 1.00   Length:15060       Length:15060      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.95                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

str(Adults3)

## 'data.frame':    15060 obs. of  15 variables:
##  $ age           : int  25 38 28 44 34 63 24 55 65 36 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : int  226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : int  7 9 12 10 6 15 10 4 9 13 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : int  0 0 0 7688 0 3103 0 0 6418 0 ...
##  $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: int  40 50 40 40 30 32 40 10 40 40 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class  : chr  " <=50K." " <=50K." " >50K." " >50K." ...

Transform Income Class

Adults3$income.class2[Adults3$income.class == " <=50K."]= "$50,000 or less"
Adults3$income.class2[Adults3$income.class == " >50K."]= "$50,001 or more"
Adults_Updated <- Adults3[,-15]
head(Adults_Updated)

##   age         workclass fnlwgt     education education.num      marital.status
## 1  25           Private 226802          11th             7       Never-married
## 2  38           Private  89814       HS-grad             9  Married-civ-spouse
## 3  28         Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  44           Private 160323  Some-college            10  Married-civ-spouse
## 5  34           Private 198693          10th             6       Never-married
## 6  63  Self-emp-not-inc 104626   Prof-school            15  Married-civ-spouse
##           occupation   relationship   race   sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black  Male            0            0
## 2    Farming-fishing        Husband  White  Male            0            0
## 3    Protective-serv        Husband  White  Male            0            0
## 4  Machine-op-inspct        Husband  Black  Male         7688            0
## 5      Other-service  Not-in-family  White  Male            0            0
## 6     Prof-specialty        Husband  White  Male         3103            0
##   hours.per.week native.country   income.class2
## 1             40  United-States $50,000 or less
## 2             50  United-States $50,000 or less
## 3             40  United-States $50,001 or more
## 4             40  United-States $50,001 or more
## 5             30  United-States $50,000 or less
## 6             32  United-States $50,001 or more

Look for Outliers within Variables in Adults Dataset

ls(Adults_Updated)

##  [1] "age"            "capital.gain"   "capital.loss"   "education"     
##  [5] "education.num"  "fnlwgt"         "hours.per.week" "income.class2" 
##  [9] "marital.status" "native.country" "occupation"     "race"          
## [13] "relationship"   "sex"            "workclass"

str(Adults_Updated)

## 'data.frame':    15060 obs. of  15 variables:
##  $ age           : int  25 38 28 44 34 63 24 55 65 36 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : int  226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : int  7 9 12 10 6 15 10 4 9 13 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : int  0 0 0 7688 0 3103 0 0 6418 0 ...
##  $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: int  40 50 40 40 30 32 40 10 40 40 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class2 : chr  "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...

Adults_NumericOnly <- Adults_Updated[,c(1,3,5,11:13)]

plot(Adults_NumericOnly)

summary(Adults_NumericOnly)

##       age            fnlwgt        education.num    capital.gain  
##  Min.   :17.00   Min.   :  13492   Min.   : 1.00   Min.   :    0  
##  1st Qu.:28.00   1st Qu.: 116655   1st Qu.: 9.00   1st Qu.:    0  
##  Median :37.00   Median : 177955   Median :10.00   Median :    0  
##  Mean   :38.77   Mean   : 189616   Mean   :10.11   Mean   : 1120  
##  3rd Qu.:48.00   3rd Qu.: 238589   3rd Qu.:13.00   3rd Qu.:    0  
##  Max.   :90.00   Max.   :1490400   Max.   :16.00   Max.   :99999  
##   capital.loss     hours.per.week 
##  Min.   :   0.00   Min.   : 1.00  
##  1st Qu.:   0.00   1st Qu.:40.00  
##  Median :   0.00   Median :40.00  
##  Mean   :  89.04   Mean   :40.95  
##  3rd Qu.:   0.00   3rd Qu.:45.00  
##  Max.   :3770.00   Max.   :99.00

table(Adults_Updated$occupation, Adults_Updated$income.class2)

##                     
##                      $50,000 or less $50,001 or more
##    Adm-clerical                 1561             258
##    Armed-Forces                    2               3
##    Craft-repair                 1543             447
##    Exec-managerial              1062             930
##    Farming-fishing               434              57
##    Handlers-cleaners             644              52
##    Machine-op-inspct             884             120
##    Other-service                1532              64
##    Priv-house-serv                87               2
##    Prof-specialty               1077             893
##    Protective-serv               235              97
##    Sales                        1339             485
##    Tech-support                  375             133
##    Transport-moving              585             159

table(Adults_Updated$income.class2, Adults_Updated$education)

##                  
##                    10th  11th  12th  1st-4th  5th-6th  7th-8th  9th  Assoc-acdm
##   $50,000 or less   380   541   186       69      151      246  208         357
##   $50,001 or more    23    30    14        2       10       20   13         142
##                  
##                    Assoc-voc  Bachelors  Doctorate  HS-grad  Masters  Preschool
##   $50,000 or less        492       1474         50     4144      412         26
##   $50,001 or more        160       1052        119      799      475          1
##                  
##                    Prof-school  Some-college
##   $50,000 or less           57          2567
##   $50,001 or more          186           654

agevincome <-table(Adults_Updated$income.class2, Adults_Updated$age)
summary(agevincome)

## Number of cases in table: 15060 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 1557.1, df = 72, p-value = 1.216e-277
##  Chi-squared approximation may be incorrect

barplot(agevincome, legend=c("$50,000 or less", "$50,001 or more"), xlab="Age", ylab="Count of Individuals", col=c("blue","green"))

Most of the individuals who are making $50,001 or more are older than 30.

Using ggplot2, a bar graph was constructed to show occupation sorted by individuals with income levels above or below $50,000.

Bar Chart of Occupation compared to Income Class

P2 <- ggplot(Adults_Updated, aes(x= occupation, fill= income.class2))+
  geom_bar()
P2 + theme(axis.text.x = element_text(angle=45, hjust=1))

When comparing occupation by income class, it becomes obvious that there are few people in most occupations making more than $50,000. It is interesting to note that the closest breakdown of individuals between $50,000 or less and $50,001 or more is within the “Prof-specialty” and “Exec-managerial” occupations.

Bar Chart of Work Class compared to Income Class

P2 <- ggplot(Adults_Updated, aes(x= workclass, fill= income.class2))+
  geom_bar()
P2 + theme(axis.text.x = element_text(angle=45, hjust=1))

Check for outliers on the following: capital gain, education number, age, hours per week

##   age  workclass fnlwgt     education education.num      marital.status
## 1  25    Private 226802          11th             7       Never-married
## 2  38    Private  89814       HS-grad             9  Married-civ-spouse
## 3  28  Local-gov 336951    Assoc-acdm            12  Married-civ-spouse
## 4  34    Private 198693          10th             6       Never-married
## 5  24    Private 369667  Some-college            10       Never-married
## 6  55    Private 104996       7th-8th             4  Married-civ-spouse
##           occupation   relationship   race     sex capital.gain capital.loss
## 1  Machine-op-inspct      Own-child  Black    Male            0            0
## 2    Farming-fishing        Husband  White    Male            0            0
## 3    Protective-serv        Husband  White    Male            0            0
## 4      Other-service  Not-in-family  White    Male            0            0
## 5      Other-service      Unmarried  White  Female            0            0
## 6       Craft-repair        Husband  White    Male            0            0
##   hours.per.week native.country   income.class2
## 1             40  United-States $50,000 or less
## 2             50  United-States $50,000 or less
## 3             40  United-States $50,001 or more
## 4             30  United-States $50,000 or less
## 5             40  United-States $50,000 or less
## 6             10  United-States $50,000 or less

Capital_Gains_Brkdn <- count(Adults_Updated, capital.gain, capital.gain > 90000)
head(Capital_Gains_Brkdn)

##   capital.gain capital.gain > 90000     n
## 1            0                FALSE 13808
## 2          114                FALSE     2
## 3          401                FALSE     1
## 4          594                FALSE    14
## 5          914                FALSE     2
## 6          991                FALSE     1

tail(Capital_Gains_Brkdn)

##     capital.gain capital.gain > 90000  n
## 105        25124                FALSE  2
## 106        25236                FALSE  3
## 107        27828                FALSE 24
## 108        34095                FALSE  1
## 109        41310                FALSE  1
## 110        99999                 TRUE 81

There are 13,808 observations with capital.gain = 0. For those not equal to 0, there were 1,252 observations ranging from $144 to $99,999.

capital.gain_2 <- filter(Adults_Updated, capital.gain > 0)
summary(capital.gain_2)

##       age         workclass             fnlwgt        education        
##  Min.   :17.00   Length:1252        Min.   : 19302   Length:1252       
##  1st Qu.:34.00   Class :character   1st Qu.:113383   Class :character  
##  Median :43.00   Mode  :character   Median :173453   Mode  :character  
##  Mean   :43.98                      Mean   :184846                     
##  3rd Qu.:52.00                      3rd Qu.:230477                     
##  Max.   :90.00                      Max.   :764638                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:1252        Length:1252        Length:1252       
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :11.07                                                           
##  3rd Qu.:13.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss
##  Length:1252        Length:1252        Min.   :  114   Min.   :0    
##  Class :character   Class :character   1st Qu.: 3464   1st Qu.:0    
##  Mode  :character   Mode  :character   Median : 7298   Median :0    
##                                        Mean   :13476   Mean   :0    
##                                        3rd Qu.:13550   3rd Qu.:0    
##                                        Max.   :99999   Max.   :0    
##  hours.per.week  native.country     income.class2     
##  Min.   : 2.00   Length:1252        Length:1252       
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :43.99                                        
##  3rd Qu.:50.00                                        
##  Max.   :99.00

min(capital.gain_2$capital.gain)

## [1] 114

max(capital.gain_2$capital.gain)

## [1] 99999

hist(capital.gain_2$capital.gain, breaks = 20, xlim = c(140,100000), xlab = "capital gains")

Based on the histogram of all observations where capital gain is larger than 0, there are 81 observations with capital gains equal to $99,999.

which(capital.gain_2$capital.gain >90000)

##  [1]   34   36   38   69   72   85   98  114  116  117  125  135  167  174  197
## [16]  203  227  256  266  282  287  323  327  369  378  382  398  401  419  438
## [31]  439  456  463  491  502  504  522  526  549  561  579  580  598  599  612
## [46]  617  646  663  687  710  728  736  746  749  750  760  770  784  798  801
## [61]  805  851  893  928  965  974  976  992  995 1000 1007 1048 1070 1108 1110
## [76] 1120 1166 1229 1231 1232 1236

Creating Min-Max Transformation Variables for All Numeric Variables in the Adults_Updated dataset

As LaRose and LaRose explain, “when mixing categorical and continuous variables, the min-max normalization may be preferred”(Larose, 155). This is due to the fact that using the z-score standardization allows for a wider scale than the min-max normalization and may reverse the conclusions for how variables are categorized.⁴ For this reason, the min-max transformation will be performed on all numeric variables.

MMNormTrans <- preProcess(Adults_Updated, method=c("range"))
Adults_CatModel <- predict(MMNormTrans, Adults_Updated)
str(Adults_CatModel)

## 'data.frame':    15060 obs. of  15 variables:
##  $ age           : num  0.11 0.288 0.151 0.37 0.233 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : num  0.1444 0.0517 0.219 0.0994 0.1254 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : num  0.4 0.533 0.733 0.6 0.333 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : num  0 0 0 0.0769 0 ...
##  $ capital.loss  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: num  0.398 0.5 0.398 0.398 0.296 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class2 : chr  "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...

Adults_CatModel$income.class2 <- as.factor(Adults_CatModel$income.class2)
str(Adults_CatModel)

## 'data.frame':    15060 obs. of  15 variables:
##  $ age           : num  0.11 0.288 0.151 0.37 0.233 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : num  0.1444 0.0517 0.219 0.0994 0.1254 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : num  0.4 0.533 0.733 0.6 0.333 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : num  0 0 0 0.0769 0 ...
##  $ capital.loss  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: num  0.398 0.5 0.398 0.398 0.296 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class2 : Factor w/ 2 levels "$50,000 or less",..: 1 1 2 2 1 2 1 1 2 1 ...

Outliers for Age

hist(Adults_Updated$age)

hist(Adults_CatModel$age)

boxplot(Adults_Updated$age)

boxplot(Adults_CatModel$age)

which(Adults_Updated$age >78)

##  [1]   216   833   881   996  1291  1925  2116  2757  3235  3403  4315  6076
## [13]  6245  6387  6428  6445  6448  6620  6860  6865  6983  7339  7681  7789
## [25]  8280  8306  8341  8359  8577  9018  9130  9274  9422  9442  9922 10451
## [37] 10488 10563 10966 11060 11166 11328 11522 12065 12922 13204 13239 13362
## [49] 13486 13507 13637 13966 14732 14760 14796 14893 14933

boxplot.stats(Adults_Updated$age, do.out=FALSE)

## $stats
## [1] 17 28 37 48 78
## 
## $n
## [1] 15060
## 
## $conf
## [1] 36.7425 37.2575
## 
## $out
## numeric(0)

IQR = 3rd Quartile - 1st Quartile for Age (non-normalized variable)

48-28

## [1] 20

1.5*20

## [1] 30

28-30

## [1] -2

48+30

## [1] 78

The 58 outliers for age can be found in observations where age is greater than 78 years old. These outliers are been considered when building the data models. It is been determined that the ages are reasonable values. Also, the histogram indicates a reasonable skew to the data. For this reason, the observations have been included in all models.

While it is possible that these outliers may affect the accuracy of models, the parameters of the project do not currently cap the age of the observations. Moreover, this subset of data does not indicate any strong patterns in employment type or hours.per.week.

AgeOutliers <- which(Adults_Updated$age >78)
All_Age_outliers <- Adults_Updated %>% filter(age < 78) %>% arrange(desc(age))
head(All_Age_outliers)

##   age     workclass fnlwgt  education education.num         marital.status
## 1  77       Private  89655    HS-grad             9                Widowed
## 2  77  Self-emp-inc  29702    HS-grad             9     Married-civ-spouse
## 3  77       Private 148949       10th             6     Married-civ-spouse
## 4  77       Private 133728    7th-8th             4     Married-civ-spouse
## 5  77       Private 189173    HS-grad             9     Married-civ-spouse
## 6  77     State-gov 267799  Doctorate            16  Married-spouse-absent
##        occupation   relationship   race     sex capital.gain capital.loss
## 1    Adm-clerical  Not-in-family  White  Female            0            0
## 2   Other-service        Husband  White    Male            0            0
## 3   Other-service        Husband  Black    Male         3818            0
## 4    Craft-repair        Husband  White    Male            0            0
## 5  Prof-specialty        Husband  White    Male            0            0
## 6  Prof-specialty  Not-in-family  White    Male            0            0
##   hours.per.week native.country   income.class2
## 1             40  United-States $50,000 or less
## 2             35  United-States $50,001 or more
## 3             30  United-States $50,000 or less
## 4             40  United-States $50,000 or less
## 5             35  United-States $50,000 or less
## 6              4  United-States $50,001 or more

Outliers for Education Number of Years

hist(Adults_Updated$education.num)

hist(Adults_CatModel$education.num)

boxplot(Adults_Updated$education.num)

boxplot(Adults_CatModel$education.num)

which(Adults_Updated$education.num <3)

boxplot.stats(Adults_Updated$education.num, do.out=FALSE)

## $stats
## [1]  3  9 10 13 16
## 
## $n
## [1] 15060
## 
## $conf
## [1]  9.9485 10.0515
## 
## $out
## numeric(0)

IQR = 3rd Quartile - 1st Quartile for Education in number of years (non-normalized variable)

13-9

## [1] 4

1.5*4

## [1] 6

9-6

## [1] 3

13+6

## [1] 19

All_Education_Outliers <- Adults_Updated %>% filter(education.num<3) %>% arrange(desc(age))
head(All_Education_Outliers)

##   age         workclass fnlwgt  education education.num      marital.status
## 1  80         Local-gov  81534    1st-4th             2             Widowed
## 2  78           Private 454321    1st-4th             2             Widowed
## 3  77  Self-emp-not-inc 161552  Preschool             1             Widowed
## 4  72           Private 109385    1st-4th             2  Married-civ-spouse
## 5  65           Private 237024    1st-4th             2  Married-civ-spouse
## 6  64         State-gov 107732    1st-4th             2  Married-civ-spouse
##           occupation    relationship                race     sex capital.gain
## 1    Farming-fishing   Not-in-family  Asian-Pac-Islander    Male         1086
## 2  Handlers-cleaners  Other-relative               White    Male            0
## 3    Exec-managerial   Not-in-family               White  Female            0
## 4      Other-service         Husband               Black    Male            0
## 5      Other-service         Husband               White    Male            0
## 6    Farming-fishing         Husband               Other    Male            0
##   capital.loss hours.per.week native.country   income.class2
## 1            0             20    Philippines $50,000 or less
## 2            0             20      Nicaragua $50,000 or less
## 3            0             60  United-States $50,000 or less
## 4            0             36  United-States $50,000 or less
## 5            0             35         Mexico $50,000 or less
## 6            0             45       Columbia $50,000 or less

The 99 outliers for number of years of education can be found in observations where education.num is less than 3 years. These outliers have been considered when building the data models. It has been determined that, while obtaining fewer than 3 years of education is rare, it is not an impossible statistic. Furthermore, a table of the observations does not reveal any obvious correlations with any other variables, this is shown at the end of the outliers section and before the modeling section. For this reason, the observations have been included in all models.

EducationNOutliers <- which(Adults_Updated$education.num <3)

Outliers for Capital Gain

boxplot(Adults_Updated$capital.gain)

boxplot(capital.gain_2$capital.gain)

boxplot(Adults_CatModel$capital.gain)

boxplot.stats(Adults_Updated$capital.gain, do.out=FALSE)

## $stats
## [1] 0 0 0 0 0
## 
## $n
## [1] 15060
## 
## $conf
## [1] 0 0
## 
## $out
## numeric(0)

boxplot.stats(Adults_CatModel$capital.gain, do.out=FALSE)

## $stats
## [1] 0 0 0 0 0
## 
## $n
## [1] 15060
## 
## $conf
## [1] 0 0
## 
## $out
## numeric(0)

Based on the boxplots of capital gain, it is evident anything above $0 would be considered an outlier.Boxplot and IQR show the same observations, the observations listed above are outliers in terms of capital gains observations. There are 1,252 outliers for capital gains,those being any observations above $0. These outliers should be considered when building the data models. Since 81 of the outliers are $99,999, it seems logical that these represent a format constraint of the data inputs. Furthermore, the next highest value is $41,310. With such a significant cluster of outliers, we believe that future models should be run with and without these observations.

CGOutliers <- which(Adults_Updated$capital.gain >0)

Outliers for Capital Loss

hist(Adults_Updated$capital.loss)

hist(Adults_CatModel$capital.loss)

boxplot(Adults_Updated$capital.loss)

boxplot(Adults_CatModel$capital.loss)

which(Adults_Updated$capital.loss >0)

boxplot.stats(Adults_Updated$capital.loss, do.out=FALSE)

## $stats
## [1] 0 0 0 0 0
## 
## $n
## [1] 15060
## 
## $conf
## [1] 0 0
## 
## $out
## numeric(0)

Capital_Loss_Brkdn <- count(Adults_Updated, capital.loss, capital.loss > 0)
head(Capital_Loss_Brkdn)

##   capital.loss capital.loss > 0     n
## 1            0            FALSE 14347
## 2          213             TRUE     1
## 3          323             TRUE     2
## 4          625             TRUE     5
## 5          653             TRUE     1
## 6         1092             TRUE     4

tail(Capital_Loss_Brkdn)

##    capital.loss capital.loss > 0 n
## 74         2559             TRUE 5
## 75         2603             TRUE 1
## 76         2824             TRUE 4
## 77         3004             TRUE 3
## 78         3175             TRUE 2
## 79         3770             TRUE 2

In the Adults_Updated dataset, there were 14,347 observations with $0 in capital losses. Any observations with values other than $0 for capital losses are considered outliers. The total number of outliers based on capital losses is 713 and the range for these observations is from $213 to $3,770. These outliers are to be considered when building the data models.

CLOutliers <- which(Adults_Updated$capital.loss >0)

Outliers for Hours Per Week

hist(Adults_Updated$hours.per.week)

hist(Adults_CatModel$hours.per.week)

boxplot(Adults_Updated$hours.per.week)

boxplot(Adults_CatModel$hours.per.week)

which(Adults_Updated$hours.per.week >55)

which(Adults_Updated$hours.per.week <27)

boxplot.stats(Adults_Updated$hours.per.week, do.out=FALSE)

## $stats
## [1] 33 40 40 45 52
## 
## $n
## [1] 15060
## 
## $conf
## [1] 39.93563 40.06437
## 
## $out
## numeric(0)

IQR = 3rd Quartile - 1st Quartile for Age (non-normalized variable)

45-40

## [1] 5

1.5*5

## [1] 7.5

40-7.5

## [1] 32.5

45+7.5

## [1] 52.5

Under_Hrs <- count(Adults_Updated, hours.per.week < 32.5)
Under_Hrs

##   hours.per.week < 32.5     n
## 1                 FALSE 12768
## 2                  TRUE  2292

Above_Hrs <- count(Adults_Updated, hours.per.week > 52.5)
Above_Hrs

##   hours.per.week > 52.5     n
## 1                 FALSE 13406
## 2                  TRUE  1654

There are 3,946 outliers based on hours per week. The observations with hours above 52.5 hours a week (1,654) and below 32.5 hours a week (2,292) are all outliers. These outliers have been considered when building the data models, and ultimately, they were included in all modeling. While technically outliers, the histogram confirms that these datapoints are evenly distributed along a bell curve and do no appear to be heavily skewed, the result of imputation error or the result of a formatting constraint.

HPWOutliers_Above <- which(Adults_Updated$hours.per.week >52.5)
HPWOutliers_Below <- which(Adults_Updated$hours.per.week <32.5)
HPWOutliers <- which(Adults_Updated$hours.per.week <32.5 | Adults_Updated$hours.per.week >52.5)

New Column Added for Identifying All Outliers based on all Continuous Variables

Adults_Updated$Outliers <-(Adults_Updated$Outliers<- Adults_Updated$age >78| Adults_Updated$education.num <3| Adults_Updated$capital.gain >0| Adults_Updated$capital.loss >0| Adults_Updated$hours.per.week <32.5 | Adults_Updated$hours.per.week >52.5)
str(Adults_Updated)

## 'data.frame':    15060 obs. of  16 variables:
##  $ age           : int  25 38 28 44 34 63 24 55 65 36 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ fnlwgt        : int  226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education.num : int  7 9 12 10 6 15 10 4 9 13 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ capital.gain  : int  0 0 0 7688 0 3103 0 0 6418 0 ...
##  $ capital.loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week: int  40 50 40 40 30 32 40 10 40 40 ...
##  $ native.country: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ income.class2 : chr  "$50,000 or less" "$50,000 or less" "$50,001 or more" "$50,001 or more" ...
##  $ Outliers      : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...

table(Adults_Updated$Outliers)

## 
## FALSE  TRUE 
##  9549  5511

There are 5,511 observations or individuals that are identified as outliers based on one or more of the continuous variables (age, education number of years, capital gain,capital loss or hours per week) within the Adults dataset.

Next, it is important to determine which variables are correlated and to what degree (based on the correlation coefficient, r). This can help identify relationships and trends within the data and which variables are necessary for prediction and which can potential be removed or intentionally omitted. According to LaRose and LaRose, “Identify any variables that are perfectly correlated (i.e., r = 1.0 or r=-1.0). Do not retain both variables in the model, but rather omit one. Identify groups of variables that are correlated with each other. Then, later, during the modeling phase, apply dimension reduction methods, such as principal components analysis to these variables” (LaRose, 78).⁴ This is done because correlated variables can overemphasize a data component or even "cause the model to become unstable and deliver unreliable results (LaRose, 76).⁴

A good way to begin this analysis is to look at the correlation coefficients between all the continuous variables to see where they show r values closer to -1.00 or 1.00. The correlation coefficient (r) shows the strength and direction of the linear relationship between the two variables.

Checking for Correlation among the Continuous Variables

contV2 <- Adults_NumericOnly
cor(contV2)

##                        age       fnlwgt education.num capital.gain capital.loss
## age             1.00000000 -0.074375137    0.02612265   0.07876001  0.057745040
## fnlwgt         -0.07437514  1.000000000   -0.03600951  -0.01283892  0.006420974
## education.num   0.02612265 -0.036009508    1.00000000   0.13174987  0.085816886
## capital.gain    0.07876001 -0.012838920    0.13174987   1.00000000 -0.031875657
## capital.loss    0.05774504  0.006420974    0.08581689  -0.03187566  1.000000000
## hours.per.week  0.10275833 -0.010306013    0.13369132   0.09050098  0.057712162
##                hours.per.week
## age                0.10275833
## fnlwgt            -0.01030601
## education.num      0.13369132
## capital.gain       0.09050098
## capital.loss       0.05771216
## hours.per.week     1.00000000

From the above table, there seems to be no strong correlations among the continuous variables, all variables should be included in the data models. This is explored further through scatterplots in the Appendix.

Modeling Section

In order to construct the three models used to classify income class, the modified subsets of the original dataset and all variables must undergo a final round of pre-processing. The data is first copied into two data sets: one which remains a mix of numeric and categorical variables, and one that is converted into 100% numeric data. The latter set is developed by expanding the categorical variables to include various factors. It should be noted that this format is not recommended for exploratory data analysis (as it makes for difficult interpretation because it is not a tidy data format); however, it is required to pass categorical variables to models which will only accept numeric inputs. For this, the ‘carat’ package is used to assign dummy variables to factors.

After the Adults_NumModel set is expanded, it is standardized using min/max standardization so that all data points fall between 0.0 and 1.0. This is performed in order to normalize the variables and results in standardizing the scale of effects for each variable.

The last part of the pre-processing section is to create training and test data sets by randomly subsetting 80% of the observations into a training set and the other 20% of the observations into the test set. Models will learn from the training set and their accuracy will be tested with the test set for overfitting. Data sets are created that exclude the capital gains outliers of $99,999. This percent split between training and test sets is accepted as an industry standard by which to allow the model to train using the bulk of the data while simultaneously providing a test set large enough to guarantee a high level of diversity. A 90%/10% split could also be used; however, in the absence of numerous random test sets, we prefer to use 20% of the data.

Pre-processing for models

#creating a categorical set and a numeric set
Adults_CatModel <- Adults_CatModel[,-c(4,8,16)]
Adults_NumModel <- Adults_Updated[,-c(4,8,16)]

#race
dmy <- dummyVars("~ race", data = Adults_NumModel)
dmy.race <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.race) 
Adults_NumModel <- Adults_NumModel[,-7]

#workclass
dmy <- dummyVars("~ workclass", data = Adults_NumModel)
dmy.workclass <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.workclass) 
Adults_NumModel <- Adults_NumModel[,-2]

#martial status
dmy <- dummyVars("~ marital.status", data = Adults_NumModel)
dmy.marital.status <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.marital.status) 
Adults_NumModel <- Adults_NumModel[,-4]

#occupation
dmy <- dummyVars("~ occupation", data = Adults_NumModel)
dmy.occupation <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.occupation) 
Adults_NumModel <- Adults_NumModel[,-4]

#native country
dmy <- dummyVars("~ native.country", data = Adults_NumModel)
dmy.native.country <- predict(dmy, Adults_NumModel)
Adults_NumModel <- cbind(Adults_NumModel, dmy.native.country) 
Adults_NumModel <- Adults_NumModel[,-8]

#changing Males to 0 and Females to 1 and removing original sex column
Adults_NumModel$sex2[Adults_NumModel$sex == " Male"]=0
Adults_NumModel$sex2[Adults_NumModel$sex == " Female"]=1
Adults_NumModel <- Adults_NumModel[,-4]

Adults_NumModel <- Adults_NumModel %>% relocate(income.class2, .after = last_col())

### Min/Max for Adults_NumModel
MinMax <- preProcess(Adults_NumModel, method=c("range"))
Adults_NumModel <- predict(MinMax, Adults_NumModel)

#removing capital gains outliers
Adults_CatModel.Outlier <- Adults_CatModel %>% filter(capital.gain < max(capital.gain))
Adults_NumModel.Outlier <- Adults_NumModel %>% filter(capital.gain < max(capital.gain))

### Creating training and test sets (80/20% split)
set.seed(100)
samp <- sample(1:15060, 12048, replace = FALSE)
TrainCat <- Adults_CatModel[samp,]
TestCat <- Adults_CatModel[-samp,]
TrainNum <- Adults_NumModel[samp,]
TestNum <- Adults_NumModel[-samp,]

TrainCat.cap.outliers <- Adults_CatModel.Outlier[samp,] 
TrainNum.cap.outliers <- Adults_NumModel.Outlier[samp,]
TestCat.cap.outliers <- Adults_CatModel.Outlier[-samp,] 
TestNum.cap.outliers <- Adults_NumModel.Outlier[-samp,]

The first model that we test is the K-Nearest Neighbor (KNN) model. This model uses numeric independent variables to measure the euclidean distance between variables. The K value in the equation indicates the number of nearest neighboring observations the model considers. The ‘knn.cv’ model includes cross-validation in its computation, so the entire data set was used instead of the training set. The advantage in this is that the model is able to train on a greater number of observations. Through trial and error, we determine that a K values of 33 provides the most accurate model in classifying the income class.

Once the most accurate K value was determined, the model is also run without observations in which capital are equal-to or greater-than $99,999.

#k-nearest neighbor
knn33 <- knn.cv(Adults_NumModel[,-81], Adults_NumModel[,81], k = 33)
tknn33 <- table(Adults_NumModel$income.class2, knn33)
tknn33

##                  knn33
##                   $50,000 or less $50,001 or more
##   $50,000 or less           10447             913
##   $50,001 or more            1650            2050

knnaccu33 <- round((tknn33[1,1] + tknn33[2,2]) / sum(tknn33)*100, 2)
print(paste("Our KNN model is ",knnaccu33,"percent accurate with a K value of 33."))

## [1] "Our KNN model is  82.98 percent accurate with a K value of 33."

#removing capital gains outliers
knn.outlier <- knn.cv(Adults_NumModel.Outlier[,-81], Adults_NumModel.Outlier[,81], k = 33)
tknnOutlier <- table(Adults_NumModel.Outlier$income.class2, knn.outlier)
tknnOutlier

##                  knn.outlier
##                   $50,000 or less $50,001 or more
##   $50,000 or less           10448             912
##   $50,001 or more            1629            1990

knnaccuOutlier <- round((tknnOutlier[1,1] + tknnOutlier[2,2]) / sum(tknnOutlier)*100, 2)
print(paste("The model is",knnaccuOutlier,"percent accurate without capital gains >= $99,999."))

## [1] "The model is 83.04 percent accurate without capital gains >= $99,999."

The accuracy of the KNN model is slightly improved when capital gains outliers are omitted from the data set. While this is noteworthy, we do not make the assumption that all models will act accordingly. We continue to use both datasets throughout our modeling.

The second model we test is a Classification Tree. The ‘rpart’ model calculates the probability of the target variable given the probabilities of the independent variables. As this model accepts both numeric and categorical variables, we utilize the training and test data sets that include both variables, as well as the set that is only numeric. The ‘rpart.plot’ function renders an acceptable chart of the classifications. Additionally, we run the dataset that excludes the capital gains outliers.

# using categorical data set
CatTree <- rpart(income.class2 ~ ., data = TrainCat, method = "class")
rpart.plot(CatTree)

est.train.tree <- predict(CatTree, TrainCat, type = "class")
tree.train.table <- table(TrainCat$income.class2, est.train.tree)
tree.train.table

##                  est.train.tree
##                   $50,000 or less $50,001 or more
##   $50,000 or less            8525             550
##   $50,001 or more            1378            1595

accutree.train <- round((tree.train.table[1,1]+tree.train.table[2,2])/sum(tree.train.table)*100,2)
print(paste("Our classification tree is",accutree.train,"percent accurate using the training set."))

## [1] "Our classification tree is 84 percent accurate using the training set."

# with categorical test data
est.test.tree <- predict(CatTree, TestCat, type = "class")
tree.test.table <- table(TestCat$income.class2, est.test.tree)
tree.test.table

##                  est.test.tree
##                   $50,000 or less $50,001 or more
##   $50,000 or less            2146             139
##   $50,001 or more             345             382

accutree.test <- round((tree.test.table[1,1]+tree.test.table[2,2])/sum(tree.test.table)*100,2)
print(paste("Our classification tree is",accutree.test,"percent accurate using the test set."))

## [1] "Our classification tree is 83.93 percent accurate using the test set."

# with numeric data
NumTree <- rpart(income.class2 ~ ., data = TrainNum, method = "class")
rpart.plot(NumTree)

est.NumTrain.tree <- predict(NumTree, TrainNum, type = "class")
Numtree.train.table <- table(TrainNum$income.class2, est.NumTrain.tree)
Numtree.train.table

##                  est.NumTrain.tree
##                   $50,000 or less $50,001 or more
##   $50,000 or less            8494             581
##   $50,001 or more            1281            1692

accuNumTree.train <- round((Numtree.train.table[1,1]+Numtree.train.table[2,2])/sum(Numtree.train.table)*100,2)
print(paste("The classification tree is",accuNumTree.train,"percent accurate using the numeric training set."))

## [1] "The classification tree is 84.55 percent accurate using the numeric training set."

with numeric test data

est.NumTest.tree <- predict(NumTree, TestNum, type = "class")
Numtree.test.table <- table(TestNum$income.class2, est.NumTest.tree)
Numtree.test.table

##                  est.NumTest.tree
##                   $50,000 or less $50,001 or more
##   $50,000 or less            2134             151
##   $50,001 or more             325             402

accuNumTree.test <- round((Numtree.test.table[1,1]+Numtree.test.table[2,2])/sum(Numtree.test.table)*100)
print(paste("The classification tree is",accuNumTree.test,"percent accurate using the numeric test set."))

## [1] "The classification tree is 84 percent accurate using the numeric test set."

removing outliers

est.OutlierTrain.tree <- predict(NumTree, TrainNum.cap.outliers, type = "class")
Numtree.Outlier.table <- table(TrainNum.cap.outliers$income.class2, est.OutlierTrain.tree)
Numtree.Outlier.table

##                  est.OutlierTrain.tree
##                   $50,000 or less $50,001 or more
##   $50,000 or less            8473             599
##   $50,001 or more            1295            1613

accuNumTreeOutlier.train <- round( (Numtree.Outlier.table[1,1]+Numtree.Outlier.table[2,2])/sum(Numtree.Outlier.table)*100,2)
print(paste("The classification tree is",accuNumTreeOutlier.train,"percent accurate using the numeric training set without capital gains."))

## [1] "The classification tree is 84.19 percent accurate using the numeric training set without capital gains."

est.OutlierTrain.tree <- predict(NumTree, TestNum.cap.outliers, type = "class")
Numtree.Outlier.TestTable <- table(TestNum.cap.outliers$income.class2, est.OutlierTrain.tree)
accuNumTreeOutlier.test <- round((Numtree.Outlier.TestTable[1,1]+Numtree.Outlier.TestTable[2,2])/sum(Numtree.Outlier.TestTable)*100,2)

print(paste("The classification tree is",accuNumTreeOutlier.test,"percent accurate using the numeric training set without capital gains."))

## [1] "The classification tree is 85.2 percent accurate using the numeric training set without capital gains."

The results of the classification tree models show that the model is highly effective in learning from the training data. The results from the test data were actually slightly more accurate than when using the training data. This is rare, and may be a product of the random sampling. This model is also marginally more accurate when using the all numeric dataset. Further research is needed, but this result is likely caused by the expanded number of variables in the numeric dataset.

Regarding the inclusion of capital gains outliers, the model accuracy is not improved by omitting outliers.

The final model we test is a neural network. Acting in same fashion as a biological neural network, the independent inputs are passed through hidden nodes before they reach the output. The number of hidden nodes determines the complexity of the network and, in most cases,the accuracy of the model. The optimal number of nodes (‘size’ input) was reached through trial and error.

neuronet <- nnet(income.class2 ~ ., TrainCat, size = 12, maxit = 2000)

## # weights:  925
## initial  value 7552.141318 
## iter  10 value 4433.614072
## iter  20 value 3988.648238
## iter  30 value 3768.480483
## iter  40 value 3665.206615
## iter  50 value 3595.230739
## iter  60 value 3515.572438
## iter  70 value 3429.873544
## iter  80 value 3375.787692
## iter  90 value 3330.771082
## iter 100 value 3301.091894
## iter 110 value 3277.558026
## iter 120 value 3259.546942
## iter 130 value 3244.858710
## iter 140 value 3225.218682
## iter 150 value 3207.807109
## iter 160 value 3194.882171
## iter 170 value 3183.757705
## iter 180 value 3174.957568
## iter 190 value 3168.052302
## iter 200 value 3162.828960
## iter 210 value 3156.421323
## iter 220 value 3149.462459
## iter 230 value 3139.071010
## iter 240 value 3131.132907
## iter 250 value 3121.244632
## iter 260 value 3111.099618
## iter 270 value 3098.990054
## iter 280 value 3086.255558
## iter 290 value 3071.991418
## iter 300 value 3062.033493
## iter 310 value 3054.138433
## iter 320 value 3047.626867
## iter 330 value 3043.710684
## iter 340 value 3037.956656
## iter 350 value 3033.946186
## iter 360 value 3030.649482
## iter 370 value 3027.045318
## iter 380 value 3024.198476
## iter 390 value 3021.760005
## iter 400 value 3018.762047
## iter 410 value 3015.561039
## iter 420 value 3010.942512
## iter 430 value 3005.294207
## iter 440 value 2997.980463
## iter 450 value 2984.488882
## iter 460 value 2971.177285
## iter 470 value 2967.867188
## iter 480 value 2966.559369
## iter 490 value 2965.880647
## iter 500 value 2965.372854
## iter 510 value 2964.510261
## iter 520 value 2963.602925
## iter 530 value 2963.404966
## iter 540 value 2963.310224
## iter 550 value 2963.146404
## iter 560 value 2963.102699
## iter 570 value 2963.054454
## iter 580 value 2962.803637
## iter 590 value 2962.751746
## iter 600 value 2962.739251
## iter 610 value 2962.735827
## iter 620 value 2962.734374
## final  value 2962.733850 
## converged

source_url("https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r")

## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef

plot.nnet(neuronet)

## Loading required package: reshape

## Warning: package 'reshape' was built under R version 4.0.3

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:class':
## 
##     condense

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

est.train.nnet <- predict(neuronet, TrainCat, type = "class")
nnet.train.table <- table(TrainCat$income.class2, est.train.nnet)
nnet.train.table

##                  est.train.nnet
##                   $50,000 or less $50,001 or more
##   $50,000 or less            8527             548
##   $50,001 or more             835            2138

accuNeuroTrain <- round((nnet.train.table[1,1]+nnet.train.table[2,2])/sum(nnet.train.table)*100, 2)

print(paste("The neural network is",accuNumTreeOutlier.test,"percent accurate using the training set."))

## [1] "The neural network is 85.2 percent accurate using the training set."

applying to test set

est.test.nnet <- predict(neuronet, TestCat, type = "class")
nnet.test.table <- table(TestCat$income.class2, est.test.nnet)
nnet.test.table

##                  est.test.nnet
##                   $50,000 or less $50,001 or more
##   $50,000 or less            2068             217
##   $50,001 or more             289             438

accuNeuroTest <- round((nnet.test.table[1,1]+nnet.test.table[2,2])/sum(nnet.test.table)*100,2)
print(paste("The neural network is",accuNeuroTest,"percent accurate using the test set."))

## [1] "The neural network is 83.2 percent accurate using the test set."

removing outliers

neuronetOutlier <- nnet(income.class2 ~ ., TrainCat.cap.outliers, size = 12, maxit = 2000)

## # weights:  925
## initial  value 7864.019938 
## iter  10 value 4359.345073
## iter  20 value 4014.169594
## iter  30 value 3851.216239
## iter  40 value 3752.494821
## iter  50 value 3634.009812
## iter  60 value 3557.540748
## iter  70 value 3507.596000
## iter  80 value 3443.251643
## iter  90 value 3392.274843
## iter 100 value 3339.842681
## iter 110 value 3309.006820
## iter 120 value 3282.351822
## iter 130 value 3260.089445
## iter 140 value 3241.794525
## iter 150 value 3226.446812
## iter 160 value 3215.061360
## iter 170 value 3202.254023
## iter 180 value 3191.827580
## iter 190 value 3185.702072
## iter 200 value 3180.834070
## iter 210 value 3175.488630
## iter 220 value 3170.374565
## iter 230 value 3166.034203
## iter 240 value 3161.840127
## iter 250 value 3157.803293
## iter 260 value 3153.486203
## iter 270 value 3149.413036
## iter 280 value 3145.916200
## iter 290 value 3141.997192
## iter 300 value 3139.146562
## iter 310 value 3136.340690
## iter 320 value 3133.372236
## iter 330 value 3131.156089
## iter 340 value 3129.156098
## iter 350 value 3127.166687
## iter 360 value 3124.915179
## iter 370 value 3123.025817
## iter 380 value 3121.085554
## iter 390 value 3118.563946
## iter 400 value 3115.617177
## iter 410 value 3113.630854
## iter 420 value 3110.843488
## iter 430 value 3107.656758
## iter 440 value 3104.587279
## iter 450 value 3101.901386
## iter 460 value 3100.231060
## iter 470 value 3098.921020
## iter 480 value 3098.128777
## iter 490 value 3097.073419
## iter 500 value 3095.995085
## iter 510 value 3095.026221
## iter 520 value 3093.919997
## iter 530 value 3092.833244
## iter 540 value 3090.971085
## iter 550 value 3089.055287
## iter 560 value 3087.919341
## iter 570 value 3086.849389
## iter 580 value 3086.034270
## iter 590 value 3085.392120
## iter 600 value 3084.780812
## iter 610 value 3084.325282
## iter 620 value 3083.834814
## iter 630 value 3083.494235
## iter 640 value 3083.165797
## iter 650 value 3082.836459
## iter 660 value 3082.591911
## iter 670 value 3082.300170
## iter 680 value 3081.874064
## iter 690 value 3081.602754
## iter 700 value 3081.444231
## iter 710 value 3081.346245
## iter 720 value 3081.242607
## iter 730 value 3081.102443
## iter 740 value 3080.976904
## iter 750 value 3080.777696
## iter 760 value 3080.563885
## iter 770 value 3080.390172
## iter 780 value 3080.184533
## iter 790 value 3080.005283
## iter 800 value 3079.805771
## iter 810 value 3079.630719
## iter 820 value 3079.485379
## iter 830 value 3079.329283
## iter 840 value 3079.141042
## iter 850 value 3079.026082
## iter 860 value 3078.904930
## iter 870 value 3078.827663
## iter 880 value 3078.774900
## iter 890 value 3078.719398
## iter 900 value 3078.622810
## iter 910 value 3078.560264
## iter 920 value 3078.499980
## iter 930 value 3078.432886
## iter 940 value 3078.379092
## iter 950 value 3078.325757
## iter 960 value 3078.280438
## iter 970 value 3078.252646
## iter 980 value 3078.229042
## iter 990 value 3078.211967
## iter1000 value 3078.194648
## iter1010 value 3078.177356
## iter1020 value 3078.162001
## iter1030 value 3078.142118
## iter1040 value 3078.128868
## iter1050 value 3078.120769
## iter1060 value 3078.106646
## iter1070 value 3078.090232
## iter1080 value 3078.077286
## iter1090 value 3078.067858
## iter1100 value 3078.051835
## iter1110 value 3078.033993
## iter1120 value 3078.001297
## iter1130 value 3077.977457
## iter1140 value 3077.962030
## iter1150 value 3077.934615
## iter1160 value 3077.917206
## iter1170 value 3077.889605
## iter1180 value 3077.869117
## iter1190 value 3077.837896
## iter1200 value 3077.815072
## iter1210 value 3077.787919
## iter1220 value 3077.564873
## iter1230 value 3077.538882
## iter1240 value 3077.440655
## iter1250 value 3077.400800
## iter1260 value 3077.354974
## iter1270 value 3077.326164
## iter1280 value 3077.306054
## iter1290 value 3077.286337
## iter1300 value 3077.266745
## iter1310 value 3077.248757
## iter1320 value 3077.236299
## iter1330 value 3077.224937
## iter1340 value 3077.205868
## iter1350 value 3077.183553
## iter1360 value 3077.180036
## iter1370 value 3077.176571
## iter1380 value 3077.173578
## iter1390 value 3077.168581
## iter1400 value 3077.163128
## iter1410 value 3077.158087
## iter1420 value 3077.152923
## iter1430 value 3077.148663
## iter1440 value 3077.142974
## final  value 3077.138895 
## converged

est.trainOutlier.nnet <- predict(neuronetOutlier, TrainCat.cap.outliers,type = "class")
nnet.trainOutlier.table <- table(TrainCat.cap.outliers$income.class2, est.trainOutlier.nnet)
nnet.trainOutlier.table

##                  est.trainOutlier.nnet
##                   $50,000 or less $50,001 or more
##   $50,000 or less            8510             562
##   $50,001 or more             933            1975

accuNeuroOutlier.train <- round((nnet.trainOutlier.table[1,1]+                                 nnet.trainOutlier.table[2,2])/sum(nnet.trainOutlier.table)*100,)
print(paste("The neural network is",accuNeuroOutlier.train,"percent accurate using the training set without capital gains outliers."))

## [1] "The neural network is 88 percent accurate using the training set without capital gains outliers."

est.testOutlier.nnet <- predict(neuronet, TestCat.cap.outliers, type = "class")
nnet.testOutlier.table <- table(TestCat.cap.outliers$income.class2, est.testOutlier.nnet)
accuNeuroOutlier.test<- round((nnet.testOutlier.table[1,1]+nnet.testOutlier.table[2,2])/sum(nnet.testOutlier.table)*100,2)

nnet.testOutlier.table

##                  est.testOutlier.nnet
##                   $50,000 or less $50,001 or more
##   $50,000 or less            2138             150
##   $50,001 or more             215             496

print(paste("The neural network is",accuNeuroOutlier.test,"percent accurate using the test set without capital gains outliers."))

## [1] "The neural network is 87.83 percent accurate using the test set without capital gains outliers."

JoblessTrain <- TrainCat.cap.outliers %>% select(-workclass, -occupation)
JoblessTest <- TestCat.cap.outliers %>% select(-workclass, -occupation)
jobless <- nnet(income.class2 ~ .,JoblessTrain, size = 12, maxit = 2000 )

## # weights:  697
## initial  value 9817.205207 
## iter  10 value 4489.661816
## iter  20 value 4097.696311
## iter  30 value 3969.028544
## iter  40 value 3877.747735
## iter  50 value 3788.771730
## iter  60 value 3716.922948
## iter  70 value 3667.826328
## iter  80 value 3629.332332
## iter  90 value 3598.808544
## iter 100 value 3580.069004
## iter 110 value 3563.204011
## iter 120 value 3544.993231
## iter 130 value 3531.601566
## iter 140 value 3522.496119
## iter 150 value 3515.038532
## iter 160 value 3510.762539
## iter 170 value 3507.147291
## iter 180 value 3503.956239
## iter 190 value 3499.805189
## iter 200 value 3494.490264
## iter 210 value 3491.817027
## iter 220 value 3488.105833
## iter 230 value 3485.615833
## iter 240 value 3483.406555
## iter 250 value 3479.920896
## iter 260 value 3476.684622
## iter 270 value 3472.012551
## iter 280 value 3467.497696
## iter 290 value 3463.101936
## iter 300 value 3460.792174
## iter 310 value 3458.527119
## iter 320 value 3456.081766
## iter 330 value 3451.308933
## iter 340 value 3444.920326
## iter 350 value 3440.881802
## iter 360 value 3436.968986
## iter 370 value 3431.277828
## iter 380 value 3423.611504
## iter 390 value 3416.752803
## iter 400 value 3412.912384
## iter 410 value 3409.799541
## iter 420 value 3406.503142
## iter 430 value 3404.531429
## iter 440 value 3403.138207
## iter 450 value 3401.912735
## iter 460 value 3400.370399
## iter 470 value 3399.280193
## iter 480 value 3398.504253
## iter 490 value 3397.788974
## iter 500 value 3397.244895
## iter 510 value 3396.743398
## iter 520 value 3396.259363
## iter 530 value 3395.761922
## iter 540 value 3395.123488
## iter 550 value 3394.111128
## iter 560 value 3392.680211
## iter 570 value 3391.461341
## iter 580 value 3390.103602
## iter 590 value 3389.014159
## iter 600 value 3387.986310
## iter 610 value 3387.219067
## iter 620 value 3386.450007
## iter 630 value 3386.116354
## iter 640 value 3385.972585
## iter 650 value 3385.838267
## iter 660 value 3385.761592
## iter 670 value 3385.719235
## iter 680 value 3385.704511
## iter 690 value 3385.688491
## iter 700 value 3385.681509
## iter 710 value 3385.675970
## final  value 3385.675014 
## converged

estjoblessTest <- predict(jobless, JoblessTest, type = "class")
joblesstable <- table(JoblessTest$income.class2, estjoblessTest)
accuJobless <- round((joblesstable[1,1]+joblesstable[2,2])/sum(joblesstable)*100,2)

print(paste("Without the 'occupation' or 'workclass' columns, the model has been reduced to", accuJobless, "percent accuracy."))

## [1] "Without the 'occupation' or 'workclass' columns, the model has been reduced to 84.76 percent accuracy."

The results of the neural network reveal that the model is superior to the KNN and classification tree models. It is notable that model is slightly improved with the omission of capital gains outliers.

The results of the neural network reveal that the model is in line with KNN and the classification tree models. It is notable, however, that model improves with the omission of capital gains outliers.

Conclusion

From our analysis, the neural network model developed with complete observations and the omission of outliers for capital gains performed the most efficiently, with an accuracy rate of 87.8% when running the training set against the test set. This model is the most accurate model in classifying the income levels for the given observations. The model correctly classified 87.8% of the test data (resulting in an error rate of only 12.2%). Further research is needed in order to fine tune our model to its full potential. A more detailed analysis of each variable may suggest that specific variables be omitted from the model while others may benefit from being weighted more heavily; nevertheless, the findings of this project alone may yield a useful framework to others.

Resources:

Alice, Michy. “Imputing Missing Data with R; MICE Package.” DataScience+, 14 May 2018, datascienceplus.com/imputing-missing-data-with-r-mice-package/.

2.“Chapter 15. Advanced Methods for Missing Data · R in Action.” · R In Action, livebook.manning.com/book/r-in-action/chapter-15/133.

Hastie, T., Tibshirani, R., Walther, G. (2000). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Retrieved 11/19/20, from https://web.stanford.edu/~hastie/Papers/gap.pdf
Larose, D. T., Larose, C. D. (2014). Discovering Knowledge in Data: an Introduction to Data Mining. Second ed., Wiley, 2014.
Roy, B. “All About Missing Data Handling.” Medium, Towards Data Science, 22 Feb. 2020, (https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184).
van Buuren, S. (2020, November 23) “4.1 Missing Data Pattern.” stefvanbuuren.name/fimd/missing-data-pattern.html.

Appendix

Zscore Transformations of Adults Dataset Variables

attach(Adults)
mean(age)

## [1] 38.76746

sd(age)

## [1] 13.84919

mean(fnlwgt)

## [1] 189435.7

sd(fnlwgt)

## [1] 105714.9

mean(education.num)

## [1] 10.07291

sd(education.num)

## [1] 2.567545

mean(capital.gain)

## [1] 1081.905

sd(capital.gain)

## [1] 7583.936

mean(capital.loss)

## [1] 87.89927

sd(capital.loss)

## [1] 403.1053

mean(hours.per.week)

## [1] 40.39224

sd(hours.per.week)

## [1] 12.47933

Adults_Updated$AgeZscore = (Adults_Updated$age - mean(Adults_Updated$age))/sd(Adults_Updated$age)
Adults_Updated$FNLWGTZscore = (Adults_Updated$fnlwgt - mean(Adults_Updated$fnlwgt))/sd(Adults_Updated$fnlwgt)
Adults_Updated$EduNumZscore = (Adults_Updated$education.num - mean(Adults_Updated$education.num))/sd(Adults_Updated$education.num)
Adults_Updated$CGZscore = (Adults_Updated$capital.gain - mean(Adults_Updated$capital.gain))/sd(Adults_Updated$capital.gain)
Adults_Updated$CLZscore = (Adults_Updated$capital.loss - mean(Adults_Updated$capital.loss))/sd(Adults_Updated$capital.loss)
Adults_Updated$HrsZscore = (Adults_Updated$hours.per.week - mean(Adults_Updated$hours.per.week))/sd(Adults_Updated$hours.per.week)
ls.str(Adults_Updated)

## age :  int [1:15060] 25 38 28 44 34 63 24 55 65 36 ...
## AgeZscore :  num [1:15060] -1.029 -0.0574 -0.8048 0.391 -0.3564 ...
## capital.gain :  int [1:15060] 0 0 0 7688 0 3103 0 0 6418 0 ...
## capital.loss :  int [1:15060] 0 0 0 0 0 0 0 0 0 0 ...
## CGZscore :  num [1:15060] -0.145 -0.145 -0.145 0.853 -0.145 ...
## CLZscore :  num [1:15060] -0.219 -0.219 -0.219 -0.219 -0.219 ...
## education :  chr [1:15060] " 11th" " HS-grad" " Assoc-acdm" " Some-college" " 10th" ...
## education.num :  int [1:15060] 7 9 12 10 6 15 10 4 9 13 ...
## EduNumZscore :  num [1:15060] -1.2165 -0.4349 0.7376 -0.0441 -1.6073 ...
## fnlwgt :  int [1:15060] 226802 89814 336951 160323 198693 104626 369667 104996 184454 212465 ...
## FNLWGTZscore :  num [1:15060] 0.3521 -0.945 1.395 -0.2774 0.0859 ...
## hours.per.week :  int [1:15060] 40 50 40 40 30 32 40 10 40 40 ...
## HrsZscore :  num [1:15060] -0.0789 0.7501 -0.0789 -0.0789 -0.9079 ...
## income.class2 :  chr [1:15060] "$50,000 or less" "$50,000 or less" "$50,001 or more" ...
## marital.status :  chr [1:15060] " Never-married" " Married-civ-spouse" " Married-civ-spouse" ...
## native.country :  chr [1:15060] " United-States" " United-States" " United-States" ...
## occupation :  chr [1:15060] " Machine-op-inspct" " Farming-fishing" " Protective-serv" ...
## Outliers :  logi [1:15060] FALSE FALSE FALSE TRUE TRUE TRUE ...
## race :  chr [1:15060] " Black" " White" " White" " Black" " White" " White" ...
## relationship :  chr [1:15060] " Own-child" " Husband" " Husband" " Husband" ...
## sex :  chr [1:15060] " Male" " Male" " Male" " Male" " Male" " Male" " Female" ...
## workclass :  chr [1:15060] " Private" " Private" " Local-gov" " Private" " Private" ...

hist(Adults_Updated$age)

hist(Adults_Updated$AgeZscore)

hist(Adults_Updated$capital.gain)

hist(Adults_Updated$CGZscore)

Scatterplots to Explore the Correlation between Continuous Variables

plot(Adults_NumericOnly$age, Adults_NumericOnly$fnlwgt)

plot(Adults_NumericOnly$age, Adults_NumericOnly$education.num)

plot(Adults_NumericOnly$age, Adults_NumericOnly$capital.gain)

plot(Adults_NumericOnly$age, Adults_NumericOnly$capital.loss)

plot(Adults_NumericOnly$age, Adults_NumericOnly$hours.per.week)

plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$education.num)

plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$capital.gain)

plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$capital.loss)

plot(Adults_NumericOnly$fnlwgt, Adults_NumericOnly$hours.per.week)

plot(Adults_NumericOnly$education.num, Adults_NumericOnly$capital.gain)

plot(Adults_NumericOnly$education.num, Adults_NumericOnly$capital.loss)

plot(Adults_NumericOnly$education.num, Adults_NumericOnly$hours.per.week)

plot(Adults_NumericOnly$capital.gain, Adults_NumericOnly$capital.loss)

plot(Adults_NumericOnly$capital.gain, Adults_NumericOnly$hours.per.week)

plot(Adults_NumericOnly$capital.loss, Adults_NumericOnly$hours.per.week)

MBA 699 - Final Project - EDA and Model Development

Marissa Valente

12/05/2020

Executive Summary

Business Understanding Section

Importing the Adults dataset

Variables:

Data Pre-Processing and Exploratory Data Analysis

Look at Missing Data within Variables and Evaluate If Omission or Imputation Should be Performed

Removing Observations with Missing Data

Transform Income Class

Look for Outliers within Variables in Adults Dataset

Bar Chart of Occupation compared to Income Class

Bar Chart of Work Class compared to Income Class

Check for outliers on the following: capital gain, education number, age, hours per week

Creating Min-Max Transformation Variables for All Numeric Variables in the Adults_Updated dataset

Checking for Correlation among the Continuous Variables

Modeling Section

with numeric test data

removing outliers

applying to test set

removing outliers

Conclusion

Resources:

Appendix