As income inequality grows throughout the world, understanding the relationships between an individuals income and the other factors in this study we can better identify and address the underlying causes for the inequalities. This study will analyze how 15 factors such as age, county, working class, sex, race, education and more influence our target variable, income from a diverse dataset of over 48,842 individuals. The goal of this analysis is to develop models that best predict income, so that these models can be used to make better decisions when considering income from occupations.
The dataset for this project is from the UCI Machine Learning Repository. This dataset was donated by Ron Kohavi and Barry Becker. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Relevant paper is from Ron Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. Our relevant prediction task based on this peer-reviewed paper is to determine whether a person makes over 50K a year.
In this analysis we attempt to predict the income of individuals given a host of variables. For purposes of simplicity, our target income variable has been broken into two categories: greater than $50,000 or less than or equal to $50,000. Ideally, this will makes its prediction less prone to error since we are using a host of categorical variables. We then explore the relationships between our variables, make any necessary changes prior to modeling, develop several models, and evaluate them using confusions matricies. Our focus in this analysis, will be on identifying the variables that improve real-world accuracy to best capture the full context of income interactions with our variables.
We begin with data exploration to understand the relationships our target variable ‘Income’ will have with our variables and the variables’ relationships to each other. This allows us to determine the steps necessary to set up for model development. Once we have an understanding of these variables we use that knowledge to prepare the data. We handle missing values, subset, train and split the data 75/25 so that we may better extract information when modeling. Then, we build the models and predict with the testing dataset. We focus on prediction accuracy when assessing the models but consider a host of performance statistics and real-world applications to determine which model is best.
In this section we begin exploring the data by creating a table with the variable names’ and descriptions. We first identify the characteristics of the data to help with properly labeling and categorizing our factors. To better understand the data, we then summarize those characteristics and present them in the data summary section. Additional visualizations and correlations are created to discover any unseen patterns or potentially problematic areas prior to preparing the data.
There are 48842 observations of 15 variables. Each observation is for individual’s income data with it’s corresponding variables of interest. Below is the description of the variables of interest in the data set.
VARIABLE NAME | DESCRIPTION |
---|---|
age | continuous |
workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked |
fnlwgt | continuous |
education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool |
education-num | continuous |
marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse |
occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces |
relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black |
sex | Female, Male |
capital-gain | continuous |
capital-loss | continuous |
hours-per-week | continuous |
native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands |
income | >50K, <=50K |
We continue by identifying any potential sources of error in the analysis. This includes calculating the number of missing and outlier values for each variable, determining and assigning the proper methods to reduce missing and problematic data points. Due to the unique dataset, we determing and convert the variables to the proper data types. Here again, for simplicity and the most effective use of the data, all are converted to factors with multiple levels as intended by the orignial data contributer. A gridded table with frequencies of each factor’s level, its identity, and the number of valid or missing data points are tabulated and presented below.
Below is a summary of the census income dataset. For this process we have already adjusted the data types to their proper forms. This summarizing function quantifies each variable in a manner consistent with their types. We notice the levels of each factor in the ‘Stats/Values’ column, the frequency of valid (non-missing) observations per level of our factors, and the quantity and percent missing alongside them. We review these statistics to identify any issues with each variable.
## Data Frame Summary
## income_data
## Dimensions: 48842 x 15
## Duplicates: 52
##
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | No | Variable | Stats / Values | Freqs (% of Valid) | Valid | Missing |
## +====+=================+================================+=======================+==========+=========+
## | 1 | age | Mean (sd) : 38.6 (13.7) | 74 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 17 < 37 < 90 | | | |
## | | | IQR (CV) : 20 (0.4) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 2 | workclass | 1. ? | 2799 ( 5.7%) | 48842 | 0 |
## | | [factor] | 2. Federal-gov | 1432 ( 2.9%) | (100.0%) | (0.0%) |
## | | | 3. Local-gov | 3136 ( 6.4%) | | |
## | | | 4. Never-worked | 10 ( 0.0%) | | |
## | | | 5. Private | 33906 (69.4%) | | |
## | | | 6. Self-emp-inc | 1695 ( 3.5%) | | |
## | | | 7. Self-emp-not-inc | 3862 ( 7.9%) | | |
## | | | 8. State-gov | 1981 ( 4.1%) | | |
## | | | 9. Without-pay | 21 ( 0.0%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 3 | fnlwgt | Mean (sd) : 189664.1 (105604) | 28523 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 12285 < 178144.5 < 1490400 | | | |
## | | | IQR (CV) : 120091.5 (0.6) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 4 | education | 1. 10th | 1389 ( 2.8%) | 48842 | 0 |
## | | [factor] | 2. 11th | 1812 ( 3.7%) | (100.0%) | (0.0%) |
## | | | 3. 12th | 657 ( 1.3%) | | |
## | | | 4. 1st-4th | 247 ( 0.5%) | | |
## | | | 5. 5th-6th | 509 ( 1.0%) | | |
## | | | 6. 7th-8th | 955 ( 2.0%) | | |
## | | | 7. 9th | 756 ( 1.5%) | | |
## | | | 8. Assoc-acdm | 1601 ( 3.3%) | | |
## | | | 9. Assoc-voc | 2061 ( 4.2%) | | |
## | | | 10. Bachelors | 8025 (16.4%) | | |
## | | | [ 6 others ] | 30830 (63.1%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 5 | education_num | Mean (sd) : 10.1 (2.6) | 16 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 1 < 10 < 16 | | | |
## | | | IQR (CV) : 3 (0.3) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 6 | marital_status | 1. Divorced | 6633 (13.6%) | 48842 | 0 |
## | | [factor] | 2. Married-AF-spouse | 37 ( 0.1%) | (100.0%) | (0.0%) |
## | | | 3. Married-civ-spouse | 22379 (45.8%) | | |
## | | | 4. Married-spouse-absent | 628 ( 1.3%) | | |
## | | | 5. Never-married | 16117 (33.0%) | | |
## | | | 6. Separated | 1530 ( 3.1%) | | |
## | | | 7. Widowed | 1518 ( 3.1%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 7 | occupation | 1. ? | 2809 ( 5.8%) | 48842 | 0 |
## | | [factor] | 2. Adm-clerical | 5611 (11.5%) | (100.0%) | (0.0%) |
## | | | 3. Armed-Forces | 15 ( 0.0%) | | |
## | | | 4. Craft-repair | 6112 (12.5%) | | |
## | | | 5. Exec-managerial | 6086 (12.5%) | | |
## | | | 6. Farming-fishing | 1490 ( 3.1%) | | |
## | | | 7. Handlers-cleaners | 2072 ( 4.2%) | | |
## | | | 8. Machine-op-inspct | 3022 ( 6.2%) | | |
## | | | 9. Other-service | 4923 (10.1%) | | |
## | | | 10. Priv-house-serv | 242 ( 0.5%) | | |
## | | | [ 5 others ] | 16460 (33.7%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 8 | relationship | 1. Husband | 19716 (40.4%) | 48842 | 0 |
## | | [factor] | 2. Not-in-family | 12583 (25.8%) | (100.0%) | (0.0%) |
## | | | 3. Other-relative | 1506 ( 3.1%) | | |
## | | | 4. Own-child | 7581 (15.5%) | | |
## | | | 5. Unmarried | 5125 (10.5%) | | |
## | | | 6. Wife | 2331 ( 4.8%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 9 | race | 1. Amer-Indian-Eskimo | 470 ( 1.0%) | 48842 | 0 |
## | | [factor] | 2. Asian-Pac-Islander | 1519 ( 3.1%) | (100.0%) | (0.0%) |
## | | | 3. Black | 4685 ( 9.6%) | | |
## | | | 4. Other | 406 ( 0.8%) | | |
## | | | 5. White | 41762 (85.5%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 10 | sex | 1. Female | 16192 (33.2%) | 48842 | 0 |
## | | [factor] | 2. Male | 32650 (66.8%) | (100.0%) | (0.0%) |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 11 | capital_gain | Mean (sd) : 1079.1 (7452) | 123 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 0 < 0 < 99999 | | | |
## | | | IQR (CV) : 0 (6.9) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 12 | capital_loss | Mean (sd) : 87.5 (403) | 99 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 0 < 0 < 4356 | | | |
## | | | IQR (CV) : 0 (4.6) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 13 | hours_per_week | Mean (sd) : 40.4 (12.4) | 96 distinct values | 48842 | 0 |
## | | [integer] | min < med < max: | | (100.0%) | (0.0%) |
## | | | 1 < 40 < 99 | | | |
## | | | IQR (CV) : 5 (0.3) | | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 14 | native_country | 1. ? | 857 ( 1.8%) | 48842 | 0 |
## | | [factor] | 2. Cambodia | 28 ( 0.1%) | (100.0%) | (0.0%) |
## | | | 3. Canada | 182 ( 0.4%) | | |
## | | | 4. China | 122 ( 0.2%) | | |
## | | | 5. Columbia | 85 ( 0.2%) | | |
## | | | 6. Cuba | 138 ( 0.3%) | | |
## | | | 7. Dominican-Republic | 103 ( 0.2%) | | |
## | | | 8. Ecuador | 45 ( 0.1%) | | |
## | | | 9. El-Salvador | 155 ( 0.3%) | | |
## | | | 10. England | 127 ( 0.3%) | | |
## | | | [ 32 others ] | 47000 (96.2%) | | |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
## | 15 | income | 1. <=50K | 37155 (76.1%) | 48842 | 0 |
## | | [factor] | 2. >50K | 11687 (23.9%) | (100.0%) | (0.0%) |
## +----+-----------------+--------------------------------+-----------------------+----------+---------+
At first glance, it appears none of the data are missing values now and each variables is a factor data type as we intended but we begin to notice a few issues. Certain variables contain a multitude of distinct levels and as such, are interpreted as numeric data types with statistics for mean, median, minima, maxima, standard deviation and interquartile ranges (IQR). For example in the variables age
, fnlwgt
, capital_gains
, capital_loss
and hours_per_week
produce nearly 100+ levels each with fnlwgt
having 28523 levels. We will need to decide if these are worth adjusting further to capture the full picture of the relationships between the variables and our target.
Following the Missing column, it seems none of the columns have missing values but Stats / Values value column shows the variables that have value as ‘?’. workclass
, occupation
, native_country
columns have values as ‘?’. The proportion of values for several columns shows significant differences and skew. For example, 67% of this dataset contains males applicants based on observations of the sex
variable and 85.5% of data points are white people given the race
variable. Due to the disproportionate levels within the variables we should expect the data is not representative of a larger population unless that population happens to have similar proportions.
Our numeric variables age
, fnlwgt
, capital_gain
, capital_loss
show signs of skew through the differences in their mean and medians as well as their ranges. The lowest value of fnlwgt
variable was 12285, while the highest was 1490400. A similar problem exists with variables capital_gain
and capital_loss.
We consider a different summary method, which at its base function calculates those statistical parameters previously mentioned and counts the number of observations for each level as performed above. This should confirm our previous grid table results but we should also look for changes, if there are any. The results of this new summary method are shown.
## age workclass fnlwgt
## Min. :17.00 Private :33906 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117550
## Median :37.00 Local-gov : 3136 Median : 178144
## Mean :38.64 ? : 2799 Mean : 189664
## 3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642
## Max. :90.00 Self-emp-inc : 1695 Max. :1490400
## (Other) : 1463
## education education_num marital_status
## HS-grad :15784 Min. : 1.00 Divorced : 6633
## Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37
## Bachelors : 8025 Median :10.00 Married-civ-spouse :22379
## Masters : 2657 Mean :10.08 Married-spouse-absent: 628
## Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117
## 11th : 1812 Max. :16.00 Separated : 1530
## (Other) : 7625 Widowed : 1518
## occupation relationship race
## Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470
## Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519
## Exec-managerial: 6086 Other-relative: 1506 Black : 4685
## Adm-clerical : 5611 Own-child : 7581 Other : 406
## Sales : 5504 Unmarried : 5125 White :41762
## Other-service : 4923 Wife : 2331
## (Other) :14434
## sex capital_gain capital_loss hours_per_week
## Female:16192 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :32650 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1079 Mean : 87.5 Mean :40.42
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native_country income
## United-States:43832 <=50K:37155
## Mexico : 951 >50K :11687
## ? : 857
## Philippines : 295
## Germany : 206
## Puerto-Rico : 184
## (Other) : 2517
With this method, our first results are confirmed. However, there appear to be few differences, if any, in these results. The only noticible change is to our target variable, income, where the previous function interpreted the values as factors without levels rather than a series of character strings as this new method did. This indicates we might not need to make any further changes to the data types or adjustments in the quantity of missing values (which might have included those with the ‘?’ level within their factors) or outliers.
We take a closer look at our target variable to get a sense of what we are trying to predict. We also look for any inate imbalances within the target by spotting any additional unintentional biases towards a specific income level. We visualize the proportions for other factors as well to see just how skewed and disproportionate this dataset is. We include missing values as ‘?’ to demonstrate their influence on the dataset as well. The chart below shows the distribution of all categorical variables, which includes the factors mentioned previously.
In this first bar chart we notice that there is almost triple the frequency of individuals with income less than or equal to $50,000 than the frequency of those who are above. This could be problematic for predicting the minority class (those with >50k) but we also know from the literature that this is realistic. In the population of the world, there is a lot fewer people who have more wealth than those who have less of it. For this reason, we note the disproportionate share of income towards those under less than of equal to $50,000 (<=50k) but leave it as is because this is an observation of real-world statistics.
From this chart, it is very clear we have a dataset with mostly males, that are white, and in a private working class. Individuals in this survey were also mostly married as noted by the married_civ_spouse
, married_af_spouse
, seperated
and widowed
frequencies above. To us, the only case in which marriage did not occur were those who were in the never_married
category. It is not likely that this will help us predict income since it disproportionately favors those who have been married at least once at some point in their lives. We can clearly see this growing disproportionality in the relationship variable where most individuals are husbands who also did not have kids. We see missing values in form of question mark (?) for variables workclass
, occupation
, native_country
. The predictor variable Income is also shown as mostly <=50k value so we know the results agree with the previous chart. Occupation seems to have no underlying patterns or skew but we should note that individuals in U.S. dominate the respondents list. This makes understanding differences between income relationships in countries quite difficult for our purposes.
We also generate histograms with the count of each observation to assess our numeric variable distributions. This will let us know more about the skewness, average values, and where potential outliers may be found for our numeric variables. The graph below shows their distributions.
As expected, the number of hours per week spikes at about 40. This makes sense since most people in the U.S. tend to work about 40 hours per week and that is who is best represnted within this dataset. Unfortunately, an extremely small proportion of individuals reported
capital_gain
and capital_loss
in this study. These reflect real-world trends but are likely not representative of true income cateogires above and below $50,000.
The next set of graphs shows the income distribution against countries
, workclass
, education
, sex
and race
. We see male has higher income in both the categories than female. White race income distribution is significantly large as compared to other races. Private working classes earn more than any other categories and United States has largest income in both the categories compareds to all other countries. Another chart is shown below for details.
In addition to those trends we add that, many other countries with respondents have income less than or equal to $50,000. Of the other countries, only Canada and India appear to have a significant proportion of people with greater than $50,000 per year. The rest are either ?
or so small, they might as well be absent since they likely misrepresent the population of the country they are assigned to. There are no distributions similar to the scale, proportions, or magnitude of data captured for the United States which will effect our results.
The private class dominates the proportion of respondents from this survey. If all other categories were stacked together, including those missing values labeled as ?
their values would barely cover half of the private working class individual’s responses from this study. Because of this, it is also no surprise that the private working class holds the largest share of income in the category greater than $50,000. But what about education?
From this dataset, we see that education does have an impact of on earning potential. As years of education increase, the proportion of the population with income greater than $50,000 increases. However, the issue with this dataset is that is does not match repeated findings from peer-reviewed literature. Due to the slight increases in the proportion of people with >50k
, it seems to matter less if you have a graduate or professional degree (categorized as Masters
, Doctorate
, Prof-school
in the chart), than if you simply graduate from high school. We know this to be partially the case, but the income of the population of people with graduate or professional degrees should be greater in all cases since it also delays earning income (while costing the individual to pursue the degree).
This could be because we have a disproportionate amount of individuals who responded with at least some college when compared to the populations in real-world scenarios. This disproportionality is evident in the sum of the population of respondents with any education at or greater than the category some_college
. Add up those with some_college, an associates, bachelors degree, masters, doctorate, and professional degrees and we have a population greater than or equal to all of high-school graduates. In other words, roughly 50 - 75% of this population would have been to college. Based on census bureau data, this is not the case in the real-world. Estimates in peer-reviewed literature places the proportion who have been to college well below our lowest bound. It is important to recognize this subset of the population which responded to the survey.
This chart shows how males hold most of the income amoung individuals in this dataset. Unfortunately, females makeup a smaller proportion of the dataset. This could be the case in some countries but in the U.S. (which makes up the greater than 90% of the data), it is common for women to have occupations that should pay the equivalent of men for the same occupation. However this chart shows that if the counts of each bar were adjusted to the same height, females would have a smaller number of individuals counted with income >50k based on this dataset. There is clearly not an equal income distribution among the sexes presented in this dataset.
The majority of individual respondents were white and as such, their distribution covers most of the chart. From this we can see that white individuals with >50k
contains the largest of all the proportions of income among the races. Based on this dataset, it appears that the races with lowest proportion of people who have income greater than $50,000 are Black, American Indian Eskimos, or other minorities. This is consistent with conventional literature and it will also effect our results by making prediction of minority classes more difficult.
To determine how well each variable is correlated with our target variable and with one another, we construct a correlation plot. This plot contains the values of all correlation between variables represented by colors and numbers. The lighter the color, the lower the correlation. Meanwhile, darker blue indicates stronger positive correlations while darker red indicates stronger negative correlations.
Given that our numeric features have correlation values near 0, they do not seem to be strongly correlated with our target. They also do not seem to have any correlation with one another so this is a factor that does not have to be dealt with. Weaker correlations indicated little to no interactions with our target variable.
Before this income data can be used as input in our machine learning models, it must be cleaned, formatted, and restructured — this is typically known as preprocessing. In this income dataset there are columns that have values listed as ‘?’. During the data preparation process we will clean these values, transform skewed features and perform train and test split for models. This preprocessing can help us with the outcome and significantly increase model accuracy of almost all our learning algorithms.
To this stage it is clear that our dataset does have missing values that appear as ‘?’. In the next step we replace the ? with NA and then take all the complete cases only. We do see there are 3620 cases with values missing and needs to be left out. We finally get the dataset with 45222 rows and 15 columns. A plot of the remaining proportions of missing values is shown after this reduction in dimesions of the dataset.
The only 3 variables that contained missing rows were
native_country
, workclass
, and occupation
. Each contains an acceptable amount of missing but these will not be highly weighted in our algorithms. We tally the number of incomplete cases present at this step. Show the reduction in dimensions of the dataset while maintaining the same number of variables.
## [1] 3620
## [1] 45222 15
There are many outliers identified in the hours_per_week
, capital_gain
, capital_loss
and fnlwgt
variables. These are shown as points in grid of boxplots above. The fnlwgt
variable (i.e. final weight) should be removed since it has no predictive power and it is a feature to allocate similar weights to people with similar demographic characteristics. We are also removing variable education
since it is just a label of education_num
column. This will reduce our variables to 13 and our dimensions by 2.
For highly-skewed feature distributions, we perform boxcox transformation for selected disproportionate columns to reduce the skewness and make it more Gaussian. Also combining the center and scale transforms standardizes the data. Now, the features will have a mean value of 0 and a standard deviation of 1. This preprocessing uses the caret package’s ‘preprocessing’ function to return a box cox transformation on all numeric variables in our income dataset. These numeric variables include age
, education_num
, capital_gain
, capital_loss
and hours_per_week
. A sample of the first few rows after this transformation are shown.
## age workclass education_num marital_status
## 1 0.17111968 State-gov 1.1572034 Never-married
## 2 0.91061264 Self-emp-not-inc 1.1572034 Married-civ-spouse
## 3 0.09590960 Private -0.4799097 Divorced
## 4 1.08942229 Private -1.2243435 Married-civ-spouse
## 5 -0.75961330 Private 1.1572034 Married-civ-spouse
## 6 0.01909911 Private 1.5929027 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 0.1428868 -0.2187778
## 2 Exec-managerial Husband White Male -0.1467316 -0.2187778
## 3 Handlers-cleaners Not-in-family White Male -0.1467316 -0.2187778
## 4 Handlers-cleaners Husband Black Male -0.1467316 -0.2187778
## 5 Prof-specialty Wife Black Female -0.1467316 -0.2187778
## 6 Exec-managerial Wife White Female -0.1467316 -0.2187778
## hours_per_week native_country income
## 1 -0.0781192 United-States 0
## 2 -2.3267123 United-States 0
## 3 -0.0781192 United-States 0
## 4 -0.0781192 United-States 0
## 5 -0.0781192 Cuba 0
## 6 -0.0781192 United-States 0
In this section we try to find those variables which could be used to reduce the dimensions of the dataset while also explaining the variation among the variables. Factors and numeric values typically require they own kinds of analysis, specifically factor analysis and principle component analysis or PCA. We take advantage of the multitude of levels identified within our factors during data exploration and assign numeric codes to each of them. This will allow all of our dataset’s variables to appear numeric and be run in PCA while preserving the distance between points. We center and scale each run and complete 4 analyses. Ideally this will inform us of a few variables that we can use to maintain the same prediction accuracy in certain models. The results are shown below. Results are displayed below.
In the variance plots we notice how much in numerical terms the variation changes between PCA groups while the scree plots show the percentage of variance explained between those same groups. We also create contribution plots to show which variables contribute the most to the data, and bioplots to show their directional contribution (as high or low contribution ratings). These 4 methods should display a similar story but we make specific changes to notice the differences. Starting with all factor-type variables solely for PCA1 we run the algorithm. Then, we remove capital_gain
and capital_loss
variables. We repeat this on the full group of variables (numeric and factors) with and without these gain and loss variables since they skew the true results when trying to predict our target.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 7506.4415 404.7482 13.37 11.7 2.499
## Proportion of Variance 0.9971 0.0029 0.00 0.0 0.000
## Cumulative Proportion 0.9971 1.0000 1.00 1.0 1.000
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 13.4813 11.7173 2.52352
## Proportion of Variance 0.5585 0.4219 0.01957
## Cumulative Proportion 0.5585 0.9804 1.00000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 7506.4415 404.7483 13.39 11.7 6.086 4.035 2.476 1.553
## Proportion of Variance 0.9971 0.0029 0.00 0.0 0.000 0.000 0.000 0.000
## Cumulative Proportion 0.9971 1.0000 1.00 1.0 1.000 1.000 1.000 1.000
## PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.391 1.141 0.8144 0.3902 0.3513
## Proportion of Variance 0.000 0.000 0.0000 0.0000 0.0000
## Cumulative Proportion 1.000 1.000 1.0000 1.0000 1.0000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 13.4999 11.7204 6.08599 4.03695 2.49933 1.55425 1.39069
## Proportion of Variance 0.4724 0.3561 0.09601 0.04224 0.01619 0.00626 0.00501
## Cumulative Proportion 0.4724 0.8285 0.92446 0.96671 0.98290 0.98916 0.99417
## PC8 PC9 PC10 PC11
## Standard deviation 1.14165 0.81443 0.39604 0.35384
## Proportion of Variance 0.00338 0.00172 0.00041 0.00032
## Cumulative Proportion 0.99755 0.99927 0.99968 1.00000
When we exclude the capital_gain
and capital_loss
variables, we find that the most important factors for explaining the variation in the data are the age of the individual, how many hours per week they work, and what their education level is. These factors are largely consistent with real-world expectations, even though we are aware the dataset is not necessarily representative of a larger population. These variables: age
, education_num
, and hours_per_week
should be used if a reduction in the dimensions of the data is necessary for model development while maintaining accuracy.
In this step for data preparation we will partition the training dataset into training and validation sets using the createDataPartition
method from the caret
package. We will reserve 70% for training and rest 30% for validation purpose. The dimensions of our training dataset become 31656 observations of our 13 selected variables. We place the remaining 13566 observations of individuals aside to assess our models performances.
With data prepared for modeling, we develop several models that we suspect would have the best chances of improving our prediction of the binary income target. This includes logisitc regression, decision trees, a random forest model, clustering techniques and more. We use the same data set for each without further transformations or reversion to keep the results simplistic. We begin in this section, tabulating the accuracy of each model to build additional models that add to the strengths and cover the weakness (and error prone) portions of certain models. Those results are then compiled in the model performance section.
Our first model is a logistic regression model. This model will let us identify variables and factor levels that have significant influence over the target variable income. To do this we must first convert factors into dummy variables within the training and test sets. We utilize the caret package for its development. Results are shown.
##
## Call:
## glm(formula = income ~ ., family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8857 -0.5154 -0.1875 -0.0215 3.9965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.47311 0.74384 -4.669 3.02e-06
## age 0.42398 0.02391 17.735 < 2e-16
## workclassLocal-gov -0.70800 0.10982 -6.447 1.14e-10
## workclassPrivate -0.48537 0.09074 -5.349 8.84e-08
## workclassSelf-emp-inc -0.35680 0.11986 -2.977 0.002912
## workclassSelf-emp-not-inc -1.05975 0.10651 -9.950 < 2e-16
## workclassState-gov -0.79442 0.12079 -6.577 4.81e-11
## workclassWithout-pay -1.17435 0.82319 -1.427 0.153701
## education_num 0.70257 0.02325 30.219 < 2e-16
## marital_statusMarried-AF-spouse 2.39636 0.61699 3.884 0.000103
## marital_statusMarried-civ-spouse 2.42026 0.27689 8.741 < 2e-16
## marital_statusMarried-spouse-absent 0.22617 0.22609 1.000 0.317132
## marital_statusNever-married -0.35877 0.08800 -4.077 4.56e-05
## marital_statusSeparated 0.12102 0.15507 0.780 0.435156
## marital_statusWidowed 0.16393 0.15205 1.078 0.280983
## occupationArmed-Forces 0.34511 0.93830 0.368 0.713024
## occupationCraft-repair 0.01538 0.07826 0.196 0.844244
## occupationExec-managerial 0.71541 0.07549 9.477 < 2e-16
## occupationFarming-fishing -0.99588 0.13732 -7.253 4.09e-13
## occupationHandlers-cleaners -0.66232 0.13622 -4.862 1.16e-06
## occupationMachine-op-inspct -0.37912 0.10149 -3.735 0.000187
## occupationOther-service -0.87291 0.11601 -7.525 5.28e-14
## occupationPriv-house-serv -1.76899 0.77207 -2.291 0.021951
## occupationProf-specialty 0.52309 0.07837 6.675 2.47e-11
## occupationProtective-serv 0.49507 0.12565 3.940 8.15e-05
## occupationSales 0.26105 0.08013 3.258 0.001123
## occupationTech-support 0.52252 0.10809 4.834 1.34e-06
## occupationTransport-moving -0.10875 0.09657 -1.126 0.260104
## relationshipNot-in-family 0.65806 0.27368 2.404 0.016196
## relationshipOther-relative -0.57552 0.25459 -2.261 0.023785
## relationshipOwn-child -0.35996 0.26685 -1.349 0.177358
## relationshipUnmarried 0.43244 0.28956 1.493 0.135321
## relationshipWife 1.25354 0.10318 12.149 < 2e-16
## raceAsian-Pac-Islander 1.08237 0.27466 3.941 8.12e-05
## raceBlack 0.47047 0.23270 2.022 0.043201
## raceOther 0.75004 0.34122 2.198 0.027943
## raceWhite 0.70143 0.22136 3.169 0.001531
## sexMale 0.81226 0.07848 10.349 < 2e-16
## capital_gain 2.40587 0.07870 30.569 < 2e-16
## capital_loss 0.26919 0.01523 17.674 < 2e-16
## hours_per_week 0.33501 0.01947 17.206 < 2e-16
## native_countryCanada -0.27713 0.68660 -0.404 0.686493
## native_countryChina -1.46388 0.70671 -2.071 0.038321
## native_countryColumbia -2.58076 1.02850 -2.509 0.012099
## native_countryCuba -0.59564 0.70644 -0.843 0.399135
## native_countryDominican-Republic -2.65882 1.21353 -2.191 0.028453
## native_countryEcuador -1.51333 0.99685 -1.518 0.128988
## native_countryEl-Salvador -0.95062 0.79168 -1.201 0.229846
## native_countryEngland -0.21348 0.72118 -0.296 0.767215
## native_countryFrance 0.21584 0.80953 0.267 0.789754
## native_countryGermany -0.55407 0.68625 -0.807 0.419442
## native_countryGreece -0.72748 0.76363 -0.953 0.340763
## native_countryGuatemala -0.92641 0.99291 -0.933 0.350807
## native_countryHaiti 0.29994 0.82510 0.364 0.716217
## native_countryHoland-Netherlands -6.79532 119.47006 -0.057 0.954642
## native_countryHonduras -0.28313 1.32242 -0.214 0.830470
## native_countryHong -1.09882 0.86073 -1.277 0.201737
## native_countryHungary 0.04357 0.88496 0.049 0.960733
## native_countryIndia -1.22863 0.68149 -1.803 0.071411
## native_countryIran -0.25848 0.78130 -0.331 0.740769
## native_countryIreland 0.51969 0.85381 0.609 0.542741
## native_countryItaly 0.30977 0.70855 0.437 0.661976
## native_countryJamaica -0.17515 0.81132 -0.216 0.829076
## native_countryJapan -0.51107 0.72863 -0.701 0.483051
## native_countryLaos -1.94923 1.28405 -1.518 0.129006
## native_countryMexico -1.05230 0.66693 -1.578 0.114604
## native_countryNicaragua -1.02472 1.00870 -1.016 0.309684
## native_countryOutlying-US(Guam-USVI-etc) -0.89973 1.25993 -0.714 0.475161
## native_countryPeru -2.11946 1.04829 -2.022 0.043195
## native_countryPhilippines -0.72138 0.65812 -1.096 0.273025
## native_countryPoland -0.21904 0.73221 -0.299 0.764830
## native_countryPortugal -0.19556 0.79784 -0.245 0.806371
## native_countryPuerto-Rico -0.77693 0.72086 -1.078 0.281130
## native_countryScotland -2.62038 1.26438 -2.072 0.038222
## native_countrySouth -2.39725 0.76145 -3.148 0.001642
## native_countryTaiwan -0.77156 0.77061 -1.001 0.316712
## native_countryThailand -1.97936 1.05882 -1.869 0.061568
## native_countryTrinadad&Tobago -1.28077 1.05847 -1.210 0.226273
## native_countryUnited-States -0.40577 0.63489 -0.639 0.522746
## native_countryVietnam -1.92562 0.86961 -2.214 0.026805
## native_countryYugoslavia 1.17984 1.02311 1.153 0.248832
##
## (Intercept) ***
## age ***
## workclassLocal-gov ***
## workclassPrivate ***
## workclassSelf-emp-inc **
## workclassSelf-emp-not-inc ***
## workclassState-gov ***
## workclassWithout-pay
## education_num ***
## marital_statusMarried-AF-spouse ***
## marital_statusMarried-civ-spouse ***
## marital_statusMarried-spouse-absent
## marital_statusNever-married ***
## marital_statusSeparated
## marital_statusWidowed
## occupationArmed-Forces
## occupationCraft-repair
## occupationExec-managerial ***
## occupationFarming-fishing ***
## occupationHandlers-cleaners ***
## occupationMachine-op-inspct ***
## occupationOther-service ***
## occupationPriv-house-serv *
## occupationProf-specialty ***
## occupationProtective-serv ***
## occupationSales **
## occupationTech-support ***
## occupationTransport-moving
## relationshipNot-in-family *
## relationshipOther-relative *
## relationshipOwn-child
## relationshipUnmarried
## relationshipWife ***
## raceAsian-Pac-Islander ***
## raceBlack *
## raceOther *
## raceWhite **
## sexMale ***
## capital_gain ***
## capital_loss ***
## hours_per_week ***
## native_countryCanada
## native_countryChina *
## native_countryColumbia *
## native_countryCuba
## native_countryDominican-Republic *
## native_countryEcuador
## native_countryEl-Salvador
## native_countryEngland
## native_countryFrance
## native_countryGermany
## native_countryGreece
## native_countryGuatemala
## native_countryHaiti
## native_countryHoland-Netherlands
## native_countryHonduras
## native_countryHong
## native_countryHungary
## native_countryIndia .
## native_countryIran
## native_countryIreland
## native_countryItaly
## native_countryJamaica
## native_countryJapan
## native_countryLaos
## native_countryMexico
## native_countryNicaragua
## native_countryOutlying-US(Guam-USVI-etc)
## native_countryPeru *
## native_countryPhilippines
## native_countryPoland
## native_countryPortugal
## native_countryPuerto-Rico
## native_countryScotland *
## native_countrySouth **
## native_countryTaiwan
## native_countryThailand .
## native_countryTrinadad&Tobago
## native_countryUnited-States
## native_countryVietnam *
## native_countryYugoslavia
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35452 on 31655 degrees of freedom
## Residual deviance: 20411 on 31575 degrees of freedom
## AIC: 20573
##
## Number of Fisher Scoring iterations: 9
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9449 755
## 1 1307 2055
##
## Accuracy : 0.848
## 95% CI : (0.8418, 0.854)
## No Information Rate : 0.7929
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5685
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8785
## Specificity : 0.7313
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.6112
## Prevalence : 0.7929
## Detection Rate : 0.6965
## Detection Prevalence : 0.7522
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 0
##
Our logistic regression model accuracy comes out as 0.848 or roughly 85%. There is room for improvement in this model’s sensitivity among other variables. By observation of the significance for each variables coefficient of the first logistic regression model, the county columns do no provide much, if any, significance. Next, we’ll try to improve the overall performance of the model to remove the country as a variable.
##
## Call:
## glm(formula = income ~ age + workclass + education_num + marital_status +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hours_per_week, family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1765 -0.5190 -0.1897 -0.0248 3.8727
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.84375 0.37085 -10.365 < 2e-16 ***
## age 0.42704 0.02377 17.964 < 2e-16 ***
## workclassLocal-gov -0.70705 0.10957 -6.453 1.10e-10 ***
## workclassPrivate -0.49229 0.09054 -5.437 5.42e-08 ***
## workclassSelf-emp-inc -0.36874 0.11938 -3.089 0.00201 **
## workclassSelf-emp-not-inc -1.07250 0.10619 -10.100 < 2e-16 ***
## workclassState-gov -0.79835 0.12063 -6.618 3.63e-11 ***
## workclassWithout-pay -1.15359 0.82450 -1.399 0.16177
## education_num 0.70247 0.02289 30.693 < 2e-16 ***
## marital_statusMarried-AF-spouse 2.38860 0.61396 3.891 0.00010 ***
## marital_statusMarried-civ-spouse 2.38313 0.27427 8.689 < 2e-16 ***
## marital_statusMarried-spouse-absent 0.20679 0.22395 0.923 0.35581
## marital_statusNever-married -0.35864 0.08777 -4.086 4.39e-05 ***
## marital_statusSeparated 0.11248 0.15483 0.726 0.46757
## marital_statusWidowed 0.15843 0.15196 1.043 0.29712
## occupationArmed-Forces 0.35864 0.93893 0.382 0.70248
## occupationCraft-repair 0.01981 0.07798 0.254 0.79945
## occupationExec-managerial 0.72302 0.07521 9.613 < 2e-16 ***
## occupationFarming-fishing -0.98003 0.13654 -7.178 7.09e-13 ***
## occupationHandlers-cleaners -0.66306 0.13574 -4.885 1.03e-06 ***
## occupationMachine-op-inspct -0.39751 0.10093 -3.939 8.19e-05 ***
## occupationOther-service -0.88180 0.11518 -7.656 1.92e-14 ***
## occupationPriv-house-serv -1.77814 0.76588 -2.322 0.02025 *
## occupationProf-specialty 0.52934 0.07805 6.782 1.19e-11 ***
## occupationProtective-serv 0.48854 0.12510 3.905 9.41e-05 ***
## occupationSales 0.26084 0.07988 3.265 0.00109 **
## occupationTech-support 0.53429 0.10783 4.955 7.24e-07 ***
## occupationTransport-moving -0.09862 0.09621 -1.025 0.30536
## relationshipNot-in-family 0.63300 0.27104 2.335 0.01952 *
## relationshipOther-relative -0.63392 0.25152 -2.520 0.01172 *
## relationshipOwn-child -0.37860 0.26498 -1.429 0.15307
## relationshipUnmarried 0.39173 0.28695 1.365 0.17220
## relationshipWife 1.24731 0.10285 12.128 < 2e-16 ***
## raceAsian-Pac-Islander 0.57791 0.24371 2.371 0.01773 *
## raceBlack 0.48339 0.23209 2.083 0.03727 *
## raceOther 0.45988 0.33398 1.377 0.16852
## raceWhite 0.70150 0.22118 3.172 0.00152 **
## sexMale 0.80531 0.07829 10.286 < 2e-16 ***
## capital_gain 2.40571 0.07844 30.671 < 2e-16 ***
## capital_loss 0.26819 0.01516 17.694 < 2e-16 ***
## hours_per_week 0.33279 0.01938 17.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35452 on 31655 degrees of freedom
## Residual deviance: 20515 on 31615 degrees of freedom
## AIC: 20597
##
## Number of Fisher Scoring iterations: 7
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9449 755
## 1 1307 2055
##
## Accuracy : 0.848
## 95% CI : (0.8418, 0.854)
## No Information Rate : 0.7929
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5685
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8785
## Specificity : 0.7313
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.6112
## Prevalence : 0.7929
## Detection Rate : 0.6965
## Detection Prevalence : 0.7522
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 0
##
Our logistic regression without countries model accuracy comes out as 0.848 or roughly 85%. There is room for improvement in this model’s sensitivity among other variables. We try to improve this by adding the country variable back and creating dummy variables for the variables that are factors.
##
## Call:
## glm(formula = income ~ ., family = "binomial", data = training.dum)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8857 -0.5154 -0.1875 -0.0215 3.9965
##
## Coefficients: (11 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -0.64521 1.19892 -0.538
## age 0.42398 0.02391 17.735
## education_num 0.70257 0.02325 30.219
## capital_gain 2.40587 0.07870 30.569
## capital_loss 0.26919 0.01523 17.674
## hours_per_week 0.33501 0.01947 17.206
## `workclass_?` NA NA NA
## `workclass_Federal-gov` 1.17435 0.82319 1.427
## `workclass_Local-gov` 0.46635 0.82148 0.568
## `workclass_Never-worked` NA NA NA
## workclass_Private 0.68898 0.81882 0.841
## `workclass_Self-emp-inc` 0.81755 0.82229 0.994
## `workclass_Self-emp-not-inc` 0.11461 0.82022 0.140
## `workclass_State-gov` 0.37993 0.82304 0.462
## `workclass_Without-pay` NA NA NA
## marital_status_Divorced -0.16393 0.15205 -1.078
## `marital_status_Married-AF-spouse` 2.23243 0.63132 3.536
## `marital_status_Married-civ-spouse` 2.25633 0.30620 7.369
## `marital_status_Married-spouse-absent` 0.06224 0.25964 0.240
## `marital_status_Never-married` -0.52270 0.15788 -3.311
## marital_status_Separated -0.04292 0.20055 -0.214
## marital_status_Widowed NA NA NA
## `occupation_?` NA NA NA
## `occupation_Adm-clerical` 0.10875 0.09657 1.126
## `occupation_Armed-Forces` 0.45385 0.94018 0.483
## `occupation_Craft-repair` 0.12413 0.08290 1.497
## `occupation_Exec-managerial` 0.82416 0.08511 9.683
## `occupation_Farming-fishing` -0.88713 0.13867 -6.397
## `occupation_Handlers-cleaners` -0.55357 0.13953 -3.967
## `occupation_Machine-op-inspct` -0.27037 0.10549 -2.563
## `occupation_Other-service` -0.76416 0.12250 -6.238
## `occupation_Priv-house-serv` -1.66024 0.77356 -2.146
## `occupation_Prof-specialty` 0.63184 0.09146 6.908
## `occupation_Protective-serv` 0.60382 0.13031 4.634
## occupation_Sales 0.36980 0.08748 4.227
## `occupation_Tech-support` 0.63127 0.11616 5.434
## `occupation_Transport-moving` NA NA NA
## relationship_Husband -1.25354 0.10318 -12.149
## `relationship_Not-in-family` -0.59548 0.28428 -2.095
## `relationship_Other-relative` -1.82907 0.26480 -6.907
## `relationship_Own-child` -1.61350 0.27749 -5.815
## relationship_Unmarried -0.82110 0.29384 -2.794
## relationship_Wife NA NA NA
## `race_Amer-Indian-Eskimo` -0.70143 0.22136 -3.169
## `race_Asian-Pac-Islander` 0.38095 0.16545 2.302
## race_Black -0.23096 0.07771 -2.972
## race_Other 0.04861 0.26111 0.186
## race_White NA NA NA
## sex_Female -0.81226 0.07848 -10.349
## sex_Male NA NA NA
## `native_country_?` NA NA NA
## native_country_Cambodia -1.17984 1.02311 -1.153
## native_country_Canada -1.45697 0.84348 -1.727
## native_country_China -2.64372 0.88690 -2.981
## native_country_Columbia -3.76060 1.13990 -3.299
## native_country_Cuba -1.77548 0.85952 -2.066
## `native_country_Dominican-Republic` -3.83866 1.30919 -2.932
## native_country_Ecuador -2.69317 1.11137 -2.423
## `native_country_El-Salvador` -2.13046 0.93046 -2.290
## native_country_England -1.39332 0.87177 -1.598
## native_country_France -0.96400 0.94727 -1.018
## native_country_Germany -1.73391 0.84363 -2.055
## native_country_Greece -1.90732 0.90741 -2.102
## native_country_Guatemala -2.10625 1.10760 -1.902
## native_country_Haiti -0.87990 0.96000 -0.917
## `native_country_Holand-Netherlands` -7.97142 119.24792 -0.067
## native_country_Honduras -1.46297 1.41062 -1.037
## native_country_Hong -2.27867 1.01266 -2.250
## native_country_Hungary -1.13627 1.01167 -1.123
## native_country_India -2.40847 0.86331 -2.790
## native_country_Iran -1.43832 0.92817 -1.550
## native_country_Ireland -0.66015 0.98701 -0.669
## native_country_Italy -0.87007 0.86127 -1.010
## native_country_Jamaica -1.35499 0.94812 -1.429
## native_country_Japan -1.69091 0.89640 -1.886
## native_country_Laos -3.12907 1.39178 -2.248
## native_country_Mexico -2.23214 0.82786 -2.696
## native_country_Nicaragua -2.20456 1.12335 -1.962
## `native_country_Outlying-US(Guam-USVI-etc)` -2.07957 1.35677 -1.533
## native_country_Peru -3.29930 1.15746 -2.850
## native_country_Philippines -1.90122 0.84760 -2.243
## native_country_Poland -1.39888 0.88177 -1.586
## native_country_Portugal -1.37540 0.93662 -1.468
## `native_country_Puerto-Rico` -1.95677 0.87148 -2.245
## native_country_Scotland -3.80022 1.35591 -2.803
## native_country_South -3.57709 0.93179 -3.839
## native_country_Taiwan -1.95141 0.93755 -2.081
## native_country_Thailand -3.15920 1.18736 -2.661
## `native_country_Trinadad&Tobago` -2.46061 1.17213 -2.099
## `native_country_United-States` -1.58561 0.80228 -1.976
## native_country_Vietnam -3.10546 1.02168 -3.040
## native_country_Yugoslavia NA NA NA
## Pr(>|z|)
## (Intercept) 0.590470
## age < 2e-16 ***
## education_num < 2e-16 ***
## capital_gain < 2e-16 ***
## capital_loss < 2e-16 ***
## hours_per_week < 2e-16 ***
## `workclass_?` NA
## `workclass_Federal-gov` 0.153701
## `workclass_Local-gov` 0.570241
## `workclass_Never-worked` NA
## workclass_Private 0.400106
## `workclass_Self-emp-inc` 0.320107
## `workclass_Self-emp-not-inc` 0.888877
## `workclass_State-gov` 0.644354
## `workclass_Without-pay` NA
## marital_status_Divorced 0.280983
## `marital_status_Married-AF-spouse` 0.000406 ***
## `marital_status_Married-civ-spouse` 1.72e-13 ***
## `marital_status_Married-spouse-absent` 0.810542
## `marital_status_Never-married` 0.000930 ***
## marital_status_Separated 0.830550
## marital_status_Widowed NA
## `occupation_?` NA
## `occupation_Adm-clerical` 0.260104
## `occupation_Armed-Forces` 0.629285
## `occupation_Craft-repair` 0.134331
## `occupation_Exec-managerial` < 2e-16 ***
## `occupation_Farming-fishing` 1.58e-10 ***
## `occupation_Handlers-cleaners` 7.27e-05 ***
## `occupation_Machine-op-inspct` 0.010375 *
## `occupation_Other-service` 4.44e-10 ***
## `occupation_Priv-house-serv` 0.031855 *
## `occupation_Prof-specialty` 4.91e-12 ***
## `occupation_Protective-serv` 3.59e-06 ***
## occupation_Sales 2.36e-05 ***
## `occupation_Tech-support` 5.50e-08 ***
## `occupation_Transport-moving` NA
## relationship_Husband < 2e-16 ***
## `relationship_Not-in-family` 0.036201 *
## `relationship_Other-relative` 4.94e-12 ***
## `relationship_Own-child` 6.08e-09 ***
## relationship_Unmarried 0.005200 **
## relationship_Wife NA
## `race_Amer-Indian-Eskimo` 0.001531 **
## `race_Asian-Pac-Islander` 0.021311 *
## race_Black 0.002957 **
## race_Other 0.852313
## race_White NA
## sex_Female < 2e-16 ***
## sex_Male NA
## `native_country_?` NA
## native_country_Cambodia 0.248832
## native_country_Canada 0.084109 .
## native_country_China 0.002875 **
## native_country_Columbia 0.000970 ***
## native_country_Cuba 0.038861 *
## `native_country_Dominican-Republic` 0.003367 **
## native_country_Ecuador 0.015381 *
## `native_country_El-Salvador` 0.022039 *
## native_country_England 0.109983
## native_country_France 0.308841
## native_country_Germany 0.039851 *
## native_country_Greece 0.035558 *
## native_country_Guatemala 0.057218 .
## native_country_Haiti 0.359373
## `native_country_Holand-Netherlands` 0.946703
## native_country_Honduras 0.299686
## native_country_Hong 0.024437 *
## native_country_Hungary 0.261367
## native_country_India 0.005274 **
## native_country_Iran 0.121232
## native_country_Ireland 0.503598
## native_country_Italy 0.312389
## native_country_Jamaica 0.152965
## native_country_Japan 0.059250 .
## native_country_Laos 0.024561 *
## native_country_Mexico 0.007012 **
## native_country_Nicaragua 0.049707 *
## `native_country_Outlying-US(Guam-USVI-etc)` 0.125340
## native_country_Peru 0.004365 **
## native_country_Philippines 0.024892 *
## native_country_Poland 0.112638
## native_country_Portugal 0.141974
## `native_country_Puerto-Rico` 0.024747 *
## native_country_Scotland 0.005067 **
## native_country_South 0.000124 ***
## native_country_Taiwan 0.037399 *
## native_country_Thailand 0.007798 **
## `native_country_Trinadad&Tobago` 0.035794 *
## `native_country_United-States` 0.048112 *
## native_country_Vietnam 0.002369 **
## native_country_Yugoslavia NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35452 on 31655 degrees of freedom
## Residual deviance: 20411 on 31575 degrees of freedom
## AIC: 20573
##
## Number of Fisher Scoring iterations: 9
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9449 755
## 1 1307 2055
##
## Accuracy : 0.848
## 95% CI : (0.8418, 0.854)
## No Information Rate : 0.7929
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5685
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8785
## Specificity : 0.7313
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.6112
## Prevalence : 0.7929
## Detection Rate : 0.6965
## Detection Prevalence : 0.7522
## Balanced Accuracy : 0.8049
##
## 'Positive' Class : 0
##
Our logistic regression model with dummy variables accuracy comes out as 0.848 or roughly 85%. We can see that this has not changed from the previous two models. There is room for improvement in this model’s sensitivity among other variables. We try to improve this with the random forest model.
In a decision tree model the data is split into distinct options of ‘yes’ or ‘no’ based on parameters that make the options possible. These splits are called nodes and the decisions made at them can be mapped. We follow this principle to identify decisions that would result in the most predictive accuracy for our income target variable. The results are shown.
## CART
##
## 31656 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 28491, 28491, 28490, 28490, 28490, 28491, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.8158114 0.4303746
## 2 0.8158114 0.4303746
## 3 0.8302584 0.4869948
## 4 0.8393562 0.5216016
## 5 0.8442842 0.5433691
## 6 0.8442842 0.5433691
## 7 0.8442842 0.5433691
## 8 0.8442842 0.5433691
## 9 0.8442842 0.5433691
## 10 0.8442842 0.5433691
## 11 0.8442842 0.5433691
## 12 0.8442842 0.5433691
## 13 0.8442842 0.5433691
## 14 0.8442842 0.5433691
## 15 0.8442842 0.5433691
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 5.
We also review which variables are most important for making decisions in our model.These are shown in the plot as a straight line extending from the axis to the length of its importance to the model. Accuracy was also used to select the optimal model using the largest value where our final tree depth used for this model is 1.
## [1] 0.8424001
Our decision Tree model accuracy comes out as 0.8424 or roughly 84%. There is room for improvement in this model’s sensitivity among other variables. We try to improve this with the random forest model.
A random forest model works by building a number of decision trees and selecting the most accurate decisions from the trees. These decisions are randomized and in our case, tries 3 variables at each node or split in the tree. We set our number of trees to 500 and train the model to predict loan status. We review the variables of most importance in the model and in this case, give the model a boost to improve accuracy.
## Random Forest
##
## 31656 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 31656, 31656, 31656, 31656, 31656, 31656, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8021560 0.3128271
## 43 0.8333873 0.5419169
## 84 0.8261639 0.5256646
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 43.
## [1] NaN
Our random Forest model accuracy comes out as 0.8399 or roughly 84%.. This is an improvement upon our decision model and the sensitivity did increase as we desired. However, there are some more strategies we can try with other models.
#XGboost
## Epoch Iteration max_depth min_child_weight subsample gpUtility acqOptimum
## 1: 0 1 4 16.900949 0.3980034 NA FALSE
## 2: 0 2 9 22.465545 0.4598973 NA FALSE
## 3: 0 3 4 2.543344 0.3380428 NA FALSE
## 4: 0 4 7 8.946161 0.2773560 NA FALSE
## 5: 1 5 8 1.000000 0.2500000 0.7678954 TRUE
## 6: 2 6 10 1.000000 0.2500000 0.4972389 TRUE
## 7: 3 7 10 1.000000 0.5000000 0.2777107 TRUE
## inBounds Elapsed Score nrounds errorMessage
## 1: TRUE 4.852 0.8944900 100 NA
## 2: TRUE 9.094 0.9070150 100 NA
## 3: TRUE 4.434 0.8964657 100 NA
## 4: TRUE 6.443 0.9071000 100 NA
## 5: TRUE 7.455 0.9117193 99 NA
## 6: TRUE 8.618 0.9128047 100 NA
## 7: TRUE 11.480 0.9130480 100 NA
## $max_depth
## [1] 10
##
## $min_child_weight
## [1] 1
##
## $subsample
## [1] 0.5
## Confusion Matrix and Statistics
##
## y_test
## XGB.predict 0 1
## 0 9317 1209
## 1 887 2153
##
## Accuracy : 0.8455
## 95% CI : (0.8393, 0.8515)
## No Information Rate : 0.7522
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5718
##
## Mcnemar's Test P-Value : 2.358e-12
##
## Sensitivity : 0.9131
## Specificity : 0.6404
## Pos Pred Value : 0.8851
## Neg Pred Value : 0.7082
## Prevalence : 0.7522
## Detection Rate : 0.6868
## Detection Prevalence : 0.7759
## Balanced Accuracy : 0.7767
##
## 'Positive' Class : 0
##
When evaluating how the models performed we focused on accuracy as our main metric. However, we also considered how the results might apply in real-world settings. This does slightly change the results of our model’s performances depending on the circumstances in which the prediction is needed. For example, someone with the goal of identifying what factors they need to maximize to boost their income will have a fundamentally different set of variables, and thus results, than someone else with the goal of minimizing income loss for an individual. Nevertheless we compiled the results as follows:
statistic | xgboost | decisiontree | randomforest | logit |
---|---|---|---|---|
Accuracy | 0.845 | 0.842 | 0.840 | 0.848 |
Kappa | 0.572 | 0.542 | 0.559 | 0.569 |
AccuracyLower | 0.839 | 0.836 | 0.834 | 0.842 |
AccuracyUpper | 0.852 | 0.848 | 0.846 | 0.854 |
AccuracyNull | 0.752 | 0.752 | 0.752 | 0.793 |
AccuracyPValue | 0.000 | 0.000 | 0.000 | 0.000 |
McnemarPValue | 0.000 | 0.000 | 0.000 | 0.000 |
Our conclusion could be given regardless of model performance and accuracy given the diversity of the dataset and its substantial drawbacks. Perhaps most importantly, we should note that this dataset was not representative of the global population and should not be applied too broadly. This dataset was heavily white, highly educated males who were married at least once in their lives. Many of these respondents also had no kids which evidence suggest can significantly shape an individual’s income over their lifetime. Typically, having kids increases income for males while it decreases for females. This makes our results less realistic and hard to interpret, especially for non-white females and other minority classes not represented in this dataset.
Additionally, responses from individuals located in the U.S. dominated the list, containing nearly 90% of the dataset’s individuals. This nullifies the results for other countries due to large clusters of outliers in their variables that could not be dealt with without comprising the integrity of the data. To reduce the errors inherent to the dataset, an extensive use of oversampling of the minority classes in a strategic manner would be necessary but unfortunately, there is no way to tell if the results would be reliable. For these reasons, we focus on the relationships between the variables which have greater reliability and certainty in this analysis.
Recall that our target variable, income, was split into two factor levels; those whose income is greater than $50,000 and those who have an income less than or equal to $50,000. As is, our XGBoosted model performed best with an accuracy between 82-87%. Our closest alternative model was the random forest. Excluding capital gain and losses, we found that age, education, and the hours worked per week, capture nearly perfectly the variance in the dataset. If we were to reduce the dimensions of the dataset these would be the best variables to use. This suggests that, aside from capital gains, the best ways to increase income to $50,000 or greater in the United States is to get a higher education, work 40 or more hours per week and be older than your colleagues. These results are applicable across the United States.
https://archive.ics.uci.edu/ml/datasets/Census+Income https://stats.idre.ucla.edu/r/dae/logit-regression/ https://www.datacamp.com/community/tutorials/logistic-regression-R https://www.marsja.se/create-dummy-variables-in-r/
::opts_chunk$set(echo=FALSE, error=FALSE, warning=FALSE, message=FALSE, fig.align="center", fig.width = 10)
knitr# Libraries
library(dplyr)
library(summarytools)
library(reshape2)
library(ggplot2)
library(DataExplorer)
library(caret)
library(tidyverse)
library(DataExplorer)
library(mice)
library(MASS)
library(e1071)
library(tree)
library(randomForest)
library(corrplot)
library(kableExtra)
library(htmltools)
library(fastDummies)
library(mlbench)
library(xgboost)
library(ParBayesianOptimization)
library(factoextra)
<- read.csv("https://raw.githubusercontent.com/amit-kapoor/Data622Group2/main/FinalProject/data/census_income.csv",
income_data check.names = FALSE) %>%
na_if("")
# categorical columns as factors
<- income_data %>%
income_data mutate(workclass=as.factor(workclass),
education=as.factor(education),
marital_status=as.factor(marital_status),
occupation=as.factor(occupation),
relationship=as.factor(relationship),
race=as.factor(race),
sex=as.factor(sex),
native_country=as.factor(native_country),
income=as.factor(income))
dfSummary(income_data, style = 'grid', graph.col = FALSE)
summary(income_data)
%>%
income_data count(income) %>%
ggplot(data=., aes(x=factor(income), y=n, fill = income)) +
geom_col() +
xlab("Income") +
ylab("Frequency") +
ggtitle("Frequency of Income") +
theme_classic() +
theme(legend.position = "none")
# select categorical columns
= c()
cat_cols <- 1
j for (i in 1:ncol(income_data)) {
if (class((income_data[,i])) == 'factor') {
=names(income_data[i])
cat_cols[j]<- j+1
j
}
}
<- income_data[cat_cols]
income_fact # long format
<- melt(income_fact, measure.vars = cat_cols, variable.name = 'metric', value.name = 'value')
income_factm
# plot categorical columns
ggplot(income_factm, aes(x = value)) +
geom_bar(aes(fill = metric)) +
facet_wrap( ~ metric, nrow = 5L, scales = 'free') + coord_flip() +
theme_classic() +
theme(legend.position = "none")
plot_histogram(income_data, geom_histogram_args = list("fill" = "tomato4"))
ggplot(income_data, aes(x=native_country, fill=income)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(income_data, aes(x=workclass, fill=income)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(income_data, aes(x=education, fill=income)) +
geom_bar(stat = "count") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(income_data, aes(x=sex, fill=income)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(income_data, aes(x=race, fill=income)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
<- income_data %>%
cors select_if(is.numeric) %>%
na.omit() %>%
cor()
::corrplot(cors, method="number")
corrplot
<- income_data == "?"
index is.na(income_data) <- index
# plot missing values
plot_missing(income_data)
# finding NAs now in income_data
sum(!complete.cases(income_data))
<- income_data[complete.cases(income_data),]
income_data_clean dim(income_data_clean)
library(tidyr)
<- income_data_clean
df %>%
df ::select_if(is.integer) %>%
dplyrgather(key, value) %>%
ggplot(aes(key, value)) +
geom_boxplot(aes(fill = key)) +
facet_wrap(~key, scales = "free") # Lots of outliers
# removing columns fnlwgt and education
<- income_data_clean %>%
income_data_clean ::select(-c(fnlwgt, education))
dplyrset.seed(622)
# Center and scaling for numeric features
<- income_data_clean %>%
income_data_tf ::select(c("age", "education_num" , "capital_gain", "capital_loss","hours_per_week")) %>%
dplyrpreProcess(method = c("BoxCox", "center","scale")) %>%
predict(income_data_clean)
$income <- plyr::mapvalues(income_data_tf$income, from = c('>50K','<=50K'), to = c(1,0))
income_data_tfhead(income_data_tf)
<- income_data_clean %>%
nums ::select(is.numeric)
dplyr<- prcomp(nums)
pca1 summary(pca1)
plot(pca1, type = 'l', col = 'light blue')
fviz_eig(pca1)
fviz_contrib(pca1, choice = "var", axes = c(1,2), top = 15)
fviz_pca_var(pca1,
col.var ="contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
axes=c(1,2)
,
)# PCA with removal of capital_gains and losses
<- income_data_clean %>%
nums2 ::select(is.numeric, -capital_gain, - capital_loss)
dplyr<- prcomp(nums2)
pca2 summary(pca2)
fviz_eig(pca2)
fviz_contrib(pca2, choice = "var", axes = c(1,2), top = 15)
fviz_pca_var(pca2,
col.var ="contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
axes=c(1,2)
,
)<- income_data_clean %>%
facs ::select(is.factor)
dplyr<- sapply(facs, as.numeric)
facs_nums <- cbind(facs_nums, nums)
dfnumeric <- prcomp(dfnumeric)
pca3 <- prcomp(dfnumeric, scale. = TRUE, center=TRUE)
prcomp summary(pca3)
fviz_eig(pca3)
fviz_contrib(pca3, choice = "var", axes = c(1,2), top = 15)
fviz_pca_var(pca3,
col.var ="contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
axes=c(1,2)
,
)<- dfnumeric %>%
dfnumeric2 ::select(-capital_gain, -capital_loss)
dplyr<- prcomp(dfnumeric2)
pca4 summary(pca4)
fviz_eig(pca4)
fviz_contrib(pca4, choice = "var", axes = c(1,2), top = 15)
fviz_pca_var(pca4,
col.var ="contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
axes=c(1,2)
,
)set.seed(622)
<- createDataPartition(income_data_tf$income, p=0.70, list = FALSE)
partition <- income_data_tf[partition,]
training <- income_data_tf[-partition,]
testing <- glm(income ~., data = training, family = "binomial")
logit.income summary(logit.income)
<- predict(logit.income, testing, type="response")
logit.pred $pred_glm <- ifelse(logit.pred > 0.5, "1", "0")
testing$pred_glm <- as.factor(testing$pred_glm)
testing$income <- as.factor(testing$income)
testing<- confusionMatrix(testing$income, testing$pred_glm)
conf.mat.logit
conf.mat.logit<- glm(income ~age + workclass + education_num + marital_status + occupation +
logit.income.nc + race + sex + capital_gain + capital_loss +
relationship data = training, family = "binomial")
hours_per_week, summary(logit.income.nc)
<- predict(logit.income.nc, testing, type="response")
logit.pred.nc $pred_glm2 <- ifelse(logit.pred > 0.5, "1", "0")
testing$pred_glm2 <- as.factor(testing$pred_glm2)
testing$income <- as.factor(testing$income)
testing<- confusionMatrix(testing$income, testing$pred_glm2)
conf.mat.logit.nc
conf.mat.logit.nc
library(fastDummies)
set.seed(622)
<- dummy_cols(training,
training.dum select_columns = c("workclass",
"marital_status",
"occupation",
"relationship",
"race",
"sex",
"native_country"),
remove_selected_columns = TRUE)
set.seed(622)
<- dummy_cols(testing,
testing.dum select_columns = c("workclass",
"marital_status",
"occupation",
"relationship",
"race",
"sex",
"native_country"),
remove_selected_columns = TRUE)
<- glm(income ~., data = training.dum, family = "binomial")
logit.income.dum summary(logit.income.dum)
<- predict(logit.income.dum, testing.dum, type="response")
logit.pred.dum $pred_glm <- ifelse(logit.pred.dum > 0.5, "1", "0")
testing.dum$pred_glm <- as.factor(testing.dum$pred_glm)
testing.dum$income <- as.factor(testing.dum$income)
testing.dum<- confusionMatrix(testing.dum$income, testing.dum$pred_glm)
conf.mat.logit.dum
conf.mat.logit.dum# Check Number of Levels for each Factor
%>% map(levels) %>% map(length)
training %>% map(levels) %>% map(length)
testing # Decision Trees model
set.seed(622)
<- trainControl(method="repeatedcv", number=10, repeats=3, search='grid')
control <- "Accuracy"
metric <- expand.grid(.maxdepth=c(1:15))
tunegrid <- train(income~., data = training, method="rpart2", tuneGrid=tunegrid, trControl=control)
tree.income print(tree.income)
plot(tree.income)
<- varImp(tree.income, scale = FALSE)
treeImp plot(treeImp, top = 10)
# prediction from decision tree model
<- predict(tree.income, testing,type='raw')
tree.predict mean(tree.predict == testing$income) # accuracy
<- confusionMatrix(tree.predict, testing$income)
conf.mat.decisiontree <- round(conf.mat.decisiontree$overall[1], 4)
var1 <- (round(conf.mat.decisiontree$overall[1], 2)*100)
var2 set.seed(622)
# Random Forest model
<- train(income~., data = training, method="rf", ntree = 5)
rf.income print(rf.income)
plot(rf.income)
<- varImp(rf.income, scale = FALSE)
rfImp plot(rfImp, top = 10)
# prediction from random forest model
<- predict(rf.income, testing,type='raw')
rf.predict mean(rf.predict == testing$Loan_Status) # accuracy
<- confusionMatrix(rf.predict, testing$income)
conf.mat.randomforest <- as.matrix(training$income)
y_train <- as.numeric(as.matrix(testing$income))
y_test <- sapply(subset(training, select = -income), as.numeric)
X_train <- sapply(subset(testing, select = -c(income, pred_glm, pred_glm2)), as.numeric)
X_test <- list(
Folds Fold1 = as.integer(seq(1,nrow(X_train),by = 3))
Fold2 = as.integer(seq(2,nrow(X_train),by = 3))
, Fold3 = as.integer(seq(3,nrow(X_train),by = 3))
,
)
<- function(max_depth, min_child_weight, subsample) {
scoringFunction <- xgb.DMatrix(X_train, label=y_train)
dtrain <- list(
Pars booster = "gbtree"
eta = 0.01
, max_depth = max_depth
, min_child_weight = min_child_weight
, subsample = subsample
, objective = "binary:logistic"
, eval_metric = "auc"
,
)<- xgb.cv(
xgbcv params = Pars
data = dtrain
, nround = 100
, folds = Folds
, prediction = TRUE
, showsd = TRUE
, early_stopping_rounds = 5
, maximize = TRUE
, verbose = 0)
, return(
list(
Score = max(xgbcv$evaluation_log$test_auc_mean)
nrounds = xgbcv$best_iteration
,
)
)
}
set.seed(50)
<- list(
bounds max_depth = c(2L, 10L)
min_child_weight = c(1, 25)
, subsample = c(0.25, .5)
,
)
<- bayesOpt(
optObj FUN = scoringFunction
bounds = bounds
, initPoints = 4
, iters.n = 3
,
)$scoreSummary
optObjprint(getBestPars(optObj))
<- xgb.DMatrix(X_train, label=y_train)
dt <- xgboost(data = dt
XGB nround = 100
, min_child_weight=1
, subsample=.5
, max_depth = 10
, early_stopping_rounds = 5
, verbose = 0)
,
<- as.numeric(predict(XGB,X_test) > 0.5)
XGB.predict
<- xgb.importance(model = XGB)
importance_matrix
xgb.plot.importance(importance_matrix = importance_matrix)
<- confusionMatrix(table(XGB.predict, y_test))
conf.mat.xgboost print(conf.mat.xgboost)
<- data.frame(matrix(names(conf.mat.xgboost$overall)))
results $xgboost <- round(conf.mat.xgboost$overall, 3)
results$decisiontree <- round(conf.mat.decisiontree$overall, 3)
results$randomforest <- round(conf.mat.randomforest$overall, 3)
results$logit <- round(conf.mat.logit$overall, 3)
results
%>%
results rename(statistic = matrix.names.conf.mat.xgboost.overall..) %>%
kable()