My Thought
Before working in Final project, I did not realize that feature engineering is important. After I read some stat documents, I found out that feature engineering is super useful to extract/create potentials to enhance future analysis and modeling. In addition, I’ve never use tidymodel before, I would like to give it a shot to see if I would have different insights.
Data
The data that I got is from Kaggle.
- I am only interested in the salary spreads in United State, therefore, some other observations are omited
## Rows: 29,170
## Columns: 14
## $ age <int> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 38,…
## $ workclass <chr> " State-gov", " Self-emp-not-inc", " Private", " Privat…
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 284582, 209642, 45781, 15…
## $ education <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Mast…
## $ education.num <int> 13, 13, 9, 7, 14, 9, 14, 13, 10, 13, 12, 9, 9, 7, 14, 1…
## $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divorced", "…
## $ occupation <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaner…
## $ relationship <chr> " Not-in-family", " Husband", " Not-in-family", " Husba…
## $ race <chr> " White", " White", " White", " Black", " White", " Whi…
## $ sex <chr> " Male", " Male", " Male", " Male", " Female", " Male",…
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ hours.per.week <int> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 50,…
## $ salary <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " >50…
Distribution of Numerical Variables
Interpretation of Interesting variables
age: the peak for salary less/equal than 50 is around 25 year-old, however the peak for higher salary is around 40-50 year-old. Which means that more of 25 year-old make less/equal to 50K and more of 40-50 will make more than that.
education.num: has several peaks in both level of salaries. Before 13, most of them make less/equal to 50K, when education.num is greater than 13, they make more than 50K.
hour.per.week: majority individuals who work under 40-42 hours per week will have salaries under 50K. Pass the threshold, it seems like the more time you spend, the more you make.
Distribution of Categorical Variables
Interpretation of Interesting variables
education: like education.num mentioned previous, the trends are clear.
marital.status: It shows that no matter which status, the population which make less than 50K is dominant. Besides, stable marital status will have less less difference in salary
occupation: other than executive/management, most of people with other occupation earn less than 50K
race: clearly white and other race has difference between salaries, but the sample in race is imbalanced. No I cannot draw conclusions in general. However, based on the data obtained, most of white race make less than 50K.
sex: from the plot, the population of female who make less/equal to 50K less than the population of male. Meanwhile, the population of female who make more than 50K is less than the population of male. It could be less female workers in the dataset, or it could be that females make less in both situation.
workclass: other than self employee company, the ratio of both salary level is dominant by less/equal to 50K
Feature Engineering
since education and education.num represent basically the same thing, I am going to keep one between two. In addition, it clearly shows that individual who has degree higher than Bachelors(in education) tend to make more than 50K, which correspond to >13(in education.num)
occupation and work class is somewhat overlap each other, and there should be more occupation type in reality, therefore, workclass is kept instead of occupations. Besides, workclass variable clearly shows that self own company will make over 50K.
Although family and partner relationship will have impact on individuals both mentally and physically, relationship brings salary raise is rare case in general. So, I do not need such variables as well.
## Rows: 29,170
## Columns: 10
## $ age <int> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 38,…
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 284582, 209642, 45781, 15…
## $ race <fct> white, white, white, not_white, white, white, white, wh…
## $ sex <fct> Male, Male, Male, Male, Female, Male, Female, M…
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ hours.per.week <int> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 50,…
## $ salary <fct> <=50K, <=50K, <=50K, <=50K, <=50K, >50K, >50K, …
## $ edu_level <fct> below/equal bachelors, below/equal bachelors, below/equ…
## $ work_level <fct> worker, worker, worker, worker, worker, worker, worker,…
Collinearity in Numeric Variables
there is no multi-colinearity happen in the dataset.
Target Variable Balance Check
The spread of target variable is not balanced, however, we should keep as it is because in reality, the ratio of both salary level is not going to be the same. Therefore, we reserved the ratio for data splitting.
## salary n ratio
## 1 <=50K 21999 0.7541652
## 2 >50K 7171 0.2458348
Create partitions
Dimension of training set
## [1] 21877 10
Dimension of testing set
## [1] 7293 9
modeling
# create recipe
re <-
recipe(salary ~ ., data = train.df) %>%
step_dummy(all_nominal_predictors())
# create model type
log_mod <-
logistic_reg() %>%
set_engine("glm")
# piece recipe and model
df_wkfl <-
workflow() %>%
add_model(log_mod) %>%
add_recipe(re)# fit data using workflow
df_fit <-
df_wkfl %>%
fit(data = train.df)All features are significant based on p-value
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.73 0.171 -21.8 1.71e-105
## 2 age 0.0343 0.00142 24.1 2.50e-128
## 3 fnlwgt 0.000000785 0.000000176 4.46 8.15e- 6
## 4 capital.gain 0.000331 0.0000115 28.8 7.36e-182
## 5 capital.loss 0.000704 0.0000383 18.4 2.05e- 75
## 6 hours.per.week 0.0328 0.00158 20.8 6.99e- 96
## 7 race_white 0.548 0.0688 7.97 1.61e- 15
## 8 sex_X.Male 1.13 0.0488 23.1 2.18e-118
## 9 edu_level_below.equal.bachelors -1.49 0.0604 -24.6 5.77e-134
## 10 work_level_worker -0.649 0.0897 -7.23 4.71e- 13
Model Performance
Based on the stats, I would say that the model is good.
Training accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.812
Testing accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.813
# confusion matrix
confusionMatrix(test.label, test_pred$.pred_class)## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5238 262
## >50K 1102 691
##
## Accuracy : 0.813
## 95% CI : (0.8038, 0.8219)
## No Information Rate : 0.8693
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4011
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8262
## Specificity : 0.7251
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.3854
## Prevalence : 0.8693
## Detection Rate : 0.7182
## Detection Prevalence : 0.7541
## Balanced Accuracy : 0.7756
##
## 'Positive' Class : <=50K
##
Conclusion
Tidymodel split modeling into parts which is different from the traditional way of building models where all things is calculated in once. Sometimes, one wants to reuse some of parts in other model, it is convenient to use Tidymodel instead. Otherwise, any changes in the model should be re-computed by the traditional way, it consumes the memory and low down the computer. For simple modeling, I would still recommend the traditional one because it saves coding time.