Blog2: Tidymodels

Jie Zou

2022-05-20

My Thought

Before working in Final project, I did not realize that feature engineering is important. After I read some stat documents, I found out that feature engineering is super useful to extract/create potentials to enhance future analysis and modeling. In addition, I’ve never use tidymodel before, I would like to give it a shot to see if I would have different insights.

Data

The data that I got is from Kaggle.

  • I am only interested in the salary spreads in United State, therefore, some other observations are omited
## Rows: 29,170
## Columns: 14
## $ age            <int> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 38,…
## $ workclass      <chr> " State-gov", " Self-emp-not-inc", " Private", " Privat…
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 284582, 209642, 45781, 15…
## $ education      <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Mast…
## $ education.num  <int> 13, 13, 9, 7, 14, 9, 14, 13, 10, 13, 12, 9, 9, 7, 14, 1…
## $ marital.status <chr> " Never-married", " Married-civ-spouse", " Divorced", "…
## $ occupation     <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaner…
## $ relationship   <chr> " Not-in-family", " Husband", " Not-in-family", " Husba…
## $ race           <chr> " White", " White", " White", " Black", " White", " Whi…
## $ sex            <chr> " Male", " Male", " Male", " Male", " Female", " Male",…
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ hours.per.week <int> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 50,…
## $ salary         <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " >50…

Distribution of Numerical Variables

Interpretation of Interesting variables

  • age: the peak for salary less/equal than 50 is around 25 year-old, however the peak for higher salary is around 40-50 year-old. Which means that more of 25 year-old make less/equal to 50K and more of 40-50 will make more than that.

  • education.num: has several peaks in both level of salaries. Before 13, most of them make less/equal to 50K, when education.num is greater than 13, they make more than 50K.

  • hour.per.week: majority individuals who work under 40-42 hours per week will have salaries under 50K. Pass the threshold, it seems like the more time you spend, the more you make.

Distribution of Categorical Variables

Interpretation of Interesting variables

  • education: like education.num mentioned previous, the trends are clear.

  • marital.status: It shows that no matter which status, the population which make less than 50K is dominant. Besides, stable marital status will have less less difference in salary

  • occupation: other than executive/management, most of people with other occupation earn less than 50K

  • race: clearly white and other race has difference between salaries, but the sample in race is imbalanced. No I cannot draw conclusions in general. However, based on the data obtained, most of white race make less than 50K.

  • sex: from the plot, the population of female who make less/equal to 50K less than the population of male. Meanwhile, the population of female who make more than 50K is less than the population of male. It could be less female workers in the dataset, or it could be that females make less in both situation.

  • workclass: other than self employee company, the ratio of both salary level is dominant by less/equal to 50K

Feature Engineering

  • since education and education.num represent basically the same thing, I am going to keep one between two. In addition, it clearly shows that individual who has degree higher than Bachelors(in education) tend to make more than 50K, which correspond to >13(in education.num)

  • occupation and work class is somewhat overlap each other, and there should be more occupation type in reality, therefore, workclass is kept instead of occupations. Besides, workclass variable clearly shows that self own company will make over 50K.

  • Although family and partner relationship will have impact on individuals both mentally and physically, relationship brings salary raise is rare case in general. So, I do not need such variables as well.

## Rows: 29,170
## Columns: 10
## $ age            <int> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 38,…
## $ fnlwgt         <int> 77516, 83311, 215646, 234721, 284582, 209642, 45781, 15…
## $ race           <fct> white, white, white, not_white, white, white, white, wh…
## $ sex            <fct>  Male,  Male,  Male,  Male,  Female,  Male,  Female,  M…
## $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2…
## $ hours.per.week <int> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 50,…
## $ salary         <fct>  <=50K,  <=50K,  <=50K,  <=50K,  <=50K,  >50K,  >50K,  …
## $ edu_level      <fct> below/equal bachelors, below/equal bachelors, below/equ…
## $ work_level     <fct> worker, worker, worker, worker, worker, worker, worker,…

Collinearity in Numeric Variables

there is no multi-colinearity happen in the dataset.

Target Variable Balance Check

The spread of target variable is not balanced, however, we should keep as it is because in reality, the ratio of both salary level is not going to be the same. Therefore, we reserved the ratio for data splitting.

##   salary     n     ratio
## 1  <=50K 21999 0.7541652
## 2   >50K  7171 0.2458348

Create partitions

Dimension of training set

## [1] 21877    10

Dimension of testing set

## [1] 7293    9

modeling

# create recipe
re <- 
  recipe(salary ~ ., data = train.df) %>% 
  step_dummy(all_nominal_predictors())

# create model type
log_mod <- 
  logistic_reg() %>% 
  set_engine("glm")
  
# piece recipe and model
df_wkfl <- 
  workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(re)
# fit data using workflow
df_fit <- 
  df_wkfl %>% 
  fit(data = train.df)

All features are significant based on p-value

## # A tibble: 10 × 5
##    term                                estimate   std.error statistic   p.value
##    <chr>                                  <dbl>       <dbl>     <dbl>     <dbl>
##  1 (Intercept)                     -3.73        0.171          -21.8  1.71e-105
##  2 age                              0.0343      0.00142         24.1  2.50e-128
##  3 fnlwgt                           0.000000785 0.000000176      4.46 8.15e-  6
##  4 capital.gain                     0.000331    0.0000115       28.8  7.36e-182
##  5 capital.loss                     0.000704    0.0000383       18.4  2.05e- 75
##  6 hours.per.week                   0.0328      0.00158         20.8  6.99e- 96
##  7 race_white                       0.548       0.0688           7.97 1.61e- 15
##  8 sex_X.Male                       1.13        0.0488          23.1  2.18e-118
##  9 edu_level_below.equal.bachelors -1.49        0.0604         -24.6  5.77e-134
## 10 work_level_worker               -0.649       0.0897          -7.23 4.71e- 13

Model Performance

Based on the stats, I would say that the model is good.

Training accuracy

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.812

Testing accuracy

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.813
# confusion matrix
confusionMatrix(test.label, test_pred$.pred_class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  <=50K  >50K
##      <=50K   5238   262
##      >50K    1102   691
##                                           
##                Accuracy : 0.813           
##                  95% CI : (0.8038, 0.8219)
##     No Information Rate : 0.8693          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4011          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8262          
##             Specificity : 0.7251          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.3854          
##              Prevalence : 0.8693          
##          Detection Rate : 0.7182          
##    Detection Prevalence : 0.7541          
##       Balanced Accuracy : 0.7756          
##                                           
##        'Positive' Class :  <=50K          
## 

Conclusion

Tidymodel split modeling into parts which is different from the traditional way of building models where all things is calculated in once. Sometimes, one wants to reuse some of parts in other model, it is convenient to use Tidymodel instead. Otherwise, any changes in the model should be re-computed by the traditional way, it consumes the memory and low down the computer. For simple modeling, I would still recommend the traditional one because it saves coding time.