1 Greetings

Welcome to my Rmd. The reason why I created this Rmd is to improve my understanding on Classification Machine Learning using R language.

2 Brief explanation about the data

This data set consist of information of students whether will be hired or not by the company after graduated from XYZ Campus.

Columns Insight :
1. sl_no: Serial Number/Index Number.
2. gender: The range of characteristics pertaining to sex of (M = Male & F = Female).
3. ssc_p: Secondary education percentage-10th grade or overall marks in percentage obtained in grade 10 examinations.
4. ssc_b: Board of education or the school board whose curriculum the candidate followed (Central or Others).
5. hsc_p: Higher secondary education percentage-12th grade or overall marks in percentage obtained in grade 12th examinations.
6. hsc_s : Specialization in higher secondary education.
7. degree_p: Degree education percentage or overall marks in percentage obtain during college.
8. degree_t : Under Graduation(Degree type) - Field of degree education. 9. workex: Indicates whether the student has word experience or not (Yes or No).
10. etest_p : Employability test result in percentage (conducted by college).
11. specialisation : Post Graduation(MBA)- Specialization.
12. mba_p : MBA percentage or overall marks in percentage obtained during MBA.
13. status : Status of placement- (Placed or Not Placed).
14. salary : Salary offered by corporate to candidates a year in USD.

You may download the data set from kaggle: https://www.kaggle.com/benroshan/factors-affecting-campus-placement

3 Bussines Questions

1.Which machine learning model provide better result?
2.Which factor influenced a candidate in getting hired?
3.Does percentage/overall marks matters for one to get hired?
4.Which degree specialization is much demanded by corporate?

4 Data Preparation

4.1 Import necessary library

library(tidyverse)
library(caret)
library(ggplot2)
library(dplyr)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(class)
library(MASS)
library(rsample)

4.2 Read the dataset.

hire <- read.csv("Placement_Data_Full_Class.csv")
head(hire)

5 Exploratory Data Analysis

5.1 Check corellation between each columns.

There are a function in R called ggcorr() which can be used to check the correlation on each columns to determine how influenced the values of that column to the target variable, in this case target variable is column status.

Unfortunately function ggcorr() can only calculate the correlation columns with integer/numeric variable, due to that variable in column status must be change into integer. - “Placed” -> 1 - “Not Placed” -> 0

hire$status <- ifelse(hire$status == "Not Placed", 0, 1)

# Check the function result
head(hire$status)

## [1] 1 1 1 0 1 0

Good! the function work perfectly. Now let’s check the corellation between each columns.

ggcorr(hire, label = T)

From the ggcorr() visualization, columns ssc_p, hsc_p and degree_p has strong influenced to determine whether the student will be hire or not. Columns etest_p and mba_p also has a influenced but not as high as the 3 columns mentioned earlier.

How about column gender, ssc_b, hsc_b, hsc_s, degree_t, workex and specialisation, for a while those columns doesn’t need to be transform into integer variable since there are another function during modeling to determine whether those columns are important or not to the target variable without transform those columns into integer variable.

But hold on a sec! Why don’t use simple visualization, hopefully there will be little insight which can be obtained between columns gender, ssc_b, hsc_b, hsc_s, degree_t, workex and specialisation to the target variable.

5.1.1 Simple visualization between column `gender`, `ssc_b`, `hsc_b`, `hsc_s`, `degree_t`, `workex` and `specialisation` to target variable

- Correlation between column gender to status

hire %>% 
  group_by(gender, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = freq,
             fill = gender))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=10)+
  theme_minimal() +
  labs(title = "Correlation between column gender to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column ssc_b to status

hire %>% 
  group_by(ssc_b, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = freq,
             fill = ssc_b))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=10)+
  theme_minimal() +
  labs(title = "Correlation between column ssc_b to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column hsc_b to status

hire %>% 
  group_by(hsc_b, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = freq,
             fill = hsc_b))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=10)+
  theme_minimal() +
  labs(title = "Correlation between column hsc_b to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column hsc_s to `status

hire %>% 
  group_by(hsc_s, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = reorder(freq, hsc_s),
             fill = hsc_s))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=5)+
  theme_minimal() +
  labs(title = "Correlation between column hsc_s to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column degree_t to status

hire %>% 
  group_by(degree_t, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = reorder(freq,degree_t),
             fill = degree_t))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=5)+
  theme_minimal() +
  labs(title = "Correlation between column degree_t to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column workex to status

hire %>% 
  group_by(workex, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = freq,
             fill = workex))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=5)+
  theme_minimal() +
  labs(title = "Correlation between column workex to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

- Correlation between column specialisation to status

hire %>% 
  group_by(specialisation, status) %>% 
  summarise(freq = n()) %>% 
  ggplot(aes(x = as.factor(status), 
             y = freq,
             fill = specialisation))+
  geom_col(position = position_dodge())+
  geom_text(aes(label = freq), 
            vjust=1.6, 
            color="white",
            position = position_dodge(0.9), size=5)+
  theme_minimal() +
  labs(title = "Correlation between column specialisation to status",
       x = "Not Hire (0) or Hire(1)",
       y = "Freq",
       col = "Gender")

From the simple visualization above, columns workexmight have some influence to determined whether the student will be hire or not since most of the student who have had work experience are hired compared to student who do not have work experience. For the rest of the columns cannot be concluded yet, since the result from the visualization do not provide a prominent image like columns workex.

5.2 Check missing value

From the data inspection, there are some reworks need to be done but before that lets check whether there are missing value or not.

colSums(is.na(hire))

##          sl_no         gender          ssc_p          ssc_b          hsc_p 
##              0              0              0              0              0 
##          hsc_b          hsc_s       degree_p       degree_t         workex 
##              0              0              0              0              0 
##        etest_p specialisation          mba_p         status         salary 
##              0              0              0              0             67

5.3 Drop unecessary columns

Column sl_no is a unique identifier for each student so can be ignore. Furthermore, there are missing value in column salary, that column can be drop since the goal is to find out whether the students is getting hired or not by the company.

hire <- hire %>% 
  dplyr::select(-sl_no, -salary)

5.4 Convert the rest of the columns into suitable data type.

hire_clean <- hire %>% 
  mutate(gender = as.factor(gender),
         hsc_s = as.factor(hsc_s),
         degree_t = as.factor(degree_t),
         specialisation = as.factor(specialisation),
         workex = as.factor(workex),
         status =  as.factor(status))

glimpse(hire_clean)

## Rows: 215
## Columns: 13
## $ gender         <fct> M, M, M, M, M, M, F, M, M, M, M, M, F, F, M, F, M, F, F~
## $ ssc_p          <dbl> 67.00, 79.33, 65.00, 56.00, 85.80, 55.00, 46.00, 82.00,~
## $ ssc_b          <chr> "Others", "Central", "Central", "Central", "Central", "~
## $ hsc_p          <dbl> 91.00, 78.33, 68.00, 52.00, 73.60, 49.80, 49.20, 64.00,~
## $ hsc_b          <chr> "Others", "Others", "Central", "Central", "Central", "O~
## $ hsc_s          <fct> Commerce, Science, Arts, Science, Commerce, Science, Co~
## $ degree_p       <dbl> 58.00, 77.48, 64.00, 52.00, 73.30, 67.25, 79.00, 66.00,~
## $ degree_t       <fct> Sci&Tech, Sci&Tech, Comm&Mgmt, Sci&Tech, Comm&Mgmt, Sci~
## $ workex         <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No~
## $ etest_p        <dbl> 55.00, 86.50, 75.00, 66.00, 96.80, 55.00, 74.28, 67.00,~
## $ specialisation <fct> Mkt&HR, Mkt&Fin, Mkt&Fin, Mkt&HR, Mkt&Fin, Mkt&Fin, Mkt~
## $ mba_p          <dbl> 58.80, 66.28, 57.80, 59.43, 55.50, 51.58, 53.29, 62.14,~
## $ status         <fct> 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0~

5.5 Check the distribution / pattern data

summary(hire_clean)

##  gender      ssc_p          ssc_b               hsc_p          hsc_b          
##  F: 76   Min.   :40.89   Length:215         Min.   :37.00   Length:215        
##  M:139   1st Qu.:60.60   Class :character   1st Qu.:60.90   Class :character  
##          Median :67.00   Mode  :character   Median :65.00   Mode  :character  
##          Mean   :67.30                      Mean   :66.33                     
##          3rd Qu.:75.70                      3rd Qu.:73.00                     
##          Max.   :89.40                      Max.   :97.70                     
##       hsc_s        degree_p          degree_t   workex       etest_p    
##  Arts    : 11   Min.   :50.00   Comm&Mgmt:145   No :141   Min.   :50.0  
##  Commerce:113   1st Qu.:61.00   Others   : 11   Yes: 74   1st Qu.:60.0  
##  Science : 91   Median :66.00   Sci&Tech : 59             Median :71.0  
##                 Mean   :66.37                             Mean   :72.1  
##                 3rd Qu.:72.00                             3rd Qu.:83.5  
##                 Max.   :91.00                             Max.   :98.0  
##  specialisation     mba_p       status 
##  Mkt&Fin:120    Min.   :51.21   0: 67  
##  Mkt&HR : 95    1st Qu.:57.95   1:148  
##                 Median :62.00          
##                 Mean   :62.28          
##                 3rd Qu.:66.25          
##                 Max.   :77.89

From the result above there are some information which might be used to help the modeling and answer the business question:
- The target variable proportion are quite imbalance, since most of the student are getting hired. It might not sounds good to modeling but for the university and student perspective it is a very good news. Another function which might be used to ensure the imbalance class is prob.table(). Ideally, target variable can be indicated as a balance when the distribution of variables are equal (50:50).

prop.table(table(hire_clean$status))

## 
##         0         1 
## 0.3116279 0.6883721

-65% of the student does not have work experience, which normal since most of the student will focus on study at first.
-Most favorite degree are Commerce and Management.
-Most favorite specialization are Marketing and Finance.

6 Cross Validation

Cross validation is to find out how good the model, by splitting the data into data train and data test.

-Data train: will be used for model training.
-Data test: will be used for testing model performance. the model will be tested to predict the test data. The predicted results and actual data from the test data will be compared to validate the model performance.

RNGkind(sample.kind = "Rounding") 
set.seed(123)

init <- initial_split(data = hire_clean,
                      prop = 0.8, 
                      strata = status) 

hire_train <- training(init)
hire_test <- testing(init)

Recheck class imbalance for target variable after splitting the data.

prop.table(table(hire_train$status))

## 
##         0         1 
## 0.3121387 0.6878613

table(hire_train$status)

## 
##   0   1 
##  54 119

prop.table(table(hire_test$status))

## 
##         0         1 
## 0.3095238 0.6904762

table(hire_test$status)

## 
##  0  1 
## 13 29

Due to imbalance data even after splitting the data into data train and data test the proportion of target variable will still the same. Therefore, by implementing one of the following technique. - Upsampling : Adding minority class observations to balance with the majority class, by duplicating the data. Weaknesses: only duplicate data, does not add new information. - Downsampling : Reduce the observation of the majority class to equal the minority class. Usually used when there are quite a lot of data on the minority class. Weaknesses: discarding information from the data held.

In this case let’s use Upsample.

set.seed(123)

hire_train_us <- upSample(x = hire_train %>%  dplyr::select(-status), 
                         y = hire_train$status, 
                         yname = "status") 

prop.table(table(hire_train_us$status))

## 
##   0   1 
## 0.5 0.5

table(hire_train_us$status)

## 
##   0   1 
## 119 119

7 Build Model

Herecome the interesting part, modeling!.

There are two approach which can be use to predict whether the student will be hire or not. The first one is used Logistic Regression Model and the other one is KNN. Both model has their own pros and cons, let’s implement both model and compare to find which one is better.

7.1 Logistic Regression

At first lets put column status as target variable and the rest of the columns as predictor into function glm().

model_lr <- glm(status ~ ., data = hire_train, family = "binomial")
summary(model_lr)

## 
## Call:
## glm(formula = status ~ ., family = "binomial", data = hire_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.14688  -0.07866   0.05301   0.24815   2.60278  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -19.489772   6.166586  -3.161 0.001575 ** 
## genderM                0.632681   0.912041   0.694 0.487872    
## ssc_p                  0.237694   0.056925   4.176 2.97e-05 ***
## ssc_bOthers            0.586890   1.026890   0.572 0.567646    
## hsc_p                  0.116535   0.041769   2.790 0.005271 ** 
## hsc_bOthers           -0.126905   0.952764  -0.133 0.894038    
## hsc_sCommerce         -1.572819   1.507598  -1.043 0.296827    
## hsc_sScience          -1.086265   1.590515  -0.683 0.494629    
## degree_p               0.230454   0.071411   3.227 0.001250 ** 
## degree_tOthers        -1.276990   1.754182  -0.728 0.466633    
## degree_tSci&Tech      -2.361904   0.926069  -2.550 0.010758 *  
## workexYes              2.075737   0.898944   2.309 0.020939 *  
## etest_p                0.003647   0.028899   0.126 0.899567    
## specialisationMkt&HR  -0.884902   0.697665  -1.268 0.204664    
## mba_p                 -0.265985   0.072398  -3.674 0.000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 214.797  on 172  degrees of freedom
## Residual deviance:  69.749  on 158  degrees of freedom
## AIC: 99.749
## 
## Number of Fisher Scoring iterations: 7

From the function summary() can be concluded that p-value from columns ssc_p, hsc_p, degree_p, workexYes, mba_p has strong influenced to the model, crosscheck once again with function step().

7.1.1 Model Fitting

Function step() will provide information regarding which columns (predictor) which have high influenced by calculate the AIC value. The lower AIC value the better.

stepAIC(model_lr, direction = "both")

## Start:  AIC=99.75
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + hsc_s + degree_p + 
##     degree_t + workex + etest_p + specialisation + mba_p
## 
##                  Df Deviance     AIC
## - hsc_s           2   71.004  97.004
## - etest_p         1   69.765  97.765
## - hsc_b           1   69.767  97.767
## - ssc_b           1   70.079  98.079
## - gender          1   70.233  98.233
## - specialisation  1   71.412  99.412
## <none>                69.749  99.749
## - degree_t        2   77.474 103.474
## - workex          1   76.300 104.300
## - hsc_p           1   79.750 107.750
## - degree_p        1   82.717 110.717
## - mba_p           1   87.896 115.896
## - ssc_p           1  102.061 130.061
## 
## Step:  AIC=97
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + degree_p + 
##     degree_t + workex + etest_p + specialisation + mba_p
## 
##                  Df Deviance     AIC
## - etest_p         1   71.004  95.004
## - hsc_b           1   71.009  95.009
## - ssc_b           1   71.360  95.360
## - gender          1   71.482  95.482
## - specialisation  1   72.892  96.892
## <none>                71.004  97.004
## + hsc_s           2   69.749  99.749
## - degree_t        2   79.574 101.574
## - workex          1   79.790 103.790
## - hsc_p           1   81.011 105.011
## - degree_p        1   84.462 108.462
## - mba_p           1   89.003 113.003
## - ssc_p           1  106.303 130.303
## 
## Step:  AIC=95
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + degree_p + 
##     degree_t + workex + specialisation + mba_p
## 
##                  Df Deviance     AIC
## - hsc_b           1   71.009  93.009
## - ssc_b           1   71.403  93.403
## - gender          1   71.498  93.498
## - specialisation  1   72.905  94.905
## <none>                71.004  95.004
## + etest_p         1   71.004  97.004
## + hsc_s           2   69.765  97.765
## - degree_t        2   79.645  99.645
## - workex          1   79.791 101.791
## - hsc_p           1   81.208 103.208
## - degree_p        1   84.506 106.506
## - mba_p           1   89.194 111.194
## - ssc_p           1  107.646 129.646
## 
## Step:  AIC=93.01
## status ~ gender + ssc_p + ssc_b + hsc_p + degree_p + degree_t + 
##     workex + specialisation + mba_p
## 
##                  Df Deviance     AIC
## - gender          1   71.503  91.503
## - ssc_b           1   71.570  91.570
## - specialisation  1   72.977  92.977
## <none>                71.009  93.009
## + hsc_b           1   71.004  95.004
## + etest_p         1   71.009  95.009
## + hsc_s           2   69.776  95.776
## - degree_t        2   79.696  97.696
## - workex          1   79.796  99.796
## - hsc_p           1   81.215 101.215
## - degree_p        1   84.607 104.607
## - mba_p           1   89.720 109.720
## - ssc_p           1  108.176 128.176
## 
## Step:  AIC=91.5
## status ~ ssc_p + ssc_b + hsc_p + degree_p + degree_t + workex + 
##     specialisation + mba_p
## 
##                  Df Deviance     AIC
## - ssc_b           1   72.827  90.827
## <none>                71.503  91.503
## - specialisation  1   74.522  92.522
## + gender          1   71.009  93.009
## + etest_p         1   71.491  93.491
## + hsc_b           1   71.498  93.498
## + hsc_s           2   70.308  94.308
## - degree_t        2   79.931  95.931
## - workex          1   82.076 100.076
## - hsc_p           1   82.489 100.489
## - degree_p        1   84.658 102.658
## - mba_p           1   95.750 113.750
## - ssc_p           1  108.176 126.176
## 
## Step:  AIC=90.83
## status ~ ssc_p + hsc_p + degree_p + degree_t + workex + specialisation + 
##     mba_p
## 
##                  Df Deviance     AIC
## <none>                72.827  90.827
## + ssc_b           1   71.503  91.503
## + gender          1   71.570  91.570
## - specialisation  1   76.130  92.130
## + hsc_b           1   72.342  92.342
## + etest_p         1   72.806  92.806
## + hsc_s           2   71.477  93.477
## - degree_t        2   80.714  94.714
## - hsc_p           1   83.108  99.108
## - workex          1   83.474  99.474
## - degree_p        1   85.373 101.373
## - mba_p           1   96.165 112.165
## - ssc_p           1  110.538 126.538

## 
## Call:  glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t + 
##     workex + specialisation + mba_p, family = "binomial", data = hire_train)
## 
## Coefficients:
##          (Intercept)                 ssc_p                 hsc_p  
##             -17.9221                0.2285                0.1091  
##             degree_p        degree_tOthers      degree_tSci&Tech  
##               0.2105               -0.6693               -2.0701  
##            workexYes  specialisationMkt&HR                 mba_p  
##               2.4860               -1.1218               -0.2617  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  164 Residual
## Null Deviance:       214.8 
## Residual Deviance: 72.83     AIC: 90.83

The lowest AIC given from function step() are from column ssc_p, hsc_p, degree_p, degree_t, workex and mba_p. After getting information what columns will provide high influenced, assign those predictor columns into new model.

7.1.2 New Model with New Predictor Variable

model_new <- glm(status ~ ssc_p + hsc_p + degree_p + degree_t + workex + mba_p, data = hire_train, family = "binomial")
summary(model_new)

## 
## Call:
## glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t + 
##     workex + mba_p, family = "binomial", data = hire_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.59796  -0.12374   0.06724   0.30739   2.37830  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -18.29822    4.84192  -3.779 0.000157 ***
## ssc_p              0.22027    0.04843   4.548 5.42e-06 ***
## hsc_p              0.11040    0.03846   2.871 0.004097 ** 
## degree_p           0.20486    0.06294   3.255 0.001135 ** 
## degree_tOthers    -0.86652    1.35390  -0.640 0.522161    
## degree_tSci&Tech  -1.98134    0.77251  -2.565 0.010324 *  
## workexYes          2.68314    0.83617   3.209 0.001333 ** 
## mba_p             -0.25277    0.06016  -4.202 2.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 214.80  on 172  degrees of freedom
## Residual deviance:  76.13  on 165  degrees of freedom
## AIC: 92.13
## 
## Number of Fisher Scoring iterations: 7

7.1.3 Prediction

hire_test$status.pred <- predict(object = model_new , 
                                newdata = hire_test, 
                                type = "response")

head(hire_test$status.pred)

## [1] 0.9999395 0.4948445 0.6910503 0.9629369 0.9995559 0.9993583

The prediction result from function predict() still in decimal form. Therefore, the predicted number which is still in decimal form can be change into 1 (hired) or 0 (not hire).

The way to do this is to use the if else function by setting a threshold.
-If the prediction results below 0.5 will be converted to 0
-Else the prediction results below above 0.5 will be converted into 1

con <- function(x)
{
  if( x < 0.5)
  {
    x <- 0
  }
  else
  {
    x <- 1
  }
}

hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))

head(hire_test$status.label)

## [1] 1 0 1 1 1 1
## Levels: 0 1

Good! The function works perfectly.

After prediction using the model, there are still wrong predictions. In classification, evaluate the model will used function called confusionMatrix().

7.1.4 Model Logistic Regression Evaluation

# Set parameter positive = "1" because 1 is indicating that the student will get hired.
eval_lr <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11  4
##          1  2 25
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7146, 0.9457)
##     No Information Rate : 0.6905          
##     P-Value [Acc > NIR] : 0.01118         
##                                           
##                   Kappa : 0.6794          
##                                           
##  Mcnemar's Test P-Value : 0.68309         
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.9259          
##          Neg Pred Value : 0.7333          
##              Prevalence : 0.6905          
##          Detection Rate : 0.5952          
##    Detection Prevalence : 0.6429          
##       Balanced Accuracy : 0.8541          
##                                           
##        'Positive' Class : 1               
##

7.1.5 Model Logistic Regression Manual Tuning

Logistic Regression can be tuning manually by adjusting the threshold variable higher or lower. If the threshold is being adjusted, it could provide a model that is better or even worse than the initial model. Due to that it must be considered again the main goal.

The threshold value can be added or subtracted little by little until the threshold value produces the desired model, if the threshold value are being increase or decrease little by little it will require a lot of time.

In this case the threshold will be increased and reduced by half the initial threshold value (+- 0.25)

Disclaimer : The determination of the threshold value by manual tuning has no special rules, therefore it can be adjusted freely.

Value Threshold 0.25

con <- function(x)
{
  if( x < 0.25)
  {
    x <- 0
  }
  else
  {
    x <- 1
  }
}

hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))

head(hire_test$status.label)

## [1] 1 1 1 1 1 1
## Levels: 0 1

eval_lr2 <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  9  3
##          1  4 26
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6864, 0.9303)
##     No Information Rate : 0.6905          
##     P-Value [Acc > NIR] : 0.0284          
##                                           
##                   Kappa : 0.6016          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8966          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.8667          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.6905          
##          Detection Rate : 0.6190          
##    Detection Prevalence : 0.7143          
##       Balanced Accuracy : 0.7944          
##                                           
##        'Positive' Class : 1               
##

Judging from the results of the 0.25 threshold evaluation, there were a decrease in accuracy of 3% and the Post-Pred Value of 5.5% but there were 3% increase in the Sensitivity. Furthermore, the prediction results for students who are predicted to be hire but actually not hire were increase to 5 students, but on the contrary, for students who are predicted not hire even though they actually hire was decreases to 1 person.

Value Threshold 0.75

con <- function(x)
{
  if( x < 0.75)
  {
    x <- 0
  }
  else
  {
    x <- 1
  }
}

hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))

head(hire_test$status.label)

## [1] 1 0 0 1 1 1
## Levels: 0 1

eval_lr3 <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 12  8
##          1  1 21
##                                          
##                Accuracy : 0.7857         
##                  95% CI : (0.6319, 0.897)
##     No Information Rate : 0.6905         
##     P-Value [Acc > NIR] : 0.1194         
##                                          
##                   Kappa : 0.5635         
##                                          
##  Mcnemar's Test P-Value : 0.0455         
##                                          
##             Sensitivity : 0.7241         
##             Specificity : 0.9231         
##          Pos Pred Value : 0.9545         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.6905         
##          Detection Rate : 0.5000         
##    Detection Prevalence : 0.5238         
##       Balanced Accuracy : 0.8236         
##                                          
##        'Positive' Class : 1              
##

Judging from the results of the 0.75 threshold evaluation, there were a decrease in accuracy of 10%, Post-Pred Value of 2% and 14% in Sensitivity. Furthermore, the prediction results for students who are predicted to be hire but actually not hire remained the same, but on the contrary, for students who are predicted not hire even though they actually hire were increase to 6 persons.

7.2 K-NN

K-NN or K nearest neighbors is the second approach to predict the student will be hire or not. Unfortunately, K-NN model only can use numeric predictor. Due to that, the result might be not as good as Logistic Regression model but with K-NN approach the university might predict the whether the student will be hired or not based on students test score.

hire_knn<- hire_clean %>% 
  dplyr::select(ssc_p, hsc_p, degree_p, etest_p,mba_p,status)

glimpse(hire_knn)

## Rows: 215
## Columns: 6
## $ ssc_p    <dbl> 67.00, 79.33, 65.00, 56.00, 85.80, 55.00, 46.00, 82.00, 73.00~
## $ hsc_p    <dbl> 91.00, 78.33, 68.00, 52.00, 73.60, 49.80, 49.20, 64.00, 79.00~
## $ degree_p <dbl> 58.00, 77.48, 64.00, 52.00, 73.30, 67.25, 79.00, 66.00, 72.00~
## $ etest_p  <dbl> 55.00, 86.50, 75.00, 66.00, 96.80, 55.00, 74.28, 67.00, 91.34~
## $ mba_p    <dbl> 58.80, 66.28, 57.80, 59.43, 55.50, 51.58, 53.29, 62.14, 61.29~
## $ status   <fct> 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1~

7.2.1 Cross Validation

prop.table(table(hire_knn$status))

## 
##         0         1 
## 0.3116279 0.6883721

set.seed(123)

hire_knn_us <- upSample(x = hire_knn %>%  dplyr::select(-status), 
                        y = hire_knn$status, 
                        yname = "status") 

prop.table(table(hire_knn_us$status))

## 
##   0   1 
## 0.5 0.5

table(hire_knn_us$status)

## 
##   0   1 
## 148 148

init <- initial_split(data = hire_knn_us,prop = 0.8,strata = status )

hire_knn_train <- training(init)
hire_knn_test <- testing(init)

prop.table(table(hire_knn_train$status))

## 
##   0   1 
## 0.5 0.5

prop.table(table(hire_knn_test$status))

## 
##   0   1 
## 0.5 0.5

7.2.2 Data Pre-Processing

In k-NN, the data need to be separate between predictors and labels (target variables).

hire_knn_train_x <- hire_knn_train %>% 
  dplyr::select(-status)

hire_knn_test_x <- hire_knn_test %>% 
  dplyr::select(-status)

# target
hire_knn_train_y <-  hire_knn_train %>% 
  dplyr::select(status)

hire_knn_test_y <- hire_knn_test %>% 
  dplyr::select(status)

7.2.3 K-NN Prediction

Unlike Logistic Regression, k-NN doesn’t make a model so it goes straight to predict.

Model K-NN require K value to predict. K value indicates how many the nearest neighbors. In order to find out optimum K value, R already provide a function called sqrt(N). Function sqrt(N) can be simply interpreted as the result of square root of N(the total number of points in the training data set).

sqrt(nrow(hire_knn_train_x))

## [1] 15.42725

After obtained the optimum K Value, input it to function knn() as indicator of how many nearest neighbors.

hire_knn_pred <- knn(train = hire_knn_train_x, test = hire_knn_test_x, cl = as.factor(hire_knn_train_y$status), k = 15)

head(hire_knn_pred)

## [1] 0 0 0 0 1 1
## Levels: 0 1

7.2.4 Model Evaluation

eval_knn <- confusionMatrix(data = hire_knn_pred, reference = as.factor(hire_knn_test_y$status), positive = "1")
eval_knn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20  4
##          1  9 25
##                                           
##                Accuracy : 0.7759          
##                  95% CI : (0.6473, 0.8749)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.506e-05       
##                                           
##                   Kappa : 0.5517          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.6897          
##          Pos Pred Value : 0.7353          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4310          
##    Detection Prevalence : 0.5862          
##       Balanced Accuracy : 0.7759          
##                                           
##        'Positive' Class : 1               
##

8 Logistic Regression VS K-NN

compare_log <- data_frame(Model = "Logistic Regression",
                         Accuracy = round((eval_lr$overall[1] * 100), 2),
                         Recall = round((eval_lr$byClass[1] * 100), 2),
                         Precision = round((eval_lr$byClass[3] * 100), 2))

compare_knn <- data_frame(Model = "K-NN",
                         Accuracy = round((eval_knn$overall[1] * 100), 2),
                         Recall = round((eval_knn$byClass[1] * 100), 2),
                         Precision = round((eval_knn$byClass[3] * 100), 2))

rbind(compare_log, compare_knn)

Reviewed from the results of confusion matrix, the ability of the model to predict correctly whether the students will be hire or not is better when using model Logistic Regression because overall result are greater than K-NN.

9 Insight

1.From the comparison confusion matrix result between Logistic Regression and K-NN, model Logistic Regression generate better result in Accuracy and Precision score.

The Logistic Regression model can outperformed model K-NN because in this case the data set contains not only numeric variable but character variable also. Like what had been mentioned previously K-NN only able predict using numeric variable only, due to that Logistic Regression model have some advantages since Logistic Regression can used both numeric and character variable as predictor.

Furthermore, the main goal is to predict whether the students will be hired or not. As university data scientist it is better to have model with low recall value instead of low precision value because with high precision the university will not give false hope to the student who actually will not getting hired.

2.By implementation function step(), combination columns ssc_p, hsc_p, degree_p, degree_t, workex and mba_p are the most influenced to predict a student will be hired or not since combination those columns provide the lowest AIC.

summary(model_new)

## 
## Call:
## glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t + 
##     workex + mba_p, family = "binomial", data = hire_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.59796  -0.12374   0.06724   0.30739   2.37830  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -18.29822    4.84192  -3.779 0.000157 ***
## ssc_p              0.22027    0.04843   4.548 5.42e-06 ***
## hsc_p              0.11040    0.03846   2.871 0.004097 ** 
## degree_p           0.20486    0.06294   3.255 0.001135 ** 
## degree_tOthers    -0.86652    1.35390  -0.640 0.522161    
## degree_tSci&Tech  -1.98134    0.77251  -2.565 0.010324 *  
## workexYes          2.68314    0.83617   3.209 0.001333 ** 
## mba_p             -0.25277    0.06016  -4.202 2.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 214.80  on 172  degrees of freedom
## Residual deviance:  76.13  on 165  degrees of freedom
## AIC: 92.13
## 
## Number of Fisher Scoring iterations: 7

From the model_new can be seen that students who have work experience will have a higher chance than students who have no work experience at all. To find out how many times the workexYes estimate can can increase the odds, the result of the estimateworkexYes can be entered into a function called exp().

exp(2.68314)

## [1] 14.63096

The result implies that students with work experience are 14 times more likely to be hired than students who have no work experience. Therefore, in the world of education, students are required to do an internship at least once, in order to have experience working in the real world.

3.When viewed by using step() function, percentage of grade is not an important factor to identify whether the student will be hired or not. However, for K-NN which only uses value as its prediction parameter, the value can produce a fairly good model but not as good as the Logistic Regression model.

In the real world, it is common for HRD to see the percentage of grade, especially for students who have just graduated, as a consideration for hiring or not. Although the results of the Logistic Regression model have good results without using the parameter value as a predictor, it cannot be concluded that the percentage of grade does not affect whether the student will get a job or not.

4.Out of 148 students who got hire by the company, 95 of them are from Marketing & Finance specialization.

knitr::include_graphics("capture.png")