Welcome to my Rmd. The reason why I created this Rmd is to improve my understanding on Classification Machine Learning using R language.
This data set consist of information of students whether will be hired or not by the company after graduated from XYZ Campus.
Columns Insight :
1. sl_no: Serial Number/Index Number.
2. gender: The range of characteristics pertaining to sex of (M = Male & F = Female).
3. ssc_p: Secondary education percentage-10th grade or overall marks in percentage obtained in grade 10 examinations.
4. ssc_b: Board of education or the school board whose curriculum the candidate followed (Central or Others).
5. hsc_p: Higher secondary education percentage-12th grade or overall marks in percentage obtained in grade 12th examinations.
6. hsc_s : Specialization in higher secondary education.
7. degree_p: Degree education percentage or overall marks in percentage obtain during college.
8. degree_t : Under Graduation(Degree type) - Field of degree education. 9. workex: Indicates whether the student has word experience or not (Yes or No).
10. etest_p : Employability test result in percentage (conducted by college).
11. specialisation : Post Graduation(MBA)- Specialization.
12. mba_p : MBA percentage or overall marks in percentage obtained during MBA.
13. status : Status of placement- (Placed or Not Placed).
14. salary : Salary offered by corporate to candidates a year in USD.
You may download the data set from kaggle: https://www.kaggle.com/benroshan/factors-affecting-campus-placement
1.Which machine learning model provide better result?
2.Which factor influenced a candidate in getting hired?
3.Does percentage/overall marks matters for one to get hired?
4.Which degree specialization is much demanded by corporate?
library(tidyverse)
library(caret)
library(ggplot2)
library(dplyr)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(class)
library(MASS)
library(rsample)
hire <- read.csv("Placement_Data_Full_Class.csv")
head(hire)
There are a function in R called ggcorr() which can be used to check the correlation on each columns to determine how influenced the values of that column to the target variable, in this case target variable is column status.
Unfortunately function ggcorr() can only calculate the correlation columns with integer/numeric variable, due to that variable in column status must be change into integer. - “Placed” -> 1 - “Not Placed” -> 0
hire$status <- ifelse(hire$status == "Not Placed", 0, 1)
# Check the function result
head(hire$status)
## [1] 1 1 1 0 1 0
Good! the function work perfectly. Now let’s check the corellation between each columns.
ggcorr(hire, label = T)
From the ggcorr() visualization, columns ssc_p, hsc_p and degree_p has strong influenced to determine whether the student will be hire or not. Columns etest_p and mba_p also has a influenced but not as high as the 3 columns mentioned earlier.
How about column gender, ssc_b, hsc_b, hsc_s, degree_t, workex and specialisation, for a while those columns doesn’t need to be transform into integer variable since there are another function during modeling to determine whether those columns are important or not to the target variable without transform those columns into integer variable.
But hold on a sec! Why don’t use simple visualization, hopefully there will be little insight which can be obtained between columns gender, ssc_b, hsc_b, hsc_s, degree_t, workex and specialisation to the target variable.
gender, ssc_b, hsc_b, hsc_s, degree_t, workex and specialisation to target variable- Correlation between column gender to status
hire %>%
group_by(gender, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = freq,
fill = gender))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=10)+
theme_minimal() +
labs(title = "Correlation between column gender to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column ssc_b to status
hire %>%
group_by(ssc_b, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = freq,
fill = ssc_b))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=10)+
theme_minimal() +
labs(title = "Correlation between column ssc_b to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column hsc_b to status
hire %>%
group_by(hsc_b, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = freq,
fill = hsc_b))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=10)+
theme_minimal() +
labs(title = "Correlation between column hsc_b to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column hsc_s to `status
hire %>%
group_by(hsc_s, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = reorder(freq, hsc_s),
fill = hsc_s))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=5)+
theme_minimal() +
labs(title = "Correlation between column hsc_s to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column degree_t to status
hire %>%
group_by(degree_t, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = reorder(freq,degree_t),
fill = degree_t))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=5)+
theme_minimal() +
labs(title = "Correlation between column degree_t to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column workex to status
hire %>%
group_by(workex, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = freq,
fill = workex))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=5)+
theme_minimal() +
labs(title = "Correlation between column workex to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
- Correlation between column specialisation to status
hire %>%
group_by(specialisation, status) %>%
summarise(freq = n()) %>%
ggplot(aes(x = as.factor(status),
y = freq,
fill = specialisation))+
geom_col(position = position_dodge())+
geom_text(aes(label = freq),
vjust=1.6,
color="white",
position = position_dodge(0.9), size=5)+
theme_minimal() +
labs(title = "Correlation between column specialisation to status",
x = "Not Hire (0) or Hire(1)",
y = "Freq",
col = "Gender")
From the simple visualization above, columns workexmight have some influence to determined whether the student will be hire or not since most of the student who have had work experience are hired compared to student who do not have work experience. For the rest of the columns cannot be concluded yet, since the result from the visualization do not provide a prominent image like columns workex.
From the data inspection, there are some reworks need to be done but before that lets check whether there are missing value or not.
colSums(is.na(hire))
## sl_no gender ssc_p ssc_b hsc_p
## 0 0 0 0 0
## hsc_b hsc_s degree_p degree_t workex
## 0 0 0 0 0
## etest_p specialisation mba_p status salary
## 0 0 0 0 67
Column sl_no is a unique identifier for each student so can be ignore. Furthermore, there are missing value in column salary, that column can be drop since the goal is to find out whether the students is getting hired or not by the company.
hire <- hire %>%
dplyr::select(-sl_no, -salary)
hire_clean <- hire %>%
mutate(gender = as.factor(gender),
hsc_s = as.factor(hsc_s),
degree_t = as.factor(degree_t),
specialisation = as.factor(specialisation),
workex = as.factor(workex),
status = as.factor(status))
glimpse(hire_clean)
## Rows: 215
## Columns: 13
## $ gender <fct> M, M, M, M, M, M, F, M, M, M, M, M, F, F, M, F, M, F, F~
## $ ssc_p <dbl> 67.00, 79.33, 65.00, 56.00, 85.80, 55.00, 46.00, 82.00,~
## $ ssc_b <chr> "Others", "Central", "Central", "Central", "Central", "~
## $ hsc_p <dbl> 91.00, 78.33, 68.00, 52.00, 73.60, 49.80, 49.20, 64.00,~
## $ hsc_b <chr> "Others", "Others", "Central", "Central", "Central", "O~
## $ hsc_s <fct> Commerce, Science, Arts, Science, Commerce, Science, Co~
## $ degree_p <dbl> 58.00, 77.48, 64.00, 52.00, 73.30, 67.25, 79.00, 66.00,~
## $ degree_t <fct> Sci&Tech, Sci&Tech, Comm&Mgmt, Sci&Tech, Comm&Mgmt, Sci~
## $ workex <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No~
## $ etest_p <dbl> 55.00, 86.50, 75.00, 66.00, 96.80, 55.00, 74.28, 67.00,~
## $ specialisation <fct> Mkt&HR, Mkt&Fin, Mkt&Fin, Mkt&HR, Mkt&Fin, Mkt&Fin, Mkt~
## $ mba_p <dbl> 58.80, 66.28, 57.80, 59.43, 55.50, 51.58, 53.29, 62.14,~
## $ status <fct> 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0~
summary(hire_clean)
## gender ssc_p ssc_b hsc_p hsc_b
## F: 76 Min. :40.89 Length:215 Min. :37.00 Length:215
## M:139 1st Qu.:60.60 Class :character 1st Qu.:60.90 Class :character
## Median :67.00 Mode :character Median :65.00 Mode :character
## Mean :67.30 Mean :66.33
## 3rd Qu.:75.70 3rd Qu.:73.00
## Max. :89.40 Max. :97.70
## hsc_s degree_p degree_t workex etest_p
## Arts : 11 Min. :50.00 Comm&Mgmt:145 No :141 Min. :50.0
## Commerce:113 1st Qu.:61.00 Others : 11 Yes: 74 1st Qu.:60.0
## Science : 91 Median :66.00 Sci&Tech : 59 Median :71.0
## Mean :66.37 Mean :72.1
## 3rd Qu.:72.00 3rd Qu.:83.5
## Max. :91.00 Max. :98.0
## specialisation mba_p status
## Mkt&Fin:120 Min. :51.21 0: 67
## Mkt&HR : 95 1st Qu.:57.95 1:148
## Median :62.00
## Mean :62.28
## 3rd Qu.:66.25
## Max. :77.89
From the result above there are some information which might be used to help the modeling and answer the business question: - The target variable proportion are quite imbalance, since most of the student are getting hired. It might not sounds good to modeling but for the university and student perspective it is a very good news. Another function which might be used to ensure the imbalance class is prob.table(). Ideally, target variable can be indicated as a balance when the distribution of variables are equal (50:50).
prop.table(table(hire_clean$status))
##
## 0 1
## 0.3116279 0.6883721
-65% of the student does not have work experience, which normal since most of the student will focus on study at first.
-Most favorite degree are Commerce and Management.
-Most favorite specialization are Marketing and Finance.
Cross validation is to find out how good the model, by splitting the data into data train and data test.
-Data train: will be used for model training.
-Data test: will be used for testing model performance. the model will be tested to predict the test data. The predicted results and actual data from the test data will be compared to validate the model performance.
RNGkind(sample.kind = "Rounding")
set.seed(123)
init <- initial_split(data = hire_clean,
prop = 0.8,
strata = status)
hire_train <- training(init)
hire_test <- testing(init)
Recheck class imbalance for target variable after splitting the data.
prop.table(table(hire_train$status))
##
## 0 1
## 0.3121387 0.6878613
table(hire_train$status)
##
## 0 1
## 54 119
prop.table(table(hire_test$status))
##
## 0 1
## 0.3095238 0.6904762
table(hire_test$status)
##
## 0 1
## 13 29
Due to imbalance data even after splitting the data into data train and data test the proportion of target variable will still the same. Therefore, by implementing one of the following technique. - Upsampling : Adding minority class observations to balance with the majority class, by duplicating the data. Weaknesses: only duplicate data, does not add new information. - Downsampling : Reduce the observation of the majority class to equal the minority class. Usually used when there are quite a lot of data on the minority class. Weaknesses: discarding information from the data held.
In this case let’s use Upsample.
set.seed(123)
hire_train_us <- upSample(x = hire_train %>% dplyr::select(-status),
y = hire_train$status,
yname = "status")
prop.table(table(hire_train_us$status))
##
## 0 1
## 0.5 0.5
table(hire_train_us$status)
##
## 0 1
## 119 119
Herecome the interesting part, modeling!.
There are two approach which can be use to predict whether the student will be hire or not. The first one is used Logistic Regression Model and the other one is KNN. Both model has their own pros and cons, let’s implement both model and compare to find which one is better.
At first lets put column status as target variable and the rest of the columns as predictor into function glm().
model_lr <- glm(status ~ ., data = hire_train, family = "binomial")
summary(model_lr)
##
## Call:
## glm(formula = status ~ ., family = "binomial", data = hire_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.14688 -0.07866 0.05301 0.24815 2.60278
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -19.489772 6.166586 -3.161 0.001575 **
## genderM 0.632681 0.912041 0.694 0.487872
## ssc_p 0.237694 0.056925 4.176 2.97e-05 ***
## ssc_bOthers 0.586890 1.026890 0.572 0.567646
## hsc_p 0.116535 0.041769 2.790 0.005271 **
## hsc_bOthers -0.126905 0.952764 -0.133 0.894038
## hsc_sCommerce -1.572819 1.507598 -1.043 0.296827
## hsc_sScience -1.086265 1.590515 -0.683 0.494629
## degree_p 0.230454 0.071411 3.227 0.001250 **
## degree_tOthers -1.276990 1.754182 -0.728 0.466633
## degree_tSci&Tech -2.361904 0.926069 -2.550 0.010758 *
## workexYes 2.075737 0.898944 2.309 0.020939 *
## etest_p 0.003647 0.028899 0.126 0.899567
## specialisationMkt&HR -0.884902 0.697665 -1.268 0.204664
## mba_p -0.265985 0.072398 -3.674 0.000239 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 214.797 on 172 degrees of freedom
## Residual deviance: 69.749 on 158 degrees of freedom
## AIC: 99.749
##
## Number of Fisher Scoring iterations: 7
From the function summary() can be concluded that p-value from columns ssc_p, hsc_p, degree_p, workexYes, mba_p has strong influenced to the model, crosscheck once again with function step().
Function step() will provide information regarding which columns (predictor) which have high influenced by calculate the AIC value. The lower AIC value the better.
stepAIC(model_lr, direction = "both")
## Start: AIC=99.75
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + hsc_s + degree_p +
## degree_t + workex + etest_p + specialisation + mba_p
##
## Df Deviance AIC
## - hsc_s 2 71.004 97.004
## - etest_p 1 69.765 97.765
## - hsc_b 1 69.767 97.767
## - ssc_b 1 70.079 98.079
## - gender 1 70.233 98.233
## - specialisation 1 71.412 99.412
## <none> 69.749 99.749
## - degree_t 2 77.474 103.474
## - workex 1 76.300 104.300
## - hsc_p 1 79.750 107.750
## - degree_p 1 82.717 110.717
## - mba_p 1 87.896 115.896
## - ssc_p 1 102.061 130.061
##
## Step: AIC=97
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + degree_p +
## degree_t + workex + etest_p + specialisation + mba_p
##
## Df Deviance AIC
## - etest_p 1 71.004 95.004
## - hsc_b 1 71.009 95.009
## - ssc_b 1 71.360 95.360
## - gender 1 71.482 95.482
## - specialisation 1 72.892 96.892
## <none> 71.004 97.004
## + hsc_s 2 69.749 99.749
## - degree_t 2 79.574 101.574
## - workex 1 79.790 103.790
## - hsc_p 1 81.011 105.011
## - degree_p 1 84.462 108.462
## - mba_p 1 89.003 113.003
## - ssc_p 1 106.303 130.303
##
## Step: AIC=95
## status ~ gender + ssc_p + ssc_b + hsc_p + hsc_b + degree_p +
## degree_t + workex + specialisation + mba_p
##
## Df Deviance AIC
## - hsc_b 1 71.009 93.009
## - ssc_b 1 71.403 93.403
## - gender 1 71.498 93.498
## - specialisation 1 72.905 94.905
## <none> 71.004 95.004
## + etest_p 1 71.004 97.004
## + hsc_s 2 69.765 97.765
## - degree_t 2 79.645 99.645
## - workex 1 79.791 101.791
## - hsc_p 1 81.208 103.208
## - degree_p 1 84.506 106.506
## - mba_p 1 89.194 111.194
## - ssc_p 1 107.646 129.646
##
## Step: AIC=93.01
## status ~ gender + ssc_p + ssc_b + hsc_p + degree_p + degree_t +
## workex + specialisation + mba_p
##
## Df Deviance AIC
## - gender 1 71.503 91.503
## - ssc_b 1 71.570 91.570
## - specialisation 1 72.977 92.977
## <none> 71.009 93.009
## + hsc_b 1 71.004 95.004
## + etest_p 1 71.009 95.009
## + hsc_s 2 69.776 95.776
## - degree_t 2 79.696 97.696
## - workex 1 79.796 99.796
## - hsc_p 1 81.215 101.215
## - degree_p 1 84.607 104.607
## - mba_p 1 89.720 109.720
## - ssc_p 1 108.176 128.176
##
## Step: AIC=91.5
## status ~ ssc_p + ssc_b + hsc_p + degree_p + degree_t + workex +
## specialisation + mba_p
##
## Df Deviance AIC
## - ssc_b 1 72.827 90.827
## <none> 71.503 91.503
## - specialisation 1 74.522 92.522
## + gender 1 71.009 93.009
## + etest_p 1 71.491 93.491
## + hsc_b 1 71.498 93.498
## + hsc_s 2 70.308 94.308
## - degree_t 2 79.931 95.931
## - workex 1 82.076 100.076
## - hsc_p 1 82.489 100.489
## - degree_p 1 84.658 102.658
## - mba_p 1 95.750 113.750
## - ssc_p 1 108.176 126.176
##
## Step: AIC=90.83
## status ~ ssc_p + hsc_p + degree_p + degree_t + workex + specialisation +
## mba_p
##
## Df Deviance AIC
## <none> 72.827 90.827
## + ssc_b 1 71.503 91.503
## + gender 1 71.570 91.570
## - specialisation 1 76.130 92.130
## + hsc_b 1 72.342 92.342
## + etest_p 1 72.806 92.806
## + hsc_s 2 71.477 93.477
## - degree_t 2 80.714 94.714
## - hsc_p 1 83.108 99.108
## - workex 1 83.474 99.474
## - degree_p 1 85.373 101.373
## - mba_p 1 96.165 112.165
## - ssc_p 1 110.538 126.538
##
## Call: glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t +
## workex + specialisation + mba_p, family = "binomial", data = hire_train)
##
## Coefficients:
## (Intercept) ssc_p hsc_p
## -17.9221 0.2285 0.1091
## degree_p degree_tOthers degree_tSci&Tech
## 0.2105 -0.6693 -2.0701
## workexYes specialisationMkt&HR mba_p
## 2.4860 -1.1218 -0.2617
##
## Degrees of Freedom: 172 Total (i.e. Null); 164 Residual
## Null Deviance: 214.8
## Residual Deviance: 72.83 AIC: 90.83
The lowest AIC given from function step() are from column ssc_p, hsc_p, degree_p, degree_t, workex and mba_p. After getting information what columns will provide high influenced, assign those predictor columns into new model.
model_new <- glm(status ~ ssc_p + hsc_p + degree_p + degree_t + workex + mba_p, data = hire_train, family = "binomial")
summary(model_new)
##
## Call:
## glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t +
## workex + mba_p, family = "binomial", data = hire_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.59796 -0.12374 0.06724 0.30739 2.37830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.29822 4.84192 -3.779 0.000157 ***
## ssc_p 0.22027 0.04843 4.548 5.42e-06 ***
## hsc_p 0.11040 0.03846 2.871 0.004097 **
## degree_p 0.20486 0.06294 3.255 0.001135 **
## degree_tOthers -0.86652 1.35390 -0.640 0.522161
## degree_tSci&Tech -1.98134 0.77251 -2.565 0.010324 *
## workexYes 2.68314 0.83617 3.209 0.001333 **
## mba_p -0.25277 0.06016 -4.202 2.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 214.80 on 172 degrees of freedom
## Residual deviance: 76.13 on 165 degrees of freedom
## AIC: 92.13
##
## Number of Fisher Scoring iterations: 7
hire_test$status.pred <- predict(object = model_new ,
newdata = hire_test,
type = "response")
head(hire_test$status.pred)
## [1] 0.9999395 0.4948445 0.6910503 0.9629369 0.9995559 0.9993583
The prediction result from function predict() still in decimal form. Therefore, the predicted number which is still in decimal form can be change into 1 (hired) or 0 (not hire).
The way to do this is to use the if else function by setting a threshold.
-If the prediction results below 0.5 will be converted to 0
-Else the prediction results below above 0.5 will be converted into 1
con <- function(x)
{
if( x < 0.5)
{
x <- 0
}
else
{
x <- 1
}
}
hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))
head(hire_test$status.label)
## [1] 1 0 1 1 1 1
## Levels: 0 1
Good! The function works perfectly.
After prediction using the model, there are still wrong predictions. In classification, evaluate the model will used function called confusionMatrix().
# Set parameter positive = "1" because 1 is indicating that the student will get hired.
eval_lr <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 4
## 1 2 25
##
## Accuracy : 0.8571
## 95% CI : (0.7146, 0.9457)
## No Information Rate : 0.6905
## P-Value [Acc > NIR] : 0.01118
##
## Kappa : 0.6794
##
## Mcnemar's Test P-Value : 0.68309
##
## Sensitivity : 0.8621
## Specificity : 0.8462
## Pos Pred Value : 0.9259
## Neg Pred Value : 0.7333
## Prevalence : 0.6905
## Detection Rate : 0.5952
## Detection Prevalence : 0.6429
## Balanced Accuracy : 0.8541
##
## 'Positive' Class : 1
##
Logistic Regression can be tuning manually by adjusting the threshold variable higher or lower. If the threshold is being adjusted, it could provide a model that is better or even worse than the initial model. Due to that it must be considered again the main goal.
The threshold value can be added or subtracted little by little until the threshold value produces the desired model, if the threshold value are being increase or decrease little by little it will require a lot of time.
In this case the threshold will be increased and reduced by half the initial threshold value (+- 0.25)
Disclaimer : The determination of the threshold value by manual tuning has no special rules, therefore it can be adjusted freely.
Value Threshold 0.25
con <- function(x)
{
if( x < 0.25)
{
x <- 0
}
else
{
x <- 1
}
}
hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))
head(hire_test$status.label)
## [1] 1 1 1 1 1 1
## Levels: 0 1
eval_lr2 <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9 3
## 1 4 26
##
## Accuracy : 0.8333
## 95% CI : (0.6864, 0.9303)
## No Information Rate : 0.6905
## P-Value [Acc > NIR] : 0.0284
##
## Kappa : 0.6016
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8966
## Specificity : 0.6923
## Pos Pred Value : 0.8667
## Neg Pred Value : 0.7500
## Prevalence : 0.6905
## Detection Rate : 0.6190
## Detection Prevalence : 0.7143
## Balanced Accuracy : 0.7944
##
## 'Positive' Class : 1
##
Judging from the results of the 0.25 threshold evaluation, there were a decrease in accuracy of 3% and the Post-Pred Value of 5.5% but there were 3% increase in the Sensitivity. Furthermore, the prediction results for students who are predicted to be hire but actually not hire were increase to 5 students, but on the contrary, for students who are predicted not hire even though they actually hire was decreases to 1 person.
Value Threshold 0.75
con <- function(x)
{
if( x < 0.75)
{
x <- 0
}
else
{
x <- 1
}
}
hire_test$status.label <- as.factor(sapply(X = hire_test$status.pred, FUN = con))
head(hire_test$status.label)
## [1] 1 0 0 1 1 1
## Levels: 0 1
eval_lr3 <- confusionMatrix(data = hire_test$status.label, reference = hire_test$status, positive = "1")
eval_lr3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 12 8
## 1 1 21
##
## Accuracy : 0.7857
## 95% CI : (0.6319, 0.897)
## No Information Rate : 0.6905
## P-Value [Acc > NIR] : 0.1194
##
## Kappa : 0.5635
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.7241
## Specificity : 0.9231
## Pos Pred Value : 0.9545
## Neg Pred Value : 0.6000
## Prevalence : 0.6905
## Detection Rate : 0.5000
## Detection Prevalence : 0.5238
## Balanced Accuracy : 0.8236
##
## 'Positive' Class : 1
##
Judging from the results of the 0.75 threshold evaluation, there were a decrease in accuracy of 10%, Post-Pred Value of 2% and 14% in Sensitivity. Furthermore, the prediction results for students who are predicted to be hire but actually not hire remained the same, but on the contrary, for students who are predicted not hire even though they actually hire were increase to 6 persons.
K-NN or K nearest neighbors is the second approach to predict the student will be hire or not. Unfortunately, K-NN model only can use numeric predictor. Due to that, the result might be not as good as Logistic Regression model but with K-NN approach the university might predict the whether the student will be hired or not based on students test score.
hire_knn<- hire_clean %>%
dplyr::select(ssc_p, hsc_p, degree_p, etest_p,mba_p,status)
glimpse(hire_knn)
## Rows: 215
## Columns: 6
## $ ssc_p <dbl> 67.00, 79.33, 65.00, 56.00, 85.80, 55.00, 46.00, 82.00, 73.00~
## $ hsc_p <dbl> 91.00, 78.33, 68.00, 52.00, 73.60, 49.80, 49.20, 64.00, 79.00~
## $ degree_p <dbl> 58.00, 77.48, 64.00, 52.00, 73.30, 67.25, 79.00, 66.00, 72.00~
## $ etest_p <dbl> 55.00, 86.50, 75.00, 66.00, 96.80, 55.00, 74.28, 67.00, 91.34~
## $ mba_p <dbl> 58.80, 66.28, 57.80, 59.43, 55.50, 51.58, 53.29, 62.14, 61.29~
## $ status <fct> 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1~
prop.table(table(hire_knn$status))
##
## 0 1
## 0.3116279 0.6883721
set.seed(123)
hire_knn_us <- upSample(x = hire_knn %>% dplyr::select(-status),
y = hire_knn$status,
yname = "status")
prop.table(table(hire_knn_us$status))
##
## 0 1
## 0.5 0.5
table(hire_knn_us$status)
##
## 0 1
## 148 148
init <- initial_split(data = hire_knn_us,prop = 0.8,strata = status )
hire_knn_train <- training(init)
hire_knn_test <- testing(init)
prop.table(table(hire_knn_train$status))
##
## 0 1
## 0.5 0.5
prop.table(table(hire_knn_test$status))
##
## 0 1
## 0.5 0.5
In k-NN, the data need to be separate between predictors and labels (target variables).
hire_knn_train_x <- hire_knn_train %>%
dplyr::select(-status)
hire_knn_test_x <- hire_knn_test %>%
dplyr::select(-status)
# target
hire_knn_train_y <- hire_knn_train %>%
dplyr::select(status)
hire_knn_test_y <- hire_knn_test %>%
dplyr::select(status)
Unlike Logistic Regression, k-NN doesn’t make a model so it goes straight to predict.
Model K-NN require K value to predict. K value indicates how many the nearest neighbors. In order to find out optimum K value, R already provide a function called sqrt(N). Function sqrt(N) can be simply interpreted as the result of square root of N(the total number of points in the training data set).
sqrt(nrow(hire_knn_train_x))
## [1] 15.42725
After obtained the optimum K Value, input it to function knn() as indicator of how many nearest neighbors.
hire_knn_pred <- knn(train = hire_knn_train_x, test = hire_knn_test_x, cl = as.factor(hire_knn_train_y$status), k = 15)
head(hire_knn_pred)
## [1] 0 0 0 0 1 1
## Levels: 0 1
eval_knn <- confusionMatrix(data = hire_knn_pred, reference = as.factor(hire_knn_test_y$status), positive = "1")
eval_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20 4
## 1 9 25
##
## Accuracy : 0.7759
## 95% CI : (0.6473, 0.8749)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.506e-05
##
## Kappa : 0.5517
##
## Mcnemar's Test P-Value : 0.2673
##
## Sensitivity : 0.8621
## Specificity : 0.6897
## Pos Pred Value : 0.7353
## Neg Pred Value : 0.8333
## Prevalence : 0.5000
## Detection Rate : 0.4310
## Detection Prevalence : 0.5862
## Balanced Accuracy : 0.7759
##
## 'Positive' Class : 1
##
compare_log <- data_frame(Model = "Logistic Regression",
Accuracy = round((eval_lr$overall[1] * 100), 2),
Recall = round((eval_lr$byClass[1] * 100), 2),
Precision = round((eval_lr$byClass[3] * 100), 2))
compare_knn <- data_frame(Model = "K-NN",
Accuracy = round((eval_knn$overall[1] * 100), 2),
Recall = round((eval_knn$byClass[1] * 100), 2),
Precision = round((eval_knn$byClass[3] * 100), 2))
rbind(compare_log, compare_knn)
Reviewed from the results of confusion matrix, the ability of the model to predict correctly whether the students will be hire or not is better when using model Logistic Regression because overall result are greater than K-NN.
1.From the comparison confusion matrix result between Logistic Regression and K-NN, model Logistic Regression generate better result in Accuracy and Precision score.
The Logistic Regression model can outperformed model K-NN because in this case the data set contains not only numeric variable but character variable also. Like what had been mentioned previously K-NN only able predict using numeric variable only, due to that Logistic Regression model have some advantages since Logistic Regression can used both numeric and character variable as predictor.
Furthermore, the main goal is to predict whether the students will be hired or not. As university data scientist it is better to have model with low recall value instead of low precision value because with high precision the university will not give false hope to the student who actually will not getting hired.
2.By implementation function step(), combination columns ssc_p, hsc_p, degree_p, degree_t, workex and mba_p are the most influenced to predict a student will be hired or not since combination those columns provide the lowest AIC.
summary(model_new)
##
## Call:
## glm(formula = status ~ ssc_p + hsc_p + degree_p + degree_t +
## workex + mba_p, family = "binomial", data = hire_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.59796 -0.12374 0.06724 0.30739 2.37830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.29822 4.84192 -3.779 0.000157 ***
## ssc_p 0.22027 0.04843 4.548 5.42e-06 ***
## hsc_p 0.11040 0.03846 2.871 0.004097 **
## degree_p 0.20486 0.06294 3.255 0.001135 **
## degree_tOthers -0.86652 1.35390 -0.640 0.522161
## degree_tSci&Tech -1.98134 0.77251 -2.565 0.010324 *
## workexYes 2.68314 0.83617 3.209 0.001333 **
## mba_p -0.25277 0.06016 -4.202 2.65e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 214.80 on 172 degrees of freedom
## Residual deviance: 76.13 on 165 degrees of freedom
## AIC: 92.13
##
## Number of Fisher Scoring iterations: 7
From the model_new can be seen that students who have work experience will have a higher chance than students who have no work experience at all. To find out how many times the workexYes estimate can can increase the odds, the result of the estimateworkexYes can be entered into a function called exp().
exp(2.68314)
## [1] 14.63096
The result implies that students with work experience are 14 times more likely to be hired than students who have no work experience. Therefore, in the world of education, students are required to do an internship at least once, in order to have experience working in the real world.
3.When viewed by using step() function, percentage of grade is not an important factor to identify whether the student will be hired or not. However, for K-NN which only uses value as its prediction parameter, the value can produce a fairly good model but not as good as the Logistic Regression model.
In the real world, it is common for HRD to see the percentage of grade, especially for students who have just graduated, as a consideration for hiring or not. Although the results of the Logistic Regression model have good results without using the parameter value as a predictor, it cannot be concluded that the percentage of grade does not affect whether the student will get a job or not.
4.Out of 148 students who got hire by the company, 95 of them are from Marketing & Finance specialization.
knitr::include_graphics("capture.png")