The goal of this research would be try to predict whether or not the students will continue the study to college degree. This research contains of several predictors to arrive at the expected goal
df <- read.csv("data.csv")library(dplyr)
glimpse(df)## Rows: 1,000
## Columns: 11
## $ type_school <chr> "Academic", "Academic", "Academic", "Vocational"…
## $ school_accreditation <chr> "A", "A", "B", "B", "A", "B", "A", "B", "B", "B"…
## $ gender <chr> "Male", "Male", "Female", "Male", "Female", "Fem…
## $ interest <chr> "Less Interested", "Less Interested", "Very Inte…
## $ residence <chr> "Urban", "Urban", "Urban", "Rural", "Urban", "Ru…
## $ parent_age <int> 56, 57, 50, 49, 57, 48, 52, 53, 52, 47, 57, 48, …
## $ parent_salary <int> 6950000, 4410000, 6500000, 6600000, 5250000, 377…
## $ house_area <dbl> 83.0, 76.8, 80.6, 78.2, 75.1, 65.3, 85.5, 83.3, …
## $ average_grades <dbl> 84.09, 86.91, 87.43, 82.12, 86.79, 86.79, 90.39,…
## $ parent_was_in_college <chr> "False", "False", "False", "True", "False", "Tru…
## $ will_go_to_college <chr> "True", "True", "True", "True", "False", "False"…
change the data as an appropriate data type and value. we will be
changing the variable of parent age and
will go to college as the data type still integer and we
would want the data to be as factor, as the value of each observations
is stating the value of itself. In other words, as the value of the data
increase, it SHOULD NOT affect the prediction in the future because it
does not represent an increase of an attribute to be predicted.
df$will_go_to_college <- as.integer(as.logical(df$will_go_to_college))
df_clean <-
df |>
mutate_if(is.character, as.factor) |>
mutate(
parent_age = as.factor(parent_age),
will_go_to_college = as.factor(will_go_to_college)
)
df_clean |> head()## type_school school_accreditation gender interest residence parent_age
## 1 Academic A Male Less Interested Urban 56
## 2 Academic A Male Less Interested Urban 57
## 3 Academic B Female Very Interested Urban 50
## 4 Vocational B Male Very Interested Rural 49
## 5 Academic A Female Very Interested Urban 57
## 6 Vocational B Female Less Interested Rural 48
## parent_salary house_area average_grades parent_was_in_college
## 1 6950000 83.0 84.09 False
## 2 4410000 76.8 86.91 False
## 3 6500000 80.6 87.43 False
## 4 6600000 78.2 82.12 True
## 5 5250000 75.1 86.79 False
## 6 3770000 65.3 86.79 True
## will_go_to_college
## 1 1
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
Checking if there is any missing value
anyNA(df_clean)## [1] FALSE
there is no missing value in our data, so let’s analyze the data range
Analyzing the data range
summary(df_clean)## type_school school_accreditation gender interest
## Academic :609 A:481 Female:485 Interested :100
## Vocational:391 B:519 Male :515 Less Interested:229
## Not Interested : 86
## Uncertain :261
## Very Interested:324
##
##
## residence parent_age parent_salary house_area average_grades
## Rural:461 52 :144 Min. : 1000000 Min. : 20.00 Min. :75.00
## Urban:539 53 :126 1st Qu.: 4360000 1st Qu.: 64.60 1st Qu.:83.74
## 54 :110 Median : 5440000 Median : 75.50 Median :85.58
## 51 : 94 Mean : 5381570 Mean : 74.52 Mean :86.10
## 50 : 93 3rd Qu.: 6382500 3rd Qu.: 84.83 3rd Qu.:88.26
## 55 : 82 Max. :10000000 Max. :120.00 Max. :98.00
## (Other):351
## parent_was_in_college will_go_to_college
## False:480 0:500
## True :520 1:500
##
##
##
##
##
It clears that:
Min.,Max. and the Mean is
balanced, give or takeSeparating data into train and test data
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(77)
index <- sample(nrow(df_clean), nrow(df_clean)*0.8)
df_train <- df_clean[index,]
df_test <- df_clean[-index,]The model will be built using Logistic Regression method
model_df <- glm(will_go_to_college ~.,
df_train,
family = "binomial")
summary(model_df)##
## Call:
## glm(formula = will_go_to_college ~ ., family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5874 -0.3681 0.0693 0.4412 3.4661
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -53.9775703058 2714.6259998454 -0.020
## type_schoolVocational 0.2099041003 0.2651536826 0.792
## school_accreditationB -0.7529066366 0.2981696416 -2.525
## genderMale -0.2517013527 0.2384513534 -1.056
## interestLess Interested 0.9714331541 0.5984085232 1.623
## interestNot Interested 2.8856198234 0.6840458177 4.218
## interestUncertain 2.1889939188 0.6073957216 3.604
## interestVery Interested 1.7048333759 0.5939576142 2.870
## residenceUrban 3.5755653723 0.4394804166 8.136
## parent_age41 10.7124969378 2714.6968092055 0.004
## parent_age42 7.1105721822 2714.6246761944 0.003
## parent_age43 -6.6877044729 3042.8075456593 -0.002
## parent_age44 -6.2903409903 2955.9662354725 -0.002
## parent_age45 9.4131874445 2714.6237245058 0.003
## parent_age46 8.4225169989 2714.6236785859 0.003
## parent_age47 7.3269759784 2714.6236539005 0.003
## parent_age48 8.7547419558 2714.6236154623 0.003
## parent_age49 9.0627232177 2714.6235914337 0.003
## parent_age50 9.0688591715 2714.6235985027 0.003
## parent_age51 9.1539140768 2714.6235940064 0.003
## parent_age52 9.4976482823 2714.6235801824 0.003
## parent_age53 8.4728378663 2714.6236089151 0.003
## parent_age54 9.1656124263 2714.6236040726 0.003
## parent_age55 9.0358504208 2714.6236193481 0.003
## parent_age56 8.3590044425 2714.6236374659 0.003
## parent_age57 7.9185791517 2714.6236542702 0.003
## parent_age58 8.7671319289 2714.6236665527 0.003
## parent_age59 9.7478935673 2714.6236772235 0.004
## parent_age60 8.5327286803 2714.6237830794 0.003
## parent_age61 9.5249981802 2714.6240154566 0.004
## parent_age62 -6.5202107096 4797.9728874104 -0.001
## parent_age64 25.8778323036 4797.9729291751 0.005
## parent_age65 -7.3053691106 4797.9729054866 -0.002
## parent_salary 0.0000012197 0.0000001216 10.034
## house_area 0.1096628704 0.0120197821 9.124
## average_grades 0.3107906763 0.0427215753 7.275
## parent_was_in_collegeTrue 0.7354255897 0.3909757801 1.881
## Pr(>|z|)
## (Intercept) 0.984136
## type_schoolVocational 0.428575
## school_accreditationB 0.011567 *
## genderMale 0.291166
## interestLess Interested 0.104512
## interestNot Interested 0.000024597675126773 ***
## interestUncertain 0.000313 ***
## interestVery Interested 0.004101 **
## residenceUrban 0.000000000000000409 ***
## parent_age41 0.996851
## parent_age42 0.997910
## parent_age43 0.998246
## parent_age44 0.998302
## parent_age45 0.997233
## parent_age46 0.997524
## parent_age47 0.997846
## parent_age48 0.997427
## parent_age49 0.997336
## parent_age50 0.997334
## parent_age51 0.997309
## parent_age52 0.997208
## parent_age53 0.997510
## parent_age54 0.997306
## parent_age55 0.997344
## parent_age56 0.997543
## parent_age57 0.997673
## parent_age58 0.997423
## parent_age59 0.997135
## parent_age60 0.997492
## parent_age61 0.997200
## parent_age62 0.998916
## parent_age64 0.995697
## parent_age65 0.998785
## parent_salary < 0.0000000000000002 ***
## house_area < 0.0000000000000002 ***
## average_grades 0.000000000000346948 ***
## parent_was_in_collegeTrue 0.059972 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1108.63 on 799 degrees of freedom
## Residual deviance: 500.03 on 763 degrees of freedom
## AIC: 574.03
##
## Number of Fisher Scoring iterations: 16
Scrutinizing the data from the business perspective, it looks that
each of df variables are useful to predict whether or not the students
will continue to the college. However, looking into the summary of
model_df, its pretty clear that there are variables in
which contributes better to predict the target variable, which are
interest, residence, parent
salary, house area, and the average
grades. Hence, the feature selection model will use those
variables as predictors
Build the model based on feature selection
model_fs <-
glm(
will_go_to_college ~ interest + residence + parent_salary + house_area + average_grades,
df_train,
family = "binomial"
)
summary(model_fs)##
## Call:
## glm(formula = will_go_to_college ~ interest + residence + parent_salary +
## house_area + average_grades, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4540 -0.4209 0.0804 0.4900 3.2214
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -44.1503284866 3.8016578929 -11.613
## interestLess Interested 0.7840403305 0.5371460732 1.460
## interestNot Interested 2.8304167706 0.6283981771 4.504
## interestUncertain 2.0999088366 0.5463231032 3.844
## interestVery Interested 1.7103143594 0.5323297110 3.213
## residenceUrban 3.1581025613 0.3174124274 9.950
## parent_salary 0.0000012063 0.0000001107 10.893
## house_area 0.1038193427 0.0108671675 9.553
## average_grades 0.3087706545 0.0391663745 7.884
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## interestLess Interested 0.144389
## interestNot Interested 0.00000666305784958 ***
## interestUncertain 0.000121 ***
## interestVery Interested 0.001314 **
## residenceUrban < 0.0000000000000002 ***
## parent_salary < 0.0000000000000002 ***
## house_area < 0.0000000000000002 ***
## average_grades 0.00000000000000318 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1108.63 on 799 degrees of freedom
## Residual deviance: 535.96 on 791 degrees of freedom
## AIC: 553.96
##
## Number of Fisher Scoring iterations: 6
Xsqr <- 1108.63 - 535.96
Xsqr## [1] 572.67
How good is the model
Xsqr/1108.63## [1] 0.5165565
Build the model based on the step-wise model
model_step <-
step(
object = model_df,
direction = "backward",
trace = F
)
summary(model_step)##
## Call:
## glm(formula = will_go_to_college ~ school_accreditation + interest +
## residence + parent_salary + house_area + average_grades +
## parent_was_in_college, family = "binomial", data = df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4875 -0.4147 0.0738 0.4835 3.3366
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -44.1983447454 3.8303682064 -11.539
## school_accreditationB -0.7007825257 0.2772069209 -2.528
## interestLess Interested 0.8440487637 0.5671129345 1.488
## interestNot Interested 2.8517705822 0.6592236912 4.326
## interestUncertain 2.1229845774 0.5782737514 3.671
## interestVery Interested 1.7222057002 0.5611346240 3.069
## residenceUrban 3.4188453590 0.3832184590 8.921
## parent_salary 0.0000011964 0.0000001112 10.758
## house_area 0.1070554563 0.0111832544 9.573
## average_grades 0.3049394176 0.0395634667 7.708
## parent_was_in_collegeTrue 0.7232668227 0.3585379455 2.017
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## school_accreditationB 0.011471 *
## interestLess Interested 0.136665
## interestNot Interested 0.0000151873933821 ***
## interestUncertain 0.000241 ***
## interestVery Interested 0.002147 **
## residenceUrban < 0.0000000000000002 ***
## parent_salary < 0.0000000000000002 ***
## house_area < 0.0000000000000002 ***
## average_grades 0.0000000000000128 ***
## parent_was_in_collegeTrue 0.043668 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1108.63 on 799 degrees of freedom
## Residual deviance: 528.46 on 789 degrees of freedom
## AIC: 550.46
##
## Number of Fisher Scoring iterations: 6
Xsqr <- 1108.63 - 528.46
Xsqr## [1] 580.17
How good is the model
Xsqr/1108.63## [1] 0.5233216
Its pretty clear that the model_step has lesser residual
deviance compared to model_fs. The computation of how good
the model is also conforms the value of the residual deviance, which
means that model_step is performing better
df_test$predict_result <- predict(model_step, df_test, type="response")
df_test$predict_go <- ifelse(df_test$predict_result > 0.5, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
df_test %>% select(will_go_to_college, predict_go)## will_go_to_college predict_go
## 7 1 1
## 22 0 0
## 27 0 0
## 29 0 1
## 34 0 0
## 36 1 1
## 37 1 1
## 46 0 0
## 50 0 1
## 51 1 1
## 57 1 1
## 60 1 1
## 63 1 0
## 65 0 0
## 80 1 1
## 82 0 1
## 88 1 1
## 91 0 0
## 97 0 0
## 101 0 0
## 103 1 1
## 108 1 1
## 117 0 0
## 119 1 0
## 121 0 0
## 126 0 0
## 127 1 1
## 129 1 1
## 132 0 0
## 138 0 0
## 140 0 0
## 144 1 1
## 154 1 1
## 165 1 1
## 170 0 0
## 175 0 0
## 177 1 1
## 181 0 0
## 184 1 1
## 189 1 1
## 194 0 1
## 195 0 0
## 197 1 0
## 202 0 1
## 207 0 0
## 209 1 1
## 225 1 1
## 226 1 1
## 228 1 1
## 229 1 1
## 232 1 1
## 239 1 1
## 241 1 1
## 243 0 1
## 244 0 0
## 245 1 1
## 252 0 0
## 254 1 1
## 255 0 1
## 256 0 0
## 276 0 0
## 278 1 1
## 284 0 0
## 287 0 0
## 298 0 0
## 308 0 0
## 312 0 0
## 317 0 0
## 326 0 0
## 331 0 0
## 348 1 1
## 350 1 1
## 369 0 0
## 370 1 1
## 377 0 0
## 378 0 1
## 388 1 1
## 402 0 0
## 403 1 1
## 404 0 0
## 407 1 1
## 409 0 1
## 412 1 1
## 415 0 0
## 416 0 0
## 422 0 0
## 425 1 0
## 429 0 0
## 433 0 0
## 465 1 1
## 466 1 0
## 471 1 1
## 472 1 1
## 479 0 0
## 488 0 0
## 495 1 1
## 498 0 0
## 502 0 1
## 507 0 0
## 517 0 0
## 518 0 0
## 521 0 0
## 525 1 1
## 526 0 0
## 537 0 0
## 544 1 1
## 553 1 1
## 559 0 0
## 573 0 0
## 576 1 1
## 581 0 0
## 585 1 1
## 597 1 1
## 602 1 1
## 604 1 1
## 609 0 0
## 610 0 0
## 611 0 0
## 614 0 0
## 618 1 1
## 619 1 1
## 621 1 1
## 624 0 0
## 630 1 1
## 636 1 1
## 644 0 0
## 645 0 0
## 650 1 0
## 652 0 0
## 659 1 1
## 669 0 0
## 672 0 0
## 693 0 1
## 698 1 1
## 700 0 0
## 702 1 1
## 707 0 1
## 709 1 1
## 710 1 1
## 715 1 1
## 724 1 1
## 728 0 0
## 733 0 0
## 734 1 0
## 739 1 1
## 750 0 0
## 758 1 1
## 760 0 0
## 761 1 1
## 764 1 1
## 766 1 1
## 767 1 1
## 773 0 1
## 778 1 1
## 781 0 0
## 790 0 0
## 791 1 1
## 793 0 0
## 797 0 0
## 802 1 1
## 803 0 0
## 807 1 1
## 811 0 0
## 812 0 0
## 813 0 0
## 814 0 0
## 819 1 1
## 826 1 1
## 829 0 0
## 833 0 0
## 839 0 0
## 856 0 0
## 863 0 0
## 872 1 1
## 877 0 0
## 882 0 0
## 883 1 1
## 885 0 0
## 890 0 0
## 895 0 0
## 898 1 0
## 904 0 0
## 910 1 0
## 923 1 1
## 926 1 1
## 927 1 1
## 952 0 0
## 962 0 0
## 966 0 0
## 971 1 1
## 972 1 1
## 977 0 0
## 980 0 0
## 984 0 0
## 985 0 0
## 987 0 0
## 991 1 1
## 994 0 0
## 997 1 1
## 999 1 1
library(caret)## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(data = df_test$predict_go,
reference = df_test$will_go_to_college,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 96 9
## 1 13 82
##
## Accuracy : 0.89
## 95% CI : (0.8382, 0.9298)
## No Information Rate : 0.545
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.779
##
## Mcnemar's Test P-Value : 0.5224
##
## Sensitivity : 0.9011
## Specificity : 0.8807
## Pos Pred Value : 0.8632
## Neg Pred Value : 0.9143
## Prevalence : 0.4550
## Detection Rate : 0.4100
## Detection Prevalence : 0.4750
## Balanced Accuracy : 0.8909
##
## 'Positive' Class : 1
##
TN = predicted will not go to college, and the fact is likewise FN = predicted will not go to college, but the fact is otherwise TP = predicted will go to college, and the fact is likewise FP = predicted will go to college, but the fact is otherwise
In case of improving the model, this approach will minimizing the FN since the business needs to focus on the class which will not go to college (the 0), as it would help to approach the students and give them the knowledge of why they should go to a college. Hence, we will change the threshold closer to 0
df_test$predict_go <- ifelse(df_test$predict_result > 0.4, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
confusionMatrix(data = df_test$predict_go,
reference = df_test$will_go_to_college,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 90 8
## 1 19 83
##
## Accuracy : 0.865
## 95% CI : (0.8097, 0.9091)
## No Information Rate : 0.545
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.7305
##
## Mcnemar's Test P-Value : 0.05429
##
## Sensitivity : 0.9121
## Specificity : 0.8257
## Pos Pred Value : 0.8137
## Neg Pred Value : 0.9184
## Prevalence : 0.4550
## Detection Rate : 0.4150
## Detection Prevalence : 0.5100
## Balanced Accuracy : 0.8689
##
## 'Positive' Class : 1
##
In this case we will focus on the Sensitivity
metrics since we are focusing on 1 class, so we will not be focusing on
the Accuracy metrics. As we use the
Senstivity metrics, as we change the threshold closer
to 0 from >0,5 to 0,4, the
Sensitivity value is improving by 1,1%
from 90,11% to 91,21%. The model is also
yielding a better value of FN (look at the number in top corner) from
9 to 8.
In extreme case, we can change the threshold to 0,2, and
it will yield even more better Sensitivity value, as
shown in the example below:
df_test$predict_go <- ifelse(df_test$predict_result > 0.2, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
confusionMatrix(data = df_test$predict_go,
reference = df_test$will_go_to_college,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 1
## 1 33 90
##
## Accuracy : 0.83
## 95% CI : (0.7706, 0.8793)
## No Information Rate : 0.545
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6669
##
## Mcnemar's Test P-Value : 0.0000001058
##
## Sensitivity : 0.9890
## Specificity : 0.6972
## Pos Pred Value : 0.7317
## Neg Pred Value : 0.9870
## Prevalence : 0.4550
## Detection Rate : 0.4500
## Detection Prevalence : 0.6150
## Balanced Accuracy : 0.8431
##
## 'Positive' Class : 1
##
As we can see the Sensitivity value reaches 98,9% with the FN value only 1. However, this method is not suggested as it can be deemed as pushing the model too hard to fit.
We have seen the model from Logistic Regression Model,
now let’s move onto K-NN
index <- sample(nrow(df), nrow(df)*0.8)
df_train2 <- df[index,]
df_test2 <- df[-index,]
df_train_pred <- df_train2 |> select_if(is.numeric)
df_test_pred <- df_test2 |> select_if(is.numeric)
df_train_tar <- df_train2[,"will_go_to_college"]
df_test_tar <- df_test2[,"will_go_to_college"]scaling the data train & test
df_train_pred_s <- scale(df_train_pred)
df_test_pred_s <- scale(
df_test_pred,
center = attr(df_train_pred_s, "scaled:center"),
scale = attr(df_train_pred_s, "scaled:scale")
)find the optimum K
sqrt(nrow(df_train_pred_s))## [1] 28.28427
Since the variable target is 2, lets set the K into odds number
kx <- ceiling(sqrt(nrow(df_train_pred_s)))
kx## [1] 29
library(class) # package untuk fungsi `knn()`
pred_knn <- knn(train = df_train_pred_s,
test = df_test_pred_s,
cl = df_train_tar,
k = kx)confusionMatrix(data = as.factor(pred_knn),
reference = as.factor(df_test_tar),
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 0
## 1 0 97
##
## Accuracy : 1
## 95% CI : (0.9817, 1)
## No Information Rate : 0.515
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.485
## Detection Rate : 0.485
## Detection Prevalence : 0.485
## Balanced Accuracy : 1.000
##
## 'Positive' Class : 1
##
As we can see from the Confusion Matrix above that
the data prediction is perfectly perfect. the
Sensitivity value is 1.000 and has no
predict any single incorrectly as there are no value of FN &
FP. However, there is a strong possibility that the k-NN result
predict is over-fit, even when we does not use the k equal to one
(k=1)
As we can see from 2 method of predicting, the k-NN has
given us better prediction than the Logistic Regression
model as it predicts perfectly the target during the evaluation process.
However, the k-NN predict might be overfit.
So, in order to predict whether the students will go to college or
not, it is better to use Logistic Regression thus far
instead of k-NN since we assume that the k-NN
model is over-fitting by looking at the result of the
confusionMatrix, and this will predict new data badly as
its only capable to predict the existing data (over-fit theory).
Although in fact, we can re-predict to fine-tuning our k-NN
method with using a different k value.