1 Objective

The goal of this research would be try to predict whether or not the students will continue the study to college degree. This research contains of several predictors to arrive at the expected goal

2 Import Data

df <- read.csv("data.csv")

3 Data Wrangling

library(dplyr)
glimpse(df)
## Rows: 1,000
## Columns: 11
## $ type_school           <chr> "Academic", "Academic", "Academic", "Vocational"…
## $ school_accreditation  <chr> "A", "A", "B", "B", "A", "B", "A", "B", "B", "B"…
## $ gender                <chr> "Male", "Male", "Female", "Male", "Female", "Fem…
## $ interest              <chr> "Less Interested", "Less Interested", "Very Inte…
## $ residence             <chr> "Urban", "Urban", "Urban", "Rural", "Urban", "Ru…
## $ parent_age            <int> 56, 57, 50, 49, 57, 48, 52, 53, 52, 47, 57, 48, …
## $ parent_salary         <int> 6950000, 4410000, 6500000, 6600000, 5250000, 377…
## $ house_area            <dbl> 83.0, 76.8, 80.6, 78.2, 75.1, 65.3, 85.5, 83.3, …
## $ average_grades        <dbl> 84.09, 86.91, 87.43, 82.12, 86.79, 86.79, 90.39,…
## $ parent_was_in_college <chr> "False", "False", "False", "True", "False", "Tru…
## $ will_go_to_college    <chr> "True", "True", "True", "True", "False", "False"…

change the data as an appropriate data type and value. we will be changing the variable of parent age and will go to college as the data type still integer and we would want the data to be as factor, as the value of each observations is stating the value of itself. In other words, as the value of the data increase, it SHOULD NOT affect the prediction in the future because it does not represent an increase of an attribute to be predicted.

df$will_go_to_college <- as.integer(as.logical(df$will_go_to_college))
df_clean <- 
  df |> 
  mutate_if(is.character, as.factor) |> 
  mutate(
    parent_age = as.factor(parent_age),
    will_go_to_college = as.factor(will_go_to_college)
         )

df_clean |> head()
##   type_school school_accreditation gender        interest residence parent_age
## 1    Academic                    A   Male Less Interested     Urban         56
## 2    Academic                    A   Male Less Interested     Urban         57
## 3    Academic                    B Female Very Interested     Urban         50
## 4  Vocational                    B   Male Very Interested     Rural         49
## 5    Academic                    A Female Very Interested     Urban         57
## 6  Vocational                    B Female Less Interested     Rural         48
##   parent_salary house_area average_grades parent_was_in_college
## 1       6950000       83.0          84.09                 False
## 2       4410000       76.8          86.91                 False
## 3       6500000       80.6          87.43                 False
## 4       6600000       78.2          82.12                  True
## 5       5250000       75.1          86.79                 False
## 6       3770000       65.3          86.79                  True
##   will_go_to_college
## 1                  1
## 2                  1
## 3                  1
## 4                  1
## 5                  0
## 6                  0

4 EDA

Checking if there is any missing value

anyNA(df_clean)
## [1] FALSE

there is no missing value in our data, so let’s analyze the data range

Analyzing the data range

summary(df_clean)
##      type_school  school_accreditation    gender               interest  
##  Academic  :609   A:481                Female:485   Interested     :100  
##  Vocational:391   B:519                Male  :515   Less Interested:229  
##                                                     Not Interested : 86  
##                                                     Uncertain      :261  
##                                                     Very Interested:324  
##                                                                          
##                                                                          
##  residence     parent_age  parent_salary        house_area     average_grades 
##  Rural:461   52     :144   Min.   : 1000000   Min.   : 20.00   Min.   :75.00  
##  Urban:539   53     :126   1st Qu.: 4360000   1st Qu.: 64.60   1st Qu.:83.74  
##              54     :110   Median : 5440000   Median : 75.50   Median :85.58  
##              51     : 94   Mean   : 5381570   Mean   : 74.52   Mean   :86.10  
##              50     : 93   3rd Qu.: 6382500   3rd Qu.: 84.83   3rd Qu.:88.26  
##              55     : 82   Max.   :10000000   Max.   :120.00   Max.   :98.00  
##              (Other):351                                                      
##  parent_was_in_college will_go_to_college
##  False:480             0:500             
##  True :520             1:500             
##                                          
##                                          
##                                          
##                                          
## 

It clears that:

  • The target variable is perfectly balanced
  • The data with continuous value looks normal since the range of Min.,Max. and the Mean is balanced, give or take

5 Cross Validation

Separating data into train and test data

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(77)

index <- sample(nrow(df_clean), nrow(df_clean)*0.8)
df_train <- df_clean[index,]
df_test <- df_clean[-index,]

6 Logistic Regression Model

6.1 Building The Model

The model will be built using Logistic Regression method

model_df <- glm(will_go_to_college ~.,
           df_train,
           family = "binomial")
summary(model_df)
## 
## Call:
## glm(formula = will_go_to_college ~ ., family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5874  -0.3681   0.0693   0.4412   3.4661  
## 
## Coefficients:
##                                  Estimate      Std. Error z value
## (Intercept)                -53.9775703058 2714.6259998454  -0.020
## type_schoolVocational        0.2099041003    0.2651536826   0.792
## school_accreditationB       -0.7529066366    0.2981696416  -2.525
## genderMale                  -0.2517013527    0.2384513534  -1.056
## interestLess Interested      0.9714331541    0.5984085232   1.623
## interestNot Interested       2.8856198234    0.6840458177   4.218
## interestUncertain            2.1889939188    0.6073957216   3.604
## interestVery Interested      1.7048333759    0.5939576142   2.870
## residenceUrban               3.5755653723    0.4394804166   8.136
## parent_age41                10.7124969378 2714.6968092055   0.004
## parent_age42                 7.1105721822 2714.6246761944   0.003
## parent_age43                -6.6877044729 3042.8075456593  -0.002
## parent_age44                -6.2903409903 2955.9662354725  -0.002
## parent_age45                 9.4131874445 2714.6237245058   0.003
## parent_age46                 8.4225169989 2714.6236785859   0.003
## parent_age47                 7.3269759784 2714.6236539005   0.003
## parent_age48                 8.7547419558 2714.6236154623   0.003
## parent_age49                 9.0627232177 2714.6235914337   0.003
## parent_age50                 9.0688591715 2714.6235985027   0.003
## parent_age51                 9.1539140768 2714.6235940064   0.003
## parent_age52                 9.4976482823 2714.6235801824   0.003
## parent_age53                 8.4728378663 2714.6236089151   0.003
## parent_age54                 9.1656124263 2714.6236040726   0.003
## parent_age55                 9.0358504208 2714.6236193481   0.003
## parent_age56                 8.3590044425 2714.6236374659   0.003
## parent_age57                 7.9185791517 2714.6236542702   0.003
## parent_age58                 8.7671319289 2714.6236665527   0.003
## parent_age59                 9.7478935673 2714.6236772235   0.004
## parent_age60                 8.5327286803 2714.6237830794   0.003
## parent_age61                 9.5249981802 2714.6240154566   0.004
## parent_age62                -6.5202107096 4797.9728874104  -0.001
## parent_age64                25.8778323036 4797.9729291751   0.005
## parent_age65                -7.3053691106 4797.9729054866  -0.002
## parent_salary                0.0000012197    0.0000001216  10.034
## house_area                   0.1096628704    0.0120197821   9.124
## average_grades               0.3107906763    0.0427215753   7.275
## parent_was_in_collegeTrue    0.7354255897    0.3909757801   1.881
##                                       Pr(>|z|)    
## (Intercept)                           0.984136    
## type_schoolVocational                 0.428575    
## school_accreditationB                 0.011567 *  
## genderMale                            0.291166    
## interestLess Interested               0.104512    
## interestNot Interested    0.000024597675126773 ***
## interestUncertain                     0.000313 ***
## interestVery Interested               0.004101 ** 
## residenceUrban            0.000000000000000409 ***
## parent_age41                          0.996851    
## parent_age42                          0.997910    
## parent_age43                          0.998246    
## parent_age44                          0.998302    
## parent_age45                          0.997233    
## parent_age46                          0.997524    
## parent_age47                          0.997846    
## parent_age48                          0.997427    
## parent_age49                          0.997336    
## parent_age50                          0.997334    
## parent_age51                          0.997309    
## parent_age52                          0.997208    
## parent_age53                          0.997510    
## parent_age54                          0.997306    
## parent_age55                          0.997344    
## parent_age56                          0.997543    
## parent_age57                          0.997673    
## parent_age58                          0.997423    
## parent_age59                          0.997135    
## parent_age60                          0.997492    
## parent_age61                          0.997200    
## parent_age62                          0.998916    
## parent_age64                          0.995697    
## parent_age65                          0.998785    
## parent_salary             < 0.0000000000000002 ***
## house_area                < 0.0000000000000002 ***
## average_grades            0.000000000000346948 ***
## parent_was_in_collegeTrue             0.059972 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1108.63  on 799  degrees of freedom
## Residual deviance:  500.03  on 763  degrees of freedom
## AIC: 574.03
## 
## Number of Fisher Scoring iterations: 16

Scrutinizing the data from the business perspective, it looks that each of df variables are useful to predict whether or not the students will continue to the college. However, looking into the summary of model_df, its pretty clear that there are variables in which contributes better to predict the target variable, which are interest, residence, parent salary, house area, and the average grades. Hence, the feature selection model will use those variables as predictors

Build the model based on feature selection

model_fs <- 
  glm(
    will_go_to_college ~ interest + residence + parent_salary + house_area + average_grades, 
    df_train,
    family = "binomial"
  )
summary(model_fs)
## 
## Call:
## glm(formula = will_go_to_college ~ interest + residence + parent_salary + 
##     house_area + average_grades, family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4540  -0.4209   0.0804   0.4900   3.2214  
## 
## Coefficients:
##                               Estimate     Std. Error z value
## (Intercept)             -44.1503284866   3.8016578929 -11.613
## interestLess Interested   0.7840403305   0.5371460732   1.460
## interestNot Interested    2.8304167706   0.6283981771   4.504
## interestUncertain         2.0999088366   0.5463231032   3.844
## interestVery Interested   1.7103143594   0.5323297110   3.213
## residenceUrban            3.1581025613   0.3174124274   9.950
## parent_salary             0.0000012063   0.0000001107  10.893
## house_area                0.1038193427   0.0108671675   9.553
## average_grades            0.3087706545   0.0391663745   7.884
##                                     Pr(>|z|)    
## (Intercept)             < 0.0000000000000002 ***
## interestLess Interested             0.144389    
## interestNot Interested   0.00000666305784958 ***
## interestUncertain                   0.000121 ***
## interestVery Interested             0.001314 ** 
## residenceUrban          < 0.0000000000000002 ***
## parent_salary           < 0.0000000000000002 ***
## house_area              < 0.0000000000000002 ***
## average_grades           0.00000000000000318 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1108.63  on 799  degrees of freedom
## Residual deviance:  535.96  on 791  degrees of freedom
## AIC: 553.96
## 
## Number of Fisher Scoring iterations: 6
Xsqr <- 1108.63 - 535.96
Xsqr
## [1] 572.67

How good is the model

Xsqr/1108.63
## [1] 0.5165565

Build the model based on the step-wise model

model_step <- 
  step(
    object = model_df,
    direction = "backward",
    trace = F
  )
summary(model_step)
## 
## Call:
## glm(formula = will_go_to_college ~ school_accreditation + interest + 
##     residence + parent_salary + house_area + average_grades + 
##     parent_was_in_college, family = "binomial", data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4875  -0.4147   0.0738   0.4835   3.3366  
## 
## Coefficients:
##                                 Estimate     Std. Error z value
## (Intercept)               -44.1983447454   3.8303682064 -11.539
## school_accreditationB      -0.7007825257   0.2772069209  -2.528
## interestLess Interested     0.8440487637   0.5671129345   1.488
## interestNot Interested      2.8517705822   0.6592236912   4.326
## interestUncertain           2.1229845774   0.5782737514   3.671
## interestVery Interested     1.7222057002   0.5611346240   3.069
## residenceUrban              3.4188453590   0.3832184590   8.921
## parent_salary               0.0000011964   0.0000001112  10.758
## house_area                  0.1070554563   0.0111832544   9.573
## average_grades              0.3049394176   0.0395634667   7.708
## parent_was_in_collegeTrue   0.7232668227   0.3585379455   2.017
##                                       Pr(>|z|)    
## (Intercept)               < 0.0000000000000002 ***
## school_accreditationB                 0.011471 *  
## interestLess Interested               0.136665    
## interestNot Interested      0.0000151873933821 ***
## interestUncertain                     0.000241 ***
## interestVery Interested               0.002147 ** 
## residenceUrban            < 0.0000000000000002 ***
## parent_salary             < 0.0000000000000002 ***
## house_area                < 0.0000000000000002 ***
## average_grades              0.0000000000000128 ***
## parent_was_in_collegeTrue             0.043668 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1108.63  on 799  degrees of freedom
## Residual deviance:  528.46  on 789  degrees of freedom
## AIC: 550.46
## 
## Number of Fisher Scoring iterations: 6
Xsqr <- 1108.63 - 528.46
Xsqr
## [1] 580.17

How good is the model

Xsqr/1108.63
## [1] 0.5233216

Its pretty clear that the model_step has lesser residual deviance compared to model_fs. The computation of how good the model is also conforms the value of the residual deviance, which means that model_step is performing better

6.2 Evaluating The Model

df_test$predict_result <- predict(model_step, df_test, type="response")
df_test$predict_go <- ifelse(df_test$predict_result > 0.5, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
df_test %>% select(will_go_to_college, predict_go)
##     will_go_to_college predict_go
## 7                    1          1
## 22                   0          0
## 27                   0          0
## 29                   0          1
## 34                   0          0
## 36                   1          1
## 37                   1          1
## 46                   0          0
## 50                   0          1
## 51                   1          1
## 57                   1          1
## 60                   1          1
## 63                   1          0
## 65                   0          0
## 80                   1          1
## 82                   0          1
## 88                   1          1
## 91                   0          0
## 97                   0          0
## 101                  0          0
## 103                  1          1
## 108                  1          1
## 117                  0          0
## 119                  1          0
## 121                  0          0
## 126                  0          0
## 127                  1          1
## 129                  1          1
## 132                  0          0
## 138                  0          0
## 140                  0          0
## 144                  1          1
## 154                  1          1
## 165                  1          1
## 170                  0          0
## 175                  0          0
## 177                  1          1
## 181                  0          0
## 184                  1          1
## 189                  1          1
## 194                  0          1
## 195                  0          0
## 197                  1          0
## 202                  0          1
## 207                  0          0
## 209                  1          1
## 225                  1          1
## 226                  1          1
## 228                  1          1
## 229                  1          1
## 232                  1          1
## 239                  1          1
## 241                  1          1
## 243                  0          1
## 244                  0          0
## 245                  1          1
## 252                  0          0
## 254                  1          1
## 255                  0          1
## 256                  0          0
## 276                  0          0
## 278                  1          1
## 284                  0          0
## 287                  0          0
## 298                  0          0
## 308                  0          0
## 312                  0          0
## 317                  0          0
## 326                  0          0
## 331                  0          0
## 348                  1          1
## 350                  1          1
## 369                  0          0
## 370                  1          1
## 377                  0          0
## 378                  0          1
## 388                  1          1
## 402                  0          0
## 403                  1          1
## 404                  0          0
## 407                  1          1
## 409                  0          1
## 412                  1          1
## 415                  0          0
## 416                  0          0
## 422                  0          0
## 425                  1          0
## 429                  0          0
## 433                  0          0
## 465                  1          1
## 466                  1          0
## 471                  1          1
## 472                  1          1
## 479                  0          0
## 488                  0          0
## 495                  1          1
## 498                  0          0
## 502                  0          1
## 507                  0          0
## 517                  0          0
## 518                  0          0
## 521                  0          0
## 525                  1          1
## 526                  0          0
## 537                  0          0
## 544                  1          1
## 553                  1          1
## 559                  0          0
## 573                  0          0
## 576                  1          1
## 581                  0          0
## 585                  1          1
## 597                  1          1
## 602                  1          1
## 604                  1          1
## 609                  0          0
## 610                  0          0
## 611                  0          0
## 614                  0          0
## 618                  1          1
## 619                  1          1
## 621                  1          1
## 624                  0          0
## 630                  1          1
## 636                  1          1
## 644                  0          0
## 645                  0          0
## 650                  1          0
## 652                  0          0
## 659                  1          1
## 669                  0          0
## 672                  0          0
## 693                  0          1
## 698                  1          1
## 700                  0          0
## 702                  1          1
## 707                  0          1
## 709                  1          1
## 710                  1          1
## 715                  1          1
## 724                  1          1
## 728                  0          0
## 733                  0          0
## 734                  1          0
## 739                  1          1
## 750                  0          0
## 758                  1          1
## 760                  0          0
## 761                  1          1
## 764                  1          1
## 766                  1          1
## 767                  1          1
## 773                  0          1
## 778                  1          1
## 781                  0          0
## 790                  0          0
## 791                  1          1
## 793                  0          0
## 797                  0          0
## 802                  1          1
## 803                  0          0
## 807                  1          1
## 811                  0          0
## 812                  0          0
## 813                  0          0
## 814                  0          0
## 819                  1          1
## 826                  1          1
## 829                  0          0
## 833                  0          0
## 839                  0          0
## 856                  0          0
## 863                  0          0
## 872                  1          1
## 877                  0          0
## 882                  0          0
## 883                  1          1
## 885                  0          0
## 890                  0          0
## 895                  0          0
## 898                  1          0
## 904                  0          0
## 910                  1          0
## 923                  1          1
## 926                  1          1
## 927                  1          1
## 952                  0          0
## 962                  0          0
## 966                  0          0
## 971                  1          1
## 972                  1          1
## 977                  0          0
## 980                  0          0
## 984                  0          0
## 985                  0          0
## 987                  0          0
## 991                  1          1
## 994                  0          0
## 997                  1          1
## 999                  1          1
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(data = df_test$predict_go,
                reference = df_test$will_go_to_college,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 96  9
##          1 13 82
##                                              
##                Accuracy : 0.89               
##                  95% CI : (0.8382, 0.9298)   
##     No Information Rate : 0.545              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.779              
##                                              
##  Mcnemar's Test P-Value : 0.5224             
##                                              
##             Sensitivity : 0.9011             
##             Specificity : 0.8807             
##          Pos Pred Value : 0.8632             
##          Neg Pred Value : 0.9143             
##              Prevalence : 0.4550             
##          Detection Rate : 0.4100             
##    Detection Prevalence : 0.4750             
##       Balanced Accuracy : 0.8909             
##                                              
##        'Positive' Class : 1                  
## 

TN = predicted will not go to college, and the fact is likewise FN = predicted will not go to college, but the fact is otherwise TP = predicted will go to college, and the fact is likewise FP = predicted will go to college, but the fact is otherwise

6.3 Fine Tuning The Model

In case of improving the model, this approach will minimizing the FN since the business needs to focus on the class which will not go to college (the 0), as it would help to approach the students and give them the knowledge of why they should go to a college. Hence, we will change the threshold closer to 0

df_test$predict_go <- ifelse(df_test$predict_result > 0.4, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
confusionMatrix(data = df_test$predict_go,
                reference = df_test$will_go_to_college,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 90  8
##          1 19 83
##                                               
##                Accuracy : 0.865               
##                  95% CI : (0.8097, 0.9091)    
##     No Information Rate : 0.545               
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.7305              
##                                               
##  Mcnemar's Test P-Value : 0.05429             
##                                               
##             Sensitivity : 0.9121              
##             Specificity : 0.8257              
##          Pos Pred Value : 0.8137              
##          Neg Pred Value : 0.9184              
##              Prevalence : 0.4550              
##          Detection Rate : 0.4150              
##    Detection Prevalence : 0.5100              
##       Balanced Accuracy : 0.8689              
##                                               
##        'Positive' Class : 1                   
## 

In this case we will focus on the Sensitivity metrics since we are focusing on 1 class, so we will not be focusing on the Accuracy metrics. As we use the Senstivity metrics, as we change the threshold closer to 0 from >0,5 to 0,4, the Sensitivity value is improving by 1,1% from 90,11% to 91,21%. The model is also yielding a better value of FN (look at the number in top corner) from 9 to 8.

In extreme case, we can change the threshold to 0,2, and it will yield even more better Sensitivity value, as shown in the example below:

df_test$predict_go <- ifelse(df_test$predict_result > 0.2, "1", "0")
df_test$predict_go <- as.factor(df_test$predict_go)
confusionMatrix(data = df_test$predict_go,
                reference = df_test$will_go_to_college,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 76  1
##          1 33 90
##                                                
##                Accuracy : 0.83                 
##                  95% CI : (0.7706, 0.8793)     
##     No Information Rate : 0.545                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6669               
##                                                
##  Mcnemar's Test P-Value : 0.0000001058         
##                                                
##             Sensitivity : 0.9890               
##             Specificity : 0.6972               
##          Pos Pred Value : 0.7317               
##          Neg Pred Value : 0.9870               
##              Prevalence : 0.4550               
##          Detection Rate : 0.4500               
##    Detection Prevalence : 0.6150               
##       Balanced Accuracy : 0.8431               
##                                                
##        'Positive' Class : 1                    
## 

As we can see the Sensitivity value reaches 98,9% with the FN value only 1. However, this method is not suggested as it can be deemed as pushing the model too hard to fit.

7 K-NN Model

We have seen the model from Logistic Regression Model, now let’s move onto K-NN

7.1 Cross validation

index <- sample(nrow(df), nrow(df)*0.8)
df_train2 <- df[index,]
df_test2 <- df[-index,]
df_train_pred <- df_train2 |> select_if(is.numeric)
df_test_pred <- df_test2 |> select_if(is.numeric)
df_train_tar <- df_train2[,"will_go_to_college"]
df_test_tar <- df_test2[,"will_go_to_college"]

scaling the data train & test

df_train_pred_s <- scale(df_train_pred)
df_test_pred_s <- scale(
  df_test_pred,
  center = attr(df_train_pred_s, "scaled:center"),
  scale = attr(df_train_pred_s, "scaled:scale")
)

7.2 Predicting

find the optimum K

sqrt(nrow(df_train_pred_s))
## [1] 28.28427

Since the variable target is 2, lets set the K into odds number

kx <- ceiling(sqrt(nrow(df_train_pred_s)))
kx
## [1] 29
library(class) # package untuk fungsi `knn()`

pred_knn <- knn(train = df_train_pred_s,
                 test = df_test_pred_s,
                 cl = df_train_tar,
                 k = kx)

7.3 Evaluating

confusionMatrix(data = as.factor(pred_knn),
                reference = as.factor(df_test_tar),
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103   0
##          1   0  97
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9817, 1)          
##     No Information Rate : 0.515                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
##                                                
##             Sensitivity : 1.000                
##             Specificity : 1.000                
##          Pos Pred Value : 1.000                
##          Neg Pred Value : 1.000                
##              Prevalence : 0.485                
##          Detection Rate : 0.485                
##    Detection Prevalence : 0.485                
##       Balanced Accuracy : 1.000                
##                                                
##        'Positive' Class : 1                    
## 

As we can see from the Confusion Matrix above that the data prediction is perfectly perfect. the Sensitivity value is 1.000 and has no predict any single incorrectly as there are no value of FN & FP. However, there is a strong possibility that the k-NN result predict is over-fit, even when we does not use the k equal to one (k=1)

8 Conclusion and Suggestion

As we can see from 2 method of predicting, the k-NN has given us better prediction than the Logistic Regression model as it predicts perfectly the target during the evaluation process. However, the k-NN predict might be overfit.

So, in order to predict whether the students will go to college or not, it is better to use Logistic Regression thus far instead of k-NN since we assume that the k-NN model is over-fitting by looking at the result of the confusionMatrix, and this will predict new data badly as its only capable to predict the existing data (over-fit theory). Although in fact, we can re-predict to fine-tuning our k-NN method with using a different k value.