Aim of the Project (Supervised)

Choosing features that are important for each of the analysis;
For binary dependent variable, stepwise method has been used.
For non-binary dependent variable, information gain method has been used.
For the output variable J_or_C (binary), performed clustering by applying at least 2 clustering algorithms to generate clusters (assume k=2).
- Logistic Regression and KNN has been used.
Performance of two algorithm compared.(1)
For the output variable Quartile (non-binary), performed clustering by applying at least 2 clustering algorithms to generate clusters (assume k=5).
- KNN and SVM has been used.
Performance of two algorithm compared.(2)

Load Dataset

library(readxl)
#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
df <- read_excel("data2.xlsx")
dim(df)

## [1] 1000   82

We have 82 variables thus feature selection for models will not be easy.

Load Packages

Check RMD codes for details.

Exploratory Data Analysis

Data Overview

Note that we should not include “id” to our algorithms.

This variable does not gives us any specific characteristic information.

We checked data with by head(), str() and summary() functions. But output was so dirty-typed in HTML report so we remove the outputs.

Missing Values

miss_var_summary(df)

## # A tibble: 82 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 CitationMetric_5a_CM2      18      1.8
##  2 CitationMetric_4a_CM2      17      1.7
##  3 CitationMetric_5_CB         1      0.1
##  4 CitationMetric_5b_CM3       1      0.1
##  5 id                          0      0  
##  6 Authors_Num                 0      0  
##  7 Countries_Num               0      0  
##  8 Countries_Unique_Num        0      0  
##  9 Countries_Unique_Count      0      0  
## 10 Countries_Perc              0      0  
## # ... with 72 more rows

vis_miss(df)

We can see that there are some rows with missing values.
We will remove those records in order to have more accurate clusters.

# REMOVING ROWS WITH NULL VALUES.
df <- na.omit(df)

#From the structure we see that these columns are categoric and unique for each row.
#They will interrupt run time of the clustering algorithms.
#Their part (relativeness with dependent variable) on the clustering is relatively low.
#We will remove these variables from dataset.
#Thus we will have more accurate results and more efficient clustering.
#We will also remove the following columns due to the same reason.

df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL

df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL

ALGORITHM AND FEATURE SELECTIONS

Dependent Variable Research (J_or_C)

# XTAB OF J AND C
table(df$J_or_C)

## 
##   C   J 
## 485 497

Dependent variable is binary. It has two options as J or C.
The number of variables are nearly even.

ggplot(df,
        aes(factor(J_or_C))) +
    geom_bar(fill = "coral",
        alpha = 0.5) +
    theme_classic()

Choosing Clustering Algorithms

We can perform “Logistic Regression” and “KNN”.
We can use stepwise methods to pick the important features.

Logistic Regression

Creating Logistic Regression with all variables.
Recoding dependent J_or_C variable into binary structure.

df_logistic <- df
df_logistic$J_or_C <- ifelse(df_logistic$J_or_C == "J",1,0)
table(df_logistic$J_or_C)

## 
##   0   1 
## 485 497

Checking Correlations with Dependent Variable

# CORRELATION MATRIX WITH ALL VALUES VS J_or_C

#WE CAN NOT FIND ANY STRONG CORRELATION


cor(df_logistic[-26],df_logistic$J_or_C)

## Warning in cor(df_logistic[-26], df_logistic$J_or_C): the standard deviation is
## zero

##                                                  [,1]
## FRES_Title                               -0.030776698
## FLESCH_Title                              0.036355645
## numCharTitle_all                          0.050662339
## numCharTitle_onlyAlpha                    0.053562812
## numCharTitle_nonAlpha                     0.040687463
## nonAlphaCharTitle_isExist                 0.011332172
## TTR_Title                                -0.003377063
## numWordTitle                              0.019689673
## numABV_Title                             -0.067194218
## binABV_Title                             -0.073972587
## numSentAbstract                           0.129395358
## numABV_Abstract                          -0.003951528
## binABV_Abstract                          -0.073972587
## PaperAge                                  0.214074054
## numLexVerb                                0.015713515
## numSylGreaThan2                           0.016937051
## numPage                                   0.451179476
## Year                                     -0.214074054
## Cited by                                  0.118646941
## FRES_Abstract                            -0.042862894
## FLESCH_Abstract                           0.004959666
## isFunding                                 0.246225226
## keywordsListedAlpha                       0.206591955
## numKeywords                              -0.166545696
## Quartile                                  0.741636549
## CitationMetric_1                          0.200695011
## CitationMetric_2                          0.109219433
## CitationMetric_3                          0.200628899
## CitationMetric_4_CB                       0.135376418
## CitationMetric_4a_CM2                     0.135376418
## CitationMetric_4b_CM3                     0.222523064
## CitationMetric_5_CB                       0.156181537
## CitationMetric_5a_CM2                     0.156181537
## CitationMetric_5b_CM3                     0.231167913
## Dominant_Topic                           -0.008926470
## single_quote_mark                         0.005762380
## double_quote_mark                        -0.033263488
## exclamation_mark                         -0.014809022
## em_dash_mark                                       NA
## parenthesis_mark                          0.054947625
## plus_mark                                -0.033795629
## comma_mark                                         NA
## hyphen_mark                               0.057948334
## period_mark                               0.078153365
## slash_mark                                0.038813257
## colon_mark                                0.094033629
## semicolon_mark                                     NA
## question_mark                             0.131906706
## square_parenthesis_mark                  -0.035032601
## underscore_mark                                    NA
## curly_parenthesis_mark                    0.031539745
## apostrophe_mark                           0.039907178
## and_mark                                  0.047746319
## backslash_mark                            0.031539745
## equal_mark                               -0.019130078
## presenceColon                             0.082372830
## avgPunctuation                            0.110797407
## numPreposition                            0.115023990
## numFreqLexItems_connectives               0.048923202
## numFreqLexItems_reviews                            NA
## numFreqLexItems_previews                  0.018308762
## numFreqLexItems_action_markers           -0.027478316
## numFreqLexItems_closing                  -0.001444089
## numTitleSubstantiveWordsWoutStopwords     0.014228113
## numTitleSubstantiveWordsWithStopwords     0.007738565
## numAbstractSubstantiveWordsWoutStopwords  0.143464423
## numAbstractSubstantiveWordsWithStopwords  0.139311196
## question_mark_loc                         0.019682779
## question_mark_isExist                     0.026916236
## presenceInitialPosition_a                 0.071255143
## presenceInitialPosition_the               0.014674761
## presenceInitialPosition_a_or_the          0.071475824
## presenceInitialPosition_ing               0.013179398
## numPrepositionBeginning                   0.023389140

No Relative strong correlation with any independent variables.

Building Logistic Regression Model

model_full_logistic <- glm( J_or_C ~., data = df_logistic, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

GLM model has been created.
Now we will select important features with stepwise model.

Feature Selection

Stepwise Regression

#With stepwise regression we will pick the most efficient AIC model.
#These variables combination gives us best logistic regression model.

#step.model <- model_full_logistic %>% stepAIC(trace = FALSE)

#step.model has been cut in order to gain efficiency.
#Outputs of the step.model 

df2 <- df_logistic[,c("FRES_Title","FLESCH_Title","numSentAbstract","PaperAge","numLexVerb",
                      "numSylGreaThan2","numPage","FRES_Abstract","FLESCH_Abstract",
                      "numKeywords","Quartile","Dominant_Topic","single_quote_mark",
                      "double_quote_mark","exclamation_mark",
                      "numTitleSubstantiveWordsWoutStopwords",
                      "numTitleSubstantiveWordsWithStopwords",
                      "question_mark_loc","question_mark_isExist",
                      "presenceInitialPosition_ing",
                      "parenthesis_mark","plus_mark","numPrepositionBeginning","J_or_C")]


model_full_logistic2 <- glm( J_or_C ~., data = df2, family = binomial)

step.model2 <- model_full_logistic2 %>% stepAIC(trace = FALSE)

Performance of the Model

Accuracy

#Logistic regression predictive output and J_or_C output
#is similar with %91.7 accuracy.

#We can say that our model, with given predictors,
#%91.7 of the time find same result 


probabilities <- step.model2 %>% predict(df2[-24], type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

observed.classes <- df_logistic[26]
mean(predicted.classes == observed.classes)

## [1] 0.9704684

Visuals for the Logistic Model

Curve

#Logistic regression curve by dependent J_or_C and 
#independent "avgPunctuation" variable.

plot(as.numeric(df_logistic$J_or_C) ~ avgPunctuation , data = df_logistic,
     col = "darkorange",
     pch = "I", 
     ylim = c(-0.2, 1))

abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)

model_full_logistic2 <- glm(J_or_C~ avgPunctuation , 
                 data = df_logistic, 
                 family = "binomial")

curve(predict(model_full_logistic2, data.frame(avgPunctuation  = x), type ="response"),
              add = TRUE,
              lwd = 3,
              col = "dodgerblue")

ROC Curve

test_ol <- predict(step.model2, newdata = df2, type = "response")
observed.classes2 <- df2$J_or_C

a <- roc(observed.classes2 ~ test_ol, plot = TRUE, print.auc = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

a$auc

## Area under the curve: 0.9966

Confusion Matrix

To Compare Algorithm Efficiency.

predicted.classes2 <- as.factor(ifelse(predicted.classes==1,"Yes","No"))
observed.classes3 <- as.factor(ifelse(observed.classes2==1,"Yes","No"))

confusionMatrix(data = predicted.classes2,
                reference = observed.classes3, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  468  12
##        Yes  17 485
##                                           
##                Accuracy : 0.9705          
##                  95% CI : (0.9579, 0.9801)
##     No Information Rate : 0.5061          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9409          
##                                           
##  Mcnemar's Test P-Value : 0.4576          
##                                           
##             Sensitivity : 0.9759          
##             Specificity : 0.9649          
##          Pos Pred Value : 0.9661          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.5061          
##          Detection Rate : 0.4939          
##    Detection Prevalence : 0.5112          
##       Balanced Accuracy : 0.9704          
##                                           
##        'Positive' Class : Yes             
##

KNN METHOD

Data Normalization

#We scale data column by column with the "scale" function.
#This will help algorithm results to be more efficient.
#Remember that you can only normalize numeric columns.

df.scaled2 <- as.data.frame(scale(df2[-24]))
df.scaled2$J_or_C <- df2$J_or_C
df.scaled2 <- na.omit(df.scaled2)
str(df.scaled2)

## 'data.frame':    982 obs. of  24 variables:
##  $ FRES_Title                           : num  0.413 -0.412 -1.293 0.301 -0.552 ...
##  $ FLESCH_Title                         : num  -0.31515 0.3351 1.1276 0.00997 0.72119 ...
##  $ numSentAbstract                      : num  -0.2603 -0.2603 -0.6187 -0.9771 0.0982 ...
##  $ PaperAge                             : num  1.073 -0.953 1.819 -0.526 -0.206 ...
##  $ numLexVerb                           : num  -0.0881 -0.0881 -1.118 0.9418 1.9718 ...
##  $ numSylGreaThan2                      : num  0.771 0.771 0.1 0.771 1.441 ...
##  $ numPage                              : num  0.266 -0.584 -0.705 -0.584 -1.069 ...
##  $ FRES_Abstract                        : num  0.6392 0.0668 0.322 0.568 -0.3816 ...
##  $ FLESCH_Abstract                      : num  -0.4671 0.5015 -0.5579 -0.346 -0.0131 ...
##  $ numKeywords                          : num  0.4564 -0.0449 -0.7968 -0.5462 -0.7968 ...
##  $ Quartile                             : num  0.33 -0.75 -0.75 -0.75 -0.75 ...
##  $ Dominant_Topic                       : num  1.02 -1.19 1.47 1.02 -0.97 ...
##  $ single_quote_mark                    : num  -0.292 -0.292 -0.292 -0.292 -0.292 ...
##  $ double_quote_mark                    : num  -0.207 -0.207 -0.207 -0.207 -0.207 ...
##  $ exclamation_mark                     : num  -0.0428 -0.0428 -0.0428 -0.0428 -0.0428 ...
##  $ numTitleSubstantiveWordsWoutStopwords: num  0.433 -0.821 -1.238 1.268 0.433 ...
##  $ numTitleSubstantiveWordsWithStopwords: num  0.458 -0.798 -1.217 1.296 0.458 ...
##  $ question_mark_loc                    : num  -0.172 -0.172 -0.172 -0.172 -0.172 ...
##  $ question_mark_isExist                : num  -0.18 -0.18 -0.18 -0.18 -0.18 ...
##  $ presenceInitialPosition_ing          : num  1.882 -0.531 -0.531 -0.531 -0.531 ...
##  $ parenthesis_mark                     : num  0.965 0.965 -0.495 -0.495 -0.495 ...
##  $ plus_mark                            : num  -0.115 -0.115 -0.115 -0.115 -0.115 ...
##  $ numPrepositionBeginning              : num  -0.124 -0.124 -0.124 -0.124 -0.124 ...
##  $ J_or_C                               : num  1 0 0 0 0 0 0 0 0 0 ...

KNN Model

#Creating KNN Model with k 1 to 30.

df.scaled2$J_or_C <- as.factor(df.scaled2$J_or_C)

table(df.scaled2$J_or_C)

## 
##   0   1 
## 485 497

modelknn <- train(J_or_C~., data=df.scaled2,
                method="knn",
                tuneGrid=expand.grid(k=1:30))

modelknn

## k-Nearest Neighbors 
## 
## 982 samples
##  23 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7758277  0.5517484
##    2  0.7606678  0.5213913
##    3  0.7646328  0.5293783
##    4  0.7719704  0.5442119
##    5  0.7801897  0.5606423
##    6  0.7816779  0.5635686
##    7  0.7902618  0.5807583
##    8  0.7902759  0.5808810
##    9  0.7919121  0.5840294
##   10  0.7945540  0.5893973
##   11  0.7944254  0.5891288
##   12  0.7963657  0.5929346
##   13  0.8007361  0.6017250
##   14  0.8019043  0.6040637
##   15  0.8043639  0.6090050
##   16  0.8015406  0.6034319
##   17  0.8043338  0.6090305
##   18  0.8048107  0.6101427
##   19  0.8037911  0.6082270
##   20  0.8069060  0.6143907
##   21  0.8092181  0.6190312
##   22  0.8105469  0.6217465
##   23  0.8083458  0.6173057
##   24  0.8088214  0.6182303
##   25  0.8090527  0.6186174
##   26  0.8110740  0.6226284
##   27  0.8121362  0.6247965
##   28  0.8112550  0.6230760
##   29  0.8110751  0.6228365
##   30  0.8134211  0.6275656
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 30.

KNN Model Plot

#Plotting Model.

plot(modelknn)

Confusion Matrix

pred<-predict(modelknn,df.scaled2[-24])
confusionMatrix(as.factor(pred),as.factor(df2$J_or_C))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 467 145
##          1  18 352
##                                           
##                Accuracy : 0.834           
##                  95% CI : (0.8092, 0.8568)
##     No Information Rate : 0.5061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.669           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9629          
##             Specificity : 0.7082          
##          Pos Pred Value : 0.7631          
##          Neg Pred Value : 0.9514          
##              Prevalence : 0.4939          
##          Detection Rate : 0.4756          
##    Detection Prevalence : 0.6232          
##       Balanced Accuracy : 0.8356          
##                                           
##        'Positive' Class : 0               
##

CONCLUSION FOR BINARY CLUSTERING

Comparison of Algorithms

For Question Selection of Algorithm: So far we find out that our dependent variable is in binary format.
Thus, we used logistic regression and KNN clustering methods.
With the stepwise regression over logistic regression model, I picked the important variables.
For Question Prediction: Both logistic regression and KNN algorithm run over the selected variables.
Predicted values and observed J_or_C values are compared and visualized.
For Question Algorithm Comparison: Over the comparison matrix results, it can be seen that logistic regression performs better.
KNN accuracy: %83.5, Logistic Regression Accuracy: %97.1
Confusion matrix contains more info on senitivity, specificity.
Log Reg Sensitivitiy: 0.953 while KNN Sensitivity: 0.9632
Log Reg Specificity: 0.9898 while KNN specificity: 0.7123

CLUSTERING WITH NON-BINARY DEPENDENT VARIABLE

Exploratory Data Analysis

table(df$Quartile)

## 
##   0   1   2   3   4 
## 519 319  91  31  22

prop.table(table(df$Quartile))*100

## 
##         0         1         2         3         4 
## 52.851324 32.484725  9.266802  3.156823  2.240326

ggplot(df,
        aes(factor(Quartile))) +
    geom_bar(fill = "coral",
        alpha = 0.5) +
    theme_classic()

FEATURE SELECTION

#Remove non-info columns.


df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL

df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL
df <- na.omit(df)

#Using information.gain function for selection of the variables.

#information.gain function calculates entropy and returns variables where
#we can gain information for dependent variable.

#For binary clustering we used stepwise regression models.

q <- information.gain(Quartile~., df)
q <- as.data.frame(q)
q <- tibble::rownames_to_column(q, "VALUE")
q <-q[order(-q$attr_importance),]

q1 <- q %>% filter(attr_importance>0)
q1$VALUE

##  [1] "J_or_C"                "numPage"               "PaperAge"             
##  [4] "Year"                  "Cited by"              "CitationMetric_1"     
##  [7] "CitationMetric_3"      "question_mark"         "CitationMetric_4b_CM3"
## [10] "CitationMetric_5b_CM3" "CitationMetric_4_CB"   "CitationMetric_4a_CM2"
## [13] "CitationMetric_5_CB"   "CitationMetric_5a_CM2"

#Scaling Data

df3KNN <- df[,q1$VALUE]
df3KNN$J_or_C <- ifelse(df3KNN$J_or_C == "J",1,0)
df3KNN <- as.data.frame(scale(df3KNN))
df3KNN$Quartile <- df$Quartile

Used information gain method.
We remove the variables that gives us zero information.
14 of the variables can give information about Quartile variable.
We will implement KNN and SVM methods for clustering.

KNN METHOD

#Creating KNN Model with k 1 to 30.

df3KNN$Quartile <- as.factor(df3KNN$Quartile)

table(df3KNN$Quartile)

## 
##   0   1   2   3   4 
## 519 319  91  31  22

modelknn <- train(Quartile~., data=df3KNN,
                method="knn",
                tuneGrid=expand.grid(k=1:30))

modelknn

## k-Nearest Neighbors 
## 
## 982 samples
##  14 predictor
##   5 classes: '0', '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.8082758  0.6838606
##    2  0.8030140  0.6757666
##    3  0.8073025  0.6816296
##    4  0.8081096  0.6815487
##    5  0.8160422  0.6931038
##    6  0.8183412  0.6958800
##    7  0.8246655  0.7058145
##    8  0.8266796  0.7086463
##    9  0.8281881  0.7104673
##   10  0.8319806  0.7167732
##   11  0.8302935  0.7132037
##   12  0.8310505  0.7139109
##   13  0.8305155  0.7125105
##   14  0.8294755  0.7101852
##   15  0.8296722  0.7104142
##   16  0.8320216  0.7140035
##   17  0.8326327  0.7147589
##   18  0.8320532  0.7138247
##   19  0.8307571  0.7114743
##   20  0.8316195  0.7127679
##   21  0.8311658  0.7117304
##   22  0.8284676  0.7069512
##   23  0.8287605  0.7072143
##   24  0.8284572  0.7066616
##   25  0.8298149  0.7087841
##   26  0.8288261  0.7068250
##   27  0.8282081  0.7055289
##   28  0.8269523  0.7031983
##   29  0.8265923  0.7024095
##   30  0.8261297  0.7017050
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.

KNN Model Plot

#Plotting Model.

plot(modelknn)

Confusion Matrix

pred<-predict(modelknn,df3KNN[-15])
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 497   5   4   2   2
##          1  21 307  57  23  16
##          2   1   6  27   4   3
##          3   0   1   1   1   0
##          4   0   0   2   1   1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8483          
##                  95% CI : (0.8243, 0.8701)
##     No Information Rate : 0.5285          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7389          
##                                           
##  Mcnemar's Test P-Value : 3.28e-16        
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9576   0.9624  0.29670 0.032258 0.045455
## Specificity            0.9719   0.8235  0.98429 0.997897 0.996875
## Pos Pred Value         0.9745   0.7241  0.65854 0.333333 0.250000
## Neg Pred Value         0.9534   0.9785  0.93199 0.969356 0.978528
## Prevalence             0.5285   0.3248  0.09267 0.031568 0.022403
## Detection Rate         0.5061   0.3126  0.02749 0.001018 0.001018
## Detection Prevalence   0.5193   0.4318  0.04175 0.003055 0.004073
## Balanced Accuracy      0.9648   0.8930  0.64050 0.515078 0.521165

SVM METHOD

model_svm <- svm(Quartile~., df3KNN)

pred <- predict(model_svm, df3KNN)
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 505   4   3   2   1
##          1  14 311  59  25  18
##          2   0   4  29   3   3
##          3   0   0   0   1   0
##          4   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8615          
##                  95% CI : (0.8383, 0.8825)
##     No Information Rate : 0.5285          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7603          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9730   0.9749  0.31868 0.032258   0.0000
## Specificity            0.9784   0.8250  0.98878 1.000000   1.0000
## Pos Pred Value         0.9806   0.7283  0.74359 1.000000      NaN
## Neg Pred Value         0.9700   0.9856  0.93425 0.969419   0.9776
## Prevalence             0.5285   0.3248  0.09267 0.031568   0.0224
## Detection Rate         0.5143   0.3167  0.02953 0.001018   0.0000
## Detection Prevalence   0.5244   0.4348  0.03971 0.001018   0.0000
## Balanced Accuracy      0.9757   0.9000  0.65373 0.516129   0.5000

CONCLUSION FOR NON-BINARY CLUSTERING

We can compare KNN and SVM by the confusion matrix.
Accuracy of SVM: 0.8615
Accuracy of KNN: 0.8483
We can say that, for “Quartile” dependent variable, these two algorithms are, in terms of efficiency, is close to each other.
Each case is special and unique, but in this case (data set), we may assume that KNN is doing better if our dependent variable on clustering is not non-binary.

Testing Clustering Algorithms’ Performance

Cagri Cebisli

05 01 2020

Aim of the Project (Supervised)

Load Dataset

Load Packages

Exploratory Data Analysis

Data Overview

Missing Values

ALGORITHM AND FEATURE SELECTIONS

Dependent Variable Research (J_or_C)

Choosing Clustering Algorithms

Logistic Regression

Checking Correlations with Dependent Variable

Building Logistic Regression Model

Feature Selection

Stepwise Regression

Performance of the Model

Accuracy

Visuals for the Logistic Model

ROC Curve

Confusion Matrix

KNN METHOD

Data Normalization

KNN Model

KNN Model Plot

Confusion Matrix

CONCLUSION FOR BINARY CLUSTERING

Comparison of Algorithms

CLUSTERING WITH NON-BINARY DEPENDENT VARIABLE

Exploratory Data Analysis

FEATURE SELECTION

KNN METHOD

KNN Model Plot

Confusion Matrix

SVM METHOD

CONCLUSION FOR NON-BINARY CLUSTERING