Aim of the Project (Supervised)

Load Dataset

library(readxl)
#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
df <- read_excel("data2.xlsx")
dim(df)
## [1] 1000   82

Load Packages

Check RMD codes for details.

Exploratory Data Analysis

Data Overview

Note that we should not include “id” to our algorithms.

This variable does not gives us any specific characteristic information.

We checked data with by head(), str() and summary() functions. But output was so dirty-typed in HTML report so we remove the outputs.

Missing Values

miss_var_summary(df)
## # A tibble: 82 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 CitationMetric_5a_CM2      18      1.8
##  2 CitationMetric_4a_CM2      17      1.7
##  3 CitationMetric_5_CB         1      0.1
##  4 CitationMetric_5b_CM3       1      0.1
##  5 id                          0      0  
##  6 Authors_Num                 0      0  
##  7 Countries_Num               0      0  
##  8 Countries_Unique_Num        0      0  
##  9 Countries_Unique_Count      0      0  
## 10 Countries_Perc              0      0  
## # ... with 72 more rows
vis_miss(df)

  • We can see that there are some rows with missing values.

  • We will remove those records in order to have more accurate clusters.

# REMOVING ROWS WITH NULL VALUES.
df <- na.omit(df)
#From the structure we see that these columns are categoric and unique for each row.
#They will interrupt run time of the clustering algorithms.
#Their part (relativeness with dependent variable) on the clustering is relatively low.
#We will remove these variables from dataset.
#Thus we will have more accurate results and more efficient clustering.
#We will also remove the following columns due to the same reason.

df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL

df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL

ALGORITHM AND FEATURE SELECTIONS

Dependent Variable Research (J_or_C)

# XTAB OF J AND C
table(df$J_or_C)
## 
##   C   J 
## 485 497
  • Dependent variable is binary. It has two options as J or C.

  • The number of variables are nearly even.

ggplot(df,
        aes(factor(J_or_C))) +
    geom_bar(fill = "coral",
        alpha = 0.5) +
    theme_classic()

Choosing Clustering Algorithms

  • We can perform “Logistic Regression” and “KNN”.

  • We can use stepwise methods to pick the important features.

Logistic Regression

  • Creating Logistic Regression with all variables.

  • Recoding dependent J_or_C variable into binary structure.

df_logistic <- df
df_logistic$J_or_C <- ifelse(df_logistic$J_or_C == "J",1,0)
table(df_logistic$J_or_C)
## 
##   0   1 
## 485 497

Checking Correlations with Dependent Variable

# CORRELATION MATRIX WITH ALL VALUES VS J_or_C

#WE CAN NOT FIND ANY STRONG CORRELATION


cor(df_logistic[-26],df_logistic$J_or_C)
## Warning in cor(df_logistic[-26], df_logistic$J_or_C): the standard deviation is
## zero
##                                                  [,1]
## FRES_Title                               -0.030776698
## FLESCH_Title                              0.036355645
## numCharTitle_all                          0.050662339
## numCharTitle_onlyAlpha                    0.053562812
## numCharTitle_nonAlpha                     0.040687463
## nonAlphaCharTitle_isExist                 0.011332172
## TTR_Title                                -0.003377063
## numWordTitle                              0.019689673
## numABV_Title                             -0.067194218
## binABV_Title                             -0.073972587
## numSentAbstract                           0.129395358
## numABV_Abstract                          -0.003951528
## binABV_Abstract                          -0.073972587
## PaperAge                                  0.214074054
## numLexVerb                                0.015713515
## numSylGreaThan2                           0.016937051
## numPage                                   0.451179476
## Year                                     -0.214074054
## Cited by                                  0.118646941
## FRES_Abstract                            -0.042862894
## FLESCH_Abstract                           0.004959666
## isFunding                                 0.246225226
## keywordsListedAlpha                       0.206591955
## numKeywords                              -0.166545696
## Quartile                                  0.741636549
## CitationMetric_1                          0.200695011
## CitationMetric_2                          0.109219433
## CitationMetric_3                          0.200628899
## CitationMetric_4_CB                       0.135376418
## CitationMetric_4a_CM2                     0.135376418
## CitationMetric_4b_CM3                     0.222523064
## CitationMetric_5_CB                       0.156181537
## CitationMetric_5a_CM2                     0.156181537
## CitationMetric_5b_CM3                     0.231167913
## Dominant_Topic                           -0.008926470
## single_quote_mark                         0.005762380
## double_quote_mark                        -0.033263488
## exclamation_mark                         -0.014809022
## em_dash_mark                                       NA
## parenthesis_mark                          0.054947625
## plus_mark                                -0.033795629
## comma_mark                                         NA
## hyphen_mark                               0.057948334
## period_mark                               0.078153365
## slash_mark                                0.038813257
## colon_mark                                0.094033629
## semicolon_mark                                     NA
## question_mark                             0.131906706
## square_parenthesis_mark                  -0.035032601
## underscore_mark                                    NA
## curly_parenthesis_mark                    0.031539745
## apostrophe_mark                           0.039907178
## and_mark                                  0.047746319
## backslash_mark                            0.031539745
## equal_mark                               -0.019130078
## presenceColon                             0.082372830
## avgPunctuation                            0.110797407
## numPreposition                            0.115023990
## numFreqLexItems_connectives               0.048923202
## numFreqLexItems_reviews                            NA
## numFreqLexItems_previews                  0.018308762
## numFreqLexItems_action_markers           -0.027478316
## numFreqLexItems_closing                  -0.001444089
## numTitleSubstantiveWordsWoutStopwords     0.014228113
## numTitleSubstantiveWordsWithStopwords     0.007738565
## numAbstractSubstantiveWordsWoutStopwords  0.143464423
## numAbstractSubstantiveWordsWithStopwords  0.139311196
## question_mark_loc                         0.019682779
## question_mark_isExist                     0.026916236
## presenceInitialPosition_a                 0.071255143
## presenceInitialPosition_the               0.014674761
## presenceInitialPosition_a_or_the          0.071475824
## presenceInitialPosition_ing               0.013179398
## numPrepositionBeginning                   0.023389140
  • No Relative strong correlation with any independent variables.

Building Logistic Regression Model

model_full_logistic <- glm( J_or_C ~., data = df_logistic, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
  • GLM model has been created.

  • Now we will select important features with stepwise model.

Feature Selection

Stepwise Regression

#With stepwise regression we will pick the most efficient AIC model.
#These variables combination gives us best logistic regression model.

#step.model <- model_full_logistic %>% stepAIC(trace = FALSE)

#step.model has been cut in order to gain efficiency.
#Outputs of the step.model 

df2 <- df_logistic[,c("FRES_Title","FLESCH_Title","numSentAbstract","PaperAge","numLexVerb",
                      "numSylGreaThan2","numPage","FRES_Abstract","FLESCH_Abstract",
                      "numKeywords","Quartile","Dominant_Topic","single_quote_mark",
                      "double_quote_mark","exclamation_mark",
                      "numTitleSubstantiveWordsWoutStopwords",
                      "numTitleSubstantiveWordsWithStopwords",
                      "question_mark_loc","question_mark_isExist",
                      "presenceInitialPosition_ing",
                      "parenthesis_mark","plus_mark","numPrepositionBeginning","J_or_C")]


model_full_logistic2 <- glm( J_or_C ~., data = df2, family = binomial)

step.model2 <- model_full_logistic2 %>% stepAIC(trace = FALSE)

Performance of the Model

Accuracy
#Logistic regression predictive output and J_or_C output
#is similar with %91.7 accuracy.

#We can say that our model, with given predictors,
#%91.7 of the time find same result 


probabilities <- step.model2 %>% predict(df2[-24], type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)

observed.classes <- df_logistic[26]
mean(predicted.classes == observed.classes)
## [1] 0.9704684
Visuals for the Logistic Model
  • Curve
#Logistic regression curve by dependent J_or_C and 
#independent "avgPunctuation" variable.

plot(as.numeric(df_logistic$J_or_C) ~ avgPunctuation , data = df_logistic,
     col = "darkorange",
     pch = "I", 
     ylim = c(-0.2, 1))

abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)

model_full_logistic2 <- glm(J_or_C~ avgPunctuation , 
                 data = df_logistic, 
                 family = "binomial")

curve(predict(model_full_logistic2, data.frame(avgPunctuation  = x), type ="response"),
              add = TRUE,
              lwd = 3,
              col = "dodgerblue")

ROC Curve
test_ol <- predict(step.model2, newdata = df2, type = "response")
observed.classes2 <- df2$J_or_C

a <- roc(observed.classes2 ~ test_ol, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

a$auc
## Area under the curve: 0.9966
Confusion Matrix
  • To Compare Algorithm Efficiency.
predicted.classes2 <- as.factor(ifelse(predicted.classes==1,"Yes","No"))
observed.classes3 <- as.factor(ifelse(observed.classes2==1,"Yes","No"))

confusionMatrix(data = predicted.classes2,
                reference = observed.classes3, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  468  12
##        Yes  17 485
##                                           
##                Accuracy : 0.9705          
##                  95% CI : (0.9579, 0.9801)
##     No Information Rate : 0.5061          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9409          
##                                           
##  Mcnemar's Test P-Value : 0.4576          
##                                           
##             Sensitivity : 0.9759          
##             Specificity : 0.9649          
##          Pos Pred Value : 0.9661          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.5061          
##          Detection Rate : 0.4939          
##    Detection Prevalence : 0.5112          
##       Balanced Accuracy : 0.9704          
##                                           
##        'Positive' Class : Yes             
## 

KNN METHOD

Data Normalization

#We scale data column by column with the "scale" function.
#This will help algorithm results to be more efficient.
#Remember that you can only normalize numeric columns.

df.scaled2 <- as.data.frame(scale(df2[-24]))
df.scaled2$J_or_C <- df2$J_or_C
df.scaled2 <- na.omit(df.scaled2)
str(df.scaled2)
## 'data.frame':    982 obs. of  24 variables:
##  $ FRES_Title                           : num  0.413 -0.412 -1.293 0.301 -0.552 ...
##  $ FLESCH_Title                         : num  -0.31515 0.3351 1.1276 0.00997 0.72119 ...
##  $ numSentAbstract                      : num  -0.2603 -0.2603 -0.6187 -0.9771 0.0982 ...
##  $ PaperAge                             : num  1.073 -0.953 1.819 -0.526 -0.206 ...
##  $ numLexVerb                           : num  -0.0881 -0.0881 -1.118 0.9418 1.9718 ...
##  $ numSylGreaThan2                      : num  0.771 0.771 0.1 0.771 1.441 ...
##  $ numPage                              : num  0.266 -0.584 -0.705 -0.584 -1.069 ...
##  $ FRES_Abstract                        : num  0.6392 0.0668 0.322 0.568 -0.3816 ...
##  $ FLESCH_Abstract                      : num  -0.4671 0.5015 -0.5579 -0.346 -0.0131 ...
##  $ numKeywords                          : num  0.4564 -0.0449 -0.7968 -0.5462 -0.7968 ...
##  $ Quartile                             : num  0.33 -0.75 -0.75 -0.75 -0.75 ...
##  $ Dominant_Topic                       : num  1.02 -1.19 1.47 1.02 -0.97 ...
##  $ single_quote_mark                    : num  -0.292 -0.292 -0.292 -0.292 -0.292 ...
##  $ double_quote_mark                    : num  -0.207 -0.207 -0.207 -0.207 -0.207 ...
##  $ exclamation_mark                     : num  -0.0428 -0.0428 -0.0428 -0.0428 -0.0428 ...
##  $ numTitleSubstantiveWordsWoutStopwords: num  0.433 -0.821 -1.238 1.268 0.433 ...
##  $ numTitleSubstantiveWordsWithStopwords: num  0.458 -0.798 -1.217 1.296 0.458 ...
##  $ question_mark_loc                    : num  -0.172 -0.172 -0.172 -0.172 -0.172 ...
##  $ question_mark_isExist                : num  -0.18 -0.18 -0.18 -0.18 -0.18 ...
##  $ presenceInitialPosition_ing          : num  1.882 -0.531 -0.531 -0.531 -0.531 ...
##  $ parenthesis_mark                     : num  0.965 0.965 -0.495 -0.495 -0.495 ...
##  $ plus_mark                            : num  -0.115 -0.115 -0.115 -0.115 -0.115 ...
##  $ numPrepositionBeginning              : num  -0.124 -0.124 -0.124 -0.124 -0.124 ...
##  $ J_or_C                               : num  1 0 0 0 0 0 0 0 0 0 ...

KNN Model

#Creating KNN Model with k 1 to 30.

df.scaled2$J_or_C <- as.factor(df.scaled2$J_or_C)

table(df.scaled2$J_or_C)
## 
##   0   1 
## 485 497
modelknn <- train(J_or_C~., data=df.scaled2,
                method="knn",
                tuneGrid=expand.grid(k=1:30))

modelknn
## k-Nearest Neighbors 
## 
## 982 samples
##  23 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.7758277  0.5517484
##    2  0.7606678  0.5213913
##    3  0.7646328  0.5293783
##    4  0.7719704  0.5442119
##    5  0.7801897  0.5606423
##    6  0.7816779  0.5635686
##    7  0.7902618  0.5807583
##    8  0.7902759  0.5808810
##    9  0.7919121  0.5840294
##   10  0.7945540  0.5893973
##   11  0.7944254  0.5891288
##   12  0.7963657  0.5929346
##   13  0.8007361  0.6017250
##   14  0.8019043  0.6040637
##   15  0.8043639  0.6090050
##   16  0.8015406  0.6034319
##   17  0.8043338  0.6090305
##   18  0.8048107  0.6101427
##   19  0.8037911  0.6082270
##   20  0.8069060  0.6143907
##   21  0.8092181  0.6190312
##   22  0.8105469  0.6217465
##   23  0.8083458  0.6173057
##   24  0.8088214  0.6182303
##   25  0.8090527  0.6186174
##   26  0.8110740  0.6226284
##   27  0.8121362  0.6247965
##   28  0.8112550  0.6230760
##   29  0.8110751  0.6228365
##   30  0.8134211  0.6275656
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 30.

KNN Model Plot

#Plotting Model.

plot(modelknn)

Confusion Matrix

pred<-predict(modelknn,df.scaled2[-24])
confusionMatrix(as.factor(pred),as.factor(df2$J_or_C))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 467 145
##          1  18 352
##                                           
##                Accuracy : 0.834           
##                  95% CI : (0.8092, 0.8568)
##     No Information Rate : 0.5061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.669           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9629          
##             Specificity : 0.7082          
##          Pos Pred Value : 0.7631          
##          Neg Pred Value : 0.9514          
##              Prevalence : 0.4939          
##          Detection Rate : 0.4756          
##    Detection Prevalence : 0.6232          
##       Balanced Accuracy : 0.8356          
##                                           
##        'Positive' Class : 0               
## 

CONCLUSION FOR BINARY CLUSTERING

Comparison of Algorithms

  • For Question Selection of Algorithm: So far we find out that our dependent variable is in binary format.

  • Thus, we used logistic regression and KNN clustering methods.

  • With the stepwise regression over logistic regression model, I picked the important variables.

  • For Question Prediction: Both logistic regression and KNN algorithm run over the selected variables.

  • Predicted values and observed J_or_C values are compared and visualized.

  • For Question Algorithm Comparison: Over the comparison matrix results, it can be seen that logistic regression performs better.

  • KNN accuracy: %83.5, Logistic Regression Accuracy: %97.1

  • Confusion matrix contains more info on senitivity, specificity.

  • Log Reg Sensitivitiy: 0.953 while KNN Sensitivity: 0.9632

  • Log Reg Specificity: 0.9898 while KNN specificity: 0.7123

CLUSTERING WITH NON-BINARY DEPENDENT VARIABLE

Exploratory Data Analysis

table(df$Quartile)
## 
##   0   1   2   3   4 
## 519 319  91  31  22
prop.table(table(df$Quartile))*100
## 
##         0         1         2         3         4 
## 52.851324 32.484725  9.266802  3.156823  2.240326
ggplot(df,
        aes(factor(Quartile))) +
    geom_bar(fill = "coral",
        alpha = 0.5) +
    theme_classic()

FEATURE SELECTION

#Remove non-info columns.


df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL

df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL
df <- na.omit(df)

#Using information.gain function for selection of the variables.

#information.gain function calculates entropy and returns variables where
#we can gain information for dependent variable.

#For binary clustering we used stepwise regression models.

q <- information.gain(Quartile~., df)
q <- as.data.frame(q)
q <- tibble::rownames_to_column(q, "VALUE")
q <-q[order(-q$attr_importance),]

q1 <- q %>% filter(attr_importance>0)
q1$VALUE
##  [1] "J_or_C"                "numPage"               "PaperAge"             
##  [4] "Year"                  "Cited by"              "CitationMetric_1"     
##  [7] "CitationMetric_3"      "question_mark"         "CitationMetric_4b_CM3"
## [10] "CitationMetric_5b_CM3" "CitationMetric_4_CB"   "CitationMetric_4a_CM2"
## [13] "CitationMetric_5_CB"   "CitationMetric_5a_CM2"
#Scaling Data

df3KNN <- df[,q1$VALUE]
df3KNN$J_or_C <- ifelse(df3KNN$J_or_C == "J",1,0)
df3KNN <- as.data.frame(scale(df3KNN))
df3KNN$Quartile <- df$Quartile

KNN METHOD

#Creating KNN Model with k 1 to 30.

df3KNN$Quartile <- as.factor(df3KNN$Quartile)

table(df3KNN$Quartile)
## 
##   0   1   2   3   4 
## 519 319  91  31  22
modelknn <- train(Quartile~., data=df3KNN,
                method="knn",
                tuneGrid=expand.grid(k=1:30))

modelknn
## k-Nearest Neighbors 
## 
## 982 samples
##  14 predictor
##   5 classes: '0', '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.8082758  0.6838606
##    2  0.8030140  0.6757666
##    3  0.8073025  0.6816296
##    4  0.8081096  0.6815487
##    5  0.8160422  0.6931038
##    6  0.8183412  0.6958800
##    7  0.8246655  0.7058145
##    8  0.8266796  0.7086463
##    9  0.8281881  0.7104673
##   10  0.8319806  0.7167732
##   11  0.8302935  0.7132037
##   12  0.8310505  0.7139109
##   13  0.8305155  0.7125105
##   14  0.8294755  0.7101852
##   15  0.8296722  0.7104142
##   16  0.8320216  0.7140035
##   17  0.8326327  0.7147589
##   18  0.8320532  0.7138247
##   19  0.8307571  0.7114743
##   20  0.8316195  0.7127679
##   21  0.8311658  0.7117304
##   22  0.8284676  0.7069512
##   23  0.8287605  0.7072143
##   24  0.8284572  0.7066616
##   25  0.8298149  0.7087841
##   26  0.8288261  0.7068250
##   27  0.8282081  0.7055289
##   28  0.8269523  0.7031983
##   29  0.8265923  0.7024095
##   30  0.8261297  0.7017050
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.

KNN Model Plot

#Plotting Model.

plot(modelknn)

Confusion Matrix

pred<-predict(modelknn,df3KNN[-15])
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 497   5   4   2   2
##          1  21 307  57  23  16
##          2   1   6  27   4   3
##          3   0   1   1   1   0
##          4   0   0   2   1   1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8483          
##                  95% CI : (0.8243, 0.8701)
##     No Information Rate : 0.5285          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7389          
##                                           
##  Mcnemar's Test P-Value : 3.28e-16        
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9576   0.9624  0.29670 0.032258 0.045455
## Specificity            0.9719   0.8235  0.98429 0.997897 0.996875
## Pos Pred Value         0.9745   0.7241  0.65854 0.333333 0.250000
## Neg Pred Value         0.9534   0.9785  0.93199 0.969356 0.978528
## Prevalence             0.5285   0.3248  0.09267 0.031568 0.022403
## Detection Rate         0.5061   0.3126  0.02749 0.001018 0.001018
## Detection Prevalence   0.5193   0.4318  0.04175 0.003055 0.004073
## Balanced Accuracy      0.9648   0.8930  0.64050 0.515078 0.521165

SVM METHOD

model_svm <- svm(Quartile~., df3KNN)
pred <- predict(model_svm, df3KNN)
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4
##          0 505   4   3   2   1
##          1  14 311  59  25  18
##          2   0   4  29   3   3
##          3   0   0   0   1   0
##          4   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8615          
##                  95% CI : (0.8383, 0.8825)
##     No Information Rate : 0.5285          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7603          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9730   0.9749  0.31868 0.032258   0.0000
## Specificity            0.9784   0.8250  0.98878 1.000000   1.0000
## Pos Pred Value         0.9806   0.7283  0.74359 1.000000      NaN
## Neg Pred Value         0.9700   0.9856  0.93425 0.969419   0.9776
## Prevalence             0.5285   0.3248  0.09267 0.031568   0.0224
## Detection Rate         0.5143   0.3167  0.02953 0.001018   0.0000
## Detection Prevalence   0.5244   0.4348  0.03971 0.001018   0.0000
## Balanced Accuracy      0.9757   0.9000  0.65373 0.516129   0.5000

CONCLUSION FOR NON-BINARY CLUSTERING