Choosing features that are important for each of the analysis;
For binary dependent variable, stepwise method has been used.
For non-binary dependent variable, information gain method has been used.
For the output variable J_or_C (binary), performed clustering by applying at least 2 clustering algorithms to generate clusters (assume k=2).
Performance of two algorithm compared.(1)
For the output variable Quartile (non-binary), performed clustering by applying at least 2 clustering algorithms to generate clusters (assume k=5).
Performance of two algorithm compared.(2)
library(readxl)
#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
df <- read_excel("data2.xlsx")
dim(df)
## [1] 1000 82
Check RMD codes for details.
Note that we should not include “id” to our algorithms.
This variable does not gives us any specific characteristic information.
We checked data with by head(), str() and summary() functions. But output was so dirty-typed in HTML report so we remove the outputs.
miss_var_summary(df)
## # A tibble: 82 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 CitationMetric_5a_CM2 18 1.8
## 2 CitationMetric_4a_CM2 17 1.7
## 3 CitationMetric_5_CB 1 0.1
## 4 CitationMetric_5b_CM3 1 0.1
## 5 id 0 0
## 6 Authors_Num 0 0
## 7 Countries_Num 0 0
## 8 Countries_Unique_Num 0 0
## 9 Countries_Unique_Count 0 0
## 10 Countries_Perc 0 0
## # ... with 72 more rows
vis_miss(df)
We can see that there are some rows with missing values.
We will remove those records in order to have more accurate clusters.
# REMOVING ROWS WITH NULL VALUES.
df <- na.omit(df)
#From the structure we see that these columns are categoric and unique for each row.
#They will interrupt run time of the clustering algorithms.
#Their part (relativeness with dependent variable) on the clustering is relatively low.
#We will remove these variables from dataset.
#Thus we will have more accurate results and more efficient clustering.
#We will also remove the following columns due to the same reason.
df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL
df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL
# XTAB OF J AND C
table(df$J_or_C)
##
## C J
## 485 497
Dependent variable is binary. It has two options as J or C.
The number of variables are nearly even.
ggplot(df,
aes(factor(J_or_C))) +
geom_bar(fill = "coral",
alpha = 0.5) +
theme_classic()
We can perform “Logistic Regression” and “KNN”.
We can use stepwise methods to pick the important features.
Creating Logistic Regression with all variables.
Recoding dependent J_or_C variable into binary structure.
df_logistic <- df
df_logistic$J_or_C <- ifelse(df_logistic$J_or_C == "J",1,0)
table(df_logistic$J_or_C)
##
## 0 1
## 485 497
# CORRELATION MATRIX WITH ALL VALUES VS J_or_C
#WE CAN NOT FIND ANY STRONG CORRELATION
cor(df_logistic[-26],df_logistic$J_or_C)
## Warning in cor(df_logistic[-26], df_logistic$J_or_C): the standard deviation is
## zero
## [,1]
## FRES_Title -0.030776698
## FLESCH_Title 0.036355645
## numCharTitle_all 0.050662339
## numCharTitle_onlyAlpha 0.053562812
## numCharTitle_nonAlpha 0.040687463
## nonAlphaCharTitle_isExist 0.011332172
## TTR_Title -0.003377063
## numWordTitle 0.019689673
## numABV_Title -0.067194218
## binABV_Title -0.073972587
## numSentAbstract 0.129395358
## numABV_Abstract -0.003951528
## binABV_Abstract -0.073972587
## PaperAge 0.214074054
## numLexVerb 0.015713515
## numSylGreaThan2 0.016937051
## numPage 0.451179476
## Year -0.214074054
## Cited by 0.118646941
## FRES_Abstract -0.042862894
## FLESCH_Abstract 0.004959666
## isFunding 0.246225226
## keywordsListedAlpha 0.206591955
## numKeywords -0.166545696
## Quartile 0.741636549
## CitationMetric_1 0.200695011
## CitationMetric_2 0.109219433
## CitationMetric_3 0.200628899
## CitationMetric_4_CB 0.135376418
## CitationMetric_4a_CM2 0.135376418
## CitationMetric_4b_CM3 0.222523064
## CitationMetric_5_CB 0.156181537
## CitationMetric_5a_CM2 0.156181537
## CitationMetric_5b_CM3 0.231167913
## Dominant_Topic -0.008926470
## single_quote_mark 0.005762380
## double_quote_mark -0.033263488
## exclamation_mark -0.014809022
## em_dash_mark NA
## parenthesis_mark 0.054947625
## plus_mark -0.033795629
## comma_mark NA
## hyphen_mark 0.057948334
## period_mark 0.078153365
## slash_mark 0.038813257
## colon_mark 0.094033629
## semicolon_mark NA
## question_mark 0.131906706
## square_parenthesis_mark -0.035032601
## underscore_mark NA
## curly_parenthesis_mark 0.031539745
## apostrophe_mark 0.039907178
## and_mark 0.047746319
## backslash_mark 0.031539745
## equal_mark -0.019130078
## presenceColon 0.082372830
## avgPunctuation 0.110797407
## numPreposition 0.115023990
## numFreqLexItems_connectives 0.048923202
## numFreqLexItems_reviews NA
## numFreqLexItems_previews 0.018308762
## numFreqLexItems_action_markers -0.027478316
## numFreqLexItems_closing -0.001444089
## numTitleSubstantiveWordsWoutStopwords 0.014228113
## numTitleSubstantiveWordsWithStopwords 0.007738565
## numAbstractSubstantiveWordsWoutStopwords 0.143464423
## numAbstractSubstantiveWordsWithStopwords 0.139311196
## question_mark_loc 0.019682779
## question_mark_isExist 0.026916236
## presenceInitialPosition_a 0.071255143
## presenceInitialPosition_the 0.014674761
## presenceInitialPosition_a_or_the 0.071475824
## presenceInitialPosition_ing 0.013179398
## numPrepositionBeginning 0.023389140
model_full_logistic <- glm( J_or_C ~., data = df_logistic, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
GLM model has been created.
Now we will select important features with stepwise model.
#With stepwise regression we will pick the most efficient AIC model.
#These variables combination gives us best logistic regression model.
#step.model <- model_full_logistic %>% stepAIC(trace = FALSE)
#step.model has been cut in order to gain efficiency.
#Outputs of the step.model
df2 <- df_logistic[,c("FRES_Title","FLESCH_Title","numSentAbstract","PaperAge","numLexVerb",
"numSylGreaThan2","numPage","FRES_Abstract","FLESCH_Abstract",
"numKeywords","Quartile","Dominant_Topic","single_quote_mark",
"double_quote_mark","exclamation_mark",
"numTitleSubstantiveWordsWoutStopwords",
"numTitleSubstantiveWordsWithStopwords",
"question_mark_loc","question_mark_isExist",
"presenceInitialPosition_ing",
"parenthesis_mark","plus_mark","numPrepositionBeginning","J_or_C")]
model_full_logistic2 <- glm( J_or_C ~., data = df2, family = binomial)
step.model2 <- model_full_logistic2 %>% stepAIC(trace = FALSE)
#Logistic regression predictive output and J_or_C output
#is similar with %91.7 accuracy.
#We can say that our model, with given predictors,
#%91.7 of the time find same result
probabilities <- step.model2 %>% predict(df2[-24], type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
observed.classes <- df_logistic[26]
mean(predicted.classes == observed.classes)
## [1] 0.9704684
#Logistic regression curve by dependent J_or_C and
#independent "avgPunctuation" variable.
plot(as.numeric(df_logistic$J_or_C) ~ avgPunctuation , data = df_logistic,
col = "darkorange",
pch = "I",
ylim = c(-0.2, 1))
abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)
model_full_logistic2 <- glm(J_or_C~ avgPunctuation ,
data = df_logistic,
family = "binomial")
curve(predict(model_full_logistic2, data.frame(avgPunctuation = x), type ="response"),
add = TRUE,
lwd = 3,
col = "dodgerblue")
test_ol <- predict(step.model2, newdata = df2, type = "response")
observed.classes2 <- df2$J_or_C
a <- roc(observed.classes2 ~ test_ol, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
a$auc
## Area under the curve: 0.9966
predicted.classes2 <- as.factor(ifelse(predicted.classes==1,"Yes","No"))
observed.classes3 <- as.factor(ifelse(observed.classes2==1,"Yes","No"))
confusionMatrix(data = predicted.classes2,
reference = observed.classes3, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 468 12
## Yes 17 485
##
## Accuracy : 0.9705
## 95% CI : (0.9579, 0.9801)
## No Information Rate : 0.5061
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9409
##
## Mcnemar's Test P-Value : 0.4576
##
## Sensitivity : 0.9759
## Specificity : 0.9649
## Pos Pred Value : 0.9661
## Neg Pred Value : 0.9750
## Prevalence : 0.5061
## Detection Rate : 0.4939
## Detection Prevalence : 0.5112
## Balanced Accuracy : 0.9704
##
## 'Positive' Class : Yes
##
#We scale data column by column with the "scale" function.
#This will help algorithm results to be more efficient.
#Remember that you can only normalize numeric columns.
df.scaled2 <- as.data.frame(scale(df2[-24]))
df.scaled2$J_or_C <- df2$J_or_C
df.scaled2 <- na.omit(df.scaled2)
str(df.scaled2)
## 'data.frame': 982 obs. of 24 variables:
## $ FRES_Title : num 0.413 -0.412 -1.293 0.301 -0.552 ...
## $ FLESCH_Title : num -0.31515 0.3351 1.1276 0.00997 0.72119 ...
## $ numSentAbstract : num -0.2603 -0.2603 -0.6187 -0.9771 0.0982 ...
## $ PaperAge : num 1.073 -0.953 1.819 -0.526 -0.206 ...
## $ numLexVerb : num -0.0881 -0.0881 -1.118 0.9418 1.9718 ...
## $ numSylGreaThan2 : num 0.771 0.771 0.1 0.771 1.441 ...
## $ numPage : num 0.266 -0.584 -0.705 -0.584 -1.069 ...
## $ FRES_Abstract : num 0.6392 0.0668 0.322 0.568 -0.3816 ...
## $ FLESCH_Abstract : num -0.4671 0.5015 -0.5579 -0.346 -0.0131 ...
## $ numKeywords : num 0.4564 -0.0449 -0.7968 -0.5462 -0.7968 ...
## $ Quartile : num 0.33 -0.75 -0.75 -0.75 -0.75 ...
## $ Dominant_Topic : num 1.02 -1.19 1.47 1.02 -0.97 ...
## $ single_quote_mark : num -0.292 -0.292 -0.292 -0.292 -0.292 ...
## $ double_quote_mark : num -0.207 -0.207 -0.207 -0.207 -0.207 ...
## $ exclamation_mark : num -0.0428 -0.0428 -0.0428 -0.0428 -0.0428 ...
## $ numTitleSubstantiveWordsWoutStopwords: num 0.433 -0.821 -1.238 1.268 0.433 ...
## $ numTitleSubstantiveWordsWithStopwords: num 0.458 -0.798 -1.217 1.296 0.458 ...
## $ question_mark_loc : num -0.172 -0.172 -0.172 -0.172 -0.172 ...
## $ question_mark_isExist : num -0.18 -0.18 -0.18 -0.18 -0.18 ...
## $ presenceInitialPosition_ing : num 1.882 -0.531 -0.531 -0.531 -0.531 ...
## $ parenthesis_mark : num 0.965 0.965 -0.495 -0.495 -0.495 ...
## $ plus_mark : num -0.115 -0.115 -0.115 -0.115 -0.115 ...
## $ numPrepositionBeginning : num -0.124 -0.124 -0.124 -0.124 -0.124 ...
## $ J_or_C : num 1 0 0 0 0 0 0 0 0 0 ...
#Creating KNN Model with k 1 to 30.
df.scaled2$J_or_C <- as.factor(df.scaled2$J_or_C)
table(df.scaled2$J_or_C)
##
## 0 1
## 485 497
modelknn <- train(J_or_C~., data=df.scaled2,
method="knn",
tuneGrid=expand.grid(k=1:30))
modelknn
## k-Nearest Neighbors
##
## 982 samples
## 23 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7758277 0.5517484
## 2 0.7606678 0.5213913
## 3 0.7646328 0.5293783
## 4 0.7719704 0.5442119
## 5 0.7801897 0.5606423
## 6 0.7816779 0.5635686
## 7 0.7902618 0.5807583
## 8 0.7902759 0.5808810
## 9 0.7919121 0.5840294
## 10 0.7945540 0.5893973
## 11 0.7944254 0.5891288
## 12 0.7963657 0.5929346
## 13 0.8007361 0.6017250
## 14 0.8019043 0.6040637
## 15 0.8043639 0.6090050
## 16 0.8015406 0.6034319
## 17 0.8043338 0.6090305
## 18 0.8048107 0.6101427
## 19 0.8037911 0.6082270
## 20 0.8069060 0.6143907
## 21 0.8092181 0.6190312
## 22 0.8105469 0.6217465
## 23 0.8083458 0.6173057
## 24 0.8088214 0.6182303
## 25 0.8090527 0.6186174
## 26 0.8110740 0.6226284
## 27 0.8121362 0.6247965
## 28 0.8112550 0.6230760
## 29 0.8110751 0.6228365
## 30 0.8134211 0.6275656
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 30.
#Plotting Model.
plot(modelknn)
pred<-predict(modelknn,df.scaled2[-24])
confusionMatrix(as.factor(pred),as.factor(df2$J_or_C))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 467 145
## 1 18 352
##
## Accuracy : 0.834
## 95% CI : (0.8092, 0.8568)
## No Information Rate : 0.5061
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.669
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9629
## Specificity : 0.7082
## Pos Pred Value : 0.7631
## Neg Pred Value : 0.9514
## Prevalence : 0.4939
## Detection Rate : 0.4756
## Detection Prevalence : 0.6232
## Balanced Accuracy : 0.8356
##
## 'Positive' Class : 0
##
For Question Selection of Algorithm: So far we find out that our dependent variable is in binary format.
Thus, we used logistic regression and KNN clustering methods.
With the stepwise regression over logistic regression model, I picked the important variables.
For Question Prediction: Both logistic regression and KNN algorithm run over the selected variables.
Predicted values and observed J_or_C values are compared and visualized.
For Question Algorithm Comparison: Over the comparison matrix results, it can be seen that logistic regression performs better.
KNN accuracy: %83.5, Logistic Regression Accuracy: %97.1
Confusion matrix contains more info on senitivity, specificity.
Log Reg Sensitivitiy: 0.953 while KNN Sensitivity: 0.9632
Log Reg Specificity: 0.9898 while KNN specificity: 0.7123
table(df$Quartile)
##
## 0 1 2 3 4
## 519 319 91 31 22
prop.table(table(df$Quartile))*100
##
## 0 1 2 3 4
## 52.851324 32.484725 9.266802 3.156823 2.240326
ggplot(df,
aes(factor(Quartile))) +
geom_bar(fill = "coral",
alpha = 0.5) +
theme_classic()
#Remove non-info columns.
df$id <- NULL
df$Authors_Num <- NULL
df$Countries_Num <- NULL
df$Countries_Unique_Num <- NULL
df$Countries_Unique_Count <- NULL
df$Countries_Perc <- NULL
df$Countries_First_Author <- NULL
df <- na.omit(df)
#Using information.gain function for selection of the variables.
#information.gain function calculates entropy and returns variables where
#we can gain information for dependent variable.
#For binary clustering we used stepwise regression models.
q <- information.gain(Quartile~., df)
q <- as.data.frame(q)
q <- tibble::rownames_to_column(q, "VALUE")
q <-q[order(-q$attr_importance),]
q1 <- q %>% filter(attr_importance>0)
q1$VALUE
## [1] "J_or_C" "numPage" "PaperAge"
## [4] "Year" "Cited by" "CitationMetric_1"
## [7] "CitationMetric_3" "question_mark" "CitationMetric_4b_CM3"
## [10] "CitationMetric_5b_CM3" "CitationMetric_4_CB" "CitationMetric_4a_CM2"
## [13] "CitationMetric_5_CB" "CitationMetric_5a_CM2"
#Scaling Data
df3KNN <- df[,q1$VALUE]
df3KNN$J_or_C <- ifelse(df3KNN$J_or_C == "J",1,0)
df3KNN <- as.data.frame(scale(df3KNN))
df3KNN$Quartile <- df$Quartile
Used information gain method.
We remove the variables that gives us zero information.
14 of the variables can give information about Quartile variable.
We will implement KNN and SVM methods for clustering.
#Creating KNN Model with k 1 to 30.
df3KNN$Quartile <- as.factor(df3KNN$Quartile)
table(df3KNN$Quartile)
##
## 0 1 2 3 4
## 519 319 91 31 22
modelknn <- train(Quartile~., data=df3KNN,
method="knn",
tuneGrid=expand.grid(k=1:30))
modelknn
## k-Nearest Neighbors
##
## 982 samples
## 14 predictor
## 5 classes: '0', '1', '2', '3', '4'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 982, 982, 982, 982, 982, 982, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8082758 0.6838606
## 2 0.8030140 0.6757666
## 3 0.8073025 0.6816296
## 4 0.8081096 0.6815487
## 5 0.8160422 0.6931038
## 6 0.8183412 0.6958800
## 7 0.8246655 0.7058145
## 8 0.8266796 0.7086463
## 9 0.8281881 0.7104673
## 10 0.8319806 0.7167732
## 11 0.8302935 0.7132037
## 12 0.8310505 0.7139109
## 13 0.8305155 0.7125105
## 14 0.8294755 0.7101852
## 15 0.8296722 0.7104142
## 16 0.8320216 0.7140035
## 17 0.8326327 0.7147589
## 18 0.8320532 0.7138247
## 19 0.8307571 0.7114743
## 20 0.8316195 0.7127679
## 21 0.8311658 0.7117304
## 22 0.8284676 0.7069512
## 23 0.8287605 0.7072143
## 24 0.8284572 0.7066616
## 25 0.8298149 0.7087841
## 26 0.8288261 0.7068250
## 27 0.8282081 0.7055289
## 28 0.8269523 0.7031983
## 29 0.8265923 0.7024095
## 30 0.8261297 0.7017050
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.
#Plotting Model.
plot(modelknn)
pred<-predict(modelknn,df3KNN[-15])
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4
## 0 497 5 4 2 2
## 1 21 307 57 23 16
## 2 1 6 27 4 3
## 3 0 1 1 1 0
## 4 0 0 2 1 1
##
## Overall Statistics
##
## Accuracy : 0.8483
## 95% CI : (0.8243, 0.8701)
## No Information Rate : 0.5285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7389
##
## Mcnemar's Test P-Value : 3.28e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.9576 0.9624 0.29670 0.032258 0.045455
## Specificity 0.9719 0.8235 0.98429 0.997897 0.996875
## Pos Pred Value 0.9745 0.7241 0.65854 0.333333 0.250000
## Neg Pred Value 0.9534 0.9785 0.93199 0.969356 0.978528
## Prevalence 0.5285 0.3248 0.09267 0.031568 0.022403
## Detection Rate 0.5061 0.3126 0.02749 0.001018 0.001018
## Detection Prevalence 0.5193 0.4318 0.04175 0.003055 0.004073
## Balanced Accuracy 0.9648 0.8930 0.64050 0.515078 0.521165
model_svm <- svm(Quartile~., df3KNN)
pred <- predict(model_svm, df3KNN)
confusionMatrix(as.factor(pred),as.factor(df3KNN$Quartile))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4
## 0 505 4 3 2 1
## 1 14 311 59 25 18
## 2 0 4 29 3 3
## 3 0 0 0 1 0
## 4 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.8615
## 95% CI : (0.8383, 0.8825)
## No Information Rate : 0.5285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7603
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.9730 0.9749 0.31868 0.032258 0.0000
## Specificity 0.9784 0.8250 0.98878 1.000000 1.0000
## Pos Pred Value 0.9806 0.7283 0.74359 1.000000 NaN
## Neg Pred Value 0.9700 0.9856 0.93425 0.969419 0.9776
## Prevalence 0.5285 0.3248 0.09267 0.031568 0.0224
## Detection Rate 0.5143 0.3167 0.02953 0.001018 0.0000
## Detection Prevalence 0.5244 0.4348 0.03971 0.001018 0.0000
## Balanced Accuracy 0.9757 0.9000 0.65373 0.516129 0.5000
We can compare KNN and SVM by the confusion matrix.
Accuracy of SVM: 0.8615
Accuracy of KNN: 0.8483
We can say that, for “Quartile” dependent variable, these two algorithms are, in terms of efficiency, is close to each other.
Each case is special and unique, but in this case (data set), we may assume that KNN is doing better if our dependent variable on clustering is not non-binary.