Results {.tabset .tabset-fade .tabset-pills}
The objective of this project is to predict the consumption of cocaine in the dataset using the three classification algorithms and two synthetic balancing techniques. The three classifier algorithms we will train include:
-logistic Regression
-K Nearest Neighbors
-Support Vector Machine
Given that the objective is to evaluate the model performance of the three classifier algorithms and synthetic balancing techniques, we will therefore focus on the classification performance results. We will start by loading the R library packages that will be used in this project, which are caret, corrplot, and smote-family packages
# Load the dataset
drugs <- read_csv("drugs_train.csv")
Check the structure of the dataset using glimpse() function.It helps us to better understand our dataset and identify potential data quality issues. Our dataset has 500 rows and 21 columns.
glimpse(drugs)
## Rows: 1,500
## Columns: 21
## $ id <chr> "train_0001", "train_0002", "train_0003~
## $ age <chr> "45-54", "25-34", "18-24", "25-34", "18~
## $ gender <chr> "male", "male", "female", "female", "ma~
## $ education <chr> "Masters degree", "University degree", ~
## $ country <chr> "USA", "USA", "USA", "USA", "Australia"~
## $ ethnicity <chr> "Mixed-Black/Asian", "Mixed-Black/Asian~
## $ personality_neuroticism <dbl> 57.6, 47.8, 57.6, 71.8, 56.1, 47.8, 75.~
## $ personality_extraversion <dbl> 57.3, 67.0, 43.3, 31.2, 62.3, 20.7, 69.~
## $ personality_openness <dbl> 50.1, 45.7, 55.3, 43.6, 70.2, 57.8, 62.~
## $ personality_agreeableness <dbl> 47.8, 47.8, 45.6, 56.3, 66.1, 41.2, 54.~
## $ personality_conscientiousness <dbl> 53.7, 56.0, 49.9, 31.8, 42.4, 33.6, 38.~
## $ personality_impulsiveness <dbl> 42.8, 33.8, 63.0, 63.0, 50.4, 56.5, 56.~
## $ personality_sensation <dbl> 22.4, 30.8, 62.0, 71.1, 62.0, 62.0, 22.~
## $ consumption_alcohol <chr> "used in last week", "used in last week~
## $ consumption_amphetamines <chr> "used over a decade ago", "never used",~
## $ consumption_caffeine <chr> "used in last day", "used in last week"~
## $ consumption_cannabis <chr> "used in last week", "never used", "use~
## $ consumption_chocolate <chr> "used in last day", "used in last day",~
## $ consumption_mushrooms <chr> "never used", "never used", "used in la~
## $ consumption_nicotine <chr> "used in last week", "never used", "use~
## $ consumption_cocaine_last_month <chr> "No", "No", "No", "No", "No", "No", "Ye~
Here we will check missing values and do data imputations where necessary. We will also convert character variables in our dataset into factor variable in order to properly apply our models. We use the xtabs() function to check if binary dependent variable “consumption of cocaine” is both represented by the independent variable . The independent variables of our dataset should be both represented in our binary “YES” & “NO” variable. This reduces bias in our model
Lets count missing values in every column of our dataset. Our dataset has no missing values in all columns
colSums(is.na(drugs)) %>%
sort()
## id age
## 0 0
## gender education
## 0 0
## country ethnicity
## 0 0
## personality_neuroticism personality_extraversion
## 0 0
## personality_openness personality_agreeableness
## 0 0
## personality_conscientiousness personality_impulsiveness
## 0 0
## personality_sensation consumption_alcohol
## 0 0
## consumption_amphetamines consumption_caffeine
## 0 0
## consumption_cannabis consumption_chocolate
## 0 0
## consumption_mushrooms consumption_nicotine
## 0 0
## consumption_cocaine_last_month
## 0
In R, categorical variables are represented by ‘chr’. There are two types of categorical variables : Nominal and Ordinal variables. Nominal variables will be converted to factors using as.factor() function. Ordinal variables will be converted to factors using factor() function and levels to identify the ranking/ order. Our dataset has 14 categorical variables
# Create a variable with categorical variables
drugs_cat_vars <- sapply(drugs, is.character) %>%
which() %>%
names()
sapply(drugs, is.character) %>%
which() %>%
length()
## [1] 14
Age variable will be treated as an ordinal variable. Since there is a small representation of people with age group of 65+ which only has 14 people, we decided to merge the age group 55-64 and 65+ to create a new age group variable 55+
# the xlabs() function used show the representation of the variable in the dependent variable
xtabs(~consumption_cocaine_last_month + age,data=drugs)
## age
## consumption_cocaine_last_month 18-24 25-34 35-44 45-54 55-64 65+
## No 450 340 270 228 71 14
## Yes 78 35 8 5 1 0
drugs$age[drugs$age=="55-64"] <- "65+"
drugs$age[drugs$age == "65+"] <- "55+"
drugs$age <- factor(drugs$age,
levels = c("18-24",
"25-34",
"35-44",
"45-54",
"55+"),
ordered = TRUE)
xtabs(~consumption_cocaine_last_month + age,data=drugs)
## age
## consumption_cocaine_last_month 18-24 25-34 35-44 45-54 55+
## No 450 340 270 228 85
## Yes 78 35 8 5 1
The variable for “Left School” will be combined for all “Left School variables” to create a new variable “Left school at 18 or before
drugs$education[drugs$education == "Left school before 16 years" |drugs$education == "Left school at 17 years" | drugs$education == "Left school at 18 years"] <- "Left school at 16 years"
drugs$education[drugs$education == "Left school at 16 years"] <- "Left school at <18"
#converting the variable to factors and levels
drugs$education <- factor(drugs$education,
levels = c("Doctorate degree",
"Masters degree",
"University degree",
"Professional certificate/ diploma",
"Some college or university, no certificate or degree",
#"Left school at 17 or 18 years",
#"Left school at 16 years or below"
"Left school at <18"),
ordered = TRUE)
xtabs(~ consumption_cocaine_last_month + education,data=drugs)
## education
## consumption_cocaine_last_month Doctorate degree Masters degree
## No 61 211
## Yes 5 18
## education
## consumption_cocaine_last_month University degree
## No 351
## Yes 25
## education
## consumption_cocaine_last_month Professional certificate/ diploma
## No 207
## Yes 14
## education
## consumption_cocaine_last_month Some college or university, no certificate or degree
## No 356
## Yes 49
## education
## consumption_cocaine_last_month Left school at <18
## No 187
## Yes 16
The below variables will also be included as ordinal variables in our model. The variables can be ordered in a natural way.The order of the variables start from “used in last day” and the least being “never used”
consumption_cat_vars <- drugs_cat_vars[7:13]
consumption_cat_vars
## [1] "consumption_alcohol" "consumption_amphetamines"
## [3] "consumption_caffeine" "consumption_cannabis"
## [5] "consumption_chocolate" "consumption_mushrooms"
## [7] "consumption_nicotine"
for (variable in consumption_cat_vars) {
drugs[[variable]] <- factor(drugs[[variable]],
levels = c("used in last day",
"used in last week",
"used in last month",
"used in last year",
"used in last decade",
"used over a decade ago",
"never used"),
ordered = TRUE)
}
Canada and Ireland have a small representation of only 5 and 13 respectively.These countries are also not represented in the ‘YES’ output of the dependent variable “consumption_cocaine_last_month”. This could cause a problem in the accuracy of our model. Both countries will be added to the ‘Other’ variable.
xtabs(~consumption_cocaine_last_month + country,data=drugs)
## country
## consumption_cocaine_last_month Australia Canada Ireland New Zealand Other UK
## No 398 5 12 85 40 59
## Yes 62 0 1 9 4 14
## country
## consumption_cocaine_last_month USA
## No 774
## Yes 37
drugs$country[drugs$country== "Canada" | drugs$country=="Ireland"] <- "Other"
xtabs(~consumption_cocaine_last_month + country,data=drugs)
## country
## consumption_cocaine_last_month Australia New Zealand Other UK USA
## No 398 85 57 59 774
## Yes 62 9 5 14 37
“white” ethnicity is also not much represented in the output variable with only 3 people. The variable will therefore added to the ‘other’ variable
xtabs(~consumption_cocaine_last_month + ethnicity,data=drugs)
## ethnicity
## consumption_cocaine_last_month Asian Black Mixed-Black/Asian Mixed-White/Asian
## No 23 21 1265 11
## Yes 2 1 107 4
## ethnicity
## consumption_cocaine_last_month Mixed-White/Black Other White
## No 37 13 3
## Yes 10 3 0
drugs$ethnicity[drugs$ethnicity=="White"] <- "Other"
xtabs(~consumption_cocaine_last_month + ethnicity,data=drugs)
## ethnicity
## consumption_cocaine_last_month Asian Black Mixed-Black/Asian Mixed-White/Asian
## No 23 21 1265 11
## Yes 2 1 107 4
## ethnicity
## consumption_cocaine_last_month Mixed-White/Black Other
## No 37 16
## Yes 10 3
The below variables are nominal variables that have not been converted into factors. These variables will be converted using the as.factor() function
drugs_nominal_vars <- sapply(drugs, is.character) %>%
which() %>%
names()
drugs_nominal_vars
## [1] "id" "gender"
## [3] "country" "ethnicity"
## [5] "consumption_cocaine_last_month"
for ( variable in drugs_nominal_vars ) {
drugs[[variable]] <- as.factor(drugs[[variable]])
}
WE check unique elements of numeric variables to see if there may be some numeric variables that may have some order or that are actually categorical. The output show that all variables are numeric and cannot be converted into factors
drugs_numeric_var <- sapply(drugs, is.numeric) %>%
which() %>%
names()
drugs_numeric_var <- drugs_numeric_var[!drugs_numeric_var %in% consumption_cat_vars]
sapply(drugs[, drugs_numeric_var],
function(x)
unique(x) %>%
length()) %>%
sort()
## personality_impulsiveness personality_sensation
## 10 11
## personality_openness personality_agreeableness
## 35 39
## personality_extraversion personality_conscientiousness
## 41 41
## personality_neuroticism
## 49
All categorical variables are converted to either factors or ordered factors. numeric variables were left as they are.
#Viewing processed dataset
glimpse(drugs)
## Rows: 1,500
## Columns: 21
## $ id <fct> train_0001, train_0002, train_0003, tra~
## $ age <ord> 45-54, 25-34, 18-24, 25-34, 18-24, 18-2~
## $ gender <fct> male, male, female, female, male, femal~
## $ education <ord> "Masters degree", "University degree", ~
## $ country <fct> USA, USA, USA, USA, Australia, Australi~
## $ ethnicity <fct> Mixed-Black/Asian, Mixed-Black/Asian, M~
## $ personality_neuroticism <dbl> 57.6, 47.8, 57.6, 71.8, 56.1, 47.8, 75.~
## $ personality_extraversion <dbl> 57.3, 67.0, 43.3, 31.2, 62.3, 20.7, 69.~
## $ personality_openness <dbl> 50.1, 45.7, 55.3, 43.6, 70.2, 57.8, 62.~
## $ personality_agreeableness <dbl> 47.8, 47.8, 45.6, 56.3, 66.1, 41.2, 54.~
## $ personality_conscientiousness <dbl> 53.7, 56.0, 49.9, 31.8, 42.4, 33.6, 38.~
## $ personality_impulsiveness <dbl> 42.8, 33.8, 63.0, 63.0, 50.4, 56.5, 56.~
## $ personality_sensation <dbl> 22.4, 30.8, 62.0, 71.1, 62.0, 62.0, 22.~
## $ consumption_alcohol <ord> used in last week, used in last week, u~
## $ consumption_amphetamines <ord> used over a decade ago, never used, nev~
## $ consumption_caffeine <ord> used in last day, used in last week, us~
## $ consumption_cannabis <ord> used in last week, never used, used in ~
## $ consumption_chocolate <ord> used in last day, used in last day, use~
## $ consumption_mushrooms <ord> never used, never used, used in last ye~
## $ consumption_nicotine <ord> used in last week, never used, used in ~
## $ consumption_cocaine_last_month <fct> No, No, No, No, No, No, Yes, No, Yes, N~
Will check potential problem of collinearity for numeric variables. We will look at graphical representation of correlation matrix using the corrplot() function. The graphs show that there are no collinearity between numeric variables.The highly correlated variables is “personality_sensation” and “personality_sensation” with 0.61 correlation which is well below the threshold.
drugs_correlations <-
cor(drugs[, drugs_numeric_var],
use = "pairwise.complete.obs")
corrplot.mixed(drugs_correlations,
upper = "pie",
lower = "number",
tl.col = "black", # color of labels (variable names)
tl.pos = "lt") # position of labels (lt = left and top)
We check the distribution (histogram) of the dependent variable consumption of cocaine The results show that our dataset is imbalanced.
ggplot(drugs,
aes(x = `consumption_cocaine_last_month`)) +
geom_bar(stat="count",fill = "lightblue")+
theme_bw()
Data is randomly divided into training and testing sets (stratified by class) using the caret package. List is set to false so that we get a vector of numbers. By default, the results will a list We apply this index for data division into train and test data
set.seed(7000)
drugs_train_index <- createDataPartition(drugs$consumption_cocaine_last_month,
p = 0.7,
list = FALSE)
drugs_train <- drugs[drugs_train_index,] #Training set
drugs_test <- drugs[-drugs_train_index,] #test set
# Class balance for train dataset
tabyl(drugs_train$consumption_cocaine_last_month)
## drugs_train$consumption_cocaine_last_month n percent
## No 962 0.91531874
## Yes 89 0.08468126
# Class balance for test dataset
tabyl(drugs_test$consumption_cocaine_last_month)
## drugs_test$consumption_cocaine_last_month n percent
## No 411 0.91536748
## Yes 38 0.08463252
Now that we have split our dataset into a training and test dataset, lets create three new synthetically balanced datasets from the one imbalanced training dataset. To do this we will be using the “smotefamily” R package and we will be trying out three different techniques: SMOTE, ROSE. Below is a brief description of each:
One of the resampling methods, namely “ROSE” requires that all the levels of factor variables (dummy variables after recoding) do not include special characters, like ” “, etc.
# first we store a list of names of variables that are factors
drugs_vars_factors <-
drugs_train %>%
sapply(is.factor) %>%
which() %>%
names()
# We use a for loop to check which variables levels include a space
for(var_ in drugs_vars_factors) {
levels_ <- levels(drugs_train[[var_]])
which_include_spaces <- grep(" ", levels_)
if (length(which_include_spaces) == 0) next else {
message(paste0("variable: ",
var_,
" has a space in levels: "))
print(levels_[which_include_spaces])
}
}
## [1] "Doctorate degree"
## [2] "Masters degree"
## [3] "University degree"
## [4] "Professional certificate/ diploma"
## [5] "Some college or university, no certificate or degree"
## [6] "Left school at <18"
## [1] "New Zealand"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
## [1] "used in last day" "used in last week" "used in last month"
## [4] "used in last year" "used in last decade" "used over a decade ago"
## [7] "never used"
# We "correct" the issue by transforming the levels using a function make_clean_names()
# from the janitor package (it replaces all non-standard characters into "_")
# transformation of variables applied to the train dataset
for(var_ in drugs_vars_factors) {
levels(drugs_train[[var_]]) <-
make_clean_names(levels(drugs_train[[var_]]) )
}
# transformation of variables applied to the test dataset
for(var_ in drugs_vars_factors) {
levels(drugs_test[[var_]]) <-
make_clean_names(levels(drugs_test[[var_]]) )
}
The cross validation applied to the datasets is the cv with 5 folds. Summary_binary_function() is also created to calculate performance metrics including F1
# lets define a summary function to evaluate models with all 5 measured
fiveStats <- function(...) c(twoClassSummary(...),
defaultSummary(...))
ctrl_cv5 <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = fiveStats)
# function created to calculate performance metrics including F1
summary_binary_class <- function(predicted_classes,
real,
level_positive = "Yes",
level_negative = "No") {
ctable <- confusionMatrix(as.factor(predicted_classes),
real,
level_positive)
stats <- round(c(ctable$overall[1],
ctable$byClass[c(1:4, 7, 11)]),
5)
return(stats)
}
The Logistic regression also known as generalized Linear model finds the probability of event success and event failure. It is based on sigmoid function where output is probability which ranges from 0 to 1. The model family that we are going to use for the distribution is the “binomial” The model is applied to the origional data, balanced data with smote and finally balanced data with rose
options(contrasts = c("contr.treatment", # for non-ordinal factors
"contr.treatment")) # for ordinal factors
set.seed(7000)
model_glm_orig <-
train(consumption_cocaine_last_month ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "glm",
family = "binomial",
trControl = ctrl_cv5)
# 6.2 SMOTE model on Logistic Regression
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_glm_smote <-
train(consumption_cocaine_last_month ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "glm",
family = "binomial",
trControl = ctrl_cv5)
# 6.3 ROSE model on Logistic Regression
ctrl_cv5$sampling <- "rose"
set.seed(7000)
model_glm_rose <-
train(consumption_cocaine_last_month ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "glm",
family = "binomial",
trControl = ctrl_cv5)
The below models will be used to compare the best model
# lets compare all models on validation stage
models_all <- ls(pattern = "model_glm")
models_all
## [1] "model_glm_orig" "model_glm_rose" "model_glm_smote"
# Results for train data
glm_train_final <- models_all %>%
sapply(function(x) get(x) %>%
predict(newdata = drugs_train) %>%
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_train$consumption_cocaine_last_month)) %>%
# transpose the result to # have models in rows
t()
glm_train_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_glm_orig 0.91912 0.21348 0.98441 0.55882 0.93117
## model_glm_rose 0.77069 0.85393 0.76299 0.25000 0.98260
## model_glm_smote 0.82873 0.84270 0.82744 0.31120 0.98272
## F1 Balanced Accuracy
## model_glm_orig 0.30894 0.59895
## model_glm_rose 0.38677 0.80846
## model_glm_smote 0.45455 0.83507
# Results for the TEST data
glm_test_final <- models_all %>%
sapply(function(x) get(x) %>%
predict(newdata = drugs_test) %>%
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_test$consumption_cocaine_last_month)) %>%
# transpose the result to # have models in rows
t()
glm_test_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_glm_orig 0.90423 0.15789 0.97324 0.35294 0.92593
## model_glm_rose 0.74388 0.68421 0.74939 0.20155 0.96250
## model_glm_smote 0.76615 0.60526 0.78102 0.20354 0.95536
## F1 Balanced Accuracy
## model_glm_orig 0.21818 0.56557
## model_glm_rose 0.31138 0.71680
## model_glm_smote 0.30464 0.69314
we will use a function train() from the caret package which in case of KNN method uses knn() function from the class package.List of (hyper)parameters required by a specific model are previewed using the modelLookup() function from the caret package The function knn() has a default value of k = 5, however we will find the optimal value of K and use it instead
set.seed(7000)
sqrt(nrow(drugs_train))
## [1] 32.41913
different_k <- data.frame(k = seq(1, 40, 1))
ctrl_cvk <- trainControl(method = "cv",
number = 5,
# for roc, spec, sensi
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(7000)
# Imbalanced dataset on KNN Model
model_knn_orig <-
train(`consumption_cocaine_last_month` ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "knn",
trControl = ctrl_cvk,
tuneGrid = different_k,
preProcess= c("range"),
metric = "ROC")
#The optimal value of K is 40 but 23 is not far off different from Neighbors=40, there we choose to use 23 instead
modelLookup("knn")
## model parameter label forReg forClass probModel
## 1 knn k #Neighbors TRUE TRUE TRUE
plot(model_knn_orig)
#We will use k = 23 which is also an odd number
k_value <- data.frame(k = 23)
# 7.2 SMOTE model on KNN Model
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_knn_smote <-
train(`consumption_cocaine_last_month` ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "knn",
trControl = ctrl_cv5,
tuneGrid = k_value,
preProcess= c("range"))
# 7.3 ROSE model on KNN Model
ctrl_cv5$sampling <- "rose"
set.seed(7000)
model_knn_rose <-
train(`consumption_cocaine_last_month` ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "knn",
trControl = ctrl_cv5,
tuneGrid = k_value,
preProcess= c("range"))
The below models will be used to compare the best model
# lets compare all models on validation stage
models_knn_all <- ls(pattern = "model_knn")
models_knn_all
## [1] "model_knn_orig" "model_knn_rose" "model_knn_smote"
# Results for train data
knn_train_final <- models_knn_all %>%
sapply(function(x) get(x) %>%
# use it for prediction
# in the TRAIN sample
predict(newdata = drugs_train) %>%
# apply the summary
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_train$consumption_cocaine_last_month)) %>%
# transpose the result to # have models in rows
t()
knn_train_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig 0.91532 0.00000 1.00000 NaN 0.91532
## model_knn_rose 0.61370 0.96629 0.58108 0.17587 0.99466
## model_knn_smote 0.54995 0.96629 0.51143 0.15468 0.99394
## F1 Balanced Accuracy
## model_knn_orig NA 0.50000
## model_knn_rose 0.29758 0.77369
## model_knn_smote 0.26667 0.73886
#Results for the TEST data
knn_test_final <- models_knn_all %>%
sapply(function(x) get(x) %>%
# use it for prediction
# in the TEST sample
predict(newdata = drugs_test) %>%
# apply the summary
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_test$consumption_cocaine_last_month)) %>%
# transpose the result to # have models in rows
t()
knn_test_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig 0.91537 0.00000 1.00000 NaN 0.91537
## model_knn_rose 0.63474 0.89474 0.61071 0.17526 0.98431
## model_knn_smote 0.56125 0.94737 0.52555 0.15584 0.99083
## F1 Balanced Accuracy
## model_knn_orig NA 0.50000
## model_knn_rose 0.29310 0.75272
## model_knn_smote 0.26766 0.73646
We use svmRadial for Support Vector Machine. The algorithm has two parameters: - cost C (here by default C = 0.25) - sigma - smoothing parameter for the radial basis kernel
#Imbalanced dataset on SVM Model
modelLookup("svmRadial")
## model parameter label forReg forClass probModel
## 1 svmRadial sigma Sigma TRUE TRUE TRUE
## 2 svmRadial C Cost TRUE TRUE TRUE
set.seed(7000)
model_svm_orig <- train(consumption_cocaine_last_month ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "svmRadial",
trControl = ctrl_cv5,
metric = "ROC")
# SMOTE model on SVM Model
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_svm_smote <-
train(`consumption_cocaine_last_month` ~ .,
data = drugs_train %>%
dplyr::select(-id),
method = "svmRadial",
trControl = ctrl_cv5,
metric = "ROC")
The below models will be used to compare the best model
# lets compare all models on validation stage
models_svm_all <- ls(pattern = "model_svm")
models_svm_all
## [1] "model_svm_orig" "model_svm_smote"
# Results for train data
svm_train_final <- models_svm_all %>%
sapply(function(x) get(x) %>%
predict(newdata = drugs_train) %>%
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_train$consumption_cocaine_last_month)) %>%
t()
svm_train_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_svm_orig 0.65842 0.94382 0.63202 0.19178 0.99184
## model_svm_smote 0.90105 0.79775 0.91060 0.45223 0.97987
## F1 Balanced Accuracy
## model_svm_orig 0.31879 0.78792
## model_svm_smote 0.57724 0.85418
svm_test_final <- models_svm_all %>%
sapply(function(x) get(x) %>%
predict(newdata = drugs_test) %>%
summary_binary_class(level_positive = "yes",
level_negative = "no",
real = drugs_test$consumption_cocaine_last_month)) %>%
t()
svm_test_final
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_svm_orig 0.60579 0.84211 0.58394 0.15764 0.97561
## model_svm_smote 0.85746 0.42105 0.89781 0.27586 0.94373
## F1 Balanced Accuracy
## model_svm_orig 0.26556 0.71302
## model_svm_smote 0.33333 0.65943
#We combine all model performances in a dataframe using the rbinf() function
knn_test <- as.data.frame(knn_test_final)
glm_test <- as.data.frame(glm_test_final)
svm_test <- as.data.frame(svm_test_final)
knn_test["model"] <- c("knn_orig", "knn_rose", "knn_smote")
glm_test["model"] <- c("glm_orig", "glm_rose", "glm_smote")
svm_test["model"] <- c("svm_orig", "svm_smote")
data_compiled <- rbind(knn_test,glm_test,svm_test)
#results
data_compiled
## Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig 0.91537 0.00000 1.00000 NaN 0.91537
## model_knn_rose 0.63474 0.89474 0.61071 0.17526 0.98431
## model_knn_smote 0.56125 0.94737 0.52555 0.15584 0.99083
## model_glm_orig 0.90423 0.15789 0.97324 0.35294 0.92593
## model_glm_rose 0.74388 0.68421 0.74939 0.20155 0.96250
## model_glm_smote 0.76615 0.60526 0.78102 0.20354 0.95536
## model_svm_orig 0.60579 0.84211 0.58394 0.15764 0.97561
## model_svm_smote 0.85746 0.42105 0.89781 0.27586 0.94373
## F1 Balanced Accuracy model
## model_knn_orig NA 0.50000 knn_orig
## model_knn_rose 0.29310 0.75272 knn_rose
## model_knn_smote 0.26766 0.73646 knn_smote
## model_glm_orig 0.21818 0.56557 glm_orig
## model_glm_rose 0.31138 0.71680 glm_rose
## model_glm_smote 0.30464 0.69314 glm_smote
## model_svm_orig 0.26556 0.71302 svm_orig
## model_svm_smote 0.33333 0.65943 svm_smote
We will compare the recall, precision, and F1 performance measures for each of the three models we trained using the four training datasets: 1. Original imbalanced 2. SMOTE balanced 3. ROSE balanced
ggplot(aes(x = reorder(model,-`Balanced Accuracy`),y = `Balanced Accuracy`),data = data_compiled) +
geom_bar(stat = 'identity', fill = 'light blue') +
ggtitle('Comparative Balanced Accuracy on Test Data') +
xlab('Models') +
ylab('Balanced Accuracy')+
geom_text(aes(label = round(`Balanced Accuracy`,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
ggplot(aes(x = reorder(model,-Specificity),y = Specificity),data = data_compiled) +
geom_bar(stat = 'identity',fill = 'light green') +
ggtitle('Comparative Precision of Models on Test Data') +
xlab('Models') +
ylab('Precision Measure')+
geom_text(aes(label = round(Specificity,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
ggplot(aes(x=reorder(model,-Sensitivity),y = Sensitivity),data = data_compiled) +
geom_bar(stat = 'identity',fill = 'light grey') +
ggtitle('Comparative Recall of Models on Test Data') +
xlab('Models') +
ylab('Recall Measure')+
geom_text(aes(label = round(Sensitivity,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
ggplot(aes(x=reorder(model,-F1),y = F1),data = data_compiled) +
geom_bar(stat = 'identity',fill = 'light pink') +
ggtitle('Comparative F1 of Models on Test Data') +
xlab('Models') +
ylab('F1 Measure')+
geom_text(aes(label = round(F1,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))
Based on Balanced Accuracy the best model was obtained using K-Nearest Neighbors algorithm and ROSE used as a balancing technique for imbalanced data. The balanced accuracy obtained for the train data was 77.37% and for the test data was 75.27%. The results show consistency of our model predictions (no over fitting)
The SMOTE balancing technique was the second best using the K-Nearest Neighbors algorithm. Balanced accuracy obtained from train data was 73.88% and for the test data was 73.65%. The results show consistency of our model predictions using smote balancing technique (no over fitting).
In was also noted that Support Vector Machine on SMOTE and Logistic regression on SMOTE had good predictions on train dataset with 85.42% and 83.51% on Balanced Accuracy respectively, however the models performed poorly on test data with Balanced Accuracy of 65.94% and 69.31% respectively. Using these models results in problems of OVERFITTING (low error on train data and high error on test data).
ggplot(aes(x = reorder(model,-`Balanced Accuracy`),y = `Balanced Accuracy`),data = data_compiled) +
geom_bar(stat = 'identity', fill = 'sky blue') +
ggtitle('Comparative Balanced Accuracy on Test Data') +
xlab('Models') +
ylab('Balanced Accuracy')+
geom_text(aes(label = round(`Balanced Accuracy`,2))) + theme_bw() +
theme(axis.text.x = element_text(angle = 40))