Results {.tabset .tabset-fade .tabset-pills}

1. Project Objective

The objective of this project is to predict the consumption of cocaine in the dataset using the three classification algorithms and two synthetic balancing techniques. The three classifier algorithms we will train include:

-logistic Regression

-K Nearest Neighbors

-Support Vector Machine

Given that the objective is to evaluate the model performance of the three classifier algorithms and synthetic balancing techniques, we will therefore focus on the classification performance results. We will start by loading the R library packages that will be used in this project, which are caret, corrplot, and smote-family packages

1.1. Import the dataset

# Load the dataset
drugs <- read_csv("drugs_train.csv")

1.2. Structure

Check the structure of the dataset using glimpse() function.It helps us to better understand our dataset and identify potential data quality issues. Our dataset has 500 rows and 21 columns.

glimpse(drugs)

## Rows: 1,500
## Columns: 21
## $ id                             <chr> "train_0001", "train_0002", "train_0003~
## $ age                            <chr> "45-54", "25-34", "18-24", "25-34", "18~
## $ gender                         <chr> "male", "male", "female", "female", "ma~
## $ education                      <chr> "Masters degree", "University degree", ~
## $ country                        <chr> "USA", "USA", "USA", "USA", "Australia"~
## $ ethnicity                      <chr> "Mixed-Black/Asian", "Mixed-Black/Asian~
## $ personality_neuroticism        <dbl> 57.6, 47.8, 57.6, 71.8, 56.1, 47.8, 75.~
## $ personality_extraversion       <dbl> 57.3, 67.0, 43.3, 31.2, 62.3, 20.7, 69.~
## $ personality_openness           <dbl> 50.1, 45.7, 55.3, 43.6, 70.2, 57.8, 62.~
## $ personality_agreeableness      <dbl> 47.8, 47.8, 45.6, 56.3, 66.1, 41.2, 54.~
## $ personality_conscientiousness  <dbl> 53.7, 56.0, 49.9, 31.8, 42.4, 33.6, 38.~
## $ personality_impulsiveness      <dbl> 42.8, 33.8, 63.0, 63.0, 50.4, 56.5, 56.~
## $ personality_sensation          <dbl> 22.4, 30.8, 62.0, 71.1, 62.0, 62.0, 22.~
## $ consumption_alcohol            <chr> "used in last week", "used in last week~
## $ consumption_amphetamines       <chr> "used over a decade ago", "never used",~
## $ consumption_caffeine           <chr> "used in last day", "used in last week"~
## $ consumption_cannabis           <chr> "used in last week", "never used", "use~
## $ consumption_chocolate          <chr> "used in last day", "used in last day",~
## $ consumption_mushrooms          <chr> "never used", "never used", "used in la~
## $ consumption_nicotine           <chr> "used in last week", "never used", "use~
## $ consumption_cocaine_last_month <chr> "No", "No", "No", "No", "No", "No", "Ye~

2. DATA PREPROCESSING

Here we will check missing values and do data imputations where necessary. We will also convert character variables in our dataset into factor variable in order to properly apply our models. We use the xtabs() function to check if binary dependent variable “consumption of cocaine” is both represented by the independent variable . The independent variables of our dataset should be both represented in our binary “YES” & “NO” variable. This reduces bias in our model

2.1. Missing Values

Lets count missing values in every column of our dataset. Our dataset has no missing values in all columns

colSums(is.na(drugs)) %>% 
  sort()

##                             id                            age 
##                              0                              0 
##                         gender                      education 
##                              0                              0 
##                        country                      ethnicity 
##                              0                              0 
##        personality_neuroticism       personality_extraversion 
##                              0                              0 
##           personality_openness      personality_agreeableness 
##                              0                              0 
##  personality_conscientiousness      personality_impulsiveness 
##                              0                              0 
##          personality_sensation            consumption_alcohol 
##                              0                              0 
##       consumption_amphetamines           consumption_caffeine 
##                              0                              0 
##           consumption_cannabis          consumption_chocolate 
##                              0                              0 
##          consumption_mushrooms           consumption_nicotine 
##                              0                              0 
## consumption_cocaine_last_month 
##                              0

2.2. Identify Categorical variables.

In R, categorical variables are represented by ‘chr’. There are two types of categorical variables : Nominal and Ordinal variables. Nominal variables will be converted to factors using as.factor() function. Ordinal variables will be converted to factors using factor() function and levels to identify the ranking/ order. Our dataset has 14 categorical variables

# Create a variable with categorical variables
drugs_cat_vars <- sapply(drugs, is.character) %>%
  which() %>%
  names()

sapply(drugs, is.character) %>%
  which() %>%
  length()

## [1] 14

2.3. Ordinal variables

2.3.1. AGE

Age variable will be treated as an ordinal variable. Since there is a small representation of people with age group of 65+ which only has 14 people, we decided to merge the age group 55-64 and 65+ to create a new age group variable 55+

# the xlabs() function used show the representation of the variable in the dependent variable
xtabs(~consumption_cocaine_last_month  + age,data=drugs)

##                               age
## consumption_cocaine_last_month 18-24 25-34 35-44 45-54 55-64 65+
##                            No    450   340   270   228    71  14
##                            Yes    78    35     8     5     1   0

drugs$age[drugs$age=="55-64"] <- "65+"
drugs$age[drugs$age == "65+"] <- "55+"
drugs$age <- factor(drugs$age,
                    levels = c("18-24",
                               "25-34",
                               "35-44",
                               "45-54",
                               "55+"),
                    ordered = TRUE)
xtabs(~consumption_cocaine_last_month  + age,data=drugs)

##                               age
## consumption_cocaine_last_month 18-24 25-34 35-44 45-54 55+
##                            No    450   340   270   228  85
##                            Yes    78    35     8     5   1

2.3.2. EDUCATION

The variable for “Left School” will be combined for all “Left School variables” to create a new variable “Left school at 18 or before

drugs$education[drugs$education == "Left school before 16 years" |drugs$education == "Left school at 17 years" | drugs$education == "Left school at 18 years"] <- "Left school at 16 years"
drugs$education[drugs$education ==  "Left school at 16 years"] <- "Left school at <18"

#converting the variable to factors and levels
drugs$education <- factor(drugs$education,
                          levels = c("Doctorate degree",
                                     "Masters degree",
                                     "University degree",
                                     "Professional certificate/ diploma",
                                     "Some college or university, no certificate or degree",
                                     #"Left school at 17 or 18 years",
                                     #"Left school at 16 years or below"
                                     "Left school at <18"),
                          ordered = TRUE)
xtabs(~ consumption_cocaine_last_month  + education,data=drugs)

##                               education
## consumption_cocaine_last_month Doctorate degree Masters degree
##                            No                61            211
##                            Yes                5             18
##                               education
## consumption_cocaine_last_month University degree
##                            No                351
##                            Yes                25
##                               education
## consumption_cocaine_last_month Professional certificate/ diploma
##                            No                                207
##                            Yes                                14
##                               education
## consumption_cocaine_last_month Some college or university, no certificate or degree
##                            No                                                   356
##                            Yes                                                   49
##                               education
## consumption_cocaine_last_month Left school at <18
##                            No                 187
##                            Yes                 16

2.3.3. Consumption variables

The below variables will also be included as ordinal variables in our model. The variables can be ordered in a natural way.The order of the variables start from “used in last day” and the least being “never used”

consumption_cat_vars <- drugs_cat_vars[7:13]
consumption_cat_vars

## [1] "consumption_alcohol"      "consumption_amphetamines"
## [3] "consumption_caffeine"     "consumption_cannabis"    
## [5] "consumption_chocolate"    "consumption_mushrooms"   
## [7] "consumption_nicotine"

for (variable in consumption_cat_vars) {
  drugs[[variable]] <- factor(drugs[[variable]],
                              levels = c("used in last day",
                                         "used in last week",
                                         "used in last month",
                                         "used in last year",
                                         "used in last decade",
                                         "used over a decade ago",
                                         "never used"), 
                              ordered = TRUE)
}

2.4. Nominal Variables

2.4.1. Country

Canada and Ireland have a small representation of only 5 and 13 respectively.These countries are also not represented in the ‘YES’ output of the dependent variable “consumption_cocaine_last_month”. This could cause a problem in the accuracy of our model. Both countries will be added to the ‘Other’ variable.

xtabs(~consumption_cocaine_last_month  + country,data=drugs)

##                               country
## consumption_cocaine_last_month Australia Canada Ireland New Zealand Other  UK
##                            No        398      5      12          85    40  59
##                            Yes        62      0       1           9     4  14
##                               country
## consumption_cocaine_last_month USA
##                            No  774
##                            Yes  37

drugs$country[drugs$country== "Canada" | drugs$country=="Ireland"] <- "Other"
xtabs(~consumption_cocaine_last_month  + country,data=drugs)

##                               country
## consumption_cocaine_last_month Australia New Zealand Other  UK USA
##                            No        398          85    57  59 774
##                            Yes        62           9     5  14  37

2.4.2. Ethnicity

“white” ethnicity is also not much represented in the output variable with only 3 people. The variable will therefore added to the ‘other’ variable

xtabs(~consumption_cocaine_last_month  + ethnicity,data=drugs)

##                               ethnicity
## consumption_cocaine_last_month Asian Black Mixed-Black/Asian Mixed-White/Asian
##                            No     23    21              1265                11
##                            Yes     2     1               107                 4
##                               ethnicity
## consumption_cocaine_last_month Mixed-White/Black Other White
##                            No                 37    13     3
##                            Yes                10     3     0

drugs$ethnicity[drugs$ethnicity=="White"] <- "Other"
xtabs(~consumption_cocaine_last_month  + ethnicity,data=drugs)

##                               ethnicity
## consumption_cocaine_last_month Asian Black Mixed-Black/Asian Mixed-White/Asian
##                            No     23    21              1265                11
##                            Yes     2     1               107                 4
##                               ethnicity
## consumption_cocaine_last_month Mixed-White/Black Other
##                            No                 37    16
##                            Yes                10     3

2.4.3. Converting all nominal variables

The below variables are nominal variables that have not been converted into factors. These variables will be converted using the as.factor() function

drugs_nominal_vars <- sapply(drugs, is.character) %>%
  which() %>%
  names()
drugs_nominal_vars

## [1] "id"                             "gender"                        
## [3] "country"                        "ethnicity"                     
## [5] "consumption_cocaine_last_month"

for ( variable in drugs_nominal_vars ) {
  drugs[[variable]] <- as.factor(drugs[[variable]])
}

2.5. Quantitative variables

WE check unique elements of numeric variables to see if there may be some numeric variables that may have some order or that are actually categorical. The output show that all variables are numeric and cannot be converted into factors

drugs_numeric_var <- sapply(drugs, is.numeric) %>%
  which() %>%
  names()
drugs_numeric_var <- drugs_numeric_var[!drugs_numeric_var %in% consumption_cat_vars]
sapply(drugs[, drugs_numeric_var], 
       function(x) 
         unique(x) %>% 
         length()) %>% 
  sort()

##     personality_impulsiveness         personality_sensation 
##                            10                            11 
##          personality_openness     personality_agreeableness 
##                            35                            39 
##      personality_extraversion personality_conscientiousness 
##                            41                            41 
##       personality_neuroticism 
##                            49

2.6. Processed dataset

All categorical variables are converted to either factors or ordered factors. numeric variables were left as they are.

#Viewing processed dataset
glimpse(drugs)

## Rows: 1,500
## Columns: 21
## $ id                             <fct> train_0001, train_0002, train_0003, tra~
## $ age                            <ord> 45-54, 25-34, 18-24, 25-34, 18-24, 18-2~
## $ gender                         <fct> male, male, female, female, male, femal~
## $ education                      <ord> "Masters degree", "University degree", ~
## $ country                        <fct> USA, USA, USA, USA, Australia, Australi~
## $ ethnicity                      <fct> Mixed-Black/Asian, Mixed-Black/Asian, M~
## $ personality_neuroticism        <dbl> 57.6, 47.8, 57.6, 71.8, 56.1, 47.8, 75.~
## $ personality_extraversion       <dbl> 57.3, 67.0, 43.3, 31.2, 62.3, 20.7, 69.~
## $ personality_openness           <dbl> 50.1, 45.7, 55.3, 43.6, 70.2, 57.8, 62.~
## $ personality_agreeableness      <dbl> 47.8, 47.8, 45.6, 56.3, 66.1, 41.2, 54.~
## $ personality_conscientiousness  <dbl> 53.7, 56.0, 49.9, 31.8, 42.4, 33.6, 38.~
## $ personality_impulsiveness      <dbl> 42.8, 33.8, 63.0, 63.0, 50.4, 56.5, 56.~
## $ personality_sensation          <dbl> 22.4, 30.8, 62.0, 71.1, 62.0, 62.0, 22.~
## $ consumption_alcohol            <ord> used in last week, used in last week, u~
## $ consumption_amphetamines       <ord> used over a decade ago, never used, nev~
## $ consumption_caffeine           <ord> used in last day, used in last week, us~
## $ consumption_cannabis           <ord> used in last week, never used, used in ~
## $ consumption_chocolate          <ord> used in last day, used in last day, use~
## $ consumption_mushrooms          <ord> never used, never used, used in last ye~
## $ consumption_nicotine           <ord> used in last week, never used, used in ~
## $ consumption_cocaine_last_month <fct> No, No, No, No, No, No, Yes, No, Yes, N~

3. Exploratory Data Analysis

3.1. Collinearity

Will check potential problem of collinearity for numeric variables. We will look at graphical representation of correlation matrix using the corrplot() function. The graphs show that there are no collinearity between numeric variables.The highly correlated variables is “personality_sensation” and “personality_sensation” with 0.61 correlation which is well below the threshold.

drugs_correlations <- 
  cor(drugs[, drugs_numeric_var],
      use = "pairwise.complete.obs")

corrplot.mixed(drugs_correlations,
               upper = "pie",
               lower = "number",
               tl.col = "black", # color of labels (variable names)
               tl.pos = "lt")  # position of labels (lt = left and top)

3.2. Check the imbalance in the dataset

We check the distribution (histogram) of the dependent variable consumption of cocaine The results show that our dataset is imbalanced.

ggplot(drugs,
       aes(x = `consumption_cocaine_last_month`)) +
  geom_bar(stat="count",fill = "lightblue")+
  theme_bw()

4. Split the data into train and test set

Data is randomly divided into training and testing sets (stratified by class) using the caret package. List is set to false so that we get a vector of numbers. By default, the results will a list We apply this index for data division into train and test data

4.1. Class balance for train and test dataset

set.seed(7000)
drugs_train_index <- createDataPartition(drugs$consumption_cocaine_last_month,
                                         p = 0.7, 
                                         list = FALSE) 
drugs_train <- drugs[drugs_train_index,] #Training set
drugs_test <- drugs[-drugs_train_index,] #test set

# Class balance for train dataset
tabyl(drugs_train$consumption_cocaine_last_month)

##  drugs_train$consumption_cocaine_last_month   n    percent
##                                          No 962 0.91531874
##                                         Yes  89 0.08468126

# Class balance for test dataset
tabyl(drugs_test$consumption_cocaine_last_month)

##  drugs_test$consumption_cocaine_last_month   n    percent
##                                         No 411 0.91536748
##                                        Yes  38 0.08463252

5. Compile Synthetically Balanced Training Datasets

Now that we have split our dataset into a training and test dataset, lets create three new synthetically balanced datasets from the one imbalanced training dataset. To do this we will be using the “smotefamily” R package and we will be trying out three different techniques: SMOTE, ROSE. Below is a brief description of each:

5.1. Clean factor variables for train and test dataset

One of the resampling methods, namely “ROSE” requires that all the levels of factor variables (dummy variables after recoding) do not include special characters, like ” “, etc.

# first we store a list of names of variables that are factors
drugs_vars_factors <- 
  drugs_train %>% 
  sapply(is.factor) %>% 
  which() %>% 
  names()
# We use a for loop to check which variables levels include a space 
for(var_ in drugs_vars_factors) {
  levels_ <- levels(drugs_train[[var_]])
  which_include_spaces <- grep(" ", levels_)
  if (length(which_include_spaces) == 0) next else {
    message(paste0("variable: ", 
                   var_, 
                   " has a space in levels: "))
    print(levels_[which_include_spaces])
  }
}

## [1] "Doctorate degree"                                    
## [2] "Masters degree"                                      
## [3] "University degree"                                   
## [4] "Professional certificate/ diploma"                   
## [5] "Some college or university, no certificate or degree"
## [6] "Left school at <18"                                  
## [1] "New Zealand"
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"            
## [1] "used in last day"       "used in last week"      "used in last month"    
## [4] "used in last year"      "used in last decade"    "used over a decade ago"
## [7] "never used"

# We "correct" the issue by transforming the levels using a function make_clean_names()
# from the janitor package (it replaces all non-standard characters into "_")

# transformation of variables applied to the train dataset
for(var_ in drugs_vars_factors) {
  levels(drugs_train[[var_]]) <- 
    make_clean_names(levels(drugs_train[[var_]]) )
}
# transformation of variables applied to the test dataset
for(var_ in drugs_vars_factors) {
  levels(drugs_test[[var_]]) <- 
    make_clean_names(levels(drugs_test[[var_]]) )
}

6. Apply Cross Validation

The cross validation applied to the datasets is the cv with 5 folds. Summary_binary_function() is also created to calculate performance metrics including F1

# lets define a summary function to evaluate models with all 5 measured
fiveStats <- function(...) c(twoClassSummary(...), 
                             defaultSummary(...))
ctrl_cv5 <- trainControl(method = "cv",
                         number = 5,
                         classProbs = TRUE,
                         summaryFunction = fiveStats)
# function created to calculate performance metrics including F1
summary_binary_class <- function(predicted_classes,
                                 real,
                                 level_positive = "Yes",
                                 level_negative = "No") {
  ctable <- confusionMatrix(as.factor(predicted_classes), 
                            real, 
                            level_positive) 
  stats <- round(c(ctable$overall[1],
                   ctable$byClass[c(1:4, 7, 11)]),
                 5)
  return(stats)
}

7. Logistic Regression Model

The Logistic regression also known as generalized Linear model finds the probability of event success and event failure. It is based on sigmoid function where output is probability which ranges from 0 to 1. The model family that we are going to use for the distribution is the “binomial” The model is applied to the origional data, balanced data with smote and finally balanced data with rose

options(contrasts = c("contr.treatment",  # for non-ordinal factors
                      "contr.treatment")) # for ordinal factors
set.seed(7000)
model_glm_orig <- 
  train(consumption_cocaine_last_month ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id), 
        method = "glm",
        family = "binomial",
        trControl = ctrl_cv5)

# 6.2 SMOTE model on Logistic Regression 
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_glm_smote <- 
  train(consumption_cocaine_last_month  ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "glm",
        family = "binomial",
        trControl = ctrl_cv5)

# 6.3  ROSE model on Logistic Regression 
ctrl_cv5$sampling <- "rose"
set.seed(7000)
model_glm_rose <- 
  train(consumption_cocaine_last_month ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "glm",
        family = "binomial",
        trControl = ctrl_cv5)

7.1. Compile predictions on Logistic Regression Model

The below models will be used to compare the best model

# lets compare all models on validation stage
models_all <- ls(pattern = "model_glm")
models_all

## [1] "model_glm_orig"  "model_glm_rose"  "model_glm_smote"

7.2. Results of the train and test data

# Results for train data
glm_train_final <- models_all %>% 
  sapply(function(x) get(x) %>% 
           predict(newdata = drugs_train) %>% 
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_train$consumption_cocaine_last_month)) %>% 
  # transpose the result to # have models in rows
  t()
glm_train_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_glm_orig   0.91912     0.21348     0.98441        0.55882        0.93117
## model_glm_rose   0.77069     0.85393     0.76299        0.25000        0.98260
## model_glm_smote  0.82873     0.84270     0.82744        0.31120        0.98272
##                      F1 Balanced Accuracy
## model_glm_orig  0.30894           0.59895
## model_glm_rose  0.38677           0.80846
## model_glm_smote 0.45455           0.83507

# Results for the TEST data
glm_test_final <- models_all %>% 
  sapply(function(x) get(x) %>% 
           predict(newdata = drugs_test) %>% 
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_test$consumption_cocaine_last_month)) %>% 
  # transpose the result to # have models in rows
  t()
glm_test_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_glm_orig   0.90423     0.15789     0.97324        0.35294        0.92593
## model_glm_rose   0.74388     0.68421     0.74939        0.20155        0.96250
## model_glm_smote  0.76615     0.60526     0.78102        0.20354        0.95536
##                      F1 Balanced Accuracy
## model_glm_orig  0.21818           0.56557
## model_glm_rose  0.31138           0.71680
## model_glm_smote 0.30464           0.69314

8. K-Nearest Neighbors Model

we will use a function train() from the caret package which in case of KNN method uses knn() function from the class package.List of (hyper)parameters required by a specific model are previewed using the modelLookup() function from the caret package The function knn() has a default value of k = 5, however we will find the optimal value of K and use it instead

set.seed(7000)
sqrt(nrow(drugs_train))

## [1] 32.41913

different_k <- data.frame(k = seq(1, 40, 1))
ctrl_cvk <- trainControl(method = "cv",
                         number = 5,
                         # for roc, spec,  sensi
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)
set.seed(7000)
# Imbalanced dataset on KNN Model
model_knn_orig <- 
  train(`consumption_cocaine_last_month` ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "knn",
        trControl = ctrl_cvk,
        tuneGrid = different_k,
        preProcess= c("range"),
        metric = "ROC")

8.1. Finding optimal value of K

#The optimal value of K is 40 but 23 is not far off different from Neighbors=40, there we choose to use 23 instead
modelLookup("knn")

##   model parameter      label forReg forClass probModel
## 1   knn         k #Neighbors   TRUE     TRUE      TRUE

plot(model_knn_orig)

8.2. Using k optimal value to train the models

#We will use k = 23 which is also an odd number
k_value <- data.frame(k = 23)

# 7.2 SMOTE model on KNN Model
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_knn_smote <- 
  train(`consumption_cocaine_last_month` ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = k_value,
        preProcess= c("range"))

# 7.3  ROSE model on KNN Model
ctrl_cv5$sampling <- "rose"
set.seed(7000)
model_knn_rose <- 
  train(`consumption_cocaine_last_month` ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "knn",
        trControl = ctrl_cv5,
        tuneGrid = k_value,
        preProcess= c("range"))

8.3. Compile predictions on K-Nearest Neighbors Model

The below models will be used to compare the best model

# lets compare all models on validation stage
models_knn_all <- ls(pattern = "model_knn")
models_knn_all

## [1] "model_knn_orig"  "model_knn_rose"  "model_knn_smote"

8.4. Results of the train and test data

# Results for train data
knn_train_final <- models_knn_all %>% 
  sapply(function(x) get(x) %>% 
           # use it for prediction
           # in the TRAIN sample
           predict(newdata = drugs_train) %>% 
           # apply the summary
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_train$consumption_cocaine_last_month)) %>% 
  # transpose the result to # have models in rows
  t()
knn_train_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig   0.91532     0.00000     1.00000            NaN        0.91532
## model_knn_rose   0.61370     0.96629     0.58108        0.17587        0.99466
## model_knn_smote  0.54995     0.96629     0.51143        0.15468        0.99394
##                      F1 Balanced Accuracy
## model_knn_orig       NA           0.50000
## model_knn_rose  0.29758           0.77369
## model_knn_smote 0.26667           0.73886

#Results for the TEST data
knn_test_final <- models_knn_all %>% 
  sapply(function(x) get(x) %>% 
           # use it for prediction
           # in the TEST sample
           predict(newdata = drugs_test) %>% 
           # apply the summary
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_test$consumption_cocaine_last_month)) %>% 
  # transpose the result to # have models in rows
  t()
knn_test_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig   0.91537     0.00000     1.00000            NaN        0.91537
## model_knn_rose   0.63474     0.89474     0.61071        0.17526        0.98431
## model_knn_smote  0.56125     0.94737     0.52555        0.15584        0.99083
##                      F1 Balanced Accuracy
## model_knn_orig       NA           0.50000
## model_knn_rose  0.29310           0.75272
## model_knn_smote 0.26766           0.73646

9. Support Vector Machine Model

We use svmRadial for Support Vector Machine. The algorithm has two parameters: - cost C (here by default C = 0.25) - sigma - smoothing parameter for the radial basis kernel

#Imbalanced dataset on SVM Model
modelLookup("svmRadial")

##       model parameter label forReg forClass probModel
## 1 svmRadial     sigma Sigma   TRUE     TRUE      TRUE
## 2 svmRadial         C  Cost   TRUE     TRUE      TRUE

set.seed(7000)
model_svm_orig <- train(consumption_cocaine_last_month ~ .,
                           data =  drugs_train %>% 
                             dplyr::select(-id), 
                           method = "svmRadial",
                           trControl = ctrl_cv5,
                           metric = "ROC")

# SMOTE model on SVM Model
ctrl_cv5$sampling <- "smote"
set.seed(7000)
model_svm_smote <- 
  train(`consumption_cocaine_last_month` ~ ., 
        data = drugs_train %>% 
          dplyr::select(-id),
        method = "svmRadial",
        trControl = ctrl_cv5,
        metric = "ROC")

9.1. Compile predictions on Support Vector Machine Model

The below models will be used to compare the best model

# lets compare all models on validation stage
models_svm_all <- ls(pattern = "model_svm")
models_svm_all

## [1] "model_svm_orig"  "model_svm_smote"

9.2. Results of the train and test data

# Results for train data
svm_train_final <- models_svm_all %>% 
  sapply(function(x) get(x) %>% 
           predict(newdata = drugs_train) %>% 
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_train$consumption_cocaine_last_month)) %>% 
  t()
svm_train_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_svm_orig   0.65842     0.94382     0.63202        0.19178        0.99184
## model_svm_smote  0.90105     0.79775     0.91060        0.45223        0.97987
##                      F1 Balanced Accuracy
## model_svm_orig  0.31879           0.78792
## model_svm_smote 0.57724           0.85418

svm_test_final <- models_svm_all %>% 
  sapply(function(x) get(x) %>% 
           predict(newdata = drugs_test) %>% 
           summary_binary_class(level_positive = "yes",
                                level_negative = "no",
                                real = drugs_test$consumption_cocaine_last_month)) %>% 
  t()
svm_test_final

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_svm_orig   0.60579     0.84211     0.58394        0.15764        0.97561
## model_svm_smote  0.85746     0.42105     0.89781        0.27586        0.94373
##                      F1 Balanced Accuracy
## model_svm_orig  0.26556           0.71302
## model_svm_smote 0.33333           0.65943

10. Combine all model performance on test_data (glm, KNN, SVM)

#We combine all model performances in a dataframe using the rbinf() function
knn_test <- as.data.frame(knn_test_final)
glm_test <- as.data.frame(glm_test_final)
svm_test <- as.data.frame(svm_test_final)
knn_test["model"] <- c("knn_orig", "knn_rose", "knn_smote")
glm_test["model"] <- c("glm_orig", "glm_rose", "glm_smote")
svm_test["model"] <- c("svm_orig", "svm_smote")

data_compiled <- rbind(knn_test,glm_test,svm_test)
#results
data_compiled

##                 Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value
## model_knn_orig   0.91537     0.00000     1.00000            NaN        0.91537
## model_knn_rose   0.63474     0.89474     0.61071        0.17526        0.98431
## model_knn_smote  0.56125     0.94737     0.52555        0.15584        0.99083
## model_glm_orig   0.90423     0.15789     0.97324        0.35294        0.92593
## model_glm_rose   0.74388     0.68421     0.74939        0.20155        0.96250
## model_glm_smote  0.76615     0.60526     0.78102        0.20354        0.95536
## model_svm_orig   0.60579     0.84211     0.58394        0.15764        0.97561
## model_svm_smote  0.85746     0.42105     0.89781        0.27586        0.94373
##                      F1 Balanced Accuracy     model
## model_knn_orig       NA           0.50000  knn_orig
## model_knn_rose  0.29310           0.75272  knn_rose
## model_knn_smote 0.26766           0.73646 knn_smote
## model_glm_orig  0.21818           0.56557  glm_orig
## model_glm_rose  0.31138           0.71680  glm_rose
## model_glm_smote 0.30464           0.69314 glm_smote
## model_svm_orig  0.26556           0.71302  svm_orig
## model_svm_smote 0.33333           0.65943 svm_smote

11. Compare the Model performance

We will compare the recall, precision, and F1 performance measures for each of the three models we trained using the four training datasets: 1. Original imbalanced 2. SMOTE balanced 3. ROSE balanced

ggplot(aes(x = reorder(model,-`Balanced Accuracy`),y = `Balanced Accuracy`),data = data_compiled) +
  geom_bar(stat = 'identity', fill = 'light blue') +
  ggtitle('Comparative Balanced Accuracy on Test Data') +
  xlab('Models')  +
  ylab('Balanced Accuracy')+
  geom_text(aes(label = round(`Balanced Accuracy`,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))

ggplot(aes(x = reorder(model,-Specificity),y = Specificity),data = data_compiled) +
  geom_bar(stat = 'identity',fill = 'light green') +
  ggtitle('Comparative Precision of Models on Test Data') +
  xlab('Models')  +
  ylab('Precision Measure')+
  geom_text(aes(label = round(Specificity,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))

ggplot(aes(x=reorder(model,-Sensitivity),y = Sensitivity),data = data_compiled) +
  geom_bar(stat = 'identity',fill = 'light grey') +
  ggtitle('Comparative Recall of Models on Test Data') +
  xlab('Models')  +
  ylab('Recall Measure')+
  geom_text(aes(label = round(Sensitivity,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))

ggplot(aes(x=reorder(model,-F1),y = F1),data = data_compiled) +
  geom_bar(stat = 'identity',fill = 'light pink') +
  ggtitle('Comparative F1 of Models on Test Data') +
  xlab('Models')  +
  ylab('F1 Measure')+
  geom_text(aes(label = round(F1,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))

12. The best Model based on Balanced Accuracy

Based on Balanced Accuracy the best model was obtained using K-Nearest Neighbors algorithm and ROSE used as a balancing technique for imbalanced data. The balanced accuracy obtained for the train data was 77.37% and for the test data was 75.27%. The results show consistency of our model predictions (no over fitting)

The SMOTE balancing technique was the second best using the K-Nearest Neighbors algorithm. Balanced accuracy obtained from train data was 73.88% and for the test data was 73.65%. The results show consistency of our model predictions using smote balancing technique (no over fitting).

In was also noted that Support Vector Machine on SMOTE and Logistic regression on SMOTE had good predictions on train dataset with 85.42% and 83.51% on Balanced Accuracy respectively, however the models performed poorly on test data with Balanced Accuracy of 65.94% and 69.31% respectively. Using these models results in problems of OVERFITTING (low error on train data and high error on test data).

ggplot(aes(x = reorder(model,-`Balanced Accuracy`),y = `Balanced Accuracy`),data = data_compiled) +
  geom_bar(stat = 'identity', fill = 'sky blue') +
  ggtitle('Comparative Balanced Accuracy on Test Data') +
  xlab('Models')  +
  ylab('Balanced Accuracy')+
  geom_text(aes(label = round(`Balanced Accuracy`,2))) + theme_bw() + 
  theme(axis.text.x = element_text(angle = 40))

Machine Learning Classification on drugs dataset

Keith Mpala

6/5/2022

1. Project Objective

1.1. Import the dataset

1.2. Structure

2. DATA PREPROCESSING

2.1. Missing Values

2.2. Identify Categorical variables.

2.3. Ordinal variables

2.3.1. AGE

2.3.2. EDUCATION

2.3.3. Consumption variables

2.4. Nominal Variables

2.4.1. Country

2.4.2. Ethnicity

2.4.3. Converting all nominal variables

2.5. Quantitative variables

2.6. Processed dataset

3. Exploratory Data Analysis

3.1. Collinearity

3.2. Check the imbalance in the dataset

4. Split the data into train and test set

4.1. Class balance for train and test dataset

5. Compile Synthetically Balanced Training Datasets

5.1. Clean factor variables for train and test dataset

6. Apply Cross Validation

7. Logistic Regression Model

7.1. Compile predictions on Logistic Regression Model

7.2. Results of the train and test data

8. K-Nearest Neighbors Model

8.1. Finding optimal value of K

8.2. Using k optimal value to train the models

8.3. Compile predictions on K-Nearest Neighbors Model

8.4. Results of the train and test data

9. Support Vector Machine Model

9.1. Compile predictions on Support Vector Machine Model

9.2. Results of the train and test data

10. Combine all model performance on test_data (glm, KNN, SVM)

11. Compare the Model performance

12. The best Model based on Balanced Accuracy