Complete the following:
Download “labeled.csv” and “unlabeled.csv” to your folder
Open a new file and SAVE it to the same folder
Set your working directory to the folder
Let’s review:
Steps we will go through today:
Preprocessing (i.e., end result is often document-term matrix if using textual data)
Partition the labeled data into the training set and the testing set
Tune and train the models in the training set
Test model performance in the testing set
Apply model in unlabeled data
We will use the caret package in R.
caret is short for “Classification And REgression
Training”, and it provides a uniform interface for hundreds of
supervised machine learning algorithms.
See this comprehensive tutorial written by the package developer for
all models supported by caret and additional steps in SML
such as feature selection: http://topepo.github.io/caret/index.html
#install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
caret loads packages as needed and assumes that they are
installed. If a modeling package is missing, there is a prompt to
install it. For this tutorial we install the ones needed for the three
algorithms we will cover.
#install.packages("kernlab") # for SVM
#install.packages("naivebayes") # for naive Bayes
#install.packages("ranger") # for random forest
Also load packages for data preprocessing.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
labeled_data <- read.csv("labeled.csv")
str(labeled_data)
## 'data.frame': 600 obs. of 3 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ content : chr "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This is"| __truncated__ "Twitter is in bed with the dims to get rid of Trump. I thought everyone knew that. RT @Real_G2DAZ: It appears t"| __truncated__ "RT @Rapscallianna: @BrandyZadrozny @kim @ArijitDSen Facebook has done more damage than good. It \031s time Face"| __truncated__ "@thehill Twitter should block this it is spreading false information" ...
## $ politics: int 1 0 0 0 1 1 0 0 0 0 ...
We most often want the dependent variable to be a “factor” (categorical).
summary(labeled_data$politics)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3767 1.0000 1.0000
labeled_data$politics_f <- factor(labeled_data$politics, levels = c(0,1), labels = c("no","yes"))
summary(labeled_data$politics_f)
## no yes
## 374 226
Initial clean:
labeled_data$content_clean <- gsub("http\\S+\\s*", "", labeled_data$content)
labeled_data$content[1]
## [1] "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This isn't hard ma'am. RT @MarshaBlackburn: Twitter has not censored Joe Biden once. It has censored @realDonaldTrump more than 65 times. https://www.foxnews.com/media/twitter-facebook-have-censored-trump-65-times-compared-to-zero-for-biden-study-says"
labeled_data$content_clean[1]
## [1] "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This isn't hard ma'am. RT @MarshaBlackburn: Twitter has not censored Joe Biden once. It has censored @realDonaldTrump more than 65 times. "
labeled_data$content_clean <- gsub('[[:digit:]]+','',labeled_data$content_clean)
Tokenization, tf-idf, and document-term matrix:
labeled_tokens <- labeled_data %>%
unnest_tokens(word, content_clean) %>%
anti_join(stop_words, by = "word") %>%
count(id, word) %>%
bind_tf_idf(id, word, n)
labeled_tokens$word <- gsub('[[:punct:]]+','',labeled_tokens$word)
labeled_dtm <- labeled_tokens %>%
cast_dtm(id, word, tf_idf)
Now we subset the labeled data into a training set and a testing set.
We first randomly generates indexes for the training set and the testing set:
set.seed(357)
trainIndex <- createDataPartition(labeled_data$politics_f, p = 0.5, list = FALSE, times = 1)
And then use these indexes to subset document-term matrix:
to_train <- labeled_dtm[trainIndex, ] %>% as.matrix() %>% as.data.frame()
to_test <- labeled_dtm[-trainIndex, ] %>% as.matrix() %>% as.data.frame()
Because the document-term matrix does not contain information on the labels, we put the labels for the training set in a separate object so we can feed it into the algorithm:
politics_code <- labeled_data$politics_f[trainIndex]
Terms to be familiar with:
K-fold cross validation: https://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html
Precision, Recall, Sensitivity, Specificity: https://topepo.github.io/caret/measuring-performance.html
ROC curve, AUC: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
set.seed(42)
trctrl <- trainControl(method = "cv", # use the cross validation method for resampling
number = 3, # 3-fold cross validation for the sake for time in class
summaryFunction = twoClassSummary, # evaluate performance using measures specific to two-class problems, such as the area under the ROC curve (AUC), sensitivity and specificity.
classProbs = TRUE, # estimate class probabilities
verboseIter = TRUE) # print a log for training
set.seed(825)
svm_model_plain <- train(y = politics_code,
x = to_train,
method = "svmLinear2",
trControl = trctrl,
scale = FALSE)
## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear2", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.
## + Fold1: cost=0.25
## - Fold1: cost=0.25
## + Fold1: cost=0.50
## - Fold1: cost=0.50
## + Fold1: cost=1.00
## - Fold1: cost=1.00
## + Fold2: cost=0.25
## - Fold2: cost=0.25
## + Fold2: cost=0.50
## - Fold2: cost=0.50
## + Fold2: cost=1.00
## - Fold2: cost=1.00
## + Fold3: cost=0.25
## - Fold3: cost=0.25
## + Fold3: cost=0.50
## - Fold3: cost=0.50
## + Fold3: cost=1.00
## - Fold3: cost=1.00
## Aggregating results
## Selecting tuning parameters
## Fitting cost = 0.25 on full training set
One way to tune the model is to use tuneLength, which
tells the algorithm to try different default values for the main
parameter.
The main parameter for SVM is C (cost). It stands measures how hard/soft we want the boundary to be (larger C –> harder margins): https://stats.stackexchange.com/questions/225409/what-does-the-cost-c-parameter-mean-in-svm
set.seed(825)
svm_model_length <- train(y = politics_code,
x = to_train,
method = "svmLinear2",
trControl = trctrl,
scale = FALSE,
tuneLength = 5) # Try 5 default values
## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear2", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.
## + Fold1: cost=0.25
## - Fold1: cost=0.25
## + Fold1: cost=0.50
## - Fold1: cost=0.50
## + Fold1: cost=1.00
## - Fold1: cost=1.00
## + Fold1: cost=2.00
## - Fold1: cost=2.00
## + Fold1: cost=4.00
## - Fold1: cost=4.00
## + Fold2: cost=0.25
## - Fold2: cost=0.25
## + Fold2: cost=0.50
## - Fold2: cost=0.50
## + Fold2: cost=1.00
## - Fold2: cost=1.00
## + Fold2: cost=2.00
## - Fold2: cost=2.00
## + Fold2: cost=4.00
## - Fold2: cost=4.00
## + Fold3: cost=0.25
## - Fold3: cost=0.25
## + Fold3: cost=0.50
## - Fold3: cost=0.50
## + Fold3: cost=1.00
## - Fold3: cost=1.00
## + Fold3: cost=2.00
## - Fold3: cost=2.00
## + Fold3: cost=4.00
## - Fold3: cost=4.00
## Aggregating results
## Selecting tuning parameters
## Fitting cost = 0.25 on full training set
Another way that gives us even more control over tuning is to use
tuneGrid, which let us specify the values the algorithm
will go through:
set.seed(825)
svm_model_tuned <- train(y = politics_code,
x = to_train,
method = "svmLinear",
trControl = trctrl,
scale = FALSE,
tuneGrid = expand.grid(C = 3^(-5:5))) # Try these values
## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear", : The metric "Accuracy" was not in the result set. ROC will be used
## instead.
## + Fold1: C=4.115e-03
## - Fold1: C=4.115e-03
## + Fold1: C=1.235e-02
## - Fold1: C=1.235e-02
## + Fold1: C=3.704e-02
## - Fold1: C=3.704e-02
## + Fold1: C=1.111e-01
## - Fold1: C=1.111e-01
## + Fold1: C=3.333e-01
## - Fold1: C=3.333e-01
## + Fold1: C=1.000e+00
## - Fold1: C=1.000e+00
## + Fold1: C=3.000e+00
## - Fold1: C=3.000e+00
## + Fold1: C=9.000e+00
## - Fold1: C=9.000e+00
## + Fold1: C=2.700e+01
## - Fold1: C=2.700e+01
## + Fold1: C=8.100e+01
## - Fold1: C=8.100e+01
## + Fold1: C=2.430e+02
## - Fold1: C=2.430e+02
## + Fold2: C=4.115e-03
## - Fold2: C=4.115e-03
## + Fold2: C=1.235e-02
## - Fold2: C=1.235e-02
## + Fold2: C=3.704e-02
## - Fold2: C=3.704e-02
## + Fold2: C=1.111e-01
## - Fold2: C=1.111e-01
## + Fold2: C=3.333e-01
## - Fold2: C=3.333e-01
## + Fold2: C=1.000e+00
## - Fold2: C=1.000e+00
## + Fold2: C=3.000e+00
## - Fold2: C=3.000e+00
## + Fold2: C=9.000e+00
## - Fold2: C=9.000e+00
## + Fold2: C=2.700e+01
## - Fold2: C=2.700e+01
## + Fold2: C=8.100e+01
## - Fold2: C=8.100e+01
## + Fold2: C=2.430e+02
## - Fold2: C=2.430e+02
## + Fold3: C=4.115e-03
## - Fold3: C=4.115e-03
## + Fold3: C=1.235e-02
## - Fold3: C=1.235e-02
## + Fold3: C=3.704e-02
## - Fold3: C=3.704e-02
## + Fold3: C=1.111e-01
## - Fold3: C=1.111e-01
## + Fold3: C=3.333e-01
## - Fold3: C=3.333e-01
## + Fold3: C=1.000e+00
## - Fold3: C=1.000e+00
## + Fold3: C=3.000e+00
## - Fold3: C=3.000e+00
## + Fold3: C=9.000e+00
## - Fold3: C=9.000e+00
## + Fold3: C=2.700e+01
## - Fold3: C=2.700e+01
## + Fold3: C=8.100e+01
## - Fold3: C=8.100e+01
## + Fold3: C=2.430e+02
## - Fold3: C=2.430e+02
## Aggregating results
## Selecting tuning parameters
## Fitting C = 0.00412 on full training set
svm_predict <- predict(svm_model_tuned, newdata = to_test)
confusionMatrix(svm_predict, labeled_data$politics_f[-trainIndex], mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 159 82
## yes 28 31
##
## Accuracy : 0.6333
## 95% CI : (0.576, 0.688)
## No Information Rate : 0.6233
## P-Value [Acc > NIR] : 0.3846
##
## Kappa : 0.1376
##
## Mcnemar's Test P-Value : 4.341e-07
##
## Sensitivity : 0.8503
## Specificity : 0.2743
## Pos Pred Value : 0.6598
## Neg Pred Value : 0.5254
## Precision : 0.6598
## Recall : 0.8503
## F1 : 0.7430
## Prevalence : 0.6233
## Detection Rate : 0.5300
## Detection Prevalence : 0.8033
## Balanced Accuracy : 0.5623
##
## 'Positive' Class : no
##
(159+31)/300 # Accuracy
## [1] 0.6333333
159/(159+28) # Recall (= Sensitivity)
## [1] 0.8502674
159/(159+82) # Precision
## [1] 0.659751
31/(82+31) # Specificity (= Recall of the other class)
## [1] 0.2743363
confusionMatrix(svm_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 159 82
## yes 28 31
##
## Accuracy : 0.6333
## 95% CI : (0.576, 0.688)
## No Information Rate : 0.6233
## P-Value [Acc > NIR] : 0.3846
##
## Kappa : 0.1376
##
## Mcnemar's Test P-Value : 4.341e-07
##
## Sensitivity : 0.2743
## Specificity : 0.8503
## Pos Pred Value : 0.5254
## Neg Pred Value : 0.6598
## Precision : 0.5254
## Recall : 0.2743
## F1 : 0.3605
## Prevalence : 0.3767
## Detection Rate : 0.1033
## Detection Prevalence : 0.1967
## Balanced Accuracy : 0.5623
##
## 'Positive' Class : yes
##
set.seed(255)
nb_model_tuned <- train(y = politics_code,
x = to_train,
method = "naive_bayes",
trControl = trctrl,
tuneGrid = expand.grid(
usekernel = c(TRUE, FALSE),
adjust = c(1, 2),
laplace = 0))
## Warning in train.default(y = politics_code, x = to_train, method =
## "naive_bayes", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.
## + Fold1: usekernel= TRUE, adjust=1, laplace=0
## - Fold1: usekernel= TRUE, adjust=1, laplace=0
## + Fold1: usekernel=FALSE, adjust=1, laplace=0
## - Fold1: usekernel=FALSE, adjust=1, laplace=0
## + Fold1: usekernel= TRUE, adjust=2, laplace=0
## - Fold1: usekernel= TRUE, adjust=2, laplace=0
## + Fold1: usekernel=FALSE, adjust=2, laplace=0
## - Fold1: usekernel=FALSE, adjust=2, laplace=0
## + Fold2: usekernel= TRUE, adjust=1, laplace=0
## - Fold2: usekernel= TRUE, adjust=1, laplace=0
## + Fold2: usekernel=FALSE, adjust=1, laplace=0
## - Fold2: usekernel=FALSE, adjust=1, laplace=0
## + Fold2: usekernel= TRUE, adjust=2, laplace=0
## - Fold2: usekernel= TRUE, adjust=2, laplace=0
## + Fold2: usekernel=FALSE, adjust=2, laplace=0
## - Fold2: usekernel=FALSE, adjust=2, laplace=0
## + Fold3: usekernel= TRUE, adjust=1, laplace=0
## - Fold3: usekernel= TRUE, adjust=1, laplace=0
## + Fold3: usekernel=FALSE, adjust=1, laplace=0
## - Fold3: usekernel=FALSE, adjust=1, laplace=0
## + Fold3: usekernel= TRUE, adjust=2, laplace=0
## - Fold3: usekernel= TRUE, adjust=2, laplace=0
## + Fold3: usekernel=FALSE, adjust=2, laplace=0
## - Fold3: usekernel=FALSE, adjust=2, laplace=0
## Aggregating results
## Selecting tuning parameters
## Fitting laplace = 0, usekernel = FALSE, adjust = 1 on full training set
Notes:
usekernel specifies distribution type: kernel vs
guassian density estimate
adjust means the bandwidth of the kernel density
(larger numbers –> more flexible density estimate)
laplace calls the Laplace smoother
nb_predict <- predict(nb_model_tuned, newdata = to_test)
confusionMatrix(nb_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 0 0
## yes 187 113
##
## Accuracy : 0.3767
## 95% CI : (0.3216, 0.4342)
## No Information Rate : 0.6233
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.3767
## Neg Pred Value : NaN
## Precision : 0.3767
## Recall : 1.0000
## F1 : 0.5472
## Prevalence : 0.3767
## Detection Rate : 0.3767
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : yes
##
set.seed(569)
rf_model_tuned <- train(y = politics_code,
x = to_train,
method = "ranger",
trControl = trctrl,
tuneGrid = data.frame(mtry = floor(sqrt(dim(to_train)[2])),
splitrule = c("gini","extratrees"),
min.node.size = 1))
## Warning in train.default(y = politics_code, x = to_train, method = "ranger", :
## The metric "Accuracy" was not in the result set. ROC will be used instead.
## + Fold1: mtry=63, splitrule=gini, min.node.size=1
## - Fold1: mtry=63, splitrule=gini, min.node.size=1
## + Fold1: mtry=63, splitrule=extratrees, min.node.size=1
## - Fold1: mtry=63, splitrule=extratrees, min.node.size=1
## + Fold2: mtry=63, splitrule=gini, min.node.size=1
## - Fold2: mtry=63, splitrule=gini, min.node.size=1
## + Fold2: mtry=63, splitrule=extratrees, min.node.size=1
## - Fold2: mtry=63, splitrule=extratrees, min.node.size=1
## + Fold3: mtry=63, splitrule=gini, min.node.size=1
## - Fold3: mtry=63, splitrule=gini, min.node.size=1
## + Fold3: mtry=63, splitrule=extratrees, min.node.size=1
## - Fold3: mtry=63, splitrule=extratrees, min.node.size=1
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 63, splitrule = extratrees, min.node.size = 1 on full training set
Notes:
mtry controls how many of the input features a
decision tree has available to consider at any given point in time,
therefore controling how much randomness is added to the decision tree
creation process. One of the rules of thumb is the square root of the
number of features (in our case, features = words). https://crunchingthedata.com/mtry-in-random-forests/
splitrule creates a rule for how to split the data
as decisions are made
min.node.size determines the depth for your tree:
your algorithm will continue branching until it reaches the minimum node
size, which is 1 data point.
rf_predict <- predict(rf_model_tuned, newdata = to_test)
confusionMatrix(rf_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 184 100
## yes 3 13
##
## Accuracy : 0.6567
## 95% CI : (0.5999, 0.7103)
## No Information Rate : 0.6233
## P-Value [Acc > NIR] : 0.1285
##
## Kappa : 0.1193
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9840
## Specificity : 0.1150
## Pos Pred Value : 0.6479
## Neg Pred Value : 0.8125
## Precision : 0.6479
## Recall : 0.9840
## F1 : 0.7813
## Prevalence : 0.6233
## Detection Rate : 0.6133
## Detection Prevalence : 0.9467
## Balanced Accuracy : 0.5495
##
## 'Positive' Class : no
##
confusionMatrix(rf_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 184 100
## yes 3 13
##
## Accuracy : 0.6567
## 95% CI : (0.5999, 0.7103)
## No Information Rate : 0.6233
## P-Value [Acc > NIR] : 0.1285
##
## Kappa : 0.1193
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.11504
## Specificity : 0.98396
## Pos Pred Value : 0.81250
## Neg Pred Value : 0.64789
## Precision : 0.81250
## Recall : 0.11504
## F1 : 0.20155
## Prevalence : 0.37667
## Detection Rate : 0.04333
## Detection Prevalence : 0.05333
## Balanced Accuracy : 0.54950
##
## 'Positive' Class : yes
##
Load the unlabeled data:
unlabeled_data <- read.csv("unlabeled.csv")
str(unlabeled_data)
## 'data.frame': 2000 obs. of 2 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ content: chr "RT @kylegriffin1: Twitter has removed a tweet from Scott Atlas, one of Trump's top COVID advisers, who lied abo"| __truncated__ "RT @oneunderscore__: Not going to link to it, but Jacob Wohl is now verified on Instagram. Facebook, which owns"| __truncated__ "Facebook is their channel. Shame on Zuckerberg RT @AuthorKimberley: The Russian disinformation campaign will go"| __truncated__ "RT @atrupar: The irony is that repealing Section 230 would result in the total suppression of bogus Hunter Bide"| __truncated__ ...
You should recognize most of the below codes. We used the exact same codes to preprocess the labeled data and make predictions on test subset in the labeled data. Now we do the same for unlabeled data.
We get the unlabeled data into the same format using the same codes:
unlabeled_data$content_clean <- gsub("http\\S+\\s*", "", unlabeled_data$content)
unlabeled_data$content_clean <- gsub('[[:digit:]]+','', unlabeled_data$content_clean)
unlabeled_tokens <- unlabeled_data %>%
unnest_tokens(word, content_clean) %>%
anti_join(stop_words, by = "word") %>%
count(id, word) %>%
bind_tf_idf(id, word, n)
unlabeled_tokens$word <- gsub('[[:punct:]]+','',unlabeled_tokens$word)
unlabeled_dtm <- unlabeled_tokens %>%
cast_dtm(id, word, tf_idf)
unlabeled_input <- unlabeled_dtm %>% as.matrix() %>% as.data.frame()
One important thing to know if that the above machine learning algorithms are trained on a specific set of features (words). They don’t know what to do with new features/words.
In some data (e.g., survey data, economic data), this is less of a problem as “new data” usually come with the same set of features anyway.
In textual data, this is worth noting because unlabeled new data would almost always use different words than our labeled data.
The below codes transform the columns (containing words) in the unlabeled data to be exactly the same with labeled data for training/testing.
unlabeled_input_clean <- unlabeled_input[,intersect(colnames(unlabeled_input),colnames(to_train))]
empty_data <- as.data.frame(matrix(nrow = 0, ncol = ncol(to_train)))
colnames(empty_data) <- colnames(to_train)
unlabeled_input_final <- plyr::rbind.fill(unlabeled_input_clean,empty_data) %>%
mutate_all(~replace_na(.,0))
Now we can apply the model to unlabeled data using the same code:
results <- predict(svm_model_tuned, newdata = unlabeled_input_final)
We can see the result for each document. Sometimes entire documents get discarded during preprocessing, so it is always good practice to ensure that the indexes are correct.
final_data <- tibble(id = as.numeric(dimnames(unlabeled_dtm)[[1]])) %>%
left_join(unlabeled_data[!duplicated(unlabeled_data$id), ], by = "id")
final_data$politics <- results
summary(final_data$politics)
## no yes
## 1731 269
The quick and simple algorithms we ran today didn’t do that well. How do we improve them?
More labeled data
Better, deeper data cleaning
Dictionary/Topic modeling as features rather than words as features
Feature selection: https://topepo.github.io/caret/feature-selection-overview.html
Detailed tuning of hyperparameters (need to deep dive into each algorithm)
For imbalanced data, oversampling the minority class and undersampling the majority class would greatly improve performance: https://topepo.github.io/caret/subsampling-for-class-imbalances.html#subsampling-during-resampling
Step 0: Clean your working environment using
rm(list=ls())
Step 1: Open a new file, SAVE it to your folder, set your working directory to the folder, and then read “sml_practice.csv” into R
Step 2: Preprocess the data. Your goal is to have a document-term matrix
Step 3: Partition the data into a training set (60%) and a testing set (40%)
Step 4: Set up resampling to be 10-fold cross-validation
Step 5: Choose one algorithm from three we covered today, tune hyperparameters as you see fit, and train the algorithm on training set
Step 6: Test the performance of this algorithm on the testing set
Step 7 (Optional): Try one of the following to improve. You can either
try one of the other two algorithms, or
try further tuning the hyperparameters, or
try oversampling/undersampling by following the link above (this is the bravest option; ask questions if you explore this new part)