Prep

Complete the following:

Download “labeled.csv” and “unlabeled.csv” to your folder
Open a new file and SAVE it to the same folder
Set your working directory to the folder

Let’s review:

In one sentence, what is supervised machine learning?

Steps we will go through today:

Preprocessing (i.e., end result is often document-term matrix if using textual data)
Partition the labeled data into the training set and the testing set
Tune and train the models in the training set
- SVM, NB, Random Forest
Test model performance in the testing set
- Accuracy, Precision, Recall, F1
Apply model in unlabeled data

Package

We will use the caret package in R.

caret is short for “Classification And REgression Training”, and it provides a uniform interface for hundreds of supervised machine learning algorithms.

See this comprehensive tutorial written by the package developer for all models supported by caret and additional steps in SML such as feature selection: http://topepo.github.io/caret/index.html

#install.packages("caret")
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it. For this tutorial we install the ones needed for the three algorithms we will cover.

#install.packages("kernlab") # for SVM
#install.packages("naivebayes") # for naive Bayes
#install.packages("ranger") # for random forest

Also load packages for data preprocessing.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)

Preprocessing

labeled_data <- read.csv("labeled.csv")
str(labeled_data)

## 'data.frame':    600 obs. of  3 variables:
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ content : chr  "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This is"| __truncated__ "Twitter is in bed with the dims to get rid of Trump. I thought everyone knew that. RT @Real_G2DAZ: It appears t"| __truncated__ "RT @Rapscallianna: @BrandyZadrozny @kim @ArijitDSen Facebook has done more damage than good. It \031s time Face"| __truncated__ "@thehill Twitter should block this it is spreading false information" ...
##  $ politics: int  1 0 0 0 1 1 0 0 0 0 ...

We most often want the dependent variable to be a “factor” (categorical).

summary(labeled_data$politics)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3767  1.0000  1.0000

labeled_data$politics_f <- factor(labeled_data$politics, levels = c(0,1), labels = c("no","yes"))
summary(labeled_data$politics_f)

##  no yes 
## 374 226

Initial clean:

labeled_data$content_clean <- gsub("http\\S+\\s*", "", labeled_data$content)
labeled_data$content[1]

## [1] "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This isn't hard ma'am. RT @MarshaBlackburn: Twitter has not censored Joe Biden once. It has censored @realDonaldTrump more than 65 times. https://www.foxnews.com/media/twitter-facebook-have-censored-trump-65-times-compared-to-zero-for-biden-study-says"

labeled_data$content_clean[1]

## [1] "Because Donald Trump uses his Twitter feed to spread venom, disinformation and lies. Joe Biden doesn't. This isn't hard ma'am. RT @MarshaBlackburn: Twitter has not censored Joe Biden once. It has censored @realDonaldTrump more than 65 times. "

labeled_data$content_clean <- gsub('[[:digit:]]+','',labeled_data$content_clean)

Tokenization, tf-idf, and document-term matrix:

labeled_tokens <- labeled_data %>%
  unnest_tokens(word, content_clean) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(id, word) %>%
  bind_tf_idf(id, word, n)

labeled_tokens$word <- gsub('[[:punct:]]+','',labeled_tokens$word)

labeled_dtm <- labeled_tokens %>%
  cast_dtm(id, word, tf_idf)

Partition

Now we subset the labeled data into a training set and a testing set.

We first randomly generates indexes for the training set and the testing set:

set.seed(357)
trainIndex <- createDataPartition(labeled_data$politics_f, p = 0.5, list = FALSE, times = 1)

And then use these indexes to subset document-term matrix:

to_train <- labeled_dtm[trainIndex, ] %>% as.matrix() %>% as.data.frame()
to_test <- labeled_dtm[-trainIndex, ] %>% as.matrix() %>% as.data.frame()

Because the document-term matrix does not contain information on the labels, we put the labels for the training set in a separate object so we can feed it into the algorithm:

politics_code <- labeled_data$politics_f[trainIndex]

Tune, Train & Test

Resampling

Terms to be familiar with:

K-fold cross validation: https://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html
Precision, Recall, Sensitivity, Specificity: https://topepo.github.io/caret/measuring-performance.html
ROC curve, AUC: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

set.seed(42)
trctrl <- trainControl(method = "cv", # use the cross validation method for resampling
                       number = 3, # 3-fold cross validation for the sake for time in class
                       summaryFunction = twoClassSummary, # evaluate performance using measures specific to two-class problems, such as the area under the ROC curve (AUC), sensitivity and specificity.
                       classProbs = TRUE, # estimate class probabilities
                       verboseIter = TRUE) # print a log for training

SVM

SVM: Tuning & training

set.seed(825)
svm_model_plain <- train(y = politics_code,
                        x = to_train,
                        method = "svmLinear2",
                        trControl = trctrl,
                        scale = FALSE)

## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear2", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.

## + Fold1: cost=0.25 
## - Fold1: cost=0.25 
## + Fold1: cost=0.50 
## - Fold1: cost=0.50 
## + Fold1: cost=1.00 
## - Fold1: cost=1.00 
## + Fold2: cost=0.25 
## - Fold2: cost=0.25 
## + Fold2: cost=0.50 
## - Fold2: cost=0.50 
## + Fold2: cost=1.00 
## - Fold2: cost=1.00 
## + Fold3: cost=0.25 
## - Fold3: cost=0.25 
## + Fold3: cost=0.50 
## - Fold3: cost=0.50 
## + Fold3: cost=1.00 
## - Fold3: cost=1.00 
## Aggregating results
## Selecting tuning parameters
## Fitting cost = 0.25 on full training set

One way to tune the model is to use tuneLength, which tells the algorithm to try different default values for the main parameter.

The main parameter for SVM is C (cost). It stands measures how hard/soft we want the boundary to be (larger C –> harder margins): https://stats.stackexchange.com/questions/225409/what-does-the-cost-c-parameter-mean-in-svm

set.seed(825)
svm_model_length <- train(y = politics_code,
                        x = to_train,
                        method = "svmLinear2",
                        trControl = trctrl,
                        scale = FALSE,
                        tuneLength = 5) # Try 5 default values

## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear2", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.

## + Fold1: cost=0.25 
## - Fold1: cost=0.25 
## + Fold1: cost=0.50 
## - Fold1: cost=0.50 
## + Fold1: cost=1.00 
## - Fold1: cost=1.00 
## + Fold1: cost=2.00 
## - Fold1: cost=2.00 
## + Fold1: cost=4.00 
## - Fold1: cost=4.00 
## + Fold2: cost=0.25 
## - Fold2: cost=0.25 
## + Fold2: cost=0.50 
## - Fold2: cost=0.50 
## + Fold2: cost=1.00 
## - Fold2: cost=1.00 
## + Fold2: cost=2.00 
## - Fold2: cost=2.00 
## + Fold2: cost=4.00 
## - Fold2: cost=4.00 
## + Fold3: cost=0.25 
## - Fold3: cost=0.25 
## + Fold3: cost=0.50 
## - Fold3: cost=0.50 
## + Fold3: cost=1.00 
## - Fold3: cost=1.00 
## + Fold3: cost=2.00 
## - Fold3: cost=2.00 
## + Fold3: cost=4.00 
## - Fold3: cost=4.00 
## Aggregating results
## Selecting tuning parameters
## Fitting cost = 0.25 on full training set

Another way that gives us even more control over tuning is to use tuneGrid, which let us specify the values the algorithm will go through:

set.seed(825)
svm_model_tuned <- train(y = politics_code,
                        x = to_train,
                        method = "svmLinear",
                        trControl = trctrl, 
                        scale = FALSE,
                        tuneGrid = expand.grid(C = 3^(-5:5))) # Try these values

## Warning in train.default(y = politics_code, x = to_train, method =
## "svmLinear", : The metric "Accuracy" was not in the result set. ROC will be used
## instead.

## + Fold1: C=4.115e-03 
## - Fold1: C=4.115e-03 
## + Fold1: C=1.235e-02 
## - Fold1: C=1.235e-02 
## + Fold1: C=3.704e-02 
## - Fold1: C=3.704e-02 
## + Fold1: C=1.111e-01 
## - Fold1: C=1.111e-01 
## + Fold1: C=3.333e-01 
## - Fold1: C=3.333e-01 
## + Fold1: C=1.000e+00 
## - Fold1: C=1.000e+00 
## + Fold1: C=3.000e+00 
## - Fold1: C=3.000e+00 
## + Fold1: C=9.000e+00 
## - Fold1: C=9.000e+00 
## + Fold1: C=2.700e+01 
## - Fold1: C=2.700e+01 
## + Fold1: C=8.100e+01 
## - Fold1: C=8.100e+01 
## + Fold1: C=2.430e+02 
## - Fold1: C=2.430e+02 
## + Fold2: C=4.115e-03 
## - Fold2: C=4.115e-03 
## + Fold2: C=1.235e-02 
## - Fold2: C=1.235e-02 
## + Fold2: C=3.704e-02 
## - Fold2: C=3.704e-02 
## + Fold2: C=1.111e-01 
## - Fold2: C=1.111e-01 
## + Fold2: C=3.333e-01 
## - Fold2: C=3.333e-01 
## + Fold2: C=1.000e+00 
## - Fold2: C=1.000e+00 
## + Fold2: C=3.000e+00 
## - Fold2: C=3.000e+00 
## + Fold2: C=9.000e+00 
## - Fold2: C=9.000e+00 
## + Fold2: C=2.700e+01 
## - Fold2: C=2.700e+01 
## + Fold2: C=8.100e+01 
## - Fold2: C=8.100e+01 
## + Fold2: C=2.430e+02 
## - Fold2: C=2.430e+02 
## + Fold3: C=4.115e-03 
## - Fold3: C=4.115e-03 
## + Fold3: C=1.235e-02 
## - Fold3: C=1.235e-02 
## + Fold3: C=3.704e-02 
## - Fold3: C=3.704e-02 
## + Fold3: C=1.111e-01 
## - Fold3: C=1.111e-01 
## + Fold3: C=3.333e-01 
## - Fold3: C=3.333e-01 
## + Fold3: C=1.000e+00 
## - Fold3: C=1.000e+00 
## + Fold3: C=3.000e+00 
## - Fold3: C=3.000e+00 
## + Fold3: C=9.000e+00 
## - Fold3: C=9.000e+00 
## + Fold3: C=2.700e+01 
## - Fold3: C=2.700e+01 
## + Fold3: C=8.100e+01 
## - Fold3: C=8.100e+01 
## + Fold3: C=2.430e+02 
## - Fold3: C=2.430e+02 
## Aggregating results
## Selecting tuning parameters
## Fitting C = 0.00412 on full training set

SVM: Testing Performance

svm_predict <- predict(svm_model_tuned, newdata = to_test)
confusionMatrix(svm_predict, labeled_data$politics_f[-trainIndex], mode = "everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  159  82
##        yes  28  31
##                                         
##                Accuracy : 0.6333        
##                  95% CI : (0.576, 0.688)
##     No Information Rate : 0.6233        
##     P-Value [Acc > NIR] : 0.3846        
##                                         
##                   Kappa : 0.1376        
##                                         
##  Mcnemar's Test P-Value : 4.341e-07     
##                                         
##             Sensitivity : 0.8503        
##             Specificity : 0.2743        
##          Pos Pred Value : 0.6598        
##          Neg Pred Value : 0.5254        
##               Precision : 0.6598        
##                  Recall : 0.8503        
##                      F1 : 0.7430        
##              Prevalence : 0.6233        
##          Detection Rate : 0.5300        
##    Detection Prevalence : 0.8033        
##       Balanced Accuracy : 0.5623        
##                                         
##        'Positive' Class : no            
##

(159+31)/300 # Accuracy

## [1] 0.6333333

159/(159+28) # Recall (= Sensitivity)

## [1] 0.8502674

159/(159+82) # Precision

## [1] 0.659751

31/(82+31) # Specificity (= Recall of the other class)

## [1] 0.2743363

confusionMatrix(svm_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  159  82
##        yes  28  31
##                                         
##                Accuracy : 0.6333        
##                  95% CI : (0.576, 0.688)
##     No Information Rate : 0.6233        
##     P-Value [Acc > NIR] : 0.3846        
##                                         
##                   Kappa : 0.1376        
##                                         
##  Mcnemar's Test P-Value : 4.341e-07     
##                                         
##             Sensitivity : 0.2743        
##             Specificity : 0.8503        
##          Pos Pred Value : 0.5254        
##          Neg Pred Value : 0.6598        
##               Precision : 0.5254        
##                  Recall : 0.2743        
##                      F1 : 0.3605        
##              Prevalence : 0.3767        
##          Detection Rate : 0.1033        
##    Detection Prevalence : 0.1967        
##       Balanced Accuracy : 0.5623        
##                                         
##        'Positive' Class : yes           
##

NB

NB: Tuning & Training

set.seed(255)
nb_model_tuned <- train(y = politics_code,
                        x = to_train,
                        method = "naive_bayes",
                        trControl = trctrl,
                        tuneGrid = expand.grid(
                          usekernel = c(TRUE, FALSE),
                          adjust = c(1, 2), 
                          laplace = 0))

## Warning in train.default(y = politics_code, x = to_train, method =
## "naive_bayes", : The metric "Accuracy" was not in the result set. ROC will be
## used instead.

## + Fold1: usekernel= TRUE, adjust=1, laplace=0 
## - Fold1: usekernel= TRUE, adjust=1, laplace=0 
## + Fold1: usekernel=FALSE, adjust=1, laplace=0 
## - Fold1: usekernel=FALSE, adjust=1, laplace=0 
## + Fold1: usekernel= TRUE, adjust=2, laplace=0 
## - Fold1: usekernel= TRUE, adjust=2, laplace=0 
## + Fold1: usekernel=FALSE, adjust=2, laplace=0 
## - Fold1: usekernel=FALSE, adjust=2, laplace=0 
## + Fold2: usekernel= TRUE, adjust=1, laplace=0 
## - Fold2: usekernel= TRUE, adjust=1, laplace=0 
## + Fold2: usekernel=FALSE, adjust=1, laplace=0 
## - Fold2: usekernel=FALSE, adjust=1, laplace=0 
## + Fold2: usekernel= TRUE, adjust=2, laplace=0 
## - Fold2: usekernel= TRUE, adjust=2, laplace=0 
## + Fold2: usekernel=FALSE, adjust=2, laplace=0 
## - Fold2: usekernel=FALSE, adjust=2, laplace=0 
## + Fold3: usekernel= TRUE, adjust=1, laplace=0 
## - Fold3: usekernel= TRUE, adjust=1, laplace=0 
## + Fold3: usekernel=FALSE, adjust=1, laplace=0 
## - Fold3: usekernel=FALSE, adjust=1, laplace=0 
## + Fold3: usekernel= TRUE, adjust=2, laplace=0 
## - Fold3: usekernel= TRUE, adjust=2, laplace=0 
## + Fold3: usekernel=FALSE, adjust=2, laplace=0 
## - Fold3: usekernel=FALSE, adjust=2, laplace=0 
## Aggregating results
## Selecting tuning parameters
## Fitting laplace = 0, usekernel = FALSE, adjust = 1 on full training set

Notes:

usekernel specifies distribution type: kernel vs guassian density estimate
adjust means the bandwidth of the kernel density (larger numbers –> more flexible density estimate)
laplace calls the Laplace smoother

NB: Testing performance

nb_predict <- predict(nb_model_tuned, newdata = to_test)
confusionMatrix(nb_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no    0   0
##        yes 187 113
##                                           
##                Accuracy : 0.3767          
##                  95% CI : (0.3216, 0.4342)
##     No Information Rate : 0.6233          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.3767          
##          Neg Pred Value :    NaN          
##               Precision : 0.3767          
##                  Recall : 1.0000          
##                      F1 : 0.5472          
##              Prevalence : 0.3767          
##          Detection Rate : 0.3767          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : yes             
##

Random Forest

Random Forest: Tuning & Training

set.seed(569)
rf_model_tuned <- train(y = politics_code,
                        x = to_train,
                        method = "ranger",
                        trControl = trctrl,
                        tuneGrid = data.frame(mtry = floor(sqrt(dim(to_train)[2])),
                                              splitrule = c("gini","extratrees"),
                                              min.node.size = 1))

## Warning in train.default(y = politics_code, x = to_train, method = "ranger", :
## The metric "Accuracy" was not in the result set. ROC will be used instead.

## + Fold1: mtry=63, splitrule=gini, min.node.size=1 
## - Fold1: mtry=63, splitrule=gini, min.node.size=1 
## + Fold1: mtry=63, splitrule=extratrees, min.node.size=1 
## - Fold1: mtry=63, splitrule=extratrees, min.node.size=1 
## + Fold2: mtry=63, splitrule=gini, min.node.size=1 
## - Fold2: mtry=63, splitrule=gini, min.node.size=1 
## + Fold2: mtry=63, splitrule=extratrees, min.node.size=1 
## - Fold2: mtry=63, splitrule=extratrees, min.node.size=1 
## + Fold3: mtry=63, splitrule=gini, min.node.size=1 
## - Fold3: mtry=63, splitrule=gini, min.node.size=1 
## + Fold3: mtry=63, splitrule=extratrees, min.node.size=1 
## - Fold3: mtry=63, splitrule=extratrees, min.node.size=1 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 63, splitrule = extratrees, min.node.size = 1 on full training set

Notes:

mtry controls how many of the input features a decision tree has available to consider at any given point in time, therefore controling how much randomness is added to the decision tree creation process. One of the rules of thumb is the square root of the number of features (in our case, features = words). https://crunchingthedata.com/mtry-in-random-forests/
splitrule creates a rule for how to split the data as decisions are made
min.node.size determines the depth for your tree: your algorithm will continue branching until it reaches the minimum node size, which is 1 data point.

Random Forest: Testing Performance

rf_predict <- predict(rf_model_tuned, newdata = to_test)
confusionMatrix(rf_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  184 100
##        yes   3  13
##                                           
##                Accuracy : 0.6567          
##                  95% CI : (0.5999, 0.7103)
##     No Information Rate : 0.6233          
##     P-Value [Acc > NIR] : 0.1285          
##                                           
##                   Kappa : 0.1193          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9840          
##             Specificity : 0.1150          
##          Pos Pred Value : 0.6479          
##          Neg Pred Value : 0.8125          
##               Precision : 0.6479          
##                  Recall : 0.9840          
##                      F1 : 0.7813          
##              Prevalence : 0.6233          
##          Detection Rate : 0.6133          
##    Detection Prevalence : 0.9467          
##       Balanced Accuracy : 0.5495          
##                                           
##        'Positive' Class : no              
##

confusionMatrix(rf_predict, reference = labeled_data$politics_f[-trainIndex], mode = "everything", positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  184 100
##        yes   3  13
##                                           
##                Accuracy : 0.6567          
##                  95% CI : (0.5999, 0.7103)
##     No Information Rate : 0.6233          
##     P-Value [Acc > NIR] : 0.1285          
##                                           
##                   Kappa : 0.1193          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.11504         
##             Specificity : 0.98396         
##          Pos Pred Value : 0.81250         
##          Neg Pred Value : 0.64789         
##               Precision : 0.81250         
##                  Recall : 0.11504         
##                      F1 : 0.20155         
##              Prevalence : 0.37667         
##          Detection Rate : 0.04333         
##    Detection Prevalence : 0.05333         
##       Balanced Accuracy : 0.54950         
##                                           
##        'Positive' Class : yes             
##

Apply

Load the unlabeled data:

unlabeled_data <- read.csv("unlabeled.csv")
str(unlabeled_data)

## 'data.frame':    2000 obs. of  2 variables:
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ content: chr  "RT @kylegriffin1: Twitter has removed a tweet from Scott Atlas, one of Trump's top COVID advisers, who lied abo"| __truncated__ "RT @oneunderscore__: Not going to link to it, but Jacob Wohl is now verified on Instagram. Facebook, which owns"| __truncated__ "Facebook is their channel. Shame on Zuckerberg RT @AuthorKimberley: The Russian disinformation campaign will go"| __truncated__ "RT @atrupar: The irony is that repealing Section 230 would result in the total suppression of bogus Hunter Bide"| __truncated__ ...

You should recognize most of the below codes. We used the exact same codes to preprocess the labeled data and make predictions on test subset in the labeled data. Now we do the same for unlabeled data.

We get the unlabeled data into the same format using the same codes:

unlabeled_data$content_clean <- gsub("http\\S+\\s*", "", unlabeled_data$content)
unlabeled_data$content_clean <- gsub('[[:digit:]]+','', unlabeled_data$content_clean)

unlabeled_tokens <- unlabeled_data %>%
  unnest_tokens(word, content_clean) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(id, word) %>%
  bind_tf_idf(id, word, n)

unlabeled_tokens$word <- gsub('[[:punct:]]+','',unlabeled_tokens$word)

unlabeled_dtm <- unlabeled_tokens %>%
  cast_dtm(id, word, tf_idf)

unlabeled_input <- unlabeled_dtm %>% as.matrix() %>% as.data.frame()

One important thing to know if that the above machine learning algorithms are trained on a specific set of features (words). They don’t know what to do with new features/words.

In some data (e.g., survey data, economic data), this is less of a problem as “new data” usually come with the same set of features anyway.
In textual data, this is worth noting because unlabeled new data would almost always use different words than our labeled data.
The below codes transform the columns (containing words) in the unlabeled data to be exactly the same with labeled data for training/testing.

unlabeled_input_clean <- unlabeled_input[,intersect(colnames(unlabeled_input),colnames(to_train))]

empty_data <- as.data.frame(matrix(nrow = 0, ncol = ncol(to_train))) 
colnames(empty_data) <- colnames(to_train)

unlabeled_input_final <- plyr::rbind.fill(unlabeled_input_clean,empty_data) %>%
  mutate_all(~replace_na(.,0))

Now we can apply the model to unlabeled data using the same code:

results <- predict(svm_model_tuned, newdata = unlabeled_input_final)

We can see the result for each document. Sometimes entire documents get discarded during preprocessing, so it is always good practice to ensure that the indexes are correct.

final_data <- tibble(id = as.numeric(dimnames(unlabeled_dtm)[[1]])) %>%
  left_join(unlabeled_data[!duplicated(unlabeled_data$id), ], by = "id")

final_data$politics <- results
summary(final_data$politics)

##   no  yes 
## 1731  269

Improve

The quick and simple algorithms we ran today didn’t do that well. How do we improve them?

More labeled data
Better, deeper data cleaning
Dictionary/Topic modeling as features rather than words as features
Feature selection: https://topepo.github.io/caret/feature-selection-overview.html
Detailed tuning of hyperparameters (need to deep dive into each algorithm)
For imbalanced data, oversampling the minority class and undersampling the majority class would greatly improve performance: https://topepo.github.io/caret/subsampling-for-class-imbalances.html#subsampling-during-resampling

Exercise

Step 0: Clean your working environment using rm(list=ls())
Step 1: Open a new file, SAVE it to your folder, set your working directory to the folder, and then read “sml_practice.csv” into R
Step 2: Preprocess the data. Your goal is to have a document-term matrix
Step 3: Partition the data into a training set (60%) and a testing set (40%)
Step 4: Set up resampling to be 10-fold cross-validation
Step 5: Choose one algorithm from three we covered today, tune hyperparameters as you see fit, and train the algorithm on training set
Step 6: Test the performance of this algorithm on the testing set
Step 7 (Optional): Try one of the following to improve. You can either
- try one of the other two algorithms, or
- try further tuning the hyperparameters, or
- try oversampling/undersampling by following the link above (this is the bravest option; ask questions if you explore this new part)

Supervised Machine Learning

Jianing Li

2023-03-30