Introduction

In this lesson, we will use tensorflow and keras for deep learning using neural networks.

Tensorflow is an open-source library for machine learning in python. Keras is a neural network API written to run on top of Tensorflow.

A deep neural network is an artificial neural network (ANN) with multiple hidden layers between the input and output layers.

The more hidden layers that we have in the network, the deeper the neural network can learn. Deep neural networks are usually made up of long chains of hidden layers. Deep learning models will typically have more hidden layers but less nodes per layer than the typical, more shallow ANN model.

Preliminary

library(tm) # text mining
library(caret) # classification
library(keras) # deep learning
library(tensorflow) # deep learning

Rather than using a Document Term Matrix (DTM) as model input, deep learning modeling using keras takes sentences as input, which are converted to an index representation. For this reason, we will apply the text transformations to the corpus and return the clean text to the original dataframe.

Cleaning

First, we import the data, remove missing text documents and prepare the data for corpus representation.

cr <- read.csv("C:/Users/chh35/OneDrive - Drexel University/Teaching/Drexel/BSAN 710/Course Content/Week 2/clothing_revs_sample.csv", 
               stringsAsFactors = FALSE)
cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]
names(cr)[1] <- "doc_id"
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")

Text Preprocessing

Next, we convert the text documents to a corpus and apply transformations to the corpus documents.

crCorpus <- Corpus(DataframeSource(cr))
crCorpus <- tm_map(crCorpus, tolower) 
crCorpus <- tm_map(crCorpus, removeNumbers)
punc2Space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
crCorpus <- tm_map(crCorpus, punc2Space, "/")
crCorpus <- tm_map(crCorpus, content_transformer(function(x) removeWords(x, stopwords("en"))))
crCorpus <- tm_map(crCorpus, 
                         removePunctuation, 
                         preserve_intra_word_contractions = FALSE, 
                         preserve_intra_word_dashes = TRUE)
crCorpus <- tm_map(crCorpus, stemDocument, language = "english")
crCorpus <- tm_map(crCorpus, stripWhitespace)

Finally, we convert the corpus to a plain text representation so that we can add the clean text as a column in the cr dataframe.

myCorpus <- tm_map(crCorpus, PlainTextDocument)
## Warning in tm_map.SimpleCorpus(crCorpus, PlainTextDocument): transformation
## drops documents
cr$clean_text <- myCorpus$content$content

Training/Testing Split

We split the data into training and testing, preserving the Rating distribution. We use 70% of the data for training and 30% for testing.

set.seed(831)
samp <- createDataPartition(cr$Rating, p = .70, list = FALSE)
train = cr[samp, ] 
test = cr[-samp, ]

One Hot Encoding of Rating

To make the dependent variable compatible with keras modeling, we need to convert the current categorical variable, Rating (with levels 1-5) to a one hot encoding/dummy variable representation, with the levels 0-4 (to be consistent with Python indexing).

# make compatible
y_train <- train$Rating - 1
y_test <- test$Rating - 1
# convert y variable to one-hot-encoding representation
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test <- to_categorical(y = y_test, num_classes = 5)

Text Tokenization

The final preliminary step before we can create our deep learning model is text tokenization. For modeling compatibility purposes, we tokenize our cleaned sentences based on white space and create a list of tokens for each document. From there, the text is indexed/vectorized.

Before doing so, we need to choose the number of terms to include in the vocabulary (num_word) and the number of tokens to include per document (max_length).

num_words will set the maximum number of words in our vocabulary. We will set this value to 5000, but it can be adjusted as a modeling choice.

num_words <- 5000

Keras requires that the input have the same dimensionality, so documents larger than this number are truncated and documents smaller than this number are padded.

We will choose the maximum length of a document to be two standard deviations above the mean.

counts <- sapply(strsplit(cr$clean_text, " "), length)
mean(counts) # average tokens per document
## [1] 30.9268
max(counts) # maximum tokens per document
## [1] 62
max_length <- round(mean(counts) + 2 * sd(counts))
max_length
## [1] 58

In choosing 58 as our maximum per-document token count, 99.8447205% of documents will not be truncated.

Once these numbers are chosen, we initialize the text vectorizor (layer_text_vectorization). Then, we apply it to the text.

text_vectorization <- layer_text_vectorization(
  max_tokens = num_words, 
  output_sequence_length = max_length, 
)

# apply
text_vectorization %>% 
  adapt(cr$clean_text)

We can compare the first document before and after applying text vectorization

cr$clean_text[1]
## [1] "just look perfect loung sleep cami lbs purchas dark purpl small hang bit loos prefer fit sleep shirt good qualiti easi wash machin delic cool airi hot summer night"
text_vectorization(matrix(cr$clean_text[1], ncol = 1))
## tf.Tensor(
## [[  13    7   12  698  745  294   78   49  310  613   16  231   35  131
##    390    4  745   36   65   46  182  117 1007  325  286  534  328   63
##    352    0    0    0    0    0    0    0    0    0    0    0    0    0
##      0    0    0    0    0    0    0    0    0    0    0    0    0    0
##      0    0]], shape=(1, 58), dtype=int64)

NN using Tensorflow and Keras

The general framework for modeling using keras (running on Tensorflow) is:

Recurrent Neural Network (RNN) Model

Recurrent neural networks (RNN) are used for processing sequences of data, such as time series or text. RNN are recurrent because they contain loops in which the output of a given layer becomes input to the same layer in the next step. For time series, the next step is the next point in time. For text, this step is the next word in a series of words.

The recurrent part is the Long Short-Term Memory (LSTM) layer, which helps to learn long term word dependencies.

RNNs are popular choices for:

Embedding Layer

The first layer in the RNN is the Embedding layer, which converts each integer-token into a vector of values. The embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset. This is necessary because the integer-tokens may take on values between 0 and 4,999 for a vocabulary of 5,000 words. The RNN cannot work on values in such a wide range. The embedding layer is trained as a part of the RNN and will learn to map words with similar semantic meanings to similar embedding-vectors, as will be shown further below. This will create a more dense vector representation to reduce dimensionality (which is particularly important for text!).

In addition to creating your own embeddings in the embedding layer, there are pre-defined word embeddings that can be used, including Word2Vec and GloVe, which can be loaded into the NN to save time during model training. These can also be useful with small datasets and can improve model performance.

The size of the embedding vector will vary and is a modeling choice. For smaller data sets, the value may be also be small (8 or 16) and for larger datasets it can be much larger (upwards of 1000). For applications of sentiment analysis, smaller embedding sizes are typically used.

First we define the size of the embedding vector. In this case we have set it to 32, so that each integer-token will be converted to a vector of length 32.

The embedding-layer also needs to know the number of words in the vocabulary (num_words) and the length of the padded token-sequences (max_length).

LSTM Layer

Next, we add the LSTM layer. The units argument is the number of neurons in the layer. More neurons = more memory. A guideline for choosing the number of units is to choose a number between the number of classes being predicted (5) and the sequence length (tokens per observation–58). Let’s use 32.

Dropout is the precentage of neurons that will be randomly disabled when processing the layer’s input and output, which reduces overfitting.

Recurrent Dropout is the percentage of neurons that will be randomly disabled when the layer’s output is fed back into the layer again to allow the network to learn from what it has previously seen.

With text, the NN can usually perform better if it processes sequences from start to finish and also backwards. For this reason, we incorporate the LSTM to be bidirectional.

Dense Layers

Dense layers are fully connected layers.

First, we add a dense hidden layer after our LSTM layer and set the activation function to relu (rectified linear unit).

Finally, we include a dense output layer to take the input from the hidden layer and have it create output for each of the 5 class levels. We use the softmax activation for multiclass classification.

embedding_size = 32

input <- layer_input(shape = c(1), dtype = "string")

output <- input %>% 
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = embedding_size) %>%
  bidirectional(layer_lstm(units = embedding_size,
             dropout = 0.2, 
             recurrent_dropout = 0.2)) %>%
  layer_dense(units = embedding_size, 
              activation = "relu") %>%
  layer_dense(units = 5, 
              activation = "softmax")



rnn_model <- keras_model(input, output)

Compiling the Model

We compile the model to specify the loss function, optimizer and metrics that we will use to assess the performance of our model and each epoch and overall.

We use categorical crossentropy, which measures the performance of a classification model whose output multiclass. For binary classification, binary crossentropy can be used.

For more information about the stochastic gradient-based optimizer, Adam, that we are using, see this paper. The optimizer adjusts the weights throughout the neural network as it learns. The learning rate of the optimizer is a hyperparameter that can be tuned to improve model performance.

In addition to accuracy (which is default), additional metrics can be added and returned at each epoch.

rnn_model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

Training the Model

Epochs are the number of times the model should process the entire set of training data and the batch size specifies the number of samples to process at a time during each epoch. We set the number of epochs to 10 and the batch size to 32, meaning we will be iterating on the data in batches of 32 samples. In general, larger batch sizes can decrease model accuracy.

We use 10% of the training-set as a small validation-set, so we have a rough idea whether the model is generalizing well or if it is over-fitting to the training-set. This reserves the last 10% of the training data for validation purposes.

history <- rnn_model %>% fit(
  train$clean_text,
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2
)

plot(history, method = "auto", smooth=FALSE)

In order for us to evaluate our model performance, we need to convert our one hot encoded dependent variable back to a single variable representation.

ypred_train <- predict(rnn_model, train$clean_text)
ypred_train_cat <- vector()
for (i in 1:nrow(ypred_train)){
  ypred_train_cat[i] <- which.max(ypred_train[i,])
}

Then, we can use the confusionMatrix() function from the caret package.

confusionMatrix(data = factor(ypred_train_cat), 
                reference = factor(train$Rating),
                mode = "everything"
                )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1   30   18    2    0    0
##          2   41   70   19    1    0
##          3   29  111  306   25    4
##          4    4    8   60  488   52
##          5    3    5    8  168 1704
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8232          
##                  95% CI : (0.8094, 0.8364)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7065          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.280374  0.33019  0.77468   0.7155   0.9682
## Specificity          0.993440  0.97928  0.93879   0.9499   0.8682
## Pos Pred Value       0.600000  0.53435  0.64421   0.7974   0.9025
## Neg Pred Value       0.975209  0.95306  0.96680   0.9237   0.9558
## Precision            0.600000  0.53435  0.64421   0.7974   0.9025
## Recall               0.280374  0.33019  0.77468   0.7155   0.9682
## F1                   0.382166  0.40816  0.70345   0.7543   0.9342
## Prevalence           0.033904  0.06717  0.12516   0.2161   0.5577
## Detection Rate       0.009506  0.02218  0.09696   0.1546   0.5399
## Detection Prevalence 0.015843  0.04151  0.15051   0.1939   0.5982
## Balanced Accuracy    0.636907  0.65473  0.85674   0.8327   0.9182

Testing the Model

To get overall performance of the model on our testing data, we can use evaluate.

results <- rnn_model %>% evaluate(
  test$clean_text, 
  y_test, 
  verbose = 0)

results
## $loss
## [1] 1.220713
## 
## $accuracy
## [1] 0.6087278

Then, we can perform the same conversion on the dependent variable and use confusionMatrix() to assess performance.

ypred_test <- predict(rnn_model, test$clean_text)
ypred_test_cat <- vector()
for (i in 1:nrow(ypred_test)){
  ypred_test_cat[i] <- which.max(ypred_test[i,])
}
confusionMatrix(data = factor(ypred_test_cat),
                reference = factor(test$Rating),
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1  11   6   1   0   0
##          2  14  13  16   3   1
##          3  19  47  65  33  16
##          4   5  21  44  83  86
##          5   5   9  28 175 651
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6087          
##                  95% CI : (0.5821, 0.6349)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 8.229e-05       
##                                           
##                   Kappa : 0.3316          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.203704 0.135417  0.42208  0.28231   0.8634
## Specificity          0.994607 0.972930  0.90401  0.85255   0.6371
## Pos Pred Value       0.611111 0.276596  0.36111  0.34728   0.7500
## Neg Pred Value       0.967766 0.936398  0.92406  0.81042   0.7872
## Precision            0.611111 0.276596  0.36111  0.34728   0.7500
## Recall               0.203704 0.135417  0.42208  0.28231   0.8634
## F1                   0.305556 0.181818  0.38922  0.31144   0.8027
## Prevalence           0.039941 0.071006  0.11391  0.21746   0.5577
## Detection Rate       0.008136 0.009615  0.04808  0.06139   0.4815
## Detection Prevalence 0.013314 0.034763  0.13314  0.17678   0.6420
## Balanced Accuracy    0.599155 0.554173  0.66304  0.56743   0.7503

Convolutional Neural Network (CNN) Model

Convolutional Neural Networks (CNN), which are sometimes called covnet, are popular models for the classification of 2D image data, but can also be used for 1D data and classification in higher dimensions.

While originally created for image processing, CNN has been gaining traction in NLP classification.

CNN involve two major elements:

Embedding Layer

As in the RNN, we begin the CNN with an Embedding Layer. We use the same specifications for the Embedding Layer as in the RNN model. We also use the same embedding size (embedding_size) as in the RNN.

Convolution Layer

The convolution layer uses the relationships between the terms that are close to one another to learn useful features or patterns in small areas of each sample.

The set of filters produced by the convolution layer is called a feature map, which summarizes the presence of detected features in the input. 32 and 64 are common choices for the number of filters (although 128 and 256 have been used in large-scale applications).

The kernels are the small areas that convolution learns from, and the kernel_size is a tunable hyperparameter, but is typically set to 3, 5 or 7.

Pooling Layer

We include a pooling layer after the convolution layer to reduce overfitting and computation time. A pooling layer compresses the results by discarding features, which helps to increase the generalizability of the model.

Dense Layer

We add a typical hidden layer, with 100 nodes. This layer uses what was learned about the features in the previous layers and will take that information to learn the relationships between the features and the class labels.

The rectfied linear unit (relu) activation function is used in this layer. The relu activation function is the most popular in deep learning networks for their speed and efficiency.

Dense Output Layer

Finally, we add a dense output layer, which takes the input from the dense hidden layer and transforms it into output for the 5 class levels (0-4) of the Rating variable.

We use the same input as in the RNN model and change the output. We then create the model by combining the input and new output (output_cnn) in the keras_model() function. After this, we compile the model using the same arguments as in the RNN model.

output_cnn <- input %>% 
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = embedding_size) %>%
  layer_conv_1d(filters = embedding_size, 
                kernel_size = 5, 
                activation='relu') %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 100, 
              activation='relu') %>%
  layer_dense(units = 5, 
              activation = "softmax")

cnn_model <- keras_model(input, output_cnn)

cnn_model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

Training the Model

history_cnn <- cnn_model %>% fit(
  train$clean_text,
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2
)

plot(history_cnn, method = "auto", smooth=FALSE)

ypred_train_cnn <- predict(cnn_model, train$clean_text)
ypred_train_cat_cnn <- vector()
for (i in 1:nrow(ypred_train_cnn)){
  ypred_train_cat_cnn[i] <- which.max(ypred_train_cnn[i,])
}
confusionMatrix(data = factor(ypred_train_cat_cnn), 
                reference = factor(train$Rating),
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1   95    1    4    1    0
##          2    4  188    0    1    1
##          3    4   12  370   12   10
##          4    2    8   13  632   23
##          5    2    3    8   36 1726
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9541          
##                  95% CI : (0.9462, 0.9611)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9257          
##                                           
##  Mcnemar's Test P-Value : 0.004159        
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.8879  0.88679   0.9367   0.9267   0.9807
## Specificity            0.9980  0.99796   0.9862   0.9814   0.9649
## Pos Pred Value         0.9406  0.96907   0.9069   0.9322   0.9724
## Neg Pred Value         0.9961  0.99190   0.9909   0.9798   0.9754
## Precision              0.9406  0.96907   0.9069   0.9322   0.9724
## Recall                 0.8879  0.88679   0.9367   0.9267   0.9807
## F1                     0.9135  0.92611   0.9215   0.9294   0.9765
## Prevalence             0.0339  0.06717   0.1252   0.2161   0.5577
## Detection Rate         0.0301  0.05957   0.1172   0.2003   0.5469
## Detection Prevalence   0.0320  0.06147   0.1293   0.2148   0.5624
## Balanced Accuracy      0.9429  0.94238   0.9615   0.9540   0.9728

Testing Performance

results_cnn <- cnn_model %>% evaluate(
  test$clean_text, 
  y_test, 
  verbose = 0)

results_cnn
## $loss
## [1] 1.524807
## 
## $accuracy
## [1] 0.5798817
ypred_test_cnn <- predict(cnn_model, test$clean_text)
ypred_test_cat_cnn <- vector()
for (i in 1:nrow(ypred_test_cnn)){
  ypred_test_cat_cnn[i] <- which.max(ypred_test_cnn[i,])
}
confusionMatrix(data = factor(ypred_test_cat_cnn),
                reference = factor(test$Rating),
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1   8   5   6   1   0
##          2   8  20  16   5   2
##          3  24  50  57  26  35
##          4   8  14  43  92 110
##          5   6   7  32 170 607
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5799          
##                  95% CI : (0.5531, 0.6064)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 0.05291         
##                                           
##                   Kappa : 0.2981          
##                                           
##  Mcnemar's Test P-Value : 4.682e-10       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.148148  0.20833  0.37013  0.31293   0.8050
## Specificity          0.990755  0.97532  0.88731  0.83459   0.6405
## Pos Pred Value       0.400000  0.39216  0.29688  0.34457   0.7384
## Neg Pred Value       0.965465  0.94158  0.91638  0.81382   0.7226
## Precision            0.400000  0.39216  0.29688  0.34457   0.7384
## Recall               0.148148  0.20833  0.37013  0.31293   0.8050
## F1                   0.216216  0.27211  0.32948  0.32799   0.7703
## Prevalence           0.039941  0.07101  0.11391  0.21746   0.5577
## Detection Rate       0.005917  0.01479  0.04216  0.06805   0.4490
## Detection Prevalence 0.014793  0.03772  0.14201  0.19749   0.6080
## Balanced Accuracy    0.569452  0.59183  0.62872  0.57376   0.7228

Tuning Deep Learning Models

Variables that impact your model’s performance include:

Some parameters that can be tuned include:

Overfitting

In the presence of overfitting, common approaches include:

  • train using more data
  • decreasing model capacity
  • dropout
  • weight regularization

Decreasing model capacity involves reducing the number of model parameters.

Regularization, specifically dropout can be used. You can add Dropout() as you would layers in the model and can be added between layers so that the model does not learn spurious features at hidden nodes. A common dropout rate is 0.5.

You can also add weight regularization (L1 or L2) to layers through the kernel_regularizer argument. Weight regularization put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more “regular”.

For more information, visit: https://tensorflow.rstudio.com/tutorials/beginners/basic-ml/tutorial_overfit_underfit/

Class Imbalance

In the presence of class imbalance, common approaches include:

  • resampling (ie random oversampling, random undersampling, SMOTE)
  • alternate performance metrics (precision, recall, F1, AUC/ROC)
  • initialize class weights

For more information, visit: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

Initializing Class Weights

When fitting the model, you can incorporate class weights to handle class imbalance. In doing so, you create a weighted average for the loss function, where a misclassification of the minority class or classes will be weighted more than the majority class or classes.

The code below is equivalent to python’s scikitlearn calculation for ‘balanced’ class weights.

dist <- table(train$Rating)

weights <- c()
for (i in 1:length(dist)){
  weights[i] <- sum(dist)/((length(dist)*dist[i]))
}

class_weights <- list("0" = weights[1],
                      "1" = weights[2],
                      "2" = weights[3],
                      "3" = weights[4],
                      "4" = weights[5])

When training the model, the list of class weights is assigned to the class_weight argument.

history_rnn_cw <- rnn_model %>% fit(
  train$clean_text,
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2,
  class_weight = class_weights
)
Resampling Strategies

Resampling (random oversampling, random undersampling, SMOTE, etc.) can be used to handle class imbalance. Below will demonstrate random oversampling using the caret package.

In order to keep the validation set untouched, prior to resampling the validation and training sets need to be separated. The code below will reserve the final 10% of observations as the validation set (val) and all other observations will remain as the training set (train_for_us), which will be randomly oversampled.

val <- tail(train, round(nrow(train)*.1))
train_for_us <- head(train, nrow(train)-nrow(val))

We use the upsample() function from the caret package to randomly oversample the training data. This will replicate observations for the minority classes to match the frequency of the majority class.

set.seed(831) 
train_us <- upSample(x=train_for_us$clean_text,
                     y = as.factor(train_for_us$Rating), 
                     yname="Rating")
table(train_us$Rating) # view the new Rating distribution
## 
##    1    2    3    4    5 
## 1591 1591 1591 1591 1591

Next, we need to transform variables before training the model. The clean text data was converted to a factor during oversampling and needs to be converted back to a character variable. It is now x.

train_us$x <- as.character(train_us$x)

Next, we need to convert the Rating variable to a dummy-coded representation for both the training and validation sets.

# Training
y_train_us <- as.integer(train_us$Rating) # convert back to numeric from factor
y_train_us <- y_train_us - 1
y_train_us <- to_categorical(y = y_train_us, num_classes = 5)
# Validation
y_val <- val$Rating - 1
y_val <- to_categorical(y = y_val, num_classes = 5)

Finally, we can train the model using oversampled training data.

history_rnn_us <- rnn_model %>% fit(
  train_us$x,
  y_train_us,
  epochs = 10,
  batch_size = 32,
  validation_data = list(val$clean_text, y_val),
  verbose=2
)