Deep Learning with Text Data & Non-Text Covariates

Introduction

In real-world data analysis, you will want to analyze both text and non-text data to be able to classify observations. To be able to do this, you will use a multi-input neural network.

Preliminary

library(tm) # text mining
library(caret) # classification
library(keras) # deep learning
library(tensorflow) # deep learning

Cleaning

First, we import the data, remove missing text documents and prepare the data for corpus representation. Then, we preprocess and stem the data and add the cleaned and stemmed text back to the original dataframe.

cr <- read.csv("C:/Users/chh35/OneDrive - Drexel University/Teaching/Drexel/BSAN 710/Course Content/Week 2/clothing_revs_sample.csv", 
               stringsAsFactors = FALSE)
cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]
names(cr)[1] <- "doc_id"
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")
crCorpus <- Corpus(DataframeSource(cr))
crCorpus <- tm_map(crCorpus, tolower) 
crCorpus <- tm_map(crCorpus, removeNumbers)
punc2Space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
crCorpus <- tm_map(crCorpus, punc2Space, "/")
crCorpus <- tm_map(crCorpus, content_transformer(function(x) removeWords(x, stopwords("en"))))
crCorpus <- tm_map(crCorpus, 
                         removePunctuation, 
                         preserve_intra_word_contractions = FALSE, 
                         preserve_intra_word_dashes = TRUE)
crCorpus <- tm_map(crCorpus, stemDocument, language = "english")
crCorpus <- tm_map(crCorpus, stripWhitespace)
myCorpus <- tm_map(crCorpus, PlainTextDocument)

## Warning in tm_map.SimpleCorpus(crCorpus, PlainTextDocument): transformation
## drops documents

cr$clean_text <- myCorpus$content$content

Training/Testing Split

We split the data into training and testing, preserving the Rating distribution. We use 70% of the data for training and 30% for testing.

set.seed(831)
samp <- createDataPartition(cr$Rating, p = .70, list = FALSE)
train = cr[samp, ] 
test = cr[-samp, ]

Prepare Covariates

In this example, we will use the Age, Recommended.IND and Positive.Feedback.Count variables as our covariates. Age and Positive.Feedback.Count are numeric and should be rescaled and Recommended.IND is binary and can be left as-is. For compatibility, we convert the data from a dataframe to a matrix after transformation.

x_train2 <- train[,c("Age", "Recommended.IND", "Positive.Feedback.Count")]
x_test2 <- test[,c("Age", "Recommended.IND", "Positive.Feedback.Count")]

## We will rescale the numeric variables 
x_train2[, c(1,3)] <- predict(preProcess(x_train2[, c(1,3)], method="range"), x_train2[, c(1,3)])
x_test2[, c(1,3)] <- predict(preProcess(x_test2[, c(1,3)], method="range"),  x_test2[, c(1,3)])

## Convert to matrices for compatibility
x_train2 <- as.matrix(x_train2)
x_test2 <- as.matrix(x_test2)

One Hot Encoding of Rating

To make the dependent variable compatible with keras modeling, we need to convert the current categorical variable, Rating (with levels 1-5) to a one hot encoding/dummy variable representation, with the levels 0-4 (to be consistent with Python indexing).

# make compatible
y_train <- train$Rating - 1
y_test <- test$Rating - 1
# convert y variable to one-hot-encoding representation
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test <- to_categorical(y = y_test, num_classes = 5)

Text Tokenization

num_words <- 5000

counts <- sapply(strsplit(cr$clean_text, " "), length)
max_length <- round(mean(counts) + 2 * sd(counts))
max_length

## [1] 58

Once these numbers are chosen, we initialize the text vectorizor (layer_text_vectorization). Then, we apply it to the text.

text_vectorization <- layer_text_vectorization(
  max_tokens = num_words, 
  output_sequence_length = max_length, 
)

# apply
text_vectorization %>% 
  adapt(cr$clean_text)

Multi-Input Recurrent Neural Network (RNN) Model

We define the 2 inputs separately and create the two separate paths, which we combine using a Concatenate Layer. Once the text and non-text covariates are combined, we have our typical output layer, although additional hidden layers can be added to the network.

embedding_size = 32

input1 <- layer_input(shape = c(1), dtype = "string")
input2 <- layer_input(shape = c(3))

text_input <- input1 %>% 
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = embedding_size) %>%
  bidirectional(layer_lstm(units = embedding_size,
                           dropout = 0.2, 
                           recurrent_dropout = 0.2))

covar_input <- input2 %>%
  layer_dense(units = 10, 
              activation = "relu") %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 10, 
              activation = "relu")
  
concat_layer <- layer_concatenate(list(text_input, covar_input)) 

output <- concat_layer %>%
  layer_dense(units = 10, 
              activation = "relu") %>%
  layer_dense(units = 5, activation = "softmax")



rnn_model <- keras_model(list(input1, input2), output)

Compiling the Model

rnn_model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

Training the Model

history <- rnn_model %>% fit(
  list(train$clean_text, x_train2),
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2
)

plot(history, method = "auto", smooth=FALSE)

In order for us to evaluate our model performance, we need to convert our one hot encoded dependent variable back to a single variable representation.

ypred_train <- predict(rnn_model, list(train$clean_text, x_train2))
ypred_train_cat <- vector()
for (i in 1:nrow(ypred_train)){
  ypred_train_cat[i] <- which.max(ypred_train[i,])
}
confusionMatrix(factor(ypred_train_cat), 
                factor(train$Rating),
                mode="everything")

## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length

## Warning in confusionMatrix.default(factor(ypred_train_cat),
## factor(train$Rating), : Levels are not in the same order for reference and data.
## Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1    0    0    0    0    0
##          2   86  129   65    1    0
##          3   18   74  283   37    5
##          4    0    4   37  506   37
##          5    3    5   10  138 1718
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8352         
##                  95% CI : (0.8218, 0.848)
##     No Information Rate : 0.5577         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7283         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.0000  0.60849  0.71646   0.7419   0.9761
## Specificity            1.0000  0.94837  0.95147   0.9685   0.8883
## Pos Pred Value            NaN  0.45907  0.67866   0.8664   0.9168
## Neg Pred Value         0.9661  0.97113  0.95911   0.9316   0.9672
## Precision                  NA  0.45907  0.67866   0.8664   0.9168
## Recall                 0.0000  0.60849  0.71646   0.7419   0.9761
## F1                         NA  0.52333  0.69704   0.7994   0.9455
## Prevalence             0.0339  0.06717  0.12516   0.2161   0.5577
## Detection Rate         0.0000  0.04087  0.08967   0.1603   0.5444
## Detection Prevalence   0.0000  0.08904  0.13213   0.1850   0.5938
## Balanced Accuracy      0.5000  0.77843  0.83396   0.8552   0.9322

Testing the Model

To get overall performance of the model on our testing data, we can use evaluate.

results <- rnn_model %>% evaluate(
  list(test$clean_text, x_test2),
  y_test, 
  verbose = 0)

results

## $loss
## [1] 1.09695
## 
## $accuracy
## [1] 0.6198225

Then, we can perform the same conversion on the dependent variable and use confusionMatrix() to assess performance.

ypred_test <- predict(rnn_model, list(test$clean_text, x_test2))
ypred_test_cat <- vector()
for (i in 1:nrow(ypred_test)){
  ypred_test_cat[i] <- which.max(ypred_test[i,])
}
confusionMatrix(factor(ypred_test_cat), 
                factor(test$Rating),
                mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1   0   0   0   0   0
##          2  37  36  31   0   0
##          3  12  49  74  30  16
##          4   0   1  19  92 102
##          5   5  10  30 172 636
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6198          
##                  95% CI : (0.5933, 0.6458)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 2.119e-06       
##                                           
##                   Kappa : 0.3589          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.00000  0.37500  0.48052  0.31293   0.8435
## Specificity           1.00000  0.94586  0.91068  0.88469   0.6371
## Pos Pred Value            NaN  0.34615  0.40884  0.42991   0.7456
## Neg Pred Value        0.96006  0.95192  0.93168  0.82250   0.7635
## Precision                  NA  0.34615  0.40884  0.42991   0.7456
## Recall                0.00000  0.37500  0.48052  0.31293   0.8435
## F1                         NA  0.36000  0.44179  0.36220   0.7915
## Prevalence            0.03994  0.07101  0.11391  0.21746   0.5577
## Detection Rate        0.00000  0.02663  0.05473  0.06805   0.4704
## Detection Prevalence  0.00000  0.07692  0.13388  0.15828   0.6309
## Balanced Accuracy     0.50000  0.66043  0.69560  0.59881   0.7403

Multi-Input Convolutional Neural Network (CNN) Model

text_input_cnn <- input1 %>% 
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, 
                  output_dim = embedding_size) %>%
  layer_conv_1d(filters = embedding_size, 
                kernel_size = 5, 
                activation='relu') %>%
  layer_global_max_pooling_1d()

covar_input_cnn <- input2 %>%
  layer_dense(units = 10, 
              activation = "relu") %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 10, 
              activation = "relu")

concat_layer_cnn <- layer_concatenate(list(text_input_cnn, covar_input_cnn)) 

output_cnn <- concat_layer %>%
  layer_dense(units = 10, activation = "relu") %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 5, activation = "softmax")


cnn_model <- keras_model(list(input1, input2), output_cnn)

cnn_model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

Training the Model

history_cnn <- cnn_model %>% fit(
  list(train$clean_text, x_train2),
  y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.1,
  verbose=2
)

plot(history_cnn, method = "auto", smooth=FALSE)

ypred_train_cnn <- predict(cnn_model, list(train$clean_text, x_train2))
ypred_train_cat_cnn <- vector()
for (i in 1:nrow(ypred_train_cnn)){
  ypred_train_cat_cnn[i] <- which.max(ypred_train_cnn[i,])
}
confusionMatrix(factor(ypred_train_cat_cnn), 
                factor(train$Rating), 
                mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1   82   96   32    0    0
##          2    0    2    0    0    0
##          3   20  102  337   31    3
##          4    0    5   15  553   35
##          5    5    7   11   98 1722
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8542          
##                  95% CI : (0.8414, 0.8664)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7614          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1  Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.76636 0.0094340   0.8532   0.8109   0.9784
## Specificity           0.95802 1.0000000   0.9435   0.9778   0.9133
## Pos Pred Value        0.39048 1.0000000   0.6836   0.9095   0.9343
## Neg Pred Value        0.99151 0.9334179   0.9782   0.9494   0.9711
## Precision             0.39048 1.0000000   0.6836   0.9095   0.9343
## Recall                0.76636 0.0094340   0.8532   0.8109   0.9784
## F1                    0.51735 0.0186916   0.7590   0.8574   0.9559
## Prevalence            0.03390 0.0671736   0.1252   0.2161   0.5577
## Detection Rate        0.02598 0.0006337   0.1068   0.1752   0.5456
## Detection Prevalence  0.06654 0.0006337   0.1562   0.1926   0.5840
## Balanced Accuracy     0.86219 0.5047170   0.8983   0.8943   0.9459

Testing Performance

results_cnn <- cnn_model %>% evaluate(
  list(test$clean_text, x_test2),
  y_test, 
  verbose = 0)

results_cnn

## $loss
## [1] 1.409228
## 
## $accuracy
## [1] 0.6050296

ypred_test_cnn <- predict(cnn_model, list(test$clean_text, x_test2))
ypred_test_cat_cnn <- vector()
for (i in 1:nrow(ypred_test_cnn)){
  ypred_test_cat_cnn[i] <- which.max(ypred_test_cnn[i,])
}
confusionMatrix(factor(ypred_test_cat_cnn), 
                factor(test$Rating),
                mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1  27  27  18   0   0
##          2   2   1   1   0   0
##          3  18  58  83  38  23
##          4   0   2  20  97 121
##          5   7   8  32 159 610
## 
## Overall Statistics
##                                           
##                Accuracy : 0.605           
##                  95% CI : (0.5784, 0.6312)
##     No Information Rate : 0.5577          
##     P-Value [Acc > NIR] : 0.0002402       
##                                           
##                   Kappa : 0.346           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1  Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.50000 0.0104167  0.53896  0.32993   0.8090
## Specificity           0.96533 0.9976115  0.88564  0.86484   0.6555
## Pos Pred Value        0.37500 0.2500000  0.37727  0.40417   0.7475
## Neg Pred Value        0.97891 0.9295252  0.93728  0.82284   0.7313
## Precision             0.37500 0.2500000  0.37727  0.40417   0.7475
## Recall                0.50000 0.0104167  0.53896  0.32993   0.8090
## F1                    0.42857 0.0200000  0.44385  0.36330   0.7771
## Prevalence            0.03994 0.0710059  0.11391  0.21746   0.5577
## Detection Rate        0.01997 0.0007396  0.06139  0.07175   0.4512
## Detection Prevalence  0.05325 0.0029586  0.16272  0.17751   0.6036
## Balanced Accuracy     0.73267 0.5040141  0.71230  0.59739   0.7323

Deep Learning with Text Data & Non-Text Covariates

Chelsey Hill

Introduction

Preliminary

Cleaning

Training/Testing Split

Prepare Covariates

One Hot Encoding of Rating

Text Tokenization

Multi-Input Recurrent Neural Network (RNN) Model

Compiling the Model

Training the Model

Testing the Model

Multi-Input Convolutional Neural Network (CNN) Model

Training the Model

Testing Performance