In real-world data analysis, you will want to analyze both text and non-text data to be able to classify observations. To be able to do this, you will use a multi-input neural network.
library(tm) # text mining
library(caret) # classification
library(keras) # deep learning
library(tensorflow) # deep learning
First, we import the data, remove missing text documents and prepare the data for corpus representation. Then, we preprocess and stem the data and add the cleaned and stemmed text back to the original dataframe.
cr <- read.csv("C:/Users/chh35/OneDrive - Drexel University/Teaching/Drexel/BSAN 710/Course Content/Week 2/clothing_revs_sample.csv",
stringsAsFactors = FALSE)
cr <- cr[cr$Review.Text != "" & cr$Review.Text != " ",]
names(cr)[1] <- "doc_id"
cr$text <- paste(cr$Title, cr$Review.Text, sep = " ")
crCorpus <- Corpus(DataframeSource(cr))
crCorpus <- tm_map(crCorpus, tolower)
crCorpus <- tm_map(crCorpus, removeNumbers)
punc2Space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
crCorpus <- tm_map(crCorpus, punc2Space, "/")
crCorpus <- tm_map(crCorpus, content_transformer(function(x) removeWords(x, stopwords("en"))))
crCorpus <- tm_map(crCorpus,
removePunctuation,
preserve_intra_word_contractions = FALSE,
preserve_intra_word_dashes = TRUE)
crCorpus <- tm_map(crCorpus, stemDocument, language = "english")
crCorpus <- tm_map(crCorpus, stripWhitespace)
myCorpus <- tm_map(crCorpus, PlainTextDocument)
## Warning in tm_map.SimpleCorpus(crCorpus, PlainTextDocument): transformation
## drops documents
cr$clean_text <- myCorpus$content$content
We split the data into training and testing, preserving the Rating distribution. We use 70% of the data for training and 30% for testing.
set.seed(831)
samp <- createDataPartition(cr$Rating, p = .70, list = FALSE)
train = cr[samp, ]
test = cr[-samp, ]
In this example, we will use the Age, Recommended.IND and Positive.Feedback.Count variables as our covariates. Age and Positive.Feedback.Count are numeric and should be rescaled and Recommended.IND is binary and can be left as-is. For compatibility, we convert the data from a dataframe to a matrix after transformation.
x_train2 <- train[,c("Age", "Recommended.IND", "Positive.Feedback.Count")]
x_test2 <- test[,c("Age", "Recommended.IND", "Positive.Feedback.Count")]
## We will rescale the numeric variables
x_train2[, c(1,3)] <- predict(preProcess(x_train2[, c(1,3)], method="range"), x_train2[, c(1,3)])
x_test2[, c(1,3)] <- predict(preProcess(x_test2[, c(1,3)], method="range"), x_test2[, c(1,3)])
## Convert to matrices for compatibility
x_train2 <- as.matrix(x_train2)
x_test2 <- as.matrix(x_test2)
To make the dependent variable compatible with keras modeling, we need to convert the current categorical variable, Rating (with levels 1-5) to a one hot encoding/dummy variable representation, with the levels 0-4 (to be consistent with Python indexing).
# make compatible
y_train <- train$Rating - 1
y_test <- test$Rating - 1
# convert y variable to one-hot-encoding representation
y_train <- to_categorical(y = y_train, num_classes = 5)
y_test <- to_categorical(y = y_test, num_classes = 5)
num_words <- 5000
counts <- sapply(strsplit(cr$clean_text, " "), length)
max_length <- round(mean(counts) + 2 * sd(counts))
max_length
## [1] 58
Once these numbers are chosen, we initialize the text vectorizor (layer_text_vectorization). Then, we apply it to the text.
text_vectorization <- layer_text_vectorization(
max_tokens = num_words,
output_sequence_length = max_length,
)
# apply
text_vectorization %>%
adapt(cr$clean_text)
We define the 2 inputs separately and create the two separate paths, which we combine using a Concatenate Layer. Once the text and non-text covariates are combined, we have our typical output layer, although additional hidden layers can be added to the network.
embedding_size = 32
input1 <- layer_input(shape = c(1), dtype = "string")
input2 <- layer_input(shape = c(3))
text_input <- input1 %>%
text_vectorization() %>%
layer_embedding(input_dim = num_words + 1,
output_dim = embedding_size) %>%
bidirectional(layer_lstm(units = embedding_size,
dropout = 0.2,
recurrent_dropout = 0.2))
covar_input <- input2 %>%
layer_dense(units = 10,
activation = "relu") %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 10,
activation = "relu")
concat_layer <- layer_concatenate(list(text_input, covar_input))
output <- concat_layer %>%
layer_dense(units = 10,
activation = "relu") %>%
layer_dense(units = 5, activation = "softmax")
rnn_model <- keras_model(list(input1, input2), output)
rnn_model %>% compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = list('accuracy')
)
history <- rnn_model %>% fit(
list(train$clean_text, x_train2),
y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.1,
verbose=2
)
plot(history, method = "auto", smooth=FALSE)
In order for us to evaluate our model performance, we need to convert our one hot encoded dependent variable back to a single variable representation.
ypred_train <- predict(rnn_model, list(train$clean_text, x_train2))
ypred_train_cat <- vector()
for (i in 1:nrow(ypred_train)){
ypred_train_cat[i] <- which.max(ypred_train[i,])
}
confusionMatrix(factor(ypred_train_cat),
factor(train$Rating),
mode="everything")
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(factor(ypred_train_cat),
## factor(train$Rating), : Levels are not in the same order for reference and data.
## Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 0 0 0 0 0
## 2 86 129 65 1 0
## 3 18 74 283 37 5
## 4 0 4 37 506 37
## 5 3 5 10 138 1718
##
## Overall Statistics
##
## Accuracy : 0.8352
## 95% CI : (0.8218, 0.848)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7283
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.0000 0.60849 0.71646 0.7419 0.9761
## Specificity 1.0000 0.94837 0.95147 0.9685 0.8883
## Pos Pred Value NaN 0.45907 0.67866 0.8664 0.9168
## Neg Pred Value 0.9661 0.97113 0.95911 0.9316 0.9672
## Precision NA 0.45907 0.67866 0.8664 0.9168
## Recall 0.0000 0.60849 0.71646 0.7419 0.9761
## F1 NA 0.52333 0.69704 0.7994 0.9455
## Prevalence 0.0339 0.06717 0.12516 0.2161 0.5577
## Detection Rate 0.0000 0.04087 0.08967 0.1603 0.5444
## Detection Prevalence 0.0000 0.08904 0.13213 0.1850 0.5938
## Balanced Accuracy 0.5000 0.77843 0.83396 0.8552 0.9322
To get overall performance of the model on our testing data, we can use evaluate.
results <- rnn_model %>% evaluate(
list(test$clean_text, x_test2),
y_test,
verbose = 0)
results
## $loss
## [1] 1.09695
##
## $accuracy
## [1] 0.6198225
Then, we can perform the same conversion on the dependent variable and use confusionMatrix() to assess performance.
ypred_test <- predict(rnn_model, list(test$clean_text, x_test2))
ypred_test_cat <- vector()
for (i in 1:nrow(ypred_test)){
ypred_test_cat[i] <- which.max(ypred_test[i,])
}
confusionMatrix(factor(ypred_test_cat),
factor(test$Rating),
mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 0 0 0 0 0
## 2 37 36 31 0 0
## 3 12 49 74 30 16
## 4 0 1 19 92 102
## 5 5 10 30 172 636
##
## Overall Statistics
##
## Accuracy : 0.6198
## 95% CI : (0.5933, 0.6458)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : 2.119e-06
##
## Kappa : 0.3589
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.00000 0.37500 0.48052 0.31293 0.8435
## Specificity 1.00000 0.94586 0.91068 0.88469 0.6371
## Pos Pred Value NaN 0.34615 0.40884 0.42991 0.7456
## Neg Pred Value 0.96006 0.95192 0.93168 0.82250 0.7635
## Precision NA 0.34615 0.40884 0.42991 0.7456
## Recall 0.00000 0.37500 0.48052 0.31293 0.8435
## F1 NA 0.36000 0.44179 0.36220 0.7915
## Prevalence 0.03994 0.07101 0.11391 0.21746 0.5577
## Detection Rate 0.00000 0.02663 0.05473 0.06805 0.4704
## Detection Prevalence 0.00000 0.07692 0.13388 0.15828 0.6309
## Balanced Accuracy 0.50000 0.66043 0.69560 0.59881 0.7403
text_input_cnn <- input1 %>%
text_vectorization() %>%
layer_embedding(input_dim = num_words + 1,
output_dim = embedding_size) %>%
layer_conv_1d(filters = embedding_size,
kernel_size = 5,
activation='relu') %>%
layer_global_max_pooling_1d()
covar_input_cnn <- input2 %>%
layer_dense(units = 10,
activation = "relu") %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 10,
activation = "relu")
concat_layer_cnn <- layer_concatenate(list(text_input_cnn, covar_input_cnn))
output_cnn <- concat_layer %>%
layer_dense(units = 10, activation = "relu") %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 5, activation = "softmax")
cnn_model <- keras_model(list(input1, input2), output_cnn)
cnn_model %>% compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = list('accuracy')
)
history_cnn <- cnn_model %>% fit(
list(train$clean_text, x_train2),
y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.1,
verbose=2
)
plot(history_cnn, method = "auto", smooth=FALSE)
ypred_train_cnn <- predict(cnn_model, list(train$clean_text, x_train2))
ypred_train_cat_cnn <- vector()
for (i in 1:nrow(ypred_train_cnn)){
ypred_train_cat_cnn[i] <- which.max(ypred_train_cnn[i,])
}
confusionMatrix(factor(ypred_train_cat_cnn),
factor(train$Rating),
mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 82 96 32 0 0
## 2 0 2 0 0 0
## 3 20 102 337 31 3
## 4 0 5 15 553 35
## 5 5 7 11 98 1722
##
## Overall Statistics
##
## Accuracy : 0.8542
## 95% CI : (0.8414, 0.8664)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7614
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.76636 0.0094340 0.8532 0.8109 0.9784
## Specificity 0.95802 1.0000000 0.9435 0.9778 0.9133
## Pos Pred Value 0.39048 1.0000000 0.6836 0.9095 0.9343
## Neg Pred Value 0.99151 0.9334179 0.9782 0.9494 0.9711
## Precision 0.39048 1.0000000 0.6836 0.9095 0.9343
## Recall 0.76636 0.0094340 0.8532 0.8109 0.9784
## F1 0.51735 0.0186916 0.7590 0.8574 0.9559
## Prevalence 0.03390 0.0671736 0.1252 0.2161 0.5577
## Detection Rate 0.02598 0.0006337 0.1068 0.1752 0.5456
## Detection Prevalence 0.06654 0.0006337 0.1562 0.1926 0.5840
## Balanced Accuracy 0.86219 0.5047170 0.8983 0.8943 0.9459
results_cnn <- cnn_model %>% evaluate(
list(test$clean_text, x_test2),
y_test,
verbose = 0)
results_cnn
## $loss
## [1] 1.409228
##
## $accuracy
## [1] 0.6050296
ypred_test_cnn <- predict(cnn_model, list(test$clean_text, x_test2))
ypred_test_cat_cnn <- vector()
for (i in 1:nrow(ypred_test_cnn)){
ypred_test_cat_cnn[i] <- which.max(ypred_test_cnn[i,])
}
confusionMatrix(factor(ypred_test_cat_cnn),
factor(test$Rating),
mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 27 27 18 0 0
## 2 2 1 1 0 0
## 3 18 58 83 38 23
## 4 0 2 20 97 121
## 5 7 8 32 159 610
##
## Overall Statistics
##
## Accuracy : 0.605
## 95% CI : (0.5784, 0.6312)
## No Information Rate : 0.5577
## P-Value [Acc > NIR] : 0.0002402
##
## Kappa : 0.346
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.50000 0.0104167 0.53896 0.32993 0.8090
## Specificity 0.96533 0.9976115 0.88564 0.86484 0.6555
## Pos Pred Value 0.37500 0.2500000 0.37727 0.40417 0.7475
## Neg Pred Value 0.97891 0.9295252 0.93728 0.82284 0.7313
## Precision 0.37500 0.2500000 0.37727 0.40417 0.7475
## Recall 0.50000 0.0104167 0.53896 0.32993 0.8090
## F1 0.42857 0.0200000 0.44385 0.36330 0.7771
## Prevalence 0.03994 0.0710059 0.11391 0.21746 0.5577
## Detection Rate 0.01997 0.0007396 0.06139 0.07175 0.4512
## Detection Prevalence 0.05325 0.0029586 0.16272 0.17751 0.6036
## Balanced Accuracy 0.73267 0.5040141 0.71230 0.59739 0.7323