1 Introduction

Currently, drug users have access to various kinds of information. This can be either a numeric label or a text comment. The main source of useful and comprehensive information, however, is text comments. You can find out the effectiveness of the drug, the side effects of the drug and other precautions. Having wide access to online review sites and forums, the consumer often finds it difficult to cover all the necessary information. An algorithm that will provide a classification of comments of significant qualities can come to the rescue.

Therefore, the purpose of our project is to classify textual data: firstly, regarding the effectiveness of prescription drugs, secondly, regarding the side effects of drugs, and thirdly, related to general comments.

Perhaps the conducted research will help the consumer of medicines to make the right choice and at the same time save money on learning more information.

2 Main assumptions

In the project we consider the following step of the analysis such as:

body corpus
data cleaning
corpus descriptive statistics
preparation of Document-Term Matrix for future analysis
cluster analysis

# Loading library
library(ggplot2)
library(tidyverse)
library(tm)
library(wordcloud2)
library(dplyr)
library(clustertend)
library(factoextra)
library(gridExtra)
library(cluster)
library(dendextend) # Dendogram (hierarchical clustering)

3 Description of the data set

The dataset used in the actual project presents patient reviews on specific drugs along with related conditions. Moreover, above mentioned reviews are divided into three groups: benefits, side effects and overall comment. The authors of the data received this dataset by crawling online pharmaceutical review sites.

The data is collected by Professor Surya Kallumadi from Kansas State University, Manhattan, Kansas, USA and Felix Gräßer from Institut für Biomedizinische Technik, Technische Universität Dresden, Dresden, Germany. The data is found on UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29

Initially, let’s summarize the data.

train <- read_tsv("drugLibTrain_raw.tsv", col_names = TRUE)
test <- read_tsv("drugLibTest_raw.tsv", col_names = TRUE)
df <- full_join(train, test)

names(df)

## [1] "...1"              "urlDrugName"       "rating"           
## [4] "effectiveness"     "sideEffects"       "condition"        
## [7] "benefitsReview"    "sideEffectsReview" "commentsReview"

names(train) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview",  "commentsReview")

names(test) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview",  "commentsReview")

names(df) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview",  "commentsReview")

summary(df)

##        ID       urlDrugName            rating       effectiveness     
##  Min.   :   0   Length:4143        Min.   : 1.000   Length:4143       
##  1st Qu.:1042   Class :character   1st Qu.: 5.000   Class :character  
##  Median :2083   Mode  :character   Median : 8.000   Mode  :character  
##  Mean   :2082                      Mean   : 6.946                     
##  3rd Qu.:3124                      3rd Qu.: 9.000                     
##  Max.   :4161                      Max.   :10.000                     
##  sideEffects         condition         benefitsReview     sideEffectsReview 
##  Length:4143        Length:4143        Length:4143        Length:4143       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  commentsReview    
##  Length:4143       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

glimpse(df)

## Rows: 4,143
## Columns: 9
## $ ID                <dbl> 2202, 3117, 1146, 3947, 1951, 2372, 1043, 2715, 1591…
## $ urlDrugName       <chr> "enalapril", "ortho-tri-cyclen", "ponstel", "prilose…
## $ rating            <dbl> 4, 1, 10, 3, 2, 1, 9, 10, 10, 1, 7, 8, 8, 9, 4, 8, 6…
## $ effectiveness     <chr> "Highly Effective", "Highly Effective", "Highly Effe…
## $ sideEffects       <chr> "Mild Side Effects", "Severe Side Effects", "No Side…
## $ condition         <chr> "management of congestive heart failure", "birth pre…
## $ benefitsReview    <chr> "slowed the progression of left ventricular dysfunct…
## $ sideEffectsReview <chr> "cough, hypotension , proteinuria, impotence , renal…
## $ commentsReview    <chr> "monitor blood pressure , weight and asses for resol…

Attribute description for the dataset

Attribute 1: urlDrugName (categorical): name of drug
Attribute 2: rating (numerical): 10 star patient rating

hist(as.numeric(df$rating), breaks = 10, main = "Frequencies of rating", xlab = "rating", col = "#4cbea3", labels = TRUE, border = "#FFFFFF")

Based on graph of the frequencies of rating we should pay attention that 968 drugs have the best rating and 556 drugs have the worst rating.

Attribute 3: effectiveness (categorical): 5 step effectiveness rating

ggplot(df, aes(x=effectiveness)) + geom_bar(stat="count") + ggtitle("Frequencies of effectiveness")

The most of medicaments were marked as Considerably Effective and Highly Effective. Based on these results we can conclude that variable which present benefitsReview should include positive arguments of patients.

# using group_by and summarize
df %>%
  group_by(effectiveness) %>%
  summarize(rating_mean = mean(rating))

## # A tibble: 5 × 2
##   effectiveness          rating_mean
##   <chr>                        <dbl>
## 1 Considerably Effective        7.30
## 2 Highly Effective              8.77
## 3 Ineffective                   1.59
## 4 Marginally Effective          3.29
## 5 Moderately Effective          5.38

Additionally, above categories Considerably Effective and Highly Effective obtain the highest rating.

Attribute 4: sideEffects (categorical): 5 step side effect rating

ggplot(df, aes(x=sideEffects)) + geom_bar(stat="count")+ ggtitle("Frequencies of sideEffects")

In general, considered drugs in our dataset in most cases have Mild Side Effects and No Side Effects. Based on these results we can expect that the variable

sideEffectsReview should include comments with slight side effects.

# using group_by and summarize
df %>%
  group_by(sideEffects) %>%
  summarize(rating_mean = mean(rating))

## # A tibble: 5 × 2
##   sideEffects                   rating_mean
##   <chr>                               <dbl>
## 1 Extremely Severe Side Effects        1.66
## 2 Mild Side Effects                    8.13
## 3 Moderate Side Effects                6.18
## 4 No Side Effects                      8.65
## 5 Severe Side Effects                  3.58

Medicaments have the highest rating for No Side Effects, Mild Side Effects.

Attribute 5: condition (categorical): name of condition
Attribute 6: benefitsReview (text): patient on benefits
Attribute 7: sideEffectsReview (text): patient on side effects
Attribute 8: commentsReview (text): overall patient comment

4 Preparation of data for modeling

We would construct analysis for tree reviews: commentsReview, sideEffectsReview and benefitsReview.

4.1 For commentsReview

The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.

The next step is to clean the body. Consider the following:

stripWhitespace - removes whitespace removePunctuation - removes punctuation marks removeNumbers - removes numbers content_transformer (tolower) - convert any word to lowercase removeWords - removes the words stop stemDocument - stems word

docs <- VCorpus(VectorSource(df$commentsReview))

# Preliminary cleaning.
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))

docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, stripWhitespace)
docs = tm_map(docs, removeWords, c(stopwords("en"), "howev", "just", "due", "still", "per", "also", "aaaaarrrgh", "aana", "aarp", "abdo", "abbout", "abcess", "aboc", "abruptlyand", "even", "morn"))
# Stemming 
docs <- tm_map(docs, stemDocument)

# Generate document-term matrix
dtm <- DocumentTermMatrix(docs)
dtm

## <<DocumentTermMatrix (documents: 4143, terms: 7814)>>
## Non-/sparse entries: 86347/32287055
## Sparsity           : 100%
## Maximal term length: 48
## Weighting          : term frequency (tf)

# Remove sparse terms
dtms <- removeSparseTerms(dtm, sparse = 0.97)

m <- as.matrix(dtms)

# Create histograms and boxplots
m_freq = as.matrix(rowSums(m))

sd = sd(m_freq)
mean = mean(m_freq)

Based on the descriptive statistics and basic graphs, we can see that the cleaned data:

The average of the above-mentioned data is: 10.93
Standard deviation of the cleaned data: 9.92. The cleaned data set is highly variable.
The histogram and the box plot show the distribution of the cleaned data.

par(mfrow = c(1,2))
# Create histogram and boxplot
hist(m_freq,
     main = "Histogram",
     col = "blue",
     col.main = "blue")
boxplot(m_freq, 
        main = "Boxplot", 
        col = "blue",
        col.main = "blue")

The next step is to create the Document-Term Matrix (DTM). DTM represents the word frequency in a given corpus. Note that the DTM has high sparsity.

Let’s first analyze the words appearing in the comments with the help of a word cloud.

The terms often used by patients in comments regarding common review are “take”, “day”, “week”, “time”, “effect”, “doctor”, “depress”, “morn”, “pill”, “night”, “start” and ect. Top of the most common words will allow us to have some introduction about the topic. The medicine prescribed by the doctor brings effects. The dose after a time / week, the patient feels much better during the day, depression decreases, morning vigor appeared and sleep improved.

tdm_1 <- TermDocumentMatrix(docs)
tdm_2 <- as.matrix(tdm_1)
tdm_2[1:20,1:20]

##            Docs
## Terms       1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   abandon   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abat      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abbrevi   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abcess    0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomen   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomin   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aberr     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abid      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abil      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abilifi   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abl       0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   ablat     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abnorm    0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aboc      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aborb     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abovem    0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abovement 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abras     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abrupt    0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abscess   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

w <- rowSums(tdm_2)  # the number of terms used by patient
w <- subset(w, w>= 100) # The terms >= 100 times
barplot(w, las = 2, col = rainbow(50))

w <- data.frame(names(w),w)
colnames(w) <- c('word','freq')
wordcloud2(w,size = 0.5, shape = 'triangle', rotateRatio = 0.5, 
           minSize = 1)

4.2 For sideEffectsReview

The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.

The next step is to clean the body.

docs1 <- VCorpus(VectorSource(df$sideEffectsReview))

# Preliminary cleaning
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))

docs1 <- tm_map(docs1, removePunctuation)
docs1 <- tm_map(docs1, removeNumbers)
docs1 <- tm_map(docs1, content_transformer(tolower))
docs1 <- tm_map(docs1, removeWords, stopwords("english"))
docs1 <- tm_map(docs1, stripWhitespace)

# Stemming 
docs1 <- tm_map(docs1, stemDocument)

# Create document-term matrix 
dtm1 <- DocumentTermMatrix(docs1)
# Reducing sparsity 1: Dealing with the sparse terms

# We are removing these terms which don't appear too often. 
dtms1 <- removeSparseTerms(dtm1, sparse = 0.97)

m1 <- as.matrix(dtms1)

# Creating histograms and boxplots
m1_freq = as.matrix(rowSums(m1))

sd1 = sd(m1_freq)
mean1 = mean(m1_freq)

Based on the descriptive statistics and basic graphs, we can see that the cleaned data:

The average of the above-mentioned data is: 6.65
Standard deviation of the cleaned data: 6.66. The cleaned data set is highly variable.
The histogram and the box plot show the distribution of the cleaned data.

par(mfrow = c(1,2))
# Create histogram and boxplot
hist(m1_freq,
     main = "Histogram",
     col = "green",
     col.main = "green")
boxplot(m1_freq, 
        main = "Boxplot", 
        col = "green",
        col.main = "green")

Terms such as “effect”, “side”, “take”, “day”, “time”, “medic”, “pain”, “feel”, “first”, “stomach”, “headach”, “skin”, “weight” and ect. were used more often than other words. Taking the prescribed medication, in the first few days / weeks of taking the drugs, the patient felt the following problems; abdominal pain, headache, skin allergy.

tdm1_1 <- TermDocumentMatrix(docs1)
tdm1_2 <- as.matrix(tdm1_1)
tdm1_2[1:20,1:20]

##           Docs
## Terms      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   aafter   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abandon  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abat     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abbsess  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomen  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomin  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abfter   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abil     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abilifi  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abit     0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abl      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abnorm   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aboveand 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abrupt   0 0 0 0 0 0 0 0 0  0  0  1  0  0  0  0  0  0  0  0
##   abscens  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   absenc   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abslut   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   absolut  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   absorb   0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   absorpt  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

w1 <- rowSums(tdm1_2)  # the number of terms used by patient
w1 <- subset(w1, w1>= 100) # the terms >= 100 times
barplot(w1, las = 2, col = rainbow(50))

w1 <- data.frame(names(w1),w1)
colnames(w1) <- c('word','freq')
wordcloud2(w1,size = 0.5, shape = 'triangle', rotateRatio = 0.5, 
           minSize = 1)

4.3 For benefitsReview

The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.

The next step is to clean the body.

docs2 <- VCorpus(VectorSource(df$benefitsReview))

# Preliminary cleaning
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))

docs2 <- tm_map(docs2, removePunctuation)
docs2 <- tm_map(docs2, removeNumbers)
docs2 <- tm_map(docs2, content_transformer(tolower))
docs2 <- tm_map(docs2, removeWords, stopwords("english"))
docs2 <- tm_map(docs2, stripWhitespace)
docs2 = tm_map(docs2, removeWords, c(stopwords("en"), "also", "abfter", "aboveand", "even", "will", "just", "didnt", "dont", "howev", "due", "still", "per", "aaaaarrrgh", "aana", "aarp", "abdo", "abbout", "abcess", "aboc", "abruptlyand", "even"))

# Stemming 
docs2 <- tm_map(docs2, stemDocument)

# Create document-term matrix
dtm2 <- DocumentTermMatrix(docs2)

# reduce sparsity
dtms2 <- removeSparseTerms(dtm2, sparse = 0.97)


m2 <- as.matrix(dtms2)

# Create histograms and boxplots

m2_freq = as.matrix(rowSums(m2))

sd2 = sd(m2_freq)
mean2 = mean(m2_freq)

Based on the descriptive statistics and basic graphs, we can see that the cleaned data:

The average of the above-mentioned data is: 7.51
Standard deviation of the cleaned data: 6.47. The cleaned data set is highly variable.
The histogram and the box plot show the distribution of the cleaned data.

par(mfrow = c(1,2))
# Creating histogram and boxplot
hist(m2_freq,
     main = "Histogram",
     col = "brown",
     col.main = "brown")
boxplot(m2_freq, 
        main = "Boxplot", 
        col = "brown",
        col.main = "brown")

tdm2_1 <- TermDocumentMatrix(docs2)
tdm2_2 <- as.matrix(tdm2_1)
tdm2_2[1:20,1:20]

##               Docs
## Terms          1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   aand         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aarm         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abait        0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abandon      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abat         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abcess       0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomen      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdomin      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abdominfirst 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abil         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abilifi      0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abl          0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   ablat        0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abli         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   ablil        0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   ablv         0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abnorm       0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aboard       0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   abort        0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aboveaverag  0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

w2 <- rowSums(tdm2_2)  # the number of terms used by patient
w2 <- subset(w2, w2>= 100) # the terms >= 100 times
barplot(w2, las = 2, col = rainbow(50))

Analyzing the terms of the generalized comment group, it should be noted that the terms associated with the first group regarding comments review (“take”, “day”, “week”, “time”, “effect”, “doctor”, “depress”, “normal”, “night”, “start” and ect.) and with the second group - side effects (“effect”, “take”, “day”, “time”, “medic”, “pain”, “feel”, “skin” and ect.) coincide.

w2 <- data.frame(names(w2),w2)
colnames(w2) <- c('word','freq')
wordcloud2(w2,size = 0.5, shape = 'triangle', rotateRatio = 0.5, 
           minSize = 1)

5 Modeling

To realize the aim of the project let’s analyze each of the variable which presents comment of patient. We will start with variable commentsReview. Next we will analyze sideEffectsReview. The third variable will be benefitsReview.

5.1 For commentsReview

In order to apply clustering, it is necessary to determine the optimal number of clusters. The silhouette statistic will be applied to three planar clustering algorithms: k-means, PAM and CLARA and hierarchical clustering.

a <- fviz_nbclust(m, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(m, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c <- fviz_nbclust(m, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d <- fviz_nbclust(m, FUNcluster = hcut, method = "silhouette") + theme_classic() 
grid.arrange(a, b, c, d, ncol=2)

According to Silhouette statistic, the optimal number of clusters is 2 for K means, for hierarchical algorithms is 2.

The next step we will conduct clustering with K Means and hierarchical algorithms. We will choose two methods to compare the results.

Let’s start analysis with K Means clustering.

5.1.1 K Means clustering

# Compute distance between document vectors
d <- dist(m, method="euclidian")
# k means - run with k=2
ncl=2
kfit <- kmeans(d, ncl, nstart=100)

Creating list of clusters with their id drugs.

cl_k = as.data.frame(kfit$cluster)

# Generate list of clusters for gdugs
ncl = 2
cl = list()
for (i in 1:ncl) {
cl[paste("cl_",i, sep = "")] = list(rownames(subset(cl_k, cl_k == i)))
}
#cl

We will present two clusters in the graph.

clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Let’s analyze possible themes of constructed clusters using the most frequently words.

# Generate corpuses for two clusters

ncl=2
for (i in 1:ncl) {
  name = paste("cl_corp_", i, sep = "")
  assign(name, docs[match(cl[[i]], names(docs))])
} 

Tdm = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl) {
  bigram_dtm_i = TermDocumentMatrix(get(paste("cl_corp_",i,sep="")))
  tdm_i <- as.matrix(bigram_dtm_i)
  Tdm[paste("cluster_",i,sep="")] = list(tdm_i)
}

# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl) {
  cl_m = as.matrix(Tdm[[i]])
  barplot(sort(sort(rowSums(cl_m), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

Based on the classification, it is necessary to distinguish two groups of drugs. The first group includes the following terms: “start”, “work”, “pain”, “get”, “fuel”, “dolce” and ect. This indicates that the medicine prescribed by the doctor at the specified dosage will help. The patient will feel an improvement in own condition, the pain will gradually disappear.

The second group includes the following terms: “pill”, “daily”, “treatment”, “prescribe”, “every” and ect. This group of medications is associated with the constant taking of medications prescribed by a doctor: one tablet every day. In this case, it is necessary to be under the supervision of a doctor.

5.1.2 Hierarchical algorithms

Let’s compare the results from K means with the alternative method - hierarchical algorithms.

hc <- hclust(d, method = "ward.D")

library(extrafont)

## Registering fonts with R

my_font <- "Roboto Condensed"
hc_d = as.dendrogram(hc)


# Plot a dendrogram

plot(hc_d, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")

# Add cluster rectangles 
ncl = 2
rect.dendrogram(hc_d, k = ncl, border = "blue", xpd = FALSE, lower_rect = 0)

clward = as.data.frame(cutree(hc_d, ncl))
# Generate list of clusters 
ncl = 2
cl_h = list()
for (i in 1:ncl) {
cl_h[paste("cl_h_",i, sep = "")] = list(rownames(subset(clward, clward == i)))
}
#cl_h

ncl=2
# Generate corpuses for two clusters
for (i in 1:ncl) {
  name = paste("cl_corp_h_", i, sep = "")
  assign(name, docs[match(cl_h[[i]], names(docs))])
} 

Tdm_h = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl) {
  bigram_dtm_h_i = TermDocumentMatrix(get(paste("cl_corp_h_",i,sep="")))
  tdm_h_i <- as.matrix(bigram_dtm_h_i)
  Tdm_h[paste("cluster_h_",i,sep="")] = list(tdm_h_i)
}

# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl) {
  cl_m_h = as.matrix(Tdm_h[[i]])
  barplot(sort(sort(rowSums(cl_m_h), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

The results of the research based on hierarchical analysis are largely similar, only the order of the groups has changed.

5.2 For sideEffectsReview

a1 <- fviz_nbclust(m1, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b1 <- fviz_nbclust(m1, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c1 <- fviz_nbclust(m1, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d1 <- fviz_nbclust(m1, FUNcluster = hcut, method = "silhouette") + theme_classic() 
grid.arrange(a1, b1, c1, d1, ncol=2)

According to Silhouette statistic, the optimal number of clusters is 2 for kmeans, for hierarchical algorithms is 2.

5.2.1 K Means clustering

# compute distance between document vectors
d1 <- dist(m1, method="euclidian")
# k means - run with k=2
ncl1=2
kfit1 <- kmeans(d1, ncl1, nstart=100)

Creating list of clusters with their id drug.

cl_k1 = as.data.frame(kfit1$cluster)

# Generate list of clusters with drugs
ncl1 = 2
cl1 = list()
for (i in 1:ncl1) {
cl1[paste("cl_",i, sep = "")] = list(rownames(subset(cl_k1, cl_k1 == i)))
}
#cl1

We will present two clusters in the graph.

clusplot(as.matrix(d1), kfit1$cluster, color=T, shade=T, labels=2, lines=0)

Let’s analyze possible themes of constructed clusters using the most frequently used words.

# Generate corpuses for two clusters

ncl1=2
for (i in 1:ncl1) {
  name = paste("cl1_corp_", i, sep = "")
  assign(name, docs1[match(cl1[[i]], names(docs1))])
} 

Tdm1 = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl1) {
  bigram_dtm1_i = TermDocumentMatrix(get(paste("cl1_corp_",i,sep="")))
  tdm1_i <- as.matrix(bigram_dtm1_i)
  Tdm1[paste("cluster_",i,sep="")] = list(tdm1_i)
}

# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl1) {
  cl1_m = as.matrix(Tdm1[[i]])
  barplot(sort(sort(rowSums(cl1_m), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

Based on the classification, it is necessary to distinguish two groups of drugs related to side effects. The first group includes the following terms: “none”, “dri”, “mild”, “notic”, “nausea”, “weight” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause slight nausea and weight loss over time.

The second group includes the following terms: “get”, “like”, “sleep”, “skin” and ect. This group of drugs can cause an allergic reaction on the skin and drowsiness.

5.2.2 Hierarchical algorithms

Let’s compare the results from kmeans with the alternative method - hierarchical algorithms.

hc1 <- hclust(d1, method = "ward.D")
hc_d1 = as.dendrogram(hc1)


# Plot a dendrogram

plot(hc_d1, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")

# Add cluster rectangles 
ncl_h1 = 2
rect.dendrogram(hc_d1, k = ncl_h1, border = "blue", xpd = FALSE, lower_rect = 0)

clward1 = as.data.frame(cutree(hc_d1, ncl1))
# Generate list of clusters with drugs
ncl_h1 = 2
cl_h1 = list()
for (i in 1:ncl1) {
cl_h1[paste("cl_h1_",i, sep = "")] = list(rownames(subset(clward1, clward1 == i)))
}
#cl_h1

ncl_h1=2
# Generate corpuses for two clusters
for (i in 1:ncl_h1) {
  name = paste("cl_corp_h1_", i, sep = "")
  assign(name, docs1[match(cl_h1[[i]], names(docs1))])
} 

Tdm_h1 = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl_h1) {
  bigram_dtm_h1_i = TermDocumentMatrix(get(paste("cl_corp_h1_",i,sep="")))
  tdm_h1_i <- as.matrix(bigram_dtm_h1_i)
  Tdm_h1[paste("cluster_h1_",i,sep="")] = list(tdm_h1_i)
}

# Chart 20 most common words in each cluster
par(mfrow = c(1,2))
for (i in 1:ncl_h1) {
  cl_m_h1 = as.matrix(Tdm_h1[[i]])
  barplot(sort(sort(rowSums(cl_m_h1), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

Based on the classification (hierarchical algorithms), it is necessary to distinguish two groups of drugs related to side effects. The first group includes the following terms: “none”, “dri”, “mild”, “weight”, but also “skin”, “mouth”, “loss”, “increase”, “gain” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time.

The second group includes the following terms: “like”, “sleep”, but also “notic”, “start”, “drug”, “pain”, “medic” and ect. This group of drugs can cause an allergic reaction on the drowsiness.

5.3 For benefitsReview

a2 <- fviz_nbclust(m2, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b2 <- fviz_nbclust(m2, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c2 <- fviz_nbclust(m2, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d2 <- fviz_nbclust(m2, FUNcluster = hcut, method = "silhouette") + theme_classic() 
grid.arrange(a2, b2, c2, d2, ncol=2)

According to Silhouette statistic, the optimal number of clusters is 2 for kmeans, for hierarchical algorithms is 3. However, the number of clusters will also be assumed to be 2.

5.3.1 K Means clustering

# compute distance between document vectors
d2 <- dist(m2, method="euclidian")
# k means - run with k=2
ncl2=2
kfit2 <- kmeans(d2, ncl2, nstart=100)

Creating list of clusters with their id drug.

cl_k2 = as.data.frame(kfit2$cluster)

# Generate list of clusters with drugs
ncl2 = 2
cl2 = list()
for (i in 1:ncl2) {
cl2[paste("cl2_",i, sep = "")] = list(rownames(subset(cl_k2, cl_k2 == i)))
}
#cl2

We will present two clusters in the graph.

clusplot(as.matrix(d2), kfit2$cluster, color=T, shade=T, labels=2, lines=0)

Let’s analyze possible themes of constructed clusters using the most frequently used words.

# Generate corpuses for two clusters

ncl2=2
for (i in 1:ncl2) {
  name = paste("cl2_corp_", i, sep = "")
  assign(name, docs2[match(cl2[[i]], names(docs2))])
} 

Tdm2 = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl2) {
  bigram_dtm2_i = TermDocumentMatrix(get(paste("cl2_corp_",i,sep="")))
  tdm2_i <- as.matrix(bigram_dtm2_i)
  Tdm2[paste("cluster_",i,sep="")] = list(tdm2_i)
}

# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl2) {
  cl2_m = as.matrix(Tdm2[[i]])
  barplot(sort(sort(rowSums(cl2_m), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

Based on the classification, it is necessary to distinguish two groups of drugs related to the effectiveness of use. The first group includes the following terms: “time”, “year”, “month”, “week”, “medic”, “work”, “get”, “start”, “much” and ect. Medications of this group prescribed by the doctor have different duration of treatment (week, 10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.

The second group includes the following terms: “reduce”, “benefit”, “skin”, “better”, “clear” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It cleanses the skin better and improves the general condition of the patient.

5.3.2 Hierarchical algorithms

Let’s compare the results from kmeans with the alternative method - hierarchical algorithms.

hc2 <- hclust(d2, method = "ward.D")
hc_d2 = as.dendrogram(hc2)

# Plot a dendrogram
plot(hc_d2, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")

# Add cluster rectangles 
ncl_h2 = 2
rect.dendrogram(hc_d2, k = ncl_h2, border = "blue", xpd = FALSE, lower_rect = 0)

clward2 = as.data.frame(cutree(hc_d2, ncl2))
# Generate list of clusters with their files
ncl_h2 = 2
cl_h2 = list()
for (i in 1:ncl_h2) {
cl_h2[paste("cl_h2_",i, sep = "")] = list(rownames(subset(clward2, clward2 == i)))
}
#cl_h2

ncl_h2=2
# Generate corpuses for two clusters
for (i in 1:ncl_h2) {
  name = paste("cl_corp_h2_", i, sep = "")
  assign(name, docs2[match(cl_h2[[i]], names(docs2))])
} 

Tdm_h2 = list()

# Generate a list of TDMs per cluster
for (i in 1:ncl_h2) {
  bigram_dtm_h2_i = TermDocumentMatrix(get(paste("cl_corp_h2_",i,sep="")))
  tdm_h2_i <- as.matrix(bigram_dtm_h2_i)
  Tdm_h2[paste("cluster_h2_",i,sep="")] = list(tdm_h2_i)
}

# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl_h2) {
  cl_m_h2 = as.matrix(Tdm_h2[[i]])
  barplot(sort(sort(rowSums(cl_m_h2), decreasing = TRUE)[1:20], decreasing = FALSE),
          las = 2,
          horiz = TRUE,
          decreasing = FALSE, 
          main = paste("20 most terms-used by patient per cluster", i, sep = " "),
          cex.main = 0.8,
          cex.names = 0.8,
          col.main = "blue")
}

Based on the classification (hierarchical algorithms), it is necessary to distinguish two groups of drugs related to the effectiveness of use.

The first group includes the following terms: “reduce”, “benefit”, “better”, but also “abl”, “sympton”, “impov”, “week” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient.

The second group includes the following terms: “time”, “year”, “month”, “medic”, “work”, “get”, “start”, “much”, but also “use”, “sleep”, “skin”, “life” and ect. Medications of this group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.

6 Evaluation

6.1 For commentsReview

The results obtained on the basis of the two methods (Kmeans and hierarchical algorithms) are largely similar.

6.2 For sideEffectsReview

The results obtained on the basis of the two methods (Kmeans and hierarchical algorithms) are largely similar.

Based on the classification (Kmeams and hierarchical algorithms), it is necessary to distinguish two groups of drugs related to side effects. The common part of the first group includes the following terms: “none”, “dri”, “mild”, “weight” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time.

The second group includes the following terms: “like”, “sleep” and ect. This group of drugs can cause an allergic reaction on the drowsiness.

6.3 For benefitsReview

Based on the classification (Kmeans and hierarchical algorithms), it is necessary to distinguish two groups of drugs related to the effectiveness of use.

The common part of the first group includes the following terms: “reduce”, “benefit”, “better”, but also “abl”, “sympton”, “impov”, “week” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient.

The common part of the second group includes the following terms: “time”, “year”, “month”, “medic”, “work”, “get”, “start”, “much”, but also “use”, “sleep”, “skin”, “life” and ect. Medications of this group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.

7 Summary

The goal of our project was to classify text data in terms of the effectiveness of prescription drugs, regarding the side effects of drugs, and related to general comments.

First, the identification of the body corpus for textual ones was performed, then data cleaning, descriptive statistics of the corpus and preparation of Document-Term Matrix for analysis. Two research techniques were used in the project for classification: K means and hierarchical algorithms. The results for each of the three comments used were largely common in accordance to used K means and hierarchical algorithms. For each variables: the effectiveness of prescription drugs, the side effects of drugs, and general comments were obtained 2 clusters.

For general comments review were separated two clusters based on two methods. The first cluster suggests that the patients will feel an improvement in own condition, the pain will gradually disappear. This group of medications is associated with the constant taking of medications prescribed by a doctor: one tablet every day. In this case, it is necessary to be under the supervision of a doctor.

For benefits comments review were separated two clusters based on two methods. The first cluster of the medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time. The second group of drugs can cause an allergic reaction on the drowsiness.

For benefits comments Review were separated two clusters based on two methods. The first cluster of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient. Medications of the second group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.

We hope that above comments will be helpful with the choice of medicaments by patients. Patients will not have to search websites to search opinion about used medicaments.

First project: Drug Review - cluster analysis

Marian Nehrebecki & Magdalena Sobala

25 12 2021

1 Introduction

2 Main assumptions

3 Description of the data set

4 Preparation of data for modeling

4.1 For commentsReview

4.2 For sideEffectsReview

4.3 For benefitsReview

5 Modeling

5.1 For commentsReview

5.1.1 K Means clustering

5.1.2 Hierarchical algorithms

5.2 For sideEffectsReview

5.2.1 K Means clustering

5.2.2 Hierarchical algorithms

5.3 For benefitsReview

5.3.1 K Means clustering

5.3.2 Hierarchical algorithms

6 Evaluation

6.1 For commentsReview

6.2 For sideEffectsReview

6.3 For benefitsReview

7 Summary