2 Main assumptions
In the project we consider the following step of the analysis such as:
- body corpus
- data cleaning
- corpus descriptive statistics
- preparation of Document-Term Matrix for future analysis
- cluster analysis
Currently, drug users have access to various kinds of information. This can be either a numeric label or a text comment. The main source of useful and comprehensive information, however, is text comments. You can find out the effectiveness of the drug, the side effects of the drug and other precautions. Having wide access to online review sites and forums, the consumer often finds it difficult to cover all the necessary information. An algorithm that will provide a classification of comments of significant qualities can come to the rescue.
Therefore, the purpose of our project is to classify textual data: firstly, regarding the effectiveness of prescription drugs, secondly, regarding the side effects of drugs, and thirdly, related to general comments.
Perhaps the conducted research will help the consumer of medicines to make the right choice and at the same time save money on learning more information.
In the project we consider the following step of the analysis such as:
The dataset used in the actual project presents patient reviews on specific drugs along with related conditions. Moreover, above mentioned reviews are divided into three groups: benefits, side effects and overall comment. The authors of the data received this dataset by crawling online pharmaceutical review sites.
The data is collected by Professor Surya Kallumadi from Kansas State University, Manhattan, Kansas, USA and Felix Gräßer from Institut für Biomedizinische Technik, Technische Universität Dresden, Dresden, Germany. The data is found on UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29
Initially, let’s summarize the data.
train <- read_tsv("drugLibTrain_raw.tsv", col_names = TRUE)
test <- read_tsv("drugLibTest_raw.tsv", col_names = TRUE)
df <- full_join(train, test)
names(df)## [1] "...1" "urlDrugName" "rating"
## [4] "effectiveness" "sideEffects" "condition"
## [7] "benefitsReview" "sideEffectsReview" "commentsReview"
names(train) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview", "commentsReview")
names(test) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview", "commentsReview")
names(df) = c("ID", "urlDrugName", "rating", "effectiveness", "sideEffects", "condition", "benefitsReview", "sideEffectsReview", "commentsReview")
summary(df)## ID urlDrugName rating effectiveness
## Min. : 0 Length:4143 Min. : 1.000 Length:4143
## 1st Qu.:1042 Class :character 1st Qu.: 5.000 Class :character
## Median :2083 Mode :character Median : 8.000 Mode :character
## Mean :2082 Mean : 6.946
## 3rd Qu.:3124 3rd Qu.: 9.000
## Max. :4161 Max. :10.000
## sideEffects condition benefitsReview sideEffectsReview
## Length:4143 Length:4143 Length:4143 Length:4143
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## commentsReview
## Length:4143
## Class :character
## Mode :character
##
##
##
## Rows: 4,143
## Columns: 9
## $ ID <dbl> 2202, 3117, 1146, 3947, 1951, 2372, 1043, 2715, 1591…
## $ urlDrugName <chr> "enalapril", "ortho-tri-cyclen", "ponstel", "prilose…
## $ rating <dbl> 4, 1, 10, 3, 2, 1, 9, 10, 10, 1, 7, 8, 8, 9, 4, 8, 6…
## $ effectiveness <chr> "Highly Effective", "Highly Effective", "Highly Effe…
## $ sideEffects <chr> "Mild Side Effects", "Severe Side Effects", "No Side…
## $ condition <chr> "management of congestive heart failure", "birth pre…
## $ benefitsReview <chr> "slowed the progression of left ventricular dysfunct…
## $ sideEffectsReview <chr> "cough, hypotension , proteinuria, impotence , renal…
## $ commentsReview <chr> "monitor blood pressure , weight and asses for resol…
Attribute description for the dataset
Attribute 1: urlDrugName (categorical): name of drug
Attribute 2: rating (numerical): 10 star patient rating
hist(as.numeric(df$rating), breaks = 10, main = "Frequencies of rating", xlab = "rating", col = "#4cbea3", labels = TRUE, border = "#FFFFFF")Based on graph of the frequencies of rating we should pay attention that 968 drugs have the best rating and 556 drugs have the worst rating.
The most of medicaments were marked as Considerably Effective and Highly Effective. Based on these results we can conclude that variable which present benefitsReview should include positive arguments of patients.
# using group_by and summarize
df %>%
group_by(effectiveness) %>%
summarize(rating_mean = mean(rating))## # A tibble: 5 × 2
## effectiveness rating_mean
## <chr> <dbl>
## 1 Considerably Effective 7.30
## 2 Highly Effective 8.77
## 3 Ineffective 1.59
## 4 Marginally Effective 3.29
## 5 Moderately Effective 5.38
Additionally, above categories Considerably Effective and Highly Effective obtain the highest rating.
In general, considered drugs in our dataset in most cases have Mild Side Effects and No Side Effects. Based on these results we can expect that the variable
sideEffectsReview should include comments with slight side effects.
# using group_by and summarize
df %>%
group_by(sideEffects) %>%
summarize(rating_mean = mean(rating))## # A tibble: 5 × 2
## sideEffects rating_mean
## <chr> <dbl>
## 1 Extremely Severe Side Effects 1.66
## 2 Mild Side Effects 8.13
## 3 Moderate Side Effects 6.18
## 4 No Side Effects 8.65
## 5 Severe Side Effects 3.58
Medicaments have the highest rating for No Side Effects, Mild Side Effects.
We would construct analysis for tree reviews: commentsReview, sideEffectsReview and benefitsReview.
The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.
The next step is to clean the body. Consider the following:
stripWhitespace - removes whitespace removePunctuation - removes punctuation marks removeNumbers - removes numbers content_transformer (tolower) - convert any word to lowercase removeWords - removes the words stop stemDocument - stems word
docs <- VCorpus(VectorSource(df$commentsReview))
# Preliminary cleaning.
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, stripWhitespace)
docs = tm_map(docs, removeWords, c(stopwords("en"), "howev", "just", "due", "still", "per", "also", "aaaaarrrgh", "aana", "aarp", "abdo", "abbout", "abcess", "aboc", "abruptlyand", "even", "morn"))
# Stemming
docs <- tm_map(docs, stemDocument)
# Generate document-term matrix
dtm <- DocumentTermMatrix(docs)
dtm## <<DocumentTermMatrix (documents: 4143, terms: 7814)>>
## Non-/sparse entries: 86347/32287055
## Sparsity : 100%
## Maximal term length: 48
## Weighting : term frequency (tf)
# Remove sparse terms
dtms <- removeSparseTerms(dtm, sparse = 0.97)
m <- as.matrix(dtms)
# Create histograms and boxplots
m_freq = as.matrix(rowSums(m))
sd = sd(m_freq)
mean = mean(m_freq)Based on the descriptive statistics and basic graphs, we can see that the cleaned data:
par(mfrow = c(1,2))
# Create histogram and boxplot
hist(m_freq,
main = "Histogram",
col = "blue",
col.main = "blue")
boxplot(m_freq,
main = "Boxplot",
col = "blue",
col.main = "blue")The next step is to create the Document-Term Matrix (DTM). DTM represents the word frequency in a given corpus. Note that the DTM has high sparsity.
Let’s first analyze the words appearing in the comments with the help of a word cloud.
The terms often used by patients in comments regarding common review are “take”, “day”, “week”, “time”, “effect”, “doctor”, “depress”, “morn”, “pill”, “night”, “start” and ect. Top of the most common words will allow us to have some introduction about the topic. The medicine prescribed by the doctor brings effects. The dose after a time / week, the patient feels much better during the day, depression decreases, morning vigor appeared and sleep improved.
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## abandon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbrevi 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abcess 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aberr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abid 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abil 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abilifi 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## ablat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abnorm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aboc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aborb 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abovem 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abovement 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abras 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abrupt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abscess 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w <- rowSums(tdm_2) # the number of terms used by patient
w <- subset(w, w>= 100) # The terms >= 100 times
barplot(w, las = 2, col = rainbow(50))The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.
The next step is to clean the body.
docs1 <- VCorpus(VectorSource(df$sideEffectsReview))
# Preliminary cleaning
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))
docs1 <- tm_map(docs1, removePunctuation)
docs1 <- tm_map(docs1, removeNumbers)
docs1 <- tm_map(docs1, content_transformer(tolower))
docs1 <- tm_map(docs1, removeWords, stopwords("english"))
docs1 <- tm_map(docs1, stripWhitespace)
# Stemming
docs1 <- tm_map(docs1, stemDocument)
# Create document-term matrix
dtm1 <- DocumentTermMatrix(docs1)
# Reducing sparsity 1: Dealing with the sparse terms
# We are removing these terms which don't appear too often.
dtms1 <- removeSparseTerms(dtm1, sparse = 0.97)
m1 <- as.matrix(dtms1)
# Creating histograms and boxplots
m1_freq = as.matrix(rowSums(m1))
sd1 = sd(m1_freq)
mean1 = mean(m1_freq)Based on the descriptive statistics and basic graphs, we can see that the cleaned data:
par(mfrow = c(1,2))
# Create histogram and boxplot
hist(m1_freq,
main = "Histogram",
col = "green",
col.main = "green")
boxplot(m1_freq,
main = "Boxplot",
col = "green",
col.main = "green")Terms such as “effect”, “side”, “take”, “day”, “time”, “medic”, “pain”, “feel”, “first”, “stomach”, “headach”, “skin”, “weight” and ect. were used more often than other words. Taking the prescribed medication, in the first few days / weeks of taking the drugs, the patient felt the following problems; abdominal pain, headache, skin allergy.
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## aafter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbsess 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abfter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abil 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abilifi 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abit 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abnorm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aboveand 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abrupt 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## abscens 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## absenc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abslut 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## absolut 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## absorb 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## absorpt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w1 <- rowSums(tdm1_2) # the number of terms used by patient
w1 <- subset(w1, w1>= 100) # the terms >= 100 times
barplot(w1, las = 2, col = rainbow(50))The first stage of building the model was creating the corpus. By visually analyzing the comments, we can see that some of them are very similar.
The next step is to clean the body.
docs2 <- VCorpus(VectorSource(df$benefitsReview))
# Preliminary cleaning
toSpace <- content_transformer(function (x, pattern) gsub(pattern, " ", x))
docs2 <- tm_map(docs2, removePunctuation)
docs2 <- tm_map(docs2, removeNumbers)
docs2 <- tm_map(docs2, content_transformer(tolower))
docs2 <- tm_map(docs2, removeWords, stopwords("english"))
docs2 <- tm_map(docs2, stripWhitespace)
docs2 = tm_map(docs2, removeWords, c(stopwords("en"), "also", "abfter", "aboveand", "even", "will", "just", "didnt", "dont", "howev", "due", "still", "per", "aaaaarrrgh", "aana", "aarp", "abdo", "abbout", "abcess", "aboc", "abruptlyand", "even"))
# Stemming
docs2 <- tm_map(docs2, stemDocument)
# Create document-term matrix
dtm2 <- DocumentTermMatrix(docs2)
# reduce sparsity
dtms2 <- removeSparseTerms(dtm2, sparse = 0.97)
m2 <- as.matrix(dtms2)
# Create histograms and boxplots
m2_freq = as.matrix(rowSums(m2))
sd2 = sd(m2_freq)
mean2 = mean(m2_freq)Based on the descriptive statistics and basic graphs, we can see that the cleaned data:
par(mfrow = c(1,2))
# Creating histogram and boxplot
hist(m2_freq,
main = "Histogram",
col = "brown",
col.main = "brown")
boxplot(m2_freq,
main = "Boxplot",
col = "brown",
col.main = "brown")## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## aand 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aarm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abait 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abcess 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdomin 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abdominfirst 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abil 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abilifi 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## ablat 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abli 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## ablil 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## ablv 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abnorm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aboard 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abort 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aboveaverag 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
w2 <- rowSums(tdm2_2) # the number of terms used by patient
w2 <- subset(w2, w2>= 100) # the terms >= 100 times
barplot(w2, las = 2, col = rainbow(50))Analyzing the terms of the generalized comment group, it should be noted that the terms associated with the first group regarding comments review (“take”, “day”, “week”, “time”, “effect”, “doctor”, “depress”, “normal”, “night”, “start” and ect.) and with the second group - side effects (“effect”, “take”, “day”, “time”, “medic”, “pain”, “feel”, “skin” and ect.) coincide.
To realize the aim of the project let’s analyze each of the variable which presents comment of patient. We will start with variable commentsReview. Next we will analyze sideEffectsReview. The third variable will be benefitsReview.
In order to apply clustering, it is necessary to determine the optimal number of clusters. The silhouette statistic will be applied to three planar clustering algorithms: k-means, PAM and CLARA and hierarchical clustering.
a <- fviz_nbclust(m, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b <- fviz_nbclust(m, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
c <- fviz_nbclust(m, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
d <- fviz_nbclust(m, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a, b, c, d, ncol=2)According to Silhouette statistic, the optimal number of clusters is 2 for K means, for hierarchical algorithms is 2.
The next step we will conduct clustering with K Means and hierarchical algorithms. We will choose two methods to compare the results.
Let’s start analysis with K Means clustering.
# Compute distance between document vectors
d <- dist(m, method="euclidian")
# k means - run with k=2
ncl=2
kfit <- kmeans(d, ncl, nstart=100)Creating list of clusters with their id drugs.
cl_k = as.data.frame(kfit$cluster)
# Generate list of clusters for gdugs
ncl = 2
cl = list()
for (i in 1:ncl) {
cl[paste("cl_",i, sep = "")] = list(rownames(subset(cl_k, cl_k == i)))
}
#clWe will present two clusters in the graph.
Let’s analyze possible themes of constructed clusters using the most frequently words.
# Generate corpuses for two clusters
ncl=2
for (i in 1:ncl) {
name = paste("cl_corp_", i, sep = "")
assign(name, docs[match(cl[[i]], names(docs))])
}
Tdm = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl) {
bigram_dtm_i = TermDocumentMatrix(get(paste("cl_corp_",i,sep="")))
tdm_i <- as.matrix(bigram_dtm_i)
Tdm[paste("cluster_",i,sep="")] = list(tdm_i)
}
# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl) {
cl_m = as.matrix(Tdm[[i]])
barplot(sort(sort(rowSums(cl_m), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}Based on the classification, it is necessary to distinguish two groups of drugs. The first group includes the following terms: “start”, “work”, “pain”, “get”, “fuel”, “dolce” and ect. This indicates that the medicine prescribed by the doctor at the specified dosage will help. The patient will feel an improvement in own condition, the pain will gradually disappear.
The second group includes the following terms: “pill”, “daily”, “treatment”, “prescribe”, “every” and ect. This group of medications is associated with the constant taking of medications prescribed by a doctor: one tablet every day. In this case, it is necessary to be under the supervision of a doctor.
Let’s compare the results from K means with the alternative method - hierarchical algorithms.
## Registering fonts with R
my_font <- "Roboto Condensed"
hc_d = as.dendrogram(hc)
# Plot a dendrogram
plot(hc_d, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")
# Add cluster rectangles
ncl = 2
rect.dendrogram(hc_d, k = ncl, border = "blue", xpd = FALSE, lower_rect = 0)clward = as.data.frame(cutree(hc_d, ncl))
# Generate list of clusters
ncl = 2
cl_h = list()
for (i in 1:ncl) {
cl_h[paste("cl_h_",i, sep = "")] = list(rownames(subset(clward, clward == i)))
}
#cl_hncl=2
# Generate corpuses for two clusters
for (i in 1:ncl) {
name = paste("cl_corp_h_", i, sep = "")
assign(name, docs[match(cl_h[[i]], names(docs))])
}
Tdm_h = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl) {
bigram_dtm_h_i = TermDocumentMatrix(get(paste("cl_corp_h_",i,sep="")))
tdm_h_i <- as.matrix(bigram_dtm_h_i)
Tdm_h[paste("cluster_h_",i,sep="")] = list(tdm_h_i)
}
# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl) {
cl_m_h = as.matrix(Tdm_h[[i]])
barplot(sort(sort(rowSums(cl_m_h), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}The results of the research based on hierarchical analysis are largely similar, only the order of the groups has changed.
In order to apply clustering, it is necessary to determine the optimal number of clusters. The silhouette statistic will be applied to three planar clustering algorithms: k-means, PAM and CLARA and hierarchical clustering.
a1 <- fviz_nbclust(m1, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b1 <- fviz_nbclust(m1, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
c1 <- fviz_nbclust(m1, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
d1 <- fviz_nbclust(m1, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a1, b1, c1, d1, ncol=2)According to Silhouette statistic, the optimal number of clusters is 2 for kmeans, for hierarchical algorithms is 2.
# compute distance between document vectors
d1 <- dist(m1, method="euclidian")
# k means - run with k=2
ncl1=2
kfit1 <- kmeans(d1, ncl1, nstart=100)Creating list of clusters with their id drug.
cl_k1 = as.data.frame(kfit1$cluster)
# Generate list of clusters with drugs
ncl1 = 2
cl1 = list()
for (i in 1:ncl1) {
cl1[paste("cl_",i, sep = "")] = list(rownames(subset(cl_k1, cl_k1 == i)))
}
#cl1We will present two clusters in the graph.
Let’s analyze possible themes of constructed clusters using the most frequently used words.
# Generate corpuses for two clusters
ncl1=2
for (i in 1:ncl1) {
name = paste("cl1_corp_", i, sep = "")
assign(name, docs1[match(cl1[[i]], names(docs1))])
}
Tdm1 = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl1) {
bigram_dtm1_i = TermDocumentMatrix(get(paste("cl1_corp_",i,sep="")))
tdm1_i <- as.matrix(bigram_dtm1_i)
Tdm1[paste("cluster_",i,sep="")] = list(tdm1_i)
}
# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl1) {
cl1_m = as.matrix(Tdm1[[i]])
barplot(sort(sort(rowSums(cl1_m), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}Based on the classification, it is necessary to distinguish two groups of drugs related to side effects. The first group includes the following terms: “none”, “dri”, “mild”, “notic”, “nausea”, “weight” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause slight nausea and weight loss over time.
The second group includes the following terms: “get”, “like”, “sleep”, “skin” and ect. This group of drugs can cause an allergic reaction on the skin and drowsiness.
Let’s compare the results from kmeans with the alternative method - hierarchical algorithms.
hc1 <- hclust(d1, method = "ward.D")
hc_d1 = as.dendrogram(hc1)
# Plot a dendrogram
plot(hc_d1, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")
# Add cluster rectangles
ncl_h1 = 2
rect.dendrogram(hc_d1, k = ncl_h1, border = "blue", xpd = FALSE, lower_rect = 0)clward1 = as.data.frame(cutree(hc_d1, ncl1))
# Generate list of clusters with drugs
ncl_h1 = 2
cl_h1 = list()
for (i in 1:ncl1) {
cl_h1[paste("cl_h1_",i, sep = "")] = list(rownames(subset(clward1, clward1 == i)))
}
#cl_h1ncl_h1=2
# Generate corpuses for two clusters
for (i in 1:ncl_h1) {
name = paste("cl_corp_h1_", i, sep = "")
assign(name, docs1[match(cl_h1[[i]], names(docs1))])
}
Tdm_h1 = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl_h1) {
bigram_dtm_h1_i = TermDocumentMatrix(get(paste("cl_corp_h1_",i,sep="")))
tdm_h1_i <- as.matrix(bigram_dtm_h1_i)
Tdm_h1[paste("cluster_h1_",i,sep="")] = list(tdm_h1_i)
}
# Chart 20 most common words in each cluster
par(mfrow = c(1,2))
for (i in 1:ncl_h1) {
cl_m_h1 = as.matrix(Tdm_h1[[i]])
barplot(sort(sort(rowSums(cl_m_h1), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}Based on the classification (hierarchical algorithms), it is necessary to distinguish two groups of drugs related to side effects. The first group includes the following terms: “none”, “dri”, “mild”, “weight”, but also “skin”, “mouth”, “loss”, “increase”, “gain” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time.
The second group includes the following terms: “like”, “sleep”, but also “notic”, “start”, “drug”, “pain”, “medic” and ect. This group of drugs can cause an allergic reaction on the drowsiness.
In order to apply clustering, it is necessary to determine the optimal number of clusters. The silhouette statistic will be applied to three planar clustering algorithms: k-means, PAM and CLARA and hierarchical clustering.
a2 <- fviz_nbclust(m2, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b2 <- fviz_nbclust(m2, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()
c2 <- fviz_nbclust(m2, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()
d2 <- fviz_nbclust(m2, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a2, b2, c2, d2, ncol=2)According to Silhouette statistic, the optimal number of clusters is 2 for kmeans, for hierarchical algorithms is 3. However, the number of clusters will also be assumed to be 2.
# compute distance between document vectors
d2 <- dist(m2, method="euclidian")
# k means - run with k=2
ncl2=2
kfit2 <- kmeans(d2, ncl2, nstart=100)Creating list of clusters with their id drug.
cl_k2 = as.data.frame(kfit2$cluster)
# Generate list of clusters with drugs
ncl2 = 2
cl2 = list()
for (i in 1:ncl2) {
cl2[paste("cl2_",i, sep = "")] = list(rownames(subset(cl_k2, cl_k2 == i)))
}
#cl2We will present two clusters in the graph.
Let’s analyze possible themes of constructed clusters using the most frequently used words.
# Generate corpuses for two clusters
ncl2=2
for (i in 1:ncl2) {
name = paste("cl2_corp_", i, sep = "")
assign(name, docs2[match(cl2[[i]], names(docs2))])
}
Tdm2 = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl2) {
bigram_dtm2_i = TermDocumentMatrix(get(paste("cl2_corp_",i,sep="")))
tdm2_i <- as.matrix(bigram_dtm2_i)
Tdm2[paste("cluster_",i,sep="")] = list(tdm2_i)
}
# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl2) {
cl2_m = as.matrix(Tdm2[[i]])
barplot(sort(sort(rowSums(cl2_m), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}Based on the classification, it is necessary to distinguish two groups of drugs related to the effectiveness of use. The first group includes the following terms: “time”, “year”, “month”, “week”, “medic”, “work”, “get”, “start”, “much” and ect. Medications of this group prescribed by the doctor have different duration of treatment (week, 10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.
The second group includes the following terms: “reduce”, “benefit”, “skin”, “better”, “clear” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It cleanses the skin better and improves the general condition of the patient.
Let’s compare the results from kmeans with the alternative method - hierarchical algorithms.
hc2 <- hclust(d2, method = "ward.D")
hc_d2 = as.dendrogram(hc2)
# Plot a dendrogram
plot(hc_d2, main = "Method Ward",leaflab = "none", col.main = "dodgerblue")
# Add cluster rectangles
ncl_h2 = 2
rect.dendrogram(hc_d2, k = ncl_h2, border = "blue", xpd = FALSE, lower_rect = 0)clward2 = as.data.frame(cutree(hc_d2, ncl2))
# Generate list of clusters with their files
ncl_h2 = 2
cl_h2 = list()
for (i in 1:ncl_h2) {
cl_h2[paste("cl_h2_",i, sep = "")] = list(rownames(subset(clward2, clward2 == i)))
}
#cl_h2ncl_h2=2
# Generate corpuses for two clusters
for (i in 1:ncl_h2) {
name = paste("cl_corp_h2_", i, sep = "")
assign(name, docs2[match(cl_h2[[i]], names(docs2))])
}
Tdm_h2 = list()
# Generate a list of TDMs per cluster
for (i in 1:ncl_h2) {
bigram_dtm_h2_i = TermDocumentMatrix(get(paste("cl_corp_h2_",i,sep="")))
tdm_h2_i <- as.matrix(bigram_dtm_h2_i)
Tdm_h2[paste("cluster_h2_",i,sep="")] = list(tdm_h2_i)
}
# Chart 20 most terms-used by patient per cluster
par(mfrow = c(1,2))
for (i in 1:ncl_h2) {
cl_m_h2 = as.matrix(Tdm_h2[[i]])
barplot(sort(sort(rowSums(cl_m_h2), decreasing = TRUE)[1:20], decreasing = FALSE),
las = 2,
horiz = TRUE,
decreasing = FALSE,
main = paste("20 most terms-used by patient per cluster", i, sep = " "),
cex.main = 0.8,
cex.names = 0.8,
col.main = "blue")
}Based on the classification (hierarchical algorithms), it is necessary to distinguish two groups of drugs related to the effectiveness of use.
The first group includes the following terms: “reduce”, “benefit”, “better”, but also “abl”, “sympton”, “impov”, “week” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient.
The second group includes the following terms: “time”, “year”, “month”, “medic”, “work”, “get”, “start”, “much”, but also “use”, “sleep”, “skin”, “life” and ect. Medications of this group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.
The results obtained on the basis of the two methods (Kmeans and hierarchical algorithms) are largely similar.
Based on the classification, it is necessary to distinguish two groups of drugs. The first group includes the following terms: “start”, “work”, “pain”, “get”, “fuel”, “dolce” and ect. This indicates that the medicine prescribed by the doctor at the specified dosage will help. The patient will feel an improvement in own condition, the pain will gradually disappear.
The second group includes the following terms: “pill”, “daily”, “treatment”, “prescribe”, “every” and ect. This group of medications is associated with the constant taking of medications prescribed by a doctor: one tablet every day. In this case, it is necessary to be under the supervision of a doctor.
The results obtained on the basis of the two methods (Kmeans and hierarchical algorithms) are largely similar.
Based on the classification (Kmeams and hierarchical algorithms), it is necessary to distinguish two groups of drugs related to side effects. The common part of the first group includes the following terms: “none”, “dri”, “mild”, “weight” and ect. The medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time.
The second group includes the following terms: “like”, “sleep” and ect. This group of drugs can cause an allergic reaction on the drowsiness.
Based on the classification (Kmeans and hierarchical algorithms), it is necessary to distinguish two groups of drugs related to the effectiveness of use.
The common part of the first group includes the following terms: “reduce”, “benefit”, “better”, but also “abl”, “sympton”, “impov”, “week” and ect. This group of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient.
The common part of the second group includes the following terms: “time”, “year”, “month”, “medic”, “work”, “get”, “start”, “much”, but also “use”, “sleep”, “skin”, “life” and ect. Medications of this group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.
The goal of our project was to classify text data in terms of the effectiveness of prescription drugs, regarding the side effects of drugs, and related to general comments.
First, the identification of the body corpus for textual ones was performed, then data cleaning, descriptive statistics of the corpus and preparation of Document-Term Matrix for analysis. Two research techniques were used in the project for classification: K means and hierarchical algorithms. The results for each of the three comments used were largely common in accordance to used K means and hierarchical algorithms. For each variables: the effectiveness of prescription drugs, the side effects of drugs, and general comments were obtained 2 clusters.
For general comments review were separated two clusters based on two methods. The first cluster suggests that the patients will feel an improvement in own condition, the pain will gradually disappear. This group of medications is associated with the constant taking of medications prescribed by a doctor: one tablet every day. In this case, it is necessary to be under the supervision of a doctor.
For benefits comments review were separated two clusters based on two methods. The first cluster of the medications prescribed by the doctor in most cases do not have any side effects. However, they can cause weight loss over time. The second group of drugs can cause an allergic reaction on the drowsiness.
For benefits comments Review were separated two clusters based on two methods. The first cluster of drugs is characterized by better efficacy. It is profitable, fast. It improves the general condition of the patient. Medications of the second group prescribed by the doctor have different duration of treatment (10 days, month, year). Observations show that the effectiveness of treatment comes: improved sleep, reduced anxiety and improved well-being.
We hope that above comments will be helpful with the choice of medicaments by patients. Patients will not have to search websites to search opinion about used medicaments.