Data Science Capstone Project, Milestone Report

Introduction

This report details the advance of coursera’s data science specialization capstone project during week two of the course. The requirement is to perform an exploratory data analysis of the data, after having being exposed to natural language processing tools and libraries.

Text loading

I will load the text data from all three files and clean them in the following ways: * Remove numbers * Remove punctuation while preserving the mid-word hyphens * Remove extra whitespaces * Convert all the text to lower case * Remove non-english sentences from the original data.

For the first four items we use the functions of the tm library, to detect the language we use the textcat library. The output will be a dataframe named indf that contains one column with the cleaned-up text.

1. Set up

# a. Environment set-up
basedir <- '~/Coursera_R_programming/Capstone/final'
setwd(basedir)

# b. Libraries load
library(tm); library(RColorBrewer); library(quanteda); library(cluster)

## Loading required package: NLP

## quanteda version 0.9.8

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:base':
## 
##     sample

library(textcat); library(ggplot2); library(stringi); library(xlsx); library(dendextend)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Loading required package: rJava

## Loading required package: xlsxjars

## 
## ---------------------
## Welcome to dendextend version 1.2.0
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

# c. Data load
news <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.news.txt")
suppressWarnings(twitter <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.twitter.txt"))
blogs <- readLines("~/Coursera_R_programming/Capstone/final/en_US/en_US.blogs.txt")

2. Data sampling

The datasets are very big, so I am going to sample 10% of the total. The sample is random.

seed <- as.numeric(as.Date("2016-8-18"))
set.seed(seed)
news.sample <- sample(news, length(news)*0.1, replace = FALSE)
twitter.sample <- sample(twitter, length(twitter)*0.1, replace = FALSE)
blogs.sample <- sample(blogs, length(blogs)*0.1, replace = FALSE)

Clean-up and pre-processing

1. Data pooling

I am pooling the three datasets together for cleaning and machine learning purposes.

alldata <- c(news.sample, twitter.sample, blogs.sample)

remove(news.sample); remove(twitter.sample); remove(blogs.sample)
remove(news); remove(twitter); remove(blogs)

2. Corpus creation

I am creating the corpus that will be the basis for cleaning and text mining.

corpus2 <- corpus(alldata)
remove(alldata)

3. Corpus clean-up

I am cleaning up the data by using the quanteda package.

# I use quanteda's "tokenize()" because tm's "tm_map" does not recognise the "corpus" class that quanteda uses. 
# The following must all be passed in one function because quanteda cannot tokenize an already tokenized object.
corpus2 <- tokenize(corpus2, 
                    removeNumbers = TRUE, 
                    removePunct = TRUE, 
                    removeSymbols = TRUE,
                    removeTwitter = TRUE,
                    removeHyphens = TRUE,
                    removeURL = TRUE)
corpus2 <- removeFeatures(corpus2, stopwords("english"))
corpus2 <- toLower(corpus2)

Stage the data

Create the “document features matrix”

To proceed, I create a document features matrix. This is the quanteda equivalent of the tm’s package “document terms matrix”. This is what you will be using from this point on.

# a. Create the "document-features matrix"
suppressMessages(dfm <- dfm(corpus2, ignoredFeatures = stopwords("english"), stem = TRUE))

# b. Access a list of the most frequently occurring features
topfeatures(dfm, 100) # 100 most frequent words

##   will    one   just   like   said    get   time     go    can   year 
##  32501  31408  30502  30369  30347  30040  27110  26865  25072  24359 
##    day   make   love    new   know   good    now   work  peopl    say 
##  22971  20939  20235  19792  18713  18513  18206  18124  16493  16209 
##   want    see  think   look  thank   back   come   need  first    use 
##  16208  16041  15654  15386  15313  15033  14678  14231  14150  13433 
##   also  thing   last   take    way  great    two  today     us   well 
##  13098  13091  13045  12789  12775  12587  12231  12146  12123  12066 
##   even   much  right realli follow    got  start   week  still   game 
##  12058  12029  11720  11506  11394  10857  10846  10799  10465  10435 
##   play   call  state   show    tri   feel school   life     rt   home 
##  10240   9987   9772   9642   9541   9446   9368   9338   9019   8992 
##   mani  night   live   help   made   hope  never  littl   best    may 
##   8947   8727   8675   8567   8536   8380   8223   8060   8040   7989 
##      u friend   give   next    lol someth    run   book  world   long 
##   7977   7769   7625   7615   7445   7314   7279   7191   7186   7185 
##  happi    end   citi    lot  place   find around  alway  watch   team 
##   7163   7109   7039   6911   6910   6850   6802   6773   6719   6715 
##    man  everi   keep    old  anoth better   talk    put    big  three 
##   6703   6687   6685   6647   6647   6623   6486   6445   6418   6384

top100words <- as.data.frame(topfeatures(dfm, 100)) # convert to dataframe for manipulating
write.xlsx(top100words, file = "topwords.xlsx", sheetName = "top100words")
top100words <- read.xlsx("topwords.xlsx", sheetName = "top100words")
names(top100words) <- c("word", "frequency")

Exploratory data analysis and preliminary visualization

I am creating the following two graphs to offer a preliminary/exploratory visualization of the data analysis I have completed up to this point.

# 1. Create a wordcloud plot
suppressWarnings(wordcloud <- plot(dfm, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(8, .5)))

# 2. Create a graph that plots a series of most common words
## First, I sort by frequency in decreasing order, just to make the end plot prettier
top100words$word <- factor(top100words$word, levels = top100words$word[order(top100words$frequency)])

## Then, I create the plot in the gglpot environment
frequencyplot <- ggplot(head(top100words, n = 30), aes(word, frequency)) +
        geom_bar(stat = "identity") +
        coord_flip() +
        theme_minimal() + ggtitle("Histogram of top 25 words") +
        xlab("Word") + ylab("Frequency")

frequencyplot

Creating and viewing n-grams

I will now use the quanteda library (RWeka did not work for me, and it’s been a struggle) to create 1, 2 and 3 n-grams over the curated set. Then, I will order the ngram data frames in descending order, so we have the most frequently used terms on top.

# 1. Create ngrams
ngram1 <- ngrams(corpus2, n = 1)
ngram2 <- ngrams(corpus2, n = 2)
ngram3 <- ngrams(corpus2, n = 3)

# 2. Tranform the ngrams into "document-features matrix"
suppressMessages(dfm.ngram1 <- dfm(ngram1, ignoredFeatures = stopwords("english"), stem = TRUE))
suppressMessages(dfm.ngram2 <- dfm(ngram2, ignoredFeatures = stopwords("english"), stem = TRUE))
suppressMessages(dfm.ngram3 <- dfm(ngram3, ignoredFeatures = stopwords("english"), stem = TRUE))

# 3. Access a list of the most frequently occurring features
## a. for unigram
top100ngram1 <- as.data.frame(topfeatures(dfm.ngram1, 100)) # creating a dataframe of 100 most common unigrams
write.xlsx(top100ngram1, file = "topwords.xlsx", sheetName = "top100ngram1", append = TRUE)
top100ngram1 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram1")
names(top100ngram1) <- c("feature", "frequency")

## b. for bigram
top100ngram2 <- as.data.frame(topfeatures(dfm.ngram2, 100)) # creating a dataframe of 100 most common bigrams
write.xlsx(top100ngram2, file = "topwords.xlsx", sheetName = "top100ngram2", append = TRUE)
top100ngram2 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram2")
names(top100ngram2) <- c("feature", "frequency")

## c. for trigram
top100ngram3 <- as.data.frame(topfeatures(dfm.ngram3, 100)) # creating a dataframe of 100 most common trigrams
write.xlsx(top100ngram3, file = "topwords.xlsx", sheetName = "top100ngram3", append = TRUE)
top100ngram3 <- read.xlsx("topwords.xlsx", sheetName = "top100ngram3")
names(top100ngram3) <- c("feature", "frequency")

# 4. Create graphs for unigram, bigram and trigram
# First, I sort by frequency in decreasing order, just to make the end plot prettier
## a. for unigram
top100ngram1$feature <- factor(top100ngram1$feature, levels = top100ngram1$feature[order(top100ngram1$frequency)])

## b. for bigram
top100ngram2$feature <- factor(top100ngram2$feature, levels = top100ngram2$feature[order(top100ngram2$frequency)])

## c. for trigram
top100ngram3$feature <- factor(top100ngram3$feature, levels = top100ngram3$feature[order(top100ngram3$frequency)])

# Then, I create the plot in the gglpot environment
## a. for unigram
frequencyplot.ngram1 <- ggplot(head(top100ngram1, n = 30), aes(feature, frequency)) +
        geom_bar(stat = "identity") +
        coord_flip() +
        theme_minimal() + ggtitle("Unigram") +
        xlab("Feature") + ylab("Frequency")

## b. for bigram
frequencyplot.ngram2 <- ggplot(head(top100ngram2, n = 30), aes(feature, frequency)) +
        geom_bar(stat = "identity") +
        coord_flip() +
        theme_minimal() + ggtitle("Bigram") +
        xlab("Feature") + ylab("Frequency")

## c. for trigram
frequencyplot.ngram3 <- ggplot(head(top100ngram3, n = 30), aes(feature, frequency)) +
        geom_bar(stat = "identity") +
        coord_flip() +
        theme_minimal() + ggtitle("Trigram") +
        xlab("Feature") + ylab("Frequency")

frequencyplot.ngram1

frequencyplot.ngram2

frequencyplot.ngram3

Similarities between words (?? - whatever, I’m just putting it out there)

I can use distances to draw comparisons between words and cluster them. Here, I do it for the 30 most common words. Note: Initially, I was trying to plot the dendrogram but couldn’t figure out a way to show the labels of the leaves, i.e. the words. I found the answer here: https://github.com/kbenoit/quanteda/issues/112; or here http://www.kenbenoit.net/courses/nyu2014qta/day7examples.R. God bless the guy.

1. Dendrogram

# hierarchical clustering on words
# weight by relative term frequency
word.dfm <- sort(tf(dfm, "prop")) # sort in decreasing order of total word frequency
word.dfm <- t(word.dfm)[1:30, ] # because it is transposed
word.dist.mat <- dist(word.dfm)
wordcluster <- hclust(word.dist.mat)

# This is a different presentation which I find more appealing 
wordcluster <- as.dendrogram(wordcluster) 

# Color the branches using color_branches() from 'dendextend' package
colorbranch <- color_branches(wordcluster, k=20)

# To color also labels, plug the result of coloring branches to color_labels()
allcolor <- color_labels(colorbranch, k=20)

# Finally, plot the dendrogram
plot(allcolor, xlab = "", ylab = "Height", main = "Euclidean Distance on Normalized Token Frequency")

2. k-means clustering

# k-means clustering
k_means_dfm <- trim(head(dfm, n = 200), minCount = 5, minDoc = 3)

## Document-feature matrix of: 426,966 documents, 131,641 features.
## (showing first 200 documents and first 6 features)
##          features
## docs      magic fail allow roster deterior team
##   text1       1    1     1      1        1    1
##   text2       0    0     0      0        0    0
##   text3       0    0     0      0        0    0
##   text4       0    0     0      0        0    0
##   text5       0    0     0      0        0    1
##   text6       0    0     1      0        0    0
##   text7       0    0     1      0        0    0
##   text8       0    0     0      0        0    0
##   text9       0    0     0      0        0    0
##   text10      0    0     0      0        0    0
##   text11      0    0     0      0        0    0
##   text12      0    0     0      0        0    0
##   text13      0    0     0      0        0    0
##   text14      0    0     0      0        0    0
##   text15      0    0     0      0        0    0
##   text16      0    0     0      0        0    0
##   text17      0    0     0      0        0    0
##   text18      0    0     0      0        0    0
##   text19      0    0     0      0        0    0
##   text20      0    0     0      0        0    0
##   text21      0    0     0      0        0    0
##   text22      0    0     0      0        0    0
##   text23      0    0     0      0        0    0
##   text24      0    0     0      0        0    0
##   text25      0    0     0      0        0    0
##   text26      0    0     0      0        0    0
##   text27      0    0     0      0        0    0
##   text28      0    0     0      0        0    0
##   text29      0    0     0      0        0    0
##   text30      0    0     0      0        0    0
##   text31      0    0     0      0        0    0
##   text32      0    0     0      0        0    0
##   text33      0    0     0      0        0    0
##   text34      0    0     0      0        0    0
##   text35      0    0     0      0        0    0
##   text36      0    0     0      0        0    0
##   text37      0    0     0      0        0    0
##   text38      0    0     0      0        0    0
##   text39      0    0     0      0        0    0
##   text40      0    0     0      0        0    0
##   text41      0    0     0      0        0    0
##   text42      0    0     0      0        0    0
##   text43      0    0     0      0        0    0
##   text44      0    0     0      0        0    0
##   text45      0    0     0      0        0    0
##   text46      1    0     0      0        0    0
##   text47      0    0     0      0        0    0
##   text48      0    0     0      0        0    0
##   text49      0    0     0      0        0    0
##   text50      0    0     0      0        0    0
##   text51      0    0     0      0        0    0
##   text52      0    0     0      0        0    0
##   text53      0    0     0      0        0    0
##   text54      0    0     0      0        0    0
##   text55      0    0     0      0        0    0
##   text56      0    0     0      0        0    0
##   text57      0    0     0      0        0    0
##   text58      0    0     0      0        0    0
##   text59      0    0     0      0        0    0
##   text60      0    0     0      0        0    0
##   text61      0    0     0      0        0    0
##   text62      0    0     0      0        0    0
##   text63      0    0     0      0        0    0
##   text64      0    0     0      0        0    0
##   text65      0    0     0      0        0    0
##   text66      0    0     0      0        0    0
##   text67      0    0     0      0        0    0
##   text68      0    0     0      0        0    0
##   text69      0    0     0      0        0    0
##   text70      0    0     0      0        0    0
##   text71      0    0     0      0        0    0
##   text72      0    0     0      0        0    0
##   text73      0    0     0      0        0    0
##   text74      0    0     0      0        0    0
##   text75      0    0     0      0        0    0
##   text76      0    0     0      0        0    0
##   text77      0    0     0      0        0    0
##   text78      0    0     0      0        0    0
##   text79      0    0     0      0        0    0
##   text80      0    0     0      0        0    0
##   text81      0    0     0      0        0    0
##   text82      0    0     0      0        0    0
##   text83      0    0     0      0        0    0
##   text84      0    0     0      0        0    0
##   text85      0    0     0      0        0    0
##   text86      0    0     0      0        0    0
##   text87      0    0     0      0        0    0
##   text88      0    0     0      0        0    0
##   text89      0    0     0      0        0    0
##   text90      0    0     0      0        0    0
##   text91      0    0     0      0        0    0
##   text92      0    0     0      0        0    0
##   text93      0    0     0      0        0    0
##   text94      0    0     0      0        0    0
##   text95      0    0     0      0        0    0
##   text96      0    0     0      0        0    0
##   text97      0    0     0      0        0    0
##   text98      0    0     0      0        0    0
##   text99      0    0     0      0        0    0
##   text100     0    0     0      0        0    0
##   text101     0    0     0      0        0    0
##   text102     0    0     0      0        0    0
##   text103     0    0     0      0        0    2
##   text104     0    1     0      0        0    0
##   text105     0    0     0      0        0    0
##   text106     0    0     0      0        0    0
##   text107     0    0     0      0        0    1
##   text108     0    0     0      0        0    0
##   text109     0    0     0      0        0    0
##   text110     0    0     0      0        0    0
##   text111     0    0     0      0        0    0
##   text112     0    0     0      0        0    0
##   text113     0    0     0      0        0    0
##   text114     0    0     0      0        0    1
##   text115     0    0     0      0        0    0
##   text116     0    0     0      0        0    0
##   text117     0    0     0      0        0    0
##   text118     0    0     0      0        0    0
##   text119     0    0     0      0        0    0
##   text120     0    0     0      0        0    0
##   text121     0    0     0      0        0    0
##   text122     0    0     0      0        0    0
##   text123     0    0     0      0        0    0
##   text124     0    0     1      0        0    0
##   text125     0    0     0      0        0    0
##   text126     0    0     0      0        0    0
##   text127     0    0     0      0        0    0
##   text128     0    0     0      0        0    0
##   text129     0    0     0      0        0    0
##   text130     0    0     0      0        0    0
##   text131     0    0     0      0        0    1
##   text132     0    0     0      0        0    0
##   text133     0    0     0      0        0    0
##   text134     0    0     0      0        0    0
##   text135     0    0     0      0        0    0
##   text136     0    0     0      0        0    0
##   text137     0    0     0      0        0    0
##   text138     0    0     0      0        0    0
##   text139     0    0     0      0        0    0
##   text140     0    0     0      0        0    0
##   text141     0    0     0      0        0    0
##   text142     0    0     0      0        0    0
##   text143     0    0     0      0        0    0
##   text144     0    0     0      0        0    0
##   text145     0    0     0      0        0    0
##   text146     0    0     0      0        0    0
##   text147     0    0     0      0        0    0
##   text148     0    0     0      0        0    0
##   text149     0    0     0      0        0    0
##   text150     0    0     0      0        0    0
##   text151     0    0     0      0        0    0
##   text152     0    0     0      0        0    0
##   text153     0    0     0      0        0    0
##   text154     0    0     0      0        0    0
##   text155     0    0     0      0        0    0
##   text156     0    0     0      0        0    0
##   text157     0    0     0      0        0    0
##   text158     0    0     0      0        0    0
##   text159     0    0     1      0        0    0
##   text160     0    0     0      0        0    0
##   text161     0    0     0      0        0    0
##   text162     0    0     1      0        0    0
##   text163     0    0     0      0        0    0
##   text164     0    0     0      0        0    0
##   text165     0    0     0      0        0    0
##   text166     0    0     0      0        0    0
##   text167     0    0     0      0        0    0
##   text168     0    0     0      0        0    0
##   text169     0    0     0      0        0    0
##   text170     0    0     0      0        0    0
##   text171     0    0     0      0        0    0
##   text172     0    0     0      0        0    0
##   text173     0    0     0      0        0    0
##   text174     0    0     0      0        0    0
##   text175     0    0     0      0        0    0
##   text176     0    0     0      0        0    0
##   text177     0    0     0      0        0    0
##   text178     0    0     0      0        0    0
##   text179     0    0     0      0        0    0
##   text180     0    0     0      0        0    0
##   text181     0    0     0      0        0    0
##   text182     0    0     0      0        0    0
##   text183     0    0     0      0        0    0
##   text184     0    0     0      0        0    0
##   text185     0    0     0      0        0    0
##   text186     0    0     0      0        0    0
##   text187     0    0     0      0        0    0
##   text188     0    0     0      0        0    0
##   text189     0    0     0      0        0    0
##   text190     0    0     0      0        0    0
##   text191     0    0     0      0        0    0
##   text192     0    0     0      0        0    0
##   text193     0    0     0      0        0    0
##   text194     0    0     0      0        0    0
##   text195     0    0     0      0        0    0
##   text196     0    0     0      0        0    0
##   text197     0    0     0      0        0    0
##   text198     0    0     0      0        0    0
##   text199     0    0     0      0        0    0
##   text200     0    0     0      0        0    0

## Removing features occurring fewer than 5 times: 4

## Removing features occurring in fewer than 3 documents: 4

# try default guidelines
k <- round(sqrt(ndoc(k_means_dfm)/2))

# clusterk5 <- kmeans(tf(k_means_dfm, "prop"), k)
# split(docnames(k_means_dfm), clusterk5$cluster)

clusterk3 <- kmeans(tf(k_means_dfm, "prop"), 3)
split(docnames(k_means_dfm), clusterk3$cluster)

## $`1`
## [1] "text1"   "text5"   "text103" "text107" "text114" "text131"
## 
## $`2`
## [1] "text6"   "text7"   "text124" "text159" "text162"
## 
## $`3`
##   [1] "text2"   "text3"   "text4"   "text8"   "text9"   "text10"  "text11" 
##   [8] "text12"  "text13"  "text14"  "text15"  "text16"  "text17"  "text18" 
##  [15] "text19"  "text20"  "text21"  "text22"  "text23"  "text24"  "text25" 
##  [22] "text26"  "text27"  "text28"  "text29"  "text30"  "text31"  "text32" 
##  [29] "text33"  "text34"  "text35"  "text36"  "text37"  "text38"  "text39" 
##  [36] "text40"  "text41"  "text42"  "text43"  "text44"  "text45"  "text46" 
##  [43] "text47"  "text48"  "text49"  "text50"  "text51"  "text52"  "text53" 
##  [50] "text54"  "text55"  "text56"  "text57"  "text58"  "text59"  "text60" 
##  [57] "text61"  "text62"  "text63"  "text64"  "text65"  "text66"  "text67" 
##  [64] "text68"  "text69"  "text70"  "text71"  "text72"  "text73"  "text74" 
##  [71] "text75"  "text76"  "text77"  "text78"  "text79"  "text80"  "text81" 
##  [78] "text82"  "text83"  "text84"  "text85"  "text86"  "text87"  "text88" 
##  [85] "text89"  "text90"  "text91"  "text92"  "text93"  "text94"  "text95" 
##  [92] "text96"  "text97"  "text98"  "text99"  "text100" "text101" "text102"
##  [99] "text104" "text105" "text106" "text108" "text109" "text110" "text111"
## [106] "text112" "text113" "text115" "text116" "text117" "text118" "text119"
## [113] "text120" "text121" "text122" "text123" "text125" "text126" "text127"
## [120] "text128" "text129" "text130" "text132" "text133" "text134" "text135"
## [127] "text136" "text137" "text138" "text139" "text140" "text141" "text142"
## [134] "text143" "text144" "text145" "text146" "text147" "text148" "text149"
## [141] "text150" "text151" "text152" "text153" "text154" "text155" "text156"
## [148] "text157" "text158" "text160" "text161" "text163" "text164" "text165"
## [155] "text166" "text167" "text168" "text169" "text170" "text171" "text172"
## [162] "text173" "text174" "text175" "text176" "text177" "text178" "text179"
## [169] "text180" "text181" "text182" "text183" "text184" "text185" "text186"
## [176] "text187" "text188" "text189" "text190" "text191" "text192" "text193"
## [183] "text194" "text195" "text196" "text197" "text198" "text199" "text200"

clusterk2 <- kmeans(tf(k_means_dfm, "prop"), 2)
split(docnames(k_means_dfm), clusterk2$cluster)

## $`1`
## [1] "text1"   "text5"   "text103" "text107" "text114" "text131"
## 
## $`2`
##   [1] "text2"   "text3"   "text4"   "text6"   "text7"   "text8"   "text9"  
##   [8] "text10"  "text11"  "text12"  "text13"  "text14"  "text15"  "text16" 
##  [15] "text17"  "text18"  "text19"  "text20"  "text21"  "text22"  "text23" 
##  [22] "text24"  "text25"  "text26"  "text27"  "text28"  "text29"  "text30" 
##  [29] "text31"  "text32"  "text33"  "text34"  "text35"  "text36"  "text37" 
##  [36] "text38"  "text39"  "text40"  "text41"  "text42"  "text43"  "text44" 
##  [43] "text45"  "text46"  "text47"  "text48"  "text49"  "text50"  "text51" 
##  [50] "text52"  "text53"  "text54"  "text55"  "text56"  "text57"  "text58" 
##  [57] "text59"  "text60"  "text61"  "text62"  "text63"  "text64"  "text65" 
##  [64] "text66"  "text67"  "text68"  "text69"  "text70"  "text71"  "text72" 
##  [71] "text73"  "text74"  "text75"  "text76"  "text77"  "text78"  "text79" 
##  [78] "text80"  "text81"  "text82"  "text83"  "text84"  "text85"  "text86" 
##  [85] "text87"  "text88"  "text89"  "text90"  "text91"  "text92"  "text93" 
##  [92] "text94"  "text95"  "text96"  "text97"  "text98"  "text99"  "text100"
##  [99] "text101" "text102" "text104" "text105" "text106" "text108" "text109"
## [106] "text110" "text111" "text112" "text113" "text115" "text116" "text117"
## [113] "text118" "text119" "text120" "text121" "text122" "text123" "text124"
## [120] "text125" "text126" "text127" "text128" "text129" "text130" "text132"
## [127] "text133" "text134" "text135" "text136" "text137" "text138" "text139"
## [134] "text140" "text141" "text142" "text143" "text144" "text145" "text146"
## [141] "text147" "text148" "text149" "text150" "text151" "text152" "text153"
## [148] "text154" "text155" "text156" "text157" "text158" "text159" "text160"
## [155] "text161" "text162" "text163" "text164" "text165" "text166" "text167"
## [162] "text168" "text169" "text170" "text171" "text172" "text173" "text174"
## [169] "text175" "text176" "text177" "text178" "text179" "text180" "text181"
## [176] "text182" "text183" "text184" "text185" "text186" "text187" "text188"
## [183] "text189" "text190" "text191" "text192" "text193" "text194" "text195"
## [190] "text196" "text197" "text198" "text199" "text200"

# Render the cluster plot
clusplot(as.matrix(k_means_dfm), clusterk2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main = "K-means clustering")