Milestone Report

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.

In this project, we will analyze a large corpus of text documents to discover the structure in the data and how words are put together.

In this report, I will cover my work, what I have done so far, on cleaning and analyzing text data, then describing the exploratory analysis.

Loading Data

The data is provided from the Coursera course Data Science Capstone by the link

After downloading and unzip the file Coursera-SwiftKey.zip, we have a folder named final which consists of four sub-folders corresponding to four locates en_US (English), de_DE (German), ru_RU (Russian) and fi_FI (French). Each sub-folder has three text files collected from different sources: blog, news and twitter under the defined language.

In this report, my work focus on English database such that I consider to the files on folder named en_US.

cname <- file.path(".", "final", "en_US")
dir(cname)

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Some basic summary about these files

getTextInfo <- function(dirName) {
        files <- list.files(dirName)
        info <- lapply(files, function(fileName) {
                extPath <- file.path(dirName, fileName)

                ## File size in Mbytes
                size <- file.size(extPath)
                mbSize <- paste0(round(size/(1024^2), 2), " MB")

                ## Open the file
                con <- file(extPath, "r")

                nLines <- 0
                maxLine <- 0
                nWords <- 0

                while (TRUE) {
                        line = readLines(con, n = 1, skipNul = TRUE)
                        if ( length(line) == 0 ) {
                                break
                        }
                        nLines <- nLines + 1
                        
                        lineLength <- sapply(gregexpr("\\W+", line), length) + 1
                    
                        nWords <- nWords + lineLength
                    
                        if ( lineLength > maxLine ) {
                                maxLine <- lineLength
                        }
                }

                close(con)

                return(c(fileName, mbSize, nLines, maxLine, nWords))
        })

        ## Overview in a dataframe
        dataInfo <- data.frame(matrix(unlist(info), nrow = length(info), byrow = T))
        names(dataInfo) <- c("Name", "Size", "Num of Lines", "Max Line", "Num of Words")
        return(dataInfo)
}

dataInfo <- getTextInfo("final/en_US")
dataInfo

##                Name      Size Num of Lines Max Line Num of Words
## 1   en_US.blogs.txt 200.42 MB       899288     6852     39120483
## 2    en_US.news.txt 196.28 MB      1010242     1929     36721085
## 3 en_US.twitter.txt 159.36 MB      2360148       47     32793443

Since the datasets are very large, we will take a 1% random sample of each dataset as representive dataset and write it to a new dataset.

set.seed(2018 - 8 - 2)
blogLines <- readLines("final/en_US/en_US.blogs.txt")
newsLines <- readLines("final/en_US/en_US.news.txt")
twitterLines <- readLines("final/en_US/en_US.twitter.txt")
# get samples
sampleBlog <- sample(blogLines, length(blogLines) * 0.01)
sampleNews <- sample(newsLines, length(newsLines) * 0.01)
sampleTwitter <- sample(twitterLines, length(twitterLines) * 0.01)
# write to files
write(sampleBlog, "final/sample/en_US.blogs.txt")
write(sampleNews, "final/sample/en_US.news.txt")
write(sampleTwitter, "final/sample/en_US.twitter.txt")

Load the files to a corpus by using tm package

cname <- file.path(".", "final", "sample")
dir(cname)

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

docs <- VCorpus(DirSource(cname))
summary(docs)

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

inspect(docs)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2055467
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2021892
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1623230

Preprocessing

# Removing punctuation
docs <- tm_map(docs,removePunctuation)
# Removing special characters
for (j in seq(docs)) {
    docs[[j]] <- gsub("/", " ", docs[[j]])
    docs[[j]] <- gsub("@", " ", docs[[j]])
    docs[[j]] <- gsub("\\|", " ", docs[[j]])
    docs[[j]] <- gsub("\u2028", " ", docs[[j]])
}
# Converting to lowercase
docs <- tm_map(docs, tolower)   
docs <- tm_map(docs, PlainTextDocument)
# Removing “stopwords” (common words) that usually have no analytic value.
docs <- tm_map(docs, removeWords, stopwords("english"))   
docs <- tm_map(docs, PlainTextDocument)
# Removing common word endings (e.g., “ing”, “es”, “s”)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, PlainTextDocument)
# Stripping unnecesary whitespace from your documents
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)

Explore Data

Create a document term matrix

dtm <- DocumentTermMatrix(docs)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 47737)>>
## Non-/sparse entries: 68479/74732
## Sparsity           : 52%
## Maximal term length: 165
## Weighting          : term frequency (tf)

Word Frequency

A view of term frequency

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)   
head(freq, 30)

##  will  like   one   get  just  said  time   can   day  year  make  love 
##  3266  3060  3050  3030  2984  2977  2626  2466  2200  2186  2087  2032 
##   new  good  know   now  work   say   see peopl thank  want think  come 
##  1963  1886  1836  1758  1724  1701  1645  1615  1529  1525  1517  1480 
##  look  dont  back  need first   use 
##  1451  1450  1446  1403  1396  1349

An alternate view of term frequency

This will identify all terms that appear frequently (in this case, 1000 or more times).

findFreqTerms(dtm, lowfreq=1000)

##  [1] "also"   "back"   "can"    "come"   "day"    "dont"   "even"  
##  [8] "first"  "follow" "game"   "get"    "good"   "got"    "great" 
## [15] "just"   "know"   "last"   "like"   "look"   "love"   "make"  
## [22] "much"   "need"   "new"    "now"    "one"    "peopl"  "play"  
## [29] "realli" "right"  "said"   "say"    "see"    "start"  "take"  
## [36] "thank"  "thing"  "think"  "time"   "today"  "two"    "use"   
## [43] "want"   "way"    "week"   "well"   "will"   "work"   "year"

Plot Word Frequencies

wf <- data.frame(word=names(freq), freq=freq)
g <- ggplot(subset(wf, freq>1000), aes(x = reorder(word, -freq), y = freq)) +
          geom_bar(stat = "identity") + 
          theme(axis.text.x=element_text(angle=45, hjust=1))
g <- g + ggtitle("The words that appear at least 1000 times.")
g

Relationships Between Terms

Term Correlations

If words always appear together, then correlation=1.0.

findAssocs(dtm, c("hate" , "love"), corlimit=0.99)

## $hate
##       10am       10pm        1st       29th        3rd      630pm 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        7am        7pm        8pm        9am    acronym      advic 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       alot    alright   alrighti        ant      anyon     anytim 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      asian        ass     asshol    atleast     austin     avatar 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##     awesom    awkward     badass      bagel        bam      betti 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        bff   birthday      bitch blackberri   blackout      booti 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##     brewer     broken    browser        btw      buddi        bye 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        cam    catalog      check cheeseburg      chill      class 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##  coachella      cobra       coke        com      comfi       comp 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##  congratul       cont      convo    coolest     coupon   crawfish 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       damn       dang      delet       dell       demo       dept 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##     deserv       dick       dirt      doggi       dope      douch 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      drunk       dumb      dunno        dvr    eachoth        err 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        est        fab       fake        fav       fave       flop 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       fuck     fucker        fyi        gah     ghetto       girl 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      givin       glad      great   greatest        gum    haircut 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##    hangout      happi     havent   headphon      heheh        hmm 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        hoo     housew        hrs        huh       hype        ili 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        ill       info  instagram      intro       itll        jag 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##    jealous       join     karaok kardashian  kickstart      kiddi 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       kobe       lame       latt    learner   linkedin       loll 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      lovin       luck      lunch        mad        man        mca 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        meh       meow      messi        mil        min       mint 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      mobil       morn      moron        msg        nah     nephew 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##  nevermind     newest      niall      nippl       nope     nothin 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        nyc       olli       omfg      oprah      overr   password 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       peep        pic       poop       porn    preview     profil 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       prop      proud      pussi        rad      readi    reunion 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        rip     semest       semi     server       sexi      shirt 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       shit    shotgun      shout        sht     skinni       slut 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##    snicker      sniff       snow    someday      sorri    special 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      spoil    spotifi   starbuck      stink     stupid        sub 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##    submiss   sugarfre      super      swarm      swear   sweetest 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       sxsw        tat       teas       tech     tellin       text 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##      thank   thankyou       thru    timelin       tire      today 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##   tomorrow     trivia       tune    tweeter     twinkl        ugh 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        umm       unti    useless   valentin       vega        via 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##     violet        vip       wack      waffl       wait    watchin 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##       weed    weekend   weirdest      wendi      whoop      wiggl 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        wit     woohoo  wordpress   workshop        wow        wtf 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        wth       xbox       xoxo        yay       yeah        yer 
##       1.00       1.00       1.00       1.00       1.00       1.00 
##        yes       yoga       youd      youll        yup        4th 
##       1.00       1.00       1.00       1.00       1.00       0.99 
##        9pm        app       aunt        aww       awww        bad 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       best        bio       blow       bore       bout        bus 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       butt   calendar      chick      coffe    concert     cousin 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##      crazi        cuz        dat        dem       dont      excit 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       fest     follow       fool     fuckin       goin      gonna 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       good  goodnight        got      gotta       haha    hashtag 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       hell      hello        hoe       hope       itun     killer 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##      momma      music       next      pizza      poker        pre 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##       rage       rain      remix        sad    session     shower 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##      skype      sleep        soo       soon      stoke       stop 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##      stuck     studio       suck        sum  superbowl     sweeti 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##        tho        til      trend     tumblr      wanna        wat 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##      watch    weather     welcom       whoa       wish       yall 
##       0.99       0.99       0.99       0.99       0.99       0.99 
##        yep        yum 
##       0.99       0.99 
## 
## $love
##            2nd           30th            5am            5pm            5th 
##           1.00           1.00           1.00           1.00           1.00 
##            acl          admin          advil           aliv          alpha 
##           1.00           1.00           1.00           1.00           1.00 
##        alreadi           amaz        appreci       aquarius         asleep 
##           1.00           1.00           1.00           1.00           1.00 
##          assoc           aunt      autograph            awe            bad 
##           1.00           1.00           1.00           1.00           1.00 
##        baddest        beckett            bee          beliv           beta 
##           1.00           1.00           1.00           1.00           1.00 
##         better        bipolar        blanket       blizzard           blow 
##           1.00           1.00           1.00           1.00           1.00 
##          booth           bore            boy      boyfriend         breezi 
##           1.00           1.00           1.00           1.00           1.00 
##          broom         browni           buff          bulli       bullshit 
##           1.00           1.00           1.00           1.00           1.00 
##           butt            cat         cereal          chick   choreographi 
##           1.00           1.00           1.00           1.00           1.00 
##          clown           come           cool          couch         cousin 
##           1.00           1.00           1.00           1.00           1.00 
##          crazi         creepi           cuff            cum            day 
##           1.00           1.00           1.00           1.00           1.00 
##        dentist        derrick           desk          detox      directori 
##           1.00           1.00           1.00           1.00           1.00 
##          dirti       discount       dishwash            djs        dubstep 
##           1.00           1.00           1.00           1.00           1.00 
##          email           epic       everyday          excus        extinct 
##           1.00           1.00           1.00           1.00           1.00 
##            fat           feud          final            foo         forgot 
##           1.00           1.00           1.00           1.00           1.00 
##          freak            fri            fuk            fun           gawd 
##           1.00           1.00           1.00           1.00           1.00 
##           geek            get        getaway            gif         ginger 
##           1.00           1.00           1.00           1.00           1.00 
##          gmail          googl        grandma           grit          haiti 
##           1.00           1.00           1.00           1.00           1.00 
##        hammock        handsom        hardcor        hashtag            hat 
##           1.00           1.00           1.00           1.00           1.00 
##           hear           hehe           hell           hope            hug 
##           1.00           1.00           1.00           1.00           1.00 
##         hungri          hurri            imo           indi          insan 
##           1.00           1.00           1.00           1.00           1.00 
##     instrument           jerk           jess        jessica           joel 
##           1.00           1.00           1.00           1.00           1.00 
##           just         killer            kim          kinda            lam 
##           1.00           1.00           1.00           1.00           1.00 
##           lawn            let      lightbulb         listen            mac 
##           1.00           1.00           1.00           1.00           1.00 
##            mag            max        melissa          merch            mom 
##           1.00           1.00           1.00           1.00           1.00 
##           mood            nap         nathan       newcastl           nice 
##           1.00           1.00           1.00           1.00           1.00 
##           nuff            num           okay            ooh         pajama 
##           1.00           1.00           1.00           1.00           1.00 
##          parti         pathet            pbr            pet       phenomen 
##           1.00           1.00           1.00           1.00           1.00 
##       platform          pleas        popcorn     powerpoint         preach 
##           1.00           1.00           1.00           1.00           1.00 
##         presal      priceless           prof         prolli            rid 
##           1.00           1.00           1.00           1.00           1.00 
##       roadtrip            sad          salsa          screw           scum 
##           1.00           1.00           1.00           1.00           1.00 
##       seinfeld           send          setup            sex           shes 
##           1.00           1.00           1.00           1.00           1.00 
##         shitti         shower            sid           sing         sittin 
##           1.00           1.00           1.00           1.00           1.00 
##          slang          sleep         sleepi          sleev          sneez 
##           1.00           1.00           1.00           1.00           1.00 
##           soon           spec           spin        sticker           stop 
##           1.00           1.00           1.00           1.00           1.00 
##          storm         strive          stuck         studio      subconsci 
##           1.00           1.00           1.00           1.00           1.00 
##      sweatpant         sweeti         switch          tasti            tea 
##           1.00           1.00           1.00           1.00           1.00 
##           temp         textil          thang      thanksgiv        toddler 
##           1.00           1.00           1.00           1.00           1.00 
##          troll           tube        unicorn          updat            url 
##           1.00           1.00           1.00           1.00           1.00 
##            usa          vacat          video          virtu           wear 
##           1.00           1.00           1.00           1.00           1.00 
##          weird           whew           whoa        whoever            wig 
##           1.00           1.00           1.00           1.00           1.00 
##            wii      wikipedia           wink           wish           woke 
##           1.00           1.00           1.00           1.00           1.00 
##            woo           woot           wwii           xmas           yang 
##           1.00           1.00           1.00           1.00           1.00 
##            yep           yike          100th            219            247 
##           1.00           1.00           0.99           0.99           0.99 
##            3pm            3rd            4th           500k        abbrevi 
##           0.99           0.99           0.99           0.99           0.99 
##          advic           ahem          alert          alley          altar 
##           0.99           0.99           0.99           0.99           0.99 
##          ambit          annoy      apprentic       artifact         aubrey 
##           0.99           0.99           0.99           0.99           0.99 
##          audio          baffl        balloon      bandwagon      bandwidth 
##           0.99           0.99           0.99           0.99           0.99 
##           bark          beatl           beet        belmont          berni 
##           0.99           0.99           0.99           0.99           0.99 
##          biggi          bloat          blown         blurri            bot 
##           0.99           0.99           0.99           0.99           0.99 
##        bourbon         boweri     brainstorm      breakfast           brim 
##           0.99           0.99           0.99           0.99           0.99 
##           buck          bunni       calendar         carniv    chattanooga 
##           0.99           0.99           0.99           0.99           0.99 
##       childish          citat          class   claustrophob         clover 
##           0.99           0.99           0.99           0.99           0.99 
##          coffe          combo          conan         condom           cone 
##           0.99           0.99           0.99           0.99           0.99 
##      congratul      consensus         convoy          coven           crap 
##           0.99           0.99           0.99           0.99           0.99 
##          crock      crossword            cst          cuddl         curios 
##           0.99           0.99           0.99           0.99           0.99 
##        currant       daffodil         dammit          delet        desktop 
##           0.99           0.99           0.99           0.99           0.99 
##            dew      dickinson discriminatori            dos      downstair 
##           0.99           0.99           0.99           0.99           0.99 
##          drink          dylan            eff          elbow          eliot 
##           0.99           0.99           0.99           0.99           0.99 
##            ell         ernest           erot        everyon       everytim 
##           0.99           0.99           0.99           0.99           0.99 
##           ewhc          excit        eyebrow        family”          femin 
##           0.99           0.99           0.99           0.99           0.99 
##         fetish            fig           fing        flippin           folk 
##           0.99           0.99           0.99           0.99           0.99 
##           fool           foxi         franki        fricken     friendship 
##           0.99           0.99           0.99           0.99           0.99 
##         fungus          funni          futon           gasp           germ 
##           0.99           0.99           0.99           0.99           0.99 
##      gettogeth         gideon           girl            gis           glad 
##           0.99           0.99           0.99           0.99           0.99 
##          gloat           good         gossip            got        grandad 
##           0.99           0.99           0.99           0.99           0.99 
##       greatest         groovi        grouchi            grr          guess 
##           0.99           0.99           0.99           0.99           0.99 
##           guis            hai      handlebar         hannah           hass 
##           0.99           0.99           0.99           0.99           0.99 
##          heath            heh          hello        hideous         hijack 
##           0.99           0.99           0.99           0.99           0.99 
##           hool           hoot           html           hurt        hustler 
##           0.99           0.99           0.99           0.99           0.99 
##           imac         incess      instantan         insult       interfac 
##           0.99           0.99           0.99           0.99           0.99 
##           iota            isa           itun       jalapeno        jameson 
##           0.99           0.99           0.99           0.99           0.99 
##           jeez           jimi           jinx          jonni           junk 
##           0.99           0.99           0.99           0.99           0.99 
##          karma         kinder        koolaid           kyli          lamar 
##           0.99           0.99           0.99           0.99           0.99 
##           lame      lauderdal           leap          lesli           lgbt 
##           0.99           0.99           0.99           0.99           0.99 
##            lib          llama          loren            lps            mad 
##           0.99           0.99           0.99           0.99           0.99 
##         mariah           marx         matrix          meant         mediev 
##           0.99           0.99           0.99           0.99           0.99 
##          micah        mifflin        migrain       mindblow          minus 
##           0.99           0.99           0.99           0.99           0.99 
##          momma           moos            mow         mullah          multi 
##           0.99           0.99           0.99           0.99           0.99 
##       multitud          music            nbd           next     nickelback 
##           0.99           0.99           0.99           0.99           0.99 
##          ninja        nonissu           nono           noob      nostalgia 
##           0.99           0.99           0.99           0.99           0.99 
##            off         offlin            oti          outag       paintbal 
##           0.99           0.99           0.99           0.99           0.99 
##       paraffin          paula            pcs          pecan           pimp 
##           0.99           0.99           0.99           0.99           0.99 
##          plato        pokemon          poker         pollen        poolsid 
##           0.99           0.99           0.99           0.99           0.99 
##            pre       preorder         preset        preteen      prettiest 
##           0.99           0.99           0.99           0.99           0.99 
##         propag    proprietari        proverb          prowl            pun 
##           0.99           0.99           0.99           0.99           0.99 
##            pup        quantit        radiant           rage           rain 
##           0.99           0.99           0.99           0.99           0.99 
##       raindrop         regret        relearn          remix         repent 
##           0.99           0.99           0.99           0.99           0.99 
##         reveri       rhapsodi        rihanna      roadblock          roddi 
##           0.99           0.99           0.99           0.99           0.99 
##          room”          roomi            rss          salli       sarasota 
##           0.99           0.99           0.99           0.99           0.99 
##            saw            sed       selfless          seper        serious 
##           0.99           0.99           0.99           0.99           0.99 
##            sfo           shag       sherlock           shin         shorti 
##           0.99           0.99           0.99           0.99           0.99 
##   shuffleboard           sick        sickest          snoop         snuggl 
##           0.99           0.99           0.99           0.99           0.99 
##         someon           song          spank       stairway          stalk 
##           0.99           0.99           0.99           0.99           0.99 
##        stapler       starbuck         static         stella       storylin 
##           0.99           0.99           0.99           0.99           0.99 
##      subscript            sum      summertim        swagger          swear 
##           0.99           0.99           0.99           0.99           0.99 
##        sweater        swedish        synergi       teriyaki       thespian 
##           0.99           0.99           0.99           0.99           0.99 
##           tivo          trait      trampolin          trend           true 
##           0.99           0.99           0.99           0.99           0.99 
##           ugli        upstair           upto        uruguay            uso 
##           0.99           0.99           0.99           0.99           0.99 
##       vanguard           verb        voiceov            wag         wallow 
##           0.99           0.99           0.99           0.99           0.99 
##            wan          watch          whatd        whimsic         whittl 
##           0.99           0.99           0.99           0.99           0.99 
##         wither          wring         xoxoxo            yeh           yesy 
##           0.99           0.99           0.99           0.99           0.99 
##            yup           zine         zipper 
##           0.99           0.99           0.99

Word Clouds

The 100 most frequently occurring words.

dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)

Clustering by Term Similarity

Removing the uninteresting or infrequent words.

dtmss <- removeSparseTerms(dtm, 0.15) # This makes a matrix that is only 15% empty space, maximum.   
dtmss

## <<DocumentTermMatrix (documents: 3, terms: 7057)>>
## Non-/sparse entries: 21171/0
## Sparsity           : 0%
## Maximal term length: 15
## Weighting          : term frequency (tf)

Hierarchal Clustering

d <- dist(t(dtmss), method="euclidian")   
fit <- hclust(d=d, method="complete")   # for a different look try substituting: method="ward.D"
plot(fit, hang=-1)
groups <- cutree(fit, k=6)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters

Conclusion

In this part, I have learnt how to use some packages for text mining such as tm, wordcloud.
Specically, I understand that the datasets are too big and take many computational resources during the analyze process. It is better to take a sample which represents the dataset for analyzing.
With the provided functionality provided in tm, I can do some basic preprocessing steps such as sterming, removing stopwords, removing whitespaces, etc. Moreover, I have learnt a new chart that is word cloud, it is very useful in text mining.
The package tm can generate Document Term Matrix which is the transformation for observing the frequency of words.
For the future words, I will continue to consider to n-gram terms