Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.
In this project, we will analyze a large corpus of text documents to discover the structure in the data and how words are put together.
In this report, I will cover my work, what I have done so far, on cleaning and analyzing text data, then describing the exploratory analysis.
The data is provided from the Coursera course Data Science Capstone by the link
After downloading and unzip the file Coursera-SwiftKey.zip, we have a folder named final which consists of four sub-folders corresponding to four locates en_US (English), de_DE (German), ru_RU (Russian) and fi_FI (French). Each sub-folder has three text files collected from different sources: blog, news and twitter under the defined language.
In this report, my work focus on English database such that I consider to the files on folder named en_US.
cname <- file.path(".", "final", "en_US")
dir(cname)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Some basic summary about these files
getTextInfo <- function(dirName) {
files <- list.files(dirName)
info <- lapply(files, function(fileName) {
extPath <- file.path(dirName, fileName)
## File size in Mbytes
size <- file.size(extPath)
mbSize <- paste0(round(size/(1024^2), 2), " MB")
## Open the file
con <- file(extPath, "r")
nLines <- 0
maxLine <- 0
nWords <- 0
while (TRUE) {
line = readLines(con, n = 1, skipNul = TRUE)
if ( length(line) == 0 ) {
break
}
nLines <- nLines + 1
lineLength <- sapply(gregexpr("\\W+", line), length) + 1
nWords <- nWords + lineLength
if ( lineLength > maxLine ) {
maxLine <- lineLength
}
}
close(con)
return(c(fileName, mbSize, nLines, maxLine, nWords))
})
## Overview in a dataframe
dataInfo <- data.frame(matrix(unlist(info), nrow = length(info), byrow = T))
names(dataInfo) <- c("Name", "Size", "Num of Lines", "Max Line", "Num of Words")
return(dataInfo)
}
dataInfo <- getTextInfo("final/en_US")
dataInfo
## Name Size Num of Lines Max Line Num of Words
## 1 en_US.blogs.txt 200.42 MB 899288 6852 39120483
## 2 en_US.news.txt 196.28 MB 1010242 1929 36721085
## 3 en_US.twitter.txt 159.36 MB 2360148 47 32793443
Since the datasets are very large, we will take a 1% random sample of each dataset as representive dataset and write it to a new dataset.
set.seed(2018 - 8 - 2)
blogLines <- readLines("final/en_US/en_US.blogs.txt")
newsLines <- readLines("final/en_US/en_US.news.txt")
twitterLines <- readLines("final/en_US/en_US.twitter.txt")
# get samples
sampleBlog <- sample(blogLines, length(blogLines) * 0.01)
sampleNews <- sample(newsLines, length(newsLines) * 0.01)
sampleTwitter <- sample(twitterLines, length(twitterLines) * 0.01)
# write to files
write(sampleBlog, "final/sample/en_US.blogs.txt")
write(sampleNews, "final/sample/en_US.news.txt")
write(sampleTwitter, "final/sample/en_US.twitter.txt")
Load the files to a corpus by using tm package
cname <- file.path(".", "final", "sample")
dir(cname)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
docs <- VCorpus(DirSource(cname))
summary(docs)
## Length Class Mode
## en_US.blogs.txt 2 PlainTextDocument list
## en_US.news.txt 2 PlainTextDocument list
## en_US.twitter.txt 2 PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2055467
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2021892
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1623230
# Removing punctuation
docs <- tm_map(docs,removePunctuation)
# Removing special characters
for (j in seq(docs)) {
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("\u2028", " ", docs[[j]])
}
# Converting to lowercase
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, PlainTextDocument)
# Removing “stopwords” (common words) that usually have no analytic value.
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, PlainTextDocument)
# Removing common word endings (e.g., “ing”, “es”, “s”)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, PlainTextDocument)
# Stripping unnecesary whitespace from your documents
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
Create a document term matrix
dtm <- DocumentTermMatrix(docs)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 47737)>>
## Non-/sparse entries: 68479/74732
## Sparsity : 52%
## Maximal term length: 165
## Weighting : term frequency (tf)
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 30)
## will like one get just said time can day year make love
## 3266 3060 3050 3030 2984 2977 2626 2466 2200 2186 2087 2032
## new good know now work say see peopl thank want think come
## 1963 1886 1836 1758 1724 1701 1645 1615 1529 1525 1517 1480
## look dont back need first use
## 1451 1450 1446 1403 1396 1349
This will identify all terms that appear frequently (in this case, 1000 or more times).
findFreqTerms(dtm, lowfreq=1000)
## [1] "also" "back" "can" "come" "day" "dont" "even"
## [8] "first" "follow" "game" "get" "good" "got" "great"
## [15] "just" "know" "last" "like" "look" "love" "make"
## [22] "much" "need" "new" "now" "one" "peopl" "play"
## [29] "realli" "right" "said" "say" "see" "start" "take"
## [36] "thank" "thing" "think" "time" "today" "two" "use"
## [43] "want" "way" "week" "well" "will" "work" "year"
wf <- data.frame(word=names(freq), freq=freq)
g <- ggplot(subset(wf, freq>1000), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
g <- g + ggtitle("The words that appear at least 1000 times.")
g
If words always appear together, then correlation=1.0.
findAssocs(dtm, c("hate" , "love"), corlimit=0.99)
## $hate
## 10am 10pm 1st 29th 3rd 630pm
## 1.00 1.00 1.00 1.00 1.00 1.00
## 7am 7pm 8pm 9am acronym advic
## 1.00 1.00 1.00 1.00 1.00 1.00
## alot alright alrighti ant anyon anytim
## 1.00 1.00 1.00 1.00 1.00 1.00
## asian ass asshol atleast austin avatar
## 1.00 1.00 1.00 1.00 1.00 1.00
## awesom awkward badass bagel bam betti
## 1.00 1.00 1.00 1.00 1.00 1.00
## bff birthday bitch blackberri blackout booti
## 1.00 1.00 1.00 1.00 1.00 1.00
## brewer broken browser btw buddi bye
## 1.00 1.00 1.00 1.00 1.00 1.00
## cam catalog check cheeseburg chill class
## 1.00 1.00 1.00 1.00 1.00 1.00
## coachella cobra coke com comfi comp
## 1.00 1.00 1.00 1.00 1.00 1.00
## congratul cont convo coolest coupon crawfish
## 1.00 1.00 1.00 1.00 1.00 1.00
## damn dang delet dell demo dept
## 1.00 1.00 1.00 1.00 1.00 1.00
## deserv dick dirt doggi dope douch
## 1.00 1.00 1.00 1.00 1.00 1.00
## drunk dumb dunno dvr eachoth err
## 1.00 1.00 1.00 1.00 1.00 1.00
## est fab fake fav fave flop
## 1.00 1.00 1.00 1.00 1.00 1.00
## fuck fucker fyi gah ghetto girl
## 1.00 1.00 1.00 1.00 1.00 1.00
## givin glad great greatest gum haircut
## 1.00 1.00 1.00 1.00 1.00 1.00
## hangout happi havent headphon heheh hmm
## 1.00 1.00 1.00 1.00 1.00 1.00
## hoo housew hrs huh hype ili
## 1.00 1.00 1.00 1.00 1.00 1.00
## ill info instagram intro itll jag
## 1.00 1.00 1.00 1.00 1.00 1.00
## jealous join karaok kardashian kickstart kiddi
## 1.00 1.00 1.00 1.00 1.00 1.00
## kobe lame latt learner linkedin loll
## 1.00 1.00 1.00 1.00 1.00 1.00
## lovin luck lunch mad man mca
## 1.00 1.00 1.00 1.00 1.00 1.00
## meh meow messi mil min mint
## 1.00 1.00 1.00 1.00 1.00 1.00
## mobil morn moron msg nah nephew
## 1.00 1.00 1.00 1.00 1.00 1.00
## nevermind newest niall nippl nope nothin
## 1.00 1.00 1.00 1.00 1.00 1.00
## nyc olli omfg oprah overr password
## 1.00 1.00 1.00 1.00 1.00 1.00
## peep pic poop porn preview profil
## 1.00 1.00 1.00 1.00 1.00 1.00
## prop proud pussi rad readi reunion
## 1.00 1.00 1.00 1.00 1.00 1.00
## rip semest semi server sexi shirt
## 1.00 1.00 1.00 1.00 1.00 1.00
## shit shotgun shout sht skinni slut
## 1.00 1.00 1.00 1.00 1.00 1.00
## snicker sniff snow someday sorri special
## 1.00 1.00 1.00 1.00 1.00 1.00
## spoil spotifi starbuck stink stupid sub
## 1.00 1.00 1.00 1.00 1.00 1.00
## submiss sugarfre super swarm swear sweetest
## 1.00 1.00 1.00 1.00 1.00 1.00
## sxsw tat teas tech tellin text
## 1.00 1.00 1.00 1.00 1.00 1.00
## thank thankyou thru timelin tire today
## 1.00 1.00 1.00 1.00 1.00 1.00
## tomorrow trivia tune tweeter twinkl ugh
## 1.00 1.00 1.00 1.00 1.00 1.00
## umm unti useless valentin vega via
## 1.00 1.00 1.00 1.00 1.00 1.00
## violet vip wack waffl wait watchin
## 1.00 1.00 1.00 1.00 1.00 1.00
## weed weekend weirdest wendi whoop wiggl
## 1.00 1.00 1.00 1.00 1.00 1.00
## wit woohoo wordpress workshop wow wtf
## 1.00 1.00 1.00 1.00 1.00 1.00
## wth xbox xoxo yay yeah yer
## 1.00 1.00 1.00 1.00 1.00 1.00
## yes yoga youd youll yup 4th
## 1.00 1.00 1.00 1.00 1.00 0.99
## 9pm app aunt aww awww bad
## 0.99 0.99 0.99 0.99 0.99 0.99
## best bio blow bore bout bus
## 0.99 0.99 0.99 0.99 0.99 0.99
## butt calendar chick coffe concert cousin
## 0.99 0.99 0.99 0.99 0.99 0.99
## crazi cuz dat dem dont excit
## 0.99 0.99 0.99 0.99 0.99 0.99
## fest follow fool fuckin goin gonna
## 0.99 0.99 0.99 0.99 0.99 0.99
## good goodnight got gotta haha hashtag
## 0.99 0.99 0.99 0.99 0.99 0.99
## hell hello hoe hope itun killer
## 0.99 0.99 0.99 0.99 0.99 0.99
## momma music next pizza poker pre
## 0.99 0.99 0.99 0.99 0.99 0.99
## rage rain remix sad session shower
## 0.99 0.99 0.99 0.99 0.99 0.99
## skype sleep soo soon stoke stop
## 0.99 0.99 0.99 0.99 0.99 0.99
## stuck studio suck sum superbowl sweeti
## 0.99 0.99 0.99 0.99 0.99 0.99
## tho til trend tumblr wanna wat
## 0.99 0.99 0.99 0.99 0.99 0.99
## watch weather welcom whoa wish yall
## 0.99 0.99 0.99 0.99 0.99 0.99
## yep yum
## 0.99 0.99
##
## $love
## 2nd 30th 5am 5pm 5th
## 1.00 1.00 1.00 1.00 1.00
## acl admin advil aliv alpha
## 1.00 1.00 1.00 1.00 1.00
## alreadi amaz appreci aquarius asleep
## 1.00 1.00 1.00 1.00 1.00
## assoc aunt autograph awe bad
## 1.00 1.00 1.00 1.00 1.00
## baddest beckett bee beliv beta
## 1.00 1.00 1.00 1.00 1.00
## better bipolar blanket blizzard blow
## 1.00 1.00 1.00 1.00 1.00
## booth bore boy boyfriend breezi
## 1.00 1.00 1.00 1.00 1.00
## broom browni buff bulli bullshit
## 1.00 1.00 1.00 1.00 1.00
## butt cat cereal chick choreographi
## 1.00 1.00 1.00 1.00 1.00
## clown come cool couch cousin
## 1.00 1.00 1.00 1.00 1.00
## crazi creepi cuff cum day
## 1.00 1.00 1.00 1.00 1.00
## dentist derrick desk detox directori
## 1.00 1.00 1.00 1.00 1.00
## dirti discount dishwash djs dubstep
## 1.00 1.00 1.00 1.00 1.00
## email epic everyday excus extinct
## 1.00 1.00 1.00 1.00 1.00
## fat feud final foo forgot
## 1.00 1.00 1.00 1.00 1.00
## freak fri fuk fun gawd
## 1.00 1.00 1.00 1.00 1.00
## geek get getaway gif ginger
## 1.00 1.00 1.00 1.00 1.00
## gmail googl grandma grit haiti
## 1.00 1.00 1.00 1.00 1.00
## hammock handsom hardcor hashtag hat
## 1.00 1.00 1.00 1.00 1.00
## hear hehe hell hope hug
## 1.00 1.00 1.00 1.00 1.00
## hungri hurri imo indi insan
## 1.00 1.00 1.00 1.00 1.00
## instrument jerk jess jessica joel
## 1.00 1.00 1.00 1.00 1.00
## just killer kim kinda lam
## 1.00 1.00 1.00 1.00 1.00
## lawn let lightbulb listen mac
## 1.00 1.00 1.00 1.00 1.00
## mag max melissa merch mom
## 1.00 1.00 1.00 1.00 1.00
## mood nap nathan newcastl nice
## 1.00 1.00 1.00 1.00 1.00
## nuff num okay ooh pajama
## 1.00 1.00 1.00 1.00 1.00
## parti pathet pbr pet phenomen
## 1.00 1.00 1.00 1.00 1.00
## platform pleas popcorn powerpoint preach
## 1.00 1.00 1.00 1.00 1.00
## presal priceless prof prolli rid
## 1.00 1.00 1.00 1.00 1.00
## roadtrip sad salsa screw scum
## 1.00 1.00 1.00 1.00 1.00
## seinfeld send setup sex shes
## 1.00 1.00 1.00 1.00 1.00
## shitti shower sid sing sittin
## 1.00 1.00 1.00 1.00 1.00
## slang sleep sleepi sleev sneez
## 1.00 1.00 1.00 1.00 1.00
## soon spec spin sticker stop
## 1.00 1.00 1.00 1.00 1.00
## storm strive stuck studio subconsci
## 1.00 1.00 1.00 1.00 1.00
## sweatpant sweeti switch tasti tea
## 1.00 1.00 1.00 1.00 1.00
## temp textil thang thanksgiv toddler
## 1.00 1.00 1.00 1.00 1.00
## troll tube unicorn updat url
## 1.00 1.00 1.00 1.00 1.00
## usa vacat video virtu wear
## 1.00 1.00 1.00 1.00 1.00
## weird whew whoa whoever wig
## 1.00 1.00 1.00 1.00 1.00
## wii wikipedia wink wish woke
## 1.00 1.00 1.00 1.00 1.00
## woo woot wwii xmas yang
## 1.00 1.00 1.00 1.00 1.00
## yep yike 100th 219 247
## 1.00 1.00 0.99 0.99 0.99
## 3pm 3rd 4th 500k abbrevi
## 0.99 0.99 0.99 0.99 0.99
## advic ahem alert alley altar
## 0.99 0.99 0.99 0.99 0.99
## ambit annoy apprentic artifact aubrey
## 0.99 0.99 0.99 0.99 0.99
## audio baffl balloon bandwagon bandwidth
## 0.99 0.99 0.99 0.99 0.99
## bark beatl beet belmont berni
## 0.99 0.99 0.99 0.99 0.99
## biggi bloat blown blurri bot
## 0.99 0.99 0.99 0.99 0.99
## bourbon boweri brainstorm breakfast brim
## 0.99 0.99 0.99 0.99 0.99
## buck bunni calendar carniv chattanooga
## 0.99 0.99 0.99 0.99 0.99
## childish citat class claustrophob clover
## 0.99 0.99 0.99 0.99 0.99
## coffe combo conan condom cone
## 0.99 0.99 0.99 0.99 0.99
## congratul consensus convoy coven crap
## 0.99 0.99 0.99 0.99 0.99
## crock crossword cst cuddl curios
## 0.99 0.99 0.99 0.99 0.99
## currant daffodil dammit delet desktop
## 0.99 0.99 0.99 0.99 0.99
## dew dickinson discriminatori dos downstair
## 0.99 0.99 0.99 0.99 0.99
## drink dylan eff elbow eliot
## 0.99 0.99 0.99 0.99 0.99
## ell ernest erot everyon everytim
## 0.99 0.99 0.99 0.99 0.99
## ewhc excit eyebrow family” femin
## 0.99 0.99 0.99 0.99 0.99
## fetish fig fing flippin folk
## 0.99 0.99 0.99 0.99 0.99
## fool foxi franki fricken friendship
## 0.99 0.99 0.99 0.99 0.99
## fungus funni futon gasp germ
## 0.99 0.99 0.99 0.99 0.99
## gettogeth gideon girl gis glad
## 0.99 0.99 0.99 0.99 0.99
## gloat good gossip got grandad
## 0.99 0.99 0.99 0.99 0.99
## greatest groovi grouchi grr guess
## 0.99 0.99 0.99 0.99 0.99
## guis hai handlebar hannah hass
## 0.99 0.99 0.99 0.99 0.99
## heath heh hello hideous hijack
## 0.99 0.99 0.99 0.99 0.99
## hool hoot html hurt hustler
## 0.99 0.99 0.99 0.99 0.99
## imac incess instantan insult interfac
## 0.99 0.99 0.99 0.99 0.99
## iota isa itun jalapeno jameson
## 0.99 0.99 0.99 0.99 0.99
## jeez jimi jinx jonni junk
## 0.99 0.99 0.99 0.99 0.99
## karma kinder koolaid kyli lamar
## 0.99 0.99 0.99 0.99 0.99
## lame lauderdal leap lesli lgbt
## 0.99 0.99 0.99 0.99 0.99
## lib llama loren lps mad
## 0.99 0.99 0.99 0.99 0.99
## mariah marx matrix meant mediev
## 0.99 0.99 0.99 0.99 0.99
## micah mifflin migrain mindblow minus
## 0.99 0.99 0.99 0.99 0.99
## momma moos mow mullah multi
## 0.99 0.99 0.99 0.99 0.99
## multitud music nbd next nickelback
## 0.99 0.99 0.99 0.99 0.99
## ninja nonissu nono noob nostalgia
## 0.99 0.99 0.99 0.99 0.99
## off offlin oti outag paintbal
## 0.99 0.99 0.99 0.99 0.99
## paraffin paula pcs pecan pimp
## 0.99 0.99 0.99 0.99 0.99
## plato pokemon poker pollen poolsid
## 0.99 0.99 0.99 0.99 0.99
## pre preorder preset preteen prettiest
## 0.99 0.99 0.99 0.99 0.99
## propag proprietari proverb prowl pun
## 0.99 0.99 0.99 0.99 0.99
## pup quantit radiant rage rain
## 0.99 0.99 0.99 0.99 0.99
## raindrop regret relearn remix repent
## 0.99 0.99 0.99 0.99 0.99
## reveri rhapsodi rihanna roadblock roddi
## 0.99 0.99 0.99 0.99 0.99
## room” roomi rss salli sarasota
## 0.99 0.99 0.99 0.99 0.99
## saw sed selfless seper serious
## 0.99 0.99 0.99 0.99 0.99
## sfo shag sherlock shin shorti
## 0.99 0.99 0.99 0.99 0.99
## shuffleboard sick sickest snoop snuggl
## 0.99 0.99 0.99 0.99 0.99
## someon song spank stairway stalk
## 0.99 0.99 0.99 0.99 0.99
## stapler starbuck static stella storylin
## 0.99 0.99 0.99 0.99 0.99
## subscript sum summertim swagger swear
## 0.99 0.99 0.99 0.99 0.99
## sweater swedish synergi teriyaki thespian
## 0.99 0.99 0.99 0.99 0.99
## tivo trait trampolin trend true
## 0.99 0.99 0.99 0.99 0.99
## ugli upstair upto uruguay uso
## 0.99 0.99 0.99 0.99 0.99
## vanguard verb voiceov wag wallow
## 0.99 0.99 0.99 0.99 0.99
## wan watch whatd whimsic whittl
## 0.99 0.99 0.99 0.99 0.99
## wither wring xoxoxo yeh yesy
## 0.99 0.99 0.99 0.99 0.99
## yup zine zipper
## 0.99 0.99 0.99
The 100 most frequently occurring words.
dark2 <- brewer.pal(6, "Dark2")
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)
Removing the uninteresting or infrequent words.
dtmss <- removeSparseTerms(dtm, 0.15) # This makes a matrix that is only 15% empty space, maximum.
dtmss
## <<DocumentTermMatrix (documents: 3, terms: 7057)>>
## Non-/sparse entries: 21171/0
## Sparsity : 0%
## Maximal term length: 15
## Weighting : term frequency (tf)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d=d, method="complete") # for a different look try substituting: method="ward.D"
plot(fit, hang=-1)
groups <- cutree(fit, k=6) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clusters
tm, wordcloud.tm, I can do some basic preprocessing steps such as sterming, removing stopwords, removing whitespaces, etc. Moreover, I have learnt a new chart that is word cloud, it is very useful in text mining.tm can generate Document Term Matrix which is the transformation for observing the frequency of words.