knitr::opts_chunk$set(cache=TRUE, warning=FALSE, message=FALSE, fig.width=6)
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs http://rpubs.com/ that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
Text mining refers to the process of parsing a selection or corpus of text in order to identify certain aspects, such as the most frequently occurring word or phrase.
Hence, let’s download the data from the website and load them into R.
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", temp)
# List zipped files
list.files <- unzip("C:/users/MaurizioLocale/downloads/temp.zip",list=TRUE)
list.files
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
# Extract specific files
unzip("C:/users/MaurizioLocale/downloads/temp.zip", list.files$Name[c(11:13)], exdir = "temp")
# Function to move files "from" "to" and remove the origianl in "to".
my.file.rename <- function(from, to) {
todir <- dirname(to)
if (!isTRUE(file.info(todir)$isdir)) dir.create(todir, recursive=TRUE)
file.rename(from = from, to = to)
}
Now, it is time to load the R package for text mining and load your texts into R. Furthermore, since the data are too big for my pc, I will sample few lines from the data.
library(tm)
library(knitr)
library(LaF)
library(SnowballC)
library(RWeka)
library(wordcloud)
library(lattice)
lblog <- determine_nlines("./temp/final/en_US/en_US.blogs.txt")
ltwit <- determine_nlines("./temp/final/en_US/en_US.twitter.txt")
lnews <- determine_nlines("./temp/final/en_US/en_US.news.txt")
set.seed(123)
writeLines(sample_lines("./temp/final/en_US/en_US.blogs.txt", lblog*.05, lblog), "blogs2.txt")
writeLines(sample_lines("./temp/final/en_US/en_US.twitter.txt", ltwit*.05, ltwit), "twitter2.txt")
writeLines(sample_lines("./temp/final/en_US/en_US.news.txt", lnews*.05, lnews), "news2.txt")
my.file.rename("blogs2.txt", "./text2/blogs2.txt")
## [1] TRUE
my.file.rename("twitter2.txt", "./text2/twitter2.txt")
## [1] TRUE
my.file.rename("news2.txt", "./text2/news2.txt")
## [1] TRUE
#Remove complete files
unlink("temp", recursive = TRUE)
docs <- Corpus(DirSource("./text2")) #, encoding = "UTF-8"
summary(docs)
## Length Class Mode
## blogs2.txt 2 PlainTextDocument list
## news2.txt 2 PlainTextDocument list
## twitter2.txt 2 PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 9301157
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 10259198
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 8121908
Once the data are loaded properly, let’s begin to preprocess the texts. We are going to remove remove numbers, capitalization, common words, punctuation, and otherwise prepare your texts for analysis. This can be somewhat time consuming and picky, but it pays off in the end in terms of high quality analyses. We will start with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.
We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.
# wordlist(c("a", ..., "n"))
#wordlist <- gsub("[^[:alnum:]]", "", docs)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, content_transformer(removeWords), "year")
docs <- tm_map(docs, content_transformer(removeWords), stopwords("SMART"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, PlainTextDocument)
tdm <- TermDocumentMatrix(docs)
dtm <- DocumentTermMatrix(docs)
Some words are more frequent than others - what are the distributions of word frequencies? We will answer this question by, first, produce the termo document matrix and the document term matrix.
tdm2 <- removeSparseTerms(tdm, 0.85)
tdm2
## <<TermDocumentMatrix (terms: 120083, documents: 3)>>
## Non-/sparse entries: 167198/193051
## Sparsity : 54%
## Maximal term length: 267
## Weighting : term frequency (tf)
dtm2 <- removeSparseTerms(dtm, 0.85)
dtm2
## <<DocumentTermMatrix (documents: 3, terms: 120083)>>
## Non-/sparse entries: 167198/193051
## Sparsity : 54%
## Maximal term length: 267
## Weighting : term frequency (tf)
Create a WordCloud of our one-grams to visualize the most occurent 100 words.
notsparse <- tdm2
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
freq = d$freq,
scale = c(3,.8),
random.order = F,
colors = pal,
max.words = 50)
And here we have the frequency table for the most frequent one-grams. To produce the table, we have to produce
freq <- sort(colSums(as.matrix(dtm2)), decreasing = TRUE)
freqDf <- data.frame(word=names(freq), freq=freq)
head(freqDf)
## word freq
## time time 12459
## day day 10915
## make make 9939
## love love 9417
## good good 9017
## work work 8529
barchart(word ~ freq, head(freqDf, 25))
Let’s tokenize our data! We are interested in two-grams.
diTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
diTdm <- TermDocumentMatrix(docs, control = list(tokenize = diTokenizer))
diDtm <- DocumentTermMatrix(docs, control = list(tokenize = diTokenizer))
diTdm2 <- removeSparseTerms(diTdm, 0.90)
diTdm2
## <<TermDocumentMatrix (terms: 1466596, documents: 3)>>
## Non-/sparse entries: 1609224/2790564
## Sparsity : 63%
## Maximal term length: 373
## Weighting : term frequency (tf)
diDtm2 <- removeSparseTerms(diDtm, 0.90)
diDtm2
## <<DocumentTermMatrix (documents: 3, terms: 1466596)>>
## Non-/sparse entries: 1609224/2790564
## Sparsity : 63%
## Maximal term length: 373
## Weighting : term frequency (tf)
Create a WordCloud to Visualize digrams, because they love wordclouds too!
m2 = as.matrix(diTdm2)
v2 = sort(rowSums(m2),decreasing=TRUE)
d2 = data.frame(word = names(v2),freq=v2)
head(d2)
## word freq
## high school high school 796
## year ago year ago 622
## st loui st loui 469
## happi birthday happi birthday 437
## good morn good morn 398
## unit state unit state 369
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d2$word,
freq = d2$freq,
scale = c(2,.7),
random.order = F,
colors = pal,
max.words = 50)
Then, the plot representing the more frequent digrams.
freq2 <- sort(colSums(as.matrix(diDtm2)), decreasing = TRUE)
freqDf2 <- data.frame(word=names(freq2), freq=freq2)
head(freqDf2)
## word freq
## high school high school 796
## year ago year ago 622
## st loui st loui 469
## happi birthday happi birthday 437
## good morn good morn 398
## unit state unit state 369
barchart(word ~ freq, head(freqDf2, 25))
Now, let’s repeat the operation for the most interesting trigrams.
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
triTdm <- TermDocumentMatrix(docs, control = list(tokenize = triTokenizer))
triDtm <- DocumentTermMatrix(docs, control = list(tokenize = triTokenizer))
triTdm2 <- removeSparseTerms(triTdm, 0.90)
triTdm2
## <<TermDocumentMatrix (terms: 1790689, documents: 3)>>
## Non-/sparse entries: 1797698/3574369
## Sparsity : 67%
## Maximal term length: 556
## Weighting : term frequency (tf)
triDtm2 <- removeSparseTerms(triDtm, 0.90)
triDtm2
## <<DocumentTermMatrix (documents: 3, terms: 1790689)>>
## Non-/sparse entries: 1797698/3574369
## Sparsity : 67%
## Maximal term length: 556
## Weighting : term frequency (tf)
Create a WordCloud to Visualize trigrams, because they love wordclouds too!
m3 = as.matrix(triTdm2)
v3 = sort(rowSums(m3), decreasing = TRUE)
d3 = data.frame(word = names(v3),freq=v3)
head(d3)
## word freq
## happi mother day happi mother day 180
## presid barack obama presid barack obama 78
## love love love love love love 61
## cinco de mayo cinco de mayo 57
## st loui counti st loui counti 55
## world war ii world war ii 51
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d3$word,
freq = d3$freq,
scale = c(2,.7),
random.order = FALSE,
colors = pal,
max.words = 50)
Then, the plot representing the more frequent trigrams.
freq3 <- sort(colSums(as.matrix(triDtm2)), decreasing = TRUE)
freqDf3 <- data.frame(word=names(freq3), freq=freq3)
barchart(word ~ freq, head(freqDf3, 25))
Those task are quite computational demanding for my pc, hence I sampled the data keeping the 5% of the text. In the sample, the concept expressed apparently rotate around time and relevant recurrent moments.
Now, I will have to try to develop an algorithm. Still have no a main preference, but it will be based on the data sampled. Cloud computing is not an option, unfortunately.
Most likely, the algorithm will investingate decreasing n-grams until a match is found.
The Shiny app will be really simple.
The user interface will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word.