knitr::opts_chunk$set(cache=TRUE, warning=FALSE, message=FALSE, fig.width=6)

Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs http://rpubs.com/ that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in;
  2. Create a basic report of summary statistics about the data sets;
  3. Report any interesting findings that you amassed so far;
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Index

  1. Operative questions
  2. Loading data
  3. Preprocessing
  4. Exploratory analysis
  1. Findings and next steps

Text mining refers to the process of parsing a selection or corpus of text in order to identify certain aspects, such as the most frequently occurring word or phrase.

1. Loading data

Hence, let’s download the data from the website and load them into R.

temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", temp)
# List zipped files
list.files <- unzip("C:/users/MaurizioLocale/downloads/temp.zip",list=TRUE)
list.files
##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00
# Extract specific files
unzip("C:/users/MaurizioLocale/downloads/temp.zip", list.files$Name[c(11:13)], exdir = "temp")

# Function to move files "from" "to" and remove the origianl in "to".

my.file.rename <- function(from, to) {
    todir <- dirname(to)
    if (!isTRUE(file.info(todir)$isdir)) dir.create(todir, recursive=TRUE)
    file.rename(from = from,  to = to)
}

Now, it is time to load the R package for text mining and load your texts into R. Furthermore, since the data are too big for my pc, I will sample few lines from the data.

library(tm)   
library(knitr)
library(LaF)
library(SnowballC)
library(RWeka)
library(wordcloud)
library(lattice)
lblog <- determine_nlines("./temp/final/en_US/en_US.blogs.txt")
ltwit <- determine_nlines("./temp/final/en_US/en_US.twitter.txt")
lnews <- determine_nlines("./temp/final/en_US/en_US.news.txt")

set.seed(123)
writeLines(sample_lines("./temp/final/en_US/en_US.blogs.txt", lblog*.05, lblog), "blogs2.txt")
writeLines(sample_lines("./temp/final/en_US/en_US.twitter.txt", ltwit*.05, ltwit), "twitter2.txt")
writeLines(sample_lines("./temp/final/en_US/en_US.news.txt", lnews*.05, lnews), "news2.txt")

my.file.rename("blogs2.txt", "./text2/blogs2.txt")
## [1] TRUE
my.file.rename("twitter2.txt", "./text2/twitter2.txt")
## [1] TRUE
my.file.rename("news2.txt", "./text2/news2.txt")
## [1] TRUE
#Remove complete files
unlink("temp", recursive = TRUE)

docs <- Corpus(DirSource("./text2")) #, encoding = "UTF-8"   

summary(docs)  
##              Length Class             Mode
## blogs2.txt   2      PlainTextDocument list
## news2.txt    2      PlainTextDocument list
## twitter2.txt 2      PlainTextDocument list
inspect(docs)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 9301157
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 10259198
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 8121908

2. Preprocessing

Once the data are loaded properly, let’s begin to preprocess the texts. We are going to remove remove numbers, capitalization, common words, punctuation, and otherwise prepare your texts for analysis. This can be somewhat time consuming and picky, but it pays off in the end in terms of high quality analyses. We will start with only 1-grams. We’ll want to pre-process the text before setting up a Term-Document Matrix (TDM) for our word clouds.

We’ll use tm_map to transform everything to lower case to avoid missing matching between differently cased terms. We’ll also want to remove common junk words (“stopwords”) and custom-defined words of non-interest, remove extra spaces, and take the stems of words to better match recurrence of the root words.

# wordlist(c("a", ..., "n"))
#wordlist <- gsub("[^[:alnum:]]", "", docs)

docs <- tm_map(docs, content_transformer(tolower))

docs <- tm_map(docs, content_transformer(removeWords), "year")

docs <- tm_map(docs, content_transformer(removeWords), stopwords("SMART"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, PlainTextDocument)
tdm <- TermDocumentMatrix(docs)
dtm <- DocumentTermMatrix(docs)

3. Exploratory analysis

3.1 Most frequent words

Some words are more frequent than others - what are the distributions of word frequencies? We will answer this question by, first, produce the termo document matrix and the document term matrix.

tdm2 <- removeSparseTerms(tdm, 0.85)
tdm2
## <<TermDocumentMatrix (terms: 120083, documents: 3)>>
## Non-/sparse entries: 167198/193051
## Sparsity           : 54%
## Maximal term length: 267
## Weighting          : term frequency (tf)
dtm2 <- removeSparseTerms(dtm, 0.85)
dtm2
## <<DocumentTermMatrix (documents: 3, terms: 120083)>>
## Non-/sparse entries: 167198/193051
## Sparsity           : 54%
## Maximal term length: 267
## Weighting          : term frequency (tf)

Create a WordCloud of our one-grams to visualize the most occurent 100 words.

notsparse <- tdm2
m = as.matrix(notsparse)
v = sort(rowSums(m),decreasing=TRUE)
d = data.frame(word = names(v),freq=v)
 
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d$word,
          freq = d$freq,
          scale = c(3,.8),
          random.order = F,
          colors = pal,
          max.words = 50)

And here we have the frequency table for the most frequent one-grams. To produce the table, we have to produce

freq <- sort(colSums(as.matrix(dtm2)), decreasing = TRUE)
freqDf <- data.frame(word=names(freq), freq=freq)   
head(freqDf)
##      word  freq
## time time 12459
## day   day 10915
## make make  9939
## love love  9417
## good good  9017
## work work  8529
barchart(word ~ freq, head(freqDf, 25))

3.2 Digrams and trigrams

Let’s tokenize our data! We are interested in two-grams.

diTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

diTdm <- TermDocumentMatrix(docs, control = list(tokenize = diTokenizer))
diDtm <- DocumentTermMatrix(docs, control = list(tokenize = diTokenizer))

diTdm2 <- removeSparseTerms(diTdm, 0.90)
diTdm2
## <<TermDocumentMatrix (terms: 1466596, documents: 3)>>
## Non-/sparse entries: 1609224/2790564
## Sparsity           : 63%
## Maximal term length: 373
## Weighting          : term frequency (tf)
diDtm2 <- removeSparseTerms(diDtm, 0.90)
diDtm2
## <<DocumentTermMatrix (documents: 3, terms: 1466596)>>
## Non-/sparse entries: 1609224/2790564
## Sparsity           : 63%
## Maximal term length: 373
## Weighting          : term frequency (tf)

Create a WordCloud to Visualize digrams, because they love wordclouds too!

m2 = as.matrix(diTdm2)
v2 = sort(rowSums(m2),decreasing=TRUE)
d2 = data.frame(word = names(v2),freq=v2)
head(d2)
##                          word freq
## high school       high school  796
## year ago             year ago  622
## st loui               st loui  469
## happi birthday happi birthday  437
## good morn           good morn  398
## unit state         unit state  369
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d2$word,
          freq = d2$freq,
          scale = c(2,.7),
          random.order = F,
          colors = pal, 
          max.words = 50)

Then, the plot representing the more frequent digrams.

freq2 <- sort(colSums(as.matrix(diDtm2)), decreasing = TRUE)
freqDf2 <- data.frame(word=names(freq2), freq=freq2)   
head(freqDf2)
##                          word freq
## high school       high school  796
## year ago             year ago  622
## st loui               st loui  469
## happi birthday happi birthday  437
## good morn           good morn  398
## unit state         unit state  369
barchart(word ~ freq, head(freqDf2, 25))

Now, let’s repeat the operation for the most interesting trigrams.

triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

triTdm <- TermDocumentMatrix(docs, control = list(tokenize = triTokenizer))
triDtm <- DocumentTermMatrix(docs, control = list(tokenize = triTokenizer))

triTdm2 <- removeSparseTerms(triTdm, 0.90)
triTdm2
## <<TermDocumentMatrix (terms: 1790689, documents: 3)>>
## Non-/sparse entries: 1797698/3574369
## Sparsity           : 67%
## Maximal term length: 556
## Weighting          : term frequency (tf)
triDtm2 <- removeSparseTerms(triDtm, 0.90)
triDtm2
## <<DocumentTermMatrix (documents: 3, terms: 1790689)>>
## Non-/sparse entries: 1797698/3574369
## Sparsity           : 67%
## Maximal term length: 556
## Weighting          : term frequency (tf)

Create a WordCloud to Visualize trigrams, because they love wordclouds too!

m3 = as.matrix(triTdm2)
v3 = sort(rowSums(m3), decreasing = TRUE)
d3 = data.frame(word = names(v3),freq=v3)
head(d3)
##                                    word freq
## happi mother day       happi mother day  180
## presid barack obama presid barack obama   78
## love love love           love love love   61
## cinco de mayo             cinco de mayo   57
## st loui counti           st loui counti   55
## world war ii               world war ii   51
# Create the word cloud
pal = brewer.pal(9,"BuPu")
wordcloud(words = d3$word,
          freq = d3$freq,
          scale = c(2,.7),
          random.order = FALSE,
          colors = pal,
          max.words = 50)

Then, the plot representing the more frequent trigrams.

freq3 <- sort(colSums(as.matrix(triDtm2)), decreasing = TRUE)
freqDf3 <- data.frame(word=names(freq3), freq=freq3)   

barchart(word ~ freq, head(freqDf3, 25))

4. Findings and next steps

Those task are quite computational demanding for my pc, hence I sampled the data keeping the 5% of the text. In the sample, the concept expressed apparently rotate around time and relevant recurrent moments.

Now, I will have to try to develop an algorithm. Still have no a main preference, but it will be based on the data sampled. Cloud computing is not an option, unfortunately.

Most likely, the algorithm will investingate decreasing n-grams until a match is found.

The Shiny app will be really simple.

The user interface will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word.