Data Science Specialization Capstone

Summary

This report explains my exploratory analysis on data and my goals for the app and prediction algorithm. Data was provided as part of the capstone project in the Data Science Specialization by John Hopkins University on Coursera. Analysis on data is necessary to understand its structure and being able to plan an effective predictive text model, to do this, the goal is to identify patterns in data.

A. Getting and cleaning the data

Data for this project was obtained from Coursera data set link https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and I used the US English texts.

if (!file.exists("Coursera-SwiftKey.zip")){
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  "Coursera-SwiftKey.zip")
    unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines(file("./final/en_US/en_US.news.txt", blocking=TRUE, open="rb"), encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)

For each file used, I obtain basic information: its size, number of lines, number of words, longest line.

info <- cbind.data.frame(c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
              c(round(file.info("./final/en_US/en_US.blogs.txt")$size/1024/1024,1), 
                round(file.info("./final/en_US/en_US.news.txt")$size/1024/1024,1), 
                round(file.info("./final/en_US/en_US.twitter.txt")$size/1024/1024,1)),
              c(length(blogs), length(news), length(twitter)),
              c(sum(sapply(strsplit(blogs,"\\s+"),length)),
                sum(sapply(strsplit(news,"\\s+"),length)),
                sum(sapply(strsplit(twitter,"\\s+"),length))),
              c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))))
names(info) <- c("File", "Size (MB)", "# Lines", "# Words", "Longest Line Length")
info

##                File Size (MB) # Lines # Words Longest Line Length
## 1   en_US.blogs.txt     200.4     100    4704                1461
## 2    en_US.news.txt     196.3     100    3222                 982
## 3 en_US.twitter.txt     159.4     100    1275                 140

Due to excessive amount of memory required to store and further process all data, and restrictions with the computer I was working, I took a sample of only 1000 items of each data set.

set.seed(777)
sample.size.factor <- 1000
data <- c(sample(blogs, sample.size.factor),
          sample(news, sample.size.factor),
          sample(twitter, sample.size.factor))
length(data)

## [1] 3000

object.size(data)

## 689752 bytes

It was necessary to perform the following cleaning tasks using tm package: eliminate punctuation symbols and numbers, convert all letters to lowercase, and strip whitespaces. Due to non UTF-8 chars in original texts, it was necessary to use iconv function to change encodings. Finally, profanity words were obtained from www.freewebheaders.com.

These cleaning tasks were necessary to prepare data for the exploratory data analysis. I decided not to clean stop words, because I want my prediction algorithm to use them.

require(tm)

## Loading required package: tm

## Loading required package: NLP

# Clean non UTF-8
data <- iconv(data, from="UTF-8", to="ASCII", sub="")

# Clean transformations
corpus <- VCorpus(VectorSource(data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

# Clean banned words
if (!file.exists("bannedwords.zip"))
{
        download.file("http://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-banned-by-google-txt-file.zip", "bannedwords.zip")
        unzip("bannedwords.zip")
}
banned <- readLines(file("full-list-of-bad-words-banned-by-google-txt-file_2013_11_26_04_53_31_867.txt", blocking=TRUE, open="rb"))
corpus <- tm_map(corpus, removeWords, banned)
corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3000

B. Exploratory analysis

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. I used RWeka package to perform frequency analyses of 1-gram, 2-gram, and 3-gram.

1-gram analysis

require(RWeka)

## Loading required package: RWeka

require(ggplot2)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

require(data.table)

## Loading required package: data.table

# Make the 1-gram
Tokenizer1Gram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
uni.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer1Gram))
uni.gram

## <<DocumentTermMatrix (documents: 3000, terms: 13781)>>
## Non-/sparse entries: 59073/41283927
## Sparsity           : 100%
## Maximal term length: 31
## Weighting          : term frequency (tf)

# Plot the top 20 1-gram 
freq1 <- sort(colSums(as.matrix(uni.gram)), decreasing=TRUE)
head(freq1, 20)

##  the  and that  for  you  was with have this  are  but  not from  his  its 
## 4420 2198  929  899  675  605  602  520  488  425  421  392  331  314  310 
## they said  one  all will 
##  299  285  267  260  257

datafreq1 <- as.data.frame(data.table(word=names(freq1), freq=freq1))
ggplot(head(datafreq1, 20), aes(x=reorder(word, freq), y=freq)) +
    geom_bar(stat="identity", fill="#00AA00") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("1-gram") +
    ylab("Frequency") +
    ggtitle("Top 20 1-grams") + 
    coord_flip()

2-gram analysis

# Make the 2-gram
Tokenizer2Gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer2Gram))
bi.gram

## <<DocumentTermMatrix (documents: 3000, terms: 57672)>>
## Non-/sparse entries: 82098/172933902
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)

# Plot the top 20 2-gram
freq2 <- sort(colSums(as.matrix(bi.gram)), decreasing=TRUE)
head(freq2, 20)

##   of the   in the   on the   to the  for the   at the    to be  and the 
##      433      419      214      192      145      131      124      121 
##     in a   it was from the     is a  will be     of a   i have    it is 
##      108      100       91       88       85       78       76       75 
##   with a    and i    i was with the 
##       75       74       74       74

datafreq2 <- as.data.frame(data.table(word=names(freq2), freq=freq2))
ggplot(head(datafreq2, 20), aes(x=reorder(word, freq), y=freq)) +
    geom_bar(stat="identity", fill="#0077AA") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("2-gram") +
    ylab("Frequency") +
    ggtitle("Top 20 2-grams") + 
    coord_flip()

3-gram analysis

# Make the 3-gram
Tokenizer3Gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri.gram <- DocumentTermMatrix(corpus, control = list(tokenize = Tokenizer3Gram))
tri.gram

## <<DocumentTermMatrix (documents: 3000, terms: 76963)>>
## Non-/sparse entries: 80646/230808354
## Sparsity           : 100%
## Maximal term length: 45
## Weighting          : term frequency (tf)

# Plot the top 20 3-gram
freq3 <- sort(colSums(as.matrix(tri.gram)), decreasing=TRUE)
head(freq3, 20)

##     one of the       a lot of     be able to     the end of     there is a 
##             34             25             18             17             16 
##        to be a    some of the     out of the      i have to     as well as 
##             16             15             14             13             12 
##    going to be    im going to    is going to look look look  the fact that 
##             12             12             12             12             12 
##     at the end    i dont know   in the first       it was a   one of those 
##             11             11             11             11             11

datafreq3 <- as.data.frame(data.table(word=names(freq3), freq=freq3))
ggplot(head(datafreq3, 20), aes(x=reorder(word, freq), y=freq)) +
    geom_bar(stat="identity", fill="#AA77FF") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("3-gram") +
    ylab("Frequency") +
    ggtitle("Top 20 3-grams") + 
    coord_flip()

C. Next steps for the Prediction Application

Whit this analysis and findings I will work on the development of the algorithm and Shiny app. The general idea is that for any given input (n-tokens) I will try to find a suitable [n+1]-gram to predict the next word. However, as n increases, there are less suitables ocurrences. The application will be implemented as a Shiny app that allows the user to enter a phrase and the app will suggest the most likely next word.