Introduction

This is a project of the Coursera Data Science Capstone course. It is to develop a text predictive algorithm derived from large data sets composed of different sources of text data using Shiny.

The purpose of this project is the estimation of the next character or word given a string of the input history. This app could be a useful solution to the problem of mistyping.

The report is the 1st milestone where you meed to demonstrate your understanding of a text data set, how to clean and explore it and the progress towards the end data product. This is also to get feedback on proposed path forward the next task of the project.

This report uses a data set of US Tweets, Blogs, and News. The source data is from publicly available sources and was provided by Corsera. The original data sets can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and the documentation for it can be found at http://www.corpora.heliohost.org/aboutcorpus.html

Task 1 - Data acquisition and cleaning

First we need to get all associated tools needed for accomplishing taskes for this report.

library(tm)
## Loading required package: NLP
library(RWekajars)
library(RWeka)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(stringi)
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer

1. Data Acquisition

The data is from a corpus called HC Corpora. English data sets were chosen for this report

if(!file.exists("./final/en_US/en_US.news.txt")||
            !file.exists("./final/en_US/en_US.blogs.txt")||    
            !file.exists("./final/en_US/en_US.twitter.txt"))
{  
      if(!file.exists("Coursera-SwiftKey.zip"))
      {    
            download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",  destfile="Coursera-SwiftKey.zip", quiet=T)
      }
      unzip(zipfile="Coursera-SwiftKey.zip")
}
setwd("~/Documents/Rcourse/course10/final")
# Reading the data sets into R environment
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul

2. Understanding the data set

We would like to know the general features of the data sets.

# Getting some understandings of the data sets
length(twitter)
## [1] 2360148
length(blogs)
## [1] 899288
length(news)
## [1] 1010242
summary(blogs)
##    Length     Class      Mode 
##    899288 character character
summary(nchar(blogs))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830
summary(nchar(twitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00
summary(nchar(news))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
summary(words_blogs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
summary(words_news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
summary(words_twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

3. Data Cleaning

Tasks to accomplish

  1. Tokenisation - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict. ### Tokenisation Developing a function that takes a file as input and returns a tokenized version of it. This process deals with: – make a corpus object – make everything lowercase – remove punctuation – remove numbers – get rid of extra spaces
tokenisator <- function (x){
  corpus <- Corpus(VectorSource(x))
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removeWords,stopwords("english"))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
}
blog_token <- tokenisator(blogs)
twitter_token <- tokenisator(twitter)
news_token <- tokenisator(news)

Profanity filtering

The data contain words of offensive and profane meaning. We need to remove them before we can use the data set to build a Text Predictor App.

#Loading a profanity list
download.file(url="http://www.bannedwordlist.com/lists/swearWords.txt",  destfile="swearwords.txt", quiet=T)
profanity <- readLines("swearWords.txt")
## Warning in readLines("swearWords.txt"): incomplete final line found on
## 'swearWords.txt'
profanity
##  [1] "anal"         "anus"         "arse"         "ass"         
##  [5] "ballsack"     "balls"        "bastard"      "bitch"       
##  [9] "biatch"       "bloody"       "blowjob"      "blow job"    
## [13] "bollock"      "bollok"       "boner"        "boob"        
## [17] "bugger"       "bum"          "butt"         "buttplug"    
## [21] "clitoris"     "cock"         "coon"         "crap"        
## [25] "cunt"         "damn"         "dick"         "dildo"       
## [29] "dyke"         "fag"          "feck"         "fellate"     
## [33] "fellatio"     "felching"     "fuck"         "f u c k"     
## [37] "fudgepacker"  "fudge packer" "flange"       "Goddamn"     
## [41] "God damn"     "hell"         "homo"         "jerk"        
## [45] "jizz"         "knobend"      "knob end"     "labia"       
## [49] "lmao"         "lmfao"        "muff"         "nigger"      
## [53] "nigga"        "omg"          "penis"        "piss"        
## [57] "poop"         "prick"        "pube"         "pussy"       
## [61] "queer"        "scrotum"      "sex"          "shit"        
## [65] "s hit"        "sh1t"         "slut"         "smegma"      
## [69] "spunk"        "tit"          "tosser"       "turd"        
## [73] "twat"         "vagina"       "wank"         "whore"       
## [77] "wtf"
twitter_token <-tm_map(twitter_token, removeWords, profanity)
blog_token <-tm_map(blog_token, removeWords, profanity)
news_token <-tm_map(news_token, removeWords, profanity)

Before we go ahead with data analysis, we need to get a subset of each data set to run for this report as the original data sets are too big for my computer.

sample_rate <- 0.05
twitter_sub <- sample(twitter_token, length(twitter_token)*sample_rate)
length(twitter_sub)
## [1] 118007
blog_sub <- sample(blog_token, length(blog_token)*sample_rate)
length(blog_sub)
## [1] 44964
news_sub <- sample(news_token, length(news_token)*sample_rate)
length(news_sub)
## [1] 50512

Task 2 - Exploratory Analysis

Here we want to do an exploratory analysis looking into groups of words 3 by 3 (n-grams=3 - Trigrams). The same can be done to Unigram (1 word group) and Bigram (2 word group) ## Twitter data subset:

trigram_twitter <- NGramTokenizer(twitter_sub, Weka_control(min = 3, max = 3))
trigram_twitter_df <- data.frame(table(trigram_twitter))
trigram_twitter_dfsorted <- trigram_twitter_df[order(trigram_twitter_df$Freq,decreasing = TRUE),]
# We will make a graph for the top most popular 20 trigram groupos only
trigram_twitter_top20 <- trigram_twitter_dfsorted[1:20,]
colnames(trigram_twitter_top20) <- c("Word","Frequency")
trigram_twitter_top20
##                          Word Frequency
## 295613      happy mothers day       168
## 378422            let us know        96
## 295645         happy new year        75
## 108338          cinco de mayo        45
## 402315 looking forward seeing        42
## 118072            come see us        35
## 354936         keep good work        34
## 687560    thanks following us        30
## 296098   happy valentines day        29
## 348394          just got back        29
## 400916    look forward seeing        29
## 295935      happy th birthday        27
## 270129  good morning everyone        26
## 92154           cant wait see        25
## 222580     follow back please        23
## 352752        just wanted say        22
## 408780         love love love        22
## 646193        st patricks day        22
## 114522   coffee coffee coffee        21
## 117550           come join us        21
wordcloud(trigram_twitter_top20$Word, trigram_twitter_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_twitter_top20$Word,
## trigram_twitter_top20$Freq, : happy mothers day could not be fit on page.
## It will not be plotted.

trigram_twitter_top20 <- trigram_twitter_top20[1:20, 1:2]
trigram_twitter_top20$Word <- as.character(trigram_twitter_top20$Word)
trigram_twitter_top20$Word <- gsub(" ", "_", trigram_twitter_top20$Word)

ggplot(trigram_twitter_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

blogs data subset:

trigram_blog <- NGramTokenizer(blog_sub, Weka_control(min = 3, max = 3))
trigram_blog_df <- data.frame(table(trigram_blog))
trigram_blog_dfsorted <- trigram_blog_df[order(trigram_blog_df$Freq,decreasing = TRUE),]
trigram_blog_top20 <- trigram_blog_dfsorted[1:20,]
colnames(trigram_blog_top20) <- c("Word","Frequency")
trigram_blog_top20
##                               Word Frequency
## 549973              new york times        35
## 549872               new york city        34
## 865043               two years ago        26
## 555248          none repeat scroll        25
## 680151        repeat scroll yellow        25
## 793790 stylebackground none repeat        25
## 773917             st patricks day        20
## 171293            couple years ago        18
## 27084          amazon services llc        17
## 171244            couple weeks ago        17
## 390857         incorporated item c        17
## 478041              love love love        17
## 483178               m pretty sure        17
## 769615             spend much time        17
## 415862             just little bit        16
## 467080               llc amazon eu        16
## 731316         services llc amazon        16
## 733690           several years ago        15
## 470078               long time ago        14
## 549931                 new york ny        14
wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): new york city could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): two years ago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): stylebackground none repeat could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): several years ago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): llc amazon eu could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): amazon services llc could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): st patricks day could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): spend much time could not be fit on page. It will not be
## plotted.

trigram_blog_top20 <- trigram_blog_top20[1:20, 1:2]
trigram_blog_top20$Word <- as.character(trigram_blog_top20$Word)
trigram_blog_top20$Word <- gsub(" ", "_", trigram_blog_top20$Word)

ggplot(trigram_blog_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

news data subset:

trigram_news <- NGramTokenizer(news_sub, Weka_control(min = 3, max = 3))
trigram_news_df <- data.frame(table(trigram_news))
trigram_news_dfsorted <- trigram_news_df[order(trigram_news_df$Freq,decreasing = TRUE),]
trigram_news_top20 <- trigram_news_dfsorted[1:20,]
colnames(trigram_news_top20) <- c("Word","Frequency")
trigram_news_top20
##                             Word Frequency
## 634246    president barack obama        76
## 544321             new york city        72
## 878347             two years ago        66
## 341395        gov chris christie        63
## 790057           st louis county        47
## 126661           cents per share        35
## 367511      high school students        35
## 939645              world war ii        34
## 295144          first time since        33
## 817196      superior court judge        31
## 878228             two weeks ago        31
## 927469           will take place        30
## 544562            new york times        27
## 181283 county prosecutors office        26
## 135395   chief executive officer        25
## 887102         us district judge        25
## 886777       us attorneys office        24
## 888226          us supreme court        23
## 939608        world trade center        22
## 715347            said last week        21
wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): president barack obama could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): st louis county could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): chief executive officer could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): new york times could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): world trade center could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): two weeks ago could not be fit on page. It will not be
## plotted.

trigram_news_top20 <- trigram_news_top20[1:20, 1:2]
trigram_news_top20$Word <- as.character(trigram_news_top20$Word)
trigram_news_top20$Word <- gsub(" ", "_", trigram_news_top20$Word)

ggplot(trigram_news_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Future plan for Shiny App and prediction algorithm

The next steps are – using phases (n-gram) freqiencies to predict the probabilities of the next word providing you have known the 1 or 2 pre-words. – Practical Machine learning will be used with Markov chain method to create a statistical model of the sequences of letters in a piece of English text. – Shiny App will be deployed to build an online App for the word prediction.