This is a project of the Coursera Data Science Capstone course. It is to develop a text predictive algorithm derived from large data sets composed of different sources of text data using Shiny.
The purpose of this project is the estimation of the next character or word given a string of the input history. This app could be a useful solution to the problem of mistyping.
The report is the 1st milestone where you meed to demonstrate your understanding of a text data set, how to clean and explore it and the progress towards the end data product. This is also to get feedback on proposed path forward the next task of the project.
This report uses a data set of US Tweets, Blogs, and News. The source data is from publicly available sources and was provided by Corsera. The original data sets can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and the documentation for it can be found at http://www.corpora.heliohost.org/aboutcorpus.html
First we need to get all associated tools needed for accomplishing taskes for this report.
library(tm)
## Loading required package: NLP
library(RWekajars)
library(RWeka)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(stringi)
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
The data is from a corpus called HC Corpora. English data sets were chosen for this report
if(!file.exists("./final/en_US/en_US.news.txt")||
!file.exists("./final/en_US/en_US.blogs.txt")||
!file.exists("./final/en_US/en_US.twitter.txt"))
{
if(!file.exists("Coursera-SwiftKey.zip"))
{
download.file(url="https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile="Coursera-SwiftKey.zip", quiet=T)
}
unzip(zipfile="Coursera-SwiftKey.zip")
}
setwd("~/Documents/Rcourse/course10/final")
# Reading the data sets into R environment
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
We would like to know the general features of the data sets.
# Getting some understandings of the data sets
length(twitter)
## [1] 2360148
length(blogs)
## [1] 899288
length(news)
## [1] 1010242
summary(blogs)
## Length Class Mode
## 899288 character character
summary(nchar(blogs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
summary(nchar(twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
summary(nchar(news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
summary(words_blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
summary(words_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
summary(words_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
Tasks to accomplish
tokenisator <- function (x){
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords,stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
}
blog_token <- tokenisator(blogs)
twitter_token <- tokenisator(twitter)
news_token <- tokenisator(news)
The data contain words of offensive and profane meaning. We need to remove them before we can use the data set to build a Text Predictor App.
#Loading a profanity list
download.file(url="http://www.bannedwordlist.com/lists/swearWords.txt", destfile="swearwords.txt", quiet=T)
profanity <- readLines("swearWords.txt")
## Warning in readLines("swearWords.txt"): incomplete final line found on
## 'swearWords.txt'
profanity
## [1] "anal" "anus" "arse" "ass"
## [5] "ballsack" "balls" "bastard" "bitch"
## [9] "biatch" "bloody" "blowjob" "blow job"
## [13] "bollock" "bollok" "boner" "boob"
## [17] "bugger" "bum" "butt" "buttplug"
## [21] "clitoris" "cock" "coon" "crap"
## [25] "cunt" "damn" "dick" "dildo"
## [29] "dyke" "fag" "feck" "fellate"
## [33] "fellatio" "felching" "fuck" "f u c k"
## [37] "fudgepacker" "fudge packer" "flange" "Goddamn"
## [41] "God damn" "hell" "homo" "jerk"
## [45] "jizz" "knobend" "knob end" "labia"
## [49] "lmao" "lmfao" "muff" "nigger"
## [53] "nigga" "omg" "penis" "piss"
## [57] "poop" "prick" "pube" "pussy"
## [61] "queer" "scrotum" "sex" "shit"
## [65] "s hit" "sh1t" "slut" "smegma"
## [69] "spunk" "tit" "tosser" "turd"
## [73] "twat" "vagina" "wank" "whore"
## [77] "wtf"
twitter_token <-tm_map(twitter_token, removeWords, profanity)
blog_token <-tm_map(blog_token, removeWords, profanity)
news_token <-tm_map(news_token, removeWords, profanity)
Before we go ahead with data analysis, we need to get a subset of each data set to run for this report as the original data sets are too big for my computer.
sample_rate <- 0.05
twitter_sub <- sample(twitter_token, length(twitter_token)*sample_rate)
length(twitter_sub)
## [1] 118007
blog_sub <- sample(blog_token, length(blog_token)*sample_rate)
length(blog_sub)
## [1] 44964
news_sub <- sample(news_token, length(news_token)*sample_rate)
length(news_sub)
## [1] 50512
Here we want to do an exploratory analysis looking into groups of words 3 by 3 (n-grams=3 - Trigrams). The same can be done to Unigram (1 word group) and Bigram (2 word group) ## Twitter data subset:
trigram_twitter <- NGramTokenizer(twitter_sub, Weka_control(min = 3, max = 3))
trigram_twitter_df <- data.frame(table(trigram_twitter))
trigram_twitter_dfsorted <- trigram_twitter_df[order(trigram_twitter_df$Freq,decreasing = TRUE),]
# We will make a graph for the top most popular 20 trigram groupos only
trigram_twitter_top20 <- trigram_twitter_dfsorted[1:20,]
colnames(trigram_twitter_top20) <- c("Word","Frequency")
trigram_twitter_top20
## Word Frequency
## 295613 happy mothers day 168
## 378422 let us know 96
## 295645 happy new year 75
## 108338 cinco de mayo 45
## 402315 looking forward seeing 42
## 118072 come see us 35
## 354936 keep good work 34
## 687560 thanks following us 30
## 296098 happy valentines day 29
## 348394 just got back 29
## 400916 look forward seeing 29
## 295935 happy th birthday 27
## 270129 good morning everyone 26
## 92154 cant wait see 25
## 222580 follow back please 23
## 352752 just wanted say 22
## 408780 love love love 22
## 646193 st patricks day 22
## 114522 coffee coffee coffee 21
## 117550 come join us 21
wordcloud(trigram_twitter_top20$Word, trigram_twitter_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_twitter_top20$Word,
## trigram_twitter_top20$Freq, : happy mothers day could not be fit on page.
## It will not be plotted.
trigram_twitter_top20 <- trigram_twitter_top20[1:20, 1:2]
trigram_twitter_top20$Word <- as.character(trigram_twitter_top20$Word)
trigram_twitter_top20$Word <- gsub(" ", "_", trigram_twitter_top20$Word)
ggplot(trigram_twitter_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
trigram_blog <- NGramTokenizer(blog_sub, Weka_control(min = 3, max = 3))
trigram_blog_df <- data.frame(table(trigram_blog))
trigram_blog_dfsorted <- trigram_blog_df[order(trigram_blog_df$Freq,decreasing = TRUE),]
trigram_blog_top20 <- trigram_blog_dfsorted[1:20,]
colnames(trigram_blog_top20) <- c("Word","Frequency")
trigram_blog_top20
## Word Frequency
## 549973 new york times 35
## 549872 new york city 34
## 865043 two years ago 26
## 555248 none repeat scroll 25
## 680151 repeat scroll yellow 25
## 793790 stylebackground none repeat 25
## 773917 st patricks day 20
## 171293 couple years ago 18
## 27084 amazon services llc 17
## 171244 couple weeks ago 17
## 390857 incorporated item c 17
## 478041 love love love 17
## 483178 m pretty sure 17
## 769615 spend much time 17
## 415862 just little bit 16
## 467080 llc amazon eu 16
## 731316 services llc amazon 16
## 733690 several years ago 15
## 470078 long time ago 14
## 549931 new york ny 14
wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): new york city could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): two years ago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): stylebackground none repeat could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): several years ago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): llc amazon eu could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): amazon services llc could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): st patricks day could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_blog_top20$Word, trigram_blog_top20$Freq,
## max.words = 60): spend much time could not be fit on page. It will not be
## plotted.
trigram_blog_top20 <- trigram_blog_top20[1:20, 1:2]
trigram_blog_top20$Word <- as.character(trigram_blog_top20$Word)
trigram_blog_top20$Word <- gsub(" ", "_", trigram_blog_top20$Word)
ggplot(trigram_blog_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
trigram_news <- NGramTokenizer(news_sub, Weka_control(min = 3, max = 3))
trigram_news_df <- data.frame(table(trigram_news))
trigram_news_dfsorted <- trigram_news_df[order(trigram_news_df$Freq,decreasing = TRUE),]
trigram_news_top20 <- trigram_news_dfsorted[1:20,]
colnames(trigram_news_top20) <- c("Word","Frequency")
trigram_news_top20
## Word Frequency
## 634246 president barack obama 76
## 544321 new york city 72
## 878347 two years ago 66
## 341395 gov chris christie 63
## 790057 st louis county 47
## 126661 cents per share 35
## 367511 high school students 35
## 939645 world war ii 34
## 295144 first time since 33
## 817196 superior court judge 31
## 878228 two weeks ago 31
## 927469 will take place 30
## 544562 new york times 27
## 181283 county prosecutors office 26
## 135395 chief executive officer 25
## 887102 us district judge 25
## 886777 us attorneys office 24
## 888226 us supreme court 23
## 939608 world trade center 22
## 715347 said last week 21
wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq, max.words=60)
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): president barack obama could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): st louis county could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): chief executive officer could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): new york times could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): world trade center could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(trigram_news_top20$Word, trigram_news_top20$Freq,
## max.words = 60): two weeks ago could not be fit on page. It will not be
## plotted.
trigram_news_top20 <- trigram_news_top20[1:20, 1:2]
trigram_news_top20$Word <- as.character(trigram_news_top20$Word)
trigram_news_top20$Word <- gsub(" ", "_", trigram_news_top20$Word)
ggplot(trigram_news_top20, aes(x=Word,y=Frequency)) + geom_bar(stat="Identity") +geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The next steps are – using phases (n-gram) freqiencies to predict the probabilities of the next word providing you have known the 1 or 2 pre-words. – Practical Machine learning will be used with Markov chain method to create a statistical model of the sequences of letters in a piece of English text. – Shiny App will be deployed to build an online App for the word prediction.