In this capstone project, given some datasets, we are going to develop (a) a prediction model to suggest words to complete phrases that are being writing by users and (b) a shiny app that simulates real situations
In the first milestone report, we will show some details of the dataset acquired and explore features and structure from data and start getting some insights of the model that will be created.
The first step is to setup the environment:
library(tm)
library(ggplot2)
library(dplyr)
library(RWeka)
library(stringi)
library(formattable)
library(SnowballC)
library(parallel)
library(wordcloud)
#Adding parallel processing to minimizing runtime
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(RWeka)))
In this step, we will load all three datasets (blog, twitter and news) and, to get a better performance, we will sample the content usign 1% of the lines for “blog and news” and 0,1% for twitter.
set.seed(20170517)
# Loading, Sampling and Summarizing Blog Dataset
blogs.ds <- readLines("en_US.blogs.txt")
blogs.ds.summary <- c(stri_stats_general(blogs.ds), stri_stats_latex(blogs.ds)[4])
blogs.sample <- blogs.ds[rbinom(length(blogs.ds)*0.01, length(blogs.ds), 0.50)]
blogs.sample.summary <- c(stri_stats_general(blogs.sample), stri_stats_latex(blogs.sample)[4])
# release memory
rm(blogs.ds)
# Loading, Sampling and Summarizing News Dataset
news.ds <- readLines("en_US.news.txt")
news.ds.summary <- c(stri_stats_general(news.ds), stri_stats_latex(news.ds)[4])
news.sample <- news.ds[rbinom(length(news.ds)*0.01, length(news.ds), 0.50)]
news.sample.summary <- c(stri_stats_general(news.sample), stri_stats_latex(news.sample)[4])
# release memory
rm(news.ds)
# Loading, Sampling and Summarizing Twitter Dataset
twitter.ds <- readLines("en_US.twitter.txt")
twitter.ds.summary <- c(stri_stats_general(twitter.ds), stri_stats_latex(twitter.ds)[4])
twitter.sample <- twitter.ds[rbinom(length(twitter.ds)*0.001, length(twitter.ds), 0.50)]
twitter.sample.summary <- c(stri_stats_general(twitter.sample), stri_stats_latex(twitter.sample)[4])
# release memory
rm(twitter.ds)
| Dataset | Type | Lines | LinesNEmpty | Chars | CharsNWhite | Words |
|---|---|---|---|---|---|---|
| Blog | Full | 899288 | 899288 | 208361438 | 171926076 | 37865888 |
| Blog | Sample | 8992 | 8992 | 2088794 | 1724016 | 377286 |
| News | Full | 77259 | 77259 | 15683765 | 13117038 | 2665742 |
| News | Sample | 772 | 772 | 150140 | 125470 | 25269 |
| Full | 2360148 | 2360148 | 162384825 | 134370864 | 30578891 | |
| Sample | 2360 | 2360 | 162182 | 134143 | 30485 |
In this section, we are going to perform some transformations in data in order to remove complexity or non-relevant details.
We will start removing some pieces of data, as mentioned bellow:
# Consolidate samples into one object
corpus.samples <- Corpus(VectorSource(c(blogs.sample, news.sample, twitter.sample)))
# Release memory
rm(blogs.sample, news.sample, twitter.sample)
# Function to replace patterns with whitespace
funPatternToSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# Remove special characters
corpus.samples <- tm_map(corpus.samples, funPatternToSpace,"\"|/|@|\\|")
# Remove Numbers
corpus.samples <- tm_map(corpus.samples, removeNumbers)
# Remove Ponctuation
corpus.samples <- tm_map(corpus.samples, removePunctuation)
# Remove Profanity words (from Luis von Ahn's Research Group)
profanity.ds <- read.csv(url("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"), header = FALSE, col.names = c("word"))
corpus.samples <- tm_map(corpus.samples, removeWords, profanity.ds$word)
# Release memory
rm(profanity.ds, funPatternToSpace)
# Remove Stopwords
corpus.samples <- tm_map(corpus.samples, removeWords, stopwords("english"))
# Remove Additional white spaces
corpus.samples <- tm_map(corpus.samples, stripWhitespace)
To finish “pre-processing”, we will transform the resulting data before analysing specific patterns. Now, we are changing the corpora in the following way:
# Transform to Lower case
corpus.samples <- tm_map(corpus.samples, tolower)
# Stemming (get word radicals)
corpus.samples <- tm_map(corpus.samples, stemDocument, language="english")
# Transform again to plain text
corpus.samples <- tm_map(corpus.samples, PlainTextDocument)
corpus.df <- data.frame(text=unlist(sapply(corpus.samples, identity)),stringsAsFactors=FALSE)
# Release memory
rm(corpus.samples)
#wordcloud(uniGram$Words, uniGram$Count, min.freq=100, colors=brewer.pal(6, "Dark2"))
uniGram <- findNGrams(corpus.df, 1, 20)
p <- ggplot(uniGram, aes(Words, Count)) + geom_col(fill="lightblue", color="darkblue") + labs(title="1-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
formattable(uniGram)
| Words | Count | |
|---|---|---|
| 5265 | i | 8234 |
| 10689 | the | 2103 |
| 7629 | one | 1449 |
| 11834 | will | 1309 |
| 4463 | get | 1285 |
| 6282 | like | 1213 |
| 1770 | can | 1151 |
| 10837 | time | 1140 |
| 5838 | just | 1055 |
| 4536 | go | 944 |
| 6541 | make | 862 |
| 2807 | day | 861 |
| 6430 | love | 856 |
| 7339 | new | 855 |
| 5632 | it | 840 |
| 12040 | year | 788 |
| 11914 | work | 780 |
| 11386 | use | 764 |
| 7493 | now | 757 |
| 5983 | know | 731 |
# Release memory
rm(uniGram, p)
biGrams <- findNGrams(corpus.df, 2, 20)
p <- ggplot(biGrams, aes(Words, Count)) + geom_col(fill="red", color="darkred") + labs(title="2-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
formattable(biGrams)
| Words | Count | |
|---|---|---|
| 28282 | i love | 250 |
| 28550 | i think | 227 |
| 28242 | i just | 173 |
| 28600 | i will | 170 |
| 28588 | i want | 165 |
| 28251 | i know | 158 |
| 28082 | i donât | 157 |
| 28029 | i can | 144 |
| 28084 | i dont | 142 |
| 28553 | i thought | 132 |
| 28317 | i need | 127 |
| 28577 | i use | 119 |
| 61381 | time i | 117 |
| 28158 | i get | 114 |
| 32513 | know i | 114 |
| 28129 | i find | 111 |
| 28268 | i like | 100 |
| 28121 | i feel | 98 |
| 33057 | last year | 95 |
| 28413 | i realli | 84 |
# Release memory
rm(biGrams, p)
triGrams <- findNGrams(corpus.df, 3, 20)
p <- ggplot(triGrams, aes(Words, Count)) + geom_col(fill="purple", color="darkblue") + labs(title="3-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
formattable(triGrams)
| Words | Count | |
|---|---|---|
| 35210 | i think i | 48 |
| 34347 | i know i | 47 |
| 9049 | boy big sword | 36 |
| 43054 | littl boy big | 36 |
| 33794 | i dont know | 35 |
| 34643 | i must say | 35 |
| 33780 | i donât think | 34 |
| 33802 | i dont think | 31 |
| 27119 | gaston south carolina | 30 |
| 68089 | south carolina attract | 30 |
| 40914 | last night i | 29 |
| 83576 | work incred pleas | 28 |
| 33620 | i can get | 27 |
| 33773 | i donât know | 27 |
| 58462 | pu bef th | 27 |
| 35252 | i thought i | 26 |
| 34524 | i love toast | 24 |
| 44386 | love toast mom | 24 |
| 34290 | i just love | 23 |
| 41899 | let just say | 23 |
# Release memory
rm(triGrams, p)
quadriGrams <- findNGrams(corpus.df, 4, 20)
p <- ggplot(quadriGrams, aes(Words, Count)) + geom_col(fill="green", color="black") + labs(title="4-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p
formattable(quadriGrams)
| Words | Count | |
|---|---|---|
| 46968 | littl boy big sword | 36 |
| 29414 | gaston south carolina attract | 30 |
| 37607 | i love toast mom | 24 |
| 48413 | love toast mom i | 19 |
| 10981 | buy time fell th | 18 |
| 11862 | canât buy time fell | 18 |
| 79098 | th king john castl | 18 |
| 66701 | respond email data entri | 16 |
| 1343 | across page can find | 15 |
| 1345 | across photo entitl typhoon | 15 |
| 6054 | awesom pictur i ever | 15 |
| 8922 | blog regular near often | 15 |
| 11343 | came across photo entitl | 15 |
| 11588 | can find support tip | 15 |
| 15728 | complet unrel search pictur | 15 |
| 17305 | creativ kut scrap bug | 15 |
| 20905 | dont blog regular near | 15 |
| 21959 | easier life laughter hope | 15 |
| 23032 | enough i hope stumbl | 15 |
| 23172 | entitl typhoon parti okinawa | 15 |
# Release memory
rm(quadriGrams, p)
Creating predition algorithm ** Segmenting analysis by type (blog, news or social) ** Enhance data cleaning (without foreign language, for example) ** Find patterns in tokens ** Take advantage of advanced model, such as Markov Hidden Models
Developing a Shiny App ** Create a simple user interface (based on messaging apps) ** As user input some text on keyboard, the ShinyApp suggest something to complete the sentence.