DataScience Specialization - Capstone Project

Synopsis

Nowadays people are spending more and more time on their mobile devices for a variety of activities, but typing on mobile can be a serious pain.
For this reason, SwiftKey provides an excelent product, a smart keyboard that makes easy for users to type faster.
In this capstone project, given some datasets, we are going to develop (a) a prediction model to suggest words to complete phrases that are being writing by users and (b) a shiny app that simulates real situations
In the first milestone report, we will show some details of the dataset acquired and explore features and structure from data and start getting some insights of the model that will be created.

Environment config

The first step is to setup the environment:

Load libraries that will support the analysis

library(tm)
library(ggplot2)
library(dplyr)
library(RWeka)
library(stringi)
library(formattable)
library(SnowballC)
library(parallel)
library(wordcloud)

#Adding parallel processing to minimizing runtime
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(RWeka)))

Datasets

In this step, we will load all three datasets (blog, twitter and news) and, to get a better performance, we will sample the content usign 1% of the lines for “blog and news” and 0,1% for twitter.

Loading

IMPORTANT: the data is provided by Coursera/SwiftKey and can be found at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

set.seed(20170517)

# Loading, Sampling and Summarizing Blog Dataset
blogs.ds <- readLines("en_US.blogs.txt")
blogs.ds.summary <- c(stri_stats_general(blogs.ds), stri_stats_latex(blogs.ds)[4])
blogs.sample <- blogs.ds[rbinom(length(blogs.ds)*0.01, length(blogs.ds), 0.50)]
blogs.sample.summary <- c(stri_stats_general(blogs.sample), stri_stats_latex(blogs.sample)[4])
# release memory
rm(blogs.ds)

# Loading, Sampling and Summarizing News Dataset
news.ds <- readLines("en_US.news.txt")
news.ds.summary <- c(stri_stats_general(news.ds), stri_stats_latex(news.ds)[4])
news.sample <- news.ds[rbinom(length(news.ds)*0.01, length(news.ds), 0.50)]
news.sample.summary <- c(stri_stats_general(news.sample), stri_stats_latex(news.sample)[4])
# release memory
rm(news.ds)

# Loading, Sampling and Summarizing Twitter Dataset
twitter.ds <- readLines("en_US.twitter.txt")
twitter.ds.summary <- c(stri_stats_general(twitter.ds), stri_stats_latex(twitter.ds)[4])
twitter.sample <- twitter.ds[rbinom(length(twitter.ds)*0.001, length(twitter.ds), 0.50)]
twitter.sample.summary <- c(stri_stats_general(twitter.sample), stri_stats_latex(twitter.sample)[4])
# release memory
rm(twitter.ds)

Summary

Dataset	Type	Lines	LinesNEmpty	Chars	CharsNWhite	Words
Blog	Full	899288	899288	208361438	171926076	37865888
Blog	Sample	8992	8992	2088794	1724016	377286
News	Full	77259	77259	15683765	13117038	2665742
News	Sample	772	772	150140	125470	25269
Twitter	Full	2360148	2360148	162384825	134370864	30578891
Twitter	Sample	2360	2360	162182	134143	30485

Dataset pre-processing (cleaning and preparing data)

In this section, we are going to perform some transformations in data in order to remove complexity or non-relevant details.

Removing

We will start removing some pieces of data, as mentioned bellow:

Special characters
Numbers
Ponctuation
Profanity words (from Luis von Ahn’s Research Group / CMU)
Stopwords
Additional white spaces

# Consolidate samples into one object
corpus.samples <- Corpus(VectorSource(c(blogs.sample, news.sample, twitter.sample)))

# Release memory
rm(blogs.sample, news.sample, twitter.sample)

# Function to replace patterns with whitespace
funPatternToSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

# Remove special characters
corpus.samples <- tm_map(corpus.samples, funPatternToSpace,"\"|/|@|\\|")

# Remove Numbers
corpus.samples <- tm_map(corpus.samples, removeNumbers)

# Remove Ponctuation
corpus.samples <- tm_map(corpus.samples, removePunctuation)

# Remove Profanity words (from Luis von Ahn's Research Group)
profanity.ds <- read.csv(url("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"), header = FALSE, col.names = c("word"))
corpus.samples <- tm_map(corpus.samples, removeWords, profanity.ds$word)

# Release memory
rm(profanity.ds, funPatternToSpace)

# Remove Stopwords
corpus.samples <- tm_map(corpus.samples, removeWords, stopwords("english"))

# Remove Additional white spaces
corpus.samples <- tm_map(corpus.samples, stripWhitespace)

Transforming

To finish “pre-processing”, we will transform the resulting data before analysing specific patterns. Now, we are changing the corpora in the following way:

Lower case
Stemming (get word radicals)
Plain text

# Transform to Lower case
corpus.samples <- tm_map(corpus.samples, tolower)

# Stemming (get word radicals)
corpus.samples <- tm_map(corpus.samples, stemDocument, language="english")

# Transform again to plain text
corpus.samples <- tm_map(corpus.samples, PlainTextDocument)

Exploratory Analysis

n-gram

corpus.df <- data.frame(text=unlist(sapply(corpus.samples, identity)),stringsAsFactors=FALSE)

# Release memory
rm(corpus.samples)

#wordcloud(uniGram$Words, uniGram$Count, min.freq=100, colors=brewer.pal(6, "Dark2"))

1-gram

uniGram <- findNGrams(corpus.df, 1, 20)
p <- ggplot(uniGram, aes(Words, Count)) + geom_col(fill="lightblue", color="darkblue") + labs(title="1-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(uniGram)

	Words	Count
5265	i	8234
10689	the	2103
7629	one	1449
11834	will	1309
4463	get	1285
6282	like	1213
1770	can	1151
10837	time	1140
5838	just	1055
4536	go	944
6541	make	862
2807	day	861
6430	love	856
7339	new	855
5632	it	840
12040	year	788
11914	work	780
11386	use	764
7493	now	757
5983	know	731

# Release memory
rm(uniGram, p)

2-gram

biGrams <- findNGrams(corpus.df, 2, 20)
p <- ggplot(biGrams, aes(Words, Count)) + geom_col(fill="red", color="darkred") + labs(title="2-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(biGrams)

	Words	Count
28282	i love	250
28550	i think	227
28242	i just	173
28600	i will	170
28588	i want	165
28251	i know	158
28082	i donât	157
28029	i can	144
28084	i dont	142
28553	i thought	132
28317	i need	127
28577	i use	119
61381	time i	117
28158	i get	114
32513	know i	114
28129	i find	111
28268	i like	100
28121	i feel	98
33057	last year	95
28413	i realli	84

# Release memory
rm(biGrams, p)

3-gram

triGrams  <- findNGrams(corpus.df, 3, 20)
p <- ggplot(triGrams, aes(Words, Count)) + geom_col(fill="purple", color="darkblue") + labs(title="3-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(triGrams)

	Words	Count
35210	i think i	48
34347	i know i	47
9049	boy big sword	36
43054	littl boy big	36
33794	i dont know	35
34643	i must say	35
33780	i donât think	34
33802	i dont think	31
27119	gaston south carolina	30
68089	south carolina attract	30
40914	last night i	29
83576	work incred pleas	28
33620	i can get	27
33773	i donât know	27
58462	pu bef th	27
35252	i thought i	26
34524	i love toast	24
44386	love toast mom	24
34290	i just love	23
41899	let just say	23

# Release memory
rm(triGrams, p)

4-gram

quadriGrams <- findNGrams(corpus.df, 4, 20)
p <- ggplot(quadriGrams, aes(Words, Count)) + geom_col(fill="green", color="black") + labs(title="4-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(quadriGrams)

	Words	Count
46968	littl boy big sword	36
29414	gaston south carolina attract	30
37607	i love toast mom	24
48413	love toast mom i	19
10981	buy time fell th	18
11862	canât buy time fell	18
79098	th king john castl	18
66701	respond email data entri	16
1343	across page can find	15
1345	across photo entitl typhoon	15
6054	awesom pictur i ever	15
8922	blog regular near often	15
11343	came across photo entitl	15
11588	can find support tip	15
15728	complet unrel search pictur	15
17305	creativ kut scrap bug	15
20905	dont blog regular near	15
21959	easier life laughter hope	15
23032	enough i hope stumbl	15
23172	entitl typhoon parti okinawa	15

# Release memory
rm(quadriGrams, p)

Conclusion

Next Steps

Creating predition algorithm ** Segmenting analysis by type (blog, news or social) ** Enhance data cleaning (without foreign language, for example) ** Find patterns in tokens ** Take advantage of advanced model, such as Markov Hidden Models
Developing a Shiny App ** Create a simple user interface (based on messaging apps) ** As user input some text on keyboard, the ShinyApp suggest something to complete the sentence.

DataScience Specialization - Capstone Project - Milestone Report

Thiago Vaz

2017 May 13th