Capstone Project - Milestone Report

Summary

The mobile phone has become the technological centerpiece of everyday life. People interact with their phones by entering text in to numerous apps and this can be painful depending on the type and amount and amount of information requested by the app. Predictive text modeling is the centerpiece of smart keyboards, which are designed to ease the typing of information into to mobile phones.

The first steps of building a predictive text model is to import a corpus of text files, explore the data, and then build a training data set. The following info will be shown in this milestone report,

Analysis of imported files
Results of exploration of each data file
Conclusions
Next steps

File Import/Analysis

Data was imported directly from the link provided by Coursera https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Files containing text samples from news sites, blogs, and twitter posts in E4 languages were downloaded and unzipped. Only the files in English were used for this project.

file	fileSize	LineCount	WordCount
news.txt	205.8 Mb	1010 K lines	34.8 M Words
blog.txt	210.2 Mb	899 K lines	38.2 M Words
twitter.txt	167.1 Mb	2360 K lines	30.7 M Words

Data Cleaning and Exploration

Data exploration was done after cleaning as this was the form of the data we would ultimately be used for modeling. The uploaded data set was cleaned by

removing profanity,
removing punctuation, numbers, white space, and stopwords, and
changing all letter to lower case.

Data were then converted to a data frame in order to use dplyr and tidyr packages for exploration.

Unigram Summary Table

blogword	n	twitword	n1	newsword	n2
one	136401	just	149870	said	250385
can	119881	get	146138	year	128720
will	116070	can	135746	will	111046
like	111913	thank	130898	one	92363
time	108576	like	130109	time	72330
just	100496	go	128032	new	70757
get	94992	love	123791	can	70702
go	83196	day	110643	state	68145
make	81342	good	101831	two	63865
day	72572	will	95901	say	63155

A summary table was added so that the top 10 most frequent words in all 3 text files can be seen side-by-side for comparison. As expected, there is a significant overlap in words between the text files, but the overlapping words do not rank the same in each file.

Unigram Word Frequency

Based on the top ten word counts for each file, it is clear that the frequency profile of words in each file is different. Looking at the frequency of appearance of the top 100 words, we clearly see a difference between files. This is an indication that care should be taken in building the training data set and insure that the number of words from each text file is equally represented in the training data set.

BiGrams

blogbigram	n	twitbigram	n1	newsbigram	n2
look like	82	right now	177	last year	159
don know	79	last night	138	year old	146
year old	72	can wait	132	new york	114
year ago	68	thank follow	127	new jersey	111
feel like	62	look forward	122	st loui	108
last year	59	look like	118	year ago	105
right now	57	feel like	93	high school	93
make sure	47	follow back	90	last week	74
can get	46	happi birthday	87	san francisco	64
can see	46	don know	75	two year	60

Bigram Summary Table

Viewing the bigram data show fewer overlapping bigrams between text files. The more formal writing styles used in the news and blog files reflect a more similar pattern as compared the the informal style used in twitter.

—-

Trigrams

blogtrigram	n	twittrigram	n1	newstrigram	n2
new york citi	10	let us know	29	presid barack obama	16
long time ago	8	can wait see	26	new york citi	11
incorpor item pp	7	happi mother day	22	st loui counti	11
make look like	7	happi new year	22	three year ago	10
amazon servic llc	6	book book book	19	said year old	9
can wait see	6	realli realli realli	14	first time sinc	8
coupl week ago	6	happi valentin day	12	five year ago	8
let just say	6	look forward see	12	past three year	8
one way anoth	6	can wait till	9	st charl counti	8
unit state america	6	cinco de mayo	8	two year ago	8

Trigram Summary Table

Review of the trigram results shows very little in comment between the 3 files

Conclusions

The major conclusion from the data exploration is that since the 3 text file are different in the word makeup, each file should be approximately equally represented in the training data set.

Next Steps

Combine the three text files and create the training data set.
- Determine the smallest sample size from the combined data set.
Determine how to build the predictive model.
Determine how to measure the accuracy of the predictive model.
Measure Ram usage and speed.
Fine tune the predictive model and find the best balance between RAM usage and speed.