Summary

The mobile phone has become the technological centerpiece of everyday life. People interact with their phones by entering text in to numerous apps and this can be painful depending on the type and amount and amount of information requested by the app. Predictive text modeling is the centerpiece of smart keyboards, which are designed to ease the typing of information into to mobile phones.

The first steps of building a predictive text model is to import a corpus of text files, explore the data, and then build a training data set. The following info will be shown in this milestone report,

File Import/Analysis

Data was imported directly from the link provided by Coursera https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Files containing text samples from news sites, blogs, and twitter posts in E4 languages were downloaded and unzipped. Only the files in English were used for this project.

file fileSize LineCount WordCount
news.txt 205.8 Mb 1010 K lines 34.8 M Words
blog.txt 210.2 Mb 899 K lines 38.2 M Words
twitter.txt 167.1 Mb 2360 K lines 30.7 M Words

Data Cleaning and Exploration

Data exploration was done after cleaning as this was the form of the data we would ultimately be used for modeling. The uploaded data set was cleaned by

  • removing profanity,
  • removing punctuation, numbers, white space, and stopwords, and
  • changing all letter to lower case.

Data were then converted to a data frame in order to use dplyr and tidyr packages for exploration.

Unigram Summary Table

blogword n twitword n1 newsword n2
one 136401 just 149870 said 250385
can 119881 get 146138 year 128720
will 116070 can 135746 will 111046
like 111913 thank 130898 one 92363
time 108576 like 130109 time 72330
just 100496 go 128032 new 70757
get 94992 love 123791 can 70702
go 83196 day 110643 state 68145
make 81342 good 101831 two 63865
day 72572 will 95901 say 63155

A summary table was added so that the top 10 most frequent words in all 3 text files can be seen side-by-side for comparison. As expected, there is a significant overlap in words between the text files, but the overlapping words do not rank the same in each file.  
   
   
   
   
   
   
   
   
   
   
   
   
   
 

Unigram Word Frequency

Based on the top ten word counts for each file, it is clear that the frequency profile of words in each file is different. Looking at the frequency of appearance of the top 100 words, we clearly see a difference between files. This is an indication that care should be taken in building the training data set and insure that the number of words from each text file is equally represented in the training data set.


BiGrams

blogbigram n twitbigram n1 newsbigram n2
look like 82 right now 177 last year 159
don know 79 last night 138 year old 146
year old 72 can wait 132 new york 114
year ago 68 thank follow 127 new jersey 111
feel like 62 look forward 122 st loui 108
last year 59 look like 118 year ago 105
right now 57 feel like 93 high school 93
make sure 47 follow back 90 last week 74
can get 46 happi birthday 87 san francisco 64
can see 46 don know 75 two year 60

Bigram Summary Table

Viewing the bigram data show fewer overlapping bigrams between text files. The more formal writing styles used in the news and blog files reflect a more similar pattern as compared the the informal style used in twitter.  
   
   
   
   
   
   
   
   
   
   
   
  —-

Trigrams

blogtrigram n twittrigram n1 newstrigram n2
new york citi 10 let us know 29 presid barack obama 16
long time ago 8 can wait see 26 new york citi 11
incorpor item pp 7 happi mother day 22 st loui counti 11
make look like 7 happi new year 22 three year ago 10
amazon servic llc 6 book book book 19 said year old 9
can wait see 6 realli realli realli 14 first time sinc 8
coupl week ago 6 happi valentin day 12 five year ago 8
let just say 6 look forward see 12 past three year 8
one way anoth 6 can wait till 9 st charl counti 8
unit state america 6 cinco de mayo 8 two year ago 8

Trigram Summary Table

Review of the trigram results shows very little in comment between the 3 files  
   
   
   
   
   
   
   
   
   
   
   
   
   
 


Conclusions

The major conclusion from the data exploration is that since the 3 text file are different in the word makeup, each file should be approximately equally represented in the training data set.

Next Steps