The mobile phone has become the technological centerpiece of everyday life. People interact with their phones by entering text in to numerous apps and this can be painful depending on the type and amount and amount of information requested by the app. Predictive text modeling is the centerpiece of smart keyboards, which are designed to ease the typing of information into to mobile phones.
The first steps of building a predictive text model is to import a corpus of text files, explore the data, and then build a training data set. The following info will be shown in this milestone report,
Data was imported directly from the link provided by Coursera https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Files containing text samples from news sites, blogs, and twitter posts in E4 languages were downloaded and unzipped. Only the files in English were used for this project.
| file | fileSize | LineCount | WordCount |
|---|---|---|---|
| news.txt | 205.8 Mb | 1010 K lines | 34.8 M Words |
| blog.txt | 210.2 Mb | 899 K lines | 38.2 M Words |
| twitter.txt | 167.1 Mb | 2360 K lines | 30.7 M Words |
Data exploration was done after cleaning as this was the form of the data we would ultimately be used for modeling. The uploaded data set was cleaned by
Data were then converted to a data frame in order to use dplyr and tidyr packages for exploration.
| blogword | n | twitword | n1 | newsword | n2 |
|---|---|---|---|---|---|
| one | 136401 | just | 149870 | said | 250385 |
| can | 119881 | get | 146138 | year | 128720 |
| will | 116070 | can | 135746 | will | 111046 |
| like | 111913 | thank | 130898 | one | 92363 |
| time | 108576 | like | 130109 | time | 72330 |
| just | 100496 | go | 128032 | new | 70757 |
| get | 94992 | love | 123791 | can | 70702 |
| go | 83196 | day | 110643 | state | 68145 |
| make | 81342 | good | 101831 | two | 63865 |
| day | 72572 | will | 95901 | say | 63155 |
A summary table was added so that the top 10 most frequent words in all 3 text files can be seen side-by-side for comparison. As expected, there is a significant overlap in words between the text files, but the overlapping words do not rank the same in each file.
Based on the top ten word counts for each file, it is clear that the frequency profile of words in each file is different. Looking at the frequency of appearance of the top 100 words, we clearly see a difference between files. This is an indication that care should be taken in building the training data set and insure that the number of words from each text file is equally represented in the training data set.
| blogbigram | n | twitbigram | n1 | newsbigram | n2 |
|---|---|---|---|---|---|
| look like | 82 | right now | 177 | last year | 159 |
| don know | 79 | last night | 138 | year old | 146 |
| year old | 72 | can wait | 132 | new york | 114 |
| year ago | 68 | thank follow | 127 | new jersey | 111 |
| feel like | 62 | look forward | 122 | st loui | 108 |
| last year | 59 | look like | 118 | year ago | 105 |
| right now | 57 | feel like | 93 | high school | 93 |
| make sure | 47 | follow back | 90 | last week | 74 |
| can get | 46 | happi birthday | 87 | san francisco | 64 |
| can see | 46 | don know | 75 | two year | 60 |
Viewing the bigram data show fewer overlapping bigrams between text files. The more formal writing styles used in the news and blog files reflect a more similar pattern as compared the the informal style used in twitter.
—-
| blogtrigram | n | twittrigram | n1 | newstrigram | n2 |
|---|---|---|---|---|---|
| new york citi | 10 | let us know | 29 | presid barack obama | 16 |
| long time ago | 8 | can wait see | 26 | new york citi | 11 |
| incorpor item pp | 7 | happi mother day | 22 | st loui counti | 11 |
| make look like | 7 | happi new year | 22 | three year ago | 10 |
| amazon servic llc | 6 | book book book | 19 | said year old | 9 |
| can wait see | 6 | realli realli realli | 14 | first time sinc | 8 |
| coupl week ago | 6 | happi valentin day | 12 | five year ago | 8 |
| let just say | 6 | look forward see | 12 | past three year | 8 |
| one way anoth | 6 | can wait till | 9 | st charl counti | 8 |
| unit state america | 6 | cinco de mayo | 8 | two year ago | 8 |
Review of the trigram results shows very little in comment between the 3 files
The major conclusion from the data exploration is that since the 3 text file are different in the word makeup, each file should be approximately equally represented in the training data set.