In this project, we use the data science techniques we have learnt in last 9 courses to tackle real world problems. Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. We build a smart keyboard that ressembles the one developed by SwiftKey to make it easier for people to type on their mobile devices.
We use the data set assigned by the course for most of our analysis. Here we download the zip file and look into the structure of the file.
## filename uncompressed_size
## 1 final/ 0
## 2 final/de_DE/ 0
## 3 final/de_DE/de_DE.twitter.txt 75578341
## 4 final/de_DE/de_DE.blogs.txt 85459666
## 5 final/de_DE/de_DE.news.txt 95591959
## 6 final/ru_RU/ 0
## 7 final/ru_RU/ru_RU.blogs.txt 116855835
## 8 final/ru_RU/ru_RU.news.txt 118996424
## 9 final/ru_RU/ru_RU.twitter.txt 105182346
## 10 final/en_US/ 0
## 11 final/en_US/en_US.twitter.txt 167105338
## 12 final/en_US/en_US.news.txt 205811889
## 13 final/en_US/en_US.blogs.txt 210160014
## 14 final/fi_FI/ 0
## 15 final/fi_FI/fi_FI.news.txt 94234350
## 16 final/fi_FI/fi_FI.blogs.txt 108503595
## 17 final/fi_FI/fi_FI.twitter.txt 25331142
There are 4 sub-folders for 4 different languages. In each folder, there are 3 .txt files, where text were drawn from Twitter, blogs and news. For this assignment, we will focus on the English text in the final/en_US/ folder.
After reading the three files, here are their summaries in terms of line count, total word count and the average number of words per line.
## fileType lineCount wordCount avgWordsPerLine
## 1 Twitter 2360148 30373543 12.86934
## 2 Blogs 77259 2643969 34.22215
## 3 News 899288 37334131 41.51521
Due to large size of the data set, we perform random sampling to reduce the 3 US data tables into a more manageable size (which is roughly 1% of the original data set) and combine them into one data table subset.
The resultant data table consist of 33721 lines and an average of 21.0746123 words per line.
To explore the data, we transform the US subset into tibbles and further break them down into tokens.
Out of 33721 tibbles, there are 45731 unique words and altogether 718056 words. It shows in the probability density curve that most words have low word frequencies and a few words have high word frequencies up to almost 30,000 appearances. We need 134 unique words in a frequency sorted dictionary to cover 50% of all word instances in the US English subset, whereas 6771 unique words are needed to cover 90% of all word instances.
Here are the top 10 words with the highest word frequency.
head(token_US, 10)
## # A tibble: 10 x 2
## words n
## <chr> <int>
## 1 the 29486
## 2 to 19442
## 3 and 16230
## 4 a 15892
## 5 i 15214
## 6 of 12798
## 7 in 10408
## 8 you 8646
## 9 for 8152
## 10 is 8041
Next, we look at the frequencies of top 10 2-grams and 3-grams in the dataset.
## # A tibble: 10 x 2
## ngrams n
## <chr> <int>
## 1 of the 2462
## 2 in the 2433
## 3 for the 1450
## 4 to the 1340
## 5 on the 1246
## 6 to be 1161
## 7 at the 906
## 8 and the 808
## 9 i have 806
## 10 i was 737
## # A tibble: 10 x 2
## ngrams n
## <chr> <int>
## 1 thanks for the 248
## 2 one of the 188
## 3 a lot of 172
## 4 i want to 149
## 5 going to be 123
## 6 looking forward to 116
## 7 i have a 113
## 8 to be a 113
## 9 it was a 112
## 10 i have to 111
The frequencies of unique one word are much higher than that of 2-grams and the frequencies of 2-grams are higher than that of 3-grams. We can use the probabilities of these frequencies as a foundation to build our prediction algorithm.