Introduction

In this project, we use the data science techniques we have learnt in last 9 courses to tackle real world problems. Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. We build a smart keyboard that ressembles the one developed by SwiftKey to make it easier for people to type on their mobile devices.

Task 1: Getting and Cleaning the Data

We use the data set assigned by the course for most of our analysis. Here we download the zip file and look into the structure of the file.

##                         filename uncompressed_size
## 1                         final/                 0
## 2                   final/de_DE/                 0
## 3  final/de_DE/de_DE.twitter.txt          75578341
## 4    final/de_DE/de_DE.blogs.txt          85459666
## 5     final/de_DE/de_DE.news.txt          95591959
## 6                   final/ru_RU/                 0
## 7    final/ru_RU/ru_RU.blogs.txt         116855835
## 8     final/ru_RU/ru_RU.news.txt         118996424
## 9  final/ru_RU/ru_RU.twitter.txt         105182346
## 10                  final/en_US/                 0
## 11 final/en_US/en_US.twitter.txt         167105338
## 12    final/en_US/en_US.news.txt         205811889
## 13   final/en_US/en_US.blogs.txt         210160014
## 14                  final/fi_FI/                 0
## 15    final/fi_FI/fi_FI.news.txt          94234350
## 16   final/fi_FI/fi_FI.blogs.txt         108503595
## 17 final/fi_FI/fi_FI.twitter.txt          25331142

There are 4 sub-folders for 4 different languages. In each folder, there are 3 .txt files, where text were drawn from Twitter, blogs and news. For this assignment, we will focus on the English text in the final/en_US/ folder.

After reading the three files, here are their summaries in terms of line count, total word count and the average number of words per line.

##   fileType lineCount wordCount avgWordsPerLine
## 1  Twitter   2360148  30373543        12.86934
## 2    Blogs     77259   2643969        34.22215
## 3     News    899288  37334131        41.51521

Due to large size of the data set, we perform random sampling to reduce the 3 US data tables into a more manageable size (which is roughly 1% of the original data set) and combine them into one data table subset.

The resultant data table consist of 33721 lines and an average of 21.0746123 words per line.

Task 2: Exploratory Data Analysis

To explore the data, we transform the US subset into tibbles and further break them down into tokens.

Out of 33721 tibbles, there are 45731 unique words and altogether 718056 words. It shows in the probability density curve that most words have low word frequencies and a few words have high word frequencies up to almost 30,000 appearances. We need 134 unique words in a frequency sorted dictionary to cover 50% of all word instances in the US English subset, whereas 6771 unique words are needed to cover 90% of all word instances.

Here are the top 10 words with the highest word frequency.

head(token_US, 10)
## # A tibble: 10 x 2
##    words     n
##    <chr> <int>
##  1 the   29486
##  2 to    19442
##  3 and   16230
##  4 a     15892
##  5 i     15214
##  6 of    12798
##  7 in    10408
##  8 you    8646
##  9 for    8152
## 10 is     8041

Next, we look at the frequencies of top 10 2-grams and 3-grams in the dataset.

## # A tibble: 10 x 2
##    ngrams      n
##    <chr>   <int>
##  1 of the   2462
##  2 in the   2433
##  3 for the  1450
##  4 to the   1340
##  5 on the   1246
##  6 to be    1161
##  7 at the    906
##  8 and the   808
##  9 i have    806
## 10 i was     737
## # A tibble: 10 x 2
##    ngrams                 n
##    <chr>              <int>
##  1 thanks for the       248
##  2 one of the           188
##  3 a lot of             172
##  4 i want to            149
##  5 going to be          123
##  6 looking forward to   116
##  7 i have a             113
##  8 to be a              113
##  9 it was a             112
## 10 i have to            111

Next Steps

The frequencies of unique one word are much higher than that of 2-grams and the frequencies of 2-grams are higher than that of 3-grams. We can use the probabilities of these frequencies as a foundation to build our prediction algorithm.