Coursera. Data science capstone project

Summary

In the frame of this capstone project my goal was to build an application, similar to SwiftKey. Since the SwiftKey is a mobile application, the model size is important as well as the quality of a prediction.
The final application uses a Backoff model, that tries to predict the word based on n previous words in the text.
The application uses 2 dictionnaries:

3 word dictionnary
2 word dictionnary

The dictionnaries were build using data available here. Only a sample from each source (news, blogs and twitter) was used to build them.
The application tries to predict the next word using 3 word dictionnary (combination of 3 consecutive words). If it finds several possible options, it proposes 3 most frequent ones. If it does not find anything, the application tries to perform a prediction using a 2 word dictionary. In case the application finds nothing using a 2 word dictionary, it does not make any predictions.
The main tradeoff in this situation is the size of the dictionnaries vs prediction quality. So I studied several strategies (based on training set size and prediction method) to achieve the most optimal result. See slide "Accuracy vs Dictionary size".

Interface & functionality

The application simulates a mobile application interface. I made it look like an IPhone using some CSS styles. Once the user has typed in the text, the application posposes him 3 options to choose.
The user has a choise whether to continue typing text or click on one of the proposed words.
Once the user clicks on one of them the word is added to the text the user has already typed in and a new prediction is made.
The application is available here

Accuracy vs Dictionary size (part 1)

Since our application is a mobile one, we need to choose the optimal algorythm in terms of prediction quality and size of dictionnaries.
I tested 2 different strategies in order to find an optimal one:

Increasing training set (15, 30, 45 or 60 thousand documents from each source) that led to increasing of dictionnaries
Increasing the length of the analysed text (making predictions based 3, 4, 5, 6 grams).

Increasing training set from 45(3x15)K documents to 90(30*3) gives a general boost to prediction accuracy, however further increasing of training set gives a limited positive impact to prediction accuracy, leading to a significant increase of required dictionnaries.
Another hypothesis I tested was that increasing of number predictions would increase efficiency of our predictions. After checking this hypothesis I failed to reject it so I consider that this is the most effitient way of improoving efficiency of the whole algotythm.

Accuracy vs Dictionary size (part 2)

So the final step in tuning of my algorythm was choosing a proposer dictionnary size to be used with my prediction function and offering between 1 and 3 options to the user.

dsBench <- read.csv2("DictionnariesSampling.csv")
levels(dsBench$DictionnarySize) <- c("15k each","30k each","45k each","60k each")
dsBench$Accuracy = round(dsBench$Correct / dsBench$SampleSize,2)
dsBench %>% filter(PredictionMethod=="One of 3 options") %>%
  select(-Parameters,-SampleSize,-SizeKb,-PredictionMethod)

##   DictionnarySize Correct SizeMb Accuracy EfficiencyMbPerPercent
## 1        15k each      34  15.25     0.17                   0.90
## 2        30k each      78  27.70     0.39                   0.71
## 3        60k each      87  40.11     0.44                   0.92
## 4        45k each      93  50.27     0.46                   1.08

So the dictionnaries made of 90K sampled documents (30K from each source) seem to be the most efficient: We get 40 persent accuracy for 28Mb.