The goal of the report is to explain the exploratory analysis and the goals for the eventual app and algorithm. I have briefly summarized my algorithm and the algorithm in it.
It gives a basic summary statistics about the data set.
To get started with the Data Science Capstone Project.I’ve download the Coursera Swiftkey Dataset. After extraction, I have chosen to work with folder en_US which contains following three files:
## Warning in readLines("./final/en_US/en_US.news.txt", skipNul = T): incomplete
## final line found on './final/en_US/en_US.news.txt'
## name size_file num.words num.length
## 1 en_US.twitter 200Mb 38154238 899288
## 2 en_US.blogs.txt 196Mb 2693898 77259
## 3 en_US.news 160Mb 30218166 2360148
Since the data set is quite large and it would be very memory intensive process to preprocess such a large data set.
So, I divided the data set into 25 small data set that constitues the entire corpus together. All the preprocess and tokenization steps are performed on these files and the results are then compiled into one major data table.
The following preprocessing steps were done (in order).
1. Removing URLs.
2. Removing symbols and non-ascii charecters.
3. Removing Hashtags and punctuations.
4. Removing Numbers.
5. Removing Profanity words.
Note: I haven’t removed the stop words because in our project stopwords can be useful in predicting the next words.
After all the preprocessing and cleaning steps, various N-Grams were tokens were made(N = 2,3,4,5) to compare the position of words relative to others.
Here’s the look of a sample of 3-gram data table.
## feature freq pred base
## 1 one_of_the 19716 the one_of
## 2 a_lot_of 19015 of a_lot
## 3 thanks_for_the 13985 the thanks_for
## 4 to_be_a 13145 a to_be
## 5 going_to_be 12560 be going_to
## 6 i_want_to 11406 to i_want
Note that the above N-Grams are built by pruning the N-Grams with a minimum frequency of four.
For the predictive model, I am thinking about using the Stupid Backoff Model. It is a simple model and can be very fast relative to other algorithms. However it may be a bit inaccurate compared to other algorithms.
This algorithm just focuses on the Probabilty of seeing a particular word given the previous set of words.
I would be deploying the 5-gram model to predict the text.However if the words entered are less than 4 than i would shift to the lesser_grams models.
For the Shiny App, I would keep it simple which will predict the next word and give out the best possible 3 options for the predictions.
As soon as the user enter space, three predictions will be shown immediately.