In the last 10-15 years with the advent of internet and social media, a lot of textual content has been generated around the world in different languages. Unlike anytime in the history, today there is so much rich information available and this information is stored in digital format. Researchers, businesses and others have been working on mining this vast trove of information for predictive analytics and insights. This has lead to tremendous growth in text mining and its applications.
One such application has been to use textual content to predict the next word in the sentence given the initial list of words from that sentence. This is particularly useful for mobile phone users who find typing on the phone to be hard. If the predicted word is indeed what the phone user intended to type, the user can save a few taps on the phone.
SwiftKey, the corporate sponsor of the Data Science Specialization Coursera project has been working on this application of text mining. For this Capstone project, SwiftKey has shared content from Twitter, news sites and blogs. In this project, we will be using this content to try to predict the next word on the sentence with greater degree of accuracy.
This report is a milestone status update for this project (and is not the final project report).
The intended audience of this milestone report are members of the management. Hence the software code that was used for statistical analysis has been suppressed and only the essential data summaries and plots have been displayed.
The en_US dataset was loaded.
## Loading required package: NLP
The number of twitter words were:
## [1] 30374206
The number of news words were:
## [1] 34372720
The number of blog words were:
## [1] 37334690
The number of twitter lines were:
## [1] 2360148
The number of news lines were:
## [1] 1010242
The number of blog lines were:
## [1] 899288
The words per line for twitter, blogs and news were respectively:
## [1] 12.86962
## [1] 34.02424
## [1] 41.51583
The words per line will be useful later in the project when we would to look at how to use words termination into our prediction model.
Each of the three files has around 1 million or 2 million lines. In our exploratory analysis, we therefore, will start small. We would begin by sampling without replacement only 2% of the lines. We will consolidate all the 3 samples and then we will create a corpus file on the consolidated sample.
We will use this corpus file for our further exploratory analysis.
We will use this corpus file for our further exploratory analysis.
This corpus file is further transformed to make sure that all the text are in lower-case. Other potential items of concern like number, punctuations and profanity words are stripped off from the corpus file.
This final file after the data preparation is used for the exploratory data analysis. Now, histograms are prepared to understand the words that frequent.
Because my laptop is slow in processing some of the commands from the tm package, I have not been able to add the plots from the exploratory data analysis.
In the next steps of the analysis, the n-gram models will be further explored to see which n-gram yields a better prediction and if the model can be further improved. Other models like the Naive-Bayesian models will also be explored.