Data Science Capstone: Text Mining

Park
25/09/2020

Executive Summary

This is the course project which is part of the Data Science Capstone Course, the last course in the Statistics and Machine Learning Specialization, taught in Coursera.

The ultimate aim of this project is to predict the most probable English word based on the sentence typed in by the users with both accuracy and time required. The algorithm was then deployed in the shiny app.

The data set used to build the model was sponsored by the SwiftKey company. Even though there are other languages provided in the data set, this project limits to sentences from English blogs, news and twitter only.

Building the Algorithm

This is a frequency-based text mining algorithm. To be specific, the frequencies of all words which appeared on the data set are collected. This includes words that appear alone eg. children, people, time etc. and phrases that usually come together such as United States, New York City, two weeks ago etc. Then, some words that are more common are retained whereas those that are rarely occur will be discarded. The results of this process are separated files for each of the unigram, bigram and trigram phrases.

However, since this process is very memory and time-consuming, the data that was used to build the frequency tables are a 3% sub-sampling of the whole English data set.

The Textmining Application

The Shiny application asks the user to type in their sentence, but leaving out their last word. Then, it will follow the steps below to give a predicted word along with a frequency table:
1. Search through the trigram table and see if there is any matches.
2. If so, gives the most frequent word as the predicted word and the frequency table of all the matches.
3. If not, search through the bigram table and step 2. is repeated.
4. If there is no match in the bigram table, randomly sample 20 words from the unigram table and the predicted word is the most frequenct word.

Results and Improvements

Due to the constraint on the memory in my computer, only a small sample (3%) from the whole data set can be taken and processed. As a consequence, the accuracy of which the predicted word is actually the word choice of the users is quite low.

In order to improve the performance of the model, it may be useful to take into account the part of speech of each unigram words. So that the algorithm can predict a much more sensible word, rather than a random sampling. Alternatively, the accuracy will also increase if the process of collecting the word frequencies can be done in a less memory-consuming manner. Hence, more phrases can be used to build the algorithm.