## Loading required package: RColorBrewer
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Executive Summary

This project explores predictive text in collaboration with SwiftKey. The idea is to use Text Mining and Natural Language Processing to predict text for a user. As part of this initial report the following objectives have been accomplished. 1. Downloading the data (News, Blogs and Twitter) and storing the data as a Corpus 2. Understanding how to work with Corpus 3. Creating a sample out of the large dataset 4. Pre-processing the data to clear the clutter 5. Word Tokenisation 6. Exploratory Data Analysis 7. Creating N-Gram Model 8. Way Forward for the shiny app

Dataset Summary

The dataset consists of three large data sets (text files) from News, Twitter and Blogs in four different languages i.e. English, German, Finnish and Russian. The idea is to use these texts as databases to build the app that can predict text.

Sampling

Since it is difficult to work with a huge dataset, I decided to create a representative sample of the data and work on it instead. I used the rbinom function to create a sample from the larger database. The rest of the processing has been done on the SAMPLE dataset.

Data Pre-processing

For data pre-processing, I have performed the following actions.
1. Changing the encoding to UTF-8.
2. Changing all upper case letters to lower case.
3. Removing special characters.
4. Removing Punctutions.
5. Removing Numbers.
6. Removing Stopwords.
7. Reading and Removing Profane Words.
8. Removing Single Letters, “ve”, “ll”, “re” (left after above operations) and Whitespaces. 9. Word Tokenisation Performed by Stem Document.

Below is a brief summary of the data available for creating the predictive model.

##       Name   Lines Word_Count Avg_Word_Count File_Size_MB
## 1:    Blog  899288   38154238       42.42716     200.4242
## 2:    News   77259    2693898       34.86840     196.2775
## 3: Twitter 2360148   30218125       12.80349     159.3641
## Warning in readLines("./final/en_US/profane/profane_words_english.txt"):
## incomplete final line found on './final/en_US/profane/profane_words_english.txt'

Exploratory Analysis

Below are the barplots giving the number of lines, word count and file size of the three datasets obtained from Blog, News and Twitter.

There are unique unique words in the sample.About 950 unique words (3%) out of 30069 words cover 50% of all words. About 12500 words (42%) cover 90% of all words. I also observed while looking at the Term Document Matrix that there are a few characters and foreign language words that have been left out.

Comparison WordCloud

I have drawn a Comparison Wordcloud below, which shows the WordCloud from News, Blog and Twitter in the same WordCloud.

n-gram models

Unigram Model

Below are the most occuring words in the sample dataset.

Bi-gram

Below are the most occuring bigrams in the sample dataset.

Tri-gram

Below are the most occuring trigrams in the sample dataset.

Wayforward

  1. Making the net running time faster when using the actual database (not the sample)
  2. Checking the accuracy of the predictive model.
  3. Adding new datasets if the model/app is not giving good predictions.
  4. Better Cleaning of the dataset to remove special characters and foreign language.
  5. Research a faster way to remove stop words.
  6. Create the Shiny App to predict text.
  7. Create a simple user friendly UI where user can enter text and see results immediately.
  8. Look for the text to be predicted in the trigram, then in the bigram and then in the unigram.