## Loading required package: RColorBrewer
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
This project explores predictive text in collaboration with SwiftKey. The idea is to use Text Mining and Natural Language Processing to predict text for a user. As part of this initial report the following objectives have been accomplished. 1. Downloading the data (News, Blogs and Twitter) and storing the data as a Corpus 2. Understanding how to work with Corpus 3. Creating a sample out of the large dataset 4. Pre-processing the data to clear the clutter 5. Word Tokenisation 6. Exploratory Data Analysis 7. Creating N-Gram Model 8. Way Forward for the shiny app
The dataset consists of three large data sets (text files) from News, Twitter and Blogs in four different languages i.e. English, German, Finnish and Russian. The idea is to use these texts as databases to build the app that can predict text.
Since it is difficult to work with a huge dataset, I decided to create a representative sample of the data and work on it instead. I used the rbinom function to create a sample from the larger database. The rest of the processing has been done on the SAMPLE dataset.
For data pre-processing, I have performed the following actions.
1. Changing the encoding to UTF-8.
2. Changing all upper case letters to lower case.
3. Removing special characters.
4. Removing Punctutions.
5. Removing Numbers.
6. Removing Stopwords.
7. Reading and Removing Profane Words.
8. Removing Single Letters, “ve”, “ll”, “re” (left after above operations) and Whitespaces. 9. Word Tokenisation Performed by Stem Document.
Below is a brief summary of the data available for creating the predictive model.
## Name Lines Word_Count Avg_Word_Count File_Size_MB
## 1: Blog 899288 38154238 42.42716 200.4242
## 2: News 77259 2693898 34.86840 196.2775
## 3: Twitter 2360148 30218125 12.80349 159.3641
## Warning in readLines("./final/en_US/profane/profane_words_english.txt"):
## incomplete final line found on './final/en_US/profane/profane_words_english.txt'
Below are the barplots giving the number of lines, word count and file size of the three datasets obtained from Blog, News and Twitter.
There are unique unique words in the sample.About 950 unique words (3%) out of 30069 words cover 50% of all words. About 12500 words (42%) cover 90% of all words. I also observed while looking at the Term Document Matrix that there are a few characters and foreign language words that have been left out.
I have drawn a Comparison Wordcloud below, which shows the WordCloud from News, Blog and Twitter in the same WordCloud.
Below are the most occuring words in the sample dataset.
Below are the most occuring bigrams in the sample dataset.
Below are the most occuring trigrams in the sample dataset.