Introduction

This project is done under Data Science Specialization on Coursera. The project is to create a Shiny App that uses a predictive algorithm that recommend the most likely words that would follow a particular text phrase typed by user based on previous 1,2 or 3 words typed. This is the 1st Milestone in the project and the scope of this milestone is to download and clean the data. Some exploratory data analysis is done on the data which is represented through various plots.

Data

The input set is represented by three files that contain text messages from different web sources (blogs, news and twitter). The content is similar, but the texts (specially in twitter messages that are often typed on smartphones) are characterized by the use of slang, emoticons, special characters, and so on.

This increase the difficult of the typical treatment problems in the field of NLP such as the stamming (simplification of singular/plural forms), punctuation, profanity filtering, elimination of words that do not belong to the specific natural language.

Step 1: Loading the required libraries and the data The dataset is available for download as a zip file.link Check to see if the Corpora file already exists; if not, download the file from and unzip the folder to extract the raw data into the selected working directory.

Step 2: Exploratory Analysis on the data and Creating Sample Showing the basic statistical information about the information loaded. A sample is created from the complete data. 10% data is fed into the sample and the sample statistics are displayed.

Observations: 1. The Twitter dataset is smaller by characters and by words, but much longer by lines. 2. The news dataset has the longest average word length whereas the Twitter dataset has the shortest.

## [1] "Summary of All Data"
##         File Size  Length Max Char per Line Total Characters
## Blogs   267758632  899288             40835        208361438
## News     20729472   77259              5760         15683765
## Twitter 334484736 2360148               213        162384825
## [1] "Statistical Summary of All Data"
##         Minimum 25th Quantile Median 75th Quantile Maximum Average
## Blogs         1            47    157           331   40835     232
## News          2           111    186           270    5760     203
## Twitter       2            37     64           100     213      69

##         FileSize Length MaxChar
## Blogs   26657536  89928    5038
## News     2069624   7725    1397
## Twitter 33727968 236014     168

## [1] "Statistics Before Cleaning"
##      [,1]        [,2]        [,3]     [,4]               
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Corpus"    "226377680" "50000"  "32082372"

Step 3: Cleaning the sampled data In particular we will: a. Remove extra white spaces b. Eliminate punctuation c. Eliminate stop words in english language d. Convert all text in lower case e. Eliminate the variations of the words and reduce them to a base form (lemma). f. Eliminate profanities g. Eliminate the numbers h. Eliminate the URLs

## [1] "Statistics After Cleaning"
##      [,1]        [,2]        [,3]     [,4]               
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Corpus"    "220486992" "50000"  "25791108"

Step 4: Tokenizing data into 1,2,3 and 4-grams

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  7177015 383.3   13903032 742.6  13903032 742.6
## Vcells 74120812 565.5  114193281 871.3 105460418 804.6

##      word  freq
## the   the 13942
## one   one  9810
## like like  9253
## just just  9001
## get   get  8829
## time time  8034
## can   can  7838
## day   day  6560
## make make  6451
## love love  6326
## know know  5839
## year year  5359
## good good  5200
## now   now  5188
## work work  5071

##              word freq
## na i         na i 3375
## i think   i think 1881
## i love     i love 1574
## na thank na thank 1334
## i can       i can 1294
## i know     i know 1283
## i dont     i dont 1171
## i want     i want 1167
## i just     i just 1148
## na im       na im  889
## i like     i like  729
## i need     i need  693
## i hope     i hope  680
## i got       i got  678
## time i     time i  675

##                            word freq
## i think i             i think i  247
## na i love             na i love  242
## i dont know         i dont know  237
## i know i               i know i  190
## na i think           na i think  183
## na i dont             na i dont  154
## i feel like         i feel like  153
## na thank follow na thank follow  151
## i dont think       i dont think  148
## i wish i               i wish i  135
## na i know             na i know  133
## na i just             na i just  132
## i donâ\200\231t know   i donâ\200\231t know  126
## i thought i         i thought i  126
## na good morn       na good morn  122

Next Steps and intervention strategy

The idea is to use the ngrams to make predictions about the words that follow a certain phrase that the user typed. In particular, it will consider the last 3 typed words (or less) and will check if there is a four-gram which contains the first three words the same as those typed. If it does not find a four-gram will be sought the trigrams with the first two words equal to the latest typed words. And so on up to propose the single most frequent words (unigrams).

With this in mind the next steps will be aimed at finding a representation of the n-grams in line with the objectives set and to develop a predictive model to use to determine the next words from a set of words typed by the user.