This project is done under Data Science Specialization on Coursera. The project is to create a Shiny App that uses a predictive algorithm that recommend the most likely words that would follow a particular text phrase typed by user based on previous 1,2 or 3 words typed. This is the 1st Milestone in the project and the scope of this milestone is to download and clean the data. Some exploratory data analysis is done on the data which is represented through various plots.
The input set is represented by three files that contain text messages from different web sources (blogs, news and twitter). The content is similar, but the texts (specially in twitter messages that are often typed on smartphones) are characterized by the use of slang, emoticons, special characters, and so on.
This increase the difficult of the typical treatment problems in the field of NLP such as the stamming (simplification of singular/plural forms), punctuation, profanity filtering, elimination of words that do not belong to the specific natural language.
Step 1: Loading the required libraries and the data The dataset is available for download as a zip file.link Check to see if the Corpora file already exists; if not, download the file from and unzip the folder to extract the raw data into the selected working directory.
Step 2: Exploratory Analysis on the data and Creating Sample Showing the basic statistical information about the information loaded. A sample is created from the complete data. 10% data is fed into the sample and the sample statistics are displayed.
Observations: 1. The Twitter dataset is smaller by characters and by words, but much longer by lines. 2. The news dataset has the longest average word length whereas the Twitter dataset has the shortest.
## [1] "Summary of All Data"
## File Size Length Max Char per Line Total Characters
## Blogs 267758632 899288 40835 208361438
## News 20729472 77259 5760 15683765
## Twitter 334484736 2360148 213 162384825
## [1] "Statistical Summary of All Data"
## Minimum 25th Quantile Median 75th Quantile Maximum Average
## Blogs 1 47 157 331 40835 232
## News 2 111 186 270 5760 203
## Twitter 2 37 64 100 213 69
## FileSize Length MaxChar
## Blogs 26657536 89928 5038
## News 2069624 7725 1397
## Twitter 33727968 236014 168
## [1] "Statistics Before Cleaning"
## [,1] [,2] [,3] [,4]
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Corpus" "226377680" "50000" "32082372"
Step 3: Cleaning the sampled data In particular we will: a. Remove extra white spaces b. Eliminate punctuation c. Eliminate stop words in english language d. Convert all text in lower case e. Eliminate the variations of the words and reduce them to a base form (lemma). f. Eliminate profanities g. Eliminate the numbers h. Eliminate the URLs
## [1] "Statistics After Cleaning"
## [,1] [,2] [,3] [,4]
## [1,] "Statistic" "File Size" "Length" "Max Char per Line"
## [2,] "Corpus" "220486992" "50000" "25791108"
Step 4: Tokenizing data into 1,2,3 and 4-grams
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7177015 383.3 13903032 742.6 13903032 742.6
## Vcells 74120812 565.5 114193281 871.3 105460418 804.6
## word freq
## the the 13942
## one one 9810
## like like 9253
## just just 9001
## get get 8829
## time time 8034
## can can 7838
## day day 6560
## make make 6451
## love love 6326
## know know 5839
## year year 5359
## good good 5200
## now now 5188
## work work 5071
## word freq
## na i na i 3375
## i think i think 1881
## i love i love 1574
## na thank na thank 1334
## i can i can 1294
## i know i know 1283
## i dont i dont 1171
## i want i want 1167
## i just i just 1148
## na im na im 889
## i like i like 729
## i need i need 693
## i hope i hope 680
## i got i got 678
## time i time i 675
## word freq
## i think i i think i 247
## na i love na i love 242
## i dont know i dont know 237
## i know i i know i 190
## na i think na i think 183
## na i dont na i dont 154
## i feel like i feel like 153
## na thank follow na thank follow 151
## i dont think i dont think 148
## i wish i i wish i 135
## na i know na i know 133
## na i just na i just 132
## i donâ\200\231t know i donâ\200\231t know 126
## i thought i i thought i 126
## na good morn na good morn 122
Next Steps and intervention strategy
The idea is to use the ngrams to make predictions about the words that follow a certain phrase that the user typed. In particular, it will consider the last 3 typed words (or less) and will check if there is a four-gram which contains the first three words the same as those typed. If it does not find a four-gram will be sought the trigrams with the first two words equal to the latest typed words. And so on up to propose the single most frequent words (unigrams).
With this in mind the next steps will be aimed at finding a representation of the n-grams in line with the objectives set and to develop a predictive model to use to determine the next words from a set of words typed by the user.