This milestone report is a part of the data science capstone project of Coursera and Swiftkey. The main objective of the capstone project is to transform corpora of text into a Next Word Prediction system, based on word frequencies and context, applying data science in the area of natural language processing. This Rmarkdown report describes exploratory analysis of the sample training data set and summarizes plans for creating the prediction model. Text mining R packages tm[1] and quanteda[2] are used for cleaning, preprocessing, managing and analyzing text. This report meets the following requirements:
Downloads, loads the data, creates sample training data and preprocess it.
Generates summary statistics about the data sets and makes basic plots such as histograms to illustrate features of the data.
Describes some interesting findings.
Reports plans for creating a prediction algorithm and Shiny app.
As the part of Data Capstone Project, this milestone report demonstrates the work done on exploratory data analysis and modeling. To get started with the Data Science Capstone Project.I’ve download the Swiftkey Dataset After extraction, I have chosen to work with folder en_US which contains following three files
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
File Lines LinesNEmpty Chars CharsNWhite TotalWords
1 blogs 899288 899288 208361438 171926076 37865888
2 news 1010242 1010242 203791405 170428853 34678691
3 twitter 2360148 2360148 162385035 134371036 30578933
Slimming down the dataset to make it more manageable for usability is a good idea. I selected to subset the data to 20K lines selected from each file
Using tm package for cleaning some words Update unused characters to space Remove stopwords (non-english) Convert to lowercase Remove Profane Language Remove punctuation Remove numbers Remove whitespace Remove ‘I’ - found a signigicant amount adding no value ## Loading bad-word list from here
Using the developed corpus, I needed to look at speech patterns and investigate sequenced word pairs, syllables, lettering, etc. via n-gram modeling methodology. Using word frequencies within the n-gram models, unigrams, bigrams and trigrams were creatd.
After completing N-gram modeling for unigrams, bigrams and trigrams, the total frequencies are determined and plotted below
word frequency
the the 9821
said said 5846
will will 5487
one one 5211
just just 4507
can can 4094
like like 4061
time time 3603
get get 3495
new new 3212
word frequency
last year last year 350
new york new york 338
right now right now 327
years ago years ago 270
high school high school 269
last week last week 220
first time first time 217
st louis st louis 202
last night last night 191
new jersey new jersey 190
In planning for developmemt of my future algorithm, I am planning to use an ngram dataframe to calculate the probabilities of the next word that occurs based on the previous word(s). For the Shiny app, the plan is to create an easily interactive and inviting interface allowing the user to enter some text allowing the user to enter some text then the prediction algorithm will suggest the next word.