Introduction

This milestone report is a part of the data science capstone project of Coursera and Swiftkey. The main objective of the capstone project is to transform corpora of text into a Next Word Prediction system, based on word frequencies and context, applying data science in the area of natural language processing. This Rmarkdown report describes exploratory analysis of the sample training data set and summarizes plans for creating the prediction model. Text mining R packages tm[1] and quanteda[2] are used for cleaning, preprocessing, managing and analyzing text. This report meets the following requirements:

Downloads, loads the data, creates sample training data and preprocess it.
Generates summary statistics about the data sets and makes basic plots such as histograms to illustrate features of the data.
Describes some interesting findings.
Reports plans for creating a prediction algorithm and Shiny app.

Data Acquisition and Summary Statistics

Execute Summary

As the part of Data Capstone Project, this milestone report demonstrates the work done on exploratory data analysis and modeling. To get started with the Data Science Capstone Project.I’ve download the Swiftkey Dataset After extraction, I have chosen to work with folder en_US which contains following three files

en_US.blogs.txt en_US.news.txt en_US.twitter.txt

Load the libraries

Download and Load the Course Data Sets

File inspection

     File   Lines LinesNEmpty     Chars CharsNWhite TotalWords
1   blogs  899288      899288 208361438   171926076   37865888
2    news 1010242     1010242 203791405   170428853   34678691
3 twitter 2360148     2360148 162385035   134371036   30578933

Sampling

Slimming down the dataset to make it more manageable for usability is a good idea. I selected to subset the data to 20K lines selected from each file

Cleanin and exploring data

Using tm package for cleaning some words Update unused characters to space Remove stopwords (non-english) Convert to lowercase Remove Profane Language Remove punctuation Remove numbers Remove whitespace Remove ‘I’ - found a signigicant amount adding no value ## Loading bad-word list from here

N-gram Tokenization

Using the developed corpus, I needed to look at speech patterns and investigate sequenced word pairs, syllables, lettering, etc. via n-gram modeling methodology. Using word frequencies within the n-gram models, unigrams, bigrams and trigrams were creatd.

Exploratory Analysis

After completing N-gram modeling for unigrams, bigrams and trigrams, the total frequencies are determined and plotted below

Unigram Frequency (Top 10)

     word frequency
the   the      9821
said said      5846
will will      5487
one   one      5211
just just      4507
can   can      4094
like like      4061
time time      3603
get   get      3495
new   new      3212

Unigram Plot

Bigram Frequency (Top 10)

                   word frequency
last year     last year       350
new york       new york       338
right now     right now       327
years ago     years ago       270
high school high school       269
last week     last week       220
first time   first time       217
st louis       st louis       202
last night   last night       191
new jersey   new jersey       190

Bigram Plot

Trigram Frequency (Top 10)

Trigram Plot

Prediction strategies and plans for Shiny app

In planning for developmemt of my future algorithm, I am planning to use an ngram dataframe to calculate the probabilities of the next word that occurs based on the previous word(s). For the Shiny app, the plan is to create an easily interactive and inviting interface allowing the user to enter some text allowing the user to enter some text then the prediction algorithm will suggest the next word.

Milestone Report – Data Science Capstone Project

César Fernández

2 de marzo de 2020