Data Science Capstone Milestone Report

Introduction

This document is produced as a Milestone report of the Data Science Specialization Capstone offered by Johns Hopkins University on Coursera.

The report attempts to:

Demonstrate that the data set provided for the Next Word Prediction project was downloaded and successfully loaded in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings amassed so far.
Get feedback on plans for creating a prediction algorithm and Shiny app.

A number of R libraries were used in the course of producing this report some of which are stringi, tm , quanteda, qdap and ggplot2.

Data Sources and Processing

The raw data is sourced from a corpus called HC Corpora and downloaded via a link provided in a Data Science Capstone course page. Although the downloaded dataset consists of data of multiple languages, the English dataset was used for this project and report.

The script used to download first checks if the dataset had been previously downloaded otherwise it downloads and extracts some information about each file in the English dataset. Below is a summary statistics of the raw dataset.

##             filename size.MB   Lines LinesNEmpty     Chars CharsNWhite
## 1.   en_US.blogs.txt  200.42  899288      899288 206824382   170389539
## 2.    en_US.news.txt  196.28   77259       77259  15639408    13072698
## 3. en_US.twitter.txt  159.36 2360148     2360148 162096031   134082634

Data Sampling

As recommended in the course instructions, a sample of the dataset can be drawn to represent the entire dataset. For the purposes of this report a function is created to read the raw data files, take a random sample comprising of 20000 lines from each datafile (blogs, news, twitter), write the sample drawn to local disk so it can be used for further processing while dislaying some general statistics of the sampled text files.

Some general characteristics of the sampled files are shown below. As seen below, blogs have more average characters per line (4552997 characters in 20000 lines) while twitter has the least characters per line (1376324 characters in 20000 lines).

##                Lines LinesNEmpty   Chars CharsNWhite
## sample.blogs   20000       20000 4552997     3750813
## sample.news    20000       20000 4063986     3396250
## sample.twitter 20000       20000 1376324     1138369

Sample Data Cleaning

The sample data obtained in the step above was loaded into R with some initial cleaning which included:

the convertion of all characters to UTF-8 character encoding,
the removal of numbers and punctuations as well as
the removal of URLs.

Further data cleaning was automatically done with the quanteda R package

This step produced files with the following word counts:

## [1] "Word Count for Blogs sample data: 810654"

## [1] "Word Count for News sample data: 669763"

## [1] "Word Count for Twitter sample data: 249511"

N-Grams Creation and Tokenization

In lexical analysis, as described by Wikipedia, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.

The sample data was further cleaned and tokenized with quanteda R package to generate a contiguous sequence of n items from a given sequence of text referred to as n-gram. The frequencies of the words occuring was then computed into a dataframe. The dataframe was used to plot histograms of the uni-grams, bi-grams and tri-grams of text generated.

During further processing of the sample dataset, stopwords and profanity were removed when creating unigrams, however these were left when creating bigrams and trigrams since they may be useful in word associations. This will however be investigated further to see whether it actually impacts the predictions of the model to be developed.

Exploratory Plots

Top Unigrams

A frequency plot showing the top 25 most frequently occuring uni-grams from the tokenization process is shown below:

Top Bigrams

A frequency plot showing the top 25 most frequently occuring bi-grams from the tokenization process is shown below:

Top Trigrams

A frequency plot showing the top 25 most frequently occuring tri-grams from the tokenization process is shown below:

A word cloud showing most occuring words in the entire sample dataset is shown below. Only words with a minimum frequency of 100 were included in the word cloud.

Conclusion

On a final note is was observed that the size of the raw dataset is considerably large and samples drawn from the raw dataset may have to be small enough to save significant processing time.

The next thing to do will involve the creation of a shiny app to predict next words given a word or more. It may require that more n-grams would be created to increase the accuracy of the prediction algorithm

Due to the expectation of this report to be concise and easy to understand by non data scientists, the codes for performing the various analyses were not included in the report (echo=FALSE, for data scientists), however the detailed scripts can be found here on github.