This document is produced as a Milestone report of the Data Science Specialization Capstone offered by Johns Hopkins University on Coursera.
The report attempts to:
A number of R libraries were used in the course of producing this report some of which are stringi, tm , quanteda, qdap and ggplot2.
The raw data is sourced from a corpus called HC Corpora and downloaded via a link provided in a Data Science Capstone course page. Although the downloaded dataset consists of data of multiple languages, the English dataset was used for this project and report.
The script used to download first checks if the dataset had been previously downloaded otherwise it downloads and extracts some information about each file in the English dataset. Below is a summary statistics of the raw dataset.
## filename size.MB Lines LinesNEmpty Chars CharsNWhite
## 1. en_US.blogs.txt 200.42 899288 899288 206824382 170389539
## 2. en_US.news.txt 196.28 77259 77259 15639408 13072698
## 3. en_US.twitter.txt 159.36 2360148 2360148 162096031 134082634
As recommended in the course instructions, a sample of the dataset can be drawn to represent the entire dataset. For the purposes of this report a function is created to read the raw data files, take a random sample comprising of 20000 lines from each datafile (blogs, news, twitter), write the sample drawn to local disk so it can be used for further processing while dislaying some general statistics of the sampled text files.
Some general characteristics of the sampled files are shown below. As seen below, blogs have more average characters per line (4552997 characters in 20000 lines) while twitter has the least characters per line (1376324 characters in 20000 lines).
## Lines LinesNEmpty Chars CharsNWhite
## sample.blogs 20000 20000 4552997 3750813
## sample.news 20000 20000 4063986 3396250
## sample.twitter 20000 20000 1376324 1138369
The sample data obtained in the step above was loaded into R with some initial cleaning which included:
Further data cleaning was automatically done with the quanteda R package
This step produced files with the following word counts:
## [1] "Word Count for Blogs sample data: 810654"
## [1] "Word Count for News sample data: 669763"
## [1] "Word Count for Twitter sample data: 249511"
In lexical analysis, as described by Wikipedia, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.
The sample data was further cleaned and tokenized with quanteda R package to generate a contiguous sequence of n items from a given sequence of text referred to as n-gram. The frequencies of the words occuring was then computed into a dataframe. The dataframe was used to plot histograms of the uni-grams, bi-grams and tri-grams of text generated.
During further processing of the sample dataset, stopwords and profanity were removed when creating unigrams, however these were left when creating bigrams and trigrams since they may be useful in word associations. This will however be investigated further to see whether it actually impacts the predictions of the model to be developed.
A frequency plot showing the top 25 most frequently occuring uni-grams from the tokenization process is shown below:
A frequency plot showing the top 25 most frequently occuring bi-grams from the tokenization process is shown below:
A frequency plot showing the top 25 most frequently occuring tri-grams from the tokenization process is shown below:
A word cloud showing most occuring words in the entire sample dataset is shown below. Only words with a minimum frequency of 100 were included in the word cloud.
On a final note is was observed that the size of the raw dataset is considerably large and samples drawn from the raw dataset may have to be small enough to save significant processing time.
The next thing to do will involve the creation of a shiny app to predict next words given a word or more. It may require that more n-grams would be created to increase the accuracy of the prediction algorithm
Due to the expectation of this report to be concise and easy to understand by non data scientists, the codes for performing the various analyses were not included in the report (echo=FALSE, for data scientists), however the detailed scripts can be found here on github.