Executive Summary

This is the Capstone Project for the Johns Hopkins University Data Science Specialization, hosted by Coursera in colaboration with SwiftKey. The overall goal of this project is to make a Natural Language Processing predictive application that returns a suggestion of the next word based on text that are inputted. Examples of such features are commonly found in web search sites and text messaging.

The dataset is obtained from the HC Corpora corpus and contains three textfile datasets (Blogs, News, Twitter). The files are provided in German, Russian, Finnish and English. The raw dataset is available to download here. We only use the English files for this project.

The goal of this report is to get familiar with the dataset and do some necessary data cleaning and transformations as part of the project. We perform text mining on the HC Corpora corpus data for the cleaning, tokenization, and all the analysis. Graphs and Wordcloud are created at the end of the report for easier visual understanding of the data.

All the codes of the project are done with R but they are not shown in this report to make it easier for non data scientist to read and understand the report. The only code that is shown is only the code to download and load files into R.

If you are interested in seeing the hidden R codes of the report, feel free to check and browse my GitHub project page by clicking here.

Data Preparation

We first download and load the three textfiles to R as shown by the codes below

if (!file.exists("Courseradata")) {
    dir.create("Courseradata")
}

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey.zip", method = "curl")

if (!file.exists("/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey")) {
    unzip(zipfile = "/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey.zip", 
          exdir = "/Users/adrianromano/Downloads/Courseradata")
}

blogs <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.blogs.txt")
news <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.news.txt")
twitter <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.twitter.txt")

Here we see the summary of the three datasets:

##   Dataset   Lines    Words FileSize.MB
## 1   Blogs  899288 37546239    200.4242
## 2    News 1010242 34762395    196.2775
## 3 Twitter 2360148 30093372    159.3641

Next we see the first line of each sampled dataset to get an understanding of how the data looks like:

First line of the Blogs sample dataset:

## [1] "She was way too pleased with herself."

First line of the News sample dataset:

## [1] "\"I think they will learn to control corporate power,\" Stegner said, \"and to dampen the excess that has always marked their region, and will arrive at a degree of stability and a reasonably sustainable economy based on resources that they will know how to cherish and renew.\""

First line of the Twitter sample dataset:

## [1] "Brooklyn New York city, where they paint the walls with biggie?"

We then merge the three datasets into one single dataset and see the total number of lines and words of the combined three datasets samples:

## [1] "Number of Lines: 42695"
## [1] "Number of Words: 1023499"

Data Cleaning

We conduct the cleaning of the data by performing these following conditions:

The profanity data was downloaded from the Free Web Headers - bad words banned by Google website and can be downloaded by clicking here.

Note: In Natural Language Processing, usually stop words or common english words such as “the”, “and”, “for”, etc are removed in order to highlight the more important words. However, since we are trying to make a next word prediction, these words are actually important to include so we decide to proceed without removing them.

Exploratory Data Analysis

After the data is cleaned, we then move on to Exploratory Data Analysis.

Here we create n-grams by the process called tokenization in which we break down texts into words. For example, a 2-gram tokenization means the texts are broken down to a two words combination. For this project, we create 1-gram (unigram), 2-gram (bigram), 3-gram (trigram), 4-gram (fourgram) and 5-gram (fivegram) and then transform the unstructured text dataset into structured data frames so that they can be used for statistical analysis and prediction models. More information about n-grams can be found on this wikipedia link if you are interested in learning more about the technical details.

Barplots and wordclouds of each n-grams are created to see which words appear the most frequent for unigram, bigram, trigram, fourgram and fivegram. In wordcloud, the most frequent words are determined by font sizes with the biggest one representing the most frequent word.

Note: Since we do not remove stopwords in the cleaning process, it is expected that the most frequent words contain a lot of common English words.

Unigram

  • Barplot:

  • Wordcloud:

Bigram

  • Barplot:

  • Wordcloud:

Trigram

  • Barplot:

  • Wordcloud:

Fourgram

  • Barplot:

  • Wordcloud:

Fivegram

  • Barplot:

  • Wordcloud:

Final Comments