This is the Capstone Project for the Johns Hopkins University Data Science Specialization, hosted by Coursera in colaboration with SwiftKey. The overall goal of this project is to make a Natural Language Processing predictive application that returns a suggestion of the next word based on text that are inputted. Examples of such features are commonly found in web search sites and text messaging.
The dataset is obtained from the HC Corpora corpus and contains three textfile datasets (Blogs, News, Twitter). The files are provided in German, Russian, Finnish and English. The raw dataset is available to download here. We only use the English files for this project.
The goal of this report is to get familiar with the dataset and do some necessary data cleaning and transformations as part of the project. We perform text mining on the HC Corpora corpus data for the cleaning, tokenization, and all the analysis. Graphs and Wordcloud are created at the end of the report for easier visual understanding of the data.
All the codes of the project are done with R but they are not shown in this report to make it easier for non data scientist to read and understand the report. The only code that is shown is only the code to download and load files into R.
If you are interested in seeing the hidden R codes of the report, feel free to check and browse my GitHub project page by clicking here.
We first download and load the three textfiles to R as shown by the codes below
if (!file.exists("Courseradata")) {
dir.create("Courseradata")
}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey.zip", method = "curl")
if (!file.exists("/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey")) {
unzip(zipfile = "/Users/adrianromano/Downloads/Courseradata/Coursera-SwiftKey.zip",
exdir = "/Users/adrianromano/Downloads/Courseradata")
}
blogs <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.blogs.txt")
news <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.news.txt")
twitter <- readLines("/Users/adrianromano/Downloads/Courseradata//final/en_US/en_US.twitter.txt")
Here we see the summary of the three datasets:
## Dataset Lines Words FileSize.MB
## 1 Blogs 899288 37546239 200.4242
## 2 News 1010242 34762395 196.2775
## 3 Twitter 2360148 30093372 159.3641
First line of the Blogs sample dataset:
## [1] "She was way too pleased with herself."
First line of the News sample dataset:
## [1] "\"I think they will learn to control corporate power,\" Stegner said, \"and to dampen the excess that has always marked their region, and will arrive at a degree of stability and a reasonably sustainable economy based on resources that they will know how to cherish and renew.\""
First line of the Twitter sample dataset:
## [1] "Brooklyn New York city, where they paint the walls with biggie?"
We then merge the three datasets into one single dataset and see the total number of lines and words of the combined three datasets samples:
## [1] "Number of Lines: 42695"
## [1] "Number of Words: 1023499"
We conduct the cleaning of the data by performing these following conditions:
The profanity data was downloaded from the Free Web Headers - bad words banned by Google website and can be downloaded by clicking here.
Note: In Natural Language Processing, usually stop words or common english words such as “the”, “and”, “for”, etc are removed in order to highlight the more important words. However, since we are trying to make a next word prediction, these words are actually important to include so we decide to proceed without removing them.
After the data is cleaned, we then move on to Exploratory Data Analysis.
Here we create n-grams by the process called tokenization in which we break down texts into words. For example, a 2-gram tokenization means the texts are broken down to a two words combination. For this project, we create 1-gram (unigram), 2-gram (bigram), 3-gram (trigram), 4-gram (fourgram) and 5-gram (fivegram) and then transform the unstructured text dataset into structured data frames so that they can be used for statistical analysis and prediction models. More information about n-grams can be found on this wikipedia link if you are interested in learning more about the technical details.
Barplots and wordclouds of each n-grams are created to see which words appear the most frequent for unigram, bigram, trigram, fourgram and fivegram. In wordcloud, the most frequent words are determined by font sizes with the biggest one representing the most frequent word.
Note: Since we do not remove stopwords in the cleaning process, it is expected that the most frequent words contain a lot of common English words.