Summary

The Capstone project for the Coursera Data Science Specialization involves using the HC Corpora Dataset. The Capstone project is done in collaboration with Swiftkey and the goal of this project is to design a shiny application with text prediction capabilities. This report will outline the exploratory analysis of the dataset and the current plans for implementing the text prediction algorithm.

Description of Data

The HC Corpora dataset is comprised of the output of crawls of news sites, blogs and twitter. A readme file with more specific details on how the data was generated can be found here. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:

  1. en_US.blogs.txt
  2. en_US.twitter.txt
  3. en_US.news.txt

The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report.

Download the data

if(!file.exists("Coursera-SwiftKey.zip")){
    #Download the dataset
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  "Coursera-SwiftKey.zip")
    Download_Date <- Sys.time()
    Download_Date
    #"2014-11-09 07:22:48 PST"
    
    #Unzip the dataset
    unzip("Coursera-SwiftKey.zip")
}else{
    print("Dataset is already downloaded!")
}

Polish News Dataset

There is a minor problem with the News dataset. It contains an unusual character on line 77,259. In order to address this issue a small piece of code was written to edit out the character before processing the dataset.

if(!file.exists("./final/en_US/en_US.news_edit.txt")){
    con <- file("./final/en_US/en_US.news.txt", "rb")
    News_Data <- readLines(con)
    close(con)
    #Remove the odd symbol on line 77259
    News_Data <- gsub("\032", "", News_Data, ignore.case=F, perl=T)
    writeLines(News_Data, con=file("./final/en_US/en_US.news_edit.txt"))
    close(con=file("./final/en_US/en_US.news_edit.txt"))
    file.rename("./final/en_US/en_US.news.txt", "./final/en_US.news.txt")
}else{
    con <- file("./final/en_US/en_US.news_edit.txt", "rb")
    News_Data <- readLines(con) 
    close(con)
}

Characteristics of Datasets

#Load libraries
library(NLP)
library(tm)
library(stringi)
library(ggplot2)
library(RWeka)
library(data.table)

#Generate Corpus for text analysis
cname <- file.path(".", "final", "en_US")
docs <- Corpus(DirSource(cname))

The first part of this exploratory analysis is to determine the basic characteristics for each dataset. These characteristics are shown in the table below.

Dataset File Size (bytes) Number of Lines Smallest entry Largest entry
Blogs 210160014 899288 1 40833
Twitter 167105338 2360148 2 140
News 204801643 1010242 1 11384

Subsetting and Processing the Dataset

Each of the datasets (Blogs, Twitter and News) are large enough that processing time is a factor. In order to address this concern, a representative sampling of each of the datasets was made for the remainder of this analysis. The subset of each file is outlined in the table below.

#Limit Dataset to a random subset of 20% of the data
set.seed(1337)
Subset <- docs
Subset[[1]]$content <- Subset[[1]]$content[as.logical(rbinom(length(Subset[[1]]$content),
                                                             1, prob=0.2))]
Subset[[2]]$content <- Subset[[2]]$content[as.logical(rbinom(length(Subset[[2]]$content),
                                                             1, prob=0.2))]
Subset[[3]]$content <- Subset[[3]]$content[as.logical(rbinom(length(Subset[[3]]$content),
                                                             1, prob=0.2))]
Dataset File Size (bytes) Number of Lines Smallest entry Largest entry
Subset Blogs 51951080 179180 1 12409
Subset Twitter 63594736 471766 3 140
Subset News 52384464 202308 1 8949

Before the subsetted data can be fully analyzed the data needs to be pre-processed to standardize the words and characters from each dataset. An example entry from the Blogs dataset is shown below:

## [1] "Point 2: If it’s a show that your kid wants, a show about a book is always better than CRACCC."

Word Frequency

#Plot Word Frequencies
ggplot(wf[wf$freq>60000, ], aes(x=word, y=freq)) +
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("") +
    ylab("Frequency") +
    ggtitle("Words that appear over 60,000\ntimes in the three Datasets")

The high frequency for “connecting” words, such as “the”, “and”, “that” suggests that using a pattern based on word frequency alone will not be sufficient for text prediction. The next analysis looks at common word combinations.

N-gram Frequency

For brivity the N-gram analysis of this report was limited to 2-grams.

#Plot Word Frequencies
ggplot(WF_Ngram[WF_Ngram$freq>16000, ], aes(x=word, y=freq)) +
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("") +
    ylab("Frequency") +
    ggtitle("Two Word Combinations (2-grams)that appear\n over 16,000 times in the three Datasets")

The distribution of 2-grams gives an idea of the prevalence of prepositions in natural language. The text prediction model will have to take this into account.

Plan

The current plan for the development of the text prediction application will be to use the frequency of 4-grams, 3-grams and 2-grams to estimate the most likely word to follow the entered text. The trick will be to offer valid predictions of N-grams that are not observed within the dataset. In these cases the algorithm will likely default to a list of “non-common” words (i.e. factor out words like the, and, that) and estimate the best possible candidate.

Session Information

This analysis was performed on a machine with the following characteristics:

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.1 RWeka_0.4-26      ggplot2_2.2.0     stringi_1.1.2    
## [5] tm_0.6-2          NLP_0.1-9        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8       knitr_1.15.1      magrittr_1.5     
##  [4] RWekajars_3.9.0-1 munsell_0.4.3     colorspace_1.3-1 
##  [7] stringr_1.1.0     plyr_1.8.4        tools_3.3.2      
## [10] parallel_3.3.2    grid_3.3.2        gtable_0.2.0     
## [13] htmltools_0.3.5   yaml_2.1.14       lazyeval_0.2.0   
## [16] rprojroot_1.1     digest_0.6.10     assertthat_0.1   
## [19] tibble_1.2        rJava_0.9-8       evaluate_0.10    
## [22] slam_0.1-40       rmarkdown_1.2     labeling_0.3     
## [25] scales_0.4.1      backports_1.0.4