Overview

This is the milestone report of the Data Science Capstone project. The goal of this report is to acquire the dataset and perform exploratory analysis to provide direction to the final prediction app.

Loading Data

The dataset is downloaded directly from the course website. it consists of texts entries collected from blogs, newspapers and twitter in four languarges: English, Finnish, German and Russian. For this proejct, we will only be using the English files.

##Download the dataset
corpora_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("corpora")){
        dir.create("corpora")
}
if(!file.exists("corpora/final/en_US")){
        temp_zip <- tempfile()
        download.file(corpora_url, destfile=temp_zip, mode="wb")
        unzip(temp_zip, exdir="corpora")
        unlink(temp_zip)
}
##Read the files
con_blogs <- file("corpora/final/en_US/en_US.blogs.txt", open="r")
US_blogs <- readLines(con_blogs)
close(con_blogs)
con_news <- file("corpora/final/en_US/en_US.news.txt", open="r")
US_news <- readLines(con_news)
close(con_news)
con_twitter <- file("corpora/final/en_US/en_US.twitter.txt", open="r")
US_twitter <- readLines(con_twitter)
close(con_twitter)

Summary of Files

Once the corpora is downloaded and read into R, we examine some basics of the three files, such as numbers of lines, words and characters.

##count lines for each file:
lines_blogs <- length(US_blogs)
lines_news <- length(US_news)
lines_twitter <- length(US_twitter)

##breakdown each file by words:
words_blogs <- words_blogs <- unlist(strsplit(US_blogs, "\\s+"))
words_news <- unlist(strsplit(US_news, "\\s+"))
words_twitter <- unlist(strsplit(US_twitter, "\\s+"))

##count words in each file:
wordcount_blogs <- length(words_blogs)
wordcount_news <- length(words_news)
wordcount_twitter <- length(words_twitter)

##calculate average numbers of words per line
WPL_blogs <- lapply(strsplit(US_blogs, "\\s+"), unlist)
WPL_news <- lapply(strsplit(US_news, "\\s+"), unlist)
WPL_twitter <- lapply(strsplit(US_twitter, "\\s+"), unlist)
lineLength_blogs <- mean(sapply(WPL_blogs, length))
lineLength_news <- mean(sapply(WPL_news, length))
lineLength_twitter <- mean(sapply(WPL_twitter, length))
##count characters in each file:
chr_blogs <- sum(nchar(US_blogs))
chr_news <- sum(nchar(US_news))
chr_twitter <- sum(nchar(US_news))

##calculate average word length
wordlength_blogs <- mean(nchar(words_blogs))
wordlength_news <- mean(nchar(words_news))
wordlength_twitter <- mean(nchar(words_twitter))

##summary table
library(knitr)
files_summary <- data.frame(
        source_file= c("blogs", "news", "twitter"),
        line_count = c(lines_blogs, lines_news, lines_twitter),
        word_count = c(wordcount_blogs, wordcount_news, wordcount_twitter),
        mean_line_length = c(lineLength_blogs, lineLength_news, lineLength_twitter),
        character_count = c(chr_blogs, chr_news, chr_twitter),
        mean_word_length = c(wordlength_blogs, wordlength_news, wordlength_twitter)
)

kable(files_summary, align="c", caption = "Summary Statstics of English blogs, news and twitter files")
Summary Statstics of English blogs, news and twitter files
source_file line_count word_count mean_line_length character_count mean_word_length
blogs 899288 37334131 41.51521 206824505 4.563911
news 1010242 34372530 34.02406 203223159 4.941762
twitter 2360148 30373543 12.86934 203223159 4.414455

From the summary table above, it is notable that texts blogs have the largest mean number of words per line, followed by news, and twitter the least. This is expected as it is consistent with the nature of these types of texts. The mean lengths of words are relatively similar at between 4 to 5 characters per word, with news texts having the largest mean word length.

Word Frequency Analysis

As we are interested in predicting subsequent words based on words entered, we need to inspect how frequent words and phrases appear together.

Pre-Process the Data

To perform n-gram analysis on the texts, we will first need to clean and standardize the data by converting all letters to lower case and remove punctuation, numbers, extra whitespaces and stop words. Due to the size of the files, we will only analyze a sample size of 1% of the original files. We take 1 % from each text sources and combine the sample to 1 corpus

library(tm)
library(quanteda)
##Create samples of 1/100 of the original texts
set.seed(1234)
sample_blogs <- sample(US_blogs, lines_blogs/100)
sample_news <- sample(US_news, lines_news/100)
sample_twitter <- sample(US_twitter, lines_twitter/100)
##Create combined corpus
BNT_corpus <- Corpus(VectorSource(c(sample_blogs, sample_news, sample_twitter)))
##converting to lower case
BNT_corpus <- tm_map(BNT_corpus, content_transformer(tolower))
##removing punctuation
BNT_corpus <- tm_map(BNT_corpus, removePunctuation)
##removing numbers
BNT_corpus <- tm_map(BNT_corpus, removeNumbers)
##removing stop words
BNT_corpus <- tm_map(BNT_corpus, removeWords, stopwords("en"))
##removing extra whitespaces
BNT_corpus <- tm_map(BNT_corpus, stripWhitespace)
##convert back to plain text
BNT_tidy <- sapply(BNT_corpus, as.character)

We can see an example of what the texts look like before and after preprocessiing:

Original Texts:

sample_blogs[1:3]
## [1] "He looked back at me, his eyes were as dark as coal,"                                                                                     
## [2] "You've set up a problem without stakes. Why does she care who the voice on the phone is? Why would she even listen to him past \"hello?\""
## [3] "Yvonne Strahovski … Peg Mooring"

Texts after pre-processing:

BNT_tidy[1:3]
## [1] " looked back eyes dark coal"                                             
## [2] "youve set problem without stakes care voice phone even listen past hello"
## [3] "yvonne strahovski … peg mooring"

n-gram Analysis

After sampling and cleanning our texts, we can proceed to tokenization and n-gram analysis.

##Tokenization
BNT_tokens <- tokens(BNT_tidy, what = "word", remove_punct = TRUE)

Unigram

##create unigrams
BNT_uni <- tokens_ngrams(BNT_tokens, n=1)
##convert to DFM
BNT_uniDFM <- dfm(BNT_uni)

##Get top 10 unigrams
BNT_uniTop <- topfeatures(BNT_uniDFM, n=10)

##create a summary dataframe for plotting
uni_Top10 <- data.frame(
        unigram = names(BNT_uniTop), 
        frequency=BNT_uniTop)
##make barplot
library(ggplot2)
ggplot(uni_Top10, aes(x=unigram, y=frequency))+
        geom_bar(stat = "identity") +
        labs(title = "Top 10 Unigrams",
             x = "unigram", y = "frequency")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bigram

BNT_bi <- tokens_ngrams(BNT_tokens, n=2)
BNT_biDFM <- dfm(BNT_bi)
BNT_biTop <- topfeatures(BNT_biDFM, n=10)
bi_Top10 <- data.frame(
        bigram = names(BNT_biTop),
        frequency = BNT_biTop
)
ggplot(bi_Top10, aes(x=bigram, y=frequency))+
        geom_bar(stat = "identity") + 
        labs(title = "Top 10 Bigrams",
             x = "bigram", y = "frequency")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1)
        )

Trigram

BNT_tri <- tokens_ngrams(BNT_tokens, n=3)
BNT_triDFM <- dfm(BNT_tri)
BNT_triTop <- topfeatures(BNT_triDFM, n=10)
tri_Top10 <- data.frame(trigram = names(BNT_triTop), frequency = BNT_triTop)
ggplot(tri_Top10, aes(x=trigram, y=frequency))+
        geom_bar(stat = "identity") + 
        labs(title = "Top 10 Trigrams", x = "trigram", y = "frequency")+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

We notice that profane languages are present at high frequency in these texts. We do not remove them from analysis at this step of the project, but will develop a plan to handle them in our final model.

Summary

In this step of the project, we downloaded the dataset, examined the files and performed exploratory analysis. We also performed pre-processing of the data including sampling, combining and text transformation create a corpus ready for further analysis. Our next step is to build prediction model based on N-gram analysis.