Synopsis

This document is for the fulfillment of the second week requirement of the the Data Science Capstone, the last course in the Data Science Specialization offered by the Johns Hopkins University and Coursera. The task for this week is to demonstrate the student’s ability to explore data, create basic summaries, visually present data using plots and create a concise report. The ultimate goal of the Capstone Project is to create a predictive text model based on a training data. This report will describe the data we will be using for this project.

Data Processing

The data is from a corpus called HC Corpora www.corpora.heliohost.org. Details about the corpora may be obtained from http://www.corpora.heliohost.org/aboutcorpus.html.

Loading the Data

The training data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The zipped file contained 4 files from four different languages: English, German, Russian and Finnish.. We will be using the English database which contain varying lengths of text from three sources: news, blogs, and twitter.

library(data.table)
twit <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE) ###We used the argument skipNul to read past the embedded nul in lines 167155, 268547, 1274086, and 1759032.
blog <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news <- fread("en_US.news.txt", sep='\n', header=FALSE)
## 
Read 11.9% of 1010242 rows
Read 52.5% of 1010242 rows
Read 91.1% of 1010242 rows
Read 1010228 rows and 1 (of 1) columns from 0.192 GB file in 00:00:07

Summary of the Data

The number of lines of text matches the number of lines in the readme file of the dataset.

library(stringi)
twit_data <- data.frame(Type = c("tweets", "blog", "news"),
            Size_Mb = c(file.size("en_US.twitter.txt")/1024^2, file.size("en_US.blogs.txt")/1024^2, file.size("en_US.news.txt")/1024^2),
            Lines = c(length(twit), length(blog), length(news$V1)),
            MeanWordsperLine = c(mean(stri_count_words(twit)), mean(stri_count_words(blog)), mean(stri_count_words(news$V1))),
            MeanSentperLine = c(mean(stri_count_boundaries(twit, type = "sentence")), mean(stri_count_boundaries(blog, type = "sentence")), mean(stri_count_boundaries(news$V1, type = "sentence"))),
            MeanCharperLine = c(mean(stri_count_boundaries(twit, type = "character")), mean(stri_count_boundaries(blog, type = "character")), mean(stri_count_boundaries(news$V1, type = "character"))))
print(twit_data)
##     Type  Size_Mb   Lines MeanWordsperLine MeanSentperLine MeanCharperLine
## 1 tweets 159.3641 2360148         12.75065        1.601754        68.68042
## 2   blog 200.4242  899288         41.75108        2.647073       229.98668
## 3   news 196.2775 1010228         34.65632        1.990405       201.59272

Sampling the Data

set.seed(1379)
twit_samp <- sample(twit, length(twit)*0.1)
rm(twit)
set.seed(1379)
news_samp <- sample(news$V1, length(news$V1)*0.1)
rm(news)
set.seed(1379)
blog_samp <- sample(blog, length(blog)*0.1)
rm(blog)

Exploring and Cleaning the Data

Exploring the individual words in the data gave me an idea of the most common abbreviations, expressions and words in the data. Some of them needed to be replaced by words in order to make a logical sequence of words and thoughts or to make their meaning explicit, particularly in the twitter database. People express themselves in twitter differently. New words and abbreviations are created daily and expressing emotions, like approval or disapproval are conveyed in emoticons.

twit_samp <- gsub("http(s?)://(.*)[.][a-z]+", " ", twit_samp)
twit_samp <- gsub("(www)[.][A-z]", " ", twit_samp)
twit_samp <- gsub("^#+(.*)", " ", twit_samp)
twit_samp <- gsub("@", "at", twit_samp)
twit_samp <- gsub("[Xx]+(.*)[Oo]", " ", twit_samp)
twit_samp <- gsub("[Ll][Oo][Ll]", " ", twit_samp)
twit_samp <- gsub(" [Uu] ", "you", twit_samp)
twit_samp <- gsub("[Tt][Hh|Nn][Xx]", "thanks", twit_samp)
twit_samp <- gsub("([Hh][Aa|Ee]){2,}", " ", twit_samp)
twit_samp <- gsub("([Bb][Ee][Cc][Uu][Zz])|([Cc][Uu][Zz])", "because", twit_samp)
twit_samp <- gsub("[Oo][Mm][Gg]", "oh my gosh",twit_samp)

I was glad for the chance to be able to practice my skills with regular expressions, grep and gsub functions but i later on learned that the dfm function in quanteda has an argument to handle twitter and url characters. I will just use the code above for the twitter dataset where abbreviations are common place and let the dfm clean the blogs and news dataset.

Frequency of Single Words or Unigrams

The relationship of each word to the word or words prior to it in a sequence of words, will provide the basis for predicting the next word in the sequence. This relationship based on probability is dependent on the number of instance the words follow ech other in the corpus.

We first examine the number of times individual words appear in the corpus.

library(quanteda)
library(RColorBrewer)
twit1words <- dfm(twit_samp, ngrams = 1, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
blog1words <- dfm(blog_samp, ngrams = 1, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
news1words <- dfm(news_samp, ngrams = 1, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
par(mfrow = c(1, 3))
plot(twit1words, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
plot(blog1words, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
plot(news1words, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
fig.1. Frequency of the individual words from  the twitter, blogs, and news (left to right) database sample in a word cloud

fig.1. Frequency of the individual words from the twitter, blogs, and news (left to right) database sample in a word cloud

We can see that the most frequent single words from the three different database samples are quite similar even though they originated from different sources. Words from twitter are quite unique and differnt from the more formal language used in news and blogs, but our data, after removal of the extraneous characters such as hashtags, urls and @ signs, reveals that they are quite similar. note: The word “the” is also a frequently occuring word in the twitter database sample. It just wasn’t printed in the word cloud.

Frequency of pairs of words or Bigrams

We next examine the words in pairs in the twitter sample.

library(ggplot2)
bitwit <- dfm(twit_samp, ngrams = 2, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
twit2w <- as.data.frame(as.matrix(docfreq(bitwit)))
twit2wS <- sort(rowSums(twit2w), decreasing=TRUE)
twit2_FreqTable <- data.frame(Words=names(twit2wS), Frequency = twit2wS)
tbigram <- ggplot(within(twit2_FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
tbigram <- tbigram + geom_bar(stat="identity", fill="steelblue2") + ggtitle("Top 15 Bigrams from Twitter")
tbigram <- tbigram + theme(axis.text.x=element_text(angle=45, hjust=1))
tbigram
fig.2. Frequency of 2-word sequence or bigrams in the twitter database sample

fig.2. Frequency of 2-word sequence or bigrams in the twitter database sample

We see that the word “the” occurs most frequently in tandem with words like “in”, “for”, “of”, and “on”.

biblog <- dfm(blog_samp, ngrams = 2, verbose = FALSE, concatenator = " ", stopwords=FALSE)
blog2w <- as.data.frame(as.matrix(docfreq(biblog)))
blog2wS <- sort(rowSums(blog2w), decreasing=TRUE)
blog2_FreqTable <- data.frame(Words=names(blog2wS), Frequency = blog2wS)
bbigram <- ggplot(within(blog2_FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
bbigram <- bbigram + geom_bar(stat="identity", fill="steelblue2") + ggtitle("Top 15 Bigrams from Blogs")
bbigram <- bbigram + theme(axis.text.x=element_text(angle=45, hjust=1))
bbigram
fig.3. Frequency of 2-word sequence or bigrams in the blogs database sample

fig.3. Frequency of 2-word sequence or bigrams in the blogs database sample

The same pair of words occur most frequently in the sample from blogs.

binews <- dfm(news_samp, ngrams = 2, verbose = FALSE, concatenator = " ", stopwords=FALSE)
news2w <- as.data.frame(as.matrix(docfreq(binews)))
news2wS <- sort(rowSums(news2w), decreasing=TRUE)
news2_FreqTable <- data.frame(Words=names(news2wS), Frequency = news2wS)
nbigram <- ggplot(within(news2_FreqTable[1:15, ], Words <- factor(Words, levels=Words)), aes(Words, Frequency))
nbigram <- nbigram + geom_bar(stat="identity", fill="steelblue2") + ggtitle("Top 15 Bigrams from news")
nbigram <- nbigram + theme(axis.text.x=element_text(angle=45, hjust=1))
nbigram
fig.4. Frequency of 2-word sequence or bigrams in the news database sample

fig.4. Frequency of 2-word sequence or bigrams in the news database sample

The same pair words or bigrams are found from the three different database source. The top 15 choices are consistent with common usage in the English language. If we look at the least frequent combinations, we will likely see odd combinations. Luckily, prediction is based on the frequency that they occur together or how people use certain combinations more frequently.

Frequency of a sequence of three words or Trigrams

We now look at the 3-word combinations or trigrams.

twit3words <- dfm(twit_samp, ngrams = 3, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
blog3words <- dfm(blog_samp, ngrams = 3, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
news3words <- dfm(news_samp, ngrams = 3, verbose = FALSE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeperators = TRUE, stem = FALSE, concatenator = " ", stopwords=FALSE)
par(mfrow = c(1, 3))
plot(twit3words, max.words = 50, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
plot(blog3words, max.words = 50, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
plot(news3words, max.words = 50, colors = brewer.pal(6, "Dark2"), scale = c(8, .5))
fig.5. Frequency of the 3-words sequence or trigrams from  the twitter, blogs, and news (left to right) database sample in a word cloud

fig.5. Frequency of the 3-words sequence or trigrams from the twitter, blogs, and news (left to right) database sample in a word cloud

Earlier we found that the same single words and pair words occur most frequently from the samples from twitter, blogs, and news sources. Examining the word cloud of trigrams above, we find that these single words and pair words are used to form different three-word sequences. This time the most frequent combinations are different across the samples obtained from the three diferent sources. Combining the samples from the three sources would give us a better prediction for the next word to follow in the sequence.

Conclusion

Examination of the three datasets issued from three different sources, reveal the same words occur most frequently despite the different source they come from. This illustrate that certain words are used more often to put together sentences, despite the variety of ways we can join them together or invent new pieces to come up with a unique new way to express thought or emotions.

However, differences start to manifest when the sequence of words start to increase, beginning with a sequence of three words. News and blogs use more formal language compared to twitter. The third word in the sequence reflects the differences in topics of interest addressed in the different media from which the data was acquired. Twitter is a forum where everyday personal situations are talked about. Blogs are more like reflection on topics of interest, while news addresses current events that may affect national or international interests.

Twitter is the birthplace of new words where the birth rate is quite phenomenal. Some of these new words gain favor and enter our perception and ultimately our everyday vocabulary. The presence of new words which have not yet gained acceptance gives rise to the need to clean the data more before incorporating it in our corpus. When we ultimately make our test prediction model, the variety of our samples will increase the likelihood that our prediction is accurate.

It is amazing how new technology can be invented or discovered that uses text as data to come up with a brand new way of analyzing text and generating new knowledge.The content of this course is surely an eye opener for me. I often found myself exploring topics that would not help at all in my project but are very interesting.

I thought the Developing Data Products was already awesome, but the organizers of this course outdid themselves. They really saved the best for last. What an appropriate project for a the last course in the series, deserving to be called a capstone project.

References

n-gram model aboutcorpus.html

Natural language processing

Text mining infrastucture in R

Jurafsky D and Martin Jh chapter 4 in Speech and Language Processing

Appendix

Code for downloading the data

I originally allowed the codes for downloading the data to be evaluated. However, problems with lack of memory to process the entire document persisted throughout the process. I already decided to create 3 documents, one for each of the database sample, when some of the things i was trying started to work.

fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile = "./data/coursera_swiftkey.zip")
dateDownloaded <- date() ### Tue Jun 07 16:26:40 2016
print(dateDownloaded)
unzip("./data/coursera_swiftkey.zip")
list.files("data")
list.files("./data/final")
list.files("./data/final/en_US")
sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.3     tools_3.2.4     htmltools_0.3  
##  [5] yaml_2.1.13     stringi_1.1.1   rmarkdown_0.9.5 knitr_1.12.3   
##  [9] stringr_1.0.0   digest_0.6.9    evaluate_0.8.3