==================================================================================================

SYNOPSIS

The report is to explains my exploratory analysis on the all of three text files (blogs, news and twitter). Due to the large size of the all three file, this analyis will only randowly pick 3000 and 5000 lines from those three files. First, basic summaries analysis of the three files will be conducted. Next, some histograms will be plotted to present the frequency of Top 20 2-grams and 3-grams distributions. Then, two 2-gram word cloud charts for each file will be created as next part of this study. Finally, I will list some interesting findings from my sample trials.

DATA PROCESSING

1.Set up the work directory and libraries and load the blogs, news and twitter data

setwd("~/Desktop/Coursera/Data Science Capstone/final/en_US")

stopwords_table <- read.csv("stopwords.csv", header = FALSE)
colnames(stopwords_table) <- c("stopword")

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

data.blogs <- file("en_US.blogs.txt", open="rb")
blogs <- readLines(data.blogs)
close(data.blogs)

data.news <- file("en_US.news.txt", open="rb")
news <- readLines(data.news)
close(data.news)

data.twitter <- file("en_US.twitter.txt", open="rb")
twitter <- readLines(data.twitter)
close(data.twitter)

2.Conduct some basic summaries on the three dataset

basic_sum <- function(file) {
                line_count <- length(file)
                file_1 <- strsplit(file," ")
                word_count = 0
                for (i in 1:length(file_1)){
                        word_count = word_count + length(file_1[[i]])
                        }
                
                print(paste("Line Count: ", line_count))
                print(paste("Words Count:", word_count))
}
basic_sum(blogs)

## [1] "Line Count:  899288"
## [1] "Words Count: 37334131"

basic_sum(news)

## [1] "Line Count:  1010242"
## [1] "Words Count: 34372530"

basic_sum(twitter)

## [1] "Line Count:  2360148"
## [1] "Words Count: 30373543"

3.Create a stopword table

This Stopword table is based on the “terrier-stop.txt” (Can be downloaded from here: https://bitbucket.org/kganes2/text-mining-resources/downloads) that I found online at http://www.text-analytics101.com/2014/10/all-about-stop-words-for-text-mining.html. I added some extra words that I would consider as stopwords, too.

stopwords_table <- as.data.frame(as.character(unlist(list(stopwords_table$stopword))))
colnames(stopwords_table) <- "stopword"

stopwords_table <- as.data.frame(stopwords_table[order(stopwords_table$stopword, na.last=TRUE), ])
colnames(stopwords_table)= "stopword"

stopwords_table$stopword_w_blank <- paste(" ",stopwords_table$stopword, " ", sep = "")

stopwords <- as.character(unlist(list(stopwords_table$stopword)))
stopwords <- removePunctuation(stopwords, preserve_intra_word_dashes = TRUE)

stopwords_w_blank <- as.character(unlist(list(stopwords_table$stopword_w_blank)))
stopwords_w_blank <- removePunctuation(stopwords_w_blank, preserve_intra_word_dashes = TRUE)

head(stopwords_table, n = 20)

##       stopword stopword_w_blank
## 1            a               a 
## 2        abaft           abaft 
## 3      abafter         abafter 
## 4     abaftest        abaftest 
## 5         able            able 
## 6        about           about 
## 7      abouter         abouter 
## 8     aboutest        aboutest 
## 9        above           above 
## 10      abover          abover 
## 11     abovest         abovest 
## 12 accordingly     accordingly 
## 13         aer             aer 
## 14        aest            aest 
## 15       afore           afore 
## 16       after           after 
## 17     afterer         afterer 
## 18    afterest        afterest 
## 19   afterward       afterward 
## 20  afterwards      afterwards

4.N-gram function

In my n-gram function,I set up a UniqWord_Count_Interval(default is 25,000). When I built my n-gram function, I encountered an issue that how to get the “ideal number” of frequent words/phrases of that can help with the prediction without sacrificing time to include more data in my model. I found this webpage (http://www.talkenglish.com/Vocabulary/english-vocabulary.aspx) is helpful. One of the most interesting points is “Professor Paul Nation found that a person needs to know 8,000-9,000 word families to enjoy reading a book.” It is good to know that knowing 8,000-9,000 word families is enough to read books. So how about “knowing” 8,000-9,000 word families then to predict next words? Next, I assumed that each word family may contain at least three different words (noun, adverb and adjective). Here is how my default UniqWord_Count_Interval (25,000) arrives.

RESULT

Plot histograms of the 2-grams and 3-grams frequencies distributions of my samples from the three files

WORD CLOUD FOR FUN – 2-GRAM ONLY

Create the 2-grams wordclouds of my samples results

## [1] "Blogs' 2-gram Word Cloud from Random 3000 Rows"

## [1] "Blogs' 2-gram Word Cloud from Random 5000 Rows"

## [1] "Twitter's 2-gram  Word Cloud from Random 3000 Rows"

## [1] "Twitter's 2-gram Word Cloud from Random 5000 Rows"

## [1] "News' 2-gram Word Cloud from Random 3000 Rows"

## [1] "News' 2-gram Word Cloud from Random 5000 Rows"

INTERESTING FINDINGS

From my sample trials, I find that the size of n-gram (2-gram and 3-gram) from sampled 5000 rows is about twice of the same n-gram (2-gram and 3-gram) from sampled 3000 rows in all three files.
From my sample trials, the top frequent 3-gram in the twitter includes “happy new year”, “happy christmas eve” and “happy mothers day”. Therefore, I believe that the twitter data was extracted at least from December to May.
From my sample trials, the blog contains “last year” and “next year” in the top frequent 2-gram, and therefore if the blogs is from updated resources, the periods of the blogs being extracted might across the end of some year.

PLANS FOR CREATING A PREDICTION ALGORITHM

Runtime and Effeciency
Accuracy

Capstone Milestone Report

Linghuan Zeng

March 29, 2015