==================================================================================================

SYNOPSIS

The report is to explains my exploratory analysis on the all of three text files (blogs, news and twitter). Due to the large size of the all three file, this analyis will only randowly pick 10000 and 30000 lines from those three files. First, basic summaries analysis of the three files will be conducted. Next, some histograms will be plotted to present the frequency of Top 20 2-grams and 3-grams distributions. Then, two 2-gram word cloud charts for each file will be created as next part of this study. I will also list some interesting findings from my sample trials. Finally, I will briefly discuss my prediction algorithm.

DATA PROCESSING

1.Set up the work directory and libraries and load the blogs, news and twitter data

setwd("~/Desktop/Coursera/Data Science Capstone/final/en_US")

stopwords_table <- read.csv("stopwords.csv", header = FALSE)
colnames(stopwords_table) <- c("stopword")

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

data.blogs <- file("en_US.blogs.txt", open="rb")
blogs <- readLines(data.blogs)
close(data.blogs)

data.news <- file("en_US.news.txt", open="rb")
news <- readLines(data.news)
close(data.news)

data.twitter <- file("en_US.twitter.txt", open="rb")
twitter <- readLines(data.twitter)
close(data.twitter)

2.Conduct some basic summaries on the three dataset

basic_sum <- function(file) {
                line_count <- length(file)
                file_1 <- strsplit(file," ")
                word_count = 0
                for (i in 1:length(file_1)){
                        word_count = word_count + length(file_1[[i]])
                        }
                
                print(paste("Line Count: ", line_count))
                print(paste("Words Count:", word_count))
}
basic_sum(blogs)

## [1] "Line Count:  899288"
## [1] "Words Count: 37334131"

basic_sum(news)

## [1] "Line Count:  1010242"
## [1] "Words Count: 34372530"

basic_sum(twitter)

## [1] "Line Count:  2360148"
## [1] "Words Count: 30373543"

3.N-gram function

RESULT

Plot histograms of the 2-grams and 3-grams frequencies distributions of my samples from the three files

WORD CLOUD FOR FUN – 2-GRAM ONLY

Create the 2-grams wordclouds of my samples results

## [1] "Blogs' 2-gram Word Cloud from Random 30000 Rows"

## [1] "Twitter's 2-gram Word Cloud from Random 30000 Rows"

## [1] "News' 2-gram Word Cloud from Random 30000 Rows"

INTERESTING FINDINGS

From my sample trials, the top frequent 3-gram in the twitter includes “happy new year”, “happy christmas eve” and “happy mothers day”. Therefore, I believe that the twitter data was extracted at least from December to May.
From my sample trials, the blog contains “last year” and “next year” in the top frequent 2-gram, and therefore if the blogs is from updated resources, the periods of the blogs being extracted might across the end of some year.

PLANS FOR CREATING A PREDICTION ALGORITHM

Prediction Algorithm

In my prediction algorithm, it first tries N-gram (4-gram, 3-gram and 2-gram) match to predict next word. But if no result is produced from N-gram, stopwords will be removed and only those key words and pharases will be kept in the input text. Then the algorithm searches and finds the predicted word from sentence in the sample sentence pool with most key words and pharases matching with the input text without stopwords.

Below is the stopword table (Only first 20 rows) based on the “terrier-stop.txt” (Can be downloaded from here: https://bitbucket.org/kganes2/text-mining-resources/downloads) that I found online at http://www.text-analytics101.com/2014/10/all-about-stop-words-for-text-mining.html. I added some extra words that I would consider as stopwords, too.

stopwords_table <- as.data.frame(as.character(unlist(list(stopwords_table$stopword))))
colnames(stopwords_table) <- "stopword"

stopwords_table <- as.data.frame(stopwords_table[order(stopwords_table$stopword, na.last=TRUE), ])
colnames(stopwords_table)= "stopword"

stopwords_table$stopword_w_blank <- paste(" ",stopwords_table$stopword, " ", sep = "")

stopwords <- as.character(unlist(list(stopwords_table$stopword)))
stopwords <- removePunctuation(stopwords, preserve_intra_word_dashes = TRUE)

stopwords_w_blank <- as.character(unlist(list(stopwords_table$stopword_w_blank)))
stopwords_w_blank <- removePunctuation(stopwords_w_blank, preserve_intra_word_dashes = TRUE)

head(stopwords_table, n = 20)

##       stopword stopword_w_blank
## 1            a               a 
## 2        abaft           abaft 
## 3      abafter         abafter 
## 4     abaftest        abaftest 
## 5         able            able 
## 6        about           about 
## 7      abouter         abouter 
## 8     aboutest        aboutest 
## 9        above           above 
## 10      abover          abover 
## 11     abovest         abovest 
## 12 accordingly     accordingly 
## 13         aer             aer 
## 14        aest            aest 
## 15       afore           afore 
## 16       after           after 
## 17     afterer         afterer 
## 18    afterest        afterest 
## 19   afterward       afterward 
## 20  afterwards      afterwards

SOMETHING NEEDS TO KEEP IN MIND

Runtime and Effeciency
Accuracy

Capstone Milestone Report

Linghuan Zeng

July 25, 2015