==================================================================================================
The report is to explains my exploratory analysis on the all of three text files (blogs, news and twitter). Due to the large size of the all three file, this analyis will only randowly pick 10000 and 30000 lines from those three files. First, basic summaries analysis of the three files will be conducted. Next, some histograms will be plotted to present the frequency of Top 20 2-grams and 3-grams distributions. Then, two 2-gram word cloud charts for each file will be created as next part of this study. I will also list some interesting findings from my sample trials. Finally, I will briefly discuss my prediction algorithm.
1.Set up the work directory and libraries and load the blogs, news and twitter data
setwd("~/Desktop/Coursera/Data Science Capstone/final/en_US")
stopwords_table <- read.csv("stopwords.csv", header = FALSE)
colnames(stopwords_table) <- c("stopword")
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
data.blogs <- file("en_US.blogs.txt", open="rb")
blogs <- readLines(data.blogs)
close(data.blogs)
data.news <- file("en_US.news.txt", open="rb")
news <- readLines(data.news)
close(data.news)
data.twitter <- file("en_US.twitter.txt", open="rb")
twitter <- readLines(data.twitter)
close(data.twitter)
2.Conduct some basic summaries on the three dataset
basic_sum <- function(file) {
line_count <- length(file)
file_1 <- strsplit(file," ")
word_count = 0
for (i in 1:length(file_1)){
word_count = word_count + length(file_1[[i]])
}
print(paste("Line Count: ", line_count))
print(paste("Words Count:", word_count))
}
basic_sum(blogs)
## [1] "Line Count: 899288"
## [1] "Words Count: 37334131"
basic_sum(news)
## [1] "Line Count: 1010242"
## [1] "Words Count: 34372530"
basic_sum(twitter)
## [1] "Line Count: 2360148"
## [1] "Words Count: 30373543"
3.N-gram function
Plot histograms of the 2-grams and 3-grams frequencies distributions of my samples from the three files
Create the 2-grams wordclouds of my samples results
## [1] "Blogs' 2-gram Word Cloud from Random 30000 Rows"
## [1] "Twitter's 2-gram Word Cloud from Random 30000 Rows"
## [1] "News' 2-gram Word Cloud from Random 30000 Rows"
From my sample trials, the top frequent 3-gram in the twitter includes “happy new year”, “happy christmas eve” and “happy mothers day”. Therefore, I believe that the twitter data was extracted at least from December to May.
From my sample trials, the blog contains “last year” and “next year” in the top frequent 2-gram, and therefore if the blogs is from updated resources, the periods of the blogs being extracted might across the end of some year.
In my prediction algorithm, it first tries N-gram (4-gram, 3-gram and 2-gram) match to predict next word. But if no result is produced from N-gram, stopwords will be removed and only those key words and pharases will be kept in the input text. Then the algorithm searches and finds the predicted word from sentence in the sample sentence pool with most key words and pharases matching with the input text without stopwords.
Below is the stopword table (Only first 20 rows) based on the “terrier-stop.txt” (Can be downloaded from here: https://bitbucket.org/kganes2/text-mining-resources/downloads) that I found online at http://www.text-analytics101.com/2014/10/all-about-stop-words-for-text-mining.html. I added some extra words that I would consider as stopwords, too.
stopwords_table <- as.data.frame(as.character(unlist(list(stopwords_table$stopword))))
colnames(stopwords_table) <- "stopword"
stopwords_table <- as.data.frame(stopwords_table[order(stopwords_table$stopword, na.last=TRUE), ])
colnames(stopwords_table)= "stopword"
stopwords_table$stopword_w_blank <- paste(" ",stopwords_table$stopword, " ", sep = "")
stopwords <- as.character(unlist(list(stopwords_table$stopword)))
stopwords <- removePunctuation(stopwords, preserve_intra_word_dashes = TRUE)
stopwords_w_blank <- as.character(unlist(list(stopwords_table$stopword_w_blank)))
stopwords_w_blank <- removePunctuation(stopwords_w_blank, preserve_intra_word_dashes = TRUE)
head(stopwords_table, n = 20)
## stopword stopword_w_blank
## 1 a a
## 2 abaft abaft
## 3 abafter abafter
## 4 abaftest abaftest
## 5 able able
## 6 about about
## 7 abouter abouter
## 8 aboutest aboutest
## 9 above above
## 10 abover abover
## 11 abovest abovest
## 12 accordingly accordingly
## 13 aer aer
## 14 aest aest
## 15 afore afore
## 16 after after
## 17 afterer afterer
## 18 afterest afterest
## 19 afterward afterward
## 20 afterwards afterwards