This document is the first report of the milestone project which focus on an explaination of major features of the data and a summary of plans for creating the prediction algorithm and Shiny app.
The report will present different initial steps to work with the raw data, i.e.: (1) Load data; (2) Summary statistics of the data sets; and (3) other interesting findings.
Within the report, important summaries of the data set will be presented in form of tables and plots.
Loading necessary packages: - Text Mining
library(tm)
## Loading required package: NLP
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(slam)
Set working directory
setwd("/Volumes/Data/Coursera/Course10_DataProjectCapstone/")
Load 3 files (which were copied into the working directory): - en_US.blogs.txt - en_Us.news.txt - en_US.Twitter.txt
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8",skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8",skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8",skipNul = TRUE)
Summary of file sizes, number of lines and characters in each file
size = file.size(c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"))
size = round((size/1024)/1000) # convert to MB
lines = sapply(list(blogs,news,twitter), length)
characters <- sapply(list(blogs, news, twitter), function(x){sum(nchar(x))})
summary = data.frame(files =c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt"),size,lines,characters)
summary
## files size lines characters
## 1 en_US.blogs.txt 205 899288 206824505
## 2 en_US.news.txt 201 1010242 203223159
## 3 en_US.twitter.txt 163 2360148 162096241
In order to analysis the data, it is necessary to clean data and create a corpu.
Because all 3 data files are too big which may take very long time to process, I will make an example of data cleaning and corpu creating for a sub set of en_US.blogs.txt (10% of the original file).
Create subset of blogs
set.seed(1010) # Make process reproducible
sub_blogs = blogs[sample(length(blogs),length(blogs)*0.1)] # make subset
sub_blogs <- iconv(sub_blogs, "latin1", "ASCII", sub="") # pre_clean data
sub_blogs <- iconv(sub_blogs,to="utf-8-mac")
sub_blogs_Corpus <- VCorpus(VectorSource(sub_blogs)) # Make corpus
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, stripWhitespace) # Remove unneccesary white spaces
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removePunctuation) # Remove punctuation
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, removeNumbers) # Remove numbers
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, tolower) # Convert to lowercase
sub_blogs_Corpus <- tm_map(sub_blogs_Corpus, PlainTextDocument) # Plain text
After cleaning process, the data need to be converted into a type which is most useful for Natural Language Prpcessing (NLP). In this study, we chose the format of n-grams which is stored in Term Document Matrices (TDM). An n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Each TDM cell represents a frequency count at the corresponding n-grams and message.
Creating a function that: - Builds n-gram tokenizer, and creat the matrix of n-grams - Calculating the frequency of n-grams - Plot the 40 most popular n-grams in form of bar-plot
n_grams_plot <- function(n, data) {
options(mc.cores=1)
# Builds n-gram tokenizer
tk <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# Create matrix
ngrams_matrix <- TermDocumentMatrix(data, control=list(tokenize=tk))
# make matrix for easy view
ngrams_matrix <- as.matrix(rollup(ngrams_matrix, 2, na.rm=TRUE, FUN=sum))
ngrams_matrix <- data.frame(word=rownames(ngrams_matrix), freq=ngrams_matrix[,1])
# find 20 most frequent n-grams in the matrix
ngrams_matrix <- ngrams_matrix[order(-ngrams_matrix$freq), ][1:20, ]
ngrams_matrix$word <- factor(ngrams_matrix$word, as.character(ngrams_matrix$word))
# plots
ggplot(ngrams_matrix, aes(x=word, y=freq)) +
geom_bar(stat="Identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("n-grams") +
ylab("Frequency")
}
Plot of frequency distribution of 1-gram
n_grams_plot(n=1, data=sub_blogs_Corpus)
Plot of frequency distribution of 2-gram
n_grams_plot(n=2, data=sub_blogs_Corpus)
Plot of frequency distribution of 3-gram
n_grams_plot(n=3, data=sub_blogs_Corpus)
Plot of frequency distribution of 4-gram
n_grams_plot(n=4, data=sub_blogs_Corpus)
Next step for the development of the prediction algorithm is to use the frequency of 4-grams, 3-grams and 2-grams to estimate the most likely word to follow the entered text. Then the algorithm is used to built the application with Shiny.