Milestone Report

Introduction

The goal of this report is to show the summary of the data, exploratory data analysis of the en_UD text files we have got them from week1, and to create a prediction algorithm. The 3 documents of text files we have downloaded are 1. en_US.twitter.txt 2. en_US.news.txt 3. en_US.blogs.txt

Importing Text documents data and showing its summary

Installing the required packages install.packages(“tm”) install.packages(“ggplot2”) install.packages(“RWeka”) install.packages(“wordcloud”)

Loading the installed packages

library(tm)
library(LaF)
library(ngram)
library(corpus)
library(wordcloud2)
library(gridExtra)
library(ggplot2)

Downloading data and extracting zip

if (!file.exists("../C0/Coursera-SwiftKey.zip")){
        download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip")
        unzip("Coursera-SwiftKey.zip")
}

Importing the data and then look at the summary of those text documents

twitter = readLines("../C0/final/en_US/en_US.twitter.txt",encoding='UTF-8')
news = readLines("../C0/final/en_US/en_US.news.txt",encoding='UTF-8')
blog = readLines("../C0/final/en_US/en_US.blogs.txt",encoding='UTF-8')

See summary of documents

summary <- data.frame('File' = c("Twitter","Blogs","News"),
                      "File Size" = sapply(list(twitter, blog, news), function(x){format(object.size(x),"MB")}),
                      'NEntries' = sapply(list(twitter, blog, news), function(x){length(x)}),
                      'TotalCharacters' = sapply(list(twitter, blog, news), function(x){sum(nchar(x))}),
                      "MeanCharperEntry" = sapply(list(twitter, blog, news), function(x){sum(nchar(x))/length(x)}),
                      "TotalWords" = sapply(list(twitter, blog, news), function(x){sum(wordcount(x,sep = " ", count.function = sum))}),
                      "MeanWordsperEntry" = sapply(list(twitter, blog, news), function(x){sum(wordcount(x,sep = " ", count.function = sum))/length(x)})
                      )
summary

##      File File.Size NEntries TotalCharacters MeanCharperEntry TotalWords
## 1 Twitter    319 Mb  2360148       162096031         68.68045   30373543
## 2   Blogs  255.4 Mb   899288       206824505        229.98695   37334131
## 3    News   19.8 Mb    77259        15639408        202.42830    2643969
##   MeanWordsperEntry
## 1          12.86934
## 2          41.51521
## 3          34.22215

Sampling and Cleaning Data

As suggested in Week1, lets take samples of each document with 5% of data abd that will help us to maintain CPU and time. Lets also set the seed to maintain reproducibility. Finally merge the samples and see the summary of sample data.

set.seed(2642020)

sample <- c(sample(twitter, length(twitter) * 0.05),
                 sample(blog, length(blog) * 0.05),
                 sample(news, length(news) * 0.05))

With the help of textmining tm package we clean the data by modifying the data such as lowercasing, stripping whitespaces, removing numbers, profanity, stopwords, punctuations and also convert the data to plain text.

sample <- iconv(sample, "latin1", "ASCII", sub="")

# I first need to format the data to the type of object the tm package can handle; a corpus.
sample <- Corpus(VectorSource(sample), readerControl = list(reader=readPlain, language="en_US"))

sample <- tm_map(sample, tolower)
sample <- tm_map(sample, removePunctuation)
sample <- tm_map(sample, removeNumbers)
sample <- tm_map(sample, stripWhitespace)
sample <- tm_map(sample, removeWords,stopwords("english"))

removeURL <- function(x) gsub("http[[:alnum:]]*","",x)
removeSpace <- function(x) gsub("\\s+"," ",x)
removeapo <- function(x) gsub("'","",x)

sample <- tm_map(sample,content_transformer(removeURL))
sample <- tm_map(sample,content_transformer(removeSpace))
sample <- tm_map(sample,content_transformer(removeapo))

Tokenization and N-grams

Some people use the RWeka package to, or ngram.

sampleunigram <- term_stats(sample, ngrams=1)
samplebigram <- term_stats(sample, ngrams=2)
sampletrigram <- term_stats(sample, ngrams=3)

Basic Plots

plot <- function(data, title,q){
  g1 <- ggplot(data=data[1:q,],aes(x=reorder(data[1:q,1],-data[1:q,2]),y=data[1:q,2])) +
        geom_bar(stat="identity", width = 0.75,fill = "blue", colour = "black") +
        labs(x = "Terms", y = "Count", title = paste(title, "- Count", sep=" ")) +
        theme(axis.text.x = element_text(angle = 90, hjust = 1),plot.title = element_text(hjust = 0.5))
  grid.arrange(arrangeGrob(g1))
  }

For the single word frequency, I will start with a wordcloud because it is quite nice to the eye. Here I use the package wordcloud2 which is the best I could find. Let us look at the 100 most frequent words.

wordcloud2(sampleunigram[1:100,], size = 1.0, shape='circle')

Plots

plot(sampleunigram,"Top 25 Unigrams",25)

plot(samplebigram,"Top 25 Bigrams",25)

plot(sampletrigram,"Top 25 Trigrams",25)

Summary and Future development

By looking at the histograms we can see that the overall frequency is high in uni-grams when compared to bi-grams and tri-grams. The frequency decreased as the increase in N-grams.

By avoiding using stop words, we may not get the correct phrases in bi-grams and tri-grams.

We in future create prediction model, Data product, Slide deck and Summary using a report and Shiny application.