The goal of this report is to show the summary of the data, exploratory data analysis of the en_UD text files we have got them from week1, and to create a prediction algorithm. The 3 documents of text files we have downloaded are 1. en_US.twitter.txt 2. en_US.news.txt 3. en_US.blogs.txt
Installing the required packages install.packages(“tm”) install.packages(“ggplot2”) install.packages(“RWeka”) install.packages(“wordcloud”)
Loading the installed packages
library(tm)
library(LaF)
library(ngram)
library(corpus)
library(wordcloud2)
library(gridExtra)
library(ggplot2)
Downloading data and extracting zip
if (!file.exists("../C0/Coursera-SwiftKey.zip")){
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
Importing the data and then look at the summary of those text documents
twitter = readLines("../C0/final/en_US/en_US.twitter.txt",encoding='UTF-8')
news = readLines("../C0/final/en_US/en_US.news.txt",encoding='UTF-8')
blog = readLines("../C0/final/en_US/en_US.blogs.txt",encoding='UTF-8')
See summary of documents
summary <- data.frame('File' = c("Twitter","Blogs","News"),
"File Size" = sapply(list(twitter, blog, news), function(x){format(object.size(x),"MB")}),
'NEntries' = sapply(list(twitter, blog, news), function(x){length(x)}),
'TotalCharacters' = sapply(list(twitter, blog, news), function(x){sum(nchar(x))}),
"MeanCharperEntry" = sapply(list(twitter, blog, news), function(x){sum(nchar(x))/length(x)}),
"TotalWords" = sapply(list(twitter, blog, news), function(x){sum(wordcount(x,sep = " ", count.function = sum))}),
"MeanWordsperEntry" = sapply(list(twitter, blog, news), function(x){sum(wordcount(x,sep = " ", count.function = sum))/length(x)})
)
summary
## File File.Size NEntries TotalCharacters MeanCharperEntry TotalWords
## 1 Twitter 319 Mb 2360148 162096031 68.68045 30373543
## 2 Blogs 255.4 Mb 899288 206824505 229.98695 37334131
## 3 News 19.8 Mb 77259 15639408 202.42830 2643969
## MeanWordsperEntry
## 1 12.86934
## 2 41.51521
## 3 34.22215
As suggested in Week1, lets take samples of each document with 5% of data abd that will help us to maintain CPU and time. Lets also set the seed to maintain reproducibility. Finally merge the samples and see the summary of sample data.
set.seed(2642020)
sample <- c(sample(twitter, length(twitter) * 0.05),
sample(blog, length(blog) * 0.05),
sample(news, length(news) * 0.05))
With the help of textmining tm package we clean the data by modifying the data such as lowercasing, stripping whitespaces, removing numbers, profanity, stopwords, punctuations and also convert the data to plain text.
sample <- iconv(sample, "latin1", "ASCII", sub="")
# I first need to format the data to the type of object the tm package can handle; a corpus.
sample <- Corpus(VectorSource(sample), readerControl = list(reader=readPlain, language="en_US"))
sample <- tm_map(sample, tolower)
sample <- tm_map(sample, removePunctuation)
sample <- tm_map(sample, removeNumbers)
sample <- tm_map(sample, stripWhitespace)
sample <- tm_map(sample, removeWords,stopwords("english"))
removeURL <- function(x) gsub("http[[:alnum:]]*","",x)
removeSpace <- function(x) gsub("\\s+"," ",x)
removeapo <- function(x) gsub("'","",x)
sample <- tm_map(sample,content_transformer(removeURL))
sample <- tm_map(sample,content_transformer(removeSpace))
sample <- tm_map(sample,content_transformer(removeapo))
Some people use the RWeka package to, or ngram.
sampleunigram <- term_stats(sample, ngrams=1)
samplebigram <- term_stats(sample, ngrams=2)
sampletrigram <- term_stats(sample, ngrams=3)
As suggested in Week1, lets take samples of each document with 5% of data abd that will help us to maintain CPU and time. Lets also set the seed to maintain reproducibility. Finally merge the samples and see the summary of sample data.
plot <- function(data, title,q){
g1 <- ggplot(data=data[1:q,],aes(x=reorder(data[1:q,1],-data[1:q,2]),y=data[1:q,2])) +
geom_bar(stat="identity", width = 0.75,fill = "blue", colour = "black") +
labs(x = "Terms", y = "Count", title = paste(title, "- Count", sep=" ")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),plot.title = element_text(hjust = 0.5))
grid.arrange(arrangeGrob(g1))
}
For the single word frequency, I will start with a wordcloud because it is quite nice to the eye. Here I use the package wordcloud2 which is the best I could find. Let us look at the 100 most frequent words.
wordcloud2(sampleunigram[1:100,], size = 1.0, shape='circle')
Plots
plot(sampleunigram,"Top 25 Unigrams",25)
plot(samplebigram,"Top 25 Bigrams",25)
plot(sampletrigram,"Top 25 Trigrams",25)
By looking at the histograms we can see that the overall frequency is high in uni-grams when compared to bi-grams and tri-grams. The frequency decreased as the increase in N-grams.
By avoiding using stop words, we may not get the correct phrases in bi-grams and tri-grams.
We in future create prediction model, Data product, Slide deck and Summary using a report and Shiny application.