#1. Introduction
The goals of this project is just to use tables and plots to illustrate important summaries of the data set. The project follows the following steps. 1.Downloading the data and loading it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings amassed so far.4. Discuss plans for creating a prediction algorithm and Shiny app.
#2. Loading Packages,
library(NLP)
library(tm)
library(RColorBrewer)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
library(formattable)
The compressed data file was downladed from this url (“https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”), unzipped and saved on the computer. The data files were read into RStudio with the readLines functionality.
Blog <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
News <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
Twit <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
#3. Datasets statistics
len_blogs <- length(Blog)
size_blogs <- object.size(Blog)
words_blogs<-length(words(Blog))
len_news <- length(News)
size_news <- object.size(News)
words_news<-length(words(News))
len_twit <- length(Twit)
size_twit <- object.size(Twit)
words_twit<-length(words(Twit))
file_name <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
size <- c(format(size_blogs, units = "auto"), format(size_news, units = "auto"),format(size_twit, units = "auto"))
lines <- c(format(len_blogs, big.mark=","),format(len_news, big.mark=","),format(len_twit, big.mark=","))
numb_words <- c(format(words_blogs, big.mark=","),format(words_news, big.mark=","),format(words_twit, big.mark=","))
data_stats <- data.frame (file_name,size,lines,numb_words)
colnames(data_stats) <- c('File Name', 'File Size', 'Number of Lines', 'Number of words')
formattable(data_stats)
| File Name | File Size | Number of Lines | Number of words |
|---|---|---|---|
| en_US.blogs.txt | 255.4 Mb | 899,288 | 37,334,131 |
| en_US.news.txt | 19.8 Mb | 77,259 | 2,643,969 |
| en_US.twitter.txt | 319 Mb | 2,360,148 | 30,373,583 |
As these files are large, and requiring a lot of computer resources, the data was sampled and we used 0.1%, 0.3% and 1% sample size based on the file sizes. The corpus will then be generated by using the sample created.
set.seed(123)
blogsSample <- Blog[rbinom(length(Blog) * 0.003, length(Blog), 0.5)]
newsSample <- News[rbinom(length(News) * 0.01, length(News), 0.5)]
twitSample <- Twit[rbinom(length(Twit) * 0.001, length(Twit), 0.5)]
rm(Blog, Twit, News)
Data <- paste(blogsSample, newsSample, twitSample)
mCorpus <- VCorpus(VectorSource(Data))
To clean the corpus,first it was transformed to all lowercase and then the following were removed: Punctuation, Extra Whitespaces, Stopwords, and Numbers.
mCorpus <- tm_map(mCorpus, tolower)
mCorpus <- tm_map(mCorpus, removePunctuation)
mCorpus <- tm_map(mCorpus, removeNumbers)
mCorpus <- tm_map(mCorpus, removeWords, stopwords("english"))
mCorpus <- tm_map(mCorpus, stemDocument)
mCorpus <- tm_map(mCorpus, stripWhitespace)
mCorpus <- tm_map(mCorpus, PlainTextDocument)
We created a Document Term Matrix and select the most frequent words (to 100 and top25. We visualized the frequency of the words in a histogram and world cloud format.
dtm <- DocumentTermMatrix(mCorpus)
frequency<-colSums(as.matrix(dtm))
frequency<-sort(frequency,decreasing = TRUE)
Fq_df<-as.data.frame(frequency)
Fq_df <- cbind(rownames(Fq_df), data.frame(Fq_df, row.names=NULL ))
colnames(Fq_df)<-c("word","frequency")
top100<-Fq_df[1:100,]
top25<-Fq_df[1:25,]
formattable(Fq_df[1:10,])
| word | frequency |
|---|---|
| will | 832 |
| said | 742 |
| one | 718 |
| get | 705 |
| time | 618 |
| year | 616 |
| just | 590 |
| can | 586 |
| like | 510 |
| day | 507 |
#Histogram of the 25 Top Unigrams
ggplot(top25, aes(x=reorder(word, frequency),y=frequency)) +
geom_bar(stat="identity", width=0.5, fill="blue") +
labs(title="Top 25 Unigrams")+
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, vjust=0.6))
#wordcloud
wordcloud(mCorpus, max.words = 200, random.order = FALSE, colors=brewer.pal(8,"Dark2"))
## Findings
The top 5 most frequent words are : will, said, one, get, time
The next step is to finish the exploratory analysis by looking at the most frequent two words and three words. Then, based on that information to create a prediction model that will suggest a word to complete the first word entered by the user. This final model will be included in the Shiny app.