#1. Introduction

The goals of this project is just to use tables and plots to illustrate important summaries of the data set. The project follows the following steps. 1.Downloading the data and loading it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings amassed so far.4. Discuss plans for creating a prediction algorithm and Shiny app.

#2. Loading Packages,

2.1 Loading needed packages, Downloading the data

library(NLP)
library(tm)
library(RColorBrewer)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
library(formattable)

2.2 Downloading the data and loading it in

The compressed data file was downladed from this url (“https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”), unzipped and saved on the computer. The data files were read into RStudio with the readLines functionality.

Blog <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
News <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
Twit <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

#3. Datasets statistics

len_blogs <- length(Blog)
size_blogs <- object.size(Blog)
words_blogs<-length(words(Blog))

len_news <- length(News)
size_news <- object.size(News)
words_news<-length(words(News))

len_twit <- length(Twit)
size_twit <- object.size(Twit)
words_twit<-length(words(Twit))


file_name <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
size <- c(format(size_blogs, units = "auto"), format(size_news, units = "auto"),format(size_twit, units = "auto"))
lines <- c(format(len_blogs, big.mark=","),format(len_news, big.mark=","),format(len_twit, big.mark=","))
numb_words <- c(format(words_blogs, big.mark=","),format(words_news, big.mark=","),format(words_twit, big.mark=","))
data_stats <- data.frame (file_name,size,lines,numb_words)
colnames(data_stats) <- c('File Name', 'File Size', 'Number of Lines', 'Number of words')
formattable(data_stats)
File Name File Size Number of Lines Number of words
en_US.blogs.txt 255.4 Mb 899,288 37,334,131
en_US.news.txt 19.8 Mb 77,259 2,643,969
en_US.twitter.txt 319 Mb 2,360,148 30,373,583

Sampling files to help with data manipulation and analysis

As these files are large, and requiring a lot of computer resources, the data was sampled and we used 0.1%, 0.3% and 1% sample size based on the file sizes. The corpus will then be generated by using the sample created.

set.seed(123)
blogsSample <- Blog[rbinom(length(Blog) * 0.003, length(Blog), 0.5)]
newsSample <- News[rbinom(length(News) * 0.01, length(News), 0.5)]
twitSample <- Twit[rbinom(length(Twit) * 0.001, length(Twit), 0.5)]
rm(Blog, Twit, News)
Data <- paste(blogsSample, newsSample, twitSample)
mCorpus <- VCorpus(VectorSource(Data))

4. Exploratory Data Analysis

4.1. Data Cleaning

To clean the corpus,first it was transformed to all lowercase and then the following were removed: Punctuation, Extra Whitespaces, Stopwords, and Numbers.

mCorpus <- tm_map(mCorpus, tolower)
mCorpus <- tm_map(mCorpus, removePunctuation)
mCorpus <- tm_map(mCorpus, removeNumbers)
mCorpus <- tm_map(mCorpus, removeWords, stopwords("english"))
mCorpus <- tm_map(mCorpus, stemDocument)
mCorpus <- tm_map(mCorpus, stripWhitespace)
mCorpus <- tm_map(mCorpus, PlainTextDocument)

4.2. Summarizing Data

We created a Document Term Matrix and select the most frequent words (to 100 and top25. We visualized the frequency of the words in a histogram and world cloud format.

dtm <- DocumentTermMatrix(mCorpus)
frequency<-colSums(as.matrix(dtm))
frequency<-sort(frequency,decreasing = TRUE)
Fq_df<-as.data.frame(frequency)
Fq_df <- cbind(rownames(Fq_df), data.frame(Fq_df, row.names=NULL ))
colnames(Fq_df)<-c("word","frequency")
top100<-Fq_df[1:100,]
top25<-Fq_df[1:25,]
formattable(Fq_df[1:10,])
word frequency
will 832
said 742
one 718
get 705
time 618
year 616
just 590
can 586
like 510
day 507

#Histogram of the 25 Top Unigrams

ggplot(top25, aes(x=reorder(word, frequency),y=frequency)) +
    geom_bar(stat="identity", width=0.5, fill="blue") +
    labs(title="Top 25 Unigrams")+
    xlab("Unigrams") + ylab("Frequency") +
    theme(axis.text.x=element_text(angle=45, vjust=0.6))

#wordcloud

wordcloud(mCorpus, max.words = 200, random.order = FALSE, colors=brewer.pal(8,"Dark2"))

## Findings

The top 5 most frequent words are : will, said, one, get, time

Plans for creating a prediction algorithm and Shiny app

The next step is to finish the exploratory analysis by looking at the most frequent two words and three words. Then, based on that information to create a prediction model that will suggest a word to complete the first word entered by the user. This final model will be included in the Shiny app.