#1. Introduction

The goals of this project is just to use tables and plots to illustrate important summaries of the data set. The project follows the following steps. 1.Downloading the data and loading it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings amassed so far.4. Discuss plans for creating a prediction algorithm and Shiny app.

#2. Loading Packages,

2.1 Loading needed packages, Downloading the data

library(NLP)
library(tm)
library(RColorBrewer)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)
library(formattable)

2.2 Downloading the data and loading it in

The compressed data file was downladed from this url (“https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”), unzipped and saved on the computer. The data files were read into RStudio with the readLines functionality.

Blog <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
News <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
Twit <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

#3. Datasets statistics

len_blogs <- length(Blog)
size_blogs <- object.size(Blog)
words_blogs<-length(words(Blog))

len_news <- length(News)
size_news <- object.size(News)
words_news<-length(words(News))

len_twit <- length(Twit)
size_twit <- object.size(Twit)
words_twit<-length(words(Twit))


file_name <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
size <- c(format(size_blogs, units = "auto"), format(size_news, units = "auto"),format(size_twit, units = "auto"))
lines <- c(format(len_blogs, big.mark=","),format(len_news, big.mark=","),format(len_twit, big.mark=","))
numb_words <- c(format(words_blogs, big.mark=","),format(words_news, big.mark=","),format(words_twit, big.mark=","))
data_stats <- data.frame (file_name,size,lines,numb_words)
colnames(data_stats) <- c('File Name', 'File Size', 'Number of Lines', 'Number of words')
formattable(data_stats)

File Name	File Size	Number of Lines	Number of words
en_US.blogs.txt	255.4 Mb	899,288	37,334,131
en_US.news.txt	19.8 Mb	77,259	2,643,969
en_US.twitter.txt	319 Mb	2,360,148	30,373,583

Sampling files to help with data manipulation and analysis

As these files are large, and requiring a lot of computer resources, the data was sampled and we used 0.1%, 0.3% and 1% sample size based on the file sizes. The corpus will then be generated by using the sample created.

set.seed(123)
blogsSample <- Blog[rbinom(length(Blog) * 0.003, length(Blog), 0.5)]
newsSample <- News[rbinom(length(News) * 0.01, length(News), 0.5)]
twitSample <- Twit[rbinom(length(Twit) * 0.001, length(Twit), 0.5)]
rm(Blog, Twit, News)
Data <- paste(blogsSample, newsSample, twitSample)
mCorpus <- VCorpus(VectorSource(Data))

4. Exploratory Data Analysis

4.1. Data Cleaning

To clean the corpus,first it was transformed to all lowercase and then the following were removed: Punctuation, Extra Whitespaces, Stopwords, and Numbers.

mCorpus <- tm_map(mCorpus, tolower)
mCorpus <- tm_map(mCorpus, removePunctuation)
mCorpus <- tm_map(mCorpus, removeNumbers)
mCorpus <- tm_map(mCorpus, removeWords, stopwords("english"))
mCorpus <- tm_map(mCorpus, stemDocument)
mCorpus <- tm_map(mCorpus, stripWhitespace)
mCorpus <- tm_map(mCorpus, PlainTextDocument)

4.2. Summarizing Data

We created a Document Term Matrix and select the most frequent words (to 100 and top25. We visualized the frequency of the words in a histogram and world cloud format.

dtm <- DocumentTermMatrix(mCorpus)
frequency<-colSums(as.matrix(dtm))
frequency<-sort(frequency,decreasing = TRUE)
Fq_df<-as.data.frame(frequency)
Fq_df <- cbind(rownames(Fq_df), data.frame(Fq_df, row.names=NULL ))
colnames(Fq_df)<-c("word","frequency")
top100<-Fq_df[1:100,]
top25<-Fq_df[1:25,]
formattable(Fq_df[1:10,])

word	frequency
will	832
said	742
one	718
get	705
time	618
year	616
just	590
can	586
like	510
day	507

#Histogram of the 25 Top Unigrams

ggplot(top25, aes(x=reorder(word, frequency),y=frequency)) +
    geom_bar(stat="identity", width=0.5, fill="blue") +
    labs(title="Top 25 Unigrams")+
    xlab("Unigrams") + ylab("Frequency") +
    theme(axis.text.x=element_text(angle=45, vjust=0.6))

#wordcloud

wordcloud(mCorpus, max.words = 200, random.order = FALSE, colors=brewer.pal(8,"Dark2"))

## Findings

The top 5 most frequent words are : will, said, one, get, time

Plans for creating a prediction algorithm and Shiny app

The next step is to finish the exploratory analysis by looking at the most frequent two words and three words. Then, based on that information to create a prediction model that will suggest a word to complete the first word entered by the user. This final model will be included in the Shiny app.

Capstone week 2

PK

22/04/2020