Milestone Report

1. Introduction

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

2. Loading Libraries

First step of the project is to load all the libraries necessary to complete all the tasks outlined in the introduction.

library(stringi)
library(NLP)
library(openNLP)
library(tm) 
library(rJava)
library(RWeka)
library(RWekajars)
library(SnowballC)
library(qdap)
library(ggplot2)

3. Loading Data

The data used in this project can be obtained from this link.

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "Coursera-SwiftKey.zip"
output <- "C://Users//bebxtaxasta//Desktop//Coursera//Project"

if (!file.exists(destFile)) {
  download.file(url, destFile)
  unzip(destFile,exdir=output) 
}

4. Summarising Data Files

In this part I will create a very basic overview of the data file statistics.

4.1 File Names

The 3 files used in this projects are as follows:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

4.2 File Size

The following is the summary of the file size in megabytes.

file.info(".//Coursera//Project//final//en_US//en_US.blogs.txt")$size / 1024^2 #size of Blogs file

## [1] 200.4242

file.info(".//Coursera//Project//final//en_US//en_US.news.txt")$size  / 1024^2 #size of News file

## [1] 196.2775

file.info(".//Coursera//Project//final//en_US//en_US.twitter.txt")$size / 1024^2 #size of Twitter file

## [1] 159.3641

4.3 Number of Rows

Here is an analysis of the number of rows in each file.

blogs <- readLines(".//Coursera//Project//final//en_US//en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
length(blogs) #Number of rows in Blogs file

## [1] 899288

news <- readLines(".//Coursera//Project//final//en_US//en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
length(news) #Number of rows in News file

## [1] 77259

twitter <- readLines(".//Coursera//Project//final//en_US//en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
length(twitter) #Number of rows in Twitter file

## [1] 2360148

4.4 Word Count

Summary of the number of words in each file used in the project.

sum(stri_count_words(blogs)) # Number of words in Blogs file

## [1] 37546246

sum(stri_count_words(news))  # Number of words in News file

## [1] 2674536

sum(stri_count_words(twitter)) # Number of words in Twitter file

## [1] 30093410

5. Creating a Data Sample

Due to the sheer size of the data files, we will use a sample 1000 lines from each file. Total sample size will be 3000.

set.seed(1000)
sTwitter <- sample(twitter, size = 1000, replace = TRUE)
sBlogs <- sample(blogs, size = 1000, replace = TRUE)
sNews <- sample(news, size = 1000, replace = TRUE)
sampleTotal <- c(sTwitter, sBlogs, sNews)
length(sampleTotal)

## [1] 3000

writeLines(sampleTotal, "./sample.txt")

6. Processing and Cleaning Data

In this step, the following steps were completed:

Created corpus dataset
Convert all text to lowercase
Remove all punctuation in the text
Remove all numbers in the text
Remove whitespace
Remove all “english” stop words
Stem document
Create a document in a plaintext format

build_corpus <- function (x = sampleData) {
  c <- VCorpus(VectorSource(x)) 
  c <- tm_map(c, tolower) 
  c <- tm_map(c, removePunctuation) 
  c <- tm_map(c, removeNumbers) 
  c <- tm_map(c, stripWhitespace) 
  c <- tm_map(c, removeWords, stopwords("english")) 
  c <- tm_map(c, stemDocument) 
  c <- tm_map(c, PlainTextDocument) 
}
corpus <- build_corpus(sampleTotal)
rm(sampleTotal)

7. Exploratory Analysis

In this section I will explore the data, get its frequencies and create plots of Uni-grams, Bi-grams and Tri-grams.

options(mc.cores=1)
getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("red"))
}
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
makePlot(freq1, "TOP 30 Most Common Uni-grams")

makePlot(freq2, "TOP 30 Most Common Bi-grams")

makePlot(freq3, "TOP 30 Most Common Tri-grams")

8. Conclusion and Further Plans

This conclused the preliminary analysis of the data. Due to the large size of the files, a lot of memory resources are needed by the computer in order to analyse them.

Here are the next steps for the final project of the course:

Build a prediction algorithm that will predict the next word based on the inputed word by the user.
Develop a Shiny app to house the prediction algorithm.
Create a pitch for the app.
Publish the pitch and the Shiny app online.