Data Science Capstone Milestone Report

The Context

This document is a concise summary and explain of of the data identified and briefly summarizes my plans for creating the prediction algorithm and Shiny apps to a non data scientist

###Loading and Summarizing the Data

I have downloaded the data to my desktop and read the data in. After reading the data I produce a summary table showing the number of words, lines and also file size.

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.5.2

## Package version: 1.5.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:quanteda':
## 
##     as.DocumentTermMatrix, stopwords

library(stringi)

## Warning: package 'stringi' was built under R version 3.5.2

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.2

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.5.2

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

setwd("~/Desktop/Desktop - SKMacBook2018/Data Science Capstone/Data/final/en_US")

# Read the blogs and Twitter data into R
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# Get file sizes
blogs.size <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)))

##    source file.size.MB num.lines num.words
## 1   blogs     200.4242    899288  37546239
## 2    news     196.2775   1010242  34762395
## 3 twitter     159.3641   2360148  30093413

Data Cleansing

The objective of this execise is to sample the data as the source file is quite large and also clean the data. Simply put we remove punctuation, stop words and convert text to lower case as an illustrative example.

# Selecting a sample due to file size. 
set.seed(857)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

#Corpus creation and data cleansing
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploring the Data

Now the data is prepared (ie cleaned) the next step is to look at the data (words) and see the most common paring of words (often referred to as n grams). In this instance I am looking at single words (uni-grams), pairs (bigrams) and trigrams (three words)

I will produce the plots only and not echo the code used to plot the graphs.

The 30 popular uni-grams are shown below

makePlot(uni1, "The 30 Most Popular Unigrams")

The 30 popular bigrams are shown below

makePlot(big2, "The 30 Most Popular Bigrams")

The 30 popular trigrams are shown below

makePlot(tri3, "The 30 Most Popular Trigrams")

Next Steps

The ultimate goal is to use the n-grams above to predict the next word. There are a number of strategies that can be deployed. At the moment I am leaning towards an n-gram model which relies on frequencies and possibly other features to predict.

After the model is developed a shiny app with a simple text box will be developed to predict text.

The End

Data Science Capstone Milestone Report

Spiro Kolokithas

22/08/2019

The Context

Data Cleansing

Exploring the Data

Next Steps