This document is a concise summary and explain of of the data identified and briefly summarizes my plans for creating the prediction algorithm and Shiny apps to a non data scientist
###Loading and Summarizing the Data
I have downloaded the data to my desktop and read the data in. After reading the data I produce a summary table showing the number of words, lines and also file size.
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.5.2
## Package version: 1.5.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(tm)
## Loading required package: NLP
##
## Attaching package: 'tm'
## The following objects are masked from 'package:quanteda':
##
## as.DocumentTermMatrix, stopwords
library(stringi)
## Warning: package 'stringi' was built under R version 3.5.2
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.2
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.2
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
setwd("~/Desktop/Desktop - SKMacBook2018/Data Science Capstone/Data/final/en_US")
# Read the blogs and Twitter data into R
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Get file sizes
blogs.size <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("en_US.twitter.txt")$size / 1024 ^ 2
# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)))
## source file.size.MB num.lines num.words
## 1 blogs 200.4242 899288 37546239
## 2 news 196.2775 1010242 34762395
## 3 twitter 159.3641 2360148 30093413
The objective of this execise is to sample the data as the source file is quite large and also clean the data. Simply put we remove punctuation, stop words and convert text to lower case as an illustrative example.
# Selecting a sample due to file size.
set.seed(857)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
#Corpus creation and data cleansing
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Now the data is prepared (ie cleaned) the next step is to look at the data (words) and see the most common paring of words (often referred to as n grams). In this instance I am looking at single words (uni-grams), pairs (bigrams) and trigrams (three words)
I will produce the plots only and not echo the code used to plot the graphs.
The 30 popular uni-grams are shown below
makePlot(uni1, "The 30 Most Popular Unigrams")
The 30 popular bigrams are shown below
makePlot(big2, "The 30 Most Popular Bigrams")
The 30 popular trigrams are shown below
makePlot(tri3, "The 30 Most Popular Trigrams")
The ultimate goal is to use the n-grams above to predict the next word. There are a number of strategies that can be deployed. At the moment I am leaning towards an n-gram model which relies on frequencies and possibly other features to predict.
After the model is developed a shiny app with a simple text box will be developed to predict text.
The End