Libraries used:
library(AppliedPredictiveModeling)
library(caret)
library(pgmm)
library(rpart)
library(tidyverse)
library(rpart.plot)
library(randomForest)
library(corrplot)
library(rattle)
library(tm)
library(downloader)
library(stringi)
library(wordcloud)
library(plotly)
Instructions
The goal of this project is to provide an exploratory text data analysis on the data that we are on track to create a prediction algorithm. This document shows and explain only the major features of the data and a briefly summary.
The main plan is to provide an understandable view of data to create, in the next steps, a Shiny app in a way that it predict the next word using natural language processing. This summary contains tables and plots to illustrate the data set.
Dataset
This dataset is fairly large and is from Capstone data Set, Coursera Data Science Capstone class and contain English, German, Russian and Finnish database. But we will focus on English database.
Each database contains three files with text corpus of documents to be used as training data:
- en_US.blogs.txt
- en_US.news.txt
- en_US.twiter.txt
# Junk to download file and unzip it
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists('final')) {
dir.create('final')
}
if (!file.exists("final/en_US")) {
if (Sys.info()[['sysname']] == 'Windows'){
download.file(Url,destfile="Coursera-SwiftKey.zip") # for window
unzip ("Coursera-SwiftKey.zip")
}
if (Sys.info()[['sysname']] == 'Darwin'){
download.file(Url,destfile="Coursera-SwiftKey.zip",method="curl") # for mac
unzip ("Coursera-SwiftKey.zip", exdir = "./")
}
}
To build our model, we don’t need to load in and use all of the data due to the large of data.
# Files path and names
files_data <- c("./final/en_US/en_US.twitter.txt",
"./final/en_US/en_US.blogs.txt",
"./final/en_US/en_US.news.txt")
# Read all three data sets
twitter<-readLines(files_data[1],warn=FALSE,encoding="UTF-8")
blogs<-readLines(files_data[2],warn=FALSE,encoding="UTF-8")
news<-readLines(files_data[3],warn=FALSE,encoding="UTF-8")
# Get files details
files_data_info <- data.frame(
file = files_data,
size_mb = round(file.info(files_data)$size/(1024^2), digits = 2))
# Add new 3 columns
files_data_info$total_lines <- 0
files_data_info$total_words <- 0
files_data_info$total_char <- 0
# Get #rows
files_data_info[1,]$total_lines <- length(twitter)
files_data_info[2,]$total_lines <- length(blogs)
files_data_info[3,]$total_lines <- length(news)
# Get #characters
files_data_info[1,]$total_char <- sum(nchar(twitter))
files_data_info[2,]$total_char <- sum(nchar(blogs))
files_data_info[3,]$total_char <- sum(nchar(news))
# Get #words
files_data_info[1,]$total_words <- stri_stats_latex(twitter)[4]
files_data_info[2,]$total_words <- stri_stats_latex(blogs)[4]
files_data_info[3,]$total_words <- stri_stats_latex(news)[4]
files_data_info
## file size_mb total_lines total_words total_char
## 1 ./final/en_US/en_US.twitter.txt 159.36 2360148 30451128 162096031
## 2 ./final/en_US/en_US.blogs.txt 200.42 899288 37570839 206824505
## 3 ./final/en_US/en_US.news.txt 196.28 77259 2651432 15639408
The plan will get a representative sample of data by reading in a random subset of the original data.
Bellow are some details from the result file.
# set seed for reproducibility
set.seed(3433)
# Sampling the data to 0.01 - for better performance
twitter <- sample(twitter, length(twitter) * 0.01, replace = FALSE)
blogs <- sample(blogs, length(blogs) * 0.01, replace = FALSE)
news <- sample(news, length(news) * 0.01, replace = FALSE)
# combine all three data sets into a single data set
inTrain <- c(twitter, blogs, news)
rm(twitter, blogs, news) # remove data not needed
The cleaning consists to transform the following topics:
Many of these steps are performed using tm package that provides a comprehensive text mining framework for R. Also the Lecture Slides from the 2012 Stanford Coursera course provide a good references to have a better understanding.
The tm package uses the concept of a so-called source to encapsulate and abstract the document input process. In this way a corpus is created from the text already sampled, using the tm pacakge, in order to take advantage of the text mining functionalities provided by that package.
buildCorpus <- function (dataSet) {
dataSet <- iconv(dataSet, "latin1", "ASCII", sub = "") # remove non-English words
docs <- VCorpus(VectorSource(dataSet))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# remove URL, Twitter handles and email patterns, transform to lower case etc.
docs <- tm_map(docs, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
docs <- tm_map(docs, toSpace, "@[^\\s]+")
docs <- tm_map(docs, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)
}
corpus <- buildCorpus(inTrain) # function call
Exploratory data is to provide to you a basic summary about the data set. Word counts, line counts, a basic data table and a basic plots to show you some features of the data.
As a result file, that combined data from twitter, blogs and news, where was considered 1% as sample size for both files. we have the resume as:
message(sprintf(" %s lines\n %s words\n %s characters\n %s MB", length(inTrain), stri_stats_latex(inTrain)[4], sum(nchar(inTrain)), round(object.size(inTrain)/1024^2, digits = 2)))
## 33365 lines
## 704006 words
## 3829031 characters
## 5.97 MB
rm(inTrain) # remove data not needed
A bar chart and word cloud are showed to illustrate most common word that appears into the result file.
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)
# plot the top 100 most frequent words
wordFreq %>%
arrange(desc(freq)) %>%
top_n(100, freq) %>%
mutate(
word = fct_reorder(word, freq, .desc = TRUE)
) %>%
plot_ly(x = ~word, y = ~freq, type = "bar") %>%
layout(title = 'Top 100 most frequent words')
# construct word cloud using library(wordcloud)
suppressWarnings (
wordcloud(words = wordFreq$word,
freq = wordFreq$freq,
min.freq = 1,
max.words = 100,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))
)
# remove variables no longer needed to free up memory
rm(tdm, freq, wordFreq)
As mentioned earlier, the next steps will be to produce a Shiny app in a way that will predict the next word using natural language processing.
This first step provided an overview over the data and confirmed the needed to work with a sample size of data due to the large data available.
The complete code are available into github.com/Cristianneuhaus, PGA-Milestone Report.RMD file.