Data Science Capstone - Week 2

Libraries used:

library(AppliedPredictiveModeling)
library(caret)
library(pgmm)
library(rpart)
library(tidyverse)
library(rpart.plot)
library(randomForest)
library(corrplot)
library(rattle)
library(tm)
library(downloader)
library(stringi)
library(wordcloud)
library(plotly)

Peer-graded Assignment Overview

Instructions
The goal of this project is to provide an exploratory text data analysis on the data that we are on track to create a prediction algorithm. This document shows and explain only the major features of the data and a briefly summary.
The main plan is to provide an understandable view of data to create, in the next steps, a Shiny app in a way that it predict the next word using natural language processing. This summary contains tables and plots to illustrate the data set.

Dataset
This dataset is fairly large and is from Capstone data Set, Coursera Data Science Capstone class and contain English, German, Russian and Finnish database. But we will focus on English database.
Each database contains three files with text corpus of documents to be used as training data:
- en_US.blogs.txt
- en_US.news.txt
- en_US.twiter.txt

# Junk to download file and unzip it
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists('final')) {
    dir.create('final')
}
if (!file.exists("final/en_US")) {
    if (Sys.info()[['sysname']] == 'Windows'){
      download.file(Url,destfile="Coursera-SwiftKey.zip") # for window
      unzip ("Coursera-SwiftKey.zip")
    }
    if (Sys.info()[['sysname']] == 'Darwin'){
      download.file(Url,destfile="Coursera-SwiftKey.zip",method="curl") # for mac
      unzip ("Coursera-SwiftKey.zip", exdir = "./")
    }
}

Exploratory data analysis

Size of files

To build our model, we don’t need to load in and use all of the data due to the large of data.

# Files path and names
files_data <- c("./final/en_US/en_US.twitter.txt",
                   "./final/en_US/en_US.blogs.txt",
                   "./final/en_US/en_US.news.txt")

# Read all three data sets
twitter<-readLines(files_data[1],warn=FALSE,encoding="UTF-8")
blogs<-readLines(files_data[2],warn=FALSE,encoding="UTF-8")
news<-readLines(files_data[3],warn=FALSE,encoding="UTF-8")

# Get files details
files_data_info <- data.frame(
  file = files_data, 
  size_mb = round(file.info(files_data)$size/(1024^2), digits = 2))

# Add new 3 columns
files_data_info$total_lines <- 0
files_data_info$total_words <- 0
files_data_info$total_char <- 0

# Get #rows
files_data_info[1,]$total_lines <- length(twitter)
files_data_info[2,]$total_lines <- length(blogs)
files_data_info[3,]$total_lines <- length(news)

# Get #characters
files_data_info[1,]$total_char <- sum(nchar(twitter))
files_data_info[2,]$total_char <- sum(nchar(blogs))
files_data_info[3,]$total_char <- sum(nchar(news))

# Get #words
files_data_info[1,]$total_words <- stri_stats_latex(twitter)[4]
files_data_info[2,]$total_words <- stri_stats_latex(blogs)[4]
files_data_info[3,]$total_words <- stri_stats_latex(news)[4]

files_data_info

##                              file size_mb total_lines total_words total_char
## 1 ./final/en_US/en_US.twitter.txt  159.36     2360148    30451128  162096031
## 2   ./final/en_US/en_US.blogs.txt  200.42      899288    37570839  206824505
## 3    ./final/en_US/en_US.news.txt  196.28       77259     2651432   15639408

Sampling

The plan will get a representative sample of data by reading in a random subset of the original data.
Bellow are some details from the result file.

# set seed for reproducibility
set.seed(3433)

# Sampling the data to 0.01 - for better performance
twitter <- sample(twitter, length(twitter) * 0.01, replace = FALSE)
blogs <- sample(blogs, length(blogs) * 0.01, replace = FALSE)
news <- sample(news, length(news) * 0.01, replace = FALSE)

# combine all three data sets into a single data set
inTrain <- c(twitter, blogs, news)

rm(twitter, blogs, news) # remove data not needed

Cleanning Data & Text Normalization

The cleaning consists to transform the following topics:

Remove non-English characters
Convert all words to lowercase
Removing punctuation, number, special characters, white-spaces, etc.
Removing stop words
Stemming the text
Remove URLs

Many of these steps are performed using tm package that provides a comprehensive text mining framework for R. Also the Lecture Slides from the 2012 Stanford Coursera course provide a good references to have a better understanding.

The tm package uses the concept of a so-called source to encapsulate and abstract the document input process. In this way a corpus is created from the text already sampled, using the tm pacakge, in order to take advantage of the text mining functionalities provided by that package.

buildCorpus <- function (dataSet) {
    dataSet <- iconv(dataSet, "latin1", "ASCII", sub = "") # remove non-English words
    docs <- VCorpus(VectorSource(dataSet))
    toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
    # remove URL, Twitter handles and email patterns, transform to lower case etc.
    docs <- tm_map(docs, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
    docs <- tm_map(docs, toSpace, "@[^\\s]+")
    docs <- tm_map(docs, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
    docs <- tm_map(docs, tolower)
    docs <- tm_map(docs, removeWords, stopwords("english"))
    docs <- tm_map(docs, removePunctuation)
    docs <- tm_map(docs, removeNumbers)
    docs <- tm_map(docs, stripWhitespace)
    docs <- tm_map(docs, PlainTextDocument)
    return(docs)
}
corpus <- buildCorpus(inTrain) # function call

Exploratory Data Analysis

Exploratory data is to provide to you a basic summary about the data set. Word counts, line counts, a basic data table and a basic plots to show you some features of the data.

As a result file, that combined data from twitter, blogs and news, where was considered 1% as sample size for both files. we have the resume as:

message(sprintf(" %s lines\n %s words\n %s characters\n %s MB", length(inTrain), stri_stats_latex(inTrain)[4], sum(nchar(inTrain)), round(object.size(inTrain)/1024^2, digits = 2)))

##  33365 lines
##  704006 words
##  3829031 characters
##  5.97 MB

rm(inTrain) # remove data not needed

Word Frequencies

A bar chart and word cloud are showed to illustrate most common word that appears into the result file.

tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)

# plot the top 100 most frequent words
wordFreq %>% 
  arrange(desc(freq)) %>%
  top_n(100, freq) %>% 
  mutate(
      word = fct_reorder(word, freq, .desc = TRUE)
  ) %>% 
  plot_ly(x = ~word, y = ~freq, type = "bar") %>% 
  layout(title = 'Top 100 most frequent words')

# construct word cloud using library(wordcloud)
suppressWarnings (
    wordcloud(words = wordFreq$word,
              freq = wordFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

# remove variables no longer needed to free up memory
rm(tdm, freq, wordFreq)

Next Steps

As mentioned earlier, the next steps will be to produce a Shiny app in a way that will predict the next word using natural language processing.
This first step provided an overview over the data and confirmed the needed to work with a sample size of data due to the large data available.
The complete code are available into github.com/Cristianneuhaus, PGA-Milestone Report.RMD file.

Data Science Capstone - Week 2 - Milestone Report

Crisitan Neuhaus

2020-12-23