The goal of this project is just to display that I’ve gotten used to working with the data and that I am on track to create your prediction algorithm. This submission explains the exploratory data analysis and the goals for the eventual app and algorithm. This document is concise and explain only the major features of the data that had been identified and briefly summarize the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. The report use tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
library(stringi)
library(tm)
library(wordcloud)
#library(RColorBrewer)
library(RWeka)
library(ggplot2)
The data is formed by 4 languages, but only English will be used. The dataset has three files includes en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
# Download and unzip file if it doesn't exist
if(!file.exists('Coursera-SwiftKey.zip')){
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
destfile = './Coursera-SwiftKey.zip', method = 'curl', quiet = T)
unzip('./Coursera-SwiftKey.zip')
}
# Load blogs data
blogsFileName <- "final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Load news data
newsFileName <- "final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Load twitter data
twitterFileName <- "final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# Remove variables no longer needed to free up memory
rm(con)
For each files, it was calculated some summary stats between can see: Size in Megabytes, number of lines (rows), total number of characters by file, average number of characters by line, total number of words by file and average number of words by line.
original_summary <- data.frame(
'File' = c("Blogs","News","Twitter"),
'FileSizeinMB' = round(file.info(c(blogsFileName,
newsFileName,
twitterFileName))$size / 1024 ^ 2),
'NumberofLines' = sapply(list(blogs, news, twitter), length),
'TotalCharacters' = sapply(list(blogs, news, twitter), function(x){sum(nchar(x))}),
'MeanCharactersinLine' = sapply(list(blogs, news, twitter),
function(x){mean(unlist(lapply(x, function(y) nchar(y))))}),
'TotalWords' = sapply(list(blogs,news,twitter), stri_stats_latex)[4,],
'MeanWordsinLine' = sapply(list(blogs, news, twitter),
function(x) summary(stri_count_words(x))[c('Mean')])
)
original_summary
## File FileSizeinMB NumberofLines TotalCharacters MeanCharactersinLine
## 1 Blogs 200 899288 206824505 229.98695
## 2 News 196 77259 15639408 202.42830
## 3 Twitter 159 2360148 162096241 68.68054
## TotalWords MeanWordsinLine
## 1 37570839 41.75107
## 2 2651432 34.61779
## 3 30451170 12.75065
To build models it doesn’t need to load in and use all of the data. By this reason, can be use randomly representative sample that can infer facts about a population. Since the data are so big (see above dataset summary table), I am only going to proceed the analysis with a subset of the original dataset (1% of each file) because running the calculations using the big files will be really slow.
# Set seed for reproducability
set.seed(93)
# Assign sample size
sampleSize = 0.01
# Sample all three data sets
blogs <- sample(blogs, length(blogs) * sampleSize, replace = FALSE)
news <- sample(news, length(news) * sampleSize, replace = FALSE)
twitter <- sample(twitter, length(twitter) * sampleSize, replace = FALSE)
# Remove all non english characters as they cause issues
blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")
# Combine all three data sets into a single data set
sampleData <- c(blogs, news, twitter)
# Save this sample dataset
sampleDataFileName <- "final/en_US/en_US.sample.txt"
write(sampleData, sampleDataFileName)
# Remove variables no longer needed to free up memory
rm(blogs, news, twitter)
# Stats of sample data
sample_summary <- data.frame(
'File' = c("Sample"),
'FileSizeinMB' = round(file.info(c(sampleDataFileName))$size / 1024 ^ 2),
'NumberofLines' = sapply(list(sampleData), length),
'TotalCharacters' = sapply(list(sampleData), function(x){sum(nchar(x))}),
'MeanCharactersinLine' = sapply(list(sampleData),
function(x){mean(unlist(lapply(x, function(y) nchar(y))))}),
'TotalWords' = sapply(list(sampleData), stri_stats_latex)[4,],
'MeanWordsinLine' = sapply(list(sampleData),
function(x) summary(stri_count_words(x))[c('Mean')]),
row.names = NULL
)
sample_summary
## File FileSizeinMB NumberofLines TotalCharacters MeanCharactersinLine
## 1 Sample 4 33365 3823203 114.5872
## TotalWords MeanWordsinLine
## 1 701325 21.0071
The final selected text data needs to be cleaned to be used in the word prediction model. By this reason, the sample data will be cleaned using techniques such as removing whitespaces, numbers, URLs, punctuations, profanity, stopwords, …
# Fuction to build Corpus
buildCorpus <- function (dataSet) {
data <- VCorpus(VectorSource(dataSet))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# Remove URL, Twitter handles and email patterns
data <- tm_map(data, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
data <- tm_map(data, toSpace, "@[^\\s]+")
data <- tm_map(data, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
# Pass to lower all words
data <- tm_map(data, tolower)
# Remove profane words from the sample data set
### Download the file if it doesn't exist
if(!file.exists('bad-words.txt')){
download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',
destfile = 'bad-words.txt', quiet = T)
}
### Load profanity words
con <- file("bad-words.txt", open = "r") #
profanityWords <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
### Remove all non english characters as they cause issues
profanityWords <- iconv(profanityWords, "latin1", "ASCII", sub = "")
### Remove profane words
data <- tm_map(data, removeWords, profanityWords)
### Remove stopwords, signal punctuation, numbers and extras spaces
data <- tm_map(data, removeWords, stopwords("english"))
data <- tm_map(data, removePunctuation)
data <- tm_map(data, removeNumbers)
data <- tm_map(data, stripWhitespace)
data <- tm_map(data, PlainTextDocument)
return(data)
}
# Build the corpus
sampleCorpus <- buildCorpus(sampleData)
# Remove variables no longer needed to free up memory
rm(sampleData)
Now what the corpus has cleaned the data, I can perform exploratory analysis on the tidy data.
The word cloud that can see below illustrate the most frequently occurring words in the data.
wordcloud(sampleCorpus, max.words = 200, random.order = FALSE, rot.per = 0.15,
use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"), scale = c(2.9, 0.4))
It is needed to format this cleaned data in to a format which is most useful for NLP. The best format is N-grams stored in Term Document Matrices, which the representation is: documents as the rows, terms/words as the columns and frequency of the term in the document as the entries. Because the number of unique words in the corpus the dimension can be large, ngram models are created to explore word frequencies.
# Function to get ngrams matrix word frequency
getMatrixFreq <- function (corpus, n, val) {
ntokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
# Create term document matrix for the corpus
termDocMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = ntokenizer))
# Eliminate sparse terms for each n-gram and get frequencies of most common n-grams
matrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(termDocMatrix, val))), decreasing = TRUE)
matrixFreq <- data.frame(word = names(matrixFreq), freq = matrixFreq)
return(matrixFreq)
}
Unigram Analysis shows the most frequent words and their respective frequency. Unigram is based on individual words.
unigramMatrixFreq <- getMatrixFreq(sampleCorpus, 1, 0.99)
# Plot
ggplot(unigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = I("grey50")) +
coord_flip() +
geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
ggtitle("Most Common Unigrams") + xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5))
Bigram Analysis shows the most frequent words and their respective frequency. Bigram is based on two word combinations.
bigramMatrixFreq <- getMatrixFreq(sampleCorpus, 2, 0.999)
# Plot
ggplot(bigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = I("grey50")) +
coord_flip() +
geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
ggtitle("Most Common Bigrams") + xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5))
Trigram Analysis shows the most frequent words and their respective frequency. Trigram is based on three word combinations.
trigramMatrixFreq <- getMatrixFreq(sampleCorpus, 3, 0.9999)
# Plot
ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = I("grey50")) +
coord_flip() +
geom_text(aes(label = freq ), hjust = -0.10, size = 3) +
ggtitle("Most Common Trigrams") + xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5))
Through this analysis we can see the most common words and infer the main topics of the the texts.
The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm using shiny application. For the Shiny application, the plan is to create an application with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.