Santana’s Data Science Capstone Project week 2

The Capstone Project of the Johns Hopkins University Coursera Data Science Specialization is on Natural Language Processing (NLP). The goal of this project is simply to display that I’ve gotten used to working with the data and demonstrate that I’m on track to create my prediction algorithm by the end of the project. I downloaded the data for the project from the Coursera site link (https://www.coursera.org/learn/data-science-project/supplement/idhGA/syllabus) that lead me to the SwiftKey zip files (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). I then manually accessed the data from my hard drive.

Loading Libraries

library(dbplyr)
library(plyr)
library(NLP)
library(tm)
library(magrittr)
library(stringi)
library(tidytext)
library(RColorBrewer)
library(tidyverse)
library(RWeka)
library(rmarkdown)
library(markdown)
library(knitr)
library(ggplot2)
library(lubridate)

Basic File Statistics

Before beginning any data cleanup or analysis is a good idea to get some general statistics on the files that will be combined to create our corpus.

filestats <- filestats[1:3, 1:7]
kable(filestats)
FileName FileSizeinMB Lines LinesNEmpty Chars CharsNWhite WordCount
en_US.blogs 200.4242 899288 899288 206824382 170389539 37570839
en_US.news 196.2775 1010242 1010242 203223154 169860866 34494539
en_US.twitter 159.3641 2360148 2360148 162096241 134082806 30451170

Creating a Data Subset

Because the datasets are extremely large, I will be selecting a slightly more manageable subset of data to create the corpus.

set.seed(735)
blogsSamp <- sample(blogs, length(blogs)*.005)
newsSamp <- sample(news, length(news)*.005)
tweetsSamp <- sample(tweets, length(tweets)*.005)

SampComb <- c(blogsSamp, newsSamp, tweetsSamp)

SampCombStats <- data.frame(
        FileName="Combined Sample",
        t(rbind(sapply(list(SampComb), stri_stats_general),
          WordCount=sapply(list(SampComb),
          stri_stats_latex)[4,])))

kable(SampCombStats)
FileName Lines LinesNEmpty Chars CharsNWhite WordCount
Words Combined Sample 21347 21347 2848488 2361368 511077

The subset created (SampCombStats) is much more manegeable so it will be used as the corpus for the exploratory analysis. Before conducting any of the exploratory analysis, I will perform some basic data cleaning on the corpus. The data cleaning consists of removing white spaces, removing puctuation and numbers, etc.

corpus <- VCorpus(VectorSource(SampComb), readerControl=list(readPlain, language="en", load="TRUE"))

corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Exploratory NLP Analysis

The exploratory data analysis I will conduct consists of 3 parts: Looking at the most common words (unigram), looking at the most common pairs of words (bigram) & looking at the most common word triplets (trigram).

Unigram Analysis

unigram = as.data.frame((as.matrix(  TermDocumentMatrix(corpus) )) )
unigram <- sort(rowSums(unigram),decreasing=TRUE)
unigram <- data.frame(word = names(unigram),freq=unigram)
kable(unigram[1:10,])
word freq
said said 1581
will will 1550
just just 1504
one one 1466
like like 1338
can can 1168
get get 1115
time time 1040
new new 929
now now 894
unigramplot <- ggplot(unigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="steelblue4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Unigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Unigrams in Sample Corpus")

unigramplot

Bigram Analysis

bigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram= as.data.frame((as.matrix(  TermDocumentMatrix(corpus,control = list(tokenize = bigramfn)) )) )
bigram <- sort(rowSums(bigram),decreasing=TRUE)
bigram  <- data.frame(word = names(bigram),freq=bigram)
kable(bigram[1:10,])
word freq
right now right now 127
new york new york 112
dont know dont know 84
last year last year 84
cant wait cant wait 83
looking forward looking forward 70
last week last week 69
years ago years ago 69
happy birthday happy birthday 60
high school high school 60
bigramplot <- ggplot(bigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="seagreen4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Bigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Bigrams in Sample Corpus")
bigramplot

Trigram Analysis

trigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram = as.data.frame((as.matrix(  TermDocumentMatrix(corpus,control = list(tokenize = trigramfn)) )) )
trigram <- sort(rowSums(trigram),decreasing=TRUE)
trigram <- data.frame(word = names(trigram),freq=trigram)
kable(trigram[1:10,])
word freq
cant wait see cant wait see 21
happy mothers day happy mothers day 18
let us know let us know 17
lost lost lost lost lost lost 16
new york city new york city 14
please please please please please please 11
two years ago two years ago 11
happy new year happy new year 10
hunter matt hunter hunter matt hunter 9
matt hunter matt matt hunter matt 9
trigramplot <- ggplot(trigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="red4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Trigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Trigrams in Sample Corpus")

trigramplot

Plan for Future Analysis and Visualizations

As I continue working on this project, I plan to look at a variety of different ways to visualize the data. One form I am currently exploring is WordClouds. I am including below an example showcasing the Top 25 Unigram Words. However, I do not think WordClouds are as efficient at conveying the information as the bar charts I included above.

library(wesanderson)
library(wordcloud)
pal <- wes_palette(21, name = "Moonrise3", type = "continuous")
wordcloud(unigram$word, unigram$freq, max.words=30, colors=pal)

In addition to word frequency, as part of my additional analysis I will also perform some basic level sentiment analysis on the data.