The Capstone Project of the Johns Hopkins University Coursera Data Science Specialization is on Natural Language Processing (NLP). The goal of this project is simply to display that I’ve gotten used to working with the data and demonstrate that I’m on track to create my prediction algorithm by the end of the project. I downloaded the data for the project from the Coursera site link (https://www.coursera.org/learn/data-science-project/supplement/idhGA/syllabus) that lead me to the SwiftKey zip files (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). I then manually accessed the data from my hard drive.
library(dbplyr)
library(plyr)
library(NLP)
library(tm)
library(magrittr)
library(stringi)
library(tidytext)
library(RColorBrewer)
library(tidyverse)
library(RWeka)
library(rmarkdown)
library(markdown)
library(knitr)
library(ggplot2)
library(lubridate)
Before beginning any data cleanup or analysis is a good idea to get some general statistics on the files that will be combined to create our corpus.
filestats <- filestats[1:3, 1:7]
kable(filestats)
| FileName | FileSizeinMB | Lines | LinesNEmpty | Chars | CharsNWhite | WordCount |
|---|---|---|---|---|---|---|
| en_US.blogs | 200.4242 | 899288 | 899288 | 206824382 | 170389539 | 37570839 |
| en_US.news | 196.2775 | 1010242 | 1010242 | 203223154 | 169860866 | 34494539 |
| en_US.twitter | 159.3641 | 2360148 | 2360148 | 162096241 | 134082806 | 30451170 |
Because the datasets are extremely large, I will be selecting a slightly more manageable subset of data to create the corpus.
set.seed(735)
blogsSamp <- sample(blogs, length(blogs)*.005)
newsSamp <- sample(news, length(news)*.005)
tweetsSamp <- sample(tweets, length(tweets)*.005)
SampComb <- c(blogsSamp, newsSamp, tweetsSamp)
SampCombStats <- data.frame(
FileName="Combined Sample",
t(rbind(sapply(list(SampComb), stri_stats_general),
WordCount=sapply(list(SampComb),
stri_stats_latex)[4,])))
kable(SampCombStats)
| FileName | Lines | LinesNEmpty | Chars | CharsNWhite | WordCount | |
|---|---|---|---|---|---|---|
| Words | Combined Sample | 21347 | 21347 | 2848488 | 2361368 | 511077 |
The subset created (SampCombStats) is much more manegeable so it will be used as the corpus for the exploratory analysis. Before conducting any of the exploratory analysis, I will perform some basic data cleaning on the corpus. The data cleaning consists of removing white spaces, removing puctuation and numbers, etc.
corpus <- VCorpus(VectorSource(SampComb), readerControl=list(readPlain, language="en", load="TRUE"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
The exploratory data analysis I will conduct consists of 3 parts: Looking at the most common words (unigram), looking at the most common pairs of words (bigram) & looking at the most common word triplets (trigram).
unigram = as.data.frame((as.matrix( TermDocumentMatrix(corpus) )) )
unigram <- sort(rowSums(unigram),decreasing=TRUE)
unigram <- data.frame(word = names(unigram),freq=unigram)
kable(unigram[1:10,])
| word | freq | |
|---|---|---|
| said | said | 1581 |
| will | will | 1550 |
| just | just | 1504 |
| one | one | 1466 |
| like | like | 1338 |
| can | can | 1168 |
| get | get | 1115 |
| time | time | 1040 |
| new | new | 929 |
| now | now | 894 |
unigramplot <- ggplot(unigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
geom_bar(stat = 'identity', width=0.75, fill="steelblue4") +
theme(axis.text.x=element_text(angle=90))+
xlab('Unigrams')+
ylab('Frequency')+
ggtitle("Histogram of Top 25 Unigrams in Sample Corpus")
unigramplot
bigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram= as.data.frame((as.matrix( TermDocumentMatrix(corpus,control = list(tokenize = bigramfn)) )) )
bigram <- sort(rowSums(bigram),decreasing=TRUE)
bigram <- data.frame(word = names(bigram),freq=bigram)
kable(bigram[1:10,])
| word | freq | |
|---|---|---|
| right now | right now | 127 |
| new york | new york | 112 |
| dont know | dont know | 84 |
| last year | last year | 84 |
| cant wait | cant wait | 83 |
| looking forward | looking forward | 70 |
| last week | last week | 69 |
| years ago | years ago | 69 |
| happy birthday | happy birthday | 60 |
| high school | high school | 60 |
bigramplot <- ggplot(bigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
geom_bar(stat = 'identity', width=0.75, fill="seagreen4") +
theme(axis.text.x=element_text(angle=90))+
xlab('Bigrams')+
ylab('Frequency')+
ggtitle("Histogram of Top 25 Bigrams in Sample Corpus")
bigramplot
trigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram = as.data.frame((as.matrix( TermDocumentMatrix(corpus,control = list(tokenize = trigramfn)) )) )
trigram <- sort(rowSums(trigram),decreasing=TRUE)
trigram <- data.frame(word = names(trigram),freq=trigram)
kable(trigram[1:10,])
| word | freq | |
|---|---|---|
| cant wait see | cant wait see | 21 |
| happy mothers day | happy mothers day | 18 |
| let us know | let us know | 17 |
| lost lost lost | lost lost lost | 16 |
| new york city | new york city | 14 |
| please please please | please please please | 11 |
| two years ago | two years ago | 11 |
| happy new year | happy new year | 10 |
| hunter matt hunter | hunter matt hunter | 9 |
| matt hunter matt | matt hunter matt | 9 |
trigramplot <- ggplot(trigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
geom_bar(stat = 'identity', width=0.75, fill="red4") +
theme(axis.text.x=element_text(angle=90))+
xlab('Trigrams')+
ylab('Frequency')+
ggtitle("Histogram of Top 25 Trigrams in Sample Corpus")
trigramplot
As I continue working on this project, I plan to look at a variety of different ways to visualize the data. One form I am currently exploring is WordClouds. I am including below an example showcasing the Top 25 Unigram Words. However, I do not think WordClouds are as efficient at conveying the information as the bar charts I included above.
library(wesanderson)
library(wordcloud)
pal <- wes_palette(21, name = "Moonrise3", type = "continuous")
wordcloud(unigram$word, unigram$freq, max.words=30, colors=pal)
In addition to word frequency, as part of my additional analysis I will also perform some basic level sentiment analysis on the data.