Santana’s Data Science Capstone Project week 2

The Capstone Project of the Johns Hopkins University Coursera Data Science Specialization is on Natural Language Processing (NLP). The goal of this project is simply to display that I’ve gotten used to working with the data and demonstrate that I’m on track to create my prediction algorithm by the end of the project. I downloaded the data for the project from the Coursera site link (https://www.coursera.org/learn/data-science-project/supplement/idhGA/syllabus) that lead me to the SwiftKey zip files (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). I then manually accessed the data from my hard drive.

Loading Libraries

library(dbplyr)
library(plyr)
library(NLP)
library(tm)
library(magrittr)
library(stringi)
library(tidytext)
library(RColorBrewer)
library(tidyverse)
library(RWeka)
library(rmarkdown)
library(markdown)
library(knitr)
library(ggplot2)
library(lubridate)

Navigating to the Correct Working Directory & Accessing the Data

Basic File Statistics

Before beginning any data cleanup or analysis is a good idea to get some general statistics on the files that will be combined to create our corpus.

filestats <- filestats[1:3, 1:7]
kable(filestats)

FileName	FileSizeinMB	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount
en_US.blogs	200.4242	899288	899288	206824382	170389539	37570839
en_US.news	196.2775	1010242	1010242	203223154	169860866	34494539
en_US.twitter	159.3641	2360148	2360148	162096241	134082806	30451170

Creating a Data Subset

Because the datasets are extremely large, I will be selecting a slightly more manageable subset of data to create the corpus.

set.seed(735)
blogsSamp <- sample(blogs, length(blogs)*.005)
newsSamp <- sample(news, length(news)*.005)
tweetsSamp <- sample(tweets, length(tweets)*.005)

SampComb <- c(blogsSamp, newsSamp, tweetsSamp)

SampCombStats <- data.frame(
        FileName="Combined Sample",
        t(rbind(sapply(list(SampComb), stri_stats_general),
          WordCount=sapply(list(SampComb),
          stri_stats_latex)[4,])))

kable(SampCombStats)

	FileName	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount
Words	Combined Sample	21347	21347	2848488	2361368	511077

The subset created (SampCombStats) is much more manegeable so it will be used as the corpus for the exploratory analysis. Before conducting any of the exploratory analysis, I will perform some basic data cleaning on the corpus. The data cleaning consists of removing white spaces, removing puctuation and numbers, etc.

corpus <- VCorpus(VectorSource(SampComb), readerControl=list(readPlain, language="en", load="TRUE"))

corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Exploratory NLP Analysis

The exploratory data analysis I will conduct consists of 3 parts: Looking at the most common words (unigram), looking at the most common pairs of words (bigram) & looking at the most common word triplets (trigram).

Unigram Analysis

unigram = as.data.frame((as.matrix(  TermDocumentMatrix(corpus) )) )
unigram <- sort(rowSums(unigram),decreasing=TRUE)
unigram <- data.frame(word = names(unigram),freq=unigram)
kable(unigram[1:10,])

	word	freq
said	said	1581
will	will	1550
just	just	1504
one	one	1466
like	like	1338
can	can	1168
get	get	1115
time	time	1040
new	new	929
now	now	894

unigramplot <- ggplot(unigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="steelblue4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Unigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Unigrams in Sample Corpus")

unigramplot

Bigram Analysis

bigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram= as.data.frame((as.matrix(  TermDocumentMatrix(corpus,control = list(tokenize = bigramfn)) )) )
bigram <- sort(rowSums(bigram),decreasing=TRUE)
bigram  <- data.frame(word = names(bigram),freq=bigram)
kable(bigram[1:10,])

	word	freq
right now	right now	127
new york	new york	112
dont know	dont know	84
last year	last year	84
cant wait	cant wait	83
looking forward	looking forward	70
last week	last week	69
years ago	years ago	69
happy birthday	happy birthday	60
high school	high school	60

bigramplot <- ggplot(bigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="seagreen4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Bigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Bigrams in Sample Corpus")
bigramplot

Trigram Analysis

trigramfn <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram = as.data.frame((as.matrix(  TermDocumentMatrix(corpus,control = list(tokenize = trigramfn)) )) )
trigram <- sort(rowSums(trigram),decreasing=TRUE)
trigram <- data.frame(word = names(trigram),freq=trigram)
kable(trigram[1:10,])

	word	freq
cant wait see	cant wait see	21
happy mothers day	happy mothers day	18
let us know	let us know	17
lost lost lost	lost lost lost	16
new york city	new york city	14
please please please	please please please	11
two years ago	two years ago	11
happy new year	happy new year	10
hunter matt hunter	hunter matt hunter	9
matt hunter matt	matt hunter matt	9

trigramplot <- ggplot(trigram[1:25,], aes(x=reorder(word, -freq),y=freq)) +
        geom_bar(stat = 'identity', width=0.75, fill="red4") +
        theme(axis.text.x=element_text(angle=90))+
        xlab('Trigrams')+
        ylab('Frequency')+
        ggtitle("Histogram of Top 25 Trigrams in Sample Corpus")

trigramplot

Plan for Future Analysis and Visualizations

As I continue working on this project, I plan to look at a variety of different ways to visualize the data. One form I am currently exploring is WordClouds. I am including below an example showcasing the Top 25 Unigram Words. However, I do not think WordClouds are as efficient at conveying the information as the bar charts I included above.

library(wesanderson)
library(wordcloud)
pal <- wes_palette(21, name = "Moonrise3", type = "continuous")
wordcloud(unigram$word, unigram$freq, max.words=30, colors=pal)

In addition to word frequency, as part of my additional analysis I will also perform some basic level sentiment analysis on the data.

Santana’s Data Science Capstone Project week 2

Marievee Santana

12/8/2021