sentiment analysis using NLP - exploratory analysis

Introduction

Three data files are used, originating from blogs, news and twitter which can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The overall goal of this project is to develop a Shiny app that takes as input a phrase (multiple words) and predicts the next word based on a prediction algorithm. ## Let’s load the required packages first

setwd("C:/Users/zxu3/Documents/R/data science")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(stringi)
library(tm)

## Loading required package: NLP

library(slam)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

library(RWeka)
library(RColorBrewer)
library(SnowballC)
library(lattice)
library(quanteda)

## Package version: 1.5.1

## Parallel computing: 2 of 12 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:utils':
## 
##     View

library(wordcloud2)
library(stringr)

Basic Summary of Data

First, we load the datasets into R. The data is provided in four languages but the English language files were used as a start.

We will then take a preliminary look at the files, in terms of size, word count and number of lines.

# Read the blogs and Twitter data into R
file1 <- file("en_US.blogs.txt", "rb")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
close(file1)

file2 <- file("en_US.news.txt", "rb")
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'

close(file2)

file3 <- file("en_US.twitter.txt", "rb")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
close(file3)


# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))

##    source num.lines num.words mean.num.words
## 1   blogs    899288  37546239       41.75107
## 2    news     77259   2674536       34.61779
## 3 twitter   2360148  30093413       12.75065

Data Sampling

As the datasets are very large, we will randomly sample 5,000 records from each file for further processing.

# Sample the data 
data.sample <- c(sample(blogs,5000),
                 sample(news, 5000),
                 sample(twitter, 5000))
length(data.sample)

## [1] 15000

sum(stri_count_words(data.sample))

## [1] 440992

data.sample <- iconv(data.sample , "latin1", "ASCII", sub="")
data.sample <- str_replace_all(data.sample , "[\r\n]" , "")

tokenization using the Quanteda package

To clean our sample data, we can use the tm_map function in the tm package. However, it turns out to be very tedious. We will use the Quanteda package instead. In the field of natural language processing, an N-gram is a contiguous sequence of n items from a given text document. It can be used to predict the next item in a sequence based on the current (n-1) item.

#tokenization using the Quanteda package
unigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 1)
bigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 2)
trigram <- tokens(data.sample, remove_punct = TRUE, ngrams = 3)


#Load Profanity word list
profanity <- read.csv("profane_words.txt", header = FALSE, stringsAsFactors = FALSE)

unigram<- dfm(unigram, remove=profanity)
unigramfreq <- textstat_frequency(unigram)
bigram<- dfm(bigram, remove=profanity)
bigramfreq <- textstat_frequency(bigram)
trigram  <- dfm(trigram, remove=profanity)
trigramfreq <- textstat_frequency(trigram)

Top Word Frequencies visualzied using barplot and wordcloud

The following charts depict the top N-gram word frequencies (by unigrams, bigrams and trigrams, respectively).

wordcloud(data.sample, max.words = 300, random.order = FALSE, 
           rot.per = 0.1, scale = c(2.5, 0.3), use.r.layout = FALSE, 
        colors = brewer.pal(8, "Dark2"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

#barplot - Unigrams with Top 20 Frequencies
barplot(unigramfreq[1:20,]$freq, names.arg = unigramfreq[1:20,]$feature,
        main = "Top 10 Most Frequent Unigrams", col="green2")

#barplot - Bigrams with Top 20 Frequencies
barplot(bigramfreq [1:20,]$freq, names.arg = bigramfreq [1:20,]$feature,
        main = "Top 10 Most Frequent Bigrams", col="tomato2")

#barplot - Trigrams with Top 20 Frequencies
barplot(trigramfreq[1:20,]$freq, names.arg = trigramfreq[1:20,]$feature,
        main = "Top 10 Most Frequent Trigrams", col="blue")

Summary and Next Steps

At this point I have successfully imported the project data set, cleansed and massaged the data, and performed NLP procesing to include generating ngrams for 1-3 terms and analyzing the term frequency.

The data presented in this report is based on 5000 observations from each of the source data. For the final project application, I plan to scale this percentage up to a higher value.

Open issues:

Investigate more efficient tokenization and ngram generation techniques Develop the word prediction and backoff algorithm to be used in the final app Develop a shiny app