Executive Summary

This report summarizes the results of the initial exploratory data analysis of a corpus of 3 text documents consisting of compilations of the following:

The documents will be analyzed and cleaned to determine their basic structures, features, and word frequencies, and the results plotted. This exercise is in preparation for building a text prediction model which will result in a predictive text Shiny Application where the user will type in word phrases and the application will predict the next word.

Exploratory Data Analysis

The course zip file was downloaded from the class website:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
It was unzipped and the English versions of the 3 files were then read into R. The files were merged into a single file for later analysis. Also a list of ‘bad words’ was downloaded for use in deleting profanity.

# Read files into R and merge
twr <- readLines('en_US.twitter.txt',warn = FALSE, skipNul= TRUE)
blg <- readLines('en_US.blogs.txt',warn = FALSE, skipNul= TRUE)
nws <- readLines('en_US.news.txt',warn = FALSE, skipNul= TRUE)
badwords <- readLines("bad-words.txt",warn = FALSE, skipNul= TRUE)
merged_data <- c(twr,blg,nws)

The data files are structured as character vectors with strings containing sentences. Sample lines:

Twitter: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
Blog: so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
News: The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.

Summary of raw data statistics

Text File Characters Words Lines
Twitter 162385035 30373583 2360148
Blog 208361438 37334131 899288
News 15683765 2643969 77259
———– ————– ————- —————
Total 386430238 70351683 3336695

The huge size of both the individual and merged data proved to be a challenge given memory and performance constraints, so a small sample was taken for the analysis.

set.seed(1228)
merged_sample<- sample(merged_data, length(merged_data)*0.05, replace = FALSE)

Data cleansing and Tokenizing

Clean Data for Analysis

The following operations were performed on the data to prepare it for frequency analysis using the Quaenteda package.

  • Create a corpus from the merged data
  • Convert UTF-8 encoding to latin1 Characters to eliminate non-English characters
  • Tokenize the Corpus
  • Remove numbers, punctuations, symbols, hyphens, twitter hash tags, and URLs
  • Remove English stopwords
  • Remove Profanity

Ngram Creation and Frequency Analysis.

  • Create a Document Feature Matrix (dfm) for each Ngram (1,2,3)
  • Plot wordclouds for the Ngrams
  • Plot histograms for Ngram frequencies

Plot N-Gram Frequency charts

Calculate Word Coverage

Word Coverage Analysis

The number of words needed for various levels of dictionary coverage were calculated. Here is a plot and table of the results. Total words in the dictionary:

## [1] 1838113
Percent Coverage Number of Words
50 800
90 14213
95 30037

Plot Word Coverage (1 Gram)

Findings

The huge size of both the individual and merged data will prove to be a challenge given memory and performance constraints. Because of this only a small sample was taken for this initial analysis. In order to set up the prediction models it will most likely be necessary to process the data in chunks.
From the frequency charts and wordclouds, the distribution of frequencies drops off quite rapidly as opposed to linearly.
The word coverage chart shows that it takes relatively few words to obtain quite good coverage of the total vocabulary. It seems to take exponentially more time to calculate 2 grams and 3 grams vs 1 grams. Again this will have to be dealt with during the modeling phase.

Plans for Prediction modeling

I plan to start off with a simple back off model, after splitting up the data into training and testing sets. I will then evaluate it’s accuracy and performance. Depending on the results I will move to more sophisticated models and smoothing techniques until I get an acceptable level of accuracy and performance.
Finally, I will develop a Shiny application that allows the user to type in a phrase and it will predict the next word with the highest probability of occurrence.

Appendix

References

Introduction to Text Mining Package:
http://angerhang.github.io/statsWithR/tutorials/textMiningIntro.html

quanteda: Quantitative Analysis of Textual Data
https://quanteda.io/

Speech and Language Processing; Chapter 4 NGrams. Daniel Jurafsky & James H. Martin. ©2014
2014. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

Natural Language Processing, CS6320, Lecture 6, N-Grams. Sandra Harabagiu
http://www.hlt.utdallas.edu/~sanda/courses/NLP/Lecture06.pdf

Natural Language Processing
Lecture Slides from the Stanford Coursera course
by Dan Jurafsky and Christopher Manning
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

CRAN Task View: Natural Language Processing
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Title: Text Mining Infrastructure in R
Authors: Ingo Feinerer, Kurt Hornik, David Meyer
https://www.jstatsoft.org/article/view/v025i05

R Code

Libraries

library(ggplot2)
library(tm)
library(stopwords)
library(quanteda)
library(ngram)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(SnowballC)

Download and unzip swift.zip

#destfile <- "swift.zip"
URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(URL,"swift.zip")
files <- unzip("swift.zip", junkpaths = TRUE, exdir = ".")
URL <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
destfile <- "bad-words.txt"
download.file(URL, destfile)

Calculate raw data statistics

#Character Count
chr_twr <- apply(as.matrix(nchar(twr)),2,sum)
chr_blg <- apply(as.matrix(nchar(blg)),2,sum)
chr_nws <- apply(as.matrix(nchar(nws)),2,sum)
# Line Count
lin_twr <- length(twr)
lin_blg <- length(blg)
lin_nws <- length(nws)
#Word count
wrd_twr <- wordcount(twr)
wrd_blg <- wordcount(blg)
wrd_nws <- wordcount(nws)
#Total Counts
tot_chr <- chr_twr+chr_blg+chr_nws
tot_lin <- lin_twr+lin_blg+lin_nws
tot_wrd <- wrd_twr+wrd_blg + wrd_nws

Create Corpus and clean Tokens

# Remove non-english characters by converting UTF-8 to latin1 format
merged_sample <- iconv(merged_sample, "UTF-8", "latin1", sub="")
#Create Corpus
Corpus <- corpus(merged_sample)
#Create and clean tokens
Tokens <- tokens(Corpus,remove_numbers = TRUE,  remove_punct = TRUE, 
                remove_symbols = TRUE, remove_hyphens = TRUE,
                remove_twitter = TRUE, remove_url = TRUE)
Tokens <- tokens_remove(Tokens,stopwords("english"))
Tokens <- tokens_remove(Tokens,badwords)

User Defined Functions

#Create dfm
create_dfm <- function(tok,n_gram,Stem){
  Dfm <- dfm(tok, tolower = TRUE, remove_punct = TRUE,stem = Stem,  
             ngrams = n_gram,concatenator = " ",verbose = FALSE)
  return(Dfm)
}
#Plot Wordcloud
plot_wrdcld <- function(Dfmx,n_words,n_gram){
freq_chart <- textstat_frequency(Dfmx,n_words)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
title <-paste(n_gram,"Gram Wordcloud")
text(x=0.5, y=0.5, title)
plt <- wordcloud(main = "Title",max.words=n_words,words = freq_chart$feature, 
            freq = freq_chart$frequency, 
            min.freq = 1, random.order=FALSE, 
            rot.per=0.35, scale = c(2.5,.2),
            colors=brewer.pal(8, "Dark2"))
}
#Plot ngram frequency chart
freq_chart <- function(Dfmx,n_gram,n_feat){
  top_freq <- textstat_frequency(Dfmx,n_feat)
  plt <- ggplot(top_freq, aes(x = reorder(feature, frequency), y = frequency)) +
    geom_bar(stat = "identity",fill= "darksalmon") + coord_flip() +
    ggtitle(paste(n_gram, "Gram Frequency \n ( Top",n_feat,")")) + 
    xlab("Word") + ylab("Frequency")
  return(plt)
}

Create dfm’s for 1,2,3 Ngrams

Dfm1 <- create_dfm(Tokens,1,Stem = FALSE)
Dfm2 <- create_dfm(Tokens,2,Stem = FALSE)
Dfm3 <- create_dfm(Tokens,3,Stem = FALSE)

Plot N-Gram Wordclouds

wrdcld1 <- plot_wrdcld(Dfm1,100,1)
wrdcld2 <- plot_wrdcld(Dfm2,50,2)
wrdcld3 <- plot_wrdcld(Dfm3,35,3)

Plot N-Gram Frequency charts

wrdfreq1 <- freq_chart(Dfm1,1,25)
wrdfreq1
wrdfreq2 <- freq_chart(Dfm2,2,25)
wrdfreq2
wrdfreq3 <- freq_chart(Dfm3,3,25)
wrdfreq3

Calculate and plot word coverage

freq <- textstat_frequency(Dfm1)
total_words <- sum(freq$frequency)
total_words

cum_freq <- freq%>%select(feature,rank,frequency)%>%mutate(cum_sum = cumsum(frequency))
coverage <- cum_freq%>%mutate(cvrg = cum_sum/total_words)
coverage_95 <- subset(coverage, cvrg <= .95)
words_50 <- nrow(subset(coverage, cvrg <= .50))
words_90 <- nrow(subset(coverage, cvrg <= .90))
words_95 <- nrow(subset(coverage, cvrg <= .95))

plt <- ggplot(data = coverage_95, aes(y=cvrg, x = rank)) + 
  geom_line(color = "blue") + 
  scale_x_continuous(breaks=c(words_50,10000,words_90,20000,30000))+
  scale_y_continuous(breaks=c(0,.25,0.5,0.75,0.9,0.95,1.0))+
  geom_segment(aes(x = words_50, y = 0, xend = words_50, yend = 0.5),
    linetype="dashed", color = "black")+
  geom_segment(aes(x = words_90, y = 0, xend = words_90, yend = 0.90),
    linetype="dashed", color = "black")+
  geom_segment(aes(x = words_95, y = 0, xend = words_95, yend = 0.95),
               linetype="dashed", color = "black")+
  ggtitle("Word Coverage") +
  xlab("# of Words") + ylab("Coverage (%)")
plt