This report summarizes the results of the initial exploratory data analysis of a corpus of 3 text documents consisting of compilations of the following:
The documents will be analyzed and cleaned to determine their basic structures, features, and word frequencies, and the results plotted. This exercise is in preparation for building a text prediction model which will result in a predictive text Shiny Application where the user will type in word phrases and the application will predict the next word.
The course zip file was downloaded from the class website:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
It was unzipped and the English versions of the 3 files were then read into R. The files were merged into a single file for later analysis. Also a list of ‘bad words’ was downloaded for use in deleting profanity.
# Read files into R and merge
twr <- readLines('en_US.twitter.txt',warn = FALSE, skipNul= TRUE)
blg <- readLines('en_US.blogs.txt',warn = FALSE, skipNul= TRUE)
nws <- readLines('en_US.news.txt',warn = FALSE, skipNul= TRUE)
badwords <- readLines("bad-words.txt",warn = FALSE, skipNul= TRUE)
merged_data <- c(twr,blg,nws)
The data files are structured as character vectors with strings containing sentences. Sample lines:
Twitter: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
Blog: so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
News: The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
| Text File | Characters | Words | Lines |
|---|---|---|---|
| 162385035 | 30373583 | 2360148 | |
| Blog | 208361438 | 37334131 | 899288 |
| News | 15683765 | 2643969 | 77259 |
| ———– | ————– | ————- | ————— |
| Total | 386430238 | 70351683 | 3336695 |
The huge size of both the individual and merged data proved to be a challenge given memory and performance constraints, so a small sample was taken for the analysis.
set.seed(1228)
merged_sample<- sample(merged_data, length(merged_data)*0.05, replace = FALSE)
The following operations were performed on the data to prepare it for frequency analysis using the Quaenteda package.
The number of words needed for various levels of dictionary coverage were calculated. Here is a plot and table of the results. Total words in the dictionary:
## [1] 1838113
| Percent Coverage | Number of Words |
|---|---|
| 50 | 800 |
| 90 | 14213 |
| 95 | 30037 |
The huge size of both the individual and merged data will prove to be a challenge given memory and performance constraints. Because of this only a small sample was taken for this initial analysis. In order to set up the prediction models it will most likely be necessary to process the data in chunks.
From the frequency charts and wordclouds, the distribution of frequencies drops off quite rapidly as opposed to linearly.
The word coverage chart shows that it takes relatively few words to obtain quite good coverage of the total vocabulary. It seems to take exponentially more time to calculate 2 grams and 3 grams vs 1 grams. Again this will have to be dealt with during the modeling phase.
I plan to start off with a simple back off model, after splitting up the data into training and testing sets. I will then evaluate it’s accuracy and performance. Depending on the results I will move to more sophisticated models and smoothing techniques until I get an acceptable level of accuracy and performance.
Finally, I will develop a Shiny application that allows the user to type in a phrase and it will predict the next word with the highest probability of occurrence.
Introduction to Text Mining Package:
http://angerhang.github.io/statsWithR/tutorials/textMiningIntro.html
quanteda: Quantitative Analysis of Textual Data
https://quanteda.io/
Speech and Language Processing; Chapter 4 NGrams. Daniel Jurafsky & James H. Martin. ©2014
2014. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
Natural Language Processing, CS6320, Lecture 6, N-Grams. Sandra Harabagiu
http://www.hlt.utdallas.edu/~sanda/courses/NLP/Lecture06.pdf
Natural Language Processing
Lecture Slides from the Stanford Coursera course
by Dan Jurafsky and Christopher Manning
https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
CRAN Task View: Natural Language Processing
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Title: Text Mining Infrastructure in R
Authors: Ingo Feinerer, Kurt Hornik, David Meyer
https://www.jstatsoft.org/article/view/v025i05
Libraries
library(ggplot2)
library(tm)
library(stopwords)
library(quanteda)
library(ngram)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(SnowballC)
Download and unzip swift.zip
#destfile <- "swift.zip"
URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(URL,"swift.zip")
files <- unzip("swift.zip", junkpaths = TRUE, exdir = ".")
URL <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
destfile <- "bad-words.txt"
download.file(URL, destfile)
Calculate raw data statistics
#Character Count
chr_twr <- apply(as.matrix(nchar(twr)),2,sum)
chr_blg <- apply(as.matrix(nchar(blg)),2,sum)
chr_nws <- apply(as.matrix(nchar(nws)),2,sum)
# Line Count
lin_twr <- length(twr)
lin_blg <- length(blg)
lin_nws <- length(nws)
#Word count
wrd_twr <- wordcount(twr)
wrd_blg <- wordcount(blg)
wrd_nws <- wordcount(nws)
#Total Counts
tot_chr <- chr_twr+chr_blg+chr_nws
tot_lin <- lin_twr+lin_blg+lin_nws
tot_wrd <- wrd_twr+wrd_blg + wrd_nws
Create Corpus and clean Tokens
# Remove non-english characters by converting UTF-8 to latin1 format
merged_sample <- iconv(merged_sample, "UTF-8", "latin1", sub="")
#Create Corpus
Corpus <- corpus(merged_sample)
#Create and clean tokens
Tokens <- tokens(Corpus,remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE,
remove_twitter = TRUE, remove_url = TRUE)
Tokens <- tokens_remove(Tokens,stopwords("english"))
Tokens <- tokens_remove(Tokens,badwords)
User Defined Functions
#Create dfm
create_dfm <- function(tok,n_gram,Stem){
Dfm <- dfm(tok, tolower = TRUE, remove_punct = TRUE,stem = Stem,
ngrams = n_gram,concatenator = " ",verbose = FALSE)
return(Dfm)
}
#Plot Wordcloud
plot_wrdcld <- function(Dfmx,n_words,n_gram){
freq_chart <- textstat_frequency(Dfmx,n_words)
layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(0, 4))
plot.new()
title <-paste(n_gram,"Gram Wordcloud")
text(x=0.5, y=0.5, title)
plt <- wordcloud(main = "Title",max.words=n_words,words = freq_chart$feature,
freq = freq_chart$frequency,
min.freq = 1, random.order=FALSE,
rot.per=0.35, scale = c(2.5,.2),
colors=brewer.pal(8, "Dark2"))
}
#Plot ngram frequency chart
freq_chart <- function(Dfmx,n_gram,n_feat){
top_freq <- textstat_frequency(Dfmx,n_feat)
plt <- ggplot(top_freq, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity",fill= "darksalmon") + coord_flip() +
ggtitle(paste(n_gram, "Gram Frequency \n ( Top",n_feat,")")) +
xlab("Word") + ylab("Frequency")
return(plt)
}
Create dfm’s for 1,2,3 Ngrams
Dfm1 <- create_dfm(Tokens,1,Stem = FALSE)
Dfm2 <- create_dfm(Tokens,2,Stem = FALSE)
Dfm3 <- create_dfm(Tokens,3,Stem = FALSE)
Plot N-Gram Wordclouds
wrdcld1 <- plot_wrdcld(Dfm1,100,1)
wrdcld2 <- plot_wrdcld(Dfm2,50,2)
wrdcld3 <- plot_wrdcld(Dfm3,35,3)
Plot N-Gram Frequency charts
wrdfreq1 <- freq_chart(Dfm1,1,25)
wrdfreq1
wrdfreq2 <- freq_chart(Dfm2,2,25)
wrdfreq2
wrdfreq3 <- freq_chart(Dfm3,3,25)
wrdfreq3
Calculate and plot word coverage
freq <- textstat_frequency(Dfm1)
total_words <- sum(freq$frequency)
total_words
cum_freq <- freq%>%select(feature,rank,frequency)%>%mutate(cum_sum = cumsum(frequency))
coverage <- cum_freq%>%mutate(cvrg = cum_sum/total_words)
coverage_95 <- subset(coverage, cvrg <= .95)
words_50 <- nrow(subset(coverage, cvrg <= .50))
words_90 <- nrow(subset(coverage, cvrg <= .90))
words_95 <- nrow(subset(coverage, cvrg <= .95))
plt <- ggplot(data = coverage_95, aes(y=cvrg, x = rank)) +
geom_line(color = "blue") +
scale_x_continuous(breaks=c(words_50,10000,words_90,20000,30000))+
scale_y_continuous(breaks=c(0,.25,0.5,0.75,0.9,0.95,1.0))+
geom_segment(aes(x = words_50, y = 0, xend = words_50, yend = 0.5),
linetype="dashed", color = "black")+
geom_segment(aes(x = words_90, y = 0, xend = words_90, yend = 0.90),
linetype="dashed", color = "black")+
geom_segment(aes(x = words_95, y = 0, xend = words_95, yend = 0.95),
linetype="dashed", color = "black")+
ggtitle("Word Coverage") +
xlab("# of Words") + ylab("Coverage (%)")
plt