Capstone Code for Part I Explorator Analysis

This page contains information for exploring the corpus data in Blogs, News and Twitter feeds. This exploratory analysis is conducted to develop tokens, understand patterns in the distribution of words in the data, and understand the difference between coverage and accuracy. The next step in this analysis is to develop a language prediction model.

Downloading the data files using file.exists:

filename <- "Coursera-SwiftKey.zip"
if (!file.exists(filename)) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", filename)
  unzip(filename, files = c("final/en_US/en_US.blogs.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.twitter.txt"))
}

Load the datasets:

blogs_data <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news_data <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'
tweets_data <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)

Calculating info on size, characters, and words

Calculate the object sizes of each data set and format the data in units of “MB”

memorysize <- c(format(object.size(blogs_data), "MB"), format(object.size(news_data), "MB"), format(object.size(tweets_data), "MB"))
memorysize
## [1] "248.5 Mb" "19.2 Mb"  "301.4 Mb"

Calculate the maximum number of characters in a line of the datasets

max_characters <- c(max(nchar(blogs_data)), max(nchar(news_data)), max(nchar(tweets_data)))
max_characters
## [1] 40833  5760   140

Calculate the number of words in the datasets

num_words <- c(stri_stats_latex(blogs_data)["Words"], 
            stri_stats_latex(news_data)["Words"], 
            stri_stats_latex(tweets_data)["Words"])
num_words
##    Words    Words    Words 
## 37570839  2651432 30451170

Calculate the number of lines in the datasets

num_lines <- c(length(blogs_data), length(news_data), length(tweets_data))
num_lines
## [1]  899288   77259 2360148

Compile all the above data statistics of the lines and words in a dataframe (table)

summarystats_table <- data.frame(memorysize, max_characters, num_words, num_lines, row.names = c("Blogs Data", "News Data", "Twitter Data"))
colnames(summarystats_table) <- c("Memory Size(in MB)", 
                       "Max characters in a line",
                       "Number of words",
                       "Number of lines")
knitr::kable(summarystats_table, digits = 2, caption = "Summary statistics of Blogs, News, and Twitter Datasets")
Summary statistics of Blogs, News, and Twitter Datasets
Memory Size(in MB) Max characters in a line Number of words Number of lines
Blogs Data 248.5 Mb 40833 37570839 899288
News Data 19.2 Mb 5760 2651432 77259
Twitter Data 301.4 Mb 140 30451170 2360148

Create sample dataset

Since working with the entire datasets would be cumbersome (especially the Blogs data) which is very large, we will now go ahead and take samples of each dataset so we can work with those for subsequent analyses. I am sampling 10000 lines from each dataset but this is somewhat arbitrary.

set a seed number so that when I run the code again is reproducible with same numbers. Then select the random sample size to be 10000 observations. Select 10000 lines from each dataset

set.seed(1234)
set.seed(1234)
rndmsample <- 10000
blogs_data_sampled <- blogs_data[sample(1:length(blogs_data), rndmsample, replace=FALSE)]
news_data_sampled <- news_data[sample(1:length(news_data), rndmsample, replace=FALSE)]
tweets_data_sampled <- tweets_data[sample(1:length(tweets_data), rndmsample, replace=FALSE)]
rm(list=c("blogs_data", "news_data", "tweets_data"))

Defining Punctuation and Sentence Structure

Now that we have a workable sample from each dataset, we now need to start telling the computer how to recognize punctuation. That way, it knows how the words are separated and related overall. We need to teach the computer about sentences and punctuation.We need to do this first before we remove punctuation so that unrelated words (from seperate sentences) do not get merged together for the ngram calculations. For the ngram calculations we only want to calculate related words from within the same sentences.

Here tell the data how to split the lines into sentences so unrelated words are not merged accidently.

form_sentences <- function(line) {
  sentences <- line
  # if periods, commas, colons, or semicolons present in a line, thought ends, sentence "formed"
  if (grepl("[.]|[,]|[:]|[;]", line)) {
    sentences <- strsplit(line, "[.]|[,]|[:]|[;]")   #split the words at these punctuation marks
  }
  # return sentences
  return(sentences[[1]])
}

An example of splitting into sentences from each dataset

news_data_sampled[1] #This is the first example of a sentence formed
## [1] "To replace the fairs, he said Coventry will host free weeknight activities, such as music, movies and kickball, in nearby Coventry Peace Park. Thursdays will be music and movie nights this summer. Activities also are planned for other nights."
form_sentences(news_data_sampled[1]) 
## [1] "To replace the fairs"                                 
## [2] " he said Coventry will host free weeknight activities"
## [3] " such as music"                                       
## [4] " movies and kickball"                                 
## [5] " in nearby Coventry Peace Park"                       
## [6] " Thursdays will be music and movie nights this summer"
## [7] " Activities also are planned for other nights"
#Notice how it does not form sentences for () or ' or unintended punction   
blogs_data_sampled[5] #This is the fifth example of a sentence formed
## [1] "I think I can believe that, though it<U+0092>s hard"
form_sentences(blogs_data_sampled[5])
## [1] "I think I can believe that" " though it<U+0092>s hard"
tweets_data_sampled[20]  #This is the 20th example of a sentence formed
## [1] "HS softball: Sun Prairie's Kristen Hoppman with solo shot to give Cardinals 5-4 lead...heading to bot of 10"
form_sentences(tweets_data_sampled[20])
## [1] "HS softball"                                                             
## [2] " Sun Prairie's Kristen Hoppman with solo shot to give Cardinals 5-4 lead"
## [3] ""                                                                        
## [4] ""                                                                        
## [5] "heading to bot of 10"

Tell the program to actually split every line into a sentence based on the punctuation described above

convert_dataset <- function(data) {
  result <- vector(mode="character")
  # for each line in the dataset, convert them to individual sentences
  for (lineno in 1:length(data)) {
    result <- c(result, form_sentences(data[lineno]))
  }
  # remove empty entries
  result <- result[result != ""]
  # return the sentences
  return(result)
}

Every line of each dataset is then divided up into sentences based on the punctuation

blogs.sentences <- convert_dataset(blogs_data_sampled)
news.sentences <- convert_dataset(news_data_sampled)
tweets.sentences <- convert_dataset(tweets_data_sampled)
rm(list=c("blogs_data_sampled", "news_data_sampled", "tweets_data_sampled"))
writeLines(blogs.sentences, con="./final/sample.blogs") 
writeLines(news.sentences, con="./final/sample.news")
writeLines(tweets.sentences, con="./final/sample.tweets")

Summary stats on these sampled datasets.

Calculate object sizes of each data set and format the data in “MB”, Calculate max characters, number of words, and the number of lines in each dataset

smpl_sentence_mem <- c(format(object.size(blogs.sentences), "MB"),
                       format(object.size(news.sentences), "MB"), 
                       format(object.size(tweets.sentences), "MB"))

smpl_sentence_max_characters <- c(max(nchar(blogs.sentences)), 
                                  max(nchar(news.sentences)),   
                                  max(nchar(tweets.sentences)))


smpl_sentence_num_words <- c(stri_stats_latex(blogs.sentences)["Words"],  
                              stri_stats_latex(news.sentences)["Words"],  
                            stri_enc_toutf8(tweets.sentences)["Words"])  


smpl_sentence_num_lines <- c(length(blogs.sentences), 
                             length(news.sentences),  
                             length(tweets.sentences)) 

Summarize the data from the above summary calculations of the sampled, sentence datasets and create summary table

smplfiles <- c("Sampled Blogs Data", "Sampled News Data", "Sampled Twitter Data")
smpl_sentence_summary_table <- data.frame(smpl_sentence_mem, smpl_sentence_max_characters, 
                                          smpl_sentence_num_words, smpl_sentence_num_lines, 
                                          row.names = smplfiles)
colnames(smpl_sentence_summary_table) <- c("Memory Size (in MB)", 
                              "Max characters in a line", 
                              "Number of words",
                              "Number of lines (ie sentences)")
knitr::kable(smpl_sentence_summary_table, digits = 2, caption = "Summary Statistics of sampled & 'cleaned' datasets")
Summary Statistics of sampled & ‘cleaned’ datasets
Memory Size (in MB) Max characters in a line Number of words Number of lines (ie sentences)
Sampled Blogs Data 4.9 Mb 911 421074 47918
Sampled News Data 4.4 Mb 367 346047 45491
Sampled Twitter Data 1.7 Mb 140 NA 20091

Constructing a corpus of documents

The next few steps will focus on building a collection of text documents (or Corpus). The collection of documents will be constructed using the sampled datasets with lines split into sentences. This corpus is what we will be working with for the rest of the exploration and analysis to come.

Construct the corpus using the tm package and do some cleaning to remove white spaces, punctuation, numbers etc

mycorpus <- VCorpus(DirSource("./final/"))
#Remove whitespaces
mycorpus <- tm_map(mycorpus, stripWhitespace)
#Remove profanity
profanity_file <- "profanity_list.txt"
# if the file does not exists, then download and unzip the file
if (!file.exists(profanity_file)) {
  # download the file from CMU we
  download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", profanity_file)
}
# load the profane words from the file
profane_words <- readLines(profanity_file)
# remove the profane words from the corpus
mycorpus <- tm_map(mycorpus, removeWords, profane_words)
# convert all characters to lowercase
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
# remove numbers
mycorpus <- tm_map(mycorpus, removeNumbers)
# remove punctuations
mycorpus <- tm_map(mycorpus, removePunctuation)

MyCorpus of text that was sampled and converted into sentences by line is now cleaned and ready for further processing.

Tokenizing the data

In the last step, we made sure that the sampled data that was converted into sentences by line was cleaned (all other punctuation and numbers now removed, all info to lower case etc). Now we are ready to understand patterns in the words of the datasets and we will be creating visuals to understand these trends of words.To do this, we use a process called TOKENIZATION. This means finding a sequence of words that appear together in a corpous. For example, a ngram of 1 would mean token words that appear the most in the corpus. A ngram of 2 would be a token of two words that appear together repeatedly in the corpus. A ngram of 3 would be a continuous token of three words that appear together in the corpus.

Unigram tokenizer with frequencies

NLPUnigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 1), paste, collapse=""), use.names=FALSE)
}
tdm1_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPUnigramTokenizer))
inspect(tdm1_NLP)
## <<TermDocumentMatrix (terms: 52976, documents: 3)>>
## Non-/sparse entries: 77905/81023
## Sparsity           : 51%
## Maximal term length: 89
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms  sample.blogs sample.news sample.tweets
##   and         12235        8934          1797
##   but          2266        1471           529
##   for          4039        3496          1686
##   have         2480        1420           729
##   that         5056        3401          1004
##   the         20740       19591          4078
##   this         2865        1231           657
##   was          3193        2265           510
##   with         3205        2600           714
##   you          3282         933          2324

Bigram tokenizer with frequencies

NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse=""), use.names=FALSE)
}
tdm2_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPBigramTokenizer))
inspect(tdm2_NLP) 
## <<TermDocumentMatrix (terms: 419250, documents: 3)>>
## Non-/sparse entries: 484180/773570
## Sparsity           : 62%
## Maximal term length: 91
## Weighting          : term frequency (tf)
## Sample             :
##          Docs
## Terms     sample.blogs sample.news sample.tweets
##   andthe           649         524            65
##   atthe            571         614           161
##   forthe           641         694           320
##   ina              508         530            84
##   inthe           1744        1771           344
##   ofthe           2000        1759           230
##   onthe            829         713           202
##   tobe             747         485           209
##   tothe           1026         840           221
##   withthe          509         434            72

Trigram tokenizer with frequencies

NLPtrigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 3), paste, collapse=""), use.names=FALSE)
}
tdm3_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPtrigramTokenizer))
inspect(tdm3_NLP)
## <<TermDocumentMatrix (terms: 745951, documents: 3)>>
## Non-/sparse entries: 778240/1459613
## Sparsity           : 65%
## Maximal term length: 94
## Weighting          : term frequency (tf)
## Sample             :
##            Docs
## Terms       sample.blogs sample.news sample.tweets
##   alotof             135         110            30
##   aswellas            87          67             0
##   goingtobe           61          66            30
##   itwasa              74          50            13
##   oneofthe           169         151            26
##   outofthe            59          50            12
##   someofthe           64          52             5
##   theendof            57          40            14
##   theus               31         104             4
##   tobea               70          55            22

Plot the most frequent ngrams

As seen in the results, the inspect call provides a view of some ngrams but not necessarily in order by frequency (popularity). To sort the ngrams in terms of most frequently used, we use the tdm freq function we will define below. Then, right after that we would want to visualize these most frequently used ngrams so we build plots of the results of tdm_freq which sorted the ngrams by frequency. (It will plot the 15 most frequently used ngrams)

tdm_freq <- function(tdm) {
  m <- as.matrix(tdm)
  wordsums <- rowSums(m)
  frequency <- sort(wordsums, decreasing=TRUE)
  return(data.frame(words=names(frequency), frequency=frequency, row.names = c()))
}
freq_plot <- function(freq, label) {
  freq <- freq[1:15, ]
  ggplot(freq, aes(reorder(words, frequency), frequency, fill=words)) +
    geom_text(aes(label = freq$frequency), vjust=-0.5, size=3) + 
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 10)) +
    labs(x=paste(label, "Tokens")) + 
    labs(y="Frequency") +
    labs(title=paste("Frequency of", label, "tokens"))
}

Here are the ngrams plots of the most frequently used tokens from the corpus (of all three datasets)

uniFreq <- tdm_freq(tdm1_NLP)
freq_plot(uniFreq, "1-gram")

BiFreq <- tdm_freq(tdm2_NLP)
freq_plot(BiFreq, "2-gram")

TriFreq <- tdm_freq(tdm3_NLP)
freq_plot(TriFreq, "3-gram")

Word Cloud visualizations of most frequent ngrams

Another way we could consider exploring the frequency of ngrams is by using wordclouds. A wordcloud is a visual representation of words (tokens in our case) where bigger words indicate that the word appears more frequently, and smaller words appear less frequently.

par(mfrow=c(1,3))
colorFunc <- colorRampPalette(c("red", "blue"))
colors <- colorFunc(15)
#1gram word cloud
wordcloud(uniFreq[1:15, ]$words, uniFreq[1:15, ]$frequency, 
          colors=colors, scale=c(5,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 1-gram tokens")
#2gram wordcloud
wordcloud(BiFreq[1:15, ]$words, BiFreq[1:15, ]$frequency, 
          colors=colors, scale=c(4,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 2-gram tokens")
#3gram wordcloud
wordcloud(TriFreq[1:15, ]$words, TriFreq[1:15, ]$frequency, 
          colors=colors, scale=c(3,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 3-gram tokens")

Coverage and precision assessment (including a follow up assessment with stemming applied)

Next we will consider how many 1gram tokens cover 50% of the language and how many 1gram tokens cover 90% of all word instances from the language. We assume that the sample dataset that we prepared is representative of the compilation of the entire language. First we calculate the cumulative frequencies of the word instances and plot the data

getCumulativeFreq <- function(tokens) {
  return(cumsum(tokens$frequency))
}
getCoverage <- function(cumdist, total, target) {
  targetfreq <- totalfreq * target
  return(which(cumdist >= targetfreq)[1])
}
# Cumulative distribution of word frequencies
cumulativedist <- getCumulativeFreq(uniFreq)
# total number of word instances
totalfreq <- cumulativedist[dim(uniFreq)[1]]
# index of word that covers 50% of word instances
c50 <- getCoverage(cumulativedist, totalfreq, 0.5)
# index of word that covers 90% of word instances
c90 <- getCoverage(cumulativedist, totalfreq, 0.9)

Plot the cumulative frequencies of the 1gram tokens

plot_coverage <- function(uniFreq, cumulativedist, total_freq, c50, c90, ptitle) {
  # plot the cumulative distribution
  x_50 <- c50 + 3700
  y_50 <- totalfreq*0.5 - 11000
  x_90 <- c90 + 3700
  y_90 <- totalfreq*0.9 - 11000
  qplot(c(1:dim(uniFreq)[1]), cumulativedist, geom="line", 
        xlab="Number of 1-gram Tokens", 
        ylab = "Cumulative Frequency of the Tokens",
        main = ptitle) +
    geom_vline(aes(xintercept=c50), colour="red") +
    geom_hline(aes(yintercept=totalfreq*0.5), colour="red") +
    geom_vline(aes(xintercept=c90), colour="blue") +
    geom_hline(aes(yintercept=totalfreq*0.9), colour="blue") +
    annotate("text", x=x_50, y=y_50, label="50% coverage", size=3) +
    annotate("text", x=x_90, y=y_90, label="90% coverage", size=3) 
}
plot_coverage(uniFreq, cumulativedist, total_freq, c50, c90, "Cumulative Frequencies with 50% & 90% Coverage Intercepts")

Compile the above coverage results into a data frame to make a table

nw <- c(c50, c90)
nratio <- nw*100/dim(uniFreq)[1]
label <- c("50% coverage", "90% coverage")
ndf <- data.frame(words = nw, percentage = nratio, row.names=label)
colnames(ndf) <- c("Number of tokens", "Percentage of total tokens")
knitr::kable(ndf, digits = 2, caption = "Coverage statistics")
Coverage statistics
Number of tokens Percentage of total tokens
50% coverage 327 0.62
90% coverage 10190 19.24

Assessment of the above coverage analysis: From the data in the table, we see that 50% coverage is achieved with 0.62%of total words in the corpus. Also, we see that it requires only 19.24% of total words to achieve 90% coverage.

Stemming coverage: To increase the coverage of words, we consider stemming. Stemming recognizes all variations of words and summarizes the collective frequency. For example, stemming will reduce each of the following words: rain, rained, rains, raining, to just RAIN. It may be reasonable to predict tokens that are stemmed in order to provide a quicker prediction of words for people who are texting on their phones.

mycorpus_stemmed <- tm_map(mycorpus, stemDocument)
NPLStemmedTokenizer <- function(x){
  unlist(lapply(ngrams(words(x), 1), paste, collapse=""), use.names=FALSE)
}
stemmed.TDM <- TermDocumentMatrix(mycorpus_stemmed, control=list(tokenize=NPLStemmedTokenizer))
stemmed.freq <- tdm_freq(stemmed.TDM)
stemmed.cumulativedist <- getCumulativeFreq(stemmed.freq)
stemmed.total_freq <- stemmed.cumulativedist[dim(cumulativedist)[1]]
stemmed.c50 <- getCoverage(stemmed.cumulativedist, stemmed.total_freq, 0.5)
stemmed.c90 <- getCoverage(stemmed.cumulativedist, stemmed.total_freq, 0.9)

plot_coverage(stemmed.freq, stemmed.cumulativedist, stemmed.total_freq, 
              stemmed.c50, stemmed.c90,
              "Cumulative Frequencies for Stemmed Corpus with Coverage Intercepts")

Compile the above stemmed coverage results into a data frame to make a table

stemmed.nw <- c(stemmed.c50, stemmed.c90)
stemmed.nratio <- nw*100/dim(stemmed.freq)[1]
label <- c("50% coverage", "90% coverage")
stemmed.ndf <- data.frame(words = stemmed.nw, percentage = stemmed.nratio, row.names=label)
colnames(stemmed.ndf) <- c("Number of tokens", "Percentage of total tokens")
knitr::kable(stemmed.ndf, digits = 2, caption = "Coverage statistics of stemmed corpus")
Coverage statistics of stemmed corpus
Number of tokens Percentage of total tokens
50% coverage 250 0.84
90% coverage 5354 26.33

Assessment of results from original and stemmed coverage assessments:

From the data in the table, we see that 50% coverage is achieved with 0.84% of total words in the corpus. Also, we see that it requires now 26.33% of total words to achieve 90% coverage. Thus the coverage was increased (percentages both went up compared to non stemming version). So coverage was increased, but prediction was decreased because stemming combines several variations of a word and may not predict the actual variation of the word you were about to type. Further investigation is needed to understand the tradeoff between the increased coverage and possibly decreased accuracy.

Final assessment: Influence of foreign words on the corpus

Finally we just want to check the ratio of non english words to english words. If the ratio is small then the presence of foreign wods can be considered negligable.

url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
## Warning: running command '"C:/Coursera/R-33~1.2/bin/x64/R" CMD INSTALL -l
## "C:\Coursera\R-3.3.2\library" "cldr_1.1.0.tar.gz"' had status 1
## Warning in install.packages(pkgs = pkgFile, type = "source", repos = NULL):
## installation of package 'cldr_1.1.0.tar.gz' had non-zero exit status
unlink(pkgFile)
library(cldr)
sentences <- c(blogs.sentences, news.sentences, tweets.sentences)
token.language <- detectLanguage(sentences)
english.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage == "ENGLISH")
foreign.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage != "ENGLISH")
round(length(foreign.words)*100/length(english.words), 2)
## [1] 5.32

The ratio of foreign words over English words is 5.32%. Since the ratio is small, the presence of foreign words in the corpus does not impact the algorithm significantly.

Plan for predicion model and shiny app-Next steps

The next step in the project is to design and create a predication model. The model will be based on the sample/training (and testing) datasets and also the ngrams freqs assessment calculated in this exploratory analysis. We will train the prediction model using the ngram with the highest coverage and accuracy to predict the next words that will be chosen by someone on their mobile phones given some text. Then we will develop this tool using the Shiny App.