This page contains information for exploring the corpus data in Blogs, News and Twitter feeds. This exploratory analysis is conducted to develop tokens, understand patterns in the distribution of words in the data, and understand the difference between coverage and accuracy. The next step in this analysis is to develop a language prediction model.
filename <- "Coursera-SwiftKey.zip"
if (!file.exists(filename)) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", filename)
unzip(filename, files = c("final/en_US/en_US.blogs.txt", "final/en_US/en_US.news.txt", "final/en_US/en_US.twitter.txt"))
}
blogs_data <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news_data <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'
tweets_data <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)
Calculate the object sizes of each data set and format the data in units of “MB”
memorysize <- c(format(object.size(blogs_data), "MB"), format(object.size(news_data), "MB"), format(object.size(tweets_data), "MB"))
memorysize
## [1] "248.5 Mb" "19.2 Mb" "301.4 Mb"
Calculate the maximum number of characters in a line of the datasets
max_characters <- c(max(nchar(blogs_data)), max(nchar(news_data)), max(nchar(tweets_data)))
max_characters
## [1] 40833 5760 140
Calculate the number of words in the datasets
num_words <- c(stri_stats_latex(blogs_data)["Words"],
stri_stats_latex(news_data)["Words"],
stri_stats_latex(tweets_data)["Words"])
num_words
## Words Words Words
## 37570839 2651432 30451170
Calculate the number of lines in the datasets
num_lines <- c(length(blogs_data), length(news_data), length(tweets_data))
num_lines
## [1] 899288 77259 2360148
Compile all the above data statistics of the lines and words in a dataframe (table)
summarystats_table <- data.frame(memorysize, max_characters, num_words, num_lines, row.names = c("Blogs Data", "News Data", "Twitter Data"))
colnames(summarystats_table) <- c("Memory Size(in MB)",
"Max characters in a line",
"Number of words",
"Number of lines")
knitr::kable(summarystats_table, digits = 2, caption = "Summary statistics of Blogs, News, and Twitter Datasets")
Memory Size(in MB) | Max characters in a line | Number of words | Number of lines | |
---|---|---|---|---|
Blogs Data | 248.5 Mb | 40833 | 37570839 | 899288 |
News Data | 19.2 Mb | 5760 | 2651432 | 77259 |
Twitter Data | 301.4 Mb | 140 | 30451170 | 2360148 |
Since working with the entire datasets would be cumbersome (especially the Blogs data) which is very large, we will now go ahead and take samples of each dataset so we can work with those for subsequent analyses. I am sampling 10000 lines from each dataset but this is somewhat arbitrary.
set a seed number so that when I run the code again is reproducible with same numbers. Then select the random sample size to be 10000 observations. Select 10000 lines from each dataset
set.seed(1234)
set.seed(1234)
rndmsample <- 10000
blogs_data_sampled <- blogs_data[sample(1:length(blogs_data), rndmsample, replace=FALSE)]
news_data_sampled <- news_data[sample(1:length(news_data), rndmsample, replace=FALSE)]
tweets_data_sampled <- tweets_data[sample(1:length(tweets_data), rndmsample, replace=FALSE)]
rm(list=c("blogs_data", "news_data", "tweets_data"))
Now that we have a workable sample from each dataset, we now need to start telling the computer how to recognize punctuation. That way, it knows how the words are separated and related overall. We need to teach the computer about sentences and punctuation.We need to do this first before we remove punctuation so that unrelated words (from seperate sentences) do not get merged together for the ngram calculations. For the ngram calculations we only want to calculate related words from within the same sentences.
Here tell the data how to split the lines into sentences so unrelated words are not merged accidently.
form_sentences <- function(line) {
sentences <- line
# if periods, commas, colons, or semicolons present in a line, thought ends, sentence "formed"
if (grepl("[.]|[,]|[:]|[;]", line)) {
sentences <- strsplit(line, "[.]|[,]|[:]|[;]") #split the words at these punctuation marks
}
# return sentences
return(sentences[[1]])
}
An example of splitting into sentences from each dataset
news_data_sampled[1] #This is the first example of a sentence formed
## [1] "To replace the fairs, he said Coventry will host free weeknight activities, such as music, movies and kickball, in nearby Coventry Peace Park. Thursdays will be music and movie nights this summer. Activities also are planned for other nights."
form_sentences(news_data_sampled[1])
## [1] "To replace the fairs"
## [2] " he said Coventry will host free weeknight activities"
## [3] " such as music"
## [4] " movies and kickball"
## [5] " in nearby Coventry Peace Park"
## [6] " Thursdays will be music and movie nights this summer"
## [7] " Activities also are planned for other nights"
#Notice how it does not form sentences for () or ' or unintended punction
blogs_data_sampled[5] #This is the fifth example of a sentence formed
## [1] "I think I can believe that, though it<U+0092>s hard"
form_sentences(blogs_data_sampled[5])
## [1] "I think I can believe that" " though it<U+0092>s hard"
tweets_data_sampled[20] #This is the 20th example of a sentence formed
## [1] "HS softball: Sun Prairie's Kristen Hoppman with solo shot to give Cardinals 5-4 lead...heading to bot of 10"
form_sentences(tweets_data_sampled[20])
## [1] "HS softball"
## [2] " Sun Prairie's Kristen Hoppman with solo shot to give Cardinals 5-4 lead"
## [3] ""
## [4] ""
## [5] "heading to bot of 10"
Tell the program to actually split every line into a sentence based on the punctuation described above
convert_dataset <- function(data) {
result <- vector(mode="character")
# for each line in the dataset, convert them to individual sentences
for (lineno in 1:length(data)) {
result <- c(result, form_sentences(data[lineno]))
}
# remove empty entries
result <- result[result != ""]
# return the sentences
return(result)
}
Every line of each dataset is then divided up into sentences based on the punctuation
blogs.sentences <- convert_dataset(blogs_data_sampled)
news.sentences <- convert_dataset(news_data_sampled)
tweets.sentences <- convert_dataset(tweets_data_sampled)
rm(list=c("blogs_data_sampled", "news_data_sampled", "tweets_data_sampled"))
writeLines(blogs.sentences, con="./final/sample.blogs")
writeLines(news.sentences, con="./final/sample.news")
writeLines(tweets.sentences, con="./final/sample.tweets")
Calculate object sizes of each data set and format the data in “MB”, Calculate max characters, number of words, and the number of lines in each dataset
smpl_sentence_mem <- c(format(object.size(blogs.sentences), "MB"),
format(object.size(news.sentences), "MB"),
format(object.size(tweets.sentences), "MB"))
smpl_sentence_max_characters <- c(max(nchar(blogs.sentences)),
max(nchar(news.sentences)),
max(nchar(tweets.sentences)))
smpl_sentence_num_words <- c(stri_stats_latex(blogs.sentences)["Words"],
stri_stats_latex(news.sentences)["Words"],
stri_enc_toutf8(tweets.sentences)["Words"])
smpl_sentence_num_lines <- c(length(blogs.sentences),
length(news.sentences),
length(tweets.sentences))
Summarize the data from the above summary calculations of the sampled, sentence datasets and create summary table
smplfiles <- c("Sampled Blogs Data", "Sampled News Data", "Sampled Twitter Data")
smpl_sentence_summary_table <- data.frame(smpl_sentence_mem, smpl_sentence_max_characters,
smpl_sentence_num_words, smpl_sentence_num_lines,
row.names = smplfiles)
colnames(smpl_sentence_summary_table) <- c("Memory Size (in MB)",
"Max characters in a line",
"Number of words",
"Number of lines (ie sentences)")
knitr::kable(smpl_sentence_summary_table, digits = 2, caption = "Summary Statistics of sampled & 'cleaned' datasets")
Memory Size (in MB) | Max characters in a line | Number of words | Number of lines (ie sentences) | |
---|---|---|---|---|
Sampled Blogs Data | 4.9 Mb | 911 | 421074 | 47918 |
Sampled News Data | 4.4 Mb | 367 | 346047 | 45491 |
Sampled Twitter Data | 1.7 Mb | 140 | NA | 20091 |
The next few steps will focus on building a collection of text documents (or Corpus). The collection of documents will be constructed using the sampled datasets with lines split into sentences. This corpus is what we will be working with for the rest of the exploration and analysis to come.
Construct the corpus using the tm package and do some cleaning to remove white spaces, punctuation, numbers etc
mycorpus <- VCorpus(DirSource("./final/"))
#Remove whitespaces
mycorpus <- tm_map(mycorpus, stripWhitespace)
#Remove profanity
profanity_file <- "profanity_list.txt"
# if the file does not exists, then download and unzip the file
if (!file.exists(profanity_file)) {
# download the file from CMU we
download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", profanity_file)
}
# load the profane words from the file
profane_words <- readLines(profanity_file)
# remove the profane words from the corpus
mycorpus <- tm_map(mycorpus, removeWords, profane_words)
# convert all characters to lowercase
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
# remove numbers
mycorpus <- tm_map(mycorpus, removeNumbers)
# remove punctuations
mycorpus <- tm_map(mycorpus, removePunctuation)
MyCorpus of text that was sampled and converted into sentences by line is now cleaned and ready for further processing.
In the last step, we made sure that the sampled data that was converted into sentences by line was cleaned (all other punctuation and numbers now removed, all info to lower case etc). Now we are ready to understand patterns in the words of the datasets and we will be creating visuals to understand these trends of words.To do this, we use a process called TOKENIZATION. This means finding a sequence of words that appear together in a corpous. For example, a ngram of 1 would mean token words that appear the most in the corpus. A ngram of 2 would be a token of two words that appear together repeatedly in the corpus. A ngram of 3 would be a continuous token of three words that appear together in the corpus.
Unigram tokenizer with frequencies
NLPUnigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1), paste, collapse=""), use.names=FALSE)
}
tdm1_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPUnigramTokenizer))
inspect(tdm1_NLP)
## <<TermDocumentMatrix (terms: 52976, documents: 3)>>
## Non-/sparse entries: 77905/81023
## Sparsity : 51%
## Maximal term length: 89
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms sample.blogs sample.news sample.tweets
## and 12235 8934 1797
## but 2266 1471 529
## for 4039 3496 1686
## have 2480 1420 729
## that 5056 3401 1004
## the 20740 19591 4078
## this 2865 1231 657
## was 3193 2265 510
## with 3205 2600 714
## you 3282 933 2324
Bigram tokenizer with frequencies
NLPBigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse=""), use.names=FALSE)
}
tdm2_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPBigramTokenizer))
inspect(tdm2_NLP)
## <<TermDocumentMatrix (terms: 419250, documents: 3)>>
## Non-/sparse entries: 484180/773570
## Sparsity : 62%
## Maximal term length: 91
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms sample.blogs sample.news sample.tweets
## andthe 649 524 65
## atthe 571 614 161
## forthe 641 694 320
## ina 508 530 84
## inthe 1744 1771 344
## ofthe 2000 1759 230
## onthe 829 713 202
## tobe 747 485 209
## tothe 1026 840 221
## withthe 509 434 72
Trigram tokenizer with frequencies
NLPtrigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse=""), use.names=FALSE)
}
tdm3_NLP <- TermDocumentMatrix(mycorpus, control=list(tokenize=NLPtrigramTokenizer))
inspect(tdm3_NLP)
## <<TermDocumentMatrix (terms: 745951, documents: 3)>>
## Non-/sparse entries: 778240/1459613
## Sparsity : 65%
## Maximal term length: 94
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms sample.blogs sample.news sample.tweets
## alotof 135 110 30
## aswellas 87 67 0
## goingtobe 61 66 30
## itwasa 74 50 13
## oneofthe 169 151 26
## outofthe 59 50 12
## someofthe 64 52 5
## theendof 57 40 14
## theus 31 104 4
## tobea 70 55 22
As seen in the results, the inspect call provides a view of some ngrams but not necessarily in order by frequency (popularity). To sort the ngrams in terms of most frequently used, we use the tdm freq function we will define below. Then, right after that we would want to visualize these most frequently used ngrams so we build plots of the results of tdm_freq which sorted the ngrams by frequency. (It will plot the 15 most frequently used ngrams)
tdm_freq <- function(tdm) {
m <- as.matrix(tdm)
wordsums <- rowSums(m)
frequency <- sort(wordsums, decreasing=TRUE)
return(data.frame(words=names(frequency), frequency=frequency, row.names = c()))
}
freq_plot <- function(freq, label) {
freq <- freq[1:15, ]
ggplot(freq, aes(reorder(words, frequency), frequency, fill=words)) +
geom_text(aes(label = freq$frequency), vjust=-0.5, size=3) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 10)) +
labs(x=paste(label, "Tokens")) +
labs(y="Frequency") +
labs(title=paste("Frequency of", label, "tokens"))
}
Here are the ngrams plots of the most frequently used tokens from the corpus (of all three datasets)
uniFreq <- tdm_freq(tdm1_NLP)
freq_plot(uniFreq, "1-gram")
BiFreq <- tdm_freq(tdm2_NLP)
freq_plot(BiFreq, "2-gram")
TriFreq <- tdm_freq(tdm3_NLP)
freq_plot(TriFreq, "3-gram")
Another way we could consider exploring the frequency of ngrams is by using wordclouds. A wordcloud is a visual representation of words (tokens in our case) where bigger words indicate that the word appears more frequently, and smaller words appear less frequently.
par(mfrow=c(1,3))
colorFunc <- colorRampPalette(c("red", "blue"))
colors <- colorFunc(15)
#1gram word cloud
wordcloud(uniFreq[1:15, ]$words, uniFreq[1:15, ]$frequency,
colors=colors, scale=c(5,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 1-gram tokens")
#2gram wordcloud
wordcloud(BiFreq[1:15, ]$words, BiFreq[1:15, ]$frequency,
colors=colors, scale=c(4,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 2-gram tokens")
#3gram wordcloud
wordcloud(TriFreq[1:15, ]$words, TriFreq[1:15, ]$frequency,
colors=colors, scale=c(3,1), ordered.colors=TRUE, random.order=TRUE)
text(x=0.5, y=-0.1, font=2, "Top 15 3-gram tokens")
Next we will consider how many 1gram tokens cover 50% of the language and how many 1gram tokens cover 90% of all word instances from the language. We assume that the sample dataset that we prepared is representative of the compilation of the entire language. First we calculate the cumulative frequencies of the word instances and plot the data
getCumulativeFreq <- function(tokens) {
return(cumsum(tokens$frequency))
}
getCoverage <- function(cumdist, total, target) {
targetfreq <- totalfreq * target
return(which(cumdist >= targetfreq)[1])
}
# Cumulative distribution of word frequencies
cumulativedist <- getCumulativeFreq(uniFreq)
# total number of word instances
totalfreq <- cumulativedist[dim(uniFreq)[1]]
# index of word that covers 50% of word instances
c50 <- getCoverage(cumulativedist, totalfreq, 0.5)
# index of word that covers 90% of word instances
c90 <- getCoverage(cumulativedist, totalfreq, 0.9)
Plot the cumulative frequencies of the 1gram tokens
plot_coverage <- function(uniFreq, cumulativedist, total_freq, c50, c90, ptitle) {
# plot the cumulative distribution
x_50 <- c50 + 3700
y_50 <- totalfreq*0.5 - 11000
x_90 <- c90 + 3700
y_90 <- totalfreq*0.9 - 11000
qplot(c(1:dim(uniFreq)[1]), cumulativedist, geom="line",
xlab="Number of 1-gram Tokens",
ylab = "Cumulative Frequency of the Tokens",
main = ptitle) +
geom_vline(aes(xintercept=c50), colour="red") +
geom_hline(aes(yintercept=totalfreq*0.5), colour="red") +
geom_vline(aes(xintercept=c90), colour="blue") +
geom_hline(aes(yintercept=totalfreq*0.9), colour="blue") +
annotate("text", x=x_50, y=y_50, label="50% coverage", size=3) +
annotate("text", x=x_90, y=y_90, label="90% coverage", size=3)
}
plot_coverage(uniFreq, cumulativedist, total_freq, c50, c90, "Cumulative Frequencies with 50% & 90% Coverage Intercepts")
Compile the above coverage results into a data frame to make a table
nw <- c(c50, c90)
nratio <- nw*100/dim(uniFreq)[1]
label <- c("50% coverage", "90% coverage")
ndf <- data.frame(words = nw, percentage = nratio, row.names=label)
colnames(ndf) <- c("Number of tokens", "Percentage of total tokens")
knitr::kable(ndf, digits = 2, caption = "Coverage statistics")
Number of tokens | Percentage of total tokens | |
---|---|---|
50% coverage | 327 | 0.62 |
90% coverage | 10190 | 19.24 |
Assessment of the above coverage analysis: From the data in the table, we see that 50% coverage is achieved with 0.62%of total words in the corpus. Also, we see that it requires only 19.24% of total words to achieve 90% coverage.
Stemming coverage: To increase the coverage of words, we consider stemming. Stemming recognizes all variations of words and summarizes the collective frequency. For example, stemming will reduce each of the following words: rain, rained, rains, raining, to just RAIN. It may be reasonable to predict tokens that are stemmed in order to provide a quicker prediction of words for people who are texting on their phones.
mycorpus_stemmed <- tm_map(mycorpus, stemDocument)
NPLStemmedTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 1), paste, collapse=""), use.names=FALSE)
}
stemmed.TDM <- TermDocumentMatrix(mycorpus_stemmed, control=list(tokenize=NPLStemmedTokenizer))
stemmed.freq <- tdm_freq(stemmed.TDM)
stemmed.cumulativedist <- getCumulativeFreq(stemmed.freq)
stemmed.total_freq <- stemmed.cumulativedist[dim(cumulativedist)[1]]
stemmed.c50 <- getCoverage(stemmed.cumulativedist, stemmed.total_freq, 0.5)
stemmed.c90 <- getCoverage(stemmed.cumulativedist, stemmed.total_freq, 0.9)
plot_coverage(stemmed.freq, stemmed.cumulativedist, stemmed.total_freq,
stemmed.c50, stemmed.c90,
"Cumulative Frequencies for Stemmed Corpus with Coverage Intercepts")
Compile the above stemmed coverage results into a data frame to make a table
stemmed.nw <- c(stemmed.c50, stemmed.c90)
stemmed.nratio <- nw*100/dim(stemmed.freq)[1]
label <- c("50% coverage", "90% coverage")
stemmed.ndf <- data.frame(words = stemmed.nw, percentage = stemmed.nratio, row.names=label)
colnames(stemmed.ndf) <- c("Number of tokens", "Percentage of total tokens")
knitr::kable(stemmed.ndf, digits = 2, caption = "Coverage statistics of stemmed corpus")
Number of tokens | Percentage of total tokens | |
---|---|---|
50% coverage | 250 | 0.84 |
90% coverage | 5354 | 26.33 |
Assessment of results from original and stemmed coverage assessments:
From the data in the table, we see that 50% coverage is achieved with 0.84% of total words in the corpus. Also, we see that it requires now 26.33% of total words to achieve 90% coverage. Thus the coverage was increased (percentages both went up compared to non stemming version). So coverage was increased, but prediction was decreased because stemming combines several variations of a word and may not predict the actual variation of the word you were about to type. Further investigation is needed to understand the tradeoff between the increased coverage and possibly decreased accuracy.
Finally we just want to check the ratio of non english words to english words. If the ratio is small then the presence of foreign wods can be considered negligable.
url <- "http://cran.us.r-project.org/src/contrib/Archive/cldr/cldr_1.1.0.tar.gz"
pkgFile<-"cldr_1.1.0.tar.gz"
download.file(url = url, destfile = pkgFile)
install.packages(pkgs=pkgFile, type="source", repos=NULL)
## Warning: running command '"C:/Coursera/R-33~1.2/bin/x64/R" CMD INSTALL -l
## "C:\Coursera\R-3.3.2\library" "cldr_1.1.0.tar.gz"' had status 1
## Warning in install.packages(pkgs = pkgFile, type = "source", repos = NULL):
## installation of package 'cldr_1.1.0.tar.gz' had non-zero exit status
unlink(pkgFile)
library(cldr)
sentences <- c(blogs.sentences, news.sentences, tweets.sentences)
token.language <- detectLanguage(sentences)
english.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage == "ENGLISH")
foreign.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage != "ENGLISH")
round(length(foreign.words)*100/length(english.words), 2)
## [1] 5.32
The ratio of foreign words over English words is 5.32%. Since the ratio is small, the presence of foreign words in the corpus does not impact the algorithm significantly.
The next step in the project is to design and create a predication model. The model will be based on the sample/training (and testing) datasets and also the ngrams freqs assessment calculated in this exploratory analysis. We will train the prediction model using the ngram with the highest coverage and accuracy to predict the next words that will be chosen by someone on their mobile phones given some text. Then we will develop this tool using the Shiny App.