This is a Milestone report related with the Coursera Capstone Project, the target is show initial exploratory data analysis about the US dataset that include three kind of files:
For this project we will use basicly the quanteda,ggplot2, knitr and RColorBrewer.
library(quanteda)
library(ggplot2)
library(knitr)
library(RColorBrewer)
setwd("D:/001 -- Coursera/Capstone Project/Coursera-SwiftKey/final/en_US")
set.seed(12345) # For reproducibility
The files that we will use in the project are bigger than 150Mbytes each one. In order to do the exploratory data analysis and to have an acceptable runtime, I will use only 10% of the data.
Those are the main characteristics of the files:
# Print the basic information about the files.
kable(dt)
| Filename | Filesize | Total Lines | Subset Lines |
|---|---|---|---|
| en_US.twitter.txt | 167105338 | 2360148 | 23601 |
| en_US.news.txt | 205811889 | 77259 | 772 |
| en_US.blogs.txt | 210160014 | 899288 | 8992 |
According with this, we will use only 23601, 772 and 8992 lines of twitter, news and blogs datasets.
Let’s see some some examples of the content for each of the files:
twitter.data[1:3]
## [1] "Listening to the #socialmedia experts from at 's Cyberposium now! They are awesome!"
## [2] "god I miss those"
## [3] "kay! Thanks(:"
news.data[1:3]
## [1] "But here in Minnesota, Dayton and other Democrats can't seriously mean to argue both that the policies of George Bush, who left office three years ago, are wholly responsible for today's continuing national recession, and that the policies of Tim Pawlenty, who left office 11 months ago, have had nothing whatever to do with Minnesota's above-average recovery over the past four years."
## [2] "Jets: D+"
## [3] "President Obama has been criticized -- including in a Times editorial -- for his ahistorical statement that it would be \"an unprecedented, extraordinary step\" if the court overturned \"a law that was passed by a strong majority of a democratically elected Congress.\" The law in question, of course, was \"Obamacare,\" and the president seemed to be trying to muscle the court into upholding it by lamenting \"that an unelected group of people would somehow overturn a duly constituted and passed law.\" That was unseemly."
blogs.data[1:3]
## [1] "It contains the full version of my complaint submitted to the BBC at:"
## [2] "Pasir Ris Park Town Beach and Pasir Ris Park Beach are connected by a bridge. From here, you can connect to the Bedok Park Connector Network, Tampines Park Connector Network or even go to Changi Beach and East Coast if you have the energy. At the end of this article, there is a brochure, from the National Parks, on the Eastern Coastal Park Connector."
## [3] "Cheese: Camembert Tremblaye: shhhh, a thermalized Camembert from France."
Let’s see some information regarding the amount of characters per line for each dataset:
kable(dt)
| Filename | Filesize | Total Lines | Subset Lines | Max Chars per Line | Avg Chars per Line | Min Chars per Line |
|---|---|---|---|---|---|---|
| en_US.twitter.txt | 167105338 | 2360148 | 23601 | 140 | 64 | 4 |
| en_US.news.txt | 205811889 | 77259 | 772 | 782 | 188 | 7 |
| en_US.blogs.txt | 210160014 | 899288 | 8992 | 4596 | 159 | 1 |
We can see that the longest line belongs to blogs dataset (4596), but the longest in average is the news dataset (188), and the maximun for twitter is already 140 (the maximun amount defined in the social media platform, until now).
Let’s make some plots in order to see that:
# Print Histogram of Number of Characters per Line
ggplot(data=lines_char.all, aes(x=num_char, fill=type)) +
geom_histogram() +
facet_wrap(~ type, ncol = 1, scales="free") +
labs(title="Histogram for Number of Characters per Line") +
labs(x="Number of Characters",y="Number of Lines")
ggplot(data=lines_char.all, aes(x=num_char, fill=type, colour=type)) +
geom_freqpoly() +
labs(title="Histogram for Number of Characters per Line") +
labs(x="Number of Characters",y="Number of Lines")
Let’s see some characteristics about the amount of words per line. For this step, we consider a “word” any group of characters separated by " “. We will use the following function to count the number of words per line:
f_num_words <- function(x) length(unlist(strsplit(x,split=" ")))
Let’s see the minimun, average and maximun amount of words per line for each type of files:
kable(dt)
| Filename | Filesize | Total Lines | Subset Lines | Max Chars per Line | Avg Chars per Line | Min Chars per Line | Max Words per Line | Avg Words per Line | Min Words per Line |
|---|---|---|---|---|---|---|---|---|---|
| en_US.twitter.txt | 167105338 | 2360148 | 23601 | 140 | 64 | 4 | 39 | 12 | 1 |
| en_US.news.txt | 205811889 | 77259 | 772 | 782 | 188 | 7 | 135 | 32 | 1 |
| en_US.blogs.txt | 210160014 | 899288 | 8992 | 4596 | 159 | 1 | 715 | 29 | 1 |
The results are similar to the previous character analysis, in terms of average words per line, blogs and news are very similar (29 and 32), and the maximun words per line for blogs dataset is very big compare with the others (715 compare to 135 and 39).
Let’s see that information with some plots:
ggplot(data=lines_word.all,aes(x=num_words, fill=type)) +
geom_histogram() +
facet_wrap(~ type, ncol = 1, scales="free") +
labs(title="Histogram for Number of Words per Line") +
labs(x="Number of Words per Line",y="Frequency")
ggplot(data=lines_word.all,aes(x=num_words,fill=type,colour=type)) +
geom_freqpoly() +
labs(title="Histogram for Number of Words per Line") +
labs(x="Number of Words per Line",y="Frequency")
Let’s built the corpora for each of the files to be used in future analysis, using the corpus() function of quanteda library:
twitter.docvars <- data.frame(Source = rep("twitter",lines.twitter.data))
blogs.docvars <- data.frame(Source = rep("blogs",lines.blogs.data))
news.docvars <- data.frame(Source = rep("news",lines.news.data))
twitter.corpus <- corpus(twitter.data, docvars = twitter.docvars)
news.corpus <- corpus(news.data, docvars = news.docvars)
blogs.corpus <- corpus(blogs.data, docvars = blogs.docvars)
## Let's see information about the corpus
summary(twitter.corpus,1)
## Corpus consisting of 23601 documents, showing 1 document.
##
## Text Types Tokens Sentences Source
## text1 15 16 2 twitter
##
## Source: D:/001 -- Coursera/Capstone Project/Coursera-SwiftKey/final/en_US/* on x86-64 by enrique
## Created: Sun Jun 12 17:04:53 2016
## Notes:
summary(news.corpus,1)
## Corpus consisting of 772 documents, showing 1 document.
##
## Text Types Tokens Sentences Source
## text1 54 71 1 news
##
## Source: D:/001 -- Coursera/Capstone Project/Coursera-SwiftKey/final/en_US/* on x86-64 by enrique
## Created: Sun Jun 12 17:04:53 2016
## Notes:
summary(blogs.corpus,1)
## Corpus consisting of 8992 documents, showing 1 document.
##
## Text Types Tokens Sentences Source
## text1 13 14 1 blogs
##
## Source: D:/001 -- Coursera/Capstone Project/Coursera-SwiftKey/final/en_US/* on x86-64 by enrique
## Created: Sun Jun 12 17:04:53 2016
## Notes:
We can see that the twitter, news and blogs Corpus have 23601, 772 and 8992 documents (equal to number of lines of each dataset).
With quanteda packages is very simple to create a new Corpus combining the previous ones:
all.corpus <- (twitter.corpus + news.corpus) + blogs.corpus
summary(all.corpus,1)
## Corpus consisting of 33365 documents, showing 1 document.
##
## Text Types Tokens Sentences Source
## text1 15 16 2 twitter
##
## Source: Combination of corpuses (twitter.corpus + news.corpus) and blogs.corpus
## Created: Sun Jun 12 17:04:53 2016
## Notes:
We can see that this new Corpus have have 33365 documents (equal to add the number of documents of each corpus).
Let’s built the document-feature matrix using the dfm() function to analyze the features and frequencies. We will also clean the data by doing the following:
all.dfm <- dfm(all.corpus,
toLower = TRUE,
removeNumbers = TRUE,
removePunct = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
stem = FALSE,
language = "english",
ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 33,365 documents
## ... indexing features: 44,290 feature types
## ... removed 173 features, from 174 supplied (glob) feature types
## ... created a 33365 x 44118 sparse dfm
## ... complete.
## Elapsed time: 2.07 seconds.
## Total number of features (words)
print(num.words.all <- nfeature(all.dfm))
## [1] 44118
## Information of dfm
head(all.dfm,5)
## Document-feature matrix of: 33,365 documents, 44,118 features.
## (showing first 5 documents and first 6 features)
## features
## docs listening socialmedia experts s cyberposium now
## text1 1 1 1 1 1 1
## text2 0 0 0 0 0 0
## text3 0 0 0 0 0 0
## text4 0 0 0 0 0 0
## text5 0 0 0 0 0 0
We can see that the total number of features is 44119, that’s is the number of words that are included in the Corpora.
We can use the data feature matrix created previously to analyze the frequency and accumulated frequency of words in the Copora.
Let’s see the top-20 words and the frequency related information:
head(all.words,20)
## freq acumfreq perfreq acumperfreq word numword
## just 2636 2636 0.7021862 0.7021862 just 1
## like 2327 4963 0.6198738 1.3220600 like 2
## will 2179 7142 0.5804491 1.9025091 will 3
## one 2168 9310 0.5775189 2.4800279 one 4
## can 1944 11254 0.5178490 2.9978769 can 5
## get 1854 13108 0.4938745 3.4917514 get 6
## time 1758 14866 0.4683017 3.9600532 time 7
## good 1548 16414 0.4123612 4.3724144 good 8
## love 1479 17893 0.3939808 4.7663952 love 9
## day 1477 19370 0.3934480 5.1598433 day 10
## now 1469 20839 0.3913170 5.5511602 now 11
## know 1356 22195 0.3612157 5.9123759 know 12
## new 1282 23477 0.3415033 6.2538792 new 13
## go 1206 24683 0.3212582 6.5751374 go 14
## see 1203 25886 0.3204590 6.8955964 see 15
## great 1111 26997 0.2959518 7.1915482 great 16
## people 1089 28086 0.2900913 7.4816395 people 17
## back 1088 29174 0.2898250 7.7714645 back 18
## think 1051 30225 0.2799688 8.0514333 think 19
## make 1028 31253 0.2738420 8.3252752 make 20
We can see the top-20 words, the meaning of each of the columns is:
freq: number of times that the word appears in the Corpora.
acumfreq: accumulated frequency (considering the previous n-1 words).
perfreq: percentage frequency of the work in the Corpora.
acumperfreq: accumulated percentage frequency (considering the previous n-1 words).
numwords: the number of words included in the accumulated values.
For example, we can see that the word now in row 11, appears 1469 times into the Corpora (freq), this value is equivalent to 0,39% of the total corpora size (perfreq). Because the table is sorted, the acumulated frequency of the word now is 20839, it means the sum of all the previous words frecuencies in the table plus it’s freq value: 19370 + 1469 = 20839 (acumfreq) that is equivalent to 5.55% of the total corpora size (acumperfreq). Also, we can see that in order to cover the 5.55% of the Corpora we need only 11 words (numword).
Let’s plot the top-20 words and also make the worcloud:
## Plot Top 20 Words
ggplot(data=head(all.words,20),
aes(x=reorder(word,-freq), y=freq)) +
geom_bar(stat ="identity", position= "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 1,size=12)) +
labs(title="Top 20 Words") +
labs(x="Top Words",y="Count")
## Plot Worcloud with top 20 words
plot(all.dfm, max.words = 20, random.order = FALSE, colors = brewer.pal(6, "Dark2"))
We can use the previous information to validate how many unique words we need in order to cover 50% and 90% of the total language (the numwords associated with the desired acumfreq value):
head(subset(all.words, acumperfreq >= 50 & acumperfreq <= 51),1)
## freq acumfreq perfreq acumperfreq word numword
## hoping 74 187757 0.01971236 50.01532 hoping 820
head(subset(all.words, acumperfreq >= 90 & acumperfreq <= 91),1)
## freq acumfreq perfreq acumperfreq word numword
## disciplined 3 337860 0.0007991497 90.00024 disciplined 13493
We can see that we only need aproximately 820 words in order to cover the 50%, and 13493 to cover the 90%.
In the following plot we can observe this:
ggplot(data=all.words, aes(x=numword, y=acumperfreq)) +
geom_line(stat ="identity", position= "identity",size=1.2, colour="black") +
geom_text(data=subset(all.words, numword == 820 | numword == 13493),
aes(label=paste("(",acumperfreq,",",numword,")")),
hjust = 1.2, vjust = -0.4) +
geom_vline(xintercept = 820, color="red") +
geom_hline(aes(yintercept=50), color="red") +
geom_vline(xintercept = 13493, color="blue") +
geom_hline(aes(yintercept=90), color="blue") +
labs(title="Number of Words needed to Cover all Words Instances") +
scale_x_continuous( trans = "log10") +
labs(x="Number of Words",y="%Coverage")
We can use those results in the prediction algorithm in order to speed up the processing time.
Let’s analyze the n-grams of the dataset.
In order to check the unigrams, bigrams and trigrams of the all dataset, we will use the function dfm() from quanteda using the ngram option parameter.
By default, the dfm() function calculate the unigrams of the texts, so the results that we got in the section 2.6 Word Analysis with Document Feature Matrix correspond to the unigrams of the Corpora.
Let’s calculate the Bigrams and Trigrams of the corpora and plot the top-20 n-grams and wordcloud for each one:
We will calculate the bigrams of the Corpora trought the dfm() function, using the same cleaning options for the unigrams.
all.bigrams.dfm <- dfm(all.corpus,
toLower = TRUE,
removeNumbers = TRUE,
removePunct = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
stem = FALSE,
language = "english",
ignoredFeatures = stopwords("english"),
ngrams=2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 33,365 documents
## ... indexing features: 313,479 feature types
## ... removed 178,553 features, from 174 supplied (glob) feature types
## ... created a 33365 x 134927 sparse dfm
## ... complete.
## Elapsed time: 8.9 seconds.
We can see that exists 313479 features.
Let’s plot the top-20 bigrams and the wordcloud related:
top20.bigrams <- data.frame(topfeatures(all.bigrams.dfm,20))
colnames(top20.bigrams)[1] <- "freq"
top20.bigrams[,2] <- rownames(top20.bigrams)
colnames(top20.bigrams)[2] <- "bigrams"
ggplot(data=top20.bigrams,aes(x=reorder(bigrams,-freq), y=freq)) +
geom_bar(stat ="identity", position= "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Top 20 Bigrams") +
labs(x="bigrams",y="Count")
## Plot Worcloud with top 20 words
plot(all.bigrams.dfm, max.words = 20, random.order = FALSE,
rot.per=0.35,scale=c(3,0.5),
colors = brewer.pal(6, "Dark2"))
Let’s calculate the trigrams of the Corpora, using the same dfm() function:
all.trigrams.dfm <- dfm(all.corpus,
toLower = TRUE,
removeNumbers = TRUE,
removePunct = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
stem = FALSE,
language = "english",
ignoredFeatures = stopwords("english"),
ngrams=3)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 33,365 documents
## ... indexing features: 531,460 feature types
## ... removed 462,272 features, from 174 supplied (glob) feature types
## ... created a 33365 x 69189 sparse dfm
## ... complete.
## Elapsed time: 14.41 seconds.
We can see that exists 531460 features.
Let’s plot the top-20 trigrams and the wordcloud related:
top20.trigrams <- data.frame(topfeatures(all.trigrams.dfm,20))
colnames(top20.trigrams)[1] <- "freq"
top20.trigrams[,2] <- rownames(top20.trigrams)
colnames(top20.trigrams)[2] <- "trigrams"
ggplot(data=top20.trigrams,aes(x=reorder(trigrams,-freq), y=freq)) +
geom_bar(stat ="identity", position= "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title="Top 20 Trigrams") +
labs(x="trigrams",y="Count")
## Plot Worcloud with top 20 words
plot(all.trigrams.dfm, max.words = 20, random.order = FALSE, scale=c(2,0.5),
colors = brewer.pal(6, "Dark2"))
We can see that some of the trigrams should be cleaned, words like happy mother’s day and happy mothers day are the same and should be considered as equal, also words like please please please and love love love must be handled. This will be take into consideration for next steps.
One option to filter the Corpora and remove the non-english words will be using a english dictionary to do that. Is neccesary to find a english dictionary and work on it.
Another choice is a library named textcat that make text cathegorization based on n-grams thought the function textcat(). Let’s see how this works:
library(textcat)
textcat(c("This is a english sentence",
"Esto es una oracion en espanol",
"This is esto es datos",
"madre",
"father",
"bonjour",
"merci"),
p=ECIMCI_profiles)
## [1] "en" "es" "pt" "no" "en" "fr" "it"
Is working good but is not perfect. Let’s see how this works with our data:
all.tokens <- toLower(
tokenize(all.corpus, what = "fasterword",
removeNumbers = TRUE,
removePunct = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
removeURL = TRUE))
all.tokens[1:8]
## $text1
## [1] "listening" "to" "the" "socialmedia" "experts"
## [6] "from" "at" "s" "cyberposium" "now"
## [11] "they" "are" "awesome"
##
## $text2
## [1] "god" "i" "miss" "those"
##
## $text3
## [1] "kay" "thanks"
##
## $text4
## [1] "halftime" "show" "strong" "production" "she"
## [6] "can" "still" "move" "like" "jagger"
##
## $text5
## [1] "this" "week" "will" "forever" "be"
## [6] "known" "as" "the" "one" "with"
## [11] "all" "the" "unexpected" "meetings" "whether"
## [16] "ill" "accomplish" "anything" "i" "planned"
## [21] "is" "yet" "to" "be" "seen"
##
## $text6
## [1] "the" "cruel" "irony" "of" "this" "game"
## [7] "it" "was" "the" "pitching" "and" "3-4"
## [13] "spot" "that" "let" "us" "down" "and"
## [19] "of" "course" "we" "wouldnt" "have" "been"
## [25] "here" "without" "em"
##
## $text7
## [1] "thinking" "if" "i" "really" "want" "people"
## [7] "to" "come" "over" "saturday"
##
## $text8
## [1] "always" "smile" "bro" "follow" "me"
## [6] "belieberboy"
textcat(all.tokens[1:8], p=ECIMCI_profiles)
## text1 text2 text3 text4 text5 text6 text7 text8
## "en" "en" "en" "en" "en" "en" "en" "en"
There still some problems that I need to fix regarding characters encoding (some characters that the function doesn’t handle). This texts is a good example of this issue, the function doesn’t work with it:
all.corpus[9]
## text9
## "I think the reason I love #BoyMeetsWorld so much is because Cory and Topanga remind me of and me <f0><U+009F><U+0092><U+0097><f0><U+009F><U+0098><U+0098>"
The next steps that I will perform to create a prediction algortihm will be:
Considering how to fix the issues with the textcat() function to remove the non-english words.
Considering how to reduce the amount of n-grams using steamming or addiotional filtering to improve the runtime of the prediction algorithm.
Consider how to create the prediction algorithm using the n-grams and how to predict the words that can not be handle by the algorithm perse.
Design the GUI of the Shiny App and star working on it, taking into consideration runtime and memory restrictions.