This is the first submission for the JH Data Science Specialization Capstone project on Coursera, and it consists in an exploratory Data Analysis on textual data. The final goal is to build a smart keyboard application in R/Shiny, able to propose guesses for the “best” next word the user will type. The prediction shall be based on the data set that I am oing to analyse now.
Here we are faced with text data from different sources (twitter, news and blog posts) and in different languages. Data is in English (en_US), German (de_DE), Finnish (fi_FI) and Russian (ru_RU).
My Analysis is concentrated on the English portion of the data set. I have a good knowledge of languages, coming from Italy (Italian), living in Belgium (French, Dutch), having a basic knowledge of Spanish and German. I think that the characteristics of these languages, let’s us take for instance German, are such that a good prediction algorithm should recognise and take into account the gender of the nouns and the cases of the part of sentence (nominativ, akkusativ and dativ mainly) in order to decline articles and adjectives and sometimes names (This is sometimes not possible to predict). Therefore in my opinion some knowledge of the language should be included in a prediction algorithm and text (vocabulary) reduction measures like stemming would most likely not work. Probably this is also true in English and other languages.
The single zipped data file was downloaded from HC Corpora) and unzipped to extract the text files.
# Loading libraries in no particular order (many packages available it is hard sometimes to make choices)
library(tm)
library(dplyr)
library(stringi)
library(stringr)
library(qdap)
library(quanteda)
library(qdapRegex)
library(openNLP)
library(RWeka)
# library(htmlTable)
library(ggplot2)
library(gridExtra)
library(grid)
library(gtable)
library(knitr)
# Files (Commented out, I only did this once, then I load a saved file when I resume work)
# localFile <- "Coursera-SwiftKey.zip"
# dataFile <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download
# download.file(dataFile, localFile)
# Unzip localFile
# unzip(localFile)
The size of the downloaded data file is 548 Mb, which uncompresses to about 1.4G. The first thing to say is that there is no way to perform any sort of brute force algorithm preserving all of this data online, at least on my PC. Some selection has to be made because the amount of data is too big to find a strategy before coming up with a solution. This can already be seen by the loading time of the single files.
** Notes
In principle I do not want to exclude the other languages from this exploration, because they present interesting features: In German normal sentences the verbs are always in the second position, identification of this specific feature could improve a lot the prediction accuracy (to be seen). But I am afraid for the moment I will I have to. Also Finnish is interesting, especially in Software localization. In this language, the amount of space you have to leave for the Finnish translation in dialog boxes is almost unpredictable as this language is very unrelated to the other European languages, where you allow about 30% of the English string. I have been a software localization coordinator for Microsoft for a couple of years, so I know by experience.
# Reading the statistics file by file to compile a table
# Twitter en_US
twEnUSSize<-file.size("data/final/en_US/en_US.twitter.txt")
twEnUs<-readLines("data/final/en_US/en_US.twitter.txt")
statsTwEnUS<-stri_stats_general(twEnUs)
statsLTwEnUS<-stri_stats_latex(twEnUs)
# News en_US
newsEnUSSize<-file.size("data/final/en_US/en_US.news.txt")
newsEnUs<-readLines("data/final/en_US/en_US.news.txt")
statsNewsEnUS<-stri_stats_general(newsEnUs)
statsLNewsEnUS<-stri_stats_latex(newsEnUs)
# Blogs en_US
blogsEnUSSize<-file.size("data/final/en_US/en_US.blogs.txt")
blogsEnUs<-readLines("data/final/en_US/en_US.blogs.txt")
statsBlogsEnUS<-stri_stats_general(blogsEnUs)
statsLBlogsEnUS<-stri_stats_latex(blogsEnUs)
# Compose a table with the values just found out...
twStatsNames<-c("Source", "Size", "Lines", "Words", "Chars", "MWPL")
twStats<-matrix(nrow=3, ncol=6)
twStats[1,] <- c("Twitter", twEnUSSize, statsTwEnUS["Lines"], statsLTwEnUS["Words"], statsTwEnUS["Chars"], statsLTwEnUS["Words"] / statsTwEnUS["Lines"])
twStats[2,] <- c("News", newsEnUSSize, statsNewsEnUS["Lines"], statsLNewsEnUS["Words"], statsNewsEnUS["Chars"], statsLNewsEnUS["Words"] / statsNewsEnUS["Lines"] )
twStats[3,] <- c("Blogs", blogsEnUSSize, statsBlogsEnUS["Lines"], statsLBlogsEnUS["Words"], statsBlogsEnUS["Chars"], statsLBlogsEnUS["Words"] / statsBlogsEnUS["Lines"])
twStats<-as.data.frame(twStats)
colnames(twStats)<-twStatsNames
# I am unsatisfied with the looks of the table here...
#htmlTable(twStats, caption="Basic statistics about the en_US files", align="l|r|r|r|r|r|")
# grid.table(twStats)
table <- tableGrob(twStats)
title <- textGrob("Text files summary statistics",gp=gpar(fontsize=20))
footnote <- textGrob("Note: Only for the English Language files", x=0, hjust=0, gp=gpar( fontface="italic"))
padding <- unit(0.5,"line")
table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))
grid.newpage()
grid.draw(table)
As it can be seen from the table generated, the document with the biggest number of lines but smallest mean number of words per line is the twitter document. Tweets are much smaller than blog posts and news abstracts, being limited to 140 characters and in general to short phrases. Probably they also contain a lot of web references and in my opinion they do not constitute a valid predictor. In the same way, email addresses should be removed. It is not clear at this point whether replacing them with a space or removing the entire sentence is the best option. Blogs articles are the longest in average, with about 41.78 words per line.
Another aspect to evaluate is the actual number of non-unique words in the text, which makes the three files somehow similar, all being between 30.45 and 37.57 million words.
To proceed to further analysis to be able to handle this type of data I will study a part of the files.
The first decision to take is the size of the sample we are going to analyse. This is not a trivial task, I do not want to undersample but my computing resources are limited. So, loosely based on resources I mention in the references, which discusses sampling size in a NLP task I take arbitrarily the sample size of 10,000 lines per each of the documents. These should be enough to draw some initial conclusions. Evaluations may be slightly different across samples, so this may change, and I should repeat it. However, we can use the samples drawn to estimate Heap’s law coefficients and then use them to estimate the necessary Vocabulary size to cover the 50 % and 90 % of the total Text size. We could argue something, from one of the resources I have researched, Heaps’ law can be roughly approximated with sqrt(Text size). This would give us approximately 10125 vocabulary words for the total text.
# Load all English documents as a corpus
if(!exists("enCorpus", mode = "any"))
enCorpus <- Corpus(DirSource(file.path(".", "data","final","en_US")),
readerControl=list(reader=readPlain, language="en_US"))
# Save it - as is - for future use (load it when needed or when resuming)
if (!file.exists("encorpus")) save(enCorpus,file="encorpus")
# Execute this when restarting before sampling...
# load("encorpus")
# Sampling: See discussion above
sampleSize = 10000
# Seed: make it reproducible
set.seed(20161120)
# Documents are read in alphabetic order, verified (not relevant for the analysis but good for consistency and overall correctness)
blSampleInd <- sample(1:length(enCorpus[[1]][[1]]), sampleSize, replace=FALSE)
neSampleInd <- sample(1:length(enCorpus[[2]][[1]]), sampleSize, replace=FALSE)
twSampleInd <- sample(1:length(enCorpus[[3]][[1]]), sampleSize, replace=FALSE)
# Corpus Reduction
enCorpus[[1]][[1]] <- enCorpus[[1]][[1]][blSampleInd]
enCorpus[[2]][[1]] <- enCorpus[[2]][[1]][neSampleInd]
enCorpus[[3]][[1]] <- enCorpus[[3]][[1]][twSampleInd]
# Saving reduced Corpus
# Save it - as is - for future use (load it when needed or when resuming after sampling)
if (!file.exists("reducedencorpus")) save(enCorpus,file="reducedencorpus")
# load("reducedencorpus")
# Reduced corpus size...
memSize <- object.size(enCorpus)
memSize
## 6246944 bytes
There is no use for mis-spelled words, for numbers, for web or email addresses and for twitter characters like @ and #. However the order of removal is important and email addresses should be removed before the twitter markers otherwise the emails shall be missed.
# Package qdapRegex contins all regular expressions needed to clean the text from web, email addresses and twitter char codes
enCorpus[[1]][[1]]<-rm_url(enCorpus[[1]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))
enCorpus[[2]][[1]]<-rm_url(enCorpus[[2]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))
enCorpus[[3]][[1]]<-rm_url(enCorpus[[3]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))
# Removing Dates
enCorpus[[1]][[1]]<-rm_date(enCorpus[[1]][[1]], pattern=pastex("@rm_date4"))
enCorpus[[2]][[1]]<-rm_date(enCorpus[[2]][[1]], pattern=pastex("@rm_date4"))
enCorpus[[3]][[1]]<-rm_date(enCorpus[[3]][[1]], pattern=pastex("@rm_date4"))
# Removing numbers and non-word characters (I have to repeat this separately by document, as the call brings down my R session)
enCorpus[[1]][[1]]<-rm_non_words(enCorpus[[1]][[1]])
enCorpus[[2]][[1]]<-rm_non_words(enCorpus[[2]][[1]])
enCorpus[[3]][[1]]<-rm_non_words(enCorpus[[3]][[1]])
# everything to lower case using tm
enCorpus <- tm_map(enCorpus, content_transformer(tolower))
Now I should have in my hands a clean corpus. Considering that in presence of an empty input I should still be able to present a prediction. This is a fine-tuning feature for the final model and I will leave it for later. Notice I did not eliminate words with wrong spelling. This should be done in my opinion, although these should not be very frequent (same spelling errors should not repeat very often). But I do not want to lose information for the moment due to specific dictionaries, so I do not attempt it.
Anyway, by removing words with low frequency it is possible that many of the occasional mis-spellings and foreign words would also be eliminated, empowering in fact the prediction algorithm. I am going to look in that direction.
I consider also that stemming and removing the stopwords is not a good way to go (this is my opinion) because they are part of the language(s) and a good predictor would suggest them when needed. So, a good prediction algorithm should also be capable of proposing stopwords and regular words, stemmed words are basically misspellings for natural languages.
Also, I am not removing bad language at this moment. This will be done later in the model and in the application.
I am creating a TDM (Term Document Matrix) from this corpus and then looking at the frequencies of terms under various aspects. Note: sometimes I settle for a different strategy to display my data. I leave the attempts that did not not provide a satisfactory output in the code, this is a placeholder for future improvements. I would
tdm<-TermDocumentMatrix(enCorpus)
tdm
## <<TermDocumentMatrix (terms: 47309, documents: 3)>>
## Non-/sparse entries: 72496/69431
## Sparsity : 49%
## Maximal term length: 37
## Weighting : term frequency (tf)
# Having a look at Zipf's and Heap's plots...
Zipf_plot(x = tdm)
## (Intercept) x
## 13.207676 -1.256846
Heaps_plot(tdm)
## (Intercept) x
## 2.1524817 0.6402622
# Calculating the frequencies:
wordFreq<-rowSums(as.matrix(tdm))
freqord<-order(wordFreq, decreasing=TRUE)
# Looking at the most frequent words...
wordFreq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
wf <- data.frame(word=names(wordFreq), freq=wordFreq)
kable(head(wf,30), caption="Most frequent words", row.names = FALSE)
| word | freq |
|---|---|
| the | 44462 |
| and | 22457 |
| that | 9667 |
| for | 9280 |
| you | 6694 |
| with | 6310 |
| was | 5724 |
| this | 4674 |
| have | 4577 |
| but | 4349 |
| are | 4237 |
| not | 3578 |
| from | 3477 |
| they | 2997 |
| all | 2937 |
| said | 2916 |
| his | 2914 |
| will | 2832 |
| one | 2652 |
| about | 2568 |
| has | 2473 |
| out | 2438 |
| just | 2317 |
| when | 2313 |
| what | 2305 |
| who | 2268 |
| more | 2200 |
| your | 2174 |
| can | 2155 |
| like | 2133 |
kable(tail(wf,30), caption="Least frequent words", row.names = FALSE)
| word | freq |
|---|---|
| zuckerberg’s | 1 |
| zuckerburg | 1 |
| zuhlsdorf | 1 |
| zukor | 1 |
| zulia | 1 |
| zulu | 1 |
| zuni | 1 |
| zunilda | 1 |
| zunt | 1 |
| zürich | 1 |
| zuzana | 1 |
| zvs | 1 |
| zwane | 1 |
| zwerling | 1 |
| zydeco | 1 |
| zydrunas | 1 |
| zyl | 1 |
| zylstra | 1 |
| zynga | 1 |
| zz’s | 1 |
| αφα | 1 |
| помощи | 1 |
| рук | 1 |
| самих | 1 |
| утопающим | 1 |
| утопающих | 1 |
| 案ずるより産むが易し | 1 |
| 潘兆初法官 | 1 |
| 犀利人妻 | 1 |
| 红腐乳 | 1 |
# Table generation with GridExtra not satisfactory. Table gets truncated, could not find way around.
# table <- tableGrob(head(wf, 12), gp = gpar(fontsize = 8), rows = NULL )
# title <- textGrob("Most frequent words",gp=gpar(fontsize=14))
# footnote <- textGrob("Note: Stemming and stopwords not removed - on purpose", x=0, hjust=0, gp=gpar( fontface="italic"))
# padding <- unit(2,"line")
# table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
# table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
# table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))
# grid.newpage()
# grid.draw(table)
# Looking at the least frequent words
# Table generation (I should make a function for this...)
# table <- tableGrob(tail(wf,12), theme =ttheme_default(gpar.coretext =gpar(fontsize=7), gpar.coltext=gpar(fontsize=8, fontface='bold'), gpar.rowtext=gpar(fontsize=7, fontface='bold') ), rows = NULL )
#
# title <- textGrob("Least frequent words",gp=gpar(fontsize=14))
# footnote <- textGrob("Note: Many spelling mistakes, foreign text, proper nouns etc...", x=0, hjust=0, gp=gpar( fontface="italic"))
# padding <- unit(2,"line")
# table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
# table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
# table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))
# grid.newpage()
# grid.draw(table)
# These look ugly
# head(wordFreq, 30)
# wordFreq[head(freqord,30)]
# wordFreq[tail(freqord,30)]
# Plot of the words with frequency > 1500 in my sample
subset(wf, freq>1500) %>%
ggplot(aes(x=reorder(word, -freq),y=freq, fill=freq)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
theme(axis.text.x=element_text(angle=45, hjust=1))
# How much do the words with freq == 1 count in this structure? How many are they in total?
count(wf[which(wf$freq==1),])
## # A tibble: 1 × 1
## n
## <int>
## 1 23297
# And what percentage of our data is that?
count(wf[which(wf$freq==1),]) / nrow(wf)
## n
## 1 0.4924433
# And how much does "the rest" count?
sum(wf[which(wf$freq>1),]$freq)
## [1] 669763
# What is the dictionary coverage of the words with freq==1?
#sum(wf[which(wf$freq==1),]$freq)/(sum(wf[which(wf$freq==1),]$freq) + sum(wf[which(wf$freq>1),]$freq))
sum(wf[which(wf$freq==1),]$freq)/sum(wf$freq)
## [1] 0.03361469
# if I want to stretch this reasoning, I could argue that there are other words with little predictive values and I could define a threshold (ex. 3)
count(wf[which(wf$freq<=3),])
## # A tibble: 1 × 1
## n
## <int>
## 1 33378
# And how much does "the rest" count?
sum(wf[which(wf$freq>3),]$freq)
## [1] 646226
# How much predictive capability would I lose? (just on unigram terms)
#sum(wf[which(wf$freq<=3),]$freq)/(sum(wf[which(wf$freq<=1),]$freq) + sum(wf[which(wf$freq>3),]$freq))
sum(wf[which(wf$freq<=3),]$freq)/sum(wf$freq)
## [1] 0.06757568
# Can I generalize?
Zips’ law fit is a bit is reasonable, while the parameters of the Heap’s law seem to be a very good fit, so this could really help in establishing a good sample size for the final application.
The above analysis, and the references tell me that terms which occur only one time in the text amount to about half of the data, and give us a coverage of the text of less than 4%. They contribute to the sparsity of the matrix and provide little predicting value. We can probably live without them and still cover 96% of the text or more, but we would reduce our TDM by 70% of the original size.
Even if I would not consider terms that occur 3 times or less I would still cover about 93% of the text. I have also tried with other sample sizes and the results seem to hold better by increasing the sample size. See also literature in the references, about number of words in a dictionary and coverage of the language.
Let us look at the coverages that we can obtain with what percentages of the initial TDM, supposing that we remove the terms with freq=1, then freq=2.. until freq=50 (arbitrary choice). I am almost certain that I will not remove all of these from the final model.
for (frequency in 1:50) {
# How many words with this frequency or less
qwRem <- count(wf[which(wf$freq<=frequency),])
# And how much coverage do we lose?
lostCov<-sum(wf[which(wf$freq<=frequency),]$freq)/sum(wf$freq)
# And what proportion of our data is that?
pwRem<-count(wf[which(wf$freq<=frequency),]) / nrow(wf)
# And how much does "the rest" count?
remainCov<-sum(wf[which(wf$freq>frequency),]$freq)/sum(wf$freq)
if (!exists("remUnfreqTerms")) {
remUnfreqTerms<-data.frame(frequency, qwRem, lostCov, pwRem, remainCov, nrow(wf)-qwRem)
}
else {
remUnfreqTerms<-rbind(remUnfreqTerms, data.frame(frequency, qwRem, lostCov, pwRem, remainCov, nrow(wf)-qwRem))
}
}
colnames(remUnfreqTerms)<-c("FrequencyRemoved", "Removed", "LostCoverage", "ProportionWordsRemoved", "RemainingCoverage", "TermsRemaining")
# Plotting the removed freq vs coverage and the size of the remaining dictionary
ggplot(remUnfreqTerms, aes(x=FrequencyRemoved)) +
geom_line( colour="blue", aes(y=Removed) ) +
geom_line( colour="red", aes(y=TermsRemaining)) +
labs(title="Words removed (blue) and remaining (red) losing infrequent terms")
ggplot(remUnfreqTerms, aes(x=FrequencyRemoved)) +
geom_line( colour="blue", aes(y=RemainingCoverage) ) +
geom_line( colour="red", aes(y=LostCoverage)) +
labs(title="Coverage loss (red) and remaining (blue) removing infrequent terms")
In fact, these coverage plots are other visualizations of Heaps’ law (not log plots).
The number of n-grams in a text is known, this is a linear function of the total number of words. In a sentence of length L words there are L-(n-1) n-grams. Therefore, if we tokenize, and keep the selected vocabulary to a decent size, the structures to hold the n-grams should also be reasonable in size. However I expect the sparsity of a density matrix to be lower.
# Sets the default number of threads to use. Thanks stackoverflow! I spent days for this single problem!
options(mc.cores=1)
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Finding bigrams
tdm2<- TermDocumentMatrix(enCorpus, control = list(tokenize = bigramTokenizer))
# Finding trigrams
tdm3<- TermDocumentMatrix(enCorpus, control = list(tokenize = trigramTokenizer))
# Calculating the frequencies of bi-grams
bigramFreq<-rowSums(as.matrix(tdm2))
# Looking at the bigrams frequencies...
bigramFreq <- sort(rowSums(as.matrix(tdm2)), decreasing=TRUE)
bif <- data.frame(bigram=names(bigramFreq), freq=bigramFreq)
# Looking at the most frequent bigrams...
kable(head(bif,30), caption="Most frequent bigrams", row.names = FALSE)
| bigram | freq |
|---|---|
| of the | 4197 |
| in the | 3759 |
| to the | 1965 |
| it s | 1911 |
| on the | 1681 |
| for the | 1678 |
| i m | 1447 |
| to be | 1433 |
| at the | 1322 |
| don t | 1279 |
| and the | 1230 |
| in a | 1074 |
| with the | 996 |
| is a | 918 |
| and i | 877 |
| for a | 840 |
| it is | 831 |
| it was | 829 |
| from the | 769 |
| of a | 766 |
| with a | 761 |
| i have | 751 |
| will be | 731 |
| i was | 727 |
| if you | 686 |
| one of | 680 |
| that s | 675 |
| is the | 665 |
| as a | 664 |
| that i | 622 |
# Looking at the least frequent bigrams...
kable(tail(bif,30), caption="Least frequent bigrams", row.names = FALSE)
| bigram | freq |
|---|---|
| zuma zapiro | 1 |
| zumba classes | 1 |
| zumba fitness | 1 |
| zumwalt east | 1 |
| zunilda junco | 1 |
| zuni tamaroa | 1 |
| zunt for | 1 |
| zürich today | 1 |
| zutic demands | 1 |
| zutic says | 1 |
| zuzana hejnova | 1 |
| zvs induction | 1 |
| zwane said | 1 |
| zwerling were | 1 |
| zydeco something | 1 |
| zydrunas ilgauskas | 1 |
| zyl rsa | 1 |
| zylstra ppg | 1 |
| zynga poker | 1 |
| zz s | 1 |
| zz top | 1 |
| αφα ice | 1 |
| дело помощи | 1 |
| дело рук | 1 |
| помощи утопающим | 1 |
| рук самих | 1 |
| самих утопающих | 1 |
| утопающим дело | 1 |
| 潘兆初法官 decided | 1 |
| 红腐乳 this | 1 |
# Object size
bigramSize<- object.size(bigramFreq)
bigramSize
## 20835024 bytes
# Calculating the frequencies of tri-grams
trigramFreq<-rowSums(as.matrix(tdm3))
# Looking at the tri-grams freqs...
trigramFreq <- sort(rowSums(as.matrix(tdm3)), decreasing=TRUE)
trif <- data.frame(trigram=names(trigramFreq), freq=trigramFreq)
# Looking at the most frequent trigrams...
kable(head(trif,30), caption="Most frequent trigrams", row.names = FALSE)
| trigram | freq |
|---|---|
| i don t | 455 |
| one of the | 337 |
| a lot of | 275 |
| it s a | 247 |
| i m not | 165 |
| i didn t | 162 |
| out of the | 154 |
| i can t | 152 |
| some of the | 146 |
| it s not | 144 |
| the end of | 144 |
| to be a | 143 |
| going to be | 134 |
| you don t | 134 |
| i ve been | 129 |
| the u s | 129 |
| part of the | 128 |
| it was a | 124 |
| don t know | 119 |
| i want to | 118 |
| as well as | 113 |
| be able to | 113 |
| thanks for the | 111 |
| i have to | 106 |
| don t have | 105 |
| t want to | 104 |
| this is a | 102 |
| the first time | 100 |
| the rest of | 100 |
| it s the | 98 |
# Looking at the least frequent trigrams...
kable(tail(trif,30), caption="Least frequent trigrams", row.names = FALSE)
| trigram | freq |
|---|---|
| zuma they create | 1 |
| zuma wants an | 1 |
| zuma zapiro s | 1 |
| zumba classes i | 1 |
| zumba fitness and | 1 |
| zumwalt east but | 1 |
| zumwalt north had | 1 |
| zumwalt west s | 1 |
| zunilda junco who | 1 |
| zuni tamaroa maritime | 1 |
| zunt for another | 1 |
| zürich today and | 1 |
| zutic demands move | 1 |
| zuzana hejnova cze | 1 |
| zva blog hop | 1 |
| zva blog starting | 1 |
| zvs induction heater | 1 |
| zwane said the | 1 |
| zwerling were elected | 1 |
| zydeco something on | 1 |
| zydrunas ilgauskas juwan | 1 |
| zyl rsa and | 1 |
| zz top was | 1 |
| дело помощи утопающим | 1 |
| дело рук самих | 1 |
| помощи утопающим дело | 1 |
| рук самих утопающих | 1 |
| утопающим дело рук | 1 |
| 潘兆初法官 decided and | 1 |
| 红腐乳 this recipe | 1 |
trigramSize<- object.size(trigramFreq)
trigramSize
## 41353584 bytes
Some of the bi-grams and tri-grams seem to be trivial, and yet they are wrong. The removal of the punctuation removed also the apostroph, and in cases like the tri-gram “i don t” although it is one of the most frequent found in my structure, it would not represent a well-spelled suggestion. This is true in many cases. I am not sure how to proceed in these cases, possibly a removal is a better option than offering a wrong prediction.
The size of the structures is also to be kept into account, with over 21 Mb for the bigrams and 46 Mb for the trigrams, so may be a 4-grams option would be too much. Some reduction measure should be considered, as the bi-grams and tri-grams with frequency 1 are almost all useless for predictions (presence foreign words, of proper nouns, mis-spelling etc).
# Plot of the bigrams with frequency > 1500 in my sample
subset(bif, freq>1000) %>%
ggplot(aes(x=reorder(bigram, -freq),y=freq, fill=freq)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
theme(axis.text.x=element_text(angle=45, hjust=1))
# Plot of the trigrams with frequency > 100 in my sample
subset(trif, freq>100) %>%
ggplot(aes(x=reorder(trigram, -freq),y=freq, fill=freq)) +
geom_bar(stat="identity", position=position_dodge(), colour="black") +
theme(axis.text.x=element_text(angle=45, hjust=1))
At the moment, my orientation is as follows: