As suggested, the goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
With motivations as follows;
The data set consists of 4 folders of diffent languages, each containing 3 files. The four languages are Russian, German, English, and Finnish and the files are blog entries, news entries, and twitter feed.
For this report, we will process the Enlish data only. While using, all three data of blog, news, and twitter. For this report, we will merge all the data, and randomly sample only 50000 var.doc from the corpus. Since, this is only for the purpose of exploratory data analysis.
The Link for downloading data. You must download it, unzip the file, and move the three txt files in en_US file to your working directory.
blog<-read_lines("en_US.blogs.txt")
news<-read_lines("en_US.news.txt")
twitter<-read_lines("en_US.twitter.txt")
There were two methods adapted to the corpus tokens, tolower and stopword removal. The token_wordstem function is not applied, since with default function, some words changed into words that are not in the dictionary.
#To corpus
cor_blog <- corpus(blog)
names(cor_blog)<-gsub("text","blog",names(cor_blog))
cor_news <- corpus(news)
names(cor_news)<-gsub("text","news",names(cor_news))
cor_twitter <- corpus(twitter)
names(cor_twitter)<-gsub("text","twitter",names(cor_twitter))
#merging corpuses
cor_all <- cor_blog + cor_news + cor_twitter
#sampling
set.seed(1237)
cor_sample <- corpus_sample(cor_all,size=50000)
cor_tokens <- tokens(cor_sample, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE)
## Warning: 'remove_hyphens' is deprecated, use 'split_hyphens' instead.
## Warning: 'remove_twitter' is defunct; see 'quanteda Tokenizers' in ?tokens
#adapting tolower and removing stopwords
cor_tokens <- tokens_tolower(cor_tokens)
cor_tokens <- tokens_remove(cor_tokens, stopwords("en"))
# ngrams
unigrams <- cor_tokens
bigrams <- tokens_ngrams(cor_tokens, n = 2, concatenator = " ")
trigrams <- tokens_ngrams(cor_tokens, n = 3, concatenator = " ")
quadrigrams <- tokens_ngrams(cor_tokens, n = 4, concatenator = " ")
ngrams <- tokens_ngrams(cor_tokens, n = 1:4, concatenator = " ")
# Document Frequency Matrices
uni_dfm <- dfm(unigrams)
bi_dfm <- dfm(bigrams)
tri_dfm <- dfm(trigrams)
quadri_dfm <- dfm(quadrigrams)
corpus_dfm <- dfm(cor_tokens)
## size wordcount sentencecount
## blog 871881856 42818152 2362935
## news 871881856 39849636 1992553
## twitter 871881856 36663968 3754212
For evaluating foreign language, I used hunspell package, as you could find it in the forum. As without any changes being made, hunspell package use English dictionary. So with it, we would be able to find foreign words as well as typos.
foreign_uni <- hunspell_check(featnames(uni_dfm))
foreign_words <- setdiff(featnames(uni_dfm),featnames(uni_dfm)[foreign_uni])
length(foreign_words)
## [1] 28038
length(featnames(uni_dfm))
## [1] 58327
length(foreign_words)/length(featnames(uni_dfm))
## [1] 0.4807036
There were 28038 words that are foreign(or rather, words not in the dictionary). Some of them are due to preprocessing, since UK has turned into uk with tolower function for example. However, even considering such, it is little less than 48% which is a lot.
half<-sum(featfreq(uni_dfm))/2
featuresfreq<-sort(featfreq(uni_dfm), decreasing = TRUE)
featuresfreq<-data.frame(featuresfreq)
featuresfreq<-featuresfreq %>% mutate(cum=cumsum(featuresfreq))
nrow(featuresfreq %>% filter(cum<half))
## [1] 1007
So with most frequent 1008 words, you will be able to cover half of the word frequency. This is about 1.78 percent of all the words.
ninety <- sum(featfreq(uni_dfm)) *.9
nrow(featuresfreq %>% filter(cum<ninety))
## [1] 14689
14690 most frequent words are needed to cover 90 percent of the words frequency, which is about the quarter of all words.
It is clear that the process should be more sophisticated. There seems to be foreign words. Wordstem should be adapted as well, with better method than what the default one from quanteda. Further research should be done, before modeling.
For this report, I have put all three data together and sampled, only 50000 doc variables, which is around 1% of the data. Therefore, these two following questions arise;
1)What happens if we do the same analysis on separate text data, without merging?
2)Was the size of the sample adequate?
As a non-native English speaker, it was hard to make a prediction on my own. But the result with the unigram seems reasonable, while the ones with bigrams and trigrams seem to be questionable.
Definition of language
This question is related to the first question, preprocessing. How to define a language is correct or not? It is obvious that writing uk instead of UK is wrong, but should we recognize it as typos? Is canadian not Canadian?
Sophistication on preprocessing Need to work on preprocessing bit more as suggested above.
For exploratory analysis, I just took uni, bi, trigrams. For building models, I would need more n-grams. Thinking of using quadratic as well.
For prediction model, I am thinking of Markov chain model or naive Baysian. Would have to do a further research on it.
For shiny app, with main panel, the word would occur, after writing a word on the side pannel.