Introduction

As suggested, the goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

With motivations as follows;

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

About the data

The data set consists of 4 folders of diffent languages, each containing 3 files. The four languages are Russian, German, English, and Finnish and the files are blog entries, news entries, and twitter feed.

For this report, we will process the Enlish data only. While using, all three data of blog, news, and twitter. For this report, we will merge all the data, and randomly sample only 50000 var.doc from the corpus. Since, this is only for the purpose of exploratory data analysis.

Download the Data

The Link for downloading data. You must download it, unzip the file, and move the three txt files in en_US file to your working directory.

Reading Data

blog<-read_lines("en_US.blogs.txt")
news<-read_lines("en_US.news.txt")
twitter<-read_lines("en_US.twitter.txt")

Preprocessing

There were two methods adapted to the corpus tokens, tolower and stopword removal. The token_wordstem function is not applied, since with default function, some words changed into words that are not in the dictionary.

#To corpus

cor_blog <- corpus(blog)
names(cor_blog)<-gsub("text","blog",names(cor_blog))
cor_news <- corpus(news)
names(cor_news)<-gsub("text","news",names(cor_news))
cor_twitter <- corpus(twitter)
names(cor_twitter)<-gsub("text","twitter",names(cor_twitter))

#merging corpuses

cor_all <- cor_blog + cor_news + cor_twitter

#sampling

set.seed(1237)
cor_sample <- corpus_sample(cor_all,size=50000)
cor_tokens <- tokens(cor_sample, remove_numbers = TRUE, remove_punct = TRUE,
  remove_symbols = TRUE, remove_separators = TRUE,
  remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE)
## Warning: 'remove_hyphens' is deprecated, use 'split_hyphens' instead.
## Warning: 'remove_twitter' is defunct; see 'quanteda Tokenizers' in ?tokens
#adapting tolower and removing stopwords

cor_tokens <- tokens_tolower(cor_tokens)
cor_tokens <- tokens_remove(cor_tokens, stopwords("en"))

# ngrams

unigrams <- cor_tokens
bigrams <- tokens_ngrams(cor_tokens, n = 2, concatenator = " ")
trigrams <- tokens_ngrams(cor_tokens, n = 3, concatenator = " ")
quadrigrams <- tokens_ngrams(cor_tokens, n = 4, concatenator = " ")
ngrams <- tokens_ngrams(cor_tokens, n = 1:4, concatenator = " ")

# Document Frequency Matrices

uni_dfm <- dfm(unigrams)
bi_dfm <- dfm(bigrams)
tri_dfm <- dfm(trigrams)
quadri_dfm <- dfm(quadrigrams)
corpus_dfm <- dfm(cor_tokens)

Basic summary of the Data(without sampling)

##              size wordcount sentencecount
## blog    871881856  42818152       2362935
## news    871881856  39849636       1992553
## twitter 871881856  36663968       3754212

Exploratory Anlalysis

unigram

bigram

trigram

Foreign Laguages

For evaluating foreign language, I used hunspell package, as you could find it in the forum. As without any changes being made, hunspell package use English dictionary. So with it, we would be able to find foreign words as well as typos.

foreign_uni <- hunspell_check(featnames(uni_dfm))
foreign_words <- setdiff(featnames(uni_dfm),featnames(uni_dfm)[foreign_uni])
length(foreign_words)
## [1] 28038
length(featnames(uni_dfm))
## [1] 58327
length(foreign_words)/length(featnames(uni_dfm))
## [1] 0.4807036

There were 28038 words that are foreign(or rather, words not in the dictionary). Some of them are due to preprocessing, since UK has turned into uk with tolower function for example. However, even considering such, it is little less than 48% which is a lot.

How many words do you need?

  1. To cover 50% of the word frequency
half<-sum(featfreq(uni_dfm))/2
featuresfreq<-sort(featfreq(uni_dfm), decreasing = TRUE)
featuresfreq<-data.frame(featuresfreq)
featuresfreq<-featuresfreq %>% mutate(cum=cumsum(featuresfreq))
nrow(featuresfreq %>% filter(cum<half))
## [1] 1007

So with most frequent 1008 words, you will be able to cover half of the word frequency. This is about 1.78 percent of all the words.

  1. To cover 90% of the word frequency
ninety <- sum(featfreq(uni_dfm)) *.9
nrow(featuresfreq %>% filter(cum<ninety))
## [1] 14689

14690 most frequent words are needed to cover 90 percent of the words frequency, which is about the quarter of all words.

Questions

  1. About preprocessing.

It is clear that the process should be more sophisticated. There seems to be foreign words. Wordstem should be adapted as well, with better method than what the default one from quanteda. Further research should be done, before modeling.

  1. About the data and sampling

For this report, I have put all three data together and sampled, only 50000 doc variables, which is around 1% of the data. Therefore, these two following questions arise;

1)What happens if we do the same analysis on separate text data, without     merging?

2)Was the size of the sample adequate? 
  1. As a non-native English speaker, it was hard to make a prediction on my own. But the result with the unigram seems reasonable, while the ones with bigrams and trigrams seem to be questionable.

  2. Definition of language

This question is related to the first question, preprocessing. How to define a language is correct or not? It is obvious that writing uk instead of UK is wrong, but should we recognize it as typos? Is canadian not Canadian?

Plans to build a model

  1. Sophistication on preprocessing Need to work on preprocessing bit more as suggested above.

  2. For exploratory analysis, I just took uni, bi, trigrams. For building models, I would need more n-grams. Thinking of using quadratic as well.

  3. For prediction model, I am thinking of Markov chain model or naive Baysian. Would have to do a further research on it.

  4. For shiny app, with main panel, the word would occur, after writing a word on the side pannel.