The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens and phrases in the text.
This task is using SwiftKey dataset, which contains files in four different languages, English, Germany, Russian and Finnish. For this task we only use the file in English, en_US. The size of this file is 556MB. en_US file contains three txt datasets, en_US.twitter.txt, en_US.blogs.txt, en_US.news.txt.
## Source Lines Words MaxWords
## 1 Twitter 2360148 30373792 213
## 2 Blogs 899288 37334441 40835
## 3 News 77259 2643972 5760
g1 <- ggplot(df, aes(x=Source, y=Lines/1e+06)) + geom_bar(stat = "identity", fill = "blue") +
labs(title = "Count of lines", y = "Number of Lines in Millions")
g2 <- ggplot(df, aes(x=Source, y=Words/1e+06)) + geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Count of Words", y = "Number of Words in Millions")
grid.arrange(g1, g2, ncol = 2)
en_US file contains three txt files, en_US.twitter, en_US.blogs, and en_US.news.
This dataset is fairly large. It’s not necessary to load the entire dataset to build the algorithms. A smaller subset of data is taken through random sampling.
## Source Lines Words
## 1 Twitter 23832 307286
## 2 Blogs 8938 371901
## 3 News 738 24947
matrix_tw <- as.matrix(tdm_twitter)
sort <- sort(rowSums(matrix_tw), decreasing = TRUE)
df_tw <- data.frame(word = names(sort), freq = sort)
head(df_tw)
## word freq
## just just 1511
## get get 1458
## like like 1340
## can can 1330
## thank thank 1305
## love love 1259
g_tw <- ggplot(df_tw[1:20, ], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat = "identity") +
labs(x = "Words", y = "Frequency", title = "Most Frequent Words in Twitter")
g_tw
wordcloud(words = df_tw$word, freq = df_tw$ freq, min.freq=4, max.words = 200, random.order = FALSE, scale = c(3, 0.5), colors = rainbow(3))
Ngram is a sequence of N words. Ngram modeling is to assign a probability to the occurrence of a N-gram or a word occurring next in the N-gram. According to Markov assumption, the next state depends only on the current state and is indepedent of previous history. Hence, Ngram modeling will help in predict next word. Unigram can be used to find the most frequent words (above plot). Bigram predicts the probablity of a word depending on two words before it. By the same token, trigram calculates the probablity of a word depending on three words before it.
dtm_tw <- DocumentTermMatrix(tw_corpus)
tw_td <- tidy(dtm_tw)
tw_bigram <- tw_td %>% unnest_tokens(bigram, term, token = "ngrams", n = 2)
bigram_count <- tw_bigram %>% count(bigram, sort = TRUE)
bigram_filtered <- bigram_count %>%
filter(!is.na(bigram))
head(bigram_filtered)
## # A tibble: 6 x 2
## bigram n
## <chr> <int>
## 1 like look 119
## 2 know let 97
## 3 now right 94
## 4 just like 78
## 5 just know 65
## 6 good morn 51
## # A tibble: 6 x 2
## trigram n
## <chr> <int>
## 1 day happi mother 8
## 2 happi new year 8
## 3 back follow pleas 7
## 4 just like look 7
## 5 just know let 6
## 6 know let like 6
The distribution of word frequencies can be seen in “Most Frequent Words in Twitter/Blog/News”plot.
Top 20 most frequent words of 2-grams and 3-grams are visualized in “Most Frequent Words in Bigram/Trigram” plot.
In this task, 1% of data was taken as sample using rbinom function. The total number of words in this sample of Twitter is 307286. And then sample data was transformed into corpus and preprocessed, including remove non-ASCII (non-Eglish language), remove signs, url, number and punctuation, etc. After all processes have been done in order, the corpus is clean and contains unique words. These words are sorted and stored in a data frame. There are 171146 unique words in Twitter sample data. It covers 55% of all words instances in English. One way to increase the percentage, to 90% for example, is to increase the sample size.
One way (may not be the ideal way) to evaluate how many words come from foreign languages, is to compare the number of unique words before remove other languages, 163593 in the sample, to the number of unique words after remove other languages, 162727 in the same sample. We know that about 0.5% words in sample Twitter data come from foreign languages.
A function can be built to compare unique words in sample1 corpus to sample2 corpus, identify unique words that are in sample2 but not in sample 1, and then add these words into sample1. Thus, increase coverage by introducing new unique words.