Data Understanding

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens and phrases in the text.

This task is using SwiftKey dataset, which contains files in four different languages, English, Germany, Russian and Finnish. For this task we only use the file in English, en_US. The size of this file is 556MB. en_US file contains three txt datasets, en_US.twitter.txt, en_US.blogs.txt, en_US.news.txt.

  1. Load the data
  2. Count words and lines
  3. Plot distribution
##    Source   Lines    Words MaxWords
## 1 Twitter 2360148 30373792      213
## 2   Blogs  899288 37334441    40835
## 3    News   77259  2643972     5760
g1 <- ggplot(df, aes(x=Source, y=Lines/1e+06)) + geom_bar(stat = "identity", fill = "blue") + 
  labs(title = "Count of lines", y = "Number of Lines in Millions")

g2 <- ggplot(df, aes(x=Source, y=Words/1e+06)) + geom_bar(stat = "identity", fill = "lightblue") + 
  labs(title = "Count of Words", y = "Number of Words in Millions")

grid.arrange(g1, g2, ncol = 2)

Data Summary

en_US file contains three txt files, en_US.twitter, en_US.blogs, and en_US.news.

  • File Size: 558MB
  • Total number of lines: 3336695
  • Total number of words: 70352205
  • Maxium words in one line: 40835

Data Preparation

Random Sampling

This dataset is fairly large. It’s not necessary to load the entire dataset to build the algorithms. A smaller subset of data is taken through random sampling.

  1. Create a vector with three components, Twitter, Blogs and News
  2. rbinom function to “flip a biased coin” (trial = 1) to determine whether the sample is a line of text or not.(1 = text, 0 = non-text)
  3. Loop through the sample and store lines of text
##    Source Lines  Words
## 1 Twitter 23832 307286
## 2   Blogs  8938 371901
## 3    News   738  24947

Data Preprocessing

  1. Convert text file into corpus and remove any character that is non-ASCII in twitter
  2. For twitter text, replace “/”, “@”, and “|” , and “’s”, “#”, and URL
  3. Convert to lower case
  4. Remove numbers, puntuation
  5. Remove stopwords
  6. Remove extra spaces
  7. Stemming the text

Word Frequency Analysis

  1. Create term-document matrix
  2. Further remove words less than 2 letters
  3. Find the most frequent words
  4. Plot word frequency distribution
matrix_tw <- as.matrix(tdm_twitter)

sort <- sort(rowSums(matrix_tw), decreasing = TRUE)

df_tw <- data.frame(word = names(sort), freq = sort)
head(df_tw)
##        word freq
## just   just 1511
## get     get 1458
## like   like 1340
## can     can 1330
## thank thank 1305
## love   love 1259
g_tw <- ggplot(df_tw[1:20, ], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat = "identity") + 
  labs(x = "Words", y = "Frequency", title = "Most Frequent Words in Twitter")

g_tw

wordcloud(words = df_tw$word, freq = df_tw$ freq, min.freq=4, max.words = 200, random.order = FALSE, scale = c(3, 0.5), colors = rainbow(3))

Ngram

Ngram is a sequence of N words. Ngram modeling is to assign a probability to the occurrence of a N-gram or a word occurring next in the N-gram. According to Markov assumption, the next state depends only on the current state and is indepedent of previous history. Hence, Ngram modeling will help in predict next word. Unigram can be used to find the most frequent words (above plot). Bigram predicts the probablity of a word depending on two words before it. By the same token, trigram calculates the probablity of a word depending on three words before it.

dtm_tw <- DocumentTermMatrix(tw_corpus)
tw_td <- tidy(dtm_tw)


tw_bigram <- tw_td %>% unnest_tokens(bigram, term, token = "ngrams", n = 2)
bigram_count <- tw_bigram %>% count(bigram, sort = TRUE)
bigram_filtered <- bigram_count %>%
  filter(!is.na(bigram)) 

head(bigram_filtered)
## # A tibble: 6 x 2
##   bigram        n
##   <chr>     <int>
## 1 like look   119
## 2 know let     97
## 3 now right    94
## 4 just like    78
## 5 just know    65
## 6 good morn    51
## # A tibble: 6 x 2
##   trigram               n
##   <chr>             <int>
## 1 day happi mother      8
## 2 happi new year        8
## 3 back follow pleas     7
## 4 just like look        7
## 5 just know let         6
## 6 know let like         6

Further Discussion

  1. The distribution of word frequencies can be seen in “Most Frequent Words in Twitter/Blog/News”plot.

  2. Top 20 most frequent words of 2-grams and 3-grams are visualized in “Most Frequent Words in Bigram/Trigram” plot.

  3. In this task, 1% of data was taken as sample using rbinom function. The total number of words in this sample of Twitter is 307286. And then sample data was transformed into corpus and preprocessed, including remove non-ASCII (non-Eglish language), remove signs, url, number and punctuation, etc. After all processes have been done in order, the corpus is clean and contains unique words. These words are sorted and stored in a data frame. There are 171146 unique words in Twitter sample data. It covers 55% of all words instances in English. One way to increase the percentage, to 90% for example, is to increase the sample size.

  4. One way (may not be the ideal way) to evaluate how many words come from foreign languages, is to compare the number of unique words before remove other languages, 163593 in the sample, to the number of unique words after remove other languages, 162727 in the same sample. We know that about 0.5% words in sample Twitter data come from foreign languages.

  5. A function can be built to compare unique words in sample1 corpus to sample2 corpus, identify unique words that are in sample2 but not in sample 1, and then add these words into sample1. Thus, increase coverage by introducing new unique words.