This report aims to explore the 3 English datasets provided. The datasets are large and require long computational time to analyze using conventional personal computers. Smaller sample of the datasets are subsetted for exploratory data analysis.
Histogram of frequently used words in the datasets is plotted. Bigrams are created to tabulate the frequency of unique token of 2 consecutive words. Links between the first word and the 2nd word of frequently observed bigrams are plotted as well for visualization purpose.
We come to a conclusion to create trigrams or even higher order n-grams to consider for more specific possibilities and possible increase in accuracy of prediction made. Stupid Backoff Model will be used for simplicity sake, and ultimately a Shiny App will be created and it shall predict the next word as the user is typing a sentence.
setwd("~/Desktop/DS 10 - Capstone/Coursera-SwiftKey/final/en_US")
con <- file("en_US.twitter.txt", "r");
us_twitter <- readLines(con); close(con)
con2 <- file("en_US.news.txt", "r");
us_news <- readLines(con2); close(con2)
con3 <- file("en_US.blogs.txt", "r");
us_blogs <- readLines(con3); close(con3)
library(dplyr); library(tidytext); library(ggplot2); library(tidyr)
library(igraph); library(ggraph)
summary(us_twitter)
Length Class Mode
2360148 character character
summary(us_news)
Length Class Mode
1010242 character character
summary(us_blogs)
Length Class Mode
899288 character character
summary(nchar(us_twitter))
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 37.00 64.00 68.68 100.00 140.00
summary(nchar(us_news))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 110.0 185.0 201.2 268.0 11384.0
summary(nchar(us_blogs))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 47 156 230 329 40833
We observed that there are 2,360,148 lines of data from the Twitter dataset, and generally tweets have shorter number of words (mostly between 60 to 70); while there are 1,010,242 and 899,288 lines of data from the News and Blogs dataset, which generally have greater number of words per entry (about 150 to 200 words per entry).
As a start, we sample 100,000 lines of data from the 3 datasets each for exploratory data analysis.
set.seed(122)
subset1 <- sample(1:length(us_twitter))[1:100000]
subset2 <- sample(1:length(us_news))[1:100000]
subset3 <- sample(1:length(us_blogs))[1:100000]
twitter_sdf <- data_frame(line = subset1, text = us_twitter[subset1], dataType = "twitter") #sample twitter df
news_sdf <- data_frame(line = subset2, text = us_news[subset2], dataType = "news") #sample news df
blogs_sdf <- data_frame(line = subset3, text = us_blogs[subset3], dataType = "blogs") #sample blogs df
sdf <- rbind(twitter_sdf, news_sdf, blogs_sdf)
First, we tokenize each row/line by breaking the text into individual token (in this case, 1 word). The tidytext library is used to transform the original data to a tidy text dataframe. Punctuations are removed, and all tokens are converted to lower case for easy comparison or combinations between datasets.
tidy_df1 <- sdf %>% unnest_tokens(word, text)
print(tidy_df1)
## # A tibble: 8,896,713 x 3
## line dataType word
## <int> <chr> <chr>
## 1 2135983 twitter hahaha
## 2 2135983 twitter that
## 3 2135983 twitter could
## 4 2135983 twitter have
## 5 2135983 twitter been
## 6 2135983 twitter potentially
## 7 2135983 twitter awkward
## 8 2135983 twitter to
## 9 2135983 twitter whoever
## 10 2135983 twitter is
## # ... with 8,896,703 more rows
Next, we create a unigram, where each token’s frequency will be calculated and tabulated. Currently, we treat the 3 datasets as independent, and calculate frequency of token separately for each dataset (twitter/blogs/news). Stop words (such as “the”,“of”, “to”) are removed as they are not useful for the text analysis we are doing. List of stop words we will be exluding will be extracted from stop_words in the tidytext dataset.
count_df <- tidy_df1 %>% anti_join(stop_words) %>% group_by(dataType) %>% count(word, sort = T)
## Joining, by = "word"
print(count_df)
## Source: local data frame [255,868 x 3]
## Groups: dataType [3]
##
## # A tibble: 255,868 x 3
## dataType word n
## <chr> <chr> <int>
## 1 blogs time 10036
## 2 blogs people 6705
## 3 blogs day 5871
## 4 news time 5743
## 5 blogs love 4920
## 6 blogs life 4649
## 7 news people 4617
## 8 twitter love 4438
## 9 blogs it’s 4182
## 10 twitter day 3911
## # ... with 255,858 more rows
cplot_df <- count_df %>% filter(n > 2500) %>% mutate(word = reorder(word, n))
ggplot(cplot_df, aes(word, n, fill = dataType)) + geom_col() + coord_flip() + facet_wrap(~dataType)
We observed that only Twitter has words like “rt” and “lol”. We know that “lol” is usually used informally, and hence we would expect no occurence of such word in the News and Blogs datasets. We now need to investigate what is the meaning of “rt”.
tidy_df1 %>% filter(word == "rt")
## # A tibble: 3,778 x 3
## line dataType word
## <int> <chr> <chr>
## 1 84419 twitter rt
## 2 66080 twitter rt
## 3 940724 twitter rt
## 4 997669 twitter rt
## 5 1739729 twitter rt
## 6 995539 twitter rt
## 7 1795973 twitter rt
## 8 438468 twitter rt
## 9 769280 twitter rt
## 10 2163203 twitter rt
## # ... with 3,768 more rows
twitter_sdf %>% filter(line %in% c(84419, 66080, 940724)) %>% select(text)
## # A tibble: 3 x 1
## text
## <chr>
## 1 Wasn't funny at the time, though. RT this made me laugh.
## 2 “: RT : in March sometime.--bet aye u got a new number?” naw still got the
## 3 RT : “Write it on your heart that every day is the best day in the year.” -
So, we subset a few lines of tweets with “rt” in it and we observed they are mostly capitalized. (I personally did not use Twitter, FB is enough to steal all my productivity away!) After some googling, I found out that RT means “Retweet” or “Real Time”. So, it makes sense, that “rt” appears frequently in the Twitter dataset.
Words in a sentence tend to have dependencies between each other. A n-grams consider n consecutive words as 1 token and tabulate the frequency of each unique token. A bigram, where n = 2, will consider the frequency of 2 consecutive words in the text dataset.
bigram_df <- sdf %>% unnest_tokens(bigram ,text, token = "ngrams", n = 2)
print(bigram_df)
## # A tibble: 8,596,713 x 3
## line dataType bigram
## <int> <chr> <chr>
## 1 1 blogs in the
## 2 1 blogs the years
## 3 1 blogs years thereafter
## 4 1 blogs thereafter most
## 5 1 blogs most of
## 6 1 blogs of the
## 7 1 blogs the oil
## 8 1 blogs oil fields
## 9 1 blogs fields and
## 10 1 blogs and platforms
## # ... with 8,596,703 more rows
count_bigram_df <- bigram_df %>% count(bigram, sort = T)
print(count_bigram_df)
## # A tibble: 2,465,003 x 2
## bigram n
## <chr> <int>
## 1 of the 41779
## 2 in the 37549
## 3 to the 19419
## 4 on the 17789
## 5 for the 16508
## 6 to be 14329
## 7 at the 12614
## 8 and the 12512
## 9 in a 10986
## 10 with the 9981
## # ... with 2,464,993 more rows
We start by separating the bigram into word1 and word2. After removing the stop words, we filter out only the top few most frequent bigram and visualize it using a function provided by Julia Silge and David Robinson and the R package ggraph.
sep_df <- bigram_df %>% separate(bigram, c("word1", "word2"), sep = " ")
bigram_filtered <- sep_df %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_count <- bigram_filtered %>% count(word1, word2, sort = T)
bigram_graph <- bigram_count %>% filter(n > 200)
print(bigram_graph)
## Source: local data frame [41 x 3]
## Groups: word1 [33]
##
## # A tibble: 41 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 st louis 1048
## 2 1 2 799
## 3 los angeles 645
## 4 san francisco 562
## 5 30 p.m 482
## 6 health care 471
## 7 happy birthday 414
## 8 san diego 392
## 9 social media 386
## 10 ice cream 355
## # ... with 31 more rows
visualize_bigrams <- function(bigrams) {
set.seed(2000)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
bigrams %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
}
bigram_graph %>% visualize_bigrams()
We observed a few very obvious bigrams which have very high frequency from the plot above. For example, “ice cream”, “happy birthday”, “vice president” and others. We also noticed the need to create trigrams (n = 3) and higher order n-grams, to take into consideration of words such as “chief executive director” and “york city council” which could stands for “new york city council”.
The next step is to create a prediction algorithm based on Stupid Backoff Model, as Kantz Backoff Model is slightly more difficult to code than Stupid Backoff Model. Trigram will be created and added for better accuracy.
Ultimately, the Shiny App shall predict the next word as the user types a sentence.