Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
“I went to the…”
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. This is very convenient - but how is it possible to present words which are relevant and fitting for the given context?
In this project, we will work on understanding and building predictive text models like those used by SwiftKey. As a basis, we will analyse language examples collected from common internet sources. In this report, exploratory analyses of the available data sets are performed.
The data sets used in this project consist of corpora, which were collected from publicly available sources by a web crawler. The crawler checked for language, so as to mainly get texts consisting of the desired language (English). Once the raw corpus had been collected, it was parsed further to remove duplicate entries and split into individual lines. Approximately 50% of each entry was then deleted. Since entries cannot be fully recreated, they are considered as anonymised.
In our data sets, each entry was tagged with it’s date of publication. Where user comments were included, they were tagged with the date of the main entry. In addition, each entry was tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog). If possible, each entry was tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it was tagged with “sports” subject). In many cases it was not feasible to tag the entries or no subject was found by the automated process, in which case the entry was tagged with a ‘0’. To save space, the subject and type is given as a numerical code.
The first step in building a predictive model for text is
understanding the data sets in terms of distribution and relationship
between the words and phrases in the text. 1. Exploratory analysis
2. Understand frequencies of words and word pairs
Questions to consider
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
In a first step, an exploratory analysis of the data set is performed. Descriptive Statistics are given in the following table and figure:
# Data overview
# List of words
words_blogs <- unlist(tokens(blogs_clean))
words_news <- unlist(tokens(news_clean))
words_twitter <- unlist(tokens(twitter_clean))
# Comparison of documents available in the three data sets
lengths_blogs <- ntoken(blogs_clean)
lengths_news <- ntoken(news_clean)
lengths_twitter <- ntoken(twitter_clean)
summary_table <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Documents = c(length(lengths_blogs), length(lengths_news), length(lengths_twitter)),
Total_Words = c(sum(lengths_blogs), sum(lengths_news), sum(lengths_twitter)),
Mean = c(mean(lengths_blogs), mean(lengths_news), mean(lengths_twitter)),
Median = c(median(lengths_blogs), median(lengths_news), median(lengths_twitter)),
Min = c(min(lengths_blogs), min(lengths_news), min(lengths_twitter)),
Max = c(max(lengths_blogs), max(lengths_news), max(lengths_twitter))
)
summary_table[, 4:7] <- round(summary_table[, 4:7], 1)
kable(summary_table, caption = "Descriptive Statistics",
booktabs = TRUE, digits = 3, align = "c")
| Dataset | Documents | Total_Words | Mean | Median | Min | Max |
|---|---|---|---|---|---|---|
| Blogs | 44964 | 1752804 | 39.0 | 26 | 0 | 1985 |
| News | 3862 | 123953 | 32.1 | 30 | 0 | 279 |
| 118007 | 1398008 | 11.8 | 11 | 0 | 42 |
plot1 <- boxplot(
ntoken(blogs_clean)[ntoken(blogs_clean) < 2000],
ntoken(news_clean)[ntoken(news_clean) < 1000],
ntoken(twitter_clean)[ntoken(twitter_clean) < 200],
names = c("Blogs", "News", "Twitter"),
main = "Document Length Comparison", col = "lightblue",
ylim = c(0, 300)
)
In a second step, data sets are compared regarding the most
frequent and least frequent words they contain. Interestingly, while the
most frequent words show up repeatedly in the three data sets, the least
frequent words seem to be unique and differ from one data set to
another.
# Top 20 and bottom 10 words
## Blogs
dfm_blogs <- dfm(tokens(blogs_clean))
top_blogs <- topfeatures(dfm_blogs, 20)
freq_blogs <- colSums(dfm_blogs)
freq_blogs_filt <- freq_blogs[freq_blogs > 1]
bottom10_blogs <- sort(freq_blogs_filt)[1:10]
## News
dfm_news <- dfm(tokens(news_clean))
top_news <- topfeatures(dfm_news, 20)
freq_news <- colSums(dfm_news)
freq_news_filt <- freq_news[freq_news > 1]
bottom10_news <- sort(freq_news_filt)[1:10]
## Twitter
dfm_twitter <- dfm(tokens(twitter_clean))
top_twitter <- topfeatures(dfm_twitter, 20)
freq_twitter <- colSums(dfm_twitter)
freq_twitter_filt <- freq_twitter[freq_twitter > 1]
bottom10_twitter <- sort(freq_twitter_filt)[1:10]
## Create and join data frames for Top20 words
df_blogs <- data.frame(Feature = names(top_blogs), Blogs = as.integer(top_blogs))
df_news <- data.frame(Feature = names(top_news), News = as.integer(top_news))
df_twitter <- data.frame(Feature = names(top_twitter), Twitter = as.integer(top_twitter))
df_all <- full_join(df_blogs, df_news, by = "Feature") %>%
full_join(df_twitter, by = "Feature") %>%
replace(is.na(.), 0)
## Transform data to long format
df_long <- df_all %>%
pivot_longer(cols = c(Blogs, News, Twitter),
names_to = "Quelle",
values_to = "Haeufigkeit")
## Select data
top_features <- df_long %>%
group_by(Feature) %>%
summarise(Total = sum(Haeufigkeit)) %>%
arrange(desc(Total)) %>%
slice_head(n = 10) %>%
pull(Feature)
df_plot <- df_long %>%
filter(Feature %in% top_features)
## Create chart
ggplot(df_plot, aes(x = reorder(Feature, Haeufigkeit), y = Haeufigkeit, fill = Quelle)) +
geom_col(position = "dodge") +
coord_flip() + # horizontale Balken
labs(title = "Top 10 words in Blogs, News and Twitter",
x = "Feature",
y = "Frequency") +
theme_minimal()
Top 20 words in Blogs, News and Twitter
## Create and join data frames for bottom 10 words
db_blogs <- data.frame(Feature = names(bottom10_blogs), Blogs = as.integer(bottom10_blogs))
db_news <- data.frame(Feature = names(bottom10_news), News = as.integer(bottom10_news))
db_twitter <- data.frame(Feature = names(bottom10_twitter), Twitter = as.integer(bottom10_twitter))
db_all <- full_join(db_blogs, db_news, by = "Feature") %>%
full_join(db_twitter, by = "Feature") %>%
replace(is.na(.), 0) %>%
arrange(Feature)
## Create table
kable(db_all, caption = "Least frequent Words",
booktabs = TRUE, digits = 3, align = "c")
| Feature | Blogs | News | |
|---|---|---|---|
| #survivor | 0 | 0 | 2 |
| 6’6 | 0 | 0 | 2 |
| andretti | 0 | 0 | 2 |
| babydaddy | 0 | 0 | 2 |
| bruschetta | 2 | 0 | 0 |
| cannellini | 2 | 0 | 0 |
| cheapness | 2 | 0 | 0 |
| columnist | 0 | 2 | 0 |
| crowned | 0 | 2 | 0 |
| divest | 2 | 0 | 0 |
| dusters | 2 | 0 | 0 |
| everglades | 2 | 0 | 0 |
| first-hand | 0 | 2 | 0 |
| flies | 0 | 2 | 0 |
| launchpad | 0 | 0 | 2 |
| listened | 0 | 2 | 0 |
| lobbyists | 0 | 2 | 0 |
| motto | 0 | 2 | 0 |
| outliers | 0 | 0 | 2 |
| pageant | 0 | 2 | 0 |
| pent | 2 | 0 | 0 |
| recieve | 0 | 0 | 2 |
| retrain | 0 | 0 | 2 |
| shaff | 0 | 2 | 0 |
| shelp | 0 | 2 | 0 |
| stemcells | 0 | 0 | 2 |
| succumbing | 2 | 0 | 0 |
| swivels | 2 | 0 | 0 |
| trix | 0 | 0 | 2 |
| walden | 2 | 0 | 0 |
Language can be characterized by the proportion of rare words in
it. The following table gives an respective overview of the ratio for
the three data sets. In sum, the ratio is rather typical for language
samples. Nonetheless, we find a larger proportion of seldom words in the
“News”-data set than in the “Blogs”- or the “Twitter”-data set.
# Proportion of rare words
prop_blogs <- table(words_blogs)
rare_words_b <- sum(prop_blogs == 2)
prop_news <- table(words_news)
rare_words_n <- sum(prop_news == 2)
prop_twitter <- table(words_twitter)
rare_words_t <- sum(prop_twitter == 2)
## Proportion of seldom words per data set
rare_ratio_blogs <- rare_words_b / length(prop_blogs)
rare_ratio_news <- rare_words_n / length(prop_news)
rare_ratio_twitter <- rare_words_t / length(prop_twitter)
total_ratio <- sum(c(rare_words_b, rare_words_n, rare_words_t)) / sum(c(length(prop_blogs), length(prop_news), length(prop_twitter)))
## Create Table
result <- data.frame(
Variable = c("Ratio blogs", "Ratio news", "Ratio twitter", "Ratio total"),
Wert = c(rare_ratio_blogs, rare_ratio_news, rare_ratio_twitter, total_ratio)
)
kable(result, caption = "Ratio rare words",
booktabs = TRUE, digits = 3, align = "c")
| Variable | Wert |
|---|---|
| Ratio blogs | 0.132 |
| Ratio news | 0.147 |
| Ratio twitter | 0.121 |
| Ratio total | 0.129 |
Finally, the frequencies of 2- and 3-grams in the data sets is
analysed. The following tables illustrate the respective results.
## Create bigrams
blogs_bi <- tokens_ngrams(blogs_clean, n = 2)
news_bi <- tokens_ngrams(news_clean, n = 2)
twitter_bi <- tokens_ngrams(twitter_clean, n = 2)
## Calculate frequencies
dfm_bi_blogs <- dfm(blogs_bi)
dfm_bi_news <- dfm(news_bi)
dfm_bi_twitter <- dfm(twitter_bi)
top_bi_blogs <- topfeatures(dfm_bi_blogs, 10)
top_bi_news <- topfeatures(dfm_bi_news, 10)
top_bi_twitter <- topfeatures(dfm_bi_twitter, 10)
## Create and join data frames for top 10 bigrams
## Blogs
db_bi_blogs <- data.frame(
Feature = colnames(dfm_bi_blogs),
Blogs = colSums(dfm_bi_blogs),
row.names = NULL
)
db_bi_blogs <- db_bi_blogs %>%
arrange(desc(Blogs)) %>%
slice(1:10)
## Create table
kable(db_bi_blogs, caption = "Top 10 Blog-Bigrams",
booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
kable_styling(position = "left") -> tab_bi_blogs
## News
db_bi_news <- data.frame(
Feature = colnames(dfm_bi_news),
News = colSums(dfm_bi_news),
row.names = NULL
)
db_bi_news <- db_bi_news%>%
arrange(desc(News)) %>%
slice(1:10)
## Create table
kable(db_bi_news, caption = "Top 10 News-Bigrams",
booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
kable_styling(position = "center") -> tab_bi_news
## Twitter
db_bi_twitter <- data.frame(
Feature = colnames(dfm_bi_twitter),
Twitter = colSums(dfm_bi_twitter),
row.names = NULL
)
db_bi_twitter <- db_bi_twitter %>%
arrange(desc(Twitter)) %>%
slice(1:10)
## Create table
kable(db_bi_twitter, caption = "Top 10 Twitter-Bigrams",
booktabs = TRUE, digits = 0,
align = c("l", "c")) %>%
kable_styling(position = "right") -> tab_bi_twitter
# Align tables in document
library(htmltools)
## Warning: Paket 'htmltools' wurde unter R Version 4.4.3 erstellt
browsable(
tagList(
div(style="display:flex; gap: 20px;",
HTML(tab_bi_blogs),
HTML(tab_bi_news),
HTML(tab_bi_twitter)
)))
| Feature | Blogs |
|---|---|
| of_the | 9189 |
| in_the | 7744 |
| to_the | 4253 |
| on_the | 3740 |
| to_be | 3390 |
| for_the | 2982 |
| and_the | 2851 |
| it_is | 2381 |
| at_the | 2287 |
| it_was | 2284 |
| Feature | News |
|---|---|
| of_the | 699 |
| in_the | 659 |
| to_the | 331 |
| for_the | 282 |
| on_the | 257 |
| at_the | 230 |
| and_the | 217 |
| with_the | 179 |
| to_be | 179 |
| he_said | 145 |
| Feature | |
|---|---|
| in_the | 3903 |
| for_the | 3707 |
| of_the | 2857 |
| on_the | 2362 |
| to_be | 2359 |
| to_the | 2136 |
| thanks_for | 2104 |
| at_the | 1904 |
| going_to | 1796 |
| thank_you | 1699 |
# Trigrams
## Create trigrams
blogs_tri <- tokens_ngrams(blogs_clean, n = 3)
news_tri <- tokens_ngrams(news_clean, n = 3)
twitter_tri <- tokens_ngrams(twitter_clean, n = 3)
## Calculate frequencies
dfm_tri_blogs <- dfm(blogs_tri)
dfm_tri_news <- dfm(news_tri)
dfm_tri_twitter <- dfm(twitter_tri)
top_tri_blogs <- topfeatures(dfm_tri_blogs, 10)
top_tri_news <- topfeatures(dfm_tri_news, 10)
top_tri_twitter <- topfeatures(dfm_tri_twitter, 10)
## Create and join data frames for top 10 trigrams
## Blogs
db_tri_blogs <- data.frame(
Feature = colnames(dfm_tri_blogs),
Blogs = colSums(dfm_tri_blogs),
row.names = NULL
)
db_tri_blogs <- db_tri_blogs %>%
arrange(desc(Blogs)) %>%
slice(1:10)
## Create table
kable(db_tri_blogs, caption = "Top 10 Blog-Trigrams",
booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
kable_styling(position = "left") -> tab_tri_blogs
## News
db_tri_news <- data.frame(
Feature = colnames(dfm_tri_news),
News = colSums(dfm_tri_news),
row.names = NULL
)
db_tri_news <- db_tri_news%>%
arrange(desc(News)) %>%
slice(1:10)
## Create table
kable(db_tri_news, caption = "Top 10 News-Trigrams",
booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
kable_styling(position = "center") -> tab_tri_news
## Twitter
db_tri_twitter <- data.frame(
Feature = colnames(dfm_tri_twitter),
Twitter = colSums(dfm_tri_twitter),
row.names = NULL
)
db_tri_twitter <- db_tri_twitter %>%
arrange(desc(Twitter)) %>%
slice(1:10)
## Create table
kable(db_tri_twitter, caption = "Top 10 Twitter-Trigrams",
booktabs = TRUE, digits = 0,
align = c("l", "c")) %>%
kable_styling(position = "right") -> tab_tri_twitter
# Align tables in document
library(htmltools)
browsable(
tagList(
div(style="display:flex; gap: 20px;",
HTML(tab_tri_blogs),
HTML(tab_tri_news),
HTML(tab_tri_twitter)
)))
| Feature | Blogs |
|---|---|
| one_of_the | 722 |
| as_well_as | 339 |
| out_of_the | 317 |
| some_of_the | 315 |
| the_end_of | 314 |
| be_able_to | 313 |
| going_to_be | 263 |
| the_fact_that | 258 |
| part_of_the | 248 |
| one_of_my | 240 |
| Feature | News |
|---|---|
| one_of_the | 54 |
| as_well_as | 29 |
| some_of_the | 27 |
| going_to_be | 26 |
| in_the_first | 26 |
| of_the_year | 24 |
| according_to_the | 23 |
| the_end_of | 20 |
| the_first_time | 18 |
| be_able_to | 18 |
| Feature | |
|---|---|
| thanks_for_the | 1162 |
| thank_you_for | 427 |
| looking_forward_to | 426 |
| can't_wait_to | 403 |
| going_to_be | 384 |
| for_the_follow | 355 |
| one_of_the | 296 |
| to_see_you | 267 |
| i'm_going_to | 261 |
| is_going_to | 261 |
The exploratory analysis of the three data sets (Blogs, News, and Twitter) revealed certain differences but also similarities of the data. The “Blogs”-data set contains the longest documents with the most words while the “Twitter”-data set contains the shortest documents. The most frequent words are very similar in all three data sets. The same is true for bi-grams and tri-grams as regards the “Blogs”- and “News”-data sets. However, the “Twitter”-data set seems to be slightly different and contains mainly common phrases in the top 3. The least frequent words also differ between the data sets and thereby seem to contribute discriminating information about the language used in the three data sources.
On a basis of these results, it seems to be worth considering all three data sets for further (predictive) analyses, as they contribute unique information about the English language. Therefore, for the definition of first prediction models, all three data sets will be joined. However, the result will be a large data base why it will be very important to minimize both the size and runtime of the model in order to provide a reasonable experience to the later user.