Project background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

“I went to the…”

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. This is very convenient - but how is it possible to present words which are relevant and fitting for the given context?


Objectives

In this project, we will work on understanding and building predictive text models like those used by SwiftKey. As a basis, we will analyse language examples collected from common internet sources. In this report, exploratory analyses of the available data sets are performed.


Methodology

Data set: About the Corpora

The data sets used in this project consist of corpora, which were collected from publicly available sources by a web crawler. The crawler checked for language, so as to mainly get texts consisting of the desired language (English). Once the raw corpus had been collected, it was parsed further to remove duplicate entries and split into individual lines. Approximately 50% of each entry was then deleted. Since entries cannot be fully recreated, they are considered as anonymised.

In our data sets, each entry was tagged with it’s date of publication. Where user comments were included, they were tagged with the date of the main entry. In addition, each entry was tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog). If possible, each entry was tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it was tagged with “sports” subject). In many cases it was not feasible to tag the entries or no subject was found by the automated process, in which case the entry was tagged with a ‘0’. To save space, the subject and type is given as a numerical code.

Exploratory analyses

The first step in building a predictive model for text is understanding the data sets in terms of distribution and relationship between the words and phrases in the text. 1. Exploratory analysis
2. Understand frequencies of words and word pairs

Questions to consider

  • Some words are more frequent than others - what are the distributions of word frequencies?
  • What are the frequencies of 2-grams and 3-grams in the data set?

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

How do you evaluate how many of the words come from foreign languages?

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

In a first step, an exploratory analysis of the data set is performed. Descriptive Statistics are given in the following table and figure:

# Data overview

    # List of words

      words_blogs <- unlist(tokens(blogs_clean)) 
      
      words_news <- unlist(tokens(news_clean))
      
      words_twitter <- unlist(tokens(twitter_clean))
        
    # Comparison of documents available in the three data sets

      lengths_blogs <- ntoken(blogs_clean)
      lengths_news <- ntoken(news_clean)
      lengths_twitter <- ntoken(twitter_clean)
      
      summary_table <- data.frame(
        Dataset = c("Blogs", "News", "Twitter"),
        Documents = c(length(lengths_blogs), length(lengths_news), length(lengths_twitter)),
        Total_Words = c(sum(lengths_blogs), sum(lengths_news), sum(lengths_twitter)),
        Mean = c(mean(lengths_blogs), mean(lengths_news), mean(lengths_twitter)),
        Median = c(median(lengths_blogs), median(lengths_news), median(lengths_twitter)),
        Min = c(min(lengths_blogs), min(lengths_news), min(lengths_twitter)),
        Max = c(max(lengths_blogs), max(lengths_news), max(lengths_twitter))
          )
      
     summary_table[, 4:7] <- round(summary_table[, 4:7], 1)
     
     kable(summary_table, caption = "Descriptive Statistics", 
              booktabs = TRUE, digits = 3, align = "c")
Descriptive Statistics
Dataset Documents Total_Words Mean Median Min Max
Blogs 44964 1752804 39.0 26 0 1985
News 3862 123953 32.1 30 0 279
Twitter 118007 1398008 11.8 11 0 42
     plot1 <- boxplot(
             ntoken(blogs_clean)[ntoken(blogs_clean) < 2000],
             ntoken(news_clean)[ntoken(news_clean) < 1000],
             ntoken(twitter_clean)[ntoken(twitter_clean) < 200],
          names = c("Blogs", "News", "Twitter"),
          main = "Document Length Comparison", col = "lightblue",
          ylim = c(0, 300)
              )


In a second step, data sets are compared regarding the most frequent and least frequent words they contain. Interestingly, while the most frequent words show up repeatedly in the three data sets, the least frequent words seem to be unique and differ from one data set to another.

    # Top 20 and bottom 10 words
        
      ## Blogs
     
        dfm_blogs <- dfm(tokens(blogs_clean))
        top_blogs <- topfeatures(dfm_blogs, 20)

        freq_blogs <- colSums(dfm_blogs)  
        freq_blogs_filt <- freq_blogs[freq_blogs > 1]
        bottom10_blogs <- sort(freq_blogs_filt)[1:10]
        
      ## News
     
        dfm_news <- dfm(tokens(news_clean))
        top_news <- topfeatures(dfm_news, 20)

        freq_news <- colSums(dfm_news)  
        freq_news_filt <- freq_news[freq_news > 1]
        bottom10_news <- sort(freq_news_filt)[1:10]
        
      ## Twitter
     
        dfm_twitter <- dfm(tokens(twitter_clean))
        top_twitter <- topfeatures(dfm_twitter, 20)

        freq_twitter <- colSums(dfm_twitter)  
        freq_twitter_filt <- freq_twitter[freq_twitter > 1]
        bottom10_twitter <- sort(freq_twitter_filt)[1:10]
        
     ## Create and join data frames for Top20 words
        
        df_blogs <- data.frame(Feature = names(top_blogs), Blogs = as.integer(top_blogs))
        df_news <- data.frame(Feature = names(top_news), News = as.integer(top_news))
        df_twitter <- data.frame(Feature = names(top_twitter), Twitter = as.integer(top_twitter))

        df_all <- full_join(df_blogs, df_news, by = "Feature") %>%
          full_join(df_twitter, by = "Feature") %>%
          replace(is.na(.), 0)  

        ## Transform data to long format
        
            df_long <- df_all %>%
            pivot_longer(cols = c(Blogs, News, Twitter),
               names_to = "Quelle",
               values_to = "Haeufigkeit")

        ## Select data
                  
            top_features <- df_long %>%
                  group_by(Feature) %>%
                  summarise(Total = sum(Haeufigkeit)) %>%
                  arrange(desc(Total)) %>%
                  slice_head(n = 10) %>%
                  pull(Feature)
          
                  df_plot <- df_long %>%
                        filter(Feature %in% top_features)
          
        ## Create chart
                  
            ggplot(df_plot, aes(x = reorder(Feature, Haeufigkeit), y = Haeufigkeit, fill = Quelle)) +
                  geom_col(position = "dodge") +
                  coord_flip() +  # horizontale Balken
                  labs(title = "Top 10 words in Blogs, News and Twitter",
                  x = "Feature",
                  y = "Frequency") +
                  theme_minimal()
Top 20 words in Blogs, News and Twitter

Top 20 words in Blogs, News and Twitter

    ## Create and join data frames for bottom 10 words
        
            db_blogs <- data.frame(Feature = names(bottom10_blogs), Blogs = as.integer(bottom10_blogs))
            db_news <- data.frame(Feature = names(bottom10_news), News = as.integer(bottom10_news))
            db_twitter <- data.frame(Feature = names(bottom10_twitter), Twitter = as.integer(bottom10_twitter))
          
            db_all <- full_join(db_blogs, db_news, by = "Feature") %>%
                full_join(db_twitter, by = "Feature") %>%
                replace(is.na(.), 0) %>%
                arrange(Feature)
          
        ## Create table
                  
            kable(db_all, caption = "Least frequent Words", 
                    booktabs = TRUE, digits = 3, align = "c")
Least frequent Words
Feature Blogs News Twitter
#survivor 0 0 2
6’6 0 0 2
andretti 0 0 2
babydaddy 0 0 2
bruschetta 2 0 0
cannellini 2 0 0
cheapness 2 0 0
columnist 0 2 0
crowned 0 2 0
divest 2 0 0
dusters 2 0 0
everglades 2 0 0
first-hand 0 2 0
flies 0 2 0
launchpad 0 0 2
listened 0 2 0
lobbyists 0 2 0
motto 0 2 0
outliers 0 0 2
pageant 0 2 0
pent 2 0 0
recieve 0 0 2
retrain 0 0 2
shaff 0 2 0
shelp 0 2 0
stemcells 0 0 2
succumbing 2 0 0
swivels 2 0 0
trix 0 0 2
walden 2 0 0


Language can be characterized by the proportion of rare words in it. The following table gives an respective overview of the ratio for the three data sets. In sum, the ratio is rather typical for language samples. Nonetheless, we find a larger proportion of seldom words in the “News”-data set than in the “Blogs”- or the “Twitter”-data set.

      # Proportion of rare words 
        
        prop_blogs <- table(words_blogs)
        rare_words_b <- sum(prop_blogs == 2)
        
        prop_news <- table(words_news)
        rare_words_n <- sum(prop_news == 2)
        
        prop_twitter <- table(words_twitter)
        rare_words_t <- sum(prop_twitter == 2)
        
       ## Proportion of seldom words per data set

        rare_ratio_blogs <- rare_words_b / length(prop_blogs)
        rare_ratio_news <- rare_words_n / length(prop_news)
        rare_ratio_twitter <- rare_words_t / length(prop_twitter)
        
        total_ratio <- sum(c(rare_words_b, rare_words_n, rare_words_t)) / sum(c(length(prop_blogs), length(prop_news), length(prop_twitter)))
        
        ## Create Table
        
        result <- data.frame(
            Variable = c("Ratio blogs", "Ratio news", "Ratio twitter", "Ratio total"),
            Wert = c(rare_ratio_blogs, rare_ratio_news, rare_ratio_twitter, total_ratio)
              )

        kable(result, caption = "Ratio rare words", 
              booktabs = TRUE, digits = 3, align = "c")
Ratio rare words
Variable Wert
Ratio blogs 0.132
Ratio news 0.147
Ratio twitter 0.121
Ratio total 0.129

Frequencies of 2-grams and 3-grams in the dataset


Finally, the frequencies of 2- and 3-grams in the data sets is analysed. The following tables illustrate the respective results.

      ## Create bigrams 

      blogs_bi <- tokens_ngrams(blogs_clean, n = 2)
      news_bi <- tokens_ngrams(news_clean, n = 2)
      twitter_bi <- tokens_ngrams(twitter_clean, n = 2)
      
      
      ## Calculate frequencies
      
      dfm_bi_blogs <- dfm(blogs_bi)
      dfm_bi_news <- dfm(news_bi)
      dfm_bi_twitter <- dfm(twitter_bi)
      
      top_bi_blogs <- topfeatures(dfm_bi_blogs, 10)
      top_bi_news <- topfeatures(dfm_bi_news, 10)
      top_bi_twitter <- topfeatures(dfm_bi_twitter, 10)
      
      
      ## Create and join data frames for top 10 bigrams
      
        ## Blogs
        
          db_bi_blogs <- data.frame(
                  Feature = colnames(dfm_bi_blogs),
                  Blogs   = colSums(dfm_bi_blogs),
                  row.names = NULL
                    )

          db_bi_blogs <- db_bi_blogs %>%
                  arrange(desc(Blogs)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_blogs, caption = "Top 10 Blog-Bigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "left") -> tab_bi_blogs
          
       ## News
        
          db_bi_news <- data.frame(
                  Feature = colnames(dfm_bi_news),
                  News   = colSums(dfm_bi_news),
                  row.names = NULL
                    )

          db_bi_news <- db_bi_news%>%
                  arrange(desc(News)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_news, caption = "Top 10 News-Bigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "center") -> tab_bi_news 
          
      
       ## Twitter
        
          db_bi_twitter <- data.frame(
                  Feature = colnames(dfm_bi_twitter),
                  Twitter = colSums(dfm_bi_twitter),
                  row.names = NULL
                    )

          db_bi_twitter <- db_bi_twitter %>%
                  arrange(desc(Twitter)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_twitter, caption = "Top 10 Twitter-Bigrams", 
                    booktabs = TRUE, digits = 0, 
                    align = c("l", "c")) %>%
                    kable_styling(position = "right") -> tab_bi_twitter   
          
      # Align tables in document
       
          library(htmltools)
## Warning: Paket 'htmltools' wurde unter R Version 4.4.3 erstellt
              browsable(
                tagList(
                  div(style="display:flex; gap: 20px;",
                      HTML(tab_bi_blogs),
                      HTML(tab_bi_news),
                      HTML(tab_bi_twitter)
                  )))
Top 10 Blog-Bigrams
Feature Blogs
of_the 9189
in_the 7744
to_the 4253
on_the 3740
to_be 3390
for_the 2982
and_the 2851
it_is 2381
at_the 2287
it_was 2284
Top 10 News-Bigrams
Feature News
of_the 699
in_the 659
to_the 331
for_the 282
on_the 257
at_the 230
and_the 217
with_the 179
to_be 179
he_said 145
Top 10 Twitter-Bigrams
Feature Twitter
in_the 3903
for_the 3707
of_the 2857
on_the 2362
to_be 2359
to_the 2136
thanks_for 2104
at_the 1904
going_to 1796
thank_you 1699
      # Trigrams

      ## Create trigrams

      blogs_tri <- tokens_ngrams(blogs_clean, n = 3)
      news_tri <- tokens_ngrams(news_clean, n = 3)
      twitter_tri <- tokens_ngrams(twitter_clean, n = 3)
      
      ## Calculate frequencies
      
      dfm_tri_blogs <- dfm(blogs_tri)
      dfm_tri_news <- dfm(news_tri)
      dfm_tri_twitter <- dfm(twitter_tri)
      
      top_tri_blogs <- topfeatures(dfm_tri_blogs, 10)
      top_tri_news <- topfeatures(dfm_tri_news, 10)
      top_tri_twitter <- topfeatures(dfm_tri_twitter, 10)
      
      
      ## Create and join data frames for top 10 trigrams
      
        ## Blogs
        
          db_tri_blogs <- data.frame(
                  Feature = colnames(dfm_tri_blogs),
                  Blogs   = colSums(dfm_tri_blogs),
                  row.names = NULL
                    )

          db_tri_blogs <- db_tri_blogs %>%
                  arrange(desc(Blogs)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_blogs, caption = "Top 10 Blog-Trigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "left") -> tab_tri_blogs
          
       ## News
        
          db_tri_news <- data.frame(
                  Feature = colnames(dfm_tri_news),
                  News   = colSums(dfm_tri_news),
                  row.names = NULL
                    )

          db_tri_news <- db_tri_news%>%
                  arrange(desc(News)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_news, caption = "Top 10 News-Trigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "center") -> tab_tri_news 
          
      
       ## Twitter
        
          db_tri_twitter <- data.frame(
                  Feature = colnames(dfm_tri_twitter),
                  Twitter = colSums(dfm_tri_twitter),
                  row.names = NULL
                    )

          db_tri_twitter <- db_tri_twitter %>%
                  arrange(desc(Twitter)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_twitter, caption = "Top 10 Twitter-Trigrams", 
                    booktabs = TRUE, digits = 0, 
                    align = c("l", "c")) %>%
                    kable_styling(position = "right") -> tab_tri_twitter   
       
    # Align tables in document
       
          library(htmltools)
              browsable(
                tagList(
                  div(style="display:flex; gap: 20px;",
                      HTML(tab_tri_blogs),
                      HTML(tab_tri_news),
                      HTML(tab_tri_twitter)
                  ))) 
Top 10 Blog-Trigrams
Feature Blogs
one_of_the 722
as_well_as 339
out_of_the 317
some_of_the 315
the_end_of 314
be_able_to 313
going_to_be 263
the_fact_that 258
part_of_the 248
one_of_my 240
Top 10 News-Trigrams
Feature News
one_of_the 54
as_well_as 29
some_of_the 27
going_to_be 26
in_the_first 26
of_the_year 24
according_to_the 23
the_end_of 20
the_first_time 18
be_able_to 18
Top 10 Twitter-Trigrams
Feature Twitter
thanks_for_the 1162
thank_you_for 427
looking_forward_to 426
can't_wait_to 403
going_to_be 384
for_the_follow 355
one_of_the 296
to_see_you 267
i'm_going_to 261
is_going_to 261


Results

The exploratory analysis of the three data sets (Blogs, News, and Twitter) revealed certain differences but also similarities of the data. The “Blogs”-data set contains the longest documents with the most words while the “Twitter”-data set contains the shortest documents. The most frequent words are very similar in all three data sets. The same is true for bi-grams and tri-grams as regards the “Blogs”- and “News”-data sets. However, the “Twitter”-data set seems to be slightly different and contains mainly common phrases in the top 3. The least frequent words also differ between the data sets and thereby seem to contribute discriminating information about the language used in the three data sources.


Outlook and further steps

On a basis of these results, it seems to be worth considering all three data sets for further (predictive) analyses, as they contribute unique information about the English language. Therefore, for the definition of first prediction models, all three data sets will be joined. However, the result will be a large data base why it will be very important to minimize both the size and runtime of the model in order to provide a reasonable experience to the later user.