Capstone Project

Project background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this project, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

“I went to the…”

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. This is very convenient - but how is it possible to present words which are relevant and fitting for the given context?

Objectives

In this project, we will work on understanding and building predictive text models like those used by SwiftKey. As a basis, we will analyse language examples collected from common internet sources. In this report, exploratory analyses of the available data sets are performed.

Methodology

Data set: About the Corpora

The data sets used in this project consist of corpora, which were collected from publicly available sources by a web crawler. The crawler checked for language, so as to mainly get texts consisting of the desired language (English). Once the raw corpus had been collected, it was parsed further to remove duplicate entries and split into individual lines. Approximately 50% of each entry was then deleted. Since entries cannot be fully recreated, they are considered as anonymised.

In our data sets, each entry was tagged with it’s date of publication. Where user comments were included, they were tagged with the date of the main entry. In addition, each entry was tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog). If possible, each entry was tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it was tagged with “sports” subject). In many cases it was not feasible to tag the entries or no subject was found by the automated process, in which case the entry was tagged with a ‘0’. To save space, the subject and type is given as a numerical code.

Exploratory analyses

The first step in building a predictive model for text is understanding the data sets in terms of distribution and relationship between the words and phrases in the text. 1. Exploratory analysis
2. Understand frequencies of words and word pairs

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the data set?

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

How do you evaluate how many of the words come from foreign languages?

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

In a first step, an exploratory analysis of the data set is performed. Descriptive Statistics are given in the following table and figure:

# Data overview

    # List of words

      words_blogs <- unlist(tokens(blogs_clean)) 
      
      words_news <- unlist(tokens(news_clean))
      
      words_twitter <- unlist(tokens(twitter_clean))
        
    # Comparison of documents available in the three data sets

      lengths_blogs <- ntoken(blogs_clean)
      lengths_news <- ntoken(news_clean)
      lengths_twitter <- ntoken(twitter_clean)
      
      summary_table <- data.frame(
        Dataset = c("Blogs", "News", "Twitter"),
        Documents = c(length(lengths_blogs), length(lengths_news), length(lengths_twitter)),
        Total_Words = c(sum(lengths_blogs), sum(lengths_news), sum(lengths_twitter)),
        Mean = c(mean(lengths_blogs), mean(lengths_news), mean(lengths_twitter)),
        Median = c(median(lengths_blogs), median(lengths_news), median(lengths_twitter)),
        Min = c(min(lengths_blogs), min(lengths_news), min(lengths_twitter)),
        Max = c(max(lengths_blogs), max(lengths_news), max(lengths_twitter))
          )
      
     summary_table[, 4:7] <- round(summary_table[, 4:7], 1)
     
     kable(summary_table, caption = "Descriptive Statistics", 
              booktabs = TRUE, digits = 3, align = "c")

Descriptive Statistics
Dataset	Documents	Total_Words	Mean	Median	Max
Blogs	44964	1752804	39.0	26	1985
News	3862	123953	32.1	30	279
Twitter	118007	1398008	11.8	11	42

     plot1 <- boxplot(
             ntoken(blogs_clean)[ntoken(blogs_clean) < 2000],
             ntoken(news_clean)[ntoken(news_clean) < 1000],
             ntoken(twitter_clean)[ntoken(twitter_clean) < 200],
          names = c("Blogs", "News", "Twitter"),
          main = "Document Length Comparison", col = "lightblue",
          ylim = c(0, 300)
              )

In a second step, data sets are compared regarding the most frequent and least frequent words they contain. Interestingly, while the most frequent words show up repeatedly in the three data sets, the least frequent words seem to be unique and differ from one data set to another.

    # Top 20 and bottom 10 words
        
      ## Blogs
     
        dfm_blogs <- dfm(tokens(blogs_clean))
        top_blogs <- topfeatures(dfm_blogs, 20)

        freq_blogs <- colSums(dfm_blogs)  
        freq_blogs_filt <- freq_blogs[freq_blogs > 1]
        bottom10_blogs <- sort(freq_blogs_filt)[1:10]
        
      ## News
     
        dfm_news <- dfm(tokens(news_clean))
        top_news <- topfeatures(dfm_news, 20)

        freq_news <- colSums(dfm_news)  
        freq_news_filt <- freq_news[freq_news > 1]
        bottom10_news <- sort(freq_news_filt)[1:10]
        
      ## Twitter
     
        dfm_twitter <- dfm(tokens(twitter_clean))
        top_twitter <- topfeatures(dfm_twitter, 20)

        freq_twitter <- colSums(dfm_twitter)  
        freq_twitter_filt <- freq_twitter[freq_twitter > 1]
        bottom10_twitter <- sort(freq_twitter_filt)[1:10]
        
     ## Create and join data frames for Top20 words
        
        df_blogs <- data.frame(Feature = names(top_blogs), Blogs = as.integer(top_blogs))
        df_news <- data.frame(Feature = names(top_news), News = as.integer(top_news))
        df_twitter <- data.frame(Feature = names(top_twitter), Twitter = as.integer(top_twitter))

        df_all <- full_join(df_blogs, df_news, by = "Feature") %>%
          full_join(df_twitter, by = "Feature") %>%
          replace(is.na(.), 0)  

        ## Transform data to long format
        
            df_long <- df_all %>%
            pivot_longer(cols = c(Blogs, News, Twitter),
               names_to = "Quelle",
               values_to = "Haeufigkeit")

        ## Select data
                  
            top_features <- df_long %>%
                  group_by(Feature) %>%
                  summarise(Total = sum(Haeufigkeit)) %>%
                  arrange(desc(Total)) %>%
                  slice_head(n = 10) %>%
                  pull(Feature)
          
                  df_plot <- df_long %>%
                        filter(Feature %in% top_features)
          
        ## Create chart
                  
            ggplot(df_plot, aes(x = reorder(Feature, Haeufigkeit), y = Haeufigkeit, fill = Quelle)) +
                  geom_col(position = "dodge") +
                  coord_flip() +  # horizontale Balken
                  labs(title = "Top 10 words in Blogs, News and Twitter",
                  x = "Feature",
                  y = "Frequency") +
                  theme_minimal()

Top 20 words in Blogs, News and Twitter

    ## Create and join data frames for bottom 10 words
        
            db_blogs <- data.frame(Feature = names(bottom10_blogs), Blogs = as.integer(bottom10_blogs))
            db_news <- data.frame(Feature = names(bottom10_news), News = as.integer(bottom10_news))
            db_twitter <- data.frame(Feature = names(bottom10_twitter), Twitter = as.integer(bottom10_twitter))
          
            db_all <- full_join(db_blogs, db_news, by = "Feature") %>%
                full_join(db_twitter, by = "Feature") %>%
                replace(is.na(.), 0) %>%
                arrange(Feature)
          
        ## Create table
                  
            kable(db_all, caption = "Least frequent Words", 
                    booktabs = TRUE, digits = 3, align = "c")

Least frequent Words
Feature	Blogs	News	Twitter
#survivor	0	0	2
6’6	0	0	2
andretti	0	0	2
babydaddy	0	0	2
bruschetta	2	0	0
cannellini	2	0	0
cheapness	2	0	0
columnist	0	2	0
crowned	0	2	0
divest	2	0	0
dusters	2	0	0
everglades	2	0	0
first-hand	0	2	0
flies	0	2	0
launchpad	0	0	2
listened	0	2	0
lobbyists	0	2	0
motto	0	2	0
outliers	0	0	2
pageant	0	2	0
pent	2	0	0
recieve	0	0	2
retrain	0	0	2
shaff	0	2	0
shelp	0	2	0
stemcells	0	0	2
succumbing	2	0	0
swivels	2	0	0
trix	0	0	2
walden	2	0	0

Language can be characterized by the proportion of rare words in it. The following table gives an respective overview of the ratio for the three data sets. In sum, the ratio is rather typical for language samples. Nonetheless, we find a larger proportion of seldom words in the “News”-data set than in the “Blogs”- or the “Twitter”-data set.

      # Proportion of rare words 
        
        prop_blogs <- table(words_blogs)
        rare_words_b <- sum(prop_blogs == 2)
        
        prop_news <- table(words_news)
        rare_words_n <- sum(prop_news == 2)
        
        prop_twitter <- table(words_twitter)
        rare_words_t <- sum(prop_twitter == 2)
        
       ## Proportion of seldom words per data set

        rare_ratio_blogs <- rare_words_b / length(prop_blogs)
        rare_ratio_news <- rare_words_n / length(prop_news)
        rare_ratio_twitter <- rare_words_t / length(prop_twitter)
        
        total_ratio <- sum(c(rare_words_b, rare_words_n, rare_words_t)) / sum(c(length(prop_blogs), length(prop_news), length(prop_twitter)))
        
        ## Create Table
        
        result <- data.frame(
            Variable = c("Ratio blogs", "Ratio news", "Ratio twitter", "Ratio total"),
            Wert = c(rare_ratio_blogs, rare_ratio_news, rare_ratio_twitter, total_ratio)
              )

        kable(result, caption = "Ratio rare words", 
              booktabs = TRUE, digits = 3, align = "c")

Ratio rare words
Variable	Wert
Ratio blogs	0.132
Ratio news	0.147
Ratio twitter	0.121
Ratio total	0.129

Frequencies of 2-grams and 3-grams in the dataset

Finally, the frequencies of 2- and 3-grams in the data sets is analysed. The following tables illustrate the respective results.

      ## Create bigrams 

      blogs_bi <- tokens_ngrams(blogs_clean, n = 2)
      news_bi <- tokens_ngrams(news_clean, n = 2)
      twitter_bi <- tokens_ngrams(twitter_clean, n = 2)
      
      
      ## Calculate frequencies
      
      dfm_bi_blogs <- dfm(blogs_bi)
      dfm_bi_news <- dfm(news_bi)
      dfm_bi_twitter <- dfm(twitter_bi)
      
      top_bi_blogs <- topfeatures(dfm_bi_blogs, 10)
      top_bi_news <- topfeatures(dfm_bi_news, 10)
      top_bi_twitter <- topfeatures(dfm_bi_twitter, 10)
      
      
      ## Create and join data frames for top 10 bigrams
      
        ## Blogs
        
          db_bi_blogs <- data.frame(
                  Feature = colnames(dfm_bi_blogs),
                  Blogs   = colSums(dfm_bi_blogs),
                  row.names = NULL
                    )

          db_bi_blogs <- db_bi_blogs %>%
                  arrange(desc(Blogs)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_blogs, caption = "Top 10 Blog-Bigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "left") -> tab_bi_blogs
          
       ## News
        
          db_bi_news <- data.frame(
                  Feature = colnames(dfm_bi_news),
                  News   = colSums(dfm_bi_news),
                  row.names = NULL
                    )

          db_bi_news <- db_bi_news%>%
                  arrange(desc(News)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_news, caption = "Top 10 News-Bigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "center") -> tab_bi_news 
          
      
       ## Twitter
        
          db_bi_twitter <- data.frame(
                  Feature = colnames(dfm_bi_twitter),
                  Twitter = colSums(dfm_bi_twitter),
                  row.names = NULL
                    )

          db_bi_twitter <- db_bi_twitter %>%
                  arrange(desc(Twitter)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_bi_twitter, caption = "Top 10 Twitter-Bigrams", 
                    booktabs = TRUE, digits = 0, 
                    align = c("l", "c")) %>%
                    kable_styling(position = "right") -> tab_bi_twitter   
          
      # Align tables in document
       
          library(htmltools)

## Warning: Paket 'htmltools' wurde unter R Version 4.4.3 erstellt

              browsable(
                tagList(
                  div(style="display:flex; gap: 20px;",
                      HTML(tab_bi_blogs),
                      HTML(tab_bi_news),
                      HTML(tab_bi_twitter)
                  )))

Top 10 Blog-Bigrams
Feature	Blogs
of_the	9189
in_the	7744
to_the	4253
on_the	3740
to_be	3390
for_the	2982
and_the	2851
it_is	2381
at_the	2287
it_was	2284

Top 10 News-Bigrams
Feature	News
of_the	699
in_the	659
to_the	331
for_the	282
on_the	257
at_the	230
and_the	217
with_the	179
to_be	179
he_said	145

Top 10 Twitter-Bigrams
Feature	Twitter
in_the	3903
for_the	3707
of_the	2857
on_the	2362
to_be	2359
to_the	2136
thanks_for	2104
at_the	1904
going_to	1796
thank_you	1699

      # Trigrams

      ## Create trigrams

      blogs_tri <- tokens_ngrams(blogs_clean, n = 3)
      news_tri <- tokens_ngrams(news_clean, n = 3)
      twitter_tri <- tokens_ngrams(twitter_clean, n = 3)
      
      ## Calculate frequencies
      
      dfm_tri_blogs <- dfm(blogs_tri)
      dfm_tri_news <- dfm(news_tri)
      dfm_tri_twitter <- dfm(twitter_tri)
      
      top_tri_blogs <- topfeatures(dfm_tri_blogs, 10)
      top_tri_news <- topfeatures(dfm_tri_news, 10)
      top_tri_twitter <- topfeatures(dfm_tri_twitter, 10)
      
      
      ## Create and join data frames for top 10 trigrams
      
        ## Blogs
        
          db_tri_blogs <- data.frame(
                  Feature = colnames(dfm_tri_blogs),
                  Blogs   = colSums(dfm_tri_blogs),
                  row.names = NULL
                    )

          db_tri_blogs <- db_tri_blogs %>%
                  arrange(desc(Blogs)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_blogs, caption = "Top 10 Blog-Trigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "left") -> tab_tri_blogs
          
       ## News
        
          db_tri_news <- data.frame(
                  Feature = colnames(dfm_tri_news),
                  News   = colSums(dfm_tri_news),
                  row.names = NULL
                    )

          db_tri_news <- db_tri_news%>%
                  arrange(desc(News)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_news, caption = "Top 10 News-Trigrams", 
                    booktabs = TRUE, digits = 0, align = c("l", "c")) %>%
                    kable_styling(position = "center") -> tab_tri_news 
          
      
       ## Twitter
        
          db_tri_twitter <- data.frame(
                  Feature = colnames(dfm_tri_twitter),
                  Twitter = colSums(dfm_tri_twitter),
                  row.names = NULL
                    )

          db_tri_twitter <- db_tri_twitter %>%
                  arrange(desc(Twitter)) %>%
                  slice(1:10)
          
          ## Create table
                  
          kable(db_tri_twitter, caption = "Top 10 Twitter-Trigrams", 
                    booktabs = TRUE, digits = 0, 
                    align = c("l", "c")) %>%
                    kable_styling(position = "right") -> tab_tri_twitter   
       
    # Align tables in document
       
          library(htmltools)
              browsable(
                tagList(
                  div(style="display:flex; gap: 20px;",
                      HTML(tab_tri_blogs),
                      HTML(tab_tri_news),
                      HTML(tab_tri_twitter)
                  )))

Top 10 Blog-Trigrams
Feature	Blogs
one_of_the	722
as_well_as	339
out_of_the	317
some_of_the	315
the_end_of	314
be_able_to	313
going_to_be	263
the_fact_that	258
part_of_the	248
one_of_my	240

Top 10 News-Trigrams
Feature	News
one_of_the	54
as_well_as	29
some_of_the	27
going_to_be	26
in_the_first	26
of_the_year	24
according_to_the	23
the_end_of	20
the_first_time	18
be_able_to	18

Top 10 Twitter-Trigrams
Feature	Twitter
thanks_for_the	1162
thank_you_for	427
looking_forward_to	426
can't_wait_to	403
going_to_be	384
for_the_follow	355
one_of_the	296
to_see_you	267
i'm_going_to	261
is_going_to	261

Results

The exploratory analysis of the three data sets (Blogs, News, and Twitter) revealed certain differences but also similarities of the data. The “Blogs”-data set contains the longest documents with the most words while the “Twitter”-data set contains the shortest documents. The most frequent words are very similar in all three data sets. The same is true for bi-grams and tri-grams as regards the “Blogs”- and “News”-data sets. However, the “Twitter”-data set seems to be slightly different and contains mainly common phrases in the top 3. The least frequent words also differ between the data sets and thereby seem to contribute discriminating information about the language used in the three data sources.

Outlook and further steps

On a basis of these results, it seems to be worth considering all three data sets for further (predictive) analyses, as they contribute unique information about the English language. Therefore, for the definition of first prediction models, all three data sets will be joined. However, the result will be a large data base why it will be very important to minimize both the size and runtime of the model in order to provide a reasonable experience to the later user.