N-gram modelling - capstone update

Introduction

This document sets out the work undertaken in the capstone submission for John Hopkins Data Science Specialization (module 10). The project is to build an n-gram model to use as a predictive text tool using the twelve datasets provided. This document sets out the initial work undertaken along with code snippets which are all executable unless otherwise stated.

Setting up environment and importing data

The following r packages need to be installed to complete this analysis:

library(tidyverse)
library(tidytext)
library(textdata) #check
library(readtext) #check
library(quanteda)
library(sentimentr)
library(lexicon)
library(wordcloud2)
library(ngram)
library(fpeek)
library(utils)

The data is provided as 12 large .txt files. We first look through a couple of these files directly and see they are large and contain many lines of data, separated by newline characters. We write a routine to summarise key information about the 12 files, and draw a chart plotting filesize vs number of lines.

## This function returns information about files held in a "data" subfolder   ##
## It assumes the format of file names is LANGUAGE_COUNTRY.TYPE.ext           ##
get_file_info <- function() {
    objs <- file.info(list.files("./data", full.names=TRUE))
    file_paths <- row.names(objs)
    line_count <- sapply(file_paths, peek_count_lines)
    file_info_tibble <- tibble(fullname = str_remove(file_paths,"./data/"),
                               filename = str_remove(fullname,"\\.([^.]+)$"),
                               filetype = str_extract(fullname,"\\.([^.]+)$"),
                               language = str_extract(filename,".*(?=_)"),
                               country = str_extract(filename,"(?<=_).*(?=\\.)"),
                               type = str_extract(filename,"(?<=\\.).*"),
                               filesize_MB = round(objs$size/1000000,1),
                               lines_million = round(line_count/1000000,2))
    return(file_info_tibble)
}

file_info <- get_file_info()

# build the chart
gg <- ggplot(file_info, aes(y = lines_million, x = filesize_MB, color = country, shape = type))
gg <- gg + geom_point(size = 5)
gg <- gg + labs(title = "filesize vs lines", y = "no. of lines", x = "filesize(MB)")
gg <- gg + theme_minimal()

# print the table and chart to console
print(file_info)

## # A tibble: 12 × 8
##    fullname   filename filetype language country type  filesize_MB lines_million
##    <chr>      <chr>    <chr>    <chr>    <chr>   <chr>       <dbl>         <dbl>
##  1 de_DE.blo… de_DE.b… .txt     de       DE      blogs        85.5          0.37
##  2 de_DE.new… de_DE.n… .txt     de       DE      news         95.6          0.24
##  3 de_DE.twi… de_DE.t… .txt     de       DE      twit…        75.6          0.95
##  4 en_US.blo… en_US.b… .txt     en       US      blogs       210.           0.9 
##  5 en_US.new… en_US.n… .txt     en       US      news        206.           1.01
##  6 en_US.twi… en_US.t… .txt     en       US      twit…       167.           2.36
##  7 fi_FI.blo… fi_FI.b… .txt     fi       FI      blogs       108.           0.44
##  8 fi_FI.new… fi_FI.n… .txt     fi       FI      news         94.2          0.49
##  9 fi_FI.twi… fi_FI.t… .txt     fi       FI      twit…        25.3          0.29
## 10 ru_RU.blo… ru_RU.b… .txt     ru       RU      blogs       117.           0.34
## 11 ru_RU.new… ru_RU.n… .txt     ru       RU      news        119            0.2 
## 12 ru_RU.twi… ru_RU.t… .txt     ru       RU      twit…       105.           0.88

print(gg)

We see that the largest files are the US english language files at around 200MB, with all other files being around 100MB except for finnish twitter which is much smaller at 25MB. Most files are under half a million lines, four files have around a million lines, and one file (US twitter) has over 2.5 million lines.

In order to begin working with these files we write a function to read some (or all) of the files into a corpus. We include an option in the function to specify the number of lines included so that we can create a corpus with cut-down versions of the texts for faster processing.

## This function takes a list of file names and builds them into an object    ##
## which is returned.  This object can be a corpus (default) or a tibble.     ##
## The user also has an option to declare a max number of lines per document. ##
## The function assumes that all files are held in a "data" sub directory.    ##

build_corpus <- function(input_data, max_lines = Inf, as_corpus = TRUE) {

    # set up a blank tibble to read data in to
    tibble0 <- tibble(doc_id = character(), text = character())
    
    # loop though each file in the list
    for (file1 in input_data) {
        
        # get the filepath in relation to working directory
        path_full = paste0("data/",file1)   
        
        # read the correct number of lines as a character vector, with one element per line
        output1 <- read_lines(path_full, n_max = max_lines, skip_empty_rows = TRUE)
        
        # collapse this to a single character string, with lines preserved using newline
        text1 <- paste0(output1, collapse = "\n")
        
        # update the tibble with this new doc filename and text
        tibble0 <- tibble0 %>% add_row(doc_id = file1, text = text1)
    }
    
    # determine whether output is a tibble or a corpus
    output1 <- tibble0
    if (as_corpus) output1 <- corpus(tibble0)

    return(output1)
}

We now build our corpora, we build a full corpus including all texts, and then large, medium and small corpus with reduced text. We use calls to Sys.time() to note how long each takes to build, and object.size() to get the approximate size of each corpus.

input_data <- list.files('./data', pattern = '.txt$')

tt_full <- Sys.time()
my_corpus_full <- build_corpus(input_data)
tt_full <- Sys.time() - tt_full

tt_large <- Sys.time()
my_corpus_large <- build_corpus(input_data, 500000)
tt_large <- Sys.time() - tt_large

tt_medium <- Sys.time()
my_corpus_medium <- build_corpus(input_data, 100000)
tt_medium <- Sys.time() - tt_medium

tt_small <- Sys.time()
my_corpus_small <- build_corpus(input_data, 10000)
tt_small <- Sys.time() - tt_small

Corpus name	Lines	Time to build (s)	Approx size (MB)
small	10,000	0.5	27.99
medium	100,000	5.57	280.05
large	500,000	17.95	993.64
full	all	29.77	1400.36

For the rest of this work we choose to work with the medium corpus in the interest of running time for functions, unless otherwise stated.

my_corpus_medium

## Corpus consisting of 12 documents.
## de_DE.blogs.txt :
## "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag m..."
## 
## de_DE.news.txt :
## "Das Rezept für ihre Schokobrezln hat die 60-Jährige schon vo..."
## 
## de_DE.twitter.txt :
## "irgendwas stimmt mut meinem internet am pc nich :( "Wir habe..."
## 
## en_US.blogs.txt :
## "In the years thereafter, most of the Oil fields and platform..."
## 
## en_US.news.txt :
## "He wasn't home alone, apparently. The St. Louis plant had to..."
## 
## en_US.twitter.txt :
## "How are you? Btw thanks for the RT. You gonna be in DC anyti..."
## 
## [ reached max_ndoc ... 6 more documents ]

Getting summary information for each document

Now that we have a corpus of documents we write functions to provide information for each document in the corpus. We first write a function to break the corpus text held as a very long string into individual lines held in a tibble with unique line IDs.

## This function creates a tibble from a text sting held in a corpus with one ##
## line per row as a default setting.  Each row is given a unique id          ##
## consisting of the 'country' and 'type' and then consecutive integers.      ##
## The user can specify whether each line is given 'bookends' which are       ##
## strings to identify the start and end of lines.                            ##            
create_tibble <- function(corpus1, file1, break_string = "\\n", add_bookends = FALSE) {

    # construct the various names required
    path_full = paste0("data/",file1)
    filename = str_remove(file1,"\\.([^.]+)$")
    ll = str_extract(filename,".*(?=_)")          # language
    cc = str_extract(filename,"(?<=_).*(?=\\.)")  # country
    tt = str_extract(filename,"(?<=\\.).*")       # type
    dd = paste0(cc,"-",tt)                  # name for this data set
    
    # get the text string from the corpus
    string1 <- corpus1[file1]
    
    # break the text string into sections by either the supplied char (\\n) as
    # default or into sentences which are identified intelligently by stringr
    if (break_string == "sentence") {
        tibble1 <- tibble(content = str_split(string1, boundary("sentence"))[[1]]) 
    } else {
        tibble1 <- tibble(content = str_split(string1, break_string)[[1]])
    }
    
    # create columns for the unique id and the smaller text strings 
    tibble1$unique <- paste(dd,seq(1:nrow(tibble1)), sep = "_")
    tibble1$content <- tibble1$content %>% str_remove("\n") %>% str_trim()
    tibble1 <- tibble1 %>% relocate(unique, .before = content)
    
    # add bookends if required
    if (add_bookends){
        tibble1$content <- paste0("strstartstr ", tibble1$content, " strendstr")
    }
    
    return(tibble1)
}

create_tibble(my_corpus_medium, "en_US.news.txt")[1:10,]

## # A tibble: 10 × 2
##    unique     content                                                           
##    <chr>      <chr>                                                             
##  1 US-news_1  "He wasn't home alone, apparently."                               
##  2 US-news_2  "The St. Louis plant had to close. It would die of old age. Worke…
##  3 US-news_3  "WSU's plans quickly became a hot topic on local online sites. Th…
##  4 US-news_4  "The Alaimo Group of Mount Holly was up for a contract last fall …
##  5 US-news_5  "And when it's often difficult to predict a law's impact, legisla…
##  6 US-news_6  "There was a certain amount of scoffing going around a few years …
##  7 US-news_7  "14915 Charlevoix, Detroit"                                       
##  8 US-news_8  "\"It’s just another in a long line of failed attempts to subsidi…
##  9 US-news_9  "But time and again in the report, Sullivan called on CPS to corr…
## 10 US-news_10 "\u0093I was just trying to hit it hard someplace,\u0094 said Riz…

Now that we have our corpus text held as a tibble we can analyse it line by line to peform different types of text analysis. The first function we write provides summary data in terms of character count, word count, n-gram count, sentence count and line count. (This is to practice analysing tibbles of data and is not necessarily very interesting). We note that using the ‘unnest_tokens’ command for very large amounts of text can be slow, and therefore this function takes a long time to process the text for the full corpus.

## This function builds a summary tibble for a given document in a corpus     ##
## The tibble holds the count of chars, words, ngrams, sentences and lines    ##
## It uses the unnest_tokens function which is slow for large inputs.         ##
summary_tibble <- function(corpus1, file1) {

    # create a tibble with a row per line and unique row ids
    tibble1 <- create_tibble(corpus1, file1)
    
    # count the characters in each row
    char_count <- tibble1 %>% 
        unnest_tokens(output = "char_count", token = "characters", input = "content") %>%
        group_by(unique) %>%
        summarise(char_count = n()) %>%
        ungroup()
    tibble1 <- full_join(tibble1, char_count, by = "unique")

    # count the words in each row
    word_count <- tibble1 %>% 
        unnest_tokens(output = "word_count", token = "words", input = "content") %>%
        group_by(unique) %>%
        summarise(word_count = n()) %>%
        ungroup()
    tibble1 <- full_join(tibble1, word_count, by = "unique")

    # count the 3-grams in each row
    ngram3_count <- tibble1 %>% 
        unnest_tokens(output = "ngram_count", token = "ngrams", input = "content", n=3) %>%
        group_by(unique) %>%
        summarise(ngram3_count = n()) %>%
        ungroup() 
    tibble1 <- full_join(tibble1, ngram3_count, by = "unique")    

    # count the sentences in each row
    sentence_count <- tibble1 %>% 
        unnest_tokens(output = "sentence_count", token = "sentences", input = "content") %>%
        group_by(unique) %>%
        summarise(sentence_count = n()) %>%
        ungroup()
    tibble1 <- full_join(tibble1, sentence_count, by = "unique")

    # count the lines in each row (should be 1!)
    line_count <- tibble1 %>% 
        unnest_tokens(output = "line_count", token = "lines", input = "content") %>%
        group_by(unique) %>%
        summarise(line_count = n()) %>%
        ungroup()  
    tibble1 <- full_join(tibble1, line_count, by = "unique")

    tibble1 %>% relocate(unique, .before = content)

    return(tibble1)
}

summary_medium <- summary_tibble(my_corpus_medium, "en_US.news.txt")
head(summary_medium,10)

## # A tibble: 10 × 7
##    unique  content  char_count word_count ngram3_count sentence_count line_count
##    <chr>   <chr>         <int>      <int>        <int>          <int>      <int>
##  1 US-new… "He was…         26          5            3              1          1
##  2 US-new… "The St…        121         29           27              4          1
##  3 US-new… "WSU's …        145         29           27              2          1
##  4 US-new… "The Al…        406         86           84              3          1
##  5 US-new… "And wh…        200         40           38              4          1
##  6 US-new… "There …        160         37           35              1          1
##  7 US-new… "14915 …         22          3            1              1          1
##  8 US-new… "\"It’s…        218         44           42              2          1
##  9 US-new… "But ti…        234         42           40              1          1
## 10 US-new… "\u0093…        128         32           30              3          1

summary(summary_medium)[,3:7]

##    char_count       word_count       ngram3_count     sentence_count  
##  Min.   :   1.0   Min.   :   1.00   Min.   :   1.00   Min.   : 1.000  
##  1st Qu.:  88.0   1st Qu.:  19.00   1st Qu.:  17.00   1st Qu.: 1.000  
##  Median : 149.0   Median :  32.00   Median :  30.00   Median : 2.000  
##  Mean   : 161.9   Mean   :  34.54   Mean   :  32.56   Mean   : 2.008  
##  3rd Qu.: 216.0   3rd Qu.:  46.00   3rd Qu.:  44.00   3rd Qu.: 2.000  
##  Max.   :4225.0   Max.   :1123.00   Max.   :1121.00   Max.   :28.000  
##  NA's   :3                                                            
##    line_count
##  Min.   :1   
##  1st Qu.:1   
##  Median :1   
##  Mean   :1   
##  3rd Qu.:1   
##  Max.   :1   
##

We look at the summary data for the medium (100,000 lines) of US news data, and can see that the average character count for each line is 162, and the average word count is 35, the average sentence count is 2, and the average line count is exactly 1 (as expected!).

We write a second function which provides more interesting information in terms of a profanity analysis and three types of sentiment analysis. In order to work properly this function also removes a predefined list of stop words. To support the function we first create the five look-up lists required: list_stopwords, list_profanity, list_afinn, list_bing, list_nrc.

## This code initialises all of the look-up lists and lexicons used throughout this work

# build list of common 'Stop Words' to exclude
# stop_words is a predefined list from tidytext
more_stop_words_list <- c("it’s", "i’m", "don’t", "i’ve", "st", "rt", "p.m.", "u.s.", "its", "im", "dont", "ive", "pm")
more_stop_words_tibble <- tibble(word = more_stop_words_list, lexicon = rep("BESPOKE", length(more_stop_words_list)))
number_tibble <- tibble(word = as.character(seq(1000)), lexicon = rep("NUMBERS",1000))
list_stopwords <- rbind(stop_words, more_stop_words_tibble, number_tibble)

# get profanity list (from sentimentr)
profanity_all <- c(profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist, profanity_zac_anger)
profanity_all_clean <- sort(unique(tolower(c(profanity_all))))
list_profanity <- tibble(word = profanity_all_clean, value = rep(1,length(profanity_all_clean)))

# get sentiment score lists (from sentimentr)
list_afinn <- get_sentiments("afinn")
list_bing <- get_sentiments("bing")
list_nrc <- get_sentiments("nrc")

We now use these lists on a scoring function to provide sentiment and profanity analysis.

## This function builds a scoring tibble for a given document in a corpus     ##
## This holds the key words for each line, and supplies a profanity count     ##
## and three types of sentiment score.                                        ##
scoring_tibble <- function(corpus1, file1) {

    # create a tibble with a row per line and unique row ids
    tibble1 <- create_tibble(corpus1, file1)

    kwords <- tibble1 %>%
        unnest_tokens(output = "word", token = "words", input = "content") %>%
        anti_join(list_stopwords, by = "word") %>%
        group_by(unique) %>%
        summarise(key_words = paste(word, collapse = " ")) %>%
        ungroup()
    
    tibble1 <- full_join(tibble1, kwords, by = "unique")

    word_count <- tibble1 %>% 
        unnest_tokens(output = "word_count", token = "words", input = "key_words") %>%
        group_by(unique) %>%
        summarise(word_count = n()) %>%
        ungroup()
    
    tibble1 <- full_join(tibble1, word_count, by = "unique")

    tibble1 <- tibble1 %>% select(-"content")  

    # get profanity score
    profanity_scores <- tibble1 %>% 
        unnest_tokens(output = "word", token = "words", input = "key_words") %>%
        inner_join(list_profanity, by = "word") %>%
        group_by(unique) %>%
        summarise(profanity = sum(value)) %>%
        ungroup()
    
    tibble1 <- full_join(tibble1, profanity_scores, by = "unique")

    # get AFINN sentiment scores
    afinn_scores <- tibble1 %>% 
        unnest_tokens(output = "word", token = "words", input = "key_words") %>%
        inner_join(list_afinn, by = "word") %>%
        group_by(unique) %>%
        summarise(AFINN_avg = sum(value),
                  AFINN_pos = sum(value > 0),
                  AFINN_neg = sum(value < 0),
                  AFINN_vpos = sum(value >= 4),
                  AFINN_vneg = sum(value <= -4)) %>%
        ungroup()

    # get bing sentiment scores
    bing_scores <- tibble1 %>% 
        unnest_tokens(output = "word", token = "words", input = "key_words") %>%
        inner_join(list_bing, by = "word") %>%
        group_by(unique) %>%
        summarise(BING_pos = sum(sentiment == 'positive'), 
                  BING_neg = sum(sentiment == 'negative')) %>%
        ungroup()

    # get NRC sentiment scores    
    nrc_scores <- tibble1 %>% 
        unnest_tokens(output = "word", token = "words", input = "key_words") %>%
        inner_join(list_nrc, by = "word") %>%
        group_by(unique) %>%
        summarise(NRC_pos = sum(sentiment == 'positive'),
                  NRC_neg = sum(sentiment == 'negative'),
                  NRC_ang = sum(sentiment == 'anger'),
                  NRC_anti = sum(sentiment == 'anticipation'),
                  NRC_disg= sum(sentiment == 'disgust'),
                  NRC_fear = sum(sentiment == 'fear'),
                  NRC_joy = sum(sentiment == 'joy'),
                  NRC_sad = sum(sentiment == 'sadness'),
                  NRC_surp = sum(sentiment == 'surprise'),
                  NRC_trst = sum(sentiment == 'trust')) %>%
        ungroup()

    tibble1 <- full_join(tibble1, afinn_scores, by = "unique")
    tibble1 <- full_join(tibble1, bing_scores, by = "unique")
    tibble1 <- full_join(tibble1, nrc_scores, by = "unique")

    # replace NA values in number columns with 0s
    non_char <- (summarise_all(tibble1, class) != "character")[1,]
    tibble1[,non_char] <- tibble1[,non_char] %>% replace(is.na(.), 0)

    return(tibble1)
}

scoring_medium <- scoring_tibble(my_corpus_medium, "en_US.news.txt")
head(scoring_medium)

## # A tibble: 6 × 21
##   unique key_words word_count profanity AFINN_avg AFINN_pos AFINN_neg AFINN_vpos
##   <chr>  <chr>          <int>     <dbl>     <dbl>     <int>     <int>      <int>
## 1 US-ne… home app…          2         0         0         0         0          0
## 2 US-ne… louis pl…         12         0        -3         0         1          0
## 3 US-ne… wsu's pl…         17         0        -1         1         1          0
## 4 US-ne… alaimo m…         42         0         4         2         0          0
## 5 US-ne… difficul…         13         0        -4         0         2          0
## 6 US-ne… amount s…         15         0         0         0         0          0
## # … with 13 more variables: AFINN_vneg <int>, BING_pos <int>, BING_neg <int>,
## #   NRC_pos <int>, NRC_neg <int>, NRC_ang <int>, NRC_anti <int>,
## #   NRC_disg <int>, NRC_fear <int>, NRC_joy <int>, NRC_sad <int>,
## #   NRC_surp <int>, NRC_trst <int>

This information is much more interesting and so we write an additional function which allows us to find the ‘highest scoring’ line from a given corpus text by any of the metrics or a combination of metrics.

## This function returns the highest scoring lines for the supplied cols from ##
## a scoring tibble.  The user can specify the metrics (column names) to use  ##
## to score, the number of results returned (top X lines), and whether to     ##
## rank on absolute score or score relative to number of key words            ##
get_high_score <- function(tibble1, cols, n = 100, absolute_vals = FALSE){

    # check we have been supplied valid column names
    protected_cols <- c("key_words", "unique", "word_count")
    matched_cols <- cols[cols %in% names(tibble1)]
    matched_cols <- matched_cols[!(matched_cols %in% protected_cols)]
    
    # if not then we exit with an error message
    if (!setequal(cols,matched_cols)) {
        stop(str_flatten(c("cols",setdiff(cols,matched_cols),"- are either not found or not allowed")," "))
    }
    
    # if we have correct column names the we loop through them and create three additional
    # columns for each one:  a relative%, a ranking, and a ranking based on relative%
    for (cc in cols) {
        new_name1 <- paste0(cc,"_%")
        new_name2 <- paste0(cc,"_rank")
        new_name3 <- paste0(cc,"_%rank")
        new_col1 <- tibble1[,cc]/tibble1[,"word_count"]
        new_col2 <- min_rank(desc(tibble1[,cc]))
        new_col3 <- min_rank(desc(new_col1))
        
        tibble1 <- tibble(tibble1, tibble(!!new_name1 := new_col1, !!new_name2 := new_col2, !!new_name3 := new_col3))   
    }
    
    # we set the a flag to sum across depending on whether the user has 
    # selected absolute or relative values
    if(absolute_vals){
        flag1 = "_rank"
    } else {
        flag1 = "_%rank"
    }
    
    # create an 'overall_score' column which sums the ranking for the selected columns
    tibble1 <- tibble1 %>% rowwise() %>% mutate(overall_score = sum(c_across(contains(flag1))))
    
    # now add an 'overall_rank' column which ranks the overall_score column
    rank_col <- row_number((tibble1$overall_score))
    tibble1 <- tibble(tibble1, tibble(overall_rank = rank_col))
    
    # for clarity create a second tibble reduced to the required number of rows and order
    tibble2 <- tibble1[(tibble1$overall_rank <= n),]
    tibble2 <- tibble2[order(tibble2$overall_rank),]

    return(tibble2)
}

We can now find sentences with profanity or different sentiment. Note that the profanity filter we have used is very sensitive and picks up words which could be used in a rude or racist way irrelevant of the actual context in which they are used.

summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("profanity", "BING_neg"), 3, TRUE)$unique,"content"][[1]]

## [1] "In 2010, she missed a progress hearing because she claimed her passport had been stolen while at the Cannes Film Festival, and from there things snowballed: Probation violation, another brief jail stint, rehab, failed drug test, probation violation, rehab. She started to mix it up bit: Alleged attack on a Betty Ford Clinic worker! Grand theft larceny accusations from a Venice, Calif., jewelry store! Then back to the grind: Probation violation, house arrest, failed alcohol test, probation violation, brief jail stint, missed therapy sessions, MIA for community service, brief jail stint. That's when Sautner instituted the timetable that seemed to get Lohan back on track."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [2] "Dear Margo: I recently was baptized and joined the Mormon Church. I really enjoy my new \"family\" except for one person: the bishop. This feeling is taboo because people act like he walks on water. I say he is just a man appointed to a position. This man didn't like me the minute he met me. When I started at this church, I was in crisis, and I told a couple of people about it. They said go to the bishop. He turned me down flat-out and said, \"We don't help people financially.\" That was a bald-faced lie, and I knew it. The excuses I got were that I was not a member yet, that he was stressed, blah, blah, blah. As time progressed, my problem became worse. Again, I was turned down, and he didn't even call me by the right name. I corrected him twice about my name, and he still got it wrong, which was proof to me that he was doing it on purpose. Do you think I am making something out of nothing? And to whom do I go to let someone know about this person? I think he treats me badly because I am disabled. (I think this because there are other disabled people who also dislike him.) — Disillusioned"                                                                                                           
## [3] "3. Grab a bundle of ones and prepare to stuff a wild pink triangle of a G-string. What's better than queer-friendly strip joints? Beautiful, haunting and historic North Beach has the hungry i (599 Jackson St.), a fairly typical strip joint that is couples- and women-friendly -- with, last time I looked, no silicone in sight; plus, it's where Lenny Bruce performed. The Lusty Lady (1033 Kearny St.) is a very queer-positive, no-booze peep show with hot, smart women who own the joint. But if you want lap dances where you can get drunk (on overpriced swill), you can hit the dyke-friendly Hustler Club next door (1031 Kearny St.) or the women-friendly Gold Club in SoMa (650 Howard St.). San Francisco has one better-known, all-male strip club: Nob Hill Adult Theatre (729 Bush; not to be confused with all-girl Nob Hill Lingerie) is primarily for a gay-male clientele, but it's generally friendly toward women, bois and mixed-gender couples -- they endure bachelorette parties, so they're seasoned enough to appreciate any clientele that is polite, isn't screaming in their faces and tips well. Right now, its marquee boasts a message both frightening and enticing -- something cryptic about \"Brazilian nuns.\""

summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("AFINN_vpos", "BING_pos", "NRC_pos"), 3, TRUE)$unique,"content"][[1]]

## [1] "Along with classes in voice, dance, and acting, there are fun theme related activities each week. The themes for this year include TV, movies, theater, and circus. Elective classes may include Minute to Win It, Dance Jeopardy, Singing Bee, American Idol, and Famous Character Talk Show. Wacky Wednesday events include Come as your Favorite TV Character, Who Dunnit Day, Be a Clown, and Super Hero Day. Fabulous Friday workshops may include puppetry, special effects, mime, clowning, and magic."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [2] "muckduck: The Ducks can now begin their dominance as an official ELITE team with that 2012 Rose Bowl win and DAT is going to be a HUGE part of this unreal offense next year!!! Samurai7: Besides the two long runs for touchdowns untouched, DAT picked up a key first down beating a defender well short of the first down marker, which set up the winning touchdown. Also, He and Barner were the \"gunners\" on the punt with 22 seconds left going downfield making the fair catch necessary and creating a long field! hermtownhomy: DAT definitely stood out, but that does not mean he should be singled out as the only reason they won. He is scary fast. Probably the fastest player in football right now and we all look forward to the future with him, but his two runs were not necessarily turning points in the game. They still could have gone on to score if those two runs hadn't happened. True turning points in the game were the interception and fumble recovery. I just don't agree with JC's point here that DAT was the reason they won. It was a team effort. RipCityDroHead85: De'Ant was prolly the MVP, I'd agree, but LaMike and Darren had pretty good games too, along with Tuinei. As far as that best duck ever, I don't know if i'd be heaping that much praise on him just yet, LaMike made a key block that sprung one of his long ones. No mention of the best duck rb ever out there blocking for another rb. portlandchiro1: I agree with the praise heaped on De'Anthony, but lets not get carried away. This was a full team win with many stars contributing. Where would Oregon have been without the two forced turnovers from the defense-including the unbelievable interception by Kiko Alonzo? How about the catches by Tunei, the grueling runs by LaMichael James, the touchdown catch by Barner, and on and on on. This was a team win and there are many heroes. Bravo to all the Ducks!!! What was your favorite moment from the Rose Bowl? What about next season are you most excited for? Post your thoughts below."
## [3] "Compressed, understated in the face of intense feeling, Tóibín's fiction places sensitive, isolated individuals in situations of extreme emotional pressure. In \"The Heather Blazing,\" winner of the 1993 Encore Award for best second novel, Eamon Redmond, a remote, aging, rigorously objective Irish high court judge, is awakened from dispassion to compassion when faced with swarming memories during a summer visit to the coast. In \"The Blackwater Lightship,\" a 1999 Booker Prize finalist, young Declan Breen, dying from complications of AIDS, brings his estranged sister, mother and grandmother together to confront the rubble of their family's past. In \"The Master,\" a 2004 Man Booker Prize finalist and winner of a 2006 Los Angeles Times Book Prize, novelist Henry James must confront public humiliation after the failure of his play while coping with his suppressed sexuality and erotic yearning."

summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("NRC_disg", "NRC_surp"), 3, TRUE)$unique,"content"][[1]]

## [1] "We know that innocent people have been convicted of murder in California – three were released in 2011 after serving a total of 57 years – and that innocent people have been executed in other states. Nationwide, 140 inmates from death rows have been exonerated of the crimes for which they were wrongly convicted. In light of possible innocence, using the death penalty puts all Californians at risk of perpetrating the ultimate injustice of executing an innocent person, for when the governor gives the final order to execute, he does so in the name of California residents, and the death certificate will read, \"Homicide,\" as the cause of death."                                                                                                                                                                                                                                                                                
## [2] "But (and it's a very important but) support for the death penalty, in Connecticut and elsewhere, is not as robust as it looks. When Quinnipiac asked a different question \u0096 \"Which punishment do you prefer for people convicted of murder, the death penalty or life in prison with no chance of parole?\" \u0096 only 46 percent favored the death penalty. An equal number chose life without parole. Death penalty opponents have an opening they haven't had for some time."                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [3] "What can we do about it? Differences in genetic makeup, metabolism and even bacteria levels in the gut explain why some people gain weight and others don't. But obesity experts are quick to add that just because you've inherited a tendency to put on pounds doesn't mean you have to feed that tendency. Everyone can build lean-muscle mass through exercise, especially weight training, to boost metabolism. That will help you burn more calories even at rest. Exercise also may give brown fat a boost. Pregnant moms need to be especially mindful of excess weight gain, sugar consumption and controlling blood-sugar levels through diet and exercise so they don't pass on problems to future generations. Losing baby weight between pregnancies also would help. While Americans do their part, scientists will continue to look at ways to manipulate gut bacteria and hormone levels to help prevent weight gain — and weight regain."

We can see that these functions find interesting sentences from our texts!

Finding common words

Now we look at selecting the common words from a corpus text. We first write a function which returns the top X most common words from a given text. The function is written so the user can specify whether to remove stop words or not.

## This function creates a word list ordered by count from a text string held ##
## held in a corpus.  The user can define the length of the list, the minimum ##
## required occurrences to be in the list and whether stop words are included.##
common_words <- function(corpus1, file1, list_length = Inf, min_required = 5, remove_stop_words = TRUE) {

    # create a tibble with a row per line and unique row ids
    tibble1 <- create_tibble(corpus1, file1)
    
    # if stop word flag selected then activate stop word list, else activate a dummy list
    if (remove_stop_words) {
        sw_list <- list_stopwords
    } else {
        sw_list <- tibble(word = character(), lexicon = character())
    }
    
    # use unnest_tokens to create word list with count
    top_words <- tibble1 %>%
        unnest_tokens(output = "word", token = "words", input = "content") %>%
        anti_join(sw_list, by = "word") %>%
        group_by(word) %>%
        count(word, sort = TRUE) %>%
        ungroup()
    
    # add percentage and cumulative percentage
    top_words <- top_words %>% mutate(pct = n/sum(n),cumulative = cumsum(pct))
    top_words[,c("pct", "cumulative")] <- 100*round(top_words[,c("pct", "cumulative")],3)
    
    # only return the specified number of rows, 
    # if a negative number supplied then show the tail rather than the head
    top_words <- top_words %>% filter(n >= min_required)
    
    if (list_length > 0 && list_length < nrow(top_words)) {
        top_words <- top_words[1:list_length,]
    }
    
    if (list_length < 0 && -list_length < nrow(top_words)) {
        top_words <- top_words[(nrow(top_words)+list_length+1):nrow(top_words),]
    }

    return(top_words)
}

common_words(my_corpus_medium, "en_US.news.txt", 
             list_length = 10, min_required = 5, remove_stop_words = FALSE)

## # A tibble: 10 × 4
##    word       n   pct cumulative
##    <chr>  <int> <dbl>      <dbl>
##  1 the   195688   5.7        5.7
##  2 to     89962   2.6        8.3
##  3 and    88582   2.6       10.8
##  4 a      87102   2.5       13.4
##  5 of     76488   2.2       15.6
##  6 in     67054   1.9       17.5
##  7 for    35032   1         18.5
##  8 that   34144   1         19.5
##  9 is     28316   0.8       20.3
## 10 on     26947   0.8       21.1

common_words(my_corpus_medium, "en_US.news.txt",
             list_length = 10, min_required = 5, remove_stop_words = TRUE)

## # A tibble: 10 × 4
##    word        n   pct cumulative
##    <chr>   <int> <dbl>      <dbl>
##  1 time     5778   0.4        0.4
##  2 people   4713   0.3        0.7
##  3 city     3802   0.2        0.9
##  4 school   3519   0.2        1.1
##  5 percent  3429   0.2        1.4
##  6 game     3375   0.2        1.6
##  7 day      3191   0.2        1.8
##  8 home     3091   0.2        2  
##  9 million  3055   0.2        2.2
## 10 county   2929   0.2        2.4

We can see from this that the most common words are all ‘common’ words unsurprisingly, so we remove stop words which is a little more interesting. We see that our new most common words (in US news data) are “time”, “people” and “city”.

Now we write a similar function to display the most common words for the three different media types for a specific language. Again we allow the user to view with or without stop words. And we also add functionality to plot a bar chart to compare visually.

## This function generates a tibble of the most common words for each media   ##
## type for a given country. The user can define the length of the list, the  ##
## minimum occurrences required to appear in the list, whether stop words are ##
## included, and whether a chart is shown.                                    ##
show_common_words <- function(corpus1, country = "US", list_length = 100, 
                        min_required = 5, remove_stop_words = TRUE, show_chart = TRUE) {

    # get the filenames for the required country
    blog1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*blogs"))]
    news1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*news"))]
    twit1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*twitter"))]
    
    # get the common word list for each media type
    blog2 <- rename(common_words(corpus1, blog1, list_length = Inf, 
                                 min_required = 1, remove_stop_words)[,1:2], blogs = n)
    news2 <- rename(common_words(corpus1, news1, list_length = Inf, 
                                 min_required = 1, remove_stop_words)[,1:2], news = n)
    twit2 <- rename(common_words(corpus1, twit1, list_length = Inf, 
                                 min_required = 1, remove_stop_words)[,1:2], twitter = n)
    
    # combine the lists into a single table, remove any NA, add (and sort by) a 'total' column
    word_table <- full_join(blog2, news2, by = "word")
    word_table <- full_join(word_table, twit2, by = "word")
    word_table <- word_table %>% replace(is.na(.), 0)
    word_table$total <- word_table$blogs + word_table$news + word_table$twitter
    word_table <- word_table[order(word_table$total, decreasing = TRUE),]
    
    # only return the specified number of rows, 
    # if a negative number supplied then show the tail rather than the head
    word_table <- word_table %>% filter(total >= min_required)
    
    if (list_length > 0 && list_length < nrow(word_table)) {
        word_table <- word_table[1:list_length,]
    }
    
    if (list_length < 0 && -list_length < nrow(word_table)) {
        word_table <- word_table[(nrow(word_table)+list_length+1):nrow(word_table),]
    }
    
    # if we are drawing a chart 
    if (show_chart) {
        
        # reformat the data table into a longer format
        word_table2 <- word_table %>% select(-"total")
        word_table2 <- word_table2 %>% pivot_longer(cols = 2:4, names_to = "type", values_to = "count")        
        
        # build the chart and print it to the console
        gg <- ggplot(word_table2, aes(y = word, x = count, fill = type))
        gg <- gg + geom_bar(position = "stack", width = 2, stat = "identity")
        gg <- gg + scale_y_discrete(limits = rev(word_table2$word))
        gg <- gg + labs(title = paste0("Word counts for ", country, " media"), y = "common words", x = "frequency")
        gg <- gg + theme_minimal()
        gg <- gg + theme(axis.text.y=element_text(size=5))
        print(gg)
    }
    
    return(word_table)
}

show_common_words(my_corpus_medium, "US", list_length = 20, 
                  min_required = 5, remove_stop_words = FALSE, show_chart = FALSE)

## # A tibble: 20 × 5
##    word   blogs   news twitter  total
##    <chr>  <int>  <int>   <int>  <int>
##  1 the   205790 195688   39718 441196
##  2 to    118290  89962   33100 241352
##  3 and   120757  88582   18531 227870
##  4 a      99848  87102   25860 212810
##  5 of     96752  76488   15159 188399
##  6 in     66187  67054   16172 149413
##  7 i      86293  15621   30585 132499
##  8 that   50749  34144   10089  94982
##  9 is     48157  28316   15457  91930
## 10 for    40005  35032   16500  91537
## 11 it     44834  21827   12592  79253
## 12 on     30821  26947   11760  69528
## 13 you    33073   9563   23202  65838
## 14 with   31564  25462    7325  64351
## 15 was    30946  22773    4986  58705
## 16 at     19138  21344    7956  48438
## 17 this   28673  12197    7043  47913
## 18 as     24903  18863    2868  46634
## 19 my     30310   4065   12155  46530
## 20 be     23216  15077    7898  46191

show_common_words(my_corpus_medium, "US", list_length = 20, 
                  min_required = 5, remove_stop_words = TRUE, show_chart = TRUE)

## # A tibble: 20 × 5
##    word   blogs  news twitter total
##    <chr>  <int> <int>   <int> <int>
##  1 time   10215  5778    3384 19377
##  2 people  6556  4713    2181 13450
##  3 day     5820  3191    3925 12936
##  4 love    5096  1014    4560 10670
##  5 life    4518  1574    1431  7523
##  6 home    3101  3091    1015  7207
##  7 week    3015  2257    1242  6514
##  8 school  1981  3519     845  6345
##  9 world   3257  1611     954  5822
## 10 game    1028  3375    1317  5720
## 11 night   2202  1694    1733  5629
## 12 city    1159  3802     406  5367
## 13 lot     2456  1803     561  4820
## 14 family  2272  1999     522  4793
## 15 feel    2730   973    1025  4728
## 16 house   1987  1989     583  4559
## 17 days    2235  1437     800  4472
## 18 team     903  2802     597  4302
## 19 book    3076   622     512  4210
## 20 season   784  2898     456  4138

We can see from this that the most popular words are similar across the media types, but if we remove the stop words we start to see a difference emerging. The chart helps to makes this clearer.

We write a further function to visualise popular words in a more appealing way using the wordcloud package.

## This function draws a wordcloud for a given word count list.  The user can ##
## define which columns are the count column and word column from a tibble.   ##
draw_wordcloud <- function(tibble1, count_col, word_col = "word") {

    # build wordcloud and print to console
    wordcloud <- wordcloud2(tibble1[,c(word_col, count_col)],
                            fontFamily = "arial",
                            fontWeight = "normal",
                            color = "random-light",
                            minRotation = -pi/4,
                            maxRotation = pi/4,
                            rotateRatio = 0.75,
                            shape = "circle")
    
    return(wordcloud)
}

us_common_words <- show_common_words(my_corpus_medium, "US", list_length = 100, 
                  min_required = 5, remove_stop_words = TRUE, show_chart = FALSE) 

wordcloud_news <- draw_wordcloud(us_common_words, "news")
wordcloud_news

wordcloud_blogs <- draw_wordcloud(us_common_words, "blogs")
wordcloud_blogs

wordcloud_twitter <- draw_wordcloud(us_common_words, "twitter")
wordcloud_twitter

We can easily see from this that the most common words are ‘time’ and ‘people’ for news; ‘time’, ‘people’ and ‘day’ for blogs; and ‘love’ and ‘day’ for twitter.

Finding common ngrams

After analysing single words we move on to looking at ngrams. These are lists of consecutive words found in text. We adapt our functions for finding and displaying common words to work for ngrams.

The first function lists common ngrams, we add functionality to this so the user can chose to use the tidytext package (unnest_token) or the ngram package (ngram) to split out ngrams. The ngram package runs calculations in C rather than r so is quicker at processing large strings of text, however it displays slightly different results to the unnest_token function so may be slightly out of kilter with other functions written so far.

## This function creates an ngram list ordered by count from a text string    ##
## held in a corpus. The user can define the length of the ngram (X words)    ##
## and the length of the lost produced (i.e. the top X ngrams).  The user can ##
## also decide whether to run the function fast - this uses the 'ngram'       ##
## package as opposed to unnest_tokens from the 'tidytext' package, this is   ##
## significantly quicker but produces slightly different results.  The user   ##
## also decide whether to include bookends within the ngrams and whether to   ##
## specify a minimum number of times a word should appear to be included.     ##
common_ngrams <- function(corpus1, file1, ngram_length = 3, list_length = Inf, 
                          min_required = 5, add_bookends = TRUE, run_fast = TRUE) {

    # create a tibble with a row per line and unique row ids
    tibble1 <- create_tibble(corpus1, file1, break_string = "sentence", add_bookends = add_bookends)
    
    # if we are on a fast run then use the ngram function to produce the ngrams
    if (run_fast) {
        # create a long string of words from the text, in lower case with no punctuation 
        vec1 <- paste(tibble1$content, collapse = " ")
        vec1 <- str_to_lower(vec1)
        vec1 <- str_replace_all(vec1,"[[:punct:]]","")
        
        # split into n-grams, remove ngrams with bookends or NA, then format ouput similar to unnest_tokens
        top_ngrams <- ngram(vec1, n = ngram_length) %>%
                            get.phrasetable() %>%
                            filter(!(str_detect(ngrams, "strendstr|strstartstr"))) %>%
                            drop_na() %>%
                            rename(n = freq, ngram = ngrams) %>%
                            mutate(cumulative = cumsum(prop), ngram = str_sub(ngram,1,-2))
    } else {
        # otherwise use unnest_tokens which is slower, but consistent with other functions we have written
        top_ngrams <- tibble1 %>%
            unnest_tokens(output = "ngram", token = "ngrams", input = "content", n = ngram_length) %>%
            group_by(ngram) %>%
            count(ngram, sort = TRUE) %>%
            ungroup()
    }
    
    # only return the specified number of rows, 
    # if a negative number supplied then show the tail rather than the head
    top_ngrams <- top_ngrams %>% filter(n >= min_required)
    
    if (list_length > 0 && list_length < nrow(top_ngrams)) {
        top_ngrams <- top_ngrams[1:list_length,]
    }
    
    if (list_length < 0 && -list_length < nrow(top_ngrams)) {
        top_ngrams <- top_ngrams[(nrow(top_ngrams)+list_length+1):nrow(top_ngrams),]
    }
    
    return(top_ngrams)
}

common_ngrams(my_corpus_medium, "en_US.news.txt", ngram_length = 5, 
              list_length = 10, min_required = 5, add_bookends = TRUE, run_fast = TRUE)

##                       ngram   n         prop   cumulative
## 1         at the end of the 122 3.211269e-05 3.211269e-05
## 2     for the first time in  67 1.763566e-05 4.974835e-05
## 3      in the middle of the  61 1.605635e-05 6.580470e-05
## 4  for the first time since  61 1.605635e-05 8.186104e-05
## 5         by the end of the  49 1.289772e-05 9.475876e-05
## 6        the end of the day  43 1.131841e-05 1.060772e-04
## 7        at the time of the  42 1.105519e-05 1.171324e-04
## 8         at the top of the  36 9.475876e-06 1.266082e-04
## 9       the end of the year  34 8.949439e-06 1.355577e-04
## 10       there are a lot of  34 8.949439e-06 1.445071e-04

We can see that the most common ngrams are popular phrases, or parts of popular phrases, which is as we would expect.

The second function lists the common ngrams for all three media types for a given country, and draws a chart if required.

## This function generates a tibble of the most common ngrams for each media  ##
## type for a given country.  The user can define the same attributes as in   ##
## common_ngrams, and whether a chart is shown.                               ##
show_common_ngrams <- function(corpus1, country = "US", ngram_length = 3, list_length = 100, 
                       min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE) {

    # get the filenames for the required country
    blog1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*blogs"))]
    news1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*news"))]
    twit1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*twitter"))]
    
    # get the common ngram list for each media type
    blog2 <- rename(common_ngrams(corpus1, blog1, ngram_length, list_length = Inf, 
                                  min_required = 1, add_bookends, run_fast)[,1:2], blogs = n)
    news2 <- rename(common_ngrams(corpus1, news1, ngram_length, list_length = Inf, 
                                  min_required = 1, add_bookends, run_fast)[,1:2], news = n)
    twit2 <- rename(common_ngrams(corpus1, twit1, ngram_length, list_length = Inf, 
                                  min_required = 1, add_bookends, run_fast)[,1:2], twitter = n)
    
    # combine the lists into a single table, remove any NA, add (and sort by) a 'total' column
    ngram_table <- full_join(blog2, news2, by = "ngram")
    ngram_table <- full_join(ngram_table, twit2, by = "ngram")
    ngram_table <- ngram_table %>% replace(is.na(.), 0)
    ngram_table$total <- ngram_table$blogs + ngram_table$news + ngram_table$twitter
    ngram_table <- ngram_table[order(ngram_table$total, decreasing = TRUE),]
    
    # only return the specified number of rows, 
    # if a negative number supplied then show the tail rather than the head
    ngram_table <- ngram_table %>% filter(total >= min_required)
    
    if (list_length > 0 && list_length < nrow(ngram_table)) {
        ngram_table <- ngram_table[1:list_length,]
    }
    
    if (list_length < 0 && -list_length < nrow(ngram_table)) {
        ngram_table <- ngram_table[(nrow(ngram_table)+list_length+1):nrow(ngram_table),]
    }
    
    # if we are drawing a chart 
    if (show_chart) {
        
        # reformat the data table into a longer format
        ngram_table2 <- ngram_table %>% select(-"total")
        ngram_table2 <- ngram_table2 %>% pivot_longer(cols = 2:4, names_to = "type", values_to = "count")        
        
        # build the chart and print it to the console
        gg <- ggplot(ngram_table2, aes(y = ngram, x = count, fill = type))
        gg <- gg + geom_bar(position = "stack", width = 2, stat = "identity")
        gg <- gg + scale_y_discrete(limits = rev(ngram_table2$ngram))
        gg <- gg + labs(title = paste0("Ngram counts for ", country, " media"), y = "common ngrams", x = "frequency")
        gg <- gg + theme_minimal()
        gg <- gg + theme(axis.text.y=element_text(size=5))
        print(gg)
    }
    
    return(ngram_table)
}

show_common_ngrams(my_corpus_medium, "US", ngram_length = 5, list_length = 10,
                   min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE)

##                       ngram blogs news twitter total
## 1         at the end of the   209  122      24   355
## 2      in the middle of the    85   61      21   167
## 3     for the first time in    55   67      20   142
## 4        the end of the day    61   43       7   111
## 5         by the end of the    42   49      10   101
## 6        there are a lot of    42   34       9    85
## 7        is one of the most    46   28       8    82
## 8  for the first time since    15   61       5    81
## 9          is going to be a    34   27      19    80
## 10      for the rest of the    38   24      15    77

Again we see common phrases appear in the list, and start to see differences between the three media types with twitter appearing to be less formal than the other two. We compare this with the common ngrams from the full corpus:

show_common_ngrams(my_corpus_full, "US", ngram_length = 5, list_length = 10,
                   min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE)

##                    ngram blogs news twitter total
## 1      at the end of the  1596 1377     644  3617
## 2   in the middle of the   841  574     392  1807
## 3  for the first time in   465  695     430  1590
## 4     the end of the day   547  446     303  1296
## 5      by the end of the   434  571     231  1236
## 6    for the rest of the   428  327     383  1138
## 7      its going to be a   157  305     631  1093
## 8  thank you so much for   189    6     832  1027
## 9       is going to be a   275  196     547  1018
## 10    there are a lot of   403  388     167   958

We see that similar phrases rise to the top of the list as expected, but this is based upon a lot more data. We also look at the bottom of the two lists and see that the less frequent ngrams based on the full corpus are less intuitive than those from the medium corpus, because they are in effect ‘rarer’ i.e. have lower probability of occurring due to having the same count (5) but from a larger sample size.

show_common_ngrams(my_corpus_medium, "US", ngram_length = 5, list_length = -10,
                   min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = FALSE)

##                                ngram blogs news twitter total
## 5849          mothers day to all you     0    0       5     5
## 5850          i hate when people say     0    0       5     5
## 5851 available in your customer area     0    0       5     5
## 5852            wait to see you guys     0    0       5     5
## 5853        follow me and ill follow     0    0       5     5
## 5854     look forward to meeting you     0    0       5     5
## 5855       in your customer area now     0    0       5     5
## 5856              follow me so i can     0    0       5     5
## 5857         heres my number so call     0    0       5     5
## 5858           and thank you for the     0    0       5     5

show_common_ngrams(my_corpus_full, "US", ngram_length = 5, list_length = -10,
                   min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = FALSE)

##                               ngram blogs news twitter total
## 286772      give the gift of health     0    0       5     5
## 286773  remember all the good times     0    0       5     5
## 286774        cant you just tell me     0    0       5     5
## 286775        are so many new music     0    0       5     5
## 286776 the members of one direction     0    0       5     5
## 286777       it should be fixed now     0    0       5     5
## 286778       on my profile can help     0    0       5     5
## 286779           were happy to be a     0    0       5     5
## 286780         lift my feet off the     0    0       5     5
## 286781   do you need more followers     0    0       5     5

Predicting the next word

We now write two functions to enable us to predict next words based on ngrams. The first function creates a long ngram lookup list containing ngrams up to a specified number of words based upon the corpus text supplied. The second function uses this long ngram lookup list to quickly search for a matching ngram and the expected next word.

The first function allows the user to specify the max length of the ngram (in words) and also how many ngrams of each length to hold. These ngrams are held in a long list along with the count of each ngram in terms of the number of times it appears in the text, and also the ngram split into two parts, one part containing all words except the last (ngram_less1), and one part containing the last word only (ngram_last). This allows the list to be queried in the ngram_less1 column and then easily determin the next possible words in the corresponding ngram_last column.

## This function produces a list of ngrams for a specified country,  The user ##
## can specify the max words per ngram, and the number of ngrams of each word ##
## length to hold.  The user can also specify the minimum number of           ##
## occurrences for an ngram to appear in the list, as well as whether bookends##
## are to be included.                                                        ##

get_ngram_lookup_list <- function(corpus1, country, ngram_max = 5, ll_max = 10000, 
                           min_word_bar = 5, add_bookends = TRUE, run_fast = TRUE) {
    
    # start an empty list to append to as we go
    ngram_list = list()
    
    # ensure max ngram is at least 2
    if (!(ngram_max > 2)) {
        ngram_max == 2
    }
    
    # loop through each ngram length and get an ngram list for each length
    for (ng in 2:ngram_max) {
        x <- show_common_ngrams(corpus1, country = country, ngram_length = ng, 
                                  list_length = ll_max, add_bookends = add_bookends, 
                                  run_fast = run_fast, show_chart = FALSE)
        
        # add columns which split the ngram into two parts: the last word, and the rest
        x <- x %>% mutate(ngram_less1 = str_remove(ngram, "\\s(\\w+)$"), ngram_last = str_extract(ngram, "(\\w+)$"))
        x <- x %>% rename(count = total) %>% select(ngram_less1, ngram_last, count)
        
        ngram_list <- append(ngram_list, list(x))
    }

    return(ngram_list)
}

The second function allows the user to pass a short string (i.e. an ngram) and then looks for this string in the ngram_less1 column. If matches are found then these are returned along with the corresponding final words of the looked up ngram, and the count of occurrences. If no matches are found then the function looks for an ngram which is one word shorter than that supplied (by removing the first word). This continues until a match is found. If no match is found then the function returns NULL.

## This function allows the user to specify a string and then looks for a     ##
## match in an ngram list, in the ngram_less1 column.  If match(es) are found ##
## then the most common X occurrences are returned (the user can specify)     ##
## if not then the function checks for a shorter ngram (by removing the first ##
## word from the ngram).  If the function fails to fin an ngram matching any  ##
## component of the supplied ngram then it returns NULL)                      ##

lookup_ngram <- function(ngram_list, ngram1, results = 1) {

    # determine the ngram length
    ngram_length <- str_count(ngram1, "\\w+")
    
    # while we still have an ngram of 2 or more
    while (ngram_length >= 2) {
        
        # filter the ngram list for this ngram (in the less1 column) and sort by count 
        lookup_table <- ngram_list[[ngram_length]] %>% filter(ngram_less1 == ngram1) %>% arrange(desc(count))
        
        # if we have found matches then return them    
        if(nrow(lookup_table)>0) {
            return(lookup_table[1:min(results,nrow(lookup_table)),])
         } 
        
        # otherwise we reduce the ngram by one word
        ngram1 <- word(ngram1,start = 2, end = ngram_length)
        
        ngram_length <- str_count(ngram1, "\\w+")
    }   
    
    # if we find no matches then return NULL
    return(NULL)
}

We can test that this works by typing short sentences (we use the full corpus for more matches):

ngram_list_full <- get_ngram_lookup_list(my_corpus_full, "US", ngram_max = 6, ll_max = Inf, 
                           min_word_bar = 1, add_bookends = TRUE, run_fast = TRUE)

lookup_ngram(ngram_list_full, "once upon a time there", results = 10)

##              ngram_less1 ngram_last count
## 1 once upon a time there        was    35

lookup_ngram(ngram_list_full, "predicting things is", results = 10)

##    ngram_less1 ngram_last count
## 1    things is        the    30
## 2    things is        not    28
## 3    things is          a    21
## 4    things is       that    20
## 5    things is         to    12
## 6    things is       more     9
## 7    things is       just     7
## 8    things is  difficult     7
## 9    things is       when     6
## 10   things is        one     5

lookup_ngram(ngram_list_full, "red white and", results = 10)

##     ngram_less1 ngram_last count
## 1 red white and       blue   151
## 2 red white and      green     6
## 3 red white and     bluezz     6
## 4 red white and      black     5

lookup_ngram(ngram_list_full, "gobbledygook dvgibieuyebv", results = 10)

## NULL

lookup_ngram(ngram_list_full, "lucy in the sky with", results = 10)

##            ngram_less1 ngram_last count
## 1 lucy in the sky with   diamonds     5

Our prediction tool appears to be working..!