This document sets out the work undertaken in the capstone submission for John Hopkins Data Science Specialization (module 10). The project is to build an n-gram model to use as a predictive text tool using the twelve datasets provided. This document sets out the initial work undertaken along with code snippets which are all executable unless otherwise stated.
The following r packages need to be installed to complete this analysis:
library(tidyverse)
library(tidytext)
library(textdata) #check
library(readtext) #check
library(quanteda)
library(sentimentr)
library(lexicon)
library(wordcloud2)
library(ngram)
library(fpeek)
library(utils)
The data is provided as 12 large .txt files. We first look through a couple of these files directly and see they are large and contain many lines of data, separated by newline characters. We write a routine to summarise key information about the 12 files, and draw a chart plotting filesize vs number of lines.
## This function returns information about files held in a "data" subfolder ##
## It assumes the format of file names is LANGUAGE_COUNTRY.TYPE.ext ##
get_file_info <- function() {
objs <- file.info(list.files("./data", full.names=TRUE))
file_paths <- row.names(objs)
line_count <- sapply(file_paths, peek_count_lines)
file_info_tibble <- tibble(fullname = str_remove(file_paths,"./data/"),
filename = str_remove(fullname,"\\.([^.]+)$"),
filetype = str_extract(fullname,"\\.([^.]+)$"),
language = str_extract(filename,".*(?=_)"),
country = str_extract(filename,"(?<=_).*(?=\\.)"),
type = str_extract(filename,"(?<=\\.).*"),
filesize_MB = round(objs$size/1000000,1),
lines_million = round(line_count/1000000,2))
return(file_info_tibble)
}
file_info <- get_file_info()
# build the chart
gg <- ggplot(file_info, aes(y = lines_million, x = filesize_MB, color = country, shape = type))
gg <- gg + geom_point(size = 5)
gg <- gg + labs(title = "filesize vs lines", y = "no. of lines", x = "filesize(MB)")
gg <- gg + theme_minimal()
# print the table and chart to console
print(file_info)
## # A tibble: 12 × 8
## fullname filename filetype language country type filesize_MB lines_million
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 de_DE.blo… de_DE.b… .txt de DE blogs 85.5 0.37
## 2 de_DE.new… de_DE.n… .txt de DE news 95.6 0.24
## 3 de_DE.twi… de_DE.t… .txt de DE twit… 75.6 0.95
## 4 en_US.blo… en_US.b… .txt en US blogs 210. 0.9
## 5 en_US.new… en_US.n… .txt en US news 206. 1.01
## 6 en_US.twi… en_US.t… .txt en US twit… 167. 2.36
## 7 fi_FI.blo… fi_FI.b… .txt fi FI blogs 108. 0.44
## 8 fi_FI.new… fi_FI.n… .txt fi FI news 94.2 0.49
## 9 fi_FI.twi… fi_FI.t… .txt fi FI twit… 25.3 0.29
## 10 ru_RU.blo… ru_RU.b… .txt ru RU blogs 117. 0.34
## 11 ru_RU.new… ru_RU.n… .txt ru RU news 119 0.2
## 12 ru_RU.twi… ru_RU.t… .txt ru RU twit… 105. 0.88
print(gg)
We see that the largest files are the US english language files at around 200MB, with all other files being around 100MB except for finnish twitter which is much smaller at 25MB. Most files are under half a million lines, four files have around a million lines, and one file (US twitter) has over 2.5 million lines.
In order to begin working with these files we write a function to read some (or all) of the files into a corpus. We include an option in the function to specify the number of lines included so that we can create a corpus with cut-down versions of the texts for faster processing.
## This function takes a list of file names and builds them into an object ##
## which is returned. This object can be a corpus (default) or a tibble. ##
## The user also has an option to declare a max number of lines per document. ##
## The function assumes that all files are held in a "data" sub directory. ##
build_corpus <- function(input_data, max_lines = Inf, as_corpus = TRUE) {
# set up a blank tibble to read data in to
tibble0 <- tibble(doc_id = character(), text = character())
# loop though each file in the list
for (file1 in input_data) {
# get the filepath in relation to working directory
path_full = paste0("data/",file1)
# read the correct number of lines as a character vector, with one element per line
output1 <- read_lines(path_full, n_max = max_lines, skip_empty_rows = TRUE)
# collapse this to a single character string, with lines preserved using newline
text1 <- paste0(output1, collapse = "\n")
# update the tibble with this new doc filename and text
tibble0 <- tibble0 %>% add_row(doc_id = file1, text = text1)
}
# determine whether output is a tibble or a corpus
output1 <- tibble0
if (as_corpus) output1 <- corpus(tibble0)
return(output1)
}
We now build our corpora, we build a full corpus including all texts, and then large, medium and small corpus with reduced text. We use calls to Sys.time() to note how long each takes to build, and object.size() to get the approximate size of each corpus.
input_data <- list.files('./data', pattern = '.txt$')
tt_full <- Sys.time()
my_corpus_full <- build_corpus(input_data)
tt_full <- Sys.time() - tt_full
tt_large <- Sys.time()
my_corpus_large <- build_corpus(input_data, 500000)
tt_large <- Sys.time() - tt_large
tt_medium <- Sys.time()
my_corpus_medium <- build_corpus(input_data, 100000)
tt_medium <- Sys.time() - tt_medium
tt_small <- Sys.time()
my_corpus_small <- build_corpus(input_data, 10000)
tt_small <- Sys.time() - tt_small
| Corpus name | Lines | Time to build (s) | Approx size (MB) |
|---|---|---|---|
| small | 10,000 | 0.5 | 27.99 |
| medium | 100,000 | 5.57 | 280.05 |
| large | 500,000 | 17.95 | 993.64 |
| full | all | 29.77 | 1400.36 |
For the rest of this work we choose to work with the medium corpus in the interest of running time for functions, unless otherwise stated.
my_corpus_medium
## Corpus consisting of 12 documents.
## de_DE.blogs.txt :
## "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag m..."
##
## de_DE.news.txt :
## "Das Rezept für ihre Schokobrezln hat die 60-Jährige schon vo..."
##
## de_DE.twitter.txt :
## "irgendwas stimmt mut meinem internet am pc nich :( "Wir habe..."
##
## en_US.blogs.txt :
## "In the years thereafter, most of the Oil fields and platform..."
##
## en_US.news.txt :
## "He wasn't home alone, apparently. The St. Louis plant had to..."
##
## en_US.twitter.txt :
## "How are you? Btw thanks for the RT. You gonna be in DC anyti..."
##
## [ reached max_ndoc ... 6 more documents ]
Now that we have a corpus of documents we write functions to provide information for each document in the corpus. We first write a function to break the corpus text held as a very long string into individual lines held in a tibble with unique line IDs.
## This function creates a tibble from a text sting held in a corpus with one ##
## line per row as a default setting. Each row is given a unique id ##
## consisting of the 'country' and 'type' and then consecutive integers. ##
## The user can specify whether each line is given 'bookends' which are ##
## strings to identify the start and end of lines. ##
create_tibble <- function(corpus1, file1, break_string = "\\n", add_bookends = FALSE) {
# construct the various names required
path_full = paste0("data/",file1)
filename = str_remove(file1,"\\.([^.]+)$")
ll = str_extract(filename,".*(?=_)") # language
cc = str_extract(filename,"(?<=_).*(?=\\.)") # country
tt = str_extract(filename,"(?<=\\.).*") # type
dd = paste0(cc,"-",tt) # name for this data set
# get the text string from the corpus
string1 <- corpus1[file1]
# break the text string into sections by either the supplied char (\\n) as
# default or into sentences which are identified intelligently by stringr
if (break_string == "sentence") {
tibble1 <- tibble(content = str_split(string1, boundary("sentence"))[[1]])
} else {
tibble1 <- tibble(content = str_split(string1, break_string)[[1]])
}
# create columns for the unique id and the smaller text strings
tibble1$unique <- paste(dd,seq(1:nrow(tibble1)), sep = "_")
tibble1$content <- tibble1$content %>% str_remove("\n") %>% str_trim()
tibble1 <- tibble1 %>% relocate(unique, .before = content)
# add bookends if required
if (add_bookends){
tibble1$content <- paste0("strstartstr ", tibble1$content, " strendstr")
}
return(tibble1)
}
create_tibble(my_corpus_medium, "en_US.news.txt")[1:10,]
## # A tibble: 10 × 2
## unique content
## <chr> <chr>
## 1 US-news_1 "He wasn't home alone, apparently."
## 2 US-news_2 "The St. Louis plant had to close. It would die of old age. Worke…
## 3 US-news_3 "WSU's plans quickly became a hot topic on local online sites. Th…
## 4 US-news_4 "The Alaimo Group of Mount Holly was up for a contract last fall …
## 5 US-news_5 "And when it's often difficult to predict a law's impact, legisla…
## 6 US-news_6 "There was a certain amount of scoffing going around a few years …
## 7 US-news_7 "14915 Charlevoix, Detroit"
## 8 US-news_8 "\"It’s just another in a long line of failed attempts to subsidi…
## 9 US-news_9 "But time and again in the report, Sullivan called on CPS to corr…
## 10 US-news_10 "\u0093I was just trying to hit it hard someplace,\u0094 said Riz…
Now that we have our corpus text held as a tibble we can analyse it line by line to peform different types of text analysis. The first function we write provides summary data in terms of character count, word count, n-gram count, sentence count and line count. (This is to practice analysing tibbles of data and is not necessarily very interesting). We note that using the ‘unnest_tokens’ command for very large amounts of text can be slow, and therefore this function takes a long time to process the text for the full corpus.
## This function builds a summary tibble for a given document in a corpus ##
## The tibble holds the count of chars, words, ngrams, sentences and lines ##
## It uses the unnest_tokens function which is slow for large inputs. ##
summary_tibble <- function(corpus1, file1) {
# create a tibble with a row per line and unique row ids
tibble1 <- create_tibble(corpus1, file1)
# count the characters in each row
char_count <- tibble1 %>%
unnest_tokens(output = "char_count", token = "characters", input = "content") %>%
group_by(unique) %>%
summarise(char_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, char_count, by = "unique")
# count the words in each row
word_count <- tibble1 %>%
unnest_tokens(output = "word_count", token = "words", input = "content") %>%
group_by(unique) %>%
summarise(word_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, word_count, by = "unique")
# count the 3-grams in each row
ngram3_count <- tibble1 %>%
unnest_tokens(output = "ngram_count", token = "ngrams", input = "content", n=3) %>%
group_by(unique) %>%
summarise(ngram3_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, ngram3_count, by = "unique")
# count the sentences in each row
sentence_count <- tibble1 %>%
unnest_tokens(output = "sentence_count", token = "sentences", input = "content") %>%
group_by(unique) %>%
summarise(sentence_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, sentence_count, by = "unique")
# count the lines in each row (should be 1!)
line_count <- tibble1 %>%
unnest_tokens(output = "line_count", token = "lines", input = "content") %>%
group_by(unique) %>%
summarise(line_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, line_count, by = "unique")
tibble1 %>% relocate(unique, .before = content)
return(tibble1)
}
summary_medium <- summary_tibble(my_corpus_medium, "en_US.news.txt")
head(summary_medium,10)
## # A tibble: 10 × 7
## unique content char_count word_count ngram3_count sentence_count line_count
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 US-new… "He was… 26 5 3 1 1
## 2 US-new… "The St… 121 29 27 4 1
## 3 US-new… "WSU's … 145 29 27 2 1
## 4 US-new… "The Al… 406 86 84 3 1
## 5 US-new… "And wh… 200 40 38 4 1
## 6 US-new… "There … 160 37 35 1 1
## 7 US-new… "14915 … 22 3 1 1 1
## 8 US-new… "\"It’s… 218 44 42 2 1
## 9 US-new… "But ti… 234 42 40 1 1
## 10 US-new… "\u0093… 128 32 30 3 1
summary(summary_medium)[,3:7]
## char_count word_count ngram3_count sentence_count
## Min. : 1.0 Min. : 1.00 Min. : 1.00 Min. : 1.000
## 1st Qu.: 88.0 1st Qu.: 19.00 1st Qu.: 17.00 1st Qu.: 1.000
## Median : 149.0 Median : 32.00 Median : 30.00 Median : 2.000
## Mean : 161.9 Mean : 34.54 Mean : 32.56 Mean : 2.008
## 3rd Qu.: 216.0 3rd Qu.: 46.00 3rd Qu.: 44.00 3rd Qu.: 2.000
## Max. :4225.0 Max. :1123.00 Max. :1121.00 Max. :28.000
## NA's :3
## line_count
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
We look at the summary data for the medium (100,000 lines) of US news data, and can see that the average character count for each line is 162, and the average word count is 35, the average sentence count is 2, and the average line count is exactly 1 (as expected!).
We write a second function which provides more interesting information in terms of a profanity analysis and three types of sentiment analysis. In order to work properly this function also removes a predefined list of stop words. To support the function we first create the five look-up lists required: list_stopwords, list_profanity, list_afinn, list_bing, list_nrc.
## This code initialises all of the look-up lists and lexicons used throughout this work
# build list of common 'Stop Words' to exclude
# stop_words is a predefined list from tidytext
more_stop_words_list <- c("it’s", "i’m", "don’t", "i’ve", "st", "rt", "p.m.", "u.s.", "its", "im", "dont", "ive", "pm")
more_stop_words_tibble <- tibble(word = more_stop_words_list, lexicon = rep("BESPOKE", length(more_stop_words_list)))
number_tibble <- tibble(word = as.character(seq(1000)), lexicon = rep("NUMBERS",1000))
list_stopwords <- rbind(stop_words, more_stop_words_tibble, number_tibble)
# get profanity list (from sentimentr)
profanity_all <- c(profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_racist, profanity_zac_anger)
profanity_all_clean <- sort(unique(tolower(c(profanity_all))))
list_profanity <- tibble(word = profanity_all_clean, value = rep(1,length(profanity_all_clean)))
# get sentiment score lists (from sentimentr)
list_afinn <- get_sentiments("afinn")
list_bing <- get_sentiments("bing")
list_nrc <- get_sentiments("nrc")
We now use these lists on a scoring function to provide sentiment and profanity analysis.
## This function builds a scoring tibble for a given document in a corpus ##
## This holds the key words for each line, and supplies a profanity count ##
## and three types of sentiment score. ##
scoring_tibble <- function(corpus1, file1) {
# create a tibble with a row per line and unique row ids
tibble1 <- create_tibble(corpus1, file1)
kwords <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "content") %>%
anti_join(list_stopwords, by = "word") %>%
group_by(unique) %>%
summarise(key_words = paste(word, collapse = " ")) %>%
ungroup()
tibble1 <- full_join(tibble1, kwords, by = "unique")
word_count <- tibble1 %>%
unnest_tokens(output = "word_count", token = "words", input = "key_words") %>%
group_by(unique) %>%
summarise(word_count = n()) %>%
ungroup()
tibble1 <- full_join(tibble1, word_count, by = "unique")
tibble1 <- tibble1 %>% select(-"content")
# get profanity score
profanity_scores <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "key_words") %>%
inner_join(list_profanity, by = "word") %>%
group_by(unique) %>%
summarise(profanity = sum(value)) %>%
ungroup()
tibble1 <- full_join(tibble1, profanity_scores, by = "unique")
# get AFINN sentiment scores
afinn_scores <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "key_words") %>%
inner_join(list_afinn, by = "word") %>%
group_by(unique) %>%
summarise(AFINN_avg = sum(value),
AFINN_pos = sum(value > 0),
AFINN_neg = sum(value < 0),
AFINN_vpos = sum(value >= 4),
AFINN_vneg = sum(value <= -4)) %>%
ungroup()
# get bing sentiment scores
bing_scores <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "key_words") %>%
inner_join(list_bing, by = "word") %>%
group_by(unique) %>%
summarise(BING_pos = sum(sentiment == 'positive'),
BING_neg = sum(sentiment == 'negative')) %>%
ungroup()
# get NRC sentiment scores
nrc_scores <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "key_words") %>%
inner_join(list_nrc, by = "word") %>%
group_by(unique) %>%
summarise(NRC_pos = sum(sentiment == 'positive'),
NRC_neg = sum(sentiment == 'negative'),
NRC_ang = sum(sentiment == 'anger'),
NRC_anti = sum(sentiment == 'anticipation'),
NRC_disg= sum(sentiment == 'disgust'),
NRC_fear = sum(sentiment == 'fear'),
NRC_joy = sum(sentiment == 'joy'),
NRC_sad = sum(sentiment == 'sadness'),
NRC_surp = sum(sentiment == 'surprise'),
NRC_trst = sum(sentiment == 'trust')) %>%
ungroup()
tibble1 <- full_join(tibble1, afinn_scores, by = "unique")
tibble1 <- full_join(tibble1, bing_scores, by = "unique")
tibble1 <- full_join(tibble1, nrc_scores, by = "unique")
# replace NA values in number columns with 0s
non_char <- (summarise_all(tibble1, class) != "character")[1,]
tibble1[,non_char] <- tibble1[,non_char] %>% replace(is.na(.), 0)
return(tibble1)
}
scoring_medium <- scoring_tibble(my_corpus_medium, "en_US.news.txt")
head(scoring_medium)
## # A tibble: 6 × 21
## unique key_words word_count profanity AFINN_avg AFINN_pos AFINN_neg AFINN_vpos
## <chr> <chr> <int> <dbl> <dbl> <int> <int> <int>
## 1 US-ne… home app… 2 0 0 0 0 0
## 2 US-ne… louis pl… 12 0 -3 0 1 0
## 3 US-ne… wsu's pl… 17 0 -1 1 1 0
## 4 US-ne… alaimo m… 42 0 4 2 0 0
## 5 US-ne… difficul… 13 0 -4 0 2 0
## 6 US-ne… amount s… 15 0 0 0 0 0
## # … with 13 more variables: AFINN_vneg <int>, BING_pos <int>, BING_neg <int>,
## # NRC_pos <int>, NRC_neg <int>, NRC_ang <int>, NRC_anti <int>,
## # NRC_disg <int>, NRC_fear <int>, NRC_joy <int>, NRC_sad <int>,
## # NRC_surp <int>, NRC_trst <int>
This information is much more interesting and so we write an additional function which allows us to find the ‘highest scoring’ line from a given corpus text by any of the metrics or a combination of metrics.
## This function returns the highest scoring lines for the supplied cols from ##
## a scoring tibble. The user can specify the metrics (column names) to use ##
## to score, the number of results returned (top X lines), and whether to ##
## rank on absolute score or score relative to number of key words ##
get_high_score <- function(tibble1, cols, n = 100, absolute_vals = FALSE){
# check we have been supplied valid column names
protected_cols <- c("key_words", "unique", "word_count")
matched_cols <- cols[cols %in% names(tibble1)]
matched_cols <- matched_cols[!(matched_cols %in% protected_cols)]
# if not then we exit with an error message
if (!setequal(cols,matched_cols)) {
stop(str_flatten(c("cols",setdiff(cols,matched_cols),"- are either not found or not allowed")," "))
}
# if we have correct column names the we loop through them and create three additional
# columns for each one: a relative%, a ranking, and a ranking based on relative%
for (cc in cols) {
new_name1 <- paste0(cc,"_%")
new_name2 <- paste0(cc,"_rank")
new_name3 <- paste0(cc,"_%rank")
new_col1 <- tibble1[,cc]/tibble1[,"word_count"]
new_col2 <- min_rank(desc(tibble1[,cc]))
new_col3 <- min_rank(desc(new_col1))
tibble1 <- tibble(tibble1, tibble(!!new_name1 := new_col1, !!new_name2 := new_col2, !!new_name3 := new_col3))
}
# we set the a flag to sum across depending on whether the user has
# selected absolute or relative values
if(absolute_vals){
flag1 = "_rank"
} else {
flag1 = "_%rank"
}
# create an 'overall_score' column which sums the ranking for the selected columns
tibble1 <- tibble1 %>% rowwise() %>% mutate(overall_score = sum(c_across(contains(flag1))))
# now add an 'overall_rank' column which ranks the overall_score column
rank_col <- row_number((tibble1$overall_score))
tibble1 <- tibble(tibble1, tibble(overall_rank = rank_col))
# for clarity create a second tibble reduced to the required number of rows and order
tibble2 <- tibble1[(tibble1$overall_rank <= n),]
tibble2 <- tibble2[order(tibble2$overall_rank),]
return(tibble2)
}
We can now find sentences with profanity or different sentiment. Note that the profanity filter we have used is very sensitive and picks up words which could be used in a rude or racist way irrelevant of the actual context in which they are used.
summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("profanity", "BING_neg"), 3, TRUE)$unique,"content"][[1]]
## [1] "In 2010, she missed a progress hearing because she claimed her passport had been stolen while at the Cannes Film Festival, and from there things snowballed: Probation violation, another brief jail stint, rehab, failed drug test, probation violation, rehab. She started to mix it up bit: Alleged attack on a Betty Ford Clinic worker! Grand theft larceny accusations from a Venice, Calif., jewelry store! Then back to the grind: Probation violation, house arrest, failed alcohol test, probation violation, brief jail stint, missed therapy sessions, MIA for community service, brief jail stint. That's when Sautner instituted the timetable that seemed to get Lohan back on track."
## [2] "Dear Margo: I recently was baptized and joined the Mormon Church. I really enjoy my new \"family\" except for one person: the bishop. This feeling is taboo because people act like he walks on water. I say he is just a man appointed to a position. This man didn't like me the minute he met me. When I started at this church, I was in crisis, and I told a couple of people about it. They said go to the bishop. He turned me down flat-out and said, \"We don't help people financially.\" That was a bald-faced lie, and I knew it. The excuses I got were that I was not a member yet, that he was stressed, blah, blah, blah. As time progressed, my problem became worse. Again, I was turned down, and he didn't even call me by the right name. I corrected him twice about my name, and he still got it wrong, which was proof to me that he was doing it on purpose. Do you think I am making something out of nothing? And to whom do I go to let someone know about this person? I think he treats me badly because I am disabled. (I think this because there are other disabled people who also dislike him.) — Disillusioned"
## [3] "3. Grab a bundle of ones and prepare to stuff a wild pink triangle of a G-string. What's better than queer-friendly strip joints? Beautiful, haunting and historic North Beach has the hungry i (599 Jackson St.), a fairly typical strip joint that is couples- and women-friendly -- with, last time I looked, no silicone in sight; plus, it's where Lenny Bruce performed. The Lusty Lady (1033 Kearny St.) is a very queer-positive, no-booze peep show with hot, smart women who own the joint. But if you want lap dances where you can get drunk (on overpriced swill), you can hit the dyke-friendly Hustler Club next door (1031 Kearny St.) or the women-friendly Gold Club in SoMa (650 Howard St.). San Francisco has one better-known, all-male strip club: Nob Hill Adult Theatre (729 Bush; not to be confused with all-girl Nob Hill Lingerie) is primarily for a gay-male clientele, but it's generally friendly toward women, bois and mixed-gender couples -- they endure bachelorette parties, so they're seasoned enough to appreciate any clientele that is polite, isn't screaming in their faces and tips well. Right now, its marquee boasts a message both frightening and enticing -- something cryptic about \"Brazilian nuns.\""
summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("AFINN_vpos", "BING_pos", "NRC_pos"), 3, TRUE)$unique,"content"][[1]]
## [1] "Along with classes in voice, dance, and acting, there are fun theme related activities each week. The themes for this year include TV, movies, theater, and circus. Elective classes may include Minute to Win It, Dance Jeopardy, Singing Bee, American Idol, and Famous Character Talk Show. Wacky Wednesday events include Come as your Favorite TV Character, Who Dunnit Day, Be a Clown, and Super Hero Day. Fabulous Friday workshops may include puppetry, special effects, mime, clowning, and magic."
## [2] "muckduck: The Ducks can now begin their dominance as an official ELITE team with that 2012 Rose Bowl win and DAT is going to be a HUGE part of this unreal offense next year!!! Samurai7: Besides the two long runs for touchdowns untouched, DAT picked up a key first down beating a defender well short of the first down marker, which set up the winning touchdown. Also, He and Barner were the \"gunners\" on the punt with 22 seconds left going downfield making the fair catch necessary and creating a long field! hermtownhomy: DAT definitely stood out, but that does not mean he should be singled out as the only reason they won. He is scary fast. Probably the fastest player in football right now and we all look forward to the future with him, but his two runs were not necessarily turning points in the game. They still could have gone on to score if those two runs hadn't happened. True turning points in the game were the interception and fumble recovery. I just don't agree with JC's point here that DAT was the reason they won. It was a team effort. RipCityDroHead85: De'Ant was prolly the MVP, I'd agree, but LaMike and Darren had pretty good games too, along with Tuinei. As far as that best duck ever, I don't know if i'd be heaping that much praise on him just yet, LaMike made a key block that sprung one of his long ones. No mention of the best duck rb ever out there blocking for another rb. portlandchiro1: I agree with the praise heaped on De'Anthony, but lets not get carried away. This was a full team win with many stars contributing. Where would Oregon have been without the two forced turnovers from the defense-including the unbelievable interception by Kiko Alonzo? How about the catches by Tunei, the grueling runs by LaMichael James, the touchdown catch by Barner, and on and on on. This was a team win and there are many heroes. Bravo to all the Ducks!!! What was your favorite moment from the Rose Bowl? What about next season are you most excited for? Post your thoughts below."
## [3] "Compressed, understated in the face of intense feeling, Tóibín's fiction places sensitive, isolated individuals in situations of extreme emotional pressure. In \"The Heather Blazing,\" winner of the 1993 Encore Award for best second novel, Eamon Redmond, a remote, aging, rigorously objective Irish high court judge, is awakened from dispassion to compassion when faced with swarming memories during a summer visit to the coast. In \"The Blackwater Lightship,\" a 1999 Booker Prize finalist, young Declan Breen, dying from complications of AIDS, brings his estranged sister, mother and grandmother together to confront the rubble of their family's past. In \"The Master,\" a 2004 Man Booker Prize finalist and winner of a 2006 Los Angeles Times Book Prize, novelist Henry James must confront public humiliation after the failure of his play while coping with his suppressed sexuality and erotic yearning."
summary_medium[scoring_medium$unique %in% get_high_score(scoring_medium, c("NRC_disg", "NRC_surp"), 3, TRUE)$unique,"content"][[1]]
## [1] "We know that innocent people have been convicted of murder in California – three were released in 2011 after serving a total of 57 years – and that innocent people have been executed in other states. Nationwide, 140 inmates from death rows have been exonerated of the crimes for which they were wrongly convicted. In light of possible innocence, using the death penalty puts all Californians at risk of perpetrating the ultimate injustice of executing an innocent person, for when the governor gives the final order to execute, he does so in the name of California residents, and the death certificate will read, \"Homicide,\" as the cause of death."
## [2] "But (and it's a very important but) support for the death penalty, in Connecticut and elsewhere, is not as robust as it looks. When Quinnipiac asked a different question \u0096 \"Which punishment do you prefer for people convicted of murder, the death penalty or life in prison with no chance of parole?\" \u0096 only 46 percent favored the death penalty. An equal number chose life without parole. Death penalty opponents have an opening they haven't had for some time."
## [3] "What can we do about it? Differences in genetic makeup, metabolism and even bacteria levels in the gut explain why some people gain weight and others don't. But obesity experts are quick to add that just because you've inherited a tendency to put on pounds doesn't mean you have to feed that tendency. Everyone can build lean-muscle mass through exercise, especially weight training, to boost metabolism. That will help you burn more calories even at rest. Exercise also may give brown fat a boost. Pregnant moms need to be especially mindful of excess weight gain, sugar consumption and controlling blood-sugar levels through diet and exercise so they don't pass on problems to future generations. Losing baby weight between pregnancies also would help. While Americans do their part, scientists will continue to look at ways to manipulate gut bacteria and hormone levels to help prevent weight gain — and weight regain."
We can see that these functions find interesting sentences from our texts!
Now we look at selecting the common words from a corpus text. We first write a function which returns the top X most common words from a given text. The function is written so the user can specify whether to remove stop words or not.
## This function creates a word list ordered by count from a text string held ##
## held in a corpus. The user can define the length of the list, the minimum ##
## required occurrences to be in the list and whether stop words are included.##
common_words <- function(corpus1, file1, list_length = Inf, min_required = 5, remove_stop_words = TRUE) {
# create a tibble with a row per line and unique row ids
tibble1 <- create_tibble(corpus1, file1)
# if stop word flag selected then activate stop word list, else activate a dummy list
if (remove_stop_words) {
sw_list <- list_stopwords
} else {
sw_list <- tibble(word = character(), lexicon = character())
}
# use unnest_tokens to create word list with count
top_words <- tibble1 %>%
unnest_tokens(output = "word", token = "words", input = "content") %>%
anti_join(sw_list, by = "word") %>%
group_by(word) %>%
count(word, sort = TRUE) %>%
ungroup()
# add percentage and cumulative percentage
top_words <- top_words %>% mutate(pct = n/sum(n),cumulative = cumsum(pct))
top_words[,c("pct", "cumulative")] <- 100*round(top_words[,c("pct", "cumulative")],3)
# only return the specified number of rows,
# if a negative number supplied then show the tail rather than the head
top_words <- top_words %>% filter(n >= min_required)
if (list_length > 0 && list_length < nrow(top_words)) {
top_words <- top_words[1:list_length,]
}
if (list_length < 0 && -list_length < nrow(top_words)) {
top_words <- top_words[(nrow(top_words)+list_length+1):nrow(top_words),]
}
return(top_words)
}
common_words(my_corpus_medium, "en_US.news.txt",
list_length = 10, min_required = 5, remove_stop_words = FALSE)
## # A tibble: 10 × 4
## word n pct cumulative
## <chr> <int> <dbl> <dbl>
## 1 the 195688 5.7 5.7
## 2 to 89962 2.6 8.3
## 3 and 88582 2.6 10.8
## 4 a 87102 2.5 13.4
## 5 of 76488 2.2 15.6
## 6 in 67054 1.9 17.5
## 7 for 35032 1 18.5
## 8 that 34144 1 19.5
## 9 is 28316 0.8 20.3
## 10 on 26947 0.8 21.1
common_words(my_corpus_medium, "en_US.news.txt",
list_length = 10, min_required = 5, remove_stop_words = TRUE)
## # A tibble: 10 × 4
## word n pct cumulative
## <chr> <int> <dbl> <dbl>
## 1 time 5778 0.4 0.4
## 2 people 4713 0.3 0.7
## 3 city 3802 0.2 0.9
## 4 school 3519 0.2 1.1
## 5 percent 3429 0.2 1.4
## 6 game 3375 0.2 1.6
## 7 day 3191 0.2 1.8
## 8 home 3091 0.2 2
## 9 million 3055 0.2 2.2
## 10 county 2929 0.2 2.4
We can see from this that the most common words are all ‘common’ words unsurprisingly, so we remove stop words which is a little more interesting. We see that our new most common words (in US news data) are “time”, “people” and “city”.
Now we write a similar function to display the most common words for the three different media types for a specific language. Again we allow the user to view with or without stop words. And we also add functionality to plot a bar chart to compare visually.
## This function generates a tibble of the most common words for each media ##
## type for a given country. The user can define the length of the list, the ##
## minimum occurrences required to appear in the list, whether stop words are ##
## included, and whether a chart is shown. ##
show_common_words <- function(corpus1, country = "US", list_length = 100,
min_required = 5, remove_stop_words = TRUE, show_chart = TRUE) {
# get the filenames for the required country
blog1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*blogs"))]
news1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*news"))]
twit1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*twitter"))]
# get the common word list for each media type
blog2 <- rename(common_words(corpus1, blog1, list_length = Inf,
min_required = 1, remove_stop_words)[,1:2], blogs = n)
news2 <- rename(common_words(corpus1, news1, list_length = Inf,
min_required = 1, remove_stop_words)[,1:2], news = n)
twit2 <- rename(common_words(corpus1, twit1, list_length = Inf,
min_required = 1, remove_stop_words)[,1:2], twitter = n)
# combine the lists into a single table, remove any NA, add (and sort by) a 'total' column
word_table <- full_join(blog2, news2, by = "word")
word_table <- full_join(word_table, twit2, by = "word")
word_table <- word_table %>% replace(is.na(.), 0)
word_table$total <- word_table$blogs + word_table$news + word_table$twitter
word_table <- word_table[order(word_table$total, decreasing = TRUE),]
# only return the specified number of rows,
# if a negative number supplied then show the tail rather than the head
word_table <- word_table %>% filter(total >= min_required)
if (list_length > 0 && list_length < nrow(word_table)) {
word_table <- word_table[1:list_length,]
}
if (list_length < 0 && -list_length < nrow(word_table)) {
word_table <- word_table[(nrow(word_table)+list_length+1):nrow(word_table),]
}
# if we are drawing a chart
if (show_chart) {
# reformat the data table into a longer format
word_table2 <- word_table %>% select(-"total")
word_table2 <- word_table2 %>% pivot_longer(cols = 2:4, names_to = "type", values_to = "count")
# build the chart and print it to the console
gg <- ggplot(word_table2, aes(y = word, x = count, fill = type))
gg <- gg + geom_bar(position = "stack", width = 2, stat = "identity")
gg <- gg + scale_y_discrete(limits = rev(word_table2$word))
gg <- gg + labs(title = paste0("Word counts for ", country, " media"), y = "common words", x = "frequency")
gg <- gg + theme_minimal()
gg <- gg + theme(axis.text.y=element_text(size=5))
print(gg)
}
return(word_table)
}
show_common_words(my_corpus_medium, "US", list_length = 20,
min_required = 5, remove_stop_words = FALSE, show_chart = FALSE)
## # A tibble: 20 × 5
## word blogs news twitter total
## <chr> <int> <int> <int> <int>
## 1 the 205790 195688 39718 441196
## 2 to 118290 89962 33100 241352
## 3 and 120757 88582 18531 227870
## 4 a 99848 87102 25860 212810
## 5 of 96752 76488 15159 188399
## 6 in 66187 67054 16172 149413
## 7 i 86293 15621 30585 132499
## 8 that 50749 34144 10089 94982
## 9 is 48157 28316 15457 91930
## 10 for 40005 35032 16500 91537
## 11 it 44834 21827 12592 79253
## 12 on 30821 26947 11760 69528
## 13 you 33073 9563 23202 65838
## 14 with 31564 25462 7325 64351
## 15 was 30946 22773 4986 58705
## 16 at 19138 21344 7956 48438
## 17 this 28673 12197 7043 47913
## 18 as 24903 18863 2868 46634
## 19 my 30310 4065 12155 46530
## 20 be 23216 15077 7898 46191
show_common_words(my_corpus_medium, "US", list_length = 20,
min_required = 5, remove_stop_words = TRUE, show_chart = TRUE)
## # A tibble: 20 × 5
## word blogs news twitter total
## <chr> <int> <int> <int> <int>
## 1 time 10215 5778 3384 19377
## 2 people 6556 4713 2181 13450
## 3 day 5820 3191 3925 12936
## 4 love 5096 1014 4560 10670
## 5 life 4518 1574 1431 7523
## 6 home 3101 3091 1015 7207
## 7 week 3015 2257 1242 6514
## 8 school 1981 3519 845 6345
## 9 world 3257 1611 954 5822
## 10 game 1028 3375 1317 5720
## 11 night 2202 1694 1733 5629
## 12 city 1159 3802 406 5367
## 13 lot 2456 1803 561 4820
## 14 family 2272 1999 522 4793
## 15 feel 2730 973 1025 4728
## 16 house 1987 1989 583 4559
## 17 days 2235 1437 800 4472
## 18 team 903 2802 597 4302
## 19 book 3076 622 512 4210
## 20 season 784 2898 456 4138
We can see from this that the most popular words are similar across the media types, but if we remove the stop words we start to see a difference emerging. The chart helps to makes this clearer.
We write a further function to visualise popular words in a more appealing way using the wordcloud package.
## This function draws a wordcloud for a given word count list. The user can ##
## define which columns are the count column and word column from a tibble. ##
draw_wordcloud <- function(tibble1, count_col, word_col = "word") {
# build wordcloud and print to console
wordcloud <- wordcloud2(tibble1[,c(word_col, count_col)],
fontFamily = "arial",
fontWeight = "normal",
color = "random-light",
minRotation = -pi/4,
maxRotation = pi/4,
rotateRatio = 0.75,
shape = "circle")
return(wordcloud)
}
us_common_words <- show_common_words(my_corpus_medium, "US", list_length = 100,
min_required = 5, remove_stop_words = TRUE, show_chart = FALSE)
wordcloud_news <- draw_wordcloud(us_common_words, "news")
wordcloud_news
wordcloud_blogs <- draw_wordcloud(us_common_words, "blogs")
wordcloud_blogs
wordcloud_twitter <- draw_wordcloud(us_common_words, "twitter")
wordcloud_twitter
We can easily see from this that the most common words are ‘time’ and ‘people’ for news; ‘time’, ‘people’ and ‘day’ for blogs; and ‘love’ and ‘day’ for twitter.
After analysing single words we move on to looking at ngrams. These are lists of consecutive words found in text. We adapt our functions for finding and displaying common words to work for ngrams.
The first function lists common ngrams, we add functionality to this so the user can chose to use the tidytext package (unnest_token) or the ngram package (ngram) to split out ngrams. The ngram package runs calculations in C rather than r so is quicker at processing large strings of text, however it displays slightly different results to the unnest_token function so may be slightly out of kilter with other functions written so far.
## This function creates an ngram list ordered by count from a text string ##
## held in a corpus. The user can define the length of the ngram (X words) ##
## and the length of the lost produced (i.e. the top X ngrams). The user can ##
## also decide whether to run the function fast - this uses the 'ngram' ##
## package as opposed to unnest_tokens from the 'tidytext' package, this is ##
## significantly quicker but produces slightly different results. The user ##
## also decide whether to include bookends within the ngrams and whether to ##
## specify a minimum number of times a word should appear to be included. ##
common_ngrams <- function(corpus1, file1, ngram_length = 3, list_length = Inf,
min_required = 5, add_bookends = TRUE, run_fast = TRUE) {
# create a tibble with a row per line and unique row ids
tibble1 <- create_tibble(corpus1, file1, break_string = "sentence", add_bookends = add_bookends)
# if we are on a fast run then use the ngram function to produce the ngrams
if (run_fast) {
# create a long string of words from the text, in lower case with no punctuation
vec1 <- paste(tibble1$content, collapse = " ")
vec1 <- str_to_lower(vec1)
vec1 <- str_replace_all(vec1,"[[:punct:]]","")
# split into n-grams, remove ngrams with bookends or NA, then format ouput similar to unnest_tokens
top_ngrams <- ngram(vec1, n = ngram_length) %>%
get.phrasetable() %>%
filter(!(str_detect(ngrams, "strendstr|strstartstr"))) %>%
drop_na() %>%
rename(n = freq, ngram = ngrams) %>%
mutate(cumulative = cumsum(prop), ngram = str_sub(ngram,1,-2))
} else {
# otherwise use unnest_tokens which is slower, but consistent with other functions we have written
top_ngrams <- tibble1 %>%
unnest_tokens(output = "ngram", token = "ngrams", input = "content", n = ngram_length) %>%
group_by(ngram) %>%
count(ngram, sort = TRUE) %>%
ungroup()
}
# only return the specified number of rows,
# if a negative number supplied then show the tail rather than the head
top_ngrams <- top_ngrams %>% filter(n >= min_required)
if (list_length > 0 && list_length < nrow(top_ngrams)) {
top_ngrams <- top_ngrams[1:list_length,]
}
if (list_length < 0 && -list_length < nrow(top_ngrams)) {
top_ngrams <- top_ngrams[(nrow(top_ngrams)+list_length+1):nrow(top_ngrams),]
}
return(top_ngrams)
}
common_ngrams(my_corpus_medium, "en_US.news.txt", ngram_length = 5,
list_length = 10, min_required = 5, add_bookends = TRUE, run_fast = TRUE)
## ngram n prop cumulative
## 1 at the end of the 122 3.211269e-05 3.211269e-05
## 2 for the first time in 67 1.763566e-05 4.974835e-05
## 3 in the middle of the 61 1.605635e-05 6.580470e-05
## 4 for the first time since 61 1.605635e-05 8.186104e-05
## 5 by the end of the 49 1.289772e-05 9.475876e-05
## 6 the end of the day 43 1.131841e-05 1.060772e-04
## 7 at the time of the 42 1.105519e-05 1.171324e-04
## 8 at the top of the 36 9.475876e-06 1.266082e-04
## 9 the end of the year 34 8.949439e-06 1.355577e-04
## 10 there are a lot of 34 8.949439e-06 1.445071e-04
We can see that the most common ngrams are popular phrases, or parts of popular phrases, which is as we would expect.
The second function lists the common ngrams for all three media types for a given country, and draws a chart if required.
## This function generates a tibble of the most common ngrams for each media ##
## type for a given country. The user can define the same attributes as in ##
## common_ngrams, and whether a chart is shown. ##
show_common_ngrams <- function(corpus1, country = "US", ngram_length = 3, list_length = 100,
min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE) {
# get the filenames for the required country
blog1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*blogs"))]
news1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*news"))]
twit1 <- docnames(corpus1)[str_detect(docnames(corpus1),paste0(country,".*twitter"))]
# get the common ngram list for each media type
blog2 <- rename(common_ngrams(corpus1, blog1, ngram_length, list_length = Inf,
min_required = 1, add_bookends, run_fast)[,1:2], blogs = n)
news2 <- rename(common_ngrams(corpus1, news1, ngram_length, list_length = Inf,
min_required = 1, add_bookends, run_fast)[,1:2], news = n)
twit2 <- rename(common_ngrams(corpus1, twit1, ngram_length, list_length = Inf,
min_required = 1, add_bookends, run_fast)[,1:2], twitter = n)
# combine the lists into a single table, remove any NA, add (and sort by) a 'total' column
ngram_table <- full_join(blog2, news2, by = "ngram")
ngram_table <- full_join(ngram_table, twit2, by = "ngram")
ngram_table <- ngram_table %>% replace(is.na(.), 0)
ngram_table$total <- ngram_table$blogs + ngram_table$news + ngram_table$twitter
ngram_table <- ngram_table[order(ngram_table$total, decreasing = TRUE),]
# only return the specified number of rows,
# if a negative number supplied then show the tail rather than the head
ngram_table <- ngram_table %>% filter(total >= min_required)
if (list_length > 0 && list_length < nrow(ngram_table)) {
ngram_table <- ngram_table[1:list_length,]
}
if (list_length < 0 && -list_length < nrow(ngram_table)) {
ngram_table <- ngram_table[(nrow(ngram_table)+list_length+1):nrow(ngram_table),]
}
# if we are drawing a chart
if (show_chart) {
# reformat the data table into a longer format
ngram_table2 <- ngram_table %>% select(-"total")
ngram_table2 <- ngram_table2 %>% pivot_longer(cols = 2:4, names_to = "type", values_to = "count")
# build the chart and print it to the console
gg <- ggplot(ngram_table2, aes(y = ngram, x = count, fill = type))
gg <- gg + geom_bar(position = "stack", width = 2, stat = "identity")
gg <- gg + scale_y_discrete(limits = rev(ngram_table2$ngram))
gg <- gg + labs(title = paste0("Ngram counts for ", country, " media"), y = "common ngrams", x = "frequency")
gg <- gg + theme_minimal()
gg <- gg + theme(axis.text.y=element_text(size=5))
print(gg)
}
return(ngram_table)
}
show_common_ngrams(my_corpus_medium, "US", ngram_length = 5, list_length = 10,
min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE)
## ngram blogs news twitter total
## 1 at the end of the 209 122 24 355
## 2 in the middle of the 85 61 21 167
## 3 for the first time in 55 67 20 142
## 4 the end of the day 61 43 7 111
## 5 by the end of the 42 49 10 101
## 6 there are a lot of 42 34 9 85
## 7 is one of the most 46 28 8 82
## 8 for the first time since 15 61 5 81
## 9 is going to be a 34 27 19 80
## 10 for the rest of the 38 24 15 77
Again we see common phrases appear in the list, and start to see differences between the three media types with twitter appearing to be less formal than the other two. We compare this with the common ngrams from the full corpus:
show_common_ngrams(my_corpus_full, "US", ngram_length = 5, list_length = 10,
min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = TRUE)
## ngram blogs news twitter total
## 1 at the end of the 1596 1377 644 3617
## 2 in the middle of the 841 574 392 1807
## 3 for the first time in 465 695 430 1590
## 4 the end of the day 547 446 303 1296
## 5 by the end of the 434 571 231 1236
## 6 for the rest of the 428 327 383 1138
## 7 its going to be a 157 305 631 1093
## 8 thank you so much for 189 6 832 1027
## 9 is going to be a 275 196 547 1018
## 10 there are a lot of 403 388 167 958
We see that similar phrases rise to the top of the list as expected, but this is based upon a lot more data. We also look at the bottom of the two lists and see that the less frequent ngrams based on the full corpus are less intuitive than those from the medium corpus, because they are in effect ‘rarer’ i.e. have lower probability of occurring due to having the same count (5) but from a larger sample size.
show_common_ngrams(my_corpus_medium, "US", ngram_length = 5, list_length = -10,
min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = FALSE)
## ngram blogs news twitter total
## 5849 mothers day to all you 0 0 5 5
## 5850 i hate when people say 0 0 5 5
## 5851 available in your customer area 0 0 5 5
## 5852 wait to see you guys 0 0 5 5
## 5853 follow me and ill follow 0 0 5 5
## 5854 look forward to meeting you 0 0 5 5
## 5855 in your customer area now 0 0 5 5
## 5856 follow me so i can 0 0 5 5
## 5857 heres my number so call 0 0 5 5
## 5858 and thank you for the 0 0 5 5
show_common_ngrams(my_corpus_full, "US", ngram_length = 5, list_length = -10,
min_required = 5, add_bookends = TRUE, run_fast = TRUE, show_chart = FALSE)
## ngram blogs news twitter total
## 286772 give the gift of health 0 0 5 5
## 286773 remember all the good times 0 0 5 5
## 286774 cant you just tell me 0 0 5 5
## 286775 are so many new music 0 0 5 5
## 286776 the members of one direction 0 0 5 5
## 286777 it should be fixed now 0 0 5 5
## 286778 on my profile can help 0 0 5 5
## 286779 were happy to be a 0 0 5 5
## 286780 lift my feet off the 0 0 5 5
## 286781 do you need more followers 0 0 5 5
We now write two functions to enable us to predict next words based on ngrams. The first function creates a long ngram lookup list containing ngrams up to a specified number of words based upon the corpus text supplied. The second function uses this long ngram lookup list to quickly search for a matching ngram and the expected next word.
The first function allows the user to specify the max length of the ngram (in words) and also how many ngrams of each length to hold. These ngrams are held in a long list along with the count of each ngram in terms of the number of times it appears in the text, and also the ngram split into two parts, one part containing all words except the last (ngram_less1), and one part containing the last word only (ngram_last). This allows the list to be queried in the ngram_less1 column and then easily determin the next possible words in the corresponding ngram_last column.
## This function produces a list of ngrams for a specified country, The user ##
## can specify the max words per ngram, and the number of ngrams of each word ##
## length to hold. The user can also specify the minimum number of ##
## occurrences for an ngram to appear in the list, as well as whether bookends##
## are to be included. ##
get_ngram_lookup_list <- function(corpus1, country, ngram_max = 5, ll_max = 10000,
min_word_bar = 5, add_bookends = TRUE, run_fast = TRUE) {
# start an empty list to append to as we go
ngram_list = list()
# ensure max ngram is at least 2
if (!(ngram_max > 2)) {
ngram_max == 2
}
# loop through each ngram length and get an ngram list for each length
for (ng in 2:ngram_max) {
x <- show_common_ngrams(corpus1, country = country, ngram_length = ng,
list_length = ll_max, add_bookends = add_bookends,
run_fast = run_fast, show_chart = FALSE)
# add columns which split the ngram into two parts: the last word, and the rest
x <- x %>% mutate(ngram_less1 = str_remove(ngram, "\\s(\\w+)$"), ngram_last = str_extract(ngram, "(\\w+)$"))
x <- x %>% rename(count = total) %>% select(ngram_less1, ngram_last, count)
ngram_list <- append(ngram_list, list(x))
}
return(ngram_list)
}
The second function allows the user to pass a short string (i.e. an ngram) and then looks for this string in the ngram_less1 column. If matches are found then these are returned along with the corresponding final words of the looked up ngram, and the count of occurrences. If no matches are found then the function looks for an ngram which is one word shorter than that supplied (by removing the first word). This continues until a match is found. If no match is found then the function returns NULL.
## This function allows the user to specify a string and then looks for a ##
## match in an ngram list, in the ngram_less1 column. If match(es) are found ##
## then the most common X occurrences are returned (the user can specify) ##
## if not then the function checks for a shorter ngram (by removing the first ##
## word from the ngram). If the function fails to fin an ngram matching any ##
## component of the supplied ngram then it returns NULL) ##
lookup_ngram <- function(ngram_list, ngram1, results = 1) {
# determine the ngram length
ngram_length <- str_count(ngram1, "\\w+")
# while we still have an ngram of 2 or more
while (ngram_length >= 2) {
# filter the ngram list for this ngram (in the less1 column) and sort by count
lookup_table <- ngram_list[[ngram_length]] %>% filter(ngram_less1 == ngram1) %>% arrange(desc(count))
# if we have found matches then return them
if(nrow(lookup_table)>0) {
return(lookup_table[1:min(results,nrow(lookup_table)),])
}
# otherwise we reduce the ngram by one word
ngram1 <- word(ngram1,start = 2, end = ngram_length)
ngram_length <- str_count(ngram1, "\\w+")
}
# if we find no matches then return NULL
return(NULL)
}
We can test that this works by typing short sentences (we use the full corpus for more matches):
ngram_list_full <- get_ngram_lookup_list(my_corpus_full, "US", ngram_max = 6, ll_max = Inf,
min_word_bar = 1, add_bookends = TRUE, run_fast = TRUE)
lookup_ngram(ngram_list_full, "once upon a time there", results = 10)
## ngram_less1 ngram_last count
## 1 once upon a time there was 35
lookup_ngram(ngram_list_full, "predicting things is", results = 10)
## ngram_less1 ngram_last count
## 1 things is the 30
## 2 things is not 28
## 3 things is a 21
## 4 things is that 20
## 5 things is to 12
## 6 things is more 9
## 7 things is just 7
## 8 things is difficult 7
## 9 things is when 6
## 10 things is one 5
lookup_ngram(ngram_list_full, "red white and", results = 10)
## ngram_less1 ngram_last count
## 1 red white and blue 151
## 2 red white and green 6
## 3 red white and bluezz 6
## 4 red white and black 5
lookup_ngram(ngram_list_full, "gobbledygook dvgibieuyebv", results = 10)
## NULL
lookup_ngram(ngram_list_full, "lucy in the sky with", results = 10)
## ngram_less1 ngram_last count
## 1 lucy in the sky with diamonds 5
Our prediction tool appears to be working..!