In Honeycomb Puzzles, The goal is to identify how many words meet the following criteria:
1. The word must be at least four letters long.
2. The word must include the central letter.
3. The word cannot include any letter beyond the seven given letters.
In this example, our honeycomb is the letters a, p, x, m, e, l and center letter g
First task: How many words can be made? What are the highest scoring words? Feel free to explore anything else you think is interesting
Second task: Which seven-letter honeycomb set-up results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.
Third task: If we assume the lowest possible scoring honeycomb is one with only one 7 letter solution - it would score 14 points. Can such a puzzle exist? If so, what is the pangram?
We will take each task one by one. Our strategy for the first task will be to create a regular expression for the example honeycomb which contains a ‘lookahead’ to ensure that we only match with words that have the letter g in them and do not contain any other letters. But first, let’s load our libraries, read the dictionary, convert it to a tibble and filter out words shorter than 4 characters .
##TASK ONE
library(tidyverse)
library(dplyr)
library(readr)
library(stringr)
library(stringi)
library(hash)
library(gtools)
#import the dictionary directly from the URL with the data.table package
# library(data.table)
# dictionary <- fread("https://norvig.com/ngrams/enable1.txt", header = FALSE)
# dictionary <- dictionary %>% as_tibble() %>%rename(words = "V1")
# setwd("C:/Users/jp168/projects/share_creative")
file <- "enable1.txt"
dictionary <- read_table(file, col_names = FALSE)
dictionary <- dictionary %>% as_tibble()%>% rename(words = "X1")%>%
filter(stri_length(words) > 3)
head(dictionary)
## # A tibble: 6 x 1
## words
## <chr>
## 1 aahed
## 2 aahing
## 3 aahs
## 4 aalii
## 5 aaliis
## 6 aals
Now we need to write the regular expression, and subset our dictionary by the words the pattern matches with:
original_honeycomb <- (c("l", "a","p","x","m","e","g"))
pattern <-"(?=.*[g])^[lapxmeg]*$"
sum(str_detect(dictionary$words, pattern))
## [1] 43
dictionary$words[str_detect(dictionary$words, pattern)]
## [1] "agama" "agapae" "agape" "agee" "agleam" "aglee" "agma"
## [8] "alga" "algae" "algal" "allege" "amalgam" "eagle" "egal"
## [15] "gaga" "gage" "gaggle" "gala" "galax" "gale" "galea"
## [22] "galeae" "gall" "gama" "game" "gamma" "gamp" "gape"
## [29] "gelee" "gemma" "gemmae" "gleam" "glee" "gleg" "legal"
## [36] "mage" "magma" "page" "peag" "peage" "pelage" "plagal"
## [43] "plage"
To figure out which word scores highest, we need to check for pangrams, take each pangram’s length, add 7 to it, then see which word has the highest score. By eye it looks like there are no pangrams, so we check to see whether our pattern failed to detect megaplex, or if megaplex is not in the dictionary that we are using.
sum(str_detect(dictionary$words, "megaplex"))
## [1] 0
Megaplex is not in our dictionary (a manual search confirms this), but let’s confirm that none of the other words our pattern matched with are pangrams. We can write a new regular expression which matches only words that contain every letter of the original honeycomb, by increasing the amount of lookaheads:
pangram <- "(?=.*[l])(?=.*[a])(?=.*[p])(?=.*[x])(?=.*[m])(?=.*[e])(?=.*[g])^[lapxmeg]+$"
sum(str_detect(dictionary$words, pangram))
## [1] 0
It looks like there are no pangrams, but let’s check that our regular expressions are working as intended:
str_detect("megaplex", pangram)#TRUE
str_detect("megaplexo", pangram) #FALSE
str_detect("egaplex", pangram) #FALSE
Given that there are no pangrams in our dictionary, the highest scoring word will be the longest word:
matches <- as_tibble(dictionary$words[str_detect(dictionary$words, pattern)])
top_5_matches <- matches %>%
arrange(desc(stri_length(value)))%>%
head()%>%
pull()
top_5_matches #amalgam, agapee, agleam, allege, gaggle, galeae
## [1] "amalgam" "agapae" "agleam" "allege" "gaggle" "galeae"
Now that we’ve seen our top words from the original honeycomb, and in anticipation of task two, we’re going to quickly investigate the frequency of letters in our dictionary. We will write a function which generates a regular expression for each letter of the alphabet, searches the dictionary, and finally prints the proportion of words in which it appears:
alphabet = c("a", "b","c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m","n","o","p","q", "r", "s", "t","u","v","w","x","y","z")
frequency <- function(x){
x = toString(x)
quantity <- str_detect(dictionary$words, sprintf("%s",x))
mean(quantity)
}
Rather than plug and chug each letter into the function, we will apply the alphabet to the function:
letter_frequency <- lapply(X = alphabet, FUN = frequency)
letter_frequency <- unlist(letter_frequency, use.names=FALSE)
letter_frequency
## [1] 0.54448856 0.15407099 0.32078229 0.27323699 0.70274582 0.10008617
## [7] 0.22130164 0.19453049 0.59297708 0.01418324 0.07332666 0.39855140
## [13] 0.23033793 0.48950231 0.46061763 0.23559551 0.01469561 0.52735922
## [19] 0.60416764 0.48446015 0.27024431 0.08605431 0.06484350 0.02650915
## [25] 0.14194303 0.04079137
Now let’s map each letter to its frequency:
#map each letter to its frequency
letter_frequency <- hash(alphabet, letter_frequency)
#turn the hashed object into a data frame
letter_frequency <- as.data.frame(as.list(letter_frequency))
#We don't want each letter as a column, so let's tidy the data up
#first we'll order it by the alphabet then turn it into a tibble
letter_frequency <- letter_frequency[alphabet]
letter_frequency <- letter_frequency %>%
gather(letter, frequency, names(letter_frequency))
letter_frequency <- as_tibble(letter_frequency)
letter_frequency
## # A tibble: 26 x 2
## letter frequency
## <chr> <dbl>
## 1 a 0.544
## 2 b 0.154
## 3 c 0.321
## 4 d 0.273
## 5 e 0.703
## 6 f 0.100
## 7 g 0.221
## 8 h 0.195
## 9 i 0.593
## 10 j 0.0142
## # ... with 16 more rows
We can visualise the distribution with a histogram
letter_frequency %>%
ggplot(aes(frequency))+
geom_histogram(binwidth=0.1, color = "black")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
A bar chart will be more informative for the second task
letter_frequency%>%
ggplot(aes(x= reorder(letter, frequency), y = frequency))+
geom_col()+
coord_flip()+
xlab("letter")+
ylab("frequency")
This raises the question of how much g being the central letter limited the original honeycomb. Let’s have a look what would happen if e were the central letter instead:
#This raises a question, how much did g being the central hexagon affect our matches?
patterne <- "(?=.*[e])^[lapxmeg]*$"
#before we had 43 matches
nrow(matches)
## [1] 43
sum(str_detect(dictionary$words, patterne)) #Now we have 78
## [1] 78
matches_with_e <- dictionary$words[str_detect(dictionary$words, patterne)]
matches_with_e
## [1] "agapae" "agape" "agee" "agleam" "aglee" "alae"
## [7] "alee" "algae" "allee" "allege" "allele" "alme"
## [13] "ample" "apex" "appeal" "appel" "appellee" "apple"
## [19] "axel" "axle" "eagle" "egal" "empale" "epee"
## [25] "exam" "example" "exempla" "expel" "expellee" "gage"
## [31] "gaggle" "gale" "galea" "galeae" "game" "gape"
## [37] "gelee" "gemma" "gemmae" "gleam" "glee" "gleg"
## [43] "lame" "lamella" "lamellae" "lapel" "leal" "leap"
## [49] "legal" "lemma" "lexeme" "mage" "male" "mallee"
## [55] "mammae" "mammee" "maple" "meal" "melee" "mell"
## [61] "paella" "page" "pale" "palea" "paleae" "paleal"
## [67] "peag" "peage" "peal" "peel" "peep" "pelage"
## [73] "pele" "pellmell" "pepla" "plage" "plea" "plexal"
##Task Two
Which seven-letter honeycomb set-up results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.
For this task we need an efficient method to generate patterns and pangram patterns, detect matches and produce a score. Removing s from the equation, we have 25 letters and 7 spaces available, which means we have 25 choose 7 permutations, or approximately 33 billion, or alternatively ~500k combinations (without repeats). Without making some assumptions, each seven letter honeycomb has seven permutations - each letter must be the central letter. However, as we are trying to maximise the possible score, we can dismiss out of hand the lowest frequency letters like j, q, x, w… etc. reducing the pool of permutations. Another assumption we are going to make, is that the most frequent letter, e, will be our central letter.
Now the question is where should our cut-off point be? If we take a look at the frequency table, the 15th most frequent letter is g, which appears in just over 1 in 5 words in our dictionary. Although the choice is somewhat arbitrary, we are going to include g as we recognise that after s/es and ed, g is the most common suffix, which means a disproportionate amount of high scoring words will have g in them. As e will be our central letter, we are now looking at 13 (the top 15 in our frequency table minus e and s) choose 6 combinations, with another assumption we can reduce this number further.
We are going to assume that a, i, and r will join e in every honeycomb as they are the next most frequent letters.
Let’s generate our ten common letters from which we will compute all combinations, and then add e, a, i and r
letter_frequency <- letter_frequency %>% arrange(desc(frequency))
common_letters <- letter_frequency$letter[5:15]
honeycombs <- combinations(11, 3, common_letters, repeats.allowed = FALSE)
honeycombs <- as_tibble(honeycombs) %>%
mutate(V4 = "e", V5 = "a", V6 = "i", V7 = "r")
honeycombs
## # A tibble: 165 x 7
## V1 V2 V3 V4 V5 V6 V7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 c d g e a i r
## 2 c d l e a i r
## 3 c d m e a i r
## 4 c d n e a i r
## 5 c d o e a i r
## 6 c d p e a i r
## 7 c d r e a i r
## 8 c d t e a i r
## 9 c d u e a i r
## 10 c g l e a i r
## # ... with 155 more rows
#We are going to convert honeycombs to a matrix, the reason will become clearer later
honeycombs <- as.matrix(honeycombs)
Now we need to take our 165 honeycombs and create regular expressions from them. Taking a look at our original honeycomb, we notice that pangram contains everything pattern contains, and a bit extra. So if we build our function for generating the pangram first, we will be able to duplicate the same logic for our pattern generator.
pattern
## [1] "(?=.*[g])^[lapxmeg]*$"
pangram
## [1] "(?=.*[l])(?=.*[a])(?=.*[p])(?=.*[x])(?=.*[m])(?=.*[e])(?=.*[g])^[lapxmeg]+$"
This function will take an input in the same form as our original honeycomb and generate a pangram pattern:
original_honeycomb # a reminder of our original honeycomb
## [1] "l" "a" "p" "x" "m" "e" "g"
generate_pangram <- function(x){
pattern <- sprintf("(?=.*[%s])", x) #will create a lookahead for every letter in our input vector
pattern <- toString(pattern) #turns our pattern into a string
pattern <- str_remove_all(pattern, ",") #removes the commas separating the lookaheads
pattern <- str_remove_all(pattern, " ") #removes the spaces
pasted <- paste(x, collapse = "") #creates a string from our input
second_half <- sprintf("^[%s]+$", pasted) #creates the second part of our regex pattern from the input string
final <- c(pattern, second_half)
final <- paste(final, collapse = "")
final
}
generate_pangram(honeycombs[1, 1:7]) # test
## [1] "(?=.*[c])(?=.*[d])(?=.*[g])(?=.*[e])(?=.*[a])(?=.*[i])(?=.*[r])^[cdgeair]+$"
#let's check if it matches our original pangram
identical(generate_pangram(c("l", "a", "p", "x", "m", "e", "g")), pangram)
## [1] TRUE
Great, our pangram generator seems to be working, now let’s write the function for our pattern generator
generate_pattern <- function(x){
pattern <- sprintf("(?=.*[%s])", x)
pattern <- toString(pattern)
pattern <- str_remove_all(pattern, ",")
pattern <- str_remove_all(pattern, " ")
pasted <- paste(x, collapse = "")
second_half <- sprintf("(?=.*[g])^[%s]*$", pasted)
second_half
}
generate_pattern(c("l", "a", "p", "x", "m", "e", "g")) #looks good
## [1] "(?=.*[g])^[lapxmeg]*$"
identical(pattern,generate_pattern(c("l", "a", "p", "x", "m", "e", "g")))
## [1] TRUE
The pattern generator is working as intended, but remembering that we made an assumption at the beginning of this task, we rewrite the function so that e is our central letter:
generate_pattern <- function(x){
pattern <- sprintf("(?=.*[%s])", x)
pattern <- toString(pattern)
pattern <- str_remove_all(pattern, ",")
pattern <- str_remove_all(pattern, " ")
pasted <- paste(x, collapse = "")
second_half <- sprintf("(?=.*[e])^[%s]*$", pasted)
second_half
}
original_test <- generate_pattern(original_honeycomb)
original_test
## [1] "(?=.*[e])^[lapxmeg]*$"
Now we can write a function which will take the outputs of our generated patterns and match them against our dictionary:
get_matches <- function(x){
matches <- str_detect(dictionary$words, x)
matches <- as_tibble(dictionary$words[matches])
matches
}
get_matches(original_test) #73 matches, why is this different to the 43 we had before? Because e is our central letter
## # A tibble: 78 x 1
## value
## <chr>
## 1 agapae
## 2 agape
## 3 agee
## 4 agleam
## 5 aglee
## 6 alae
## 7 alee
## 8 algae
## 9 allee
## 10 allege
## # ... with 68 more rows
#do we have any pangrams?
original_pangram <- generate_pangram(original_honeycomb)
get_matches(original_pangram) #0 pangrams
## # A tibble: 0 x 1
## # ... with 1 variable: value <chr>
Now that our individual functions are built, we need to write the logic for our scoring function, we will combine the three functions that we just wrote, and add some additional logic to create a function that takes a honeycomb in the original honeycomb’s format and calculates its score directly:
scoring <- function(x){
total <- 0 #start a running total
scoring_pattern <- generate_pattern(x)
scoring_pangram <- generate_pangram(x)
total = total + (nrow(get_matches(scoring_pangram))*7)#edit the total to be the number of pangrams * 7
scoring_matches <- get_matches(scoring_pattern) #get out pattern matches
for(match in scoring_matches$value){
if(stri_length(match)==4){
total = total + 1
} else if(stri_length(match)>4){
total = total + (stri_length(match))
}
}
print(total)
}
scoring(original_honeycomb)
## [1] 302
Our function says the original honeycomb, with a central letter of e, would be worth 302 points, we should check that manually using a different method to our running total logic in the scoring function:
sum(str_detect(dictionary$words, original_test)) # 78 words, looks good
## [1] 78
z <- as_tibble(dictionary$words[str_detect(dictionary$words, original_test)])
z_four <- z %>% filter(stri_length(z$value) == 4)%>%
nrow()
#we have 31 4 letter words
z_more_than_four <- z %>% filter(stri_length(z$value)>4)
sum(stri_length(z_more_than_four$value))+z_four #302, our scoring function seems to be working
## [1] 302
Now that we’ve generated our functions and are confident they are working, we need a way to take the honeycombs we generated and pass them into the function all at once to see which honeycomb is the highest scorer:
#now we need to figure out what the ultimate honeycomb is
scores <- apply(honeycombs, MARGIN = 1, FUN = scoring)
Applying this function to many more honeycombs would take a long time at current processing speed. Let’s map the scores to the honeycombs:
#add our scores to our honeycombs
honeycombs_with_scores <- as_tibble(honeycombs) %>% mutate(scores = scores)
honeycombs_with_scores <- honeycombs_with_scores %>%
arrange(desc(scores))
honeycombs_with_scores
## # A tibble: 165 x 8
## V1 V2 V3 V4 V5 V6 V7 scores
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 g n t e a i r 3769
## 2 d n t e a i r 3672
## 3 d g n e a i r 3271
## 4 d l t e a i r 3194
## 5 c d t e a i r 3023
## 6 l n t e a i r 2983
## 7 d p t e a i r 2961
## 8 c n t e a i r 2945
## 9 n p t e a i r 2886
## 10 m n t e a i r 2837
## # ... with 155 more rows
# our highest scoring honeycomb without s is:
unname(unlist(honeycombs_with_scores %>% slice(1)))
## [1] "g" "n" "t" "e" "a" "i" "r" "3769"
highest_honeycomb <- c("g", "n", "r", "e","a","i","t")
get_matches(generate_pangram(highest_honeycomb)) # 50 pangrams!
## # A tibble: 50 x 1
## value
## <chr>
## 1 aerating
## 2 aggregating
## 3 argentine
## 4 argentite
## 5 entertaining
## 6 entraining
## 7 entreating
## 8 garnierite
## 9 gartering
## 10 generating
## # ... with 40 more rows
We ought to remember that we made a fairly large assumption at the beginning of task two to limit the amount of computations we would have to make. We assumed that e would be the best central letter. However, now that we have our best honeycomb, we notice that g is included despite it being low down on the frequency list, it won’t hurt to check whether changing the central letter results in a better pangram.
#Results:
#3095 with g
#3898 with r
#3372 with a
#3406 with i
#3782 with n
#3421 with t
#3769 with e
As we can see, e was only the third best option in our honeycomb, the best honeycomb was in fact “granite” with r as the central letter!
Let’s have a peek at the distribution for fun:
honeycombs_with_scores %>%
ggplot(aes(scores))+
geom_histogram(binwidth = 100, color="black")+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
And a density plot for added fun:
honeycombs_with_scores %>%
ggplot(aes(scores))+
geom_density()+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Third task: If we assume the lowest possible scoring honeycomb is one with only one 7 letter solution - it would score 14 points. Can such a puzzle exist? If so, what is the pangram?
There are perhaps many ways to tackle this task, one way would be to take all of the 7 letter words in our dictionary, split them apart, turn them into sets, re-order them and remove any sets with string length < 7, order them alphabetically and remove duplicates. This would give us a smaller pool of 7 letter words (or honeycombs) to work from.
However, when considering permutations the pool would still be quite large.Applying the remaining permutations to our scoring function and filtering for scores == 14 would take a very long time. Before resorting to this solution, perhaps we can combine some knowledge of the quirks of the English language and hack a solution - we just need to find one pangram with a score 14.
A low frequency letter as the central letter can dramatically reduce the number of words we can make from any given honeycomb. Another strategy might be to look for a seven letter word made only of unique consonants.
Whilst j is the lowest frequency letter, if we notice that q is nearly always accompanied by u in the English language, perhaps if we can find seven letter words that have q in them but no u, and no repeat characters, we might find our solution:
#we grab our seven letter words
seven_letter_words <- dictionary %>%
filter(stri_length(words) == 7)
no_u <- "(?=.*q)((?!u).){7}"
no_u_match <- str_detect(seven_letter_words$words, no_u)
seven_letter_words$words[no_u_match]
## [1] "qindars" "qintars" "qwertys"
candidates <- seven_letter_words$words[no_u_match]
candidates
## [1] "qindars" "qintars" "qwertys"
We have some candidates, if we set q as the central letter, we may get lucky:
qindars_pattern <-"(?=.*[q])^[qindars]*$"
qintars_pattern <-"(?=.*[q])^[qintars]*$"
qwertys_pattern <-"(?=.*[q])^[qwertys]*$"
get_matches(qindars_pattern) #4 matches
## # A tibble: 4 x 1
## value
## <chr>
## 1 qaid
## 2 qaids
## 3 qindar
## 4 qindars
get_matches(qintars_pattern) #7 matches
## # A tibble: 7 x 1
## value
## <chr>
## 1 qanat
## 2 qanats
## 3 qats
## 4 qintar
## 5 qintars
## 6 tranq
## 7 tranqs
get_matches(qwertys_pattern) #close, but no cigar!
## # A tibble: 2 x 1
## value
## <chr>
## 1 qwerty
## 2 qwertys
We try a few more combinations of low frequency letters:
xz <- ("(?=.*x)(?=.*z)")
seven_letter_words$words[str_detect(seven_letter_words$words, xz)]
## [1] "oxazine" "oxidize"
#oxazine #let's try with both x and z as the central letters
oxazine <- "(?=.*[z])^[oxazine]*$"
seven_letter_words$words[str_detect(seven_letter_words$words, oxazine)] #ozonize
## [1] "oxazine" "ozonize"
oxazine2 <- "(?=.*[x])^[oxazine]*$"
seven_letter_words$words[str_detect(seven_letter_words$words, oxazine2)]
## [1] "oxazine"
dictionary$words[(str_detect(dictionary$words, oxazine2))] #nope
## [1] "annex" "annexe" "anoxia" "axion" "axon" "axone" "exine"
## [8] "exon" "ixia" "nixe" "nixie" "oxazine" "oxen" "xenia"
## [15] "xenon"
jq <- ("(?=.*j)(?=.*q)")
seven_letter_words$words[str_detect(seven_letter_words$words, jq)]
## [1] "jonquil"
jonquil <- "(?=.*[j])^[jonquil]*$"
dictionary$words[(str_detect(dictionary$words, jonquil))] #nope
## [1] "jill" "jillion" "jinn" "jinni" "join" "jonquil" "juju"
jx <- ("(?=.*j)(?=.*x)")
seven_letter_words$words[str_detect(seven_letter_words$words, jx)]
## [1] "jinxing" "jukebox" "outjinx"
outjinx <- "(?=.*[j])^[outjinx]*$"
dictionary$words[(str_detect(dictionary$words, outjinx))] # nope, and there is jinx so x won't work as the central letter
## [1] "jinn" "jinni" "jinx" "join" "joint" "juju" "junto"
## [8] "outjinx" "outjut" "unjoint"
xq <- ("(?=.*q)(?=.*x)")
seven_letter_words$words[str_detect(seven_letter_words$words, xq)]
## [1] "equinox" "quixote"
equinox <- "(?=.*[x])^[equinox]*$"
dictionary$words[(str_detect(dictionary$words, equinox))]
## [1] "equinox" "exine" "exon" "nixe" "nixie" "oxen" "xenon"
quixote <- "(?=.*[x])^[quixote]*$"
dictionary$words[(str_detect(dictionary$words, quixote))] #exit and text
## [1] "exit" "quixote" "text"
It’s looking like our attempts to hack the solution aren’t going to work, and that we may need to depend on brute force. But, if we think outside the box and remember that megaplex was not in our dictionary, perhaps there are some seven letter words with q in them but no u in the English language that we haven’t picked up. A google search gives us the word ‘kamotiq’
kamotiq <- c("k","a", "m","o","q","i","t")
kamotiq_pattern <- "(?=.*[q])^[kamotiq]*$"
sum(str_detect(dictionary$words, kamotiq_pattern))
## [1] 0
#quick check to see whether with a as the central letter we find some matches
sum(str_detect(dictionary$words, "(?=.*[a])^[kamotiq]*$"))
## [1] 30
It seems like kamotiq, with q as the central letter, would fit our criteria for a perfect low scoring pangram.