Problem Statement from the Riddler

The New York Times recently launched some new word puzzles, one of which is Spelling Bee. In this game, seven letters are arranged in a honeycomb lattice, with one letter in the center.

The goal is to identify as many words that meet the following criteria:

  1. The word must be at least four letters long.
  2. The word must include the central letter.
  3. The word cannot include any letter beyond the seven given letters.

Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.

Which seven-letter honeycomb results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.

For consistency, please use the word list at https://norvig.com/ngrams/enable1.txt to check your game score.

The Words

We read in the words from the website given.

words <- url("https://norvig.com/ngrams/enable1.txt")
pangrams <- readLines(words, warn = FALSE)
length(pangrams)
## [1] 172820

As we will not be using words that are of length less than 4, we will remove all of these from the list as they are not worth any points. To do this, we need a function that takes a word as an input and outputs the number of letters in it. While we’re at it, we’ll build a function that also outputs the number of unique letters in the word.

## Function for number of unique letters in a word
unqLet <- function(x){
  return(length(unique(strsplit(x,split="")[[1]])))
}

## Function for total number of letters in a word
numLet <- function(x){
  return(length(strsplit(x,split="")[[1]]))
}

## Apply to all the words 
NumLets <- sapply(pangrams, numLet)

## Reduce the word list to those with 4 or more letters.
pangrams <- pangrams[NumLets>=4]
length(pangrams)
## [1] 171752

We also will not consider words that have more than 7 unique letters since these will be outside the bounds of our game.

## Apply unique number of letters to all words
UniqLets <- sapply(pangrams, unqLet)

## Reduce word list further to include only those with 7 or fewer unique letters
pangrams <- pangrams[UniqLets<=7]

length(pangrams)
## [1] 98141

Since our valid set must not include the letter ‘s’, we should get rid of all words that include an ‘s’.

# A function that returns TRUE if the word contains an 's'
# and FALSE if it does not. 
containS <- function(word){
   return("s" %in% strsplit(word, split="")[[1]])
}

# which of the words contain S?
Swords <- sapply(pangrams, containS)

# keep only those that do not contain S
pangrams <- pangrams[!Swords]
length(pangrams)
## [1] 44585

Letter Distribution

Let’s analyze the unique letters in each word in the list of 98,141 available words to see which letters are used the most often.

let.Dist <- rep(0,26)
names(let.Dist) <- letters

for(word in pangrams){
  lets <- unique(strsplit(word, split = "")[[1]])
  let.Dist[lets] <- let.Dist[lets] + 1
}

let.Dist <- sort(let.Dist, decreasing = TRUE)
let.Dist[1:12]
##     e     a     i     r     n     l     t     o     d     c     g     u 
## 28203 21719 21128 20977 18735 16503 16232 16038 13894 11644 10042  9627

It should not be a surprise to see that the letters e, a, i, r, n, l, and t are used the most often (when s is taken away).
## Points per word

For further analysis, we will need a function that will take a word, the set of available letters, and the center letter, and then return the number of points that word is rewarded.

## Input word as a string,
## set as a vector of 7 letters, 
## and center as the center letter. 
wordPoints <- function(word, set, center){
   # WordL will be the letters with repeats in the word
  WordL <- strsplit(word, split="")[[1]] 
  # UWordL will be the unique letters in the word
  UWordL <- unique(WordL)
  # In order to get any points at all, all the unique letters must be in the set AND
  # the center letter must be one of the unique letters.
  anyPnts <- all(UWordL %in% set) & center%in% UWordL
  if(anyPnts){
     # when the above conditions are satisfied, then there is a bonus if the number
     # of unique letters is 7 (which means it matches all those in the set). Otherwise
     # the points is simply the length of the original word itself. 
    WrdPnts <- ifelse(length(UWordL)==7,7+length(WordL),length(WordL))
    return(WrdPnts)
  }
  else{return(0)}
}

Suppose we use the top 7 letters in the distribution with ‘e’ as the center. Let’s find out how many points that will give us.

TopSet <- names(let.Dist)[1:7]
TopSet 
## [1] "e" "a" "i" "r" "n" "l" "t"
Tot.Word.Points <- sapply(pangrams, 
                          function(x) return(wordPoints(x,TopSet,"e")))
sum(Tot.Word.Points)
## [1] 3166

While this could be the answer, let’s check other possibilities. Out of the top 10 letters used, we’ll form all combinations of 7 letters, and then cycle through each individual letter in the set as a possible center keeping track of the maximum points along with the maximal set.

## This creates all the combinations of 7 letters chosen from the top 10
setMat <- matrix(combn(names(let.Dist)[1:10], 7), ncol=7, byrow=TRUE)

## initial states of empty set and zero points
maxSet <- c()
maxPoints <- 0

## A function that used the wordPoints but applies it to the entire
## pangrams list
Tot_Points <- function(set, cent){
  setPoints <- sapply(pangrams, function(x) return(wordPoints(x, set, cent)))
  return(sum(setPoints))
}

## There are 120 rows and 7 columns in setMat
## To save time on the calculations, we search only the top 3
## letters as centers in each set. 
for(r in 1:120){
  for(c in 1:2){
    t <- Tot_Points(setMat[r,],setMat[r,c])
    if(t>maxPoints){
      maxPoints <- t
      ## Not only do we want the maximal set, but the center
      ## which will be a repeat value tacked on the end. 
      maxSet <- c(setMat[r,],setMat[r,c])
    }
  }
}
maxSet
## [1] "e" "a" "i" "r" "n" "t" "d" "e"
TopSet <- maxSet[-8]
maxPoints
## [1] 3867

Do any words contain all letters from the maximal set?

## Function to return TRUE of a word contains all the letters 
## in the maximal set
containAll <- function(word){
   return(all(TopSet %in% strsplit(word, split = "")[[1]]))
}

## Which words contain all the letters?
Words.Containing.All <- sapply(pangrams, containAll)

## Print these words
pangrams[Words.Containing.All]
##  [1] "andradite"   "antired"     "attainder"   "daintier"    "detainer"   
##  [6] "detrain"     "detrained"   "entertained" "entrained"   "intenerated"
## [11] "intreated"   "irredenta"   "irridenta"   "itinerated"  "nitrated"   
## [16] "reattained"  "reinitiated" "retained"    "retrained"   "trained"

Trusting that enough sets have been checked, the set will most likely be that containing e, a, i, r, n, t, and d giving us 3867 points.