The Goal

Each day, a new five letter word is the day’s Wordle. Thanks to The Riddler’s fascination with the Riddler, the Riddler universe has access to a list of words that we are allowed to guess along with a list of words that could be the mystery Wordle word.

Let’s read in the mystery words and the words we can use to guess. Thanks Jason Ash for the text files online!

mysteryWords <- url("https://jtash.vercel.app/wordle-solutions.txt")
guessWords <- url("https://jtash.vercel.app/wordle-guesses.txt")

Mystery <- readLines(mysteryWords, warn = FALSE)
Guess <- readLines(guessWords, warn = FALSE)

Distribution of Letters in Mystery Words

To come up with a good first guess word, we’ll do a quick analysis of the distribution of letters in the 1st position, 2nd, and so on.

letterDist <- rep(0, 26)
names(letterDist) <- letters

## this function takes a word list and a number between 1 and 5 and then
## creates a distribution of letters for that given spot

distribute <- function(wordList, letNum){
   letterDist <- rep(0, 26)
   names(letterDist) <- letters
   k <- length(wordList)
   for(i in 1:k){
      x <- strsplit(wordList[i], split="")[[1]][letNum]
      letterDist[x] <- letterDist[x]+1
   }
   return(letterDist)
}

(letter1Dist <- distribute(Mystery, 1))
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t 
## 141 173 198 111  72 136 115  69  34  20  20  88 107  37  41 142  23 105 366 149 
##   u   v   w   x   y   z 
##  33  43  83   0   6   3
(letter2Dist <- distribute(Mystery, 2))
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t 
## 304  16  40  20 242   8  12 144 202   2  10 201  38  87 279  61   5 267  16  77 
##   u   v   w   x   y   z 
## 186  15  44  14  23   2
(letter3Dist <- distribute(Mystery, 3))
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t 
## 307  57  56  75 177  25  67   9 266   3  12 112  61 139 244  58   1 163  80 111 
##   u   v   w   x   y   z 
## 165  49  26  12  29  11
(letter4Dist <- distribute(Mystery, 4))
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t 
## 163  24 152  69 318  35  76  28 158   2  55 162  68 182 132  50   0 152 171 139 
##   u   v   w   x   y   z 
##  82  46  25   3   3  20
(letter5Dist <- distribute(Mystery, 5))
##   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t 
##  64  11  31 118 424  26  41 139  11   0 113 156  42 130  58  56   0 212  36 253 
##   u   v   w   x   y   z 
##   1   0  17   8 364   4

We see that ‘s’ is the most popular first letter while ‘a’ is the most popular 2nd and 3rd letters, ‘e’ the most popular 4th and 5th letters. If we look at the next most popular letters in the 3rd and 4th spots, we find ‘i’ and ‘n’, respectively. It turns out ‘saine’ is a guessable word which is an archaic term meaning to make the sign of the cross over oneself to protect from evil.

To illustrate the process of guessing the Wordle, I will walk through my guessing procedure for 1/17/2022. My first guess was ‘saine’, and it says that the ‘s’, ‘i’, and ‘e’ are all correct AND in the correct places while the ‘a’ and the ‘n’ are not in the word.

## A function that takes a word as input along with a vector of letters, and returns TRUE if any of the letters are in the word, False otherwise.  This will be used to remove words.
remWords <- function(x, r){
   return(!any(r %in% x))
}

## A function that takes a word and vector of letters as input and returns TRUE if all the letters are in the word, and FALSE otherwise. Will be used to find words to keep. 
keepWords <- function(x, k){
   return(all(k %in% x))
}

## A function that takes a word, a letter, and a position as input and returns TRUE if the word has that letter in that position and FALSE otherwise. 
exactSpot <- function(x, e, pos){
   return(x[pos] == e)
}

## A function that takes a word, a letter, and a position as input and returns TRUE if the word does not have that letter in that position. 
notExactSpot <- function(x, e, pos){
   return(x[pos] != e)
}

## inputs gw = guess word, MystDist is the distribution of Mystery words, and then px is 0 if letter in position x is not included, 1 if letter in position x is included but not in that position, and 2 if letter in position x is included and it the correct position. 

filterList <- function(gW, MystDist, p1, p2, p3, p4, p5){
   Mdist <- MystDist
   gLets <- strsplit(gW, "")[[1]]
   P <- c(p1, p2, p3, p4, p5)
   i <- 1
   for(p in P){
      if(p==0){
         Mdist <- Mdist[sapply(strsplit(Mdist, ""), 
                               function(x) remWords(x, gLets[i]))]
      }
      if(p==1){
         Mdist <- Mdist[sapply(strsplit(Mdist, ""), 
                               function(x) keepWords(x, gLets[i]))]
         Mdist <- Mdist[sapply(strsplit(Mdist, ""), 
                               function(x) notExactSpot(x, gLets[i],i))]
      }
      if(p==2){
         Mdist <- Mdist[sapply(strsplit(Mdist, ""), 
                               function(x) exactSpot(x, gLets[i],i))]
      }
      i <- i+1
   }
   return(Mdist)
}

Using the function now for the wordle on 1/17/2022, I input ‘saine’, the Mystery word list, and then 2,0,2,0,2.

(MystRnd2 <- filterList("saine", Mystery, 2, 0, 2, 0, 2))
##  [1] "seize" "shire" "slice" "slide" "slime" "smile" "smite" "spice" "spike"
## [10] "spire" "spite" "suite"

Guess 2

I would like to analyze each of these words as potential next guess words. If I were to have ‘seize’ as my guess word, what would be the maximum sized remaining third round distribution of words? Then, I choose the minimum over all guess words.

listFilter <- function(Mword, Gword, potentialWords){
   Glets <- strsplit(Gword, "")[[1]]
   Mlets <- strsplit(Mword, "")[[1]]
   notInLets <- unique(Glets[!Glets %in% Mlets])
   InLets <- unique(Glets[Glets %in% Mlets])
   exactLets <- Glets[Glets == Mlets]
   notexactLets <- setdiff(InLets, exactLets)
   # Refine the word distribution according to the rules
   NewDist <- potentialWords[sapply(strsplit(potentialWords, split=""), function(x) remWords(x, notInLets))]
   NewDist <- NewDist[sapply(strsplit(NewDist, split=""), function(x) keepWords(x, InLets))]
   if(length(exactLets)>0){
      for(a in exactLets){
         for(a1 in which(a==Mlets)){
            NewDist <- NewDist[sapply(strsplit(NewDist, split=""), function(x) exactSpot(x, a, a1))]
         }
      }
   }
   if(length(notexactLets)>0){
      for(a in notexactLets){
         NewDist <- NewDist[sapply(strsplit(NewDist, split=""), function(x) notExactSpot(x, a, which(a==Glets)))]
      }
   }
   return(NewDist)
}

nextGuess <- function(GuessList, MysteryList){
   L <- length(MysteryList)
   for(g in GuessList){
      M <- 1
      for(m in MysteryList){
         gL <- listFilter(m, g, MysteryList)
         if(length(gL)>M){
            M <- length(gL)
         }
      }
      if(M <= L){
         if(M<L){G <- g}
         else{ G <- c(G,g)}
         L <- M
      }
   }
   return(list(G, L))
}

nextGuess(MystRnd2, MystRnd2)
## [[1]]
## [1] "spite"
## 
## [[2]]
## [1] 6

If this had given me a 1 or 2, I would have considered it. However, seeing that I could reasonably have a 1/6th chance of getting it on the third try I will instead analyze the distribution of the second and fourth letters in these words to come up with a better guess.

(letter2Dist_r2 <- distribute(MystRnd2,2))
## a b c d e f g h i j k l m n o p q r s t u v w x y z 
## 0 0 0 0 1 0 0 1 0 0 0 3 2 0 0 4 0 0 0 0 1 0 0 0 0 0
(letter4Dist_r2 <- distribute(MystRnd2,4))
## a b c d e f g h i j k l m n o p q r s t u v w x y z 
## 0 0 2 1 0 0 0 0 0 0 1 1 1 0 0 0 0 2 0 3 0 0 0 0 0 1

Using the above information, I will filter the guess list down to words that do not have any of the letters that have a zero above.

Gref <- Guess[sapply(strsplit(Guess, split=""), function(x) remWords(x, c("a", "b", "f", "g", "i", "j", "n", "o","q","s", "v", "w", "x", "y")))]

length(Gref)
## [1] 247

With 247 guess words to work with, let’s see what happens if we use the words in this list to narrow down the mystery word.

nextGuess(Gref, MystRnd2)
## [[1]]
## [1] "clept" "crump"
## 
## [[2]]
## [1] 2

By using either ‘clept’ or ‘crump’ as our second guess, the maximum number of mystery words left will be two. Although we have zero possibility of getting it on the second guess, we have a much higher probability of guessing the mystery word on the third.

I’ve guessed ‘clept’ as my second round and the result is that there is no ‘c’, ‘l’, ‘p’, or ‘t’ in the word. We already know that ‘e’ is in the last spot. I can only cross my fingers and hope that the filter will reduce it list down to one.

Let’s use the function to eliminate further.

(MystRnd3 <- filterList("clept", MystRnd2, 0, 0, 1, 0, 0))
## [1] "seize" "shire"

Well, darn. I have two to choose from. Going with the first, I ended up being wrong. In hindsight, had I guessed ‘clump’ on the second try I would have got it on the third!

filterList("crump", MystRnd2, 0, 1, 0, 0, 0)
## [1] "shire"

This is the third time I’ve used this strategy, and there are times in which using one of the 2nd round mystery words is more beneficial than using a guess word. I have yet to automate the middle step in which we select whether the 2nd guess should come from the Mystery or Guess list, and as such, calculating the probability of getting it in three or less would be pretty difficult.