0. INTRODUCTION

Wordle, while not quite as popular as it was in early February, is still played enough to generate more than 150,000 daily Twitter posts by users itching to showcase their 5-letter word guessing prowess. Within some key boundaries, rules are straightforward as each game challenges a player to identify a hidden word in as few tries as possible:

As there is only a single Wordle solution each day, sharing results to social media allows players to rank themselves against global competitors whose results are posted daily by the automated Twitter account, @WordleStats. The craze has born a flurry of imitators trying to create a similar buzz by challenging your solving abilities along a breadth of diverse topics. As of late March, the count is up to 18 spin-off versions. Some of the more popular titles include:

  • Quordle: similar to Wordle, but you’re solving four puzzles at once,

  • Worldle: geography themed,

  • Nerdle: math themed,

  • Lewdle: like the original, but all the solutions are NSFW words.

Along with creating clones, there’s been an ever increasing effort to figure out how to outwit the game by determining the most effective solving strategies. These include identifying the best starter words, letter combos, and multiple processes to eliminate erroneous letters. To this end, statisticians and data science gurus have taken to the web to post pointers on how to be the best wordler you can be. Much of this work applies statistical methods to data sets built from a combination of common word lists and the full Wordle solution set. Dictionaries outside of Wordle are used to determine averages of letter combinations across all known 5-letter words. An alternative is to use the word lists supplied by the game itself. Within the code are both the solution set of words numbering 2,315 and an allowable set of words at just under 13,000. The allowable words are all the 5-letter words that the game will accept, with just over 2,300 of them classified as winning solutions. Using these word lists provides the best opportunity to statistically understand any potential patterns that could lead to more efficient strategies.

One of the more comprehensive articles on the various strategies players use was written by Chris Chow about a month into Wordle mania. Through a literature review of other gamers’ studies, he determined that in addition to having a great starter or “seed” word, the best players used a combination of tactics to play towards multiple goals or definitions of success. Some play to solve it in the fewest guesses, while others focus on gaining information to ensure success by the final round (playing not to lose). He created a gaming bot to simulate various tactics and determined that personal preference was as big a factor in a successful gaming strategy as what letters/words were chosen at the beginning of a round.

The following case study will rely on text network analysis techniques in an attempt to identify latent or hidden relationships between Wordle characters. My method will explore the below topics pertaining to each step of the data-intensive workflow process:

  1. Prepare: Prior to analysis, I’ll explain the context from which the data came, formulate some research questions, and introduced the R packages that will enable analysis.

  2. Wrangle: In section 2, I’ll import the Wordle data set taken from Twitter, tidy it, and tokenize it into elements that can be statistically analyzed as well as input into network graphing models.

  3. Analyze: In section 3, I’ll explore the data elements in an effort to describe trends that can shape the tactics in building the graphing models.

  4. Model: I’ll wrap up the analysis in Section 4 by introducing networks of letter combinations. These tools will assist in visualizing relationships between letters and may uncover insights into effective letter or word choice in the game.

  5. Communicate: Finally, I’ll conclude by consolidating artifacts from the analysis to address the research focus from section 1b.


1. PREPARE

1a. Research Context

As a relatively new student of the data science field (and a Wordle addict), I’m throwing my hat into the analytic ring, but I’m handicapping myself a bit. As a gaming purist, I want to do as little damage as possible to the Wordle experience, so I am refraining from digging into the coded word lists and building my analysis on only the games that have been played to date.

1b. Guiding Questions

The goal of this analysis is to determine whether or not text network analysis can be used to describe the latent relationships between letters or letter combinations and thereby indicate those combinations most conducive to solving Wordle puzzles faster.

1c. Load Libraries

library(plotly)
library(tidyverse)
library(tidytext)
library(htmlwidgets)
library(dplyr)
library(here)
library(igraph)
library(ggraph)
library(zoo)
library(wordcloud2)
library(kableExtra)

2. WRANGLE

2a. Import Data

The data for this case study was generated from Kevin O’ Connor’s (@gooeyblob) automated @WordleStats site, which I put into a .csv file for ease of input. That data was collected between 7 January and 23 March 2022 for a total of 76 observations.

wordle_raw <- read_csv(here("data", "wordle.csv"))

wordle_raw %>% 
  kbl() %>% 
  kable_styling() %>% 
  scroll_box(width = "800px", height = "500px")
Date ID Word n Hmode 1 2 3 4 5 6 X
23-Mar 277 PURGE 156785 8555 0.01 0.04 0.22 0.35 0.26 0.11 0.02
22-Mar 276 SLOSH 160161 8807 0.00 0.02 0.19 0.36 0.27 0.13 0.02
21-Mar 275 THEIR 173636 9200 0.02 0.14 0.36 0.30 0.13 0.04 0.00
20-Mar 274 RENEW 154987 8417 0.00 0.04 0.20 0.33 0.27 0.13 0.02
19-Mar 273 ALLOW 156311 8515 0.00 0.05 0.21 0.32 0.26 0.14 0.03
18-Mar 272 SAUTE 179830 9304 0.01 0.08 0.31 0.34 0.19 0.06 0.01
17-Mar 271 MOVIE 169071 8847 0.01 0.05 0.18 0.30 0.26 0.16 0.03
16-Mar 270 CATER 217856 11234 0.01 0.07 0.19 0.22 0.19 0.18 0.15
15-Mar 269 TEASE 202855 10024 0.01 0.16 0.32 0.30 0.16 0.06 0.01
14-Mar 268 SMELT 185406 9373 0.00 0.05 0.19 0.33 0.28 0.13 0.02
13-Mar 267 FOCUS 179436 8937 0.01 0.04 0.23 0.36 0.24 0.10 0.01
12-Mar 266 TODAY 192049 9353 0.01 0.07 0.29 0.35 0.20 0.07 0.01
11-Mar 265 WATCH 226349 12400 0.01 0.06 0.14 0.18 0.17 0.24 0.20
10-Mar 264 LAPSE 208884 9960 0.00 0.08 0.31 0.34 0.19 0.07 0.01
9-Mar 263 MONTH 201799 9435 0.01 0.05 0.26 0.37 0.22 0.08 0.01
8-Mar 262 SWEET 207473 9767 0.01 0.05 0.18 0.31 0.28 0.15 0.02
7-Mar 261 HOARD 218595 9823 0.01 0.09 0.30 0.34 0.19 0.07 0.01
6-Mar 260 CLOTH 218595 9911 0.01 0.08 0.33 0.34 0.17 0.07 0.01
5-Mar 259 BRINE 229895 10405 0.01 0.09 0.25 0.29 0.22 0.12 0.03
4-Mar 258 AHEAD 203730 9396 0.01 0.05 0.20 0.35 0.26 0.12 0.02
3-Mar 257 MOURN 240018 10465 0.01 0.08 0.29 0.34 0.19 0.08 0.01
2-Mar 256 NASTY 257304 10813 0.01 0.07 0.26 0.31 0.21 0.11 0.02
1-Mar 255 RUPEE 240137 10577 0.01 0.02 0.17 0.35 0.30 0.13 0.02
28-Feb 254 CHOKE 251094 10521 0.01 0.08 0.30 0.36 0.18 0.06 0.01
27-Feb 253 CHANT 250413 10438 0.01 0.09 0.33 0.33 0.16 0.07 0.01
26-Feb 252 SPILL 248363 10087 0.01 0.05 0.26 0.34 0.22 0.10 0.02
25-Feb 251 VIVID 255907 11687 0.01 0.02 0.10 0.29 0.33 0.21 0.04
24-Feb 250 BLOKE 250674 10405 0.01 0.06 0.21 0.32 0.25 0.12 0.02
23-Feb 249 TROVE 277576 11411 0.01 0.05 0.16 0.24 0.25 0.22 0.08
22-Feb 248 THORN 309356 11814 0.01 0.14 0.38 0.30 0.12 0.04 0.00
21-Feb 247 OTHER 278731 10887 0.01 0.09 0.26 0.30 0.21 0.10 0.02
20-Feb 246 TACIT 273306 11094 0.01 0.04 0.21 0.32 0.26 0.14 0.03
19-Feb 245 SWILL 282327 11241 0.01 0.01 0.08 0.19 0.31 0.30 0.10
18-Feb 244 DODGE 265238 10220 0.01 0.03 0.15 0.29 0.27 0.19 0.07
17-Feb 243 SHAKE 342003 12767 0.01 0.06 0.16 0.23 0.24 0.21 0.09
16-Feb 242 CAULK 289721 10740 0.01 0.04 0.20 0.31 0.26 0.15 0.03
15-Feb 241 AROMA 287836 10343 0.01 0.06 0.25 0.33 0.22 0.11 0.02
14-Feb 240 CYNIC 261521 10030 0.01 0.02 0.11 0.33 0.34 0.17 0.03
13-Feb 239 ROBIN 277471 9249 0.01 0.06 0.29 0.34 0.21 0.08 0.01
12-Feb 238 ULTRA 269885 9310 0.01 0.07 0.23 0.34 0.24 0.10 0.01
11-Feb 237 ULCER 278826 10631 0.01 0.04 0.18 0.30 0.28 0.16 0.03
10-Feb 236 PAUSE 304830 13480 0.01 0.08 0.26 0.32 0.21 0.10 0.02
9-Feb 235 HUMOR 305372 13846 0.01 0.05 0.22 0.34 0.25 0.11 0.02
8-Feb 234 FRAME 336236 15369 0.01 0.10 0.20 0.24 0.24 0.17 0.03
7-Feb 233 ELDER 288228 13340 0.01 0.03 0.13 0.24 0.30 0.24 0.05
6-Feb 232 SKILL 311018 13716 0.01 0.03 0.17 0.33 0.27 0.16 0.03
5-Feb 231 ALOFT 319698 13708 0.01 0.04 0.22 0.36 0.25 0.11 0.02
4-Feb 230 PLEAT 359679 14813 0.01 0.10 0.28 0.31 0.19 0.09 0.02
3-Feb 229 SHARD 358176 14609 0.01 0.07 0.22 0.28 0.25 0.14 0.04
2-Feb 228 MOIST 361908 14205 0.03 0.13 0.32 0.29 0.16 0.07 0.01
1-Feb 227 THOSE 351663 13606 0.01 0.13 0.34 0.30 0.15 0.06 0.01
31-Jan 226 LIGHT 341314 13347 0.01 0.10 0.25 0.27 0.19 0.12 0.05
30-Jan 225 WRUNG 294687 11524 0.00 0.02 0.18 0.39 0.27 0.12 0.02
29-Jan 224 COULD 313220 11592 0.01 0.07 0.29 0.35 0.20 0.08 0.01
28-Jan 223 PERKY 296968 11148 0.01 0.04 0.17 0.30 0.27 0.17 0.04
27-Jan 222 MOUNT 331844 11451 0.01 0.09 0.29 0.33 0.19 0.07 0.01
26-Jan 221 WHACK 302348 10163 0.01 0.04 0.22 0.37 0.24 0.10 0.02
25-Jan 220 SUGAR 276404 8708 0.01 0.06 0.25 0.34 0.23 0.09 0.01
24-Jan 219 KNOLL 258038 8317 0.01 0.01 0.11 0.29 0.33 0.21 0.04
23-Jan 218 CRIMP 269929 7630 0.01 0.05 0.28 0.38 0.20 0.07 0.01
22-Jan 217 WINCE 241489 6850 0.01 0.03 0.17 0.33 0.29 0.15 0.03
21-Jan 216 PRICK 273727 7409 0.01 0.08 0.30 0.33 0.19 0.07 0.01
20-Jan 215 ROBOT 243964 6589 0.01 0.08 0.29 0.34 0.20 0.08 0.01
19-Jan 214 POINT 280622 7094 0.01 0.16 0.37 0.28 0.12 0.04 0.01
18-Jan 213 PROXY 220950 6206 0.01 0.02 0.11 0.24 0.31 0.26 0.06
17-Jan 212 SHIRE 222197 5640 0.01 0.08 0.32 0.32 0.18 0.08 0.02
16-Jan 211 SOLAR 209609 4955 0.01 0.09 0.32 0.32 0.18 0.07 0.01
15-Jan 210 PANIC 205880 4655 0.01 0.09 0.35 0.34 0.16 0.05 0.01
14-Jan 209 TANGY 169484 3985 0.01 0.04 0.21 0.30 0.24 0.15 0.05
13-Jan 208 ABBEY 132726 3345 0.01 0.02 0.13 0.29 0.31 0.20 0.03
12-Jan 207 FAVOR 137586 3073 0.01 0.04 0.15 0.26 0.29 0.21 0.04
11-Jan 206 DRINK 153880 3017 0.01 0.09 0.35 0.34 0.16 0.05 0.01
10-Jan 205 QUERY 107134 2242 0.01 0.04 0.16 0.30 0.30 0.17 0.02
9-Jan 204 GORGE 91477 1913 0.01 0.03 0.13 0.27 0.30 0.22 0.04
8-Jan 203 CRANK 101503 1763 0.01 0.05 0.23 0.31 0.24 0.14 0.02
7-Jan 202 SLUMP 80630 1362 0.01 0.03 0.23 0.39 0.24 0.09 0.01


This data is still quite raw. To better understand player engagement and results, we must tidy it a bit by doing some computations and modifying a few columns to make plotting simpler.

2b. Tidy Data

# Convert date column from Character to Date class
wordle_date <- mutate(wordle_raw, Date = as.Date(Date, "%d-%B"))

# Rename column 'n' to 'Players'
guesses_by_date <- wordle_date %>% 
  rename(Players = n)

# Rename guess variables
names(guesses_by_date)[6:12] <- c("one", "two", "three", "four", "five", "six", "wrong")

# Compute players by guess and replace that column
wordle_guesses <- guesses_by_date %>% 
  mutate(one = one * Players) %>% 
  mutate(two = two * Players) %>% 
  mutate(three = three * Players) %>% 
  mutate(four = four * Players) %>% 
  mutate(five = five * Players) %>% 
  mutate(six = six * Players) %>% 
  mutate(wrong = wrong * Players)

# Add column for average guess by word
wordle_guesses <- wordle_guesses %>% 
  mutate(avg_guess = (one + 2 * two + 3 * three + 4 * four +
         5 * five + 6 * six + 7 * wrong) / Players) %>% 
  relocate(avg_guess, .before = Players)

# Compute average number of guesses by attempt
wordle_tidy <- wordle_guesses %>% 
  pivot_longer('one':last_col(), values_to = "guesses")

wordle_guesses %>% 
  kbl() %>% 
  kable_styling() %>% 
  scroll_box(width = "800px", height = "500px")
Date ID Word avg_guess Players Hmode one two three four five six wrong
2022-03-23 277 PURGE 4.25 156785 8555 1567.85 6271.40 34492.70 54874.75 40764.10 17246.35 3135.70
2022-03-22 276 SLOSH 4.32 160161 8807 0.00 3203.22 30430.59 57657.96 43243.47 20820.93 3203.22
2022-03-21 275 THEIR 3.47 173636 9200 3472.72 24309.04 62508.96 52090.80 22572.68 6945.44 0.00
2022-03-20 274 RENEW 4.27 154987 8417 0.00 6199.48 30997.40 51145.71 41846.49 20148.31 3099.74
2022-03-19 273 ALLOW 4.36 156311 8515 0.00 7815.55 32825.31 50019.52 40640.86 21883.54 4689.33
2022-03-18 272 SAUTE 3.84 179830 9304 1798.30 14386.40 55747.30 61142.20 34167.70 10789.80 1798.30
2022-03-17 271 MOVIE 4.32 169071 8847 1690.71 8453.55 30432.78 50721.30 43958.46 27051.36 5072.13
2022-03-16 270 CATER 4.68 217856 11234 2178.56 15249.92 41392.64 47928.32 41392.64 39214.08 32678.40
2022-03-15 269 TEASE 3.72 202855 10024 2028.55 32456.80 64913.60 60856.50 32456.80 12171.30 2028.55
2022-03-14 268 SMELT 4.31 185406 9373 0.00 9270.30 35227.14 61183.98 51913.68 24102.78 3708.12
2022-03-13 267 FOCUS 4.09 179436 8937 1794.36 7177.44 41270.28 64596.96 43064.64 17943.60 1794.36
2022-03-12 266 TODAY 3.91 192049 9353 1920.49 13443.43 55694.21 67217.15 38409.80 13443.43 1920.49
2022-03-11 265 WATCH 4.96 226349 12400 2263.49 13580.94 31688.86 40742.82 38479.33 54323.76 45269.80
2022-03-10 264 LAPSE 3.89 208884 9960 0.00 16710.72 64754.04 71020.56 39687.96 14621.88 2088.84
2022-03-09 263 MONTH 4.02 201799 9435 2017.99 10089.95 52467.74 74665.63 44395.78 16143.92 2017.99
2022-03-08 262 SWEET 4.33 207473 9767 2074.73 10373.65 37345.14 64316.63 58092.44 31120.95 4149.46
2022-03-07 261 HOARD 3.89 218595 9823 2185.95 19673.55 65578.50 74322.30 41533.05 15301.65 2185.95
2022-03-06 260 CLOTH 3.86 218595 9911 2185.95 17487.60 72136.35 74322.30 37161.15 15301.65 2185.95
2022-03-05 259 BRINE 4.13 229895 10405 2298.95 20690.55 57473.75 66669.55 50576.90 27587.40 6896.85
2022-03-04 258 AHEAD 4.27 203730 9396 2037.30 10186.50 40746.00 71305.50 52969.80 24447.60 4074.60
2022-03-03 257 MOURN 3.90 240018 10465 2400.18 19201.44 69605.22 81606.12 45603.42 19201.44 2400.18
2022-03-02 256 NASTY 4.02 257304 10813 2573.04 18011.28 66899.04 79764.24 54033.84 28303.44 5146.08
2022-03-01 255 RUPEE 4.38 240137 10577 2401.37 4802.74 40823.29 84047.95 72041.10 31217.81 4802.74
2022-02-28 254 CHOKE 3.84 251094 10521 2510.94 20087.52 75328.20 90393.84 45196.92 15065.64 2510.94
2022-02-27 253 CHANT 3.79 250413 10438 2504.13 22537.17 82636.29 82636.29 40066.08 17528.91 2504.13
2022-02-26 252 SPILL 4.09 248363 10087 2483.63 12418.15 64574.38 84443.42 54639.86 24836.30 4967.26
2022-02-25 251 VIVID 4.70 255907 11687 2559.07 5118.14 25590.70 74213.03 84449.31 53740.47 10236.28
2022-02-24 250 BLOKE 4.15 250674 10405 2506.74 15040.44 52641.54 80215.68 62668.50 30080.88 5013.48
2022-02-23 249 TROVE 4.68 277576 11411 2775.76 13878.80 44412.16 66618.24 69394.00 61066.72 22206.08
2022-02-22 248 THORN 3.47 309356 11814 3093.56 43309.84 117555.28 92806.80 37122.72 12374.24 0.00
2022-02-21 247 OTHER 3.96 278731 10887 2787.31 25085.79 72470.06 83619.30 58533.51 27873.10 5574.62
2022-02-20 246 TACIT 4.35 273306 11094 2733.06 10932.24 57394.26 87457.92 71059.56 38262.84 8199.18
2022-02-19 245 SWILL 5.08 282327 11241 2823.27 2823.27 22586.16 53642.13 87521.37 84698.10 28232.70
2022-02-18 244 DODGE 4.66 265238 10220 2652.38 7957.14 39785.70 76919.02 71614.26 50395.22 18566.66
2022-02-17 243 SHAKE 4.62 342003 12767 3420.03 20520.18 54720.48 78660.69 82080.72 71820.63 30780.27
2022-02-16 242 CAULK 4.34 289721 10740 2897.21 11588.84 57944.20 89813.51 75327.46 43458.15 8691.63
2022-02-15 241 AROMA 4.10 287836 10343 2878.36 17270.16 71959.00 94985.88 63323.92 31661.96 5756.72
2022-02-14 240 CYNIC 4.63 261521 10030 2615.21 5230.42 28767.31 86301.93 88917.14 44458.57 7845.63
2022-02-13 239 ROBIN 3.96 277471 9249 2774.71 16648.26 80466.59 94340.14 58268.91 22197.68 2774.71
2022-02-12 238 ULTRA 4.07 269885 9310 2698.85 18891.95 62073.55 91760.90 64772.40 26988.50 2698.85
2022-02-11 237 ULCER 4.40 278826 10631 2788.26 11153.04 50188.68 83647.80 78071.28 44612.16 8364.78
2022-02-10 236 PAUSE 4.02 304830 13480 3048.30 24386.40 79255.80 97545.60 64014.30 30483.00 6096.60
2022-02-09 235 HUMOR 4.18 305372 13846 3053.72 15268.60 67181.84 103826.48 76343.00 33590.92 6107.44
2022-02-08 234 FRAME 4.20 336236 15369 3362.36 33623.60 67247.20 80696.64 80696.64 57160.12 10087.08
2022-02-07 233 ELDER 4.71 288228 13340 2882.28 8646.84 37469.64 69174.72 86468.40 69174.72 14411.40
2022-02-06 232 SKILL 4.42 311018 13716 3110.18 9330.54 52873.06 102635.94 83974.86 49762.88 9330.54
2022-02-05 231 ALOFT 4.24 319698 13708 3196.98 12787.92 70333.56 115091.28 79924.50 35166.78 6393.96
2022-02-04 230 PLEAT 3.92 359679 14813 3596.79 35967.90 100710.12 111500.49 68339.01 32371.11 7193.58
2022-02-03 229 SHARD 4.30 358176 14609 3581.76 25072.32 78798.72 100289.28 89544.00 50144.64 14327.04
2022-02-02 228 MOIST 3.70 361908 14205 10857.24 47048.04 115810.56 104953.32 57905.28 25333.56 3619.08
2022-02-01 227 THOSE 3.67 351663 13606 3516.63 45716.19 119565.42 105498.90 52749.45 21099.78 3516.63
2022-01-31 226 LIGHT 4.06 341314 13347 3413.14 34131.40 85328.50 92154.78 64849.66 40957.68 17065.70
2022-01-30 225 WRUNG 4.35 294687 11524 0.00 5893.74 53043.66 114927.93 79565.49 35362.44 5893.74
2022-01-29 224 COULD 3.97 313220 11592 3132.20 21925.40 90833.80 109627.00 62644.00 25057.60 3132.20
2022-01-28 223 PERKY 4.45 296968 11148 2969.68 11878.72 50484.56 89090.40 80181.36 50484.56 11878.72
2022-01-27 222 MOUNT 3.82 331844 11451 3318.44 29865.96 96234.76 109508.52 63050.36 23229.08 3318.44
2022-01-26 221 WHACK 4.17 302348 10163 3023.48 12093.92 66516.56 111868.76 72563.52 30234.80 6046.96
2022-01-25 220 SUGAR 4.00 276404 8708 2764.04 16584.24 69101.00 93977.36 63572.92 24876.36 2764.04
2022-01-24 219 KNOLL 4.71 258038 8317 2580.38 2580.38 28384.18 74831.02 85152.54 54187.98 10321.52
2022-01-23 218 CRIMP 3.96 269929 7630 2699.29 13496.45 75580.12 102573.02 53985.80 18895.03 2699.29
2022-01-22 217 WINCE 4.46 241489 6850 2414.89 7244.67 41053.13 79691.37 70031.81 36223.35 7244.67
2022-01-21 216 PRICK 3.83 273727 7409 2737.27 21898.16 82118.10 90329.91 52008.13 19160.89 2737.27
2022-01-20 215 ROBOT 3.95 243964 6589 2439.64 19517.12 70749.56 82947.76 48792.80 19517.12 2439.64
2022-01-19 214 POINT 3.47 280622 7094 2806.22 44899.52 103830.14 78574.16 33674.64 11224.88 2806.22
2022-01-18 213 PROXY 4.87 220950 6206 2209.50 4419.00 24304.50 53028.00 68494.50 57447.00 13257.00
2022-01-17 212 SHIRE 3.93 222197 5640 2221.97 17775.76 71103.04 71103.04 39995.46 17775.76 4443.94
2022-01-16 211 SOLAR 3.82 209609 4955 2096.09 18864.81 67074.88 67074.88 37729.62 14672.63 2096.09
2022-01-15 210 PANIC 3.77 205880 4655 2058.80 18529.20 72058.00 69999.20 32940.80 10294.00 2058.80
2022-01-14 209 TANGY 4.37 169484 3985 1694.84 6779.36 35591.64 50845.20 40676.16 25422.60 8474.20
2022-01-13 208 ABBEY 4.56 132726 3345 1327.26 2654.52 17254.38 38490.54 41145.06 26545.20 3981.78
2022-01-12 207 FAVOR 4.57 137586 3073 1375.86 5503.44 20637.90 35772.36 39899.94 28893.06 5503.44
2022-01-11 206 DRINK 3.77 153880 3017 1538.80 13849.20 53858.00 52319.20 24620.80 7694.00 1538.80
2022-01-10 205 QUERY 4.43 107134 2242 1071.34 4285.36 17141.44 32140.20 32140.20 18212.78 2142.68
2022-01-09 204 GORGE 4.64 91477 1913 914.77 2744.31 11892.01 24698.79 27443.10 20124.94 3659.08
2022-01-08 203 CRANK 4.22 101503 1763 1015.03 5075.15 23345.69 31465.93 24360.72 14210.42 2030.06
2022-01-07 202 SLUMP 4.13 80630 1362 806.30 2418.90 18544.90 31445.70 19351.20 7256.70 806.30


Note that this data describes only those games that were shared on Twitter. The Players column captures number of players sharing their results on that day. While not reported anywhere officially, the total number of Wordle players is thought to be in the neighborhood of 15 times more than what is shared to social media. Hmode denotes those players who chose to play in hard mode. This mode forces players to keep using letters that have been correctly identified. In normal mode, players can enter any valid word. This is helpful if your preference is to eliminate letters in an information-gathering strategy. The columns one through six denote the number of correct guesses at attempts 1 through 6. The column wrong identifies those players failing to guess the correct word by the 6th attempt.

2c. Tokenize Words

The final step in Section 2 involves breaking the words into individual characters to enable network analysis in Section 4. This tokenization will isolate individual letters (unigrams) and/or group them into bigrams (letter pairs) to explore latent relationships.

# Reduce dataframe to only answers
wordle_answers <- wordle_raw %>%
  select(ID, Word)
  
# Tokenize characters-unigrams
wordle_unigram <- wordle_answers %>%   
  unnest_tokens(letter, Word, token = "characters", to_lower = FALSE)

unigram_top_tokens <- wordle_unigram %>% 
  count(letter, sort = TRUE) %>% 
  top_n(26)

unigram_top_tokens %>% 
  kbl() %>% 
  kable_styling(fixed_thead = T) %>% 
  kable_paper() %>% 
  scroll_box(width = "25%", height = "250px")
letter n
E 36
R 33
A 31
O 31
T 27
L 25
S 21
I 20
C 18
H 18
N 18
U 16
P 13
K 11
M 11
D 10
G 8
W 8
Y 8
B 6
V 5
F 4
Q 1
X 1
# Tokenize characters-bigrams
wordle_bigram <- wordle_answers %>%   
  unnest_tokens(bigram, Word, token = "character_shingles", n = 2) 
  
wordle_bigram$bigram = toupper(wordle_bigram$bigram)

bigram_top_tokens <- wordle_bigram %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(31)

bigram_top_tokens %>% 
  kbl() %>% 
  kable_styling(fixed_thead = T) %>% 
  kable_paper() %>% 
  scroll_box(width = "25%", height = "250px")
bigram n
ER 6
MO 6
TH 6
AR 5
IN 5
LL 5
LO 5
RO 5
AN 4
HA 4
HO 4
NT 4
OR 4
RI 4
SE 4
SH 4
UL 4
AT 3
AU 3
CH 3
EA 3
GE 3
HE 3
IC 3
IL 3
KE 3
OT 3
OU 3
RA 3
TE 3
VI 3

3. ANALYZE

3a. Player and Score Descriptors

Since one of our goals is to understand which letter or combinations of letters lead to better outcomes, tidying the data allows us to explore player engagement at the word and attempt level.

wordle_tidy$name <- factor(wordle_tidy$name, 
                           levels = c("one",
                                      "two",
                                      "three",
                                      "four",
                                      "five",
                                      "six",
                                      "wrong"))

wordle_tidy %>%
  ggplot(aes(fill=name, y=guesses, x=Date)) + 
  geom_bar(position="stack", stat="identity") +
  labs(subtitle = "Twitter-Reported Wordle Scores by Date", x = "", y = "Players") +
  scale_fill_discrete(name = "Guesses") +
  scale_y_continuous(labels = scales::comma)

Wordle’s moment in the sun may be waning, but there are still over 150,000 players posting their scores to Twitter on a daily basis. This means that there are still upwards of 2 million players globally. Wordle was purchased by the New York Times (NYT) at the end of January, which coincided with the beginning of decline in player interest. That is probably less about the NYT and more about the nature of game fads.

There was a rumor or belief that the NYT made the game more difficult, leading to players losing interest. Well let’s see what the data says… (hover over each column to see that day’s word and scores)

scores_bar_graph <- wordle_guesses %>% 
  ggplot(aes(y= avg_guess, x = Date)) +
  geom_col(aes(text = Word), fill = "pink") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_ribbon(stat = "smooth", method = "lm", alpha = .15) +
  scale_y_continuous(labels = scales::comma) +
  coord_cartesian(ylim = c(3,5)) +
  labs(title = "Wordle Averages by Date", x = "", y = "Average Score")

ggplotly(scores_bar_graph, tooltip = c("text", "Date", "avg_guess"))

When we take all scores into account from early January through 23 March, the trend line does not appear to imply an increase in difficulty. In fact, if you hover over the ends of the trend line, you can see that the most recent average, 4.16 is lower than the average in early January, 4.20. That is fairly flat and indicates no real change to difficulty level for this data set. But with all those peaks and troughs, it is probably a bit more accurate to look at a rolling average to capture more short-term effects that could have impacted how players perceived game difficulty.

rolling_bar_graph <- wordle_guesses %>% 
  mutate(seven_avg = rollmean(avg_guess, 5,
                             align="center",
                             fill=0)) %>%
  ggplot(aes(y= avg_guess, x = Date)) +
  geom_col(aes(text = Word), fill = "pink") +
  geom_line(aes(y = seven_avg),
            color = "red") +
  scale_y_continuous(labels = scales::comma) +
  coord_cartesian(ylim = c(3,5)) +
  labs(title = "5-Day Rolling Wordle Average", x = "", y = "Average Score")

ggplotly(rolling_bar_graph, tooltip = c("text", "Date", "avg_guess"))

This shows a very different adventure, though the ending is probably the same. Again, numerous peaks and dips that, over time, average to a fairly flat trend. But depending on where you started playing, your experience could vary greatly from someone else. If we focus on the time around the NYT purchase of the game, you can see a steady increase in difficulty from the beginning of February until about the 18th. For anyone paying attention, it would absolutely look and feel like the game got harder for about 2 weeks after the NYT took the reigns. Eventually, the game would re-balance over the following few weeks to get back in line with what has become a flat trend over the course of the game.

As it turns out, the NYT did nothing to make the game any more difficult. In fact, they removed solution words they felt were too obscure and could make the game less fun to play. The vast majority of original solution words are still in play in the same order the original creator laid out.

3b. Letter Frequencies

Prior to applying text networking models to the data, it may be informative to better understand key letters and letter combinations that could influence Wordle guess choice. Logic dictates that the most common letters (or combinations thereof) would make good options for early word selection. For this last piece of exploratory analysis, I’ll examine single letters (unigrams) and letter pairs (bigrams).

unigram_top_tokens <- wordle_unigram %>% 
  count(letter, sort = TRUE) %>% 
  top_n(26)

top_unigrams <- unigram_top_tokens %>% 
  mutate(freq = round(n / sum(n), 3)) %>% 
  arrange(desc(freq)) 

top_unigrams %>% 
  ggplot(aes(x = reorder(letter, -n), y = freq)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  scale_y_continuous(breaks = c(0, .02, .04, .06, .08, .10)) +
  labs(title = "Top Wordle Letters (Unigrams)", subtitle = "7 Jan - 23 Mar '22", 
       y = "Frequency of Appearance", x = "Letter")

The top five most used letters to date include E, R, O, A, and T. Odds are that if you use those letters early, you will see some success identifying usable clues. However, you can only form two actual words from those letters: ORATE and ROATE. I highly doubt that ROATE is one of the solution words, so this significantly limits a player’s ability to guess right on the first attempt. Lets try this on bigrams to see if we uncover any additional clues.

bigram_top_tokens <- wordle_bigram %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(31)

top_bigrams <- bigram_top_tokens %>% 
  mutate(freq = round(n / sum(n), 3)) %>% 
  arrange(desc(freq)) 

top_bigrams %>% 
  ggplot(aes(x = reorder(bigram, -n), y = freq)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  scale_y_continuous(breaks = c(0, .02, .04, .06, .08, .10)) +
  labs(title = "Top Wordle Letter Pairs (Bigrams)", subtitle = "7 Jan - 23 Mar '22",
       y = "Frequency of Appearance", x = "Letter Pair")

The top three bigrams (ER, MO, and TH) are recognized as fairly common beginning or ending parts of words. Again, this makes sense, but as they’ve only been seen 6 times each out of 76 games, it doesn’t appear that bigrams alone are the key to victory in the first couple of guesses either.

Now that we’ve explored individual letters and bigram relationships mathematically, let’s transitions to more visual representations. Section 4 will take this same data, but experiment with network visuals in an attempt to identify latent relationships that the math is challenged to reveal through standard statistical charts.


4. MODEL

To visualize relationships between Wordle letters (edges and nodes), we’ll need three pieces of information:

  • from: the letter an edge is coming from

  • to: the letter an edge is going towards

  • weight: A numeric value associated with each edge

4a. Transform Frequency Counts into Network Data

We need to transform our dataset (wordle_bigram) into these variables in the following way: from is the “letter1”, to is the “letter2”, and weight is “n”.

The function graph_from_data_frame enables the transformation:

# Separate bigrams
bigram_separated <- wordle_bigram %>%
  separate(bigram, c("letter1", "letter2"), sep = 1)

# Count Bigrams
bigram_counts <- bigram_separated %>% 
  count(letter1, letter2, sort = TRUE)

# Create graph
bigram_graph <- bigram_counts %>%
  graph_from_data_frame()

4b. Visualize Bigram Network

set.seed(100)
bigram_graph |>
  ggraph(layout = "stress") + 
  geom_node_text(aes(label = name)) +
  geom_edge_link(aes(edge_alpha = n, start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')), 
                 arrow = arrow(length = unit(2, 'mm')), color = "blue") + 
  theme_graph() +
  labs(title = "Wordle Letter Pairs", edge_alpha = "#Connections")

Line weights are shaded to emphasize the number of connections between letters. You’ll see the top five letters (E, R, O, A, and T) from the frequency chart still play a prominent role in this visual, but a few more connections are also easily seen with this type of chart. The letters N, H, L, M, and I now appear to be quite an important part of the network. If we filter out some of the least common relationships, we can zero in on the most numerous letter pairs:

bigram_graph_filtered <- bigram_counts %>%
  filter(n > 2) %>%
  graph_from_data_frame()

set.seed(200)
bigram_graph_filtered |>
  ggraph(layout = "fr") + 
  geom_node_text(aes(label = name)) +
  geom_edge_link(aes(edge_alpha = n, start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')), 
                 arrow = arrow(length = unit(2, 'mm')), color = "red") + 
  theme_graph() +
  labs(title = "Most Common Wordle Letter Pairs", edge_alpha = "#Connections")

This relationship visual highlights letters and letter combinations that were not apparent in earlier charts. The letters H, N, and M don’t appear in our frequency chart until letter 10 and higher.


5. COMMUNICATE

While letter counts and frequency analysis give insights into the most common letters, they fall short in describing the key letter pairings that lead to effective word choice for early guessing.

Frequency analysis identified the top five most common letters as E, R, A, O, and T and the most common bigrams as ER, MO, and TH. Consulting the Scrabble dictionary, there are 22 words that could be built from combining these results. Correcting for uncommon words and repeat letters, the remaining potential starter words (10) are:

earth, harem, hater, heart, homer, other, metro, tamer, torah, and orate.

Network analysis, on the other hand, identified letter combinations above and beyond the mathematical exercises. Based on adding the letters L, I, and N, the potential starter word list grows to 72:

Conclusion

This case study was designed to answer two questions:

  • Can text network analysis identify latent relationships between Wordle letters?

  • Can those identified relationships lead to more effective Wordle solutions?

The first question can be answered in the affirmative as the network visuals aided in identifying letter relationships that were not apparent in the frequency analysis at either the unigram or bigram level. Those additional combinations increased the potential solution words that could be used early in the game by seven fold.

The second question, however, is a little more difficult to answer and I believe the case study needs to be expanded to address the aspect of efficiency. Just having more potential words doesn’t necessarily equate to solving the puzzle quicker. One aspect of the game this analysis did not address was letter positioning. Letter frequency analysis focused on how often letters appeared regardless of position within a word. An additional level of analysis could explore positions one through six to determine where letters appeared the most, thereby reducing the potential word list to those terms that closely matched letter placement.

Lastly, there were specific limitations on this case study based on the data set. My goal was to understand how text networks could be applied at the character level rather than understanding numerical absolutes.As a result, I specifically chose to remain within the puzzles that had already been solved so as not to spoil the game for those that may still be playing. This method restricted my word list and therefore my letter counts significantly. I also didn’t use any external lexicons or dictionaries to determine “average” letter or bigram frequencies.

References

Bernoff, J. (2022, January 20). A mathematical analysis of the best first guess for Wordle. Without Bullshit. Retrieved March 25, 2022, from https://withoutbullshit.com/blog/a-mathematical-analysis-of-the-best-first-guess-for-wordle

Chow, C. (2022, February 11). Loaded words in Wordle. Medium. Retrieved March 25, 2022, from https://towardsdatascience.com/loaded-words-in-wordle-e78cb36f1e3c#:~:text=In%20Wordle%2C%20there%20are%202%2C315,(%E2%80%9Csupport%20words%E2%80%9D).

Frias, J. (2022, March 23). Forget luck: Optimized wordle strategy. Medium. Retrieved March 26, 2022, from https://betterprogramming.pub/forget-luck-optimized-wordle-strategy-using-bigquery-c676771e316f

Gupta, R. (2022, January 25). WORDLE-Vision: Simple Analytics to up your Wordle Game. Medium. Retrieved March 25, 2022, from https://towardsdatascience.com/wordle-vision-simple-analytics-to-up-your-wordle-game-65daf4f1aa6f

Hinton, L. (2022, March 27). Wordle word answer - what’s the Wordle today? (March 27). Gfinity Esports. Retrieved March 25, 2022, from https://www.gfinityesports.com/wordle/answer-list/

Lesser, R. (2022, March 8). Wordle, 15 million tweets later. Observable. Retrieved March 25, 2022, from https://observablehq.com/@rlesser/wordle-twitter-exploration

The New York Times. (n.d.). Wordle - a daily word game. The New York Times. Retrieved March 23, 2022, from https://www.nytimes.com/games/wordle/index.html