Unit 4 Independent Analysis: Can Text Networks Predict Wordle Success?

0. INTRODUCTION

Wordle, while not quite as popular as it was in early February, is still played enough to generate more than 150,000 daily Twitter posts by users itching to showcase their 5-letter word guessing prowess. Within some key boundaries, rules are straightforward as each game challenges a player to identify a hidden word in as few tries as possible:

As there is only a single Wordle solution each day, sharing results to social media allows players to rank themselves against global competitors whose results are posted daily by the automated Twitter account, @WordleStats. The craze has born a flurry of imitators trying to create a similar buzz by challenging your solving abilities along a breadth of diverse topics. As of late March, the count is up to 18 spin-off versions. Some of the more popular titles include:

Quordle: similar to Wordle, but you’re solving four puzzles at once,
Worldle: geography themed,
Nerdle: math themed,
Lewdle: like the original, but all the solutions are NSFW words.

Along with creating clones, there’s been an ever increasing effort to figure out how to outwit the game by determining the most effective solving strategies. These include identifying the best starter words, letter combos, and multiple processes to eliminate erroneous letters. To this end, statisticians and data science gurus have taken to the web to post pointers on how to be the best wordler you can be. Much of this work applies statistical methods to data sets built from a combination of common word lists and the full Wordle solution set. Dictionaries outside of Wordle are used to determine averages of letter combinations across all known 5-letter words. An alternative is to use the word lists supplied by the game itself. Within the code are both the solution set of words numbering 2,315 and an allowable set of words at just under 13,000. The allowable words are all the 5-letter words that the game will accept, with just over 2,300 of them classified as winning solutions. Using these word lists provides the best opportunity to statistically understand any potential patterns that could lead to more efficient strategies.

One of the more comprehensive articles on the various strategies players use was written by Chris Chow about a month into Wordle mania. Through a literature review of other gamers’ studies, he determined that in addition to having a great starter or “seed” word, the best players used a combination of tactics to play towards multiple goals or definitions of success. Some play to solve it in the fewest guesses, while others focus on gaining information to ensure success by the final round (playing not to lose). He created a gaming bot to simulate various tactics and determined that personal preference was as big a factor in a successful gaming strategy as what letters/words were chosen at the beginning of a round.

The following case study will rely on text network analysis techniques in an attempt to identify latent or hidden relationships between Wordle characters. My method will explore the below topics pertaining to each step of the data-intensive workflow process:

Prepare: Prior to analysis, I’ll explain the context from which the data came, formulate some research questions, and introduced the R packages that will enable analysis.
Wrangle: In section 2, I’ll import the Wordle data set taken from Twitter, tidy it, and tokenize it into elements that can be statistically analyzed as well as input into network graphing models.
Analyze: In section 3, I’ll explore the data elements in an effort to describe trends that can shape the tactics in building the graphing models.
Model: I’ll wrap up the analysis in Section 4 by introducing networks of letter combinations. These tools will assist in visualizing relationships between letters and may uncover insights into effective letter or word choice in the game.
Communicate: Finally, I’ll conclude by consolidating artifacts from the analysis to address the research focus from section 1b.

1. PREPARE

1a. Research Context

As a relatively new student of the data science field (and a Wordle addict), I’m throwing my hat into the analytic ring, but I’m handicapping myself a bit. As a gaming purist, I want to do as little damage as possible to the Wordle experience, so I am refraining from digging into the coded word lists and building my analysis on only the games that have been played to date.

1b. Guiding Questions

The goal of this analysis is to determine whether or not text network analysis can be used to describe the latent relationships between letters or letter combinations and thereby indicate those combinations most conducive to solving Wordle puzzles faster.

1c. Load Libraries

library(plotly)
library(tidyverse)
library(tidytext)
library(htmlwidgets)
library(dplyr)
library(here)
library(igraph)
library(ggraph)
library(zoo)
library(wordcloud2)
library(kableExtra)

2. WRANGLE

2a. Import Data

The data for this case study was generated from Kevin O’ Connor’s (@gooeyblob) automated @WordleStats site, which I put into a .csv file for ease of input. That data was collected between 7 January and 23 March 2022 for a total of 76 observations.

wordle_raw <- read_csv(here("data", "wordle.csv"))

wordle_raw %>% 
  kbl() %>% 
  kable_styling() %>% 
  scroll_box(width = "800px", height = "500px")

Date	ID	Word	n	Hmode	1	2	3	4	5	6	X
23-Mar	277	PURGE	156785	8555	0.01	0.04	0.22	0.35	0.26	0.11	0.02
22-Mar	276	SLOSH	160161	8807	0.00	0.02	0.19	0.36	0.27	0.13	0.02
21-Mar	275	THEIR	173636	9200	0.02	0.14	0.36	0.30	0.13	0.04	0.00
20-Mar	274	RENEW	154987	8417	0.00	0.04	0.20	0.33	0.27	0.13	0.02
19-Mar	273	ALLOW	156311	8515	0.00	0.05	0.21	0.32	0.26	0.14	0.03
18-Mar	272	SAUTE	179830	9304	0.01	0.08	0.31	0.34	0.19	0.06	0.01
17-Mar	271	MOVIE	169071	8847	0.01	0.05	0.18	0.30	0.26	0.16	0.03
16-Mar	270	CATER	217856	11234	0.01	0.07	0.19	0.22	0.19	0.18	0.15
15-Mar	269	TEASE	202855	10024	0.01	0.16	0.32	0.30	0.16	0.06	0.01
14-Mar	268	SMELT	185406	9373	0.00	0.05	0.19	0.33	0.28	0.13	0.02
13-Mar	267	FOCUS	179436	8937	0.01	0.04	0.23	0.36	0.24	0.10	0.01
12-Mar	266	TODAY	192049	9353	0.01	0.07	0.29	0.35	0.20	0.07	0.01
11-Mar	265	WATCH	226349	12400	0.01	0.06	0.14	0.18	0.17	0.24	0.20
10-Mar	264	LAPSE	208884	9960	0.00	0.08	0.31	0.34	0.19	0.07	0.01
9-Mar	263	MONTH	201799	9435	0.01	0.05	0.26	0.37	0.22	0.08	0.01
8-Mar	262	SWEET	207473	9767	0.01	0.05	0.18	0.31	0.28	0.15	0.02
7-Mar	261	HOARD	218595	9823	0.01	0.09	0.30	0.34	0.19	0.07	0.01
6-Mar	260	CLOTH	218595	9911	0.01	0.08	0.33	0.34	0.17	0.07	0.01
5-Mar	259	BRINE	229895	10405	0.01	0.09	0.25	0.29	0.22	0.12	0.03
4-Mar	258	AHEAD	203730	9396	0.01	0.05	0.20	0.35	0.26	0.12	0.02
3-Mar	257	MOURN	240018	10465	0.01	0.08	0.29	0.34	0.19	0.08	0.01
2-Mar	256	NASTY	257304	10813	0.01	0.07	0.26	0.31	0.21	0.11	0.02
1-Mar	255	RUPEE	240137	10577	0.01	0.02	0.17	0.35	0.30	0.13	0.02
28-Feb	254	CHOKE	251094	10521	0.01	0.08	0.30	0.36	0.18	0.06	0.01
27-Feb	253	CHANT	250413	10438	0.01	0.09	0.33	0.33	0.16	0.07	0.01
26-Feb	252	SPILL	248363	10087	0.01	0.05	0.26	0.34	0.22	0.10	0.02
25-Feb	251	VIVID	255907	11687	0.01	0.02	0.10	0.29	0.33	0.21	0.04
24-Feb	250	BLOKE	250674	10405	0.01	0.06	0.21	0.32	0.25	0.12	0.02
23-Feb	249	TROVE	277576	11411	0.01	0.05	0.16	0.24	0.25	0.22	0.08
22-Feb	248	THORN	309356	11814	0.01	0.14	0.38	0.30	0.12	0.04	0.00
21-Feb	247	OTHER	278731	10887	0.01	0.09	0.26	0.30	0.21	0.10	0.02
20-Feb	246	TACIT	273306	11094	0.01	0.04	0.21	0.32	0.26	0.14	0.03
19-Feb	245	SWILL	282327	11241	0.01	0.01	0.08	0.19	0.31	0.30	0.10
18-Feb	244	DODGE	265238	10220	0.01	0.03	0.15	0.29	0.27	0.19	0.07
17-Feb	243	SHAKE	342003	12767	0.01	0.06	0.16	0.23	0.24	0.21	0.09
16-Feb	242	CAULK	289721	10740	0.01	0.04	0.20	0.31	0.26	0.15	0.03
15-Feb	241	AROMA	287836	10343	0.01	0.06	0.25	0.33	0.22	0.11	0.02
14-Feb	240	CYNIC	261521	10030	0.01	0.02	0.11	0.33	0.34	0.17	0.03
13-Feb	239	ROBIN	277471	9249	0.01	0.06	0.29	0.34	0.21	0.08	0.01
12-Feb	238	ULTRA	269885	9310	0.01	0.07	0.23	0.34	0.24	0.10	0.01
11-Feb	237	ULCER	278826	10631	0.01	0.04	0.18	0.30	0.28	0.16	0.03
10-Feb	236	PAUSE	304830	13480	0.01	0.08	0.26	0.32	0.21	0.10	0.02
9-Feb	235	HUMOR	305372	13846	0.01	0.05	0.22	0.34	0.25	0.11	0.02
8-Feb	234	FRAME	336236	15369	0.01	0.10	0.20	0.24	0.24	0.17	0.03
7-Feb	233	ELDER	288228	13340	0.01	0.03	0.13	0.24	0.30	0.24	0.05
6-Feb	232	SKILL	311018	13716	0.01	0.03	0.17	0.33	0.27	0.16	0.03
5-Feb	231	ALOFT	319698	13708	0.01	0.04	0.22	0.36	0.25	0.11	0.02
4-Feb	230	PLEAT	359679	14813	0.01	0.10	0.28	0.31	0.19	0.09	0.02
3-Feb	229	SHARD	358176	14609	0.01	0.07	0.22	0.28	0.25	0.14	0.04
2-Feb	228	MOIST	361908	14205	0.03	0.13	0.32	0.29	0.16	0.07	0.01
1-Feb	227	THOSE	351663	13606	0.01	0.13	0.34	0.30	0.15	0.06	0.01
31-Jan	226	LIGHT	341314	13347	0.01	0.10	0.25	0.27	0.19	0.12	0.05
30-Jan	225	WRUNG	294687	11524	0.00	0.02	0.18	0.39	0.27	0.12	0.02
29-Jan	224	COULD	313220	11592	0.01	0.07	0.29	0.35	0.20	0.08	0.01
28-Jan	223	PERKY	296968	11148	0.01	0.04	0.17	0.30	0.27	0.17	0.04
27-Jan	222	MOUNT	331844	11451	0.01	0.09	0.29	0.33	0.19	0.07	0.01
26-Jan	221	WHACK	302348	10163	0.01	0.04	0.22	0.37	0.24	0.10	0.02
25-Jan	220	SUGAR	276404	8708	0.01	0.06	0.25	0.34	0.23	0.09	0.01
24-Jan	219	KNOLL	258038	8317	0.01	0.01	0.11	0.29	0.33	0.21	0.04
23-Jan	218	CRIMP	269929	7630	0.01	0.05	0.28	0.38	0.20	0.07	0.01
22-Jan	217	WINCE	241489	6850	0.01	0.03	0.17	0.33	0.29	0.15	0.03
21-Jan	216	PRICK	273727	7409	0.01	0.08	0.30	0.33	0.19	0.07	0.01
20-Jan	215	ROBOT	243964	6589	0.01	0.08	0.29	0.34	0.20	0.08	0.01
19-Jan	214	POINT	280622	7094	0.01	0.16	0.37	0.28	0.12	0.04	0.01
18-Jan	213	PROXY	220950	6206	0.01	0.02	0.11	0.24	0.31	0.26	0.06
17-Jan	212	SHIRE	222197	5640	0.01	0.08	0.32	0.32	0.18	0.08	0.02
16-Jan	211	SOLAR	209609	4955	0.01	0.09	0.32	0.32	0.18	0.07	0.01
15-Jan	210	PANIC	205880	4655	0.01	0.09	0.35	0.34	0.16	0.05	0.01
14-Jan	209	TANGY	169484	3985	0.01	0.04	0.21	0.30	0.24	0.15	0.05
13-Jan	208	ABBEY	132726	3345	0.01	0.02	0.13	0.29	0.31	0.20	0.03
12-Jan	207	FAVOR	137586	3073	0.01	0.04	0.15	0.26	0.29	0.21	0.04
11-Jan	206	DRINK	153880	3017	0.01	0.09	0.35	0.34	0.16	0.05	0.01
10-Jan	205	QUERY	107134	2242	0.01	0.04	0.16	0.30	0.30	0.17	0.02
9-Jan	204	GORGE	91477	1913	0.01	0.03	0.13	0.27	0.30	0.22	0.04
8-Jan	203	CRANK	101503	1763	0.01	0.05	0.23	0.31	0.24	0.14	0.02
7-Jan	202	SLUMP	80630	1362	0.01	0.03	0.23	0.39	0.24	0.09	0.01

This data is still quite raw. To better understand player engagement and results, we must tidy it a bit by doing some computations and modifying a few columns to make plotting simpler.

2b. Tidy Data

# Convert date column from Character to Date class
wordle_date <- mutate(wordle_raw, Date = as.Date(Date, "%d-%B"))

# Rename column 'n' to 'Players'
guesses_by_date <- wordle_date %>% 
  rename(Players = n)

# Rename guess variables
names(guesses_by_date)[6:12] <- c("one", "two", "three", "four", "five", "six", "wrong")

# Compute players by guess and replace that column
wordle_guesses <- guesses_by_date %>% 
  mutate(one = one * Players) %>% 
  mutate(two = two * Players) %>% 
  mutate(three = three * Players) %>% 
  mutate(four = four * Players) %>% 
  mutate(five = five * Players) %>% 
  mutate(six = six * Players) %>% 
  mutate(wrong = wrong * Players)

# Add column for average guess by word
wordle_guesses <- wordle_guesses %>% 
  mutate(avg_guess = (one + 2 * two + 3 * three + 4 * four +
         5 * five + 6 * six + 7 * wrong) / Players) %>% 
  relocate(avg_guess, .before = Players)

# Compute average number of guesses by attempt
wordle_tidy <- wordle_guesses %>% 
  pivot_longer('one':last_col(), values_to = "guesses")

wordle_guesses %>% 
  kbl() %>% 
  kable_styling() %>% 
  scroll_box(width = "800px", height = "500px")

Date	ID	Word	avg_guess	Players	Hmode	one	two	three	four	five	six	wrong
2022-03-23	277	PURGE	4.25	156785	8555	1567.85	6271.40	34492.70	54874.75	40764.10	17246.35	3135.70
2022-03-22	276	SLOSH	4.32	160161	8807	0.00	3203.22	30430.59	57657.96	43243.47	20820.93	3203.22
2022-03-21	275	THEIR	3.47	173636	9200	3472.72	24309.04	62508.96	52090.80	22572.68	6945.44	0.00
2022-03-20	274	RENEW	4.27	154987	8417	0.00	6199.48	30997.40	51145.71	41846.49	20148.31	3099.74
2022-03-19	273	ALLOW	4.36	156311	8515	0.00	7815.55	32825.31	50019.52	40640.86	21883.54	4689.33
2022-03-18	272	SAUTE	3.84	179830	9304	1798.30	14386.40	55747.30	61142.20	34167.70	10789.80	1798.30
2022-03-17	271	MOVIE	4.32	169071	8847	1690.71	8453.55	30432.78	50721.30	43958.46	27051.36	5072.13
2022-03-16	270	CATER	4.68	217856	11234	2178.56	15249.92	41392.64	47928.32	41392.64	39214.08	32678.40
2022-03-15	269	TEASE	3.72	202855	10024	2028.55	32456.80	64913.60	60856.50	32456.80	12171.30	2028.55
2022-03-14	268	SMELT	4.31	185406	9373	0.00	9270.30	35227.14	61183.98	51913.68	24102.78	3708.12
2022-03-13	267	FOCUS	4.09	179436	8937	1794.36	7177.44	41270.28	64596.96	43064.64	17943.60	1794.36
2022-03-12	266	TODAY	3.91	192049	9353	1920.49	13443.43	55694.21	67217.15	38409.80	13443.43	1920.49
2022-03-11	265	WATCH	4.96	226349	12400	2263.49	13580.94	31688.86	40742.82	38479.33	54323.76	45269.80
2022-03-10	264	LAPSE	3.89	208884	9960	0.00	16710.72	64754.04	71020.56	39687.96	14621.88	2088.84
2022-03-09	263	MONTH	4.02	201799	9435	2017.99	10089.95	52467.74	74665.63	44395.78	16143.92	2017.99
2022-03-08	262	SWEET	4.33	207473	9767	2074.73	10373.65	37345.14	64316.63	58092.44	31120.95	4149.46
2022-03-07	261	HOARD	3.89	218595	9823	2185.95	19673.55	65578.50	74322.30	41533.05	15301.65	2185.95
2022-03-06	260	CLOTH	3.86	218595	9911	2185.95	17487.60	72136.35	74322.30	37161.15	15301.65	2185.95
2022-03-05	259	BRINE	4.13	229895	10405	2298.95	20690.55	57473.75	66669.55	50576.90	27587.40	6896.85
2022-03-04	258	AHEAD	4.27	203730	9396	2037.30	10186.50	40746.00	71305.50	52969.80	24447.60	4074.60
2022-03-03	257	MOURN	3.90	240018	10465	2400.18	19201.44	69605.22	81606.12	45603.42	19201.44	2400.18
2022-03-02	256	NASTY	4.02	257304	10813	2573.04	18011.28	66899.04	79764.24	54033.84	28303.44	5146.08
2022-03-01	255	RUPEE	4.38	240137	10577	2401.37	4802.74	40823.29	84047.95	72041.10	31217.81	4802.74
2022-02-28	254	CHOKE	3.84	251094	10521	2510.94	20087.52	75328.20	90393.84	45196.92	15065.64	2510.94
2022-02-27	253	CHANT	3.79	250413	10438	2504.13	22537.17	82636.29	82636.29	40066.08	17528.91	2504.13
2022-02-26	252	SPILL	4.09	248363	10087	2483.63	12418.15	64574.38	84443.42	54639.86	24836.30	4967.26
2022-02-25	251	VIVID	4.70	255907	11687	2559.07	5118.14	25590.70	74213.03	84449.31	53740.47	10236.28
2022-02-24	250	BLOKE	4.15	250674	10405	2506.74	15040.44	52641.54	80215.68	62668.50	30080.88	5013.48
2022-02-23	249	TROVE	4.68	277576	11411	2775.76	13878.80	44412.16	66618.24	69394.00	61066.72	22206.08
2022-02-22	248	THORN	3.47	309356	11814	3093.56	43309.84	117555.28	92806.80	37122.72	12374.24	0.00
2022-02-21	247	OTHER	3.96	278731	10887	2787.31	25085.79	72470.06	83619.30	58533.51	27873.10	5574.62
2022-02-20	246	TACIT	4.35	273306	11094	2733.06	10932.24	57394.26	87457.92	71059.56	38262.84	8199.18
2022-02-19	245	SWILL	5.08	282327	11241	2823.27	2823.27	22586.16	53642.13	87521.37	84698.10	28232.70
2022-02-18	244	DODGE	4.66	265238	10220	2652.38	7957.14	39785.70	76919.02	71614.26	50395.22	18566.66
2022-02-17	243	SHAKE	4.62	342003	12767	3420.03	20520.18	54720.48	78660.69	82080.72	71820.63	30780.27
2022-02-16	242	CAULK	4.34	289721	10740	2897.21	11588.84	57944.20	89813.51	75327.46	43458.15	8691.63
2022-02-15	241	AROMA	4.10	287836	10343	2878.36	17270.16	71959.00	94985.88	63323.92	31661.96	5756.72
2022-02-14	240	CYNIC	4.63	261521	10030	2615.21	5230.42	28767.31	86301.93	88917.14	44458.57	7845.63
2022-02-13	239	ROBIN	3.96	277471	9249	2774.71	16648.26	80466.59	94340.14	58268.91	22197.68	2774.71
2022-02-12	238	ULTRA	4.07	269885	9310	2698.85	18891.95	62073.55	91760.90	64772.40	26988.50	2698.85
2022-02-11	237	ULCER	4.40	278826	10631	2788.26	11153.04	50188.68	83647.80	78071.28	44612.16	8364.78
2022-02-10	236	PAUSE	4.02	304830	13480	3048.30	24386.40	79255.80	97545.60	64014.30	30483.00	6096.60
2022-02-09	235	HUMOR	4.18	305372	13846	3053.72	15268.60	67181.84	103826.48	76343.00	33590.92	6107.44
2022-02-08	234	FRAME	4.20	336236	15369	3362.36	33623.60	67247.20	80696.64	80696.64	57160.12	10087.08
2022-02-07	233	ELDER	4.71	288228	13340	2882.28	8646.84	37469.64	69174.72	86468.40	69174.72	14411.40
2022-02-06	232	SKILL	4.42	311018	13716	3110.18	9330.54	52873.06	102635.94	83974.86	49762.88	9330.54
2022-02-05	231	ALOFT	4.24	319698	13708	3196.98	12787.92	70333.56	115091.28	79924.50	35166.78	6393.96
2022-02-04	230	PLEAT	3.92	359679	14813	3596.79	35967.90	100710.12	111500.49	68339.01	32371.11	7193.58
2022-02-03	229	SHARD	4.30	358176	14609	3581.76	25072.32	78798.72	100289.28	89544.00	50144.64	14327.04
2022-02-02	228	MOIST	3.70	361908	14205	10857.24	47048.04	115810.56	104953.32	57905.28	25333.56	3619.08
2022-02-01	227	THOSE	3.67	351663	13606	3516.63	45716.19	119565.42	105498.90	52749.45	21099.78	3516.63
2022-01-31	226	LIGHT	4.06	341314	13347	3413.14	34131.40	85328.50	92154.78	64849.66	40957.68	17065.70
2022-01-30	225	WRUNG	4.35	294687	11524	0.00	5893.74	53043.66	114927.93	79565.49	35362.44	5893.74
2022-01-29	224	COULD	3.97	313220	11592	3132.20	21925.40	90833.80	109627.00	62644.00	25057.60	3132.20
2022-01-28	223	PERKY	4.45	296968	11148	2969.68	11878.72	50484.56	89090.40	80181.36	50484.56	11878.72
2022-01-27	222	MOUNT	3.82	331844	11451	3318.44	29865.96	96234.76	109508.52	63050.36	23229.08	3318.44
2022-01-26	221	WHACK	4.17	302348	10163	3023.48	12093.92	66516.56	111868.76	72563.52	30234.80	6046.96
2022-01-25	220	SUGAR	4.00	276404	8708	2764.04	16584.24	69101.00	93977.36	63572.92	24876.36	2764.04
2022-01-24	219	KNOLL	4.71	258038	8317	2580.38	2580.38	28384.18	74831.02	85152.54	54187.98	10321.52
2022-01-23	218	CRIMP	3.96	269929	7630	2699.29	13496.45	75580.12	102573.02	53985.80	18895.03	2699.29
2022-01-22	217	WINCE	4.46	241489	6850	2414.89	7244.67	41053.13	79691.37	70031.81	36223.35	7244.67
2022-01-21	216	PRICK	3.83	273727	7409	2737.27	21898.16	82118.10	90329.91	52008.13	19160.89	2737.27
2022-01-20	215	ROBOT	3.95	243964	6589	2439.64	19517.12	70749.56	82947.76	48792.80	19517.12	2439.64
2022-01-19	214	POINT	3.47	280622	7094	2806.22	44899.52	103830.14	78574.16	33674.64	11224.88	2806.22
2022-01-18	213	PROXY	4.87	220950	6206	2209.50	4419.00	24304.50	53028.00	68494.50	57447.00	13257.00
2022-01-17	212	SHIRE	3.93	222197	5640	2221.97	17775.76	71103.04	71103.04	39995.46	17775.76	4443.94
2022-01-16	211	SOLAR	3.82	209609	4955	2096.09	18864.81	67074.88	67074.88	37729.62	14672.63	2096.09
2022-01-15	210	PANIC	3.77	205880	4655	2058.80	18529.20	72058.00	69999.20	32940.80	10294.00	2058.80
2022-01-14	209	TANGY	4.37	169484	3985	1694.84	6779.36	35591.64	50845.20	40676.16	25422.60	8474.20
2022-01-13	208	ABBEY	4.56	132726	3345	1327.26	2654.52	17254.38	38490.54	41145.06	26545.20	3981.78
2022-01-12	207	FAVOR	4.57	137586	3073	1375.86	5503.44	20637.90	35772.36	39899.94	28893.06	5503.44
2022-01-11	206	DRINK	3.77	153880	3017	1538.80	13849.20	53858.00	52319.20	24620.80	7694.00	1538.80
2022-01-10	205	QUERY	4.43	107134	2242	1071.34	4285.36	17141.44	32140.20	32140.20	18212.78	2142.68
2022-01-09	204	GORGE	4.64	91477	1913	914.77	2744.31	11892.01	24698.79	27443.10	20124.94	3659.08
2022-01-08	203	CRANK	4.22	101503	1763	1015.03	5075.15	23345.69	31465.93	24360.72	14210.42	2030.06
2022-01-07	202	SLUMP	4.13	80630	1362	806.30	2418.90	18544.90	31445.70	19351.20	7256.70	806.30

Note that this data describes only those games that were shared on Twitter. The Players column captures number of players sharing their results on that day. While not reported anywhere officially, the total number of Wordle players is thought to be in the neighborhood of 15 times more than what is shared to social media. Hmode denotes those players who chose to play in hard mode. This mode forces players to keep using letters that have been correctly identified. In normal mode, players can enter any valid word. This is helpful if your preference is to eliminate letters in an information-gathering strategy. The columns one through six denote the number of correct guesses at attempts 1 through 6. The column wrong identifies those players failing to guess the correct word by the 6th attempt.

2c. Tokenize Words

The final step in Section 2 involves breaking the words into individual characters to enable network analysis in Section 4. This tokenization will isolate individual letters (unigrams) and/or group them into bigrams (letter pairs) to explore latent relationships.

# Reduce dataframe to only answers
wordle_answers <- wordle_raw %>%
  select(ID, Word)
  
# Tokenize characters-unigrams
wordle_unigram <- wordle_answers %>%   
  unnest_tokens(letter, Word, token = "characters", to_lower = FALSE)

unigram_top_tokens <- wordle_unigram %>% 
  count(letter, sort = TRUE) %>% 
  top_n(26)

unigram_top_tokens %>% 
  kbl() %>% 
  kable_styling(fixed_thead = T) %>% 
  kable_paper() %>% 
  scroll_box(width = "25%", height = "250px")

letter	n
E	36
R	33
A	31
O	31
T	27
L	25
S	21
I	20
C	18
H	18
N	18
U	16
P	13
K	11
M	11
D	10
G	8
W	8
Y	8
B	6
V	5
F	4
Q	1
X	1

# Tokenize characters-bigrams
wordle_bigram <- wordle_answers %>%   
  unnest_tokens(bigram, Word, token = "character_shingles", n = 2) 
  
wordle_bigram$bigram = toupper(wordle_bigram$bigram)

bigram_top_tokens <- wordle_bigram %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(31)

bigram_top_tokens %>% 
  kbl() %>% 
  kable_styling(fixed_thead = T) %>% 
  kable_paper() %>% 
  scroll_box(width = "25%", height = "250px")

bigram	n
ER	6
MO	6
TH	6
AR	5
IN	5
LL	5
LO	5
RO	5
AN	4
HA	4
HO	4
NT	4
OR	4
RI	4
SE	4
SH	4
UL	4
AT	3
AU	3
CH	3
EA	3
GE	3
HE	3
IC	3
IL	3
KE	3
OT	3
OU	3
RA	3
TE	3
VI	3

3. ANALYZE

3a. Player and Score Descriptors

Since one of our goals is to understand which letter or combinations of letters lead to better outcomes, tidying the data allows us to explore player engagement at the word and attempt level.

wordle_tidy$name <- factor(wordle_tidy$name, 
                           levels = c("one",
                                      "two",
                                      "three",
                                      "four",
                                      "five",
                                      "six",
                                      "wrong"))

wordle_tidy %>%
  ggplot(aes(fill=name, y=guesses, x=Date)) + 
  geom_bar(position="stack", stat="identity") +
  labs(subtitle = "Twitter-Reported Wordle Scores by Date", x = "", y = "Players") +
  scale_fill_discrete(name = "Guesses") +
  scale_y_continuous(labels = scales::comma)

Wordle’s moment in the sun may be waning, but there are still over 150,000 players posting their scores to Twitter on a daily basis. This means that there are still upwards of 2 million players globally. Wordle was purchased by the New York Times (NYT) at the end of January, which coincided with the beginning of decline in player interest. That is probably less about the NYT and more about the nature of game fads.

There was a rumor or belief that the NYT made the game more difficult, leading to players losing interest. Well let’s see what the data says… (hover over each column to see that day’s word and scores)

scores_bar_graph <- wordle_guesses %>% 
  ggplot(aes(y= avg_guess, x = Date)) +
  geom_col(aes(text = Word), fill = "pink") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_ribbon(stat = "smooth", method = "lm", alpha = .15) +
  scale_y_continuous(labels = scales::comma) +
  coord_cartesian(ylim = c(3,5)) +
  labs(title = "Wordle Averages by Date", x = "", y = "Average Score")

ggplotly(scores_bar_graph, tooltip = c("text", "Date", "avg_guess"))

When we take all scores into account from early January through 23 March, the trend line does not appear to imply an increase in difficulty. In fact, if you hover over the ends of the trend line, you can see that the most recent average, 4.16 is lower than the average in early January, 4.20. That is fairly flat and indicates no real change to difficulty level for this data set. But with all those peaks and troughs, it is probably a bit more accurate to look at a rolling average to capture more short-term effects that could have impacted how players perceived game difficulty.

rolling_bar_graph <- wordle_guesses %>% 
  mutate(seven_avg = rollmean(avg_guess, 5,
                             align="center",
                             fill=0)) %>%
  ggplot(aes(y= avg_guess, x = Date)) +
  geom_col(aes(text = Word), fill = "pink") +
  geom_line(aes(y = seven_avg),
            color = "red") +
  scale_y_continuous(labels = scales::comma) +
  coord_cartesian(ylim = c(3,5)) +
  labs(title = "5-Day Rolling Wordle Average", x = "", y = "Average Score")

ggplotly(rolling_bar_graph, tooltip = c("text", "Date", "avg_guess"))

This shows a very different adventure, though the ending is probably the same. Again, numerous peaks and dips that, over time, average to a fairly flat trend. But depending on where you started playing, your experience could vary greatly from someone else. If we focus on the time around the NYT purchase of the game, you can see a steady increase in difficulty from the beginning of February until about the 18th. For anyone paying attention, it would absolutely look and feel like the game got harder for about 2 weeks after the NYT took the reigns. Eventually, the game would re-balance over the following few weeks to get back in line with what has become a flat trend over the course of the game.

As it turns out, the NYT did nothing to make the game any more difficult. In fact, they removed solution words they felt were too obscure and could make the game less fun to play. The vast majority of original solution words are still in play in the same order the original creator laid out.

3b. Letter Frequencies

Prior to applying text networking models to the data, it may be informative to better understand key letters and letter combinations that could influence Wordle guess choice. Logic dictates that the most common letters (or combinations thereof) would make good options for early word selection. For this last piece of exploratory analysis, I’ll examine single letters (unigrams) and letter pairs (bigrams).

unigram_top_tokens <- wordle_unigram %>% 
  count(letter, sort = TRUE) %>% 
  top_n(26)

top_unigrams <- unigram_top_tokens %>% 
  mutate(freq = round(n / sum(n), 3)) %>% 
  arrange(desc(freq)) 

top_unigrams %>% 
  ggplot(aes(x = reorder(letter, -n), y = freq)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  scale_y_continuous(breaks = c(0, .02, .04, .06, .08, .10)) +
  labs(title = "Top Wordle Letters (Unigrams)", subtitle = "7 Jan - 23 Mar '22", 
       y = "Frequency of Appearance", x = "Letter")

The top five most used letters to date include E, R, O, A, and T. Odds are that if you use those letters early, you will see some success identifying usable clues. However, you can only form two actual words from those letters: ORATE and ROATE. I highly doubt that ROATE is one of the solution words, so this significantly limits a player’s ability to guess right on the first attempt. Lets try this on bigrams to see if we uncover any additional clues.

bigram_top_tokens <- wordle_bigram %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(31)

top_bigrams <- bigram_top_tokens %>% 
  mutate(freq = round(n / sum(n), 3)) %>% 
  arrange(desc(freq)) 

top_bigrams %>% 
  ggplot(aes(x = reorder(bigram, -n), y = freq)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  scale_y_continuous(breaks = c(0, .02, .04, .06, .08, .10)) +
  labs(title = "Top Wordle Letter Pairs (Bigrams)", subtitle = "7 Jan - 23 Mar '22",
       y = "Frequency of Appearance", x = "Letter Pair")

The top three bigrams (ER, MO, and TH) are recognized as fairly common beginning or ending parts of words. Again, this makes sense, but as they’ve only been seen 6 times each out of 76 games, it doesn’t appear that bigrams alone are the key to victory in the first couple of guesses either.

Now that we’ve explored individual letters and bigram relationships mathematically, let’s transitions to more visual representations. Section 4 will take this same data, but experiment with network visuals in an attempt to identify latent relationships that the math is challenged to reveal through standard statistical charts.

4. MODEL

To visualize relationships between Wordle letters (edges and nodes), we’ll need three pieces of information:

from: the letter an edge is coming from
to: the letter an edge is going towards
weight: A numeric value associated with each edge

4a. Transform Frequency Counts into Network Data

We need to transform our dataset (wordle_bigram) into these variables in the following way: from is the “letter1”, to is the “letter2”, and weight is “n”.

The function graph_from_data_frame enables the transformation:

# Separate bigrams
bigram_separated <- wordle_bigram %>%
  separate(bigram, c("letter1", "letter2"), sep = 1)

# Count Bigrams
bigram_counts <- bigram_separated %>% 
  count(letter1, letter2, sort = TRUE)

# Create graph
bigram_graph <- bigram_counts %>%
  graph_from_data_frame()

4b. Visualize Bigram Network

set.seed(100)
bigram_graph |>
  ggraph(layout = "stress") + 
  geom_node_text(aes(label = name)) +
  geom_edge_link(aes(edge_alpha = n, start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')), 
                 arrow = arrow(length = unit(2, 'mm')), color = "blue") + 
  theme_graph() +
  labs(title = "Wordle Letter Pairs", edge_alpha = "#Connections")

Line weights are shaded to emphasize the number of connections between letters. You’ll see the top five letters (E, R, O, A, and T) from the frequency chart still play a prominent role in this visual, but a few more connections are also easily seen with this type of chart. The letters N, H, L, M, and I now appear to be quite an important part of the network. If we filter out some of the least common relationships, we can zero in on the most numerous letter pairs:

bigram_graph_filtered <- bigram_counts %>%
  filter(n > 2) %>%
  graph_from_data_frame()

set.seed(200)
bigram_graph_filtered |>
  ggraph(layout = "fr") + 
  geom_node_text(aes(label = name)) +
  geom_edge_link(aes(edge_alpha = n, start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')), 
                 arrow = arrow(length = unit(2, 'mm')), color = "red") + 
  theme_graph() +
  labs(title = "Most Common Wordle Letter Pairs", edge_alpha = "#Connections")

This relationship visual highlights letters and letter combinations that were not apparent in earlier charts. The letters H, N, and M don’t appear in our frequency chart until letter 10 and higher.

5. COMMUNICATE

While letter counts and frequency analysis give insights into the most common letters, they fall short in describing the key letter pairings that lead to effective word choice for early guessing.

Frequency analysis identified the top five most common letters as E, R, A, O, and T and the most common bigrams as ER, MO, and TH. Consulting the Scrabble dictionary, there are 22 words that could be built from combining these results. Correcting for uncommon words and repeat letters, the remaining potential starter words (10) are:

earth, harem, hater, heart, homer, other, metro, tamer, torah, and orate.

Network analysis, on the other hand, identified letter combinations above and beyond the mathematical exercises. Based on adding the letters L, I, and N, the potential starter word list grows to 72:

Conclusion

This case study was designed to answer two questions:

Can text network analysis identify latent relationships between Wordle letters?
Can those identified relationships lead to more effective Wordle solutions?

The first question can be answered in the affirmative as the network visuals aided in identifying letter relationships that were not apparent in the frequency analysis at either the unigram or bigram level. Those additional combinations increased the potential solution words that could be used early in the game by seven fold.

The second question, however, is a little more difficult to answer and I believe the case study needs to be expanded to address the aspect of efficiency. Just having more potential words doesn’t necessarily equate to solving the puzzle quicker. One aspect of the game this analysis did not address was letter positioning. Letter frequency analysis focused on how often letters appeared regardless of position within a word. An additional level of analysis could explore positions one through six to determine where letters appeared the most, thereby reducing the potential word list to those terms that closely matched letter placement.

Lastly, there were specific limitations on this case study based on the data set. My goal was to understand how text networks could be applied at the character level rather than understanding numerical absolutes.As a result, I specifically chose to remain within the puzzles that had already been solved so as not to spoil the game for those that may still be playing. This method restricted my word list and therefore my letter counts significantly. I also didn’t use any external lexicons or dictionaries to determine “average” letter or bigram frequencies.

References

Bernoff, J. (2022, January 20). A mathematical analysis of the best first guess for Wordle. Without Bullshit. Retrieved March 25, 2022, from https://withoutbullshit.com/blog/a-mathematical-analysis-of-the-best-first-guess-for-wordle

Chow, C. (2022, February 11). Loaded words in Wordle. Medium. Retrieved March 25, 2022, from https://towardsdatascience.com/loaded-words-in-wordle-e78cb36f1e3c#:~:text=In%20Wordle%2C%20there%20are%202%2C315,(%E2%80%9Csupport%20words%E2%80%9D).

Frias, J. (2022, March 23). Forget luck: Optimized wordle strategy. Medium. Retrieved March 26, 2022, from https://betterprogramming.pub/forget-luck-optimized-wordle-strategy-using-bigquery-c676771e316f

Gupta, R. (2022, January 25). WORDLE-Vision: Simple Analytics to up your Wordle Game. Medium. Retrieved March 25, 2022, from https://towardsdatascience.com/wordle-vision-simple-analytics-to-up-your-wordle-game-65daf4f1aa6f

Hinton, L. (2022, March 27). Wordle word answer - what’s the Wordle today? (March 27). Gfinity Esports. Retrieved March 25, 2022, from https://www.gfinityesports.com/wordle/answer-list/

Lesser, R. (2022, March 8). Wordle, 15 million tweets later. Observable. Retrieved March 25, 2022, from https://observablehq.com/@rlesser/wordle-twitter-exploration

The New York Times. (n.d.). Wordle - a daily word game. The New York Times. Retrieved March 23, 2022, from https://www.nytimes.com/games/wordle/index.html