Rubik’s Cube Records

Author

Erin Morrison

Image

Introduction & Background Information

Rubik’s cube competitions have a long global tradition with the first World Championship being held in 1982. These events include a variety of events where people race to get the fastest solve times. The above image shows the World Cube Association’s (WCA) full competition setup. For the 2x2-5x5 competitors get 5 attempts where middle 3 times are averaged for their final score. There are many rules regulating the difficulty of scrambles, handling of unfinished solves, and pre-solve inspection time. Every event has judges to facilitate each competitor’s solve and ensure rules are followed and times are correctly recorded (Source 1). This project’s data set is from the WCA website and contains the 100 best individual solve times for the 2x2-7x7 events. It also includes the competitor’s name, when and where the record was set, the rank, and the event (Source 2). This data needed lots of cleaning to get all the variables in the correct format especially when reading the years from the competition name and converting the solve times to seconds. I also had to combine each of the events into one big data set. I chose this topic because there was a time when I was obsessed with learning how to solve all different types of Rubik’s cubes and improving my solve times. While I never came close to competition speed I still loved it.

Libraries & Data

# Load libraries
library(tidyverse)
library(ggfortify)
library(plotly)
library(highcharter)

# Set working directory
setwd("~/DATA 110")

# Read in data
cube2x2 <- read_csv("worldcube2x2.csv")
cube3x3 <- read_csv("worldcube3x3.csv")
cube4x4 <- read_csv("worldcube4x4.csv")
cube5x5 <- read_csv("worldcube5x5.csv")
cube6x6 <- read_csv("worldcube6x6.csv")
cube7x7 <- read_csv("worldcube7x7.csv")

Data Cleaning

Add an event variable

cube2x2$Event <- "2x2"
cube3x3$Event <- "3x3"
cube4x4$Event <- "4x4"
cube5x5$Event <- "5x5"
cube6x6$Event <- "6x6"
cube7x7$Event <- "7x7"

Convert all solve times to seconds

# 6x6
# Get vector of incorrectly formatted solve times
times6x6 <- cube6x6 |>
  filter(!Number %in% c(1, 2)) |>
  pull(Result)
# Convert to seconds
times6x6 <- as.character(times6x6)
seconds6x6 <- sapply(strsplit(times6x6, ":"),
                    function(x) as.numeric(x[1]) * 60 + as.numeric(x[2]))
# Combine cleaned solve times with other variables
cube6x6c <- cube6x6
cube6x6c$Result[(3:100)] <- seconds6x6
cube6x6c$Result <- as.numeric(cube6x6c$Result)

# 7x7
# Get vector of incorrectly formatted solve times
times7x7 <- cube7x7 |>
  pull(Result)
# Convert to seconds
times7x7 <- as.character(times7x7)
seconds7x7 <- sapply(strsplit(times7x7, ":"),
                    function(x) as.numeric(x[1]) * 60 + as.numeric(x[2]))
# Combine cleaned solve times with other variables
cube7x7c <- cube7x7
cube7x7c$Result <- seconds7x7
cube7x7c$Result <- as.numeric(cube7x7c$Result)

Combine Data

cubeData <- full_join(cube2x2, cube3x3)

Joining with `by = join_by(Number, Name, Result, Region, Competition, Event)`

cubeData <- full_join(cubeData, cube4x4)

Joining with `by = join_by(Number, Name, Result, Region, Competition, Event)`

cubeData <- full_join(cubeData, cube5x5)

Joining with `by = join_by(Number, Name, Result, Region, Competition, Event)`

cubeData <- full_join(cubeData, cube6x6c)

Joining with `by = join_by(Number, Name, Result, Region, Competition, Event)`

cubeData <- full_join(cubeData, cube7x7c)

Joining with `by = join_by(Number, Name, Result, Region, Competition, Event)`

Make a Year Variable

# Read year from competition variable
cubeData <- cubeData |>
  mutate(Year = str_extract(Competition, "\\d{4}")) # Source 3
cubeData$Year <- as.numeric(cubeData$Year)

Remove Leading Unidentified Characters

# Region
cubeData$Region[1:200] <- str_sub(cubeData$Region, start = 2L)

Warning in cubeData$Region[1:200] <- str_sub(cubeData$Region, start = 2L):
number of items to replace is not a multiple of replacement length

# Competition
cubeData$Competition[1:200] <- str_sub(cubeData$Competition, start = 2L)

Warning in cubeData$Competition[1:200] <- str_sub(cubeData$Competition, :
number of items to replace is not a multiple of replacement length

Linear Regression

# Make model
mod <- lm(Result ~ Event + Number, data = cubeData)
summary(mod)


Call:
lm(formula = Result ~ Event + Number, data = cubeData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0607 -0.4821  0.2181  0.7185  1.9553 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.192516   0.135916  -8.774   <2e-16 ***
Event3x3      3.085689   0.159378  19.361   <2e-16 ***
Event4x4     17.224002   0.159407 108.050   <2e-16 ***
Event5x5     33.496205   0.159414 210.121   <2e-16 ***
Event6x6     66.245578   0.159423 415.534   <2e-16 ***
Event7x7    101.583124   0.159424 637.190   <2e-16 ***
Number        0.037671   0.001615  23.319   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.127 on 593 degrees of freedom
Multiple R-squared:  0.9991,    Adjusted R-squared:  0.999 
F-statistic: 1.046e+05 on 6 and 593 DF,  p-value: < 2.2e-16

# Prediction vs actual graph
modGraph <- cubeData |> ggplot(aes(Result, predict(mod))) +
  geom_point() +
  geom_abline()
ggplotly(modGraph)

# Diagnostic plots
autoplot(mod, 1:4, nrow=2, ncol=2)

Equation

solve time = 0.037671(number) - 1.192516(2x2) + 3.085689(3x3) + 17.224002(4x4) + 33.496205 (5x5) + 66.245578(6x6) + 101.583124(7x7)

Analysis

This model attempts to predict the solve time based on the event and rank. All the p-values are very low which makes sense since both the variables are directly correlated with the result (bigger cubes take a longer time and better ranks are better times). By that same reasoning the equation makes sense that the bigger cubes have bigger multipliers and the rank has a small positive multiplier. The adjusted R^2 is 0.999 which means that the model was able to explain 99.9% of the variability in the data, however I feel this number is very misleading. Since each event has very different average times it appears to be predicting quite well within only a few seconds of the true value. But since times within groups are so close this is actually a big difference. For example, it predicts a 2x2 solve time to be 2.46 when it is actually 0.64 this is nearly 4 times the true value. While this is an extreme example and the accuracy improves with larger cubes it shows that this model may not be as accurate as the R^2 suggests. The diagnostic plots suggest to me that a linear model may not be appropriate seeing as there are many outliers and the QQ plot shows a curved pattern. Another reason this model may be flawed is the lack of normality of its inputs.

Exploration

# Times between events plot
cubeData |> 
  ggplot(aes(Event, Result)) +
  geom_point()

# Filter for top 5 times in each event
top5 <- cubeData |>
  filter(Number <= 5)
# Same plot but with only top5 to show competitor names
p2 <- top5 |>
  ggplot(aes(Event, Result, color = Name)) +
  geom_point(alpha = 0.5)
ggplotly(p2)

Final Visualization

# Fix rounded numbers in the top5 dataset (6x6 & 7x7)
top5$Result[21:30] <- c(58.03, 59.74, 60.00, 60.33, 60.86, 94.15, 94.59, 95.51, 95.68, 96.19)

# Clean names
top5$Name[2] <- "Vako Marchilashvili"
top5$Name[c(6, 9, 10)] <- "Yiheng Wang"
top5$Name[4] <- "Guanbo Wang"
top5$Name[8] <- "Ruihang Xu"
top5$Name[c(13, 16, 17)] <- "Tymon Kolasiński"

# Make color palette (colors from source 4)
cols <- c("#6a00ff", "#ff0040", "#ff9500", "#ffff00", "#00ff15", "#00ffff")

# Final graph
hchart(top5, "scatter", hcaes(x = Event, y = Result, group = Event)) |> # Source 5
  hc_colors(cols) |>
  hc_title(text = "Best Rubik's Cube Solve Times (2x2-7x7)") |>
  hc_yAxis(title = list(text = "Solve Time (seconds)"), maxPadding = 0) |> # Source 6
  hc_xAxis(title = list(text = "Cube Dimensions")) |>
  hc_tooltip(formatter = JS("function(){
                            return ('<b>Competitor: </b>' + this.point.Name + 
                            '<br><b>Rank: </b>' + this.point.Number +
                            '<br><b>Year: </b>' + this.point.Year +
                            '<br><b>Solve Time: </b>' + this.point.Result +
                            '<br><b>Cube Dimensions: </b>' + this.point.Event)}")) |> # Source 7
  hc_caption(text = "Source: World Cube Association")

Analysis

This visualization is a scatter plot with the cube dimensions on the x-axis and the solve time in seconds on the y-axis. I find it cool that the groups are aligned in such a way that they could be connected by an exponential function. This shape makes sense as the total number of pieces in a Rubik’s cube also have an exponential increase with it’s dimensions. It was also interesting to see Max Park’s dominance especially over the larger cubes. He holds all the top 5 records for 6x6 and 7x7 and appears in the top 5 for 3x3, 4x4, and 5x5. I think highcharter’s ability to select groups to focus on greatly benefits this graph since it allows the viewer to see the overall shape and then zoom in to see more detail. I would have liked to change the color or shape of the points to represent the competitor and then add this factor as a second legend however I couldn’t figure out how to group the points by two variables.

Sources

https://www.cubelelo.com/blogs/cubing/ultimate-guide-to-wca-competition
https://www.worldcubeassociation.org/results/rankings/777/single?show=100+results
https://stackoverflow.com/questions/48802732/extract-a-year-number-from-a-string-that-is-surrounded-by-special-characters
https://coolors.co/palettes/popular/bright
https://jkunst.com/highcharter/articles/hchart.html
https://api.highcharts.com/highcharts/yAxis.maxPadding
https://stackoverflow.com/questions/67609684/r-highcharter-customize-tool-tip-captions