Introduction to the Datasaurus Dozen
Anscombe’s Quartet demonstrates how important it is to visualize your data. All four data sets have nearly identical summary statistics, but have very different distributions and look different when graphed.
The Datasaurus Dozen is similar, but with crazier shapes and more of them, but most importantly, a dynomite Dinosaur!
I was inspired to create something similar to what AutoDesk created in this link.
Below we’ll walk through how we can create and animate similar plots while discussing the trials and tribulations with every day problems dealing with data.
Dinosaur Puns Anyone?
Q: What do you call it when a dinosaur’s data is a complete mess? A: A tyrannosaurus wreck!
Alright, alright, terrible Dino puns aside, let’s pull in the data.
Here’s where you can find the data that’s visualized here. The code below will install the files to your working directory.
download.file('https://www.autodeskresearch.com/sites/default/files/SameStatsDataAndImages.zip', destfile ='dinosaur.zip')Acquainting Ourselves With The Data
Once you’ve downloaded the .zip file to your working directory, we need to find a way to bring the files into our environment.
After seeing how many files there were, I was also inspired to create a solution for importing multiple data sources.
When reading in multiple .csv files, I’ve always manually coded each file in. xyz <- read_csv(xyz.csv) rinse and repeat. This is easy with a couple of data sets, but imagine trying to bring in hundreds of them!
I decided enough is enough and to try to figure out how to pull these tsv’s (tab separated values) in R’s Global Environment.
wd <- getwd()
folder <- glue({wd},'/datasets')
zipF <- list.files(path = folder, pattern = "*.tsv", full.names = TRUE)
zipF## [1] "C:/Users/Ty/Desktop/Anscombe/datasets/BoxPlots.tsv"
## [2] "C:/Users/Ty/Desktop/Anscombe/datasets/DatasaurusDozen-wide.tsv"
## [3] "C:/Users/Ty/Desktop/Anscombe/datasets/DatasaurusDozen.tsv"
## [4] "C:/Users/Ty/Desktop/Anscombe/datasets/SimpsonsParadox-Wide.tsv"
## [5] "C:/Users/Ty/Desktop/Anscombe/datasets/SimpsonsParadox.tsv"
## [6] "C:/Users/Ty/Desktop/Anscombe/datasets/TwelveFromSlant-Alternate-long.tsv"
## [7] "C:/Users/Ty/Desktop/Anscombe/datasets/TwelveFromSlant-Alternate-wide.tsv"
## [8] "C:/Users/Ty/Desktop/Anscombe/datasets/TwelveFromSlant-long.tsv"
## [9] "C:/Users/Ty/Desktop/Anscombe/datasets/TwelveFromSlant-wide.tsv"
When Data Throws You For A Loop
Looks like I have 9 different files I could pull in. For this exercise, let’s bring the bundle in.
Everyone seems to agree on this good rule of thumb - if you have copied and pasted a block of code more than twice, it is time to convert it to a function.
Loading these files into my environment isn’t difficult, but the resulting name of my data frames is frustrating. As you can see below, the name will be the path of the file.
assign(zipF[1], read_tsv(zipF[1]))
ls()[1]## [1] "C:/Users/Ty/Desktop/Anscombe/datasets/BoxPlots.tsv"
Below is a quick function to read in all of your delimited files while keeping the naming convention.
For each .tsv file in our list, zipF, we’ll replace everything to the left of the name we want and replace .tsv with nothing by using gsub
for (i in 1:length(zipF)) {
assign(gsub(".*datasets/","",zipF[i]) %>%
gsub("*.tsv", "", .) %>%
paste(),read_tsv(zipF[i]))
}Datasaurus Dozen In Action!
Now that we have all of our data sets imported with names that are simple and make sense, let’s take the data set of interest, DatasaurusDozen, and dig in. If you’re familiar with the tidyverse library which contains ggplot, then the first part below should be familiar.
In addition to the plot, the below 2 lines are what brings life to our plot.
transition_states(dataset,3,3) + ease_aes(‘cubic-in-out’)
p <- DatasaurusDozen %>%
ggplot(aes(x,y)) +
geom_point()
theme_set(theme_bw())
scatter_viz <- p +
transition_states(dataset,3,3) +
ease_aes('cubic-in-out') +
labs(title = "{closest_state}") +
theme(plot.title = element_text(size=22,hjust = 0.5),
axis.title.x=element_blank(),
axis.title.y=element_blank())
scatter_vizHere’s The Final Product!
I wasn’t able to animate a table, so I decided to use a bar chart to represent the values. Also, I couldn’t figure out how to round the interpolated values between the animated frames. Here we leverage the animate, image_read, and image_append functions to create the viz below.
Source for the code to append gifs: link
All in all, I’m pretty happy with the outcome. Whether it’s a scatter plot or box plot, this gif demonstrates the value of visualizing your data.
a_gif <- animate(box_viz, width = 350, height = 350)
b_gif <- animate(scatter_viz, width = 350, height = 350)
c_gif <- animate(bar_viz, width = 350, height = 350)
a_mgif <- image_read(a_gif)
b_mgif <- image_read(b_gif)
c_mgif <- image_read(c_gif)
new_gif <- image_append(c(a_mgif[1], b_mgif[1],c_mgif[1]))
for(i in 2:100){
combined <- image_append(c(a_mgif[i], b_mgif[i],c_mgif[i]))
new_gif <- c(new_gif, combined)
}
new_gif