Project 1 Outlaws Draft

Author

Z Griffin

Intro

My data is from 17Lands, a data tracker and aggregator for Magic Arena, one of the online forms of the trading card game Magic the Gathering for traditional draft for the set Outlaws at Thunder Junction (OTJ or Outlaws). The set was released in April 2024 and has a wild west theme.

In the traditional draft format, each player gets three packs, containing 14 playable cards. They open the first, pick one card, and pass the remaining the to next player. Pick and pass, repeat until all cards are picked. Repeat for next pack, pass in opposite direction. Repeat for third. Players than use their collected pool of cards along with (provided) basic land cards to build a minimum 40 card deck.

Each card has a rarity, how common the card is/how frequent it is to see in packs/ from common to mythic. Theoretically, the rarer the card, the better it is. This is not always true, sometimes due to cards being designed more for a different format. Each pack is ‘balanced’ by having guaranteed ratios of rarities: six commons; most likely a 7th common, 3 uncommons, one rare or mythic, a basic land, one wild card, and one foil wild card, and finally one non-playable slot. One theory of drafting is “first take the rare, then [for next one handed to you] choose the next best card”. If going off the assumption that rares and mythics will be chosen first, I am interested in seeing when do commons get picked, and what influences how soon they are picked.

My data set contained each card in OTJ’s “name”, “color”, “rarity”, and 17Land’s aggregated statistics of the number of times each card was seen in the draft format (“# seen”), the number of times 17Land’s users picked the card (“# picked”, and –what I am interested in– the average taken at (“ATA”), and average pick per pack (“ALSA”, renamed to “APP”).

The average taken at is the average draft pick number that players pick the card from the other available cards; with 14 playable cards per 3 packs, there are 42 picks in each draft. Average pick per pack measures the same thing, but only per each pack of 14 cards – for any elevation in ATA due to a card being in opened in later packs. For example, a card that was picked at pick 17 would have a value of 17 towards it’s ATA but a value of 3 (pack 2, pick 3) for APP. I ended up not even using the APP, as I only had time for one of the two.

The rest of the variables I needed are card/spell details, which come from the MtG card database api I acquired from Kaggle. This csv had the same basic card details as the 17Lands, along with a more nuanced “color_identity”, mana_cost (as printed on the card, mana color dependent), “cmc” or converted mana cost (the numerical only value) (think of mana as energy or currency needed to spend to play a spell). “Type” contained all information about a card’s type that I had to extract some into “base_type”, these effect when a card can be played as well as how it can be used, “power” and “toughness” are specifically for Creature types and are the attack and defense values. Finally, “text” contained the card’s text and any instructions or abilities. I had to extract variables for these. I grouped, and created “keywords” for the presence of any of typical keyword abilities and mechanics, “mana_gen” for any text that indicated the generation of mana, and finally “mechanics” for set theme/setting related game mechanics.

Data

Load libraries and set working directory, then load data sets

library(tidyverse)
library(tidymodels)
setwd("~/Schol Stuff/Montgomery College 2025/Data 110 Data Visualization/mtgdraft Project 1")

OTJ is the set code for Outlaws at Thunder Junction. otj_trad is the draft information from 17Lands, all_cards is the mtg card database from the Magic api.

otj_trad <- read_csv("outlaws_traddraft.csv")
all_cards <- read_csv("all_mtg_cards.csv") 

Data Cleaning:

##17Lands Draft Data:

Using head to get a sense of the columns and variables.

head(otj_trad)
# A tibble: 6 × 19
  Name        Color Rarity `# Seen`  ALSA `# Picked`   ATA `# GP` `% GP` `GP WR`
  <chr>       <chr> <chr>     <dbl> <dbl>      <dbl> <dbl>  <dbl> <chr>  <chr>  
1 Another Ro… W     R         11871  6.42       1466  9.92    297 2.8%   <NA>   
2 Archangel … W     M          2131  2.34        688  2.52   3856 79.9%  57.7%  
3 Armored Ar… W     C         69302  7.96       7091 11.9    5050 10.0%  53.4%  
4 Aven Inter… W     R          4872  2.55       1749  2.76  10613 84.9%  59.7%  
5 Bounding F… W     U         19153  4.35       3461  6.01  18958 76.9%  58.9%  
6 Bovine Int… W     U         25848  6.57       2494 10.4    3435 19.2%  51.8%  
# ℹ 9 more variables: `# OH` <dbl>, `OH WR` <chr>, `# GD` <dbl>, `GD WR` <chr>,
#   `# GIH` <dbl>, `GIH WR` <chr>, `# GNS` <dbl>, `GNS WR` <chr>, IIH <chr>

Rename the # Seen, # Picked to remove the number sign. Rename ALSA to better reflect what it measure: average pick (or taken) per pack. Select only the draft-relevant columns, saves me from having to fix the game play column names. Replace the color and rarity single letter codes with full words, and fill the color NAs with colorless.

otj_draft <- otj_trad |>
  rename(totseen = `# Seen`) |>
  rename(totpicked = `# Picked`) |>
  rename(APP = ALSA) |>
  rename(name = Name) |> #changing to match the second dataset so I can join based on name
  

  select(name, Color, Rarity, totseen, totpicked, APP, ATA) 
  
  otj_draft$Color[is.na(otj_draft$Color)] <- 'Colorless'
  
  otj_draft$Color[otj_draft$Color == 'B'] <- 'Black'
  otj_draft$Color[otj_draft$Color == 'U'] <- 'Blue'
  otj_draft$Color[otj_draft$Color == 'G'] <- 'Green'
  otj_draft$Color[otj_draft$Color == 'R'] <- 'Red'
  otj_draft$Color[otj_draft$Color == 'W'] <- 'White'
  
  otj_draft$Rarity[otj_draft$Rarity == 'C'] <- 'common'
  otj_draft$Rarity[otj_draft$Rarity == 'U'] <- 'uncommon'
  otj_draft$Rarity[otj_draft$Rarity == 'R'] <- 'rare'
  otj_draft$Rarity[otj_draft$Rarity == 'M'] <- 'mythic'
 
head(otj_draft)
# A tibble: 6 × 7
  name                Color Rarity   totseen totpicked   APP   ATA
  <chr>               <chr> <chr>      <dbl>     <dbl> <dbl> <dbl>
1 Another Round       White rare       11871      1466  6.42  9.92
2 Archangel of Tithes White mythic      2131       688  2.34  2.52
3 Armored Armadillo   White common     69302      7091  7.96 11.9 
4 Aven Interrupter    White rare        4872      1749  2.55  2.76
5 Bounding Felidar    White uncommon   19153      3461  4.35  6.01
6 Bovine Intervention White uncommon   25848      2494  6.57 10.4 

A Basic visualization: Draft Distribution by Rarity,

rarity_count <- otj_draft |>
  group_by(Rarity) |>
  summarize(set = n(), seen = sum(totseen), picked=sum(totpicked)) |>
  arrange(c('commmon', 'uncommon', 'rare', 'mythic'))
rarity_count
# A tibble: 4 × 4
  Rarity     set    seen picked
  <chr>    <int>   <dbl>  <dbl>
1 common      91 5233589 805215
2 uncommon   120 2485259 453768
3 rare        90  549350 122236
4 mythic      75  148581  41275
rarity_count$Rarity <- factor(rarity_count$Rarity, levels = c('common', 'uncommon', 'rare', 'mythic'))

ggplot(rarity_count, aes (x = Rarity, y = set)) +
  geom_col() +
  labs(title = 'Set Distribution in Outlaws of Thunder Junction by Rarity',
       x = 'Rarity',
       y = 'Card Count',
       caption = "Source: 17Lands") +
  theme_bw()

ggplot(data = rarity_count ) +
  geom_col(aes(x = Rarity, y= picked, fill = Rarity)) +
  labs(title = 'Draft Pick Count by Card Rarity',
       subtitle = 'for 17Lands drafters for OTJ',
       x = 'Rarity',
       y = 'Number',
       fill = " ",
       caption = 'Source: 17Lands') +
  theme_bw() +
  scale_color_brewer(palette = "Accent")

Filter once more for just the commons:

commons_draft <- otj_draft |>
  filter(Rarity == 'common')
head(commons_draft)
# A tibble: 6 × 7
  name                Color Rarity totseen totpicked   APP   ATA
  <chr>               <chr> <chr>    <dbl>     <dbl> <dbl> <dbl>
1 Armored Armadillo   White common   69302      7091  7.96 11.9 
2 Bridled Bighorn     White common   63485      8409  6.03  8.93
3 Eriette's Lullaby   White common   60898      7383  5.72  8.77
4 Holy Cow            White common   53490      8803  4.77  6.74
5 Inventive Wingsmith White common   69362      7656  7.78 11.6 
6 Mystical Tether     White common   39016     10822  3.48  4.38

the database

Now the card database! This database contains decades of Magic set releases, and I need one set in particular.

I need to filter for for the Outlaws (OTJ) cards in the card database. There are actually three more set names/codes associated with this release (which are included in the draft dataset), but the ten Special Guest (SPG) reprints and 30 (90 with variants) The Big Score (BIG) cards are all mythic, and 65 (80 with variants) the Breaking News! (OTP) are at least uncommon.

And then I remembered I ultimately going to be running my analysis on the commons, and all of those special cards not commons.

otj_baseset <- all_cards |>
  filter(set == 'OTJ')

otj_baseset
# A tibble: 374 × 39
   name   multiverse_id layout names mana_cost   cmc colors color_identity type 
   <chr>          <dbl> <chr>  <lgl> <chr>     <dbl> <chr>  <chr>          <chr>
 1 Anoth…        654942 normal NA    {X}{X}{2…     3 ['W']  ['W']          Sorc…
 2 Archa…        654943 normal NA    {1}{W}{W…     4 ['W']  ['W']          Crea…
 3 Armor…        654944 normal NA    {W}           1 ['W']  ['W']          Crea…
 4 Aven …        654945 normal NA    {1}{W}{W}     3 ['W']  ['W']          Crea…
 5 Bound…        654946 normal NA    {5}{W}        6 ['W']  ['W']          Crea…
 6 Bovin…        654947 normal NA    {1}{W}        2 ['W']  ['W']          Inst…
 7 Bridl…        654948 normal NA    {3}{W}        4 ['W']  ['W']          Crea…
 8 Claim…        654949 normal NA    {2}{W}        3 ['W']  ['W']          Crea…
 9 Dust …        654950 normal NA    {1}{W}        2 ['W']  ['W']          Crea…
10 Eriet…        654951 normal NA    {1}{W}        2 ['W']  ['W']          Sorc…
# ℹ 364 more rows
# ℹ 30 more variables: supertypes <chr>, subtypes <chr>, rarity <chr>,
#   text <chr>, flavor <chr>, artist <chr>, number <chr>, power <chr>,
#   toughness <chr>, loyalty <dbl>, variations <chr>, watermark <chr>,
#   border <lgl>, timeshifted <lgl>, hand <dbl>, life <dbl>, reserved <lgl>,
#   release_date <lgl>, starter <lgl>, rulings <chr>, foreign_names <chr>,
#   printings <chr>, original_text <chr>, original_type <chr>, …

Note that there are 374 rows, two less than otj_draft; and that this without the 105 supplementary set cards. I probably have duplicate card names due to art variations. However, I think all of those are not commons, so I should be able to solve this by filtering for just commons

Filter for just the commons:

otj_commons <- otj_baseset |>
  filter(rarity == 'Common')

head(otj_commons)
# A tibble: 6 × 39
  name    multiverse_id layout names mana_cost   cmc colors color_identity type 
  <chr>           <dbl> <chr>  <lgl> <chr>     <dbl> <chr>  <chr>          <chr>
1 Armore…        654944 normal NA    {W}           1 ['W']  ['W']          Crea…
2 Bridle…        654948 normal NA    {3}{W}        4 ['W']  ['W']          Crea…
3 Eriett…        654951 normal NA    {1}{W}        2 ['W']  ['W']          Sorc…
4 Holy C…        654957 normal NA    {2}{W}        3 ['W']  ['W']          Crea…
5 Invent…        654958 normal NA    {2}{W}        3 ['W']  ['W']          Crea…
6 Mystic…        654960 normal NA    {2}{W}        3 ['W']  ['W']          Ench…
# ℹ 30 more variables: supertypes <chr>, subtypes <chr>, rarity <chr>,
#   text <chr>, flavor <chr>, artist <chr>, number <chr>, power <chr>,
#   toughness <chr>, loyalty <dbl>, variations <chr>, watermark <chr>,
#   border <lgl>, timeshifted <lgl>, hand <dbl>, life <dbl>, reserved <lgl>,
#   release_date <lgl>, starter <lgl>, rulings <chr>, foreign_names <chr>,
#   printings <chr>, original_text <chr>, original_type <chr>,
#   legalities <chr>, source <lgl>, image_url <chr>, set <chr>, …

Global environment shows I have 106 rows compared to the 91 in commons_draft: investigating shows I forgot about the basic lands.

otj_commons2 <- otj_commons |>
  filter(!(name %in% c('Island', 'Forest', 'Mountain', 'Plains', 'Swamp')))

length(otj_commons2$name)
[1] 91

Now the two dataframes match in length, and joining will be possible. But first, I need to clean up some of these columns.

colnames(otj_commons2)
 [1] "name"           "multiverse_id"  "layout"         "names"         
 [5] "mana_cost"      "cmc"            "colors"         "color_identity"
 [9] "type"           "supertypes"     "subtypes"       "rarity"        
[13] "text"           "flavor"         "artist"         "number"        
[17] "power"          "toughness"      "loyalty"        "variations"    
[21] "watermark"      "border"         "timeshifted"    "hand"          
[25] "life"           "reserved"       "release_date"   "starter"       
[29] "rulings"        "foreign_names"  "printings"      "original_text" 
[33] "original_type"  "legalities"     "source"         "image_url"     
[37] "set"            "set_name"       "id"            

The commons should only have layout normal and no supertypes: confirm wth unique

unique(otj_commons2$layout)
[1] "normal"
unique(otj_commons2$supertypes)
[1] NA

choosing only the columns I need so my final df isn’t huge and re-ordering The distinction the magic api makes between colors and color_identity does not apply to any any of the common cards except the dual lands, and I would consider the lands to be colored rather than colorless, so I don’t need both columns.

Also filing the NAs of color_identity with “colorless”, and replacing the single letter codes for colors with words

otj_com3 <- otj_commons2 |>
  select(name, color_identity, mana_cost, cmc, type, subtypes, power, toughness, text) 
  
 
otj_com3$color_identity[is.na(otj_com3$color_identity)] <- 'Colorless'

otj_com3$color_identity[otj_com3$identiy == "['B']"] <- 'Black'
Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['U']"] <- 'Blue'
Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['G']"] <- 'Green'
Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['R']"] <- 'Red'
Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['W']"] <- 'White'
Warning: Unknown or uninitialised column: `identiy`.
head(otj_com3)
# A tibble: 6 × 9
  name       color_identity mana_cost   cmc type  subtypes power toughness text 
  <chr>      <chr>          <chr>     <dbl> <chr> <chr>    <chr> <chr>     <chr>
1 Armored A… ['W']          {W}           1 Crea… ['Armad… 0     4         "War…
2 Bridled B… ['W']          {3}{W}        4 Crea… ['Sheep… 3     4         "Vig…
3 Eriette's… ['W']          {1}{W}        2 Sorc… <NA>     <NA>  <NA>      "Des…
4 Holy Cow   ['W']          {2}{W}        3 Crea… ['Ox', … 2     2         "Fla…
5 Inventive… ['W']          {2}{W}        3 Crea… ['Dwarf… 2     4         "At …
6 Mystical … ['W']          {2}{W}        3 Ench… <NA>     <NA>  <NA>      "You…

Well that didn’t work, and I haven’t been able to fix it. Thankfully, I have multiple color related colors.

Join the data sets

comms_joined <- left_join(commons_draft, otj_com3, by = 'name')

head(comms_joined)
# A tibble: 6 × 15
  name       Color Rarity totseen totpicked   APP   ATA color_identity mana_cost
  <chr>      <chr> <chr>    <dbl>     <dbl> <dbl> <dbl> <chr>          <chr>    
1 Armored A… White common   69302      7091  7.96 11.9  ['W']          {W}      
2 Bridled B… White common   63485      8409  6.03  8.93 ['W']          {3}{W}   
3 Eriette's… White common   60898      7383  5.72  8.77 ['W']          {1}{W}   
4 Holy Cow   White common   53490      8803  4.77  6.74 ['W']          {2}{W}   
5 Inventive… White common   69362      7656  7.78 11.6  ['W']          {2}{W}   
6 Mystical … White common   39016     10822  3.48  4.38 ['W']          {2}{W}   
# ℹ 6 more variables: cmc <dbl>, type <chr>, subtypes <chr>, power <chr>,
#   toughness <chr>, text <chr>

More Cleaning: Getting my variables in a Usable State

I can’t drop NAs, as any NA value such as Eriette’s Lullaby and Mystical Tether for power, and toughness is because Instants and Enchantments do not have combat stats of power and toughness. I don’t want to fill them with zeros because 0/0 for power/toughness means something completely different.
I can fill the land mana costs with zeros if needed, though. That’s inaccurate but true.Subtypes I’m not planning on using as is, except to make a new column

Trying to remove the extra information from type. Using str_replace and mutate to create a new column with just the main card type (artifact, artifact creature, enchantment, instant, sorcery, land)

join2 <- comms_joined |>
  mutate(base_type = str_replace(type, "—.*$", ""), .after = 'subtypes')
         
head(join2)
# A tibble: 6 × 16
  name       Color Rarity totseen totpicked   APP   ATA color_identity mana_cost
  <chr>      <chr> <chr>    <dbl>     <dbl> <dbl> <dbl> <chr>          <chr>    
1 Armored A… White common   69302      7091  7.96 11.9  ['W']          {W}      
2 Bridled B… White common   63485      8409  6.03  8.93 ['W']          {3}{W}   
3 Eriette's… White common   60898      7383  5.72  8.77 ['W']          {1}{W}   
4 Holy Cow   White common   53490      8803  4.77  6.74 ['W']          {2}{W}   
5 Inventive… White common   69362      7656  7.78 11.6  ['W']          {2}{W}   
6 Mystical … White common   39016     10822  3.48  4.38 ['W']          {2}{W}   
# ℹ 7 more variables: cmc <dbl>, type <chr>, subtypes <chr>, base_type <chr>,
#   power <chr>, toughness <chr>, text <chr>

Grouping by type, and getting totals for the set, what was seen in the drafts, what users picked , mostly to see if the string replace worked.

type_counts <- join2 |>
  group_by(base_type) |>
  summarize(count=n(), seen = sum(totseen), picked = sum(totpicked))
type_counts
# A tibble: 8 × 4
  base_type            count    seen picked
  <chr>                <int>   <dbl>  <dbl>
1 "Artifact "              1   69466   8316
2 "Artifact Creature "     3  201959  27524
3 "Creature "             48 2950412 425706
4 "Enchantment"            1   39016  10822
5 "Enchantment "           2  134630  16936
6 "Instant"               16  967872 148991
7 "Land "                 12  402571  89168
8 "Sorcery"                8  467663  77752

I do not know why it is still breaking the enchantments up into two different groups. There were two Enchantment-Auras and one regular Enchantment, but the str_replace otherwise worked perfectly. Let’s see what the difference is via unique.

unique(join2$base_type)
[1] "Creature "          "Sorcery"            "Enchantment"       
[4] "Instant"            "Enchantment "       "Artifact "         
[7] "Artifact Creature " "Land "             

Remove the extra space from one of the enchantment basetypes, and make them both named Enchantment.

join2$base_type[join2$base_type == 'Enchantment '] <- "Enchantment"
join2$base_type[join2$base_type == "Enchantment"] <- "Enchantment"

unique(join2$base_type)
[1] "Creature "          "Sorcery"            "Enchantment"       
[4] "Instant"            "Artifact "          "Artifact Creature "
[7] "Land "             

NOw I need to get the rest of my variables: Extract Keywords: Magic has multiple keywords in a card’s text that give them abilities that are advantageous, such as Flying and Deathtouch. I didn’t expect every single common to have ability or effect text, so I need to extract those keywords where they exist and create a new column.

While having a column per keyword would be more accurate, that would leave me with way too many variables to deal for project. So I am grouping,

join4 <- join2 |>
  #common magic keywords, 
  mutate(keywords = if_else(grepl("counter on|Double strike|First strike|Flying|Haste|Indestructible|Lifelink|Reach|Trample |Vigilance|Ward",text, ignore.case = TRUE), "keyword", "no"), .before = text) |>
  
  
  #mana generators, very important. need grepl for it's case ignoring. Grepl kept giving errors, so str_detect and learning how to make it ignore case it is
  mutate(mana_gen = case_when(
    str_detect(text, "(?i)add one mana") ~"yes", 
    str_detect(text, "(?i)add \\{") ~ "yes", # whn the card specifies the color of the mana, the curly brackets were helpful! 
    str_detect(text, "(?i)create a treasure") ~ "yes",
    .default = "no"), .before = text) |>
  
  # set specific mechanics and themes
  #grouping the mechanics of Outlaw(a creature grouping), Mount/saddle, Plot,Spree, and Commit A Crimes. this code will grab spells that care about having creatures that fulfill the criteria, and those creatures.
  #Commit a Crime: cards that give bonuses for having done crimes, and the cards that qualify as crimes
  # Mostly this means targeting an opponent or something they control, which while some spells specify, others just say "target creature" or "target artifact", meaning you could choose your own (it might even be generally used that way). I'm having to specify some exclusions-- thankfully this is a small dataset. 
  mutate(mechanics = case_when(
    str_detect(text, "outlaw") ~ "yes",
    str_detect(subtypes, "Assassin") ~ "yes",
    str_detect(subtypes, "Mercenary") ~ "yes",
    str_detect(subtypes, "Pirate") ~ "yes",
    str_detect(subtypes, "Rogue") ~ "yes",
    str_detect(subtypes, "Warlock") ~ "yes",
    str_detect(text, "Spree") ~ "yes",
    str_detect(text, "Plot") ~ "yes",
    str_detect(subtypes, "Mount") ~ "yes", # creatures that ARE mounts
    str_detect(text, "(?!)mount") ~ "yes", # cards that care if OTHERS are mounts
    str_detect(text, "commit") ~ "yes",
    str_detect(text, "target creature you") ~ "no", #excluding spells that only apply to the owner's, no cards in this dataset specified other types the owner controlled
    str_detect(text, "becomes the target") ~ "no", # the explanatory text for Ward includes the word target, needs be excluded before I look for all instances of the word target
    str_detect(text, "target") ~ "yes",
    .default = "no"), .before =text)  

head(join4)
# A tibble: 6 × 19
  name       Color Rarity totseen totpicked   APP   ATA color_identity mana_cost
  <chr>      <chr> <chr>    <dbl>     <dbl> <dbl> <dbl> <chr>          <chr>    
1 Armored A… White common   69302      7091  7.96 11.9  ['W']          {W}      
2 Bridled B… White common   63485      8409  6.03  8.93 ['W']          {3}{W}   
3 Eriette's… White common   60898      7383  5.72  8.77 ['W']          {1}{W}   
4 Holy Cow   White common   53490      8803  4.77  6.74 ['W']          {2}{W}   
5 Inventive… White common   69362      7656  7.78 11.6  ['W']          {2}{W}   
6 Mystical … White common   39016     10822  3.48  4.38 ['W']          {2}{W}   
# ℹ 10 more variables: cmc <dbl>, type <chr>, subtypes <chr>, base_type <chr>,
#   power <chr>, toughness <chr>, keywords <chr>, mana_gen <chr>,
#   mechanics <chr>, text <chr>

I’m not entirely happy with this approach, I know that I personally don’t weigh all keywords equally, but it’s the best I have at the moment.

Finally doing some things with the pick data

summary(join4)
     name              Color              Rarity             totseen     
 Length:91          Length:91          Length:91          Min.   :27576  
 Class :character   Class :character   Class :character   1st Qu.:50054  
 Mode  :character   Mode  :character   Mode  :character   Median :62702  
                                                          Mean   :57512  
                                                          3rd Qu.:67355  
                                                          Max.   :70918  
   totpicked          APP             ATA         color_identity    
 Min.   : 6086   Min.   :3.290   Min.   : 3.920   Length:91         
 1st Qu.: 7649   1st Qu.:4.765   1st Qu.: 6.605   Class :character  
 Median : 8483   Median :6.030   Median : 8.760   Mode  :character  
 Mean   : 8849   Mean   :5.951   Mean   : 8.488                     
 3rd Qu.:10008   3rd Qu.:7.185   3rd Qu.:10.405                     
 Max.   :12642   Max.   :8.360   Max.   :12.300                     
  mana_cost              cmc            type             subtypes        
 Length:91          Min.   :0.000   Length:91          Length:91         
 Class :character   1st Qu.:1.000   Class :character   Class :character  
 Mode  :character   Median :2.000   Mode  :character   Mode  :character  
                    Mean   :2.396                                        
                    3rd Qu.:3.000                                        
                    Max.   :6.000                                        
  base_type            power            toughness           keywords        
 Length:91          Length:91          Length:91          Length:91         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   mana_gen          mechanics             text          
 Length:91          Length:91          Length:91         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         

Honestly that’s a lower max ATA than I expected, considering there are 42-45 picks. Also, power and toughness need to become numerical

join4$power <- as.numeric(join4$power)
join4$toughness <- as.numeric(join4$toughness)
  
summary(Filter(is.numeric,join4)) 
    totseen        totpicked          APP             ATA        
 Min.   :27576   Min.   : 6086   Min.   :3.290   Min.   : 3.920  
 1st Qu.:50054   1st Qu.: 7649   1st Qu.:4.765   1st Qu.: 6.605  
 Median :62702   Median : 8483   Median :6.030   Median : 8.760  
 Mean   :57512   Mean   : 8849   Mean   :5.951   Mean   : 8.488  
 3rd Qu.:67355   3rd Qu.:10008   3rd Qu.:7.185   3rd Qu.:10.405  
 Max.   :70918   Max.   :12642   Max.   :8.360   Max.   :12.300  
                                                                 
      cmc            power         toughness    
 Min.   :0.000   Min.   :0.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
 Median :2.000   Median :3.000   Median :2.000  
 Mean   :2.396   Mean   :2.608   Mean   :2.706  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
 Max.   :6.000   Max.   :6.000   Max.   :5.000  
                 NA's   :40      NA's   :40     

Finally, start on the actual model

Start with converted mana cost vs average taken at.

###A Scatter Plot:

scatter1 <- join4 |>
  ggplot(aes(x = cmc, y = ATA) )+
  labs(title = 'Average Taken At vs Converted Mana Cost \n for 17Lands MTG Outlaws of Thunder Junction players',
       x = 'Converted Mana Cost ',
       y = 'Average Taken At',
       caption = 'source: 17Lands & Wizards of the Coast') +
  theme_minimal() +
  geom_point(size = 2.5, alpha = 0.6) 
  
  

scatter1 

That’s not linear at all but we’ll run the smoother.

p2 <- scatter1 + geom_smooth(method='lm',formula=y~x)

p2

Well that’s pretty terrible. Fit summary:

fit1 <- lm(ATA ~ cmc, data = join4)
summary(fit1)

Call:
lm(formula = ATA ~ cmc, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0276 -1.4074  0.2558  1.7125  4.3542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.5574     0.4233  17.853   <2e-16 ***
cmc           0.3884     0.1484   2.616   0.0104 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.191 on 89 degrees of freedom
Multiple R-squared:  0.07142,   Adjusted R-squared:  0.06098 
F-statistic: 6.845 on 1 and 89 DF,  p-value: 0.01044

As expected, cmc is not doing much to help this linear model. I need more variables! Most of my variables are categorical. Power and toughness are the only quantitative, and only Creatures and Artifact Creatures have those. Recall that I did not fill the NAs for power and toughness for the other spell types

#convert categorical variables to factors
join4$color_identity <- as.factor(join4$color_identity)
join4$base_type <- as.factor(join4$base_type)
join4$keywords <- as.factor(join4$keywords)
join4$mana_gen <- as.factor(join4$mana_gen)
join4$mechanics<- as.factor(join4$mechanics)

fit2 <- lm(ATA ~ cmc + power + toughness + color_identity + base_type + keywords + mana_gen + mechanics, data = join4)
#autoplot(fit2, 1:4, nrow=2, ncol=2)
summary(fit2)

Call:
lm(formula = ATA ~ cmc + power + toughness + color_identity + 
    base_type + keywords + mana_gen + mechanics, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1575 -1.0707  0.3149  1.0707  2.6108 

Coefficients: (1 not defined because of singularities)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              9.09312    1.02268   8.892 6.39e-11 ***
cmc                     -0.46118    0.45624  -1.011  0.31833    
power                    0.08695    0.40125   0.217  0.82957    
toughness                0.67681    0.29360   2.305  0.02656 *  
color_identity['G']     -3.03486    0.89044  -3.408  0.00153 ** 
color_identity['R']      0.26577    0.85626   0.310  0.75792    
color_identity['U']      1.40648    0.83260   1.689  0.09915 .  
color_identity['W']     -0.60916    0.86179  -0.707  0.48386    
color_identityColorless  0.26566    1.23892   0.214  0.83133    
base_typeCreature             NA         NA      NA       NA    
keywordsno               0.01836    0.56289   0.033  0.97414    
mana_genyes              0.27627    0.74978   0.368  0.71452    
mechanicsyes            -0.50079    0.60950  -0.822  0.41627    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.626 on 39 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.5224,    Adjusted R-squared:  0.3877 
F-statistic: 3.879 on 11 and 39 DF,  p-value: 0.0008024
  1. I don’t understand why base_typeCreature didn’t get coefficients, nor why none of the other types are represented, so I’m going to take out base_type as well.
  2. largest p-value is keywordsno, so that get’s removed from the model
fit3<- lm(ATA ~ cmc + power + toughness + base_type + color_identity+ mana_gen + mechanics, data = join4)
#autoplot(fit3, 1:4, nrow=2, ncol=2)
summary(fit3)

Call:
lm(formula = ATA ~ cmc + power + toughness + base_type + color_identity + 
    mana_gen + mechanics, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1628 -1.0712  0.3114  1.0591  2.6188 

Coefficients: (1 not defined because of singularities)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              9.37769    1.09696   8.549 1.45e-10 ***
cmc                     -0.46210    0.44964  -1.028  0.31026    
power                    0.08936    0.38942   0.229  0.81966    
toughness                0.67455    0.28177   2.394  0.02144 *  
base_typeCreature       -0.27915    1.15314  -0.242  0.80996    
color_identity['G']     -3.02912    0.86186  -3.515  0.00111 ** 
color_identity['R']      0.27289    0.81755   0.334  0.74028    
color_identity['U']      1.41167    0.80703   1.749  0.08792 .  
color_identity['W']     -0.60947    0.85091  -0.716  0.47800    
color_identityColorless       NA         NA      NA       NA    
mana_genyes              0.27956    0.73366   0.381  0.70518    
mechanicsyes            -0.49934    0.60022  -0.832  0.41039    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.605 on 40 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.5224,    Adjusted R-squared:  0.403 
F-statistic: 4.376 on 10 and 40 DF,  p-value: 0.0003611

I’m still wondering abot what’s with the na’s on Base_type, let’s try removing that

fit4<- lm(ATA ~ cmc + power + toughness + color_identity + mana_gen + mechanics, data = join4)
summary(fit4)

Call:
lm(formula = ATA ~ cmc + power + toughness + color_identity + 
    mana_gen + mechanics, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1628 -1.0712  0.3114  1.0591  2.6188 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              9.09854    0.99639   9.132 2.48e-11 ***
cmc                     -0.46210    0.44964  -1.028  0.31026    
power                    0.08936    0.38942   0.229  0.81966    
toughness                0.67455    0.28177   2.394  0.02144 *  
color_identity['G']     -3.02912    0.86186  -3.515  0.00111 ** 
color_identity['R']      0.27289    0.81755   0.334  0.74028    
color_identity['U']      1.41167    0.80703   1.749  0.08792 .  
color_identity['W']     -0.60947    0.85091  -0.716  0.47800    
color_identityColorless  0.27915    1.15314   0.242  0.80996    
mana_genyes              0.27956    0.73366   0.381  0.70518    
mechanicsyes            -0.49934    0.60022  -0.832  0.41039    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.605 on 40 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.5224,    Adjusted R-squared:  0.403 
F-statistic: 4.376 on 10 and 40 DF,  p-value: 0.0003611
plot(fit4) #getting an error on rendering autoplot, so having to change to plot

absolutely no change in adjusted R squared, which did increase from fit 2 to fit3, so leaving base_type out.

largest p-value is POWER, remove that for fit5

fit5<- lm(ATA ~ cmc + toughness + color_identity+ mana_gen + mechanics, data = join4)
summary(fit5)

Call:
lm(formula = ATA ~ cmc + toughness + color_identity + mana_gen + 
    mechanics, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1215 -1.0900  0.2713  1.1031  2.5889 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               9.0749     0.9795   9.264 1.32e-11 ***
cmc                      -0.3742     0.2327  -1.608 0.115569    
toughness                 0.6585     0.2697   2.441 0.019046 *  
color_identity['G']      -2.9581     0.7951  -3.720 0.000597 ***
color_identity['R']       0.3022     0.7981   0.379 0.706935    
color_identity['U']       1.4035     0.7969   1.761 0.085657 .  
color_identity['W']      -0.6045     0.8408  -0.719 0.476226    
color_identityColorless   0.2872     1.1392   0.252 0.802247    
mana_genyes               0.2385     0.7032   0.339 0.736229    
mechanicsyes             -0.4820     0.5885  -0.819 0.417542    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.587 on 41 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.5218,    Adjusted R-squared:  0.4168 
F-statistic: 4.971 on 9 and 41 DF,  p-value: 0.0001556
plot(fit5)

Ajusted R-squared went up to .4168, next highest p-value is in the color_identityColorless; but Color_identity[‘G’] has my lowest p-values, so I don’t think I want to get rid of those. If I do…

fit6<- lm(ATA ~ cmc + toughness + mana_gen + mechanics, data = join4)
summary(fit6)

Call:
lm(formula = ATA ~ cmc + toughness + mana_gen + mechanics, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4114 -1.5134  0.2917  1.6787  2.8322 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.32907    0.96317   8.648 3.35e-11 ***
cmc          -0.34754    0.29243  -1.188   0.2408    
toughness     0.60552    0.32505   1.863   0.0689 .  
mana_genyes  -0.08053    0.85270  -0.094   0.9252    
mechanicsyes  0.14833    0.66373   0.223   0.8241    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.089 on 46 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.07033,   Adjusted R-squared:  -0.01051 
F-statistic:  0.87 on 4 and 46 DF,  p-value: 0.4892
plot(fit6)

yeah, my instincts were right: adjusted r-squared tanks to a negative number (how? it’s squared?!) and the residuals vs fitted plot curves even worse.

Let’s try taking the second highest p-value away from fit5 instead: mana_gen

fit7<- lm(ATA ~ cmc + toughness + color_identity + mechanics, data = join4)
summary(fit7)

Call:
lm(formula = ATA ~ cmc + toughness + color_identity + mechanics, 
    data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1549 -0.9441  0.3321  1.0809  2.5270 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               9.1278     0.9568   9.540 4.49e-12 ***
cmc                      -0.3797     0.2297  -1.653 0.105770    
toughness                 0.6625     0.2666   2.485 0.017028 *  
color_identity['G']      -2.9476     0.7861  -3.750 0.000535 ***
color_identity['R']       0.3506     0.7769   0.451 0.654092    
color_identity['U']       1.3695     0.7822   1.751 0.087266 .  
color_identity['W']      -0.6453     0.8233  -0.784 0.437545    
color_identityColorless   0.3259     1.1215   0.291 0.772774    
mechanicsyes             -0.4985     0.5803  -0.859 0.395213    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.57 on 42 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.5205,    Adjusted R-squared:  0.4291 
F-statistic: 5.698 on 8 and 42 DF,  p-value: 6.429e-05
plot(fit7)

My residuals plot is still not great, but R squared is getting better. Mechanics is the only thing left other than color_identity, try taking that out:

fit8<- lm(ATA ~ cmc + toughness + color_identity, data = join4)
summary(fit8)

Call:
lm(formula = ATA ~ cmc + toughness + color_identity, data = join4)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1691 -0.8897  0.3514  1.0907  2.8247 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               8.6411     0.7686  11.243 2.19e-14 ***
cmc                      -0.3914     0.2286  -1.712 0.094090 .  
toughness                 0.6754     0.2654   2.545 0.014603 *  
color_identity['G']      -2.7598     0.7528  -3.666 0.000672 ***
color_identity['R']       0.3382     0.7744   0.437 0.664529    
color_identity['U']       1.4690     0.7712   1.905 0.063507 .  
color_identity['W']      -0.3888     0.7649  -0.508 0.613806    
color_identityColorless   0.6519     1.0521   0.620 0.538785    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.565 on 43 degrees of freedom
  (40 observations deleted due to missingness)
Multiple R-squared:  0.512, Adjusted R-squared:  0.4326 
F-statistic: 6.446 on 7 and 43 DF,  p-value: 3.266e-05
plot(fit8)

Adjusted R squared went up down a little – it went up earlier and I haven’t changed anything— and it’s back to being higher in fit8 again– so I have no explanation for this! This means that now fit8 is once again best version of the model, even though it’s still pretty terrible – and the residual plot shows that this is simply not a very good data set for a linear model. I probaly need a transformation.

Visualization!

#join4$Color <- factor(join4$Color, levels = c("Black", "Blue", "Green", "Red", "White", "Colorless"))

violins <- join4 |>
  ggplot() +
  geom_violin(aes(x = Color, y = ATA, fill = Color)) +
  labs(title = 'Average Taken At for Commons of OTJ by Card Color',
       x = 'Color',
       y = 'Average Taken At',
       fill = " ",
       caption = "Source: 17Lands and Wizards of the Coast") +
  scale_fill_manual(values = c("Black" ="#4f3369", "Blue" = "#009bfc", "Green" = "#02db47", "Red"= "#d9161c", "White" = "#d9d502", "Colorless" = "#C0C0C0")) +
  theme_bw()

violins

Data cleaning and Challenges

For data cleaning, I had to filter to first rename columns in the 17Lands dataset. I also had to fill in the NAs in Color and rename the single letter codes to words. After selecting just the variables I needed, filter for the commons.The card database proved more difficult. I first filtered for the Outlaws at Thunder Junction set, and then the rarity of commons. Then, when there were still more cards than in the draft data set, I remembered about basic lands, and filter to exclude those. At that point, I could figure out which columns I needed from the Magic card database (there were many I did not), and selected those I did. After, the two dataframes into comms_joined. Comms_joined still needed data cleaning. I had to create a more useful Type variable, that did not list the cards entire type via mutate and str_replace. I had to use grepl and str_detect a lot to create the rest of my variables via mutate for join4.

My visualization represents the distribution of average picks within each color, Color identity proved to be the most influential variable in my linear model, despite a linear model being an absolutely terrible model at all for this dataset. I am absolutely not surprised by this result, but I am surprised at how little the other variables mattered. Possibly I over corrected for having too many variables by grouping too many together (and I definitely forgot some mechanics!) but I couldn’t think of good ways for smaller groupings.

I wanted to use the Duskmourn set, as I really like that set. The magic api didn’t have support for R, and I had no idea what I was doing. Another card database site that I was able to figure out how to use the api and get a JSON only gave me a list of cards, but not the full information. So I resorted to using a set that was on a csv file already acquired from the magic api and on kaggle that sadly was too old to have Duskmourn. This actually may have turned out to be a good thing, as one of the new mechanics in Duskmourn involved more split cards. I had a hard time at first trying to figure out what on earth I was going to do with split-spell cards– cards that have two spells on them– as they were represeted twice in the card database, once for each spell name– but only once on the Draft data set. After spending quite some time thinking on this, I realized none of those cards were commons, an it didn’t matter at all. I got a bit lost in data cleaning and making tables and visualizations and forgot what my actual goal was, which while informative and education, also lost me a lot of time and is disastrous when a migrainer in Maryland in the Spring.

I had multiple issues with grepl and str_detect that were incredibly frustrating, and usually because of a comma in the wrong place.

I’m now thinking it might have been better to do something with the total number seen or picked, possibly try to include that either in the linear model, and/or use for the visualization, but it is too late now.

References:

https://www.17lands.com/metrics_definitions

https://magic.wizards.com/en/news/making-magic/what-are-play-boosters

https://scryfall.com/sets/otp?order=name&as=checklist

https://www.kaggle.com/datasets/patrickgendotti/mtg-all-cards