library(tidyverse)
library(tidymodels)
setwd("~/Schol Stuff/Montgomery College 2025/Data 110 Data Visualization/mtgdraft Project 1")Project 1 Outlaws Draft
Intro
My data is from 17Lands, a data tracker and aggregator for Magic Arena, one of the online forms of the trading card game Magic the Gathering for traditional draft for the set Outlaws at Thunder Junction (OTJ or Outlaws). The set was released in April 2024 and has a wild west theme.
In the traditional draft format, each player gets three packs, containing 14 playable cards. They open the first, pick one card, and pass the remaining the to next player. Pick and pass, repeat until all cards are picked. Repeat for next pack, pass in opposite direction. Repeat for third. Players than use their collected pool of cards along with (provided) basic land cards to build a minimum 40 card deck.
Each card has a rarity, how common the card is/how frequent it is to see in packs/ from common to mythic. Theoretically, the rarer the card, the better it is. This is not always true, sometimes due to cards being designed more for a different format. Each pack is ‘balanced’ by having guaranteed ratios of rarities: six commons; most likely a 7th common, 3 uncommons, one rare or mythic, a basic land, one wild card, and one foil wild card, and finally one non-playable slot. One theory of drafting is “first take the rare, then [for next one handed to you] choose the next best card”. If going off the assumption that rares and mythics will be chosen first, I am interested in seeing when do commons get picked, and what influences how soon they are picked.
My data set contained each card in OTJ’s “name”, “color”, “rarity”, and 17Land’s aggregated statistics of the number of times each card was seen in the draft format (“# seen”), the number of times 17Land’s users picked the card (“# picked”, and –what I am interested in– the average taken at (“ATA”), and average pick per pack (“ALSA”, renamed to “APP”).
The average taken at is the average draft pick number that players pick the card from the other available cards; with 14 playable cards per 3 packs, there are 42 picks in each draft. Average pick per pack measures the same thing, but only per each pack of 14 cards – for any elevation in ATA due to a card being in opened in later packs. For example, a card that was picked at pick 17 would have a value of 17 towards it’s ATA but a value of 3 (pack 2, pick 3) for APP. I ended up not even using the APP, as I only had time for one of the two.
The rest of the variables I needed are card/spell details, which come from the MtG card database api I acquired from Kaggle. This csv had the same basic card details as the 17Lands, along with a more nuanced “color_identity”, mana_cost (as printed on the card, mana color dependent), “cmc” or converted mana cost (the numerical only value) (think of mana as energy or currency needed to spend to play a spell). “Type” contained all information about a card’s type that I had to extract some into “base_type”, these effect when a card can be played as well as how it can be used, “power” and “toughness” are specifically for Creature types and are the attack and defense values. Finally, “text” contained the card’s text and any instructions or abilities. I had to extract variables for these. I grouped, and created “keywords” for the presence of any of typical keyword abilities and mechanics, “mana_gen” for any text that indicated the generation of mana, and finally “mechanics” for set theme/setting related game mechanics.
Data
Load libraries and set working directory, then load data sets
OTJ is the set code for Outlaws at Thunder Junction. otj_trad is the draft information from 17Lands, all_cards is the mtg card database from the Magic api.
otj_trad <- read_csv("outlaws_traddraft.csv")
all_cards <- read_csv("all_mtg_cards.csv") Data Cleaning:
##17Lands Draft Data:
Using head to get a sense of the columns and variables.
head(otj_trad)# A tibble: 6 × 19
Name Color Rarity `# Seen` ALSA `# Picked` ATA `# GP` `% GP` `GP WR`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Another Ro… W R 11871 6.42 1466 9.92 297 2.8% <NA>
2 Archangel … W M 2131 2.34 688 2.52 3856 79.9% 57.7%
3 Armored Ar… W C 69302 7.96 7091 11.9 5050 10.0% 53.4%
4 Aven Inter… W R 4872 2.55 1749 2.76 10613 84.9% 59.7%
5 Bounding F… W U 19153 4.35 3461 6.01 18958 76.9% 58.9%
6 Bovine Int… W U 25848 6.57 2494 10.4 3435 19.2% 51.8%
# ℹ 9 more variables: `# OH` <dbl>, `OH WR` <chr>, `# GD` <dbl>, `GD WR` <chr>,
# `# GIH` <dbl>, `GIH WR` <chr>, `# GNS` <dbl>, `GNS WR` <chr>, IIH <chr>
Rename the # Seen, # Picked to remove the number sign. Rename ALSA to better reflect what it measure: average pick (or taken) per pack. Select only the draft-relevant columns, saves me from having to fix the game play column names. Replace the color and rarity single letter codes with full words, and fill the color NAs with colorless.
otj_draft <- otj_trad |>
rename(totseen = `# Seen`) |>
rename(totpicked = `# Picked`) |>
rename(APP = ALSA) |>
rename(name = Name) |> #changing to match the second dataset so I can join based on name
select(name, Color, Rarity, totseen, totpicked, APP, ATA)
otj_draft$Color[is.na(otj_draft$Color)] <- 'Colorless'
otj_draft$Color[otj_draft$Color == 'B'] <- 'Black'
otj_draft$Color[otj_draft$Color == 'U'] <- 'Blue'
otj_draft$Color[otj_draft$Color == 'G'] <- 'Green'
otj_draft$Color[otj_draft$Color == 'R'] <- 'Red'
otj_draft$Color[otj_draft$Color == 'W'] <- 'White'
otj_draft$Rarity[otj_draft$Rarity == 'C'] <- 'common'
otj_draft$Rarity[otj_draft$Rarity == 'U'] <- 'uncommon'
otj_draft$Rarity[otj_draft$Rarity == 'R'] <- 'rare'
otj_draft$Rarity[otj_draft$Rarity == 'M'] <- 'mythic'
head(otj_draft)# A tibble: 6 × 7
name Color Rarity totseen totpicked APP ATA
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Another Round White rare 11871 1466 6.42 9.92
2 Archangel of Tithes White mythic 2131 688 2.34 2.52
3 Armored Armadillo White common 69302 7091 7.96 11.9
4 Aven Interrupter White rare 4872 1749 2.55 2.76
5 Bounding Felidar White uncommon 19153 3461 4.35 6.01
6 Bovine Intervention White uncommon 25848 2494 6.57 10.4
A Basic visualization: Draft Distribution by Rarity,
rarity_count <- otj_draft |>
group_by(Rarity) |>
summarize(set = n(), seen = sum(totseen), picked=sum(totpicked)) |>
arrange(c('commmon', 'uncommon', 'rare', 'mythic'))
rarity_count# A tibble: 4 × 4
Rarity set seen picked
<chr> <int> <dbl> <dbl>
1 common 91 5233589 805215
2 uncommon 120 2485259 453768
3 rare 90 549350 122236
4 mythic 75 148581 41275
rarity_count$Rarity <- factor(rarity_count$Rarity, levels = c('common', 'uncommon', 'rare', 'mythic'))
ggplot(rarity_count, aes (x = Rarity, y = set)) +
geom_col() +
labs(title = 'Set Distribution in Outlaws of Thunder Junction by Rarity',
x = 'Rarity',
y = 'Card Count',
caption = "Source: 17Lands") +
theme_bw()ggplot(data = rarity_count ) +
geom_col(aes(x = Rarity, y= picked, fill = Rarity)) +
labs(title = 'Draft Pick Count by Card Rarity',
subtitle = 'for 17Lands drafters for OTJ',
x = 'Rarity',
y = 'Number',
fill = " ",
caption = 'Source: 17Lands') +
theme_bw() +
scale_color_brewer(palette = "Accent")Filter once more for just the commons:
commons_draft <- otj_draft |>
filter(Rarity == 'common')
head(commons_draft)# A tibble: 6 × 7
name Color Rarity totseen totpicked APP ATA
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Armored Armadillo White common 69302 7091 7.96 11.9
2 Bridled Bighorn White common 63485 8409 6.03 8.93
3 Eriette's Lullaby White common 60898 7383 5.72 8.77
4 Holy Cow White common 53490 8803 4.77 6.74
5 Inventive Wingsmith White common 69362 7656 7.78 11.6
6 Mystical Tether White common 39016 10822 3.48 4.38
the database
Now the card database! This database contains decades of Magic set releases, and I need one set in particular.
I need to filter for for the Outlaws (OTJ) cards in the card database. There are actually three more set names/codes associated with this release (which are included in the draft dataset), but the ten Special Guest (SPG) reprints and 30 (90 with variants) The Big Score (BIG) cards are all mythic, and 65 (80 with variants) the Breaking News! (OTP) are at least uncommon.
And then I remembered I ultimately going to be running my analysis on the commons, and all of those special cards not commons.
otj_baseset <- all_cards |>
filter(set == 'OTJ')
otj_baseset# A tibble: 374 × 39
name multiverse_id layout names mana_cost cmc colors color_identity type
<chr> <dbl> <chr> <lgl> <chr> <dbl> <chr> <chr> <chr>
1 Anoth… 654942 normal NA {X}{X}{2… 3 ['W'] ['W'] Sorc…
2 Archa… 654943 normal NA {1}{W}{W… 4 ['W'] ['W'] Crea…
3 Armor… 654944 normal NA {W} 1 ['W'] ['W'] Crea…
4 Aven … 654945 normal NA {1}{W}{W} 3 ['W'] ['W'] Crea…
5 Bound… 654946 normal NA {5}{W} 6 ['W'] ['W'] Crea…
6 Bovin… 654947 normal NA {1}{W} 2 ['W'] ['W'] Inst…
7 Bridl… 654948 normal NA {3}{W} 4 ['W'] ['W'] Crea…
8 Claim… 654949 normal NA {2}{W} 3 ['W'] ['W'] Crea…
9 Dust … 654950 normal NA {1}{W} 2 ['W'] ['W'] Crea…
10 Eriet… 654951 normal NA {1}{W} 2 ['W'] ['W'] Sorc…
# ℹ 364 more rows
# ℹ 30 more variables: supertypes <chr>, subtypes <chr>, rarity <chr>,
# text <chr>, flavor <chr>, artist <chr>, number <chr>, power <chr>,
# toughness <chr>, loyalty <dbl>, variations <chr>, watermark <chr>,
# border <lgl>, timeshifted <lgl>, hand <dbl>, life <dbl>, reserved <lgl>,
# release_date <lgl>, starter <lgl>, rulings <chr>, foreign_names <chr>,
# printings <chr>, original_text <chr>, original_type <chr>, …
Note that there are 374 rows, two less than otj_draft; and that this without the 105 supplementary set cards. I probably have duplicate card names due to art variations. However, I think all of those are not commons, so I should be able to solve this by filtering for just commons
Filter for just the commons:
otj_commons <- otj_baseset |>
filter(rarity == 'Common')
head(otj_commons)# A tibble: 6 × 39
name multiverse_id layout names mana_cost cmc colors color_identity type
<chr> <dbl> <chr> <lgl> <chr> <dbl> <chr> <chr> <chr>
1 Armore… 654944 normal NA {W} 1 ['W'] ['W'] Crea…
2 Bridle… 654948 normal NA {3}{W} 4 ['W'] ['W'] Crea…
3 Eriett… 654951 normal NA {1}{W} 2 ['W'] ['W'] Sorc…
4 Holy C… 654957 normal NA {2}{W} 3 ['W'] ['W'] Crea…
5 Invent… 654958 normal NA {2}{W} 3 ['W'] ['W'] Crea…
6 Mystic… 654960 normal NA {2}{W} 3 ['W'] ['W'] Ench…
# ℹ 30 more variables: supertypes <chr>, subtypes <chr>, rarity <chr>,
# text <chr>, flavor <chr>, artist <chr>, number <chr>, power <chr>,
# toughness <chr>, loyalty <dbl>, variations <chr>, watermark <chr>,
# border <lgl>, timeshifted <lgl>, hand <dbl>, life <dbl>, reserved <lgl>,
# release_date <lgl>, starter <lgl>, rulings <chr>, foreign_names <chr>,
# printings <chr>, original_text <chr>, original_type <chr>,
# legalities <chr>, source <lgl>, image_url <chr>, set <chr>, …
Global environment shows I have 106 rows compared to the 91 in commons_draft: investigating shows I forgot about the basic lands.
otj_commons2 <- otj_commons |>
filter(!(name %in% c('Island', 'Forest', 'Mountain', 'Plains', 'Swamp')))
length(otj_commons2$name)[1] 91
Now the two dataframes match in length, and joining will be possible. But first, I need to clean up some of these columns.
colnames(otj_commons2) [1] "name" "multiverse_id" "layout" "names"
[5] "mana_cost" "cmc" "colors" "color_identity"
[9] "type" "supertypes" "subtypes" "rarity"
[13] "text" "flavor" "artist" "number"
[17] "power" "toughness" "loyalty" "variations"
[21] "watermark" "border" "timeshifted" "hand"
[25] "life" "reserved" "release_date" "starter"
[29] "rulings" "foreign_names" "printings" "original_text"
[33] "original_type" "legalities" "source" "image_url"
[37] "set" "set_name" "id"
The commons should only have layout normal and no supertypes: confirm wth unique
unique(otj_commons2$layout)[1] "normal"
unique(otj_commons2$supertypes)[1] NA
choosing only the columns I need so my final df isn’t huge and re-ordering The distinction the magic api makes between colors and color_identity does not apply to any any of the common cards except the dual lands, and I would consider the lands to be colored rather than colorless, so I don’t need both columns.
Also filing the NAs of color_identity with “colorless”, and replacing the single letter codes for colors with words
otj_com3 <- otj_commons2 |>
select(name, color_identity, mana_cost, cmc, type, subtypes, power, toughness, text)
otj_com3$color_identity[is.na(otj_com3$color_identity)] <- 'Colorless'
otj_com3$color_identity[otj_com3$identiy == "['B']"] <- 'Black'Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['U']"] <- 'Blue'Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['G']"] <- 'Green'Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['R']"] <- 'Red'Warning: Unknown or uninitialised column: `identiy`.
otj_com3$color_identity[otj_com3$identiy == "['W']"] <- 'White'Warning: Unknown or uninitialised column: `identiy`.
head(otj_com3)# A tibble: 6 × 9
name color_identity mana_cost cmc type subtypes power toughness text
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Armored A… ['W'] {W} 1 Crea… ['Armad… 0 4 "War…
2 Bridled B… ['W'] {3}{W} 4 Crea… ['Sheep… 3 4 "Vig…
3 Eriette's… ['W'] {1}{W} 2 Sorc… <NA> <NA> <NA> "Des…
4 Holy Cow ['W'] {2}{W} 3 Crea… ['Ox', … 2 2 "Fla…
5 Inventive… ['W'] {2}{W} 3 Crea… ['Dwarf… 2 4 "At …
6 Mystical … ['W'] {2}{W} 3 Ench… <NA> <NA> <NA> "You…
Well that didn’t work, and I haven’t been able to fix it. Thankfully, I have multiple color related colors.
Join the data sets
comms_joined <- left_join(commons_draft, otj_com3, by = 'name')
head(comms_joined)# A tibble: 6 × 15
name Color Rarity totseen totpicked APP ATA color_identity mana_cost
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Armored A… White common 69302 7091 7.96 11.9 ['W'] {W}
2 Bridled B… White common 63485 8409 6.03 8.93 ['W'] {3}{W}
3 Eriette's… White common 60898 7383 5.72 8.77 ['W'] {1}{W}
4 Holy Cow White common 53490 8803 4.77 6.74 ['W'] {2}{W}
5 Inventive… White common 69362 7656 7.78 11.6 ['W'] {2}{W}
6 Mystical … White common 39016 10822 3.48 4.38 ['W'] {2}{W}
# ℹ 6 more variables: cmc <dbl>, type <chr>, subtypes <chr>, power <chr>,
# toughness <chr>, text <chr>
More Cleaning: Getting my variables in a Usable State
I can’t drop NAs, as any NA value such as Eriette’s Lullaby and Mystical Tether for power, and toughness is because Instants and Enchantments do not have combat stats of power and toughness. I don’t want to fill them with zeros because 0/0 for power/toughness means something completely different.
I can fill the land mana costs with zeros if needed, though. That’s inaccurate but true.Subtypes I’m not planning on using as is, except to make a new column
Trying to remove the extra information from type. Using str_replace and mutate to create a new column with just the main card type (artifact, artifact creature, enchantment, instant, sorcery, land)
join2 <- comms_joined |>
mutate(base_type = str_replace(type, "—.*$", ""), .after = 'subtypes')
head(join2)# A tibble: 6 × 16
name Color Rarity totseen totpicked APP ATA color_identity mana_cost
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Armored A… White common 69302 7091 7.96 11.9 ['W'] {W}
2 Bridled B… White common 63485 8409 6.03 8.93 ['W'] {3}{W}
3 Eriette's… White common 60898 7383 5.72 8.77 ['W'] {1}{W}
4 Holy Cow White common 53490 8803 4.77 6.74 ['W'] {2}{W}
5 Inventive… White common 69362 7656 7.78 11.6 ['W'] {2}{W}
6 Mystical … White common 39016 10822 3.48 4.38 ['W'] {2}{W}
# ℹ 7 more variables: cmc <dbl>, type <chr>, subtypes <chr>, base_type <chr>,
# power <chr>, toughness <chr>, text <chr>
Grouping by type, and getting totals for the set, what was seen in the drafts, what users picked , mostly to see if the string replace worked.
type_counts <- join2 |>
group_by(base_type) |>
summarize(count=n(), seen = sum(totseen), picked = sum(totpicked))
type_counts# A tibble: 8 × 4
base_type count seen picked
<chr> <int> <dbl> <dbl>
1 "Artifact " 1 69466 8316
2 "Artifact Creature " 3 201959 27524
3 "Creature " 48 2950412 425706
4 "Enchantment" 1 39016 10822
5 "Enchantment " 2 134630 16936
6 "Instant" 16 967872 148991
7 "Land " 12 402571 89168
8 "Sorcery" 8 467663 77752
I do not know why it is still breaking the enchantments up into two different groups. There were two Enchantment-Auras and one regular Enchantment, but the str_replace otherwise worked perfectly. Let’s see what the difference is via unique.
unique(join2$base_type)[1] "Creature " "Sorcery" "Enchantment"
[4] "Instant" "Enchantment " "Artifact "
[7] "Artifact Creature " "Land "
Remove the extra space from one of the enchantment basetypes, and make them both named Enchantment.
join2$base_type[join2$base_type == 'Enchantment '] <- "Enchantment"
join2$base_type[join2$base_type == "Enchantment"] <- "Enchantment"
unique(join2$base_type)[1] "Creature " "Sorcery" "Enchantment"
[4] "Instant" "Artifact " "Artifact Creature "
[7] "Land "
NOw I need to get the rest of my variables: Extract Keywords: Magic has multiple keywords in a card’s text that give them abilities that are advantageous, such as Flying and Deathtouch. I didn’t expect every single common to have ability or effect text, so I need to extract those keywords where they exist and create a new column.
While having a column per keyword would be more accurate, that would leave me with way too many variables to deal for project. So I am grouping,
join4 <- join2 |>
#common magic keywords,
mutate(keywords = if_else(grepl("counter on|Double strike|First strike|Flying|Haste|Indestructible|Lifelink|Reach|Trample |Vigilance|Ward",text, ignore.case = TRUE), "keyword", "no"), .before = text) |>
#mana generators, very important. need grepl for it's case ignoring. Grepl kept giving errors, so str_detect and learning how to make it ignore case it is
mutate(mana_gen = case_when(
str_detect(text, "(?i)add one mana") ~"yes",
str_detect(text, "(?i)add \\{") ~ "yes", # whn the card specifies the color of the mana, the curly brackets were helpful!
str_detect(text, "(?i)create a treasure") ~ "yes",
.default = "no"), .before = text) |>
# set specific mechanics and themes
#grouping the mechanics of Outlaw(a creature grouping), Mount/saddle, Plot,Spree, and Commit A Crimes. this code will grab spells that care about having creatures that fulfill the criteria, and those creatures.
#Commit a Crime: cards that give bonuses for having done crimes, and the cards that qualify as crimes
# Mostly this means targeting an opponent or something they control, which while some spells specify, others just say "target creature" or "target artifact", meaning you could choose your own (it might even be generally used that way). I'm having to specify some exclusions-- thankfully this is a small dataset.
mutate(mechanics = case_when(
str_detect(text, "outlaw") ~ "yes",
str_detect(subtypes, "Assassin") ~ "yes",
str_detect(subtypes, "Mercenary") ~ "yes",
str_detect(subtypes, "Pirate") ~ "yes",
str_detect(subtypes, "Rogue") ~ "yes",
str_detect(subtypes, "Warlock") ~ "yes",
str_detect(text, "Spree") ~ "yes",
str_detect(text, "Plot") ~ "yes",
str_detect(subtypes, "Mount") ~ "yes", # creatures that ARE mounts
str_detect(text, "(?!)mount") ~ "yes", # cards that care if OTHERS are mounts
str_detect(text, "commit") ~ "yes",
str_detect(text, "target creature you") ~ "no", #excluding spells that only apply to the owner's, no cards in this dataset specified other types the owner controlled
str_detect(text, "becomes the target") ~ "no", # the explanatory text for Ward includes the word target, needs be excluded before I look for all instances of the word target
str_detect(text, "target") ~ "yes",
.default = "no"), .before =text)
head(join4)# A tibble: 6 × 19
name Color Rarity totseen totpicked APP ATA color_identity mana_cost
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Armored A… White common 69302 7091 7.96 11.9 ['W'] {W}
2 Bridled B… White common 63485 8409 6.03 8.93 ['W'] {3}{W}
3 Eriette's… White common 60898 7383 5.72 8.77 ['W'] {1}{W}
4 Holy Cow White common 53490 8803 4.77 6.74 ['W'] {2}{W}
5 Inventive… White common 69362 7656 7.78 11.6 ['W'] {2}{W}
6 Mystical … White common 39016 10822 3.48 4.38 ['W'] {2}{W}
# ℹ 10 more variables: cmc <dbl>, type <chr>, subtypes <chr>, base_type <chr>,
# power <chr>, toughness <chr>, keywords <chr>, mana_gen <chr>,
# mechanics <chr>, text <chr>
I’m not entirely happy with this approach, I know that I personally don’t weigh all keywords equally, but it’s the best I have at the moment.
Finally doing some things with the pick data
summary(join4) name Color Rarity totseen
Length:91 Length:91 Length:91 Min. :27576
Class :character Class :character Class :character 1st Qu.:50054
Mode :character Mode :character Mode :character Median :62702
Mean :57512
3rd Qu.:67355
Max. :70918
totpicked APP ATA color_identity
Min. : 6086 Min. :3.290 Min. : 3.920 Length:91
1st Qu.: 7649 1st Qu.:4.765 1st Qu.: 6.605 Class :character
Median : 8483 Median :6.030 Median : 8.760 Mode :character
Mean : 8849 Mean :5.951 Mean : 8.488
3rd Qu.:10008 3rd Qu.:7.185 3rd Qu.:10.405
Max. :12642 Max. :8.360 Max. :12.300
mana_cost cmc type subtypes
Length:91 Min. :0.000 Length:91 Length:91
Class :character 1st Qu.:1.000 Class :character Class :character
Mode :character Median :2.000 Mode :character Mode :character
Mean :2.396
3rd Qu.:3.000
Max. :6.000
base_type power toughness keywords
Length:91 Length:91 Length:91 Length:91
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
mana_gen mechanics text
Length:91 Length:91 Length:91
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Honestly that’s a lower max ATA than I expected, considering there are 42-45 picks. Also, power and toughness need to become numerical
join4$power <- as.numeric(join4$power)
join4$toughness <- as.numeric(join4$toughness)
summary(Filter(is.numeric,join4)) totseen totpicked APP ATA
Min. :27576 Min. : 6086 Min. :3.290 Min. : 3.920
1st Qu.:50054 1st Qu.: 7649 1st Qu.:4.765 1st Qu.: 6.605
Median :62702 Median : 8483 Median :6.030 Median : 8.760
Mean :57512 Mean : 8849 Mean :5.951 Mean : 8.488
3rd Qu.:67355 3rd Qu.:10008 3rd Qu.:7.185 3rd Qu.:10.405
Max. :70918 Max. :12642 Max. :8.360 Max. :12.300
cmc power toughness
Min. :0.000 Min. :0.000 Min. :1.000
1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
Median :2.000 Median :3.000 Median :2.000
Mean :2.396 Mean :2.608 Mean :2.706
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
Max. :6.000 Max. :6.000 Max. :5.000
NA's :40 NA's :40
Finally, start on the actual model
Start with converted mana cost vs average taken at.
###A Scatter Plot:
scatter1 <- join4 |>
ggplot(aes(x = cmc, y = ATA) )+
labs(title = 'Average Taken At vs Converted Mana Cost \n for 17Lands MTG Outlaws of Thunder Junction players',
x = 'Converted Mana Cost ',
y = 'Average Taken At',
caption = 'source: 17Lands & Wizards of the Coast') +
theme_minimal() +
geom_point(size = 2.5, alpha = 0.6)
scatter1 That’s not linear at all but we’ll run the smoother.
p2 <- scatter1 + geom_smooth(method='lm',formula=y~x)
p2Well that’s pretty terrible. Fit summary:
fit1 <- lm(ATA ~ cmc, data = join4)
summary(fit1)
Call:
lm(formula = ATA ~ cmc, data = join4)
Residuals:
Min 1Q Median 3Q Max
-5.0276 -1.4074 0.2558 1.7125 4.3542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5574 0.4233 17.853 <2e-16 ***
cmc 0.3884 0.1484 2.616 0.0104 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.191 on 89 degrees of freedom
Multiple R-squared: 0.07142, Adjusted R-squared: 0.06098
F-statistic: 6.845 on 1 and 89 DF, p-value: 0.01044
As expected, cmc is not doing much to help this linear model. I need more variables! Most of my variables are categorical. Power and toughness are the only quantitative, and only Creatures and Artifact Creatures have those. Recall that I did not fill the NAs for power and toughness for the other spell types
#convert categorical variables to factors
join4$color_identity <- as.factor(join4$color_identity)
join4$base_type <- as.factor(join4$base_type)
join4$keywords <- as.factor(join4$keywords)
join4$mana_gen <- as.factor(join4$mana_gen)
join4$mechanics<- as.factor(join4$mechanics)
fit2 <- lm(ATA ~ cmc + power + toughness + color_identity + base_type + keywords + mana_gen + mechanics, data = join4)
#autoplot(fit2, 1:4, nrow=2, ncol=2)
summary(fit2)
Call:
lm(formula = ATA ~ cmc + power + toughness + color_identity +
base_type + keywords + mana_gen + mechanics, data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1575 -1.0707 0.3149 1.0707 2.6108
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.09312 1.02268 8.892 6.39e-11 ***
cmc -0.46118 0.45624 -1.011 0.31833
power 0.08695 0.40125 0.217 0.82957
toughness 0.67681 0.29360 2.305 0.02656 *
color_identity['G'] -3.03486 0.89044 -3.408 0.00153 **
color_identity['R'] 0.26577 0.85626 0.310 0.75792
color_identity['U'] 1.40648 0.83260 1.689 0.09915 .
color_identity['W'] -0.60916 0.86179 -0.707 0.48386
color_identityColorless 0.26566 1.23892 0.214 0.83133
base_typeCreature NA NA NA NA
keywordsno 0.01836 0.56289 0.033 0.97414
mana_genyes 0.27627 0.74978 0.368 0.71452
mechanicsyes -0.50079 0.60950 -0.822 0.41627
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.626 on 39 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.5224, Adjusted R-squared: 0.3877
F-statistic: 3.879 on 11 and 39 DF, p-value: 0.0008024
- I don’t understand why base_typeCreature didn’t get coefficients, nor why none of the other types are represented, so I’m going to take out base_type as well.
- largest p-value is keywordsno, so that get’s removed from the model
fit3<- lm(ATA ~ cmc + power + toughness + base_type + color_identity+ mana_gen + mechanics, data = join4)
#autoplot(fit3, 1:4, nrow=2, ncol=2)
summary(fit3)
Call:
lm(formula = ATA ~ cmc + power + toughness + base_type + color_identity +
mana_gen + mechanics, data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1628 -1.0712 0.3114 1.0591 2.6188
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.37769 1.09696 8.549 1.45e-10 ***
cmc -0.46210 0.44964 -1.028 0.31026
power 0.08936 0.38942 0.229 0.81966
toughness 0.67455 0.28177 2.394 0.02144 *
base_typeCreature -0.27915 1.15314 -0.242 0.80996
color_identity['G'] -3.02912 0.86186 -3.515 0.00111 **
color_identity['R'] 0.27289 0.81755 0.334 0.74028
color_identity['U'] 1.41167 0.80703 1.749 0.08792 .
color_identity['W'] -0.60947 0.85091 -0.716 0.47800
color_identityColorless NA NA NA NA
mana_genyes 0.27956 0.73366 0.381 0.70518
mechanicsyes -0.49934 0.60022 -0.832 0.41039
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.605 on 40 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.5224, Adjusted R-squared: 0.403
F-statistic: 4.376 on 10 and 40 DF, p-value: 0.0003611
I’m still wondering abot what’s with the na’s on Base_type, let’s try removing that
fit4<- lm(ATA ~ cmc + power + toughness + color_identity + mana_gen + mechanics, data = join4)
summary(fit4)
Call:
lm(formula = ATA ~ cmc + power + toughness + color_identity +
mana_gen + mechanics, data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1628 -1.0712 0.3114 1.0591 2.6188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.09854 0.99639 9.132 2.48e-11 ***
cmc -0.46210 0.44964 -1.028 0.31026
power 0.08936 0.38942 0.229 0.81966
toughness 0.67455 0.28177 2.394 0.02144 *
color_identity['G'] -3.02912 0.86186 -3.515 0.00111 **
color_identity['R'] 0.27289 0.81755 0.334 0.74028
color_identity['U'] 1.41167 0.80703 1.749 0.08792 .
color_identity['W'] -0.60947 0.85091 -0.716 0.47800
color_identityColorless 0.27915 1.15314 0.242 0.80996
mana_genyes 0.27956 0.73366 0.381 0.70518
mechanicsyes -0.49934 0.60022 -0.832 0.41039
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.605 on 40 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.5224, Adjusted R-squared: 0.403
F-statistic: 4.376 on 10 and 40 DF, p-value: 0.0003611
plot(fit4) #getting an error on rendering autoplot, so having to change to plotabsolutely no change in adjusted R squared, which did increase from fit 2 to fit3, so leaving base_type out.
largest p-value is POWER, remove that for fit5
fit5<- lm(ATA ~ cmc + toughness + color_identity+ mana_gen + mechanics, data = join4)
summary(fit5)
Call:
lm(formula = ATA ~ cmc + toughness + color_identity + mana_gen +
mechanics, data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1215 -1.0900 0.2713 1.1031 2.5889
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.0749 0.9795 9.264 1.32e-11 ***
cmc -0.3742 0.2327 -1.608 0.115569
toughness 0.6585 0.2697 2.441 0.019046 *
color_identity['G'] -2.9581 0.7951 -3.720 0.000597 ***
color_identity['R'] 0.3022 0.7981 0.379 0.706935
color_identity['U'] 1.4035 0.7969 1.761 0.085657 .
color_identity['W'] -0.6045 0.8408 -0.719 0.476226
color_identityColorless 0.2872 1.1392 0.252 0.802247
mana_genyes 0.2385 0.7032 0.339 0.736229
mechanicsyes -0.4820 0.5885 -0.819 0.417542
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.587 on 41 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.5218, Adjusted R-squared: 0.4168
F-statistic: 4.971 on 9 and 41 DF, p-value: 0.0001556
plot(fit5)Ajusted R-squared went up to .4168, next highest p-value is in the color_identityColorless; but Color_identity[‘G’] has my lowest p-values, so I don’t think I want to get rid of those. If I do…
fit6<- lm(ATA ~ cmc + toughness + mana_gen + mechanics, data = join4)
summary(fit6)
Call:
lm(formula = ATA ~ cmc + toughness + mana_gen + mechanics, data = join4)
Residuals:
Min 1Q Median 3Q Max
-4.4114 -1.5134 0.2917 1.6787 2.8322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.32907 0.96317 8.648 3.35e-11 ***
cmc -0.34754 0.29243 -1.188 0.2408
toughness 0.60552 0.32505 1.863 0.0689 .
mana_genyes -0.08053 0.85270 -0.094 0.9252
mechanicsyes 0.14833 0.66373 0.223 0.8241
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.089 on 46 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.07033, Adjusted R-squared: -0.01051
F-statistic: 0.87 on 4 and 46 DF, p-value: 0.4892
plot(fit6)yeah, my instincts were right: adjusted r-squared tanks to a negative number (how? it’s squared?!) and the residuals vs fitted plot curves even worse.
Let’s try taking the second highest p-value away from fit5 instead: mana_gen
fit7<- lm(ATA ~ cmc + toughness + color_identity + mechanics, data = join4)
summary(fit7)
Call:
lm(formula = ATA ~ cmc + toughness + color_identity + mechanics,
data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1549 -0.9441 0.3321 1.0809 2.5270
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.1278 0.9568 9.540 4.49e-12 ***
cmc -0.3797 0.2297 -1.653 0.105770
toughness 0.6625 0.2666 2.485 0.017028 *
color_identity['G'] -2.9476 0.7861 -3.750 0.000535 ***
color_identity['R'] 0.3506 0.7769 0.451 0.654092
color_identity['U'] 1.3695 0.7822 1.751 0.087266 .
color_identity['W'] -0.6453 0.8233 -0.784 0.437545
color_identityColorless 0.3259 1.1215 0.291 0.772774
mechanicsyes -0.4985 0.5803 -0.859 0.395213
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.57 on 42 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.5205, Adjusted R-squared: 0.4291
F-statistic: 5.698 on 8 and 42 DF, p-value: 6.429e-05
plot(fit7)My residuals plot is still not great, but R squared is getting better. Mechanics is the only thing left other than color_identity, try taking that out:
fit8<- lm(ATA ~ cmc + toughness + color_identity, data = join4)
summary(fit8)
Call:
lm(formula = ATA ~ cmc + toughness + color_identity, data = join4)
Residuals:
Min 1Q Median 3Q Max
-3.1691 -0.8897 0.3514 1.0907 2.8247
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.6411 0.7686 11.243 2.19e-14 ***
cmc -0.3914 0.2286 -1.712 0.094090 .
toughness 0.6754 0.2654 2.545 0.014603 *
color_identity['G'] -2.7598 0.7528 -3.666 0.000672 ***
color_identity['R'] 0.3382 0.7744 0.437 0.664529
color_identity['U'] 1.4690 0.7712 1.905 0.063507 .
color_identity['W'] -0.3888 0.7649 -0.508 0.613806
color_identityColorless 0.6519 1.0521 0.620 0.538785
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.565 on 43 degrees of freedom
(40 observations deleted due to missingness)
Multiple R-squared: 0.512, Adjusted R-squared: 0.4326
F-statistic: 6.446 on 7 and 43 DF, p-value: 3.266e-05
plot(fit8)Adjusted R squared went up down a little – it went up earlier and I haven’t changed anything— and it’s back to being higher in fit8 again– so I have no explanation for this! This means that now fit8 is once again best version of the model, even though it’s still pretty terrible – and the residual plot shows that this is simply not a very good data set for a linear model. I probaly need a transformation.
Visualization!
#join4$Color <- factor(join4$Color, levels = c("Black", "Blue", "Green", "Red", "White", "Colorless"))
violins <- join4 |>
ggplot() +
geom_violin(aes(x = Color, y = ATA, fill = Color)) +
labs(title = 'Average Taken At for Commons of OTJ by Card Color',
x = 'Color',
y = 'Average Taken At',
fill = " ",
caption = "Source: 17Lands and Wizards of the Coast") +
scale_fill_manual(values = c("Black" ="#4f3369", "Blue" = "#009bfc", "Green" = "#02db47", "Red"= "#d9161c", "White" = "#d9d502", "Colorless" = "#C0C0C0")) +
theme_bw()
violinsData cleaning and Challenges
For data cleaning, I had to filter to first rename columns in the 17Lands dataset. I also had to fill in the NAs in Color and rename the single letter codes to words. After selecting just the variables I needed, filter for the commons.The card database proved more difficult. I first filtered for the Outlaws at Thunder Junction set, and then the rarity of commons. Then, when there were still more cards than in the draft data set, I remembered about basic lands, and filter to exclude those. At that point, I could figure out which columns I needed from the Magic card database (there were many I did not), and selected those I did. After, the two dataframes into comms_joined. Comms_joined still needed data cleaning. I had to create a more useful Type variable, that did not list the cards entire type via mutate and str_replace. I had to use grepl and str_detect a lot to create the rest of my variables via mutate for join4.
My visualization represents the distribution of average picks within each color, Color identity proved to be the most influential variable in my linear model, despite a linear model being an absolutely terrible model at all for this dataset. I am absolutely not surprised by this result, but I am surprised at how little the other variables mattered. Possibly I over corrected for having too many variables by grouping too many together (and I definitely forgot some mechanics!) but I couldn’t think of good ways for smaller groupings.
I wanted to use the Duskmourn set, as I really like that set. The magic api didn’t have support for R, and I had no idea what I was doing. Another card database site that I was able to figure out how to use the api and get a JSON only gave me a list of cards, but not the full information. So I resorted to using a set that was on a csv file already acquired from the magic api and on kaggle that sadly was too old to have Duskmourn. This actually may have turned out to be a good thing, as one of the new mechanics in Duskmourn involved more split cards. I had a hard time at first trying to figure out what on earth I was going to do with split-spell cards– cards that have two spells on them– as they were represeted twice in the card database, once for each spell name– but only once on the Draft data set. After spending quite some time thinking on this, I realized none of those cards were commons, an it didn’t matter at all. I got a bit lost in data cleaning and making tables and visualizations and forgot what my actual goal was, which while informative and education, also lost me a lot of time and is disastrous when a migrainer in Maryland in the Spring.
I had multiple issues with grepl and str_detect that were incredibly frustrating, and usually because of a comma in the wrong place.
I’m now thinking it might have been better to do something with the total number seen or picked, possibly try to include that either in the linear model, and/or use for the visualization, but it is too late now.
References:
https://www.17lands.com/metrics_definitions
https://magic.wizards.com/en/news/making-magic/what-are-play-boosters
https://scryfall.com/sets/otp?order=name&as=checklist
https://www.kaggle.com/datasets/patrickgendotti/mtg-all-cards