To begin, here is a summary of the tasks I’ll be completing for this project:
Project 2 - Data 607: Summary of Tasks
Dataset Selection:
Choose any three “wide” datasets from the Week 6 Discussion items. Avoid using the Sample Post dataset (already used in Week 6).
Data Preparation:
For each chosen dataset, create a CSV file (or optionally a MySQL database) that contains all the dataset information. Use a “wide” structure that mirrors how the data appears in the discussion item.
Data Import & Transformation:
Read each CSV file into R. Use the tidyr and dplyr packages to tidy and transform your data. (Note: This step carries the most weight in your grade.)
Analysis & Documentation:
Perform the analysis as requested in the discussion item for each dataset. Prepare an R Markdown file that includes your code, narrative descriptions of your data cleanup, analysis, and conclusions. Publish the R Markdown file to rpubs.com.
library(dplyr)
library(tidyverse)
With the correct packages installed, it’s time to bring in the Data. I will be using:
D&D Data I shared in Discussion Cheeses Data shared by a classmate Rolling Stone Top Album data shared by a classmate.
I downloaded all the datasets in their untidy forms and loaded them into my Github for easy access. Time to pull the untidy data into this project and check that it loaded in properly:
DungeonsAndDragons <- "https://raw.githubusercontent.com/tcgraham-data/Data-607-Project-2/refs/heads/main/dnd_chars_all.csv"
Cheese <- "https://raw.githubusercontent.com/tcgraham-data/Data-607-Project-2/refs/heads/main/cheeses.csv"
ROllingStoneTopAlbums <- "https://raw.githubusercontent.com/tcgraham-data/Data-607-Project-2/refs/heads/main/Rolling%20Stone%20500%20(public)%20-%20Data.csv"
dnddata <- read.csv(DungeonsAndDragons)
cheesedata <- read.csv(Cheese)
albumdata <- read.csv(ROllingStoneTopAlbums)
head(dnddata)
## ip finger hash name race
## 1 ed15f9d fe3ed6570067d2cd808bcee0a4396824 ee1e382c Hill Dwarf
## 2 ed15f9d aa656cef94740b1da714d82782d8a8c4 ee1e382c Hill Dwarf
## 3 6b5d3f4 d922658 04b9abba9ce759ef13319c520bf0eb69 f1f6ff43 Human
## 4 9b7218f b5d19a0 ba92372c716a832e03e866447cb2d102 f92bdd74 Fallen Aasimar
## 5 9b7218f b5d19a0 2f4a637dd42b9aa945abf870fe34edcf f92bdd74 Fallen Aasimar
## 6 bf0845c 6594cbd de16fbfebad327876582875339ad1e1f 5b8cf3ca Variant
## background date class
## 1 Guild Member - Justice 2022-08-23T20:02:11Z Sorcerer 13|Cleric 1
## 2 Guild Member - Justice 2022-08-23T19:43:25Z Sorcerer 13|Cleric 1
## 3 Noble 2022-08-22T14:57:09Z Fighter 13
## 4 Outlander 2022-08-22T12:12:53Z Sorcerer 3|Druid 2
## 5 Outlander 2022-08-22T12:07:21Z Sorcerer 3|Druid 2
## 6 Entertainer 2022-08-22T03:19:30Z Bard 1
## justClass subclass level
## 1 Sorcerer|Cleric Clockwork Soul|Order Domain 14
## 2 Sorcerer|Cleric Clockwork Soul|Order Domain 14
## 3 Fighter Eldritch Knight 13
## 4 Sorcerer|Druid Shadow Magic|Circle of Wildfire 5
## 5 Sorcerer|Druid Shadow Magic|Circle of Wildfire 5
## 6 Bard 1
## feats HP AC Str Dex Con Int Wis Cha
## 1 Fey Touched|War Caster|Metamagic Adept 146 10 9 11 20 14 14 20
## 2 133 10 9 11 18 14 14 20
## 3 Heavy Armor Master 140 21 20 12 19 14 11 10
## 4 34 16 8 10 14 10 16 16
## 5 34 16 8 10 14 10 16 16
## 6 10 13 10 16 14 12 8 16
## alignment skills
## 1 Arcana|Religion|Intimidation
## 2 Arcana|Religion|Intimidation
## 3 CG History|Insight|Perception|Persuasion
## 4 Caltico Neutro Athletics|Arcana|Insight|Survival
## 5 Caltico Neutro Athletics|Arcana|Insight|Survival
## 6 Acrobatics|Sleight of Hand|Perception|Deception|Performance
## weapons
## 1 Crossbow, light|Dagger
## 2 Crossbow, light|Dagger
## 3 Longsword +1|Dagger|Crossbow, light|Mace Of Disruption
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
## spells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6
## castingStat
## 1 Cha
## 2 Cha
## 3 Int
## 4 Cha
## 5 Cha
## 6 Cha
## choices
## 1 metamagic/Twinned Spell*Subtle Spell*Quickened Spell*Extended Spell
## 2 metamagic/Twinned Spell*Subtle Spell*Quickened Spell
## 3 fighting style/Defense
## 4 metamagic/Transmuted Spell*Seeking Spell
## 5 metamagic/Transmuted Spell*Seeking Spell
## 6
## country countryCode processedAlignment good lawful processedRace
## 1 Canada CA NA NA Dwarf
## 2 Canada CA NA NA Dwarf
## 3 United States US CG NA NA Human
## 4 Brazil BR NA NA Aasimar
## 5 Brazil BR NA NA Aasimar
## 6 Canada CA NA NA Half-Elf
## processedSpells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6
## processedWeapons alias
## 1 Crossbow, Light|Dagger thirsty_davinci
## 2 Crossbow, Light|Dagger thirsty_davinci
## 3 Longsword|Dagger|Crossbow, Light|Mace cool_bhabha
## 4 Dagger vigilant_mcclintock
## 5 Dagger vigilant_mcclintock
## 6 Longsword|Dagger|Longbow quirky_hertz
head(cheesedata)
## cheese url milk
## 1 Aarewasser https://www.cheese.com/aarewasser/ cow
## 2 Abbaye de Belloc https://www.cheese.com/abbaye-de-belloc/ sheep
## 3 Abbaye de Belval https://www.cheese.com/abbaye-de-belval/ cow
## 4 Abbaye de Citeaux https://www.cheese.com/abbaye-de-citeaux/ cow
## 5 Abbaye de Tamié https://www.cheese.com/tamie/ cow
## 6 Abbaye de Timadeuc https://www.cheese.com/abbaye-de-timadeuc/ cow
## country region family type
## 1 Switzerland <NA> <NA> semi-soft
## 2 France Pays Basque <NA> semi-hard, artisan
## 3 France <NA> <NA> semi-hard
## 4 France Burgundy <NA> semi-soft, artisan, brined
## 5 France Savoie <NA> soft, artisan
## 6 France province of Brittany <NA> semi-hard
## fat_content calcium_content texture rind color
## 1 <NA> <NA> buttery washed yellow
## 2 <NA> <NA> creamy, dense, firm natural yellow
## 3 40-46% <NA> elastic washed ivory
## 4 <NA> <NA> creamy, dense, smooth washed white
## 5 <NA> <NA> creamy, open, smooth washed white
## 6 <NA> <NA> soft washed pale yellow
## flavor aroma vegetarian vegan
## 1 sweet buttery FALSE FALSE
## 2 burnt caramel lanoline TRUE FALSE
## 3 <NA> aromatic FALSE FALSE
## 4 acidic, milky, smooth barnyardy, earthy FALSE FALSE
## 5 fruity, nutty perfumed, pungent FALSE FALSE
## 6 salty, smooth nutty FALSE FALSE
## synonyms alt_spellings
## 1 <NA> <NA>
## 2 Abbaye Notre-Dame de Belloc <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> Tamié, Trappiste de Tamie, Abbey of Tamie
## 6 <NA> <NA>
## producers
## 1 Jumi
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 Abbaye Cistercienne NOTRE-DAME DE TIMADEUC
head(albumdata)
## Sort.Name Clean.Name Album X2003.Rank.Old
## 1 Sinatra, Frank Frank Sinatra In the Wee Small Hours 101
## 2 Diddley, Bo Bo Diddley Bo Diddley / Go Bo Diddley 212
## 3 Presley, Elvis Elvis Presley Elvis Presley 56
## 4 Sinatra, Frank Frank Sinatra Songs for Swingin' Lovers! 302
## 5 Little Richard Little Richard Here's Little Richard 50
## 6 Beyonce Beyonce Lemonade NA
## X2003.Rank X2012.Rank X2020.Rank X2020.2003.Differential Release.Year
## 1 100 101 282 -182 1955
## 2 214 216 455 -241 1955
## 3 55 56 332 -277 1956
## 4 306 308 NA -195 1956
## 5 50 50 227 -177 1957
## 6 NA NA 32 469 2016
## Album.Genre Album.Type Wks.on.Billboard
## 1 Big Band/Jazz Studio 14
## 2 Rock n' Roll/Rhythm & Blues Studio -
## 3 Rock n' Roll/Rhythm & Blues Studio 100
## 4 Big Band/Jazz Studio ?
## 5 Studio 5
## 6 Studio 87
## Peak.Billboard.Position Spotify.Popularity
## 1 2 48
## 2 201 50
## 3 1 58
## 4 2 62
## 5 13 64
## 6 1 73
## Spotify.URI
## 1 spotify:album:3GmwKB1tgPZgXeRJZSm9WX
## 2 spotify:album:1cbtDEwxCjMhglb49OgNBR
## 3 spotify:album:7GXP5OhYyPVLmcVfO9Iqin
## 4 spotify:album:4kca7vXd1Wo5GE2DMafvMc
## 5 spotify:album:18tV6PLXYvVjsdOVk0S7M8
## 6 spotify:album:7dK54iZuOxXFarGhXwEXfF
## Chartmetric.Link Artist.Member.Count
## 1 https://app.chartmetric.com/album/3115110/about 1
## 2 https://app.chartmetric.com/album/4137525/about 1
## 3 https://app.chartmetric.com/album/1338334/about 1
## 4 https://app.chartmetric.com/album/1075976/about 1
## 5 https://app.chartmetric.com/album/3128391/about 1
## 6 https://app.chartmetric.com/album/4160994/about 1
## Artist.Gender Artist.Birth.Year.Sum Debut.Album.Release.Year
## 1 Male 1915 1946
## 2 Male 1928 1955
## 3 Male 1935 1956
## 4 Male 1915 1946
## 5 Male 1932 1957
## 6 Female 1981 2003
## Avg..Age.at.Top.500.Album Years.Between.Debut.and.Top.500.Album
## 1 40 9
## 2 27 0
## 3 21 0
## 4 41 10
## 5 25 0
## 6 35 13
## Album.ID Album.ID.Quoted X X.1
## 1 3GmwKB1tgPZgXeRJZSm9WX "3GmwKB1tgPZgXeRJZSm9WX",
## 2 1cbtDEwxCjMhglb49OgNBR "1cbtDEwxCjMhglb49OgNBR",
## 3 7GXP5OhYyPVLmcVfO9Iqin "7GXP5OhYyPVLmcVfO9Iqin",
## 4 4kca7vXd1Wo5GE2DMafvMc "4kca7vXd1Wo5GE2DMafvMc",
## 5 18tV6PLXYvVjsdOVk0S7M8 "18tV6PLXYvVjsdOVk0S7M8",
## 6 7dK54iZuOxXFarGhXwEXfF "7dK54iZuOxXFarGhXwEXfF",
Now that our data is loaded, we need to clean it up with the intent of doing something with it. Since a lot of the discussions didn’t necessarily have a request built in, I have developed my own questions to answer:
D&D Data - What are the top Character classes and races by percent? Further, can we figure out what the top class/race pairs are by percent?
Cheese Data - Who on earth makes the most cheese styles?
Top Album Data - Which albums have had the most percent change on the rank list from 2003 to 2020?
First thing we need to do is remove columns that aren’t providing us with any real data related to the game itself. This can be done by manually look at the columns and striking those which aren’t relevant using dplyr:
dnd_data_clean <- dnddata %>%
select(-matches("^(name|ip|finger|hash|date|alias)"))
head(dnd_data_clean)
## race background class justClass
## 1 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 2 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 3 Human Noble Fighter 13 Fighter
## 4 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 5 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 6 Variant Entertainer Bard 1 Bard
## subclass level feats
## 1 Clockwork Soul|Order Domain 14 Fey Touched|War Caster|Metamagic Adept
## 2 Clockwork Soul|Order Domain 14
## 3 Eldritch Knight 13 Heavy Armor Master
## 4 Shadow Magic|Circle of Wildfire 5
## 5 Shadow Magic|Circle of Wildfire 5
## 6 1
## HP AC Str Dex Con Int Wis Cha alignment
## 1 146 10 9 11 20 14 14 20
## 2 133 10 9 11 18 14 14 20
## 3 140 21 20 12 19 14 11 10 CG
## 4 34 16 8 10 14 10 16 16 Caltico Neutro
## 5 34 16 8 10 14 10 16 16 Caltico Neutro
## 6 10 13 10 16 14 12 8 16
## skills
## 1 Arcana|Religion|Intimidation
## 2 Arcana|Religion|Intimidation
## 3 History|Insight|Perception|Persuasion
## 4 Athletics|Arcana|Insight|Survival
## 5 Athletics|Arcana|Insight|Survival
## 6 Acrobatics|Sleight of Hand|Perception|Deception|Performance
## weapons
## 1 Crossbow, light|Dagger
## 2 Crossbow, light|Dagger
## 3 Longsword +1|Dagger|Crossbow, light|Mace Of Disruption
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
## spells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6
## castingStat
## 1 Cha
## 2 Cha
## 3 Int
## 4 Cha
## 5 Cha
## 6 Cha
## choices
## 1 metamagic/Twinned Spell*Subtle Spell*Quickened Spell*Extended Spell
## 2 metamagic/Twinned Spell*Subtle Spell*Quickened Spell
## 3 fighting style/Defense
## 4 metamagic/Transmuted Spell*Seeking Spell
## 5 metamagic/Transmuted Spell*Seeking Spell
## 6
## country countryCode processedAlignment good lawful processedRace
## 1 Canada CA NA NA Dwarf
## 2 Canada CA NA NA Dwarf
## 3 United States US CG NA NA Human
## 4 Brazil BR NA NA Aasimar
## 5 Brazil BR NA NA Aasimar
## 6 Canada CA NA NA Half-Elf
## processedSpells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6
## processedWeapons
## 1 Crossbow, Light|Dagger
## 2 Crossbow, Light|Dagger
## 3 Longsword|Dagger|Crossbow, Light|Mace
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
With that done, we need to get an “NA” added to any cell that is currently empty. That way I don’t have to worry about cells getting weird on me:
dnd_data_clean <- dnd_data_clean %>%
mutate(across(where(is.character), ~na_if(., "")))
head(dnd_data_clean)
## race background class justClass
## 1 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 2 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 3 Human Noble Fighter 13 Fighter
## 4 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 5 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 6 Variant Entertainer Bard 1 Bard
## subclass level feats
## 1 Clockwork Soul|Order Domain 14 Fey Touched|War Caster|Metamagic Adept
## 2 Clockwork Soul|Order Domain 14 <NA>
## 3 Eldritch Knight 13 Heavy Armor Master
## 4 Shadow Magic|Circle of Wildfire 5 <NA>
## 5 Shadow Magic|Circle of Wildfire 5 <NA>
## 6 <NA> 1 <NA>
## HP AC Str Dex Con Int Wis Cha alignment
## 1 146 10 9 11 20 14 14 20 <NA>
## 2 133 10 9 11 18 14 14 20 <NA>
## 3 140 21 20 12 19 14 11 10 CG
## 4 34 16 8 10 14 10 16 16 Caltico Neutro
## 5 34 16 8 10 14 10 16 16 Caltico Neutro
## 6 10 13 10 16 14 12 8 16 <NA>
## skills
## 1 Arcana|Religion|Intimidation
## 2 Arcana|Religion|Intimidation
## 3 History|Insight|Perception|Persuasion
## 4 Athletics|Arcana|Insight|Survival
## 5 Athletics|Arcana|Insight|Survival
## 6 Acrobatics|Sleight of Hand|Perception|Deception|Performance
## weapons
## 1 Crossbow, light|Dagger
## 2 Crossbow, light|Dagger
## 3 Longsword +1|Dagger|Crossbow, light|Mace Of Disruption
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
## spells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6 <NA>
## castingStat
## 1 Cha
## 2 Cha
## 3 Int
## 4 Cha
## 5 Cha
## 6 Cha
## choices
## 1 metamagic/Twinned Spell*Subtle Spell*Quickened Spell*Extended Spell
## 2 metamagic/Twinned Spell*Subtle Spell*Quickened Spell
## 3 fighting style/Defense
## 4 metamagic/Transmuted Spell*Seeking Spell
## 5 metamagic/Transmuted Spell*Seeking Spell
## 6 <NA>
## country countryCode processedAlignment good lawful processedRace
## 1 Canada CA <NA> NA NA Dwarf
## 2 Canada CA <NA> NA NA Dwarf
## 3 United States US CG NA NA Human
## 4 Brazil BR <NA> NA NA Aasimar
## 5 Brazil BR <NA> NA NA Aasimar
## 6 Canada CA <NA> NA NA Half-Elf
## processedSpells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6 <NA>
## processedWeapons
## 1 Crossbow, Light|Dagger
## 2 Crossbow, Light|Dagger
## 3 Longsword|Dagger|Crossbow, Light|Mace
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
Since I’m really only interested in the “race” and “justClass” columns for my observation, I want to make sure we remove any rows where those are empty. Since you can’t functionally play the game without those two elements, there’s not really a reason to measure them, even as “empty”:
dnd_data_clean <- dnd_data_clean %>%
filter(!is.na(race) & race != "empty",
!is.na(justClass) & justClass != "empty")
# Display the cleaned dataset
head(dnd_data_clean)
## race background class justClass
## 1 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 2 Hill Dwarf Guild Member - Justice Sorcerer 13|Cleric 1 Sorcerer|Cleric
## 3 Human Noble Fighter 13 Fighter
## 4 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 5 Fallen Aasimar Outlander Sorcerer 3|Druid 2 Sorcerer|Druid
## 6 Variant Entertainer Bard 1 Bard
## subclass level feats
## 1 Clockwork Soul|Order Domain 14 Fey Touched|War Caster|Metamagic Adept
## 2 Clockwork Soul|Order Domain 14 <NA>
## 3 Eldritch Knight 13 Heavy Armor Master
## 4 Shadow Magic|Circle of Wildfire 5 <NA>
## 5 Shadow Magic|Circle of Wildfire 5 <NA>
## 6 <NA> 1 <NA>
## HP AC Str Dex Con Int Wis Cha alignment
## 1 146 10 9 11 20 14 14 20 <NA>
## 2 133 10 9 11 18 14 14 20 <NA>
## 3 140 21 20 12 19 14 11 10 CG
## 4 34 16 8 10 14 10 16 16 Caltico Neutro
## 5 34 16 8 10 14 10 16 16 Caltico Neutro
## 6 10 13 10 16 14 12 8 16 <NA>
## skills
## 1 Arcana|Religion|Intimidation
## 2 Arcana|Religion|Intimidation
## 3 History|Insight|Perception|Persuasion
## 4 Athletics|Arcana|Insight|Survival
## 5 Athletics|Arcana|Insight|Survival
## 6 Acrobatics|Sleight of Hand|Perception|Deception|Performance
## weapons
## 1 Crossbow, light|Dagger
## 2 Crossbow, light|Dagger
## 3 Longsword +1|Dagger|Crossbow, light|Mace Of Disruption
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
## spells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|Summon Construct*4|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6 <NA>
## castingStat
## 1 Cha
## 2 Cha
## 3 Int
## 4 Cha
## 5 Cha
## 6 Cha
## choices
## 1 metamagic/Twinned Spell*Subtle Spell*Quickened Spell*Extended Spell
## 2 metamagic/Twinned Spell*Subtle Spell*Quickened Spell
## 3 fighting style/Defense
## 4 metamagic/Transmuted Spell*Seeking Spell
## 5 metamagic/Transmuted Spell*Seeking Spell
## 6 <NA>
## country countryCode processedAlignment good lawful processedRace
## 1 Canada CA <NA> NA NA Dwarf
## 2 Canada CA <NA> NA NA Dwarf
## 3 United States US CG NA NA Human
## 4 Brazil BR <NA> NA NA Aasimar
## 5 Brazil BR <NA> NA NA Aasimar
## 6 Canada CA <NA> NA NA Half-Elf
## processedSpells
## 1 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 2 Alarm*1|Protection from Evil and Good*1|Command*1|Heroism*1|Aid*2|Lesser Restoration*2|Dispel Magic*3|Protection from Energy*3|Freedom of Movement*4|*|Greater Restoration*5|Wall of Force*5
## 3 Prestidigitation*0|Mage Hand*0|Blade Ward*0|Light*0|Alarm*1|Burning Hands*1|Unseen Servant*1|Chromatic Orb*1|Magic Missile*1|Identify*1|Detect Magic*1|Absorb Elements*1|Armor of Agathys*1|Shatter*2|Misty Step*2|Enlarge/Reduce*2|Catnap*3
## 4 Burning Hands*1|Cure Wounds*1|Darkness*2
## 5 Burning Hands*1|Cure Wounds*1|Darkness*2
## 6 <NA>
## processedWeapons
## 1 Crossbow, Light|Dagger
## 2 Crossbow, Light|Dagger
## 3 Longsword|Dagger|Crossbow, Light|Mace
## 4 Dagger
## 5 Dagger
## 6 Longsword|Dagger|Longbow
I do want to make it clear I want to keep all this data because you never know when you might need it. You never know when you just NEED to know what percentage of DnD players are Eldritch Knights…But since I’m not going to use all of it, I want to make a unique datafram with only the information I’m interested in:
dnd_subset <- dnd_data_clean %>% select(race, justClass)
head(dnd_subset)
## race justClass
## 1 Hill Dwarf Sorcerer|Cleric
## 2 Hill Dwarf Sorcerer|Cleric
## 3 Human Fighter
## 4 Fallen Aasimar Sorcerer|Druid
## 5 Fallen Aasimar Sorcerer|Druid
## 6 Variant Bard
Now we’re getting somewhere. The last thing I see that’s a problem is that some characters have dual-classed. How do I treat that? In DnD parlance, the first listed class is the primary class, and the second is a secondary classing. Now we’ll want to make a new column that splits up any dual classes by looking for a pipe and extracting info after the pipe:
dnd_subset <- dnd_subset %>%
mutate(`Secondary Class` = ifelse(grepl("\\|", justClass),
sub(".*\\|", "", justClass),
NA))
head(dnd_subset)
## race justClass Secondary Class
## 1 Hill Dwarf Sorcerer|Cleric Cleric
## 2 Hill Dwarf Sorcerer|Cleric Cleric
## 3 Human Fighter <NA>
## 4 Fallen Aasimar Sorcerer|Druid Druid
## 5 Fallen Aasimar Sorcerer|Druid Druid
## 6 Variant Bard <NA>
That bit of research worked well. But I still have the extra secondary data in justClass and need to remove it. And I want to rename “justClass” to “Primary Class”
dnd_subset <- dnd_subset %>%
mutate(justClass = sub("\\|.*", "", justClass)) %>%
rename(`Primary Class` = justClass)
head(dnd_subset)
## race Primary Class Secondary Class
## 1 Hill Dwarf Sorcerer Cleric
## 2 Hill Dwarf Sorcerer Cleric
## 3 Human Fighter <NA>
## 4 Fallen Aasimar Sorcerer Druid
## 5 Fallen Aasimar Sorcerer Druid
## 6 Variant Bard <NA>
And to just clean this up a little bit more, so we can capture who does and doesn’t have a Second class:
dnd_subset <- dnd_subset %>%
mutate(`Secondary Class` = ifelse(is.na(`Secondary Class`) | `Secondary Class` == "", "none", `Secondary Class`))
head(dnd_subset)
## race Primary Class Secondary Class
## 1 Hill Dwarf Sorcerer Cleric
## 2 Hill Dwarf Sorcerer Cleric
## 3 Human Fighter none
## 4 Fallen Aasimar Sorcerer Druid
## 5 Fallen Aasimar Sorcerer Druid
## 6 Variant Bard none
Let’s start answering some questions. We have 10,867 records here. So this is a pretty solid statistical sample.
##DnD Question 1 - What’s the most popular race?
library(ggplot2)
ggplot(dnd_subset, aes(x = race)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Race", x = "Race", y = "Count") +
theme_minimal()
That was unhelpful. I had no idea there were that many race options in
the game. Color me impressed. I guess we’ll just tabulate this
differently. Let’s make a new dataset that lists all races and then
counts each instance and descends from most to least. Then we can graph
the top ten.
dnd_race_count <- dnd_subset %>%
count(race) %>%
rename(Count = n) %>%
arrange(desc(Count))
print(dnd_race_count)
## race Count
## 1 Human 2068
## 2 Half-Elf 688
## 3 Wood Elf 572
## 4 Dragonborn 486
## 5 Tiefling 449
## 6 Half-Orc 425
## 7 High Elf 378
## 8 Mountain Dwarf 368
## 9 Tabaxi 320
## 10 Hill Dwarf 319
## 11 Lightfoot Halfling 318
## 12 Forest Gnome 219
## 13 Goliath 198
## 14 Warforged 194
## 15 Goblin 181
## 16 Dark Elf 179
## 17 Firbolg 167
## 18 Turtle 139
## 19 Protector Aasimar 138
## 20 Variant 118
## 21 Rock Gnome 116
## 22 Kenku 106
## 23 Lizardfolk 106
## 24 Triton 103
## 25 Kobold 102
## 26 Ghastly Halfling 97
## 27 Minotaur 95
## 28 Stout Halfling 90
## 29 Fallen Aasimar 89
## 30 Bugbear 82
## 31 Fire Genasi 77
## 32 Scourge Aasimar 70
## 33 Deep Gnome 69
## 34 Serpentblood 69
## 35 Shadow Elf 68
## 36 Orc 63
## 37 Aarakocra 60
## 38 Changeling 60
## 39 Eladrin ToF 59
## 40 Custom 52
## 41 Lineage - Glas 41
## 42 Water Genasi 41
## 43 Elephantine 40
## 44 Sea Elf 40
## 45 Air Genasi 39
## 46 Birdfolk 39
## 47 Aasimar 38
## 48 Centaur 37
## 49 Hobgoblin 34
## 50 Earth Genasi 33
## 51 Lineage - Zari 31
## 52 Animal Hybrid 28
## 53 Dreamtouched 28
## 54 Lineage - Meph 28
## 55 Satyr 25
## 56 Leonine 23
## 57 Lineage - Levi 22
## 58 Lineage - Fier 20
## 59 Chromatic 19
## 60 Gray Dwarf 19
## 61 Eladrin DMG 18
## 62 Detection 17
## 63 Dhampir 17
## 64 Fairy 15
## 65 Geth - Yan 15
## 66 Harefolk 15
## 67 Gem 12
## 68 Lineage - Disp 12
## 69 Pale Elf 12
## 70 Reborn Lineage 11
## 71 Sentinel 11
## 72 Tortle 11
## 73 Half-elf 10
## 74 Stride 10
## 75 Asgardian 9
## 76 Healing 9
## 77 Hexed Lineage 9
## 78 Lineage - Baal 9
## 79 Lineage - Mamm 9
## 80 Lotus Halfling 9
## 81 Passage 9
## 82 Human Variant 8
## 83 Storm 8
## 84 Violetken 8
## 85 Dire wolf 7
## 86 Hunt 7
## 87 Making 7
## 88 Drow 6
## 89 Geth - Zer 6
## 90 Githzerai 6
## 91 Half-Dark Elf 6
## 92 Hospitality 6
## 93 Kalashtar 6
## 94 Loxodon 6
## 95 Metallic 6
## 96 Tooth 6
## 97 Dragonborn (Draconblood) 5
## 98 Envoy 5
## 99 Lineage - Zariel 5
## 100 Shadow 5
## 101 Shifter (Wildhunt) 5
## 102 Warding 5
## 103 Eladrin 4
## 104 Half Elf 4
## 105 Humana 4
## 106 Variant Half Elf 4
## 107 Vedalken 4
## 108 CA 3
## 109 Copper Dragon 3
## 110 Dwarf 3
## 111 Ghostwise Halfling 3
## 112 Goblyn 3
## 113 Grung 3
## 114 Half-Elf (Wood) 3
## 115 Half-Khindra 3
## 116 Hide 3
## 117 Imaskari Human 3
## 118 Leonin 3
## 119 Meadowguard 3
## 120 Myconid 3
## 121 Ratfolk (white) 3
## 122 Scribing 3
## 123 Shifter 3
## 124 Succubus 3
## 125 Variant Half-Elf 3
## 126 Variant Human 3
## 127 BR 2
## 128 Dhampir (Illithid) 2
## 129 Div Spawn 2
## 130 Finding 2
## 131 Giant Wolf Spider 2
## 132 Gnoll 2
## 133 Gray-Elf 2
## 134 Grove Cervan 2
## 135 Half-Dwarf 2
## 136 Half-Elf(Drow) 2
## 137 High Moon Elf 2
## 138 Human (Dwarf) 2
## 139 Human (Mark of Finding) 2
## 140 Human - Custom 2
## 141 Human Hybrid 2
## 142 Human Tiefling (Glasya) 2
## 143 Human Undertownian 2
## 144 Human Undertownian Ancestor 2
## 145 Juggeenaut 2
## 146 Juggernaut 2
## 147 Juggernaut 2
## 148 Kor 2
## 149 Longtooth 2
## 150 Longtooth Shifter 2
## 151 Luxodon 2
## 152 Making Mark Human 2
## 153 Meio-Orc 2
## 154 Minotaurus 2
## 155 Ogre 2
## 156 Owlfolk 2
## 157 Plasmoid 2
## 158 Reborn 2
## 159 Reborn (high elf) 2
## 160 Sprite Lord 2
## 161 US 2
## 162 Variant Human 2
## 163 Wald Elf 2
## 164 Wand Golem Thingy 2
## 165 Wolf 2
## 166 Wood Elf(Aereni) 2
## 167 Wooden Puppet 2
## 168 Yuan-Ti 2
## 169 Yuan-ti pureblood 2
## 170 firbolg 2
## 171 1/2 Elf--Var. 1
## 172 Aaracokra 1
## 173 Aassling 1
## 174 Animal Hybrid (Bear) 1
## 175 Asgardian (MD) 1
## 176 Augmented Combatant 1
## 177 Axolotl Hybrid 1
## 178 Bearfolk 1
## 179 Beast 1
## 180 Bird Hybrid 1
## 181 Blefus 1
## 182 Bonedevil Tiefling 1
## 183 Boofreen 1
## 184 Caliban 1
## 185 Cambiaformas 1
## 186 Caprien (Goatfolk) 1
## 187 Changeling 2.0 1
## 188 Child Of The Old Ones 1
## 189 Chitine 1
## 190 Chromatic(green) 1
## 191 Cloaker 1
## 192 Clockwork 1
## 193 Clockwork Dancer 1
## 194 Custom - Desert Lizardfolk 1
## 195 Custom - Fish in a suit of armour 1
## 196 Desert Elf 1
## 197 Dhampire 1
## 198 Dire Wolf Hengenyokai 1
## 199 Direwolf 1
## 200 Doll 1
## 201 Dragonforged 1
## 202 Drakin 1
## 203 Drow 1
## 204 Duergar 1
## 205 Earth Genasi (MotM) 1
## 206 Eladrin Elf 1
## 207 Elf 1
## 208 Fallen Aasimar (goblin) 1
## 209 Flying Snake 1
## 210 Forest Dragon 1
## 211 Freed Undead 1
## 212 Frog man 1
## 213 Gem (Sapphire) 1
## 214 Genasi-null 1
## 215 Gethzerai 1
## 216 Ghost 1
## 217 Glowing One (Celestial) 1
## 218 Goron 1
## 219 Grovetender 1
## 220 Halb-Ork 1
## 221 Half Elf/Half Siren 1
## 222 Half Orc Finding 1
## 223 Half Orc Fire Gensai 1
## 224 Half Sea High Elf 1
## 225 Half-Elf (Drow) 1
## 226 Half-Elf Variant 1
## 227 Half-Giant 1
## 228 Half-Orc (Custom lineage) 1
## 229 Halfling Gnome 1
## 230 Hengenyokai- Sabre Toothed Tiger 1
## 231 High Elf/Skeleton 1
## 232 High Elf/Vampire 1
## 233 Higher Elfin 1
## 234 Hill Giant 1
## 235 Hin 1
## 236 Human (Elven Ancestry) 1
## 237 Human (Variant) 1
## 238 Human (Variant) 1
## 239 Human (dragon blood?) 1
## 240 Human (variant) 1
## 241 Human 1 inspo 1
## 242 Human Age 31 1
## 243 Human Var. 1
## 244 Human Walldweller 1
## 245 Human/Marsh-Wiggle 1
## 246 Human/Shou 1
## 247 Jackalfolk 1
## 248 Juggernaut Warforged 1
## 249 Kenku/half elf 1
## 250 Kitsune 1
## 251 Kitty 1
## 252 Kobold (Volvo's) 1
## 253 Kobold Scout (Blue Scale) 1
## 254 Krijg 1
## 255 Lightfoot Hin 1
## 256 Lineage - Fierna 1
## 257 Lineage - Glasya 1
## 258 Locathah 1
## 259 Loxodon 1
## 260 Luxodont 1
## 261 Mark of Warding Dwarf 1
## 262 Mechlock 1
## 263 Medio Orco 1
## 264 Medusa 1
## 265 Meio-Elfo Variante 1
## 266 Minotaur (Death Knight) 1
## 267 Moon Elf 1
## 268 Mothfolk 1
## 269 Mountain Dwarf 1
## 270 Mycanid 1
## 271 Naga 1
## 272 Nature Tiefling (Dryad?) 1
## 273 Nautilid 1
## 274 Octopus 1
## 275 Ogrillon Zombi 1
## 276 Olympian, Ares 1
## 277 Oryk 1
## 278 Pandaren 1
## 279 Penguin 1
## 280 Protector Aasimar: Iluskan Parents 1
## 281 Ratfolk(underfolk) 1
## 282 Shadar-Kai Elf 1
## 283 Shadar-kai elf 1
## 284 Sharkfolk 1
## 285 Sheikah 1
## 286 Shiya 1
## 287 Simic Hybrid 1
## 288 Skeleton 1
## 289 Skellington 1
## 290 SpiderFolk 1
## 291 Staemmige 1
## 292 Swiftstride Shifter 1
## 293 Synthetic Human 1
## 294 Tauren 1
## 295 Teifling - Levistus 1
## 296 Tieffelin 1
## 297 Tiefling (Mammon) 1
## 298 Tiefling - Zariel 1
## 299 Tiefling Levistus 1
## 300 Tiefling Lineage - Baal 1
## 301 Tortle - Razorback Tribe 1
## 302 Troll 1
## 303 Turtlebear 1
## 304 UA Eladrin 1
## 305 Undead 1
## 306 Unlucky 1
## 307 Vampire 1
## 308 Vampire (Zendikar) 1
## 309 Variant HalfElf 1
## 310 Vedalken 1
## 311 Warforged Bear 1
## 312 Warforged Official 1
## 313 Water Genasi 27 Ans 1
## 314 Winter Orc 1
## 315 Yordle 1
## 316 Yuan-ti Pureblood 1
## 317 moon Elf 1
## 318 naga 1
## 319 nyad 1
## 320 tortle 1
## 321 warforged (envoy) 1
library(ggplot2)
top_races <- dnd_race_count %>%
slice_max(Count, n = 10)
ggplot(top_races, aes(x = reorder(race, Count), y = Count)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Races by Count", x = "Race", y = "Count") +
theme_minimal()
This is cool. And it makes a lot of sense. In the game, humans get +1 to all stats for a total of +6 overall. Other races have a total of +3 to a stat. The “upside” to choosing a race other than human is that while humans have a blanket +1 to every stat, each race has one stat that receives a +2 perk. In general, this makes humans easier to play for new players. Most new players are encouraged to start with a human and then choose a more interesting race on the “next” game. I wonder what the percentage is between “human” and “all others:
human_vs_other <- dnd_race_count %>%
mutate(race_group = ifelse(tolower(race) == "human", "Human", "Other")) %>%
group_by(race_group) %>%
summarise(Total_Count = sum(Count))
human_vs_other
## # A tibble: 2 × 2
## race_group Total_Count
## <chr> <int>
## 1 Human 2068
## 2 Other 8808
total_count <- sum(human_vs_other$Total_Count)
human_count <- human_vs_other %>%
filter(race_group == "Human") %>%
pull(Total_Count)
human_percentage <- (human_count / total_count) * 100
human_percentage
## [1] 19.01434
That’s much more in line with what I thought. The bar chart makes it look like Humans are far and away the top choice. Which they are for a single racial class. However, the reality is that it’s going to be the largest instance because it is the most friendly to beginner players. So it’s almost better to think of “human” as beginners and all others as veteran players. I would believe that 81% of DnD players more likely to play more complex and challenging characters.
dnd_class_count <- dnd_subset %>%
count(`Primary Class`) %>%
rename(Count = n) %>%
arrange(desc(Count))
dnd_class_count
## Primary Class Count
## 1 Fighter 1463
## 2 Rogue 1308
## 3 Cleric 1003
## 4 Barbarian 956
## 5 Paladin 891
## 6 Ranger 749
## 7 Wizard 747
## 8 Druid 722
## 9 Monk 716
## 10 Bard 701
## 11 Sorcerer 658
## 12 Warlock 649
## 13 Artificer 218
## 14 Revised Ranger 44
## 15 Blood Hunter 26
## 16 Mystic 13
## 17 Dwarf 2
## 18 Gunslinger 2
## 19 Half-Elf 2
## 20 Battle Clown 1
## 21 Crafting Commoner 1
## 22 Dragonborn 1
## 23 Genasi 1
## 24 Human 1
## 25 commoner 1
top_classes <- dnd_class_count %>%
slice_max(Count, n = 10)
top_classes
## Primary Class Count
## 1 Fighter 1463
## 2 Rogue 1308
## 3 Cleric 1003
## 4 Barbarian 956
## 5 Paladin 891
## 6 Ranger 749
## 7 Wizard 747
## 8 Druid 722
## 9 Monk 716
## 10 Bard 701
ggplot(top_classes, aes(x = reorder(`Primary Class`, Count), y = Count)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Primary Classes", x = "Primary Class", y = "Count") +
theme_minimal()
Answer: Fighter. That makes a lot of sense. It’s one of the most flexible classes to generally do what you want to do. About all you can’t do well as a “fighter” is cast magic. But magic users are hard because they get real squishy, real fast. OK. I’ve spent way too much time on DnD data. Time to tackle CHeese.
OK. I have to get more focused on what I’m doing here. I don’t want this to ramble too much. My main question here was:
Let’s begin be tidying up our data. There are a few columns here that don’t provide us with much information, nor do we really need it because of how little infor we get out of it. So let’s scrub some data:
cheese_data_clean <- cheesedata %>%
select(-matches("^(url|region|family|fat_content|calcium_content|synonyms|alt_spellings|producers|vegetarian|vegan)"))
head(cheese_data_clean)
## cheese milk country type
## 1 Aarewasser cow Switzerland semi-soft
## 2 Abbaye de Belloc sheep France semi-hard, artisan
## 3 Abbaye de Belval cow France semi-hard
## 4 Abbaye de Citeaux cow France semi-soft, artisan, brined
## 5 Abbaye de Tamié cow France soft, artisan
## 6 Abbaye de Timadeuc cow France semi-hard
## texture rind color flavor
## 1 buttery washed yellow sweet
## 2 creamy, dense, firm natural yellow burnt caramel
## 3 elastic washed ivory <NA>
## 4 creamy, dense, smooth washed white acidic, milky, smooth
## 5 creamy, open, smooth washed white fruity, nutty
## 6 soft washed pale yellow salty, smooth
## aroma
## 1 buttery
## 2 lanoline
## 3 aromatic
## 4 barnyardy, earthy
## 5 perfumed, pungent
## 6 nutty
Now we need to update this so that any empty cells are given the null of “NA”
cheese_data_clean <- cheese_data_clean %>%
mutate(across(where(is.character), ~na_if(., "")))
head(cheese_data_clean)
## cheese milk country type
## 1 Aarewasser cow Switzerland semi-soft
## 2 Abbaye de Belloc sheep France semi-hard, artisan
## 3 Abbaye de Belval cow France semi-hard
## 4 Abbaye de Citeaux cow France semi-soft, artisan, brined
## 5 Abbaye de Tamié cow France soft, artisan
## 6 Abbaye de Timadeuc cow France semi-hard
## texture rind color flavor
## 1 buttery washed yellow sweet
## 2 creamy, dense, firm natural yellow burnt caramel
## 3 elastic washed ivory <NA>
## 4 creamy, dense, smooth washed white acidic, milky, smooth
## 5 creamy, open, smooth washed white fruity, nutty
## 6 soft washed pale yellow salty, smooth
## aroma
## 1 buttery
## 2 lanoline
## 3 aromatic
## 4 barnyardy, earthy
## 5 perfumed, pungent
## 6 nutty
Perfect. Now we have solid data. But it’s still wide and we need to make it long to get the answer we seek. And I need to make the question harder so I can have fun exploding columns. Let’s keep cheese, country, and type:
cheese_data_subset <- cheese_data_clean %>%
select(cheese, country, type)
head(cheese_data_subset)
## cheese country type
## 1 Aarewasser Switzerland semi-soft
## 2 Abbaye de Belloc France semi-hard, artisan
## 3 Abbaye de Belval France semi-hard
## 4 Abbaye de Citeaux France semi-soft, artisan, brined
## 5 Abbaye de Tamié France soft, artisan
## 6 Abbaye de Timadeuc France semi-hard
Great. Now I want to generate new columns for type such that each column only has one descriptor:
cheese_data_subset <- cheese_data_subset %>%
mutate(type = trimws(type)) %>%
separate(type, into = c("type1", "type2", "type3"), sep = ",", fill = "right", extra = "drop")
head(cheese_data_subset)
## cheese country type1 type2 type3
## 1 Aarewasser Switzerland semi-soft <NA> <NA>
## 2 Abbaye de Belloc France semi-hard artisan <NA>
## 3 Abbaye de Belval France semi-hard <NA> <NA>
## 4 Abbaye de Citeaux France semi-soft artisan brined
## 5 Abbaye de Tamié France soft artisan <NA>
## 6 Abbaye de Timadeuc France semi-hard <NA> <NA>
OK this is getting fun. type1 really seams to be about cheese hardness. So let’s rename the columns and then sort data accordingly:
cheese_data_subset <- cheese_data_subset %>%
rename(hardness = type1,
category = type2,
type = type3)
head(cheese_data_subset)
## cheese country hardness category type
## 1 Aarewasser Switzerland semi-soft <NA> <NA>
## 2 Abbaye de Belloc France semi-hard artisan <NA>
## 3 Abbaye de Belval France semi-hard <NA> <NA>
## 4 Abbaye de Citeaux France semi-soft artisan brined
## 5 Abbaye de Tamié France soft artisan <NA>
## 6 Abbaye de Timadeuc France semi-hard <NA> <NA>
unique_hardness <- unique(cheese_data_subset$hardness)
unique_category <- unique(cheese_data_subset$category)
unique_type <- unique(cheese_data_subset$type)
cat("Unique Hardness values:\n")
## Unique Hardness values:
print(unique_hardness)
## [1] "semi-soft" "semi-hard" "soft" "hard" "firm"
## [6] "fresh soft" "semi-firm" NA "fresh firm" "artisan"
cat("\nUnique Category values:\n")
##
## Unique Category values:
print(unique_category)
## [1] NA " artisan" " soft-ripened" " brined"
## [5] " blue-veined" " smear-ripened" " semi-soft" " processed"
## [9] " firm" " hard" " whey" " semi-hard"
## [13] " organic"
cat("\nUnique Type values:\n")
##
## Unique Type values:
print(unique_type)
## [1] NA " brined" " organic" " blue-veined"
## [5] " artisan" " soft-ripened" " processed" " smear-ripened"
Now we need to start moving data around to be in the correct columns.
cheese_data_subset <- cheese_data_subset %>%
mutate(
category = if_else(
hardness == "artisan",
if_else(category %in% c("empty", "", NA_character_),
"artisan",
paste(category, "artisan", sep = ", ")),
category
),
hardness = if_else(hardness == "artisan", NA_character_, hardness)
)
head(cheese_data_subset)
## cheese country hardness category type
## 1 Aarewasser Switzerland semi-soft <NA> <NA>
## 2 Abbaye de Belloc France semi-hard artisan <NA>
## 3 Abbaye de Belval France semi-hard <NA> <NA>
## 4 Abbaye de Citeaux France semi-soft artisan brined
## 5 Abbaye de Tamié France soft artisan <NA>
## 6 Abbaye de Timadeuc France semi-hard <NA> <NA>
Ok this worked so well that I decided I wanted to scan all three columns for all keywords. Then I wanted to seperate each key word into the right column. I thought this would be easy. It was not. I turned to the internet for help and discovered some new libraries to get this done. I don’t know if this is OK or not…but it tidies my data how I wanted it. So I guess there’s that.
library(stringr)
library(purrr)
hardness_keywords <- c("semi-soft", "semi-hard", "soft", "hard", "firm", "fresh soft", "semi-firm", "fresh firm")
category_keywords <- c("artisan", "processed", "organic", "smear-ripened")
type_keywords <- c("soft-ripened", "brined", "blue-veined", "smear-ripened", "whey")
reassign_keywords <- function(hardness_val, category_val, type_val) {
all_vals <- c(hardness_val, category_val, type_val)
all_vals <- unlist(str_split(all_vals, ","))
all_vals <- str_trim(all_vals)
all_vals <- all_vals[all_vals != ""]
new_hardness <- paste(intersect(all_vals, hardness_keywords), collapse = ", ")
new_category <- paste(intersect(all_vals, category_keywords), collapse = ", ")
new_type <- paste(intersect(all_vals, type_keywords), collapse = ", ")
tibble(new_hardness = new_hardness,
new_category = new_category,
new_type = new_type)
}
new_values <- cheese_data_subset %>%
select(hardness, category, type) %>%
pmap_dfr(reassign_keywords)
cheese_data_subset_updated <- cheese_data_subset %>%
bind_cols(new_values) %>%
select(-hardness, -category, -type) %>%
rename(
hardness = new_hardness,
category = new_category,
type = new_type
)
cheese_data_subset_updated <- cheese_data_subset_updated %>%
mutate(across(everything(), ~ifelse(. == "", NA_character_, .)))
head(cheese_data_subset_updated)
## cheese country hardness category type
## 1 Aarewasser Switzerland semi-soft <NA> <NA>
## 2 Abbaye de Belloc France semi-hard artisan <NA>
## 3 Abbaye de Belval France semi-hard <NA> <NA>
## 4 Abbaye de Citeaux France semi-soft artisan brined
## 5 Abbaye de Tamié France soft artisan <NA>
## 6 Abbaye de Timadeuc France semi-hard <NA> <NA>
PERFECT. This is the coolest thing I’ve done in R. I’m really happy with this. OK. Now to get a bar chart of cheeses by country of origin:
country_counts <- cheese_data_subset_updated %>%
count(country) %>%
arrange(desc(n)) %>%
slice_max(n, n = 20)
ggplot(country_counts, aes(x = reorder(country, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Countries by Number of Rows",
x = "Country",
y = "Count") +
theme_minimal()
Oh crud. Turns our the country column was also comma dilineated. Let’s
truncate that and redo the graph:
cheese_data_subset_updated <- cheese_data_subset_updated %>%
mutate(country = sub(",.*", "", country))
head(cheese_data_subset_updated$country)
## [1] "Switzerland" "France" "France" "France" "France"
## [6] "France"
country_counts <- cheese_data_subset_updated %>%
count(country) %>%
arrange(desc(n)) %>%
slice_max(n, n = 20)
ggplot(country_counts, aes(x = reorder(country, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Countries by Number of Rows",
x = "Country",
y = "Count") +
theme_minimal()
Better. I suppose if I’m being thorough. I should collapse “Ireland”
“Scotland” and “England” all into “United Kingdom” and see how it looks.
It’ll have an effect…
cheese_data_subset_updated <- cheese_data_subset_updated %>%
mutate(country = case_when(
country %in% c("Ireland", "Scotland", "England") ~ "United Kingdom",
TRUE ~ country
))
head(cheese_data_subset_updated$country)
## [1] "Switzerland" "France" "France" "France" "France"
## [6] "France"
Let’s try the chart one more time:
country_counts <- cheese_data_subset_updated %>%
count(country) %>%
arrange(desc(n)) %>%
slice_max(n, n = 20)
ggplot(country_counts, aes(x = reorder(country, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Countries by Number of Rows",
x = "Country",
y = "Count") +
theme_minimal()
Who in the world makes the most cheese? The United States. King of
Cheese.
Onto the last bit of data.
##Rolling Stone Top Album Data
I am refusing to go off on any wild data goose chases. I’m like 5 hours into this project and ready to wrap it up. So here’s the question we want to answer, knowing that we have the top album data from 2003 to 2020. I want to know what the differential in movement is and I want to know the top movers up and top movers down. Like, over the course of 17 years, who got more popular and who got less popular.
##Question - Who were the top movers up and down over the course of 17 years?
Let’s begin by scrubbing data that has no real meaning for us:
album_data_clean <- albumdata %>%
select(-matches("^(Album.ID|Album.ID.Quoted)"))
head(album_data_clean)
## Sort.Name Clean.Name Album X2003.Rank.Old
## 1 Sinatra, Frank Frank Sinatra In the Wee Small Hours 101
## 2 Diddley, Bo Bo Diddley Bo Diddley / Go Bo Diddley 212
## 3 Presley, Elvis Elvis Presley Elvis Presley 56
## 4 Sinatra, Frank Frank Sinatra Songs for Swingin' Lovers! 302
## 5 Little Richard Little Richard Here's Little Richard 50
## 6 Beyonce Beyonce Lemonade NA
## X2003.Rank X2012.Rank X2020.Rank X2020.2003.Differential Release.Year
## 1 100 101 282 -182 1955
## 2 214 216 455 -241 1955
## 3 55 56 332 -277 1956
## 4 306 308 NA -195 1956
## 5 50 50 227 -177 1957
## 6 NA NA 32 469 2016
## Album.Genre Album.Type Wks.on.Billboard
## 1 Big Band/Jazz Studio 14
## 2 Rock n' Roll/Rhythm & Blues Studio -
## 3 Rock n' Roll/Rhythm & Blues Studio 100
## 4 Big Band/Jazz Studio ?
## 5 Studio 5
## 6 Studio 87
## Peak.Billboard.Position Spotify.Popularity
## 1 2 48
## 2 201 50
## 3 1 58
## 4 2 62
## 5 13 64
## 6 1 73
## Spotify.URI
## 1 spotify:album:3GmwKB1tgPZgXeRJZSm9WX
## 2 spotify:album:1cbtDEwxCjMhglb49OgNBR
## 3 spotify:album:7GXP5OhYyPVLmcVfO9Iqin
## 4 spotify:album:4kca7vXd1Wo5GE2DMafvMc
## 5 spotify:album:18tV6PLXYvVjsdOVk0S7M8
## 6 spotify:album:7dK54iZuOxXFarGhXwEXfF
## Chartmetric.Link Artist.Member.Count
## 1 https://app.chartmetric.com/album/3115110/about 1
## 2 https://app.chartmetric.com/album/4137525/about 1
## 3 https://app.chartmetric.com/album/1338334/about 1
## 4 https://app.chartmetric.com/album/1075976/about 1
## 5 https://app.chartmetric.com/album/3128391/about 1
## 6 https://app.chartmetric.com/album/4160994/about 1
## Artist.Gender Artist.Birth.Year.Sum Debut.Album.Release.Year
## 1 Male 1915 1946
## 2 Male 1928 1955
## 3 Male 1935 1956
## 4 Male 1915 1946
## 5 Male 1932 1957
## 6 Female 1981 2003
## Avg..Age.at.Top.500.Album Years.Between.Debut.and.Top.500.Album X X.1
## 1 40 9
## 2 27 0
## 3 21 0
## 4 41 10
## 5 25 0
## 6 35 13
OK….Now all this is interesting, but we really only need a dataframe that covers what we’re interested in, which is Clean.Name, Album, and X2020.2003.Differential
album_subset <- album_data_clean %>%
select(Clean.Name, Album, X2020.2003.Differential)
head(album_subset)
## Clean.Name Album X2020.2003.Differential
## 1 Frank Sinatra In the Wee Small Hours -182
## 2 Bo Diddley Bo Diddley / Go Bo Diddley -241
## 3 Elvis Presley Elvis Presley -277
## 4 Frank Sinatra Songs for Swingin' Lovers! -195
## 5 Little Richard Here's Little Richard -177
## 6 Beyonce Lemonade 469
Terrific. Let’s organize by descending.
top_positive <- album_subset %>%
filter(X2020.2003.Differential > 0) %>%
slice_max(order_by = X2020.2003.Differential, n = 15)
top_negative <- album_subset %>%
filter(X2020.2003.Differential < 0) %>%
slice_min(order_by = X2020.2003.Differential, n = 15)
top_positive
## Clean.Name Album X2020.2003.Differential
## 1 Kanye West My Beautiful Dark Twisted Fantasy 484
## 2 Kendrick Lamar To Pimp a Butterfly 482
## 3 Beyonce Lemonade 469
## 4 Amy Winehouse Back to Black 468
## 5 D'Angelo Voodoo 460
## 6 OutKast Aquemini 452
## 7 Kate Bush Hounds of Love 433
## 8 Kanye West The College Dropout 427
## 9 Frank Ocean Blonde 422
## 10 Beyonce Beyonce 420
## 11 Jay-Z The Blueprint 414
## 12 Erykah Badu Baduizm 412
## 13 Missy Elliott Supa Dupa Fly 408
## 14 Radiohead Kid A 408
## 15 Drake Take Care 406
top_negative
## Clean.Name Album
## 1 Robert Johnson The Complete Recordings
## 2 Boz Scaggs Boz Scaggs
## 3 Ian Dury and the Blockheads New Boots & Panties!!
## 4 Bonnie Raitt Give It Up
## 5 B.B. King Live in Cook County Jail
## 6 Bob Dylan Modern Times
## 7 Bruce Springsteen The Rising
## 8 Radiohead Amnesiac
## 9 Vampire Weekend Vampire Weekend
## 10 M.I.A. Kala
## 11 Arctic Monkeys Whatever People Say I Am, That's What I'm Not
## 12 MGMT Oracular Spectacular
## 13 The White Stripes White Blood Cells
## 14 My Morning Jacket Z
## 15 Creedence Clearwater Revival Chronicle: The 20 Greatest Hits
## 16 Manu Chao Proxima estacion: Esperanza
## 17 The Beach Boys The Smile Sessions
## X2020.2003.Differential
## 1 -501
## 2 -501
## 3 -501
## 4 -501
## 5 -501
## 6 -501
## 7 -501
## 8 -501
## 9 -501
## 10 -501
## 11 -501
## 12 -501
## 13 -501
## 14 -501
## 15 -501
## 16 -501
## 17 -501
OK. Well that was interesting. I had to take a second look at the data, because I don’t understand how we can have 15 records at -501. Turns out that if an album was not on the chart in 2003 OR again in 2020 it had a net score of -501. This is possible because they are on the list from the stats pulled in 2012. So all 15 albums listed in top_negative all popped onto the list in 2012 only. They weren’t there in 2003 and were gone again by 2012. I guess the differntial column is less useful than I would have hoped?
SO let’s rerun our dataframe with the 2003 and 2020 data. I’m not interested in albums that were on and off between those two periods. I also want to calculate my own differential. So really, I’m asking that of the 2003 albums that were still on the list by 2020…who were the top movers. So we need to reclean our data a bit:
album_subset <- album_data_clean %>%
select(Clean.Name, Album, X2003.Rank, X2020.Rank)
head(album_subset)
## Clean.Name Album X2003.Rank X2020.Rank
## 1 Frank Sinatra In the Wee Small Hours 100 282
## 2 Bo Diddley Bo Diddley / Go Bo Diddley 214 455
## 3 Elvis Presley Elvis Presley 55 332
## 4 Frank Sinatra Songs for Swingin' Lovers! 306 NA
## 5 Little Richard Here's Little Richard 50 227
## 6 Beyonce Lemonade NA 32
OK. Let’s make this an exploration of classic rock rather than the new stuff. If you aren’t on the list in 2003, I’m not interested. Also…if you aren’t on the list in 2020, I’m not interested. So let’s remove those:
album_subset <- album_subset %>%
filter(!is.na(X2003.Rank) & !is.na(X2020.Rank))
head(album_subset)
## Clean.Name Album X2003.Rank X2020.Rank
## 1 Frank Sinatra In the Wee Small Hours 100 282
## 2 Bo Diddley Bo Diddley / Go Bo Diddley 214 455
## 3 Elvis Presley Elvis Presley 55 332
## 4 Little Richard Here's Little Richard 50 227
## 5 Miles Davis Kind of Blue 12 31
## 6 John Coltrane Giant Steps 102 232
Now that we have that, let’s calculate the differential between 2003 and 2020:
album_subset <- album_subset %>%
mutate(Difference = X2003.Rank - X2020.Rank)
head(album_subset)
## Clean.Name Album X2003.Rank X2020.Rank Difference
## 1 Frank Sinatra In the Wee Small Hours 100 282 -182
## 2 Bo Diddley Bo Diddley / Go Bo Diddley 214 455 -241
## 3 Elvis Presley Elvis Presley 55 332 -277
## 4 Little Richard Here's Little Richard 50 227 -177
## 5 Miles Davis Kind of Blue 12 31 -19
## 6 John Coltrane Giant Steps 102 232 -130
Alright…not let’s look at the top movers on the high and low sides:
top_positive_movers <- album_subset %>%
filter(Difference > 0) %>%
slice_max(order_by = Difference, n = 15)
top_negative_movers <- album_subset %>%
filter(Difference < 0) %>%
slice_min(order_by = Difference, n = 15)
top_positive_movers
## Clean.Name Album X2003.Rank X2020.Rank
## 1 D'Angelo Voodoo 488 28
## 2 Jay-Z The Blueprint 464 50
## 3 Radiohead Kid A 428 20
## 4 Hole Live Through This 466 106
## 5 Wu-Tang Clan Enter the Wu-Tang: 36 Chambers 386 27
## 6 Nas Illmatic 400 44
## 7 Funkadelic Maggot Brain 486 136
## 8 Fugees The Score 477 134
## 9 George Michael Faith 480 151
## 10 Cyndi Lauper She's So Unusual 494 184
## 11 John Prince John Prine 458 149
## 12 The Notorious B.I.G. Life After Death 483 179
## 13 Lauryn Hill The Miseducation of Lauryn Hill 312 10
## 14 The Police Synchronicity 455 159
## 15 OutKast Stankonia 359 64
## Difference
## 1 460
## 2 414
## 3 408
## 4 360
## 5 359
## 6 356
## 7 350
## 8 343
## 9 329
## 10 310
## 11 309
## 12 304
## 13 302
## 14 296
## 15 295
top_negative_movers
## Clean.Name Album X2003.Rank
## 1 Muddy Waters The Anthology 38
## 2 Phil Spector Back to Mono (1958-1969) 64
## 3 Al Green Greatest Hits 52
## 4 Robert Johnson King of the Delta Blues Singers 27
## 5 Linda Ronstadt Heart Like a Wheel 164
## 6 Jefferson Airplane Surrealistic Pillow 146
## 7 Howlin' Wolf Moanin' in the Moonlight 153
## 8 The Stooges The Stooges 185
## 9 Sly & The Family Stone Greatest Hits 60
## 10 Ramones Rocket to Russia 105
## 11 Elvis Presley Elvis Presley 55
## 12 The Flying Burrito Brothers The Gilded Palace of Sin 192
## 13 Bonnie Raitt Nick of Time 229
## 14 Elvis Costello My Aim Is True 168
## 15 Marvin Gaye Let's Get It On 165
## X2020.Rank Difference
## 1 483 -445
## 2 489 -425
## 3 456 -404
## 4 374 -347
## 5 490 -326
## 6 471 -325
## 7 477 -324
## 8 488 -303
## 9 343 -283
## 10 385 -280
## 11 332 -277
## 12 462 -270
## 13 492 -263
## 14 430 -262
## 15 422 -257
OK, this is very cool.
##Answer
We can see how D’Angelo, Jay-Z and Radiohead had massive resurgences from 2003 to 2020. And looking at the other direction, Muddy Waters, Phil Spector and Al Green lost a lot of ground.
I honestly don’t know what to make of most of this information, but there are the top album movers over a fun 17 year span.
##Conclusion
This was a cool exercise. I feel so much more comfortable cleaning data. To the extent that I have a much better understanding of what that even means. More crucially, I have total confidence in my ability to just move data around to get the kinds of answers I want out of it, or better, the answers paying clients will want out of it. This was neat. It was hard and challenging, but neat. Thanks for the exercise!!