myData <- read.csv("../00_data/boardgames_details.csv") %>% as_tibble()
myData %>% skimr::skim()
| Name | Piped data |
| Number of rows | 21631 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| primary | 0 | 1.00 | 1 | 107 | 0 | 21236 | 0 |
| description | 1 | 1.00 | 49 | 16144 | 0 | 21615 | 0 |
| boardgamecategory | 283 | 0.99 | 8 | 216 | 0 | 6730 | 0 |
| boardgamemechanic | 1590 | 0.93 | 8 | 478 | 0 | 8291 | 0 |
| boardgamefamily | 3761 | 0.83 | 13 | 2768 | 0 | 11285 | 0 |
| boardgameexpansion | 16125 | 0.25 | 7 | 18150 | 0 | 5264 | 0 |
| boardgameimplementation | 16769 | 0.22 | 6 | 890 | 0 | 4247 | 0 |
| boardgamedesigner | 596 | 0.97 | 7 | 332 | 0 | 9136 | 0 |
| boardgameartist | 5907 | 0.73 | 6 | 8408 | 0 | 9080 | 0 |
| boardgamepublisher | 1 | 1.00 | 6 | 3744 | 0 | 11265 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| num | 0 | 1 | 10815.00 | 6244.48 | 0 | 5407.5 | 10815 | 16222.5 | 21630 | ▇▇▇▇▇ |
| id | 0 | 1 | 118133.09 | 105310.42 | 1 | 12280.5 | 105187 | 207013.0 | 350992 | ▇▂▃▃▂ |
| yearpublished | 0 | 1 | 1986.09 | 210.04 | -3500 | 2001.0 | 2011 | 2017.0 | 2023 | ▁▁▁▁▇ |
| minplayers | 0 | 1 | 2.01 | 0.69 | 0 | 2.0 | 2 | 2.0 | 10 | ▇▁▁▁▁ |
| maxplayers | 0 | 1 | 5.71 | 15.10 | 0 | 4.0 | 4 | 6.0 | 999 | ▇▁▁▁▁ |
| playingtime | 0 | 1 | 90.51 | 534.83 | 0 | 25.0 | 45 | 90.0 | 60000 | ▇▁▁▁▁ |
| minplaytime | 0 | 1 | 63.65 | 447.21 | 0 | 20.0 | 30 | 60.0 | 60000 | ▇▁▁▁▁ |
| maxplaytime | 0 | 1 | 90.51 | 534.83 | 0 | 25.0 | 45 | 90.0 | 60000 | ▇▁▁▁▁ |
| minage | 0 | 1 | 9.61 | 3.64 | 0 | 8.0 | 10 | 12.0 | 25 | ▂▇▆▁▁ |
| owned | 0 | 1 | 1487.92 | 5395.08 | 0 | 150.0 | 322 | 903.5 | 168364 | ▇▁▁▁▁ |
| trading | 0 | 1 | 43.59 | 102.41 | 0 | 5.0 | 13 | 38.0 | 2508 | ▇▁▁▁▁ |
| wanting | 0 | 1 | 42.03 | 117.94 | 0 | 3.0 | 9 | 29.0 | 2011 | ▇▁▁▁▁ |
| wishing | 0 | 1 | 233.66 | 800.66 | 0 | 14.0 | 39 | 131.0 | 19325 | ▇▁▁▁▁ |
A primary key in my data is “num”
Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.
myData_part1 <- myData %>% select(num, id:minage)
myData_part2 <- myData %>% select(num, boardgamecategory:wishing)
Use tidyr::left_join or other joining functions.
myData_joined <- left_join(myData_part1, myData_part2, by = "num")
# Confirm myData_joined = data
setdiff(myData_joined, myData)
## # A tibble: 0 × 23
## # … with 23 variables: num <int>, id <int>, primary <chr>, description <chr>,
## # yearpublished <int>, minplayers <int>, maxplayers <int>, playingtime <int>,
## # minplaytime <int>, maxplaytime <int>, minage <int>,
## # boardgamecategory <chr>, boardgamemechanic <chr>, boardgamefamily <chr>,
## # boardgameexpansion <chr>, boardgameimplementation <chr>,
## # boardgamedesigner <chr>, boardgameartist <chr>, boardgamepublisher <chr>,
## # owned <int>, trading <int>, wanting <int>, wishing <int>
myData %>%
summarise(sum(str_detect(minplayers, "^2")))
## # A tibble: 1 × 1
## `sum(str_detect(minplayers, "^2"))`
## <int>
## 1 14834
str_detect(myData_part1$yearpublished, "2008$") %>% head(10)
## [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
sum(str_detect(myData_part1$yearpublished, "2008$"))
## [1] 582
myData %>%
mutate(new_var_withN = str_extract(yearpublished, "[5]+")) %>%
select(yearpublished, new_var_withN)
## # A tibble: 21,631 × 2
## yearpublished new_var_withN
## <int> <chr>
## 1 2008 <NA>
## 2 2000 <NA>
## 3 1995 5
## 4 2010 <NA>
## 5 2008 <NA>
## 6 2004 <NA>
## 7 2015 5
## 8 2016 <NA>
## 9 2015 5
## 10 2007 <NA>
## # … with 21,621 more rows
myData_part1 %>% mutate(yearpublished_rep = yearpublished %>% str_replace("[0-5]$", "-"))
## # A tibble: 21,631 × 12
## num id primary descr…¹ yearp…² minpl…³ maxpl…⁴ playi…⁵ minpl…⁶ maxpl…⁷
## <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int>
## 1 0 30549 Pandemic In Pan… 2008 2 4 45 45 45
## 2 1 822 Carcass… Carcas… 2000 2 5 45 30 45
## 3 2 13 Catan In CAT… 1995 3 4 120 60 120
## 4 3 68448 7 Wonde… You ar… 2010 2 7 30 30 30
## 5 4 36218 Dominion "… 2008 2 4 30 30 30
## 6 5 9209 Ticket … With e… 2004 2 5 60 30 60
## 7 6 178900 Codenam… Codena… 2015 2 8 15 15 15
## 8 7 167791 Terrafo… In the… 2016 1 5 120 120 120
## 9 8 173346 7 Wonde… In man… 2015 2 2 30 30 30
## 10 9 31260 Agricola Descri… 2007 1 5 150 30 150
## # … with 21,621 more rows, 2 more variables: minage <int>,
## # yearpublished_rep <chr>, and abbreviated variable names ¹description,
## # ²yearpublished, ³minplayers, ⁴maxplayers, ⁵playingtime, ⁶minplaytime,
## # ⁷maxplaytime