Import your data

myData <- read.csv("../00_data/boardgames_details.csv") %>% as_tibble()

myData %>% skimr::skim()
Data summary
Name Piped data
Number of rows 21631
Number of columns 23
_______________________
Column type frequency:
character 10
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
primary 0 1.00 1 107 0 21236 0
description 1 1.00 49 16144 0 21615 0
boardgamecategory 283 0.99 8 216 0 6730 0
boardgamemechanic 1590 0.93 8 478 0 8291 0
boardgamefamily 3761 0.83 13 2768 0 11285 0
boardgameexpansion 16125 0.25 7 18150 0 5264 0
boardgameimplementation 16769 0.22 6 890 0 4247 0
boardgamedesigner 596 0.97 7 332 0 9136 0
boardgameartist 5907 0.73 6 8408 0 9080 0
boardgamepublisher 1 1.00 6 3744 0 11265 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num 0 1 10815.00 6244.48 0 5407.5 10815 16222.5 21630 ▇▇▇▇▇
id 0 1 118133.09 105310.42 1 12280.5 105187 207013.0 350992 ▇▂▃▃▂
yearpublished 0 1 1986.09 210.04 -3500 2001.0 2011 2017.0 2023 ▁▁▁▁▇
minplayers 0 1 2.01 0.69 0 2.0 2 2.0 10 ▇▁▁▁▁
maxplayers 0 1 5.71 15.10 0 4.0 4 6.0 999 ▇▁▁▁▁
playingtime 0 1 90.51 534.83 0 25.0 45 90.0 60000 ▇▁▁▁▁
minplaytime 0 1 63.65 447.21 0 20.0 30 60.0 60000 ▇▁▁▁▁
maxplaytime 0 1 90.51 534.83 0 25.0 45 90.0 60000 ▇▁▁▁▁
minage 0 1 9.61 3.64 0 8.0 10 12.0 25 ▂▇▆▁▁
owned 0 1 1487.92 5395.08 0 150.0 322 903.5 168364 ▇▁▁▁▁
trading 0 1 43.59 102.41 0 5.0 13 38.0 2508 ▇▁▁▁▁
wanting 0 1 42.03 117.94 0 3.0 9 29.0 2011 ▇▁▁▁▁
wishing 0 1 233.66 800.66 0 14.0 39 131.0 19325 ▇▁▁▁▁

Chapter 13

What are primary keys in your data?

A primary key in my data is “num”

Can you divide your data into two?

Divide it using dplyr::select in a way the two have a common variable, which you could use to join the two.

myData_part1 <- myData %>% select(num, id:minage)
myData_part2 <- myData %>% select(num, boardgamecategory:wishing)

Can you join the two together?

Use tidyr::left_join or other joining functions.

myData_joined <- left_join(myData_part1, myData_part2, by = "num")

# Confirm myData_joined = data
setdiff(myData_joined, myData)
## # A tibble: 0 × 23
## # … with 23 variables: num <int>, id <int>, primary <chr>, description <chr>,
## #   yearpublished <int>, minplayers <int>, maxplayers <int>, playingtime <int>,
## #   minplaytime <int>, maxplaytime <int>, minage <int>,
## #   boardgamecategory <chr>, boardgamemechanic <chr>, boardgamefamily <chr>,
## #   boardgameexpansion <chr>, boardgameimplementation <chr>,
## #   boardgamedesigner <chr>, boardgameartist <chr>, boardgamepublisher <chr>,
## #   owned <int>, trading <int>, wanting <int>, wishing <int>

Chapter 14

Tools

Decect Matches

myData %>%
    summarise(sum(str_detect(minplayers, "^2")))
## # A tibble: 1 × 1
##   `sum(str_detect(minplayers, "^2"))`
##                                 <int>
## 1                               14834
str_detect(myData_part1$yearpublished, "2008$") %>% head(10)
##  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
sum(str_detect(myData_part1$yearpublished, "2008$"))
## [1] 582

Extract matches

myData %>%
    mutate(new_var_withN = str_extract(yearpublished, "[5]+")) %>%
    select(yearpublished, new_var_withN)
## # A tibble: 21,631 × 2
##    yearpublished new_var_withN
##            <int> <chr>        
##  1          2008 <NA>         
##  2          2000 <NA>         
##  3          1995 5            
##  4          2010 <NA>         
##  5          2008 <NA>         
##  6          2004 <NA>         
##  7          2015 5            
##  8          2016 <NA>         
##  9          2015 5            
## 10          2007 <NA>         
## # … with 21,621 more rows

Replacing Matches

myData_part1 %>% mutate(yearpublished_rep = yearpublished %>% str_replace("[0-5]$", "-"))
## # A tibble: 21,631 × 12
##      num     id primary  descr…¹ yearp…² minpl…³ maxpl…⁴ playi…⁵ minpl…⁶ maxpl…⁷
##    <int>  <int> <chr>    <chr>     <int>   <int>   <int>   <int>   <int>   <int>
##  1     0  30549 Pandemic In Pan…    2008       2       4      45      45      45
##  2     1    822 Carcass… Carcas…    2000       2       5      45      30      45
##  3     2     13 Catan    In CAT…    1995       3       4     120      60     120
##  4     3  68448 7 Wonde… You ar…    2010       2       7      30      30      30
##  5     4  36218 Dominion &quot;…    2008       2       4      30      30      30
##  6     5   9209 Ticket … With e…    2004       2       5      60      30      60
##  7     6 178900 Codenam… Codena…    2015       2       8      15      15      15
##  8     7 167791 Terrafo… In the…    2016       1       5     120     120     120
##  9     8 173346 7 Wonde… In man…    2015       2       2      30      30      30
## 10     9  31260 Agricola Descri…    2007       1       5     150      30     150
## # … with 21,621 more rows, 2 more variables: minage <int>,
## #   yearpublished_rep <chr>, and abbreviated variable names ¹​description,
## #   ²​yearpublished, ³​minplayers, ⁴​maxplayers, ⁵​playingtime, ⁶​minplaytime,
## #   ⁷​maxplaytime