Principles and good practices of data cleaning

class: center, middle, inverse, title-slide

.title[
# Principles and good practices of data cleaning
]
.author[
### Shanlong Ding
]
.institute[
### HID/DSI/WPRO/WHO
]
.date[
### 09 Dec 2024
]

---

# A cat story

.pull-left[
![](data:image/png;base64,#image/cat-2.jpg)
]

.pull-right[
![](data:image/png;base64,#image/cat-1.jpg)
- Textual variables: Name, Address

- CATegorical variables: Gender, Breed

- Numeric variables: Birth date, Body weight, Body length

]

---

# A data set

<div class="datatables html-widget html-fill-item" id="htmlwidget-ff79dcaea72469f7acb1" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-ff79dcaea72469f7acb1">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6"],["9527A","2581D","2581D","3394C","2218B","1195A"],["Hubei, China","Shandong, la Chine","I don't know","China Shandong","Madrid, Spain","London, UK"],["M","Mâle","M","Female","Male","F"],["Maine","Américain à poil court","I don't know","American shorthair","Ginger","Tuxedo"],["2021-05-06","2016-01-19","19 Jan 2016","25 Apr 2016","18/09/2020","05 Apr 2020"],["12 kg","7 kg","7kg","198.4 lbs","I don't know","11.02 lbs"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>ID<\/th>\n      <th>Address<\/th>\n      <th>Gender<\/th>\n      <th>Breed<\/th>\n      <th>Birth date<\/th>\n      <th>Body weight<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"searching":false,"paging":false,"info":false,"columnDefs":[{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"ID","targets":1},{"name":"Address","targets":2},{"name":"Gender","targets":3},{"name":"Breed","targets":4},{"name":"Birth date","targets":5},{"name":"Body weight","targets":6}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---

# Cleaned data

<div class="datatables html-widget html-fill-item" id="htmlwidget-e6945680959dc1eeb5e0" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-e6945680959dc1eeb5e0">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5"],["9527A","2581D","3394C","2218B","1195A"],["Giant","Maomao","Mimi","Puss","Kitty"],["Hubei","Shandong","Shandong","Madrid","London"],["China","China","China","Spain","UK"],["M","M","F","M","F"],["Maine","American Shorthair","American Shorthair","Ginger","Tuxedo"],["2021-05-06","2016-01-19","2016-04-25","2020-09-18","2020-04-05"],[12,7,9,null,5]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>id<\/th>\n      <th>name<\/th>\n      <th>city<\/th>\n      <th>country<\/th>\n      <th>gender<\/th>\n      <th>breed<\/th>\n      <th>birth_date<\/th>\n      <th>weight_kg<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"searching":false,"paging":false,"info":false,"scrollX":true,"columnDefs":[{"className":"dt-right","targets":8},{"orderable":false,"targets":0},{"name":" ","targets":0},{"name":"id","targets":1},{"name":"name","targets":2},{"name":"city","targets":3},{"name":"country","targets":4},{"name":"gender","targets":5},{"name":"breed","targets":6},{"name":"birth_date","targets":7},{"name":"weight_kg","targets":8}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

- **Each row:** a single observation.

- **Each column:** a single variable - no mixed information.

- **Each cell:** a single value - no duplicated or nested information.

---

# Data science process

![](data:image/png;base64,#image/data-science-process.png)

.center[####Garbage In, Garbage Out.####]

- Data cleaning is the most time-consuming task in data analysis.

- Have clean data can increase the efficiency of all subsequent work several times over.

---

# Basic principles

- Understand your data

- Rename variables

- Recognize NAs, outliers, and duplicated data

- Standardization

- Transform

- Merge

- Test and save

---

# Understand your data

- Data resource?

- Why use the data?

- Structure and basic information?

``` r
glimpse(df_raw)
```

```
## Rows: 6
## Columns: 6
## $ ID            <chr> "9527A", "2581D", "2581D", "3394C", "2218B", "1195A"
## $ Address       <chr> "Hubei, China", "Shandong, la Chine", "I don't know", "C…
## $ Gender        <chr> "M", "Mâle", "M", "Female", "Male", "F"
## $ Breed         <chr> "Maine", "Américain à poil court", "I don't know", "Amer…
## $ `Birth date`  <chr> "2021-05-06", "2016-01-19", "19 Jan 2016", "25 Apr 2016"…
## $ `Body weight` <chr> "12 kg", "7 kg", "7kg", "198.4 lbs", "I don't know", "11…
```

---

# Rename variables

``` r
names(df_raw)
```

```
## [1] "ID"          "Address"     "Gender"      "Breed"       "Birth date" 
## [6] "Body weight"
```

``` r
df_raw_1 <- clean_names(df_raw)
names(df_raw_1)
```

```
## [1] "id"          "address"     "gender"      "breed"       "birth_date" 
## [6] "body_weight"
```

- All letters: lowercase.

- Word separate: "_" instead of " ".

- Simple and understandable.

---

# Rename variables

|id    |address            |gender |breed                  |birth_date  |body_weight  |
|:-----|:------------------|:------|:----------------------|:-----------|:------------|
|9527A |Hubei, China       |M      |Maine                  |2021-05-06  |12 kg        |
|2581D |Shandong, la Chine |Mâle   |Américain à poil court |2016-01-19  |7 kg         |
|2581D |I don't know       |M      |I don't know           |19 Jan 2016 |7kg          |
|3394C |China Shandong     |Female |American shorthair     |25 Apr 2016 |198.4 lbs    |
|2218B |Madrid, Spain      |Male   |Ginger                 |18/09/2020  |I don't know |
|1195A |London, UK         |F      |Tuxedo                 |05 Apr 2020 |11.02 lbs    |

---

# Recognize NAs, outliers, and duplicated data

- Keep, delete, or fill NAs.

- Recognize outliers and correct them.

- Remove duplicated data.

``` r
df_raw_2 <- df_raw_1
df_raw_2[df_raw_2 == "I don't know"] <- NA
df_raw_2 <- distinct(df_raw_2, id, .keep_all = TRUE)
df_raw_2[df_raw_2$id == "3394C", "body_weight"] <- "19.84 lbs"
```

---

# Recognize NAs, outliers, and duplicated data

|id    |address            |gender |breed                  |birth_date  |body_weight |
|:-----|:------------------|:------|:----------------------|:-----------|:-----------|
|9527A |Hubei, China       |M      |Maine                  |2021-05-06  |12 kg       |
|2581D |Shandong, la Chine |Mâle   |Américain à poil court |2016-01-19  |7 kg        |
|3394C |China Shandong     |Female |American shorthair     |25 Apr 2016 |19.84 lbs   |
|2218B |Madrid, Spain      |Male   |Ginger                 |18/09/2020  |NA          |
|1195A |London, UK         |F      |Tuxedo                 |05 Apr 2020 |11.02 lbs   |

---

# Standardization

- Date: YYYY-MM-DD, whatever

- Text: language, upper case, address

- CATegorical variables: levels

``` r
df_raw_3 <- df_raw_2
df_raw_3[2, 2] <- "Shandong, China"
df_raw_3[3, 2] <- "Shandong, China"
df_raw_3[2, 4] <- "American shorthair"
df_raw_3$gender <- fct_collapse(
  df_raw_3$gender, 
  "M" = c("M", "Male", "Mâle"), "F" = c("F", "Female")
)
df_raw_3$birth_date <- as_date(
  ifelse(
    str_detect(df_raw_3$birth_date, "^\\d{4}"), 
    ymd(df_raw_3$birth_date), dmy(df_raw_3$birth_date)
  )
)
df_raw_3$breed <- factor(str_to_title(df_raw_3$breed))
```

---

# Standardization

|id    |address         |gender |breed              |birth_date |body_weight |
|:-----|:---------------|:------|:------------------|:----------|:-----------|
|9527A |Hubei, China    |M      |Maine              |2021-05-06 |12 kg       |
|2581D |Shandong, China |M      |American Shorthair |2016-01-19 |7 kg        |
|3394C |Shandong, China |F      |American Shorthair |2016-04-25 |19.84 lbs   |
|2218B |Madrid, Spain   |M      |Ginger             |2020-09-18 |NA          |
|1195A |London, UK      |F      |Tuxedo             |2020-04-05 |11.02 lbs   |

---

# Transform

``` r
df_raw_4 <- df_raw_3 %>% 
  separate(address, c("city", "country"), sep = ", ") %>% 
  separate(body_weight, c("weight", "unit"), sep = " ") %>% 
  mutate(
    weight = as.numeric(weight), 
    weight_kg = ifelse(unit == "lbs", weight / 2.205, weight)
  ) %>% 
  select(-weight, -unit)
```

- Unify all units.

- Split variables containing multiple pieces of information.

---

# Transform

|id    |city     |country |gender |breed              |birth_date | weight_kg|
|:-----|:--------|:-------|:------|:------------------|:----------|---------:|
|9527A |Hubei    |China   |M      |Maine              |2021-05-06 | 12.000000|
|2581D |Shandong |China   |M      |American Shorthair |2016-01-19 |  7.000000|
|3394C |Shandong |China   |F      |American Shorthair |2016-04-25 |  8.997732|
|2218B |Madrid   |Spain   |M      |Ginger             |2020-09-18 |        NA|
|1195A |London   |UK      |F      |Tuxedo             |2020-04-05 |  4.997732|

---

# Merge

.pull-left[

|id    |name   |
|:-----|:------|
|9527A |Giant  |
|2581D |Maomao |
|2894D |Claus  |
|3394C |Mimi   |
|1107B |Kate   |
|2218B |Puss   |
|1195A |Lucas  |
|9956C |Kitty  |
]

.pull-right[

- Find unique keys:

- A single variable, such as ID.
  
  - A combination of variables, such as name, birth date, gender, etc.

- Merge different tables in a right way.

]

<br>

``` r
df_final <- df_raw_4 %>% 
  left_join(df_id_name, by = "id") %>% 
  relocate(id, name)
```

---

# Test and save

``` r
kable(df_final)
```

|id    |name   |city     |country |gender |breed              |birth_date | weight_kg|
|:-----|:------|:--------|:-------|:------|:------------------|:----------|---------:|
|9527A |Giant  |Hubei    |China   |M      |Maine              |2021-05-06 | 12.000000|
|2581D |Maomao |Shandong |China   |M      |American Shorthair |2016-01-19 |  7.000000|
|3394C |Mimi   |Shandong |China   |F      |American Shorthair |2016-04-25 |  8.997732|
|2218B |Puss   |Madrid   |Spain   |M      |Ginger             |2020-09-18 |        NA|
|1195A |Lucas  |London   |UK      |F      |Tuxedo             |2020-04-05 |  4.997732|

``` r
write_csv(df_final, "cat-cleaned.csv")
```

---

# Summary

- **Clean data, clear insights.**

- **Follow principles:**

- Understand
  
  - Standardize
  
  - Transform
  
  - Test
  
- **Iterate and Improve:**

- Data quality is a continuous process, not a one-time task.
  
<br>

## .center[Good data is the foundation of great decisions!]

---

class: inverse, center, middle

# Thank you for listening!