class: center, middle, inverse, title-slide .title[ # Principles and good practices of data cleaning ] .author[ ### Shanlong Ding ] .institute[ ### HID/DSI/WPRO/WHO ] .date[ ### 09 Dec 2024 ] --- # A cat story .pull-left[  ] .pull-right[  - Textual variables: Name, Address - CATegorical variables: Gender, Breed - Numeric variables: Birth date, Body weight, Body length ] --- # A data set
--- # Cleaned data
-- - **Each row:** a single observation. -- - **Each column:** a single variable - no mixed information. -- - **Each cell:** a single value - no duplicated or nested information. --- # Data science process  -- .center[####Garbage In, Garbage Out.####] -- - Data cleaning is the most time-consuming task in data analysis. - Have clean data can increase the efficiency of all subsequent work several times over. --- # Basic principles - Understand your data - Rename variables - Recognize NAs, outliers, and duplicated data - Standardization - Transform - Merge - Test and save --- # Understand your data - Data resource? - Why use the data? - Structure and basic information? ``` r glimpse(df_raw) ``` ``` ## Rows: 6 ## Columns: 6 ## $ ID <chr> "9527A", "2581D", "2581D", "3394C", "2218B", "1195A" ## $ Address <chr> "Hubei, China", "Shandong, la Chine", "I don't know", "C… ## $ Gender <chr> "M", "Mâle", "M", "Female", "Male", "F" ## $ Breed <chr> "Maine", "Américain à poil court", "I don't know", "Amer… ## $ `Birth date` <chr> "2021-05-06", "2016-01-19", "19 Jan 2016", "25 Apr 2016"… ## $ `Body weight` <chr> "12 kg", "7 kg", "7kg", "198.4 lbs", "I don't know", "11… ``` --- # Rename variables ``` r names(df_raw) ``` ``` ## [1] "ID" "Address" "Gender" "Breed" "Birth date" ## [6] "Body weight" ``` ``` r df_raw_1 <- clean_names(df_raw) names(df_raw_1) ``` ``` ## [1] "id" "address" "gender" "breed" "birth_date" ## [6] "body_weight" ``` -- - All letters: lowercase. - Word separate: "_" instead of " ". - Simple and understandable. --- # Rename variables |id |address |gender |breed |birth_date |body_weight | |:-----|:------------------|:------|:----------------------|:-----------|:------------| |9527A |Hubei, China |M |Maine |2021-05-06 |12 kg | |2581D |Shandong, la Chine |Mâle |Américain à poil court |2016-01-19 |7 kg | |2581D |I don't know |M |I don't know |19 Jan 2016 |7kg | |3394C |China Shandong |Female |American shorthair |25 Apr 2016 |198.4 lbs | |2218B |Madrid, Spain |Male |Ginger |18/09/2020 |I don't know | |1195A |London, UK |F |Tuxedo |05 Apr 2020 |11.02 lbs | --- # Recognize NAs, outliers, and duplicated data - Keep, delete, or fill NAs. - Recognize outliers and correct them. - Remove duplicated data. ``` r df_raw_2 <- df_raw_1 df_raw_2[df_raw_2 == "I don't know"] <- NA df_raw_2 <- distinct(df_raw_2, id, .keep_all = TRUE) df_raw_2[df_raw_2$id == "3394C", "body_weight"] <- "19.84 lbs" ``` --- # Recognize NAs, outliers, and duplicated data |id |address |gender |breed |birth_date |body_weight | |:-----|:------------------|:------|:----------------------|:-----------|:-----------| |9527A |Hubei, China |M |Maine |2021-05-06 |12 kg | |2581D |Shandong, la Chine |Mâle |Américain à poil court |2016-01-19 |7 kg | |3394C |China Shandong |Female |American shorthair |25 Apr 2016 |19.84 lbs | |2218B |Madrid, Spain |Male |Ginger |18/09/2020 |NA | |1195A |London, UK |F |Tuxedo |05 Apr 2020 |11.02 lbs | --- # Standardization - Date: YYYY-MM-DD, whatever - Text: language, upper case, address - CATegorical variables: levels ``` r df_raw_3 <- df_raw_2 df_raw_3[2, 2] <- "Shandong, China" df_raw_3[3, 2] <- "Shandong, China" df_raw_3[2, 4] <- "American shorthair" df_raw_3$gender <- fct_collapse( df_raw_3$gender, "M" = c("M", "Male", "Mâle"), "F" = c("F", "Female") ) df_raw_3$birth_date <- as_date( ifelse( str_detect(df_raw_3$birth_date, "^\\d{4}"), ymd(df_raw_3$birth_date), dmy(df_raw_3$birth_date) ) ) df_raw_3$breed <- factor(str_to_title(df_raw_3$breed)) ``` --- # Standardization |id |address |gender |breed |birth_date |body_weight | |:-----|:---------------|:------|:------------------|:----------|:-----------| |9527A |Hubei, China |M |Maine |2021-05-06 |12 kg | |2581D |Shandong, China |M |American Shorthair |2016-01-19 |7 kg | |3394C |Shandong, China |F |American Shorthair |2016-04-25 |19.84 lbs | |2218B |Madrid, Spain |M |Ginger |2020-09-18 |NA | |1195A |London, UK |F |Tuxedo |2020-04-05 |11.02 lbs | --- # Transform ``` r df_raw_4 <- df_raw_3 %>% separate(address, c("city", "country"), sep = ", ") %>% separate(body_weight, c("weight", "unit"), sep = " ") %>% mutate( weight = as.numeric(weight), weight_kg = ifelse(unit == "lbs", weight / 2.205, weight) ) %>% select(-weight, -unit) ``` - Unify all units. - Split variables containing multiple pieces of information. --- # Transform |id |city |country |gender |breed |birth_date | weight_kg| |:-----|:--------|:-------|:------|:------------------|:----------|---------:| |9527A |Hubei |China |M |Maine |2021-05-06 | 12.000000| |2581D |Shandong |China |M |American Shorthair |2016-01-19 | 7.000000| |3394C |Shandong |China |F |American Shorthair |2016-04-25 | 8.997732| |2218B |Madrid |Spain |M |Ginger |2020-09-18 | NA| |1195A |London |UK |F |Tuxedo |2020-04-05 | 4.997732| --- # Merge .pull-left[ |id |name | |:-----|:------| |9527A |Giant | |2581D |Maomao | |2894D |Claus | |3394C |Mimi | |1107B |Kate | |2218B |Puss | |1195A |Lucas | |9956C |Kitty | ] .pull-right[ - Find unique keys: - A single variable, such as ID. - A combination of variables, such as name, birth date, gender, etc. - Merge different tables in a right way. ] <br> ``` r df_final <- df_raw_4 %>% left_join(df_id_name, by = "id") %>% relocate(id, name) ``` --- # Test and save ``` r kable(df_final) ``` |id |name |city |country |gender |breed |birth_date | weight_kg| |:-----|:------|:--------|:-------|:------|:------------------|:----------|---------:| |9527A |Giant |Hubei |China |M |Maine |2021-05-06 | 12.000000| |2581D |Maomao |Shandong |China |M |American Shorthair |2016-01-19 | 7.000000| |3394C |Mimi |Shandong |China |F |American Shorthair |2016-04-25 | 8.997732| |2218B |Puss |Madrid |Spain |M |Ginger |2020-09-18 | NA| |1195A |Lucas |London |UK |F |Tuxedo |2020-04-05 | 4.997732| ``` r write_csv(df_final, "cat-cleaned.csv") ``` --- # Summary - **Clean data, clear insights.** - **Follow principles:** - Understand - Standardize - Transform - Test - **Iterate and Improve:** - Data quality is a continuous process, not a one-time task. <br> -- ## .center[Good data is the foundation of great decisions!] --- class: inverse, center, middle # Thank you for listening!