data cleaning training

1. Purpose.

The purpose of this noteboook is to provide examples of useful data cleaning steps.

2. Load libraries and view practice dataset.

library(dplyr)
library(stringr)
library(tidyr)
df <- tibble::tribble(~`fruitName`, ~`Count (Tonnes!)`,
        "apple GREEN", 1, 
        "apple Red", 2, 
        "carrot_std", 3 
      )

3. Get data into a tidy structure using tidyr functions.

Tidy data is where:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

It can help to think of it in terms of being easy to use in ggplot2. Could each column be a x, y, fill, colour, or facet variable?

It is useful if data is always in this format (with the exception of spatial line or polygon data, where a tidy format can make objects get too large in size).

See tidyr_training for further guidance.

table4a

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

gather(table4a, key="year", value="cases", `1999`:`2000`)

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

spread(table2, key=type, value=count)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

4. Change all variable names to snake case.

Only lower case or underscores, and remove all special characters.

colnames(df) <- snakecase::to_snake_case(colnames(df), sep_out = "_")
df

## # A tibble: 3 x 2
##   fruit_name  count_tonnes
##   <chr>              <dbl>
## 1 apple GREEN            1
## 2 apple Red              2
## 3 carrot_std             3

5. Convert all character values to sentence case with spaces.

df <- df %>% 
  mutate_if(funs(is.character), funs(stringi::stri_trans_totitle(., opts_brkiter = stringi::stri_opts_brkiter(type = "sentence")))) %>% 
  mutate_if(funs(is.character), funs(str_replace_all(., "_", " ")))
df

## # A tibble: 3 x 2
##   fruit_name  count_tonnes
##   <chr>              <dbl>
## 1 Apple green            1
## 2 Apple red              2
## 3 Carrot std             3

6. Change character variables to factors and doubles to integers, as required.

df <- df %>% mutate_at(vars("fruit_name"), as.factor)
df <- df %>% mutate_at(vars("count_tonnes"), as.integer)
df

## # A tibble: 3 x 2
##   fruit_name  count_tonnes
##   <fct>              <int>
## 1 Apple green            1
## 2 Apple red              2
## 3 Carrot std             3

7. mutate any new useful columns.

df <- df %>% mutate(count_kg = count_tonnes * 1000)
df

## # A tibble: 3 x 3
##   fruit_name  count_tonnes count_kg
##   <fct>              <int>    <dbl>
## 1 Apple green            1     1000
## 2 Apple red              2     2000
## 3 Carrot std             3     3000