The purpose of this noteboook is to provide examples of useful data cleaning steps.
library(dplyr)
library(stringr)
library(tidyr)
df <- tibble::tribble(~`fruitName`, ~`Count (Tonnes!)`,
"apple GREEN", 1,
"apple Red", 2,
"carrot_std", 3
)
Tidy data is where:
It can help to think of it in terms of being easy to use in ggplot2. Could each column be a x, y, fill, colour, or facet variable?
It is useful if data is always in this format (with the exception of spatial line or polygon data, where a tidy format can make objects get too large in size).
See tidyr_training for further guidance.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
gather(table4a, key="year", value="cases", `1999`:`2000`)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
spread(table2, key=type, value=count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Only lower case or underscores, and remove all special characters.
colnames(df) <- snakecase::to_snake_case(colnames(df), sep_out = "_")
df
## # A tibble: 3 x 2
## fruit_name count_tonnes
## <chr> <dbl>
## 1 apple GREEN 1
## 2 apple Red 2
## 3 carrot_std 3
df <- df %>%
mutate_if(funs(is.character), funs(stringi::stri_trans_totitle(., opts_brkiter = stringi::stri_opts_brkiter(type = "sentence")))) %>%
mutate_if(funs(is.character), funs(str_replace_all(., "_", " ")))
df
## # A tibble: 3 x 2
## fruit_name count_tonnes
## <chr> <dbl>
## 1 Apple green 1
## 2 Apple red 2
## 3 Carrot std 3
df <- df %>% mutate_at(vars("fruit_name"), as.factor)
df <- df %>% mutate_at(vars("count_tonnes"), as.integer)
df
## # A tibble: 3 x 2
## fruit_name count_tonnes
## <fct> <int>
## 1 Apple green 1
## 2 Apple red 2
## 3 Carrot std 3
df <- df %>% mutate(count_kg = count_tonnes * 1000)
df
## # A tibble: 3 x 3
## fruit_name count_tonnes count_kg
## <fct> <int> <dbl>
## 1 Apple green 1 1000
## 2 Apple red 2 2000
## 3 Carrot std 3 3000