tidy.knit

Tools needed for Cleaning, Organizing and Transforming Data in R

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Cleaning Data Tools

rename()
rename_with()
glimpse()
select()
clean_names()
skim_without-charts()

Organising Data Tools

filter()
max()
mean()
summarize()
arrange()
group_by()
drop_na()

Transforming Data Tools

%>%
unite()
separate()
mutate()

Checking for Bias Tools

library(SimDesign)

bias(actual, predicted)

Summary statistics and visualization

Anscombe’s quartet has four datasets that have nearly identical summary statistics.

library(Tmisc)

data(quartet)

quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))

## # A tibble: 4 × 6
##   set   `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <fct>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1 I             9    3.32      7.50    2.03       0.816
## 2 II            9    3.32      7.50    2.03       0.816
## 3 III           9    3.32      7.5     2.03       0.816
## 4 IV            9    3.32      7.50    2.03       0.817

The standard deviation can help us understand the spread of values in a dataset and show us how far each value is from the mean.

ggplot(quartet, aes(x,y)) + geom_point() + geom_smooth(method=lm, se=FALSE) + facet_wrap(~set)

## `geom_smooth()` using formula 'y ~ x'

search() shows attached datasets and packages

detach() remove attached datasets. Opposite of attach()

library(datasauRus)

ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset)) + geom_point() + 
                theme_void() + theme(legend.position = "none") + facet_wrap(~dataset, ncol = 3)