Introduction

This is my first attempt at creating an R Notebook (from RStudio) and tests the new janitor package from Samuel Firke. This package contains ‘Simple Tools for Examining and Cleaning Dirty Data’ and is the most popular debutant on CRAN over the past 10 days

This notebook takes some of the examples from the vignette and documentation but, of course, the code is easily amendable and alternative data can be selected for processing.

Just open up the - downloadable from Dropbox - janitor.nb.html in Rstudio or in the R console with the rmarkdown package and an editable notebook version is created for you to mess around with. You will need to have the three libraries listed below installed and loaded

library(janitor)
library(readxl)
library(dplyr)

Dataset

Here is the familiar dataset that is used in the vignette. Adding the print statement results in a table (when in notebook format) with the option of the traditional layout

print(glimpse(mtcars))

## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Functions

clean_names

Resulting names of data.frame are unique and consist only of the _ character, lowercase letters, and numbers.

# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("hIgHlo", "REPEAT VALUE", "REPEAT VALUE",
                    "% successful (2009)",  "abc@!*", "")
glimpse(test_df)

## Observations: 1
## Variables: 6
## $ hIgHlo              <lgl> NA
## $ REPEAT VALUE        <lgl> NA
## $ REPEAT VALUE        <lgl> NA
## $ % successful (2009) <lgl> NA
## $ abc@!*              <lgl> NA
## $                     <lgl> NA

test_df %>%
  clean_names()

##   highlo repeat_value repeat_value_2 percent_successful_2009 abc  x
## 1     NA           NA             NA                      NA  NA NA

tabyl

An enhanced version of base table with added features including being returned as a data.frame and automatically calculating percentages

mtcars %>%
  filter(gear > 3) %>%
  tabyl(cyl)

##   cyl  n   percent
## 1   4 10 0.5882353
## 2   6  5 0.2941176
## 3   8  2 0.1176471

get_dupes

Hunts down and examines duplicate records during data cleaning. Confirms there is no problem with this data

mtcars %>%
  get_dupes(wt, cyl)

## # A tibble: 4 × 12
##      wt   cyl dupe_count   mpg  disp    hp  drat  qsec    vs    am  gear
##   <dbl> <dbl>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  3.44     6          2  19.2 167.6   123  3.92 18.30     1     0     4
## 2  3.44     6          2  17.8 167.6   123  3.92 18.90     1     0     4
## 3  3.57     8          2  14.3 360.0   245  3.21 15.84     0     0     3
## 4  3.57     8          2  15.0 301.0   335  3.54 14.60     0     1     5
## # ... with 1 more variables: carb <dbl>

crosstab

Create a crosstab, displaying either frequencies or percentages calculated by row, column, or overall.
Just choose another combination e.g. vs & carb in the code and it immediately gets amended.

mtcars %>%
  filter(am == 0) %>%
  crosstab(vs, carb)

##   vs 1 2 3 4
## 1  0 0 4 3 5
## 2  1 3 2 0 2

adorn_crosstab

Designed to run on the output of a call to crosstab, this adds formatting, percentage sign, Ns, totals row/column, and custom rounding to a table of numeric values. The result is no longer clean data, but it saves time in reporting table results.

# showing with all parameters
mtcars %>%
 crosstab(gear, cyl) %>%
 adorn_crosstab(., denom = "col", rounding = "half up", show_n = FALSE, digits = 2)

##   gear      4      6      8
## 1    3  9.09% 28.57% 85.71%
## 2    4 72.73% 57.14%  0.00%
## 3    5 18.18% 14.29% 14.29%

add_totals_row

This function excludes the first column of the input data.frame, assuming that it contains a descriptive variable not to be summed.

# showing with all parameters
mtcars %>%
  crosstab(am, cyl) %>%
  add_totals_row

##      am  4 6  8
## 1     0  3 4 12
## 2     1  8 3  2
## 3 Total 11 7 14

add_totals_col

This function excludes the first column of the input data.frame, assuming that it contains a descriptive variable not to be summed.

mtcars %>%
  crosstab(am, cyl) %>%
  add_totals_col

##   am 4 6  8 Total
## 1  0 3 4 12    19
## 2  1 8 3  2    13

Other functions

There are a few more covered in the vignette and some of the functions above also may have additional parameters. Check them out

Author

Andrew Clark is an independent R developer based in North Vancouver

Introductions to his work can be found under the mytinyshinys umbrella and a set of recent flexdashboards

janitor Notebook

Andrew Clark

7th October 2016