This is my first attempt at creating an R Notebook (from RStudio) and tests the new janitor package from Samuel Firke. This package contains ‘Simple Tools for Examining and Cleaning Dirty Data’ and is the most popular debutant on CRAN over the past 10 days
This notebook takes some of the examples from the vignette and documentation but, of course, the code is easily amendable and alternative data can be selected for processing.
Just open up the - downloadable from Dropbox - janitor.nb.html in Rstudio or in the R console with the rmarkdown package and an editable notebook version is created for you to mess around with. You will need to have the three libraries listed below installed and loaded
library(janitor)
library(readxl)
library(dplyr)
Here is the familiar dataset that is used in the vignette. Adding the print statement results in a table (when in notebook format) with the option of the traditional layout
print(glimpse(mtcars))
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Resulting names of data.frame are unique and consist only of the _ character, lowercase letters, and numbers.
# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("hIgHlo", "REPEAT VALUE", "REPEAT VALUE",
"% successful (2009)", "abc@!*", "")
glimpse(test_df)
## Observations: 1
## Variables: 6
## $ hIgHlo <lgl> NA
## $ REPEAT VALUE <lgl> NA
## $ REPEAT VALUE <lgl> NA
## $ % successful (2009) <lgl> NA
## $ abc@!* <lgl> NA
## $ <lgl> NA
test_df %>%
clean_names()
## highlo repeat_value repeat_value_2 percent_successful_2009 abc x
## 1 NA NA NA NA NA NA
An enhanced version of base table with added features including being returned as a data.frame and automatically calculating percentages
mtcars %>%
filter(gear > 3) %>%
tabyl(cyl)
## cyl n percent
## 1 4 10 0.5882353
## 2 6 5 0.2941176
## 3 8 2 0.1176471
Hunts down and examines duplicate records during data cleaning. Confirms there is no problem with this data
mtcars %>%
get_dupes(wt, cyl)
## # A tibble: 4 × 12
## wt cyl dupe_count mpg disp hp drat qsec vs am gear
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.44 6 2 19.2 167.6 123 3.92 18.30 1 0 4
## 2 3.44 6 2 17.8 167.6 123 3.92 18.90 1 0 4
## 3 3.57 8 2 14.3 360.0 245 3.21 15.84 0 0 3
## 4 3.57 8 2 15.0 301.0 335 3.54 14.60 0 1 5
## # ... with 1 more variables: carb <dbl>
Create a crosstab, displaying either frequencies or percentages calculated by row, column, or overall.
Just choose another combination e.g. vs & carb in the code and it immediately gets amended.
mtcars %>%
filter(am == 0) %>%
crosstab(vs, carb)
## vs 1 2 3 4
## 1 0 0 4 3 5
## 2 1 3 2 0 2
Designed to run on the output of a call to crosstab, this adds formatting, percentage sign, Ns, totals row/column, and custom rounding to a table of numeric values. The result is no longer clean data, but it saves time in reporting table results.
# showing with all parameters
mtcars %>%
crosstab(gear, cyl) %>%
adorn_crosstab(., denom = "col", rounding = "half up", show_n = FALSE, digits = 2)
## gear 4 6 8
## 1 3 9.09% 28.57% 85.71%
## 2 4 72.73% 57.14% 0.00%
## 3 5 18.18% 14.29% 14.29%
This function excludes the first column of the input data.frame, assuming that it contains a descriptive variable not to be summed.
# showing with all parameters
mtcars %>%
crosstab(am, cyl) %>%
add_totals_row
## am 4 6 8
## 1 0 3 4 12
## 2 1 8 3 2
## 3 Total 11 7 14
This function excludes the first column of the input data.frame, assuming that it contains a descriptive variable not to be summed.
mtcars %>%
crosstab(am, cyl) %>%
add_totals_col
## am 4 6 8 Total
## 1 0 3 4 12 19
## 2 1 8 3 2 13
There are a few more covered in the vignette and some of the functions above also may have additional parameters. Check them out