Working with the Janitor Package in R

An Overview

The janitor package is a small but incredibly useful and convenient tool when it comes to cleaning and maintaining data frames within RStudio. According to experts, data scientists spend nearly 50% to 80% of their time collecting and preparing unruly digital data, before getting a chance to turn them into useful information. The janitor package contains simple functions and is optimized for user-friendliness. It is also a tidyverse oriented package, so it plays nicely with the pipe %>% operator.

Installation

You can install the most recent officially-released version from CRAN by:

#install.packages("janitor")
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Major Functions in Janitor

Cleaning Data Frames

One can invoke the clean_names() function every time an unseen dataset is loaded. This one function handles various issues related to the column names of a dataset such as:

  • Converting all column names to a consistent format, where the default is snake_case
  • Handling special characters and spaces in the names
  • Handling duplicate column names by appending numbers
#Creating a dummy data.frame with dirty column names
dset <- as.data.frame(matrix(ncol = 7))
names(dset) <- c("first column", "SEC0nd!", " % valueColumn","SAME NAME","SAME NAME", "Age", "$Salary$") 
dset %>% clean_names()
##   first_column sec0nd percent_value_column same_name same_name_2 age salary
## 1           NA     NA                   NA        NA          NA  NA     NA

Finding Duplicates

With the use of the get_dupes() function, we can identify and hunt down any duplicate records and then determine if we keep those for further analysis. The get_dupes() function will return the list of duplicate rows along with their count.

This is a useful function when you have multiple employee records. People can have similar names but not the same employeeID. Exploring a combination of columns can help avoid unconscious mistakes.

get_dupes(starwars, eye_color, hair_color, skin_color, sex, homeworld)
## # A tibble: 7 × 15
##   eye_color hair_color skin_color sex    homeworld dupe_count name  height  mass
##   <chr>     <chr>      <chr>      <chr>  <chr>          <int> <chr>  <int> <dbl>
## 1 blue      black      yellow     female Mirial             2 Lumi…    170  56.2
## 2 blue      black      yellow     female Mirial             2 Barr…    166  50  
## 3 blue      blond      fair       male   Tatooine           2 Luke…    172  77  
## 4 blue      blond      fair       male   Tatooine           2 Anak…    188  84  
## 5 brown     brown      light      female Naboo              3 Cordé    157  NA  
## 6 brown     brown      light      female Naboo              3 Dormé    165  NA  
## 7 brown     brown      light      female Naboo              3 Padm…    165  45  
## # … with 6 more variables: birth_year <dbl>, gender <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Removing Empty Rows & Columns

Similar to the na.omit function, but easier to remember! By default, the function removes both empty rows and columns which can be specified to our preferences.

dset2 <- data.frame(x = c(1,NA,4),
                    y = c(NA,NA,3),
                    z = c(NA, NA, NA))

dset2 %>% remove_empty(c("rows","cols"))
##   x  y
## 1 1 NA
## 3 4  3
#Uncomment this line to see a different output!
#dset2 %>% remove_empty(c("cols"))

Converting Date Time Formats

If you ever had to deal with data directly imported as a CSV from google sheets, then you’ll be familiar with the pain of handling different date-time formats. Luckily, janitor provides two different functions to handle dates.

The excel_numeric_to_date function deals with encoded date values from Excel like 41103

excel_numeric_to_date(42223)
## [1] "2015-08-07"

The convert_to_date function proves to be especially useful when you’re dealing with non-uniform date-times (eg: MM-DD-YY/DD-MM-YYYY) and converts them all to the YYYY-MM-DD format.

convert_to_date(c("23.03.1997","40000"),character_fun = lubridate::dmy)
## [1] "1997-03-23" "2009-07-06"

Conclusion

Although the janitor package comes with a limited set of functions to clean a dataset, it is clear that they are very user-friendly, easy to remember, and incredibly powerful. It also contains a variety of functions to help with data exploration as an alternative to many existing R functions but with more capabilities. You can use the help(janitor) command to dive into the details of this package!

help(janitor)

References

Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

convert_to_date: Convert many date and datetime formats as may be received from Microsoft Excel. https://www.rdocumentation.org/packages/janitor/versions/2.1.0/topics/convert_to_date

Overview of Janitor Functions. https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html

Rapp, Albert. 2022. Showcasing the janitor package. https://www.r-bloggers.com/2022/01/showcasing-the-janitor-package/