Working with the Janitor Package in R
An Overview
The janitor package is a small but incredibly useful and convenient tool when it comes to cleaning and maintaining data frames within RStudio. According to experts, data scientists spend nearly 50% to 80% of their time collecting and preparing unruly digital data, before getting a chance to turn them into useful information. The janitor package contains simple functions and is optimized for user-friendliness. It is also a tidyverse oriented package, so it plays nicely with the pipe %>% operator.
Installation
You can install the most recent officially-released version from CRAN by:
#install.packages("janitor")
library(janitor)##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Major Functions in Janitor
Cleaning Data Frames
One can invoke the clean_names() function every time an unseen dataset is loaded. This one function handles various issues related to the column names of a dataset such as:
- Converting all column names to a consistent format, where the default is snake_case
- Handling special characters and spaces in the names
- Handling duplicate column names by appending numbers
#Creating a dummy data.frame with dirty column names
dset <- as.data.frame(matrix(ncol = 7))
names(dset) <- c("first column", "SEC0nd!", " % valueColumn","SAME NAME","SAME NAME", "Age", "$Salary$")
dset %>% clean_names()## first_column sec0nd percent_value_column same_name same_name_2 age salary
## 1 NA NA NA NA NA NA NA
Finding Duplicates
With the use of the get_dupes() function, we can identify and hunt down any duplicate records and then determine if we keep those for further analysis. The get_dupes() function will return the list of duplicate rows along with their count.
This is a useful function when you have multiple employee records. People can have similar names but not the same employeeID. Exploring a combination of columns can help avoid unconscious mistakes.
get_dupes(starwars, eye_color, hair_color, skin_color, sex, homeworld)## # A tibble: 7 × 15
## eye_color hair_color skin_color sex homeworld dupe_count name height mass
## <chr> <chr> <chr> <chr> <chr> <int> <chr> <int> <dbl>
## 1 blue black yellow female Mirial 2 Lumi… 170 56.2
## 2 blue black yellow female Mirial 2 Barr… 166 50
## 3 blue blond fair male Tatooine 2 Luke… 172 77
## 4 blue blond fair male Tatooine 2 Anak… 188 84
## 5 brown brown light female Naboo 3 Cordé 157 NA
## 6 brown brown light female Naboo 3 Dormé 165 NA
## 7 brown brown light female Naboo 3 Padm… 165 45
## # … with 6 more variables: birth_year <dbl>, gender <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Removing Empty Rows & Columns
Similar to the na.omit function, but easier to remember! By default, the function removes both empty rows and columns which can be specified to our preferences.
dset2 <- data.frame(x = c(1,NA,4),
y = c(NA,NA,3),
z = c(NA, NA, NA))
dset2 %>% remove_empty(c("rows","cols"))## x y
## 1 1 NA
## 3 4 3
#Uncomment this line to see a different output!
#dset2 %>% remove_empty(c("cols"))Converting Date Time Formats
If you ever had to deal with data directly imported as a CSV from google sheets, then you’ll be familiar with the pain of handling different date-time formats. Luckily, janitor provides two different functions to handle dates.
The excel_numeric_to_date function deals with encoded date values from Excel like 41103
excel_numeric_to_date(42223)## [1] "2015-08-07"
The convert_to_date function proves to be especially useful when you’re dealing with non-uniform date-times (eg: MM-DD-YY/DD-MM-YYYY) and converts them all to the YYYY-MM-DD format.
convert_to_date(c("23.03.1997","40000"),character_fun = lubridate::dmy)## [1] "1997-03-23" "2009-07-06"
Conclusion
Although the janitor package comes with a limited set of functions to clean a dataset, it is clear that they are very user-friendly, easy to remember, and incredibly powerful. It also contains a variety of functions to help with data exploration as an alternative to many existing R functions but with more capabilities. You can use the help(janitor) command to dive into the details of this package!
help(janitor)References
Firke, Sam. 2021. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
convert_to_date: Convert many date and datetime formats as may be received from Microsoft Excel. https://www.rdocumentation.org/packages/janitor/versions/2.1.0/topics/convert_to_date
Overview of Janitor Functions. https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html
Rapp, Albert. 2022. Showcasing the janitor package. https://www.r-bloggers.com/2022/01/showcasing-the-janitor-package/