Extending skimr

2022-01-04

Introduction

The skim() function summarizes data types contained within data frames and objects that have as.data.frame() methods to coerce them into data frames. It comes with a set of default summary functions for a wide variety of data types, but this is not comprehensive.

Package authors (and advanced users) can add support for skimming their specific non-data-frame objects in their packages, and they can provide different defaults in their own summary functions. This will require including skimr as a dependency.

Skimming objects that are not coercible to data frames

This example will illustrate this by creating support for the lm object produced by lm(). For any object this involves two required elements and one optional element. This is a simple example, but for other types of objects there may be much more complexity

If you are adding skim support to a package you will also need to add skimr to the list of imports.

library(skimr)

The lm() function produces a complex object with class “lm”.

results <- lm(weight ~ feed, data = chickwts)
class(results)
## [1] "lm"
attributes(results)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

There is no as.data.frame method for an lm object.

as.data.frame(results)

Error in as.data.frame.default(results) : cannot coerce class ‘“lm”’ to a data.frame

Unlike the example of having a new type of data in a column of a simple data frame (for which we would create a sfl) frame in the “Using skimr” vignette, this is a different type of challenge: an object that we might wish to skim, but that cannot be directly skimmed. Therefore we need to make it into an object that is either a data frame or coercible to a data frame.

In the case of the lm object, the model attribute is already a data frame. So a very simple way to solve the challenge is to skim results$model directly.

skim(results$model)
Data summary
Name results$model
Number of rows 71
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
feed 0 1 FALSE 6 soy: 14, cas: 12, lin: 12, sun: 12

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1 261.31 78.07 108 204.5 258 323.5 423 ▆▆▇▇▃

This is works, but we could go one step further and create a new function for doing this directly.

skim_lm <- function(.data){
   .data <- .data$model
   skim(.data)
}

lm(weight ~ feed, data = chickwts) %>% skim_lm()
Data summary
Name .data
Number of rows 71
Number of columns 2
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
feed 0 1 FALSE 6 soy: 14, cas: 12, lin: 12, sun: 12

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1 261.31 78.07 108 204.5 258 323.5 423 ▆▆▇▇▃

If desired, a more complex function can be created. For example, the lm object also contains fitted values and residuals. We could incorporate these in the data frame.

skim_lm <- function(.data, fit = FALSE){
   .data <- .data$model
   if (fit == TRUE){
   .data <- .data %>% 
            dplyr::bind_cols(
              fitted =
                data.frame(results$fitted.values),
              residuals =
                data.frame(results$residuals))
   }
   skim(.data)
}
skim_lm(results, fit = TRUE)
Data summary
Name .data
Number of rows 71
Number of columns 4
_______________________
Column type frequency:
factor 1
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
feed 0 1 FALSE 6 soy: 14, cas: 12, lin: 12, sun: 12

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1 261.31 78.07 108.00 204.50 258.00 323.50 423.00 ▆▆▇▇▃
results.fitted.values 0 1 261.31 57.46 160.20 218.75 246.43 323.58 328.92 ▃▃▅▃▇
results.residuals 0 1 0.00 52.86 -123.91 -34.41 1.57 38.17 103.09 ▂▅▇▅▃

A second example of the need for a special function is with dist objects. The UScitiesD dat set is an example of this.

class(UScitiesD)
## [1] "dist"
UScitiesD
##               Atlanta Chicago Denver Houston LosAngeles Miami NewYork
## Chicago           587                                                
## Denver           1212     920                                        
## Houston           701     940    879                                 
## LosAngeles       1936    1745    831    1374                         
## Miami             604    1188   1726     968       2339              
## NewYork           748     713   1631    1420       2451  1092        
## SanFrancisco     2139    1858    949    1645        347  2594    2571
## Seattle          2182    1737   1021    1891        959  2734    2408
## Washington.DC     543     597   1494    1220       2300   923     205
##               SanFrancisco Seattle
## Chicago                           
## Denver                            
## Houston                           
## LosAngeles                        
## Miami                             
## NewYork                           
## SanFrancisco                      
## Seattle                678        
## Washington.DC         2442    2329

A dist object is most often, as in this case, lower triange
matrices of distances, which can be measured in various ways. There are many packages that produce dist objects and/or take dist objects as inputs, including those for cluster analysis and multidimensional scaling.

A simple solution to this is to follow a similar design to that for lm objects.

skim_dist <- function(.data) {
    .data <- data.frame(as.matrix(.data))
    skimr::skim(.data)
    
}

However, this has the limitation of treating the dist data as though it is simple numeric data.

What we might want to do instead, is to create a new class, for example, “distance” that is specifically for distance data. This will allow it to have its own sfl or skimr function list.

As handling gets more comples, rather than make a new function it can be more powerful to define an as.data.frame S3 method for dist objects, which will allow it to integrate with skimr more completely and uses to use the skim() function directly. In a package you will want to export this.

as.data.frame.dist <- function(.data) {
    .data <- data.frame(as.matrix(.data))
     To <- colnames(.data)

    .data <- lapply(.data[,To], 
                    `class<-`, "distance")
    .data <- dplyr::bind_rows(.data )
    .data<- dplyr::bind_cols(.data)

    .data
}
as.data.frame(UScitiesD)
## # A tibble: 10 × 10
##    Atlanta  Chicago Denver Houston LosAngeles Miami NewYork SanFrancisco Seattle
##    <distan> <dista> <dist> <dista> <distance> <dis> <dista> <distance>   <dista>
##  1    0      587    1212    701    1936        604   748    2139         2182   
##  2  587        0     920    940    1745       1188   713    1858         1737   
##  3 1212      920       0    879     831       1726  1631     949         1021   
##  4  701      940     879      0    1374        968  1420    1645         1891   
##  5 1936     1745     831   1374       0       2339  2451     347          959   
##  6  604     1188    1726    968    2339          0  1092    2594         2734   
##  7  748      713    1631   1420    2451       1092     0    2571         2408   
##  8 2139     1858     949   1645     347       2594  2571       0          678   
##  9 2182     1737    1021   1891     959       2734  2408     678            0   
## 10  543      597    1494   1220    2300        923   205    2442         2329   
## # … with 1 more variable: Washington.DC <distance>

However, until an sfl is created, skimr will not recognize the class and fall back to treating the data as if it were character data.

skim(UScitiesD)
## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.

## Warning: Couldn't find skimmers for class: distance; No user-defined `sfl`
## provided. Falling back to `character`.
Data summary
Name UScitiesD
Number of rows 10
Number of columns 10
_______________________
Column type frequency:
character 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Atlanta 0 1 1 4 0 10 0
Chicago 0 1 1 4 0 10 0
Denver 0 1 1 4 0 10 0
Houston 0 1 1 4 0 10 0
LosAngeles 0 1 1 4 0 10 0
Miami 0 1 1 4 0 10 0
NewYork 0 1 1 4 0 10 0
SanFrancisco 0 1 1 4 0 10 0
Seattle 0 1 1 4 0 10 0
Washington.DC 0 1 1 4 0 10 0

The solution to this is to define an sfl (skimr function list) specifically for the distance class.

Defining sfls for a package

skimr has an opinionated list of functions for each class (e.g. numeric, factor) of data. The core package supports many commonly used classes, but there are many others. You can investigate these defaults by calling get_default_skimmer_names().

What if your data type, like distance, isn’t covered by defaults? skimr usually falls back to treating the type as a character, which isn’t necessarily helpful. In this case, you’re best off adding your data type with skim_with().

Before we begin, we’ll be using the following custom summary statistics throughout. These functions find the nearest and furthest other location for each location.

One thing that is important to be aware of when creating statistics functions for skimr is that skimr largely uses tibbles rather than base data frames. This means that many base operations do not work as epected.

get_nearest <- function(column) {
  column <- as.numeric(column)
  closest <- min(column[column != 0])
  which_rows <- which(column == closest, arr.ind = TRUE)
  which_rows <- as.vector((which_rows))
  paste(column[which_rows], collapse = ",")
}

get_furthest <- function(column) {
  column <- as.numeric(column)
  furthest <- max(column[column != 0])
  which_rows <- which(column == furthest, arr.ind = TRUE)
  which_rows <- as.vector(which_rows)
  paste(column[which_rows], collapse = ",")

}

This function, like all summary functions used by skimr has two notable features.

There are a lot of functions that fulfill these criteria:

Not fulfilling the two criteria can lead to some very confusing behavior within skimr. Beware! An example of this issue is the base quantile() function in default skimr percentiles are returned by using quantile() five times. In the case of these functions, there could be ties which would result in returning vectors that have length greater than 1. This is handled by collapsing all of the tied values into a single string.

Notice, also, that in the case of distance data we may wish to exclude distances of 0, which indicate the distance from a place to itself. In finding the minimum our function looks only at the distance to other places.

There are at least two ways that you might want to customize skimr handling of a special data type within a package or your own work. The first is to create a custom skimming function.

skim_with_dist <- skim_with(
  distance = sfl(
    nearest = get_nearest,
    furthest = get_furthest
  )
)
## Creating new skimming functions for the following classes: distance.
## They did not have recognized defaults. Call get_default_skimmers() for more information.
skim_with_dist(UScitiesD)
Data summary
Name UScitiesD
Number of rows 10
Number of columns 10
_______________________
Column type frequency:
distance 10
________________________
Group variables None

Variable type: distance

skim_variable n_missing complete_rate nearest furthest
Atlanta 0 1 543 2182
Chicago 0 1 587 1858
Denver 0 1 831 1726
Houston 0 1 701 1891
LosAngeles 0 1 347 2451
Miami 0 1 604 2734
NewYork 0 1 205 2571
SanFrancisco 0 1 347 2594
Seattle 0 1 678 2734
Washington.DC 0 1 205 2442

The example above creates a new function, and you can call that function on a specific column with distance data to get the appropriate summary statistics. The skim_with factory also uses the default skimrs for things like factors, characters, and numerics. Therefore our skim_with_dist is like the regular skim function with the added ability to summarize distance columns.

While this works for any data type and you can also include it within any package (assuming your users load skimr), there is a second, even better, approach. To take full advantage of skimr, we’ll dig a bit into its API.

Adding new methods

skimr has a lookup mechanism, based on the function get_skimmers(), to find default summary functions for each class. This is based on the S3 class system. You can learn more about it in Advanced R.

This requires that you add skimr to your list of dependencies.

To export a new set of defaults for a data type, create a method for the generic function get_skimmers. Each of those methods returns an sfl (skimr function list) This is the same list-like data structure used in the skim_with() example above. But note! There is one key difference. When adding a generic we also want to identify the skim_type in the sfl. You will probably want to use skimr::get_skimmers.distance() but that will not work in a vignette. In a package you will want to export this.

#' @importFrom skimr get_skimmers
#' @export
get_skimmers.distance <- function(column) {
  sfl(
    skim_type = "distance",
    nearest = get_nearest,
    furthest = get_furthest
  )
}

The same strategy follows for other data types.

Users of your package should load skimr to get the skim() function (although you could import and reexport it). Once loaded, a call to get_default_skimmer_names() will return defaults for your data types as well!

get_default_skimmer_names()
## $AsIs
## [1] "n_unique"   "min_length" "max_length"
## 
## $character
## [1] "min"        "max"        "empty"      "n_unique"   "whitespace"
## 
## $complex
## [1] "mean"
## 
## $Date
## [1] "min"      "max"      "median"   "n_unique"
## 
## $difftime
## [1] "min"      "max"      "median"   "n_unique"
## 
## $distance
## [1] "nearest"  "furthest"
## 
## $factor
## [1] "ordered"    "n_unique"   "top_counts"
## 
## $list
## [1] "n_unique"   "min_length" "max_length"
## 
## $logical
## [1] "mean"  "count"
## 
## $numeric
## [1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"
## 
## $POSIXct
## [1] "min"      "max"      "median"   "n_unique"
## 
## $Timespan
## [1] "min"      "max"      "median"   "n_unique"
## 
## $ts
##  [1] "start"      "end"        "frequency"  "deltat"     "mean"      
##  [6] "sd"         "min"        "max"        "median"     "line_graph"

They will then be able to use skim() directly.

skim(UScitiesD)
Data summary
Name UScitiesD
Number of rows 10
Number of columns 10
_______________________
Column type frequency:
distance 10
________________________
Group variables None

Variable type: distance

skim_variable n_missing complete_rate nearest furthest
Atlanta 0 1 543 2182
Chicago 0 1 587 1858
Denver 0 1 831 1726
Houston 0 1 701 1891
LosAngeles 0 1 347 2451
Miami 0 1 604 2734
NewYork 0 1 205 2571
SanFrancisco 0 1 347 2594
Seattle 0 1 678 2734
Washington.DC 0 1 205 2442

Conclusion

This is a very simple example. For some packages the custom statistics will likely be much more complex. The flexibility of skimr allows you to manage that.

Thanks to Jakub Nowosad, Tiernan Martin, Edzer Pebesma, Michael Sumner, and Kyle Butts for inspiring and helping with the development of this code.