Introduction

When beginning to use R one of the first things to do after getting a view of the dataset you are dealing using glimpse() and str() is to start looking at basic descriptive statistics. In the past I used summary extensivly to do this. I was going through a tutorial on EDA and linear regression and the author used the skimr package which I thought was a great alternative for quickly getting basic statistics.

Personally I prefer the structure of the skimr output and how it can be customised for your own needs.

Why use Skimr

Some examples

In the examples to follow I use different tidyverse verbs and different methods of displaying data.

Default output

Below is the default output from the command when run on a dataset. I will use the mpg dataset. Below the output is separated into a summary section and then split by variable types being character and numeric in this instance. Note the inclusion of spark lines and some additional statistics not included in summary() function.

skim(mpg)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
manufacturer 0 1 4 10 0 15 0
model 0 1 2 22 0 38 0
trans 0 1 8 10 0 10 0
drv 0 1 1 1 0 3 0
fl 0 1 1 1 0 5 0
class 0 1 3 10 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
displ 0 1 3.47 1.29 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
year 0 1 2003.50 4.51 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
cyl 0 1 5.89 1.61 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
cty 0 1 16.86 4.26 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
hwy 0 1 23.44 5.95 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

What did we return ?

Because skim returns a skim_df object this is pipeable and open to additional manipulation. Looking at the structure of the skim_df we can get an orientation of how it’s made up. I use to_long in the below to get a look at skim_type, skim_variable and stat.

Looking at the below n_missing and complete_rate are base skimmers. The rest are type-base skimmers and we need to use the skim_type prefix to refer to the correct column.

to_long(skim(mpg,model,hwy)) %>% select(skim_type,skim_variable,stat) %>% arrange(skim_type)
## # A tibble: 17 x 3
##    skim_type skim_variable stat                
##    <chr>     <chr>         <chr>               
##  1 character model         n_missing           
##  2 character model         complete_rate       
##  3 character model         character.min       
##  4 character model         character.max       
##  5 character model         character.empty     
##  6 character model         character.n_unique  
##  7 character model         character.whitespace
##  8 numeric   hwy           n_missing           
##  9 numeric   hwy           complete_rate       
## 10 numeric   hwy           numeric.mean        
## 11 numeric   hwy           numeric.sd          
## 12 numeric   hwy           numeric.p0          
## 13 numeric   hwy           numeric.p25         
## 14 numeric   hwy           numeric.p50         
## 15 numeric   hwy           numeric.p75         
## 16 numeric   hwy           numeric.p100        
## 17 numeric   hwy           numeric.hist

Selecting from skim_df

Below an example of what I mean by selecting type based and base skimmers. Note n_missing is our base skimmer and numeric.mean and character.n_unique are our type-based skimmers.

skim(mpg) %>% select(skim_type,skim_variable,n_missing,numeric.mean,character.n_unique)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing n_unique
manufacturer 0 15
model 0 38
trans 0 10
drv 0 3
fl 0 5
class 0 7

Variable type: numeric

skim_variable n_missing mean
displ 0 3.47
year 0 2003.50
cyl 0 5.89
cty 0 16.86
hwy 0 23.44

Selecting specific columns from our data

Only specific columns can be selected if desired. Note there are many ways to do this. We can also use pipe and select.

skim(mpg,hwy)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hwy 0 1 23.44 5.95 12 18 24 27 44 ▅▅▇▁▁

Skim after grouping data

We can use grouping and display the relevant information by group. Note below how pipe and group_by is used.

mpg  %>% group_by(drv) %>%  skim(hwy)  
Data summary
Name Piped data
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
numeric 1
________________________
Group variables drv

Variable type: numeric

skim_variable drv n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hwy 4 0 1 19.17 4.08 12 17 18 22 28 ▃▇▅▁▅
hwy f 0 1 28.16 4.21 17 26 28 29 44 ▁▇▇▁▁
hwy r 0 1 21.00 3.66 15 17 21 24 26 ▇▂▃▃▇

Excluding charts

If you dont want the charts.You can use skim_without_charts.

skim_without_charts(mpg) %>% filter(skim_variable == "hwy")
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
hwy 0 1 23.44 5.95 12 18 24 27 44

Select only the sections we want to see using yank

If we only want to see the numeric section we can yank that section.

mpg %>% skim() %>% yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
displ 0 1 3.47 1.29 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
year 0 1 2003.50 4.51 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
cyl 0 1 5.89 1.61 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
cty 0 1 16.86 4.26 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
hwy 0 1 23.44 5.95 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

Specify your own statistics using skim_with and sfl

Using skim_with we can specify our own statistics.For example we can make use of R’s stat package. Note the default functionality is to append your statistics to the default output statistics that skim returns. By selecting append = FALSE we only return the statistics we specify.

my_skim <- skim_with(numeric = sfl(iqr = IQR, mad = mad, p99 = ~ quantile(., probs = .99)),
  append = FALSE)

my_skim(mpg,hwy)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate iqr mad p99
hwy 0 1 9 7.41 39.68

We can also exclude statistics we don’t want. Note below we set P25 and P75 to NULL. Note we didn’t specify append = FALSE so our addition of IQR gets appended to the default output.

my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))


my_skim(mpg,hwy)
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p50 p100 hist iqr
hwy 0 1 23.44 5.95 12 24 44 ▅▅▇▁▁ 9

Keep the original data set with skim_tee

We can also use skim_tee() to return the original data after running skim

mpg_tee <- mpg %>% skim_tee()  
## -- Data Summary ------------------------
##                            Values
## Name                       data  
## Number of rows             234   
## Number of columns          11    
## _______________________          
## Column type frequency:           
##   character                6     
##   numeric                  5     
## ________________________         
## Group variables            None  
## 
## -- Variable type: character ----------------------------------------------------
## # A tibble: 6 x 8
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## * <chr>             <int>         <dbl> <int> <int> <int>    <int>      <int>
## 1 manufacturer          0             1     4    10     0       15          0
## 2 model                 0             1     2    22     0       38          0
## 3 trans                 0             1     8    10     0       10          0
## 4 drv                   0             1     1     1     0        3          0
## 5 fl                    0             1     1     1     0        5          0
## 6 class                 0             1     3    10     0        7          0
## 
## -- Variable type: numeric ------------------------------------------------------
## # A tibble: 5 x 11
##   skim_variable n_missing complete_rate    mean    sd     p0    p25    p50
## * <chr>             <int>         <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 displ                 0             1    3.47  1.29    1.6    2.4    3.3
## 2 year                  0             1 2004.    4.51 1999   1999   2004. 
## 3 cyl                   0             1    5.89  1.61    4      4      6  
## 4 cty                   0             1   16.9   4.26    9     14     17  
## 5 hwy                   0             1   23.4   5.95   12     18     24  
##      p75  p100 hist 
## *  <dbl> <dbl> <chr>
## 1    4.6     7 <U+2587><U+2586><U+2586><U+2583><U+2581>
## 2 2008    2008 <U+2587><U+2581><U+2581><U+2581><U+2587>
## 3    8       8 <U+2587><U+2581><U+2587><U+2581><U+2587>
## 4   19      35 <U+2586><U+2587><U+2583><U+2581><U+2581>
## 5   27      44 <U+2585><U+2585><U+2587><U+2581><U+2581>
head(mpg_tee) 
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

Issues encountered

The one issue I encountered was with the spark lines. If you look at the skim_tee example the sparklines are displayed as <U+2587><U+2586><U+2586><U+2583><U+2581> for example.

Looking at https://cran.r-project.org/web/packages/skimr/readme/README.html the reason given is as follows " This longstanding problem originates in the low-level code for printing dataframes.

while skimr can render the histograms to the console and in RMarkdown documents, it cannot in other circumstances. This includes:

In these caes we can use the skim_without_charts as detailed above.