R skimr package introduction

Some examples

In the examples to follow I use different tidyverse verbs and different methods of displaying data.

Default output

Below is the default output from the command when run on a dataset. I will use the mpg dataset. Below the output is separated into a summary section and then split by variable types being character and numeric in this instance. Note the inclusion of spark lines and some additional statistics not included in summary() function.

skim(mpg)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
manufacturer	1	4	10	15
model	1	2	22	38
trans	1	8	10	10
drv	1	1	1	3
fl	1	1	1	5
class	1	3	10	7

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
displ	1	3.47	1.29	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
year	1	2003.50	4.51	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
cyl	1	5.89	1.61	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
cty	1	16.86	4.26	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
hwy	1	23.44	5.95	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

What did we return ?

Because skim returns a skim_df object this is pipeable and open to additional manipulation. Looking at the structure of the skim_df we can get an orientation of how it’s made up. I use to_long in the below to get a look at skim_type, skim_variable and stat.

Looking at the below n_missing and complete_rate are base skimmers. The rest are type-base skimmers and we need to use the skim_type prefix to refer to the correct column.

to_long(skim(mpg,model,hwy)) %>% select(skim_type,skim_variable,stat) %>% arrange(skim_type)

## # A tibble: 17 x 3
##    skim_type skim_variable stat                
##    <chr>     <chr>         <chr>               
##  1 character model         n_missing           
##  2 character model         complete_rate       
##  3 character model         character.min       
##  4 character model         character.max       
##  5 character model         character.empty     
##  6 character model         character.n_unique  
##  7 character model         character.whitespace
##  8 numeric   hwy           n_missing           
##  9 numeric   hwy           complete_rate       
## 10 numeric   hwy           numeric.mean        
## 11 numeric   hwy           numeric.sd          
## 12 numeric   hwy           numeric.p0          
## 13 numeric   hwy           numeric.p25         
## 14 numeric   hwy           numeric.p50         
## 15 numeric   hwy           numeric.p75         
## 16 numeric   hwy           numeric.p100        
## 17 numeric   hwy           numeric.hist

Selecting from skim_df

Below an example of what I mean by selecting type based and base skimmers. Note n_missing is our base skimmer and numeric.mean and character.n_unique are our type-based skimmers.

skim(mpg) %>% select(skim_type,skim_variable,n_missing,numeric.mean,character.n_unique)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	n_unique
manufacturer	0	15
model	0	38
trans	0	10
drv	0	3
fl	0	5
class	0	7

Variable type: numeric

skim_variable	n_missing	mean
displ	0	3.47
year	0	2003.50
cyl	0	5.89
cty	0	16.86
hwy	0	23.44

Selecting specific columns from our data

Only specific columns can be selected if desired. Note there are many ways to do this. We can also use pipe and select.

skim(mpg,hwy)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hwy	0	1	23.44	5.95	12	18	24	27	44	▅▅▇▁▁

Skim after grouping data

We can use grouping and display the relevant information by group. Note below how pipe and group_by is used.

mpg  %>% group_by(drv) %>%  skim(hwy)

Data summary
Name	Piped data
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
numeric	1
________________________
Group variables	drv

Variable type: numeric

skim_variable	drv	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hwy	4	1	19.17	4.08	12	17	18	22	28	▃▇▅▁▅
hwy	f	1	28.16	4.21	17	26	28	29	44	▁▇▇▁▁
hwy	r	1	21.00	3.66	15	17	21	24	26	▇▂▃▃▇

Excluding charts

If you dont want the charts.You can use skim_without_charts.

skim_without_charts(mpg) %>% filter(skim_variable == "hwy")

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
hwy	0	1	23.44	5.95	12	18	24	27	44

Select only the sections we want to see using yank

If we only want to see the numeric section we can yank that section.

mpg %>% skim() %>% yank("numeric")

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
displ	1	3.47	1.29	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
year	1	2003.50	4.51	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
cyl	1	5.89	1.61	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
cty	1	16.86	4.26	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
hwy	1	23.44	5.95	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

Specify your own statistics using skim_with and sfl

Using skim_with we can specify our own statistics.For example we can make use of R’s stat package. Note the default functionality is to append your statistics to the default output statistics that skim returns. By selecting append = FALSE we only return the statistics we specify.

my_skim <- skim_with(numeric = sfl(iqr = IQR, mad = mad, p99 = ~ quantile(., probs = .99)),
  append = FALSE)

my_skim(mpg,hwy)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	iqr	mad	p99
hwy	0	1	9	7.41	39.68

We can also exclude statistics we don’t want. Note below we set P25 and P75 to NULL. Note we didn’t specify append = FALSE so our addition of IQR gets appended to the default output.

my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))


my_skim(mpg,hwy)

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p50	p100	hist	iqr
hwy	0	1	23.44	5.95	12	24	44	▅▅▇▁▁	9

Keep the original data set with skim_tee

We can also use skim_tee() to return the original data after running skim

mpg_tee <- mpg %>% skim_tee()

## -- Data Summary ------------------------
##                            Values
## Name                       data  
## Number of rows             234   
## Number of columns          11    
## _______________________          
## Column type frequency:           
##   character                6     
##   numeric                  5     
## ________________________         
## Group variables            None  
## 
## -- Variable type: character ----------------------------------------------------
## # A tibble: 6 x 8
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## * <chr>             <int>         <dbl> <int> <int> <int>    <int>      <int>
## 1 manufacturer          0             1     4    10     0       15          0
## 2 model                 0             1     2    22     0       38          0
## 3 trans                 0             1     8    10     0       10          0
## 4 drv                   0             1     1     1     0        3          0
## 5 fl                    0             1     1     1     0        5          0
## 6 class                 0             1     3    10     0        7          0
## 
## -- Variable type: numeric ------------------------------------------------------
## # A tibble: 5 x 11
##   skim_variable n_missing complete_rate    mean    sd     p0    p25    p50
## * <chr>             <int>         <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1 displ                 0             1    3.47  1.29    1.6    2.4    3.3
## 2 year                  0             1 2004.    4.51 1999   1999   2004. 
## 3 cyl                   0             1    5.89  1.61    4      4      6  
## 4 cty                   0             1   16.9   4.26    9     14     17  
## 5 hwy                   0             1   23.4   5.95   12     18     24  
##      p75  p100 hist 
## *  <dbl> <dbl> <chr>
## 1    4.6     7 <U+2587><U+2586><U+2586><U+2583><U+2581>
## 2 2008    2008 <U+2587><U+2581><U+2581><U+2581><U+2587>
## 3    8       8 <U+2587><U+2581><U+2587><U+2581><U+2587>
## 4   19      35 <U+2586><U+2587><U+2583><U+2581><U+2581>
## 5   27      44 <U+2585><U+2585><U+2587><U+2581><U+2581>

head(mpg_tee)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

R skimr package introduction

David Timewell

13/08/2020

Introduction

Why use Skimr

Some examples

Default output

What did we return ?

Selecting from skim_df

Selecting specific columns from our data

Skim after grouping data

Excluding charts

Select only the sections we want to see using yank

Specify your own statistics using skim_with and sfl

Keep the original data set with skim_tee

Issues encountered