skimr Package for Data Exploration

An important part of data analysis starts with understanding your data. There are many packages and functions that can be useful for exploring your data in R. I came across some packages, such as DataExplorer, that output reports on your data to html for initial exploration. While those can be useful, I thought I’d focus on a fairly straightforward package for exploring data - skimr.

The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary()

Install

The skimr package can be installed from CRAN using install.packages(“skimr”).

if (!require("skimr")) install.packages("skimr")
## Loading required package: skimr
library(skimr)

Exploring all your data using the skim function

The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary(). The output from skimr is a skim_df object that breaks down the data by rows and columns. The skim_df also lists the number of each variable type present in the dataset (e.g. factor, numeric, integer). The output also includes information on missing data.

All of the data

skim(dat)
Data summary
Name dat
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

One variable

Suppose you just want to get information on one variable. In this case, you would just add the variable name into the skim() call. In this case, I wanted to look at island and get more information about that variable. The output for factors allows you to quickly see the number of unique values (levels) in each variable, as well as the number of observations for each unique value.

skim(dat, island)
Data summary
Name dat
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
island 0 1 FALSE 3 Bis: 168, Dre: 124, Tor: 52

Combining functions using pipes

One cool thing about the skimr package is that the output skim_df can be used with tidyverse functions, such as dplyr using pipes. This allows you to get more information about the data using functions such as dplyr::group_by(). In this case, the output is grouped by species so you can get information about numbers of males and females per species or number of species per island.

dat %>%
  group_by(species) %>%
  skim()
Data summary
Name Piped data
Number of rows 344
Number of columns 7
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables species

Variable type: factor

skim_variable species n_missing complete_rate ordered n_unique top_counts
island Adelie 0 1.00 FALSE 3 Dre: 56, Tor: 52, Bis: 44
island Chinstrap 0 1.00 FALSE 1 Dre: 68, Bis: 0, Tor: 0
island Gentoo 0 1.00 FALSE 1 Bis: 124, Dre: 0, Tor: 0
sex Adelie 6 0.96 FALSE 2 fem: 73, mal: 73
sex Chinstrap 0 1.00 FALSE 2 fem: 34, mal: 34
sex Gentoo 5 0.96 FALSE 2 mal: 61, fem: 58

Variable type: numeric

skim_variable species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm Adelie 1 0.99 38.79 2.66 32.1 36.75 38.80 40.75 46.0 ▁▆▇▆▁
bill_length_mm Chinstrap 0 1.00 48.83 3.34 40.9 46.35 49.55 51.08 58.0 ▂▇▇▅▁
bill_length_mm Gentoo 1 0.99 47.50 3.08 40.9 45.30 47.30 49.55 59.6 ▃▇▆▁▁
bill_depth_mm Adelie 1 0.99 18.35 1.22 15.5 17.50 18.40 19.00 21.5 ▂▆▇▃▁
bill_depth_mm Chinstrap 0 1.00 18.42 1.14 16.4 17.50 18.45 19.40 20.8 ▅▇▇▆▂
bill_depth_mm Gentoo 1 0.99 14.98 0.98 13.1 14.20 15.00 15.70 17.3 ▅▇▇▆▂
flipper_length_mm Adelie 1 0.99 189.95 6.54 172.0 186.00 190.00 195.00 210.0 ▁▆▇▅▁
flipper_length_mm Chinstrap 0 1.00 195.82 7.13 178.0 191.00 196.00 201.00 212.0 ▁▅▇▅▂
flipper_length_mm Gentoo 1 0.99 217.19 6.48 203.0 212.00 216.00 221.00 231.0 ▂▇▇▆▃
body_mass_g Adelie 1 0.99 3700.66 458.57 2850.0 3350.00 3700.00 4000.00 4775.0 ▅▇▇▃▂
body_mass_g Chinstrap 0 1.00 3733.09 384.34 2700.0 3487.50 3700.00 3950.00 4800.0 ▁▅▇▃▁
body_mass_g Gentoo 1 0.99 5076.02 504.12 3950.0 4700.00 5000.00 5500.00 6300.0 ▃▇▇▇▂

Deeper dive in the data

If you wanted to dig further in, you could group the data by species, filter by island and just look at the flipper length variable.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(flipper_length_mm) 
Data summary
Name Piped data
Number of rows 168
Number of columns 7
_______________________
Column type frequency:
numeric 1
________________________
Group variables species

Variable type: numeric

skim_variable species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
flipper_length_mm Adelie 0 1.00 188.80 6.73 172 184.75 189.5 193 203 ▁▅▇▇▃
flipper_length_mm Gentoo 1 0.99 217.19 6.48 203 212.00 216.0 221 231 ▂▇▇▆▃

Finally, the skim() function can be combined with dplyr::glimpse() to create a different view of the output.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(sex) %>% 
  dplyr::glimpse()
## Rows: 2
## Columns: 8
## $ skim_type         <chr> "factor", "factor"
## $ skim_variable     <chr> "sex", "sex"
## $ species           <fct> Adelie, Gentoo
## $ n_missing         <int> 0, 5
## $ complete_rate     <dbl> 1.0000000, 0.9596774
## $ factor.ordered    <lgl> FALSE, FALSE
## $ factor.n_unique   <int> 2, 2
## $ factor.top_counts <chr> "fem: 22, mal: 22", "mal: 61, fem: 58"

Limitations

While skimr could prove useful for initial exploratory analysis, there are some limitations. One limitation that I quickly came across was the inability to print the spark-histogram characters. This happened when I tried to create a “nicer” looking table using pander.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(flipper_length_mm) %>% 
  pander()
Table continues below
skim_type skim_variable species n_missing complete_rate
numeric flipper_length_mm Adelie 0 1
numeric flipper_length_mm Gentoo 1 0.9919
Table continues below
numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50
188.8 6.729 172 184.8 189.5
217.2 6.485 203 212 216
numeric.p75 numeric.p100 numeric.hist
193 203 <U+2581><U+2585><U+2587><U+2587><U+2583>
221 231 <U+2582><U+2587><U+2587><U+2586><U+2583>

One workaround that was created is the skim_without_charts() function. This outputs the data without the histograms. However, many people find the histograms useful.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim_without_charts(flipper_length_mm) %>% 
  pander()
Table continues below
skim_type skim_variable species n_missing complete_rate
numeric flipper_length_mm Adelie 0 1
numeric flipper_length_mm Gentoo 1 0.9919
Table continues below
numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50
188.8 6.729 172 184.8 189.5
217.2 6.485 203 212 216
numeric.p75 numeric.p100
193 203
221 231

Final Thoughts

This was just a brief look at some of the functionality of the skimr package. Just using a few of the commands above can provide a quick but useful look at your data. Combining skimr with tidyverse functions greatly expands the summary information that can be generated providing greater insight into your data.

Resources Utilized

palmerpenguin data:

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081

skimr package: https://cran.r-project.org/web/packages/skimr/readme/README.html

skimr vignette: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html