Code Through

skimr Package for Data Exploration

An important part of data analysis starts with understanding your data. There are many packages and functions that can be useful for exploring your data in R. I came across some packages, such as DataExplorer, that output reports on your data to html for initial exploration. While those can be useful, I thought I’d focus on a fairly straightforward package for exploring data - skimr.

The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary()

Install

The skimr package can be installed from CRAN using install.packages(“skimr”).

if (!require("skimr")) install.packages("skimr")

## Loading required package: skimr

library(skimr)

Exploring all your data using the skim function

The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary(). The output from skimr is a skim_df object that breaks down the data by rows and columns. The skim_df also lists the number of each variable type present in the dataset (e.g. factor, numeric, integer). The output also includes information on missing data.

All of the data

skim(dat)

Data summary
Name	dat
Number of rows	344
Number of columns	7
_______________________
Column type frequency:
factor	3
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂

One variable

Suppose you just want to get information on one variable. In this case, you would just add the variable name into the skim() call. In this case, I wanted to look at island and get more information about that variable. The output for factors allows you to quickly see the number of unique values (levels) in each variable, as well as the number of observations for each unique value.

skim(dat, island)

Data summary
Name	dat
Number of rows	344
Number of columns	7
_______________________
Column type frequency:
factor	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
island	0	1	FALSE	3	Bis: 168, Dre: 124, Tor: 52

Combining functions using pipes

One cool thing about the skimr package is that the output skim_df can be used with tidyverse functions, such as dplyr using pipes. This allows you to get more information about the data using functions such as dplyr::group_by(). In this case, the output is grouped by species so you can get information about numbers of males and females per species or number of species per island.

dat %>%
  group_by(species) %>%
  skim()

Data summary
Name	Piped data
Number of rows	344
Number of columns	7
_______________________
Column type frequency:
factor	2
numeric	4
________________________
Group variables	species

Variable type: factor

skim_variable	species	n_missing	complete_rate	ordered	n_unique	top_counts
island	Adelie	0	1.00	FALSE	3	Dre: 56, Tor: 52, Bis: 44
island	Chinstrap	0	1.00	FALSE	1	Dre: 68, Bis: 0, Tor: 0
island	Gentoo	0	1.00	FALSE	1	Bis: 124, Dre: 0, Tor: 0
sex	Adelie	6	0.96	FALSE	2	fem: 73, mal: 73
sex	Chinstrap	0	1.00	FALSE	2	fem: 34, mal: 34
sex	Gentoo	5	0.96	FALSE	2	mal: 61, fem: 58

Variable type: numeric

skim_variable	species	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	Adelie	1	0.99	38.79	2.66	32.1	36.75	38.80	40.75	46.0	▁▆▇▆▁
bill_length_mm	Chinstrap	0	1.00	48.83	3.34	40.9	46.35	49.55	51.08	58.0	▂▇▇▅▁
bill_length_mm	Gentoo	1	0.99	47.50	3.08	40.9	45.30	47.30	49.55	59.6	▃▇▆▁▁
bill_depth_mm	Adelie	1	0.99	18.35	1.22	15.5	17.50	18.40	19.00	21.5	▂▆▇▃▁
bill_depth_mm	Chinstrap	0	1.00	18.42	1.14	16.4	17.50	18.45	19.40	20.8	▅▇▇▆▂
bill_depth_mm	Gentoo	1	0.99	14.98	0.98	13.1	14.20	15.00	15.70	17.3	▅▇▇▆▂
flipper_length_mm	Adelie	1	0.99	189.95	6.54	172.0	186.00	190.00	195.00	210.0	▁▆▇▅▁
flipper_length_mm	Chinstrap	0	1.00	195.82	7.13	178.0	191.00	196.00	201.00	212.0	▁▅▇▅▂
flipper_length_mm	Gentoo	1	0.99	217.19	6.48	203.0	212.00	216.00	221.00	231.0	▂▇▇▆▃
body_mass_g	Adelie	1	0.99	3700.66	458.57	2850.0	3350.00	3700.00	4000.00	4775.0	▅▇▇▃▂
body_mass_g	Chinstrap	0	1.00	3733.09	384.34	2700.0	3487.50	3700.00	3950.00	4800.0	▁▅▇▃▁
body_mass_g	Gentoo	1	0.99	5076.02	504.12	3950.0	4700.00	5000.00	5500.00	6300.0	▃▇▇▇▂

Deeper dive in the data

If you wanted to dig further in, you could group the data by species, filter by island and just look at the flipper length variable.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(flipper_length_mm)

Data summary
Name	Piped data
Number of rows	168
Number of columns	7
_______________________
Column type frequency:
numeric	1
________________________
Group variables	species

Variable type: numeric

skim_variable	species	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
flipper_length_mm	Adelie	0	1.00	188.80	6.73	172	184.75	189.5	193	203	▁▅▇▇▃
flipper_length_mm	Gentoo	1	0.99	217.19	6.48	203	212.00	216.0	221	231	▂▇▇▆▃

Finally, the skim() function can be combined with dplyr::glimpse() to create a different view of the output.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(sex) %>% 
  dplyr::glimpse()

## Rows: 2
## Columns: 8
## $ skim_type         <chr> "factor", "factor"
## $ skim_variable     <chr> "sex", "sex"
## $ species           <fct> Adelie, Gentoo
## $ n_missing         <int> 0, 5
## $ complete_rate     <dbl> 1.0000000, 0.9596774
## $ factor.ordered    <lgl> FALSE, FALSE
## $ factor.n_unique   <int> 2, 2
## $ factor.top_counts <chr> "fem: 22, mal: 22", "mal: 61, fem: 58"

Limitations

While skimr could prove useful for initial exploratory analysis, there are some limitations. One limitation that I quickly came across was the inability to print the spark-histogram characters. This happened when I tried to create a “nicer” looking table using pander.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim(flipper_length_mm) %>% 
  pander()

Table continues below
skim_type	skim_variable	species	n_missing	complete_rate
numeric	flipper_length_mm	Adelie	0	1
numeric	flipper_length_mm	Gentoo	1	0.9919

Table continues below
numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50
188.8	6.729	172	184.8	189.5
217.2	6.485	203	212	216

numeric.p75	numeric.p100	numeric.hist
193	203	<U+2581><U+2585><U+2587><U+2587><U+2583>
221	231	<U+2582><U+2587><U+2587><U+2586><U+2583>

One workaround that was created is the skim_without_charts() function. This outputs the data without the histograms. However, many people find the histograms useful.

dat %>% 
  dplyr::group_by(species) %>%
  filter(island == "Biscoe") %>% 
  skim_without_charts(flipper_length_mm) %>% 
  pander()

Table continues below
skim_type	skim_variable	species	n_missing	complete_rate
numeric	flipper_length_mm	Adelie	0	1
numeric	flipper_length_mm	Gentoo	1	0.9919

Table continues below
numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50
188.8	6.729	172	184.8	189.5
217.2	6.485	203	212	216

numeric.p75	numeric.p100
193	203
221	231

Final Thoughts

This was just a brief look at some of the functionality of the skimr package. Just using a few of the commands above can provide a quick but useful look at your data. Combining skimr with tidyverse functions greatly expands the summary information that can be generated providing greater insight into your data.

Resources Utilized

palmerpenguin data:

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081

skimr package: https://cran.r-project.org/web/packages/skimr/readme/README.html

skimr vignette: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html

Code Through - skimr

Jill Sherwood

6/26/2020

skimr Package for Data Exploration

Install

Exploring all your data using the skim function

All of the data

One variable

Combining functions using pipes

Deeper dive in the data

Limitations

Final Thoughts

Resources Utilized