An important part of data analysis starts with understanding your data. There are many packages and functions that can be useful for exploring your data in R. I came across some packages, such as DataExplorer, that output reports on your data to html for initial exploration. While those can be useful, I thought I’d focus on a fairly straightforward package for exploring data - skimr.
The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary()
The skimr package can be installed from CRAN using install.packages(“skimr”).
## Loading required package: skimr
The skimr package takes data input and quickly creates summary statistics. One of the benefits of the skimr package is that it goes beyond the statistics generated by summary(). The output from skimr is a skim_df object that breaks down the data by rows and columns. The skim_df also lists the number of each variable type present in the dataset (e.g. factor, numeric, integer). The output also includes information on missing data.
| Name | dat |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Suppose you just want to get information on one variable. In this case, you would just add the variable name into the skim() call. In this case, I wanted to look at island and get more information about that variable. The output for factors allows you to quickly see the number of unique values (levels) in each variable, as well as the number of observations for each unique value.
| Name | dat |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| island | 0 | 1 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
One cool thing about the skimr package is that the output skim_df can be used with tidyverse functions, such as dplyr using pipes. This allows you to get more information about the data using functions such as dplyr::group_by(). In this case, the output is grouped by species so you can get information about numbers of males and females per species or number of species per island.
| Name | Piped data |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | species |
Variable type: factor
| skim_variable | species | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|---|
| island | Adelie | 0 | 1.00 | FALSE | 3 | Dre: 56, Tor: 52, Bis: 44 |
| island | Chinstrap | 0 | 1.00 | FALSE | 1 | Dre: 68, Bis: 0, Tor: 0 |
| island | Gentoo | 0 | 1.00 | FALSE | 1 | Bis: 124, Dre: 0, Tor: 0 |
| sex | Adelie | 6 | 0.96 | FALSE | 2 | fem: 73, mal: 73 |
| sex | Chinstrap | 0 | 1.00 | FALSE | 2 | fem: 34, mal: 34 |
| sex | Gentoo | 5 | 0.96 | FALSE | 2 | mal: 61, fem: 58 |
Variable type: numeric
| skim_variable | species | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | Adelie | 1 | 0.99 | 38.79 | 2.66 | 32.1 | 36.75 | 38.80 | 40.75 | 46.0 | ▁▆▇▆▁ |
| bill_length_mm | Chinstrap | 0 | 1.00 | 48.83 | 3.34 | 40.9 | 46.35 | 49.55 | 51.08 | 58.0 | ▂▇▇▅▁ |
| bill_length_mm | Gentoo | 1 | 0.99 | 47.50 | 3.08 | 40.9 | 45.30 | 47.30 | 49.55 | 59.6 | ▃▇▆▁▁ |
| bill_depth_mm | Adelie | 1 | 0.99 | 18.35 | 1.22 | 15.5 | 17.50 | 18.40 | 19.00 | 21.5 | ▂▆▇▃▁ |
| bill_depth_mm | Chinstrap | 0 | 1.00 | 18.42 | 1.14 | 16.4 | 17.50 | 18.45 | 19.40 | 20.8 | ▅▇▇▆▂ |
| bill_depth_mm | Gentoo | 1 | 0.99 | 14.98 | 0.98 | 13.1 | 14.20 | 15.00 | 15.70 | 17.3 | ▅▇▇▆▂ |
| flipper_length_mm | Adelie | 1 | 0.99 | 189.95 | 6.54 | 172.0 | 186.00 | 190.00 | 195.00 | 210.0 | ▁▆▇▅▁ |
| flipper_length_mm | Chinstrap | 0 | 1.00 | 195.82 | 7.13 | 178.0 | 191.00 | 196.00 | 201.00 | 212.0 | ▁▅▇▅▂ |
| flipper_length_mm | Gentoo | 1 | 0.99 | 217.19 | 6.48 | 203.0 | 212.00 | 216.00 | 221.00 | 231.0 | ▂▇▇▆▃ |
| body_mass_g | Adelie | 1 | 0.99 | 3700.66 | 458.57 | 2850.0 | 3350.00 | 3700.00 | 4000.00 | 4775.0 | ▅▇▇▃▂ |
| body_mass_g | Chinstrap | 0 | 1.00 | 3733.09 | 384.34 | 2700.0 | 3487.50 | 3700.00 | 3950.00 | 4800.0 | ▁▅▇▃▁ |
| body_mass_g | Gentoo | 1 | 0.99 | 5076.02 | 504.12 | 3950.0 | 4700.00 | 5000.00 | 5500.00 | 6300.0 | ▃▇▇▇▂ |
If you wanted to dig further in, you could group the data by species, filter by island and just look at the flipper length variable.
| Name | Piped data |
| Number of rows | 168 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | species |
Variable type: numeric
| skim_variable | species | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| flipper_length_mm | Adelie | 0 | 1.00 | 188.80 | 6.73 | 172 | 184.75 | 189.5 | 193 | 203 | ▁▅▇▇▃ |
| flipper_length_mm | Gentoo | 1 | 0.99 | 217.19 | 6.48 | 203 | 212.00 | 216.0 | 221 | 231 | ▂▇▇▆▃ |
Finally, the skim() function can be combined with dplyr::glimpse() to create a different view of the output.
## Rows: 2
## Columns: 8
## $ skim_type <chr> "factor", "factor"
## $ skim_variable <chr> "sex", "sex"
## $ species <fct> Adelie, Gentoo
## $ n_missing <int> 0, 5
## $ complete_rate <dbl> 1.0000000, 0.9596774
## $ factor.ordered <lgl> FALSE, FALSE
## $ factor.n_unique <int> 2, 2
## $ factor.top_counts <chr> "fem: 22, mal: 22", "mal: 61, fem: 58"
While skimr could prove useful for initial exploratory analysis, there are some limitations. One limitation that I quickly came across was the inability to print the spark-histogram characters. This happened when I tried to create a “nicer” looking table using pander.
dat %>%
dplyr::group_by(species) %>%
filter(island == "Biscoe") %>%
skim(flipper_length_mm) %>%
pander()| skim_type | skim_variable | species | n_missing | complete_rate |
|---|---|---|---|---|
| numeric | flipper_length_mm | Adelie | 0 | 1 |
| numeric | flipper_length_mm | Gentoo | 1 | 0.9919 |
| numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 |
|---|---|---|---|---|
| 188.8 | 6.729 | 172 | 184.8 | 189.5 |
| 217.2 | 6.485 | 203 | 212 | 216 |
| numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|
| 193 | 203 | <U+2581><U+2585><U+2587><U+2587><U+2583> |
| 221 | 231 | <U+2582><U+2587><U+2587><U+2586><U+2583> |
One workaround that was created is the skim_without_charts() function. This outputs the data without the histograms. However, many people find the histograms useful.
dat %>%
dplyr::group_by(species) %>%
filter(island == "Biscoe") %>%
skim_without_charts(flipper_length_mm) %>%
pander()| skim_type | skim_variable | species | n_missing | complete_rate |
|---|---|---|---|---|
| numeric | flipper_length_mm | Adelie | 0 | 1 |
| numeric | flipper_length_mm | Gentoo | 1 | 0.9919 |
| numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 |
|---|---|---|---|---|
| 188.8 | 6.729 | 172 | 184.8 | 189.5 |
| 217.2 | 6.485 | 203 | 212 | 216 |
| numeric.p75 | numeric.p100 |
|---|---|
| 193 | 203 |
| 221 | 231 |
This was just a brief look at some of the functionality of the skimr package. Just using a few of the commands above can provide a quick but useful look at your data. Combining skimr with tidyverse functions greatly expands the summary information that can be generated providing greater insight into your data.
palmerpenguin data:
Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081
skimr package: https://cran.r-project.org/web/packages/skimr/readme/README.html
skimr vignette: https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html