When beginning to use R one of the first things to do after getting a view of the dataset you are dealing using glimpse() and str() is to start looking at basic descriptive statistics. In the past I used summary extensivly to do this. I was going through a tutorial on EDA and linear regression and the author used the skimr package which I thought was a great alternative for quickly getting basic statistics.
Personally I prefer the structure of the skimr output and how it can be customised for your own needs.
In the examples to follow I use different tidyverse verbs and different methods of displaying data.
Below is the default output from the command when run on a dataset. I will use the mpg dataset. Below the output is separated into a summary section and then split by variable types being character and numeric in this instance. Note the inclusion of spark lines and some additional statistics not included in summary() function.
skim(mpg)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 |
| model | 0 | 1 | 2 | 22 | 0 | 38 | 0 |
| trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 |
| drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| class | 0 | 1 | 3 | 10 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
| year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
| cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
| cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
| hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
Because skim returns a skim_df object this is pipeable and open to additional manipulation. Looking at the structure of the skim_df we can get an orientation of how it’s made up. I use to_long in the below to get a look at skim_type, skim_variable and stat.
Looking at the below n_missing and complete_rate are base skimmers. The rest are type-base skimmers and we need to use the skim_type prefix to refer to the correct column.
to_long(skim(mpg,model,hwy)) %>% select(skim_type,skim_variable,stat) %>% arrange(skim_type)
## # A tibble: 17 x 3
## skim_type skim_variable stat
## <chr> <chr> <chr>
## 1 character model n_missing
## 2 character model complete_rate
## 3 character model character.min
## 4 character model character.max
## 5 character model character.empty
## 6 character model character.n_unique
## 7 character model character.whitespace
## 8 numeric hwy n_missing
## 9 numeric hwy complete_rate
## 10 numeric hwy numeric.mean
## 11 numeric hwy numeric.sd
## 12 numeric hwy numeric.p0
## 13 numeric hwy numeric.p25
## 14 numeric hwy numeric.p50
## 15 numeric hwy numeric.p75
## 16 numeric hwy numeric.p100
## 17 numeric hwy numeric.hist
Below an example of what I mean by selecting type based and base skimmers. Note n_missing is our base skimmer and numeric.mean and character.n_unique are our type-based skimmers.
skim(mpg) %>% select(skim_type,skim_variable,n_missing,numeric.mean,character.n_unique)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | n_unique |
|---|---|---|
| manufacturer | 0 | 15 |
| model | 0 | 38 |
| trans | 0 | 10 |
| drv | 0 | 3 |
| fl | 0 | 5 |
| class | 0 | 7 |
Variable type: numeric
| skim_variable | n_missing | mean |
|---|---|---|
| displ | 0 | 3.47 |
| year | 0 | 2003.50 |
| cyl | 0 | 5.89 |
| cty | 0 | 16.86 |
| hwy | 0 | 23.44 |
Only specific columns can be selected if desired. Note there are many ways to do this. We can also use pipe and select.
skim(mpg,hwy)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hwy | 0 | 1 | 23.44 | 5.95 | 12 | 18 | 24 | 27 | 44 | ▅▅▇▁▁ |
We can use grouping and display the relevant information by group. Note below how pipe and group_by is used.
mpg %>% group_by(drv) %>% skim(hwy)
| Name | Piped data |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | drv |
Variable type: numeric
| skim_variable | drv | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| hwy | 4 | 0 | 1 | 19.17 | 4.08 | 12 | 17 | 18 | 22 | 28 | ▃▇▅▁▅ |
| hwy | f | 0 | 1 | 28.16 | 4.21 | 17 | 26 | 28 | 29 | 44 | ▁▇▇▁▁ |
| hwy | r | 0 | 1 | 21.00 | 3.66 | 15 | 17 | 21 | 24 | 26 | ▇▂▃▃▇ |
If you dont want the charts.You can use skim_without_charts.
skim_without_charts(mpg) %>% filter(skim_variable == "hwy")
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| hwy | 0 | 1 | 23.44 | 5.95 | 12 | 18 | 24 | 27 | 44 |
If we only want to see the numeric section we can yank that section.
mpg %>% skim() %>% yank("numeric")
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
| year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
| cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
| cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
| hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
Using skim_with we can specify our own statistics.For example we can make use of R’s stat package. Note the default functionality is to append your statistics to the default output statistics that skim returns. By selecting append = FALSE we only return the statistics we specify.
my_skim <- skim_with(numeric = sfl(iqr = IQR, mad = mad, p99 = ~ quantile(., probs = .99)),
append = FALSE)
my_skim(mpg,hwy)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | iqr | mad | p99 |
|---|---|---|---|---|---|
| hwy | 0 | 1 | 9 | 7.41 | 39.68 |
We can also exclude statistics we don’t want. Note below we set P25 and P75 to NULL. Note we didn’t specify append = FALSE so our addition of IQR gets appended to the default output.
my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))
my_skim(mpg,hwy)
| Name | mpg |
| Number of rows | 234 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p50 | p100 | hist | iqr |
|---|---|---|---|---|---|---|---|---|---|
| hwy | 0 | 1 | 23.44 | 5.95 | 12 | 24 | 44 | ▅▅▇▁▁ | 9 |
We can also use skim_tee() to return the original data after running skim
mpg_tee <- mpg %>% skim_tee()
## -- Data Summary ------------------------
## Values
## Name data
## Number of rows 234
## Number of columns 11
## _______________________
## Column type frequency:
## character 6
## numeric 5
## ________________________
## Group variables None
##
## -- Variable type: character ----------------------------------------------------
## # A tibble: 6 x 8
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## * <chr> <int> <dbl> <int> <int> <int> <int> <int>
## 1 manufacturer 0 1 4 10 0 15 0
## 2 model 0 1 2 22 0 38 0
## 3 trans 0 1 8 10 0 10 0
## 4 drv 0 1 1 1 0 3 0
## 5 fl 0 1 1 1 0 5 0
## 6 class 0 1 3 10 0 7 0
##
## -- Variable type: numeric ------------------------------------------------------
## # A tibble: 5 x 11
## skim_variable n_missing complete_rate mean sd p0 p25 p50
## * <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 displ 0 1 3.47 1.29 1.6 2.4 3.3
## 2 year 0 1 2004. 4.51 1999 1999 2004.
## 3 cyl 0 1 5.89 1.61 4 4 6
## 4 cty 0 1 16.9 4.26 9 14 17
## 5 hwy 0 1 23.4 5.95 12 18 24
## p75 p100 hist
## * <dbl> <dbl> <chr>
## 1 4.6 7 <U+2587><U+2586><U+2586><U+2583><U+2581>
## 2 2008 2008 <U+2587><U+2581><U+2581><U+2581><U+2587>
## 3 8 8 <U+2587><U+2581><U+2587><U+2581><U+2587>
## 4 19 35 <U+2586><U+2587><U+2583><U+2581><U+2581>
## 5 27 44 <U+2585><U+2585><U+2587><U+2581><U+2581>
head(mpg_tee)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
The one issue I encountered was with the spark lines. If you look at the skim_tee example the sparklines are displayed as <U+2587><U+2586><U+2586><U+2583><U+2581> for example.
Looking at https://cran.r-project.org/web/packages/skimr/readme/README.html the reason given is as follows " This longstanding problem originates in the low-level code for printing dataframes.
while skimr can render the histograms to the console and in RMarkdown documents, it cannot in other circumstances. This includes:
In these caes we can use the skim_without_charts as detailed above.