Summarytools

Libraries

library(tidyverse)
library(dplyr)
library(summarytools)
library(kableExtra)

Overview

Tools to Quickly and Neatly Summarize Data. summarytools provides coherent set of functions, centered on data exploration and reporting. There are four main function with this packages:

  • dfSummary() - Extensive Data Frame Summaries featuring type-specific information for all variables in a data frame: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly detect anomalies and identify trends at a glance
  • descr() - Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersion
  • freq() - Frequency Tables featuring counts, proportions, as well as missing data information
  • ctable() - Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions

dfSummary

Used to summarize an entire dataset descriptive statistics i.e. variable type, variable statistics, frequency, and number of missing values along with plots to show the distribution of the data is automatically created by the function. Moreover, the output can be controlled using various arguments.

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 gender [factor] 1. F 2. M
489(50.0%)
489(50.0%)
978 (97.8%) 22 (2.2%)
2 age [numeric] Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) 63 distinct values 975 (97.5%) 25 (2.5%)
3 age.gr [factor] 1. 18-34 2. 35-50 3. 51-70 4. 71 +
258(26.5%)
241(24.7%)
317(32.5%)
159(16.3%)
975 (97.5%) 25 (2.5%)
4 BMI [numeric] Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) 974 distinct values 974 (97.4%) 26 (2.6%)
5 smoker [factor] 1. Yes 2. No
298(29.8%)
702(70.2%)
1000 (100%) 0 (0%)
6 cigs.per.day [numeric] Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) 37 distinct values 965 (96.5%) 35 (3.5%)
7 diseased [factor] 1. Yes 2. No
224(22.4%)
776(77.6%)
1000 (100%) 0 (0%)
8 disease [character] 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]
36(16.2%)
34(15.3%)
21(9.5%)
20(9.0%)
20(9.0%)
19(8.6%)
14(6.3%)
14(6.3%)
12(5.4%)
11(5.0%)
21(9.5%)
222 (22.2%) 778 (77.8%)
9 samp.wgts [numeric] Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)
0.86!:267(26.7%)
1.04!:249(24.9%)
1.05!:324(32.4%)
1.06!:160(16.0%)
! rounded
1000 (100%) 0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

decr

Used to generate more detailed statistic meterics on the numerical values of a dataset

descr(tobacco$BMI)
## Descriptive Statistics  
## tobacco$BMI  
## N: 1000  
## 
##                        BMI
## ----------------- --------
##              Mean    25.73
##           Std.Dev     4.49
##               Min     8.83
##                Q1    22.93
##            Median    25.62
##                Q3    28.65
##               Max    39.44
##               MAD     4.18
##               IQR     5.72
##                CV     0.17
##          Skewness     0.02
##       SE.Skewness     0.08
##          Kurtosis     0.26
##           N.Valid   974.00
##         Pct.Valid    97.40

freq

Used to generate frequency table with counts, proportions, as well as missing data information

freq(tobacco$gender, plain.ascii = FALSE, style = "rmarkdown")
## ### Frequencies  
## #### tobacco$gender  
## **Type:** Factor  
## 
## |     &nbsp; | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## |      **F** |  489 |   50.00 |        50.00 |   48.90 |        48.90 |
## |      **M** |  489 |   50.00 |       100.00 |   48.90 |        97.80 |
## | **\<NA\>** |   22 |         |              |    2.20 |       100.00 |
## |  **Total** | 1000 |  100.00 |       100.00 |  100.00 |       100.00 |

ctable

Used to cross-tabulate frequencies across pairs of cateforical varaibles.

Example: disease ~ gender

print(ctable(tobacco$disease, tobacco$gender),  method = "render")

Cross-Tabulation, Row Proportions

disease * gender

Data Frame: tobacco
gender
disease F M <NA> Total
Cancer 16 ( 47.1% ) 18 ( 52.9% ) 0 ( 0.0% ) 34 ( 100.0% )
Cholesterol 10 ( 47.6% ) 11 ( 52.4% ) 0 ( 0.0% ) 21 ( 100.0% )
Diabetes 8 ( 57.1% ) 5 ( 35.7% ) 1 ( 7.1% ) 14 ( 100.0% )
Digestive 5 ( 41.7% ) 7 ( 58.3% ) 0 ( 0.0% ) 12 ( 100.0% )
Hearing 5 ( 35.7% ) 9 ( 64.3% ) 0 ( 0.0% ) 14 ( 100.0% )
Heart 9 ( 45.0% ) 11 ( 55.0% ) 0 ( 0.0% ) 20 ( 100.0% )
Hypertension 18 ( 50.0% ) 17 ( 47.2% ) 1 ( 2.8% ) 36 ( 100.0% )
Hypotension 7 ( 63.6% ) 4 ( 36.4% ) 0 ( 0.0% ) 11 ( 100.0% )
Musculoskeletal 8 ( 42.1% ) 10 ( 52.6% ) 1 ( 5.3% ) 19 ( 100.0% )
Neurological 7 ( 70.0% ) 3 ( 30.0% ) 0 ( 0.0% ) 10 ( 100.0% )
Other 1 ( 50.0% ) 1 ( 50.0% ) 0 ( 0.0% ) 2 ( 100.0% )
Pulmonary 9 ( 45.0% ) 11 ( 55.0% ) 0 ( 0.0% ) 20 ( 100.0% )
Vision 6 ( 66.7% ) 3 ( 33.3% ) 0 ( 0.0% ) 9 ( 100.0% )
<NA> 380 ( 48.8% ) 379 ( 48.7% ) 19 ( 2.4% ) 778 ( 100.0% )
Total 489 ( 48.9% ) 489 ( 48.9% ) 22 ( 2.2% ) 1000 ( 100.0% )

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

I found this the fastest and most effiencent way to explore the statistical metrics of any given datset.