Tools to Quickly and Neatly Summarize Data. summarytools provides coherent set of functions, centered on data exploration and reporting. There are four main function with this packages:
dfSummary() - Extensive Data Frame Summaries featuring type-specific information for all variables in a data frame: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly detect anomalies and identify trends at a glancedescr() - Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersionfreq() - Frequency Tables featuring counts, proportions, as well as missing data informationctable() - Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportionsUsed to summarize an entire dataset descriptive statistics i.e. variable type, variable statistics, frequency, and number of missing values along with plots to show the distribution of the data is automatically created by the function. Moreover, the output can be controlled using various arguments.
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | gender [factor] | 1. F 2. M |
|
978 (97.8%) | 22 (2.2%) | |||||||||||||||||||||||||||||||||||||||||||||
| 2 | age [numeric] | Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) | 63 distinct values | 975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
| 3 | age.gr [factor] | 1. 18-34 2. 35-50 3. 51-70 4. 71 + |
|
975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
| 4 | BMI [numeric] | Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) | 974 distinct values | 974 (97.4%) | 26 (2.6%) | |||||||||||||||||||||||||||||||||||||||||||||
| 5 | smoker [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 6 | cigs.per.day [numeric] | Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) | 37 distinct values | 965 (96.5%) | 35 (3.5%) | |||||||||||||||||||||||||||||||||||||||||||||
| 7 | diseased [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 8 | disease [character] | 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ] |
|
222 (22.2%) | 778 (77.8%) | |||||||||||||||||||||||||||||||||||||||||||||
| 9 | samp.wgts [numeric] | Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1) |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15
Used to generate more detailed statistic meterics on the numerical values of a dataset
## Descriptive Statistics
## tobacco$BMI
## N: 1000
##
## BMI
## ----------------- --------
## Mean 25.73
## Std.Dev 4.49
## Min 8.83
## Q1 22.93
## Median 25.62
## Q3 28.65
## Max 39.44
## MAD 4.18
## IQR 5.72
## CV 0.17
## Skewness 0.02
## SE.Skewness 0.08
## Kurtosis 0.26
## N.Valid 974.00
## Pct.Valid 97.40
Used to generate frequency table with counts, proportions, as well as missing data information
## ### Frequencies
## #### tobacco$gender
## **Type:** Factor
##
## | | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## | **F** | 489 | 50.00 | 50.00 | 48.90 | 48.90 |
## | **M** | 489 | 50.00 | 100.00 | 48.90 | 97.80 |
## | **\<NA\>** | 22 | | | 2.20 | 100.00 |
## | **Total** | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
Used to cross-tabulate frequencies across pairs of cateforical varaibles.
Example: disease ~ gender
| gender | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| disease | F | M | <NA> | Total | ||||||||||||
| Cancer | 16 | ( | 47.1% | ) | 18 | ( | 52.9% | ) | 0 | ( | 0.0% | ) | 34 | ( | 100.0% | ) |
| Cholesterol | 10 | ( | 47.6% | ) | 11 | ( | 52.4% | ) | 0 | ( | 0.0% | ) | 21 | ( | 100.0% | ) |
| Diabetes | 8 | ( | 57.1% | ) | 5 | ( | 35.7% | ) | 1 | ( | 7.1% | ) | 14 | ( | 100.0% | ) |
| Digestive | 5 | ( | 41.7% | ) | 7 | ( | 58.3% | ) | 0 | ( | 0.0% | ) | 12 | ( | 100.0% | ) |
| Hearing | 5 | ( | 35.7% | ) | 9 | ( | 64.3% | ) | 0 | ( | 0.0% | ) | 14 | ( | 100.0% | ) |
| Heart | 9 | ( | 45.0% | ) | 11 | ( | 55.0% | ) | 0 | ( | 0.0% | ) | 20 | ( | 100.0% | ) |
| Hypertension | 18 | ( | 50.0% | ) | 17 | ( | 47.2% | ) | 1 | ( | 2.8% | ) | 36 | ( | 100.0% | ) |
| Hypotension | 7 | ( | 63.6% | ) | 4 | ( | 36.4% | ) | 0 | ( | 0.0% | ) | 11 | ( | 100.0% | ) |
| Musculoskeletal | 8 | ( | 42.1% | ) | 10 | ( | 52.6% | ) | 1 | ( | 5.3% | ) | 19 | ( | 100.0% | ) |
| Neurological | 7 | ( | 70.0% | ) | 3 | ( | 30.0% | ) | 0 | ( | 0.0% | ) | 10 | ( | 100.0% | ) |
| Other | 1 | ( | 50.0% | ) | 1 | ( | 50.0% | ) | 0 | ( | 0.0% | ) | 2 | ( | 100.0% | ) |
| Pulmonary | 9 | ( | 45.0% | ) | 11 | ( | 55.0% | ) | 0 | ( | 0.0% | ) | 20 | ( | 100.0% | ) |
| Vision | 6 | ( | 66.7% | ) | 3 | ( | 33.3% | ) | 0 | ( | 0.0% | ) | 9 | ( | 100.0% | ) |
| <NA> | 380 | ( | 48.8% | ) | 379 | ( | 48.7% | ) | 19 | ( | 2.4% | ) | 778 | ( | 100.0% | ) |
| Total | 489 | ( | 48.9% | ) | 489 | ( | 48.9% | ) | 22 | ( | 2.2% | ) | 1000 | ( | 100.0% | ) |
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15
I found this the fastest and most effiencent way to explore the statistical metrics of any given datset.