descriptive statistics is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset. Descriptive statistics is often the first step and an important part in any statistical analysis. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it. If well presented, descriptive statistics is already a good starting point for further analyses. There exists many measures to summarize a dataset. They are divided into two types:
Without viewing descriptive statistical analysis or appropriate plots, its hard to build the right model. There are several base functions and packages for such analysis and visualization.Today, In this blog, I will be looking at the summarytools
package, which provides few functions to neatly summarize the data. This is especially useful in a RMarkdown report as it can directly render the summary outputs to HTML and then display it in the application or report.
The most important function of this package is the dfSummary
. This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.
At its core, this package has 4 main functions: -dfSummary() -freq() -ctable() -descr()
Let’s see each of them in action:
dfSummary
dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.
To see the results in RStudio’s Viewer (or in the default Web browser if working in another IDE or from a terminal window), we use the view() function. Otherwise use print and render in the markup.
library(summarytools)
print(dfSummary(tobacco), method = "render")
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | gender [factor] | 1. F 2. M |
|
978 (97.8%) | 22 (2.2%) | |||||||||||||||||||||||||||||||||||||||||||||
2 | age [numeric] | Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) | 63 distinct values | 975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
3 | age.gr [factor] | 1. 18-34 2. 35-50 3. 51-70 4. 71 + |
|
975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
4 | BMI [numeric] | Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) | 974 distinct values | 974 (97.4%) | 26 (2.6%) | |||||||||||||||||||||||||||||||||||||||||||||
5 | smoker [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
6 | cigs.per.day [numeric] | Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) | 37 distinct values | 965 (96.5%) | 35 (3.5%) | |||||||||||||||||||||||||||||||||||||||||||||
7 | diseased [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
8 | disease [character] | 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ] |
|
222 (22.2%) | 778 (77.8%) | |||||||||||||||||||||||||||||||||||||||||||||
9 | samp.wgts [numeric] | Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1) |
|
1000 (100%) | 0 (0%) |
Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19
freq
is used to obtain more detailed statistics on categorical variables.The freq() function generates frequency tables with counts, proportions, as well as missing data information.
freq(tobacco$disease, order = "freq", rows = 1:5)
## Frequencies
## tobacco$disease
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------------ ------ --------- -------------- --------- --------------
## Hypertension 36 16.22 16.22 3.60 3.60
## Cancer 34 15.32 31.53 3.40 7.00
## Cholesterol 21 9.46 40.99 2.10 9.10
## Heart 20 9.01 50.00 2.00 11.10
## Pulmonary 20 9.01 59.01 2.00 13.10
## (Other) 91 40.99 100.00 9.10 22.20
## <NA> 778 77.80 100.00
## Total 1000 100.00 100.00 100.00 100.00
descr
is used to obtain more detailed statistics on numerical variables.descr() generates descriptive / univariate statistics, i.e. common central tendency statistics and measures of dispersion. It accepts single vectors as well as data frames; in the latter case, all non-numerical columns are ignored, with a message to that effect.
descr(tobacco$age)
## Descriptive Statistics
## tobacco$age
## N: 1000
##
## age
## ----------------- --------
## Mean 49.60
## Std.Dev 18.29
## Min 18.00
## Q1 34.00
## Median 50.00
## Q3 66.00
## Max 80.00
## MAD 23.72
## IQR 32.00
## CV 0.37
## Skewness -0.04
## SE.Skewness 0.08
## Kurtosis -1.26
## N.Valid 975.00
## Pct.Valid 97.50
ctable
is used to cross-tabulate frequencies for a pair of categorical variables.ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables.
Since markdown does not support multiline table headings (but does accept html code), we’ll use the html rendering feature for this section.
Using the tobacco data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.
print(ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r"),
method = "render")
diseased | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
smoker | Yes | No | Total | |||||||||
Yes | 125 | ( | 41.9% | ) | 173 | ( | 58.1% | ) | 298 | ( | 100.0% | ) |
No | 99 | ( | 14.1% | ) | 603 | ( | 85.9% | ) | 702 | ( | 100.0% | ) |
Total | 224 | ( | 22.4% | ) | 776 | ( | 77.6% | ) | 1000 | ( | 100.0% | ) |
Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19