descriptive statistics is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset. Descriptive statistics is often the first step and an important part in any statistical analysis. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it. If well presented, descriptive statistics is already a good starting point for further analyses. There exists many measures to summarize a dataset. They are divided into two types:

Without viewing descriptive statistical analysis or appropriate plots, its hard to build the right model. There are several base functions and packages for such analysis and visualization.Today, In this blog, I will be looking at the summarytools package, which provides few functions to neatly summarize the data. This is especially useful in a RMarkdown report as it can directly render the summary outputs to HTML and then display it in the application or report.

The most important function of this package is the dfSummary. This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

At its core, this package has 4 main functions: -dfSummary() -freq() -ctable() -descr()

Let’s see each of them in action:

dfSummary

dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.

To see the results in RStudio’s Viewer (or in the default Web browser if working in another IDE or from a terminal window), we use the view() function. Otherwise use print and render in the markup.

library(summarytools)

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 gender [factor] 1. F 2. M
489(50.0%)
489(50.0%)
978 (97.8%) 22 (2.2%)
2 age [numeric] Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) 63 distinct values 975 (97.5%) 25 (2.5%)
3 age.gr [factor] 1. 18-34 2. 35-50 3. 51-70 4. 71 +
258(26.5%)
241(24.7%)
317(32.5%)
159(16.3%)
975 (97.5%) 25 (2.5%)
4 BMI [numeric] Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) 974 distinct values 974 (97.4%) 26 (2.6%)
5 smoker [factor] 1. Yes 2. No
298(29.8%)
702(70.2%)
1000 (100%) 0 (0%)
6 cigs.per.day [numeric] Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) 37 distinct values 965 (96.5%) 35 (3.5%)
7 diseased [factor] 1. Yes 2. No
224(22.4%)
776(77.6%)
1000 (100%) 0 (0%)
8 disease [character] 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]
36(16.2%)
34(15.3%)
21(9.5%)
20(9.0%)
20(9.0%)
19(8.6%)
14(6.3%)
14(6.3%)
12(5.4%)
11(5.0%)
21(9.5%)
222 (22.2%) 778 (77.8%)
9 samp.wgts [numeric] Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)
0.86!:267(26.7%)
1.04!:249(24.9%)
1.05!:324(32.4%)
1.06!:160(16.0%)
! rounded
1000 (100%) 0 (0%)

Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19

freq is used to obtain more detailed statistics on categorical variables.

The freq() function generates frequency tables with counts, proportions, as well as missing data information.

freq(tobacco$disease, order = "freq", rows = 1:5)
## Frequencies  
## tobacco$disease  
## Type: Character  
## 
##                      Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------ ------ --------- -------------- --------- --------------
##       Hypertension     36     16.22          16.22      3.60           3.60
##             Cancer     34     15.32          31.53      3.40           7.00
##        Cholesterol     21      9.46          40.99      2.10           9.10
##              Heart     20      9.01          50.00      2.00          11.10
##          Pulmonary     20      9.01          59.01      2.00          13.10
##            (Other)     91     40.99         100.00      9.10          22.20
##               <NA>    778                              77.80         100.00
##              Total   1000    100.00         100.00    100.00         100.00

descr is used to obtain more detailed statistics on numerical variables.

descr() generates descriptive / univariate statistics, i.e. common central tendency statistics and measures of dispersion. It accepts single vectors as well as data frames; in the latter case, all non-numerical columns are ignored, with a message to that effect.

descr(tobacco$age)
## Descriptive Statistics  
## tobacco$age  
## N: 1000  
## 
##                        age
## ----------------- --------
##              Mean    49.60
##           Std.Dev    18.29
##               Min    18.00
##                Q1    34.00
##            Median    50.00
##                Q3    66.00
##               Max    80.00
##               MAD    23.72
##               IQR    32.00
##                CV     0.37
##          Skewness    -0.04
##       SE.Skewness     0.08
##          Kurtosis    -1.26
##           N.Valid   975.00
##         Pct.Valid    97.50

ctable is used to cross-tabulate frequencies for a pair of categorical variables.

ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables.

Since markdown does not support multiline table headings (but does accept html code), we’ll use the html rendering feature for this section.

Using the tobacco data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.

print(ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r"),
      method = "render")

Cross-Tabulation, Row Proportions

smoker * diseased

Data Frame: tobacco
diseased
smoker Yes No Total
Yes 125 ( 41.9% ) 173 ( 58.1% ) 298 ( 100.0% )
No 99 ( 14.1% ) 603 ( 85.9% ) 702 ( 100.0% )
Total 224 ( 22.4% ) 776 ( 77.6% ) 1000 ( 100.0% )

Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19