Blog 3

descriptive statistics is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset. Descriptive statistics is often the first step and an important part in any statistical analysis. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it. If well presented, descriptive statistics is already a good starting point for further analyses. There exists many measures to summarize a dataset. They are divided into two types:

location measures and
dispersion measures

Without viewing descriptive statistical analysis or appropriate plots, its hard to build the right model. There are several base functions and packages for such analysis and visualization.Today, In this blog, I will be looking at the summarytools package, which provides few functions to neatly summarize the data. This is especially useful in a RMarkdown report as it can directly render the summary outputs to HTML and then display it in the application or report.

The most important function of this package is the dfSummary. This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

At its core, this package has 4 main functions: -dfSummary() -freq() -ctable() -descr()

Let’s see each of them in action:

`dfSummary`

dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.

To see the results in RStudio’s Viewer (or in the default Web browser if working in another IDE or from a terminal window), we use the view() function. Otherwise use print and render in the markup.

library(summarytools)

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

gender [factor]

1. F 2. M

489	(	50.0%	)
489	(	50.0%	)

978 (97.8%)

22 (2.2%)

age [numeric]

Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4)

63 distinct values

975 (97.5%)

25 (2.5%)

age.gr [factor]

1. 18-34 2. 35-50 3. 51-70 4. 71 +

258	(	26.5%	)
241	(	24.7%	)
317	(	32.5%	)
159	(	16.3%	)

975 (97.5%)

25 (2.5%)

BMI [numeric]

Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2)

974 distinct values

974 (97.4%)

26 (2.6%)

smoker [factor]

1. Yes 2. No

298	(	29.8%	)
702	(	70.2%	)

1000 (100%)

0 (0%)

cigs.per.day [numeric]

Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8)

37 distinct values

965 (96.5%)

35 (3.5%)

diseased [factor]

1. Yes 2. No

224	(	22.4%	)
776	(	77.6%	)

1000 (100%)

0 (0%)

disease [character]

1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]

36	(	16.2%	)
34	(	15.3%	)
21	(	9.5%	)
20	(	9.0%	)
20	(	9.0%	)
19	(	8.6%	)
14	(	6.3%	)
14	(	6.3%	)
12	(	5.4%	)
11	(	5.0%	)
21	(	9.5%	)

222 (22.2%)

778 (77.8%)

samp.wgts [numeric]

Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)

0.86!	:	267	(	26.7%	)
1.04!	:	249	(	24.9%	)
1.05!	:	324	(	32.4%	)
1.06!	:	160	(	16.0%	)
! rounded

1000 (100%)

0 (0%)

Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19

`freq` is used to obtain more detailed statistics on categorical variables.

The freq() function generates frequency tables with counts, proportions, as well as missing data information.

freq(tobacco$disease, order = "freq", rows = 1:5)

## Frequencies  
## tobacco$disease  
## Type: Character  
## 
##                      Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------ ------ --------- -------------- --------- --------------
##       Hypertension     36     16.22          16.22      3.60           3.60
##             Cancer     34     15.32          31.53      3.40           7.00
##        Cholesterol     21      9.46          40.99      2.10           9.10
##              Heart     20      9.01          50.00      2.00          11.10
##          Pulmonary     20      9.01          59.01      2.00          13.10
##            (Other)     91     40.99         100.00      9.10          22.20
##               <NA>    778                              77.80         100.00
##              Total   1000    100.00         100.00    100.00         100.00

`descr` is used to obtain more detailed statistics on numerical variables.

descr() generates descriptive / univariate statistics, i.e. common central tendency statistics and measures of dispersion. It accepts single vectors as well as data frames; in the latter case, all non-numerical columns are ignored, with a message to that effect.

descr(tobacco$age)

## Descriptive Statistics  
## tobacco$age  
## N: 1000  
## 
##                        age
## ----------------- --------
##              Mean    49.60
##           Std.Dev    18.29
##               Min    18.00
##                Q1    34.00
##            Median    50.00
##                Q3    66.00
##               Max    80.00
##               MAD    23.72
##               IQR    32.00
##                CV     0.37
##          Skewness    -0.04
##       SE.Skewness     0.08
##          Kurtosis    -1.26
##           N.Valid   975.00
##         Pct.Valid    97.50

`ctable` is used to cross-tabulate frequencies for a pair of categorical variables.

ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables.

Since markdown does not support multiline table headings (but does accept html code), we’ll use the html rendering feature for this section.

Using the tobacco data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.

print(ctable(x = tobacco$smoker, y = tobacco$diseased, prop = "r"),
      method = "render")

Cross-Tabulation, Row Proportions

smoker * diseased

Data Frame: tobacco

	diseased
smoker	Yes				No				Total
Yes	125	(	41.9%	)	173	(	58.1%	)	298	(	100.0%	)
No	99	(	14.1%	)	603	(	85.9%	)	702	(	100.0%	)
Total	224	(	22.4%	)	776	(	77.6%	)	1000	(	100.0%	)

Generated by summarytools 0.9.6 (R version 3.6.3)
2020-05-19

Blog 3

Abdelmalek Hajjam

5/19/2020

`dfSummary`

Data Frame Summary

tobacco

`freq` is used to obtain more detailed statistics on categorical variables.

`descr` is used to obtain more detailed statistics on numerical variables.

`ctable` is used to cross-tabulate frequencies for a pair of categorical variables.

Cross-Tabulation, Row Proportions

smoker * diseased

Blog 3

Abdelmalek Hajjam

5/19/2020

dfSummary

Data Frame Summary

tobacco

freq is used to obtain more detailed statistics on categorical variables.

descr is used to obtain more detailed statistics on numerical variables.

ctable is used to cross-tabulate frequencies for a pair of categorical variables.

Cross-Tabulation, Row Proportions

smoker * diseased

`dfSummary`

`freq` is used to obtain more detailed statistics on categorical variables.

`descr` is used to obtain more detailed statistics on numerical variables.

`ctable` is used to cross-tabulate frequencies for a pair of categorical variables.