Data 621 - Blog Post #4

Libraries

library(tidyverse)
library(dplyr)
library(summarytools)
library(kableExtra)

Overview

Tools to Quickly and Neatly Summarize Data. summarytools provides coherent set of functions, centered on data exploration and reporting. There are four main function with this packages:

dfSummary() - Extensive Data Frame Summaries featuring type-specific information for all variables in a data frame: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly detect anomalies and identify trends at a glance
descr() - Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersion
freq() - Frequency Tables featuring counts, proportions, as well as missing data information
ctable() - Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions

dfSummary

Used to summarize an entire dataset descriptive statistics i.e. variable type, variable statistics, frequency, and number of missing values along with plots to show the distribution of the data is automatically created by the function. Moreover, the output can be controlled using various arguments.

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

gender [factor]

1. F 2. M

489	(	50.0%	)
489	(	50.0%	)

978 (97.8%)

22 (2.2%)

2

age [numeric]

Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4)

63 distinct values

975 (97.5%)

25 (2.5%)

3

age.gr [factor]

1. 18-34 2. 35-50 3. 51-70 4. 71 +

258	(	26.5%	)
241	(	24.7%	)
317	(	32.5%	)
159	(	16.3%	)

975 (97.5%)

25 (2.5%)

4

BMI [numeric]

Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2)

974 distinct values

974 (97.4%)

26 (2.6%)

5

smoker [factor]

1. Yes 2. No

298	(	29.8%	)
702	(	70.2%	)

1000 (100%)

0 (0%)

6

cigs.per.day [numeric]

Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8)

37 distinct values

965 (96.5%)

35 (3.5%)

7

diseased [factor]

1. Yes 2. No

224	(	22.4%	)
776	(	77.6%	)

1000 (100%)

0 (0%)

8

disease [character]

1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]

36	(	16.2%	)
34	(	15.3%	)
21	(	9.5%	)
20	(	9.0%	)
20	(	9.0%	)
19	(	8.6%	)
14	(	6.3%	)
14	(	6.3%	)
12	(	5.4%	)
11	(	5.0%	)
21	(	9.5%	)

222 (22.2%)

778 (77.8%)

9

samp.wgts [numeric]

Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)

0.86!	:	267	(	26.7%	)
1.04!	:	249	(	24.9%	)
1.05!	:	324	(	32.4%	)
1.06!	:	160	(	16.0%	)
! rounded

1000 (100%)

0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

decr

Used to generate more detailed statistic meterics on the numerical values of a dataset

descr(tobacco$BMI)

## Descriptive Statistics  
## tobacco$BMI  
## N: 1000  
## 
##                        BMI
## ----------------- --------
##              Mean    25.73
##           Std.Dev     4.49
##               Min     8.83
##                Q1    22.93
##            Median    25.62
##                Q3    28.65
##               Max    39.44
##               MAD     4.18
##               IQR     5.72
##                CV     0.17
##          Skewness     0.02
##       SE.Skewness     0.08
##          Kurtosis     0.26
##           N.Valid   974.00
##         Pct.Valid    97.40

freq

Used to generate frequency table with counts, proportions, as well as missing data information

freq(tobacco$gender, plain.ascii = FALSE, style = "rmarkdown")

## ### Frequencies  
## #### tobacco$gender  
## **Type:** Factor  
## 
## |     &nbsp; | Freq | % Valid | % Valid Cum. | % Total | % Total Cum. |
## |-----------:|-----:|--------:|-------------:|--------:|-------------:|
## |      **F** |  489 |   50.00 |        50.00 |   48.90 |        48.90 |
## |      **M** |  489 |   50.00 |       100.00 |   48.90 |        97.80 |
## | **\<NA\>** |   22 |         |              |    2.20 |       100.00 |
## |  **Total** | 1000 |  100.00 |       100.00 |  100.00 |       100.00 |

ctable

Used to cross-tabulate frequencies across pairs of cateforical varaibles.

Example: disease ~ gender

print(ctable(tobacco$disease, tobacco$gender),  method = "render")

Cross-Tabulation, Row Proportions

disease * gender

Data Frame: tobacco

	gender
disease	F				M				<NA>				Total
Cancer	16	(	47.1%	)	18	(	52.9%	)	0	(	0.0%	)	34	(	100.0%	)
Cholesterol	10	(	47.6%	)	11	(	52.4%	)	0	(	0.0%	)	21	(	100.0%	)
Diabetes	8	(	57.1%	)	5	(	35.7%	)	1	(	7.1%	)	14	(	100.0%	)
Digestive	5	(	41.7%	)	7	(	58.3%	)	0	(	0.0%	)	12	(	100.0%	)
Hearing	5	(	35.7%	)	9	(	64.3%	)	0	(	0.0%	)	14	(	100.0%	)
Heart	9	(	45.0%	)	11	(	55.0%	)	0	(	0.0%	)	20	(	100.0%	)
Hypertension	18	(	50.0%	)	17	(	47.2%	)	1	(	2.8%	)	36	(	100.0%	)
Hypotension	7	(	63.6%	)	4	(	36.4%	)	0	(	0.0%	)	11	(	100.0%	)
Musculoskeletal	8	(	42.1%	)	10	(	52.6%	)	1	(	5.3%	)	19	(	100.0%	)
Neurological	7	(	70.0%	)	3	(	30.0%	)	0	(	0.0%	)	10	(	100.0%	)
Other	1	(	50.0%	)	1	(	50.0%	)	0	(	0.0%	)	2	(	100.0%	)
Pulmonary	9	(	45.0%	)	11	(	55.0%	)	0	(	0.0%	)	20	(	100.0%	)
Vision	6	(	66.7%	)	3	(	33.3%	)	0	(	0.0%	)	9	(	100.0%	)
<NA>	380	(	48.8%	)	379	(	48.7%	)	19	(	2.4%	)	778	(	100.0%	)
Total	489	(	48.9%	)	489	(	48.9%	)	22	(	2.2%	)	1000	(	100.0%	)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-15

I found this the fastest and most effiencent way to explore the statistical metrics of any given datset.

Data 621 - Blog Post #4

Joseph Simone

Summarytools

Libraries

Overview

dfSummary

Data Frame Summary

tobacco

decr

freq

ctable

Cross-Tabulation, Row Proportions

disease * gender