In the modern world of statistical analysis, summary statistics serves as the starting point for analysis. In R, descriptr package provides us with a very comprehensive tool. In this vignette, iris data set and flights data set from nycflights13 package are going to be used to better illustrate functions in descriptr package. To generate the summarty statistics, the first step is to do a data screening to get an intuitive idea of the data set.
library(descriptr)
library(nycflights13)
library(dplyr)
iris %>%
ds_screener()
## ---------------------------------------------------------------------------------------
## | Column Name | Data Type | Levels | Missing | Missing (%) |
## ---------------------------------------------------------------------------------------
## | Sepal.Length | numeric | NA | 0 | 0 |
## | Sepal.Width | numeric | NA | 0 | 0 |
## | Petal.Length | numeric | NA | 0 | 0 |
## | Petal.Width | numeric | NA | 0 | 0 |
## | Species | factor |setosa versicolor virginica| 0 | 0 |
## ---------------------------------------------------------------------------------------
##
## Overall Missing Values 0
## Percentage of Missing Values 0 %
## Rows with Missing Values 0
## Columns With Missing Values 0
Missing values need to be dealt with in the data set. This is beacause some computational errors may occur due to the existence of missing values. In descriptr package, ds_screener function will scrutinise the the missing value for data set. After screening the data set, we can start to delve into the summary statistics for each element of your data set. Taking the Petal.Width column from iris data set as an example:
iris %>%
ds_summary_stats(Petal.Width)
## ────────────────────────── Variable: Petal.Width ──────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 0.58
## Missing 0.00 Std Deviation 0.76
## Mean 1.20 Range 2.40
## Median 1.30 Interquartile Range 1.50
## Mode 0.20 Uncorrected SS 302.33
## Trimmed Mean 1.19 Corrected SS 86.57
## Skewness -0.10 Coeff Variation 63.56
## Kurtosis -1.34 Std Error Mean 0.06
##
## Quantiles
##
## Quantile Value
##
## Max 2.50
## 99% 2.50
## 95% 2.30
## 90% 2.20
## Q3 1.80
## Median 1.30
## Q1 0.30
## 10% 0.20
## 5% 0.20
## 1% 0.10
## Min 0.10
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 10 0.1 101 2.5
## 13 0.1 110 2.5
## 14 0.1 145 2.5
## 33 0.1 115 2.4
## 38 0.1 137 2.4
If you want to see the summary statistics for all variables in your data set, the first argument can be left as blank in the function of ds_summary_stats.This will return the summary statistics for all variable in your dataset.
iris %>%
ds_summary_stats()
## ────────────────────────── Variable: Sepal.Length ─────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 0.69
## Missing 0.00 Std Deviation 0.83
## Mean 5.84 Range 3.60
## Median 5.80 Interquartile Range 1.30
## Mode 5.00 Uncorrected SS 5223.85
## Trimmed Mean 5.82 Corrected SS 102.17
## Skewness 0.31 Coeff Variation 14.17
## Kurtosis -0.55 Std Error Mean 0.07
##
## Quantiles
##
## Quantile Value
##
## Max 7.90
## 99% 7.70
## 95% 7.25
## 90% 6.90
## Q3 6.40
## Median 5.80
## Q1 5.10
## 10% 4.80
## 5% 4.60
## 1% 4.40
## Min 4.30
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 14 4.3 132 7.9
## 9 4.4 118 7.7
## 39 4.4 119 7.7
## 43 4.4 123 7.7
## 42 4.5 136 7.7
##
##
##
## ────────────────────────── Variable: Sepal.Width ──────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 0.19
## Missing 0.00 Std Deviation 0.44
## Mean 3.06 Range 2.40
## Median 3.00 Interquartile Range 0.50
## Mode 3.00 Uncorrected SS 1430.40
## Trimmed Mean 3.05 Corrected SS 28.31
## Skewness 0.32 Coeff Variation 14.26
## Kurtosis 0.23 Std Error Mean 0.04
##
## Quantiles
##
## Quantile Value
##
## Max 4.40
## 99% 4.15
## 95% 3.80
## 90% 3.61
## Q3 3.30
## Median 3.00
## Q1 2.80
## 10% 2.50
## 5% 2.34
## 1% 2.20
## Min 2.00
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 61 2 16 4.4
## 63 2.2 34 4.2
## 69 2.2 33 4.1
## 120 2.2 15 4
## 42 2.3 6 3.9
##
##
##
## ────────────────────────── Variable: Petal.Length ─────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 3.12
## Missing 0.00 Std Deviation 1.77
## Mean 3.76 Range 5.90
## Median 4.35 Interquartile Range 3.50
## Mode 1.40 Uncorrected SS 2582.71
## Trimmed Mean 3.75 Corrected SS 464.33
## Skewness -0.27 Coeff Variation 46.97
## Kurtosis -1.40 Std Error Mean 0.14
##
## Quantiles
##
## Quantile Value
##
## Max 6.90
## 99% 6.70
## 95% 6.10
## 90% 5.80
## Q3 5.10
## Median 4.35
## Q1 1.60
## 10% 1.40
## 5% 1.30
## 1% 1.15
## Min 1.00
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 23 1 119 6.9
## 14 1.1 118 6.7
## 15 1.2 123 6.7
## 36 1.2 106 6.6
## 3 1.3 132 6.4
##
##
##
## ────────────────────────── Variable: Petal.Width ──────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 0.58
## Missing 0.00 Std Deviation 0.76
## Mean 1.20 Range 2.40
## Median 1.30 Interquartile Range 1.50
## Mode 0.20 Uncorrected SS 302.33
## Trimmed Mean 1.19 Corrected SS 86.57
## Skewness -0.10 Coeff Variation 63.56
## Kurtosis -1.34 Std Error Mean 0.06
##
## Quantiles
##
## Quantile Value
##
## Max 2.50
## 99% 2.50
## 95% 2.30
## 90% 2.20
## Q3 1.80
## Median 1.30
## Q1 0.30
## 10% 0.20
## 5% 0.20
## 1% 0.10
## Min 0.10
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 10 0.1 101 2.5
## 13 0.1 110 2.5
## 14 0.1 145 2.5
## 33 0.1 115 2.4
## 38 0.1 137 2.4
For continuous variable in the data set, the frequence tables can be created to reflect the underlying distribution of the data. The ds_freq_table function in descriptr package will provide the frequency, cumulative frequency, frequency percent and cumulative frequency percentage for the data that you want to analyse.
iris %>%
ds_freq_table(Petal.Width, 10)
## Variable: Petal.Width
## |-----------------------------------------------------------------------|
## | Bins | Frequency | Cum Frequency | Percent | Cum Percent |
## |-----------------------------------------------------------------------|
## | 0.1 - 0.3 | 41 | 41 | 27.33 | 27.33 |
## |-----------------------------------------------------------------------|
## | 0.3 - 0.6 | 8 | 49 | 5.33 | 32.67 |
## |-----------------------------------------------------------------------|
## | 0.6 - 0.8 | 1 | 50 | 0.67 | 33.33 |
## |-----------------------------------------------------------------------|
## | 0.8 - 1.1 | 7 | 57 | 4.67 | 38 |
## |-----------------------------------------------------------------------|
## | 1.1 - 1.3 | 21 | 78 | 14 | 52 |
## |-----------------------------------------------------------------------|
## | 1.3 - 1.5 | 33 | 111 | 22 | 74 |
## |-----------------------------------------------------------------------|
## | 1.5 - 1.8 | 6 | 117 | 4 | 78 |
## |-----------------------------------------------------------------------|
## | 1.8 - 2 | 23 | 140 | 15.33 | 93.33 |
## |-----------------------------------------------------------------------|
## | 2 - 2.3 | 9 | 149 | 6 | 99.33 |
## |-----------------------------------------------------------------------|
## | 2.3 - 2.5 | 14 | 163 | 9.33 | 108.67 |
## |-----------------------------------------------------------------------|
## | Total | 150 | - | 100.00 | - |
## |-----------------------------------------------------------------------|
In the ds_freq_table function, 4 is set to be the default value for frequence intervals. Once the frequence tablbe has been generated, a histogram can be created to visualise the data subset for further analysis. From this histogram, an intuitive distribution of your data set or subset can be efficiently visualised. Taking the Petal.Width column from iris data frame as an example:
hist_width = iris %>%
ds_freq_table(Petal.Width, 10)
plot(hist_width)
However, ds_freq_table function doesn’t allow users to examine multiple variables. In descriptr package, ds_auto_summary function can be used to investigate summary statistics, including frequence tables, in the data set or data subset.Taking Petal.Length & Petal.Width from iris data frame as an example:
iris %>%
ds_auto_summary_stats(Petal.Length,Petal.Width)
## ────────────────────────── Variable: Petal.Length ─────────────────────────
##
## ──────────────────────────── Summary Statistics ───────────────────────────
##
## ────────────────────────── Variable: Petal.Length ─────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 3.12
## Missing 0.00 Std Deviation 1.77
## Mean 3.76 Range 5.90
## Median 4.35 Interquartile Range 3.50
## Mode 1.40 Uncorrected SS 2582.71
## Trimmed Mean 3.75 Corrected SS 464.33
## Skewness -0.27 Coeff Variation 46.97
## Kurtosis -1.40 Std Error Mean 0.14
##
## Quantiles
##
## Quantile Value
##
## Max 6.90
## 99% 6.70
## 95% 6.10
## 90% 5.80
## Q3 5.10
## Median 4.35
## Q1 1.60
## 10% 1.40
## 5% 1.30
## 1% 1.15
## Min 1.00
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 23 1 119 6.9
## 14 1.1 118 6.7
## 15 1.2 123 6.7
## 36 1.2 106 6.6
## 3 1.3 132 6.4
##
##
##
## NULL
##
##
## ────────────────────────── Frequency Distribution ─────────────────────────
##
## Variable: Petal.Length
## |-----------------------------------------------------------------------|
## | Bins | Frequency | Cum Frequency | Percent | Cum Percent |
## |-----------------------------------------------------------------------|
## | 1 - 2.2 | 50 | 50 | 33.33 | 33.33 |
## |-----------------------------------------------------------------------|
## | 2.2 - 3.4 | 3 | 53 | 2 | 35.33 |
## |-----------------------------------------------------------------------|
## | 3.4 - 4.5 | 34 | 87 | 22.67 | 58 |
## |-----------------------------------------------------------------------|
## | 4.5 - 5.7 | 47 | 134 | 31.33 | 89.33 |
## |-----------------------------------------------------------------------|
## | 5.7 - 6.9 | 16 | 150 | 10.67 | 100 |
## |-----------------------------------------------------------------------|
## | Total | 150 | - | 100.00 | - |
## |-----------------------------------------------------------------------|
##
##
## ────────────────────────── Variable: Petal.Width ──────────────────────────
##
## ──────────────────────────── Summary Statistics ───────────────────────────
##
## ────────────────────────── Variable: Petal.Width ──────────────────────────
##
## Univariate Analysis
##
## N 150.00 Variance 0.58
## Missing 0.00 Std Deviation 0.76
## Mean 1.20 Range 2.40
## Median 1.30 Interquartile Range 1.50
## Mode 0.20 Uncorrected SS 302.33
## Trimmed Mean 1.19 Corrected SS 86.57
## Skewness -0.10 Coeff Variation 63.56
## Kurtosis -1.34 Std Error Mean 0.06
##
## Quantiles
##
## Quantile Value
##
## Max 2.50
## 99% 2.50
## 95% 2.30
## 90% 2.20
## Q3 1.80
## Median 1.30
## Q1 0.30
## 10% 0.20
## 5% 0.20
## 1% 0.10
## Min 0.10
##
## Extreme Values
##
## Low High
##
## Obs Value Obs Value
## 10 0.1 101 2.5
## 13 0.1 110 2.5
## 14 0.1 145 2.5
## 33 0.1 115 2.4
## 38 0.1 137 2.4
##
##
##
## NULL
##
##
## ────────────────────────── Frequency Distribution ─────────────────────────
##
## Variable: Petal.Width
## |-----------------------------------------------------------------------|
## | Bins | Frequency | Cum Frequency | Percent | Cum Percent |
## |-----------------------------------------------------------------------|
## | 0.1 - 0.6 | 49 | 49 | 32.67 | 32.67 |
## |-----------------------------------------------------------------------|
## | 0.6 - 1.1 | 8 | 57 | 5.33 | 38 |
## |-----------------------------------------------------------------------|
## | 1.1 - 1.5 | 41 | 98 | 27.33 | 65.33 |
## |-----------------------------------------------------------------------|
## | 1.5 - 2 | 29 | 127 | 19.33 | 84.67 |
## |-----------------------------------------------------------------------|
## | 2 - 2.5 | 23 | 150 | 15.33 | 100 |
## |-----------------------------------------------------------------------|
## | Total | 150 | - | 100.00 | - |
## |-----------------------------------------------------------------------|
Apart from continuous variables, descriptr package also provides functions to analyse categorical variables in the data frame. Categorical variables are stored in a data frame to indicate the levels or classes of continuous variables. In descriptr package, the ds_group_summary() function generates summary statistics of a continuous variable conditional on categorical variables. Taking Species & Sepal.Length from iris data frame as an example:
iris %>%
ds_group_summary(Species,Sepal.Length)
## Sepal.Length by Species
## -----------------------------------------------------------------------------------------
## | Statistic/Levels| setosa| versicolor| virginica|
## -----------------------------------------------------------------------------------------
## | Obs| 50| 50| 50|
## | Minimum| 4.3| 4.9| 4.9|
## | Maximum| 5.8| 7| 7.9|
## | Mean| 5.01| 5.94| 6.59|
## | Median| 5| 5.9| 6.5|
## | Mode| 5| 5.5| 6.3|
## | Std. Deviation| 0.35| 0.52| 0.64|
## | Variance| 0.12| 0.27| 0.4|
## | Skewness| 0.12| 0.11| 0.12|
## | Kurtosis| -0.25| -0.53| 0.03|
## | Uncorrected SS| 1259.09| 1774.86| 2189.9|
## | Corrected SS| 6.09| 13.06| 19.81|
## | Coeff Variation| 7.04| 8.7| 9.65|
## | Std. Error Mean| 0.05| 0.07| 0.09|
## | Range| 1.5| 2.1| 3|
## | Interquartile Range| 0.4| 0.7| 0.67|
## -----------------------------------------------------------------------------------------
Also, the plot function function is available to visualise the summary statistics generated by ds_group_summary conditional on categorical variables. The ds_group_summary function will generate box plots for each categorical variable.
iris_box = iris %>%
ds_group_summary(Species,Sepal.Length)
plot(iris_box)
If there are more than one categorical variable in the data set, the ds_cross_table() function is therefore useful to creates two way tables of categorical variables. Below example uses flight data set from nycflights13 package, which measures the total number of departure flights for each carrier from New York in 2013. A temporary table containing two categorical variables is organised to demonstrate the usage of ds_cross_table function in descriptr package. Also, plot function can be applied to the output of ds_cross_table function to visualise to result.
temp_df = data.frame(Carrier= as.factor(flights$carrier), Month = as.factor(flights$month))
temp_tb = ds_cross_table(temp_df, Carrier, Month)
plot(temp_tb)
In this vignette, we spend a lot of time on summary statistics of data set and we also explore on how to visualise the result of function from descriptr package. This serves as the first step of delving into the data set deeply.
Wickham, H & Grolemund, G 2017 R for Data Science O’Reilly Media Inc, Sebastopol
R Documentation 2019, descriptr, viewed 26 March 2019,https://www.rdocumentation.org/packages/descriptr/versions/0.5.0/topics/descriptr