Descriptive statistics with R

Introduction

Descriptive statistics refers to the analysis of data which helps to describe or summarize data in the form of tables. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are categorized into the following three categories (sometimes authors will add another category known as measures of position).

1. Measures of central tendency: A measure of central tendency, sometimes referred to as measure of location is a single numerical value that describes a set of data by identifying the central position within that particular set of data. Below are the statistics under this category.

Mean: The mean (or average) is indisputably the most known and commonly used measure of central tendancy. It can be used with both discrete (integer values) and continuous (real numbers) data, although its use is most often with continuous data. The mean is very sensitive to outliers (values that fall outside other scores) and as such, it is should not be used for data that contains outliers, or generally, it should not be used for data that is skewed (non-normal data).
Median: The median is the middle score for a set of data that has been arranged in order of magnitude, either ascending or descending. Just like the mean, the median is used with both discrete and continuous data, but can also be used for categorical data that is measured on ordinal scale (categories where order is important e.g. education level: secondary/college/university, socio-economic status: low/middle/high, e.t.c). It is an alternative to the mean when data contains outliers since it is less affected by the presence of outliers in the data set.
Mode: This is the most frequently occurring score in a data set. It is used with categorical data to reveal the most common category. The mode is rarely used for discrete or continuous data because it may sometimes result in multi-modal (more than one mode) or no mode at all.

2. Measures of dispersion: These are ways of summarizing a group of data by describing how spread out the scores are. They are sometimes referred to as measures of variability or measures of spread. They include the following:

Range
Variance
Standard deviation
Standard error of mean
Coefficient of variation
Quartiles/percentiles
Interquartile range

3. Measures of distribution: Measures of distribution are used to show the distribution of data set. These measures are rarely used in practice. There are two statistics under this category which include:

Skewness: Shows the distribution of the data set, that is, whether data follows a normal distribution or it is skewed (left or right skewed). A value of skewness equal (or close) to zero indicates normality while negative and positive values of skewness indicate left skew (tail is to the left) and right skew (tail is to the right) respectively.
Kurtosis: Kurtosis is a statistical measure that is used to describe the degree to which scores of a data set cluster in the tails or the peak of a frequency distribution. The peak is the tallest part of the distribution, and the tails are the ends of the distribution. There are three types of kurtosis:
- Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked height. Such a distribution has a kurtosis of zero (or close to zero).
- Leptokurtic: More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails). This has a positive value of kurtosis.
- Platykurtic: Fewer values in the tails and fewer values close to the mean (i.e. the curve has a flat peak and has more dispersed scores with lighter tails). This has a negative value of kurtosis.

Install required packages

The following packages will be required and should be loaded first. If they are not already installed, begin by installing them by using the install.packages() function e.g. install.packages(“janitor”)

library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
options(dplyr.summarise.inform = FALSE) # to suppressing warning messages
library(arsenal) # tableby() function
library(kableExtra) # display table formatting

Data set

A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 obervations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

kbl(nhanesdata[1:6, c(1, 2, 11:15)], 
    caption = "Table 1: Showing first 6 observations and a few variables.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Showing first 6 observations and a few variables.
id	region	highbp	sex	race	age	agegroup
1400	South	No	Male	White	54	50-59
1401	South	No	Female	White	41	40-49
1402	South	No	Female	Other	21	20-29
1404	South	Yes	Female	White	63	60-69
1405	South	No	Female	White	64	60-69
1406	South	Yes	Female	White	63	60-69

Descriptive statistics: Minimum, maximum, mean and standard deviation

The syntax below uses the functions of the dplyr package to calculate the minimum, maximum, mean and standard deviation of serum cholesterol by the health status of respondents. Results are presented in Table 2.

table2 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Mean = mean(cholesterol), 
              SD = sd(cholesterol))
names(table2)[1] = c("Health status")

kbl(table2, digits = 2, 
    caption = "Table 2: Descriptive statistics of serum cholesterol by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 2: Descriptive statistics of serum cholesterol by health status.
Health status	Frequency	Minimum	Maximum	Mean	SD
Poor	729	80	457	229.49	53.21
Fair	1670	88	828	228.49	52.59
Average	2940	85	644	220.35	49.36
Good	2591	92	492	212.64	47.52
Excellent	2407	84	426	208.65	45.39

Descriptive statistics: Minimum, maximum, median and interquartile range

In this example, we again use the functions of the dplyr package to calculate the minimum, maximum, median and interquartile range of serum cholesterol by the health status of respondents. Note that the interquartile range is calculated by getting the difference between the upper and lower quartiles with the code: IQR = diff(quantile(cholesterol, c(1, 3)/4)) as appears within the summarise() function in the code below. Results of running the below code are presented in Table 3.

table3 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Median = median(cholesterol), 
              IQR = diff(quantile(cholesterol, c(1, 3)/4)))
names(table3)[1] = c("Health status")

kbl(table3, digits = 2, 
    caption = "Table 3: Descriptive statistics of serum cholesterol by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 3: Descriptive statistics of serum cholesterol by health status.
Health status	Frequency	Minimum	Maximum	Median	IQR
Poor	729	80	457	223	68
Fair	1670	88	828	223	66
Average	2940	85	644	216	65
Good	2591	92	492	208	61
Excellent	2407	84	426	204	61

Descriptive statistics: Multiple categorical variables

When descriptive statistics of interest are to be summarized by a multiple categorical variables, they (the categorical variables) should all be included in the group_by() function. In the code below, we summarize the shown descriptive statistics by two categorical variables namely: race and sex of respondents. The results of running the code are presented in Table 4.

table4 <- nhanesdata %>% 
    group_by(race, sex) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Mean = mean(cholesterol), 
              SD = sd(cholesterol))
names(table4)[1:2] = c("Race", "Sex")

kbl(table4, 
    digits = 2, 
    caption = "Table 4: Descriptive statistics of serum cholesterol by race and sex.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 4: Descriptive statistics of serum cholesterol by race and sex.
Race	Sex	Frequency	Minimum	Maximum	Mean	SD
White	Male	4306	84	468	213.57	45.62
White	Female	4745	88	828	222.47	52.02
Black	Male	500	105	418	209.12	50.22
Black	Female	586	80	471	217.27	52.27
Other	Male	103	113	315	215.86	38.47
Other	Female	97	117	316	211.70	45.39

Descriptive statistics: Multiple quantitative variables

In the above examples, we have just considered multiple descriptive statistics of a single quantitative variable by one or two categorical variables. In the next examples, we illustrate how to compute two statistics for multiple quantitative variables by a single categorical variable. The first example shows the calculation of the mean and standard deviation while the second shows the calculation of the median and interquartile range.

We begin by creating the function combinestats() that will be used to combine the mean and (standard deviation) in the first example, and the median and (interquartile range) in the second example. This function accepts two arguments - data and statistic which specify a vector with the data values and the statistic to compute (mean/median) respectively.

combinestats = function(data, statistic = "mean"){
  
    # combinestats function returns the mean (standard deviation) or the 
    # median (standard deviation)
    # Inputs:
        # data: vector containing the data values
        # statistic: statistic to calculate (mean/median), default is mean
    # Outputs:
        # mean (standard deviation) or median (interquartile range)
  
    if (statistic == "mean"){
        data.mean = round(mean(data), 2)
        data.sd = round(sd(data), 2)
        mean.sd = cbind(paste0(c(data.mean, " (", data.sd, ")"), collapse = ""))
        return(mean.sd)
    }else if (statistic == "median"){
        data.median = round(median(data), 2)
        data.iqr = round(diff(quantile(data, c(1, 3)/4)), 2)
        median.iqr = cbind(paste0(c(data.median, " (", data.iqr, ")"), collapse = ""))
        return(median.iqr)
    }else{
        stop(statistic, " is an invalid value. Valid values are: mean or median.")
        return()
    }
}

We now apply the above function to the data set as shown in the code below in order to calculate the mean and standard deviation of the specified variables. Results are presented in Table 5, where the values in brackets are the standard deviations.

table5 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Age = combinestats(age),
              BMI = combinestats(bmi),
              SystolicBP = combinestats(bpsystolic),
              DiastolicBP = combinestats(bpdiastolic),
              Cholesterol = combinestats(cholesterol)) %>%
  ungroup() %>%
  mutate_at(vars(health), list(~as.character(.))) %>%
  bind_rows(summarise(health = "Total", 
                      nhanesdata, 
                      Frequency = n(),
                      Age = combinestats(age),
                      BMI = combinestats(bmi),
                      SystolicBP = combinestats(bpsystolic),
                      DiastolicBP = combinestats(bpdiastolic),
                      Cholesterol = combinestats(cholesterol)))
names(table5)[1] = c("Health status")

kbl(table5, digits = 2, align = c("l", rep("r", times = ncol(table5)-1)),
    caption = "Table 5: Mean (standard deviation) of multiple quantitative variables by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 5: Mean (standard deviation) of multiple quantitative variables by health status.
Health status	Frequency	Age	BMI	SystolicBP	DiastolicBP	Cholesterol
Poor	729	59.99 (11.48)	26.48 (5.91)	140.74 (26.06)	84.96 (13.73)	229.49 (53.21)
Fair	1670	56.37 (14.65)	26.59 (5.52)	139.04 (26.06)	84.31 (13.58)	228.49 (52.59)
Average	2940	49.14 (16.58)	25.81 (4.93)	132.34 (22.88)	82.69 (13.05)	220.35 (49.36)
Good	2591	43.5 (17.18)	25.2 (4.66)	127.3 (21.6)	80.36 (12.41)	212.64 (47.52)
Excellent	2407	40.14 (15.95)	24.56 (4.1)	124.32 (19.67)	79.22 (11.93)	208.65 (45.39)
Total	10337	47.56 (17.22)	25.54 (4.91)	130.88 (23.34)	81.72 (12.93)	217.66 (49.4)

In the following code, we again apply the combinestats() function but this time, to calculate the median and interquartile range for the specified variables. Results are presented in Table 6 with values in brackets representing the interquartile ranges.

table6 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Age = combinestats(age, statistic = "median"),
              BMI = combinestats(bmi, statistic = "median"),
              SystolicBP = combinestats(bpsystolic, statistic = "median"),
              DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
              Cholesterol = combinestats(cholesterol, statistic = "median")) %>%
  ungroup() %>%
  mutate_at(vars(health), list(~as.character(.))) %>%
  bind_rows(summarise(health = "Total", 
                      nhanesdata, 
                      Frequency = n(),
                      Age = combinestats(age, statistic = "median"),
                      BMI = combinestats(bmi, statistic = "median"),
                      SystolicBP = combinestats(bpsystolic, statistic = "median"),
                      DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
                      Cholesterol = combinestats(cholesterol, statistic = "median")))
names(table6)[1] = c("Health status")

kbl(table6, digits = 2, align = c("l", rep("r", times = ncol(table6)-1)),
    caption = "Table 6: Median (interquartile range) of quantitative variables by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 6: Median (interquartile range) of quantitative variables by health status.
Health status	Frequency	Age	BMI	SystolicBP	DiastolicBP	Cholesterol
Poor	729	62 (12)	25.6 (7.32)	138 (35)	84 (18)	223 (68)
Fair	1670	62 (19)	25.85 (6.65)	135.5 (32)	84 (16.5)	223 (66)
Average	2940	52 (30)	25.21 (5.94)	130 (30)	80 (18)	216 (65)
Good	2591	40 (33)	24.55 (5.65)	124 (29.5)	80 (20)	208 (61)
Excellent	2407	36 (26)	23.92 (5.01)	120 (26)	80 (18)	204 (61)
Total	10337	49 (32)	24.82 (5.89)	128 (28)	80 (20)	213 (64)

Calculating p values

Tests of differences in group means are used to determine whether means of groups in a sample or different at a specified level of significance. We consider two such tests that are commonly used for testing differences in group means. They include:

One way analysis of variance
Kruskal-Wallis one way analysis of variance

Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that by default, variable names are used as labels in the output table results.

labels(nhanesdata)  = c(age = "Age in years", 
                        cholesterol = "Serum cholesterol", 
                        bmi = "Body mass index")

The code below by default computes the mean (standard deviation) and the range (minimum-maximum), then performs a one way analysis of variance technique to compare the means of the given quantitative variable by gender (male/female). Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.

tab6 = tableby(sex ~ age + bmi + cholesterol, 
               data = nhanesdata)

summary(tab6, digits = 2, digits.p = 3, 
        title = 'Table 7: Descriptive statistics of quantitative variables and p-values', 
        pfootnote = TRUE)

Table 7: Descriptive statistics of quantitative variables and p-values
	Male (N=4909)	Female (N=5428)	Total (N=10337)	p value
Age in years				0.365¹
Mean (SD)	47.40 (17.17)	47.71 (17.26)	47.56 (17.22)
Range	20.00 - 74.00	20.00 - 74.00	20.00 - 74.00
Body mass index				0.575¹
Mean (SD)	25.51 (4.02)	25.56 (5.60)	25.54 (4.91)
Range	12.39 - 53.04	14.13 - 61.13	12.39 - 61.13
Serum cholesterol				< 0.001¹
Mean (SD)	213.17 (45.98)	221.71 (51.97)	217.66 (49.40)
Range	84.00 - 468.00	80.00 - 828.00	80.00 - 828.00

Linear Model ANOVA

Specifying a test

By default, the analysis of variance test (option anova) is applied when variables are continuous. You can also calculate the Kruskal-Wallis test by specifying the option kwt as shown in the code below (the part kwt(cholesterol, “N”,“median”, “iqr”)). Note that we choose to calculate the median and interquartiel range (iqr) for this test instead of the mean and standard deviation because the Kruskal-Wallis test is used when the data fails to meet the assumptions of parametric tests, particularly the normality assumption.

tab7 <- tableby(race ~ kwt(cholesterol, "N","median", "iqr") + 
                    anova(age, "N","mean", "sd") +
                    anova(bmi, "N","mean", "sd"), 
                    data = nhanesdata)

summary(tab7, digits = 2, digits.p = 3, 
        title = 'Table 8: Descriptive statistics of quantitative variables and p-values', 
        pfootnote = TRUE)

Table 8: Descriptive statistics of quantitative variables and p-values
	White (N=9051)	Black (N=1086)	Other (N=200)	Total (N=10337)	p value
Serum cholesterol					0.003¹
N	9051	1086	200	10337
Median	213.00	208.00	211.50	213.00
IQR	64.00	64.00	57.25	64.00
Age in years					< 0.001²
N	9051	1086	200	10337
Mean	47.83	45.95	44.09	47.56
SD	17.17	17.45	17.33	17.22
Body mass index					< 0.001²
N	9051	1086	200	10337
Mean	25.43	26.72	24.02	25.54
SD	4.76	5.97	4.29	4.91

Kruskal-Wallis rank sum test
Linear Model ANOVA

STEM Research
https://stemresearchs.com