Descriptive statistics refers to the analysis of data which helps to describe or summarize data in the form of tables. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are categorized into the following three categories (sometimes authors will add another category known as measures of position).
1. Measures of central tendency: A measure of central tendency, sometimes referred to as measure of location is a single numerical value that describes a set of data by identifying the central position within that particular set of data. Below are the statistics under this category.
2. Measures of dispersion: These are ways of summarizing a group of data by describing how spread out the scores are. They are sometimes referred to as measures of variability or measures of spread. They include the following:
3. Measures of distribution: Measures of distribution are used to show the distribution of data set. These measures are rarely used in practice. There are two statistics under this category which include:
Skewness: Shows the distribution of the data set, that is, whether data follows a normal distribution or it is skewed (left or right skewed). A value of skewness equal (or close) to zero indicates normality while negative and positive values of skewness indicate left skew (tail is to the left) and right skew (tail is to the right) respectively.
Kurtosis: Kurtosis is a statistical measure that is used to describe the degree to which scores of a data set cluster in the tails or the peak of a frequency distribution. The peak is the tallest part of the distribution, and the tails are the ends of the distribution. There are three types of kurtosis:
Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked height. Such a distribution has a kurtosis of zero (or close to zero).
Leptokurtic: More values in the distribution tails and more values close to the mean (i.e. sharply peaked with heavy tails). This has a positive value of kurtosis.
Platykurtic: Fewer values in the tails and fewer values close to the mean (i.e. the curve has a flat peak and has more dispersed scores with lighter tails). This has a negative value of kurtosis.
The following packages will be required and should be loaded first. If they are not already installed, begin by installing them by using the install.packages() function e.g. install.packages(“janitor”)
library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
options(dplyr.summarise.inform = FALSE) # to suppressing warning messages
library(arsenal) # tableby() function
library(kableExtra) # display table formatting
A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 obervations and a few columns of this data set.
setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))
kbl(nhanesdata[1:6, c(1, 2, 11:15)],
caption = "Table 1: Showing first 6 observations and a few variables.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| id | region | highbp | sex | race | age | agegroup |
|---|---|---|---|---|---|---|
| 1400 | South | No | Male | White | 54 | 50-59 |
| 1401 | South | No | Female | White | 41 | 40-49 |
| 1402 | South | No | Female | Other | 21 | 20-29 |
| 1404 | South | Yes | Female | White | 63 | 60-69 |
| 1405 | South | No | Female | White | 64 | 60-69 |
| 1406 | South | Yes | Female | White | 63 | 60-69 |
The syntax below uses the functions of the dplyr package to calculate the minimum, maximum, mean and standard deviation of serum cholesterol by the health status of respondents. Results are presented in Table 2.
table2 <- nhanesdata %>%
group_by(health) %>%
summarise(Frequency = n(),
Minimum = min(cholesterol),
Maximum = max(cholesterol),
Mean = mean(cholesterol),
SD = sd(cholesterol))
names(table2)[1] = c("Health status")
kbl(table2, digits = 2,
caption = "Table 2: Descriptive statistics of serum cholesterol by health status.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| Health status | Frequency | Minimum | Maximum | Mean | SD |
|---|---|---|---|---|---|
| Poor | 729 | 80 | 457 | 229.49 | 53.21 |
| Fair | 1670 | 88 | 828 | 228.49 | 52.59 |
| Average | 2940 | 85 | 644 | 220.35 | 49.36 |
| Good | 2591 | 92 | 492 | 212.64 | 47.52 |
| Excellent | 2407 | 84 | 426 | 208.65 | 45.39 |
In this example, we again use the functions of the dplyr package to calculate the minimum, maximum, median and interquartile range of serum cholesterol by the health status of respondents. Note that the interquartile range is calculated by getting the difference between the upper and lower quartiles with the code: IQR = diff(quantile(cholesterol, c(1, 3)/4)) as appears within the summarise() function in the code below. Results of running the below code are presented in Table 3.
table3 <- nhanesdata %>%
group_by(health) %>%
summarise(Frequency = n(),
Minimum = min(cholesterol),
Maximum = max(cholesterol),
Median = median(cholesterol),
IQR = diff(quantile(cholesterol, c(1, 3)/4)))
names(table3)[1] = c("Health status")
kbl(table3, digits = 2,
caption = "Table 3: Descriptive statistics of serum cholesterol by health status.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| Health status | Frequency | Minimum | Maximum | Median | IQR |
|---|---|---|---|---|---|
| Poor | 729 | 80 | 457 | 223 | 68 |
| Fair | 1670 | 88 | 828 | 223 | 66 |
| Average | 2940 | 85 | 644 | 216 | 65 |
| Good | 2591 | 92 | 492 | 208 | 61 |
| Excellent | 2407 | 84 | 426 | 204 | 61 |
When descriptive statistics of interest are to be summarized by a multiple categorical variables, they (the categorical variables) should all be included in the group_by() function. In the code below, we summarize the shown descriptive statistics by two categorical variables namely: race and sex of respondents. The results of running the code are presented in Table 4.
table4 <- nhanesdata %>%
group_by(race, sex) %>%
summarise(Frequency = n(),
Minimum = min(cholesterol),
Maximum = max(cholesterol),
Mean = mean(cholesterol),
SD = sd(cholesterol))
names(table4)[1:2] = c("Race", "Sex")
kbl(table4,
digits = 2,
caption = "Table 4: Descriptive statistics of serum cholesterol by race and sex.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| Race | Sex | Frequency | Minimum | Maximum | Mean | SD |
|---|---|---|---|---|---|---|
| White | Male | 4306 | 84 | 468 | 213.57 | 45.62 |
| White | Female | 4745 | 88 | 828 | 222.47 | 52.02 |
| Black | Male | 500 | 105 | 418 | 209.12 | 50.22 |
| Black | Female | 586 | 80 | 471 | 217.27 | 52.27 |
| Other | Male | 103 | 113 | 315 | 215.86 | 38.47 |
| Other | Female | 97 | 117 | 316 | 211.70 | 45.39 |
In the above examples, we have just considered multiple descriptive statistics of a single quantitative variable by one or two categorical variables. In the next examples, we illustrate how to compute two statistics for multiple quantitative variables by a single categorical variable. The first example shows the calculation of the mean and standard deviation while the second shows the calculation of the median and interquartile range.
We begin by creating the function combinestats() that will be used to combine the mean and (standard deviation) in the first example, and the median and (interquartile range) in the second example. This function accepts two arguments - data and statistic which specify a vector with the data values and the statistic to compute (mean/median) respectively.
combinestats = function(data, statistic = "mean"){
# combinestats function returns the mean (standard deviation) or the
# median (standard deviation)
# Inputs:
# data: vector containing the data values
# statistic: statistic to calculate (mean/median), default is mean
# Outputs:
# mean (standard deviation) or median (interquartile range)
if (statistic == "mean"){
data.mean = round(mean(data), 2)
data.sd = round(sd(data), 2)
mean.sd = cbind(paste0(c(data.mean, " (", data.sd, ")"), collapse = ""))
return(mean.sd)
}else if (statistic == "median"){
data.median = round(median(data), 2)
data.iqr = round(diff(quantile(data, c(1, 3)/4)), 2)
median.iqr = cbind(paste0(c(data.median, " (", data.iqr, ")"), collapse = ""))
return(median.iqr)
}else{
stop(statistic, " is an invalid value. Valid values are: mean or median.")
return()
}
}
We now apply the above function to the data set as shown in the code below in order to calculate the mean and standard deviation of the specified variables. Results are presented in Table 5, where the values in brackets are the standard deviations.
table5 <- nhanesdata %>%
group_by(health) %>%
summarise(Frequency = n(),
Age = combinestats(age),
BMI = combinestats(bmi),
SystolicBP = combinestats(bpsystolic),
DiastolicBP = combinestats(bpdiastolic),
Cholesterol = combinestats(cholesterol)) %>%
ungroup() %>%
mutate_at(vars(health), list(~as.character(.))) %>%
bind_rows(summarise(health = "Total",
nhanesdata,
Frequency = n(),
Age = combinestats(age),
BMI = combinestats(bmi),
SystolicBP = combinestats(bpsystolic),
DiastolicBP = combinestats(bpdiastolic),
Cholesterol = combinestats(cholesterol)))
names(table5)[1] = c("Health status")
kbl(table5, digits = 2, align = c("l", rep("r", times = ncol(table5)-1)),
caption = "Table 5: Mean (standard deviation) of multiple quantitative variables by health status.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| Health status | Frequency | Age | BMI | SystolicBP | DiastolicBP | Cholesterol |
|---|---|---|---|---|---|---|
| Poor | 729 | 59.99 (11.48) | 26.48 (5.91) | 140.74 (26.06) | 84.96 (13.73) | 229.49 (53.21) |
| Fair | 1670 | 56.37 (14.65) | 26.59 (5.52) | 139.04 (26.06) | 84.31 (13.58) | 228.49 (52.59) |
| Average | 2940 | 49.14 (16.58) | 25.81 (4.93) | 132.34 (22.88) | 82.69 (13.05) | 220.35 (49.36) |
| Good | 2591 | 43.5 (17.18) | 25.2 (4.66) | 127.3 (21.6) | 80.36 (12.41) | 212.64 (47.52) |
| Excellent | 2407 | 40.14 (15.95) | 24.56 (4.1) | 124.32 (19.67) | 79.22 (11.93) | 208.65 (45.39) |
| Total | 10337 | 47.56 (17.22) | 25.54 (4.91) | 130.88 (23.34) | 81.72 (12.93) | 217.66 (49.4) |
In the following code, we again apply the combinestats() function but this time, to calculate the median and interquartile range for the specified variables. Results are presented in Table 6 with values in brackets representing the interquartile ranges.
table6 <- nhanesdata %>%
group_by(health) %>%
summarise(Frequency = n(),
Age = combinestats(age, statistic = "median"),
BMI = combinestats(bmi, statistic = "median"),
SystolicBP = combinestats(bpsystolic, statistic = "median"),
DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
Cholesterol = combinestats(cholesterol, statistic = "median")) %>%
ungroup() %>%
mutate_at(vars(health), list(~as.character(.))) %>%
bind_rows(summarise(health = "Total",
nhanesdata,
Frequency = n(),
Age = combinestats(age, statistic = "median"),
BMI = combinestats(bmi, statistic = "median"),
SystolicBP = combinestats(bpsystolic, statistic = "median"),
DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
Cholesterol = combinestats(cholesterol, statistic = "median")))
names(table6)[1] = c("Health status")
kbl(table6, digits = 2, align = c("l", rep("r", times = ncol(table6)-1)),
caption = "Table 6: Median (interquartile range) of quantitative variables by health status.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| Health status | Frequency | Age | BMI | SystolicBP | DiastolicBP | Cholesterol |
|---|---|---|---|---|---|---|
| Poor | 729 | 62 (12) | 25.6 (7.32) | 138 (35) | 84 (18) | 223 (68) |
| Fair | 1670 | 62 (19) | 25.85 (6.65) | 135.5 (32) | 84 (16.5) | 223 (66) |
| Average | 2940 | 52 (30) | 25.21 (5.94) | 130 (30) | 80 (18) | 216 (65) |
| Good | 2591 | 40 (33) | 24.55 (5.65) | 124 (29.5) | 80 (20) | 208 (61) |
| Excellent | 2407 | 36 (26) | 23.92 (5.01) | 120 (26) | 80 (18) | 204 (61) |
| Total | 10337 | 49 (32) | 24.82 (5.89) | 128 (28) | 80 (20) | 213 (64) |
Tests of differences in group means are used to determine whether means of groups in a sample or different at a specified level of significance. We consider two such tests that are commonly used for testing differences in group means. They include:
Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that by default, variable names are used as labels in the output table results.
labels(nhanesdata) = c(age = "Age in years",
cholesterol = "Serum cholesterol",
bmi = "Body mass index")
The code below by default computes the mean (standard deviation) and the range (minimum-maximum), then performs a one way analysis of variance technique to compare the means of the given quantitative variable by gender (male/female). Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.
tab6 = tableby(sex ~ age + bmi + cholesterol,
data = nhanesdata)
summary(tab6, digits = 2, digits.p = 3,
title = 'Table 7: Descriptive statistics of quantitative variables and p-values',
pfootnote = TRUE)
| Male (N=4909) | Female (N=5428) | Total (N=10337) | p value | |
|---|---|---|---|---|
| Age in years | 0.3651 | |||
| Mean (SD) | 47.40 (17.17) | 47.71 (17.26) | 47.56 (17.22) | |
| Range | 20.00 - 74.00 | 20.00 - 74.00 | 20.00 - 74.00 | |
| Body mass index | 0.5751 | |||
| Mean (SD) | 25.51 (4.02) | 25.56 (5.60) | 25.54 (4.91) | |
| Range | 12.39 - 53.04 | 14.13 - 61.13 | 12.39 - 61.13 | |
| Serum cholesterol | < 0.0011 | |||
| Mean (SD) | 213.17 (45.98) | 221.71 (51.97) | 217.66 (49.40) | |
| Range | 84.00 - 468.00 | 80.00 - 828.00 | 80.00 - 828.00 |
By default, the analysis of variance test (option anova) is applied when variables are continuous. You can also calculate the Kruskal-Wallis test by specifying the option kwt as shown in the code below (the part kwt(cholesterol, “N”,“median”, “iqr”)). Note that we choose to calculate the median and interquartiel range (iqr) for this test instead of the mean and standard deviation because the Kruskal-Wallis test is used when the data fails to meet the assumptions of parametric tests, particularly the normality assumption.
tab7 <- tableby(race ~ kwt(cholesterol, "N","median", "iqr") +
anova(age, "N","mean", "sd") +
anova(bmi, "N","mean", "sd"),
data = nhanesdata)
summary(tab7, digits = 2, digits.p = 3,
title = 'Table 8: Descriptive statistics of quantitative variables and p-values',
pfootnote = TRUE)
| White (N=9051) | Black (N=1086) | Other (N=200) | Total (N=10337) | p value | |
|---|---|---|---|---|---|
| Serum cholesterol | 0.0031 | ||||
| N | 9051 | 1086 | 200 | 10337 | |
| Median | 213.00 | 208.00 | 211.50 | 213.00 | |
| IQR | 64.00 | 64.00 | 57.25 | 64.00 | |
| Age in years | < 0.0012 | ||||
| N | 9051 | 1086 | 200 | 10337 | |
| Mean | 47.83 | 45.95 | 44.09 | 47.56 | |
| SD | 17.17 | 17.45 | 17.33 | 17.22 | |
| Body mass index | < 0.0012 | ||||
| N | 9051 | 1086 | 200 | 10337 | |
| Mean | 25.43 | 26.72 | 24.02 | 25.54 | |
| SD | 4.76 | 5.97 | 4.29 | 4.91 |
STEM Research
https://stemresearchs.com