Introduction


Descriptive statistics refers to the analysis of data which helps to describe or summarize data in the form of tables. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are categorized into the following three categories (sometimes authors will add another category known as measures of position).

1. Measures of central tendency: A measure of central tendency, sometimes referred to as measure of location is a single numerical value that describes a set of data by identifying the central position within that particular set of data. Below are the statistics under this category.

2. Measures of dispersion: These are ways of summarizing a group of data by describing how spread out the scores are. They are sometimes referred to as measures of variability or measures of spread. They include the following:

3. Measures of distribution: Measures of distribution are used to show the distribution of data set. These measures are rarely used in practice. There are two statistics under this category which include:

Install required packages


The following packages will be required and should be loaded first. If they are not already installed, begin by installing them by using the install.packages() function e.g. install.packages(“janitor”)

library(janitor) # contains tabyl() function for tabulation
library(dplyr) # operations on data frames e.g. %>%
options(dplyr.summarise.inform = FALSE) # to suppressing warning messages
library(arsenal) # tableby() function
library(kableExtra) # display table formatting

Data set


A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 obervations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/descriptive-statistics")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

kbl(nhanesdata[1:6, c(1, 2, 11:15)], 
    caption = "Table 1: Showing first 6 observations and a few variables.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Showing first 6 observations and a few variables.
id region highbp sex race age agegroup
1400 South No Male White 54 50-59
1401 South No Female White 41 40-49
1402 South No Female Other 21 20-29
1404 South Yes Female White 63 60-69
1405 South No Female White 64 60-69
1406 South Yes Female White 63 60-69

Descriptive statistics: Minimum, maximum, mean and standard deviation


The syntax below uses the functions of the dplyr package to calculate the minimum, maximum, mean and standard deviation of serum cholesterol by the health status of respondents. Results are presented in Table 2.

table2 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Mean = mean(cholesterol), 
              SD = sd(cholesterol))
names(table2)[1] = c("Health status")

kbl(table2, digits = 2, 
    caption = "Table 2: Descriptive statistics of serum cholesterol by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 2: Descriptive statistics of serum cholesterol by health status.
Health status Frequency Minimum Maximum Mean SD
Poor 729 80 457 229.49 53.21
Fair 1670 88 828 228.49 52.59
Average 2940 85 644 220.35 49.36
Good 2591 92 492 212.64 47.52
Excellent 2407 84 426 208.65 45.39

Descriptive statistics: Minimum, maximum, median and interquartile range


In this example, we again use the functions of the dplyr package to calculate the minimum, maximum, median and interquartile range of serum cholesterol by the health status of respondents. Note that the interquartile range is calculated by getting the difference between the upper and lower quartiles with the code: IQR = diff(quantile(cholesterol, c(1, 3)/4)) as appears within the summarise() function in the code below. Results of running the below code are presented in Table 3.

table3 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Median = median(cholesterol), 
              IQR = diff(quantile(cholesterol, c(1, 3)/4)))
names(table3)[1] = c("Health status")

kbl(table3, digits = 2, 
    caption = "Table 3: Descriptive statistics of serum cholesterol by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 3: Descriptive statistics of serum cholesterol by health status.
Health status Frequency Minimum Maximum Median IQR
Poor 729 80 457 223 68
Fair 1670 88 828 223 66
Average 2940 85 644 216 65
Good 2591 92 492 208 61
Excellent 2407 84 426 204 61

Descriptive statistics: Multiple categorical variables


When descriptive statistics of interest are to be summarized by a multiple categorical variables, they (the categorical variables) should all be included in the group_by() function. In the code below, we summarize the shown descriptive statistics by two categorical variables namely: race and sex of respondents. The results of running the code are presented in Table 4.

table4 <- nhanesdata %>% 
    group_by(race, sex) %>% 
    summarise(Frequency = n(),
              Minimum = min(cholesterol),
              Maximum = max(cholesterol),
              Mean = mean(cholesterol), 
              SD = sd(cholesterol))
names(table4)[1:2] = c("Race", "Sex")

kbl(table4, 
    digits = 2, 
    caption = "Table 4: Descriptive statistics of serum cholesterol by race and sex.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 4: Descriptive statistics of serum cholesterol by race and sex.
Race Sex Frequency Minimum Maximum Mean SD
White Male 4306 84 468 213.57 45.62
White Female 4745 88 828 222.47 52.02
Black Male 500 105 418 209.12 50.22
Black Female 586 80 471 217.27 52.27
Other Male 103 113 315 215.86 38.47
Other Female 97 117 316 211.70 45.39

Descriptive statistics: Multiple quantitative variables


In the above examples, we have just considered multiple descriptive statistics of a single quantitative variable by one or two categorical variables. In the next examples, we illustrate how to compute two statistics for multiple quantitative variables by a single categorical variable. The first example shows the calculation of the mean and standard deviation while the second shows the calculation of the median and interquartile range.

We begin by creating the function combinestats() that will be used to combine the mean and (standard deviation) in the first example, and the median and (interquartile range) in the second example. This function accepts two arguments - data and statistic which specify a vector with the data values and the statistic to compute (mean/median) respectively.

combinestats = function(data, statistic = "mean"){
  
    # combinestats function returns the mean (standard deviation) or the 
    # median (standard deviation)
    # Inputs:
        # data: vector containing the data values
        # statistic: statistic to calculate (mean/median), default is mean
    # Outputs:
        # mean (standard deviation) or median (interquartile range)
  
    if (statistic == "mean"){
        data.mean = round(mean(data), 2)
        data.sd = round(sd(data), 2)
        mean.sd = cbind(paste0(c(data.mean, " (", data.sd, ")"), collapse = ""))
        return(mean.sd)
    }else if (statistic == "median"){
        data.median = round(median(data), 2)
        data.iqr = round(diff(quantile(data, c(1, 3)/4)), 2)
        median.iqr = cbind(paste0(c(data.median, " (", data.iqr, ")"), collapse = ""))
        return(median.iqr)
    }else{
        stop(statistic, " is an invalid value. Valid values are: mean or median.")
        return()
    }
}

We now apply the above function to the data set as shown in the code below in order to calculate the mean and standard deviation of the specified variables. Results are presented in Table 5, where the values in brackets are the standard deviations.

table5 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Age = combinestats(age),
              BMI = combinestats(bmi),
              SystolicBP = combinestats(bpsystolic),
              DiastolicBP = combinestats(bpdiastolic),
              Cholesterol = combinestats(cholesterol)) %>%
  ungroup() %>%
  mutate_at(vars(health), list(~as.character(.))) %>%
  bind_rows(summarise(health = "Total", 
                      nhanesdata, 
                      Frequency = n(),
                      Age = combinestats(age),
                      BMI = combinestats(bmi),
                      SystolicBP = combinestats(bpsystolic),
                      DiastolicBP = combinestats(bpdiastolic),
                      Cholesterol = combinestats(cholesterol)))
names(table5)[1] = c("Health status")

kbl(table5, digits = 2, align = c("l", rep("r", times = ncol(table5)-1)),
    caption = "Table 5: Mean (standard deviation) of multiple quantitative variables by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 5: Mean (standard deviation) of multiple quantitative variables by health status.
Health status Frequency Age BMI SystolicBP DiastolicBP Cholesterol
Poor 729 59.99 (11.48) 26.48 (5.91) 140.74 (26.06) 84.96 (13.73) 229.49 (53.21)
Fair 1670 56.37 (14.65) 26.59 (5.52) 139.04 (26.06) 84.31 (13.58) 228.49 (52.59)
Average 2940 49.14 (16.58) 25.81 (4.93) 132.34 (22.88) 82.69 (13.05) 220.35 (49.36)
Good 2591 43.5 (17.18) 25.2 (4.66) 127.3 (21.6) 80.36 (12.41) 212.64 (47.52)
Excellent 2407 40.14 (15.95) 24.56 (4.1) 124.32 (19.67) 79.22 (11.93) 208.65 (45.39)
Total 10337 47.56 (17.22) 25.54 (4.91) 130.88 (23.34) 81.72 (12.93) 217.66 (49.4)

In the following code, we again apply the combinestats() function but this time, to calculate the median and interquartile range for the specified variables. Results are presented in Table 6 with values in brackets representing the interquartile ranges.

table6 <- nhanesdata %>% 
    group_by(health) %>% 
    summarise(Frequency = n(),
              Age = combinestats(age, statistic = "median"),
              BMI = combinestats(bmi, statistic = "median"),
              SystolicBP = combinestats(bpsystolic, statistic = "median"),
              DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
              Cholesterol = combinestats(cholesterol, statistic = "median")) %>%
  ungroup() %>%
  mutate_at(vars(health), list(~as.character(.))) %>%
  bind_rows(summarise(health = "Total", 
                      nhanesdata, 
                      Frequency = n(),
                      Age = combinestats(age, statistic = "median"),
                      BMI = combinestats(bmi, statistic = "median"),
                      SystolicBP = combinestats(bpsystolic, statistic = "median"),
                      DiastolicBP = combinestats(bpdiastolic, statistic = "median"),
                      Cholesterol = combinestats(cholesterol, statistic = "median")))
names(table6)[1] = c("Health status")

kbl(table6, digits = 2, align = c("l", rep("r", times = ncol(table6)-1)),
    caption = "Table 6: Median (interquartile range) of quantitative variables by health status.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 6: Median (interquartile range) of quantitative variables by health status.
Health status Frequency Age BMI SystolicBP DiastolicBP Cholesterol
Poor 729 62 (12) 25.6 (7.32) 138 (35) 84 (18) 223 (68)
Fair 1670 62 (19) 25.85 (6.65) 135.5 (32) 84 (16.5) 223 (66)
Average 2940 52 (30) 25.21 (5.94) 130 (30) 80 (18) 216 (65)
Good 2591 40 (33) 24.55 (5.65) 124 (29.5) 80 (20) 208 (61)
Excellent 2407 36 (26) 23.92 (5.01) 120 (26) 80 (18) 204 (61)
Total 10337 49 (32) 24.82 (5.89) 128 (28) 80 (20) 213 (64)

Calculating p values


Tests of differences in group means are used to determine whether means of groups in a sample or different at a specified level of significance. We consider two such tests that are commonly used for testing differences in group means. They include:

Begin by labeling a few variables of interest in the nhanesdata data frame. This step just ensures that we have easy to understand labels in the output table. Note that by default, variable names are used as labels in the output table results.

labels(nhanesdata)  = c(age = "Age in years", 
                        cholesterol = "Serum cholesterol", 
                        bmi = "Body mass index")

The code below by default computes the mean (standard deviation) and the range (minimum-maximum), then performs a one way analysis of variance technique to compare the means of the given quantitative variable by gender (male/female). Interpretation of statistical significance is discussed in the section Hypothesis testing and p-values.

tab6 = tableby(sex ~ age + bmi + cholesterol, 
               data = nhanesdata)

summary(tab6, digits = 2, digits.p = 3, 
        title = 'Table 7: Descriptive statistics of quantitative variables and p-values', 
        pfootnote = TRUE)
Table 7: Descriptive statistics of quantitative variables and p-values
Male (N=4909) Female (N=5428) Total (N=10337) p value
Age in years 0.3651
   Mean (SD) 47.40 (17.17) 47.71 (17.26) 47.56 (17.22)
   Range 20.00 - 74.00 20.00 - 74.00 20.00 - 74.00
Body mass index 0.5751
   Mean (SD) 25.51 (4.02) 25.56 (5.60) 25.54 (4.91)
   Range 12.39 - 53.04 14.13 - 61.13 12.39 - 61.13
Serum cholesterol < 0.0011
   Mean (SD) 213.17 (45.98) 221.71 (51.97) 217.66 (49.40)
   Range 84.00 - 468.00 80.00 - 828.00 80.00 - 828.00
  1. Linear Model ANOVA

Specifying a test


By default, the analysis of variance test (option anova) is applied when variables are continuous. You can also calculate the Kruskal-Wallis test by specifying the option kwt as shown in the code below (the part kwt(cholesterol, “N”,“median”, “iqr”)). Note that we choose to calculate the median and interquartiel range (iqr) for this test instead of the mean and standard deviation because the Kruskal-Wallis test is used when the data fails to meet the assumptions of parametric tests, particularly the normality assumption.

tab7 <- tableby(race ~ kwt(cholesterol, "N","median", "iqr") + 
                    anova(age, "N","mean", "sd") +
                    anova(bmi, "N","mean", "sd"), 
                    data = nhanesdata)

summary(tab7, digits = 2, digits.p = 3, 
        title = 'Table 8: Descriptive statistics of quantitative variables and p-values', 
        pfootnote = TRUE)
Table 8: Descriptive statistics of quantitative variables and p-values
White (N=9051) Black (N=1086) Other (N=200) Total (N=10337) p value
Serum cholesterol 0.0031
   N 9051 1086 200 10337
   Median 213.00 208.00 211.50 213.00
   IQR 64.00 64.00 57.25 64.00
Age in years < 0.0012
   N 9051 1086 200 10337
   Mean 47.83 45.95 44.09 47.56
   SD 17.17 17.45 17.33 17.22
Body mass index < 0.0012
   N 9051 1086 200 10337
   Mean 25.43 26.72 24.02 25.54
   SD 4.76 5.97 4.29 4.91
  1. Kruskal-Wallis rank sum test
  2. Linear Model ANOVA

STEM Research
https://stemresearchs.com