Introduction

This module covers the fundamental building blocks of statistical analysis: descriptive statistics. For MSc students in Applied Statistics, Medical Statistics, and Data Science, understanding the mathematical underpinnings of these measures and their implementation in R is crucial for robust data management and analysis.

Simulated Data Generation

We will generate a simulated health dataset representing Serum Creatinine levels (mg/dL) and Hospital Stay Duration (days) for 200 patients. This dataset will be used for large-scale computation.

set.seed(2024) # Ensure reproducibility

n <- 200
sim_data <- data.frame(
  Patient_ID = 1:n,
  # Log-normal distribution often fits biological markers (right-skewed)
  Creatinine = round(rlnorm(n, meanlog = 0, sdlog = 0.4), 2), 
  # Poisson distribution for count data (days)
  Stay_Duration = rpois(n, lambda = 5) + 1,
  # Categorical variable
  Condition = sample(c("Mild", "Moderate", "Severe"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2))
)

# Display first few rows
kable(head(sim_data), caption = "Preview of Simulated Health Data")
Preview of Simulated Health Data
Patient_ID Creatinine Stay_Duration Condition
1 1.48 6 Moderate
2 1.21 9 Moderate
3 0.96 7 Mild
4 0.92 12 Moderate
5 1.59 4 Mild
6 1.68 3 Moderate

4.1 Measures of Central Tendency

Central tendency describes the center or “typical value” of a frequency distribution.

4.1.1 The Arithmetic Mean

Definition: The sum of all observations divided by the number of observations.

Formula:

\(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\)

Usage: Used for continuous data without significant outliers.

4.1.2 The Geometric Mean

Definition: The \(n\)-th root of the product of \(n\) numbers.

Formula:

\(GM = \sqrt[n]{\prod_{i=1}^{n} x_i} = \exp\left(\frac{1}{n}\sum \ln(x_i)\right)\)

Usage: Used for growth rates, financial indices, or highly skewed biological data (e.g., titers, bacterial counts).

4.1.3 The Harmonic Mean

Definition: The reciprocal of the arithmetic mean of the reciprocals.

Formula:

\(HM = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}\)

Usage: Used for rates and ratios (e.g., speed, precision in F1-score).

4.1.4 The Median

Definition: The middle value when data is ordered. Formula: If \(n\) is odd: \(x_{(\frac{n+1}{2})}\) If \(n\) is even: \(\frac{x_{(n/2)} + x_{(n/2 + 1)}}{2}\) Usage: Preferred for skewed distributions or data with outliers.

4.1.5 The Mode

Definition: The value that appears most frequently.

Usage: Used for categorical data or discrete distributions.

4.1.6 The Mid-range

Definition: The average of the maximum and minimum values.

Formula:

\(MR = \frac{\max(x) + \min(x)}{2}\)

Usage: A quick estimate of center, highly sensitive to outliers.

Computation in R

# 1. Manual Computation on a small vector for verification
x_small <- c(2, 4, 4, 8, 100) # Note the outlier '100'

# Manual Arithmetic Mean
man_mean <- sum(x_small) / length(x_small)

# Manual Geometric Mean
man_geo <- exp(mean(log(x_small)))

# Manual Harmonic Mean
man_harm <- length(x_small) / sum(1/x_small)

print(paste("Manual Mean:", man_mean))
## [1] "Manual Mean: 23.6"
print(paste("Manual Geo Mean:", round(man_geo, 2)))
## [1] "Manual Geo Mean: 7.61"
print(paste("Manual Harmonic Mean:", round(man_harm, 2)))
## [1] "Manual Harmonic Mean: 4.41"
# 2. Computation on Simulated Data (Creatinine)
# R does not have built-in Geo/Harmonic mean in base, so we define them:
geo_mean <- function(x) { exp(mean(log(x))) }
harm_mean <- function(x) { 1 / mean(1/x) }
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

stats_central <- data.frame(
  Measure = c("Arithmetic Mean", "Geometric Mean", "Harmonic Mean", "Median", "Mode", "Mid-range"),
  Value = c(
    mean(sim_data$Creatinine),
    geo_mean(sim_data$Creatinine),
    harm_mean(sim_data$Creatinine),
    median(sim_data$Creatinine),
    get_mode(sim_data$Creatinine),
    (min(sim_data$Creatinine) + max(sim_data$Creatinine)) / 2
  )
)

kable(stats_central, caption = "Central Tendency for Creatinine")
Central Tendency for Creatinine
Measure Value
Arithmetic Mean 1.0947500
Geometric Mean 1.0096571
Harmonic Mean 0.9262291
Median 1.0200000
Mode 1.2400000
Mid-range 1.4350000

Visualization

ggplot(sim_data, aes(x = Creatinine)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  geom_density(alpha = .2, fill = "#FF6666") +
  geom_vline(aes(xintercept = mean(Creatinine)), color = "blue", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(Creatinine)), color = "red", linetype = "dashed", size = 1) +
  annotate("text", x = 3, y = 0.5, label = "Blue: Mean\nRed: Median") +
  theme_minimal() +
  labs(title = "Distribution of Creatinine with Central Tendency")


4.2 Measures of Dispersion (Variation)

Dispersion measures how spread out the data is around the center.

4.2.1 Range

Definition: The difference between the maximum and minimum values.

Formula:

\(Range = x_{max} - x_{min}\)

4.2.2 Variance (\(s^2\))

Definition: The average of the squared differences from the Mean.

Formula (Sample):

\(s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}\)

Interpretation: Measures the spread in squared units.

4.2.3 Standard Deviation (\(s\))

Definition: The square root of the variance.

Formula:

\(s = \sqrt{s^2}\)

Interpretation: Measures spread in the original units of the data.

4.2.4 Coefficient of Variation (CV)

Definition: The ratio of the standard deviation to the mean.

Formula:

\(CV = (\frac{s}{\bar{x}}) \times 100\%\)

Usage: Used to compare variability between datasets with different units or widely different means.

Computation in R

# 1. Manual Computation (Small vector: 2, 4, 4, 8)
x_disp <- c(2, 4, 4, 8)
n_disp <- length(x_disp)
mean_disp <- mean(x_disp) # 4.5
# Sum of squared diffs: (2-4.5)^2 + (4-4.5)^2 + (4-4.5)^2 + (8-4.5)^2
# = 6.25 + 0.25 + 0.25 + 12.25 = 19
var_manual <- 19 / (n_disp - 1) # 19 / 3 = 6.33
sd_manual <- sqrt(var_manual)   # 2.51

# 2. Computation on Simulated Data
creat_var <- var(sim_data$Creatinine)
creat_sd <- sd(sim_data$Creatinine)
creat_cv <- (creat_sd / mean(sim_data$Creatinine)) * 100
creat_range <- max(sim_data$Creatinine) - min(sim_data$Creatinine)

disp_results <- data.frame(
  Measure = c("Range", "Variance", "Standard Deviation", "CV (%)"),
  Value = c(creat_range, creat_var, creat_sd, creat_cv)
)

kable(disp_results, caption = "Measures of Dispersion for Creatinine")
Measures of Dispersion for Creatinine
Measure Value
Range 2.3300000
Variance 0.1962130
Standard Deviation 0.4429594
CV (%) 40.4621483

4.3 Measures of Position (Location)

These measures indicate the relative position of a data point within the dataset.

4.3.1 Percentiles

Definition: Values below which a certain percentage of the data falls. The \(k\)-th percentile (\(P_k\)) separates the lowest \(k\%\) of data.

4.3.2 Quartiles

Definition: Divide the data into four equal parts.

  • \(Q_1\) (25th percentile): Lower quartile.
  • \(Q_2\) (50th percentile): Median.
  • \(Q_3\) (75th percentile): Upper quartile.
  • Interquartile Range (IQR): \(Q_3 - Q_1\). Represents the middle 50% of data.

4.3.3 Deciles

Definition: Divide the data into ten equal parts (\(D_1\) to \(D_9\)). \(D_1\) is the 10th percentile.

Computation in R

# Quartiles
q <- quantile(sim_data$Creatinine, probs = c(0.25, 0.5, 0.75))

# Deciles
d <- quantile(sim_data$Creatinine, probs = seq(0.1, 0.9, by = 0.1))

# Specific Percentile (e.g., 95th for reference intervals)
p95 <- quantile(sim_data$Creatinine, probs = 0.95)

print(q)
##  25%  50%  75% 
## 0.77 1.02 1.33
print(paste("IQR:", IQR(sim_data$Creatinine)))
## [1] "IQR: 0.56"

Visualization: Boxplot

The Boxplot is the standard visual for measures of position.

ggplot(sim_data, aes(x = Condition, y = Creatinine, fill = Condition)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of Creatinine by Condition",
       subtitle = "Visualizing Quartiles, Median, and Outliers")


4.4 Measures of Shape

Measures of shape describe the distribution’s form beyond simple center and spread.

4.4.1 Moments

  • 1st Moment: Mean (Center).
  • 2nd Central Moment: Variance (Spread).
  • 3rd Central Moment: Skewness (Asymmetry).
  • 4th Central Moment: Kurtosis (Peakedness).

4.4.2 Skewness

Definition: Measure of the asymmetry of the probability distribution.

Formula:

\(\text{Skew} = \frac{\frac{1}{n}\sum(x_i - \bar{x})^3}{s^3}\)

Interpretation:

  • 0: Symmetrical (Normal).
  • > 0: Right-skewed (Tail extends right, Mean > Median).
  • < 0: Left-skewed (Tail extends left, Mean < Median).

4.4.3 Kurtosis

Definition: Measure of the “tailedness” or peakedness relative to a normal distribution.

Formula:

\(\text{Kurt} = \frac{\frac{1}{n}\sum(x_i - \bar{x})^4}{s^4} - 3\) (Excess Kurtosis)

Types:

  • Mesokurtic (~0): Normal distribution.
  • Leptokurtic (>0): Heavy tails, sharp peak.
  • Platykurtic (<0): Light tails, flat peak.

Computation in R

# Using e1071 library
sk_val <- skewness(sim_data$Creatinine)
ku_val <- kurtosis(sim_data$Creatinine)

shape_res <- data.frame(
  Measure = c("Skewness", "Excess Kurtosis"),
  Value = c(sk_val, ku_val),
  Interpretation = c(
    ifelse(sk_val > 0.5, "Right Skewed", ifelse(sk_val < -0.5, "Left Skewed", "Symmetric")),
    ifelse(ku_val > 0.5, "Leptokurtic (Peaked)", ifelse(ku_val < -0.5, "Platykurtic (Flat)", "Mesokurtic"))
  )
)

kable(shape_res, caption = "Shape Statistics for Creatinine")
Shape Statistics for Creatinine
Measure Value Interpretation
Skewness 0.8651310 Right Skewed
Excess Kurtosis 0.7588392 Leptokurtic (Peaked)

Visualization: QQ-Plot

A Q-Q plot compares the data against a theoretical normal distribution to assess shape visually.

ggplot(sim_data, aes(sample = Creatinine)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(title = "Q-Q Plot for Normality Check",
       subtitle = "Deviations from the red line indicate non-normality (Skewness/Kurtosis)")


4.5 Measures of Frequency

In Health Data Science, categorical data is summarized using frequencies.

4.5.1 Absolute Frequency

Definition: The raw count of observations in a category (\(n_i\)).

4.5.2 Relative Frequency

Definition: The proportion of total observations in a category.

Formula:

\(f_i = \frac{n_i}{N}\)

4.5.3 Cumulative Frequency

Definition: The running total of frequencies.

4.5.4 Contingency Tables (Cross-tabulation)

Definition: A matrix displaying the frequency distribution of variables.

Computation in R

# 1. One-way Table (Condition)
abs_freq <- table(sim_data$Condition)
rel_freq <- prop.table(abs_freq)
cum_freq <- cumsum(rel_freq)

freq_table <- data.frame(
  Condition = names(abs_freq),
  Count = as.vector(abs_freq),
  Proportion = round(as.vector(rel_freq), 3),
  Cumulative_Prop = round(as.vector(cum_freq), 3)
)

kable(freq_table, caption = "Frequency Table for Patient Condition")
Frequency Table for Patient Condition
Condition Count Proportion Cumulative_Prop
Mild 101 0.505 0.505
Moderate 67 0.335 0.840
Severe 32 0.160 1.000
# 2. Two-way Table (Condition vs Stay Duration > 5 days)
sim_data$Long_Stay <- ifelse(sim_data$Stay_Duration > 5, "Yes", "No")
cross_tab <- table(sim_data$Condition, sim_data$Long_Stay)

kable(cross_tab, caption = "Contingency Table: Condition vs Long Hospital Stay")
Contingency Table: Condition vs Long Hospital Stay
No Yes
Mild 42 59
Moderate 24 43
Severe 8 24

Visualization: Bar Chart

ggplot(freq_table, aes(x = Condition, y = Count, fill = Condition)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(Proportion*100, "%")), vjust = -0.5) +
  theme_minimal() +
  labs(title = "Frequency of Patient Conditions", y = "Count")


End of Module IV ```