This module covers the fundamental building blocks of statistical analysis: descriptive statistics. For MSc students in Applied Statistics, Medical Statistics, and Data Science, understanding the mathematical underpinnings of these measures and their implementation in R is crucial for robust data management and analysis.
We will generate a simulated health dataset representing Serum Creatinine levels (mg/dL) and Hospital Stay Duration (days) for 200 patients. This dataset will be used for large-scale computation.
set.seed(2024) # Ensure reproducibility
n <- 200
sim_data <- data.frame(
Patient_ID = 1:n,
# Log-normal distribution often fits biological markers (right-skewed)
Creatinine = round(rlnorm(n, meanlog = 0, sdlog = 0.4), 2),
# Poisson distribution for count data (days)
Stay_Duration = rpois(n, lambda = 5) + 1,
# Categorical variable
Condition = sample(c("Mild", "Moderate", "Severe"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2))
)
# Display first few rows
kable(head(sim_data), caption = "Preview of Simulated Health Data")| Patient_ID | Creatinine | Stay_Duration | Condition |
|---|---|---|---|
| 1 | 1.48 | 6 | Moderate |
| 2 | 1.21 | 9 | Moderate |
| 3 | 0.96 | 7 | Mild |
| 4 | 0.92 | 12 | Moderate |
| 5 | 1.59 | 4 | Mild |
| 6 | 1.68 | 3 | Moderate |
Central tendency describes the center or “typical value” of a frequency distribution.
Definition: The sum of all observations divided by the number of observations.
Formula:
\(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\)
Usage: Used for continuous data without significant outliers.
Definition: The \(n\)-th root of the product of \(n\) numbers.
Formula:
\(GM = \sqrt[n]{\prod_{i=1}^{n} x_i} = \exp\left(\frac{1}{n}\sum \ln(x_i)\right)\)
Usage: Used for growth rates, financial indices, or highly skewed biological data (e.g., titers, bacterial counts).
Definition: The reciprocal of the arithmetic mean of the reciprocals.
Formula:
\(HM = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}\)
Usage: Used for rates and ratios (e.g., speed, precision in F1-score).
Definition: The middle value when data is ordered. Formula: If \(n\) is odd: \(x_{(\frac{n+1}{2})}\) If \(n\) is even: \(\frac{x_{(n/2)} + x_{(n/2 + 1)}}{2}\) Usage: Preferred for skewed distributions or data with outliers.
Definition: The value that appears most frequently.
Usage: Used for categorical data or discrete distributions.
Definition: The average of the maximum and minimum values.
Formula:
\(MR = \frac{\max(x) + \min(x)}{2}\)
Usage: A quick estimate of center, highly sensitive to outliers.
# 1. Manual Computation on a small vector for verification
x_small <- c(2, 4, 4, 8, 100) # Note the outlier '100'
# Manual Arithmetic Mean
man_mean <- sum(x_small) / length(x_small)
# Manual Geometric Mean
man_geo <- exp(mean(log(x_small)))
# Manual Harmonic Mean
man_harm <- length(x_small) / sum(1/x_small)
print(paste("Manual Mean:", man_mean))## [1] "Manual Mean: 23.6"
## [1] "Manual Geo Mean: 7.61"
## [1] "Manual Harmonic Mean: 4.41"
# 2. Computation on Simulated Data (Creatinine)
# R does not have built-in Geo/Harmonic mean in base, so we define them:
geo_mean <- function(x) { exp(mean(log(x))) }
harm_mean <- function(x) { 1 / mean(1/x) }
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
stats_central <- data.frame(
Measure = c("Arithmetic Mean", "Geometric Mean", "Harmonic Mean", "Median", "Mode", "Mid-range"),
Value = c(
mean(sim_data$Creatinine),
geo_mean(sim_data$Creatinine),
harm_mean(sim_data$Creatinine),
median(sim_data$Creatinine),
get_mode(sim_data$Creatinine),
(min(sim_data$Creatinine) + max(sim_data$Creatinine)) / 2
)
)
kable(stats_central, caption = "Central Tendency for Creatinine")| Measure | Value |
|---|---|
| Arithmetic Mean | 1.0947500 |
| Geometric Mean | 1.0096571 |
| Harmonic Mean | 0.9262291 |
| Median | 1.0200000 |
| Mode | 1.2400000 |
| Mid-range | 1.4350000 |
ggplot(sim_data, aes(x = Creatinine)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
geom_density(alpha = .2, fill = "#FF6666") +
geom_vline(aes(xintercept = mean(Creatinine)), color = "blue", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median(Creatinine)), color = "red", linetype = "dashed", size = 1) +
annotate("text", x = 3, y = 0.5, label = "Blue: Mean\nRed: Median") +
theme_minimal() +
labs(title = "Distribution of Creatinine with Central Tendency")Dispersion measures how spread out the data is around the center.
Definition: The difference between the maximum and minimum values.
Formula:
\(Range = x_{max} - x_{min}\)
Definition: The average of the squared differences from the Mean.
Formula (Sample):
\(s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}\)
Interpretation: Measures the spread in squared units.
Definition: The square root of the variance.
Formula:
\(s = \sqrt{s^2}\)
Interpretation: Measures spread in the original units of the data.
Definition: The ratio of the standard deviation to the mean.
Formula:
\(CV = (\frac{s}{\bar{x}}) \times 100\%\)
Usage: Used to compare variability between datasets with different units or widely different means.
# 1. Manual Computation (Small vector: 2, 4, 4, 8)
x_disp <- c(2, 4, 4, 8)
n_disp <- length(x_disp)
mean_disp <- mean(x_disp) # 4.5
# Sum of squared diffs: (2-4.5)^2 + (4-4.5)^2 + (4-4.5)^2 + (8-4.5)^2
# = 6.25 + 0.25 + 0.25 + 12.25 = 19
var_manual <- 19 / (n_disp - 1) # 19 / 3 = 6.33
sd_manual <- sqrt(var_manual) # 2.51
# 2. Computation on Simulated Data
creat_var <- var(sim_data$Creatinine)
creat_sd <- sd(sim_data$Creatinine)
creat_cv <- (creat_sd / mean(sim_data$Creatinine)) * 100
creat_range <- max(sim_data$Creatinine) - min(sim_data$Creatinine)
disp_results <- data.frame(
Measure = c("Range", "Variance", "Standard Deviation", "CV (%)"),
Value = c(creat_range, creat_var, creat_sd, creat_cv)
)
kable(disp_results, caption = "Measures of Dispersion for Creatinine")| Measure | Value |
|---|---|
| Range | 2.3300000 |
| Variance | 0.1962130 |
| Standard Deviation | 0.4429594 |
| CV (%) | 40.4621483 |
These measures indicate the relative position of a data point within the dataset.
Definition: Values below which a certain percentage of the data falls. The \(k\)-th percentile (\(P_k\)) separates the lowest \(k\%\) of data.
Definition: Divide the data into four equal parts.
Definition: Divide the data into ten equal parts (\(D_1\) to \(D_9\)). \(D_1\) is the 10th percentile.
# Quartiles
q <- quantile(sim_data$Creatinine, probs = c(0.25, 0.5, 0.75))
# Deciles
d <- quantile(sim_data$Creatinine, probs = seq(0.1, 0.9, by = 0.1))
# Specific Percentile (e.g., 95th for reference intervals)
p95 <- quantile(sim_data$Creatinine, probs = 0.95)
print(q)## 25% 50% 75%
## 0.77 1.02 1.33
## [1] "IQR: 0.56"
The Boxplot is the standard visual for measures of position.
ggplot(sim_data, aes(x = Condition, y = Creatinine, fill = Condition)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of Creatinine by Condition",
subtitle = "Visualizing Quartiles, Median, and Outliers")Measures of shape describe the distribution’s form beyond simple center and spread.
Definition: Measure of the asymmetry of the probability distribution.
Formula:
\(\text{Skew} = \frac{\frac{1}{n}\sum(x_i - \bar{x})^3}{s^3}\)
Interpretation:
Definition: Measure of the “tailedness” or peakedness relative to a normal distribution.
Formula:
\(\text{Kurt} = \frac{\frac{1}{n}\sum(x_i - \bar{x})^4}{s^4} - 3\) (Excess Kurtosis)
Types:
# Using e1071 library
sk_val <- skewness(sim_data$Creatinine)
ku_val <- kurtosis(sim_data$Creatinine)
shape_res <- data.frame(
Measure = c("Skewness", "Excess Kurtosis"),
Value = c(sk_val, ku_val),
Interpretation = c(
ifelse(sk_val > 0.5, "Right Skewed", ifelse(sk_val < -0.5, "Left Skewed", "Symmetric")),
ifelse(ku_val > 0.5, "Leptokurtic (Peaked)", ifelse(ku_val < -0.5, "Platykurtic (Flat)", "Mesokurtic"))
)
)
kable(shape_res, caption = "Shape Statistics for Creatinine")| Measure | Value | Interpretation |
|---|---|---|
| Skewness | 0.8651310 | Right Skewed |
| Excess Kurtosis | 0.7588392 | Leptokurtic (Peaked) |
A Q-Q plot compares the data against a theoretical normal distribution to assess shape visually.
ggplot(sim_data, aes(sample = Creatinine)) +
stat_qq() +
stat_qq_line(color = "red") +
theme_minimal() +
labs(title = "Q-Q Plot for Normality Check",
subtitle = "Deviations from the red line indicate non-normality (Skewness/Kurtosis)")In Health Data Science, categorical data is summarized using frequencies.
Definition: The raw count of observations in a category (\(n_i\)).
Definition: The proportion of total observations in a category.
Formula:
\(f_i = \frac{n_i}{N}\)
Definition: The running total of frequencies.
Definition: A matrix displaying the frequency distribution of variables.
# 1. One-way Table (Condition)
abs_freq <- table(sim_data$Condition)
rel_freq <- prop.table(abs_freq)
cum_freq <- cumsum(rel_freq)
freq_table <- data.frame(
Condition = names(abs_freq),
Count = as.vector(abs_freq),
Proportion = round(as.vector(rel_freq), 3),
Cumulative_Prop = round(as.vector(cum_freq), 3)
)
kable(freq_table, caption = "Frequency Table for Patient Condition")| Condition | Count | Proportion | Cumulative_Prop |
|---|---|---|---|
| Mild | 101 | 0.505 | 0.505 |
| Moderate | 67 | 0.335 | 0.840 |
| Severe | 32 | 0.160 | 1.000 |
# 2. Two-way Table (Condition vs Stay Duration > 5 days)
sim_data$Long_Stay <- ifelse(sim_data$Stay_Duration > 5, "Yes", "No")
cross_tab <- table(sim_data$Condition, sim_data$Long_Stay)
kable(cross_tab, caption = "Contingency Table: Condition vs Long Hospital Stay")| No | Yes | |
|---|---|---|
| Mild | 42 | 59 |
| Moderate | 24 | 43 |
| Severe | 8 | 24 |