In the pipeline of Applied Statistics, Descriptive Statistics serves as the initial step of Exploratory Data Analysis (EDA). While Inferential Statistics allows us to make predictions about a population based on a sample, Descriptive Statistics provides the tools to summarize, organize, and visualize the specific features of the dataset at hand.
For an MSc level understanding, we move beyond simple averages and look at the distribution’s properties through the lens of statistical moments and robustness.
Throughout this chapter, we will use a simulated medical dataset representing a Cardiovascular Health Study. This dataset contains variables common in biostatistics: 1. Age (Numeric) 2. Cholesterol (Numeric, mg/dL) 3. SystolicBP (Numeric, mmHg) 4. Risk_Group (Categorical: Low, High)
set.seed(42)
n <- 200
# Simulate Data
data <- data.frame(
ID = 1:n,
Age = round(rnorm(n, mean = 55, sd = 10)),
Cholesterol = round(rgamma(n, shape = 10, scale = 20) + 50), # Right skewed
Risk_Group = sample(c("Low", "High"), n, replace = TRUE, prob = c(0.6, 0.4))
)
# Introduce a linear relationship for Systolic BP based on Age and Noise
data$SystolicBP <- round(100 + 0.5 * data$Age + rnorm(n, 0, 10))
# Introduce outliers manually for demonstration
data$Cholesterol[c(1, 10)] <- c(600, 580)
head(data)
## ID Age Cholesterol Risk_Group SystolicBP
## 1 1 69 600 Low 135
## 2 2 49 228 Low 109
## 3 3 59 388 Low 109
## 4 4 61 165 Low 136
## 5 5 59 253 Low 129
## 6 6 54 278 High 121
Central tendency describes the center of the data distribution.
The most common measure, representing the “center of gravity” of the data. Mathematically, for a sample vector \(x\) of size \(n\):
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
Properties: 1. It uses every value in the data. 2. \(\sum(x_i - \bar{x}) = 0\). 3. Sensitive to outliers.
The value separating the higher half from the lower half of the data. It is the \(0.5\) quantile.
Properties: 1. Robust to outliers. 2. Minimizes the sum of absolute deviations: \(\text{min} \sum |x_i - c|\).
Let’s compare Mean and Median for our Cholesterol
variable, which contains outliers.
# Calculate statistics
mean_chol <- mean(data$Cholesterol)
median_chol <- median(data$Cholesterol)
cat("Mean Cholesterol: ", round(mean_chol, 2), "\n")
## Mean Cholesterol: 242.77
cat("Median Cholesterol:", round(median_chol, 2), "\n")
## Median Cholesterol: 231.5
MSc Insight: Notice the discrepancy. The mean is pulled upward by the outliers (600, 580), whereas the median remains representative of the “typical” patient. In applied statistics, reporting both often provides insight into the skewness of the data.
Dispersion quantifies the spread or variability of the data.
The sample variance (\(s^2\)) measures the average squared deviation from the mean. We use Bessel’s correction (\(n-1\)) to ensure \(s^2\) is an unbiased estimator of the population variance \(\sigma^2\).
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The Standard Deviation (\(s\)) is simply \(\sqrt{s^2}\).
The range between the 75th percentile (\(Q_3\)) and the 25th percentile (\(Q_1\)).
\[ IQR = Q_3 - Q_1 \]
This is a robust measure of spread.
Useful for comparing variability between datasets with different units or widely different means.
\[ CV = \frac{s}{\bar{x}} \times 100\% \]
sd_chol <- sd(data$Cholesterol)
iqr_chol <- IQR(data$Cholesterol)
cv_chol <- (sd_chol / mean(data$Cholesterol)) * 100
cat("Standard Deviation:", round(sd_chol, 2), "\n")
## Standard Deviation: 73.37
cat("IQR: ", round(iqr_chol, 2), "\n")
## IQR: 79
cat("CV (%): ", round(cv_chol, 2), "%\n")
## CV (%): 30.22 %
To fully understand a distribution, we look beyond center and spread.
Skewness (\(\gamma_1\)) measures asymmetry.
\[ \gamma_1 = E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right] \]
Kurtosis (\(\beta_2\)) measures the
“tailedness” (or peakedness) relative to a normal distribution. In R,
kurtosis usually returns the raw kurtosis (Normal = 3).
Excess Kurtosis is \(\beta_2 -
3\).
\[ \beta_2 = E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right] \]
skew_val <- skewness(data$Cholesterol)
kurt_val <- kurtosis(data$Cholesterol)
cat("Skewness:", round(skew_val, 2), "(Positive indicates right-skew)\n")
## Skewness: 1.49 (Positive indicates right-skew)
cat("Kurtosis:", round(kurt_val, 2), "(>3 indicates heavy tails)\n")
## Kurtosis: 7.24 (>3 indicates heavy tails)
Visuals are often more powerful than summary tables. We use
ggplot2 for publication-quality figures.
Visualizing the distribution of Cholesterol.
p1 <- ggplot(data, aes(x = Cholesterol)) +
geom_histogram(aes(y = ..density..), binwidth = 20, fill = "skyblue", color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1) +
geom_vline(aes(xintercept = mean(Cholesterol)), color = "blue", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median(Cholesterol)), color = "darkgreen", linetype = "dashed", size = 1) +
labs(title = "Cholesterol Distribution", subtitle = "Blue Dashed: Mean, Green Dashed: Median") +
theme_minimal()
print(p1)
Figure 1: Distribution of Cholesterol Levels showing right-skewness.
Boxplots are excellent for detecting outliers and comparing distributions across groups.
p2 <- ggplot(data, aes(x = Risk_Group, y = SystolicBP, fill = Risk_Group)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
labs(title = "Systolic BP by Risk Group", y = "Systolic BP (mmHg)") +
theme_minimal()
print(p2)
Figure 2: Boxplot of Systolic BP by Risk Group.
Descriptive statistics also examine relationships between variables.
Measures the joint variability of two random variables. \[ Cov(X, Y) = \frac{1}{n-1}\sum (x_i - \bar{x})(y_i - \bar{y}) \]
Standardized measure of the relationship (bounds: -1 to +1).
Pearson’s \(r\): Measures linear relationship. Sensitive to outliers. \[ r = \frac{Cov(X,Y)}{s_x s_y} \]
Spearman’s \(\rho\): Rank-based correlation. Measures monotonic relationships. Robust to outliers.
# Pearson Correlation
cor_pearson <- cor(data$Age, data$SystolicBP, method = "pearson")
# Spearman Correlation
cor_spearman <- cor(data$Age, data$SystolicBP, method = "spearman")
cat("Pearson Correlation (Age vs BP): ", round(cor_pearson, 3), "\n")
## Pearson Correlation (Age vs BP): 0.454
cat("Spearman Correlation (Age vs BP):", round(cor_spearman, 3), "\n")
## Spearman Correlation (Age vs BP): 0.462
p3 <- ggplot(data, aes(x = Age, y = SystolicBP)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
labs(title = "Correlation: Age vs Systolic BP",
subtitle = paste("Pearson r =", round(cor_pearson, 2))) +
theme_minimal()
print(p3)
Figure 3: Scatter plot showing the relationship between Age and Systolic BP.
In this chapter, we explored the fundamental building blocks of applied statistics:
Applied Exercise: Given the presence of outliers in the ‘Cholesterol’ variable shown in Figure 1, which measure of central tendency would you recommend reporting to a Chief Medical Officer?
Answer: The Median, as it represents the typical patient without being distorted by the extreme outlier cases.
End of Chapter 2 ````
install.packages(c("rmarkdown", "ggplot2", "moments", "dplyr", "gridExtra"))
in your R console.Descriptive_Statistics.Rmd.This will generate a professional HTML document with embedded mathematical formulas, code, calculated results, and high-quality plots.