Introduction

In the pipeline of Applied Statistics, Descriptive Statistics serves as the initial step of Exploratory Data Analysis (EDA). While Inferential Statistics allows us to make predictions about a population based on a sample, Descriptive Statistics provides the tools to summarize, organize, and visualize the specific features of the dataset at hand.

For an MSc level understanding, we move beyond simple averages and look at the distribution’s properties through the lens of statistical moments and robustness.

Real-Life Case Study Data

Throughout this chapter, we will use a simulated medical dataset representing a Cardiovascular Health Study. This dataset contains variables common in biostatistics: 1. Age (Numeric) 2. Cholesterol (Numeric, mg/dL) 3. SystolicBP (Numeric, mmHg) 4. Risk_Group (Categorical: Low, High)

set.seed(42)
n <- 200

# Simulate Data
data <- data.frame(
  ID = 1:n,
  Age = round(rnorm(n, mean = 55, sd = 10)),
  Cholesterol = round(rgamma(n, shape = 10, scale = 20) + 50), # Right skewed
  Risk_Group = sample(c("Low", "High"), n, replace = TRUE, prob = c(0.6, 0.4))
)

# Introduce a linear relationship for Systolic BP based on Age and Noise
data$SystolicBP <- round(100 + 0.5 * data$Age + rnorm(n, 0, 10))

# Introduce outliers manually for demonstration
data$Cholesterol[c(1, 10)] <- c(600, 580) 

head(data)
##   ID Age Cholesterol Risk_Group SystolicBP
## 1  1  69         600        Low        135
## 2  2  49         228        Low        109
## 3  3  59         388        Low        109
## 4  4  61         165        Low        136
## 5  5  59         253        Low        129
## 6  6  54         278       High        121

Measures of Central Tendency (First Moment)

Central tendency describes the center of the data distribution.

The Arithmetic Mean

The most common measure, representing the “center of gravity” of the data. Mathematically, for a sample vector \(x\) of size \(n\):

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

Properties: 1. It uses every value in the data. 2. \(\sum(x_i - \bar{x}) = 0\). 3. Sensitive to outliers.

The Median

The value separating the higher half from the lower half of the data. It is the \(0.5\) quantile.

Properties: 1. Robust to outliers. 2. Minimizes the sum of absolute deviations: \(\text{min} \sum |x_i - c|\).

R Implementation and Interpretation

Let’s compare Mean and Median for our Cholesterol variable, which contains outliers.

# Calculate statistics
mean_chol <- mean(data$Cholesterol)
median_chol <- median(data$Cholesterol)

cat("Mean Cholesterol:  ", round(mean_chol, 2), "\n")
## Mean Cholesterol:   242.77
cat("Median Cholesterol:", round(median_chol, 2), "\n")
## Median Cholesterol: 231.5

MSc Insight: Notice the discrepancy. The mean is pulled upward by the outliers (600, 580), whereas the median remains representative of the “typical” patient. In applied statistics, reporting both often provides insight into the skewness of the data.


Measures of Dispersion (Second Moment)

Dispersion quantifies the spread or variability of the data.

Variance and Standard Deviation

The sample variance (\(s^2\)) measures the average squared deviation from the mean. We use Bessel’s correction (\(n-1\)) to ensure \(s^2\) is an unbiased estimator of the population variance \(\sigma^2\).

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The Standard Deviation (\(s\)) is simply \(\sqrt{s^2}\).

Interquartile Range (IQR)

The range between the 75th percentile (\(Q_3\)) and the 25th percentile (\(Q_1\)).

\[ IQR = Q_3 - Q_1 \]

This is a robust measure of spread.

Coefficient of Variation (CV)

Useful for comparing variability between datasets with different units or widely different means.

\[ CV = \frac{s}{\bar{x}} \times 100\% \]

sd_chol <- sd(data$Cholesterol)
iqr_chol <- IQR(data$Cholesterol)
cv_chol <- (sd_chol / mean(data$Cholesterol)) * 100

cat("Standard Deviation:", round(sd_chol, 2), "\n")
## Standard Deviation: 73.37
cat("IQR:               ", round(iqr_chol, 2), "\n")
## IQR:                79
cat("CV (%):            ", round(cv_chol, 2), "%\n")
## CV (%):             30.22 %

Shape of Distribution (Higher Moments)

To fully understand a distribution, we look beyond center and spread.

Skewness (Third Moment)

Skewness (\(\gamma_1\)) measures asymmetry.

\[ \gamma_1 = E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right] \]

  • \(\gamma_1 \approx 0\): Symmetrical (e.g., Normal).
  • \(\gamma_1 > 0\): Positively skewed (Right-tail is longer).
  • \(\gamma_1 < 0\): Negatively skewed (Left-tail is longer).

Kurtosis (Fourth Moment)

Kurtosis (\(\beta_2\)) measures the “tailedness” (or peakedness) relative to a normal distribution. In R, kurtosis usually returns the raw kurtosis (Normal = 3). Excess Kurtosis is \(\beta_2 - 3\).

\[ \beta_2 = E\left[\left(\frac{X-\mu}{\sigma}\right)^4\right] \]

  • Leptokurtic (>3): Heavy tails, sharp peak (more prone to outliers).
  • Platykurtic (<3): Light tails, flat peak.
skew_val <- skewness(data$Cholesterol)
kurt_val <- kurtosis(data$Cholesterol)

cat("Skewness:", round(skew_val, 2), "(Positive indicates right-skew)\n")
## Skewness: 1.49 (Positive indicates right-skew)
cat("Kurtosis:", round(kurt_val, 2), "(>3 indicates heavy tails)\n")
## Kurtosis: 7.24 (>3 indicates heavy tails)

Data Visualization

Visuals are often more powerful than summary tables. We use ggplot2 for publication-quality figures.

Histogram and Density Plots

Visualizing the distribution of Cholesterol.

p1 <- ggplot(data, aes(x = Cholesterol)) +
  geom_histogram(aes(y = ..density..), binwidth = 20, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean(Cholesterol)), color = "blue", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(Cholesterol)), color = "darkgreen", linetype = "dashed", size = 1) +
  labs(title = "Cholesterol Distribution", subtitle = "Blue Dashed: Mean, Green Dashed: Median") +
  theme_minimal()

print(p1)
Figure 1: Distribution of Cholesterol Levels showing right-skewness.

Figure 1: Distribution of Cholesterol Levels showing right-skewness.

Boxplots

Boxplots are excellent for detecting outliers and comparing distributions across groups.

p2 <- ggplot(data, aes(x = Risk_Group, y = SystolicBP, fill = Risk_Group)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
  labs(title = "Systolic BP by Risk Group", y = "Systolic BP (mmHg)") +
  theme_minimal()

print(p2)
Figure 2: Boxplot of Systolic BP by Risk Group.

Figure 2: Boxplot of Systolic BP by Risk Group.


Multivariate Descriptive Statistics

Descriptive statistics also examine relationships between variables.

Covariance

Measures the joint variability of two random variables. \[ Cov(X, Y) = \frac{1}{n-1}\sum (x_i - \bar{x})(y_i - \bar{y}) \]

Correlation Coefficients

Standardized measure of the relationship (bounds: -1 to +1).

  1. Pearson’s \(r\): Measures linear relationship. Sensitive to outliers. \[ r = \frac{Cov(X,Y)}{s_x s_y} \]

  2. Spearman’s \(\rho\): Rank-based correlation. Measures monotonic relationships. Robust to outliers.

# Pearson Correlation
cor_pearson <- cor(data$Age, data$SystolicBP, method = "pearson")

# Spearman Correlation
cor_spearman <- cor(data$Age, data$SystolicBP, method = "spearman")

cat("Pearson Correlation (Age vs BP): ", round(cor_pearson, 3), "\n")
## Pearson Correlation (Age vs BP):  0.454
cat("Spearman Correlation (Age vs BP):", round(cor_spearman, 3), "\n")
## Spearman Correlation (Age vs BP): 0.462

Visualizing Correlation

p3 <- ggplot(data, aes(x = Age, y = SystolicBP)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  labs(title = "Correlation: Age vs Systolic BP", 
       subtitle = paste("Pearson r =", round(cor_pearson, 2))) +
  theme_minimal()

print(p3)
Figure 3: Scatter plot showing the relationship between Age and Systolic BP.

Figure 3: Scatter plot showing the relationship between Age and Systolic BP.


Summary and Conclusion

In this chapter, we explored the fundamental building blocks of applied statistics:

  1. Central Tendency: We distinguished between the Mean (sensitive) and Median (robust).
  2. Dispersion: We quantified risk/spread using Variance, SD, and IQR.
  3. Shape: We utilized moments (Skewness and Kurtosis) to characterize the tails of the distribution.
  4. Multivariate Analysis: We explored relationships using Covariance and Correlation.

Applied Exercise: Given the presence of outliers in the ‘Cholesterol’ variable shown in Figure 1, which measure of central tendency would you recommend reporting to a Chief Medical Officer?

Answer: The Median, as it represents the typical patient without being distorted by the extreme outlier cases.


End of Chapter 2 ````

Instructions for the User:

  1. Install R and RStudio.
  2. Install Packages: Run install.packages(c("rmarkdown", "ggplot2", "moments", "dplyr", "gridExtra")) in your R console.
  3. Save: Copy the text inside the code block above into a file named Descriptive_Statistics.Rmd.
  4. Render: Open the file in RStudio and click the “Knit” button.

This will generate a professional HTML document with embedded mathematical formulas, code, calculated results, and high-quality plots.