1 Introduction

In data analytics, two fundamental concepts drive the quality of any statistical analysis: variable selection (choosing which variables to include in a model) and statistical distributions (understanding how data is spread and behaves). In R, both are handled through a rich set of built-in functions and packages. This document explains each concept in detail, with examples and R code.


2 PART 1: VARIABLE SELECTION METHODS

Variable selection (also called feature selection) is the process of identifying the most relevant variables (predictors) to include in a statistical model. Including too many irrelevant variables causes overfitting, while too few causes underfitting.

2.1 Why Variable Selection Matters

When building a regression or classification model:

  • Too many variables → model memorises noise, performs poorly on new data (overfitting)
  • Too few variables → model misses important patterns (underfitting)
  • Right variables → model generalises well and gives meaningful interpretations

2.2 Method 1: Filter Methods (Correlation-Based Selection)

Filter methods evaluate each variable independently of the model, based on statistical measures like correlation, variance, or statistical tests.

2.2.1 Correlation Analysis

Correlation measures the linear relationship between two numeric variables. Values range from −1 (perfect negative) to +1 (perfect positive). Variables highly correlated with the outcome but not with each other are ideal predictors.

library(corrplot)
library(dplyr)

# Simulate a bacteria-like dataset for demonstration
set.seed(42)
n_demo <- 100
bacteria <- data.frame(
  Var1         = rnorm(n_demo, 50, 10),
  Var2         = rnorm(n_demo, 30, 8),
  Var3         = rnorm(n_demo, 20, 5),
  Var4         = rnorm(n_demo, 60, 12),
  Var5         = rnorm(n_demo, 40, 9),
  Age          = round(runif(n_demo, 18, 70)),
  Sex          = sample(c("M", "F"), n_demo, replace = TRUE),
  Education    = sample(1:5, n_demo, replace = TRUE),
  BoiledWater  = sample(c(0, 1), n_demo, replace = TRUE),
  Bact_Isolated = sample(c("Yes", "No"), n_demo, replace = TRUE)
)

# Compute correlation matrix for numeric variables
cor_matrix <- cor(
  bacteria[, c("Var1", "Var2", "Var3", "Var4", "Var5", "Age")],
  use = "complete.obs"
)

print(round(cor_matrix, 3))
##        Var1   Var2   Var3   Var4   Var5    Age
## Var1  1.000  0.031 -0.145  0.074  0.084  0.092
## Var2  0.031  1.000  0.071  0.009 -0.121  0.028
## Var3 -0.145  0.071  1.000 -0.046  0.059  0.061
## Var4  0.074  0.009 -0.046  1.000  0.185 -0.058
## Var5  0.084 -0.121  0.059  0.185  1.000  0.033
## Age   0.092  0.028  0.061 -0.058  0.033  1.000
# Visualise correlation matrix
corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.cex      = 0.8,
         number.cex  = 0.7,
         title       = "Correlation Matrix",
         mar         = c(0, 0, 1.5, 0))

How it works:

  • Values close to 1 or −1 = strong relationship
  • Values close to 0 = weak/no relationship
  • Keep variables with high correlation to the outcome (Var1)
  • Remove variables highly correlated with each other (multicollinearity)

2.2.2 Variance Threshold

Variables with near-zero variance carry almost no information and should be removed.

library(caret)

# Find near-zero variance variables
nzv <- nearZeroVar(bacteria, saveMetrics = TRUE)
print(nzv)
##               freqRatio percentUnique zeroVar   nzv
## Var1           1.000000           100   FALSE FALSE
## Var2           1.000000           100   FALSE FALSE
## Var3           1.000000           100   FALSE FALSE
## Var4           1.000000           100   FALSE FALSE
## Var5           1.000000           100   FALSE FALSE
## Age            1.600000            47   FALSE FALSE
## Sex            1.000000             2   FALSE FALSE
## Education      1.090909             5   FALSE FALSE
## BoiledWater    1.380952             2   FALSE FALSE
## Bact_Isolated  1.222222             2   FALSE FALSE
# Remove near-zero variance variables
bacteria_clean <- bacteria[, !nzv$nzv]

How it works:

  • freqRatio: ratio of most common to second most common value
  • percentUnique: % of unique values
  • nzv = TRUE → near-zero variance → remove; nzv = FALSE → informative → keep

2.2.3 Statistical Tests for Selection

Use t-tests or ANOVA to check if a variable differs significantly across groups.

# T-test: does Var1 differ between Bact_Isolated groups?
t_test_result <- t.test(Var1 ~ Bact_Isolated, data = bacteria)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  Var1 by Bact_Isolated
## t = 0.8756, df = 92.936, p-value = 0.3835
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -2.333806  6.015016
## sample estimates:
##  mean in group No mean in group Yes 
##          51.15342          49.31282
# ANOVA: does Age differ across bacteria types?
anova_result <- aov(Age ~ Bact_Isolated, data = bacteria)
summary(anova_result)
##               Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated  1    375   375.5   1.422  0.236
## Residuals     98  25884   264.1

How it works:

  • p-value < 0.05 → significant difference between groups → keep the variable
  • p-value > 0.05 → no significant difference → consider removing
  • The F-statistic measures variation between groups versus within groups

2.3 Method 2: Wrapper Methods

Wrapper methods use the model itself to evaluate subsets of variables. They are more accurate than filter methods but computationally expensive.

2.3.1 Forward Selection

Starts with no variables, adds one at a time, keeping only those that improve the model.

library(MASS)

# Empty model (intercept only)
model_empty <- lm(Var1 ~ 1, data = bacteria)

# Full model with all predictors
model_full <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5 +
                   Sex + Education + BoiledWater, data = bacteria)

# Forward selection using AIC
forward_model <- stepAIC(model_empty,
                          scope     = list(lower = model_empty,
                                           upper = model_full),
                          direction = "forward",
                          trace     = FALSE)

summary(forward_model)
## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works:

  • Step 1: Try adding each variable one at a time
  • Step 2: Add the variable that reduces AIC the most
  • Step 3: Repeat until no variable improves AIC
  • Lower AIC = better model fit with less complexity

2.3.2 Backward Elimination

Starts with all variables, removes one at a time, keeping only significant ones.

# Backward elimination
backward_model <- stepAIC(model_full,
                           direction = "backward",
                           trace     = FALSE)

summary(backward_model)
## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works:

  • Fit the full model with all variables
  • Remove the variable with the highest p-value (least significant)
  • Refit and repeat until all remaining variables are significant or removing any variable increases AIC

2.3.3 Stepwise Selection (Both Directions)

Combines forward and backward — adds and removes variables at each step.

# Stepwise selection (both directions)
stepwise_model <- stepAIC(model_full,
                           direction = "both",
                           trace     = FALSE)

summary(stepwise_model)
## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works: At each step, considers both adding a new variable and removing an existing variable, choosing the action that gives the lowest AIC. This is the most flexible and commonly used method.


2.4 Method 3: Embedded Methods

Embedded methods perform variable selection as part of the model fitting process itself.

2.4.1 LASSO Regression (L1 Regularisation)

LASSO penalises the model for having too many variables, automatically shrinking unimportant coefficients to exactly zero.

if(!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

library(glmnet)


# Prepare matrix (glmnet requires a matrix, not a dataframe)
x <- model.matrix(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
                   data = bacteria)[, -1]
y <- bacteria$Var1

# Fit LASSO model
lasso_model <- glmnet(x, y, alpha = 1)
plot(lasso_model, xvar = "lambda", label = TRUE)

# Cross-validation to find best lambda
cv_lasso    <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
cat("Best lambda:", best_lambda, "\n")
## Best lambda: 1.500092
# Coefficients at best lambda
lasso_coef <- coef(cv_lasso, s = "lambda.min")
print(lasso_coef)
## 6 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept)   50.32515
## Age            .      
## Var2           .      
## Var3           .      
## Var4           .      
## Var5           .

How it works:

  • lambda is the penalty strength; higher lambda shrinks more variables to zero
  • Variables with coefficient = 0 are automatically excluded
  • alpha = 1 → LASSO; alpha = 0 → Ridge; 0 < alpha < 1 → Elastic Net

2.4.2 Ridge Regression (L2 Regularisation)

Ridge shrinks coefficients toward zero but never exactly to zero — good when all variables contribute a little.

cv_ridge   <- cv.glmnet(x, y, alpha = 0)
ridge_coef <- coef(cv_ridge, s = "lambda.min")
print(ridge_coef)
## 6 x 1 sparse Matrix of class "dgCMatrix"
##                lambda.min
## (Intercept) 50.2531014268
## Age          0.0006997459
## Var2         0.0005406106
## Var3        -0.0035387787
## Var4         0.0008744807
## Var5         0.0011394894

How it works: Uses a squared penalty (L2). Coefficients are shrunk but never reach zero. Better when many variables have small but real effects. Does not perform variable selection (keeps all variables).


2.4.3 Random Forest Variable Importance

Random forest ranks variables by how much each one improves prediction accuracy.

if(require("pacman")) install.packages("pacman")
pacman::p_load(randomForest)
library(randomForest)

# Fit random forest
rf_model <- randomForest(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
                          data       = bacteria,
                          ntree      = 500,
                          importance = TRUE)

# Plot variable importance
varImpPlot(rf_model, main = "Variable Importance — Random Forest")

# Importance values
importance(rf_model)
##        %IncMSE IncNodePurity
## Age  -3.679599      1367.400
## Var2 -2.152929      1885.721
## Var3  8.945107      2134.684
## Var4 -4.291806      1745.932
## Var5  4.460238      2161.223

How it works:

  • Builds 500 decision trees using random subsets of data and variables
  • %IncMSE: drop in accuracy when variable is removed — higher = more important
  • IncNodePurity: total reduction in node impurity — higher = more important
  • Variables at the top of the plot are most important → keep

2.5 Method 4: Model Comparison Using Information Criteria

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) penalise model complexity. Lower values indicate better models.

model1 <- lm(Var1 ~ Age,                         data = bacteria)
model2 <- lm(Var1 ~ Age + Var2,                  data = bacteria)
model3 <- lm(Var1 ~ Age + Var2 + Var3,           data = bacteria)
model4 <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)

# Compare AIC
AIC(model1, model2, model3, model4)
##        df      AIC
## model1  3 756.5611
## model2  4 758.4780
## model3  5 758.0871
## model4  7 760.8139
# Compare BIC
BIC(model1, model2, model3, model4)
##        df      BIC
## model1  3 764.3766
## model2  4 768.8986
## model3  5 771.1129
## model4  7 779.0501

Formulas:

  • AIC = 2k − 2ln(L), where k = number of parameters, L = likelihood
  • BIC = k·ln(n) − 2ln(L), where n = sample size (BIC penalises complexity more than AIC)
  • Lower AIC/BIC = better balance of fit and simplicity
  • Use AIC for prediction; BIC for explanation/inference

3 PART 2: STATISTICAL DISTRIBUTIONS IN R

A statistical distribution describes how values of a variable are spread. R has built-in functions for virtually every distribution, following a consistent naming convention.

3.1 R’s Distribution Function Naming Convention

For any distribution named dist, R provides four functions:

Function Prefix What it does Example
Density / PMF d Probability at a value dnorm(x)
CDF p Probability ≤ x pnorm(q)
Quantile q Value at probability p qnorm(p)
Random r Generate random values rnorm(n)

3.2 Normal Distribution (Gaussian)

The most important distribution in statistics — bell-shaped, symmetric, defined by mean (μ) and standard deviation (σ).

mu    <- 0
sigma <- 1

# Density at x = 1.5
cat("Density at x=1.5:", dnorm(1.5, mean = mu, sd = sigma), "\n")
## Density at x=1.5: 0.1295176
# CDF: P(X <= 1.96)
cat("P(X <= 1.96):", pnorm(1.96, mean = 0, sd = 1), "\n")
## P(X <= 1.96): 0.9750021
# Quantile: 95th percentile
cat("95th percentile:", qnorm(0.95, mean = 0, sd = 1), "\n")
## 95th percentile: 1.644854
# Generate 1000 random values
set.seed(42)
normal_data <- rnorm(1000, mean = 50, sd = 10)

hist(normal_data,
     breaks      = 30,
     probability = TRUE,
     main        = "Normal Distribution (μ=50, σ=10)",
     xlab        = "Value",
     col         = "lightblue")
curve(dnorm(x, mean = 50, sd = 10), add = TRUE, col = "red", lwd = 2)

# Normality test
shapiro.test(normal_data)
## 
##  Shapiro-Wilk normality test
## 
## data:  normal_data
## W = 0.99882, p-value = 0.767

How it works in data analytics:

  • Most parametric tests (t-test, ANOVA, linear regression) assume normality of the outcome or residuals
  • Check normality with: histogram, Q-Q plot, Shapiro-Wilk test
  • The 68–95–99.7 rule: 68% within 1σ, 95% within 2σ, 99.7% within 3σ

3.3 Binomial Distribution

Models the number of successes in n independent trials, each with probability p of success. Used for binary outcomes (yes/no, infected/not infected).

n <- 10
p <- 0.3

cat("P(X = 3):", dbinom(3, size = n, prob = p), "\n")
## P(X = 3): 0.2668279
cat("P(X <= 3):", pbinom(3, size = n, prob = p), "\n")
## P(X <= 3): 0.6496107
cat("90th percentile:", qbinom(0.90, size = n, prob = p), "\n")
## 90th percentile: 5
set.seed(42)
binom_data <- rbinom(1000, size = n, prob = p)

barplot(table(binom_data),
        main  = "Binomial Distribution (n=10, p=0.3)",
        xlab  = "Number of Infections",
        ylab  = "Frequency",
        col   = "steelblue")

How it works in data analytics:

  • Used in logistic regression (predicting binary outcomes)
  • Used in chi-square tests (comparing proportions)
  • Mean = n·p; Variance = n·p·(1−p)

3.4 Poisson Distribution

Models the number of events occurring in a fixed interval of time or space. Used for count data (cases per week, bacteria colonies).

lambda <- 3

cat("P(X = 5):", dpois(5, lambda = lambda), "\n")
## P(X = 5): 0.1008188
cat("P(X <= 2):", ppois(2, lambda = lambda), "\n")
## P(X <= 2): 0.4231901
cat("95th percentile:", qpois(0.95, lambda = lambda), "\n")
## 95th percentile: 6
set.seed(42)
poisson_data <- rpois(365, lambda = lambda)

barplot(table(poisson_data),
        main  = "Poisson Distribution (λ=3 cases/day)",
        xlab  = "Number of Cases",
        ylab  = "Number of Days",
        col   = "tomato")

How it works in data analytics:

  • Mean = Variance = lambda (key property)
  • If Variance >> Mean → overdispersion → use Negative Binomial instead
  • In R: family = poisson in glm() for Poisson regression

3.5 t-Distribution

Similar to normal but with heavier tails — used when sample size is small (n < 30) or population standard deviation is unknown.

df_t <- 29  # degrees of freedom (n - 1 for n = 30)

cat("Density at t=2:", dt(2, df = df_t), "\n")
## Density at t=2: 0.05694135
cat("Two-tailed p-value for t=2:", 2 * pt(-abs(2), df = df_t), "\n")
## Two-tailed p-value for t=2: 0.05494364
cat("Critical value (95% CI):", qt(0.975, df = df_t), "\n")
## Critical value (95% CI): 2.04523
# Compare t vs normal
curve(dt(x, df = 5), from = -4, to = 4, col = "blue",
      main = "t vs Normal Distribution",
      ylab = "Density", lwd = 2)
curve(dnorm(x), add = TRUE, col = "red", lwd = 2, lty = 2)
legend("topright",
       legend = c("t (df=5)", "Normal"),
       col = c("blue", "red"), lty = c(1, 2))

How it works:

  • With large samples (n > 30), t-distribution ≈ normal distribution
  • With small samples, t-distribution has fatter tails
  • Fatter tails → need larger t-value to reject H₀ → more conservative

3.6 Chi-Square Distribution

Used for testing relationships between categorical variables and goodness-of-fit tests.

df_chi <- 4

cat("Density at χ²=5:", dchisq(5, df = df_chi), "\n")
## Density at χ²=5: 0.1026062
cat("P(χ² > 9.49):", 1 - pchisq(9.49, df = df_chi), "\n")
## P(χ² > 9.49): 0.04995313
cat("Critical value (α=0.05):", qchisq(0.95, df = df_chi), "\n")
## Critical value (α=0.05): 9.487729
curve(dchisq(x, df = 4), from = 0, to = 20,
      main = "Chi-Square Distribution (df=4)",
      xlab = "Chi-Square Value",
      ylab = "Density",
      col  = "purple", lwd = 2)

# Practical use: test independence between Sex and Bact_Isolated
chi_test <- chisq.test(table(bacteria$Sex, bacteria$Bact_Isolated))
print(chi_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bacteria$Sex, bacteria$Bact_Isolated)
## X-squared = 4.0404, df = 1, p-value = 0.04442
cat("χ² statistic:", chi_test$statistic, "\n")
## χ² statistic: 4.040404
cat("Degrees of freedom:", chi_test$parameter, "\n")
## Degrees of freedom: 1
cat("p-value:", chi_test$p.value, "\n")
## p-value: 0.04442318

How it works:

  • χ² = Σ (Observed − Expected)² / Expected
  • Large χ² → big difference from expected → reject H₀
  • df = (rows − 1) × (columns − 1) for contingency tables

3.7 F-Distribution

Used in ANOVA and regression to compare variances or test overall model significance.

df1 <- 3
df2 <- 96

cat("Density at F=2.5:", df(2.5, df1 = df1, df2 = df2), "\n")
## Density at F=2.5: 0.07976461
cat("P(F > 4.0):", 1 - pf(4.0, df1 = df1, df2 = df2), "\n")
## P(F > 4.0): 0.009906319
cat("Critical value (α=0.05):", qf(0.95, df1 = df1, df2 = df2), "\n")
## Critical value (α=0.05): 2.699393
# Practical: linear model F-test
model_f <- lm(Var1 ~ Age + Var2 + Var3, data = bacteria)
summary(model_f)
## 
## Call:
## lm(formula = Var1 ~ Age + Var2 + Var3, data = bacteria)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.1491  -6.1324  -0.0805   6.9580  25.3747 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.23706    6.25885   8.346 5.23e-13 ***
## Age          0.06386    0.06432   0.993    0.323    
## Var2         0.05674    0.14493   0.392    0.696    
## Var3        -0.31466    0.20645  -1.524    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.4 on 96 degrees of freedom
## Multiple R-squared:  0.03263,    Adjusted R-squared:  0.002401 
## F-statistic: 1.079 on 3 and 96 DF,  p-value: 0.3616
# ANOVA
anova_model <- aov(Var1 ~ Bact_Isolated, data = bacteria)
summary(anova_model)
##               Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated  1     84   83.85   0.771  0.382
## Residuals     98  10652  108.69

How it works:

  • F = (explained variance / df1) / (unexplained variance / df2)
  • F > critical value → reject H₀ → model/group differences are significant

3.8 Checking Distributions in Practice

if(require("pacman")) install.packages("pacman")
pacman::p_load(moments)
library(ggplot2)
library(moments)

# Step 1: Histogram with density overlay
ggplot(bacteria, aes(x = Var1)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "steelblue", color = "white") +
  geom_density(color = "red", linewidth = 1) +
  labs(title = "Distribution of Var1", x = "Var1", y = "Density") +
  theme_bw()

# Step 2: Q-Q Plot
qqnorm(bacteria$Var1, main = "Q-Q Plot of Var1")
qqline(bacteria$Var1, col = "red", lwd = 2)

# Step 3: Normality tests
shapiro.test(bacteria$Var1)
## 
##  Shapiro-Wilk normality test
## 
## data:  bacteria$Var1
## W = 0.98122, p-value = 0.1654
ks.test(bacteria$Var1, "pnorm",
        mean(bacteria$Var1, na.rm = TRUE),
        sd(bacteria$Var1,   na.rm = TRUE))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  bacteria$Var1
## D = 0.055683, p-value = 0.9158
## alternative hypothesis: two-sided
# Step 4: Descriptive statistics
cat("Mean:",     mean(bacteria$Var1, na.rm = TRUE), "\n")
## Mean: 50.32515
cat("Median:",   median(bacteria$Var1, na.rm = TRUE), "\n")
## Median: 50.89797
cat("Skewness:", skewness(bacteria$Var1, na.rm = TRUE), "\n")
## Skewness: -0.4761129
cat("Kurtosis:", kurtosis(bacteria$Var1, na.rm = TRUE), "\n")
## Kurtosis: 3.209195

Interpretation:

Statistic Value Meaning
Skewness ≈ 0 Symmetric (normal-like)
Skewness > 0 Right-skewed (long tail on right)
Skewness < 0 Left-skewed (long tail on left)
Kurtosis ≈ 3 Normal distribution
Kurtosis > 3 Heavy tails (more outliers)
Kurtosis < 3 Light tails (fewer outliers)

3.9 Choosing the Right Distribution for Your Model

Data Type Distribution R Model
Continuous, symmetric Normal lm()
Binary (0/1) Binomial glm(..., family = binomial)
Count data Poisson glm(..., family = poisson)
Count with overdispersion Negative Binomial MASS::glm.nb()
Time to event Exponential / Weibull survival::survreg()
Proportions Beta betareg::betareg()
Categorical outcome Multinomial nnet::multinom()

4 PART 3: COMBINING VARIABLE SELECTION AND DISTRIBUTIONS

In a complete data analysis workflow in R:

# STEP 1: Explore distributions
summary(bacteria)
##       Var1            Var2            Var3            Var4      
##  Min.   :20.07   Min.   :13.80   Min.   : 6.50   Min.   :39.81  
##  1st Qu.:43.83   1st Qu.:25.27   1st Qu.:16.44   1st Qu.:53.61  
##  Median :50.90   Median :29.45   Median :19.88   Median :59.45  
##  Mean   :50.33   Mean   :29.30   Mean   :19.95   Mean   :60.40  
##  3rd Qu.:56.62   3rd Qu.:33.69   3rd Qu.:23.26   3rd Qu.:68.10  
##  Max.   :72.87   Max.   :51.62   Max.   :32.30   Max.   :89.07  
##       Var5            Age               Sex        Education     BoiledWater  
##  Min.   :17.85   Min.   :19.00   Length   :100   Min.   :1.00   Min.   :0.00  
##  1st Qu.:33.63   1st Qu.:27.00   N.unique :  2   1st Qu.:2.00   1st Qu.:0.00  
##  Median :38.74   Median :43.00   N.blank  :  0   Median :3.00   Median :0.00  
##  Mean   :38.94   Mean   :42.32   Min.nchar:  1   Mean   :3.08   Mean   :0.42  
##  3rd Qu.:44.95   3rd Qu.:56.00   Max.nchar:  1   3rd Qu.:4.00   3rd Qu.:1.00  
##  Max.   :66.69   Max.   :70.00                   Max.   :5.00   Max.   :1.00  
##    Bact_Isolated
##  Length   :100  
##  N.unique :  2  
##  N.blank  :  0  
##  Min.nchar:  2  
##  Max.nchar:  3  
## 
str(bacteria)
## 'data.frame':    100 obs. of  10 variables:
##  $ Var1         : num  63.7 44.4 53.6 56.3 54 ...
##  $ Var2         : num  39.6 38.4 22 44.8 24.7 ...
##  $ Var3         : num  10 21.7 25.9 30.3 13.1 ...
##  $ Var4         : num  59.9 69.1 60.5 68.8 58.2 ...
##  $ Var5         : num  52 32.2 40.5 40.4 34.8 ...
##  $ Age          : num  62 21 61 46 44 19 47 55 30 60 ...
##  $ Sex          : chr  "F" "F" "M" "M" ...
##  $ Education    : int  2 2 1 1 5 1 4 4 4 2 ...
##  $ BoiledWater  : num  0 0 0 0 1 1 0 0 0 1 ...
##  $ Bact_Isolated: chr  "No" "No" "Yes" "Yes" ...
# STEP 2: Check normality of outcome
shapiro.test(bacteria$Var1)
## 
##  Shapiro-Wilk normality test
## 
## data:  bacteria$Var1
## W = 0.98122, p-value = 0.1654
hist(bacteria$Var1, main = "Distribution of Var1",
     col = "lightblue", xlab = "Var1")

# STEP 3: Correlation-based pre-selection (filter method)
cor_matrix_wf <- cor(bacteria[, sapply(bacteria, is.numeric)],
                     use = "complete.obs")
print(round(cor_matrix_wf, 2))
##              Var1  Var2  Var3  Var4  Var5   Age Education BoiledWater
## Var1         1.00  0.03 -0.14  0.07  0.08  0.09      0.05        0.06
## Var2         0.03  1.00  0.07  0.01 -0.12  0.03     -0.02       -0.12
## Var3        -0.14  0.07  1.00 -0.05  0.06  0.06     -0.14       -0.02
## Var4         0.07  0.01 -0.05  1.00  0.19 -0.06     -0.01        0.11
## Var5         0.08 -0.12  0.06  0.19  1.00  0.03      0.09       -0.06
## Age          0.09  0.03  0.06 -0.06  0.03  1.00      0.03       -0.06
## Education    0.05 -0.02 -0.14 -0.01  0.09  0.03      1.00       -0.06
## BoiledWater  0.06 -0.12 -0.02  0.11 -0.06 -0.06     -0.06        1.00
# STEP 4: Fit full model
full_model_wf <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)

# STEP 5: Stepwise variable selection (wrapper method)
final_model <- stepAIC(full_model_wf, direction = "both", trace = FALSE)

# STEP 6: Check assumptions (distributions of residuals)
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))

# STEP 7: Interpret final model
summary(final_model)
## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

5 PART 4: MULTIPLE LINEAR REGRESSION — CHOLERA INCIDENCE RATES

5.1 Introduction

Cholera remains a significant public health challenge across sub-Saharan Africa, with incidence shaped by a complex interplay of environmental, infrastructural, and socioeconomic determinants. This analysis applies multiple linear regression (MLR) to a cross-sectional dataset of 47 African countries, examining how access to safe water, sanitation coverage, population density, annual rainfall, and GDP per capita jointly predict cholera incidence rates (cases per 100,000 population).


5.2 Dataset and Variables

set.seed(42)
n <- 47

df <- data.frame(
  Country     = paste0("Country_", 1:n),
  Safe_Water  = round(pmin(pmax(rnorm(n, 58, 20), 10), 95), 1),
  Sanitation  = round(pmin(pmax(rnorm(n, 42, 18),  5), 90), 1),
  Pop_Density = round(abs(rnorm(n, 85, 60)), 1),
  Rainfall    = round(pmin(pmax(rnorm(n, 900, 350), 200), 2200), 0),
  GDP_pc      = round(abs(rnorm(n, 1800, 1200)), 0)
)

# Introduce realistic epidemiological relationships
df$Cholera_Rate <- round(pmax(
  45
  - 0.35 * df$Safe_Water
  - 0.22 * df$Sanitation
  + 0.04 * df$Pop_Density
  - 0.003 * df$Rainfall
  - 0.003 * df$GDP_pc
  + rnorm(n, 0, 3.5),
  0.5
), 2)

knitr::kable(head(df[, -1], 8),
             caption = "Table 1: First 8 rows of the study dataset")
Table 1: First 8 rows of the study dataset
Safe_Water Sanitation Pop_Density Rainfall GDP_pc Cholera_Rate
85.4 68.0 18.4 589 796 0.50
46.7 34.2 33.4 744 114 24.61
65.3 53.8 17.1 890 2046 2.11
70.7 47.8 2.6 755 1386 4.27
66.1 27.9 89.8 1290 2103 5.84
55.9 70.4 124.2 732 247 9.43
88.2 53.6 157.1 748 649 7.92
56.1 43.6 147.7 1144 3103 13.34
Variable Description Unit
Cholera_Rate Cholera incidence — Dependent (Y) Cases per 100,000 pop.
Safe_Water Population with access to safe water %
Sanitation Population with sanitation coverage %
Pop_Density Population density Persons per km²
Rainfall Mean annual rainfall mm/year
GDP_pc GDP per capita USD

5.3 Descriptive Statistics

if(require("pacman")) install.packages("pacman")
pacman::p_load(car,glmnet,Imtest)
library(dplyr)
library(tidyr)
library(corrplot)

vars <- c("Cholera_Rate", "Safe_Water", "Sanitation",
          "Pop_Density",  "Rainfall",   "GDP_pc")

# Use base R indexing to avoid select() masking conflict
df_vars <- df[, vars]

desc <- df_vars %>%
  dplyr::summarise(dplyr::across(dplyr::everything(), list(
    N      = ~dplyr::n(),
    Mean   = ~round(mean(.), 2),
    SD     = ~round(sd(.), 2),
    Min    = ~round(min(.), 2),
    Q1     = ~round(quantile(., 0.25), 2),
    Median = ~round(median(.), 2),
    Q3     = ~round(quantile(., 0.75), 2),
    Max    = ~round(max(.), 2)
  ))) %>%
  tidyr::pivot_longer(dplyr::everything(),
                      names_to  = c("Variable", "Stat"),
                      names_sep = "_(?=[^_]+$)") %>%
  tidyr::pivot_wider(names_from = Stat, values_from = value)

knitr::kable(desc,
             caption = "Table 2: Descriptive statistics of all study variables (N = 47)")
Table 2: Descriptive statistics of all study variables (N = 47)
Variable N Mean SD Min Q1 Median Q3 Max
Cholera_Rate 47 11.16 9.11 0.5 3.99 8.84 16.30 31.21
Safe_Water 47 56.39 22.43 10.0 44.35 55.90 70.70 95.00
Sanitation 47 46.39 14.96 5.0 36.75 47.80 55.40 70.40
Pop_Density 47 75.56 54.33 2.6 34.95 77.70 92.85 247.10
Rainfall 47 880.87 290.18 289.0 722.50 886.00 1100.50 1491.00
GDP_pc 47 1714.06 1103.86 114.0 874.50 1578.00 2299.50 4271.00
library(corrplot)

## Exploratory Data Analysis

cor_matrix_mlr <- cor(df[, vars])

corrplot(cor_matrix_mlr,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.col      = "black",
         tl.srt      = 45,
         number.cex  = 0.75,
         title       = "Correlation Matrix",
         mar         = c(0, 0, 1.5, 0))
Figure 1: Correlation matrix of all study variables

Figure 1: Correlation matrix of all study variables

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))

predictors <- list(
  list(var = "Safe_Water",  lab = "Safe Water Access (%)"),
  list(var = "Sanitation",  lab = "Sanitation Coverage (%)"),
  list(var = "Pop_Density", lab = "Population Density (per km²)"),
  list(var = "Rainfall",    lab = "Annual Rainfall (mm)"),
  list(var = "GDP_pc",      lab = "GDP per Capita (USD)")
)

for (pred in predictors) {
  plot(df[[pred$var]], df$Cholera_Rate,
       xlab = pred$lab,
       ylab = "Cholera Rate (per 100k)",
       pch  = 16, col = "steelblue", cex = 0.9)
  abline(lm(df$Cholera_Rate ~ df[[pred$var]]), col = "red", lwd = 1.5)
}

par(mfrow = c(1, 1))
Figure 2: Scatter plots of cholera rate against each predictor

Figure 2: Scatter plots of cholera rate against each predictor


5.4 Multiple Linear Regression

5.4.1 Model Specification

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \varepsilon\]

where \(\varepsilon \sim N(0, \sigma^2)\), independently and identically distributed.

model <- lm(Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density + Rainfall + GDP_pc,
            data = df)

summary(model)
## 
## Call:
## lm(formula = Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density + 
##     Rainfall + GDP_pc, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6938 -2.0299  0.5241  2.2139  6.1178 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.8698605  2.7774551  17.595  < 2e-16 ***
## Safe_Water  -0.3020617  0.0220950 -13.671  < 2e-16 ***
## Sanitation  -0.2659752  0.0329462  -8.073 5.27e-10 ***
## Pop_Density  0.0262633  0.0090172   2.913 0.005777 ** 
## Rainfall    -0.0063214  0.0017150  -3.686 0.000661 ***
## GDP_pc      -0.0027731  0.0004635  -5.983 4.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.26 on 41 degrees of freedom
## Multiple R-squared:  0.8858, Adjusted R-squared:  0.8719 
## F-statistic: 63.61 on 5 and 41 DF,  p-value: < 2.2e-16

5.4.2 Model Fit Summary

r2     <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared
fstat  <- summary(model)$fstatistic
fpval  <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)

fit_table <- data.frame(
  Statistic = c("R-squared (R²)", "Adjusted R²", "F-statistic",
                "Degrees of freedom", "p-value (overall model)", "Observations"),
  Value     = c(round(r2, 4), round(adj_r2, 4),
                round(fstat[1], 3),
                paste0(fstat[2], ", ", fstat[3]),
                ifelse(fpval < 0.001, "< 0.001", round(fpval, 4)),
                nrow(df))
)

knitr::kable(fit_table, caption = "Table 3: Overall model fit statistics")
Table 3: Overall model fit statistics
Statistic Value
R-squared (R²) 0.8858
Adjusted R² 0.8719
F-statistic 63.607
Degrees of freedom 5, 41
p-value (overall model) < 0.001
Observations 47

5.4.3 Regression Coefficients

library(broom)

coef_table <- tidy(model, conf.int = TRUE) %>%
  mutate(
    term      = c("Intercept", "Safe Water (X1)", "Sanitation (X2)",
                  "Pop. Density (X3)", "Rainfall (X4)", "GDP per Capita (X5)"),
    estimate  = round(estimate,  4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value   = ifelse(p.value < 0.001, "< 0.001", round(p.value, 4)),
    conf.low  = round(conf.low,  4),
    conf.high = round(conf.high, 4)
  )

coef_table <- coef_table[, c("term", "estimate", "std.error",
                              "statistic", "p.value",
                              "conf.low", "conf.high")]

names(coef_table) <- c("Variable", "B", "Std. Error", "t-value",
                        "p-value", "95% CI Lower", "95% CI Upper")

knitr::kable(coef_table, caption = "Table 4: OLS regression coefficients")
Table 4: OLS regression coefficients
Variable B Std. Error t-value p-value 95% CI Lower 95% CI Upper
Intercept 48.8699 2.7775 17.595 < 0.001 43.2607 54.4790
Safe Water (X1) -0.3021 0.0221 -13.671 < 0.001 -0.3467 -0.2574
Sanitation (X2) -0.2660 0.0329 -8.073 < 0.001 -0.3325 -0.1994
Pop. Density (X3) 0.0263 0.0090 2.913 0.0058 0.0081 0.0445
Rainfall (X4) -0.0063 0.0017 -3.686 < 0.001 -0.0098 -0.0029
GDP per Capita (X5) -0.0028 0.0005 -5.983 < 0.001 -0.0037 -0.0018

5.5 Interpretation of Coefficients

  • Intercept (β₀): When all predictors equal zero, the predicted cholera rate anchors the regression plane.
  • Safe Water Access (β₁, p < 0.001): Each 1 pp increase in safe water access is associated with a decrease in cholera cases per 100,000, holding all else constant. This is the largest coefficient, consistent with waterborne transmission being the dominant route.
  • Sanitation Coverage (β₂, p < 0.001): A 1 pp improvement in sanitation is independently associated with a further reduction in cases per 100,000, underscoring the value of dual WASH strategies.
  • Population Density (β₃, p < 0.001): Each additional person per km² is associated with an increase in cases per 100,000. Higher density facilitates rapid transmission and strains infrastructure.
  • Annual Rainfall (β₄): Higher annual rainfall is associated with a modest decrease in incidence. This coefficient should be interpreted cautiously, as rainfall can also increase risk with inadequate drainage.
  • GDP per Capita (β₅, p < 0.001): Wealthier countries invest more in water treatment, surveillance, and outbreak response.

5.6 Model Diagnostics

5.6.1 Residual Plots

par(mfrow = c(2, 2))
plot(model, which = 1:4, pch = 16, col = "steelblue")
Figure 3: Regression diagnostic plots

Figure 3: Regression diagnostic plots

par(mfrow = c(1, 1))
  • Residuals vs Fitted: Points scatter randomly around zero — supports linearity and homoscedasticity
  • Normal Q-Q: Residuals fall closely along the diagonal — indicates approximate normality
  • Scale-Location: No funnel shape — consistent with constant variance
  • Cook’s Distance: No observations exceed the threshold of 1 — no highly influential outliers

5.6.2 Normality of Residuals

sw_test <- shapiro.test(residuals(model))
cat("Shapiro-Wilk Test:\n")
## Shapiro-Wilk Test:
cat("  W =", round(sw_test$statistic, 4), "\n")
##   W = 0.9756
cat("  p-value =", round(sw_test$p.value, 4), "\n")
##   p-value = 0.4243

5.6.3 Multicollinearity (VIF)

library(car)

vif_vals  <- vif(model)
vif_table <- data.frame(
  Variable = c("Safe Water (X₁)", "Sanitation (X₂)", "Pop. Density (X₃)",
               "Rainfall (X₄)",   "GDP per Capita (X₅)"),
  VIF      = round(vif_vals, 2)
)

knitr::kable(vif_table, caption = "Table 5: Variance Inflation Factors (VIF)")
Table 5: Variance Inflation Factors (VIF)
Variable VIF
Safe_Water Safe Water (X₁) 1.06
Sanitation Sanitation (X₂) 1.05
Pop_Density Pop. Density (X₃) 1.04
Rainfall Rainfall (X₄) 1.07
GDP_pc GDP per Capita (X₅) 1.13

5.6.4 Autocorrelation (Durbin-Watson)

library(lmtest)
dw <- dwtest(model)
cat("Durbin-Watson statistic:", round(dw$statistic, 3), "\n")
## Durbin-Watson statistic: 2.019
cat("p-value:", round(dw$p.value, 4), "\n")
## p-value: 0.5383

5.7 Discussion

The fitted model demonstrates strong predictive performance (R² ≈ 0.90), capturing the vast majority of cross-country variance in cholera incidence. Safe water access is the single strongest protective predictor, and sanitation exerts a significant independent effect — reinforcing the need for integrated WASH strategies. The positive effect of population density highlights heightened vulnerability in densely populated settings, while the negative association with GDP per capita reflects how economic development enables investment in health infrastructure.

Limitations: This is an ecological study; country-level associations do not necessarily reflect individual-level relationships. Unmeasured confounders (health literacy, conflict exposure, reporting quality) may bias estimates. Longitudinal and sub-national analyses would strengthen causal inference.


5.8 Conclusion

This multiple linear regression analysis identifies safe water access and sanitation coverage as the two strongest and most policy-relevant predictors of cholera incidence, with population density amplifying risk and higher GDP per capita reducing it. The model explains approximately 90% of cross-country variance and satisfies core OLS assumptions. These results support prioritising WASH infrastructure expansion — particularly in high-density, low-income settings — as the most effective strategy for cholera burden reduction.


5.9 References

  • World Health Organization (2023). Cholera – Global situation. WHO Weekly Epidemiological Record.
  • Azman, A.S., et al. (2013). The incubation period of cholera. American Journal of Tropical Medicine and Hygiene, 89(4), 658–660.
  • Weil, A.A., & Ryan, E.T. (2018). Cholera: Recent updates. Current Opinion in Infectious Diseases, 31(5), 455–461.
  • R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Packages used: dplyr, tidyr, broom, corrplot, car, lmtest, caret, glmnet, randomForest, MASS, ggplot2, moments, knitr