In data analytics, two fundamental concepts drive the quality of any statistical analysis: variable selection (choosing which variables to include in a model) and statistical distributions (understanding how data is spread and behaves). In R, both are handled through a rich set of built-in functions and packages. This document explains each concept in detail, with examples and R code.
Variable selection (also called feature selection) is the process of identifying the most relevant variables (predictors) to include in a statistical model. Including too many irrelevant variables causes overfitting, while too few causes underfitting.
When building a regression or classification model:
Filter methods evaluate each variable independently of the model, based on statistical measures like correlation, variance, or statistical tests.
Correlation measures the linear relationship between two numeric variables. Values range from −1 (perfect negative) to +1 (perfect positive). Variables highly correlated with the outcome but not with each other are ideal predictors.
library(corrplot)
library(dplyr)
# Simulate a bacteria-like dataset for demonstration
set.seed(42)
n_demo <- 100
bacteria <- data.frame(
Var1 = rnorm(n_demo, 50, 10),
Var2 = rnorm(n_demo, 30, 8),
Var3 = rnorm(n_demo, 20, 5),
Var4 = rnorm(n_demo, 60, 12),
Var5 = rnorm(n_demo, 40, 9),
Age = round(runif(n_demo, 18, 70)),
Sex = sample(c("M", "F"), n_demo, replace = TRUE),
Education = sample(1:5, n_demo, replace = TRUE),
BoiledWater = sample(c(0, 1), n_demo, replace = TRUE),
Bact_Isolated = sample(c("Yes", "No"), n_demo, replace = TRUE)
)
# Compute correlation matrix for numeric variables
cor_matrix <- cor(
bacteria[, c("Var1", "Var2", "Var3", "Var4", "Var5", "Age")],
use = "complete.obs"
)
print(round(cor_matrix, 3))## Var1 Var2 Var3 Var4 Var5 Age
## Var1 1.000 0.031 -0.145 0.074 0.084 0.092
## Var2 0.031 1.000 0.071 0.009 -0.121 0.028
## Var3 -0.145 0.071 1.000 -0.046 0.059 0.061
## Var4 0.074 0.009 -0.046 1.000 0.185 -0.058
## Var5 0.084 -0.121 0.059 0.185 1.000 0.033
## Age 0.092 0.028 0.061 -0.058 0.033 1.000
# Visualise correlation matrix
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.cex = 0.8,
number.cex = 0.7,
title = "Correlation Matrix",
mar = c(0, 0, 1.5, 0))How it works:
Variables with near-zero variance carry almost no information and should be removed.
library(caret)
# Find near-zero variance variables
nzv <- nearZeroVar(bacteria, saveMetrics = TRUE)
print(nzv)## freqRatio percentUnique zeroVar nzv
## Var1 1.000000 100 FALSE FALSE
## Var2 1.000000 100 FALSE FALSE
## Var3 1.000000 100 FALSE FALSE
## Var4 1.000000 100 FALSE FALSE
## Var5 1.000000 100 FALSE FALSE
## Age 1.600000 47 FALSE FALSE
## Sex 1.000000 2 FALSE FALSE
## Education 1.090909 5 FALSE FALSE
## BoiledWater 1.380952 2 FALSE FALSE
## Bact_Isolated 1.222222 2 FALSE FALSE
How it works:
freqRatio: ratio of most common to second most common
valuepercentUnique: % of unique valuesnzv = TRUE → near-zero variance → remove;
nzv = FALSE → informative → keepUse t-tests or ANOVA to check if a variable differs significantly across groups.
# T-test: does Var1 differ between Bact_Isolated groups?
t_test_result <- t.test(Var1 ~ Bact_Isolated, data = bacteria)
print(t_test_result)##
## Welch Two Sample t-test
##
## data: Var1 by Bact_Isolated
## t = 0.8756, df = 92.936, p-value = 0.3835
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -2.333806 6.015016
## sample estimates:
## mean in group No mean in group Yes
## 51.15342 49.31282
# ANOVA: does Age differ across bacteria types?
anova_result <- aov(Age ~ Bact_Isolated, data = bacteria)
summary(anova_result)## Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated 1 375 375.5 1.422 0.236
## Residuals 98 25884 264.1
How it works:
Wrapper methods use the model itself to evaluate subsets of variables. They are more accurate than filter methods but computationally expensive.
Starts with no variables, adds one at a time, keeping only those that improve the model.
library(MASS)
# Empty model (intercept only)
model_empty <- lm(Var1 ~ 1, data = bacteria)
# Full model with all predictors
model_full <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5 +
Sex + Education + BoiledWater, data = bacteria)
# Forward selection using AIC
forward_model <- stepAIC(model_empty,
scope = list(lower = model_empty,
upper = model_full),
direction = "forward",
trace = FALSE)
summary(forward_model)##
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.749 -5.833 0.663 6.548 25.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.2395 4.2124 13.351 <2e-16 ***
## Var3 -0.2965 0.2047 -1.448 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared: 0.02096, Adjusted R-squared: 0.01097
## F-statistic: 2.098 on 1 and 98 DF, p-value: 0.1507
How it works:
Starts with all variables, removes one at a time, keeping only significant ones.
# Backward elimination
backward_model <- stepAIC(model_full,
direction = "backward",
trace = FALSE)
summary(backward_model)##
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.749 -5.833 0.663 6.548 25.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.2395 4.2124 13.351 <2e-16 ***
## Var3 -0.2965 0.2047 -1.448 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared: 0.02096, Adjusted R-squared: 0.01097
## F-statistic: 2.098 on 1 and 98 DF, p-value: 0.1507
How it works:
Combines forward and backward — adds and removes variables at each step.
# Stepwise selection (both directions)
stepwise_model <- stepAIC(model_full,
direction = "both",
trace = FALSE)
summary(stepwise_model)##
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.749 -5.833 0.663 6.548 25.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.2395 4.2124 13.351 <2e-16 ***
## Var3 -0.2965 0.2047 -1.448 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared: 0.02096, Adjusted R-squared: 0.01097
## F-statistic: 2.098 on 1 and 98 DF, p-value: 0.1507
How it works: At each step, considers both adding a new variable and removing an existing variable, choosing the action that gives the lowest AIC. This is the most flexible and commonly used method.
Embedded methods perform variable selection as part of the model fitting process itself.
LASSO penalises the model for having too many variables, automatically shrinking unimportant coefficients to exactly zero.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)
library(glmnet)
# Prepare matrix (glmnet requires a matrix, not a dataframe)
x <- model.matrix(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
data = bacteria)[, -1]
y <- bacteria$Var1
# Fit LASSO model
lasso_model <- glmnet(x, y, alpha = 1)
plot(lasso_model, xvar = "lambda", label = TRUE)# Cross-validation to find best lambda
cv_lasso <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
cat("Best lambda:", best_lambda, "\n")## Best lambda: 1.500092
## 6 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 50.32515
## Age .
## Var2 .
## Var3 .
## Var4 .
## Var5 .
How it works:
lambda is the penalty strength; higher lambda shrinks
more variables to zeroalpha = 1 → LASSO; alpha = 0 → Ridge;
0 < alpha < 1 → Elastic NetRidge shrinks coefficients toward zero but never exactly to zero — good when all variables contribute a little.
cv_ridge <- cv.glmnet(x, y, alpha = 0)
ridge_coef <- coef(cv_ridge, s = "lambda.min")
print(ridge_coef)## 6 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 50.2531014268
## Age 0.0006997459
## Var2 0.0005406106
## Var3 -0.0035387787
## Var4 0.0008744807
## Var5 0.0011394894
How it works: Uses a squared penalty (L2). Coefficients are shrunk but never reach zero. Better when many variables have small but real effects. Does not perform variable selection (keeps all variables).
Random forest ranks variables by how much each one improves prediction accuracy.
if(require("pacman")) install.packages("pacman")
pacman::p_load(randomForest)
library(randomForest)
# Fit random forest
rf_model <- randomForest(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
data = bacteria,
ntree = 500,
importance = TRUE)
# Plot variable importance
varImpPlot(rf_model, main = "Variable Importance — Random Forest")## %IncMSE IncNodePurity
## Age -3.679599 1367.400
## Var2 -2.152929 1885.721
## Var3 8.945107 2134.684
## Var4 -4.291806 1745.932
## Var5 4.460238 2161.223
How it works:
%IncMSE: drop in accuracy when variable is removed —
higher = more importantIncNodePurity: total reduction in node impurity —
higher = more importantAIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) penalise model complexity. Lower values indicate better models.
model1 <- lm(Var1 ~ Age, data = bacteria)
model2 <- lm(Var1 ~ Age + Var2, data = bacteria)
model3 <- lm(Var1 ~ Age + Var2 + Var3, data = bacteria)
model4 <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)
# Compare AIC
AIC(model1, model2, model3, model4)## df AIC
## model1 3 756.5611
## model2 4 758.4780
## model3 5 758.0871
## model4 7 760.8139
## df BIC
## model1 3 764.3766
## model2 4 768.8986
## model3 5 771.1129
## model4 7 779.0501
Formulas:
A statistical distribution describes how values of a variable are spread. R has built-in functions for virtually every distribution, following a consistent naming convention.
For any distribution named dist, R provides four
functions:
| Function | Prefix | What it does | Example |
|---|---|---|---|
| Density / PMF | d |
Probability at a value | dnorm(x) |
| CDF | p |
Probability ≤ x | pnorm(q) |
| Quantile | q |
Value at probability p | qnorm(p) |
| Random | r |
Generate random values | rnorm(n) |
The most important distribution in statistics — bell-shaped, symmetric, defined by mean (μ) and standard deviation (σ).
mu <- 0
sigma <- 1
# Density at x = 1.5
cat("Density at x=1.5:", dnorm(1.5, mean = mu, sd = sigma), "\n")## Density at x=1.5: 0.1295176
## P(X <= 1.96): 0.9750021
## 95th percentile: 1.644854
# Generate 1000 random values
set.seed(42)
normal_data <- rnorm(1000, mean = 50, sd = 10)
hist(normal_data,
breaks = 30,
probability = TRUE,
main = "Normal Distribution (μ=50, σ=10)",
xlab = "Value",
col = "lightblue")
curve(dnorm(x, mean = 50, sd = 10), add = TRUE, col = "red", lwd = 2)##
## Shapiro-Wilk normality test
##
## data: normal_data
## W = 0.99882, p-value = 0.767
How it works in data analytics:
Models the number of successes in n independent trials, each with probability p of success. Used for binary outcomes (yes/no, infected/not infected).
## P(X = 3): 0.2668279
## P(X <= 3): 0.6496107
## 90th percentile: 5
set.seed(42)
binom_data <- rbinom(1000, size = n, prob = p)
barplot(table(binom_data),
main = "Binomial Distribution (n=10, p=0.3)",
xlab = "Number of Infections",
ylab = "Frequency",
col = "steelblue")How it works in data analytics:
Models the number of events occurring in a fixed interval of time or space. Used for count data (cases per week, bacteria colonies).
## P(X = 5): 0.1008188
## P(X <= 2): 0.4231901
## 95th percentile: 6
set.seed(42)
poisson_data <- rpois(365, lambda = lambda)
barplot(table(poisson_data),
main = "Poisson Distribution (λ=3 cases/day)",
xlab = "Number of Cases",
ylab = "Number of Days",
col = "tomato")How it works in data analytics:
family = poisson in glm() for
Poisson regressionSimilar to normal but with heavier tails — used when sample size is small (n < 30) or population standard deviation is unknown.
## Density at t=2: 0.05694135
## Two-tailed p-value for t=2: 0.05494364
## Critical value (95% CI): 2.04523
# Compare t vs normal
curve(dt(x, df = 5), from = -4, to = 4, col = "blue",
main = "t vs Normal Distribution",
ylab = "Density", lwd = 2)
curve(dnorm(x), add = TRUE, col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c("t (df=5)", "Normal"),
col = c("blue", "red"), lty = c(1, 2))How it works:
Used for testing relationships between categorical variables and goodness-of-fit tests.
## Density at χ²=5: 0.1026062
## P(χ² > 9.49): 0.04995313
## Critical value (α=0.05): 9.487729
curve(dchisq(x, df = 4), from = 0, to = 20,
main = "Chi-Square Distribution (df=4)",
xlab = "Chi-Square Value",
ylab = "Density",
col = "purple", lwd = 2)# Practical use: test independence between Sex and Bact_Isolated
chi_test <- chisq.test(table(bacteria$Sex, bacteria$Bact_Isolated))
print(chi_test)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(bacteria$Sex, bacteria$Bact_Isolated)
## X-squared = 4.0404, df = 1, p-value = 0.04442
## χ² statistic: 4.040404
## Degrees of freedom: 1
## p-value: 0.04442318
How it works:
Used in ANOVA and regression to compare variances or test overall model significance.
## Density at F=2.5: 0.07976461
## P(F > 4.0): 0.009906319
## Critical value (α=0.05): 2.699393
# Practical: linear model F-test
model_f <- lm(Var1 ~ Age + Var2 + Var3, data = bacteria)
summary(model_f)##
## Call:
## lm(formula = Var1 ~ Age + Var2 + Var3, data = bacteria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.1491 -6.1324 -0.0805 6.9580 25.3747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.23706 6.25885 8.346 5.23e-13 ***
## Age 0.06386 0.06432 0.993 0.323
## Var2 0.05674 0.14493 0.392 0.696
## Var3 -0.31466 0.20645 -1.524 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.4 on 96 degrees of freedom
## Multiple R-squared: 0.03263, Adjusted R-squared: 0.002401
## F-statistic: 1.079 on 3 and 96 DF, p-value: 0.3616
## Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated 1 84 83.85 0.771 0.382
## Residuals 98 10652 108.69
How it works:
if(require("pacman")) install.packages("pacman")
pacman::p_load(moments)
library(ggplot2)
library(moments)
# Step 1: Histogram with density overlay
ggplot(bacteria, aes(x = Var1)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "steelblue", color = "white") +
geom_density(color = "red", linewidth = 1) +
labs(title = "Distribution of Var1", x = "Var1", y = "Density") +
theme_bw()# Step 2: Q-Q Plot
qqnorm(bacteria$Var1, main = "Q-Q Plot of Var1")
qqline(bacteria$Var1, col = "red", lwd = 2)##
## Shapiro-Wilk normality test
##
## data: bacteria$Var1
## W = 0.98122, p-value = 0.1654
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: bacteria$Var1
## D = 0.055683, p-value = 0.9158
## alternative hypothesis: two-sided
## Mean: 50.32515
## Median: 50.89797
## Skewness: -0.4761129
## Kurtosis: 3.209195
Interpretation:
| Statistic | Value | Meaning |
|---|---|---|
| Skewness | ≈ 0 | Symmetric (normal-like) |
| Skewness | > 0 | Right-skewed (long tail on right) |
| Skewness | < 0 | Left-skewed (long tail on left) |
| Kurtosis | ≈ 3 | Normal distribution |
| Kurtosis | > 3 | Heavy tails (more outliers) |
| Kurtosis | < 3 | Light tails (fewer outliers) |
| Data Type | Distribution | R Model |
|---|---|---|
| Continuous, symmetric | Normal | lm() |
| Binary (0/1) | Binomial | glm(..., family = binomial) |
| Count data | Poisson | glm(..., family = poisson) |
| Count with overdispersion | Negative Binomial | MASS::glm.nb() |
| Time to event | Exponential / Weibull | survival::survreg() |
| Proportions | Beta | betareg::betareg() |
| Categorical outcome | Multinomial | nnet::multinom() |
In a complete data analysis workflow in R:
## Var1 Var2 Var3 Var4
## Min. :20.07 Min. :13.80 Min. : 6.50 Min. :39.81
## 1st Qu.:43.83 1st Qu.:25.27 1st Qu.:16.44 1st Qu.:53.61
## Median :50.90 Median :29.45 Median :19.88 Median :59.45
## Mean :50.33 Mean :29.30 Mean :19.95 Mean :60.40
## 3rd Qu.:56.62 3rd Qu.:33.69 3rd Qu.:23.26 3rd Qu.:68.10
## Max. :72.87 Max. :51.62 Max. :32.30 Max. :89.07
## Var5 Age Sex Education BoiledWater
## Min. :17.85 Min. :19.00 Length :100 Min. :1.00 Min. :0.00
## 1st Qu.:33.63 1st Qu.:27.00 N.unique : 2 1st Qu.:2.00 1st Qu.:0.00
## Median :38.74 Median :43.00 N.blank : 0 Median :3.00 Median :0.00
## Mean :38.94 Mean :42.32 Min.nchar: 1 Mean :3.08 Mean :0.42
## 3rd Qu.:44.95 3rd Qu.:56.00 Max.nchar: 1 3rd Qu.:4.00 3rd Qu.:1.00
## Max. :66.69 Max. :70.00 Max. :5.00 Max. :1.00
## Bact_Isolated
## Length :100
## N.unique : 2
## N.blank : 0
## Min.nchar: 2
## Max.nchar: 3
##
## 'data.frame': 100 obs. of 10 variables:
## $ Var1 : num 63.7 44.4 53.6 56.3 54 ...
## $ Var2 : num 39.6 38.4 22 44.8 24.7 ...
## $ Var3 : num 10 21.7 25.9 30.3 13.1 ...
## $ Var4 : num 59.9 69.1 60.5 68.8 58.2 ...
## $ Var5 : num 52 32.2 40.5 40.4 34.8 ...
## $ Age : num 62 21 61 46 44 19 47 55 30 60 ...
## $ Sex : chr "F" "F" "M" "M" ...
## $ Education : int 2 2 1 1 5 1 4 4 4 2 ...
## $ BoiledWater : num 0 0 0 0 1 1 0 0 0 1 ...
## $ Bact_Isolated: chr "No" "No" "Yes" "Yes" ...
##
## Shapiro-Wilk normality test
##
## data: bacteria$Var1
## W = 0.98122, p-value = 0.1654
# STEP 3: Correlation-based pre-selection (filter method)
cor_matrix_wf <- cor(bacteria[, sapply(bacteria, is.numeric)],
use = "complete.obs")
print(round(cor_matrix_wf, 2))## Var1 Var2 Var3 Var4 Var5 Age Education BoiledWater
## Var1 1.00 0.03 -0.14 0.07 0.08 0.09 0.05 0.06
## Var2 0.03 1.00 0.07 0.01 -0.12 0.03 -0.02 -0.12
## Var3 -0.14 0.07 1.00 -0.05 0.06 0.06 -0.14 -0.02
## Var4 0.07 0.01 -0.05 1.00 0.19 -0.06 -0.01 0.11
## Var5 0.08 -0.12 0.06 0.19 1.00 0.03 0.09 -0.06
## Age 0.09 0.03 0.06 -0.06 0.03 1.00 0.03 -0.06
## Education 0.05 -0.02 -0.14 -0.01 0.09 0.03 1.00 -0.06
## BoiledWater 0.06 -0.12 -0.02 0.11 -0.06 -0.06 -0.06 1.00
# STEP 4: Fit full model
full_model_wf <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)
# STEP 5: Stepwise variable selection (wrapper method)
final_model <- stepAIC(full_model_wf, direction = "both", trace = FALSE)
# STEP 6: Check assumptions (distributions of residuals)
par(mfrow = c(2, 2))
plot(final_model)##
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.749 -5.833 0.663 6.548 25.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.2395 4.2124 13.351 <2e-16 ***
## Var3 -0.2965 0.2047 -1.448 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared: 0.02096, Adjusted R-squared: 0.01097
## F-statistic: 2.098 on 1 and 98 DF, p-value: 0.1507
Cholera remains a significant public health challenge across sub-Saharan Africa, with incidence shaped by a complex interplay of environmental, infrastructural, and socioeconomic determinants. This analysis applies multiple linear regression (MLR) to a cross-sectional dataset of 47 African countries, examining how access to safe water, sanitation coverage, population density, annual rainfall, and GDP per capita jointly predict cholera incidence rates (cases per 100,000 population).
set.seed(42)
n <- 47
df <- data.frame(
Country = paste0("Country_", 1:n),
Safe_Water = round(pmin(pmax(rnorm(n, 58, 20), 10), 95), 1),
Sanitation = round(pmin(pmax(rnorm(n, 42, 18), 5), 90), 1),
Pop_Density = round(abs(rnorm(n, 85, 60)), 1),
Rainfall = round(pmin(pmax(rnorm(n, 900, 350), 200), 2200), 0),
GDP_pc = round(abs(rnorm(n, 1800, 1200)), 0)
)
# Introduce realistic epidemiological relationships
df$Cholera_Rate <- round(pmax(
45
- 0.35 * df$Safe_Water
- 0.22 * df$Sanitation
+ 0.04 * df$Pop_Density
- 0.003 * df$Rainfall
- 0.003 * df$GDP_pc
+ rnorm(n, 0, 3.5),
0.5
), 2)
knitr::kable(head(df[, -1], 8),
caption = "Table 1: First 8 rows of the study dataset")| Safe_Water | Sanitation | Pop_Density | Rainfall | GDP_pc | Cholera_Rate |
|---|---|---|---|---|---|
| 85.4 | 68.0 | 18.4 | 589 | 796 | 0.50 |
| 46.7 | 34.2 | 33.4 | 744 | 114 | 24.61 |
| 65.3 | 53.8 | 17.1 | 890 | 2046 | 2.11 |
| 70.7 | 47.8 | 2.6 | 755 | 1386 | 4.27 |
| 66.1 | 27.9 | 89.8 | 1290 | 2103 | 5.84 |
| 55.9 | 70.4 | 124.2 | 732 | 247 | 9.43 |
| 88.2 | 53.6 | 157.1 | 748 | 649 | 7.92 |
| 56.1 | 43.6 | 147.7 | 1144 | 3103 | 13.34 |
| Variable | Description | Unit |
|---|---|---|
Cholera_Rate |
Cholera incidence — Dependent (Y) | Cases per 100,000 pop. |
Safe_Water |
Population with access to safe water | % |
Sanitation |
Population with sanitation coverage | % |
Pop_Density |
Population density | Persons per km² |
Rainfall |
Mean annual rainfall | mm/year |
GDP_pc |
GDP per capita | USD |
if(require("pacman")) install.packages("pacman")
pacman::p_load(car,glmnet,Imtest)
library(dplyr)
library(tidyr)
library(corrplot)
vars <- c("Cholera_Rate", "Safe_Water", "Sanitation",
"Pop_Density", "Rainfall", "GDP_pc")
# Use base R indexing to avoid select() masking conflict
df_vars <- df[, vars]
desc <- df_vars %>%
dplyr::summarise(dplyr::across(dplyr::everything(), list(
N = ~dplyr::n(),
Mean = ~round(mean(.), 2),
SD = ~round(sd(.), 2),
Min = ~round(min(.), 2),
Q1 = ~round(quantile(., 0.25), 2),
Median = ~round(median(.), 2),
Q3 = ~round(quantile(., 0.75), 2),
Max = ~round(max(.), 2)
))) %>%
tidyr::pivot_longer(dplyr::everything(),
names_to = c("Variable", "Stat"),
names_sep = "_(?=[^_]+$)") %>%
tidyr::pivot_wider(names_from = Stat, values_from = value)
knitr::kable(desc,
caption = "Table 2: Descriptive statistics of all study variables (N = 47)")| Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|---|
| Cholera_Rate | 47 | 11.16 | 9.11 | 0.5 | 3.99 | 8.84 | 16.30 | 31.21 |
| Safe_Water | 47 | 56.39 | 22.43 | 10.0 | 44.35 | 55.90 | 70.70 | 95.00 |
| Sanitation | 47 | 46.39 | 14.96 | 5.0 | 36.75 | 47.80 | 55.40 | 70.40 |
| Pop_Density | 47 | 75.56 | 54.33 | 2.6 | 34.95 | 77.70 | 92.85 | 247.10 |
| Rainfall | 47 | 880.87 | 290.18 | 289.0 | 722.50 | 886.00 | 1100.50 | 1491.00 |
| GDP_pc | 47 | 1714.06 | 1103.86 | 114.0 | 874.50 | 1578.00 | 2299.50 | 4271.00 |
library(corrplot)
## Exploratory Data Analysis
cor_matrix_mlr <- cor(df[, vars])
corrplot(cor_matrix_mlr,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.75,
title = "Correlation Matrix",
mar = c(0, 0, 1.5, 0))Figure 1: Correlation matrix of all study variables
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
predictors <- list(
list(var = "Safe_Water", lab = "Safe Water Access (%)"),
list(var = "Sanitation", lab = "Sanitation Coverage (%)"),
list(var = "Pop_Density", lab = "Population Density (per km²)"),
list(var = "Rainfall", lab = "Annual Rainfall (mm)"),
list(var = "GDP_pc", lab = "GDP per Capita (USD)")
)
for (pred in predictors) {
plot(df[[pred$var]], df$Cholera_Rate,
xlab = pred$lab,
ylab = "Cholera Rate (per 100k)",
pch = 16, col = "steelblue", cex = 0.9)
abline(lm(df$Cholera_Rate ~ df[[pred$var]]), col = "red", lwd = 1.5)
}
par(mfrow = c(1, 1))Figure 2: Scatter plots of cholera rate against each predictor
\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \varepsilon\]
where \(\varepsilon \sim N(0, \sigma^2)\), independently and identically distributed.
model <- lm(Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density + Rainfall + GDP_pc,
data = df)
summary(model)##
## Call:
## lm(formula = Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density +
## Rainfall + GDP_pc, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6938 -2.0299 0.5241 2.2139 6.1178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.8698605 2.7774551 17.595 < 2e-16 ***
## Safe_Water -0.3020617 0.0220950 -13.671 < 2e-16 ***
## Sanitation -0.2659752 0.0329462 -8.073 5.27e-10 ***
## Pop_Density 0.0262633 0.0090172 2.913 0.005777 **
## Rainfall -0.0063214 0.0017150 -3.686 0.000661 ***
## GDP_pc -0.0027731 0.0004635 -5.983 4.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.26 on 41 degrees of freedom
## Multiple R-squared: 0.8858, Adjusted R-squared: 0.8719
## F-statistic: 63.61 on 5 and 41 DF, p-value: < 2.2e-16
r2 <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared
fstat <- summary(model)$fstatistic
fpval <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
fit_table <- data.frame(
Statistic = c("R-squared (R²)", "Adjusted R²", "F-statistic",
"Degrees of freedom", "p-value (overall model)", "Observations"),
Value = c(round(r2, 4), round(adj_r2, 4),
round(fstat[1], 3),
paste0(fstat[2], ", ", fstat[3]),
ifelse(fpval < 0.001, "< 0.001", round(fpval, 4)),
nrow(df))
)
knitr::kable(fit_table, caption = "Table 3: Overall model fit statistics")| Statistic | Value |
|---|---|
| R-squared (R²) | 0.8858 |
| Adjusted R² | 0.8719 |
| F-statistic | 63.607 |
| Degrees of freedom | 5, 41 |
| p-value (overall model) | < 0.001 |
| Observations | 47 |
library(broom)
coef_table <- tidy(model, conf.int = TRUE) %>%
mutate(
term = c("Intercept", "Safe Water (X1)", "Sanitation (X2)",
"Pop. Density (X3)", "Rainfall (X4)", "GDP per Capita (X5)"),
estimate = round(estimate, 4),
std.error = round(std.error, 4),
statistic = round(statistic, 3),
p.value = ifelse(p.value < 0.001, "< 0.001", round(p.value, 4)),
conf.low = round(conf.low, 4),
conf.high = round(conf.high, 4)
)
coef_table <- coef_table[, c("term", "estimate", "std.error",
"statistic", "p.value",
"conf.low", "conf.high")]
names(coef_table) <- c("Variable", "B", "Std. Error", "t-value",
"p-value", "95% CI Lower", "95% CI Upper")
knitr::kable(coef_table, caption = "Table 4: OLS regression coefficients")| Variable | B | Std. Error | t-value | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| Intercept | 48.8699 | 2.7775 | 17.595 | < 0.001 | 43.2607 | 54.4790 |
| Safe Water (X1) | -0.3021 | 0.0221 | -13.671 | < 0.001 | -0.3467 | -0.2574 |
| Sanitation (X2) | -0.2660 | 0.0329 | -8.073 | < 0.001 | -0.3325 | -0.1994 |
| Pop. Density (X3) | 0.0263 | 0.0090 | 2.913 | 0.0058 | 0.0081 | 0.0445 |
| Rainfall (X4) | -0.0063 | 0.0017 | -3.686 | < 0.001 | -0.0098 | -0.0029 |
| GDP per Capita (X5) | -0.0028 | 0.0005 | -5.983 | < 0.001 | -0.0037 | -0.0018 |
Figure 3: Regression diagnostic plots
## Shapiro-Wilk Test:
## W = 0.9756
## p-value = 0.4243
library(car)
vif_vals <- vif(model)
vif_table <- data.frame(
Variable = c("Safe Water (X₁)", "Sanitation (X₂)", "Pop. Density (X₃)",
"Rainfall (X₄)", "GDP per Capita (X₅)"),
VIF = round(vif_vals, 2)
)
knitr::kable(vif_table, caption = "Table 5: Variance Inflation Factors (VIF)")| Variable | VIF | |
|---|---|---|
| Safe_Water | Safe Water (X₁) | 1.06 |
| Sanitation | Sanitation (X₂) | 1.05 |
| Pop_Density | Pop. Density (X₃) | 1.04 |
| Rainfall | Rainfall (X₄) | 1.07 |
| GDP_pc | GDP per Capita (X₅) | 1.13 |
The fitted model demonstrates strong predictive performance (R² ≈ 0.90), capturing the vast majority of cross-country variance in cholera incidence. Safe water access is the single strongest protective predictor, and sanitation exerts a significant independent effect — reinforcing the need for integrated WASH strategies. The positive effect of population density highlights heightened vulnerability in densely populated settings, while the negative association with GDP per capita reflects how economic development enables investment in health infrastructure.
Limitations: This is an ecological study; country-level associations do not necessarily reflect individual-level relationships. Unmeasured confounders (health literacy, conflict exposure, reporting quality) may bias estimates. Longitudinal and sub-national analyses would strengthen causal inference.
This multiple linear regression analysis identifies safe water access and sanitation coverage as the two strongest and most policy-relevant predictors of cholera incidence, with population density amplifying risk and higher GDP per capita reducing it. The model explains approximately 90% of cross-country variance and satisfies core OLS assumptions. These results support prioritising WASH infrastructure expansion — particularly in high-density, low-income settings — as the most effective strategy for cholera burden reduction.
Packages used: dplyr, tidyr,
broom, corrplot, car,
lmtest, caret, glmnet,
randomForest, MASS, ggplot2,
moments, knitr