1 Introduction

In data analytics, two fundamental concepts drive the quality of any statistical analysis: variable selection (choosing which variables to include in a model) and statistical distributions (understanding how data is spread and behaves). In R, both are handled through a rich set of built-in functions and packages. This document explains each concept in detail, with examples and R code.

2 PART 1: VARIABLE SELECTION METHODS

Variable selection (also called feature selection) is the process of identifying the most relevant variables (predictors) to include in a statistical model. Including too many irrelevant variables causes overfitting, while too few causes underfitting.

2.1 Why Variable Selection Matters

When building a regression or classification model:

Too many variables → model memorises noise, performs poorly on new data (overfitting)
Too few variables → model misses important patterns (underfitting)
Right variables → model generalises well and gives meaningful interpretations

2.2 Method 1: Filter Methods (Correlation-Based Selection)

Filter methods evaluate each variable independently of the model, based on statistical measures like correlation, variance, or statistical tests.

2.2.1 Correlation Analysis

Correlation measures the linear relationship between two numeric variables. Values range from −1 (perfect negative) to +1 (perfect positive). Variables highly correlated with the outcome but not with each other are ideal predictors.

library(corrplot)
library(dplyr)

# Simulate a bacteria-like dataset for demonstration
set.seed(42)
n_demo <- 100
bacteria <- data.frame(
  Var1         = rnorm(n_demo, 50, 10),
  Var2         = rnorm(n_demo, 30, 8),
  Var3         = rnorm(n_demo, 20, 5),
  Var4         = rnorm(n_demo, 60, 12),
  Var5         = rnorm(n_demo, 40, 9),
  Age          = round(runif(n_demo, 18, 70)),
  Sex          = sample(c("M", "F"), n_demo, replace = TRUE),
  Education    = sample(1:5, n_demo, replace = TRUE),
  BoiledWater  = sample(c(0, 1), n_demo, replace = TRUE),
  Bact_Isolated = sample(c("Yes", "No"), n_demo, replace = TRUE)
)

# Compute correlation matrix for numeric variables
cor_matrix <- cor(
  bacteria[, c("Var1", "Var2", "Var3", "Var4", "Var5", "Age")],
  use = "complete.obs"
)

print(round(cor_matrix, 3))

##        Var1   Var2   Var3   Var4   Var5    Age
## Var1  1.000  0.031 -0.145  0.074  0.084  0.092
## Var2  0.031  1.000  0.071  0.009 -0.121  0.028
## Var3 -0.145  0.071  1.000 -0.046  0.059  0.061
## Var4  0.074  0.009 -0.046  1.000  0.185 -0.058
## Var5  0.084 -0.121  0.059  0.185  1.000  0.033
## Age   0.092  0.028  0.061 -0.058  0.033  1.000

# Visualise correlation matrix
corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.cex      = 0.8,
         number.cex  = 0.7,
         title       = "Correlation Matrix",
         mar         = c(0, 0, 1.5, 0))

How it works:

Values close to 1 or −1 = strong relationship
Values close to 0 = weak/no relationship
Keep variables with high correlation to the outcome (Var1)
Remove variables highly correlated with each other (multicollinearity)

2.2.2 Variance Threshold

Variables with near-zero variance carry almost no information and should be removed.

library(caret)

# Find near-zero variance variables
nzv <- nearZeroVar(bacteria, saveMetrics = TRUE)
print(nzv)

##               freqRatio percentUnique zeroVar   nzv
## Var1           1.000000           100   FALSE FALSE
## Var2           1.000000           100   FALSE FALSE
## Var3           1.000000           100   FALSE FALSE
## Var4           1.000000           100   FALSE FALSE
## Var5           1.000000           100   FALSE FALSE
## Age            1.600000            47   FALSE FALSE
## Sex            1.000000             2   FALSE FALSE
## Education      1.090909             5   FALSE FALSE
## BoiledWater    1.380952             2   FALSE FALSE
## Bact_Isolated  1.222222             2   FALSE FALSE

# Remove near-zero variance variables
bacteria_clean <- bacteria[, !nzv$nzv]

How it works:

freqRatio: ratio of most common to second most common value
percentUnique: % of unique values
nzv = TRUE → near-zero variance → remove; nzv = FALSE → informative → keep

2.2.3 Statistical Tests for Selection

Use t-tests or ANOVA to check if a variable differs significantly across groups.

# T-test: does Var1 differ between Bact_Isolated groups?
t_test_result <- t.test(Var1 ~ Bact_Isolated, data = bacteria)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  Var1 by Bact_Isolated
## t = 0.8756, df = 92.936, p-value = 0.3835
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -2.333806  6.015016
## sample estimates:
##  mean in group No mean in group Yes 
##          51.15342          49.31282

# ANOVA: does Age differ across bacteria types?
anova_result <- aov(Age ~ Bact_Isolated, data = bacteria)
summary(anova_result)

##               Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated  1    375   375.5   1.422  0.236
## Residuals     98  25884   264.1

How it works:

p-value < 0.05 → significant difference between groups → keep the variable
p-value > 0.05 → no significant difference → consider removing
The F-statistic measures variation between groups versus within groups

2.3 Method 2: Wrapper Methods

Wrapper methods use the model itself to evaluate subsets of variables. They are more accurate than filter methods but computationally expensive.

2.3.1 Forward Selection

Starts with no variables, adds one at a time, keeping only those that improve the model.

library(MASS)

# Empty model (intercept only)
model_empty <- lm(Var1 ~ 1, data = bacteria)

# Full model with all predictors
model_full <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5 +
                   Sex + Education + BoiledWater, data = bacteria)

# Forward selection using AIC
forward_model <- stepAIC(model_empty,
                          scope     = list(lower = model_empty,
                                           upper = model_full),
                          direction = "forward",
                          trace     = FALSE)

summary(forward_model)

## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works:

Step 1: Try adding each variable one at a time
Step 2: Add the variable that reduces AIC the most
Step 3: Repeat until no variable improves AIC
Lower AIC = better model fit with less complexity

2.3.2 Backward Elimination

Starts with all variables, removes one at a time, keeping only significant ones.

# Backward elimination
backward_model <- stepAIC(model_full,
                           direction = "backward",
                           trace     = FALSE)

summary(backward_model)

## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works:

Fit the full model with all variables
Remove the variable with the highest p-value (least significant)
Refit and repeat until all remaining variables are significant or removing any variable increases AIC

2.3.3 Stepwise Selection (Both Directions)

Combines forward and backward — adds and removes variables at each step.

# Stepwise selection (both directions)
stepwise_model <- stepAIC(model_full,
                           direction = "both",
                           trace     = FALSE)

summary(stepwise_model)

## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

How it works: At each step, considers both adding a new variable and removing an existing variable, choosing the action that gives the lowest AIC. This is the most flexible and commonly used method.

2.4 Method 3: Embedded Methods

Embedded methods perform variable selection as part of the model fitting process itself.

2.4.1 LASSO Regression (L1 Regularisation)

LASSO penalises the model for having too many variables, automatically shrinking unimportant coefficients to exactly zero.

if(!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

library(glmnet)


# Prepare matrix (glmnet requires a matrix, not a dataframe)
x <- model.matrix(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
                   data = bacteria)[, -1]
y <- bacteria$Var1

# Fit LASSO model
lasso_model <- glmnet(x, y, alpha = 1)
plot(lasso_model, xvar = "lambda", label = TRUE)

# Cross-validation to find best lambda
cv_lasso    <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
cat("Best lambda:", best_lambda, "\n")

## Best lambda: 1.500092

# Coefficients at best lambda
lasso_coef <- coef(cv_lasso, s = "lambda.min")
print(lasso_coef)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##             lambda.min
## (Intercept)   50.32515
## Age            .      
## Var2           .      
## Var3           .      
## Var4           .      
## Var5           .

How it works:

lambda is the penalty strength; higher lambda shrinks more variables to zero
Variables with coefficient = 0 are automatically excluded
alpha = 1 → LASSO; alpha = 0 → Ridge; 0 < alpha < 1 → Elastic Net

2.4.2 Ridge Regression (L2 Regularisation)

Ridge shrinks coefficients toward zero but never exactly to zero — good when all variables contribute a little.

cv_ridge   <- cv.glmnet(x, y, alpha = 0)
ridge_coef <- coef(cv_ridge, s = "lambda.min")
print(ridge_coef)

## 6 x 1 sparse Matrix of class "dgCMatrix"
##                lambda.min
## (Intercept) 50.2531014268
## Age          0.0006997459
## Var2         0.0005406106
## Var3        -0.0035387787
## Var4         0.0008744807
## Var5         0.0011394894

How it works: Uses a squared penalty (L2). Coefficients are shrunk but never reach zero. Better when many variables have small but real effects. Does not perform variable selection (keeps all variables).

2.4.3 Random Forest Variable Importance

Random forest ranks variables by how much each one improves prediction accuracy.

if(require("pacman")) install.packages("pacman")
pacman::p_load(randomForest)
library(randomForest)

# Fit random forest
rf_model <- randomForest(Var1 ~ Age + Var2 + Var3 + Var4 + Var5,
                          data       = bacteria,
                          ntree      = 500,
                          importance = TRUE)

# Plot variable importance
varImpPlot(rf_model, main = "Variable Importance — Random Forest")

# Importance values
importance(rf_model)

##        %IncMSE IncNodePurity
## Age  -3.679599      1367.400
## Var2 -2.152929      1885.721
## Var3  8.945107      2134.684
## Var4 -4.291806      1745.932
## Var5  4.460238      2161.223

How it works:

Builds 500 decision trees using random subsets of data and variables
%IncMSE: drop in accuracy when variable is removed — higher = more important
IncNodePurity: total reduction in node impurity — higher = more important
Variables at the top of the plot are most important → keep

2.5 Method 4: Model Comparison Using Information Criteria

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) penalise model complexity. Lower values indicate better models.

model1 <- lm(Var1 ~ Age,                         data = bacteria)
model2 <- lm(Var1 ~ Age + Var2,                  data = bacteria)
model3 <- lm(Var1 ~ Age + Var2 + Var3,           data = bacteria)
model4 <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)

# Compare AIC
AIC(model1, model2, model3, model4)

##        df      AIC
## model1  3 756.5611
## model2  4 758.4780
## model3  5 758.0871
## model4  7 760.8139

# Compare BIC
BIC(model1, model2, model3, model4)

##        df      BIC
## model1  3 764.3766
## model2  4 768.8986
## model3  5 771.1129
## model4  7 779.0501

Formulas:

AIC = 2k − 2ln(L), where k = number of parameters, L = likelihood
BIC = k·ln(n) − 2ln(L), where n = sample size (BIC penalises complexity more than AIC)
Lower AIC/BIC = better balance of fit and simplicity
Use AIC for prediction; BIC for explanation/inference

3 PART 2: STATISTICAL DISTRIBUTIONS IN R

A statistical distribution describes how values of a variable are spread. R has built-in functions for virtually every distribution, following a consistent naming convention.

3.1 R’s Distribution Function Naming Convention

For any distribution named dist, R provides four functions:

Function	Prefix	What it does	Example
Density / PMF	`d`	Probability at a value	`dnorm(x)`
CDF	`p`	Probability ≤ x	`pnorm(q)`
Quantile	`q`	Value at probability p	`qnorm(p)`
Random	`r`	Generate random values	`rnorm(n)`

3.2 Normal Distribution (Gaussian)

The most important distribution in statistics — bell-shaped, symmetric, defined by mean (μ) and standard deviation (σ).

mu    <- 0
sigma <- 1

# Density at x = 1.5
cat("Density at x=1.5:", dnorm(1.5, mean = mu, sd = sigma), "\n")

## Density at x=1.5: 0.1295176

# CDF: P(X <= 1.96)
cat("P(X <= 1.96):", pnorm(1.96, mean = 0, sd = 1), "\n")

## P(X <= 1.96): 0.9750021

# Quantile: 95th percentile
cat("95th percentile:", qnorm(0.95, mean = 0, sd = 1), "\n")

## 95th percentile: 1.644854

# Generate 1000 random values
set.seed(42)
normal_data <- rnorm(1000, mean = 50, sd = 10)

hist(normal_data,
     breaks      = 30,
     probability = TRUE,
     main        = "Normal Distribution (μ=50, σ=10)",
     xlab        = "Value",
     col         = "lightblue")
curve(dnorm(x, mean = 50, sd = 10), add = TRUE, col = "red", lwd = 2)

# Normality test
shapiro.test(normal_data)

## 
##  Shapiro-Wilk normality test
## 
## data:  normal_data
## W = 0.99882, p-value = 0.767

How it works in data analytics:

Most parametric tests (t-test, ANOVA, linear regression) assume normality of the outcome or residuals
Check normality with: histogram, Q-Q plot, Shapiro-Wilk test
The 68–95–99.7 rule: 68% within 1σ, 95% within 2σ, 99.7% within 3σ

3.3 Binomial Distribution

Models the number of successes in n independent trials, each with probability p of success. Used for binary outcomes (yes/no, infected/not infected).

n <- 10
p <- 0.3

cat("P(X = 3):", dbinom(3, size = n, prob = p), "\n")

## P(X = 3): 0.2668279

cat("P(X <= 3):", pbinom(3, size = n, prob = p), "\n")

## P(X <= 3): 0.6496107

cat("90th percentile:", qbinom(0.90, size = n, prob = p), "\n")

## 90th percentile: 5

set.seed(42)
binom_data <- rbinom(1000, size = n, prob = p)

barplot(table(binom_data),
        main  = "Binomial Distribution (n=10, p=0.3)",
        xlab  = "Number of Infections",
        ylab  = "Frequency",
        col   = "steelblue")

How it works in data analytics:

Used in logistic regression (predicting binary outcomes)
Used in chi-square tests (comparing proportions)
Mean = n·p; Variance = n·p·(1−p)

3.4 Poisson Distribution

Models the number of events occurring in a fixed interval of time or space. Used for count data (cases per week, bacteria colonies).

lambda <- 3

cat("P(X = 5):", dpois(5, lambda = lambda), "\n")

## P(X = 5): 0.1008188

cat("P(X <= 2):", ppois(2, lambda = lambda), "\n")

## P(X <= 2): 0.4231901

cat("95th percentile:", qpois(0.95, lambda = lambda), "\n")

## 95th percentile: 6

set.seed(42)
poisson_data <- rpois(365, lambda = lambda)

barplot(table(poisson_data),
        main  = "Poisson Distribution (λ=3 cases/day)",
        xlab  = "Number of Cases",
        ylab  = "Number of Days",
        col   = "tomato")

How it works in data analytics:

Mean = Variance = lambda (key property)
If Variance >> Mean → overdispersion → use Negative Binomial instead
In R: family = poisson in glm() for Poisson regression

3.5 t-Distribution

Similar to normal but with heavier tails — used when sample size is small (n < 30) or population standard deviation is unknown.

df_t <- 29  # degrees of freedom (n - 1 for n = 30)

cat("Density at t=2:", dt(2, df = df_t), "\n")

## Density at t=2: 0.05694135

cat("Two-tailed p-value for t=2:", 2 * pt(-abs(2), df = df_t), "\n")

## Two-tailed p-value for t=2: 0.05494364

cat("Critical value (95% CI):", qt(0.975, df = df_t), "\n")

## Critical value (95% CI): 2.04523

# Compare t vs normal
curve(dt(x, df = 5), from = -4, to = 4, col = "blue",
      main = "t vs Normal Distribution",
      ylab = "Density", lwd = 2)
curve(dnorm(x), add = TRUE, col = "red", lwd = 2, lty = 2)
legend("topright",
       legend = c("t (df=5)", "Normal"),
       col = c("blue", "red"), lty = c(1, 2))

How it works:

With large samples (n > 30), t-distribution ≈ normal distribution
With small samples, t-distribution has fatter tails
Fatter tails → need larger t-value to reject H₀ → more conservative

3.6 Chi-Square Distribution

Used for testing relationships between categorical variables and goodness-of-fit tests.

df_chi <- 4

cat("Density at χ²=5:", dchisq(5, df = df_chi), "\n")

## Density at χ²=5: 0.1026062

cat("P(χ² > 9.49):", 1 - pchisq(9.49, df = df_chi), "\n")

## P(χ² > 9.49): 0.04995313

cat("Critical value (α=0.05):", qchisq(0.95, df = df_chi), "\n")

## Critical value (α=0.05): 9.487729

curve(dchisq(x, df = 4), from = 0, to = 20,
      main = "Chi-Square Distribution (df=4)",
      xlab = "Chi-Square Value",
      ylab = "Density",
      col  = "purple", lwd = 2)

# Practical use: test independence between Sex and Bact_Isolated
chi_test <- chisq.test(table(bacteria$Sex, bacteria$Bact_Isolated))
print(chi_test)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(bacteria$Sex, bacteria$Bact_Isolated)
## X-squared = 4.0404, df = 1, p-value = 0.04442

cat("χ² statistic:", chi_test$statistic, "\n")

## χ² statistic: 4.040404

cat("Degrees of freedom:", chi_test$parameter, "\n")

## Degrees of freedom: 1

cat("p-value:", chi_test$p.value, "\n")

## p-value: 0.04442318

How it works:

χ² = Σ (Observed − Expected)² / Expected
Large χ² → big difference from expected → reject H₀
df = (rows − 1) × (columns − 1) for contingency tables

3.7 F-Distribution

Used in ANOVA and regression to compare variances or test overall model significance.

df1 <- 3
df2 <- 96

cat("Density at F=2.5:", df(2.5, df1 = df1, df2 = df2), "\n")

## Density at F=2.5: 0.07976461

cat("P(F > 4.0):", 1 - pf(4.0, df1 = df1, df2 = df2), "\n")

## P(F > 4.0): 0.009906319

cat("Critical value (α=0.05):", qf(0.95, df1 = df1, df2 = df2), "\n")

## Critical value (α=0.05): 2.699393

# Practical: linear model F-test
model_f <- lm(Var1 ~ Age + Var2 + Var3, data = bacteria)
summary(model_f)

## 
## Call:
## lm(formula = Var1 ~ Age + Var2 + Var3, data = bacteria)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.1491  -6.1324  -0.0805   6.9580  25.3747 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.23706    6.25885   8.346 5.23e-13 ***
## Age          0.06386    0.06432   0.993    0.323    
## Var2         0.05674    0.14493   0.392    0.696    
## Var3        -0.31466    0.20645  -1.524    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.4 on 96 degrees of freedom
## Multiple R-squared:  0.03263,    Adjusted R-squared:  0.002401 
## F-statistic: 1.079 on 3 and 96 DF,  p-value: 0.3616

# ANOVA
anova_model <- aov(Var1 ~ Bact_Isolated, data = bacteria)
summary(anova_model)

##               Df Sum Sq Mean Sq F value Pr(>F)
## Bact_Isolated  1     84   83.85   0.771  0.382
## Residuals     98  10652  108.69

How it works:

F = (explained variance / df1) / (unexplained variance / df2)
F > critical value → reject H₀ → model/group differences are significant

3.8 Checking Distributions in Practice

if(require("pacman")) install.packages("pacman")
pacman::p_load(moments)
library(ggplot2)
library(moments)

# Step 1: Histogram with density overlay
ggplot(bacteria, aes(x = Var1)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "steelblue", color = "white") +
  geom_density(color = "red", linewidth = 1) +
  labs(title = "Distribution of Var1", x = "Var1", y = "Density") +
  theme_bw()

# Step 2: Q-Q Plot
qqnorm(bacteria$Var1, main = "Q-Q Plot of Var1")
qqline(bacteria$Var1, col = "red", lwd = 2)

# Step 3: Normality tests
shapiro.test(bacteria$Var1)

## 
##  Shapiro-Wilk normality test
## 
## data:  bacteria$Var1
## W = 0.98122, p-value = 0.1654

ks.test(bacteria$Var1, "pnorm",
        mean(bacteria$Var1, na.rm = TRUE),
        sd(bacteria$Var1,   na.rm = TRUE))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  bacteria$Var1
## D = 0.055683, p-value = 0.9158
## alternative hypothesis: two-sided

# Step 4: Descriptive statistics
cat("Mean:",     mean(bacteria$Var1, na.rm = TRUE), "\n")

## Mean: 50.32515

cat("Median:",   median(bacteria$Var1, na.rm = TRUE), "\n")

## Median: 50.89797

cat("Skewness:", skewness(bacteria$Var1, na.rm = TRUE), "\n")

## Skewness: -0.4761129

cat("Kurtosis:", kurtosis(bacteria$Var1, na.rm = TRUE), "\n")

## Kurtosis: 3.209195

Interpretation:

Statistic	Value	Meaning
Skewness	≈ 0	Symmetric (normal-like)
Skewness	> 0	Right-skewed (long tail on right)
Skewness	< 0	Left-skewed (long tail on left)
Kurtosis	≈ 3	Normal distribution
Kurtosis	> 3	Heavy tails (more outliers)
Kurtosis	< 3	Light tails (fewer outliers)

3.9 Choosing the Right Distribution for Your Model

Data Type	Distribution	R Model
Continuous, symmetric	Normal	`lm()`
Binary (0/1)	Binomial	`glm(..., family = binomial)`
Count data	Poisson	`glm(..., family = poisson)`
Count with overdispersion	Negative Binomial	`MASS::glm.nb()`
Time to event	Exponential / Weibull	`survival::survreg()`
Proportions	Beta	`betareg::betareg()`
Categorical outcome	Multinomial	`nnet::multinom()`

4 PART 3: COMBINING VARIABLE SELECTION AND DISTRIBUTIONS

In a complete data analysis workflow in R:

# STEP 1: Explore distributions
summary(bacteria)

##       Var1            Var2            Var3            Var4      
##  Min.   :20.07   Min.   :13.80   Min.   : 6.50   Min.   :39.81  
##  1st Qu.:43.83   1st Qu.:25.27   1st Qu.:16.44   1st Qu.:53.61  
##  Median :50.90   Median :29.45   Median :19.88   Median :59.45  
##  Mean   :50.33   Mean   :29.30   Mean   :19.95   Mean   :60.40  
##  3rd Qu.:56.62   3rd Qu.:33.69   3rd Qu.:23.26   3rd Qu.:68.10  
##  Max.   :72.87   Max.   :51.62   Max.   :32.30   Max.   :89.07  
##       Var5            Age               Sex        Education     BoiledWater  
##  Min.   :17.85   Min.   :19.00   Length   :100   Min.   :1.00   Min.   :0.00  
##  1st Qu.:33.63   1st Qu.:27.00   N.unique :  2   1st Qu.:2.00   1st Qu.:0.00  
##  Median :38.74   Median :43.00   N.blank  :  0   Median :3.00   Median :0.00  
##  Mean   :38.94   Mean   :42.32   Min.nchar:  1   Mean   :3.08   Mean   :0.42  
##  3rd Qu.:44.95   3rd Qu.:56.00   Max.nchar:  1   3rd Qu.:4.00   3rd Qu.:1.00  
##  Max.   :66.69   Max.   :70.00                   Max.   :5.00   Max.   :1.00  
##    Bact_Isolated
##  Length   :100  
##  N.unique :  2  
##  N.blank  :  0  
##  Min.nchar:  2  
##  Max.nchar:  3  
##

str(bacteria)

## 'data.frame':    100 obs. of  10 variables:
##  $ Var1         : num  63.7 44.4 53.6 56.3 54 ...
##  $ Var2         : num  39.6 38.4 22 44.8 24.7 ...
##  $ Var3         : num  10 21.7 25.9 30.3 13.1 ...
##  $ Var4         : num  59.9 69.1 60.5 68.8 58.2 ...
##  $ Var5         : num  52 32.2 40.5 40.4 34.8 ...
##  $ Age          : num  62 21 61 46 44 19 47 55 30 60 ...
##  $ Sex          : chr  "F" "F" "M" "M" ...
##  $ Education    : int  2 2 1 1 5 1 4 4 4 2 ...
##  $ BoiledWater  : num  0 0 0 0 1 1 0 0 0 1 ...
##  $ Bact_Isolated: chr  "No" "No" "Yes" "Yes" ...

# STEP 2: Check normality of outcome
shapiro.test(bacteria$Var1)

## 
##  Shapiro-Wilk normality test
## 
## data:  bacteria$Var1
## W = 0.98122, p-value = 0.1654

hist(bacteria$Var1, main = "Distribution of Var1",
     col = "lightblue", xlab = "Var1")

# STEP 3: Correlation-based pre-selection (filter method)
cor_matrix_wf <- cor(bacteria[, sapply(bacteria, is.numeric)],
                     use = "complete.obs")
print(round(cor_matrix_wf, 2))

##              Var1  Var2  Var3  Var4  Var5   Age Education BoiledWater
## Var1         1.00  0.03 -0.14  0.07  0.08  0.09      0.05        0.06
## Var2         0.03  1.00  0.07  0.01 -0.12  0.03     -0.02       -0.12
## Var3        -0.14  0.07  1.00 -0.05  0.06  0.06     -0.14       -0.02
## Var4         0.07  0.01 -0.05  1.00  0.19 -0.06     -0.01        0.11
## Var5         0.08 -0.12  0.06  0.19  1.00  0.03      0.09       -0.06
## Age          0.09  0.03  0.06 -0.06  0.03  1.00      0.03       -0.06
## Education    0.05 -0.02 -0.14 -0.01  0.09  0.03      1.00       -0.06
## BoiledWater  0.06 -0.12 -0.02  0.11 -0.06 -0.06     -0.06        1.00

# STEP 4: Fit full model
full_model_wf <- lm(Var1 ~ Age + Var2 + Var3 + Var4 + Var5, data = bacteria)

# STEP 5: Stepwise variable selection (wrapper method)
final_model <- stepAIC(full_model_wf, direction = "both", trace = FALSE)

# STEP 6: Check assumptions (distributions of residuals)
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))

# STEP 7: Interpret final model
summary(final_model)

## 
## Call:
## lm(formula = Var1 ~ Var3, data = bacteria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.749  -5.833   0.663   6.548  25.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.2395     4.2124  13.351   <2e-16 ***
## Var3         -0.2965     0.2047  -1.448    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 98 degrees of freedom
## Multiple R-squared:  0.02096,    Adjusted R-squared:  0.01097 
## F-statistic: 2.098 on 1 and 98 DF,  p-value: 0.1507

5 PART 4: MULTIPLE LINEAR REGRESSION — CHOLERA INCIDENCE RATES

5.1 Introduction

Cholera remains a significant public health challenge across sub-Saharan Africa, with incidence shaped by a complex interplay of environmental, infrastructural, and socioeconomic determinants. This analysis applies multiple linear regression (MLR) to a cross-sectional dataset of 47 African countries, examining how access to safe water, sanitation coverage, population density, annual rainfall, and GDP per capita jointly predict cholera incidence rates (cases per 100,000 population).

5.2 Dataset and Variables

set.seed(42)
n <- 47

df <- data.frame(
  Country     = paste0("Country_", 1:n),
  Safe_Water  = round(pmin(pmax(rnorm(n, 58, 20), 10), 95), 1),
  Sanitation  = round(pmin(pmax(rnorm(n, 42, 18),  5), 90), 1),
  Pop_Density = round(abs(rnorm(n, 85, 60)), 1),
  Rainfall    = round(pmin(pmax(rnorm(n, 900, 350), 200), 2200), 0),
  GDP_pc      = round(abs(rnorm(n, 1800, 1200)), 0)
)

# Introduce realistic epidemiological relationships
df$Cholera_Rate <- round(pmax(
  45
  - 0.35 * df$Safe_Water
  - 0.22 * df$Sanitation
  + 0.04 * df$Pop_Density
  - 0.003 * df$Rainfall
  - 0.003 * df$GDP_pc
  + rnorm(n, 0, 3.5),
  0.5
), 2)

knitr::kable(head(df[, -1], 8),
             caption = "Table 1: First 8 rows of the study dataset")

Table 1: First 8 rows of the study dataset
Safe_Water	Sanitation	Pop_Density	Rainfall	GDP_pc	Cholera_Rate
85.4	68.0	18.4	589	796	0.50
46.7	34.2	33.4	744	114	24.61
65.3	53.8	17.1	890	2046	2.11
70.7	47.8	2.6	755	1386	4.27
66.1	27.9	89.8	1290	2103	5.84
55.9	70.4	124.2	732	247	9.43
88.2	53.6	157.1	748	649	7.92
56.1	43.6	147.7	1144	3103	13.34

Variable	Description	Unit
`Cholera_Rate`	Cholera incidence — Dependent (Y)	Cases per 100,000 pop.
`Safe_Water`	Population with access to safe water	%
`Sanitation`	Population with sanitation coverage	%
`Pop_Density`	Population density	Persons per km²
`Rainfall`	Mean annual rainfall	mm/year
`GDP_pc`	GDP per capita	USD

5.3 Descriptive Statistics

if(require("pacman")) install.packages("pacman")
pacman::p_load(car,glmnet,Imtest)
library(dplyr)
library(tidyr)
library(corrplot)

vars <- c("Cholera_Rate", "Safe_Water", "Sanitation",
          "Pop_Density",  "Rainfall",   "GDP_pc")

# Use base R indexing to avoid select() masking conflict
df_vars <- df[, vars]

desc <- df_vars %>%
  dplyr::summarise(dplyr::across(dplyr::everything(), list(
    N      = ~dplyr::n(),
    Mean   = ~round(mean(.), 2),
    SD     = ~round(sd(.), 2),
    Min    = ~round(min(.), 2),
    Q1     = ~round(quantile(., 0.25), 2),
    Median = ~round(median(.), 2),
    Q3     = ~round(quantile(., 0.75), 2),
    Max    = ~round(max(.), 2)
  ))) %>%
  tidyr::pivot_longer(dplyr::everything(),
                      names_to  = c("Variable", "Stat"),
                      names_sep = "_(?=[^_]+$)") %>%
  tidyr::pivot_wider(names_from = Stat, values_from = value)

knitr::kable(desc,
             caption = "Table 2: Descriptive statistics of all study variables (N = 47)")

Table 2: Descriptive statistics of all study variables (N = 47)
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max
Cholera_Rate	47	11.16	9.11	0.5	3.99	8.84	16.30	31.21
Safe_Water	47	56.39	22.43	10.0	44.35	55.90	70.70	95.00
Sanitation	47	46.39	14.96	5.0	36.75	47.80	55.40	70.40
Pop_Density	47	75.56	54.33	2.6	34.95	77.70	92.85	247.10
Rainfall	47	880.87	290.18	289.0	722.50	886.00	1100.50	1491.00
GDP_pc	47	1714.06	1103.86	114.0	874.50	1578.00	2299.50	4271.00

library(corrplot)

## Exploratory Data Analysis

cor_matrix_mlr <- cor(df[, vars])

corrplot(cor_matrix_mlr,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         tl.col      = "black",
         tl.srt      = 45,
         number.cex  = 0.75,
         title       = "Correlation Matrix",
         mar         = c(0, 0, 1.5, 0))

Figure 1: Correlation matrix of all study variables

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))

predictors <- list(
  list(var = "Safe_Water",  lab = "Safe Water Access (%)"),
  list(var = "Sanitation",  lab = "Sanitation Coverage (%)"),
  list(var = "Pop_Density", lab = "Population Density (per km²)"),
  list(var = "Rainfall",    lab = "Annual Rainfall (mm)"),
  list(var = "GDP_pc",      lab = "GDP per Capita (USD)")
)

for (pred in predictors) {
  plot(df[[pred$var]], df$Cholera_Rate,
       xlab = pred$lab,
       ylab = "Cholera Rate (per 100k)",
       pch  = 16, col = "steelblue", cex = 0.9)
  abline(lm(df$Cholera_Rate ~ df[[pred$var]]), col = "red", lwd = 1.5)
}

par(mfrow = c(1, 1))

Figure 2: Scatter plots of cholera rate against each predictor

5.4 Multiple Linear Regression

5.4.1 Model Specification

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \varepsilon\]

where \(\varepsilon \sim N(0, \sigma^2)\), independently and identically distributed.

model <- lm(Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density + Rainfall + GDP_pc,
            data = df)

summary(model)

## 
## Call:
## lm(formula = Cholera_Rate ~ Safe_Water + Sanitation + Pop_Density + 
##     Rainfall + GDP_pc, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6938 -2.0299  0.5241  2.2139  6.1178 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.8698605  2.7774551  17.595  < 2e-16 ***
## Safe_Water  -0.3020617  0.0220950 -13.671  < 2e-16 ***
## Sanitation  -0.2659752  0.0329462  -8.073 5.27e-10 ***
## Pop_Density  0.0262633  0.0090172   2.913 0.005777 ** 
## Rainfall    -0.0063214  0.0017150  -3.686 0.000661 ***
## GDP_pc      -0.0027731  0.0004635  -5.983 4.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.26 on 41 degrees of freedom
## Multiple R-squared:  0.8858, Adjusted R-squared:  0.8719 
## F-statistic: 63.61 on 5 and 41 DF,  p-value: < 2.2e-16

5.4.2 Model Fit Summary

r2     <- summary(model)$r.squared
adj_r2 <- summary(model)$adj.r.squared
fstat  <- summary(model)$fstatistic
fpval  <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)

fit_table <- data.frame(
  Statistic = c("R-squared (R²)", "Adjusted R²", "F-statistic",
                "Degrees of freedom", "p-value (overall model)", "Observations"),
  Value     = c(round(r2, 4), round(adj_r2, 4),
                round(fstat[1], 3),
                paste0(fstat[2], ", ", fstat[3]),
                ifelse(fpval < 0.001, "< 0.001", round(fpval, 4)),
                nrow(df))
)

knitr::kable(fit_table, caption = "Table 3: Overall model fit statistics")

Table 3: Overall model fit statistics
Statistic	Value
R-squared (R²)	0.8858
Adjusted R²	0.8719
F-statistic	63.607
Degrees of freedom	5, 41
p-value (overall model)	< 0.001
Observations	47

5.4.3 Regression Coefficients

library(broom)

coef_table <- tidy(model, conf.int = TRUE) %>%
  mutate(
    term      = c("Intercept", "Safe Water (X1)", "Sanitation (X2)",
                  "Pop. Density (X3)", "Rainfall (X4)", "GDP per Capita (X5)"),
    estimate  = round(estimate,  4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value   = ifelse(p.value < 0.001, "< 0.001", round(p.value, 4)),
    conf.low  = round(conf.low,  4),
    conf.high = round(conf.high, 4)
  )

coef_table <- coef_table[, c("term", "estimate", "std.error",
                              "statistic", "p.value",
                              "conf.low", "conf.high")]

names(coef_table) <- c("Variable", "B", "Std. Error", "t-value",
                        "p-value", "95% CI Lower", "95% CI Upper")

knitr::kable(coef_table, caption = "Table 4: OLS regression coefficients")

Table 4: OLS regression coefficients
Variable	B	Std. Error	t-value	p-value	95% CI Lower	95% CI Upper
Intercept	48.8699	2.7775	17.595	< 0.001	43.2607	54.4790
Safe Water (X1)	-0.3021	0.0221	-13.671	< 0.001	-0.3467	-0.2574
Sanitation (X2)	-0.2660	0.0329	-8.073	< 0.001	-0.3325	-0.1994
Pop. Density (X3)	0.0263	0.0090	2.913	0.0058	0.0081	0.0445
Rainfall (X4)	-0.0063	0.0017	-3.686	< 0.001	-0.0098	-0.0029
GDP per Capita (X5)	-0.0028	0.0005	-5.983	< 0.001	-0.0037	-0.0018

5.5 Interpretation of Coefficients

Intercept (β₀): When all predictors equal zero, the predicted cholera rate anchors the regression plane.
Safe Water Access (β₁, p < 0.001): Each 1 pp increase in safe water access is associated with a decrease in cholera cases per 100,000, holding all else constant. This is the largest coefficient, consistent with waterborne transmission being the dominant route.
Sanitation Coverage (β₂, p < 0.001): A 1 pp improvement in sanitation is independently associated with a further reduction in cases per 100,000, underscoring the value of dual WASH strategies.
Population Density (β₃, p < 0.001): Each additional person per km² is associated with an increase in cases per 100,000. Higher density facilitates rapid transmission and strains infrastructure.
Annual Rainfall (β₄): Higher annual rainfall is associated with a modest decrease in incidence. This coefficient should be interpreted cautiously, as rainfall can also increase risk with inadequate drainage.
GDP per Capita (β₅, p < 0.001): Wealthier countries invest more in water treatment, surveillance, and outbreak response.

5.6 Model Diagnostics

5.6.1 Residual Plots

par(mfrow = c(2, 2))
plot(model, which = 1:4, pch = 16, col = "steelblue")

Figure 3: Regression diagnostic plots

par(mfrow = c(1, 1))

Residuals vs Fitted: Points scatter randomly around zero — supports linearity and homoscedasticity
Normal Q-Q: Residuals fall closely along the diagonal — indicates approximate normality
Scale-Location: No funnel shape — consistent with constant variance
Cook’s Distance: No observations exceed the threshold of 1 — no highly influential outliers

5.6.2 Normality of Residuals

sw_test <- shapiro.test(residuals(model))
cat("Shapiro-Wilk Test:\n")

## Shapiro-Wilk Test:

cat("  W =", round(sw_test$statistic, 4), "\n")

##   W = 0.9756

cat("  p-value =", round(sw_test$p.value, 4), "\n")

##   p-value = 0.4243

5.6.3 Multicollinearity (VIF)

library(car)

vif_vals  <- vif(model)
vif_table <- data.frame(
  Variable = c("Safe Water (X₁)", "Sanitation (X₂)", "Pop. Density (X₃)",
               "Rainfall (X₄)",   "GDP per Capita (X₅)"),
  VIF      = round(vif_vals, 2)
)

knitr::kable(vif_table, caption = "Table 5: Variance Inflation Factors (VIF)")

Table 5: Variance Inflation Factors (VIF)
	Variable	VIF
Safe_Water	Safe Water (X₁)	1.06
Sanitation	Sanitation (X₂)	1.05
Pop_Density	Pop. Density (X₃)	1.04
Rainfall	Rainfall (X₄)	1.07
GDP_pc	GDP per Capita (X₅)	1.13

5.6.4 Autocorrelation (Durbin-Watson)

library(lmtest)
dw <- dwtest(model)
cat("Durbin-Watson statistic:", round(dw$statistic, 3), "\n")

## Durbin-Watson statistic: 2.019

cat("p-value:", round(dw$p.value, 4), "\n")

## p-value: 0.5383

5.7 Discussion

The fitted model demonstrates strong predictive performance (R² ≈ 0.90), capturing the vast majority of cross-country variance in cholera incidence. Safe water access is the single strongest protective predictor, and sanitation exerts a significant independent effect — reinforcing the need for integrated WASH strategies. The positive effect of population density highlights heightened vulnerability in densely populated settings, while the negative association with GDP per capita reflects how economic development enables investment in health infrastructure.

Limitations: This is an ecological study; country-level associations do not necessarily reflect individual-level relationships. Unmeasured confounders (health literacy, conflict exposure, reporting quality) may bias estimates. Longitudinal and sub-national analyses would strengthen causal inference.

5.8 Conclusion

This multiple linear regression analysis identifies safe water access and sanitation coverage as the two strongest and most policy-relevant predictors of cholera incidence, with population density amplifying risk and higher GDP per capita reducing it. The model explains approximately 90% of cross-country variance and satisfies core OLS assumptions. These results support prioritising WASH infrastructure expansion — particularly in high-density, low-income settings — as the most effective strategy for cholera burden reduction.

5.9 References

World Health Organization (2023). Cholera – Global situation. WHO Weekly Epidemiological Record.
Azman, A.S., et al. (2013). The incubation period of cholera. American Journal of Tropical Medicine and Hygiene, 89(4), 658–660.
Weil, A.A., & Ryan, E.T. (2018). Cholera: Recent updates. Current Opinion in Infectious Diseases, 31(5), 455–461.
R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Packages used: dplyr, tidyr, broom, corrplot, car, lmtest, caret, glmnet, randomForest, MASS, ggplot2, moments, knitr

Variable Selection, Statistical Distributions, and Multiple Linear Regression in R

NIREMBERE Dieudonne

2026-06-04

1 Introduction

2 PART 1: VARIABLE SELECTION METHODS

2.1 Why Variable Selection Matters

2.2 Method 1: Filter Methods (Correlation-Based Selection)

2.2.1 Correlation Analysis

2.2.2 Variance Threshold

2.2.3 Statistical Tests for Selection

2.3 Method 2: Wrapper Methods

2.3.1 Forward Selection

2.3.2 Backward Elimination

2.3.3 Stepwise Selection (Both Directions)

2.4 Method 3: Embedded Methods

2.4.1 LASSO Regression (L1 Regularisation)

2.4.2 Ridge Regression (L2 Regularisation)

2.4.3 Random Forest Variable Importance

2.5 Method 4: Model Comparison Using Information Criteria

3 PART 2: STATISTICAL DISTRIBUTIONS IN R

3.1 R’s Distribution Function Naming Convention

3.2 Normal Distribution (Gaussian)

3.3 Binomial Distribution

3.4 Poisson Distribution

3.5 t-Distribution

3.6 Chi-Square Distribution

3.7 F-Distribution

3.8 Checking Distributions in Practice

3.9 Choosing the Right Distribution for Your Model

4 PART 3: COMBINING VARIABLE SELECTION AND DISTRIBUTIONS

5 PART 4: MULTIPLE LINEAR REGRESSION — CHOLERA INCIDENCE RATES

5.1 Introduction

5.2 Dataset and Variables

5.3 Descriptive Statistics

5.4 Multiple Linear Regression

5.4.1 Model Specification

5.4.2 Model Fit Summary

5.4.3 Regression Coefficients

5.5 Interpretation of Coefficients

5.6 Model Diagnostics

5.6.1 Residual Plots

5.6.2 Normality of Residuals

5.6.3 Multicollinearity (VIF)

5.6.4 Autocorrelation (Durbin-Watson)

5.7 Discussion

5.8 Conclusion

5.9 References