Understanding Climatic Factors and Malaria Prevalence using PCA

Author

Geoffrey Manda

0. Introduction

This tutorial demonstrates how to use Principal Component Analysis (PCA) to explore relationships between climatic factors and simulate their influence on malaria prevalence. We’ll break down each step with clear explanations and formulas, assuming no prior knowledge. We first load the necessary packages

# Load required libraries
library(tidyverse)
library(flextable)
library(ggplot2)
library(gridExtra)

1. Simulating Climatic Data

We start by creating a dataset of 10 different climatic factors.

Imagine we’ve collected 100 measurements for each factor, like temperature, humidity, and rainfall. We use a function called rnorm() to generate random numbers that follow a bell curve (normal distribution). This simulates realistic variability in our climate data. set.seed(123) ensures we get the same random numbers every time we run the code, making our results reproducible.

# For reproducibility
set.seed(123)

# Simulate data for 10 climatic factors and 100 observations
data <- data.frame(
  Temperature = rnorm(100, mean = 25, sd = 5),
  Humidity = rnorm(100, mean = 60, sd = 10),
  Rainfall = rnorm(100, mean = 100, sd = 20),
  WindSpeed = rnorm(100, mean = 10, sd = 2),
  SolarRadiation = rnorm(100, mean = 200, sd = 30),
  AirPressure = rnorm(100, mean = 1013, sd = 5),
  CloudCover = rnorm(100, mean = 50, sd = 15),
  SoilMoisture = rnorm(100, mean = 30, sd = 8),
  Evaporation = rnorm(100, mean = 5, sd = 1),
  DewPoint = rnorm(100, mean = 15, sd = 3)
)

2. Exploring Relationships: Covariance Matrix

Let’s examine how these climate factors are related to each other. The covariance measures how much two variables change together. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance means one variable tends to increase as the other decreases. ¹

The formula for the covariance between two variables (X) and (Y) is:

\[ \text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1} \]

where:

\(X_i\) and \(Y_i\): individual data points,
\(\bar{X}\) and \(\bar{Y}\): the means of \(X\) and \(Y\),
\(n\): the number of data points.

We can calculate the covariance between all pairs of climate factors and organize them in a table called a covariance matrix.

# Compute and print covariance matrix
cov_matrix <- cov(data)

A heatmap is a visual representation of this matrix, making it easier to observe the relationships between the factors.

# Visualize covariance matrix with a heatmap
heatmap(cov_matrix, main = "Covariance Matrix Heatmap", col = colorRampPalette(c("blue", "white", "red"))(50))

3. Dimensionality Reduction

Principal Component Analysis (PCA) Imagine trying to visualize the relationships between all 10 climate factors at once – it would be quite difficult! This is where PCA comes in. PCA is a technique that finds new variables (called principal components) that capture most of the information in our original data. These principal components are combinations of our original variables. Think of it like summarizing a long story with a few key sentences. PCA helps us reduce the number of variables we need to consider while still keeping the most important information. We use the prcomp() function to perform PCA. Scaling the data ensures that all variables are treated equally, regardless of their units (e.g., temperature in degrees Celsius vs. rainfall in millimeters).

# Perform PCA
pca_result <- prcomp(data, scale. = TRUE) # Scaling ensures variables are on the same scale

# Summary of PCA
summary(pca_result)

Importance of components:
                          PC1    PC2    PC3    PC4    PC5     PC6     PC7
Standard deviation     1.2152 1.1359 1.1131 1.0689 1.0235 0.96101 0.94237
Proportion of Variance 0.1477 0.1290 0.1239 0.1143 0.1047 0.09235 0.08881
Cumulative Proportion  0.1477 0.2767 0.4006 0.5149 0.6196 0.71198 0.80078
                           PC8     PC9    PC10
Standard deviation     0.90868 0.79380 0.73238
Proportion of Variance 0.08257 0.06301 0.05364
Cumulative Proportion  0.88335 0.94636 1.00000

3.1 Explained Variance and Scree Plot

Each principal component explains a certain amount of the variability in our data. The eigenvalue associated with each principal component tells us how much variance it explains.

# Eigenvalues
eigenvalues <- pca_result$sdev^2
print(eigenvalues)

 [1] 1.4767870 1.2903483 1.2390709 1.1425356 1.0474732 0.9235375 0.8880634
 [8] 0.8256940 0.6301157 0.5363744

# Proportion of variance explained
explained_variance <- eigenvalues / sum(eigenvalues)
cumulative_variance <- cumsum(explained_variance)
print(cumulative_variance)

 [1] 0.1476787 0.2767135 0.4006206 0.5148742 0.6196215 0.7119752 0.8007816
 [8] 0.8833510 0.9463626 1.0000000

A scree plot helps us visualize the eigenvalues. It’s like a bar chart showing the importance of each principal component. We often look for an “elbow” in the scree plot to decide how many principal components to keep.

# Scree plot to visualize variance explained
plot(explained_variance, type = "b",
     main = "Scree Plot",
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained")

3.2 Biplot

A biplot is a visual way to see how the original variables relate to the principal components. It’s like a map where each point represents a climate factor, and the axes represent the principal components. Variables that are close together on the biplot are highly correlated.

# Biplot for PCA
biplot(pca_result, scale = 0, main = "PCA Biplot")

3.3 Factor Loadings

Factor loadings tell us how strongly each original variable is associated with each principal component. They are like correlation coefficients between the original variables and the principal components.`

# Extract and print factor loadings
factor_loadings <- pca_result$rotation %*% diag(pca_result$sdev)
print(factor_loadings)

                      [,1]        [,2]        [,3]        [,4]         [,5]
Temperature     0.51727818 -0.47686125 -0.05236879  0.04650169 -0.186960920
Humidity       -0.06811801  0.01018334 -0.44315791 -0.64299891  0.328306357
Rainfall       -0.30555789  0.25142217 -0.28884935  0.06867251  0.466619485
WindSpeed       0.51313822  0.57925968  0.17895339 -0.18625860  0.168369510
SolarRadiation -0.50234554  0.30879720  0.47385833 -0.06316775 -0.431716220
AirPressure    -0.33651519 -0.12459868  0.40976572 -0.66895860  0.003619476
CloudCover     -0.03684068  0.15834986 -0.52510214 -0.12523309 -0.529279279
SoilMoisture    0.55705501 -0.21393552  0.20037016 -0.42369572 -0.107462909
Evaporation    -0.18101412 -0.51969646  0.33014334  0.14972551  0.382002190
DewPoint        0.37130395  0.46074396  0.32734764  0.13565581  0.186143005
                      [,6]        [,7]        [,8]        [,9]       [,10]
Temperature     0.28542100  0.07701321 -0.48931890  0.36854947  0.05009456
Humidity       -0.34506259  0.13681129 -0.22008286  0.14182736 -0.26701621
Rainfall        0.66435281 -0.21474857  0.06190823  0.20969586 -0.04744548
WindSpeed      -0.11662930  0.17378476  0.20985093  0.33196844  0.32869213
SolarRadiation  0.03330958  0.11415011 -0.04198625  0.37183953 -0.28847047
AirPressure     0.20601262 -0.16654690 -0.22119255 -0.16133065  0.33267589
CloudCover     -0.13706437 -0.58166473  0.07311173  0.14185528  0.13938150
SoilMoisture    0.26610925 -0.18157707  0.45437223 -0.06073745 -0.31429164
Evaporation    -0.35176353 -0.39086093  0.18044151  0.33054715  0.03848443
DewPoint       -0.10558399 -0.47143663 -0.44148262 -0.15043607 -0.19672769

# Heatmap of factor loadings
heatmap(as.matrix(factor_loadings),
        main = "Factor Loadings Heatmap",
        col = colorRampPalette(c("blue", "white", "red"))(50),
        margins = c(10, 10))

4. Simulating Malaria Prevalence

Now, let’s imagine that malaria prevalence is influenced by some of our climate factors. We create a new variable called MalariaPrevalence that is a combination of Temperature, Humidity, and Rainfall, with some random noise added for realism. This simulates how malaria prevalence might be higher in warmer, more humid, and rainier conditions.

# Simulate malaria prevalence
malaria_prevalence <- 0.3 * data$Temperature +
  0.5 * data$Humidity +
  0.2 * data$Rainfall +
  rnorm(100, mean = 0, sd = 5) # Add random noise

# Add malaria prevalence to the dataset
data$MalariaPrevalence <- malaria_prevalence

5. Identifying Key Climatic Factors

We can use PCA to figure out which climate factors are most important for predicting malaria prevalence. We do this by fitting a series of regression models, each using a different number of principal components. The Akaike Information Criterion (AIC) helps us choose the best model – the one that balances model complexity and goodness of fit.

# Initialize variables to store results
aic_values <- numeric()
models <- list()

# Loop over number of components (1 to total components)
for (i in 1:10) {
  # Use the first i principal components
  pca_data <- data.frame(pca_result$x[, 1:i])
  pca_data$MalariaPrevalence <- data$MalariaPrevalence
  
  # Fit regression model
  model <- lm(MalariaPrevalence ~ ., data = pca_data)
  
  # Store model and its AIC
  models[[i]] <- model
  aic_values[i] <- AIC(model)
}

# Identify the optimal number of components
optimal_components <- which.min(aic_values)
cat("Optimal number of components:", optimal_components, "\n")

Optimal number of components: 10

5.1 Contribution of Original Variables

Finally, we can see how much each original climate factor contributes to predicting malaria prevalence in our best model. This helps us understand which factors are most influential.

# Extract factor loadings for optimal number of components
optimal_model <- models[[optimal_components]]
coefficients <- optimal_model$coefficients[-1] # Exclude intercept
loadings <- pca_result$rotation[, 1:optimal_components]

# Map coefficients back to original variables
original_contributions <- loadings %*% coefficients

# Scale contributions to original variables
scaled_contributions <- original_contributions * pca_result$sdev[1:optimal_components]
names(scaled_contributions) <- colnames(data[, 1:10])

# Display contributions
print(scaled_contributions)

                     [,1]
Temperature     2.0222455
Humidity        5.4686957
Rainfall        3.4515077
WindSpeed       0.7668720
SolarRadiation  0.3553193
AirPressure    -0.1682604
CloudCover      0.4503427
SoilMoisture    0.5031792
Evaporation    -0.2066410
DewPoint        0.5118944
attr(,"names")
 [1] "Temperature"    "Humidity"       "Rainfall"       "WindSpeed"     
 [5] "SolarRadiation" "AirPressure"    "CloudCover"     "SoilMoisture"  
 [9] "Evaporation"    "DewPoint"

# Visualize contributions
barplot(abs(scaled_contributions),
        main = "Contribution of Original Variables to Malaria Prevalence",
        ylab = "Absolute Contribution",
        las = 2,
        col = "steelblue")

6. Conclusion

This tutorial showed how PCA can be used to analyze climate data and simulate its influence on malaria prevalence. By identifying key climatic factors, this approach can help us understand and potentially predict malaria outbreaks, leading to better public health interventions.

Footnotes

Covariance is an essential concept in statistics for understanding linear relationships between variables.↩︎