The goal of this analysis is to determine whether environmental conditions at the time of sampling influence the number of mosquitoes captured.
Given the high correlation between environmental variables, a Principal Component Analysis (PCA) approach was applied to reduce dimensionality and identify key environmental factors driving mosquito abundance. Additionally, clustering analysis was performed to group locations based on environmental similarities.
Note:After completing the analysis, I’m still uncertain about the relevance of including these results in the article. The PCA biplot shows that higher wind speeds are associated with reduced mosquito counts, which is already established in the literature. Additionally, the PCA biplot doesn’t seem to show a clear impact of temperature and humidity on mosquito counts. I’m also unsure how to interpret the clustering analysis; we can’t easily generalize the result that ‘this cluster likely represents locations with lower PC1 (temperature-humidity contrast) and lower PC2 (wind speed influence)’ since the analysis only covers 7 days and doesn’t fully capture the weather system of the location.
data <- read.csv("dataset_environmental_variables.csv", header=T)
The dataset includes 111 records collected from 9 locations over a span of 3 years, with mosquito sampling conducted either in the morning (AM) or at night (PM). Environmental data were recorded during each sampling session.
A total of 81% of the records are from 2022, and 60% of the sampling occurred during the PM period.
There were minimal missing values in the cloud_cover and period variables (n = 2).
| Variable | Type | Sort | Description |
|---|---|---|---|
| Location | Independent | Categorical nominal | Sampling location |
| Year | Independent | Numeric discrete | Sampling year |
| Week | Independent | Numeric discrete | Sampling week number |
| Period | Independent | Categorical nominal | Sampling period (AM/PM) |
| Temperature | Independent | Numeric continuous | Air temperature in Celsius |
| Humidity | Independent | Numeric continuous | Relative humidity in percentage |
| Wind speed | Independent | Numeric continuous | Wind speed in km/h |
| Cloud cover | Independent | Categorical ordinal | Cloud cover in 5 categories |
| Mosquito Count | Dependent | Numeric discrete | Number of mosquitoes caught |
Here’s an overview of the mean, median, variance and standard variation for the 5 numerical variables.
| Mean | Median | Variance | Standard_Deviation | |
|---|---|---|---|---|
| temperature | 17.579279 | 18.2 | 21.228930 | 4.607486 |
| humidity | 57.451351 | 54.8 | 299.715975 | 17.312307 |
| wind_speed | 3.064324 | 1.4 | 17.035734 | 4.127437 |
| mosquito_count | 37.216216 | 7.0 | 10327.716462 | 101.625373 |
| week_num | 30.135135 | 30.0 | 1.572482 | 1.253986 |
TEMPERATURE: The temperature has a moderate mean (17.58) close to the median (18.2), suggesting a relatively balanced distribution with no extreme skewness. The standard deviation (4.61) indicates moderate variability, so temperatures are somewhat consistent but can vary significantly.
HUMIDITY: The mean (57.45) and the median (54.8) are close, indicating a relatively symmetric distribution of humidity values. The high variance (299.72) and standard deviation (17.31) suggest that humidiy values are quite spread out, meaning the humidity fluctuates significantly between measurements.
WIND SPEED: The mean (3.06) is higher than the median (1.4), indicating a right-skewed distribution, with some very high speed values. The large variance (14.04) and standard deviation (4.13) suggest that while most wind speeds are relatively low (close to 1.4 km/h), there are some extreme outliers driving the mean up.
MOSQUITO COUNT: The large difference between the mean (37.22) and the median (7.0) suggest a highly skewed distribution with some extreme counts that are pulling the mean upwards. The high variance (10327.72) and standard deviation (101.63) confirm that the mosquito counts vary greatly across the samples, with some locations or conditions seeing much higher mosquito populations than others.
WEEK NUMBER: The week number has a mean (30.14), which is nearly identical to the median (30.0), suggesting a symmetric distribution with little to no skewness. The standard deviation of 1.25 indicates low variability, meaning the week numbers are fairly consistent and close to the average, with only slight deviations.
OVERVIEW: A scatterplot matrix helps us understand the correlation between numerical variables by visually displaying relationships between all possible pairs. It allows us to spot patterns, trends, and potential multicollinearity, which is important for statistical analysis. The plots also help identify outliers that might affect correlation results. Instead of examining variables one by one, a scatterplot matrix provides a quick and comprehensive way to compare multiple relationships at once, making it easier to interpret how variables interact.
# Install and load the necessary package
install.packages("GGally",, repos = "https://cran.rstudio.com/")
## package 'GGally' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\carol\AppData\Local\Temp\RtmpmScKxm\downloaded_packages
library(GGally)
# Create a subset of the data with the selected variables
data_subset <- data.frame(
temperature = data$temperature,
humidity = data$humidity,
wind_speed = data$wind_speed,
mosquito_count = data$mosquito_count,
week_num = data$week_num
)
# Create the scatterplot matrix
ggpairs(na.omit(data_subset))
Diagonal: It shows the distribution of each individual variable in the form of a density plot. It gives you a sense of the distribution of each variable, such as whether it’s normally distributed, skewed, or has multiple modes.
Upper triangle: It show the correlation coefficients between the pairs of variables, which quantify the strength and direction of the linear relationship between them. These values range from -1 to 1.
Lower triangle: It shows the scatterplots for each pair of variables. It provides a visual representation of the relationship between two variables. If there’s a strong relationship, the points will follow a clear trend (e.g., a straight line or curve).
The most notable relationship is between humidity and temperature, with a moderate negative correlation (-0.533), suggesting that as temperature increases, humidity tends to decrease.
There is another notable relationship between mosquito count and week number, with a moderate negative correlation (-0.366), surggesting that…? Note: The scatterplot doesn’t show much of a trend, so I’m not sure there’s a relationship.
Most of the correlations involving mosquito count are weak (near 0), meaning that mosquito count is not strongly influenced by temperature, humidity, or wind speed.
Wind speed shows weak negative correlations with both temperature (-0.182) and humidity (-0.128), indicating a weak inverse relationship between these variables.
Here, I tested the normality of the variable mosquito_count using the Shapiro-Wilk Test.
##
## Shapiro-Wilk normality test
##
## data: data$mosquito_count
## W = 0.38462, p-value < 2.2e-16
W=0.38462: This measures how well the data fits a normal distribution. Since a value close to 1 suggests that the data is more likely to be distributed, our results indicates that the mosquito count is not normally distributed.
p-value < 2.2e-16: This represents the probability of obtaining a test statistic as extreme as the one observed, assuming that the data is normally distributed. Our result (less than 0.05) suggests that the null hypothesis that the data follows a normal distribution should be rejected.
OVERVIEW: Principal Component Analysis (PCA) is a dimensionality reduction technique that helps identify patterns in high-dimensuinal datasets by transforming them into a set of new uncorrelated variables called *principal components (PCs).** These components capture the most varience in the data while reducing redundancy.
Note: I didn’t include the variable ‘week_num’ in the analysis, as it is not considered an environmental variable.
This step ensures the dataset is suitable for PCA by standardizing numerical variables (e.g., centering and scaling) to give them equal weight, handling missing values, and removing highly correlated or redundant variables to improve analysis accuracy
# Select numerical variables for PCA
pca_data <- data[, c("temperature", "humidity", "wind_speed", "mosquito_count")]
# Remove rows with missing values
pca_data <- na.omit(pca_data)
# Standardize the data
pca_data_scaled <- scale(pca_data)
Standardization adjusts all variables to have the same scale, so no single variable dominates the analysis just because it has larger numbers. For example, in our dataset: Temperature range from 5 to 30, Humidity range from 30 to 90, and mosquito range from 0 to 500.
Since mosquito count has much largers values, it could dominate the PCA results, making it harder to see the real patterns in the data. We want to make sure that all variables contribute equally to the PCA, rather than one overwhelming the results due to its scale by transforming each variable to have a mean = 0 and a standard deviation = 1.
This step reduces the dimensionality of the dataset while preserving as much variance as possible by transforming correlated variables into a smaller set of uncorrelated principal components.
pca_result <- prcomp(pca_data_scaled, center = TRUE, scale. = TRUE)
This provides key statistics, including the proportion of variance explained by each principal component, helping determine how many components to retain for analysis.
summary(pca_result) # Check the proportion of variance explained
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.2551 1.0565 0.9598 0.6223
## Proportion of Variance 0.3938 0.2790 0.2303 0.0968
## Cumulative Proportion 0.3938 0.6729 0.9032 1.0000
STANDARD DEVIATION: This tells us how much each PC spreads out the data. A higher standard deviation means that the PC captures more variation in the data. Here, *PC1 has the highest standard deviation (1.2551), meaning it captures the most variation in the dataset**.
PROPORTION OF VARIANCE: This tells us how much of the total variation in the data is explained by each principal component. PC1 explains 39.38% of the variance, PC2 explains 27.30% of the variance, PC3 explains 23.03% of the variance, and PC4 explains 9.68% of the variance.
CUMULATIVE PROPORTION: This tells us how much variance is explained when we add up the components. PC1 alone explains 39.38% of the variance, PC1+PC2 explain 67.29% of the variance, PC1+PC2+PC3 explain 90.32% of the variance, and PC1+PC2+PC3+PC4 explain 100% of the variance.
Since PC1 and PC2 together explain about 67% of the variation, they contain the most important patterns in the dataset.à
If we want to reduce the number of dimensions in our data (for visualization and analysis), we can focus only on PC1 and PC2 since they explain most of the variation.
This identifies which variables have the greatest influence on each principal component, revealing patterns and relationships in the data.
# Squared loadings (eigenvectors)
loadings_squared <- pca_result$rotation^2
# Compute % contribution of each variable to each PC
contributions <- sweep(loadings_squared, 2, colSums(loadings_squared), "/") * 100
# Print contributions
print(contributions)
## PC1 PC2 PC3 PC4
## temperature 47.1879676 7.013986 0.1856846 45.6123615
## humidity 46.1846388 2.159676 9.0423669 42.6133185
## wind_speed 0.1147859 71.976111 16.2519860 11.6571169
## mosquito_count 6.5126077 18.850227 74.5199625 0.1172031
Temperature (47%) and Humidity (46%) contribute to PC1.
Humidity (71%) contributes to PC2.
These plots visually represent how variables and observations relate to the principal components, aiding interpretation of patterns and groupings in the data.
library(ggbiplot)
ggbiplot(pca_result, labels = rownames(pca_data), ellipse = TRUE, circle = TRUE) +
theme_minimal() +
labs(title = "PCA Biplot")
The arrows represent the orignal variables in the dataset.
Arrrows pointing in the same direction are positively correlated.
Angles between arrows: small angles (close to 0) = strong positive correlation, large angles (90) = no correlation, opposite direction (180) = strong negative correlation.
Lenght of the arrows: longer arrows: variable is well represented by PC1 and PC2, shorter arrows variable is less important in these components.
Each dot represents an observation (sampling event).
Observations close together mean that they have similar values for the variables, while observations far apart mean that they are more different in terms of their environmental conditions.
If a point is in the same direction as a variable’s arrow, it means that the observation has a high value for that variable.
Clusters: If some points group together, it may indicate natural patterns in the dataset.
Extrem points: Outliers might appear far from the rest, indicating unusual conditions in those observations.
When looking at the PC1 axis, the arrows for Humidity and Temperature are pointing in opposite directions, meaning higher temperature are associated with lower humidity. This corelates with our previous corrolation analysis.
When looking at the PC2 axis, the arrows for Wind speed and Mosquito count are pointing in opposite directions, meaning that higher wind speed are associated with reduced mosquito caught.
There seems to be a cluster around along 0 of the PC2 axis, suggesting that…? Note: I’m not sure how to interpret that.
Clustering analysis, such as K-means, helps to group similar observations based on their principal component (PCA) scores. Here’s why it’s useful:
When you have multiple variables, clustering in the original space can be challenging due to noise and redundant information.
PCA transforms data into a set of uncorrelated principal components which makes patterns easier to detect.
K-means clustering on PCA-reduced data is often more accurate and interpretable because it removes irrelevant noise.
PCA helps visualize relationships between observations, but clustering adds structure by grouping similar data points.
K-means assigns each observation to a cluster, allowing us to find natural groupings in the data that may not be obvious.
This is particularly useful in environmental data to identify patterns in mosquito populations or group similar environmental conditions.
K-means clustering works best in lower-dimensional spaces because it relies on distance calculations.
Running K-means on PCA-reduced data reduces the number of dimensions, making clustering faster and more efficient.
After clustering, we can analyze what characterizes each group based on PCA components.
This helps in making data-driven decisions, such as: which environmental factors influence mosquito abundance? or “Are there distinct weather patterns associated with high/low mosquito counts?
Plotting clusters on a PCA biplot helps visually confirms if the K-means clustering aligns well with the data structure.
If distinct clusters appear, this means that the PCA transformation effectively captured meaningful patterns in the dataset.
The Elbow Method is a technique used to determine the optimal number of clusters (k) in K-means clustering. It helps find the best balance between minimizing within-cluster variation and avoiding too many clusters.
# Load necessary libraries
library(ggplot2)
# Extract the first few Principal Components (e.g., PC1 and PC2)
pca_data_k <- pca_result$x[, 1:2] # Select first 2 PCs
# Define a range for k (number of clusters)
wss <- numeric(10) # To store within-cluster sum of squares
for (k in 1:10) {
kmeans_result <- kmeans(pca_data_k, centers = k, nstart = 25)
wss[k] <- kmeans_result$tot.withinss # Total within-cluster sum of squares
}
# Create an Elbow Plot
elbow_plot <- ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
geom_point(color = "black", size = 3) + # Black points
geom_line(color = "blue", linewidth = 0.8) + # Blue curve
scale_x_continuous(breaks = 1:10) + # Ensure x-axis is divided by 1
scale_y_continuous(breaks = seq(0, max(wss), by = 50)) + # Y-axis increments by 50
labs(title = "Figure 4. Elbow Method for Optimal k",
x = "Number of clusters (k)",
y = "Total (within-cluster sum of squares)") +
theme_minimal() +
theme(
axis.line = element_line(color = "black"), # Black axis lines
axis.ticks = element_line(color = "black") # Black axis ticks
)
# Print the plot
print(elbow_plot)
The spatial distribution on the PC1 vs. PC2 plot helps visualize how observations group based on their principal component scores. Points that are closer together share similar characteristics, while those farther apart indicate greater differences in the dataset.
# Extract the first two principal components
pca_data <- as.data.frame(pca_result$x[, 1:2]) # Use PC1 and PC2
data$location_ab <- substr(data$location, 1, 2) # Create a new column with the first 2 characters of each entry
pca_data$Location <- data$location_ab # Add location names
# Apply K-means clustering (choose k based on the Elbow Method)
set.seed(123) # For reproducibility
k <- 3 # Set the optimal number of clusters based on previous analysis
kmeans_result <- kmeans(pca_data[, 1:2], centers = k, nstart = 25)
# Add cluster labels to the PCA dataset
pca_data$Cluster <- as.factor(kmeans_result$cluster)
# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(aes(shape = factor(Cluster)), size = 2, alpha = 0.7) + # Scatter plot points
geom_text(aes(label = Location), vjust = -1, size = 3, show.legend = FALSE) + # Remove location label legend
labs(
title = "Figure 5: Clustering of locations based on PCA",
x = "Principal Component 1 (temperature & humidity contrast)",
y = "Principal Component 2 (wind speed influence)"
) +
theme_minimal() +
theme(
axis.line = element_line(color = "black"),
axis.ticks = element_line(color = "black")
) +
scale_shape_manual(
values = c(16, 17, 18),
labels = c("Cluster 1 (n=16)", "Cluster 2 (n=61)", "Cluster 3 (n=34)") # Adding labels for shape legend
) + # Adjust cluster shapes
scale_color_manual(
values = c("darkorange2", "cadetblue3", "darkseagreen3"),
labels = c("Cluster 1 (n=16)", "Cluster 2 (n=61)", "Cluster 3 (n=34)") # Adding labels for color legend
) + # Customize colors of clusters
guides(
shape = "none", # Remove legend for shape
color = guide_legend(title = "Cluster") # Title for the color legend
)
The location’s distribution across PC1 (x-axis) and PC2 (y-axis) tells us about the environment contrast captured by the principal components.
Cluster 1 is spread between -2.5 to 1.5 (PC1) and 1 to 5 (Pc2), meaning this cluster represents locations with moderate temperature/humidity contrast (PC1) and higher wind speed influence (PC2). The locations likely experience a balance between these two factors, but are less extreme than those in Cluster 2.
Cluster 2 is spread between 0 to 2 (PC1) and -0.5 to 1 *(PC2), meaning this cluster represents locations with higher PC1 values (greater temperature and humidity contrast) and moderate PC2 values (influenced by wind speed). This could indicate regions with a more variable climate, possibly with higher temperatures or more diverse humidity conditions.
Cluster 3 is concentrated around -2 to -1 (PC1) and -1 to 0 (PC2), meaning this cluster likely represents locations with lower PC1 (temperature & humidity contrast) and lower PC2 (wind speed influence). These locations might be cooler, less variable in temperature/humidity, and less affected by the wind.
This step explores whether specific clusters or data points are linked to particular locations, revealing potential spatial trends or geographic influences on the dataset.
library(dplyr)
# Step 1: Calculate the count of each location per cluster
location_count_summary <- pca_data %>%
group_by(Cluster, Location) %>%
count() %>%
arrange(Location)
# Step 2: Calculate the total count of each location
total_locations <- location_count_summary %>%
group_by(Location) %>%
summarise(total_count = sum(n))
# Step 3: Merge the counts with the total locations to calculate the percentage
location_count_summary <- location_count_summary %>%
left_join(total_locations, by = "Location") %>%
mutate(percentage = (n / total_count) * 100)
# Step 4: Create a bar plot to visualize the percentage of each location associated with each cluster
ggplot(location_count_summary, aes(x = Location, y = percentage, fill = factor(Cluster))) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Figure 7: Cluster association by location",
x = "Location",
y = "Percentage of Location in Each Cluster") +
scale_fill_manual(values = c("darkorange2", "cadetblue3", "darkseagreen3")) + # Customize colors for clusters
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
Note: The percentages represent the proportion of data from each specific location that is associated with each cluster. For example, 64% of the data from Cambridge Bay is assigned to Cluster 1.
Cluster 1 is associated with Cambridge Bay (64%), Karrak Lake (50%)
Cluster 2 is associated with Churchill (79%), Fairbanks (89%), Toolik (79%), Yellowknife (93%) and Karrak Lake (50%)
Cluster 3 is associated with Whitehorse, Kuujjuaq (60%), Kobberfjord (100%)
Out of a total of 111 data points, 16 were assigned to Cluster 1 (14%), 61 to Cluster 2 (55%), and 34 to Cluster 3 (31%).
This step examines how mosquito counts vary among different clusters by summarizing key statistics such as total count, range, mean, and variability. It helps identify patterns in mosquito abundance across clusters, revealing potential differences in environmental conditions or habitat suitability.
| Cluster | Total | Min | Max | Mean | SD | Variance |
|---|---|---|---|---|---|---|
| 1 | 237 | 0 | 76 | 14.8 | 22.3 | 497.4 |
| 2 | 3166 | 0 | 611 | 93.1 | 170.2 | 28963.7 |
| 3 | 728 | 0 | 89 | 11.9 | 18.0 | 325.3 |
Cluster 1: Contains 237 observations, with mosquito counts ranging from 0 to 76. The mean count (14.8) is relatively low, and the high standard deviation (22.3) suggests considerable variation in mosquito abundance within this cluster (Cambridge Bay & Karrak Lake).
Cluster 2: Has the highest total count (3166) and a wide range (0–611). The mean (93.1) is significantly higher than in other clusters, with a large standard deviation (170.2) and variance (28963.7), indicating extreme variability in mosquito abundance within the cluster (Churchill, Fairbanks, Toolik, Yelloknife, & Karrak Lake).
Cluster 3: Includes 728 observations, with counts between 0 and 89. The mean (11.9) is similar to Cluster 1, but the standard deviation (18.0) and variance (325.3) suggest slightly less variability in mosquito abundance within this cluster (Whitehorse, Kuujjuaq, Kobberfjord).
Note: I’m unsure if we can effectively use this information, as it may not provide additional insights worth reporting in the article. We already know how mosquito counts vary by location.
When deciding between Principal Component Analysis (PCA) and Generalized Linear Models (GLM) / Regression, several factors about the dataset must be considered. Below are the key reasons why PCA was a better approach for this dataset.
Problem with Regression/GLM in our data:
When we use regression or GLM with multiple predictor variables that are highly correlated with each other (multicollinearity), it becomes difficult to determine how much each individual variable is actually influencing the outcome. This is because the model struggles to separate out the effects of each variable, as they are providing overlapping information. As a result, the estimates for the coefficients (which tell us how much a variable is contributing to the outcome) may become misleading, making it hard to draw accurate conclusions.
Our dataset contains correlated variables (temperature & humidity = -0.533) that could potentially impact a GLM if both variables are included as predictors in the model.This correlation means that the two variables are moderately negatively related, and including both in a GLM could lead to multicollinearity.
Advantages of PCA with our data:
High dimensionality relative to sample size occurs when we have a large number of predictors (variables) in our model but relatively few observations (data points). In our case, we have 111 observations but multiple predictors like temperature, humidity, wind speed, cloud cover, location, and time period.
Problem with Regression/GLM in our data:
Overfitting: When you use too many predictors relative to the number of observations, your model can become overly complex and fit the noise in the data rather than the true underlying pattern. This is known as overfitting, and it can lead to poor generalization, meaning the model won’t perform well on new or unseen data.
Instability of Coefficients: In high-dimensional data (many predictors), it becomes harder for the model to estimate the true effect of each predictor because of the potential for multicollinearity or redundancy among predictors. The model might produce unstable or unreliable coefficient estimates, making the results less interpretable (as previously mentionned).
Data Scarcity: With only 111 observations, there’s not enough data to support a large number of predictors. This lack of data can prevent the model from accurately learning the relationships between predictors and the outcome variable.
Advantages of PCA with our data:
It helps address this issue by reducing the dimensionality of the data. Instead of using all the original predictors, PCA combines them into fewer principal components that still capture most of the variability in the data. This achieves two things:
Reduces overfitting risks: By reducing the number of predictors, PCA makes the model less complex and helps prevent overfitting.
Keeps most of the variance: PCA transforms the data into new components that account for the most important variations in the data, so we retain the critical information without the noise.
Problem with Regression/GLM in our data:
Mosquito count variable is highly skewed because it contains extreme values (e.g., some counts are as low as 0, while others go up to 611). This type of distribution is not normal, meaning it doesn’t follow a typical bell-shaped curve. When you analyze data with such a skewed distribution, traditional methods (like linear regression or GLMs) may not be the best fit because they often assume that the data follows a normal or roughly symmetrical distribution.
Poisson model: It assumes that the mean and variance are equal. But in our case, the mosquito counts show overdispersion, where the variance is much larger than the mean, which violates this assumption about the underlying distribution of the data.
Advantages of PCA with our data:
It does not rely on specific assumptions about the distribution of the data, such as normality or equal mean-variance relationships. Instead, PCA looks for patterns of variance in the data and compresses the information into a smaller set of principal components that capture the most important features of the data.
It can be particularly helpful in cases where the data is non-normal or exhibits overdispersion because it avoids these distributional issues. It provides a way to transform the data and make it more stable and interpretable for further analysis.This makes PCA a more flexible and robust tool in cases of non-normality and overdispersion.
Problem with Regression/GLM in our data:
In regression models or GLMs, we’re forced to assume a specific relationship between the dependent variable (e.g., mosquito count) and the predictors (e.g., temperature, humidity). This could be linear or another predefined functional form.
However, this approach might not reflect the true structure in our data. For example, we might be incorrectly assuming a linear relationship, when the actual relationship could be more complex or non-linear.
Advantages of PCA with our data:
PCA is an unsupervised method, meaning it doesn’t rely on any predefined relationships between the variables. Unlike in supervised learning, where we typically have a dependent variable (like mosquito counts) that we’re trying to predict based on independent variables (like temperature, humidity, etc.), PCA works without any target variable.
This allows PCA to explore the data freely, finding patterns and relationships among the variables based purely on how they co-vary with each other, rather than on any assumptions or hypotheses about how they should be related.
PCA can reveal important underlying patterns in the data by creating principal components (PCs), which are combinations of the original variables. These components capture the most important sources of variation in the data. For example, PC1 might capture a contrast between temperature and humidity, showing that these two variables are related and have opposite effects on mosquito counts.
Problem with Regression/GLM in our data:
In GLM/regression models, we would often treat categorical variables like location (e.g., Cambridge Bay vs. Churchill) as fixed groups. However, this can introduce unnecessary complexity into the model without providing deeper insight into how different locations are actually related to mosquito behavior.
Simply categorizing locations without accounting for the underlying environmental factors may obscure the true environmental conditions affecting mosquito populations.
Advantages of PCA with our data:
Clustering based on environmental conditions: Instead of directly predicting mosquito counts, PCA allows us to cluster data points based on environmental conditions, which helps reveal how mosquitoes behave differently under varying factors.
K-Means clustering on PCA components: When we apply K-Means clustering to the principal components derived from PCA, we can identify three major environmental conditions where mosquitoes exhibit distinct behaviors. These clusters are based on the actual variation in the data, leading to more meaningful groupings.
Improved understanding of locations: Rather than treating locations like Cambridge Bay and Churchill as simple categorical variables, PCA allows us to see how these locations differ in terms of environmental conditions. For example, it may show that these locations cluster into distinct groups based on temperature, humidity, and other environmental factors, providing deeper insights into how these locations are related to mosquito behavior.
PCA can be a valuable tool for improving regression models. By reducing the number of predictors and transforming the data into principal components, PCA addresses issues like multicollinearity and redundancy, which can lead to more stable and interpretable models. The new set of transformed variables often retains most of the variance in the data, allowing for a more streamlined and effective regression model.
Instead of using multiple environmental variables, the regression model can be simplified to something like: