Environmental Data Analysis

1. Aim of the analysis

The goal of this analysis is to determine whether environmental conditions at the time of sampling influence the number of mosquitoes captured.

Given the high correlation between environmental variables, a Principal Component Analysis (PCA) approach was applied to reduce dimensionality and identify key environmental factors driving mosquito abundance. Additionally, clustering analysis was performed to group locations based on environmental similarities.

Note:After completing the analysis, I’m still uncertain about the relevance of including these results in the article. The PCA biplot shows that higher wind speeds are associated with reduced mosquito counts, which is already established in the literature. Additionally, the PCA biplot doesn’t seem to show a clear impact of temperature and humidity on mosquito counts. I’m also unsure how to interpret the clustering analysis; we can’t easily generalize the result that ‘this cluster likely represents locations with lower PC1 (temperature-humidity contrast) and lower PC2 (wind speed influence)’ since the analysis only covers 7 days and doesn’t fully capture the weather system of the location.

Open and rename our dataset:

data <- read.csv("dataset_environmental_variables.csv", header=T)

2. Data overview

The dataset includes 111 records collected from 9 locations over a span of 3 years, with mosquito sampling conducted either in the morning (AM) or at night (PM). Environmental data were recorded during each sampling session.

A total of 81% of the records are from 2022, and 60% of the sampling occurred during the PM period.
There were minimal missing values in the cloud_cover and period variables (n = 2).

**Table 1: Variable Information**
Variable	Type	Sort	Description
Location	Independent	Categorical nominal	Sampling location
Year	Independent	Numeric discrete	Sampling year
Week	Independent	Numeric discrete	Sampling week number
Period	Independent	Categorical nominal	Sampling period (AM/PM)
Temperature	Independent	Numeric continuous	Air temperature in Celsius
Humidity	Independent	Numeric continuous	Relative humidity in percentage
Wind speed	Independent	Numeric continuous	Wind speed in km/h
Cloud cover	Independent	Categorical ordinal	Cloud cover in 5 categories
Mosquito Count	Dependent	Numeric discrete	Number of mosquitoes caught

2.1 Understanding the distribution of our variables

Here’s an overview of the mean, median, variance and standard variation for the 5 numerical variables.

**Table 2: Basic Statistics for Numerical Variables**
	Mean	Median	Variance	Standard_Deviation
temperature	17.579279	18.2	21.228930	4.607486
humidity	57.451351	54.8	299.715975	17.312307
wind_speed	3.064324	1.4	17.035734	4.127437
mosquito_count	37.216216	7.0	10327.716462	101.625373
week_num	30.135135	30.0	1.572482	1.253986

2.1.1 Result interpretations

TEMPERATURE: The temperature has a moderate mean (17.58) close to the median (18.2), suggesting a relatively balanced distribution with no extreme skewness. The standard deviation (4.61) indicates moderate variability, so temperatures are somewhat consistent but can vary significantly.
HUMIDITY: The mean (57.45) and the median (54.8) are close, indicating a relatively symmetric distribution of humidity values. The high variance (299.72) and standard deviation (17.31) suggest that humidiy values are quite spread out, meaning the humidity fluctuates significantly between measurements.
WIND SPEED: The mean (3.06) is higher than the median (1.4), indicating a right-skewed distribution, with some very high speed values. The large variance (14.04) and standard deviation (4.13) suggest that while most wind speeds are relatively low (close to 1.4 km/h), there are some extreme outliers driving the mean up.
MOSQUITO COUNT: The large difference between the mean (37.22) and the median (7.0) suggest a highly skewed distribution with some extreme counts that are pulling the mean upwards. The high variance (10327.72) and standard deviation (101.63) confirm that the mosquito counts vary greatly across the samples, with some locations or conditions seeing much higher mosquito populations than others.
WEEK NUMBER: The week number has a mean (30.14), which is nearly identical to the median (30.0), suggesting a symmetric distribution with little to no skewness. The standard deviation of 1.25 indicates low variability, meaning the week numbers are fairly consistent and close to the average, with only slight deviations.

2.2 Understanding the correlation of our numeric variables

OVERVIEW: A scatterplot matrix helps us understand the correlation between numerical variables by visually displaying relationships between all possible pairs. It allows us to spot patterns, trends, and potential multicollinearity, which is important for statistical analysis. The plots also help identify outliers that might affect correlation results. Instead of examining variables one by one, a scatterplot matrix provides a quick and comprehensive way to compare multiple relationships at once, making it easier to interpret how variables interact.

# Install and load the necessary package
install.packages("GGally",, repos = "https://cran.rstudio.com/")

## package 'GGally' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\carol\AppData\Local\Temp\RtmpmScKxm\downloaded_packages

library(GGally)

# Create a subset of the data with the selected variables
data_subset <- data.frame(
  temperature = data$temperature,
  humidity = data$humidity,
  wind_speed = data$wind_speed,
  mosquito_count = data$mosquito_count,
  week_num = data$week_num
)

# Create the scatterplot matrix
ggpairs(na.omit(data_subset))

2.2.1 How to read the scatterplot matrix

Diagonal: It shows the distribution of each individual variable in the form of a density plot. It gives you a sense of the distribution of each variable, such as whether it’s normally distributed, skewed, or has multiple modes.
Upper triangle: It show the correlation coefficients between the pairs of variables, which quantify the strength and direction of the linear relationship between them. These values range from -1 to 1.
Lower triangle: It shows the scatterplots for each pair of variables. It provides a visual representation of the relationship between two variables. If there’s a strong relationship, the points will follow a clear trend (e.g., a straight line or curve).

2.2.2 Results interpretations

The most notable relationship is between humidity and temperature, with a moderate negative correlation (-0.533), suggesting that as temperature increases, humidity tends to decrease.
There is another notable relationship between mosquito count and week number, with a moderate negative correlation (-0.366), surggesting that…? Note: The scatterplot doesn’t show much of a trend, so I’m not sure there’s a relationship.
Most of the correlations involving mosquito count are weak (near 0), meaning that mosquito count is not strongly influenced by temperature, humidity, or wind speed.
Wind speed shows weak negative correlations with both temperature (-0.182) and humidity (-0.128), indicating a weak inverse relationship between these variables.

2.3 Checking the normality of the dependent variable

Here, I tested the normality of the variable mosquito_count using the Shapiro-Wilk Test.

## 
##  Shapiro-Wilk normality test
## 
## data:  data$mosquito_count
## W = 0.38462, p-value < 2.2e-16

2.3.1 Result interpretations

W=0.38462: This measures how well the data fits a normal distribution. Since a value close to 1 suggests that the data is more likely to be distributed, our results indicates that the mosquito count is not normally distributed.
p-value < 2.2e-16: This represents the probability of obtaining a test statistic as extreme as the one observed, assuming that the data is normally distributed. Our result (less than 0.05) suggests that the null hypothesis that the data follows a normal distribution should be rejected.

3. Principal Component Analysis (PCA)

OVERVIEW: Principal Component Analysis (PCA) is a dimensionality reduction technique that helps identify patterns in high-dimensuinal datasets by transforming them into a set of new uncorrelated variables called *principal components (PCs).** These components capture the most varience in the data while reducing redundancy.

Note: I didn’t include the variable ‘week_num’ in the analysis, as it is not considered an environmental variable.

3.1 Prepare the data

This step ensures the dataset is suitable for PCA by standardizing numerical variables (e.g., centering and scaling) to give them equal weight, handling missing values, and removing highly correlated or redundant variables to improve analysis accuracy

# Select numerical variables for PCA
pca_data <- data[, c("temperature", "humidity", "wind_speed", "mosquito_count")]

# Remove rows with missing values
pca_data <- na.omit(pca_data)

# Standardize the data
pca_data_scaled <- scale(pca_data)

3.1.1 Notes on data standardization

Standardization adjusts all variables to have the same scale, so no single variable dominates the analysis just because it has larger numbers. For example, in our dataset: Temperature range from 5 to 30, Humidity range from 30 to 90, and mosquito range from 0 to 500.
Since mosquito count has much largers values, it could dominate the PCA results, making it harder to see the real patterns in the data. We want to make sure that all variables contribute equally to the PCA, rather than one overwhelming the results due to its scale by transforming each variable to have a mean = 0 and a standard deviation = 1.

3.2 Perform PCA

This step reduces the dimensionality of the dataset while preserving as much variance as possible by transforming correlated variables into a smaller set of uncorrelated principal components.

pca_result <- prcomp(pca_data_scaled, center = TRUE, scale. = TRUE)

3.3 View PCA Summary

This provides key statistics, including the proportion of variance explained by each principal component, helping determine how many components to retain for analysis.

summary(pca_result)  # Check the proportion of variance explained

## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.2551 1.0565 0.9598 0.6223
## Proportion of Variance 0.3938 0.2790 0.2303 0.0968
## Cumulative Proportion  0.3938 0.6729 0.9032 1.0000

3.3.1 Results interpretations

STANDARD DEVIATION: This tells us how much each PC spreads out the data. A higher standard deviation means that the PC captures more variation in the data. Here, *PC1 has the highest standard deviation (1.2551), meaning it captures the most variation in the dataset**.
PROPORTION OF VARIANCE: This tells us how much of the total variation in the data is explained by each principal component. PC1 explains 39.38% of the variance, PC2 explains 27.30% of the variance, PC3 explains 23.03% of the variance, and PC4 explains 9.68% of the variance.
CUMULATIVE PROPORTION: This tells us how much variance is explained when we add up the components. PC1 alone explains 39.38% of the variance, PC1+PC2 explain 67.29% of the variance, PC1+PC2+PC3 explain 90.32% of the variance, and PC1+PC2+PC3+PC4 explain 100% of the variance.

3.3.2 Results summary

Since PC1 and PC2 together explain about 67% of the variation, they contain the most important patterns in the dataset.à
If we want to reduce the number of dimensions in our data (for visualization and analysis), we can focus only on PC1 and PC2 since they explain most of the variation.

3.4 Get contribution of each variables to the PCA

This identifies which variables have the greatest influence on each principal component, revealing patterns and relationships in the data.

# Squared loadings (eigenvectors)
loadings_squared <- pca_result$rotation^2 

# Compute % contribution of each variable to each PC
contributions <- sweep(loadings_squared, 2, colSums(loadings_squared), "/") * 100

# Print contributions
print(contributions)

##                       PC1       PC2        PC3        PC4
## temperature    47.1879676  7.013986  0.1856846 45.6123615
## humidity       46.1846388  2.159676  9.0423669 42.6133185
## wind_speed      0.1147859 71.976111 16.2519860 11.6571169
## mosquito_count  6.5126077 18.850227 74.5199625  0.1172031

3.4.1 Results interpretations

Temperature (47%) and Humidity (46%) contribute to PC1.
Humidity (71%) contributes to PC2.

3.5 PCA biplot (visualzing principal components)

These plots visually represent how variables and observations relate to the principal components, aiding interpretation of patterns and groupings in the data.

library(ggbiplot)

ggbiplot(pca_result, labels = rownames(pca_data), ellipse = TRUE, circle = TRUE) +
  theme_minimal() +
  labs(title = "PCA Biplot")

3.5.1 Interpreting the biplot

Interpreting variables (arrows)

The arrows represent the orignal variables in the dataset.
Arrrows pointing in the same direction are positively correlated.
Angles between arrows: small angles (close to 0) = strong positive correlation, large angles (90) = no correlation, opposite direction (180) = strong negative correlation.
Lenght of the arrows: longer arrows: variable is well represented by PC1 and PC2, shorter arrows variable is less important in these components.

Interpreting observations (points)

Each dot represents an observation (sampling event).
Observations close together mean that they have similar values for the variables, while observations far apart mean that they are more different in terms of their environmental conditions.
If a point is in the same direction as a variable’s arrow, it means that the observation has a high value for that variable.

Identifying patterns in the data

Clusters: If some points group together, it may indicate natural patterns in the dataset.
Extrem points: Outliers might appear far from the rest, indicating unusual conditions in those observations.

3.5.2 Results interpretations

When looking at the PC1 axis, the arrows for Humidity and Temperature are pointing in opposite directions, meaning higher temperature are associated with lower humidity. This corelates with our previous corrolation analysis.
When looking at the PC2 axis, the arrows for Wind speed and Mosquito count are pointing in opposite directions, meaning that higher wind speed are associated with reduced mosquito caught.
There seems to be a cluster around along 0 of the PC2 axis, suggesting that…? Note: I’m not sure how to interpret that.

4. Clustering analysis on PCA results using K-Means

Clustering analysis, such as K-means, helps to group similar observations based on their principal component (PCA) scores. Here’s why it’s useful:

PCA reduces dimensionality, making clustering more effective

When you have multiple variables, clustering in the original space can be challenging due to noise and redundant information.
PCA transforms data into a set of uncorrelated principal components which makes patterns easier to detect.
K-means clustering on PCA-reduced data is often more accurate and interpretable because it removes irrelevant noise.

Identifying hidden patterns or groups in the data

PCA helps visualize relationships between observations, but clustering adds structure by grouping similar data points.
K-means assigns each observation to a cluster, allowing us to find natural groupings in the data that may not be obvious.
This is particularly useful in environmental data to identify patterns in mosquito populations or group similar environmental conditions.

Improves computational efficiency

K-means clustering works best in lower-dimensional spaces because it relies on distance calculations.
Running K-means on PCA-reduced data reduces the number of dimensions, making clustering faster and more efficient.

Helps in data interpretation and decision-making

After clustering, we can analyze what characterizes each group based on PCA components.
This helps in making data-driven decisions, such as: which environmental factors influence mosquito abundance? or “Are there distinct weather patterns associated with high/low mosquito counts?

Vizualization of clusters in PCA space

Plotting clusters on a PCA biplot helps visually confirms if the K-means clustering aligns well with the data structure.
If distinct clusters appear, this means that the PCA transformation effectively captured meaningful patterns in the dataset.

4.1 The Elbow Method in K-means clustering

The Elbow Method is a technique used to determine the optimal number of clusters (k) in K-means clustering. It helps find the best balance between minimizing within-cluster variation and avoiding too many clusters.

# Load necessary libraries
library(ggplot2)

# Extract the first few Principal Components (e.g., PC1 and PC2)
pca_data_k <- pca_result$x[, 1:2]  # Select first 2 PCs

# Define a range for k (number of clusters)
wss <- numeric(10)  # To store within-cluster sum of squares
for (k in 1:10) {
  kmeans_result <- kmeans(pca_data_k, centers = k, nstart = 25)
  wss[k] <- kmeans_result$tot.withinss  # Total within-cluster sum of squares
}

# Create an Elbow Plot
elbow_plot <- ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
  geom_point(color = "black", size = 3) +       # Black points
  geom_line(color = "blue", linewidth = 0.8) +  # Blue curve
  scale_x_continuous(breaks = 1:10) +           # Ensure x-axis is divided by 1
  scale_y_continuous(breaks = seq(0, max(wss), by = 50)) + # Y-axis increments by 50
  labs(title = "Figure 4. Elbow Method for Optimal k",
       x = "Number of clusters (k)",
       y = "Total (within-cluster sum of squares)") +
  theme_minimal() +
  theme(
    axis.line = element_line(color = "black"),  # Black axis lines
    axis.ticks = element_line(color = "black")  # Black axis ticks
  )

# Print the plot
print(elbow_plot)

4.1.1 How to interpret the Elbow Plot?

The biggest drop is between k=2 and k=3, but the drop somewhat flattens after k=3, meaning that k=3 is likely the best choice.

4.2 Spatial distribution on the plot (PC1 vs PC2)

The spatial distribution on the PC1 vs. PC2 plot helps visualize how observations group based on their principal component scores. Points that are closer together share similar characteristics, while those farther apart indicate greater differences in the dataset.

# Extract the first two principal components
pca_data <- as.data.frame(pca_result$x[, 1:2])  # Use PC1 and PC2
data$location_ab <- substr(data$location, 1, 2) # Create a new column with the first 2 characters of each entry
pca_data$Location <- data$location_ab  # Add location names

# Apply K-means clustering (choose k based on the Elbow Method)
set.seed(123)  # For reproducibility
k <- 3  # Set the optimal number of clusters based on previous analysis
kmeans_result <- kmeans(pca_data[, 1:2], centers = k, nstart = 25)

# Add cluster labels to the PCA dataset
pca_data$Cluster <- as.factor(kmeans_result$cluster)

# Plot PCA with clusters
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(aes(shape = factor(Cluster)), size = 2, alpha = 0.7) +  # Scatter plot points
  geom_text(aes(label = Location), vjust = -1, size = 3, show.legend = FALSE) +  # Remove location label legend
  labs(
    title = "Figure 5: Clustering of locations based on PCA",
    x = "Principal Component 1 (temperature & humidity contrast)",
    y = "Principal Component 2 (wind speed influence)"
  ) +
  theme_minimal() +
  theme(
    axis.line = element_line(color = "black"),
    axis.ticks = element_line(color = "black")
  ) +
  scale_shape_manual(
    values = c(16, 17, 18), 
    labels = c("Cluster 1 (n=16)", "Cluster 2 (n=61)", "Cluster 3 (n=34)")  # Adding labels for shape legend
  ) +  # Adjust cluster shapes
  scale_color_manual(
    values = c("darkorange2", "cadetblue3", "darkseagreen3"),
    labels = c("Cluster 1 (n=16)", "Cluster 2 (n=61)", "Cluster 3 (n=34)")  # Adding labels for color legend
  ) +  # Customize colors of clusters
  guides(
    shape = "none",  # Remove legend for shape
    color = guide_legend(title = "Cluster")  # Title for the color legend
  )

4.2.1 Results interpretations

The location’s distribution across PC1 (x-axis) and PC2 (y-axis) tells us about the environment contrast captured by the principal components.

Cluster 1 is spread between -2.5 to 1.5 (PC1) and 1 to 5 (Pc2), meaning this cluster represents locations with moderate temperature/humidity contrast (PC1) and higher wind speed influence (PC2). The locations likely experience a balance between these two factors, but are less extreme than those in Cluster 2.
Cluster 2 is spread between 0 to 2 (PC1) and -0.5 to 1 *(PC2), meaning this cluster represents locations with higher PC1 values (greater temperature and humidity contrast) and moderate PC2 values (influenced by wind speed). This could indicate regions with a more variable climate, possibly with higher temperatures or more diverse humidity conditions.
Cluster 3 is concentrated around -2 to -1 (PC1) and -1 to 0 (PC2), meaning this cluster likely represents locations with lower PC1 (temperature & humidity contrast) and lower PC2 (wind speed influence). These locations might be cooler, less variable in temperature/humidity, and less affected by the wind.

4.3 Cluster distribution and location associations

This step explores whether specific clusters or data points are linked to particular locations, revealing potential spatial trends or geographic influences on the dataset.

library(dplyr)

# Step 1: Calculate the count of each location per cluster
location_count_summary <- pca_data %>%
  group_by(Cluster, Location) %>%
  count() %>%
  arrange(Location)

# Step 2: Calculate the total count of each location
total_locations <- location_count_summary %>%
  group_by(Location) %>%
  summarise(total_count = sum(n))

# Step 3: Merge the counts with the total locations to calculate the percentage
location_count_summary <- location_count_summary %>%
  left_join(total_locations, by = "Location") %>%
  mutate(percentage = (n / total_count) * 100)

# Step 4: Create a bar plot to visualize the percentage of each location associated with each cluster
ggplot(location_count_summary, aes(x = Location, y = percentage, fill = factor(Cluster))) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Figure 7: Cluster association by location",
       x = "Location",
       y = "Percentage of Location in Each Cluster") +
  scale_fill_manual(values = c("darkorange2", "cadetblue3", "darkseagreen3")) +  # Customize colors for clusters
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

4.3.1 Results interpretations

Note: The percentages represent the proportion of data from each specific location that is associated with each cluster. For example, 64% of the data from Cambridge Bay is assigned to Cluster 1.

Cluster 1 is associated with Cambridge Bay (64%), Karrak Lake (50%)
Cluster 2 is associated with Churchill (79%), Fairbanks (89%), Toolik (79%), Yellowknife (93%) and Karrak Lake (50%)
Cluster 3 is associated with Whitehorse, Kuujjuaq (60%), Kobberfjord (100%)

Out of a total of 111 data points, 16 were assigned to Cluster 1 (14%), 61 to Cluster 2 (55%), and 34 to Cluster 3 (31%).

4.4 Mosquito counts across clusters

This step examines how mosquito counts vary among different clusters by summarizing key statistics such as total count, range, mean, and variability. It helps identify patterns in mosquito abundance across clusters, revealing potential differences in environmental conditions or habitat suitability.

**Table 3: Summary statistics of mosquito counts across clusters**
Cluster	Total	Max	Mean	SD	Variance
1	237	76	14.8	22.3	497.4
2	3166	611	93.1	170.2	28963.7
3	728	89	11.9	18.0	325.3

4.4.1 Results interpretations

Cluster 1: Contains 237 observations, with mosquito counts ranging from 0 to 76. The mean count (14.8) is relatively low, and the high standard deviation (22.3) suggests considerable variation in mosquito abundance within this cluster (Cambridge Bay & Karrak Lake).
Cluster 2: Has the highest total count (3166) and a wide range (0–611). The mean (93.1) is significantly higher than in other clusters, with a large standard deviation (170.2) and variance (28963.7), indicating extreme variability in mosquito abundance within the cluster (Churchill, Fairbanks, Toolik, Yelloknife, & Karrak Lake).
Cluster 3: Includes 728 observations, with counts between 0 and 89. The mean (11.9) is similar to Cluster 1, but the standard deviation (18.0) and variance (325.3) suggest slightly less variability in mosquito abundance within this cluster (Whitehorse, Kuujjuaq, Kobberfjord).

Note: I’m unsure if we can effectively use this information, as it may not provide additional insights worth reporting in the article. We already know how mosquito counts vary by location.

5. Thoughts on why PCA is more suitable than direct regression or GLM for this data

When deciding between Principal Component Analysis (PCA) and Generalized Linear Models (GLM) / Regression, several factors about the dataset must be considered. Below are the key reasons why PCA was a better approach for this dataset.