This material corresponds to Week 6 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.
The datasets used in this exercise — VegSamplesV1.txt and SparrowsElphick.txt — originate from the examples provided in the article “A protocol for data exploration to avoid common statistical problems” by Zuur, Ieno, and Elphick (2010). They were originally collected to illustrate real ecological scenarios involving vegetation structure and bird morphology. The workflow applied here is an adaptation of that paper’s protocol, designed to guide students step-by-step through a technical data-exploration process: checking data quality, identifying outliers, diagnosing collinearity, simplifying models, and visualizing relationships to ensure that final statistical analyses are robust and interpretable.
Note: it’s a good idea to load libraries in a separate code chunk, so you don’t have to mess with the setup chunk again.
library(here)
library(lattice)
library(car)
library(ggplot2)
library(readr)
library(dplyr)
library(GGally)
library(plotly)
Read VegSamplesV1.txt into veg; quickly peek at the data with head() and str().
veg <- read.table("data/VegSamplesV1.txt", header = TRUE)
head(veg)
## Year Site UniversalPlotName Banded PtCountsum Avgmaxht Avgdens
## 1 2002 EastRiverMarsh 02PLOT012 31 53 33.19 23.30
## 2 2002 EastRiverMarsh 02PLOT013 40 40 48.22 27.63
## 3 2002 EastRiverMarsh 02PLOT014 15 15 33.81 26.40
## 4 2002 EastRiverMarsh 02PLOT015 27 33 28.34 45.63
## 5 2002 EastRiverMarsh 02PLOT016 20 17 28.75 38.88
## 6 2002 EastRiverMarsh 02PLOT017 17 30 24.19 53.60
## ht.thatch S.patens Distichlis S.alternifloraShort S.alternifloraTall Juncus
## 1 4.67 21.31 41.47 32.22 2.78 0.00
## 2 6.56 11.67 52.78 23.61 9.72 0.00
## 3 3.33 29.09 17.80 35.89 0.00 2.22
## 4 7.33 52.22 19.76 5.56 3.33 7.78
## 5 3.44 47.84 12.78 15.49 10.00 0.00
## 6 3.44 60.42 0.01 26.44 0.00 0.00
## Bare Other Phragmites Shrub Tallsedge Water
## 1 1.67 0.56 0 0.00 0 0.00
## 2 0.00 0.00 0 0.00 0 2.22
## 3 8.33 0.00 0 6.67 0 0.00
## 4 2.22 8.89 0 0.00 0 0.25
## 5 10.14 3.75 0 0.00 0 0.00
## 6 2.01 0.00 0 0.00 0 11.11
What to look for.
Do variable names match what we expect?
Are numeric vs. factor types sensible?
Any obvious formatting issues?
Interpretation. If types and names look correct, we can proceed to explore content (values and distributions).
str(veg)
## 'data.frame': 60 obs. of 19 variables:
## $ Year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ Site : chr "EastRiverMarsh" "EastRiverMarsh" "EastRiverMarsh" "EastRiverMarsh" ...
## $ UniversalPlotName : chr "02PLOT012" "02PLOT013" "02PLOT014" "02PLOT015" ...
## $ Banded : int 31 40 15 27 20 17 17 29 13 37 ...
## $ PtCountsum : int 53 40 15 33 17 30 49 39 22 23 ...
## $ Avgmaxht : num 33.2 48.2 33.8 28.3 28.8 ...
## $ Avgdens : num 23.3 27.6 26.4 45.6 38.9 ...
## $ ht.thatch : num 4.67 6.56 3.33 7.33 3.44 3.44 2.11 5.89 5.22 0.89 ...
## $ S.patens : num 21.3 11.7 29.1 52.2 47.8 ...
## $ Distichlis : num 41.5 52.8 17.8 19.8 12.8 ...
## $ S.alternifloraShort: num 32.22 23.61 35.89 5.56 15.49 ...
## $ S.alternifloraTall : num 2.78 9.72 0 3.33 10 ...
## $ Juncus : num 0 0 2.22 7.78 0 0 0.56 5.56 0 0 ...
## $ Bare : num 1.67 0 8.33 2.22 10.14 ...
## $ Other : num 0.56 0 0 8.89 3.75 0 0.56 0 1.91 0.26 ...
## $ Phragmites : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Shrub : num 0 0 6.67 0 0 0 0 0 0 0 ...
## $ Tallsedge : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Water : num 0 2.22 0 0.25 0 ...
#glimpse(veg)
Goal… Understand the basic scale and spread of the response variable Banded.
#summary(veg)
#colSums(is.na(veg))
range(veg$Banded) # response variable
## [1] 0 51
#range(veg$Avgmaxht) # example covariate
Outliers at Y (points far from the bulk).
par(mfrow=c(1,2))
boxplot(veg$Banded, ylab="Banded sparrows")
dotchart(veg$Banded, xlab="Banded sparrows")
There a nicer way to see it. It uses ggplotly(), which converts each ggplot into an interactive object. subplot() from plotly arranges both side-by-side (like your mfrow = c(1,2)). Hovering over points will show each observation’s value (and its position), so you can see and identify outliers interactively.
p_box <- ggplot(veg, aes(y = Banded)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, fill = "lightgray") +
labs(title = "Boxplot of Banded Sparrows",
y = "Number of banded sparrows") +
theme_minimal(base_size = 13)
# Interactive dotchart (using rank order)
p_dot <- ggplot(veg, aes(x = seq_along(Banded), y = Banded)) +
geom_point(color = "darkblue", size = 2, alpha = 0.7) +
labs(title = "Dotchart of Banded Sparrows",
x = "Observation index", y = "Number of banded sparrows") +
theme_minimal(base_size = 13)
# Combine interactively with plotly
subplot(ggplotly(p_box), ggplotly(p_dot), nrows = 1, margin = 0.05, titleX = TRUE, titleY = TRUE)
Also take a brief look at Skewness (long tails) or multimodality.Values that seem impossible for your study design.
hist(veg$Banded, breaks=20, col="white", border="black")
Interpretation. If outliers exist, note them; don’t
delete them automatically — they may be real! Skewness suggests
transformations or non-Gaussian models later.
Goal… Screen all vegetation covariates for pairwise correlation and individual shapes.
Then first see which variable we have…
colnames(veg)
## [1] "Year" "Site" "UniversalPlotName"
## [4] "Banded" "PtCountsum" "Avgmaxht"
## [7] "Avgdens" "ht.thatch" "S.patens"
## [10] "Distichlis" "S.alternifloraShort" "S.alternifloraTall"
## [13] "Juncus" "Bare" "Other"
## [16] "Phragmites" "Shrub" "Tallsedge"
## [19] "Water"
Define numvars (the numeric vegetation predictors).We recommend using GGally::ggpairs() to see histograms/densities on the diagonal, scatterplots below, and correlations above.
numvars <- c("Avgmaxht","Avgdens","ht.thatch","S.patens","Distichlis",
"S.alternifloraShort","S.alternifloraTall","Juncus","Bare",
"Other","Phragmites","Shrub","Tallsedge","Water")
GGally::ggpairs(
veg[, numvars],
progress = FALSE,
title = "Correlation and distribution of vegetation variables",
upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8)),
diag = list(continuous = wrap("densityDiag"))
)
What to look for: 1. Strong positive/negative correlations (|r| close to 1).
About (1), the interpretation should be high pairwise correlations hint at multicollinearity, which we’ll test formally with VIF next… but gets you an approach.
#### I you dont wanna have a lot of time for the correlation try
#pairs(veg[, numvars])
#cor(veg[, numvars], use="pairwise")
Quantify multicollinearity among predictors.Tere is too may ways to do the same. So if you just want to see directly the vif you can go as:
vif(lm(Banded ~ ., data = veg[, c("Banded", numvars)]))
## Avgmaxht Avgdens ht.thatch S.patens
## 6.120018 3.206401 1.671224 159.350658
## Distichlis S.alternifloraShort S.alternifloraTall Juncus
## 53.754540 121.463782 159.382860 44.995377
## Bare Other Phragmites Shrub
## 12.058665 5.817015 3.749027 2.781804
## Tallsedge Water
## 4.409398 17.067777
It is just a way to code it. Try to see what you do there. your are just making the modelling and add it directly to the finction vif() from car package. What to look for,
VIF > 10 is a red flag; > 5 merits attention (rules of thumb).
In this dataset, S.patens and S.alternifloraTall show an extremely high VIF, signaling redundancy with other vegetation covers…. Take a brief look…
xyplot(Banded ~ S.patens, data=veg, type=c("p","smooth"),
xlab="X variable", ylab="Banded sparrows")
Interpretation.Severe multicollinearity inflates standard errors and destabilizes coefficients.We will remove the highest-VIF variable(s) and re-fit. So let’s start with a normal workflow.
Our goal at this step is to fit the initial model and check assumptions.
Fit m1 with all 14 predictors:
m1 <- lm(Banded ~ Avgmaxht + Avgdens + ht.thatch + S.patens + Distichlis +
S.alternifloraShort + S.alternifloraTall + Juncus + Bare + Other +
Phragmites + Shrub + Tallsedge + Water, data = veg)
Plot residuals vs fitted (homoscedasticity) and QQ plot of residuals (normality)
plot(fitted(m1), resid(m1), pch=16,
xlab="Fitted values", ylab="Residuals"); abline(h=0,lty=2)
qqnorm(resid(m1)); qqline(resid(m1))
If
assumptions look poor, as you’d already learned we may transform Y or
use alternative error structures later. But first, we’ll fix
collinearity (it’s the major problem here).
We use vif function as follows:
car::vif(m1)
## Avgmaxht Avgdens ht.thatch S.patens
## 6.120018 3.206401 1.671224 159.350658
## Distichlis S.alternifloraShort S.alternifloraTall Juncus
## 53.754540 121.463782 159.382860 44.995377
## Bare Other Phragmites Shrub
## 12.058665 5.817015 3.749027 2.781804
## Tallsedge Water
## 4.409398 17.067777
Drop the strongest source of collinearity and reassess.In our case we just selected ‘S.patens’.
m2 <- update(m1, . ~ . - S.patens)
car::vif(m2)
## Avgmaxht Avgdens ht.thatch Distichlis
## 6.093919 3.189566 1.659956 1.484444
## S.alternifloraShort S.alternifloraTall Juncus Bare
## 2.951472 4.431942 1.722569 1.652923
## Other Phragmites Shrub Tallsedge
## 1.649813 1.538195 1.315428 2.178762
## Water
## 1.366047
Try also to use summary(m1)
Take a meticulous look at the structure, specially “- S.patens”, what it means? Now we have a new model called m2, does it? So all remaining VIFs ideally < 5 (certainly < 10). Coefficients are more stable and interpretable now.
Removing S.patens (highest VIF): m2 <- update(m1, . ~ . - S.patens). VIFs collapse to acceptable levels for almost all predictors and but conserving similar summary() results.
summary(m2)
##
## Call:
## lm(formula = Banded ~ Avgmaxht + Avgdens + ht.thatch + Distichlis +
## S.alternifloraShort + S.alternifloraTall + Juncus + Bare +
## Other + Phragmites + Shrub + Tallsedge + Water, data = veg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7188 -6.5508 0.8511 6.8449 19.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.97307 18.56317 0.106 0.91582
## Avgmaxht 0.37554 0.36389 1.032 0.30746
## Avgdens 0.03792 0.17628 0.215 0.83062
## ht.thatch -0.09645 0.63933 -0.151 0.88075
## Distichlis 0.10054 0.14618 0.688 0.49505
## S.alternifloraShort 0.08114 0.13559 0.598 0.55247
## S.alternifloraTall -0.14622 0.14606 -1.001 0.32201
## Juncus 0.45380 0.16801 2.701 0.00965 **
## Bare 0.10026 0.32785 0.306 0.76112
## Other 0.22432 0.53688 0.418 0.67802
## Phragmites 0.38886 0.66022 0.589 0.55875
## Shrub -1.65800 0.88250 -1.879 0.06662 .
## Tallsedge -0.11198 0.84680 -0.132 0.89537
## Water 0.14307 0.31120 0.460 0.64787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.75 on 46 degrees of freedom
## Multiple R-squared: 0.3443, Adjusted R-squared: 0.159
## F-statistic: 1.858 on 13 and 46 DF, p-value: 0.06187
As you can see “Juncus” typically shows a significant positive effect.
After removing S.patens (and optionally Avgmaxht), coefficients are more stable and interpretable…. here we eliminate “Avgmaxht”,
m3 <- update(m2, . ~ . - Avgmaxht)
car::vif(m3)
## Avgdens ht.thatch Distichlis S.alternifloraShort
## 3.164151 1.482111 1.479446 2.865455
## S.alternifloraTall Juncus Bare Other
## 2.823777 1.443508 1.461686 1.281845
## Phragmites Shrub Tallsedge Water
## 1.377497 1.314531 1.305986 1.201262
summary(m3)
##
## Call:
## lm(formula = Banded ~ Avgdens + ht.thatch + Distichlis + S.alternifloraShort +
## S.alternifloraTall + Juncus + Bare + Other + Phragmites +
## Shrub + Tallsedge + Water, data = veg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6720 -6.4526 0.2663 7.1947 19.3166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.12555 12.52004 1.288 0.20406
## Avgdens 0.02168 0.17570 0.123 0.90230
## ht.thatch 0.11952 0.60453 0.198 0.84413
## Distichlis 0.09179 0.14604 0.629 0.53271
## S.alternifloraShort 0.05726 0.13369 0.428 0.67041
## S.alternifloraTall -0.05542 0.11667 -0.475 0.63695
## Juncus 0.52359 0.15391 3.402 0.00138 **
## Bare -0.01482 0.30851 -0.048 0.96188
## Other -0.03734 0.47356 -0.079 0.93748
## Phragmites 0.60909 0.62522 0.974 0.33494
## Shrub -1.63422 0.88281 -1.851 0.07044 .
## Tallsedge 0.44113 0.65606 0.672 0.50462
## Water 0.03152 0.29203 0.108 0.91450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.75 on 47 degrees of freedom
## Multiple R-squared: 0.3292, Adjusted R-squared: 0.1579
## F-statistic: 1.922 on 12 and 47 DF, p-value: 0.05578
Understand which predictors matter after collinearity reduction. Our model now looks like m_final <- lm(Banded ~ Juncus + Shrub, data = veg) as suggested by the cleaned results and ecological rationale. Then let’s formulate it,
m_final <- lm(Banded ~ Juncus + Shrub, data = veg)
summary(m_final)
##
## Call:
## lm(formula = Banded ~ Juncus + Shrub, data = veg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.022 -7.495 1.283 7.978 20.978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.0220 1.5944 11.930 < 2e-16 ***
## Juncus 0.5354 0.1204 4.445 4.11e-05 ***
## Shrub -1.3237 0.7240 -1.828 0.0727 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.1 on 57 degrees of freedom
## Multiple R-squared: 0.2822, Adjusted R-squared: 0.2571
## F-statistic: 11.21 on 2 and 57 DF, p-value: 7.858e-05
car::vif(m_final)
## Juncus Shrub
## 1.002039 1.002039
“Juncus” typically shows a significant positive effect. “Shrub” often shows a negative effect (sometimes marginal).
Please start creatting stories about the results… for example here is ours interpretation,
More Juncus gerardii → more Seaside Sparrows banded; and for some reason somehow more shrub cover → fewer banded sparrows, but is not necessary true all the time (more detail is needed to start saying that). Perhaps we need to see if there is something at shrub making this birds going there sometimes. Finally, other vegetation variables contribute little once redundancy is removed.
xyplot(Banded ~ Juncus, data=veg, type=c("p","smooth"),
xlab="% Juncus cover", ylab="Banded sparrows")
xyplot(Banded ~ Shrub, data=veg, type=c("p","smooth"),
xlab="% Shrub cover", ylab="Banded sparrows")
To illustrate the concept of outliers in Y and order of the data, we switch to the sparrow morphology file. The “order” (row position) is often arbitrary (e.g., the order the field sheets were entered). Visualizing values vs. their row order can reveal accidental sorting or glitches. We’ll randomize the dataset first so the order isn’t structured.
Sparrows <- read.table("data/SparrowsElphick.txt", header = TRUE)
head(Sparrows)
## wingcrd flatwing tarsus head culmen nalospi wt bandstat initials Year Month
## 1 59.0 60.0 22.3 31.2 12.3 13.0 9.5 1 2 2002 9
## 2 54.0 55.0 20.3 28.3 10.8 7.8 12.2 1 2 2002 10
## 3 53.0 54.0 21.6 30.2 12.5 8.5 13.8 1 2 2002 10
## 4 55.0 56.0 19.7 30.4 12.1 8.3 13.8 1 8 2002 7
## 5 55.0 56.0 20.3 28.7 11.2 8.0 14.1 1 3 2002 10
## 6 53.5 54.5 20.8 30.6 12.8 8.6 14.8 1 7 2004 8
## Day Location SpeciesCode Sex Age
## 1 19 4 1 0 2
## 2 4 4 3 0 2
## 3 4 4 3 0 2
## 4 30 9 1 0 2
## 5 4 4 3 0 2
## 6 2 1 1 0 2
colnames(Sparrows)
## [1] "wingcrd" "flatwing" "tarsus" "head" "culmen"
## [6] "nalospi" "wt" "bandstat" "initials" "Year"
## [11] "Month" "Day" "Location" "SpeciesCode" "Sex"
## [16] "Age"
# Randomize row order (so "order" has meaning to demonstrate)
set.seed(4259) # for reproducibility in class
Sparrows <- Sparrows[sample(1:nrow(Sparrows)), ]
Sparrows$ord <- order(Sparrows$wingcrd)
head(Sparrows[, c("wingcrd","ord")])
## wingcrd ord
## 874 57.0 1277
## 638 57.0 4
## 819 58.0 568
## 36 51.5 808
## 571 58.0 308
## 612 57.0 603
What order() does (in plain words)??
p_ord <- ggplot(Sparrows,
aes(x = wingcrd, y = ord,
text = paste0("Wing length: ", wingcrd, " mm\nOrder: ", ord))) +
geom_point() +
labs(
title = "Wing length vs. order of the data (SparrowsElphick)",
subtitle = "We use a second dataset to illustrate the 'order' concept, after randomizing row order",
x = "Wing length (mm)",
y = "Order of the data"
) +
theme_minimal(base_size = 13) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggplotly(p_ord, tooltip = c("text", "x", "y"))
Do you see any gaps or clusters (e.g., many similar wing lengths) when plotted against the order index? If the file had been pre-sorted (e.g., by wing length), you’d see a nearly perfect line.
Replace wingcrd with another numeric variable (e.g., tarsus, head, wt). Update both order() and the aes() mapping.Try rank() instead of order(), and see what happens.
Compute:
Sparrows$rnk <- rank(Sparrows$wingcrd, ties.method = "average")
Plot y = rnk and compare with ord. Which one is more informative when there are many ties?
If Sex is recorded (sometimes as character “4/5” or factor), color points:
0 → sex not recorded or undetermined 4 → male 5 → female
p_ord2 <- ggplot(Sparrows,
aes(x = wingcrd, y = ord, color = factor(Sex))) +
geom_point(alpha = 0.8) +
labs(color = "Sex")
ggplotly(p_ord2)
Try to do so for the other morphology measurements Do you see a pattern across them? Explained below:
We could also plot all of then into one graph using cbind and other Technics, as follows:
# combine variables into a matrix
Z <- cbind(Sparrows$wingcrd, Sparrows$tarsus, Sparrows$head,
Sparrows$culmen, Sparrows$nalospi, Sparrows$wt)
# add column names
colnames(Z) <- c("wing length", "tarsus length", "head length",
"culmen length", "nalospi to bill tip", "weight")
dotplot(as.matrix(Z), groups = FALSE,
strip = strip.custom(bg = 'white',
par.strip.text = list(cex = 0.8)),
scales = list(x = list(relation = "free"),
y = list(relation = "free"),
draw = FALSE),
col = 1, cex = 0.5, pch = 16,
xlab = "Value of the variable",
ylab = "Order of the data from text file") # Buena suerte, pura vida
Try to guess, what does cbind do? is possible to use subplot() and ggplotly() to create a multiple interactive object? if yes please try to do so…
Sparrows morphometrics
Using the Sparrows dataset, apply the same data-exploration workflow you used for the vegetation data. Choose one numeric variable as your response (for example, wing length or weight) and a few continuous and categorical predictors. Examine the distributions and correlations, fit an initial linear model, and check for multicollinearity using car::vif(). Take into account that sex should be categorical no numerical… Because neither is more than other.
Then iteratively simplify the model with update() until all predictors have acceptable VIF values and meaningful coefficients. Verify that residuals are roughly normal and homoscedastic, and visualize key relationships with ggplot() or ggplotly(). Your goal is to produce a clean, interpretable final model and one figure that best represents its fitted relationship.