This material corresponds to Week 6 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.

Example Workflow at an exploratory approach

The datasets used in this exercise — VegSamplesV1.txt and SparrowsElphick.txt — originate from the examples provided in the article “A protocol for data exploration to avoid common statistical problems” by Zuur, Ieno, and Elphick (2010). They were originally collected to illustrate real ecological scenarios involving vegetation structure and bird morphology. The workflow applied here is an adaptation of that paper’s protocol, designed to guide students step-by-step through a technical data-exploration process: checking data quality, identifying outliers, diagnosing collinearity, simplifying models, and visualizing relationships to ensure that final statistical analyses are robust and interpretable.

Load libraries

Note: it’s a good idea to load libraries in a separate code chunk, so you don’t have to mess with the setup chunk again.

library(here)
library(lattice)
library(car)
library(ggplot2)
library(readr)
library(dplyr)
library(GGally)
library(plotly)

Read VegSamplesV1.txt into veg; quickly peek at the data with head() and str().

veg <- read.table("data/VegSamplesV1.txt", header = TRUE)
head(veg)
##   Year           Site UniversalPlotName Banded PtCountsum Avgmaxht Avgdens
## 1 2002 EastRiverMarsh         02PLOT012     31         53    33.19   23.30
## 2 2002 EastRiverMarsh         02PLOT013     40         40    48.22   27.63
## 3 2002 EastRiverMarsh         02PLOT014     15         15    33.81   26.40
## 4 2002 EastRiverMarsh         02PLOT015     27         33    28.34   45.63
## 5 2002 EastRiverMarsh         02PLOT016     20         17    28.75   38.88
## 6 2002 EastRiverMarsh         02PLOT017     17         30    24.19   53.60
##   ht.thatch S.patens Distichlis S.alternifloraShort S.alternifloraTall Juncus
## 1      4.67    21.31      41.47               32.22               2.78   0.00
## 2      6.56    11.67      52.78               23.61               9.72   0.00
## 3      3.33    29.09      17.80               35.89               0.00   2.22
## 4      7.33    52.22      19.76                5.56               3.33   7.78
## 5      3.44    47.84      12.78               15.49              10.00   0.00
## 6      3.44    60.42       0.01               26.44               0.00   0.00
##    Bare Other Phragmites Shrub Tallsedge Water
## 1  1.67  0.56          0  0.00         0  0.00
## 2  0.00  0.00          0  0.00         0  2.22
## 3  8.33  0.00          0  6.67         0  0.00
## 4  2.22  8.89          0  0.00         0  0.25
## 5 10.14  3.75          0  0.00         0  0.00
## 6  2.01  0.00          0  0.00         0 11.11

What to look for.

  1. Do variable names match what we expect?

  2. Are numeric vs. factor types sensible?

  3. Any obvious formatting issues?

Interpretation. If types and names look correct, we can proceed to explore content (values and distributions).

str(veg)
## 'data.frame':    60 obs. of  19 variables:
##  $ Year               : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ Site               : chr  "EastRiverMarsh" "EastRiverMarsh" "EastRiverMarsh" "EastRiverMarsh" ...
##  $ UniversalPlotName  : chr  "02PLOT012" "02PLOT013" "02PLOT014" "02PLOT015" ...
##  $ Banded             : int  31 40 15 27 20 17 17 29 13 37 ...
##  $ PtCountsum         : int  53 40 15 33 17 30 49 39 22 23 ...
##  $ Avgmaxht           : num  33.2 48.2 33.8 28.3 28.8 ...
##  $ Avgdens            : num  23.3 27.6 26.4 45.6 38.9 ...
##  $ ht.thatch          : num  4.67 6.56 3.33 7.33 3.44 3.44 2.11 5.89 5.22 0.89 ...
##  $ S.patens           : num  21.3 11.7 29.1 52.2 47.8 ...
##  $ Distichlis         : num  41.5 52.8 17.8 19.8 12.8 ...
##  $ S.alternifloraShort: num  32.22 23.61 35.89 5.56 15.49 ...
##  $ S.alternifloraTall : num  2.78 9.72 0 3.33 10 ...
##  $ Juncus             : num  0 0 2.22 7.78 0 0 0.56 5.56 0 0 ...
##  $ Bare               : num  1.67 0 8.33 2.22 10.14 ...
##  $ Other              : num  0.56 0 0 8.89 3.75 0 0.56 0 1.91 0.26 ...
##  $ Phragmites         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Shrub              : num  0 0 6.67 0 0 0 0 0 0 0 ...
##  $ Tallsedge          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Water              : num  0 2.22 0 0.25 0 ...
#glimpse(veg)

First pass on the response: range & distribution

Goal… Understand the basic scale and spread of the response variable Banded.

#summary(veg)
#colSums(is.na(veg))
range(veg$Banded)          # response variable
## [1]  0 51
#range(veg$Avgmaxht)        # example covariate

What to look for.

Outliers at Y (points far from the bulk).

par(mfrow=c(1,2))
boxplot(veg$Banded, ylab="Banded sparrows")
dotchart(veg$Banded, xlab="Banded sparrows")

There a nicer way to see it. It uses ggplotly(), which converts each ggplot into an interactive object. subplot() from plotly arranges both side-by-side (like your mfrow = c(1,2)). Hovering over points will show each observation’s value (and its position), so you can see and identify outliers interactively.

p_box <- ggplot(veg, aes(y = Banded)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, fill = "lightgray") +
  labs(title = "Boxplot of Banded Sparrows",
       y = "Number of banded sparrows") +
  theme_minimal(base_size = 13)

# Interactive dotchart (using rank order)
p_dot <- ggplot(veg, aes(x = seq_along(Banded), y = Banded)) +
  geom_point(color = "darkblue", size = 2, alpha = 0.7) +
  labs(title = "Dotchart of Banded Sparrows",
       x = "Observation index", y = "Number of banded sparrows") +
  theme_minimal(base_size = 13)

# Combine interactively with plotly
subplot(ggplotly(p_box), ggplotly(p_dot), nrows = 1, margin = 0.05, titleX = TRUE, titleY = TRUE)

Also take a brief look at Skewness (long tails) or multimodality.Values that seem impossible for your study design.

hist(veg$Banded, breaks=20, col="white", border="black")

Interpretation. If outliers exist, note them; don’t delete them automatically — they may be real! Skewness suggests transformations or non-Gaussian models later.

Covariates at a glance: correlation & univariate shapes

Goal… Screen all vegetation covariates for pairwise correlation and individual shapes.

Then first see which variable we have…

colnames(veg)
##  [1] "Year"                "Site"                "UniversalPlotName"  
##  [4] "Banded"              "PtCountsum"          "Avgmaxht"           
##  [7] "Avgdens"             "ht.thatch"           "S.patens"           
## [10] "Distichlis"          "S.alternifloraShort" "S.alternifloraTall" 
## [13] "Juncus"              "Bare"                "Other"              
## [16] "Phragmites"          "Shrub"               "Tallsedge"          
## [19] "Water"

Define numvars (the numeric vegetation predictors).We recommend using GGally::ggpairs() to see histograms/densities on the diagonal, scatterplots below, and correlations above.

numvars <- c("Avgmaxht","Avgdens","ht.thatch","S.patens","Distichlis",
             "S.alternifloraShort","S.alternifloraTall","Juncus","Bare",
             "Other","Phragmites","Shrub","Tallsedge","Water")



GGally::ggpairs(
  veg[, numvars],
  progress = FALSE,
  title = "Correlation and distribution of vegetation variables",
  upper = list(continuous = wrap("cor", size = 3.5)),
  lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8)),
  diag  = list(continuous = wrap("densityDiag"))
)

What to look for: 1. Strong positive/negative correlations (|r| close to 1).

  1. Nonlinear clouds or unusual point patterns.

About (1), the interpretation should be high pairwise correlations hint at multicollinearity, which we’ll test formally with VIF next… but gets you an approach.

#### I you dont wanna have a lot of time for the correlation try

#pairs(veg[, numvars])
#cor(veg[, numvars], use="pairwise")

Formal collinearity check: VIF

Quantify multicollinearity among predictors.Tere is too may ways to do the same. So if you just want to see directly the vif you can go as:

vif(lm(Banded ~ ., data = veg[, c("Banded", numvars)]))
##            Avgmaxht             Avgdens           ht.thatch            S.patens 
##            6.120018            3.206401            1.671224          159.350658 
##          Distichlis S.alternifloraShort  S.alternifloraTall              Juncus 
##           53.754540          121.463782          159.382860           44.995377 
##                Bare               Other          Phragmites               Shrub 
##           12.058665            5.817015            3.749027            2.781804 
##           Tallsedge               Water 
##            4.409398           17.067777

It is just a way to code it. Try to see what you do there. your are just making the modelling and add it directly to the finction vif() from car package. What to look for,

VIF > 10 is a red flag; > 5 merits attention (rules of thumb).

In this dataset, S.patens and S.alternifloraTall show an extremely high VIF, signaling redundancy with other vegetation covers…. Take a brief look…

xyplot(Banded ~ S.patens, data=veg, type=c("p","smooth"),
       xlab="X variable", ylab="Banded sparrows")

Interpretation.Severe multicollinearity inflates standard errors and destabilizes coefficients.We will remove the highest-VIF variable(s) and re-fit. So let’s start with a normal workflow.

Full model exploratory fit + residual diagnostics

Our goal at this step is to fit the initial model and check assumptions.

Fit m1 with all 14 predictors:

m1 <- lm(Banded ~ Avgmaxht + Avgdens + ht.thatch + S.patens + Distichlis +
           S.alternifloraShort + S.alternifloraTall + Juncus + Bare + Other +
           Phragmites + Shrub + Tallsedge + Water, data = veg)

Plot residuals vs fitted (homoscedasticity) and QQ plot of residuals (normality)

plot(fitted(m1), resid(m1), pch=16,
     xlab="Fitted values", ylab="Residuals"); abline(h=0,lty=2)

qqnorm(resid(m1)); qqline(resid(m1))

If assumptions look poor, as you’d already learned we may transform Y or use alternative error structures later. But first, we’ll fix collinearity (it’s the major problem here).

We use vif function as follows:

car::vif(m1)
##            Avgmaxht             Avgdens           ht.thatch            S.patens 
##            6.120018            3.206401            1.671224          159.350658 
##          Distichlis S.alternifloraShort  S.alternifloraTall              Juncus 
##           53.754540          121.463782          159.382860           44.995377 
##                Bare               Other          Phragmites               Shrub 
##           12.058665            5.817015            3.749027            2.781804 
##           Tallsedge               Water 
##            4.409398           17.067777

Reduce multicollinearity: VIF-guided pruning

Drop the strongest source of collinearity and reassess.In our case we just selected ‘S.patens’.

m2 <- update(m1, . ~ . - S.patens)
car::vif(m2)
##            Avgmaxht             Avgdens           ht.thatch          Distichlis 
##            6.093919            3.189566            1.659956            1.484444 
## S.alternifloraShort  S.alternifloraTall              Juncus                Bare 
##            2.951472            4.431942            1.722569            1.652923 
##               Other          Phragmites               Shrub           Tallsedge 
##            1.649813            1.538195            1.315428            2.178762 
##               Water 
##            1.366047

Try also to use summary(m1)

Take a meticulous look at the structure, specially “- S.patens”, what it means? Now we have a new model called m2, does it? So all remaining VIFs ideally < 5 (certainly < 10). Coefficients are more stable and interpretable now.

Interpret a second modelling reducing variables

Removing S.patens (highest VIF): m2 <- update(m1, . ~ . - S.patens). VIFs collapse to acceptable levels for almost all predictors and but conserving similar summary() results.

summary(m2)
## 
## Call:
## lm(formula = Banded ~ Avgmaxht + Avgdens + ht.thatch + Distichlis + 
##     S.alternifloraShort + S.alternifloraTall + Juncus + Bare + 
##     Other + Phragmites + Shrub + Tallsedge + Water, data = veg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7188  -6.5508   0.8511   6.8449  19.4148 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          1.97307   18.56317   0.106  0.91582   
## Avgmaxht             0.37554    0.36389   1.032  0.30746   
## Avgdens              0.03792    0.17628   0.215  0.83062   
## ht.thatch           -0.09645    0.63933  -0.151  0.88075   
## Distichlis           0.10054    0.14618   0.688  0.49505   
## S.alternifloraShort  0.08114    0.13559   0.598  0.55247   
## S.alternifloraTall  -0.14622    0.14606  -1.001  0.32201   
## Juncus               0.45380    0.16801   2.701  0.00965 **
## Bare                 0.10026    0.32785   0.306  0.76112   
## Other                0.22432    0.53688   0.418  0.67802   
## Phragmites           0.38886    0.66022   0.589  0.55875   
## Shrub               -1.65800    0.88250  -1.879  0.06662 . 
## Tallsedge           -0.11198    0.84680  -0.132  0.89537   
## Water                0.14307    0.31120   0.460  0.64787   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 46 degrees of freedom
## Multiple R-squared:  0.3443, Adjusted R-squared:  0.159 
## F-statistic: 1.858 on 13 and 46 DF,  p-value: 0.06187

As you can see “Juncus” typically shows a significant positive effect.

After removing S.patens (and optionally Avgmaxht), coefficients are more stable and interpretable…. here we eliminate “Avgmaxht”,

m3 <- update(m2, . ~ . - Avgmaxht)
car::vif(m3)
##             Avgdens           ht.thatch          Distichlis S.alternifloraShort 
##            3.164151            1.482111            1.479446            2.865455 
##  S.alternifloraTall              Juncus                Bare               Other 
##            2.823777            1.443508            1.461686            1.281845 
##          Phragmites               Shrub           Tallsedge               Water 
##            1.377497            1.314531            1.305986            1.201262
summary(m3)
## 
## Call:
## lm(formula = Banded ~ Avgdens + ht.thatch + Distichlis + S.alternifloraShort + 
##     S.alternifloraTall + Juncus + Bare + Other + Phragmites + 
##     Shrub + Tallsedge + Water, data = veg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6720  -6.4526   0.2663   7.1947  19.3166 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         16.12555   12.52004   1.288  0.20406   
## Avgdens              0.02168    0.17570   0.123  0.90230   
## ht.thatch            0.11952    0.60453   0.198  0.84413   
## Distichlis           0.09179    0.14604   0.629  0.53271   
## S.alternifloraShort  0.05726    0.13369   0.428  0.67041   
## S.alternifloraTall  -0.05542    0.11667  -0.475  0.63695   
## Juncus               0.52359    0.15391   3.402  0.00138 **
## Bare                -0.01482    0.30851  -0.048  0.96188   
## Other               -0.03734    0.47356  -0.079  0.93748   
## Phragmites           0.60909    0.62522   0.974  0.33494   
## Shrub               -1.63422    0.88281  -1.851  0.07044 . 
## Tallsedge            0.44113    0.65606   0.672  0.50462   
## Water                0.03152    0.29203   0.108  0.91450   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 47 degrees of freedom
## Multiple R-squared:  0.3292, Adjusted R-squared:  0.1579 
## F-statistic: 1.922 on 12 and 47 DF,  p-value: 0.05578

Interpret the cleaned model

Understand which predictors matter after collinearity reduction. Our model now looks like m_final <- lm(Banded ~ Juncus + Shrub, data = veg) as suggested by the cleaned results and ecological rationale. Then let’s formulate it,

m_final <- lm(Banded ~ Juncus + Shrub, data = veg)
summary(m_final)
## 
## Call:
## lm(formula = Banded ~ Juncus + Shrub, data = veg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.022  -7.495   1.283   7.978  20.978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.0220     1.5944  11.930  < 2e-16 ***
## Juncus        0.5354     0.1204   4.445 4.11e-05 ***
## Shrub        -1.3237     0.7240  -1.828   0.0727 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 57 degrees of freedom
## Multiple R-squared:  0.2822, Adjusted R-squared:  0.2571 
## F-statistic: 11.21 on 2 and 57 DF,  p-value: 7.858e-05
car::vif(m_final)
##   Juncus    Shrub 
## 1.002039 1.002039

“Juncus” typically shows a significant positive effect. “Shrub” often shows a negative effect (sometimes marginal).

Please start creatting stories about the results… for example here is ours interpretation,

More Juncus gerardii → more Seaside Sparrows banded; and for some reason somehow more shrub cover → fewer banded sparrows, but is not necessary true all the time (more detail is needed to start saying that). Perhaps we need to see if there is something at shrub making this birds going there sometimes. Finally, other vegetation variables contribute little once redundancy is removed.

xyplot(Banded ~ Juncus, data=veg, type=c("p","smooth"),
       xlab="% Juncus cover", ylab="Banded sparrows")

xyplot(Banded ~ Shrub, data=veg, type=c("p","smooth"),
       xlab="% Shrub cover", ylab="Banded sparrows")

Bonus: “Order of the data” (using a second dataset)

To illustrate the concept of outliers in Y and order of the data, we switch to the sparrow morphology file. The “order” (row position) is often arbitrary (e.g., the order the field sheets were entered). Visualizing values vs. their row order can reveal accidental sorting or glitches. We’ll randomize the dataset first so the order isn’t structured.

Sparrows <- read.table("data/SparrowsElphick.txt", header = TRUE)
head(Sparrows)
##   wingcrd flatwing tarsus head culmen nalospi   wt bandstat initials Year Month
## 1    59.0     60.0   22.3 31.2   12.3    13.0  9.5        1        2 2002     9
## 2    54.0     55.0   20.3 28.3   10.8     7.8 12.2        1        2 2002    10
## 3    53.0     54.0   21.6 30.2   12.5     8.5 13.8        1        2 2002    10
## 4    55.0     56.0   19.7 30.4   12.1     8.3 13.8        1        8 2002     7
## 5    55.0     56.0   20.3 28.7   11.2     8.0 14.1        1        3 2002    10
## 6    53.5     54.5   20.8 30.6   12.8     8.6 14.8        1        7 2004     8
##   Day Location SpeciesCode Sex Age
## 1  19        4           1   0   2
## 2   4        4           3   0   2
## 3   4        4           3   0   2
## 4  30        9           1   0   2
## 5   4        4           3   0   2
## 6   2        1           1   0   2
colnames(Sparrows)
##  [1] "wingcrd"     "flatwing"    "tarsus"      "head"        "culmen"     
##  [6] "nalospi"     "wt"          "bandstat"    "initials"    "Year"       
## [11] "Month"       "Day"         "Location"    "SpeciesCode" "Sex"        
## [16] "Age"
# Randomize row order (so "order" has meaning to demonstrate)
set.seed(4259)  # for reproducibility in class
Sparrows <- Sparrows[sample(1:nrow(Sparrows)), ]
Sparrows$ord <- order(Sparrows$wingcrd)
head(Sparrows[, c("wingcrd","ord")])
##     wingcrd  ord
## 874    57.0 1277
## 638    57.0    4
## 819    58.0  568
## 36     51.5  808
## 571    58.0  308
## 612    57.0  603

What order() does (in plain words)??

Interactive “dotchart” with ggplotly

p_ord <- ggplot(Sparrows,
                aes(x = wingcrd, y = ord,
                    text = paste0("Wing length: ", wingcrd, " mm\nOrder: ", ord))) +
  geom_point() +
  labs(
    title = "Wing length vs. order of the data (SparrowsElphick)",
    subtitle = "We use a second dataset to illustrate the 'order' concept, after randomizing row order",
    x = "Wing length (mm)",
    y = "Order of the data"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

ggplotly(p_ord, tooltip = c("text", "x", "y"))

Do you see any gaps or clusters (e.g., many similar wing lengths) when plotted against the order index? If the file had been pre-sorted (e.g., by wing length), you’d see a nearly perfect line.

Student task (edit & explore)

Replace wingcrd with another numeric variable (e.g., tarsus, head, wt). Update both order() and the aes() mapping.Try rank() instead of order(), and see what happens.

Compute:

Sparrows$rnk <- rank(Sparrows$wingcrd, ties.method = "average")

Plot y = rnk and compare with ord. Which one is more informative when there are many ties?

Color by group.

If Sex is recorded (sometimes as character “4/5” or factor), color points:

0 → sex not recorded or undetermined 4 → male 5 → female

p_ord2 <- ggplot(Sparrows,
                 aes(x = wingcrd, y = ord, color = factor(Sex))) +
  geom_point(alpha = 0.8) +
  labs(color = "Sex")
ggplotly(p_ord2)

Try to do so for the other morphology measurements Do you see a pattern across them? Explained below:

We could also plot all of then into one graph using cbind and other Technics, as follows:

# combine variables into a matrix
Z <- cbind(Sparrows$wingcrd, Sparrows$tarsus,  Sparrows$head,
           Sparrows$culmen,  Sparrows$nalospi, Sparrows$wt)

# add column names
colnames(Z) <- c("wing length", "tarsus length", "head length",
                 "culmen length", "nalospi to bill tip", "weight")

dotplot(as.matrix(Z), groups = FALSE,
        strip = strip.custom(bg = 'white',
        par.strip.text = list(cex = 0.8)),
        scales = list(x = list(relation = "free"),
                      y = list(relation = "free"),
                      draw = FALSE),
        col = 1, cex  = 0.5, pch = 16,
        xlab = "Value of the variable",
        ylab = "Order of the data from text file") # Buena suerte, pura vida

Try to guess, what does cbind do? is possible to use subplot() and ggplotly() to create a multiple interactive object? if yes please try to do so…

Independent exploration task — Keep going with new dataset by yourself

Sparrows morphometrics

Using the Sparrows dataset, apply the same data-exploration workflow you used for the vegetation data. Choose one numeric variable as your response (for example, wing length or weight) and a few continuous and categorical predictors. Examine the distributions and correlations, fit an initial linear model, and check for multicollinearity using car::vif(). Take into account that sex should be categorical no numerical… Because neither is more than other.

Then iteratively simplify the model with update() until all predictors have acceptable VIF values and meaningful coefficients. Verify that residuals are roughly normal and homoscedastic, and visualize key relationships with ggplot() or ggplotly(). Your goal is to produce a clean, interpretable final model and one figure that best represents its fitted relationship.