BGEN516 - Correlation Lab

Author
Affiliation

University of Montana

Published

October 28, 2025

Lab Overview

In this lab, I’ll utilize correlation analysis to explore variable relationships in the SAFI dataset. We have worked with this dataset in previous modules (e.g., in the Week 3 data wrangling tutorials) so we are relatively familiar with it at this point:

SAFI (Studying African Farmer-Led Irrigation) is a study looking at farming and irrigation methods in Tanzania and Mozambique. The survey data was collected through interviews conducted between November 2016 and June 2017.

Prep workspace

To start, I’ll load required packages and import the data. Note that I’ll demonstrate the use of a new (to us) package, corrplot. I’ll also examine the data to better understand its structure and contents.

library(here)
library(tidyverse)
library(skimr)
library(corrr)
library(corrplot)
library(gt)
library(cowplot)

safi_data <- read_csv(here("data", "SAFI_clean.csv"))

glimpse(safi_data)
Rows: 131
Columns: 14
$ key_ID               <dbl> 1, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
$ village              <chr> "God", "God", "God", "God", "God", "God", "God", …
$ interview_date       <dttm> 2016-11-17, 2016-11-17, 2016-11-17, 2016-11-17, …
$ no_membrs            <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5, …
$ years_liv            <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 20,…
$ respondent_wall_type <chr> "muddaub", "muddaub", "burntbricks", "burntbricks…
$ rooms                <dbl> 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 1, 3, 1, 3, 2, 1, 1…
$ memb_assoc           <chr> "NULL", "yes", "NULL", "NULL", "NULL", "NULL", "n…
$ affect_conflicts     <chr> "NULL", "once", "NULL", "NULL", "NULL", "NULL", "…
$ liv_count            <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 1…
$ items_owned          <chr> "bicycle;television;solar_panel;table", "cow_cart…
$ no_meals             <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2…
$ months_lack_food     <chr> "Jan", "Jan;Sept;Oct;Nov;Dec", "Jan;Feb;Mar;Oct;N…
$ instanceID           <chr> "uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef", "uui…

This output suggest that some of columns in the dataset contain missing values. Based on our prior work with this dataset in Week 3 tutorials, we know that NULL is used to represent missing values. However, we didn’t specify this when importing the data, so R treats NULL as a string value in character columns rather than recognizing it as a missing value indicator. We can confirm this by inspecting the unique values in one of the variables. For example:

unique(safi_data$affect_conflicts)
[1] "NULL"       "once"       "never"      "more_once"  "frequently"

These values represent all distinct entries in the affect_conflicts column, allowing us to see that NULL appears as a regular string value rather than being treated as NA.

We can also confirm that R isn’t treating NULL as NA by counting missing values across all columns:

colSums(is.na(safi_data))
              key_ID              village       interview_date 
                   0                    0                    0 
           no_membrs            years_liv respondent_wall_type 
                   0                    0                    0 
               rooms           memb_assoc     affect_conflicts 
                   0                    0                    0 
           liv_count          items_owned             no_meals 
                   0                    0                    0 
    months_lack_food           instanceID 
                   0                    0 

All columns returned zero missing values, indicating that R isn’t recognizing any entries as NA. Recall that we could have avoided this issue during import by specifying na = "NULL" in the read_csv() function, which tells R to treat NULL as a missing value rather than a string.

Filter the data

Next, I’ll clean and transform the data. To be safe, I’ll explicitly tell R to filter out NA values from the numeric columns that represent key variables for this lab.

safi_filtered <- safi_data %>%
  # Remove rows with missing values
  filter(
    !is.na(no_membrs) & no_membrs != "NULL",
    !is.na(years_liv) & years_liv != "NULL",
    !is.na(rooms) & rooms != "NULL",
    !is.na(liv_count) & liv_count != "NULL",
    !is.na(no_meals) & no_meals != "NULL"
    ) %>%
  # Remove outliers in household size
  filter(no_membrs > 1, no_membrs < 15) %>%
  # Create new variables
  mutate(
    livestock_per_member = liv_count/no_membrs,
    meals_per_member = no_meals/no_membrs
    ) %>%
  select(
    no_membrs, 
    liv_count, 
    no_meals, 
    years_liv, 
    livestock_per_member, 
    meals_per_member
    )

glimpse(safi_filtered)
Rows: 126
Columns: 6
$ no_membrs            <dbl> 3, 7, 10, 7, 7, 3, 6, 12, 8, 12, 6, 7, 6, 10, 5, …
$ liv_count            <dbl> 1, 3, 1, 2, 4, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 1…
$ no_meals             <dbl> 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 2…
$ years_liv            <dbl> 4, 9, 15, 6, 40, 3, 38, 70, 6, 23, 20, 20, 8, 20,…
$ livestock_per_member <dbl> 0.3333333, 0.4285714, 0.1000000, 0.2857143, 0.571…
$ meals_per_member     <dbl> 0.6666667, 0.2857143, 0.2000000, 0.2857143, 0.285…

Looks good!

Compute correlations

Next, I’ll create a correlation matrix using correlate from the corrr package. The following functions are also provided by corrr:

  • shave() - “Convert the upper or lower triangle of a correlation data frame (cor_df) to missing values.”

  • fashion() - “For the purpose of printing, convert a correlation data frame into a noquote matrix with the correlations cleanly formatted (leading zeros removed; spaced for signs) and the diagonal (or any NA) left blank.”

  • rplot() - “Plot a correlation data frame using ggplot2.”

cor_matrix <- safi_filtered %>%
  # Shorten variable name for clean printing
  rename(liv_per_membr = "livestock_per_member") %>%
  # Create correlation matrix
  correlate() %>%
  # Chave upper triangle
  shave() %>%
  # Remove empty column
  select(-meals_per_member)

# Numeric representation
fashion(cor_matrix)
              term no_membrs liv_count no_meals years_liv liv_per_membr
1        no_membrs                                                     
2        liv_count       .21                                           
3         no_meals       .09       .10                                 
4        years_liv       .13       .32      .04                        
5    liv_per_membr      -.50       .60     -.02       .16              
6 meals_per_member      -.79      -.19      .31      -.15           .49
# Visual representation
rplot(cor_matrix)

Most of the correlations are relatively weak, making them hard to see in the plot. The strongest correlation is between no_membrs and meals_per_member. This is the strongest correlation because meals_per_member is directly calculated from no_meals divided by no_membrs, creating a mathematical relationship between the two variables. Similarly, liv_per_member is calculated from liv_count divided by no_members, so those variables are directly related!

corrplot Package

There are of course options other than those provided by corrr to visualize a correlation matrix. In the code below, I use the base R cor() function with corrplot from the corrplot package.

cor_matrix <- safi_filtered %>%
  select(
    no_membrs, 
    liv_count, 
    no_meals, 
    years_liv, 
    livestock_per_member, 
    meals_per_member
    ) %>%
  cor() %>%
  corrplot(
    method = "number",
    type = "lower",
    diag = FALSE,
    tl.col = "black",
    tl.srt = 45
    )

Identify two correlations

The assignment instructions state that that two pairs of variables with strong correlations should be selected for further analysis. The two strongest correlations are:

  • meals_per_member and no_membrs: \(r =\) -0.79

  • livestock_per_member and liv_count: \(r =\) 0.6

These pairs have strong correlations because each variable in the pair was created using the other, so their values are closely linked by definition. Computing a derived variable and examining its correlation with the original variable helps us see the relationship between variables and specifically what strong association looks like. However, including both variables in the same statistical model would create severe multicollinearity, making it difficult to separate their individual effects and interpret results accurately.

With that in mind, I’ll select two other correlations to also include in the remainder of my analysis:

  • no_membrs and liv_count: \(r =\) 0.21

  • liv_count and years_liv: \(r =\) 0.32

Next, I’ll calculate both Pearson and Spearman correlations for these selected pairs.

pearson_corr <- safi_filtered %>%
  summarise(
  cor_meals_no_membrs = cor(meals_per_member, no_membrs),
  cor_livestock_liv_count = cor(livestock_per_member, liv_count),
  cor_no_membrs_liv_count = cor(no_membrs, liv_count),
  cor_liv_count_years_liv = cor(liv_count, years_liv)
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Pairs",
    values_to = "r"
    ) %>%
  mutate(r = round(r, digits = 2))
  
pearson_corr %>% gt()
Pairs r
cor_meals_no_membrs -0.79
cor_livestock_liv_count 0.60
cor_no_membrs_liv_count 0.21
cor_liv_count_years_liv 0.32
spearman_corr <- safi_filtered %>%
  summarise(
  cor_meals_no_membrs = cor(meals_per_member, no_membrs, method = "spearman"),
  cor_livestock_liv_count = cor(livestock_per_member, liv_count, method = "spearman"),
  cor_no_membrs_liv_count = cor(no_membrs, liv_count, method = "spearman"),
  cor_liv_count_years_liv = cor(liv_count, years_liv, method = "spearman")
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Pairs",
    values_to = "rho"
    ) %>%
  mutate(rho = round(rho, digits = 2))

spearman_corr %>% gt() %>% cols_label(rho = "{{:rho:}}")
Pairs ρ
cor_meals_no_membrs -0.88
cor_livestock_liv_count 0.72
cor_no_membrs_liv_count 0.23
cor_liv_count_years_liv 0.28

So, which method should we actually use with this data?

The choice of method depends on the data: Pearson correlation requires assumptions such as linearity, normality, and homoscedasticity, while Spearman correlation requires monotonicity and appropriate measurement level. The method should thus be chosen based on which set of assumptions the data satisfy.

Check assumptions

Let’s start with linearity. The key question: do the data exhibit a linear relationship?

Note that for plots C and D, I used geom_jitter rather than geom_point because visualized variables are discrete, and overlapping points make it difficult to see the true distribution of the data.

Most of the relationships between variables appear roughly linear. Let’s move on to the next assumption, normality.

The key question: do the data follow a normal distribution?

It’s clear that both liv_count and no_meals do not follow a normal distribution. The points for both variables deviate noticeably from the line, reflecting departures due to their discrete and skewed nature.

livestock_per_member, meals_per_member, and years_liv show moderate deviations, particularly in the tails, indicating some skewness or outliers.

The no_membrs variable closely follows the line, suggesting it is approximately normally distributed. Unlike the other variables, the points generally remain close to the line. However, the observed pattern results directly from removing outliers for this variable. It is important to recognize that this adjustment is appropriate only when there is a valid justification for removal, such as a data entry or measurement error.

Overall, the Q-Q plots suggest that the normality assumption is questionable for most of these variables. Because most variables are not normally distributed and some are discrete, Spearman correlation is more appropriate than Pearson correlation. The data appear to exhibit a monotonic relationship and satisfy the required measurement type. Let’s return to our Spearman correlation coefficients.

Interpret coefficients

Pairs ρ
meals per member & household size -0.88
total livestock & livestock per member 0.72
household size & total livestock 0.23
total livestock & years living in village 0.28

The correlations vary notably in strength and direction.

meals_per_member and no_membrs show a strong negative correlation (-0.88). This value reflects the calculation of meals per person rather than an effect, because meals per member is derived from the total number of meals divided by household size.

Similarly, liv_count and livestock_per_member have a strong positive correlation (0.72), reflecting the fact that livestock per member is derived from total livestock rather than indicating a direct relationship with household size.

no_membrs with liv_count show a weak positive correlation (0.23), suggesting that larger households may have slightly more livestock, but other factors also influence livestock ownership.

Finally, liv_count and years_liv show a weak positive correlation (0.28), indicating that households that have lived longer in the area tend to have slightly more livestock, possibly reflecting the time needed to accumulate animals.