Summarizing some of the important concept on Confounding variables, these are some of the basic concepts I can find.

What it is

According to PSDS Chapter 4 (Peter Bruce and Gedeck 2020), Confounding variable is “an important predictor that, when omitted, leads to spurious relationships in a regression equation.” The host in Z Statistics (Z Statistics 2020) presented an example that helps to explain this definition. In his illustration, the initial hypothesis was that ice cream sales has a positive relationship with the amount of sunburn presentations. According to our common sense, this relationship is spurious. Then he pointed out the factor that was causing the high ice cream sales and sunburn presentations could be sun exposure. Since sun exposure was missing in the model, the model cannot be explained and questionable.

Why does it matter

According to (Stephanie Glen, n.d.), the issues with missing variables in the model are:

Your model may be missing an important predictor and thus reduced in its power of inference or prediction;
Some predictors may exhibit high variance
Your model may be biased
Your model may even be invalid

Criteria of a confounding variable

According to (Data Learner, n.d.a), to be qualified as confounding variable,

the variable should be associated with some predictor variables
the variable should be associated with the response variable
the variable should not be part of the causal pathway from other predictors to the response

Ways to eliminate confounding variable

According to (Stephanie Glen, n.d.), the ways to eliminate the effects of confounding variables depend on the phase of project ones is in.

During experiment design phase

randomization
restriction / exclusion
matching
stratification

During analysis phase

stratified analysis
multivariate analysis

Finding confounding variable in the grape yield dataset

Our group project aimed to determine the factors (geographical, climatic and soil) that could influence grape yield in Australia. To apply the technique in identifying a confounding variable, we will start with the initial hypothesis that yield is significantly influenced by the nitrogen content in soil. From there, I will try to examine if rainfall in November is a confounding variable that will influence the relationship between yield and nitrogen.

Data preparation

## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## here() starts at N:/Data_Work/20200808 36103 Statistical Thinking for DS/AT2/AT2C/STDS-AT2C

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.0-2

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

# remove objects from current environment
remove(list = ls())

# read data ----
yield_by_region <- readRDS(here("project/src/output/yield_by_region.rds"))

# remove one outlier and one significant observation
yield_by_region <- yield_by_region[-222,]
yield_by_region <- yield_by_region[-180,]

# yield_by_region <- dplyr::mutate(yield_by_region, nitroBand = ifelse(lu_nitrogen < median(yield_by_region$lu_nitrogen),'low','high'))
yield_by_region <- dplyr::mutate(yield_by_region, nitroBand = ifelse(lu_nitrogen < 0.12,'low','high'))

# set train and test ----
train_size <- floor(0.7 * nrow(yield_by_region))

set.seed(10) 
train_n <- sample(seq_len(nrow(yield_by_region)), size = train_size)

train <- yield_by_region[train_n, ]
test <- yield_by_region[-train_n, ]

Relationship between November rain and nitrogen

The plot below shows that November rainfall seems to have a non-linear relationship with nitrogen concentration in soil. This satisfies Criteria #1 above.

# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
  ggplot(aes(lu_mean_rain_nov, lu_nitrogen)) +
  geom_smooth(mapping = aes(lu_mean_rain_nov, yield, group = lu_nitrogen, linetype = nitroBand)) +
  geom_line(color = 'red')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Relationship between yield and November rain

The plot below shows that November rainfall impacts yield in a negative way. Beyond the 60 point, the impact levels off. This satisfies Criteria #2 above.

# simple plot between Yield and Rain
yield_by_region %>%
  ggplot(aes(lu_mean_rain_nov, yield)) +
  geom_smooth(mapping = aes(lu_mean_rain_nov, yield)) +
  geom_line(color = 'red')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Relationship between yield and nitrogen, with banding

Now let’s look at the relationship between nitrogen and yield, as in our inital model. The plot below shows that low nitrogen content does impact yield in a negative way. Beyond the 0.12 point (ie the ‘high’ band), the impact levels off. There is a single point of higher nitrogen value which corresponds with the lowest yield. This seems a bit counter-intuitive to me as I would expect the use of fertilizer to increase nitrogen was supposed to increase grape production, not to reduce it! Does the November rainfall confound the negative relationship here?

Interestingly, with the very low concentration of nitrogen below 0.5, increase in nitrogen pushes up yield. We’ll ignore this observation for now. But this could be investigated more in the part 2 of this study.

# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
  ggplot(aes(lu_nitrogen, yield)) +
  geom_smooth(mapping = aes(lu_nitrogen, yield, group = nitroBand, linetype = nitroBand)) +
  geom_line(color = 'red')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Relationship between yield and November rain, with nitrogen bands

The plot below shows that November rainfall impacts yield in a negative way. Beyond the 60 point, the impact levels off.

# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
  ggplot(aes(lu_mean_rain_nov, yield)) +
  geom_smooth(mapping = aes(lu_mean_rain_nov, yield, group = nitroBand, linetype = nitroBand)) +
  geom_line(color = 'red')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Showcasing the effect of rain on the relatioship between yield and nitrogen

# get average yield for low nitrogen band
avg_yield_nitro_low <- mean(yield_by_region[yield_by_region$nitroBand == 'low',"yield"])
# get average yield for high nitrogen band
avg_yield_nitro_high <- mean(yield_by_region[yield_by_region$nitroBand == 'high',"yield"])

yield_by_region %>%
  ggplot(aes(lu_mean_rain_nov, yield, color = nitroBand)) +
  geom_smooth(method = 'lm') +
  geom_point() +
  geom_hline(yintercept = avg_yield_nitro_low, linetype = "dashed", colour = "blue") +
  geom_hline(yintercept = avg_yield_nitro_high, linetype = "dashed", colour = "red")

## `geom_smooth()` using formula 'y ~ x'

Comparing the LM coefficients

fit_nitro <- lm(yield ~ nitroBand, data = yield_by_region)
summary(fit_nitro)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  6.053356  0.3946808 15.337344 1.680388e-36
## nitroBandlow 4.576785  0.6947494  6.587677 3.300848e-10

fit_nitro_rain <- lm(yield ~ nitroBand + lu_mean_rain_nov, data = yield_by_region)
summary(fit_nitro_rain)$coef

##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      10.40786315  1.0703158  9.724105 8.801742e-19
## nitroBandlow      2.87742310  0.7736546  3.719261 2.544322e-04
## lu_mean_rain_nov -0.07508064  0.0172562 -4.350937 2.086650e-05

LM coefficients of original nitrogen variable (for comparison)

fit_nitro_orig <- lm(yield ~ lu_nitrogen, data = yield_by_region)
summary(fit_nitro_orig)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  13.64501   1.017786 13.406560 2.743241e-30
## lu_nitrogen -42.44180   6.690468 -6.343622 1.278475e-09

fit_nitro_orig_rain <- lm(yield ~ lu_nitrogen + lu_mean_rain_nov, data = yield_by_region)
summary(fit_nitro_orig_rain)$coef

##                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)       14.75914328  1.0203478 14.464817 1.188748e-33
## lu_nitrogen      -23.60777206  7.9421475 -2.972467 3.287612e-03
## lu_mean_rain_nov  -0.07550434  0.0185074 -4.079684 6.335170e-05

Conclusion

The increase in nitrogen concentration in soil seems to cause decrease in grape yield when under the heavy influence of November rainfall. This can be supported by comparing the coefficients of nitrogen in the two models. The coefficient of nitrogen band dropped from 4.576785 to 2.87742310 which is a 37% drop. Since it is greater than 10%, the November Rain variable is a confounding variable to the association of yield with nitrogen content.

Instead of settling on this conclusion, the results ask for more investigation into the problem. For example, it requires more research into the time of applying fertilizers, timing of rainfall which will optimal for best yield, and the growth cycle of grapes. The domain knowledge of grape growing will provide clearer directions of identifying the important factors influencing grape yield.

References

Caffo, Brian. n.d. Regression Models for Data Science in R. Leanpub. https://leanpub.com/regmods/read#leanpub-auto-adjustment.

Data Learner. n.d.a. Week 9 : CONFOUNDING: DEFINITION. YouTube. https://www.youtube.com/watch?v=hNZVFMVKVlc.

———. n.d.b. Week 9 : CONFOUNDING: INTRO. YouTube. https://www.youtube.com/watch?v=Cn18J9PoNOE.

Elizabeth Lynch. n.d.a. Adjusting for Confounding Variables. YouTube. https://www.youtube.com/watch?v=E0oazq_ehMk.

———. n.d.b. Confounding. YouTube. https://www.youtube.com/watch?v=1Cn1smM3kbQ.

———. n.d.c. Dealing with Confounding. YouTube. https://www.youtube.com/watch?v=M7zWJpXPTWc.

Halpern, Benjamin S., Helen M. Regan, Hugh P. Possingham, and Michael A. McCarthy. 2006. “Accounting for Uncertainty in Marine Reserve Design.” Ecology Letters 9 (1): 2–11. https://doi.org/10.1111/j.1461-0248.2005.00827.x.

Peter Bruce, Andrew Bruce, and Peter Gedeck. 2020. Practical Statistics for Data Scientists Second Edition. O’Reilly. https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Stephanie Glen. n.d. What Is a Confounding Variable. YouTube. https://www.youtube.com/watch?v=sKkCZqqZ3qE.

Z Statistics. 2020. What Is a Confounding Variable?? (Health Stats Iq). YouTube. https://www.youtube.com/watch?v=SjIbrNYAd5Y.

Confounding Variable Basics