Summarizing some of the important concept on Confounding variables, these are some of the basic concepts I can find.
According to PSDS Chapter 4 (Peter Bruce and Gedeck 2020), Confounding variable is “an important predictor that, when omitted, leads to spurious relationships in a regression equation.” The host in Z Statistics (Z Statistics 2020) presented an example that helps to explain this definition. In his illustration, the initial hypothesis was that ice cream sales has a positive relationship with the amount of sunburn presentations. According to our common sense, this relationship is spurious. Then he pointed out the factor that was causing the high ice cream sales and sunburn presentations could be sun exposure. Since sun exposure was missing in the model, the model cannot be explained and questionable.
According to (Stephanie Glen, n.d.), the issues with missing variables in the model are:
According to (Data Learner, n.d.a), to be qualified as confounding variable,
According to (Stephanie Glen, n.d.), the ways to eliminate the effects of confounding variables depend on the phase of project ones is in.
Our group project aimed to determine the factors (geographical, climatic and soil) that could influence grape yield in Australia. To apply the technique in identifying a confounding variable, we will start with the initial hypothesis that yield is significantly influenced by the nitrogen content in soil. From there, I will try to examine if rainfall in November is a confounding variable that will influence the relationship between yield and nitrogen.
## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## here() starts at N:/Data_Work/20200808 36103 Statistical Thinking for DS/AT2/AT2C/STDS-AT2C
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.0-2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# remove objects from current environment
remove(list = ls())
# read data ----
yield_by_region <- readRDS(here("project/src/output/yield_by_region.rds"))
# remove one outlier and one significant observation
yield_by_region <- yield_by_region[-222,]
yield_by_region <- yield_by_region[-180,]
# yield_by_region <- dplyr::mutate(yield_by_region, nitroBand = ifelse(lu_nitrogen < median(yield_by_region$lu_nitrogen),'low','high'))
yield_by_region <- dplyr::mutate(yield_by_region, nitroBand = ifelse(lu_nitrogen < 0.12,'low','high'))
# set train and test ----
train_size <- floor(0.7 * nrow(yield_by_region))
set.seed(10)
train_n <- sample(seq_len(nrow(yield_by_region)), size = train_size)
train <- yield_by_region[train_n, ]
test <- yield_by_region[-train_n, ]
The plot below shows that November rainfall seems to have a non-linear relationship with nitrogen concentration in soil. This satisfies Criteria #1 above.
# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
ggplot(aes(lu_mean_rain_nov, lu_nitrogen)) +
geom_smooth(mapping = aes(lu_mean_rain_nov, yield, group = lu_nitrogen, linetype = nitroBand)) +
geom_line(color = 'red')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The plot below shows that November rainfall impacts yield in a negative way. Beyond the 60 point, the impact levels off. This satisfies Criteria #2 above.
# simple plot between Yield and Rain
yield_by_region %>%
ggplot(aes(lu_mean_rain_nov, yield)) +
geom_smooth(mapping = aes(lu_mean_rain_nov, yield)) +
geom_line(color = 'red')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Now let’s look at the relationship between nitrogen and yield, as in our inital model. The plot below shows that low nitrogen content does impact yield in a negative way. Beyond the 0.12 point (ie the ‘high’ band), the impact levels off. There is a single point of higher nitrogen value which corresponds with the lowest yield. This seems a bit counter-intuitive to me as I would expect the use of fertilizer to increase nitrogen was supposed to increase grape production, not to reduce it! Does the November rainfall confound the negative relationship here?
Interestingly, with the very low concentration of nitrogen below 0.5, increase in nitrogen pushes up yield. We’ll ignore this observation for now. But this could be investigated more in the part 2 of this study.
# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
ggplot(aes(lu_nitrogen, yield)) +
geom_smooth(mapping = aes(lu_nitrogen, yield, group = nitroBand, linetype = nitroBand)) +
geom_line(color = 'red')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The plot below shows that November rainfall impacts yield in a negative way. Beyond the 60 point, the impact levels off.
# simple plot between Yield and Rain grouped by nitroBand
yield_by_region %>%
ggplot(aes(lu_mean_rain_nov, yield)) +
geom_smooth(mapping = aes(lu_mean_rain_nov, yield, group = nitroBand, linetype = nitroBand)) +
geom_line(color = 'red')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# get average yield for low nitrogen band
avg_yield_nitro_low <- mean(yield_by_region[yield_by_region$nitroBand == 'low',"yield"])
# get average yield for high nitrogen band
avg_yield_nitro_high <- mean(yield_by_region[yield_by_region$nitroBand == 'high',"yield"])
yield_by_region %>%
ggplot(aes(lu_mean_rain_nov, yield, color = nitroBand)) +
geom_smooth(method = 'lm') +
geom_point() +
geom_hline(yintercept = avg_yield_nitro_low, linetype = "dashed", colour = "blue") +
geom_hline(yintercept = avg_yield_nitro_high, linetype = "dashed", colour = "red")
## `geom_smooth()` using formula 'y ~ x'
fit_nitro <- lm(yield ~ nitroBand, data = yield_by_region)
summary(fit_nitro)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.053356 0.3946808 15.337344 1.680388e-36
## nitroBandlow 4.576785 0.6947494 6.587677 3.300848e-10
fit_nitro_rain <- lm(yield ~ nitroBand + lu_mean_rain_nov, data = yield_by_region)
summary(fit_nitro_rain)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.40786315 1.0703158 9.724105 8.801742e-19
## nitroBandlow 2.87742310 0.7736546 3.719261 2.544322e-04
## lu_mean_rain_nov -0.07508064 0.0172562 -4.350937 2.086650e-05
fit_nitro_orig <- lm(yield ~ lu_nitrogen, data = yield_by_region)
summary(fit_nitro_orig)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.64501 1.017786 13.406560 2.743241e-30
## lu_nitrogen -42.44180 6.690468 -6.343622 1.278475e-09
fit_nitro_orig_rain <- lm(yield ~ lu_nitrogen + lu_mean_rain_nov, data = yield_by_region)
summary(fit_nitro_orig_rain)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.75914328 1.0203478 14.464817 1.188748e-33
## lu_nitrogen -23.60777206 7.9421475 -2.972467 3.287612e-03
## lu_mean_rain_nov -0.07550434 0.0185074 -4.079684 6.335170e-05
The increase in nitrogen concentration in soil seems to cause decrease in grape yield when under the heavy influence of November rainfall. This can be supported by comparing the coefficients of nitrogen in the two models. The coefficient of nitrogen band dropped from 4.576785 to 2.87742310 which is a 37% drop. Since it is greater than 10%, the November Rain variable is a confounding variable to the association of yield with nitrogen content.
Instead of settling on this conclusion, the results ask for more investigation into the problem. For example, it requires more research into the time of applying fertilizers, timing of rainfall which will optimal for best yield, and the growth cycle of grapes. The domain knowledge of grape growing will provide clearer directions of identifying the important factors influencing grape yield.
Caffo, Brian. n.d. Regression Models for Data Science in R. Leanpub. https://leanpub.com/regmods/read#leanpub-auto-adjustment.
Data Learner. n.d.a. Week 9 : CONFOUNDING: DEFINITION. YouTube. https://www.youtube.com/watch?v=hNZVFMVKVlc.
———. n.d.b. Week 9 : CONFOUNDING: INTRO. YouTube. https://www.youtube.com/watch?v=Cn18J9PoNOE.
Elizabeth Lynch. n.d.a. Adjusting for Confounding Variables. YouTube. https://www.youtube.com/watch?v=E0oazq_ehMk.
———. n.d.b. Confounding. YouTube. https://www.youtube.com/watch?v=1Cn1smM3kbQ.
———. n.d.c. Dealing with Confounding. YouTube. https://www.youtube.com/watch?v=M7zWJpXPTWc.
Halpern, Benjamin S., Helen M. Regan, Hugh P. Possingham, and Michael A. McCarthy. 2006. “Accounting for Uncertainty in Marine Reserve Design.” Ecology Letters 9 (1): 2–11. https://doi.org/10.1111/j.1461-0248.2005.00827.x.
Peter Bruce, Andrew Bruce, and Peter Gedeck. 2020. Practical Statistics for Data Scientists Second Edition. O’Reilly. https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Stephanie Glen. n.d. What Is a Confounding Variable. YouTube. https://www.youtube.com/watch?v=sKkCZqqZ3qE.
Z Statistics. 2020. What Is a Confounding Variable?? (Health Stats Iq). YouTube. https://www.youtube.com/watch?v=SjIbrNYAd5Y.