In this lab we will: 1) Simulate data to use to demonstrate correlation and ANOVA 2) Create exploratory visualiztions of correlations 3) Use correlation to look at the association between two interval-level variables; 4) Use ANOVA to look at the effect of a categorical independent variable on an interval-level dependent variable.
We’re going to use simulated data to work through an analysis of plant growth. First we’ll visualize associations, then we’ll look at correlations, and finally we’ll do ANOVA.
Imagine we’re testing a pesticide to see how it influences plant growth. We run an experiment and vary 1. how much light each plant gets, 2. whether there is a pesticide administered, and 3. other random factors. This simulation imagines an experiment of 100 rows of 4 plants, each row is given a different amount of light, and within each row, each of the four plants is given a different amount of pesticide. The plants are grown in a greenhouse environment where pests themselves are not an issue.
I can create fake data of the relationship between those variables, which we might call simulated data:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# simulated data for light and water levels
df <- tibble(light = rep(runif(100, .5, 1), each = 4), # observed light amounts between .5 half light and 1- full light
pesticide =rep(c(0,2,6,12), 100)) %>%
# simulated growth in inches
# plants grow more with more light, and less with more pesticide
mutate(growth = 8*light + -.2*pesticide + rnorm(400, 3, 2)) # 'other factors' are in the rnorm call here we don't know what they were
summary(df$growth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08991 6.70025 8.11288 8.22638 9.90234 15.31704
In my simulation, plants grew from a min value of 0.0899146 to a max value of 15.3170375 with an average of 8.2263786.
Now that we have the data, let’s visualize the relationship between our focal outcome – growth – and our two independent variables, light and pesticide. These are called exploratory visualizations because they give us the chance to observe–visually not analytically–possible quantitative associations.
Light is a continuous variable, so we’ll use a scatterplot.
df %>% ggplot(aes(light, growth))+
geom_point()
This plot is basic, but it does the job. I can see some kind of
correlation there, I think.
Pesticide technically is also an interval variable, but it only takes 4 values in my data: 0,2,6,12 so it might be more interesting to make a boxplot. I’ll have to adjust the water variable from numeric to factor to get it to work.
df <- df %>% mutate(cat_pesticide = as_factor(case_when(
pesticide == 0 ~ 'none',
pesticide == 2 ~ '2 grams',
pesticide == 6 ~ '6 grams',
pesticide == 12 ~ '12 grams')))
df %>%
ggplot(aes(cat_pesticide, growth))+
geom_boxplot()
It looks like there’s an association here too. Plants with more pesticide grew less, which makes sense.
Now we’ll try some correlation. We can use the cor.test
function to both calculate th correlation and to do a hypothesis test
for its significance.
lg_test <- cor.test(df$light,df$growth)
lg_test
##
## Pearson's product-moment correlation
##
## data: df$light and df$growth
## t = 9.7623, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3568648 0.5153769
## sample estimates:
## cor
## 0.4395366
Here we see a positive, significant correlation of $= $ `r 0.4395366. The magnitude suggests a moderate correlation, and the fact that it’s positive suggests plants which had a lot of light also tended to grow more. That it’s significant means that we can (or could if it were real) expect that these results would be similar in the population of plants.
Let’s try with pesticide.
pg_test <- cor.test(df$pesticide,df$growth)
pg_test
##
## Pearson's product-moment correlation
##
## data: df$pesticide and df$growth
## t = -8.5519, df = 398, p-value = 2.638e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4737453 -0.3078358
## sample estimates:
## cor
## -0.3939952
Here we see a negative, significant correlation of $= $ `r -0.3939952. The magnitude suggests a moderate to weak correlation, and the fact that it’s negative suggests that when pesticide increases, plant growth tends to decrease. That it’s significant means that we can (or could if it were real) expect that these results would be similar in the population of plants.
In the case of the pesticide plants, we also might wonder whether there was a significant difference between plants that had different amounts of pesticide. We could run an ANOVA to see if the difference between groups of plants that got different amounts of pesticides was larger that the difference within those groups. This is another example of exploratory visualization.
We’ll focus here on the three groups of plants that got pesticides – ignoring the plants with 0 pesticides.
Let’s start with a plot of the distributions of the groups.
# filter to plants with at least some pesticides
df_somepest <- df %>% filter(pesticide > 0) %>%
# calculate the group means for the plot
group_by(cat_pesticide) %>% mutate(growth_mean = mean(growth)) %>% ungroup()
df_somepest %>% ggplot(aes(growth, color = cat_pesticide, group = cat_pesticide))+
geom_density(alpha = 0.5)+
geom_vline(aes(xintercept = growth_mean,color = cat_pesticide), linetype = 2)
We can see that while we have clearly different means – so variation
between groups, we also see that each distribution is quite spread out,
so a fair amount of variation within groups too.
We can use oneway.test to run an ANOVA test to see
whether the differences are significant.
# we use a 'formula' here to tell r to do an anova of growth 'by' pesticide
anova1 <- oneway.test(growth ~ pesticide, data = df_somepest)
anova1
##
## One-way analysis of means (not assuming equal variances)
##
## data: growth and pesticide
## F = 26.029, num df = 2.00, denom df = 197.39, p-value = 9.268e-11
In our simulation, these differences are significant with an observed test statistic of 26.0294956 - so many times more variation between groups than within groups. Our p-value is 9.2678051^{-11}, so very small and significant at \(\alpha = 0.001\).
For this lab I will use the data prepared for the “Neighborhoods and Health” topic. This file combines data from the PLACES data which contains information on the health characteristics of populations within census tracts (neighborhoods) and NaNDA data which brings together information on the demographic, economic, social, and ecological characteristics of these neighborhoods. The datafile for this class includes a random sample of 3,000 census tracts drawn from the population of all 74k census tracts in the US. See more about the data using the resources on the course webpage.
load("data/PLACES_NaNDA_sample.Rdata")
Load that data and create a new cleaned dataframe that has four variables: a tract id, the poverty proportion as measured in the 2013-17 ACS, the prevalence of asthma in the tract, and the number of open parks in the tract.
clean_places <- PLACES_NaNDA_sample %>%
filter(!is.na(ppov13_17), !is.na(asthma), !is.na(count_open_parks)) %>%
select(LocationID, ppov13_17, asthma, count_open_parks)
Create two visualizations. One that shows the relationship between poverty and asthma and one that shows the relationship between the number of parks in a tract and asthma.
Visualization 1: Scatterplot Showing Relationship Btwn Poverty & Asthma
clean_places <- clean_places %>%
mutate(cat_poverty = as.factor(case_when(
ppov13_17<=.05 ~ 'low poverty',
ppov13_17<.20 ~ 'moderate poverty',
ppov13_17>= .20 ~ 'high poverty')))
clean_places <- clean_places %>%
mutate(cat_poverty = ordered(cat_poverty,
levels = c("low poverty", "moderate poverty",
"high poverty")))
clean_places <- clean_places %>%
mutate(cat_asthma = as.factor(case_when(
asthma<=6.5 ~ 'low asthma',
asthma<=9.5 ~ 'mild asthma',
asthma<=12.5 ~ 'moderate asthma',
asthma>12.5 ~ 'severe asthma')))
clean_places <- clean_places %>%
mutate(cat_asthma = ordered(cat_asthma,
levels = c("low asthma", "mild asthma","moderate asthma",
"severe asthma")))
clean_places %>%
ggplot(aes(cat_poverty, cat_asthma))+
geom_point()
Visualation # 2: Boxplot Showing Relationship Between # of Parks &
Asthma
clean_places <- clean_places %>%
mutate(cat_open_parks = as.factor(case_when(
count_open_parks<=10 ~ 'very few or no parks',
count_open_parks<20 ~ 'some parks',
count_open_parks<30 ~ 'a number of parks',
count_open_parks>30 ~ 'lots of parks')))
clean_places <- clean_places %>%
mutate(cat_open_parks = ordered(cat_open_parks,
levels = c("very few or no parks", "some parks",
"a number of parks", "lots of parks")))
clean_places <- clean_places %>%
mutate(cat_asthma = ordered(cat_asthma,
levels = c("low asthma", "mild asthma","moderate asthma",
"severe asthma")))
clean_places %>%
ggplot(aes(cat_open_parks, cat_asthma))+
geom_boxplot()
Examine and interpret the correlation for asthma and poverty and for asthma and parks.
cor.test(clean_places$asthma, clean_places$ppov13_17)
##
## Pearson's product-moment correlation
##
## data: clean_places$asthma and clean_places$ppov13_17
## t = 49.247, df = 2992, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6488237 0.6884167
## sample estimates:
## cor
## 0.6690947
Interpretation of Correlation Test for Asthma & Poverty: Based on the p-value, which is significantly less than 0.05, the result of the correlation test is statistically significant. This means we can reject the null hypothesis that there is no correlation between asthma and poverty.
cor.test(clean_places$asthma, clean_places$count_open_parks)
##
## Pearson's product-moment correlation
##
## data: clean_places$asthma and clean_places$count_open_parks
## t = -4.4393, df = 2992, p-value = 9.354e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11637672 -0.04520024
## sample estimates:
## cor
## -0.08089161
Interpretation of Correlation Test for Asthma and Parks: Interpretation of Correlation Test for Asthma & Poverty: Based on the p-value, which is significantly less than 0.05, the result of the correlation test is statistically significant. This means we can reject the null hypothesis that there is no correlation between asthma and poverty.
Conduct and interpret an analysis seeing if there’s a significant difference in asthma prevalence across tracts that 1) have no parks, 2) have 1 park or 3) have 2 or more parks. You’ll need to recode parks into a categorical variable to do this, and then run an ANOVA test.
clean_places <- clean_places %>%
mutate(cat_open_parks = as.factor(case_when(
count_open_parks==0 ~ 'no parks',
count_open_parks==1 ~ '1 park',
count_open_parks>=2 ~ '2 or more parks')))
clean_places <- clean_places %>%
mutate(cat_open_parks = ordered(cat_open_parks,
levels = c("no parks", "1 park", "2 or more parks")))
anova <- oneway.test(asthma ~ cat_open_parks, data = clean_places)
anova
##
## One-way analysis of means (not assuming equal variances)
##
## data: asthma and cat_open_parks
## F = 6.1003, num df = 2.0, denom df = 1673.5, p-value = 0.002292
Interpretation for ANOVA test (asthma and parks < 3)
Return to the simulation. The line
mutate(growth = 8*light + -.2*pesticide + rnorm(400, 3, 2))
defines the relationship between growth and the other variables. Read
through the line and see if you understand it. In our demonstration the
correlations and the ANOVA were significant.
What could you do to make them no longer statistically significant? Try out your idea.
Could you make the correlation with light stronger?
Could you make the amount of variation within pesticide groups smaller?
Hints are in part 0 this time :).