Lab 8 Goals

Prep data for correlation and ANOVA analysis
Create exploratory visualizations
Conduct and interpret Correlation Analysis
Conduct ant interpret ANOVA

INTRODUCTION

In this lab we will: 1) Simulate data to use to demonstrate correlation and ANOVA 2) Create exploratory visualiztions of correlations 3) Use correlation to look at the association between two interval-level variables; 4) Use ANOVA to look at the effect of a categorical independent variable on an interval-level dependent variable.

Part 1: Create the simulation

We’re going to use simulated data to work through an analysis of plant growth. First we’ll visualize associations, then we’ll look at correlations, and finally we’ll do ANOVA.

1.1 Experiment Description

Imagine we’re testing a pesticide to see how it influences plant growth. We run an experiment and vary 1. how much light each plant gets, 2. whether there is a pesticide administered, and 3. other random factors. This simulation imagines an experiment of 100 rows of 4 plants, each row is given a different amount of light, and within each row, each of the four plants is given a different amount of pesticide. The plants are grown in a greenhouse environment where pests themselves are not an issue.

1.2 simulating data

I can create fake data of the relationship between those variables, which we might call simulated data:

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# simulated data for light and water levels
df <- tibble(light = rep(runif(100, .5, 1), each = 4), # observed light amounts between .5 half light and 1- full light
             pesticide =rep(c(0,2,6,12), 100)) %>%
  # simulated growth in inches
  # plants grow more with more light, and less with more pesticide
  mutate(growth = 8*light + -.2*pesticide + rnorm(400, 3, 2)) # 'other factors' are in the rnorm call here we don't know what they were
summary(df$growth)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.08991  6.70025  8.11288  8.22638  9.90234 15.31704

In my simulation, plants grew from a min value of 0.0899146 to a max value of 15.3170375 with an average of 8.2263786.

Part 2: Exploratory Visualization

2.1 Scatterplot

Now that we have the data, let’s visualize the relationship between our focal outcome – growth – and our two independent variables, light and pesticide. These are called exploratory visualizations because they give us the chance to observe–visually not analytically–possible quantitative associations.

Light is a continuous variable, so we’ll use a scatterplot.

df %>% ggplot(aes(light, growth))+
  geom_point()

This plot is basic, but it does the job. I can see some kind of correlation there, I think.

2.2 Boxplot

Pesticide technically is also an interval variable, but it only takes 4 values in my data: 0,2,6,12 so it might be more interesting to make a boxplot. I’ll have to adjust the water variable from numeric to factor to get it to work.

df <- df %>% mutate(cat_pesticide = as_factor(case_when(
  pesticide == 0 ~ 'none',
  pesticide == 2 ~ '2 grams',
  pesticide == 6 ~ '6 grams',
  pesticide == 12 ~ '12 grams')))
df %>%
  ggplot(aes(cat_pesticide, growth))+
  geom_boxplot()

It looks like there’s an association here too. Plants with more pesticide grew less, which makes sense.

Part 3: Correlation on simulated data

3.1 Light and Growth Correlation

Now we’ll try some correlation. We can use the cor.test function to both calculate th correlation and to do a hypothesis test for its significance.

lg_test <- cor.test(df$light,df$growth)
lg_test

## 
##  Pearson's product-moment correlation
## 
## data:  df$light and df$growth
## t = 9.7623, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3568648 0.5153769
## sample estimates:
##       cor 
## 0.4395366

Here we see a positive, significant correlation of $= $ `r 0.4395366. The magnitude suggests a moderate correlation, and the fact that it’s positive suggests plants which had a lot of light also tended to grow more. That it’s significant means that we can (or could if it were real) expect that these results would be similar in the population of plants.

3.2 Pesticide and Growth Correlation

Let’s try with pesticide.

pg_test <- cor.test(df$pesticide,df$growth)
pg_test

## 
##  Pearson's product-moment correlation
## 
## data:  df$pesticide and df$growth
## t = -8.5519, df = 398, p-value = 2.638e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4737453 -0.3078358
## sample estimates:
##        cor 
## -0.3939952

Here we see a negative, significant correlation of $= $ `r -0.3939952. The magnitude suggests a moderate to weak correlation, and the fact that it’s negative suggests that when pesticide increases, plant growth tends to decrease. That it’s significant means that we can (or could if it were real) expect that these results would be similar in the population of plants.

Part 4: ANOVA on simulated data

In the case of the pesticide plants, we also might wonder whether there was a significant difference between plants that had different amounts of pesticide. We could run an ANOVA to see if the difference between groups of plants that got different amounts of pesticides was larger that the difference within those groups. This is another example of exploratory visualization.

We’ll focus here on the three groups of plants that got pesticides – ignoring the plants with 0 pesticides.

4.1 Exploratory Visualization Again

Let’s start with a plot of the distributions of the groups.

# filter to plants with at least some pesticides
df_somepest <- df %>% filter(pesticide > 0) %>% 
  # calculate the group means for the plot
  group_by(cat_pesticide) %>% mutate(growth_mean = mean(growth)) %>% ungroup()
  
df_somepest %>% ggplot(aes(growth, color = cat_pesticide, group = cat_pesticide))+
  geom_density(alpha = 0.5)+
  geom_vline(aes(xintercept = growth_mean,color = cat_pesticide), linetype = 2)

We can see that while we have clearly different means – so variation between groups, we also see that each distribution is quite spread out, so a fair amount of variation within groups too.

4.2 ANOVA Analysis and Interpretation

We can use oneway.test to run an ANOVA test to see whether the differences are significant.

# we use a 'formula' here to tell r to do an anova of growth 'by' pesticide
anova1 <- oneway.test(growth ~ pesticide, data = df_somepest)
anova1

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  growth and pesticide
## F = 26.029, num df = 2.00, denom df = 197.39, p-value = 9.268e-11

In our simulation, these differences are significant with an observed test statistic of 26.0294956 - so many times more variation between groups than within groups. Our p-value is 9.2678051^{-11}, so very small and significant at $\alpha = 0.001$.

LAB 8 Checkout

1) Prep your data

For this lab I will use the data prepared for the “Neighborhoods and Health” topic. This file combines data from the PLACES data which contains information on the health characteristics of populations within census tracts (neighborhoods) and NaNDA data which brings together information on the demographic, economic, social, and ecological characteristics of these neighborhoods. The datafile for this class includes a random sample of 3,000 census tracts drawn from the population of all 74k census tracts in the US. See more about the data using the resources on the course webpage.

load("data/PLACES_NaNDA_sample.Rdata")

Load that data and create a new cleaned dataframe that has four variables: a tract id, the poverty proportion as measured in the 2013-17 ACS, the prevalence of asthma in the tract, and the number of open parks in the tract.

clean_places <- PLACES_NaNDA_sample %>%
  filter(!is.na(ppov13_17), !is.na(asthma), !is.na(count_open_parks)) %>%
  select(LocationID, ppov13_17, asthma, count_open_parks)

2) Visualize the data

Create two visualizations. One that shows the relationship between poverty and asthma and one that shows the relationship between the number of parks in a tract and asthma.

Visualization 1: Scatterplot Showing Relationship Btwn Poverty & Asthma

clean_places <- clean_places %>%
  mutate(cat_poverty = as.factor(case_when(
         ppov13_17<=.05 ~ 'low poverty',
         ppov13_17<.20 ~ 'moderate poverty',
         ppov13_17>= .20 ~ 'high poverty'))) 

clean_places <- clean_places %>% 
  mutate(cat_poverty = ordered(cat_poverty,
          levels = c("low poverty", "moderate poverty",
                     "high poverty")))
         
clean_places <- clean_places %>%
  mutate(cat_asthma = as.factor(case_when(
         asthma<=6.5 ~ 'low asthma',
         asthma<=9.5 ~ 'mild asthma',
         asthma<=12.5 ~ 'moderate asthma',
         asthma>12.5 ~ 'severe asthma')))  

clean_places <- clean_places %>% 
  mutate(cat_asthma = ordered(cat_asthma,
          levels = c("low asthma", "mild asthma","moderate asthma", 
                      "severe asthma")))

clean_places %>%
  ggplot(aes(cat_poverty, cat_asthma))+
  geom_point()

Visualation # 2: Boxplot Showing Relationship Between # of Parks & Asthma

clean_places <- clean_places %>%
  mutate(cat_open_parks = as.factor(case_when(
         count_open_parks<=10 ~ 'very few or no parks',
         count_open_parks<20 ~ 'some parks',
         count_open_parks<30 ~ 'a number of parks',
         count_open_parks>30 ~ 'lots of parks')))
         

clean_places <- clean_places %>% 
  mutate(cat_open_parks = ordered(cat_open_parks,
          levels = c("very few or no parks", "some parks", 
                     "a number of parks", "lots of parks")))
         

clean_places <- clean_places %>% 
  mutate(cat_asthma = ordered(cat_asthma,
          levels = c("low asthma", "mild asthma","moderate asthma", 
                      "severe asthma")))

clean_places %>%
  ggplot(aes(cat_open_parks, cat_asthma))+
  geom_boxplot()

3) Correlation Analysis

Examine and interpret the correlation for asthma and poverty and for asthma and parks.

cor.test(clean_places$asthma, clean_places$ppov13_17)

## 
##  Pearson's product-moment correlation
## 
## data:  clean_places$asthma and clean_places$ppov13_17
## t = 49.247, df = 2992, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6488237 0.6884167
## sample estimates:
##       cor 
## 0.6690947

Interpretation of Correlation Test for Asthma & Poverty: Based on the p-value, which is significantly less than 0.05, the result of the correlation test is statistically significant. This means we can reject the null hypothesis that there is no correlation between asthma and poverty.

cor.test(clean_places$asthma, clean_places$count_open_parks)

## 
##  Pearson's product-moment correlation
## 
## data:  clean_places$asthma and clean_places$count_open_parks
## t = -4.4393, df = 2992, p-value = 9.354e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11637672 -0.04520024
## sample estimates:
##         cor 
## -0.08089161

Interpretation of Correlation Test for Asthma and Parks: Interpretation of Correlation Test for Asthma & Poverty: Based on the p-value, which is significantly less than 0.05, the result of the correlation test is statistically significant. This means we can reject the null hypothesis that there is no correlation between asthma and poverty.

4) ANOVA

Conduct and interpret an analysis seeing if there’s a significant difference in asthma prevalence across tracts that 1) have no parks, 2) have 1 park or 3) have 2 or more parks. You’ll need to recode parks into a categorical variable to do this, and then run an ANOVA test.

clean_places <- clean_places %>% 
  mutate(cat_open_parks = as.factor(case_when(
    count_open_parks==0 ~ 'no parks',
    count_open_parks==1 ~ '1 park',
    count_open_parks>=2 ~ '2 or more parks')))


clean_places <- clean_places %>% 
  mutate(cat_open_parks = ordered(cat_open_parks,
          levels = c("no parks", "1 park", "2 or more parks")))


anova <- oneway.test(asthma ~ cat_open_parks, data = clean_places)
anova

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  asthma and cat_open_parks
## F = 6.1003, num df = 2.0, denom df = 1673.5, p-value = 0.002292

Interpretation for ANOVA test (asthma and parks < 3)

BONUS

Return to the simulation. The line mutate(growth = 8*light + -.2*pesticide + rnorm(400, 3, 2)) defines the relationship between growth and the other variables. Read through the line and see if you understand it. In our demonstration the correlations and the ANOVA were significant.

What could you do to make them no longer statistically significant? Try out your idea.

Could you make the correlation with light stronger?

Could you make the amount of variation within pesticide groups smaller?

HINTS

Hints are in part 0 this time :).

R Lab 8 - Correlation plus ANOVA

Ian Kennedy

July 11, 2024