Lab 7 Goals:

Introduction

In this lab we will: 1) Construct some new categorical versions of my focal variables; 2) Produce a nice, complete bivariate table; and 3) Perform a chi-square test for the statistical significance of my focal association.

For more information on the techniques and concepts illustrated in this lab, see Hands-On Programming in R at https://rstudio-education.github.io/hopr/basics.html

0.1 Loading Data

For this lab I will use the data prepared for the “Neighborhoods and Health” topic. This file combines data from the PLACES data which contains information on the health characteristics of populations within census tracts (neighborhoods) and NaNDA data which brings together information on the demographic, economic, social, and ecological characteristics of these neighborhoods. The datafile for this class includes a random sample of 3,000 census tracts drawn from the population of all 74k census tracts in the US. See more about the data using the resources on the course webpage.

Once you download the data to the data file in your working directory, opening the file is easy:

load("data/PLACES_NaNDA_sample.Rdata") # loading the data for the "Neighborhoods and Health" project

0.2 Checking for missing data and cleaning

In this example we are interested in looking at the association between the level of POVERTY in the neighborhood and the prevalence of ASTHMA in the area.

First, check for any missing values and then do any necessary cleaning on our variables of interest.

We’re focused on two variables: asthma = # people with asthma per 1,000 population pov13_17 = Proportion people w/ income past 12 months below the federal poverty line

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
clean_places <- PLACES_NaNDA_sample %>%
  filter(!is.na(ppov13_17), !is.na(asthma)) %>%
  select(LocationID, ppov13_17, asthma)

Once you have the clean dataset, let’s walk through the steps here. Remember, a \(\chi^2\) (spelled chi-squared, pronounced ‘kai squared’) test is a way of testing for an association between two categorical variables. Specifically, it’s a way to see if the different groups in the independent variable have different proportions on the dependent variable. In our example, we will test whether neighborhoods in different poverty categories also have different rates of asthma. Like our other hypotheses tests, we’re testing whether the variation we observe in the sample would be unlikely under a null hypothesis of no association between groups–in our case under the null hypothesis that the proportion of neighborhoods with high or low asthma is constant across various levels of poverty.

PART 1) Construct some new categorical versions of my focal variables;

1.1 Categorical Recoding

Chi-square only works with categorical variables so we need to create categorical versions of both focal variables.

I’ll create a new factor (categorical) variable for the IV, neighborhood poverty: 1) low poverty = at or below 5% poverty 2) middle poverty = above 5% but at or below 20% 3) high poverty = above 20% poverty

Then – and this is important – I run some code to make sure that my recode worked as expected.

this code will only work if you also named your new dataframe ‘clean_places’

# recode
clean_places <- clean_places %>% 
  mutate(cat_pov = case_when(ppov13_17<=.05 ~'low poverty',
                                       ppov13_17<.20 ~ 'moderate poverty',
                                       ppov13_17>=.20 ~ 'high poverty'))
# testing to make sure that the range is right for each cat and that I have some observations in each cat
clean_places %>% group_by(cat_pov) %>% summarise(max_pov = max(ppov13_17), min_pov = min(ppov13_17), n = n())
## # A tibble: 3 × 4
##   cat_pov          max_pov min_pov     n
##   <chr>              <dbl>   <dbl> <int>
## 1 high poverty      1       0.200    859
## 2 low poverty       0.0500  0        494
## 3 moderate poverty  0.200   0.0501  1644

What am I checking for here? I want to be sure that my “high poverty” category actually has the high poverty tracts and that it doesn’t overlap with the other categories. This checks out. The chart shows that low poverty tracts run from 0% poverty to almost 5%, moderate from 5% to almost 20% and high from 20% to 100%, just like I coded it.

1.2 Making an Ordinal Var from a Categorical Var

Since we know that that worked, now lets tell R that it has an order to it.

clean_places <- clean_places %>% 
  mutate(cat_pov = ordered(cat_pov,
          levels = c("low poverty", "moderate poverty",
                     "high poverty")))

1.3 Practice

Ok, based on that template from working with the poverty variable, create a new factor (categorical) variable for the DV, asthma cases per 1000:

  1. low asthma = at or below the mean asthma level for all tracts in the sample
  2. high asthma = above the mean asthma level for all tracts in the sample

First make the variable, check to make sure that it worked, and then give it an order. These are the same steps that we took with the poverty var above, so you can use that as a framework.

clean_places <- clean_places %>% 
  mutate(cat_asthma = case_when(asthma <= mean(asthma) ~ 'low asthma',
                                asthma > mean(asthma) ~ 'high asthma'),
         cat_asthma = ordered(cat_asthma, levels = c('low asthma', 'high asthma')))
                                                               
clean_places %>% mutate(asthma_mean = mean(asthma)) %>%
  group_by(cat_asthma) %>%
  summarize(mean = first(asthma_mean), max_as = max(asthma), min_as = min(asthma), n = n())
## # A tibble: 2 × 5
##   cat_asthma   mean max_as min_as     n
##   <ord>       <dbl>  <dbl>  <dbl> <int>
## 1 low asthma   9.89    9.8    6.1  1611
## 2 high asthma  9.89   17.1    9.9  1386

PART 2) Produce a bivariate table

Creating a bivariate table showing the cross-classification of the two new categorical variables.

2.1 A simple table

There are a couple ways of doing this. The easiest way is to use the table function. We need to put the DV–asthma–first so that it will show up as the rows of the table.

# simple way using "base" r
base_tab <- table(clean_places$cat_asthma, clean_places$cat_pov)
base_tab
##              
##               low poverty moderate poverty high poverty
##   low asthma          445              985          181
##   high asthma          49              659          678

2.2 Tidyverse table

But we can also do it in the tidyverse, though it’s a little more involved. The count function will get us all of the values we need–the cell counts–but not in the normal format. The pivot_wider function pivots our table, making new columns from the column we assing to ‘names_from’ and giving them the values in ‘values_from’.

# tidyverse way
tidy_tab <- clean_places %>% count(cat_asthma,cat_pov) %>% 
  pivot_wider(names_from = cat_pov, values_from = n)
tidy_tab
## # A tibble: 2 × 4
##   cat_asthma  `low poverty` `moderate poverty` `high poverty`
##   <ord>               <int>              <int>          <int>
## 1 low asthma            445                985            181
## 2 high asthma            49                659            678

2.3 Nicely knitted table

These tables look… medium when knitted. We can use the knitr function ‘kable’ to make it look nice. This works with either kind of table.

knitr::kable(base_tab)
low poverty moderate poverty high poverty
low asthma 445 985 181
high asthma 49 659 678

2.4 Table proportions

Regardless of which table we look at, it’s kind of hard to see the association just looking at the counts–it’s much easier looking at the proportions.

We can use the following lines to use prop.table to get a table with proportions instead of frequencies. The base R command for doing this is call prop.table, which only works by passing a table through it.

So the following line takes our ‘base_tab’ from above and passes it to the base R function prop.table. The “,2” thing at the end of the code requests column proportions, since “,1” would give you row proportions.

base_prop <- prop.table(base_tab,2)
base_prop
##              
##               low poverty moderate poverty high poverty
##   low asthma   0.90080972       0.59914842   0.21071013
##   high asthma  0.09919028       0.40085158   0.78928987

Looking at the proportions we can see that there’s definitely an association in the sample–rates are similar in moderate poverty neighborhoods, by wayy differnt in low and high poverty neighborhoods.

2.5 Table proportions in the tidyverse nicely knitted

Using the tidyverse, we can have a bit more control over the format of the table. For instance, we could create a table that had the raw count first, and then the column proportion in parentheses. And we can pipe that to kable to make it look nice knitted

library(glue)
## Warning: package 'glue' was built under R version 4.3.3
clean_places %>% 
  # this does the actual counting
  count(cat_asthma,cat_pov) %>% 
  # These next two lines calculate the proportions
  group_by(cat_pov) %>% 
  mutate(prop = n/sum(n), 
         # this uses a glue function to format our text, putting n, the count first
         # and then a rounded version of the proportion next
         text = glue("{n} ({round(prop,2)})")) %>%
  # we use select to remove two variables that we don't want in our table
  select(-n, -prop) %>%
  # then we change the layout of the table by pivoting it
  pivot_wider(names_from = cat_pov, values_from = text) %>% 
  # finally, we can pipe it to a kable to make it look nice knitted
  knitr::kable()
cat_asthma low poverty moderate poverty high poverty
low asthma 445 (0.9) 985 (0.6) 181 (0.21)
high asthma 49 (0.1) 659 (0.4) 678 (0.79)

PART 3) Perform a chi-square test for the statistical significance of my focal association

3.1 Running the Chi-Square

Finally performing Chi-square test on my table. The null hypothesis is that the variables are statistically independent. The “correct=FALSE” part of the code just turns off the default Yates Continuity Correction

chi_2 <- chisq.test(clean_places$cat_asthma, clean_places$cat_pov, correct=FALSE)
chi_2
## 
##  Pearson's Chi-squared test
## 
## data:  clean_places$cat_asthma and clean_places$cat_pov
## X-squared = 656.45, df = 2, p-value < 2.2e-16

3.2 Chi-Square Interpretation

We can see that the \(\chi^2\) statistic is 656.45, which is huge. At 2 degrees of freedom, we have a very low p value.

So that means that we can reject the null hypotheses that there is no relationship between asthma and poverty in the population. Instead, our analysis supports the alternative hypotheses that there is an association. Since these are ordinal categorical variables, we can even say that there is a positive association – low poverty neighborhoods are more likely to also have low rates of asthma, while high poverty neighborhoods are more likely to have high rates of asthma.

Lab 7 Checkout

Ok, we started by considering asthma and neighborhood poverty. Now, please conduct a similar analysis that looks instead at rates of cancer and neighborhood poverty. Follow these steps:

  1. State your null and alternative hypotheses

Alt hypothesis: Our alternative hypothesis is that there is a relationship between cancer rates and neighborhood poverty in the population.

Null hypothesis: Our null hypothesis is that there is no relationship between cancer rates and neighborhood poverty in the population.

  1. Create a clean dataset with just the variables you’ll be using
clean_places <- PLACES_NaNDA_sample %>%
  filter(!is.na(cancer), !is.na(ppov13_17)) %>%
  select(LocationID, cancer, ppov13_17)
  1. Create categorical versions of cancer and poverty rates (you already have poverty!). Include code that let’s you confirm that your variables are as you expected.
library(dplyr)

clean_places <- clean_places %>% 
  mutate(cat_pov = case_when(ppov13_17 <=.05 ~ 'low poverty',
                             ppov13_17 <.20 ~ 'moderate poverty',
                             ppov13_17 >=.20 ~ 'high poverty')) 

clean_places %>% group_by(cat_pov) %>% summarise(max_pov = max(ppov13_17), min_pov = min(ppov13_17), n = n())
## # A tibble: 3 × 4
##   cat_pov          max_pov min_pov     n
##   <chr>              <dbl>   <dbl> <int>
## 1 high poverty      1       0.200    859
## 2 low poverty       0.0500  0        494
## 3 moderate poverty  0.200   0.0501  1644
clean_places <- clean_places %>%
  mutate(cat_cancer = case_when(cancer <=3.0 ~ 'low cancer',
                                cancer <=7.9 ~ 'moderate cancer',
                                cancer >7.9 ~ ' high cancer'))

clean_places %>% group_by(cat_cancer) %>% summarise(max_cancer = max(cancer), min_cancer = min(cancer), n = n())
## # A tibble: 3 × 4
##   cat_cancer        max_cancer min_cancer     n
##   <chr>                  <dbl>      <dbl> <int>
## 1 " high cancer"          17.5        8     579
## 2 "low cancer"             3          0.7    61
## 3 "moderate cancer"        7.9        3.1  2357
  1. Present a table showing the bivariate relationship between these variables. Show both a table of counts and a table of proportions (or a table that shows both as in 2.5 above)
library(glue)

clean_places %>%
  count(cat_cancer, cat_pov) %>%
  group_by(cat_pov) %>% 
  mutate(prop = n/sum(n), text = glue("{n} ({round(prop, 2)})")) %>%
  select(-n, -prop) %>%
  pivot_wider(names_from = cat_pov, values_from = text) %>%
  knitr:: kable()
cat_cancer high poverty low poverty moderate poverty
high cancer 62 (0.07) 138 (0.28) 379 (0.23)
low cancer 46 (0.05) 2 (0) 13 (0.01)
moderate cancer 751 (0.87) 354 (0.72) 1252 (0.76)
  1. Conduct a \(\chi^2\) hypotheses test and interpret the results.
chi_2 <- chisq.test(clean_places$cat_cancer, clean_places$cat_pov, correct=FALSE)
chi_2
## 
##  Pearson's Chi-squared test
## 
## data:  clean_places$cat_cancer and clean_places$cat_pov
## X-squared = 174.22, df = 4, p-value < 2.2e-16

Question 5 Interpretation:

The result of our chi square test shows a chi statistic (critical value) of 174.22, 4 degrees of freedom, and a p-value less than 2.2e-16 (which is less than 0.01). This result is highly statistically significant. Because the p-value is so small, it makes the chances of observing a chi statistic/critical value of 174.22 very low. Based on this information, we can conclude there IS a relationship between cancer rates and neighborhood poverty, and reject the null hypothesis that a relationship between the two variables does not exist.

What you finished already!?!

Bonus things to do:
- plot and explore the \(\chi^2\) distribution, like we did for the normal in Distribution Plots.R
- simulate the sampling distribution for a bivariate table, or for a table with more cols and rows, like we did for the normal in lab 5 and 6
– does it look like a \(\chi^2\) distribution?
– can you confirm the accuracy of, say, the critical value when \(\alpha = 0.05\)?

HINTS

Intro

maybe something like

library(tidyverse)
clean_places <- PLACES_NaNDA_sample %>% filter(!is.na(asthma), !is.na(ppov13_17)) %>%
  select(LocationID, asthma, ppov13_17)

Part 1

clean_places <- clean_places %>% 
  mutate(cat_asthma = case_when(asthma <= mean(asthma) ~ 'low asthma',
                                asthma > mean(asthma) ~ 'high asthma'),
         cat_asthma = ordered(cat_asthma, levels = c('low asthma', 'high asthma')))
                                                               
clean_places %>% mutate(asthma_mean = mean(asthma)) %>%
  group_by(cat_asthma) %>%
  summarize(mean = first(asthma_mean), max_as = max(asthma), min_as = min(asthma), n = n())

Part 4

clean_places <- PLACES_NaNDA_sample %>% filter(!is.na(cancer), !is.na(ppov13_17)) %>%
  select(LocationID, cancer, ppov13_17)

clean_places <- clean_places %>%   
  mutate(cat_cancer = case_when(cancer <= mean(cancer) ~ 'low cancer',
                                cancer > mean(cancer) ~ 'high cancer'),
         cat_cancer = ordered(cat_cancer, levels = c('low cancer', 'high cancer')),
         cat_pov = case_when(ppov13_17<=.05 ~'low poverty',
                                       ppov13_17<.20 ~ 'moderate poverty',
                                       ppov13_17>=.20 ~ 'high poverty'),
         cat_pov = ordered(cat_pov,
          levels = c("low poverty", "moderate poverty",
                     "high poverty")))

table(clean_places$cat_cancer, clean_places$cat_pov)
prop.table(table(clean_places$cat_cancer, clean_places$cat_pov),2)
chisq.test(clean_places$cat_cancer, clean_places$cat_pov, correct = FALSE)