In this lab we will: 1) Construct some new categorical versions of my focal variables; 2) Produce a nice, complete bivariate table; and 3) Perform a chi-square test for the statistical significance of my focal association.
For more information on the techniques and concepts illustrated in this lab, see Hands-On Programming in R at https://rstudio-education.github.io/hopr/basics.html
For this lab I will use the data prepared for the “Neighborhoods and Health” topic. This file combines data from the PLACES data which contains information on the health characteristics of populations within census tracts (neighborhoods) and NaNDA data which brings together information on the demographic, economic, social, and ecological characteristics of these neighborhoods. The datafile for this class includes a random sample of 3,000 census tracts drawn from the population of all 74k census tracts in the US. See more about the data using the resources on the course webpage.
Once you download the data to the data file in your working directory, opening the file is easy:
load("data/PLACES_NaNDA_sample.Rdata") # loading the data for the "Neighborhoods and Health" project
In this example we are interested in looking at the association between the level of POVERTY in the neighborhood and the prevalence of ASTHMA in the area.
First, check for any missing values and then do any necessary cleaning on our variables of interest.
We’re focused on two variables: asthma = # people with asthma per 1,000 population pov13_17 = Proportion people w/ income past 12 months below the federal poverty line
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
clean_places <- PLACES_NaNDA_sample %>%
filter(!is.na(ppov13_17), !is.na(asthma)) %>%
select(LocationID, ppov13_17, asthma)
Once you have the clean dataset, let’s walk through the steps here. Remember, a \(\chi^2\) (spelled chi-squared, pronounced ‘kai squared’) test is a way of testing for an association between two categorical variables. Specifically, it’s a way to see if the different groups in the independent variable have different proportions on the dependent variable. In our example, we will test whether neighborhoods in different poverty categories also have different rates of asthma. Like our other hypotheses tests, we’re testing whether the variation we observe in the sample would be unlikely under a null hypothesis of no association between groups–in our case under the null hypothesis that the proportion of neighborhoods with high or low asthma is constant across various levels of poverty.
Chi-square only works with categorical variables so we need to create categorical versions of both focal variables.
I’ll create a new factor (categorical) variable for the IV, neighborhood poverty: 1) low poverty = at or below 5% poverty 2) middle poverty = above 5% but at or below 20% 3) high poverty = above 20% poverty
Then – and this is important – I run some code to make sure that my recode worked as expected.
this code will only work if you also named your new dataframe ‘clean_places’
# recode
clean_places <- clean_places %>%
mutate(cat_pov = case_when(ppov13_17<=.05 ~'low poverty',
ppov13_17<.20 ~ 'moderate poverty',
ppov13_17>=.20 ~ 'high poverty'))
# testing to make sure that the range is right for each cat and that I have some observations in each cat
clean_places %>% group_by(cat_pov) %>% summarise(max_pov = max(ppov13_17), min_pov = min(ppov13_17), n = n())
## # A tibble: 3 × 4
## cat_pov max_pov min_pov n
## <chr> <dbl> <dbl> <int>
## 1 high poverty 1 0.200 859
## 2 low poverty 0.0500 0 494
## 3 moderate poverty 0.200 0.0501 1644
What am I checking for here? I want to be sure that my “high poverty” category actually has the high poverty tracts and that it doesn’t overlap with the other categories. This checks out. The chart shows that low poverty tracts run from 0% poverty to almost 5%, moderate from 5% to almost 20% and high from 20% to 100%, just like I coded it.
Since we know that that worked, now lets tell R that it has an order to it.
clean_places <- clean_places %>%
mutate(cat_pov = ordered(cat_pov,
levels = c("low poverty", "moderate poverty",
"high poverty")))
Ok, based on that template from working with the poverty variable, create a new factor (categorical) variable for the DV, asthma cases per 1000:
First make the variable, check to make sure that it worked, and then give it an order. These are the same steps that we took with the poverty var above, so you can use that as a framework.
clean_places <- clean_places %>%
mutate(cat_asthma = case_when(asthma <= mean(asthma) ~ 'low asthma',
asthma > mean(asthma) ~ 'high asthma'),
cat_asthma = ordered(cat_asthma, levels = c('low asthma', 'high asthma')))
clean_places %>% mutate(asthma_mean = mean(asthma)) %>%
group_by(cat_asthma) %>%
summarize(mean = first(asthma_mean), max_as = max(asthma), min_as = min(asthma), n = n())
## # A tibble: 2 × 5
## cat_asthma mean max_as min_as n
## <ord> <dbl> <dbl> <dbl> <int>
## 1 low asthma 9.89 9.8 6.1 1611
## 2 high asthma 9.89 17.1 9.9 1386
Creating a bivariate table showing the cross-classification of the two new categorical variables.
There are a couple ways of doing this. The easiest way is to use the table function. We need to put the DV–asthma–first so that it will show up as the rows of the table.
# simple way using "base" r
base_tab <- table(clean_places$cat_asthma, clean_places$cat_pov)
base_tab
##
## low poverty moderate poverty high poverty
## low asthma 445 985 181
## high asthma 49 659 678
But we can also do it in the tidyverse, though it’s a little more involved. The count function will get us all of the values we need–the cell counts–but not in the normal format. The pivot_wider function pivots our table, making new columns from the column we assing to ‘names_from’ and giving them the values in ‘values_from’.
# tidyverse way
tidy_tab <- clean_places %>% count(cat_asthma,cat_pov) %>%
pivot_wider(names_from = cat_pov, values_from = n)
tidy_tab
## # A tibble: 2 × 4
## cat_asthma `low poverty` `moderate poverty` `high poverty`
## <ord> <int> <int> <int>
## 1 low asthma 445 985 181
## 2 high asthma 49 659 678
These tables look… medium when knitted. We can use the knitr function ‘kable’ to make it look nice. This works with either kind of table.
knitr::kable(base_tab)
| low poverty | moderate poverty | high poverty | |
|---|---|---|---|
| low asthma | 445 | 985 | 181 |
| high asthma | 49 | 659 | 678 |
Regardless of which table we look at, it’s kind of hard to see the association just looking at the counts–it’s much easier looking at the proportions.
We can use the following lines to use prop.table to get a table with proportions instead of frequencies. The base R command for doing this is call prop.table, which only works by passing a table through it.
So the following line takes our ‘base_tab’ from above and passes it to the base R function prop.table. The “,2” thing at the end of the code requests column proportions, since “,1” would give you row proportions.
base_prop <- prop.table(base_tab,2)
base_prop
##
## low poverty moderate poverty high poverty
## low asthma 0.90080972 0.59914842 0.21071013
## high asthma 0.09919028 0.40085158 0.78928987
Looking at the proportions we can see that there’s definitely an association in the sample–rates are similar in moderate poverty neighborhoods, by wayy differnt in low and high poverty neighborhoods.
Using the tidyverse, we can have a bit more control over the format
of the table. For instance, we could create a table that had the raw
count first, and then the column proportion in parentheses. And we can
pipe that to kable to make it look nice knitted
library(glue)
## Warning: package 'glue' was built under R version 4.3.3
clean_places %>%
# this does the actual counting
count(cat_asthma,cat_pov) %>%
# These next two lines calculate the proportions
group_by(cat_pov) %>%
mutate(prop = n/sum(n),
# this uses a glue function to format our text, putting n, the count first
# and then a rounded version of the proportion next
text = glue("{n} ({round(prop,2)})")) %>%
# we use select to remove two variables that we don't want in our table
select(-n, -prop) %>%
# then we change the layout of the table by pivoting it
pivot_wider(names_from = cat_pov, values_from = text) %>%
# finally, we can pipe it to a kable to make it look nice knitted
knitr::kable()
| cat_asthma | low poverty | moderate poverty | high poverty |
|---|---|---|---|
| low asthma | 445 (0.9) | 985 (0.6) | 181 (0.21) |
| high asthma | 49 (0.1) | 659 (0.4) | 678 (0.79) |
Finally performing Chi-square test on my table. The null hypothesis is that the variables are statistically independent. The “correct=FALSE” part of the code just turns off the default Yates Continuity Correction
chi_2 <- chisq.test(clean_places$cat_asthma, clean_places$cat_pov, correct=FALSE)
chi_2
##
## Pearson's Chi-squared test
##
## data: clean_places$cat_asthma and clean_places$cat_pov
## X-squared = 656.45, df = 2, p-value < 2.2e-16
We can see that the \(\chi^2\) statistic is 656.45, which is huge. At 2 degrees of freedom, we have a very low p value.
So that means that we can reject the null hypotheses that there is no relationship between asthma and poverty in the population. Instead, our analysis supports the alternative hypotheses that there is an association. Since these are ordinal categorical variables, we can even say that there is a positive association – low poverty neighborhoods are more likely to also have low rates of asthma, while high poverty neighborhoods are more likely to have high rates of asthma.
Ok, we started by considering asthma and neighborhood poverty. Now, please conduct a similar analysis that looks instead at rates of cancer and neighborhood poverty. Follow these steps:
Alt hypothesis: Our alternative hypothesis is that there is a relationship between cancer rates and neighborhood poverty in the population.
Null hypothesis: Our null hypothesis is that there is no relationship between cancer rates and neighborhood poverty in the population.
clean_places <- PLACES_NaNDA_sample %>%
filter(!is.na(cancer), !is.na(ppov13_17)) %>%
select(LocationID, cancer, ppov13_17)
library(dplyr)
clean_places <- clean_places %>%
mutate(cat_pov = case_when(ppov13_17 <=.05 ~ 'low poverty',
ppov13_17 <.20 ~ 'moderate poverty',
ppov13_17 >=.20 ~ 'high poverty'))
clean_places %>% group_by(cat_pov) %>% summarise(max_pov = max(ppov13_17), min_pov = min(ppov13_17), n = n())
## # A tibble: 3 × 4
## cat_pov max_pov min_pov n
## <chr> <dbl> <dbl> <int>
## 1 high poverty 1 0.200 859
## 2 low poverty 0.0500 0 494
## 3 moderate poverty 0.200 0.0501 1644
clean_places <- clean_places %>%
mutate(cat_cancer = case_when(cancer <=3.0 ~ 'low cancer',
cancer <=7.9 ~ 'moderate cancer',
cancer >7.9 ~ ' high cancer'))
clean_places %>% group_by(cat_cancer) %>% summarise(max_cancer = max(cancer), min_cancer = min(cancer), n = n())
## # A tibble: 3 × 4
## cat_cancer max_cancer min_cancer n
## <chr> <dbl> <dbl> <int>
## 1 " high cancer" 17.5 8 579
## 2 "low cancer" 3 0.7 61
## 3 "moderate cancer" 7.9 3.1 2357
library(glue)
clean_places %>%
count(cat_cancer, cat_pov) %>%
group_by(cat_pov) %>%
mutate(prop = n/sum(n), text = glue("{n} ({round(prop, 2)})")) %>%
select(-n, -prop) %>%
pivot_wider(names_from = cat_pov, values_from = text) %>%
knitr:: kable()
| cat_cancer | high poverty | low poverty | moderate poverty |
|---|---|---|---|
| high cancer | 62 (0.07) | 138 (0.28) | 379 (0.23) |
| low cancer | 46 (0.05) | 2 (0) | 13 (0.01) |
| moderate cancer | 751 (0.87) | 354 (0.72) | 1252 (0.76) |
chi_2 <- chisq.test(clean_places$cat_cancer, clean_places$cat_pov, correct=FALSE)
chi_2
##
## Pearson's Chi-squared test
##
## data: clean_places$cat_cancer and clean_places$cat_pov
## X-squared = 174.22, df = 4, p-value < 2.2e-16
Question 5 Interpretation:
The result of our chi square test shows a chi statistic (critical value) of 174.22, 4 degrees of freedom, and a p-value less than 2.2e-16 (which is less than 0.01). This result is highly statistically significant. Because the p-value is so small, it makes the chances of observing a chi statistic/critical value of 174.22 very low. Based on this information, we can conclude there IS a relationship between cancer rates and neighborhood poverty, and reject the null hypothesis that a relationship between the two variables does not exist.
Bonus things to do:
- plot and explore the \(\chi^2\)
distribution, like we did for the normal in Distribution
Plots.R
- simulate the sampling distribution for a bivariate table, or for a
table with more cols and rows, like we did for the normal in lab 5 and
6
– does it look like a \(\chi^2\)
distribution?
– can you confirm the accuracy of, say, the critical value when \(\alpha = 0.05\)?
maybe something like
library(tidyverse)
clean_places <- PLACES_NaNDA_sample %>% filter(!is.na(asthma), !is.na(ppov13_17)) %>%
select(LocationID, asthma, ppov13_17)
clean_places <- clean_places %>%
mutate(cat_asthma = case_when(asthma <= mean(asthma) ~ 'low asthma',
asthma > mean(asthma) ~ 'high asthma'),
cat_asthma = ordered(cat_asthma, levels = c('low asthma', 'high asthma')))
clean_places %>% mutate(asthma_mean = mean(asthma)) %>%
group_by(cat_asthma) %>%
summarize(mean = first(asthma_mean), max_as = max(asthma), min_as = min(asthma), n = n())
clean_places <- PLACES_NaNDA_sample %>% filter(!is.na(cancer), !is.na(ppov13_17)) %>%
select(LocationID, cancer, ppov13_17)
clean_places <- clean_places %>%
mutate(cat_cancer = case_when(cancer <= mean(cancer) ~ 'low cancer',
cancer > mean(cancer) ~ 'high cancer'),
cat_cancer = ordered(cat_cancer, levels = c('low cancer', 'high cancer')),
cat_pov = case_when(ppov13_17<=.05 ~'low poverty',
ppov13_17<.20 ~ 'moderate poverty',
ppov13_17>=.20 ~ 'high poverty'),
cat_pov = ordered(cat_pov,
levels = c("low poverty", "moderate poverty",
"high poverty")))
table(clean_places$cat_cancer, clean_places$cat_pov)
prop.table(table(clean_places$cat_cancer, clean_places$cat_pov),2)
chisq.test(clean_places$cat_cancer, clean_places$cat_pov, correct = FALSE)