I am testing for a difference in proportion at the 5% significance level between those who ate peanuts and had an allergic reaction after the study compared to those who avoided them and had an allergic reaction after the study.
My hypothesis is that those who eat peanuts will have fewer allergic reactions than those who avoid them.
\(H_0\): \(p_1\) = \(p_2\)
\(H_a\): \(p_1\) < \(p_2\)
Where \(p_1\) is the proportion of patients who have an allergic reaction after 5 years of consuming peanuts \(p_2\) is the proportion of patients who have an allergic reaction after 5 years of avoiding peanuts.
I will be using the Learning Early about Peanut Allergy data set from the openintro.org website at https://www.openintro.org/data/index.php?data=LEAP
Here is some more information about the study directly from the Openintro.org website.
“The study team enrolled children in the United Kingdom between 2006 and 2009, selecting 640 infants with eczema, egg allergy, or both. Each child was randomly assigned to a treatment group (peanut consumption) or the control group (peanut avoidance); children in the treatment group were fed at least 6 grams of peanut protein daily until 5 years of age, while children in the control group were to avoid consuming peanut protein until 5 years of age.
At 5 years of age, each child was tested for peanut allergy using an oral food challenge (OFC): 5 grams of peanut protein in a single dose.
This dataset only contains the patients in the primary ITT analysis in the New England Journal of Medicine paper. This means it only includes the children eligible for the study because they are positive for an egg allergy and/or eczema and negative for skin test of peanut allergy.”
There are 530 observations and 13 variables in the dataset, including the child’s age, sex, primary ethnicity among others. I will be using 2 variables; treatment.group (did the patient consume peanuts or not) and overall.V60.outcome (did the patient have an allergic reaction) to answer my question.
I chose it because I was interested to know if the proportion of patients who have an allergic reaction after 5 years of consuming peanuts would be less than those who did not have exposure to peanuts.
I am going to look at the dimensions and head of the data as well as check for any missing information. I will use group by and summarise to look at the proportion of allergic reactions after 5 years for those who ate peanuts and those who did not.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)
setwd("~/Downloads/Data 101 Course materials/Data Sets")
LEAP <- read.csv("LEAP.csv")
str(LEAP)
## 'data.frame': 530 obs. of 6 variables:
## $ participant.ID : chr "LEAP_100522" "LEAP_103358" "LEAP_105069" "LEAP_105328" ...
## $ treatment.group : chr "Peanut Consumption" "Peanut Consumption" "Peanut Avoidance" "Peanut Consumption" ...
## $ age.months : num 6.08 7.59 5.98 7.03 6.41 ...
## $ sex : chr "Female" "Female" "Male" "Female" ...
## $ primary.ethnicity : chr "Black" "White" "White" "White" ...
## $ overall.V60.outcome: chr "PASS OFC" "PASS OFC" "PASS OFC" "PASS OFC" ...
head(LEAP)
## participant.ID treatment.group age.months sex primary.ethnicity
## 1 LEAP_100522 Peanut Consumption 6.0780 Female Black
## 2 LEAP_103358 Peanut Consumption 7.5893 Female White
## 3 LEAP_105069 Peanut Avoidance 5.9795 Male White
## 4 LEAP_105328 Peanut Consumption 7.0308 Female White
## 5 LEAP_106377 Peanut Avoidance 6.4066 Male White
## 6 LEAP_107031 Peanut Consumption 6.0452 Female White
## overall.V60.outcome
## 1 PASS OFC
## 2 PASS OFC
## 3 PASS OFC
## 4 PASS OFC
## 5 PASS OFC
## 6 PASS OFC
colSums(is.na(LEAP))
## participant.ID treatment.group age.months sex
## 0 0 0 0
## primary.ethnicity overall.V60.outcome
## 0 0
There are no missing data.
df <- LEAP |>
group_by(overall.V60.outcome, treatment.group) |>
summarise(outcomes= n()
)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by overall.V60.outcome and treatment.group.
## ℹ Output is grouped by overall.V60.outcome.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(overall.V60.outcome, treatment.group))` for
## per-operation grouping (`?dplyr::dplyr_by`) instead.
df
## # A tibble: 4 × 3
## # Groups: overall.V60.outcome [2]
## overall.V60.outcome treatment.group outcomes
## <chr> <chr> <int>
## 1 FAIL OFC Peanut Avoidance 36
## 2 FAIL OFC Peanut Consumption 5
## 3 PASS OFC Peanut Avoidance 227
## 4 PASS OFC Peanut Consumption 262
The number 5 for who ate peanuts and had an allergic reaction is right on the border for being able to use the difference in proportions test. If it was less than 5 an alternative test would need to be used.
names(df) <- gsub("\\.", "_", names(df))
names(df)
## [1] "overall_V60_outcome" "treatment_group" "outcomes"
data <- matrix(c(5, 36,262, 227), nrow = 2, byrow = TRUE)
colnames(data) <- c("Consumed Peanuts", "Avoided Peanuts")
rownames(data) <- c("Had Allergic Reaction", "No Allergic Reaction")
data
## Consumed Peanuts Avoided Peanuts
## Had Allergic Reaction 5 36
## No Allergic Reaction 262 227
barplot(data,
beside = FALSE,
col = c("skyblue", "orange"),
main = "Visualization",
xlab = "Peanut Consumption",
ylab = "Number of Allergic Reactions")
My hypothesis is that those who eat peanuts will have fewer allergic reactions than those who avoid them.
\(H_0\): \(p_1\) = \(p_2\)
\(H_a\): \(p_1\) < \(p_2\)
Where \(p_1\) is the proportion of patients who have an allergic reaction after 5 years of consuming peanuts \(p_2\) is the proportion of patients who have an allergic reaction after 5 years of avoiding peanuts.
prop.test(c(5, 36), c(267, 263), alternative = "less")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(5, 36) out of c(267, 263)
## X-squared = 24.286, df = 1, p-value = 4.151e-07
## alternative hypothesis: less
## 95 percent confidence interval:
## -1.00000000 -0.07694385
## sample estimates:
## prop 1 prop 2
## 0.01872659 0.13688213
The p-value is 4.151e-07 which is less than alpha at 0.05.
Therefore, the results are statistically significant. We reject the null hypothesis that there is no difference between the two groups. We have evidence that consuming peanuts early causes fewer instances of allergic reactions than avoiding peanuts in those at risk of developing a peanut allergy.
This is helpful information for parents to know, that introducing peanuts to children at an early age may help prevent them from developing an allergic reaction to peanuts later on.
It may be interesting to look at whether there was a difference in allergic reactions between males and females who ate peanuts and males and females who both avoided peanuts. It would also be an idea to test the children for an allergic reaction at a younger age to see if the amount of time consuming peanuts plays a significant role or not.
https://www.openintro.org/data/index.php?data=LEAP
Du Toit, George, et al. Randomized trial of peanut consumption in infants at risk for peanut allergy. New England Journal of Medicine 372.9 (2015): 803-813.