Final_Project

Introduction

Research Question: Is there an association between dogs’ risk of cancer and exposure to the herbicide 2, 4 -Dichlorophenoxyacetic acid (2, 4-D)?

The data set I selected is a study to determine if there is an increased risk of cancer amongst dogs that are exposed to 2,4-Dichlorophenoxyacetic acid (2,4-D), or herbicide. 2,4-D is a herbicide used to kill plants (particularly weeds) by “changing the way certain cells grow.” This herbicide can be harmful to humans and animals. For humans, those who were exposed to 2,4-D “vomited, had diarrhea, headaches, and were confused or aggressive. Some people also had kidney failure and skeletal muscle damage.” As this data set is specifically looking at the risk of cancer for dogs, dogs and other pets are exposed to this herbicide when they come into contact with plants that are wet from being sprayed with the herbicide and then groom themselves. Pets may consume this herbicide from eating grass or accidentally drinking the herbicide. Dogs and cats that consume 2, 4-D may experience “vomiting, diarrhea, loss of appetite, lethargy, drooling, staggering, or convulsions.”

This data set attempts to understand and study the long-term effects of being exposed to 2,4-D, specifically cancer. I will be analyzing this data set to see if there is an association between cancer and exposure to 2, 4-D in dogs. I will be conducting a Chi-Square test for association utilizing the two categorical variables provided: order and response. Order has two factors which are whether the dogs has been exposed to 2,4-D or not. Response has two factors which are if the dog got cancer or not.

The link to the source: https://www.openintro.org/data/index.php?data=cancer_in_dogs. This dataset was retrieved from the site OpenIntro.org Data Sets.

References: https://www.openintro.org/data/index.php?data=cancer_in_dogs and https://npic.orst.edu/factsheets/24Dgen.html

1. Import data set

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

cancer <- read_csv("C:/DATA101/cancer_in_dogs.csv")

## Rows: 1436 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): order, response
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Explore data set with EDA Functions

str(cancer)

## spc_tbl_ [1,436 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ order   : chr [1:1436] "2,4-D" "2,4-D" "2,4-D" "2,4-D" ...
##  $ response: chr [1:1436] "cancer" "cancer" "cancer" "cancer" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   order = col_character(),
##   ..   response = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(cancer)

## # A tibble: 6 × 2
##   order response
##   <chr> <chr>   
## 1 2,4-D cancer  
## 2 2,4-D cancer  
## 3 2,4-D cancer  
## 4 2,4-D cancer  
## 5 2,4-D cancer  
## 6 2,4-D cancer

summary(cancer)

##     order             response        
##  Length:1436        Length:1436       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

3. Clean data set

Analysis: After exploring my data set with EDA functions I began prepping my data set by cleaning the data set. Neither variables had NAs so I was good to go with using my dplyr functions. I began with renaming order to exposure because the variable order did not make much sense to me, and I figured exposure would be a better name as id indicates if the dog has been expose to 2,4-D or not. Since this data set was pretty much ready to use from the get go, all I did was group the data set by response factor and arranged the data by response factor. I then created a table for my observed data set. I decided to use the Chi-Square Test because I wanted to do a test for association between different factors, and I believe that the Chi- Square Test would allow me to do so. I checked all my expected counts to ensure I could conduct by Chi-Square test. All my expected counts were greater than 5 so I continued onward and conducted my Chi Square test and created a bar graph using the table I created.

colSums(is.na(cancer))

##    order response 
##        0        0

Note: There are no NAs for either variables

4. Analyze the data set

cancer_new <- cancer |>
  rename(exposure = order) |>
  group_by(response) |>
  arrange(response)
  

observed_dataset <- table(cancer_new$exposure, cancer_new$response)
observed_dataset

##           
##            cancer no cancer
##   2,4-D       191       304
##   no 2,4-D    300       641

5. Chi-Square Test for Association

Expected count: (row total)(column total) / sample size

Row Total 2, 4-D: 191 + 304 = 495

Row Total no 2,4-D: 300+641 = 941

Column Total cancer: 191 + 300 = 491

Column Total no cancer: 304 + 641 = 945

Total = 1436

Expected count for cancer and 2,4-D: (495 * 491)/1436 = 169.25

Expected count for no cancer and 2,4-D: (495 * 945)/1436 = 325.75

Expected count for cancer and no 2,4-D: (941 * 491)/1436 = 321.75

Expected count for no cancer and no 2,4-D: (941 * 945)/1436 = 619.25

All expected counts are much greater than 5, therefore my test is valid.

\(H_0\) : 2, 4-d exposure is not associated with cancer in dogs

\(H_a\) : 2, 4-d exposure is associated with cancer in dogs

chisq.test(observed_dataset)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  observed_dataset
## X-squared = 6.1861, df = 1, p-value = 0.01288

The degrees of freedom is 1 because there is 2 x 2 Contingency Table. After the columns and rows totals are fixed, only one cell frequency can vary freely.

The p-value is 0.01288 which is less than 0.05 meaning there is sufficient evidence that there is an association between exposure to 2,4-D and cancer in dogs.

6. Visualization

barplot(observed_dataset,
        col = c("blue", "red"),
        beside = TRUE,
        main = "Exposure to 2,4-D vs Cancer",
        xlab = "Exposure",
        ylab = "Count")

legend("topleft",
       legend = rownames(observed_dataset),
       fill = c("blue", "red"))

Conclusion

In conclusion, in my Chi-Square test for association I found sufficient evidence that there is an association between cancer in dogs and exposure to 2,4-D. My p-value was was 0.01288 which isn’t extremely small but smaller than the significance level of 0.05.

This information is extremely important for people who use these herbicides and own dogs. Dog owners may be placing their dog’s lives at risk. This information is important for anyone who uses 2,4-D in a public space such as a park or plants near a sidewalk. Most dog owners take their dogs on walks using these parks and sidewalks. Therefore these dogs are getting access to a dangerous herbicide that could potentially give them cancer.

For future research, studies can include other factors such as the dog’s breed or age. These factors may play in a role in how high of a risk a dog is to cancer because of 2,4-D. Other avenues of research could include researching how 2,4-D impact other household pets such as cats. Although many cats are indoor cats, there are still plent of outdoor cats making this research meaningful.