Final Project

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

cancer_in_dogs <- read.csv("cancer_in_dogs.csv")

Introduction:

Question: Is there a significant association between 2,4-D herbicide exposure and cancer occurrence in dogs?

This dataset is from OpenIntro.org, and it has 1,436 observations and 2 variables. This is a study from 1994, testing to see if dogs had an increased risk of cancer depending on if they are exposed to the herbicide 2,4-Dichlorophernoxyacetic (2,4-D). The study examined 491 dogs that had cancer, and 945 as the control group. The column names are order and response. In order, it shows if the dog was exposed to 2,4-D or not. In the responses column, it shows if the dog had cancer or not. This herbicide is often used on lawns and gardens to control and keep lawns safe. Dogs can be exposed to this in numerous ways, especially on walks. They might sniff and lick the grass, which immediately puts them at risk. I chose this topic because I have a dog and want to keep him safe, especially knowing that small things like this can cause problems for him. This is the link to the dataset: https://www.openintro.org/data/index.php?data=cancer_in_dogs

Data Analysis

In this section, I am using “str” to check the structure of the dataset. Everything looks good and there are no unusual things with the columns or rows. Next, I used “head” to see if there were any issues or hard to understand names, or columns. We can also see that order and response is considered a char, which is perfect. Next up, I used “colSums” to check for N/A’s. Thankfully, we do not see any N/A’s in the dataset.Up next, we used three dplyr functions: filter, group by, and summarize. What filter is doing is keeping the rows that meet the condition I am looking for, so in this case, being exposed to 2,4-D and not being exposed. Group by is grouping the data by response (cancer or not), while summarize finds the count for the cancer vs no cancer group. This is what we have so far. I am going to use a Chi Squared Test, to do the Test of Independence.

### Check structure
str(cancer_in_dogs)

## 'data.frame':    1436 obs. of  2 variables:
##  $ order   : chr  "2,4-D" "2,4-D" "2,4-D" "2,4-D" ...
##  $ response: chr  "cancer" "cancer" "cancer" "cancer" ...

### Check head of dataset
head (cancer_in_dogs)

##   order response
## 1 2,4-D   cancer
## 2 2,4-D   cancer
## 3 2,4-D   cancer
## 4 2,4-D   cancer
## 5 2,4-D   cancer
## 6 2,4-D   cancer

### Check for N/A's
colSums(is.na(cancer_in_dogs))

##    order response 
##        0        0

### Dplyr functions
cancer_in_dogs |>
  filter(order %in% c("2,4-D", "no 2,4-D")) |>
  group_by(response) |>
  summarize(
    total_cases = n(),
    exposed_count = sum(order == "2,4-D")
  )

## # A tibble: 2 × 3
##   response  total_cases exposed_count
##   <chr>           <int>         <int>
## 1 cancer            491           191
## 2 no cancer         945           304

Chi Square Test of Independence

Hypotheses

H₀: There is no association between herbicide exposure and cancer occurrence in dogs Hₐ: There is an association between herbicide exposure and cancer occurrence in dogs

### Bar graph plot: Used AI (Allowed in instructions)
library(ggplot2)

# Create the data for plotting
# From your contingency table:
# cancer/exposed = 191, cancer/not exposed = 300
# no cancer/exposed = 304, no cancer/not exposed = 641

plot_data <- data.frame(
  Cancer_Status = rep(c("Cancer", "No Cancer"), each = 2),
  Exposure = rep(c("Exposed", "Not Exposed"), 2),
  Count = c(191, 300, 304, 641)
)

# Create side-by-side bar graph
ggplot(plot_data, aes(x = Cancer_Status, y = Count, fill = Exposure)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
  labs(
    title = "Cancer Status by Herbicide Exposure in Dogs",
    x = "Cancer Status",
    y = "Number of Dogs",
    fill = "Exposure Status"
  ) +
  scale_fill_manual(values = c("Exposed" = "#E74C3C", "Not Exposed" = "#3498DB")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 11),
    legend.position = "top"
  )

observed <- matrix(c(191, 300, 304, 641), nrow = 2, byrow = TRUE)

### Check expected cell count assumptions
expected <- chisq.test(observed, correct = FALSE)$expected
expected

##          [,1]     [,2]
## [1,] 169.2514 321.7486
## [2,] 325.7486 619.2514

### Test of independence
chisq.test(observed, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  observed
## X-squared = 6.4806, df = 1, p-value = 0.01091

Interpretation:

To start, looking at Our results from the test, X²= 6.481, the degrees of freedom = 1, and the p-value = 0.0109. Since the p-value is less than 0.05, we reject the null, meaning there is a significant association between the exposure of 2,4-D and cancer in dogs. Since our X² was at 6.481 and our DF was only at 1, we can also assume there is a strong association between the two factors. For the cell count assumptions, all of the counts were greater than 5, so this means our results are all reliable and valid. The plot shows us that comparing the two groups of exposed versus not exposed and still have cancer, there is a small difference; around 200 number of dogs. It shows that there are more un-exposed dogs who have cancer than exposed dogs who have cancer. But, when we look at the no cancer section, we see it is a way higher amount of dogs that don’t have cancer compared to dogs who were exposed and don’t have cancer. The difference is about 320, possibly. We can assume from this graph that in the cancer group, exposure is more common, and in the no cancer group, non-exposure is more common.

Conclusion and Future Directions

In all, we found out that there is a statistical significant association between the herbicide exposure and cancer occurrence in dogs, with our X² being 6.48 and our p-value, 0.0109.There is a 6.7% difference in cancer rate when you compare exposed to non exposed dogs I did 191 (exposed + cancer)divided by 491 (total dogs with cancer)which got me 38.9%, then did 304 (exposed + no cancer) divided by 945 (total dogs with no cancer) and got 32.2%. 38.9% - 32.22% = 6.7 % difference. So, that 6.7% is representing that herbicide exposure is more common in dogs with cancer, than without. This means we can assume that herbicide exposure (specifically 2,4-D in this case) has a link to higher cancer risk in dogs. For future research, I think it would be interesting to dig deeper into factors like breed, age, diet, and health problems that the dog already has.This would give us more insight on what owners should look out for when caring for their dogs.