Project 2

Data Analysis

In this section, I cleaned and explored the overdose dataset to prepare it for my statistical analysis. I started by loading the data, and standardizing the column names, then filtering the dataset to keep only the rows that included male and female categories. I used dplyr functions, like filter() and select() to fix the data for the comparison. I also calculated the summary statistics to better understand the data of overdose death rates per 100,000 population for each sex. Lastly to visualize the data, I created a boxplot to compare the death rate distributions between males and females and a histogram.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("~/Downloads")
overdose <- read_csv("Drug_overdose_death_rates__by_drug_type__sex__age__race__and_Hispanic_origin__United_States (2).csv")

## Rows: 6228 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): INDICATOR, PANEL, UNIT, STUB_NAME, STUB_LABEL, AGE, FLAG
## dbl (8): PANEL_NUM, UNIT_NUM, STUB_NAME_NUM, STUB_LABEL_NUM, YEAR, YEAR_NUM,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(overdose) <- tolower(names(overdose))
names(overdose) <- gsub(" ", "_", names(overdose))
names(overdose) <- gsub("/", "_", names(overdose))

Filter the columns needed

overdose_sex <- overdose %>%
  filter(stub_label %in% c("Male", "Female")) %>%
  select(stub_label, estimate, year, age, panel)

Summary (EDA)

summary(overdose_sex$estimate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.200   1.000   2.300   4.576   6.325  29.100

overdose_sex %>%
  group_by(stub_label) %>%
  summarise(
    mean_rate = mean(estimate, na.rm = TRUE),
    sd_rate = sd(estimate, na.rm = TRUE),
    min_rate = min(estimate, na.rm = TRUE),
    max_rate = max(estimate, na.rm = TRUE),
    n = n()
  )

## # A tibble: 2 × 6
##   stub_label mean_rate sd_rate min_rate max_rate     n
##   <chr>          <dbl>   <dbl>    <dbl>    <dbl> <int>
## 1 Female          3.15    3.42      0.2     14.4   240
## 2 Male            6.00    6.36      0.3     29.1   240

Visualization

ggplot(overdose_sex, aes(x = stub_label, y = estimate, fill = stub_label)) +
  geom_boxplot() +
  labs(
    title = "Overdose Death Rates per 100,000 Population by Sex",
    x = "Sex",
    y = "Death Rate per 100,000 Population"
  ) +
  theme_minimal()

ggplot(overdose_sex, aes(x = estimate)) +
  geom_histogram(bins = 30, color = "black", fill = "skyblue") +
  labs(
    title = "Distribution of Overdose Death Rates",
    x = "Death Rate per 100,000 Population",
    y = "Count"
  ) +
  theme_minimal()

##Statistical Analysis To determine whether there is a significant difference in overdose death rates between males and females, I conducted an independent samples t-test. The quantitative variable in this analysis is the overdose death rate per 100,000 population, and the categorical variable is sex, which includes the groups “Male” and “Female”. Hypotheses \(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) ≠ \(\mu_2\)

\(\mu_1\) = mean overdose death rate for males \(\mu_2\) = mean overdose death rate for females

t_test <- t.test(estimate ~ stub_label, data = overdose_sex)
t_test

## 
##  Welch Two Sample t-test
## 
## data:  estimate by stub_label
## t = -6.1193, df = 366.82, p-value = 2.414e-09
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -3.767509 -1.934991
## sample estimates:
## mean in group Female   mean in group Male 
##             3.150417             6.001667

t_test$p.value

## [1] 2.414115e-09

The two-sample t-test comparing mean overdose death rates between males and females gave a p-value of 2.41e-09. Which is less than the significance level of α = 0.05. Because the p-value is very small, I reject the null hypothesis, and conclude that there is strong statistical evidence supporting the alternative hypothesis. This means the mean overdose death rates for males and females are very different. The sample means show that males have a higher average overdose death rate compared to females. The 95% confidence interval for the difference in means ranges from −3.77 to −1.93, and because the interval does not include zero.Therefore it shows that the difference is significant. These results explain that males have higher overdose death rates than females in the United States.

###Conclusion and Future Analysis This analysis showed a difference in overdose death rates between males and females in the United States. The t-test results showed a p-value less than α = 0.05, which made me reject the null hypothesis and confirm that males and females do not share the same average overdose death rate. Males had a higher rate than females. For future research, there should be additional factors like age groups, race, or specific drug types explored to understand which populations face the highest risks. Looking at these trends over time or expanding the analysis to include demographic variables can give us a deeper insight about overdoses in the US and support prevention efforts.

Reference: Drug overdose death rates, by drug type, sex, age, race, and Hispanic origin: United States https://catalog.data.gov/dataset/drug-overdose-death-rates-by-drug-type-sex-age-race-and-hispanic-origin-united-states-3f72f

Project 2

Jathiya Hamidi

Introduction

Data Analysis