library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(ggplot2)

# Load the dataset
avandia <- read_csv("avandia.csv")
## Rows: 227571 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): treatment, cardiovascular_problems
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Research Question: Does the type of diabetes medicine cause cardiovascular problems for elderly patients?

Introduction

The data set that I have chosen to answer my research question is named avandia but it’s a data set that shows the cardiovascular problems related to two types of diabetes medicine with 227,571 cases. The two types of diabetes medicine is Pioglitazone and Rosiglitazone, which will be the name of the factors for the treatment variable. The other variable is named cardiovascular_problems which has the factors yes or no, yes if it caused cardiovascular problem for the patient and no if not.

Source for the data set

“Cardiovascular Problems for Two Types of Diabetes Medicines.” Data Sets, www.openintro.org/data/index.php?data=avandia. Accessed 18 Nov. 2025.

Data Analysis

For my data analysis, I will be identifying which diabetes medicine causes the most amount of cardiovascular problems for the elderly patients through exploratory data analysis. I will also be showcasing at the end of my data analysis a bar graph that shows the proportion of patients who got cardiovascular problems by the diabetes medicine they took. The first thing I’ll do is to make sure my data is clean and that there are no N/As. The next thing I’ll do is to prepare my data for Statistical Analysis. I will be creating new variables that hold the number of patients that took medicine A versus medicine B and also the number of patients that had cardiovascular problems in each type of medicine group.

Clean data by removing rows with N/A treatment and cardiovascular_problems.

avandia_clean <- avandia |>
  filter(!is.na(treatment)) %>%
  filter(!is.na(cardiovascular_problems))

Create two data sets, separating the different medicine and count the total of patients in each group and the patients that had cardiovascular problems.

pioglitazone <- avandia |>
  filter(treatment == "Pioglitazone")

total_pioglitazone <- count(pioglitazone)

pioglitazone_yes <- pioglitazone |>
  filter(cardiovascular_problems == "yes")

count(pioglitazone_yes)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  5386
rosiglitazone <- avandia |>
  filter(treatment == "Rosiglitazone")

count(rosiglitazone)
## # A tibble: 1 × 1
##       n
##   <int>
## 1 67593
rosiglitazone_yes <- rosiglitazone |>
  filter(cardiovascular_problems == "yes")

count(rosiglitazone_yes)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  2593

Prepare variables for Statistical Analysis.

pioglitazone_hasCP <- sum(count(pioglitazone_yes))
total_pioglitazone <- sum(count(pioglitazone))

rosiglitazone_hasCP <- sum(count(rosiglitazone_yes))
total_rosiglitazone <- sum(count(rosiglitazone))

Create a bar plot to showcase proportion of patients from each medicine group who got cardiovascular problems.

ggplot(avandia_clean, aes(x = treatment, fill = cardiovascular_problems)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Cardiovascular Problems by Treatment",
    x = "Treatment",
    y = "Proportion"
  ) +
  theme_minimal()

Statistical Analysis

Hypothesis: Does the proportion of patients who have cardiovascular problems differ between those who take the diabetes medicine Rosiglitazone and Pioglitazone?

\(H_0\): \(p_1\) = \(p_2\)

\(H_a\): \(p_1\)\(p_2\)

\(p_1\) = proportion of patients who got cardiovascular problems after taking Pioglitazone

\(p_2\) = proportion of patients who got cardiovascular problems after taking Rosiglitazone

prop.test(
  c(pioglitazone_hasCP, rosiglitazone_hasCP),
  c(total_pioglitazone, total_rosiglitazone),
  correct = FALSE, alternative = "two.sided"
)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(pioglitazone_hasCP, rosiglitazone_hasCP) out of c(total_pioglitazone, total_rosiglitazone)
## X-squared = 30.957, df = 1, p-value = 2.639e-08
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.006391230 -0.002998432
## sample estimates:
##     prop 1     prop 2 
## 0.03366713 0.03836196

The significance level is α = 0.05. The p-value is 2.639e-08 which is less than the significance level of 0.05. Therefore, there is enough evidence to reject the null hypothesis meaning that there is a difference in proportion between patients who get cardiovascular problems that take Pioglitazone versus Rosiglitazone. We see from the results as well the difference in proportion. For the patients who took Pioglitazone, the rate is 3.37% compared to the patients who took Rosiglitazone which has a higher rate of 3.84%. We can see this too from the bar plot that we created during the Data Analysis stage.

Conclusion and Future Directions

In conclusion, we’ve found out that there is enough evidence to reject the null hypothesis and that there is a significant difference in the proportion of patients who get cardiovascular problems whether they take Pioglitazone versus Rosiglitazone. This answers are research question of how types of diabetes medication could possibly cause cardiovascular problems. If we look back at the rates for both medicine, 3.37% is not that far from 3.84%, however, the reason why it is still statistically different from each other is because of how big our data is, making that small difference matter and be sufficient evidence to conclude that there is a difference in proportion between the effects of the two medication.

In this project, I only explored two variables to consider, but for better results in future research and analysis, there should be more variables taken into consideration like age, bmi, or even genetic history in the family, that way we could paint a better picture in understanding what causes cardiovascular problems for patients.

Source

“Cardiovascular Problems for Two Types of Diabetes Medicines.” Data Sets, www.openintro.org/data/index.php?data=avandia. Accessed 18 Nov. 2025.