Introduction

For this assignment, we used the heart disease dataset to explore how different patient characteristics may be related to heart disease. The dataset includes several clinical and demographic variables, such as age, sex, resting blood pressure, cholesterol, maximum heart rate, exercise-induced angina, and oldpeak. These variables make it possible to look at patterns and compare patients with and without heart disease. Using R Programming, the dataset was cleaned, organized, and analyzed using summary statistics, graphs, and Pearson correlation. The purpose of the analysis was to identify which variables seem to provide the most meaningful insight into heart disease in this dataset.

Key Terms Used in the Dataset

Before presenting the analysis, a few key terms from the dataset are defined below for clarity. Some of the variable names are abbreviated clinical terms, so these definitions help make the results easier to understand.

  • target – indicates whether heart disease is present (1 = disease, 0 = no disease)
  • age – age of the patient in years
  • sex – sex of the patient (1 = male, 0 = female)
  • cp – chest pain type
  • trestbps – resting blood pressure
  • chol – serum cholesterol level
  • fbs – fasting blood sugar greater than 120 mg/dL (1 = true, 0 = false)
  • restecg – resting electrocardiographic results
  • thalach – maximum heart rate achieved during exercise
  • exang – exercise-induced angina (1 = yes, 0 = no)
  • oldpeak – ST depression induced by exercise relative to rest
  • slope – the slope of the peak exercise ST segment
  • ca – number of major blood vessels colored by fluoroscopy
  • thal – thalassemia-related test result used in heart disease assessment

Not in the dataset but important to understand: - ST segment - is a part of the ECG wave that helps show how the heart is responding electrically after a heartbeat. Abnormal changes in this part of the tracing can sometimes indicate heart-related problems. - ST depression - is when the ST segment on an ECG drops below its usual level. This may indicate an abnormal heart response during exercise or stress.

2. List the Variables in the Dataset

# Used by Mae, Ola, and Mencha
names(heart)
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

4. Write a User-Defined Function Using Any Variable from the Dataset

# Based on the combined group direction
# Simple function to classify resting blood pressure
bp_category <- function(bp) {
  if (bp < 120) {
    return("Normal")
  } else if (bp < 140) {
    return("Elevated")
  } else {
    return("High")
  }
}

bp_category(130)
## [1] "Elevated"

5. Use Data Manipulation Techniques and Filter Rows Based on Logical Criteria

# Ola's code idea:
# filter high-risk male patients using sex, cholesterol, and age
high_risk_males <- heart %>%
  filter(sex == 1, chol > 240, age > 55)

cat("Number of high-risk male patients:", nrow(high_risk_males), "\n")
## Number of high-risk male patients: 164
head(high_risk_males, 10)
##    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1   58   1  0      114  318   0       2     140     0     4.4     0  3    1
## 2   56   1  2      130  256   1       0     142     1     0.6     1  1    1
## 3   70   1  2      160  269   0       1     112     1     2.9     1  1    3
## 4   59   1  0      138  271   0       0     182     0     0.0     2  0    2
## 5   64   1  0      128  263   0       1     105     1     0.2     1  1    3
## 6   67   1  0      100  299   0       0     125     1     0.9     1  2    2
## 7   59   1  3      170  288   0       0     159     0     0.2     1  0    3
## 8   59   1  0      170  326   0       0     140     1     3.4     0  0    3
## 9   56   1  0      125  249   1       0     144     1     1.2     1  1    2
## 10  65   1  0      110  248   0       0     158     0     0.6     2  2    1
##    target
## 1       0
## 2       0
## 3       0
## 4       1
## 5       1
## 6       0
## 7       0
## 8       0
## 9       0
## 10      0

6. Identify the Dependent and Independent Variables, Use Reshaping Techniques, and Create a New Data Frame

# Mae's code idea:
# target as dependent variable
dependentvariable <- as.data.frame(cbind(heart$target))
names(dependentvariable)[1] <- "Target"

head(dependentvariable, n = 10)
##    Target
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       1
## 7       0
## 8       0
## 9       0
## 10      0
# Ola's code idea:
# reshape selected variables into long format
vars <- heart %>%
  select(age, thalach, oldpeak, exang, target)

heart_long <- vars %>%
  pivot_longer(
    cols = c(thalach, oldpeak, exang),
    names_to = "variable",
    values_to = "value"
  )

head(heart_long, 15)
## # A tibble: 15 × 4
##      age target variable value
##    <int>  <int> <chr>    <dbl>
##  1    52      0 thalach  168  
##  2    52      0 oldpeak    1  
##  3    52      0 exang      0  
##  4    53      0 thalach  155  
##  5    53      0 oldpeak    3.1
##  6    53      0 exang      1  
##  7    70      0 thalach  125  
##  8    70      0 oldpeak    2.6
##  9    70      0 exang      1  
## 10    61      0 thalach  161  
## 11    61      0 oldpeak    0  
## 12    61      0 exang      0  
## 13    62      0 thalach  106  
## 14    62      0 oldpeak    1.9
## 15    62      0 exang      0

7. Remove Missing Values in the Dataset

# Ola and Mencha both checked missing values
colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0
heart_clean <- na.omit(heart)

cat("Rows before:", nrow(heart), "| Rows after:", nrow(heart_clean))
## Rows before: 1025 | Rows after: 1025

8. Identify and Remove Duplicated Data in the Dataset

# Ola and Mencha both removed duplicates
cat("Number of duplicated rows:", sum(duplicated(heart_clean)), "\n")
## Number of duplicated rows: 723
heart_clean <- heart_clean[!duplicated(heart_clean), ]

cat("Rows after removing duplicates:", nrow(heart_clean))
## Rows after removing duplicates: 302

9. Reorder Multiple Rows in Descending Order

# Mencha's code idea:
# sort by cholesterol and age
heart_sorted <- heart_clean %>%
  arrange(desc(chol), desc(age))

head(heart_sorted, 10)
##    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1   67   0  2      115  564   0       0     160     0     1.6     1  0    3
## 2   65   0  2      140  417   1       0     157     0     0.8     2  1    2
## 3   56   0  0      134  409   0       0     150     1     1.9     1  2    3
## 4   63   0  0      150  407   0       0     154     0     4.0     1  3    3
## 5   62   0  0      140  394   0       0     157     0     1.2     1  0    2
## 6   65   0  2      160  360   0       0     151     0     0.8     2  0    2
## 7   57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 8   55   1  0      132  353   0       1     132     1     1.2     1  1    3
## 9   55   0  1      132  342   0       1     166     0     1.2     2  0    2
## 10  43   0  0      132  341   1       0     136     1     3.0     1  0    3
##    target
## 1       1
## 2       1
## 3       0
## 4       0
## 5       1
## 6       1
## 7       1
## 8       0
## 9       1
## 10      0

10. Rename Some of the Column Names in the Dataset

# Mencha and Ola both renamed columns for readability
heart_renamed <- heart_clean %>%
  rename(
    Age = age,
    Sex = sex,
    ChestPainType = cp,
    RestingBP = trestbps,
    Cholesterol = chol,
    FastingBS = fbs,
    RestECG = restecg,
    MaxHeartRate = thalach,
    ExerciseAngina = exang,
    STDepression = oldpeak,
    HeartDisease = target
  )

names(heart_renamed)
##  [1] "Age"            "Sex"            "ChestPainType"  "RestingBP"     
##  [5] "Cholesterol"    "FastingBS"      "RestECG"        "MaxHeartRate"  
##  [9] "ExerciseAngina" "STDepression"   "slope"          "ca"            
## [13] "thal"           "HeartDisease"

11. Add New Variables in the Data Frame Using a Mathematical Function

# Ola's code idea:
# create Age_Group
heart_clean$Age_Group <- case_when(
  heart_clean$age < 40 ~ "Young",
  heart_clean$age >= 40 & heart_clean$age < 60 ~ "Middle-aged",
  TRUE ~ "Senior"
)

# Mencha's code idea:
# create cholesterol category and mathematical variables
heart_clean$chol_category <- case_when(
  heart_clean$chol < 200 ~ "Desirable",
  heart_clean$chol >= 200 & heart_clean$chol < 240 ~ "Borderline High",
  TRUE ~ "High"
)

heart_clean$chol_double <- heart_clean$chol * 2
heart_clean$bp_hr_ratio <- round(heart_clean$trestbps / heart_clean$thalach, 3)

# Ola's code idea:
# create a risk score
heart_clean$risk_score <- 0.3 * heart_clean$age +
                          0.4 * heart_clean$chol +
                          0.3 * heart_clean$trestbps

head(heart_clean[, c("age", "chol", "chol_double", "trestbps", "thalach", "bp_hr_ratio", "risk_score")], 10)
##    age chol chol_double trestbps thalach bp_hr_ratio risk_score
## 1   52  212         424      125     168       0.744      137.9
## 2   53  203         406      140     155       0.903      139.1
## 3   70  174         348      145     125       1.160      134.1
## 4   61  203         406      148     161       0.919      143.9
## 5   62  294         588      138     106       1.302      177.6
## 6   58  248         496      100     122       0.820      146.6
## 7   58  318         636      114     140       0.814      178.8
## 8   55  289         578      160     145       1.103      180.1
## 9   46  249         498      120     144       0.833      149.4
## 10  54  286         572      122     116       1.052      167.2

12. Create a Training Set Using Random Number Generator Engine

# Ola's code idea
set.seed(1234)

train_index <- sample(1:nrow(heart_clean), size = 0.70 * nrow(heart_clean))
TrainingSet <- heart_clean[train_index, ]
TestingSet  <- heart_clean[-train_index, ]

dim(TrainingSet)
## [1] 211  19
dim(TestingSet)
## [1] 91 19

14. Perform Statistical Functions: Mean, Median, Mode, and Range

# Mencha and Mae both used a custom mode function
get_mode <- function(x) {
  uniq_vals <- unique(x)
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}

cat("=== Cholesterol ===\n")
## === Cholesterol ===
cat("Mean: ", mean(heart_clean$chol), "\n")
## Mean:  246.5
cat("Median: ", median(heart_clean$chol), "\n")
## Median:  240.5
cat("Mode: ", get_mode(heart_clean$chol), "\n")
## Mode:  204
cat("Range: ", range(heart_clean$chol), "\n\n")
## Range:  126 564
cat("=== Maximum Heart Rate ===\n")
## === Maximum Heart Rate ===
cat("Mean: ", mean(heart_clean$thalach), "\n")
## Mean:  149.5695
cat("Median: ", median(heart_clean$thalach), "\n")
## Median:  152.5
cat("Mode: ", get_mode(heart_clean$thalach), "\n")
## Mode:  162
cat("Range: ", range(heart_clean$thalach), "\n")
## Range:  71 202

15. Plot a Scatter Plot for Any 2 Variables in the Dataset

# Combined idea
# instead of the weaker age vs cholesterol version
ggplot(heart_clean, aes(x = age, y = thalach, color = as.factor(target))) +
  geom_point(alpha = 0.6, size = 2) +
  labs(
    title = "Scatter Plot: Age vs Maximum Heart Rate",
    x = "Age",
    y = "Maximum Heart Rate",
    color = "Heart Disease"
  ) +
  scale_color_discrete(labels = c("No Disease", "Disease")) +
  theme_minimal()

Analysis: Age vs. Maximum Heart Rate

The scatter plot shows the relationship between age and maximum heart rate, with the points colored by heart disease status. Overall, the pattern shows a downward trend, which suggests that as age increases, maximum heart rate tends to decrease. This means there is an inverse relationship between the two variables. The color also helps show how patients with and without heart disease are spread across the graph. Based on this pattern, maximum heart rate appears to be an important variable when looking at heart disease in this dataset.

16. Plot a Bar Plot for Any 2 Variables in the Dataset

# Combined idea + Mary lol

bar_data <- heart_clean %>%
  group_by(target) %>%
  summarise(avg_oldpeak = mean(oldpeak))

ggplot(bar_data, aes(x = factor(target), y = avg_oldpeak, fill = factor(target))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Oldpeak by Heart Disease Status",
    x = "Heart Disease Status",
    y = "Average Oldpeak",
    fill = "Heart Disease Status"
  ) +
  scale_x_discrete(labels = c("No Disease", "Disease")) +
  scale_fill_discrete(labels = c("No Disease", "Disease")) +
  theme_minimal()

Analysis: Average Oldpeak by Heart Disease Status

The bar plot compares the average oldpeak between patients with and without heart disease. Interestingly, the no disease group has the higher average oldpeak in this dataset. This suggests that oldpeak by itself may not be enough to clearly explain heart disease status here. Instead, it shows that the dataset is a bit more complex, and that heart disease is probably better understood by looking at several variables together rather than relying on just one measure.

17. Find the Correlation Between Any 2 Variables by Applying Pearson Correlation

# combined idea
# age vs maximum heart rate 
cor_value <- cor(heart_clean$age, heart_clean$thalach, method = "pearson")
cor_value
## [1] -0.3952352

Analysis:

The Pearson correlation coefficient between age and maximum heart rate is -0.3952352, which indicates a moderate negative relationship. This means that as age increases, maximum heart rate tends to decrease. In simpler terms, older patients in this dataset generally had lower maximum heart rates than younger patients. Even though the relationship is not very strong, it is still strong enough to show a noticeable pattern. This also supports what we see in the scatter plot.

Conclusion

Based on our analysis, heart disease in this dataset does not seem to be explained by just one factor like cholesterol. A stronger story came from looking at multiple variables together, especially age, resting blood pressure, maximum heart rate, exercise-induced angina, and oldpeak. The results suggest that exercise-related and cardiovascular response variables may be especially helpful in showing differences between patients with and without heart disease. For example, the moderate negative correlation between age and maximum heart rate suggests that older patients generally reached lower maximum heart rates in this dataset. Overall, the findings show that heart disease is influenced by a combination of factors, and this assignment demonstrated how R can be used to clean, organize, and analyze healthcare data in a meaningful way.