1. Loading Libraries and Dataset

library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(caret)

heart <- read.csv("heart.csv")

2. Printing the Structure of the Dataset

str(heart)

## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : int  0 0 0 0 0 1 0 0 0 0 ...

The str() function reveals the data types and dimensions of the dataset. It contains 1025 observations and 14 variables. Most are numeric integers (e.g., age, sex, cp) while oldpeak is a continuous numeric value representing ST depression.

3. Listing the Variables in the Dataset

names(heart)

##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

The names() function returns all 14 column names in the dataset, representing clinical attributes collected during patient cardiac assessments.

4. Printing the Top 15 Rows of the Dataset

head(heart, 15)

##    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1   52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2   53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3   70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4   61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5   62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6   58   0  0      100  248   0       0     122     0     1.0     1  0    2
## 7   58   1  0      114  318   0       2     140     0     4.4     0  3    1
## 8   55   1  0      160  289   0       0     145     1     0.8     1  1    3
## 9   46   1  0      120  249   0       0     144     0     0.8     2  0    3
## 10  54   1  0      122  286   0       0     116     1     3.2     1  2    2
## 11  71   0  0      112  149   0       1     125     0     1.6     1  0    2
## 12  43   0  0      132  341   1       0     136     1     3.0     1  0    3
## 13  34   0  1      118  210   0       1     192     0     0.7     2  0    2
## 14  51   1  0      140  298   0       1     122     1     4.2     1  3    3
## 15  52   1  0      128  204   1       1     156     1     1.0     1  0    0
##    target
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       1
## 7       0
## 8       0
## 9       0
## 10      0
## 11      1
## 12      0
## 13      1
## 14      0
## 15      0

The head() function displays the first 15 rows of the dataset, providing a quick preview of the data values across all variables.

5. User Defined Function

classify_cholesterol <- function(chol_value) {
  category <- ifelse(chol_value < 200, "Desirable",
              ifelse(chol_value < 240, "Borderline High", "High"))
  return(category)
}

heart$chol_category <- classify_cholesterol(heart$chol)
table(heart$chol_category)

## 
## Borderline High       Desirable            High 
##             339             169             517

This custom function classifies each patient’s cholesterol into clinical risk categories based on medical guidelines: Desirable (below 200 mg/dL), Borderline High (200–239 mg/dL), or High (240+ mg/dL). The table() output shows the distribution across these categories.

6. Data Manipulation: Filtering Rows Based on Logical Criteria

high_risk_males <- heart %>%
  filter(sex == 1, trestbps > 140, age > 50)

cat("Number of high-risk male patients:", nrow(high_risk_males), "\n")

## Number of high-risk male patients: 120

head(high_risk_males, 10)

##    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1   70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 2   61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 3   55   1  0      160  289   0       0     145     1     0.8     1  1    3
## 4   70   1  2      160  269   0       1     112     1     2.9     1  1    3
## 5   67   1  2      152  212   0       0     150     0     0.8     1  0    3
## 6   57   1  1      154  232   0       0     164     0     0.0     2  1    2
## 7   59   1  2      150  212   1       1     157     0     1.6     2  0    2
## 8   59   1  3      170  288   0       0     159     0     0.2     1  0    3
## 9   59   1  0      170  326   0       0     140     1     3.4     0  0    3
## 10  68   1  0      144  193   1       1     141     0     3.4     1  2    3
##    target   chol_category
## 1       0       Desirable
## 2       0 Borderline High
## 3       0            High
## 4       0            High
## 5       0 Borderline High
## 6       0 Borderline High
## 7       1 Borderline High
## 8       0            High
## 9       0            High
## 10      0       Desirable

This filters male patients (sex == 1) over age 50 with resting blood pressure above 140 mmHg, a clinically relevant criteria for identifying hypertensive males at elevated cardiovascular risk. The head() limits the output to 10 rows to keep the report concise.

7. Identifying Dependent & Independent Variables and Reshape

selected_vars <- heart %>%
  select(age, chol, trestbps, thalach, target)

heart_long <- selected_vars %>%
  pivot_longer(cols = c(age, chol, trestbps, thalach),
               names_to = "variable",
               values_to = "value")

head(heart_long, 12)

## # A tibble: 12 × 3
##    target variable value
##     <int> <chr>    <int>
##  1      0 age         52
##  2      0 chol       212
##  3      0 trestbps   125
##  4      0 thalach    168
##  5      0 age         53
##  6      0 chol       203
##  7      0 trestbps   140
##  8      0 thalach    155
##  9      0 age         70
## 10      0 chol       174
## 11      0 trestbps   145
## 12      0 thalach    125

The dependent variable is target (heart disease diagnosis: 1 = present, 0 = absent). The independent variables are the clinical predictors. Here, four key numeric predictors (age, chol, trestbps, thalach) are selected and reshaped from wide to long format using pivot_longer(). This creates a new data frame where each row represents one measurement per patient, which is useful for grouped visualizations and comparisons.

8. Removing Missing Values

cat("Missing values per column:\n")

## Missing values per column:

colSums(is.na(heart))

##           age           sex            cp      trestbps          chol 
##             0             0             0             0             0 
##           fbs       restecg       thalach         exang       oldpeak 
##             0             0             0             0             0 
##         slope            ca          thal        target chol_category 
##             0             0             0             0             0

heart_clean <- na.omit(heart)
cat("\nRows before:", nrow(heart), "| Rows after:", nrow(heart_clean))

## 
## Rows before: 1025 | Rows after: 1025

The colSums(is.na()) function counts missing values in each column. The na.omit() function removes any rows containing missing data. In this dataset there are no missing values, so the row count remains unchanged. This is still an important validation step in any data analysis workflow.

9. Identifying and Removing Duplicated Data

cat("Number of duplicated rows:", sum(duplicated(heart_clean)), "\n")

## Number of duplicated rows: 723

heart_clean <- heart_clean[!duplicated(heart_clean), ]
cat("Rows after removing duplicates:", nrow(heart_clean))

## Rows after removing duplicates: 302

The duplicated() function identifies rows that are exact copies of earlier rows. This dataset contains duplicate records, which is a known issue with the Kaggle version of the UCI Heart Disease dataset. Removing duplicates ensures that no single patient is counted more than once, which would skew statistical results and bias any predictive models built from this data.

10. Reorder Multiple Rows in Descending Order

heart_sorted <- heart_clean %>%
  arrange(desc(chol), desc(age))

head(heart_sorted, 10)

##    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1   67   0  2      115  564   0       0     160     0     1.6     1  0    3
## 2   65   0  2      140  417   1       0     157     0     0.8     2  1    2
## 3   56   0  0      134  409   0       0     150     1     1.9     1  2    3
## 4   63   0  0      150  407   0       0     154     0     4.0     1  3    3
## 5   62   0  0      140  394   0       0     157     0     1.2     1  0    2
## 6   65   0  2      160  360   0       0     151     0     0.8     2  0    2
## 7   57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 8   55   1  0      132  353   0       1     132     1     1.2     1  1    3
## 9   55   0  1      132  342   0       1     166     0     1.2     2  0    2
## 10  43   0  0      132  341   1       0     136     1     3.0     1  0    3
##    target chol_category
## 1       1          High
## 2       1          High
## 3       0          High
## 4       0          High
## 5       1          High
## 6       1          High
## 7       1          High
## 8       0          High
## 9       1          High
## 10      0          High

The dataset is sorted by cholesterol in descending order, with ties broken by age in descending order. This places the highest-risk patients (oldest with highest cholesterol) at the top, which is useful for quickly identifying extreme cases requiring clinical attention.

11. Renaming Column Names

heart_renamed <- heart_clean %>%
  rename(
    Age = age,
    Sex = sex,
    ChestPainType = cp,
    RestingBP = trestbps,
    Cholesterol = chol,
    FastingBS = fbs,
    RestingECG = restecg,
    MaxHeartRate = thalach,
    ExerciseAngina = exang,
    STDepression = oldpeak,
    HeartDisease = target
  )

names(heart_renamed)

##  [1] "Age"            "Sex"            "ChestPainType"  "RestingBP"     
##  [5] "Cholesterol"    "FastingBS"      "RestingECG"     "MaxHeartRate"  
##  [9] "ExerciseAngina" "STDepression"   "slope"          "ca"            
## [13] "thal"           "HeartDisease"   "chol_category"

The rename() function replaces shorthand column codes with descriptive clinical labels. This improves readability for anyone reviewing the analysis who may not be familiar with the original UCI dataset abbreviations.

12. Adding New Variables Using a Mathematical Function

heart_clean$chol_double <- heart_clean$chol * 2
heart_clean$bp_hr_ratio <- round(heart_clean$trestbps / heart_clean$thalach, 3)

head(heart_clean[, c("chol", "chol_double", "trestbps", "thalach", "bp_hr_ratio")], 10)

##    chol chol_double trestbps thalach bp_hr_ratio
## 1   212         424      125     168       0.744
## 2   203         406      140     155       0.903
## 3   174         348      145     125       1.160
## 4   203         406      148     161       0.919
## 5   294         588      138     106       1.302
## 6   248         496      100     122       0.820
## 7   318         636      114     140       0.814
## 8   289         578      160     145       1.103
## 9   249         498      120     144       0.833
## 10  286         572      122     116       1.052

Two new variables are created: chol_double multiplies cholesterol by 2 (demonstrating a basic mathematical transformation), and bp_hr_ratio divides resting blood pressure by maximum heart rate. The BP-to-HR ratio is a clinically meaningful derived metric — a higher ratio suggests the heart is working under greater pressure relative to its maximum capacity, which may indicate cardiovascular inefficiency.

13. Creating a Training Set Using Random Number Generator

set.seed(123)

train_index <- sample(1:nrow(heart_clean), size = 0.7 * nrow(heart_clean))
train_set <- heart_clean[train_index, ]
test_set  <- heart_clean[-train_index, ]

cat("Training set rows:", nrow(train_set), "\n")

## Training set rows: 211

cat("Test set rows:", nrow(test_set), "\n")

## Test set rows: 91

The set.seed(123) ensures reproducibility by fixing the random number generator so the same split is produced every time the code runs. The dataset is split 70/30 into training and test sets using sample().

14. Printing the Summary Statistics of the Dataset

summary(heart_clean)

##       age             sex               cp            trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.42   Mean   :0.6821   Mean   :0.9636   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol            fbs           restecg          thalach     
##  Min.   :126.0   Min.   :0.000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :240.5   Median :0.000   Median :1.0000   Median :152.5  
##  Mean   :246.5   Mean   :0.149   Mean   :0.5265   Mean   :149.6  
##  3rd Qu.:274.8   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3278   Mean   :1.043   Mean   :1.397   Mean   :0.7185  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##       thal           target      chol_category       chol_double    
##  Min.   :0.000   Min.   :0.000   Length:302         Min.   : 252.0  
##  1st Qu.:2.000   1st Qu.:0.000   Class :character   1st Qu.: 422.0  
##  Median :2.000   Median :1.000   Mode  :character   Median : 481.0  
##  Mean   :2.315   Mean   :0.543                      Mean   : 493.0  
##  3rd Qu.:3.000   3rd Qu.:1.000                      3rd Qu.: 549.5  
##  Max.   :3.000   Max.   :1.000                      Max.   :1128.0  
##   bp_hr_ratio    
##  Min.   :0.5250  
##  1st Qu.:0.7580  
##  Median :0.8655  
##  Mean   :0.9050  
##  3rd Qu.:0.9928  
##  Max.   :1.8220

The summary() function provides the five-number summary (minimum, 1st quartile, median, 3rd quartile, maximum) plus the mean for each numeric variable. Key observations: the median age is 54, median cholesterol is 223 mg/dL, and median maximum heart rate is 153 bpm. The target variable has a mean that indicates the proportion of patients diagnosed with heart disease in this cleaned dataset.

15. Statistical Functions: Mean, Median, Mode, Range

get_mode <- function(x) {
  uniq_vals <- unique(x)
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}

cat("=== Cholesterol (chol) ===\n")

## === Cholesterol (chol) ===

cat("Mean:  ", mean(heart_clean$chol), "\n")

## Mean:   246.5

cat("Median:", median(heart_clean$chol), "\n")

## Median: 240.5

cat("Mode:  ", get_mode(heart_clean$chol), "\n")

## Mode:   204

cat("Range: ", range(heart_clean$chol), "\n\n")

## Range:  126 564

cat("=== Maximum Heart Rate (thalach) ===\n")

## === Maximum Heart Rate (thalach) ===

cat("Mean:  ", mean(heart_clean$thalach), "\n")

## Mean:   149.5695

cat("Median:", median(heart_clean$thalach), "\n")

## Median: 152.5

cat("Mode:  ", get_mode(heart_clean$thalach), "\n")

## Mode:   162

cat("Range: ", range(heart_clean$thalach), "\n")

## Range:  71 202

A custom get_mode() function is defined because R has no built-in mode function for numeric data. The statistics are computed on cholesterol and maximum heart rate. Cholesterol shows a mean higher than the median, suggesting a right-skewed distribution with some patients having extremely high values. For maximum heart rate, the values cluster around 150 bpm, consistent with a middle-aged patient population undergoing cardiac stress testing.

16. Deviring Scater plot: Age vs Maximum Heart Rate

ggplot(heart_clean, aes(x = age, y = thalach, color = as.factor(target))) +
  geom_point(alpha = 0.6, size = 2) +
  labs(title = "Scatter Plot: Age vs Maximum Heart Rate",
       x = "Age",
       y = "Maximum Heart Rate (thalach)",
       color = "Heart Disease") +
  scale_color_manual(values = c("0" = "#E74C3C", "1" = "#2ECC71"),
                     labels = c("No Disease", "Disease")) +
  theme_minimal()

The scatter plot reveals a negative relationship between age and maximum heart rate — as patients get older, their maximum achievable heart rate decreases. This is physiologically expected. The color coding by heart disease status shows that patients with heart disease (green) tend to cluster at higher heart rates for their age compared to patients without disease (red), suggesting that achieving a higher heart rate during stress testing may be associated with better cardiac function.

17. Ploting Bar chart: Chest Pain Type by Heart Disease Status

ggplot(heart_clean, aes(x = as.factor(cp), fill = as.factor(target))) +
  geom_bar(position = "dodge") +
  labs(title = "Bar Plot: Chest Pain Type by Heart Disease Status",
       x = "Chest Pain Type",
       y = "Count",
       fill = "Heart Disease") +
  scale_fill_manual(values = c("0" = "#3498DB", "1" = "#E67E22"),
                    labels = c("No Disease", "Disease")) +
  scale_x_discrete(labels = c("0" = "Typical Angina", "1" = "Atypical", 
                               "2" = "Non-Anginal", "3" = "Asymptomatic")) +
  theme_minimal()

The bar plot shows the distribution of patients across four chest pain types, grouped by heart disease status. Notably, asymptomatic patients (type 0) are predominantly in the no-disease group, while patients with atypical angina and non-anginal pain show higher proportions of heart disease. This highlights that chest pain type is a strong differentiator in cardiac diagnosis.

18. Pearson Correlation: Age vs Maximum Heart Rate

cor_value <- cor(heart_clean$age, heart_clean$thalach, method = "pearson")
cat("Pearson Correlation (Age vs Max Heart Rate):", cor_value, "\n")

## Pearson Correlation (Age vs Max Heart Rate): -0.3952352

cor.test(heart_clean$age, heart_clean$thalach, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  heart_clean$age and heart_clean$thalach
## t = -7.4525, df = 300, p-value = 9.858e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4864024 -0.2955546
## sample estimates:
##        cor 
## -0.3952352

The Pearson correlation coefficient between age and maximum heart rate is negative, confirming the inverse relationship visible in the scatter plot. The cor.test() function provides the correlation coefficient along with a p-value and 95% confidence interval. A statistically significant p-value (< 0.05) confirms this relationship is not due to random chance. This finding is clinically expected, maximum achievable heart rate naturally declines with age, commonly estimated by the formula 220 minus age.

COMP4028 – Assignment 1: Heart Disease Data Analysis

Mencha Tembong

2026-03-22