Weekly Assignment #5

Due: Friday, Oct 17th, 2025 @ 11:59PM

Loading Required Libraries

Before running the code, we will load the necessary R packages. The tidyverse package will provide a cohesive set of functions for data manipulation, cleaning, and reshaping. The ggplot2 package allows us to create highly customizable and detailed plots. DT will enable us to render interactive tables that users can sort, search, and scroll through. Finally, rsconnect allows deployment of our Shiny app to shinyapps.io. Loading these libraries ensures that all subsequent functions will work properly and avoids errors due to missing dependencies.

#install.packages('rsconnect')

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(DT)
library(rsconnect)

After executing this code, the R environment will have all necessary packages loaded. This setup ensures that the dataset can be manipulated efficiently, visualizations can be generated with full flexibility, tables will be interactive for the user, and the Shiny app can be deployed. Users will have access to all interactive and plotting features without function errors.

Loading and Preparing the Heart Disease Dataset

Before running this code, we will import the Heart Disease dataset for analysis. We will use read.csv() with sep=” ” and header=FALSE because the dataset does not have headers and uses spaces as separators. We will then assign descriptive column names using colnames(). After importing, we will convert integer-coded variables into factors using factor(), providing human-readable labels (e.g., converting 1 and 0 in the sex variable to “M” and “F”). This will allow R to correctly treat these as categorical variables for plotting and summarization. Finally, we will add a unique PatientID using mutate() and move it to the first column with relocate(). We will also inspect the dataset using summary() and head().

#Loading and Preparing the Heart Disease Dataset

heart.dat <- read.csv("heart.dat.csv", sep=" ", header = FALSE)
names <- c("age", "sex", "cp", "restbp", 
  "chol", "fbs", "restecg", "maxach", "exang", "oldpeak", "slope", "num", 
  "thal","disease")

colnames(heart.dat) <- names

heart.dat$sex <- factor(heart.dat$sex, labels=c("F", "M"))
heart.dat$cp <- factor(heart.dat$cp, 
  labels=c("Typ", "Atyp", "Non-Ang", "Asymp"))

heart.dat$fbs <- factor(heart.dat$fbs, labels=c("T", "F"))

heart.dat$restecg <- factor(heart.dat$restecg, 
  labels=c("Normal", "Abnorm", "Hypertrophy"))

heart.dat$exang <- factor(heart.dat$exang, labels=c("N", "Y"))

heart.dat$slope <- factor(heart.dat$slope, 
  labels=c("Up", "Flat", "Down"))

heart.dat$thal <- factor(heart.dat$thal, 
  labels=c("Normal", "Fixed", "Reversible"))

heart.dat$disease <- factor(heart.dat$disease, labels=c("H", "S"))

heart.dat<- heart.dat%>% 
  mutate(PatientID = 1:n())%>%
  relocate(PatientID, .before = 1)


summary(heart.dat)
##    PatientID           age        sex           cp          restbp     
##  Min.   :  1.00   Min.   :29.00   F: 87   Typ    : 20   Min.   : 94.0  
##  1st Qu.: 68.25   1st Qu.:48.00   M:183   Atyp   : 42   1st Qu.:120.0  
##  Median :135.50   Median :55.00           Non-Ang: 79   Median :130.0  
##  Mean   :135.50   Mean   :54.43           Asymp  :129   Mean   :131.3  
##  3rd Qu.:202.75   3rd Qu.:61.00                         3rd Qu.:140.0  
##  Max.   :270.00   Max.   :77.00                         Max.   :200.0  
##       chol       fbs            restecg        maxach      exang  
##  Min.   :126.0   T:230   Normal     :131   Min.   : 71.0   N:181  
##  1st Qu.:213.0   F: 40   Abnorm     :  2   1st Qu.:133.0   Y: 89  
##  Median :245.0           Hypertrophy:137   Median :153.5          
##  Mean   :249.7                             Mean   :149.7          
##  3rd Qu.:280.0                             3rd Qu.:166.0          
##  Max.   :564.0                             Max.   :202.0          
##     oldpeak      slope          num                 thal     disease
##  Min.   :0.00   Up  :130   Min.   :0.0000   Normal    :152   H:150  
##  1st Qu.:0.00   Flat:122   1st Qu.:0.0000   Fixed     : 14   S:120  
##  Median :0.80   Down: 18   Median :0.0000   Reversible:104          
##  Mean   :1.05              Mean   :0.6704                           
##  3rd Qu.:1.60              3rd Qu.:1.0000                           
##  Max.   :6.20              Max.   :3.0000
head(heart.dat)
##   PatientID age sex      cp restbp chol fbs     restecg maxach exang oldpeak
## 1         1  70   M   Asymp    130  322   T Hypertrophy    109     N     2.4
## 2         2  67   F Non-Ang    115  564   T Hypertrophy    160     N     1.6
## 3         3  57   M    Atyp    124  261   T      Normal    141     N     0.3
## 4         4  64   M   Asymp    128  263   T      Normal    105     Y     0.2
## 5         5  74   F    Atyp    120  269   T Hypertrophy    121     Y     0.2
## 6         6  65   M   Asymp    120  177   T      Normal    140     N     0.4
##   slope num       thal disease
## 1  Flat   3     Normal       S
## 2  Flat   0 Reversible       H
## 3    Up   0 Reversible       S
## 4  Flat   1 Reversible       H
## 5    Up   1     Normal       H
## 6    Up   0 Reversible       H

After executing this code, the dataset will be clean and ready for analysis. All categorical variables now have descriptive labels, making plots easier to interpret. The PatientID column provides a unique identifier for each row. The summary() and head() output will allow users to verify the variable types, observe ranges for continuous variables, and see that the dataset has been successfully prepared for visualization and analysis.

Scatter Plot Function (Two Continuous Variables Coloured by a Categorical Variable)

We will now create a scatter plot function to allow users to explore the relationships between two continuous variables (e.g., age and chol). This function will also colour points based on a categorical variable (e.g., sex) to reveal patterns or group differences. To organize the data, we will define continuousVars and categoricalVars as vectors containing the appropriate continuous and categorical variables. For this plot, varX and varY will be set to variables from continuousVars, while varCol will be chosen from categoricalVars. We will use ggplot() to initiate the plot, aes() with .data[[varX]], .data[[varY]], and .data[[varCol]] to map variables dynamically, and geom_point() to plot individual data points. The theme_minimal() function will give the plot a clean, uncluttered appearance, and labs() will be used to generate dynamic axis labels and a descriptive plot title.

#Scatter Plot Function (Two Continuous Variables Coloured by a Categorical Variable)

#Main variable groups
continuousVars <- c("age", "restbp", "chol", "maxach", "oldpeak")
categoricalVars <- c("disease", "exang", "fbs", "thal", "cp", "sex", "restecg")

#Define graph specific varX, varY, varCol
varX <- continuousVars
varY<- continuousVars
varCol <- categoricalVars


#Make Graph
myScatterPlot <- function(varX, varY, varCol) {
  heart.dat %>%
    ggplot(aes(x = .data[[varX]],
               y = .data[[varY]],
               colour = .data[[varCol]])) +
    geom_point(size = 3, alpha = 0.7) +
    theme_minimal() +
    labs(
      x = varX,
      y = varY,
      colour = varCol,
      title = paste("Scatter plot of", varY, "vs", varX, "coloured by", varCol)
    )
}


#Example
myScatterPlot("age", "chol", "sex")

After running this code, users will be able to generate a scatter plot of any two continuous variables and color points by a categorical variable. The plot will show the relationship between the variables, highlight clusters or trends, and help users visually detect patterns in the data.

Box Plot Function (Categorical vs Continuous)

We will now create a box plot function to allow users to compare distributions of a continuous variable across different levels of a categorical variable, and separate further by a second categorical variable. ggplot() will be used with geom_boxplot() to display medians, quartiles, and outliers. The fill aesthetic allows colouring by a second categorical variable. This function will help users quickly see differences and variability in continuous measures across categories.

#Box Plot Function (Categorical vs Continuous, Optional Fill)

#Define graph specific varX, varY, varCol
varX <- categoricalVars
varY<- continuousVars
varCol <- categoricalVars

#Make Graph
myBoxPlot <- function(varX, varY, varCol) {
  heart.dat %>%
    ggplot(aes(x = .data[[varX]], 
               y = .data[[varY]], 
               fill = .data[[varCol]])) +
    geom_boxplot() +
    theme_minimal() +
    labs(x = varX, y = varY, fill = varCol,
         title = paste("Boxplot of", varY, "by", varX))
}

#Example
myBoxPlot("exang", "chol", "sex")

After executing this code, users will be able to create boxplots showing the distribution of a continuous variable for each category. The boxes will be coloured to reflect any chosen categorical variable, making comparisons between groups visually intuitive.

Categorical Plot Function (One or Two Categorical Variables)

We will now create a function, myCatPlot(), to allow users to explore one or two categorical variables in the Heart Disease dataset. The goal of this function will be to provide a clear overview of category counts, enabling users to understand the distribution of participants within each category and to detect potential patterns when two categorical variables are combined. We will design the function so that the second variable, varFill, is optional. If a second categorical variable is provided, we will combine it with the first using interaction() to create a new factor representing all possible category combinations. This approach will allow users to examine joint distributions dynamically while keeping the function reusable for different variable selections.

We will use ggplot() to create the visualization, mapping the combined variable to both the x-axis and the fill color via aes(x = Combined, fill = Combined). The geom_bar() function will generate the bar plot showing the counts for each category or combination of categories. We will apply theme_minimal() to produce a clean, uncluttered plot, and use labs() to dynamically generate descriptive axis labels and a title based on the selected variables. Finally, we will rotate the x-axis labels with element_text(angle = 45, hjust = 1) to prevent overlapping text, ensuring readability even when category names are long or when many combinations exist. This function will allow users to intuitively explore categorical distributions and relationships, providing a foundation for further analysis and comparison across the dataset.

#Categorical Plot Function (One or Two Categorical Variables)

#Define graph specific varX, varY, varCol
varX <- categoricalVars
varFill <- categoricalVars

myCatPlot <- function(varX, varFill = NULL) {
  df <- heart.dat
  
#Create combined variable (if varFill is given)
  df$Combined <- if (!is.null(varFill) && varFill != "") {
    interaction(df[[varX]], df[[varFill]], sep = "_")
  } else {
    df[[varX]]
  }
  
#Labels and title
  label <- if (!is.null(varFill) && varFill != "") paste(varX, "&", varFill) else varX
  title <- paste("Bar plot of", label)
  
#Make Graph
  ggplot(df, aes(x = Combined, fill = Combined)) +
    geom_bar() +
    theme_minimal() +
    labs(x = label, y = "Count", fill = label, title = title) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

#Examples
myCatPlot("sex")       # Single categorical variable

myCatPlot("sex", "cp") # Combine two categorical variables

After executing this code, users will see a bar plot displaying the counts of participants for the selected categorical variable(s). If only a single variable (varX) is chosen, the plot will show a separate bar for each category, with the height of each bar representing the number of participants in that category. If a second variable (varFill) is selected, the plot will display bars for all combinations of the two variables, allowing users to see the joint distribution of categories.

The fill color of the bars will correspond to the category or combination of categories, making it easier to visually distinguish groups. The x-axis labels will be rotated at a 45-degree angle to prevent overlapping text, ensuring that all categories are readable, even when there are many or long category names. The plot title and axis labels will dynamically update based on the variables selected, providing context without requiring users to manually annotate the plot.

Overall, this function will allow users to quickly explore the composition of categorical variables, identify imbalances or dominant groups, and detect patterns or relationships between two categorical variables, all within a single, clean, and interactive visual representation.

Interactive Data Table Function

We will create a function to allow users to select a range of columns and view the data as an interactive table. The select() function will subset the dataset by column range, and datatable() will render the data interactively, supporting scrolling, sorting, and searching.

# Make a table
makeTable <- function(startVar, endVar) {
  heart.dat %>%
    select(all_of(startVar):all_of(endVar)) %>%
    datatable(
      options = list(scrollX = TRUE),
      class = "cell-border stripe",
      rownames = FALSE
    )
}

# Example
makeTable("PatientID", "chol")

After executing this code, users will be able to interactively explore the data in a table format. Columns within the selected range will be displayed, and users can scroll horizontally, search for values, and sort data for easier exploration.

Creating a Glossary Table of Variables

We will create a glossary to describe each variable in the dataset. A data.frame will store variable names and descriptions. datatable() will render it interactively. This will provide users a quick reference to understand what each variable represents before creating plots or tables.

# Create a glossary table of variables 

glossary <- data.frame(
Variable = c("PatientID", "age", "sex", "cp", "restbp", "chol",
"fbs", "restecg", "maxach", "exang", "oldpeak",
"slope", "num", "thal", "disease"),
Description = c(
"Unique patient identifier",
"Age in years",
"Sex (F = female, M = male)",
"Chest pain type (Typ = typical, Atyp = atypical, Non-Ang = non-anginal, Asymp = asymptomatic)",
"Resting blood pressure (mm Hg)",
"Serum cholesterol (mg/dl)",
"Fasting blood sugar > 120 mg/dl (T = true, F = false)",
"Resting electrocardiographic results (Normal, Abnorm, Hypertrophy)",
"Maximum heart rate achieved",
"Exercise induced angina (N = no, Y = yes)",
"ST depression induced by exercise relative to rest",
"Slope of the peak exercise ST segment (Up, Flat, Down)",
"Number of major vessels coloured by fluoroscopy (0–3)",
"Thalassemia (Normal, Fixed, Reversible)",
"Presence of heart disease (H = heart disease, S = healthy)"
),
stringsAsFactors = FALSE
)
# Display as interactive table
datatable(glossary, options = list(scrollX = TRUE, pageLength = 5), rownames = FALSE)

After running this code, users will see an interactive table with variable names and detailed descriptions. The table will allow horizontal scrolling, sorting, and searching. This glossary ensures that users can understand each variable before selecting it for plots or tables, improving the clarity and usability of the Shiny app.