Advanced Epidemiology with R: From Zero to Hero

Author

David E. Rebellon-Sanchez

Module 1: Introduction to R, RStudio, and Basic Concepts

1.1. What is R and Why Use It?

R is a programming language and statistical computing environment widely used in data science, statistics, and machine learning.

🔹 Advantages of Using R:
✅ Free and Open Source
✅ Easy for Statistical Analysis with hundreds of specialized libraries
✅ Active Community with forums, tutorials, and documentation
✅ Data-Oriented, designed specifically to efficiently handle and analyze data

1.2. Working Environment: RStudio

RStudio is an IDE (Integrated Development Environment) that facilitates working with R. In RStudio, you will typically see four main panels:

Script/Editor:
This is similar to a document where you write and save your code so that you can reproduce it at any time and share it with other researchers, allowing them to replicate the same steps that led to your conclusions. Some people prefer to have multiple scripts for different parts of their analysis (e.g., data loading and cleaning, exploratory analysis, creation of clean datasets for further analysis, descriptive analysis, modeling). When you create a script, it is usually located in the upper-left quadrant, but you can drag it to any section that is most comfortable for you. Scripts are saved as .R files.
Console:
This is where you execute the code and see the results in real time. All the code in your script is actually run in the console, and here you will see the results or warning messages. However, you can also execute code directly in the console without saving it in your script, which is useful when you want to quickly check an analysis or output before deciding whether to include it in your final script for replication. The console is usually located in the lower-left quadrant.
Environment/History:
This panel displays the objects you have created through your code (data frames, matrices, lists, variables, plots, etc.) and the history of executed commands. It is generally located in the upper-right quadrant.
Plots/Viewer/Files/Help:
This section is usually located in the lower-right quadrant and is useful for viewing plots, navigating files, or previewing reports. In the help section, you can explore the libraries you have installed and the functions available within them. You will delve deeper into this later, but to give you a general idea, a library is essentially a package of pre-programmed functions designed to streamline coding tasks. For example, someone might have created a library with a set of functions that allows you to perform an exploratory analysis with a single line of code instead of writing hundreds of lines yourself. As you progress, you will discover different libraries that suit your coding style and data handling needs, incorporating them into your workflow.

💡 Typical Workflow in RStudio:

1️⃣ Write your code in a .R (script) file.  
2️⃣ Execute line by line or in blocks (`Ctrl + Enter` in Windows or `Cmd + Enter` in Mac).  
3️⃣ Observe results in the console and "Environment".

1.3. First Steps in the Console

The R console displays a > symbol when it is ready to receive a command.

🔹 Example: Basic Operations

Try executing these operations in the console:

# Basic operations in R
2 + 2
5 * 3
sqrt(25)  # Square root

💡 Run each line and observe the results in the console.

🔥 Exercise 1.1: Write and Execute Your First Script in R

Open RStudio and create a new script (File > New File > R Script).
In that script, write the following:

# Print "Hello, R" to the console
print("Hello, R")

# Perform an arithmetic operation of your choice
result <- 10 * 3
result

Run the code and observe the result in the console.

✅ Solution 1.1

If done correctly, you should see the following output:

[1] "Hello, R"
[1] 30

This simple exercise familiarizes you with the idea of writing in the script and executing in the console.

🎯 Congratulations! You have completed Module 1. In the next module, we will explore data structures in R.

Module 2: Data Structures in R

In R, there are different types of data structures. Think of them as different “containers” for storing information. Here, you will learn about the most important ones.

2.1. Vectors

A vector is the most basic structure.
All elements must be of the same type (numeric, text, logical, etc.).
It can be seen as a “row” or “column” of values in Excel, but in R, you don’t need to think about rows/columns until using it in a data frame or matrix.

Useful functions for creating vectors:

c(...): Combines elements to form a vector.
rep(x, times): Repeats the object x times times.
seq(from, to, by): Creates numeric sequences.

Operations with vectors:

# Create a numeric vector
numbers <- c(1, 2, 3, 4, 5)

# Sum of all elements
sum_numbers <- sum(numbers)
# Mean
mean_numbers <- mean(numbers)

sum_numbers        # 15
mean_numbers       # 3

2.2. Matrices

A matrix is a two-dimensional object: it has rows and columns, but all elements must be of the same type.
We can use matrix() or combine vectors with cbind() (column binding) or rbind() (row binding).

Creating matrices:

# Create a 3x3 matrix
my_matrix <- matrix(1:9, nrow = 3, ncol = 3)
my_matrix

1:9 generates a vector from 1 to 9, which fills by columns unless specified otherwise with byrow = TRUE.

Indexing in matrices:

my_matrix[1, ]: Complete first row.
my_matrix[, 2]: Complete second column.
my_matrix[2, 2]: Element in row 2, column 2.

2.3. Data Frames

A data frame is very similar to a spreadsheet.
Each column can have a different data type (numeric, character, factor, etc.).
Each row typically corresponds to an observation or case.

Functions for data frames:

data.frame(...): Creates a data frame.
head(df, n) / tail(df, n): Displays the first or last n rows.
str(df): Shows the internal structure.
dim(df): Dimensions (number of rows and columns).

Example:

patients <- data.frame(
  ID = 1:3,
  Age = c(23, 45, 31),
  Gender = c("M", "F", "M")
)
patients

2.4. Lists

A list can contain any type of object: vectors, matrices, data frames, even other lists!
They are very flexible containers.

my_list <- list(
  numeric_vector = c(1, 2, 3),
  example_matrix = matrix(1:4, nrow = 2),
  df_patients = patients
)
my_list

🔥 Exercise 2.1

Create a vector with 10 random numeric values using rnorm(10).
Convert this vector into a matrix with 5 rows and 2 columns.
Create a data frame with two columns:
- The first: values from the first column of the created matrix.
- The second: categorical values (e.g., “A”, “B”, “C”).
Create a list containing the original vector, the matrix, and the data frame.

✅ Solution 2.1

# 1. Vector with 10 random values
set.seed(123)
random_vec <- rnorm(10)
random_vec

# 2. Convert to matrix
matrix_5x2 <- matrix(random_vec, nrow = 5, ncol = 2)
matrix_5x2

# 3. Data frame
df_example <- data.frame(
  values = matrix_5x2[, 1],
  category = c("A", "B", "A", "C", "B")
)
df_example

# 4. List containing all objects
final_list <- list(
  vector = random_vec,
  matrix = matrix_5x2,
  df = df_example
)
final_list

🎯 Congratulations! You have completed Module 2. In the next module, we will explore Programming in R: Assignment, Functions, Loops, and Control Flow.

Module 3: Programming in R (Assignment, Functions, Loops, Control Flow)

In this module, we will explore programming logic to automate tasks: how to assign values, create functions, and use loops.

3.1. Variable Assignment

To assign a value to a variable in R, use <- or =.

# Variable assignment
x <- 10
y = 20
z <- x + y  # 30

🔹 Note: The R convention is to use <-, but = also works.

3.2. Creating Functions

A function is a reusable block of code.

Structure of a Function in R

function_name <- function(arg1, arg2, ...) {
  # Instructions
  # Return value
}

🔹 Key Components: - Arguments: Input values for the function (e.g., arg1, arg2). - Body: The internal code of the function. - Return(): Specifies the output of the function. If omitted, the last computed value is returned.

Example: Function to Calculate Hypotenuse

hypotenuse <- function(side1, side2) {
  # side1: length of the first side
  # side2: length of the second side
  # Returns: the hypotenuse
  
  h <- sqrt(side1^2 + side2^2)
  return(h)
}

hypotenuse(3, 4)  # 5

🔹 Default Values: You can define optional values in a function:

# If side2 is not provided, it defaults to 1
hypotenuse <- function(side1, side2 = 1) {
  sqrt(side1^2 + side2^2)
}

3.3. Control Flow: Conditionals

Conditionals allow the program to make decisions based on certain conditions.

1️⃣ if - else

x <- 5
if (x > 3) {
  print("x is greater than 3")
} else {
  print("x is less than or equal to 3")
}

2️⃣ ifelse(): Vectorized Version

numbers <- 1:5
ifelse(numbers %% 2 == 0, "Even", "Odd")
# c("Odd", "Even", "Odd", "Even", "Odd")

3.4. Loops

Loops allow repeating a series of steps automatically.

3.4.1. for Loop

🔹 Structure:

for (iterator in sequence) {
  # Code block
}

🔹 Example:

for (i in 1:5) {
  print(i^2)
}

Here, i sequentially takes the values 1, 2, 3, 4, 5, and print(i^2) is executed for each.

3.4.2. while Loop

🔹 Repeats while a condition is true:

j <- 1
while (j <= 5) {
  print(j^2)
  j <- j + 1
}

⚠️ Warning: If you do not update the control variable (j <- j + 1), the loop will never end.

🔥 Exercise 3.1

Create a vector with values from 1 to 10.
Use a for loop to print the square root of each value.
Inside the loop, use an if statement to print a special message if the value is greater than 8.

✅ Solution 3.1

values <- 1:10
for (i in values) {
  cat("The square root of", i, "is:", sqrt(i), "\n")
  if (i > 8) {
    cat("This value is greater than 8!\n")
  }
}

🎯 Congratulations! You have completed Module 3. In the next module, we will explore basic operations and data manipulation in R.

Module 4: Basic Operations, Package Management, and Initial Exploration

Before conducting complex analyses, it is essential to explore the data, check the number of rows, identify variable types, and detect missing values. In this module, we will also learn how to install and load packages in R. Next, we will see how to download a package called dplyr and load it into the current session. This package is very useful for data manipulation.

4.1. Installing and Loading Packages

In R, many advanced functions are organized into packages or libraries that must be installed and loaded before use.

🔹 Install a package (only once):

# Before running any command is important to install and load the necessary libraries. Remember to remove the # symbol before running the code. If the software asks you to restart the session before installing the libraries, do it at least once. This can prevent future conflicts due to the creation of different paths for the libraries storage.

#install.packages("dplyr") # To run it, you must remove the `#` symbol that appears before the `install.packages` function. In R, we often want to add comments to our code without executing them in the console. One way to do this is by adding the `#` symbol before the function or line of code that we do not want to execute.

🔹 Load a package into the current session:

library(dplyr)

💡 Note: install.packages("package_name") downloads the package from CRAN (official repository), while library(package_name) loads it into the R session. If you try to run code and it doesn’t work because the library isn’t installed, you’ll need to install it and load it before attempting to run the code again

4.2. Key Packages for Data Exploration

4.2.1. `DataExplorer`

🔹 Purpose: Automatically generate Exploratory Data Analysis (EDA) reports.
🔹 Key functions: - create_report(data, output_file = "Report.html"): Generates an HTML report with distributions, plots, missing values, etc. - plot_intro(data): Displays the number of rows, columns, and variable types. - plot_missing(data): Visualizes missing values (NA).

🔹 Example:

library(DataExplorer)
data(iris)  # Example dataset (flowers)
plot_intro(iris)       # Basic information
plot_missing(iris)     # Are there missing values?
# create_report(iris)    # # This function generates a complete report of the exploratory analysis, which will be saved as an HTML file and can be viewed in your browser.

4.2.2. `skimr`

🔹 Purpose: Generate a detailed summary (summary) of each column in a dataset.
🔹 Main function: skim(data)

🔹 Example:

library(skimr)
skim(iris)

🔹 Expected output: Information about each variable, including: - Mean, median, maximum, and minimum values. - Number of unique values and missing values (NA).

4.2.3. `knitr`

🔹 Purpose: Create reproducible reports in R Markdown. 🔹 Key functions: - kable(data): Displays data frames in a table format. - include_graphics(path): Allows the inclusion of images in an R Markdown report.

🔹 Example:

library(knitr)
kable(head(mtcars), caption = "Table with the first rows of mtcars")

🎯 Congratulations! You have completed Module 4. In the next module, we will explore data manipulation and transformation in R using dplyr, tidyr, and gtsummary.

Module 5: Libraries for Data Manipulation and Transformation

Learning to clean, transform, and present data efficiently is essential in data analysis. These steps often represent the majority of the work in a real-world analytics project. In this module, we will explore three key libraries for data manipulation and transformation in R: dplyr, tidyr, and gtsummary. These libraries are part of the tidyverse, a collection of R packages designed for data science. Each library has specific functions to facilitate data manipulation, transformation, and summary table creation. Here we will explore some of the main functions available in each library and provide practical examples.

5.1. `dplyr`

🔹 Objective: Facilitate data frame manipulation in a readable and efficient way.

🔹 Key functions: In the tidyverse philosophy, each of these functions is called a “verb”: 1. filter(): Filter rows based on a condition. 2. select(): Select columns by name or range. 3. mutate(): Create or modify columns. 4. arrange(): Reorder rows. 5. group_by() + summarise(): Group by one or more variables and calculate aggregate statistics.

🔹 Example:

library(dplyr)

# Creating an example data frame
df <- data.frame(
  subject = 1:6,
  age = c(23, 35, 29, 41, 37, 50),
  weight = c(60, 80, 70, 85, 76, 90)
)

# Applying transformations with dplyr
df %>% 
  filter(age > 30) %>%            # Only those older than 30
  select(subject, weight) %>%     # Select only these two columns
  mutate(BMI = weight / (1.70^2)) %>% # Create BMI column
  arrange(desc(BMI))              # Order by BMI descending

💡 Note: We use the pipe operator %>%, which is read as “and then”.

5.2. `tidyr`

🔹 Objective: Restructure data frames, making it easier to convert between wide and long formats.

🔹 Key functions: - pivot_longer(): Converts columns into rows. - pivot_wider(): Converts rows into columns. - separate(): Splits a column into multiple columns. - unite(): Combines multiple columns into one.

# Load the necessary libraries
library(tidyr)
# library(dplyr)  # We already loaded it before, but if not, you would need to load it again for the pipeline `%>%` to work.

# We will create a dataset in wide format
# Assume we have sales data by trimester

df_wide <- data.frame(
  Producto = c("A", "B", "C"),
  Q1 = c(100, 200, 150),
  Q2 = c(120, 210, 180),
  Q3 = c(130, 190, 160)
)

# Convert the dataset from wide to long format
# This allows us to work with data in a more organized format
df_long <- df_wide %>%
  pivot_longer(cols = Q1:Q3, names_to = "Trimestre", values_to = "Ventas")

# We display the dataset in long format
print(df_long)

# Now we will convert the dataset back to wide format
df_wide2 <- df_long %>%
  pivot_wider(names_from = "Trimestre", values_from = "Ventas")

# We display the dataset in wide format
print(df_wide2)

# Example of `separate()`: splitting a column into several
# We create a dataset with names in the format "First_Last"
df_names <- data.frame(Nombre = c("Juan_Pérez", "Ana_López", "Carlos_García"))

df_separated <- df_names %>% separate(Nombre, into = c("Primer_Nombre", "Apellido"), sep = "_")
print(df_separated)

# Example of `unite()`: combining columns
df_united <- df_separated %>% unite("Nombre_Completo", Primer_Nombre, Apellido, sep = " ")
print(df_united)

5.3. `gtsummary`

🔹 Objective: Create summary tables ready for reports or publication.

🔹 Key functions: - tbl_summary(data, by = variable): Generates a summary table of statistics for each column, differentiating by groups (by). - add_overall(): Adds a column with the global total. - add_p(): Calculates and adds p-values for subgroup comparisons.

🔹 Example:

# Load the necessary libraries
library(gtsummary)  # For creating summary tables
library(dplyr)      # For data manipulation

# Create a sample dataset (only if df is not defined)
set.seed(123)  # For reproducibility
df <- data.frame(
  subject = 1:100,
  age = sample(20:80, 100, replace = TRUE),
  weight = sample(50:100, 100, replace = TRUE)
)

# Create a categorical variable based on weight
df <- df %>%
  mutate(weight_category = ifelse(weight > 75, "Above 75", "75 or Below"))

# Generate the summary table with gtsummary
tbl_summary(df, by = weight_category) %>%  # Group by the new categorical variable
  add_overall() %>%  # Add the overall total column
  add_p()  # Calculate and add p-values for subgroup comparisons

💡 Explanation: Here, we divide the sample into weight > 75 and weight <= 75, add the Overall column, and calculate a p-value for subgroup comparisons.

🎯 Congratulations! You have completed Module 5. In the next module, we will explore Data Visualization and Graph Management with ggplot2.

Module 6: Data Visualization and Graph Management with ggplot2

Data visualization is an essential tool in statistical analysis, as it allows us to understand patterns, detect anomalies, and effectively communicate findings.

In this module, we will learn to:

✅ Understand the fundamental concepts of ggplot2.
✅ Create basic and advanced graphs.
✅ Customize visualizations to enhance interpretation.
✅ Explore different types of charts depending on data nature.

6.1 Introduction to `ggplot2`

ggplot2 is the most widely used R package for data visualization. It is based on the Grammar of Graphics, allowing users to construct plots in a structured and modular way.

The philosophy of ggplot2 is that a plot is built using layers, where each layer adds specific visual information.

Basic Structure of `ggplot2`

A plot in ggplot2 follows a syntax with the following elements:

 ggplot(data, aes(x = variable_x, y = variable_y)) + 
  geom_type_of_plot() + 
  other_customization_functions()

Key Components:

ggplot(data, aes(...))
- Defines the dataset and the variables to represent.
- aes(x = ..., y = ...) specifies which variable is mapped to each axis.
geom_*() (Geometry of the plot)
- Specifies the type of plot to create, such as:
  - geom_point() (scatter plot)
  - geom_bar() (bar chart)
  - geom_boxplot() (boxplot)
  - geom_histogram() (histogram)
  - geom_line() (line plot)
Additional functions for customization
- labs(): Adds titles and labels.
- theme(): Modifies the appearance of the plot.
- facet_wrap(): Splits the plot into panels by a categorical variable.
- scale_*(): Customizes colors, sizes, and axes.

Installation and Loading of the Package

If ggplot2 is not installed, you can install it with:

# install.packages("ggplot2") # Remove de symbol # preceding the command before running the code if you don't have the package installed.

To load it in your R session:

library(ggplot2)

Key Arguments Inside `aes()`

The aes() function (aesthetic mappings) defines how variables are mapped to visual elements.

Argument	Description
`x =`	Variable on the X-axis
`y =`	Variable on the Y-axis
`color =`	Color of points/lines by category
`fill =`	Fill color in bars or areas
`size =`	Size of points/lines
`alpha =`	Transparency (1 = solid, 0 = transparent)
`shape =`	Shape of points in scatter plots

**Available `geom_*()` Types**

*Function `geom_()`**	Description
`geom_point()`	Scatter plot
`geom_bar()`	Bar chart
`geom_boxplot()`	Boxplot (distribution)
`geom_histogram()`	Histogram
`geom_line()`	Line plot
`geom_smooth()`	Trend line

**Customization with `labs()`, `theme()`, `scale_*()`**

Once the plot is created, we can enhance it visually:

labs(): Adds titles and custom labels.
theme_minimal(), theme_classic(): Changes the appearance of the plot.
scale_fill_manual(), scale_color_manual(): Modifies colors.
facet_wrap(~ variable): Splits plots into panels based on a categorical variable.

Basic Structure So Far

✅ ggplot2 allows the creation of layered plots, combining data and visual elements. ✅ The aes() function defines the mapping of variables to graphical elements. ✅ geom_*() specifies the type of plot to use. ✅ Plots can be customized using labs(), theme(), scale_*(), and facet_wrap().

Now that we understand the basic structure of ggplot2, we will explore practical examples in the following sections.

To generate plots, we will use the dat dataset from the NHANES database. This dataset contains information on blood pressure, age, sleep hours, physical activity, and smoking status in a sample of the U.S. population.

# Install and load the necessary libraries. Remember to remove the # symbol before running the code. If the software asks you to restart the session before installing the libraries, do it at least once. This can prevent future conflicts due to the creation of different paths for the libraries.

#install.packages(c("NHANES", "dplyr", "gtsummary", "skimr", "DataExplorer"))

library(NHANES)
library(dplyr)
library(gtsummary)
library(skimr)
library(DataExplorer)

# Select variables of interest
dat <- NHANES::NHANESraw %>%
  dplyr::select(BPSysAve, Age, SleepHrsNight, PhysActive, 
                SmokeNow, Smoke100, Gender, Race1, Poverty, Diabetes)
head(dat)

## Handling missing values
# Count missing values in a specific variable
sum(is.na(dat$Gender))
# Summary of missing values in the entire dataset
colSums(is.na(dat))
# Visualization with DataExplorer
plot_missing(dat)
# Alternative with skimr
skim(dat)

# Convert variables to factors
dat$Gender <- as.factor(dat$Gender)
dat$Race1  <- as.factor(dat$Race1)

# Create a new variable for smoking status
dat <- dat %>%
  mutate(SmokingStatus = case_when(
    SmokeNow == "Yes" ~ "Current",
    Smoke100 == "Yes" ~ "Former",
    TRUE ~ "Never"
  )) %>%
  mutate(SmokingStatus = factor(SmokingStatus))

6.2. Basic Graphs with `ggplot2`

Next, we will explore different types of charts with practical examples.

6.2.1. Bar Chart

Represents categorical variable distributions.

ggplot(dat, aes(x = Gender)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Gender Distribution",
       x = "Gender",
       y = "Frequency") +
  theme_minimal()

🔹 Stacked Bar Chart by Category

ggplot(dat, aes(x = Gender, fill = SmokingStatus)) +
  geom_bar(position = "stack") +
  labs(title = "Gender and Smoking Status") +
  theme_minimal()

🔹 Proportional Bar Chart

ggplot(dat, aes(x = Gender, fill = SmokingStatus)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of Smoking Status by Gender",
       y = "Proportion") +
  theme_minimal()

6.2.2. Pie Chart

Although pie charts are not recommended in scientific publications, they can be useful in visual reports.

library(dplyr)
df_diab <- dat %>%
  count(Diabetes) %>%
  mutate(prop = n / sum(n))

ggplot(df_diab, aes(x = "", y = prop, fill = Diabetes)) +
  geom_bar(stat = "identity") +
  coord_polar("y") +
  theme_void() +
  labs(title = "Diabetes Distribution")

6.2.3. Boxplot

Compares distributions of a quantitative variable by a categorical variable.

ggplot(dat, aes(x = Gender, y = BPSysAve)) +
  geom_boxplot(fill = "tomato", alpha = 0.6) +
  labs(title = "Systolic Blood Pressure by Gender",
       x = "Gender", y = "Systolic Blood Pressure") +
  theme_minimal()

6.2.4. Scatterplot

To analyze the relationship between two numerical variables.

ggplot(dat, aes(x = Age, y = BPSysAve)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Age and Systolic Blood Pressure",
       x = "Age", y = "Systolic Blood Pressure") +
  theme_minimal()

6.2.5. Histogram and Density Curve

🔹 Histogram

ggplot(dat, aes(x = BPSysAve)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Histogram of Systolic Blood Pressure",
       x = "Systolic Blood Pressure", y = "Frequency") +
  theme_minimal()

🔹 Density Curve

ggplot(dat, aes(x = BPSysAve)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Density of Systolic Blood Pressure") +
  theme_minimal()

6.3. Advanced Customization in `ggplot2`

ggplot(dat, aes(x = Age, y = BPSysAve)) +
  geom_point(color = "darkblue") +
  theme_classic() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(color = "blue", size = 11),
    axis.text.y = element_text(color = "blue", size = 11),
    plot.title = element_text(face = "bold", hjust = 0.5)
  ) +
  labs(title = "Relationship Between Age and Systolic Blood Pressure")

6.4. Practical Exercise

Perform the following tasks using ggplot2:
✅ Create a bar chart to visualize Race1 distribution.
✅ Generate a boxplot comparing SleepHrsNight by PhysActive.
✅ Build a scatterplot between Age and SleepHrsNight with a trend line.
✅ Apply customization to the graphs.

🔹 Module Summary

✅ Understood the grammar of ggplot2.
✅ Learned to create bar charts, boxplots, histograms, and scatter plots.
✅ Explored how to customize graphs for better communication.
✅ Completed practical exercises to reinforce learning.

🎯 Congratulations! You have completed Module 6. In the next module, we will explore Exploratory Data Analysis (EDA).

Module 7: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a fundamental step in any statistical or machine learning analysis. Its purpose is to understand the structure of the data, identify patterns, detect outliers, and verify data quality before applying models.

This module will guide students through a structured EDA strategy and provide practical examples using R tools.

7.1. Exploratory Data Analysis Strategy

A well-structured exploratory analysis follows logical steps to assess data quality and relationships:

1️⃣ Identifying Missing Values (NAs)

Missing values can impact analysis results. It is crucial to detect them and decide how to handle them (removal, imputation, etc.).

# Loading a built-in dataset
df_iris <- iris # In this line, we are copying the iris example dataset available in R to a new object called df_iris.

# If you have a dataset saved in Excel format, you can use read_excel:
# df_your_data <- readxl::read_excel("path_to_your_file.xlsx")

# If you have a dataset saved in CSV format, you can use read_csv:
# df_your_data <- readr::read_csv("path_to_your_file.csv")

# Make sure to replace "path_to_your_file" with the actual path to your data file.


colSums(is.na(df_iris))
plot_missing(df_iris)

Note: While we’re using the built-in mtcars dataset for convenience, you may want to load your own data if you have it stored elsewhere. For example, if your data is saved in an Excel file, you can use functions like readxl::read_excel("path_to_file.xlsx"). Similarly, if you have a CSV file, you might use readr::read_csv("path_to_file.csv"). In both cases, you’ll need to specify the correct file path so R can locate and load the dataset.

2️⃣ Univariate Analysis: Variable Distributions

To understand the distribution of a variable, we use: - Numerical variables: Histograms and boxplots. - Categorical variables: Frequency tables and bar charts.

We can also create new variables based on existing ones for analysis or to generate tables to compare groups in our data.

# Histograms for numerical variables
plot_histogram(df_iris)

# Boxplots by category
plot_boxplot(df_iris, by = "Species")

# Create a dummy variable to simulate a binary outcome
# The 'Outcome' variable will take the value 'High' if Sepal.Length > 5.8 and 'Low' otherwise.
# This allows us to analyze the distribution of the categorical variable in relation to the species.
df_iris$Outcome <- ifelse(df_iris$Sepal.Length > 5.8, "High", "Low")

# Convert the variable into a factor for categorical analysis
# This facilitates its use in frequency tables and graphs.
df_iris$Outcome <- as.factor(df_iris$Outcome)

# Bar chart with percentages by outcome (this is more a bivariate analysis, but it's useful to understand the distribution of the outcome variable by species)
library(ggplot2)

ggplot(df_iris, aes(x = Species, fill = Outcome)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Outcome Distribution by Species",
       x = "Species",
       y = "Percentage",
       fill = "Outcome") +
  theme_minimal()

# Frequency tables
# Count table
table(df_iris$Species)

# Proportion table
prop.table(table(df_iris$Species), margin = 1)

# Frequency tables by outcome (this is more a bivariate analysis, but it's useful to understand the distribution of the outcome variable by species)
# Count table
table(df_iris$Species, df_iris$Outcome)

# Proportion table by outcome (this is more a bivariate analysis, but it's useful to understand the distribution of the outcome variable by species)
prop.table(table(df_iris$Species, df_iris$Outcome), margin = 1)

3️⃣ Bivariate Analysis: Relationships Between Variables

In the step 2 we show how we could generate sine bivariate analysis to understand the distribution of the outcome variable by species. We could also explore the relationships between diferent variables using:

Numerical variables: Correlation matrices, scatterplots.
Categorical variables: Contingency tables, independence tests.
Numerical vs. Categorical: Boxplots by category, statistical tests.

# Bivariate Analysis: Relationships Between Variables

## 1 Correlation Matrix for Numerical Variables
# Removing the 'Species' column to avoid confusion (since it's categorical)
df_numeric <- df_iris[, sapply(df_iris, is.numeric)]  # Keeps only numeric columns

# Visualizing correlation matrix without categorical variables
plot_correlation(df_numeric)

# Alternatively, using cor() to compute the correlation matrix numerically
cor(df_numeric, use = "complete.obs")

## 2 Scatterplot to Visualize Relationship Between Sepal Length and Sepal Width
library(ggplot2)

ggplot(df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + # Adds a linear regression line
  labs(title = "Scatterplot of Sepal Length vs. Sepal Width",
       x = "Sepal Length",
       y = "Sepal Width",
       color = "Species") +
  theme_minimal()

## 3 Boxplot: Numerical vs. Categorical Variable
# Comparing Sepal.Length distribution across Species
ggplot(df_iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Length by Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal()

# Comparing Sepal.Width distribution across Outcome categories
ggplot(df_iris, aes(x = Outcome, y = Sepal.Width, fill = Outcome)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sepal Width by Outcome",
       x = "Outcome",
       y = "Sepal Width") +
  theme_minimal()

## 4. Contingency Table for Categorical Variables
# Analyzing the relationship between Species and Outcome
table(df_iris$Species, df_iris$Outcome)

# Convert table to proportions (row-wise percentage)
prop.table(table(df_iris$Species, df_iris$Outcome), margin = 1)

# Visualizing categorical relationships with a stacked bar chart
ggplot(df_iris, aes(x = Species, fill = Outcome)) +
  geom_bar(position = "stack") +
  labs(title = "Stacked Bar Chart of Outcome by Species",
       x = "Species",
       y = "Count",
       fill = "Outcome") +
  theme_minimal()

# Chi-square test for independence between Species and Outcome
chisq.test(table(df_iris$Species, df_iris$Outcome))

4️⃣ Detecting Outliers and Inconsistent Data

Extreme values (outliers) can significantly impact statistical analyses. Identifying them is crucial for deciding whether to remove, transform, or keep them based on domain knowledge.

Outliers are typically detected using:

Boxplots, which provide a visual summary of distributions.
Tukey’s Criterion, which defines outliers as values outside 1.5 times the interquartile range (IQR).
Density Plots, which can reveal extreme values and multimodal distributions.

Another type of graphs that we can use are the violin plots, which are similar to box plots, but they also show the probability density of the data at different values. In the example below, we will use a violin plot to show the distribution of Sepal Length by Species, and we will highlight the outliers in a scatter plot.

# Load necessary library
library(ggplot2)

# Violin plot with outliers highlighted
ggplot(df_iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin(alpha = 0.5) +  # Adds violin plot for better distribution visualization
  geom_jitter(width = 0.1, aes(color = Species), size = 1.5, alpha = 0.7) +  # Adds individual data points
  geom_boxplot(width = 0.2, outlier.shape = NA, alpha = 0.6) +  # Adds boxplot without outliers
  labs(title = "Distribution of Sepal Length by Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal()

# Calculate the interquartile range (IQR)
Q1 <- quantile(df_iris$Sepal.Length, 0.25)
Q3 <- quantile(df_iris$Sepal.Length, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers
outliers <- df_iris %>%
  filter(Sepal.Length < lower_bound | Sepal.Length > upper_bound)

# Display detected outliers
print(outliers)

# Highlight outliers in a scatter plot
ggplot(df_iris, aes(x = Species, y = Sepal.Length, color = Sepal.Length < lower_bound | Sepal.Length > upper_bound)) +
  geom_jitter(width = 0.2, size = 2) +
  scale_color_manual(values = c("black", "red"), labels = c("Normal", "Outlier")) +
  labs(title = "Outlier Detection in Sepal Length",
       x = "Species",
       y = "Sepal Length",
       color = "Status") +
  theme_minimal()

🔹 Module Summary

✅ EDA is key before any analysis or modeling.
✅ We assess data quality (missing values, outliers, inconsistencies).
✅ We analyze distributions and relationships between variables (correlations, frequency tables).
✅ We use tools like DataExplorer, skimr, dplyr, and gtsummary for a quick and effective analysis.
✅ Practical exercise: Apply EDA to another dataset to reinforce learning.

🎯 Congratulations! You have completed Module 6. In the next module, we will explore Descriptive and Analytical Statistical Analysis in R.

Module 8: Descriptive and Analytical Statistical Analysis

Descriptive and analytical statistical analysis allows us to summarize, explore, and compare data before applying more complex models. In this module, we will learn how to:

✅ Import and manage data. ✅ Handle missing data. ✅ Create and recode variables. ✅ Perform univariate descriptive analysis. ✅ Compare subgroups through bivariate analysis. ✅ Build customized descriptive tables with gtsummary.

For this module, we will use the same dat dataset from the NHANES database that we used when we review the ggplot2 functions. This dataset contains information on blood pressure, age, sleep hours, physical activity, and smoking status in a sample of the U.S. population.

# Install and load the necessary libraries. Remember to remove the # symbol before running the code. If the software asks you to restart the session before installing the libraries, do it at least once. This can prevent future conflicts due to the creation of different paths for the libraries.

#install.packages(c("NHANES", "dplyr", "gtsummary", "skimr", "DataExplorer"))

library(NHANES)
library(dplyr)
library(gtsummary)
library(skimr)
library(DataExplorer)

# Select variables of interest
dat <- NHANES::NHANESraw %>%
  dplyr::select(BPSysAve, Age, SleepHrsNight, PhysActive, 
                SmokeNow, Smoke100, Gender, Race1, Poverty, Diabetes)
head(dat)

8.1. Reminder: Data Importation

For this course, we will work with dat from NHANES. However, in real life, you can import your own data using the following functions:

# Install and load the necessary libraries
# install.packages("readxl")
# library(readxl)

# Import your excel files
# df_excel <- read_excel("data/example.xlsx") # You should replace "data/example.xlsx" with the path and name of your file.

8.2. Handling Missing Data

Missing values can impact analysis. First, we identify them:

# Count missing values in a specific variable
sum(is.na(dat$Gender))

# Summary of missing values in the entire dataset
colSums(is.na(dat))

# Visualization with DataExplorer
plot_missing(dat)

# Alternative with skimr
skim(dat)

If there are missing data, we can remove them (na.omit(dat)) or impute them using the median:

dat <- dat %>%
  mutate(SleepHrsNight = ifelse(is.na(SleepHrsNight), median(SleepHrsNight, na.rm = TRUE), SleepHrsNight))

8.3. Creating and Recoding Variables

Converting Categorical Variables

dat$Gender <- as.factor(dat$Gender)
dat$Race1  <- as.factor(dat$Race1)

Creating a Smoking Status Variable

dat <- dat %>%
  mutate(SmokingStatus = case_when(
    SmokeNow == "Yes" ~ "Current",
    Smoke100 == "Yes" ~ "Former",
    TRUE ~ "Never"
  )) %>%
  mutate(SmokingStatus = factor(SmokingStatus))

8.4. Univariate Descriptive Analysis

Frequencies and Basic Statistics

# Gender Frequency
table(dat$Gender)

# Smoking Status Frequency
table(dat$SmokingStatus)

# Summary of Age
summary(dat$Age)

# Mean and standard deviation of systolic blood pressure
mean(dat$BPSysAve, na.rm = TRUE)
sd(dat$BPSysAve, na.rm = TRUE)

8.5. Bivariate Analysis

Comparison of a Numerical Variable Between Groups (t-test)

t.test(Age ~ Gender, data = dat)

ANOVA for More Than Two Groups

anova_model <- aov(Age ~ Race1, data = dat)
summary(anova_model)

Contingency Tables and Chi-Square Test

tab <- table(dat$SmokingStatus, dat$Gender)
chisq.test(tab)

xtabs(~ SmokingStatus + Gender, data = dat)

8.6. Descriptive Analysis with `gtsummary`

Univariate Summary

# Ver las primeras filas de los datos
head(dat)

# Crear tabla resumen con gtsummary
tbl1 <- dat %>%
  dplyr::select(BPSysAve, Age, Gender, Diabetes, SmokingStatus) %>%
  tbl_summary()

# You should be able to see a table with the summary statistics for the selected variables in your computer running the following command:
# tbl1

Comparison by Group

tbl2 <- dat %>%
  select(BPSysAve, Age, Gender, Diabetes, SmokingStatus) %>%
  tbl_summary(
    by = Gender,
    missing = "ifany",
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ),
    label = list(
      BPSysAve ~ "Blood Pressure Sys. Ave",
      SmokingStatus ~ "Smoking Status"
    )
  ) %>%
  add_overall() %>%
  add_p() %>%
  modify_header(label ~ "**Variable**")

# You should be able to see this table in your computer running the following command.
# tbl2.

🔥 Exercise 8.1: Application to Another Dataset

data("airquality")
df_airquality <- airquality

# Step 1: Identify missing values
plot_missing(df_airquality)

# Step 2: Convert Temperature into a Category
df_airquality <- df_airquality %>%
  mutate(TempCat = ifelse(Temp > median(Temp, na.rm = TRUE), "High", "Low"))

# Step 3: Comparison of Ozone by TempCat
t.test(Ozone ~ TempCat, data = df_airquality, na.rm = TRUE)

# Step 4: Descriptive Table
tbl_summary(df_airquality, by = "TempCat")

🔹 Module Summary

✅ We explored missing data and how to handle them.
✅ We created and recoded variables to improve interpretation.
✅ We performed univariate and bivariate descriptive analysis.
✅ We generated professional tables with gtsummary.
✅ Practical exercise to reinforce learning.

🎯 Congratulations! You have completed Module 7. In the next module, we will explore Fundamentals of Statistical Modeling.

Module 9: Fundamentals of Statistical Modeling

Before moving on to more advanced machine learning techniques, it is essential to understand the principles of statistical modeling. In this module, we will explore different types of regression, their applications, assumptions, and how to implement them in R.

9.1. Linear Regression: Modeling Continuous Relationships

🔹 Concept and Application Linear regression is one of the simplest and most widely used models in statistics and machine learning. It is used when the dependent (response) variable is continuous and is assumed to have a linear relationship with one or more predictor variables.

The linear regression model equation is: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \varepsilon \]

Where: - ( _0 ) is the intercept, the value of ( Y ) when all variables ( X_i ) are 0. - ( _i ) represents the change in ( Y ) per unit change in ( X_i ), assuming all other variables remain constant. - ( ) is the random error term, assumed to be normally distributed with mean 0.

🔹 Key Assumptions 1. Linearity: The relationship between the response variable and the predictors is linear. 2. Independence: Observations are independent of each other. 3. Homoscedasticity: The variance of residuals is constant. 4. Normality of Errors: Residuals should follow a normal distribution for valid inferences.

🔹 Implementation in R

# Load data
data <- read.csv("blood_pressure.csv")

# Fit model
mod_lin <- lm(SystolicPressure ~ Age + BMI + Gender, data = data)

# Model summary
summary(mod_lin)

9.2. Model Evaluation: Mean Squared Error (MSE)

Mean Squared Error (MSE) measures the average squared differences between actual and predicted values: \[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

A lower MSE indicates a better fit.
A very low MSE may indicate overfitting.

🔹 Calculation in R

# Calculate MSE
mse <- mean((data$SystolicPressure - predict(mod_lin))^2)
print(mse)

9.3. Logistic Regression: Modeling Binary Categorical Variables

🔹 Concept and Application When the response variable is binary (e.g., presence or absence of a disease), logistic regression models the probability of an event occurring using the logistic function: \[ P(Y=1) = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}} \]

🔹 Example in R: Predicting Diabetes

data$Diabetes <- factor(data$Diabetes, levels = c("No", "Yes"))

mod_logit <- glm(Diabetes ~ Age + BMI + Gender, data = data, family = binomial)
summary(mod_logit)

# Presentation with gtsummary
library(gtsummary)
tbl_regression(mod_logit, exponentiate = TRUE)

9.4. Multinomial Regression: Variables with More than Two Categories

If the response variable has more than two categories, we use multinomial regression.

library(nnet)
data$SmokingStatus <- relevel(data$SmokingStatus, ref = "Never")

mod_multinom <- multinom(SmokingStatus ~ Age + BMI, data = data)
summary(mod_multinom)

9.5. Poisson Regression: Modeling Count Data

For count data (e.g., number of hospital visits), Poisson regression is used:

mod_pois <- glm(HospitalVisits ~ Age + BMI, data = data, family = poisson(link = "log"))
summary(mod_pois)

To obtain prevalence ratios (PR) with robust estimates, use:

library(sandwich)
library(lmtest)

mod_pois_robust <- glm(Diabetes ~ Age + BMI, data = data, family = poisson(link = "log"))
coeftest(mod_pois_robust, vcov = vcovHC(mod_pois_robust, type = "HC0"))

9.6. Cox Regression: Survival Analysis

To analyze time-to-event data (e.g., time until hospitalization), we use Cox regression:

library(survival)

# Simulating survival data
data$Time <- round(runif(nrow(data), 1, 1000))
data$Event <- rbinom(nrow(data), 1, 0.3)

mod_cox <- coxph(Surv(Time, Event) ~ Age + BMI, data = data)
summary(mod_cox)

9.7. Cross-Validation: Evaluating Generalization

🔹 Concept and Application Cross-validation evaluates the predictive capacity of a model without relying on a single dataset.

🔹 Implementation in R

library(caret)

ctrl <- trainControl(method = "cv", number = 10)
model_cv <- train(SystolicPressure ~ Age + BMI + Gender, data = data, method = "lm", trControl = ctrl)

print(model_cv)

🎯 Congratulations! You have completed Module 9. In the next module, we will explore Introduction to Machine Learning Libraries and Hyperparameter Tuning.

Module 10: Fundamentals of Model Selection and Machine Learning Algorithms

Before optimizing hyperparameters (fine-tuning), it is essential to understand machine learning models and their foundations. This module covers model selection, penalized regressions, decision trees, and ensemble models.

10.1 Introduction

Understanding different approaches to model selection is key to building efficient models and reducing overfitting.

10.2 Model and Variable Selection

10.2.1 Approaches to Variable Selection

Best Subset Selection: Tests all possible combinations of variables and selects the best one.
Forward Selection: Starts with one variable and progressively adds the one that improves the model.
Backward Selection: Begins with all variables and removes the least relevant ones.

Example in R

library(leaps)
bestmodels <- regsubsets(BMI ~ ., data = df, nvmax = 8)
summary(bestmodels)

10.3 Penalized Regressions: Ridge and Lasso

10.3.1 Concept of Regularization

Regularization reduces overfitting by adding a penalty term to the loss function.

Lasso (L1): Encourages variable selection.
Ridge (L2): Reduces variance without eliminating variables.

10.3.2 Implementation in R

library(glmnet)
x <- model.matrix(BMI ~ ., df)[, -1]
y <- df$BMI
lasso.mod <- glmnet(x, y, alpha = 1)
plot(lasso.mod)

10.4 Decision Trees and CART

Decision trees iteratively split data based on the variable that maximizes separation.

10.4.1 Key Hyperparameters

Max Depth: Depth of the tree.
Min Samples Split: Minimum number of observations to split a node.
Criterion: Splitting metric (Gini or entropy).

10.4.2 Implementation in R

library(rpart)
modelo_cart <- rpart(BMI ~ ., data = df, method = "anova")
plot(modelo_cart); text(modelo_cart)

10.5 Ensemble Models: Random Forest and Boosting

10.5.1 Random Forest

Combines multiple trees to improve stability.

library(randomForest)
modelo_rf <- randomForest(BMI ~ ., data = df, ntree = 500)
importance(modelo_rf)

10.5.2 Boosting (XGBoost)

library(xgboost)
dtrain <- xgb.DMatrix(data = x, label = y)
modelo_xgb <- xgboost(data = dtrain, nrounds = 100, max_depth = 3, eta = 0.1)

10.6 Fundamentals of Model Optimization

10.6.1 Cross-Validation

Evaluates model performance on different data partitions.

library(caret)
control <- trainControl(method = "cv", number = 5)
modelo_cv <- train(BMI ~ ., data = df, method = "rf", trControl = control)

10.6.2 Evaluation Metrics

Regression: Mean Squared Error (MSE), R².
Classification: Accuracy, F1-score, ROC-AUC.

pred <- predict(modelo_rf, df)
mean((df$BMI - pred)^2)  # MSE

10.7 Conclusion

This module provides the foundation for understanding machine learning models before fine-tuning hyperparameters. In the next module, we will explore advanced fine-tuning strategies to improve model performance.

Module 11: Fine-Tuning and Optimization of Machine Learning Models

11.1 Introduction

Fine-tuning is the process of optimizing hyperparameters to improve model performance. In this module, we will explore methods for selecting optimal values and how R libraries implement this optimization.

11.2 Importance of Hyperparameter Selection

Hyperparameters significantly impact model performance and must be correctly adjusted. Common methods include:

Grid Search: Evaluates all possible combinations within a defined range.
Random Search: Selects random combinations and evaluates performance.
Bayesian Optimization: Models the loss function and iteratively adjusts hyperparameters.

11.3 Methods for Model Tuning

11.3.1 Cross-Validation

Cross-validation assesses performance on different data partitions.

library(caret)
control <- trainControl(method = "cv", number = 5)
model_cv <- train(BMI ~ ., data = df, method = "rf", trControl = control)

Here, trainControl defines the cross-validation scheme, and train fits the model with the defined parameters.

11.3.2 Best Model Selection with Grid Search

param_grid <- expand.grid(mtry = c(2, 4, 6, 8),
                          min.node.size = c(1, 3, 5),
                          sample.fraction = c(0.5, 0.7, 0.9))
model_tuned <- train(BMI ~ ., data = df, method = "rf",
                       trControl = control, tuneGrid = param_grid)

Here, expand.grid generates combinations of hyperparameters, and tuneGrid defines the exploration in train.

11.3.3 Selection with Random Search

model_random <- train(BMI ~ ., data = df, method = "rf",
                        trControl = control, tuneLength = 10)

Here, tuneLength = 10 means 10 random hyperparameter combinations will be tested.

11.3.4 Bayesian Optimization with mlr3mbo

library(mlr3)
library(mlr3tuning)
learner <- lrn("classif.ranger", predict_type = "prob")
search_space <- ps(mtry = p_int(2, 8),
                    min.node.size = p_int(1, 10))
instance <- TuningInstanceSingleCrit$new(task = task,
                                         learner = learner,
                                         resampling = rsmp("cv", folds = 5),
                                         measure = msr("classif.auc"),
                                         search_space = search_space,
                                         terminator = trm("evals", n_evals = 20))
tuner <- tnr("mbo")
tuner$optimize(instance)

Here, mlr3tuning uses Bayesian optimization to find the best combination of hyperparameters.

11.4 Hyperparameter Selection in Different Models

11.4.1 Penalized Regressions (Lasso and Ridge)

library(glmnet)
x <- model.matrix(BMI ~ ., df)[, -1]
y <- df$BMI
cv_model <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_model$lambda.min

Here, cv.glmnet uses cross-validation to select the best lambda value.

11.4.2 Decision Trees and Random Forest

library(randomForest)
model_rf <- randomForest(BMI ~ ., data = df, ntree = 500, mtry = 4)
model_rf$importance

Here, mtry defines the number of predictors in each split and is optimized using cross-validation.

11.4.3 XGBoost with Parameter Optimization

library(xgboost)
dtrain <- xgb.DMatrix(data = as.matrix(df[, -1]), label = df$BMI)
best_model <- xgb.cv(params = list(eta = 0.1, max_depth = 3),
                      data = dtrain, nrounds = 50, nfold = 5,
                      objective = "reg:squarederror")

Here, xgb.cv uses cross-validation to tune eta and max_depth.

11.5 Model Evaluation

Metrics depend on the type of problem:

Regression: MSE, R².
Classification: Accuracy, F1-score, ROC-AUC.

pred <- predict(model_rf, df)
mse <- mean((df$BMI - pred)^2)  # MSE Calculation

11.6 Conclusion

This module provides an in-depth look at hyperparameter selection and tuning using various strategies. Depending on the model and data, different techniques may be more effective in optimizing model performance.

🎯 Congratulations! You have completed Module 11. In the next module, we will explore Model Stacking and Super Learner.

Module 12: Model Stacking and Super Learner

Stacking is an ensemble technique that combines predictions from multiple base models (such as LASSO regression, Random Forest, XGBoost) and learns an optimal combination of their predictions instead of selecting a single “best model.”

Ensembles often outperform individual models as they leverage strengths and reduce weaknesses. In this module, we will explore Super Learner, an implementation of this technique in R, using the NHANES dataset.

12.1. What is Super Learner?

Super Learner is a statistical learning method based on internal cross-validation, selecting and combining multiple predictive models (learners) to enhance the final model’s performance.

🔹 Differences between Discrete Super Learner and Super Learner

Discrete Super Learner: - Compares the performance of several models and selects only one—the best-performing one in cross-validation. - This is a model selection approach rather than a combination approach. - Useful for interpretability but may not always provide the best predictive performance.

Super Learner: - Does not rely on a single model but instead combines multiple models using a meta-learner (typically a weighted regression). - Uses cross-validation to assign optimal weights to each model, usually outperforming any individual model. - It is an adaptive ensemble technique that adjusts the combination based on the data.

👉 Conclusion: Super Learner is generally more powerful than Discrete Super Learner, as it leverages information from multiple models instead of relying on just one.

12.2. Components of a Super Learner

The SuperLearner package in R facilitates stacking by incorporating three main components:

Base models (learners): Individual models that are trained, and their predictions are combined (e.g., SL.lm, SL.glmnet, SL.ranger, SL.xgboost).
Meta-learner: A final model that learns how to combine the predictions from base models. This can be a linear regression, nnls (Non-Negative Least Squares), or a more complex model.
Internal cross-validation: Super Learner partitions the data into multiple subsets to estimate the performance of each model before combining them.

12.3. Implementation in NHANES: Predicting Blood Pressure

📌 Step 1: Load Packages and NHANES Dataset

# Install and load required libraries
install.packages(c("SuperLearner", "tidyverse", "nhanesA", "glmnet", "ranger", "xgboost"))
library(SuperLearner)
library(tidyverse)
library(nhanesA)
library(glmnet)
library(ranger)
library(xgboost)

# Download and prepare NHANES dataset
df <- nhanes("BPX")  # Extract blood pressure data

# Select relevant variables
df <- df %>%
  select(BPXDI1, Age, Gender, BMI, Race) %>%
  drop_na()  # Remove missing values

# Define response variable and predictors
Y <- df$BPXDI1  # Diastolic blood pressure
X <- df %>% select(-BPXDI1)  # Predictor variables

📌 Step 2: Define Base Models (Learners)

SLlist <- c("SL.lm", "SL.glmnet", "SL.ranger", "SL.xgboost")

📌 Step 3: Train the Super Learner

set.seed(123)
SLres <- SuperLearner(
  Y = Y,
  X = X,
  SL.library = SLlist,
  family = gaussian()
)
print(SLres)

📌 Step 4: Analyze Model Combination

The output will show weights assigned to each model. A model with a high weight contributes more to the final prediction.

12.4. Tuning Hyperparameters in Base Models

We can improve base model performance by tuning their hyperparameters before combining them in the Super Learner.

📌 Example: Tuning Hyperparameters in LASSO and Random Forest

# Create customized versions of base models with tuned hyperparameters
SL.lasso <- create.Learner("SL.glmnet", tune = list(alpha = 1, lambda = 0.01))
SL.rf <- create.Learner("SL.ranger", tune = list(num.trees = 500, mtry = 2))

# Define new set of learners
SLlist_tuned <- c(SL.lasso$names, SL.rf$names, "SL.xgboost")

# Train the Super Learner with tuned models
set.seed(123)
SLres_tuned <- SuperLearner(
  Y = Y,
  X = X,
  SL.library = SLlist_tuned,
  family = gaussian()
)
print(SLres_tuned)

12.5. Evaluating Super Learner with Cross-Validation

To estimate generalization error, we use cross-validation with CV.SuperLearner.

SLcv <- CV.SuperLearner(
  Y = Y,
  X = X,
  SL.library = SLlist_tuned,
  family = gaussian(),
  cvControl = list(V = 5)  # 5-fold CV
)

# Compute Mean Squared Error (MSE) of Super Learner
mse <- mean((Y - SLcv$SL.predict)^2)
print(mse)

A lower MSE indicates a better model fit.

12.6. Meta-Learner with nnls

The meta-learner learns to combine predictions from base models. One of the most used is nnls (Non-Negative Least Squares), which imposes positive coefficient constraints.

SLres_nnls <- SuperLearner(
  Y = Y,
  X = X,
  SL.library = SLlist_tuned,
  method = "method.NNLS",
  family = gaussian()
)
print(SLres_nnls)

Conclusions and Final Thoughts

✅ Super Learner combines multiple models to improve predictive accuracy.
✅ Tuning hyperparameters in base models enhances their performance within the ensemble.
✅ The meta-learner optimizes prediction combinations, and nnls ensures non-negative coefficients.
✅ Cross-validation with CV.SuperLearner helps estimate the model’s predictive capability.

Module 13: Complete Integrated Example (From EDA to SuperLearner)

This module aims to consolidate the knowledge acquired throughout the course by applying the entire analysis workflow, from data exploration to creating a predictive model using SuperLearner. The NHANES dataset will be used.

13.1. Data Loading and Exploration

1. Import Libraries and Data

# Install and load required libraries
install.packages(c("tidyverse", "nhanesA", "DataExplorer", "skimr", "gtsummary", "SuperLearner"))
library(tidyverse)
library(nhanesA)
library(DataExplorer)
library(skimr)
library(gtsummary)
library(SuperLearner)

# Load data from NHANES
nhanes_data <- nhanes("BPX")

2. Select and Review Relevant Variables

# Select relevant variables
nhanes_df <- nhanes_data %>%
  select(BPXDI1, Age, Gender, BMI, Race, Diabetes) %>%
  drop_na()

# Data structure and summary
str(nhanes_df)
skim(nhanes_df)
create_report(nhanes_df)

13.2. Data Preprocessing

3. Convert Categorical Variables

nhanes_df <- nhanes_df %>%
  mutate(Gender = as.factor(Gender),
         Race = as.factor(Race),
         Diabetes = as.factor(Diabetes))

4. Missing Values Analysis

plot_missing(nhanes_df)

5. Descriptive Analysis

nhanes_df %>%
  tbl_summary(by = Gender) %>%
  add_overall() %>%
  add_p()

13.3. Data Visualization with ggplot2

6. Blood Pressure Distribution

ggplot(nhanes_df, aes(x = BPXDI1)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Diastolic Blood Pressure")

7. Relationship Between Age and Blood Pressure

ggplot(nhanes_df, aes(x = Age, y = BPXDI1)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Age and Blood Pressure")

13.4. Statistical Modeling

8. Linear Regression

mod_lin <- lm(BPXDI1 ~ Age + BMI + Gender + Race, data = nhanes_df)
summary(mod_lin)

9. Logistic Regression (Diabetes Prediction)

mod_logit <- glm(Diabetes ~ Age + BMI + Gender + Race, data = nhanes_df, family = binomial)
tbl_regression(mod_logit, exponentiate = TRUE)

13.5. Introduction to Machine Learning and Super Learner

10. Define Base Models

SLlist <- c("SL.lm", "SL.glmnet", "SL.ranger", "SL.xgboost")

11. Create the Super Learner

set.seed(123)
SLres <- SuperLearner(
  Y = nhanes_df$BPXDI1,
  X = nhanes_df %>% select(-BPXDI1),
  SL.library = SLlist,
  family = gaussian()
)
print(SLres)

12. Cross-Validation of the Super Learner

SLcv <- CV.SuperLearner(
  Y = nhanes_df$BPXDI1,
  X = nhanes_df %>% select(-BPXDI1),
  SL.library = SLlist,
  family = gaussian(),
  cvControl = list(V = 5)
)

# Calculate MSE
mse <- mean((nhanes_df$BPXDI1 - SLcv$SL.predict)^2)
print(mse)

13.6. Final Reflection

Which model performed best? Compare the Super Learner’s error with individual models.
What other variables could improve the model? Identify potential additional predictors.
How could the model fit be improved? Hyperparameter tuning and selection of better learners.

🚀📊 “Congratulations on completing this journey through advanced epidemiology with R. You now have the tools to explore data, build models, and make evidence-based decisions. Knowledge grows when applied, so keep exploring, analyzing, and creating!” 🚀📊

Module 1: Introduction to R, RStudio, and Basic Concepts

1.1. What is R and Why Use It?

1.2. Working Environment: RStudio

1.3. First Steps in the Console

🔹 Example: Basic Operations

🔥 Exercise 1.1: Write and Execute Your First Script in R

✅ Solution 1.1

Module 2: Data Structures in R

2.1. Vectors

Useful functions for creating vectors:

Operations with vectors:

2.2. Matrices

Creating matrices:

Indexing in matrices:

2.3. Data Frames

Functions for data frames:

Example:

2.4. Lists

🔥 Exercise 2.1

✅ Solution 2.1

Module 3: Programming in R (Assignment, Functions, Loops, Control Flow)

3.1. Variable Assignment

3.2. Creating Functions

Structure of a Function in R

Example: Function to Calculate Hypotenuse

3.3. Control Flow: Conditionals

1️⃣ if - else

2️⃣ ifelse(): Vectorized Version

3.4. Loops

3.4.1. for Loop

3.4.2. while Loop

🔥 Exercise 3.1

✅ Solution 3.1

Module 4: Basic Operations, Package Management, and Initial Exploration

4.1. Installing and Loading Packages

4.2. Key Packages for Data Exploration

4.2.1. DataExplorer

4.2.2. skimr

4.2.3. knitr

Module 5: Libraries for Data Manipulation and Transformation

5.1. dplyr

5.2. tidyr

5.3. gtsummary

Module 6: Data Visualization and Graph Management with ggplot2

6.1 Introduction to ggplot2

Basic Structure of ggplot2

Key Components:

Installation and Loading of the Package

Key Arguments Inside aes()

Available geom_*() Types

Customization with labs(), theme(), scale_*()

Basic Structure So Far

6.2. Basic Graphs with ggplot2

6.2.1. Bar Chart

6.2.2. Pie Chart

6.2.3. Boxplot

6.2.4. Scatterplot

6.2.5. Histogram and Density Curve

6.3. Advanced Customization in ggplot2

6.4. Practical Exercise

🔹 Module Summary

Module 7: Exploratory Data Analysis (EDA)

7.1. Exploratory Data Analysis Strategy

1️⃣ Identifying Missing Values (NAs)

2️⃣ Univariate Analysis: Variable Distributions

3️⃣ Bivariate Analysis: Relationships Between Variables

4️⃣ Detecting Outliers and Inconsistent Data

🔹 Module Summary

Module 8: Descriptive and Analytical Statistical Analysis

8.1. Reminder: Data Importation

8.2. Handling Missing Data

8.3. Creating and Recoding Variables

Converting Categorical Variables

Creating a Smoking Status Variable

8.4. Univariate Descriptive Analysis

Frequencies and Basic Statistics

8.5. Bivariate Analysis

Comparison of a Numerical Variable Between Groups (t-test)

ANOVA for More Than Two Groups

Contingency Tables and Chi-Square Test

4.2.1. `DataExplorer`

4.2.2. `skimr`

4.2.3. `knitr`

5.1. `dplyr`

5.2. `tidyr`

5.3. `gtsummary`

6.1 Introduction to `ggplot2`

Basic Structure of `ggplot2`

Key Arguments Inside `aes()`

**Available `geom_*()` Types**

**Customization with `labs()`, `theme()`, `scale_*()`**

6.2. Basic Graphs with `ggplot2`

6.3. Advanced Customization in `ggplot2`

8.6. Descriptive Analysis with `gtsummary`