EPI 553 - Introduction to R Programming

Author

Muntasir Masum

Published

January 22, 2026

1 Introduction

1.1 Course Overview

R is a powerful programming language and environment for statistical computing and data visualization. In this course, you will learn:

  • How to interact with R using RStudio
  • Creating and manipulating R objects
  • Working with different data types and structures
  • Loading and working with real datasets

This lecture provides a foundational introduction to R programming that you’ll use throughout the course.

Learning Objectives: By the end of this lecture, you should be able to:

  • Use RStudio effectively
  • Create and manipulate R objects
  • Understand R data types and structures
  • Write and run R code in scripts
  • Load and explore datasets

2 R Programming Fundamentals

2.1 What is R?

R is:

  • A programming language for statistical computing and data visualization
  • Open-source and free software
  • A dynamic programming language that automatically interprets your code as you run it
  • Platform independent (Windows, Mac, Linux)

RStudio is an integrated development environment (IDE) for R that makes it much easier to use R. We will focus on using R through RStudio.

2.2 Why Learn R?

R has become the standard tool for data analysis and statistical computing in:

  • Academic research
  • Data science and analytics
  • Biostatistics and epidemiology
  • Business intelligence

Key advantages include:

  • Extensive package ecosystem (more than 20,000 packages)
  • Publication-quality graphics
  • Reproducible research capabilities
  • Strong community support

2.3 The RStudio Interface

When you open RStudio, you’ll see a window with multiple panes arranged in a grid.

The four main panes are:

  1. Console (bottom-left): Where R code is executed and results displayed
  2. Script Editor (top-left): Where you write and save R scripts
  3. Environment/History (top-right): Shows objects you’ve created and command history
  4. Files/Plots/Packages/Help (bottom-right): File browser, plot display, package manager

2.3.1 Customizing Your Workspace

You can customize the appearance and layout of RStudio:

  • Tools → Global Options - Change theme, fonts, pane layout
  • View menu - Show/hide panes as needed
  • Recommended: Use a dark theme to reduce eye strain during long coding sessions

3 Working with the Console

3.1 Basic R Commands

The console is where you type commands and see results. Let’s start with simple arithmetic:

# Simple addition
1 + 1
[1] 2
# Division
10 / 2
[1] 5
# Exponentiation
2 ^ 3
[1] 8
# Square root
sqrt(16)
[1] 4

3.2 Understanding R Output

Notice the [1] prefix in the output. This indicates the position of the first value displayed on that line.

# When displaying many values, you see multiple indices
100:130
 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123
[25] 124 125 126 127 128 129 130

In this output, [1] indicates position 1, [14] indicates position 14, and [27] indicates position 27.

Tip: The [1] notation becomes especially useful when working with vectors and matrices where you need to track the position of values.


3.3 Incomplete Commands

If you type an incomplete command and press Enter, R displays a + prompt, waiting for you to complete it:

# Type this but don't press Enter after the minus sign
5 -
# Then press Enter and the prompt will show +
# Type 1 and press Enter
1

When you do this in R, you’ll see:

> 5 -
+ 1
[1] 4

To cancel an incomplete command, press Escape and start over.


3.4 Error Messages

If you type a command that R doesn’t recognize, you’ll get an error message:

> 3 % 5
Error: unexpected input in "3 % 5"

Error messages are helpful! They tell you what went wrong. Don’t be intimidated by them—they’re just R’s way of saying it didn’t understand your command. The % operator is used for different purposes in R, and 3 % 5 is not valid syntax.


3.5 Canceling Long-Running Commands

Some R commands take a long time to run. You can cancel a running command by pressing:

  • Ctrl + C (Windows/Linux)
  • Command + . (period) (Mac)

Or click the STOP button in the console. Note that canceling may take a moment.


4 R Objects and Assignment

4.1 Creating Objects

R lets you save data by storing it in objects. An object is simply a name you can use to retrieve stored data.

You assign values to objects using <- or =:

# Create an object named 'a' with value 1
a <- 1

# View the contents
a
[1] 1
# Do arithmetic with the object
a + 2
[1] 3
# Create another object
b <- 10
a + b
[1] 11

Best Practice: Use <- for assignment in R. While = also works, <- is the R convention and makes your code more readable to other R users.

4.1.1 Reassigning Objects

R will overwrite an object without asking for confirmation:

# First assignment
a <- 1
a
[1] 1
# Reassign a new value
a <- 2
a
[1] 2
# You can overwrite with different data types
a <- "text"
a
[1] "text"

4.2 Naming Conventions

Object names in R have a few rules:

Valid names start with:

  • A letter or period (.variable_name)
  • Can contain numbers, letters, underscores, and periods

Invalid names:

  • Start with a number: 2variables
  • Contain special symbols: my-var, my$var, my@var
# Valid names
my_variable <- 5
my.variable <- 5
myVariable <- 5
x1 <- 10

# Invalid names (these will produce errors)
1variable <- 5        # Error: starts with number
my-variable <- 5      # Error: hyphen not allowed
my variable <- 5      # Error: space not allowed

4.2.1 Case Sensitivity

R is case-sensitive, so name and Name are different objects:

name <- "lowercase"
Name <- "uppercase"

name
[1] "lowercase"
Name
[1] "uppercase"

4.3 Listing Objects in Your Environment

Use ls() to see all objects you’ve created:

# Create some objects
x <- 5
y <- 10
z <- "hello"

# List all objects
ls()
[1] "a"    "b"    "name" "Name" "x"    "y"    "z"   
# Remove a specific object
rm(z)

# Verify it's gone
ls()
[1] "a"    "b"    "name" "Name" "x"    "y"   

5 R Data Types and Structures

5.1 Data Types

R has several basic data types:

5.1.1 Numeric

# Numeric (default for numbers)
x <- 3.14
y <- 42

class(x)
[1] "numeric"
class(y)
[1] "numeric"

5.1.2 Character

# Character (text)
name <- "Alice"
greeting <- 'Hello, world!'

class(name)
[1] "character"
class(greeting)
[1] "character"

5.1.3 Logical

# Logical (TRUE/FALSE)
is_patient <- TRUE
treatment_received <- FALSE

class(is_patient)
[1] "logical"

5.2 Vectors

A vector is a collection of values of the same type. Create vectors using c():

# Numeric vector
ages <- c(25, 30, 35, 40, 45)

# Character vector
names <- c("Alice", "Bob", "Carol", "Dave", "Eve")

# Logical vector
is_smoker <- c(TRUE, FALSE, TRUE, FALSE, TRUE)

# Check length
length(ages)
[1] 5
# Access individual elements
names[2]      # Second element
[1] "Bob"
ages[c(1, 3)] # First and third elements
[1] 25 35

5.3 Sequences and Repetition

# Create sequences
1:10           # Sequence from 1 to 10
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, by = 2)  # Sequence with step
[1] 1 3 5 7 9
# Repeat values
rep(1, 5)      # Repeat 1 five times
[1] 1 1 1 1 1
rep(c("A", "B"), 3)  # Repeat vector three times
[1] "A" "B" "A" "B" "A" "B"

5.4 Lists

A list is a collection where elements can have different types:

# Create a list
person <- list(
  name = "Alice",
  age = 30,
  is_student = FALSE
)

# Access elements
person$name
[1] "Alice"
person[[2]]  # Access by position
[1] 30
person[["age"]]  # Access by name
[1] 30
# View the list structure
str(person)
List of 3
 $ name      : chr "Alice"
 $ age       : num 30
 $ is_student: logi FALSE

5.5 Matrices

A matrix is a two-dimensional collection of elements all of the same type:

# Create a 3x3 matrix
m <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
# Access elements
m[1, 2]      # Row 1, Column 2
[1] 4
m[2, ]       # All of Row 2
[1] 2 5 8
m[, 3]       # All of Column 3
[1] 7 8 9
# Matrix operations
t(m)         # Transpose
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
m %*% m      # Matrix multiplication
     [,1] [,2] [,3]
[1,]   30   66  102
[2,]   36   81  126
[3,]   42   96  150

5.6 Data Frames

Data frames are the most important data structure in R for statistical analysis. They’re like spreadsheets where:

  • Each column is a variable
  • Each row is an observation
  • Columns can have different data types
# Create a simple data frame
df <- data.frame(
  ID = 1:5,
  Name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  Age = c(28, 35, 42, 31, 29),
  Diagnosis = c("Yes", "No", "Yes", "No", "No")
)

df
  ID  Name Age Diagnosis
1  1 Alice  28       Yes
2  2   Bob  35        No
3  3 Carol  42       Yes
4  4  Dave  31        No
5  5   Eve  29        No
# Check structure
str(df)
'data.frame':   5 obs. of  4 variables:
 $ ID       : int  1 2 3 4 5
 $ Name     : chr  "Alice" "Bob" "Carol" "Dave" ...
 $ Age      : num  28 35 42 31 29
 $ Diagnosis: chr  "Yes" "No" "Yes" "No" ...
# Dimensions
dim(df)
[1] 5 4
nrow(df)
[1] 5
ncol(df)
[1] 4

5.7 Accessing Data Frame Elements

# Access a column using $
df$Name
[1] "Alice" "Bob"   "Carol" "Dave"  "Eve"  
# Access by row and column
df[1, 2]      # Row 1, Column 2
[1] "Alice"
# Access entire column
df[, "Age"]
[1] 28 35 42 31 29
df[, 3]
[1] 28 35 42 31 29
# Access entire row
df[2, ]
  ID Name Age Diagnosis
2  2  Bob  35        No
# Subset rows where Age > 30
df[df$Age > 30, ]
  ID  Name Age Diagnosis
2  2   Bob  35        No
3  3 Carol  42       Yes
4  4  Dave  31        No

6 Reading Data

6.1 Loading CSV Files

In epidemiological research, you’ll usually start with existing data:

# Read a CSV file
data <- read.csv("path/to/file.csv")

# If first row contains variable names (default)
data <- read.csv("data.csv")

# If no header row
data <- read.csv("data.csv", header = FALSE)

# Specify missing value codes
data <- read.csv("data.csv", na.strings = c("", "NA", "."))

6.1.1 Using Import Dataset in RStudio

Easier method for point-and-click:

  1. Click Environment tab → Import DatasetFrom Text (readr)…
  2. Browse to your file
  3. Preview the data and adjust settings
  4. Click Import - RStudio generates the code

Copy this code to your script for reproducibility:

# Code generated by RStudio
library(readr)
mydata <- read_csv("myfile.csv")

Always save the import code in your script. This creates a reproducible record of how you loaded your data.


7 R Scripts

7.1 What is an R Script?

An R script is a plain text file containing R code. Scripts allow you to:

  • Write and save your code
  • Create a reproducible record of your analysis
  • Rerun your entire analysis anytime
  • Share your code with others

7.1.1 Creating a New Script

Method 1 (RStudio):

  • File → New File → R Script
  • Or use Ctrl + Shift + N (Windows/Linux) / Cmd + Shift + N (Mac)

Method 2:

  • Click the blank file icon in the toolbar

7.2 Best Practices for Scripts

7.2.1 1. Add Comments

Use # to add comments that explain your code:

# Create sample patient data
set.seed(123)
data <- data.frame(
  age = rnorm(100, mean = 55, sd = 15),
  weight = rnorm(100, mean = 80, sd = 12),
  height = rnorm(100, mean = 1.70, sd = 0.10)
)

# Calculate BMI
data$BMI <- data$weight / (data$height^2)

# Subset to patients over 50 years old
older_patients <- data[data$age > 50, ]

# Summarize BMI for older patients
summary(older_patients$BMI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   17.8    24.6    27.0    27.3    30.3    39.4 

7.2.2 2. Organize Your Script

Start with header information and load all packages:

# =====================================================
# Analysis of NHANES Data
# Author: Your Name
# Date: February 2025
# =====================================================

# Load required packages
library(dplyr)
library(ggplot2)
library(NHANES)

# -----
# 1. Load and Prepare Data
# -----
data(NHANES)
head(NHANES)

# -----
# 2. Exploratory Data Analysis
# -----
summary(NHANES)
NHANES %>%
  group_by(Gender) %>%
  summarise(mean_age = mean(Age, na.rm = TRUE),
            mean_bp_sys = mean(BPSys1, na.rm = TRUE))

# -----
# 3. Statistical Tests
# -----
t.test(BPSys1 ~ Gender, data = NHANES)

7.2.3 3. Use Meaningful Names

# Good names
systolic_bp <- 140
patient_age <- 55
calculate_bmi <- function(weight, height) { weight / height^2 }

# Poor names
sb <- 140          # Unclear abbreviation
a <- 55            # Single letter
f1 <- function(w, h) { w / h^2 }  # Cryptic

7.3 Running Code from a Script

Run a single line:

  • Click the line and press Ctrl + Enter (Windows/Linux) / Cmd + Return (Mac)
  • Or click the Run button

Run multiple lines:

  • Highlight them and press Ctrl + Enter / Cmd + Return

Run the entire script:

  • Click the Source button (or Ctrl + Shift + S / Cmd + Shift + S)

8 Practical Example

8.1 Complete Workflow

Let’s work through a complete analysis workflow using the NHANES (National Health and Nutrition Examination Survey) dataset:

# Load required packages
library(NHANES)
library(dplyr)

# 1. Load the data
data(NHANES)

# 2. Examine the data
head(NHANES)
# A tibble: 6 × 76
     ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education    MaritalStatus HHIncome   
  <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>        <fct>         <fct>      
1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-34999
2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-34999
3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High School  Married       25000-34999
4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>         <NA>          20000-24999
5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some College LivePartner   35000-44999
6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>         <NA>          75000-99999
# ℹ 65 more variables: HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
#   Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
#   BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>,
#   BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
#   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>, HealthGen <fct>,
#   DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, …
str(NHANES)
tibble [10,000 × 76] (S3: tbl_df/tbl/data.frame)
 $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
 $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
 $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
 $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
 $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
 $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
 $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
 $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
 $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
 $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
 $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
 $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
 $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
 $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
 $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
 $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ Height          : num [1:10000] 165 165 165 105 168 ...
 $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
 $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
 $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
 $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
 $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
 $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
 $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
 $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
 $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
 $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
 $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
 $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
 $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
 $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
 $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
 $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
 $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
 $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
 $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
 $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
 $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
 $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
 $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
 $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
 $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
 $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
 $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
 $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
 $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
 $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
 $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
 $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
 $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
 $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
 $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
 $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
 $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
 $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
 $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
 $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
 $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
 $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
 $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
# 3. Calculate summary statistics by gender
summary_stats <- NHANES %>%
  group_by(Gender) %>%
  summarise(
    n = n(),
    mean_age = mean(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE),
    mean_bp_sys = mean(BPSys1, na.rm = TRUE),
    sd_bp_sys = sd(BPSys1, na.rm = TRUE),
    mean_bmi = mean(BMI, na.rm = TRUE),
    sd_bmi = sd(BMI, na.rm = TRUE)
  )

print(summary_stats)
# A tibble: 2 × 8
  Gender     n mean_age sd_age mean_bp_sys sd_bp_sys mean_bmi sd_bmi
  <fct>  <int>    <dbl>  <dbl>       <dbl>     <dbl>    <dbl>  <dbl>
1 female  5020     37.6   22.7        117.      18.1     26.8   7.90
2 male    4980     35.8   22.0        121.      16.6     26.5   6.81
# 4. Visualize the data
library(ggplot2)

ggplot(NHANES, aes(x = Gender, y = BPSys1, fill = Gender)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.2) +
  labs(title = "Systolic Blood Pressure by Gender",
       y = "Systolic Blood Pressure (mmHg)",
       x = "Gender") +
  theme_minimal() +
  theme(legend.position = "none")

Boxplot of systolic blood pressure comparing male and female participants

# 5. Conduct statistical test
# Remove rows with missing blood pressure data
nhanes_clean <- NHANES %>%
  filter(!is.na(BPSys1))

# Compare systolic blood pressure between genders
t_test <- t.test(BPSys1 ~ Gender, data = nhanes_clean)
print(t_test)

    Welch Two Sample t-test

data:  BPSys1 by Gender
t = -9.3, df = 8172, p-value <2e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -4.332 -2.829
sample estimates:
mean in group female   mean in group male 
               117.3                120.9 

9 Additional Resources

9.1 Learning R

Free Online Resources:

Interactive Learning:

  • swirl package: Learn R interactively in the console
install.packages("swirl")
library(swirl)
swirl()

9.2 Useful Packages for Epidemiology

  • dplyr - Data manipulation
  • ggplot2 - Data visualization
  • tidyr - Data reshaping
  • epiDisplay - Epidemiological tables
  • epitools - Epidemiological calculations
  • survival - Survival analysis

10 Key Concepts Summary

10.1 Essential Points to Remember

  1. R Objects: Use <- to assign values to named objects

  2. Vectors: Combine multiple values with c()

  3. Functions: Call functions with syntax function_name(argument1, argument2)

  4. Data Frames: The primary structure for statistical analysis in R

  5. Scripts: Always write and save your code in scripts for reproducibility

  6. Help: Use ?function_name or help(function_name) for documentation

  7. Comments: Use # to explain your code for future reference

  8. Working Directory: Understand where R looks for files with getwd() and setwd()


11 Session Information

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_4.0.1 dplyr_1.1.4   NHANES_2.1.0 

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        cli_3.6.5          knitr_1.51         rlang_1.1.6        xfun_0.55         
 [6] otel_0.2.0         generics_0.1.4     S7_0.2.1           jsonlite_2.0.0     labeling_0.4.3    
[11] glue_1.8.0         htmltools_0.5.9    scales_1.4.0       rmarkdown_2.30     grid_4.5.2        
[16] evaluate_1.0.5     tibble_3.3.0       fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.4   
[21] compiler_4.5.2     RColorBrewer_1.1-3 htmlwidgets_1.6.4  pkgconfig_2.0.3    rstudioapi_0.17.1 
[26] farver_2.1.2       digest_0.6.39      R6_2.6.1           tidyselect_1.2.1   utf8_1.2.6        
[31] pillar_1.11.1      magrittr_2.0.4     withr_3.0.2        gtable_0.3.6       tools_4.5.2       

Last updated: January 20, 2026

This lecture provides the foundation for all the statistical computing we’ll do in EPI 553. In the next lecture, we’ll review biostatistical foundations essential for understanding advanced modeling techniques.