How to use this course: Each concept is explained in three layers — what it is (Theory), why it matters (Importance), and when to use it (Context) — followed by practical R code examples. Read the theory first, then study the code, then try modifying the examples yourself.


1 Module 1: Introduction to R & RStudio

1.1 What is R?

R is a free, open-source programming language created in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland. It was designed from the ground up for statistical computing, data manipulation, and data visualization. Unlike general-purpose languages like Python or Java, R’s entire design philosophy revolves around data — every feature, every data structure, every built-in function exists to help you work with data more effectively.

R is one of the most important tools in data analysis because:

  • It has 20,000+ packages on CRAN (Comprehensive R Archive Network) covering every analytical need imaginable — from basic statistics to machine learning, genomics, finance, and social science.
  • It produces publication-quality visualizations through packages like ggplot2.
  • It enables reproducible research — your entire analysis, from raw data to final report, lives in one document (R Markdown).
  • It is free and open-source, used by top universities, research institutions, and companies like Google, Facebook, and the New York Times.
  • It has a massive, supportive community (Stack Overflow, R-bloggers, RStudio Community).

Choose R when your work involves statistical analysis, academic research, data visualization, bioinformatics, economics, or when you need to produce reports that combine code, results, and narrative automatically.

1.2 The RStudio Interface

RStudio is an Integrated Development Environment (IDE) for R. Think of R as the engine and RStudio as the car — R does the computation, RStudio provides the dashboard, controls, and comfort that make working with R much easier. RStudio is not R itself; it is a tool that sits on top of R.

RStudio is divided into 4 panes, each serving a specific purpose:

Pane Location Purpose
Source Editor Top-left Write, edit, and save R scripts (.R) or R Markdown (.Rmd) files
Console Bottom-left Run R commands interactively and see immediate output
Environment / History Top-right See all objects currently in memory; browse command history
Files / Plots / Packages / Help Bottom-right Browse files, view plots, manage packages, read documentation

The Source Editor is where you write reusable, repeatable code. The Console is where you experiment quickly. This separation is important: always write your final analysis in a script (Source Editor) so you can re-run it later. Code typed only into the Console is lost when you close RStudio.

1.3 Your First R Commands

Every journey in R begins with understanding that R is an interpreted language — you write a command, press Enter (or click Run), and R immediately executes it and shows the result. There is no compilation step. This makes R excellent for interactive data exploration.

# The hash symbol # starts a comment. R ignores everything after it.
# Comments are essential — they explain WHY you wrote the code, not just what it does.

# R as a calculator — the most basic use
2 + 3        # addition
#> [1] 5
100 - 47     # subtraction
#> [1] 53
8 * 7        # multiplication
#> [1] 56
22 / 7       # division — notice R gives decimal result
#> [1] 3.142857
2 ^ 10       # exponentiation: 2 to the power of 10
#> [1] 1024
17 %% 5      # modulus: remainder after dividing 17 by 5
#> [1] 2
17 %/% 5     # integer division: how many times 5 fits in 17
#> [1] 3
# Printing output explicitly
print("Welcome to R!")
#> [1] "Welcome to R!"
# R also has built-in mathematical constants
pi           # 3.14159...
#> [1] 3.141593
exp(1)       # Euler's number e = 2.71828...
#> [1] 2.718282

2 Module 2: R Basics — Variables, Data Types, and Operators

2.1 Variables and the Assignment Operator

A variable is a named storage location in computer memory. When you create a variable in R, you are telling R: “Reserve a space in memory, put this value there, and let me refer to it by this name.” In R, the standard assignment operator is <- (read as “gets”). You can also use =, but <- is the R convention and is strongly preferred by the community because = is also used for function arguments, which can create confusion.

Variables are the foundation of all programming. Without variables, you would have to retype the same value every time you needed it, and your code could not store or reuse results. Variables also make your code readable (using average_salary is clearer than using 47500 everywhere) and maintainable (change the value in one place and it updates everywhere).

Use variables whenever a value: (1) is used more than once, (2) might change and you want to update it in one place, (3) is the result of a calculation that you will use later, or (4) has a meaningful name that makes your code self-documenting.

# The <- operator assigns a value to a variable name
student_name  <- "Alice Mukamana"    # text value
student_age   <- 22                  # whole number
student_gpa   <- 3.85                # decimal number
is_enrolled   <- TRUE                # yes/no (logical) value

# To see the value of a variable, just type its name
student_name
#> [1] "Alice Mukamana"
student_age
#> [1] 22
# Or use print() for explicit output
print(student_gpa)
#> [1] 3.85
# You can use a variable in a calculation immediately after creating it
monthly_salary <- 450000             # in Rwandan Francs
annual_salary  <- monthly_salary * 12
annual_salary
#> [1] 5400000
# Update a variable — the old value is replaced
student_age <- 23                    # Alice had a birthday
student_age
#> [1] 23
# Variable naming rules:
# use_underscores (snake_case — recommended in R)
# can start with a letter or dot
# cannot start with a number (2nd_value is invalid)
# cannot contain spaces (student name is invalid)
# avoid special characters except _ and .

2.2 Data Types in R

A data type tells R what kind of value a variable holds and therefore what operations are valid on it. R has 5 main atomic data types. The word “atomic” means these are the simplest, indivisible building blocks — all more complex data structures in R are built from these.

  • Numeric: Any real number (3.14, -7, 0, 1000000)
  • Integer: Whole numbers only, stored more efficiently (written with L suffix: 5L)
  • Character: Text strings, enclosed in single or double quotes (“Hello”, ‘World’)
  • Logical: Boolean values representing TRUE or FALSE (must be capitalized in R)
  • Complex: Numbers with real and imaginary parts (rarely used in data analysis)

Understanding data types is critical because R behaves differently depending on type. You cannot do arithmetic on text. You cannot sort numbers alphabetically. When data is imported from a spreadsheet, numbers are sometimes read as text (“123” instead of 123), causing silent errors in your calculations. Knowing types helps you diagnose and fix these problems.

Always check data types when importing data (class(), str()), before doing calculations (ensure columns are numeric), and when joining datasets (types must match for merging to work correctly).

# Numeric
temperature <- 36.6
population  <- 13400000
class(temperature)       # tells you the type
#> [1] "numeric"
is.numeric(temperature)  # TRUE/FALSE check
#> [1] TRUE
# Integer — the L suffix tells R to store as integer (more memory-efficient)
num_students <- 45L
class(num_students)
#> [1] "integer"
is.integer(num_students)
#> [1] TRUE
# Character (String)
country   <- "Rwanda"
capital   <- 'Kigali'          # single or double quotes both work
class(country)
#> [1] "character"
# IMPORTANT: Numbers stored as characters cannot be used in math!
wrong_num <- "42"              # this looks like a number but it is text
# wrong_num + 1               # this would cause an ERROR
as.numeric(wrong_num) + 1     # convert first, then add
#> [1] 43
# Logical — must be ALL CAPS: TRUE or FALSE
passed_exam  <- TRUE
has_degree   <- FALSE
class(passed_exam)
#> [1] "logical"
# Logical values are stored as 1 (TRUE) and 0 (FALSE)
# This is useful for counting: sum(logical_vector) counts the TRUEs
TRUE + TRUE + FALSE    # = 2
#> [1] 2
# Type Conversion — R can convert between types (called coercion)
as.numeric("3.14")    # character to numeric
#> [1] 3.14
as.character(100)     # numeric to character: "100"
#> [1] "100"
as.logical(0)         # 0 = FALSE
#> [1] FALSE
as.logical(1)         # any non-zero = TRUE
#> [1] TRUE
as.integer(3.9)       # truncates to 3 (does NOT round!)
#> [1] 3
# Checking Types
class(3.14)      # "numeric"
#> [1] "numeric"
class(TRUE)      # "logical"
#> [1] "logical"
class("hello")   # "character"
#> [1] "character"

2.3 Operators in R

Operators are symbols that perform operations on values or variables. R has four categories of operators: arithmetic (math calculations), relational/comparison (comparing values), logical (combining TRUE/FALSE values), and assignment (storing values). Each produces a specific type of result — arithmetic gives numbers, relational gives TRUE/FALSE, logical gives TRUE/FALSE.

Operators are the verbs of programming — they express actions. Relational operators are especially important in data analysis because they are the foundation of filtering: “show me all sales where revenue > 10000” translates directly into R as data[data$revenue > 10000, ].

# Arithmetic Operators
x <- 17
y <- 5
x + y    # 22  — addition
#> [1] 22
x - y    # 12  — subtraction
#> [1] 12
x * y    # 85  — multiplication
#> [1] 85
x / y    # 3.4 — division (always decimal in R)
#> [1] 3.4
x %% y   # 2   — modulus: remainder of 17 divided by 5
#> [1] 2
x %/% y  # 3   — integer division: how many times 5 goes into 17
#> [1] 3
x ^ y    # exponentiation
#> [1] 1419857
# Real-world arithmetic example
price_per_kg   <- 1200  # RWF per kg of beans
quantity_kg    <- 50
discount_rate  <- 0.10  # 10% discount

subtotal  <- price_per_kg * quantity_kg
discount  <- subtotal * discount_rate
total     <- subtotal - discount

cat("Subtotal: RWF", subtotal, "\n")
#> Subtotal: RWF 60000
cat("Discount: RWF", discount, "\n")
#> Discount: RWF 6000
cat("Total:    RWF", total, "\n")
#> Total:    RWF 54000
# Relational (Comparison) Operators — always return TRUE or FALSE
a <- 10
b <- 20
a == b    # FALSE — equal to (double == for comparison)
#> [1] FALSE
a != b    # TRUE  — not equal to
#> [1] TRUE
a <  b    # TRUE  — less than
#> [1] TRUE
a >  b    # FALSE — greater than
#> [1] FALSE
a <= 10   # TRUE  — less than or equal
#> [1] TRUE
a >= 10   # TRUE  — greater than or equal
#> [1] TRUE
# Logical Operators
# & (AND): BOTH conditions must be TRUE
# | (OR):  AT LEAST ONE condition must be TRUE
# ! (NOT): reverses TRUE/FALSE
age    <- 25
income <- 500000  # RWF per month

is_eligible <- (age >= 18) & (income >= 300000)
is_eligible
#> [1] TRUE
qualifies <- (age < 30) | (income > 1000000)
qualifies
#> [1] TRUE

3 Module 3: Data Structures — How R Organizes Data

Real-world data is never a single number — it comes in collections: a column of 1000 measurements, a table with rows and columns, a nested survey response. R has 5 core data structures to handle these cases. Choosing the right data structure is one of the most important decisions in data analysis because it affects how efficiently you can store, access, and manipulate your data.

3.1 Vectors — The Backbone of R

A vector is R’s most fundamental data structure: an ordered sequence of elements that are all the same type. If you mix types, R will silently convert everything to the most flexible type (this is called type coercion). Vectors are “atomic” — they cannot hold other vectors or mixed types. Almost everything in R is built on top of vectors — even a single number like 42 is actually a vector of length 1.

Vectors are important because R is designed for vectorized operations — operations that apply automatically to every element without writing a loop. scores * 2 doubles every score in a vector instantly. This makes R code both concise and extremely fast compared to loop-based approaches. Understanding vectors deeply is the single most important step for thinking in R.

Use a vector when you have a single variable measured across multiple observations — exam scores for 30 students, daily temperatures for a month, sales figures for 12 months. If your data has multiple variables, you need a data frame.

# Creating Vectors with c() — "combine" or "concatenate"
exam_scores   <- c(78, 85, 92, 67, 88, 74, 91, 55, 83, 76)
student_names <- c("Alice", "Bob", "Carol", "David", "Eve",
                   "Frank", "Grace", "Henry", "Irene", "James")
passed        <- c(TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE)

# Useful shortcuts for creating sequences
days_in_month <- 1:31                      # consecutive integers
even_numbers  <- seq(2, 20, by = 2)        # 2, 4, 6, ..., 20
rep(0, times = 5)                          # repeat a value: 0 0 0 0 0
#> [1] 0 0 0 0 0
rep(c(1, 2), times = 3)                    # repeat a pattern: 1 2 1 2 1 2
#> [1] 1 2 1 2 1 2
# Exploring a Vector
length(exam_scores)   # how many elements?
#> [1] 10
class(exam_scores)    # what type?
#> [1] "numeric"
sum(exam_scores)      # total
#> [1] 789
mean(exam_scores)     # average
#> [1] 78.9
max(exam_scores)      # highest
#> [1] 92
min(exam_scores)      # lowest
#> [1] 55
sd(exam_scores)       # standard deviation
#> [1] 11.55133
# Accessing Elements — R uses 1-based indexing (first element is [1], NOT [0])
exam_scores[1]               # first score
#> [1] 78
exam_scores[2:5]             # scores 2 through 5 (inclusive)
#> [1] 85 92 67 88
exam_scores[c(1, 3, 7)]     # scores at positions 1, 3, 7
#> [1] 78 92 91
exam_scores[-1]              # ALL scores EXCEPT the first
#> [1] 85 92 67 88 74 91 55 83 76
# Accessing by Name
names(exam_scores) <- student_names
exam_scores["Alice"]
#> Alice 
#>    78
exam_scores[c("Bob", "Eve")]
#> Bob Eve 
#>  85  88
# Logical Indexing (Filtering) — the KEY to filtering data
exam_scores[exam_scores >= 80]             # scores 80 and above
#>   Bob Carol   Eve Grace Irene 
#>    85    92    88    91    83
student_names[exam_scores >= 80]          # names of students who scored 80+
#> [1] "Bob"   "Carol" "Eve"   "Grace" "Irene"
student_names[!passed]                    # names of students who failed
#> [1] "David" "Henry"
# Vectorized Operations — R's Superpower: no loop needed
exam_scores + 5             # add 5 bonus points to everyone
#> Alice   Bob Carol David   Eve Frank Grace Henry Irene James 
#>    83    90    97    72    93    79    96    60    88    81
exam_scores >= 60           # TRUE/FALSE: did each student pass?
#> Alice   Bob Carol David   Eve Frank Grace Henry Irene James 
#>  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
sum(exam_scores >= 60)      # how many students passed? (counts TRUEs)
#> [1] 9
# ifelse() for element-wise conditional logic
letter_grades <- ifelse(exam_scores >= 90, "A",
                 ifelse(exam_scores >= 80, "B",
                 ifelse(exam_scores >= 70, "C",
                 ifelse(exam_scores >= 60, "D", "F"))))
letter_grades
#> Alice   Bob Carol David   Eve Frank Grace Henry Irene James 
#>   "C"   "B"   "A"   "D"   "B"   "C"   "A"   "F"   "B"   "C"

3.2 Matrices — Two-Dimensional Data of One Type

A matrix is a 2-dimensional extension of a vector: it has rows and columns, but like a vector, all elements must be the same type. A matrix is essentially a vector that has been given a shape (dimensions). Matrices are used heavily in linear algebra, statistics, and machine learning algorithms.

Matrices are important in data analysis when working with numerical computations that require row/column operations — correlation matrices, distance matrices, linear algebra for regression coefficients, or image data. Many statistical functions (cor(), cov(), solve()) work directly on matrices.

Use matrices when: (1) all your data is numeric and of the same type, (2) you need matrix algebra operations (transpose, multiplication, inversion), (3) you are working with correlation or covariance matrices. For data with mixed types (numbers and text), use a data frame instead.

# Creating a Matrix — fills column by column by default
score_matrix <- matrix(
  c(85, 78, 90, 72,    # Math scores
    88, 82, 79, 85,    # Science scores
    74, 91, 83, 69),   # English scores
  nrow = 4,
  ncol = 3
)

rownames(score_matrix) <- c("Alice", "Bob", "Carol", "David")
colnames(score_matrix) <- c("Math", "Science", "English")
score_matrix
#>       Math Science English
#> Alice   85      88      74
#> Bob     78      82      91
#> Carol   90      79      83
#> David   72      85      69
# Exploring a Matrix
dim(score_matrix)      # c(rows, cols)
#> [1] 4 3
nrow(score_matrix)     # 4 rows
#> [1] 4
ncol(score_matrix)     # 3 columns
#> [1] 3
# Accessing Elements: [row, column] — leave blank for entire row or column
score_matrix[1, ]                    # Alice's scores (row 1, all columns)
#>    Math Science English 
#>      85      88      74
score_matrix[, 2]                    # Everyone's Science score (all rows, col 2)
#> Alice   Bob Carol David 
#>    88    82    79    85
score_matrix[2, 3]                   # Bob's English score
#> [1] 91
score_matrix["Carol", "Math"]        # Carol's Math score (by name)
#> [1] 90
# Matrix Operations with apply()
apply(score_matrix, 1, mean)         # MARGIN=1: average per student (across rows)
#>    Alice      Bob    Carol    David 
#> 82.33333 83.66667 84.00000 75.33333
apply(score_matrix, 2, mean)         # MARGIN=2: average per subject (across columns)
#>    Math Science English 
#>   81.25   83.50   79.25
t(score_matrix)                      # transpose: rows become columns
#>         Alice Bob Carol David
#> Math       85  78    90    72
#> Science    88  82    79    85
#> English    74  91    83    69

3.3 Lists — Flexible Containers for Mixed Data

A list is R’s most flexible data structure. Unlike vectors and matrices, a list can hold elements of different types and different lengths — including other lists. A list is like a filing cabinet: each drawer (element) can contain something completely different — a number, a sentence, a vector, a data frame, even another list.

Lists are essential because real-world data is rarely uniform. A person’s profile has a name (text), age (number), scores (vector), and address (nested structure) — a list can hold all of this together. Many R functions return lists as output (regression models, t-tests, etc.) because they need to return multiple different types of results at once.

Use lists when: (1) you need to store heterogeneous data (mixed types) together, (2) a function needs to return multiple results of different types, (3) you are working with the output of statistical functions like lm(), t.test(), or summary(), (4) you need nested or hierarchical data structures.

# Creating a List — notice the mixed types
student_profile <- list(
  name        = "Alice Mukamana",
  age         = 22,
  gpa         = 3.85,
  courses     = c("Statistics", "R Programming", "Data Mining"),
  scores      = c(88, 92, 79),
  is_enrolled = TRUE,
  address     = list(city = "Kigali", district = "Gasabo")  # nested list!
)

# Accessing List Elements — Three Methods
student_profile$name              # $ operator (most common for named lists)
#> [1] "Alice Mukamana"
student_profile[["gpa"]]          # [[ ]] double brackets (returns the element)
#> [1] 3.85
student_profile[[4]]              # by position: 4th element = courses
#> [1] "Statistics"    "R Programming" "Data Mining"
# Accessing nested elements
student_profile$address$city      # "Kigali"
#> [1] "Kigali"
# Exploring a List
length(student_profile)           # how many top-level elements?
#> [1] 7
names(student_profile)            # names of all elements
#> [1] "name"        "age"         "gpa"         "courses"     "scores"     
#> [6] "is_enrolled" "address"
# Adding and modifying elements
student_profile$graduation_year <- 2025
student_profile$age <- 23

# Why functions return lists — example with t.test()
test_result <- t.test(c(75, 82, 88, 79, 91), mu = 75)
class(test_result)                # "htest" — a list
#> [1] "htest"
names(test_result)                # all available components
#>  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
#>  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"
test_result$p.value               # extract just the p-value
#> [1] 0.05169353
test_result$conf.int              # extract the confidence interval
#> [1] 74.90534 91.09466
#> attr(,"conf.level")
#> [1] 0.95

3.4 Data Frames — The Workhorse of Data Analysis

A data frame is R’s representation of a rectangular dataset — like a spreadsheet or database table. It has rows (observations) and columns (variables). The crucial difference from a matrix is that columns can have different types: one column can be text (names), another numeric (scores), another logical (passed?), another factor (category). Every column is a vector of the same length, and together they form the table.

Data frames are the single most important data structure for data analysis. When you import a CSV file, it becomes a data frame. When you analyze survey data, each respondent is a row and each question is a column. The entire tidyverse ecosystem (dplyr, ggplot2, tidyr) is built around data frames. Mastering data frames is mastering R data analysis.

Use data frames when: (1) your data has multiple variables of different types (which is almost always), (2) you are working with imported datasets (CSV, Excel, database), (3) you need to filter rows, select columns, merge datasets, or create summaries — essentially any realistic data analysis task.

# Creating a Data Frame — each column is a named vector of the SAME length
employees <- data.frame(
  id         = 1:6,
  name       = c("Alice", "Bob", "Carol", "David", "Eve", "Frank"),
  department = c("Finance", "IT", "HR", "Finance", "IT", "HR"),
  salary     = c(850000, 920000, 780000, 1100000, 880000, 750000),
  years_exp  = c(3, 5, 2, 8, 4, 1),
  promoted   = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE),
  stringsAsFactors = FALSE
)
employees
#>   id  name department  salary years_exp promoted
#> 1  1 Alice    Finance  850000         3    FALSE
#> 2  2   Bob         IT  920000         5     TRUE
#> 3  3 Carol         HR  780000         2    FALSE
#> 4  4 David    Finance 1100000         8     TRUE
#> 5  5   Eve         IT  880000         4    FALSE
#> 6  6 Frank         HR  750000         1    FALSE
# First Look at the Data
nrow(employees); ncol(employees); dim(employees)
#> [1] 6
#> [1] 6
#> [1] 6 6
# str() is your MOST IMPORTANT tool for understanding a new dataset
# Shows: dimensions, column names, types, and a preview of values
str(employees)
#> 'data.frame':    6 obs. of  6 variables:
#>  $ id        : int  1 2 3 4 5 6
#>  $ name      : chr  "Alice" "Bob" "Carol" "David" ...
#>  $ department: chr  "Finance" "IT" "HR" "Finance" ...
#>  $ salary    : num  850000 920000 780000 1100000 880000 750000
#>  $ years_exp : num  3 5 2 8 4 1
#>  $ promoted  : logi  FALSE TRUE FALSE TRUE FALSE FALSE
# summary() gives statistical summaries for each column
summary(employees)
#>        id              name       department     salary          years_exp    
#>  Min.   :1.00   Length   :6   Length   :6    Min.   : 750000   Min.   :1.000  
#>  1st Qu.:2.25   N.unique :6   N.unique :3    1st Qu.: 797500   1st Qu.:2.250  
#>  Median :3.50   N.blank  :0   N.blank  :0    Median : 865000   Median :3.500  
#>  Mean   :3.50   Min.nchar:3   Min.nchar:2    Mean   : 880000   Mean   :3.833  
#>  3rd Qu.:4.75   Max.nchar:5   Max.nchar:7    3rd Qu.: 910000   3rd Qu.:4.750  
#>  Max.   :6.00                                Max.   :1100000   Max.   :8.000  
#>   promoted      
#>  Mode :logical  
#>  FALSE:4        
#>  TRUE :2        
#>                 
#>                 
#> 
# Accessing Columns
employees$name
#> [1] "Alice" "Bob"   "Carol" "David" "Eve"   "Frank"
employees[["salary"]]
#> [1]  850000  920000  780000 1100000  880000  750000
# Accessing Rows
employees[1, ]            # first row
#>   id  name department salary years_exp promoted
#> 1  1 Alice    Finance 850000         3    FALSE
employees[3:5, ]          # rows 3 to 5
#>   id  name department  salary years_exp promoted
#> 3  3 Carol         HR  780000         2    FALSE
#> 4  4 David    Finance 1100000         8     TRUE
#> 5  5   Eve         IT  880000         4    FALSE
employees[2, "salary"]    # Bob's salary specifically
#> [1] 920000
# Filtering Rows — the KEY operation in data analysis
employees[employees$department == "Finance", ]
#>   id  name department  salary years_exp promoted
#> 1  1 Alice    Finance  850000         3    FALSE
#> 4  4 David    Finance 1100000         8     TRUE
employees[employees$salary > 900000, ]
#>   id  name department  salary years_exp promoted
#> 2  2   Bob         IT  920000         5     TRUE
#> 4  4 David    Finance 1100000         8     TRUE
employees[employees$department == "IT" & employees$years_exp > 3, ]
#>   id name department salary years_exp promoted
#> 2  2  Bob         IT 920000         5     TRUE
#> 5  5  Eve         IT 880000         4    FALSE
# Adding a New Column
employees$annual_bonus    <- employees$salary * 0.10
employees$total_package   <- employees$salary + employees$annual_bonus
head(employees)
#>   id  name department  salary years_exp promoted annual_bonus total_package
#> 1  1 Alice    Finance  850000         3    FALSE        85000        935000
#> 2  2   Bob         IT  920000         5     TRUE        92000       1012000
#> 3  3 Carol         HR  780000         2    FALSE        78000        858000
#> 4  4 David    Finance 1100000         8     TRUE       110000       1210000
#> 5  5   Eve         IT  880000         4    FALSE        88000        968000
#> 6  6 Frank         HR  750000         1    FALSE        75000        825000

3.5 Factors — Categorical Data with Defined Levels

A factor is R’s data structure for categorical variables — variables that take on a limited set of predefined values called “levels”. Examples: gender (Male/Female), education level (Primary/Secondary/University), rating (Poor/Fair/Good/Excellent). Internally, R stores factors as integers with a lookup table of labels, making them memory-efficient. Ordered factors have levels with a meaningful order; unordered factors do not.

Factors matter because:

  1. Statistical correctness: Many functions treat factors differently from text — lm(), glm(), and aov() automatically create indicator (dummy) variables from factors, which is necessary for correct statistical modeling.
  2. Visualization control: ggplot2 uses factor levels to determine the order of categories in plots. Without factors, bars and categories appear in alphabetical order; with factors, you control the order.
  3. Memory efficiency: Storing “Male”/“Female” as 1/2 with a lookup table uses less memory than repeating the full strings.

Use factors when: (1) a column has a fixed, known set of possible values (categories), (2) order matters (education level, satisfaction rating, income bracket), (3) you are building statistical models that include categorical predictors, (4) you want to control the display order in plots.

# Unordered Factor (Nominal) — for categories with no natural order
department <- factor(c("Finance", "IT", "HR", "Finance", "IT", "HR", "Finance"))
department
#> [1] Finance IT      HR      Finance IT      HR      Finance
#> Levels: Finance HR IT
levels(department)       # the unique categories
#> [1] "Finance" "HR"      "IT"
nlevels(department)      # how many levels?
#> [1] 3
table(department)        # frequency count of each level
#> department
#> Finance      HR      IT 
#>       3       2       2
# Ordered Factor (Ordinal) — for categories with a meaningful order
satisfaction <- factor(
  c("Good", "Excellent", "Fair", "Good", "Poor", "Excellent", "Fair"),
  levels  = c("Poor", "Fair", "Good", "Excellent"),  # define the order explicitly
  ordered = TRUE
)
satisfaction
#> [1] Good      Excellent Fair      Good      Poor      Excellent Fair     
#> Levels: Poor < Fair < Good < Excellent
# Now comparison operators work meaningfully
satisfaction > "Fair"       # which responses are better than Fair?
#> [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
satisfaction >= "Good"      # which are Good or better?
#> [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
min(satisfaction)           # worst rating
#> [1] Poor
#> Levels: Poor < Fair < Good < Excellent
max(satisfaction)           # best rating
#> [1] Excellent
#> Levels: Poor < Fair < Good < Excellent
# Relevel: change the reference category for modeling
department <- relevel(department, ref = "Finance")
levels(department)          # Finance is now first (= reference in regression)
#> [1] "Finance" "HR"      "IT"
# Drop unused levels after filtering
small_dept <- department[department != "HR"]
droplevels(small_dept)      # "HR" level is now removed
#> [1] Finance IT      Finance IT      Finance
#> Levels: Finance IT

4 Module 4: Control Flow — Making R Think and Decide

Control flow refers to the order in which R executes statements. By default, R runs code top to bottom, one line at a time. Control flow structures let you change that order based on conditions (if/else) or repeat code automatically multiple times (loops). This is what makes programs intelligent — they can make decisions and handle repetitive tasks.

4.1 Conditional Statements: if / else if / else

An if statement evaluates a logical condition (TRUE or FALSE) and executes a block of code only if the condition is TRUE. else if checks additional conditions when the first is FALSE. else is the fallback — it runs only when all previous conditions were FALSE. Only one branch ever executes.

Conditional logic is the basis of all decision-making in code. In data analysis, you use it to: classify data into categories (pass/fail), handle special cases (what to do if a value is NA), apply different formulas based on conditions (discount tiers based on purchase amount), or validate inputs.

Use if/else when you need your code to behave differently depending on a single value or condition. For applying a condition to an entire vector (e.g., categorize all rows in a column), use ifelse() instead — it is vectorized and applies to every element simultaneously.

# Basic if / else if / else — income tax bracket example
monthly_income <- 650000  # RWF

if (monthly_income >= 1000000) {
  tax_rate <- 0.30
  bracket  <- "High income"
} else if (monthly_income >= 500000) {
  tax_rate <- 0.20
  bracket  <- "Middle income"
} else if (monthly_income >= 100000) {
  tax_rate <- 0.10
  bracket  <- "Low income"
} else {
  tax_rate <- 0.00
  bracket  <- "Tax exempt"
}

tax_amount <- monthly_income * tax_rate
cat("Income:", monthly_income, "RWF\n")
#> Income: 650000 RWF
cat("Bracket:", bracket, "\n")
#> Bracket: Middle income
cat("Tax rate:", tax_rate * 100, "%\n")
#> Tax rate: 20 %
cat("Tax owed:", tax_amount, "RWF\n")
#> Tax owed: 130000 RWF
# ifelse() for Vectorized Conditions — applies to every element of a vector
scores <- c(45, 72, 88, 58, 91, 63, 77)

pass_fail <- ifelse(scores >= 60, "Pass", "Fail")
pass_fail
#> [1] "Fail" "Pass" "Pass" "Fail" "Pass" "Pass" "Pass"
# Nested ifelse() for multiple categories
grade <- ifelse(scores >= 90, "A",
         ifelse(scores >= 80, "B",
         ifelse(scores >= 70, "C",
         ifelse(scores >= 60, "D", "F"))))
grade
#> [1] "F" "C" "B" "F" "A" "D" "C"
# switch() — cleaner alternative to many else-if branches for exact value matching
get_currency_symbol <- function(country) {
  switch(country,
    "Rwanda"  = "RWF",
    "USA"     = "USD",
    "UK"      = "GBP",
    "France"  = "EUR",
    "Unknown currency"
  )
}
get_currency_symbol("Rwanda")
#> [1] "RWF"
get_currency_symbol("Japan")
#> [1] "Unknown currency"

4.2 Loops: for, while, and repeat

A loop is a control structure that executes a block of code repeatedly. A for loop repeats a fixed number of times (once for each element in a sequence). A while loop repeats as long as a condition remains TRUE — the number of iterations is not known in advance. A repeat loop runs forever until explicitly stopped with break.

Loops are essential when you need to perform the same operation many times — reading multiple files, applying a function to each group, building results iteratively, or running simulations. However, in R, loops can be slow on large datasets. Wherever possible, prefer vectorized operations or the apply family over explicit loops for better performance.

Use for when you know how many iterations you need (looping over a list of files, items, or indices). Use while when you loop until a condition changes (convergence in optimization). Use repeat/break rarely — usually for polling or until-type logic. Prefer vectorized operations for speed when working with vectors and data frames.

# for loop — iterates over each element in a sequence
months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun")

for (month in months) {
  cat("Processing month:", month, "\n")
}
#> Processing month: Jan 
#> Processing month: Feb 
#> Processing month: Mar 
#> Processing month: Apr 
#> Processing month: May 
#> Processing month: Jun
# for loop — accumulating results
# IMPORTANT: pre-allocate the result vector for efficiency
n            <- 10
factorial_n  <- 1

for (i in 1:n) {
  factorial_n <- factorial_n * i
}
cat("10! =", factorial_n, "\n")
#> 10! = 3628800
# for loop — iterating over rows of a data frame
monthly_sales <- data.frame(
  region = c("North", "South", "East", "West"),
  Q1 = c(450, 380, 520, 410), Q2 = c(480, 400, 490, 430),
  Q3 = c(510, 420, 540, 450), Q4 = c(600, 450, 580, 490)
)

monthly_sales$annual_total <- 0
for (i in 1:nrow(monthly_sales)) {
  monthly_sales$annual_total[i] <- sum(monthly_sales[i, 2:5])
}
monthly_sales
#>   region  Q1  Q2  Q3  Q4 annual_total
#> 1  North 450 480 510 600         2040
#> 2  South 380 400 420 450         1650
#> 3   East 520 490 540 580         2130
#> 4   West 410 430 450 490         1780
# while loop — repeats until condition changes
# Example: How many years does it take for an investment to double?
investment <- 100000   # initial investment in RWF
rate       <- 0.08     # 8% annual growth
years      <- 0
target     <- 200000

while (investment < target) {
  investment <- investment * (1 + rate)
  years      <- years + 1
}
cat("Investment doubles after", years, "years\n")
#> Investment doubles after 10 years
cat("Final value: RWF", round(investment, 0), "\n")
#> Final value: RWF 215892
# break and next — controlling loop flow
# next: skip this iteration; break: exit the loop entirely
scores <- c(88, NA, 72, NA, 91, 55, NA, 78)

cat("Valid scores above 60: ")
#> Valid scores above 60:
for (s in scores) {
  if (is.na(s)) next    # skip missing values
  if (s < 60) next      # skip failing scores
  cat(s, "")
}
#> 88 72 91 78
cat("\n")

5 Module 5: Functions — Writing Reusable Code

A function is a named, reusable block of code that performs a specific task. Functions take inputs (called arguments or parameters), process them, and return an output. R comes with thousands of built-in functions (mean(), sum(), paste()), and you can write your own. The DRY principle — Don’t Repeat Yourself — is the key motivation: if you write the same code more than twice, it should be a function.

Functions are the foundation of good programming practice:

  1. Reusability: Write once, use anywhere — in the same script, across projects, or share with colleagues.
  2. Readability: calculate_bmi(weight, height) is far clearer than the raw formula repeated everywhere.
  3. Maintainability: If the logic changes, update it in one place, not in 20 places.
  4. Testability: You can test a function independently to make sure it works correctly.
  5. Abstraction: Functions let you build complexity step by step, hiding details behind meaningful names.

5.1 Built-in Functions

# Mathematical Functions
abs(-42)             # absolute value: 42
#> [1] 42
sqrt(144)            # square root: 12
#> [1] 12
log(100)             # natural log (base e)
#> [1] 4.60517
log10(1000)          # base-10 log: 3
#> [1] 3
exp(1)               # e^1 = 2.718...
#> [1] 2.718282
ceiling(3.2)         # round UP to nearest integer: 4
#> [1] 4
floor(3.9)           # round DOWN to nearest integer: 3
#> [1] 3
round(3.14159, digits = 2)   # round to 2 decimal places: 3.14
#> [1] 3.14
# Statistical Functions
x <- c(23, 45, 12, 67, 34, 56, 28, 41, 38, 52)
mean(x)              # arithmetic average
#> [1] 39.6
median(x)            # middle value when sorted
#> [1] 39.5
var(x)               # variance
#> [1] 267.8222
sd(x)                # standard deviation
#> [1] 16.36527
sum(x)               # total
#> [1] 396
cumsum(x)            # running total (cumulative sum)
#>  [1]  23  68  80 147 181 237 265 306 344 396
quantile(x)          # 0%, 25%, 50%, 75%, 100%
#>    0%   25%   50%   75%  100% 
#> 12.00 29.50 39.50 50.25 67.00
quantile(x, 0.90)    # 90th percentile
#>  90% 
#> 57.1
# String Functions
greeting <- "Hello, Data Analyst!"
nchar(greeting)                      # number of characters
#> [1] 20
toupper(greeting)                    # ALL CAPS
#> [1] "HELLO, DATA ANALYST!"
tolower(greeting)                    # all lowercase
#> [1] "hello, data analyst!"
substr(greeting, 1, 5)              # extract characters 1-5: "Hello"
#> [1] "Hello"
gsub("a", "@", greeting)            # replace all 'a' with '@'
#> [1] "Hello, D@t@ An@lyst!"
trimws("  spaces around  ")         # remove leading/trailing whitespace
#> [1] "spaces around"
paste("R", "is", "great")          # join with spaces
#> [1] "R is great"
paste0("file_", 1:3, ".csv")       # join without separator -> vector
#> [1] "file_1.csv" "file_2.csv" "file_3.csv"
sprintf("Score: %.1f%%", 87.5)     # formatted string
#> [1] "Score: 87.5%"
# Utility Functions
sort(x)                              # sort ascending
#>  [1] 12 23 28 34 38 41 45 52 56 67
sort(x, decreasing = TRUE)          # sort descending
#>  [1] 67 56 52 45 41 38 34 28 23 12
unique(c(1,2,2,3,3,3))             # remove duplicates: 1 2 3
#> [1] 1 2 3
which(x > 40)                       # INDICES where condition is TRUE
#> [1]  2  4  6  8 10
which.min(x); which.max(x)         # index of min/max value
#> [1] 3
#> [1] 4

5.2 Writing Custom Functions

# Basic Function Structure:
# function_name <- function(argument1, argument2, ...) {
#   body: code that does the work
#   return(result)
# }

# Example 1: Unit Conversion
celsius_to_fahrenheit <- function(celsius) {
  fahrenheit <- (celsius * 9/5) + 32
  return(fahrenheit)
}
celsius_to_fahrenheit(0)    # 32 degrees F (freezing)
#> [1] 32
celsius_to_fahrenheit(100)  # 212 degrees F (boiling)
#> [1] 212
celsius_to_fahrenheit(37)   # 98.6 degrees F (body temperature)
#> [1] 98.6
# Example 2: Default Arguments — used when caller does not provide a value
calculate_loan_payment <- function(principal, rate_annual = 0.12, years = 5) {
  rate_monthly <- rate_annual / 12
  n_months     <- years * 12
  payment <- principal * (rate_monthly * (1 + rate_monthly)^n_months) /
             ((1 + rate_monthly)^n_months - 1)
  return(round(payment, 2))
}

calculate_loan_payment(1000000)              # uses defaults: 12% rate, 5 years
#> [1] 22244.45
calculate_loan_payment(1000000, rate_annual = 0.09)
#> [1] 20758.36
calculate_loan_payment(1000000, years = 3)
#> [1] 33214.31
# Example 3: Returning Multiple Values via a List
describe_vector <- function(x, label = "Data") {
  x_clean <- x[!is.na(x)]   # remove NAs first
  list(
    label     = label,
    n         = length(x_clean),
    n_missing = sum(is.na(x)),
    mean      = round(mean(x_clean), 2),
    median    = round(median(x_clean), 2),
    sd        = round(sd(x_clean), 2),
    min       = min(x_clean),
    max       = max(x_clean)
  )
}

exam_scores <- c(78, 85, NA, 92, 67, 88, NA, 74, 91)
stats <- describe_vector(exam_scores, label = "Exam Scores")
stats$mean
#> [1] 82.14
stats$n_missing
#> [1] 2
# Example 4: Input Validation — always check your inputs!
safe_divide <- function(numerator, denominator) {
  if (!is.numeric(numerator) | !is.numeric(denominator)) {
    stop("Both arguments must be numeric.")   # stop() throws an error
  }
  if (denominator == 0) {
    warning("Division by zero — returning NA.")  # warning() alerts but continues
    return(NA)
  }
  return(numerator / denominator)
}

safe_divide(10, 2)
#> [1] 5
safe_divide(10, 0)      # triggers warning
#> [1] NA

5.3 The Apply Family — Replacing Loops with Elegance

The apply family is a set of functions that apply a function to elements of a data structure — rows of a matrix, elements of a list, groups of a vector — without writing an explicit for loop. The main members are apply() (matrices), lapply() (lists, returns list), sapply() (lists, returns simplified vector), and tapply() (vectors with groups).

The apply family produces cleaner, more readable code than equivalent for loops and is often faster. More importantly, it forces you to write your operation as a function, which encourages modular, reusable code. In the tidyverse, purrr::map() is the modern equivalent and is even more powerful.

# apply() — apply a function over rows (margin=1) or columns (margin=2) of a matrix
scores_matrix <- matrix(c(85, 78, 90, 72, 88, 82, 79, 85, 74, 91, 83, 69),
                        nrow = 4, byrow = TRUE,
                        dimnames = list(c("Alice","Bob","Carol","David"),
                                        c("Math","Science","English")))
scores_matrix
#>       Math Science English
#> Alice   85      78      90
#> Bob     72      88      82
#> Carol   79      85      74
#> David   91      83      69
apply(scores_matrix, 1, mean)    # average per student (across each row)
#>    Alice      Bob    Carol    David 
#> 84.33333 80.66667 79.33333 81.00000
apply(scores_matrix, 2, mean)    # average per subject (across each column)
#>    Math Science English 
#>   81.75   83.50   78.75
apply(scores_matrix, 2, max)     # highest score in each subject
#>    Math Science English 
#>      91      88      90
# lapply() — apply function to each list element, returns a LIST
monthly_revenues <- list(Jan = c(450, 380, 520), Feb = c(480, 400, 490), Mar = c(510, 420, 540))
lapply(monthly_revenues, sum)     # total per month (returns list)
#> $Jan
#> [1] 1350
#> 
#> $Feb
#> [1] 1370
#> 
#> $Mar
#> [1] 1470
# sapply() — same as lapply but simplifies result to a VECTOR or MATRIX
sapply(monthly_revenues, sum)     # returns a named vector — cleaner!
#>  Jan  Feb  Mar 
#> 1350 1370 1470
sapply(monthly_revenues, mean)    # average per month
#>      Jan      Feb      Mar 
#> 450.0000 456.6667 490.0000
# tapply() — apply function to groups — extremely useful for group-wise statistics
employee_salaries <- c(850000, 920000, 780000, 1100000, 880000, 750000)
departments       <- c("Finance", "IT", "HR", "Finance", "IT", "HR")

tapply(employee_salaries, departments, mean)    # average salary per department
#> Finance      HR      IT 
#>  975000  765000  900000
tapply(employee_salaries, departments, length)  # count per department
#> Finance      HR      IT 
#>       2       2       2

6 Module 6: Importing & Exporting Data

In real data analysis, data almost never comes from code — it comes from external sources: CSV files from databases, Excel reports from colleagues, API responses, survey exports, database queries. R has extensive tools for importing data from virtually any format. Exporting allows you to share cleaned data, model results, or reports with others.

No matter how good your analysis is, if you cannot get data into R and results out of R, you cannot work in a real environment. Understanding import/export is the bridge between R and the rest of the data ecosystem — spreadsheets, databases, BI tools, and stakeholder reports.

# read.csv() — Base R, always available, works for any CSV file
sales_data <- read.csv(
  "sales_2024.csv",
  header           = TRUE,   # first row contains column names
  sep              = ",",    # field separator (use ";" for European CSVs)
  stringsAsFactors = FALSE,  # keep text as character, not factor (best practice)
  na.strings       = c("NA", "", "N/A", "-", "null")  # what counts as missing?
)

# readr::read_csv() — faster and smarter (tidyverse version)
library(readr)
sales_data <- read_csv("sales_2024.csv")  # note: read_csv not read.csv

# Reading Excel files (install once: install.packages("readxl"))
library(readxl)
budget <- read_excel("budget_2024.xlsx", sheet = "Q1 Data")
excel_sheets("budget_2024.xlsx")          # see what sheets are available

# Exporting Results
write.csv(sales_data, "sales_clean.csv", row.names = FALSE)

library(writexl)
write_xlsx(list("Sales" = sales_data, "Budget" = budget), "report.xlsx")

# Save and reload R objects
saveRDS(sales_data, "sales_clean.rds")        # save single object
loaded <- readRDS("sales_clean.rds")          # load it back
# Using Built-in Datasets for Learning and Practice
data(mtcars)    # Motor Trend car road tests (1974)
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
data(iris)      # Fisher's iris flower dataset — classic in statistics and ML
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

7 Module 7: Data Cleaning & Manipulation

Data cleaning is the process of detecting and correcting problems in raw data — missing values, wrong types, duplicate records, inconsistent formatting, and outliers. Data manipulation is reshaping, transforming, and restructuring data to make it suitable for analysis. Together, these tasks typically consume 60-80% of a data analyst’s time. Clean, well-structured data is the foundation of valid analysis — garbage in, garbage out.

library(dplyr)    # data manipulation
library(tidyr)    # data reshaping
library(stringr)  # string manipulation

7.1 Handling Missing Values (NA)

In R, missing values are represented by NA (Not Available). NA is not zero, not an empty string, not -999 — it is a genuine placeholder that says “we don’t know what this value is.” NA is contagious: any arithmetic involving NA returns NA (5 + NA = NA). This is intentional — if you don’t know one value, you cannot know the result of a calculation involving it.

Unhandled missing values silently corrupt your analysis. A mean calculated without removing NAs returns NA. A model trained on data with NAs may error or produce biased results. Understanding why data is missing (Missing Completely at Random vs Not at Random) determines the right strategy for handling it.

Remove rows (na.omit) only when missingness is rare and random, and you have enough data to afford losing rows. Impute with mean/median when the variable is numeric and missingness is random. Impute with mode or “Unknown” for categorical variables. Never impute before splitting data for modeling — it causes data leakage.

# Creating Data with Missing Values
survey <- data.frame(
  respondent = 1:8,
  age        = c(25, NA, 30, 22, NA, 35, 28, NA),
  income     = c(500000, 800000, NA, 450000, 600000, NA, 750000, 400000),
  education  = c("University", "Secondary", "University", NA,
                 "Primary", "University", NA, "Secondary")
)
survey
#>   respondent age income  education
#> 1          1  25 500000 University
#> 2          2  NA 800000  Secondary
#> 3          3  30     NA University
#> 4          4  22 450000       <NA>
#> 5          5  NA 600000    Primary
#> 6          6  35     NA University
#> 7          7  28 750000       <NA>
#> 8          8  NA 400000  Secondary
# Detecting Missing Values
is.na(survey$age)               # TRUE/FALSE vector: which are NA?
#> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
sum(is.na(survey$age))          # count NAs in one column
#> [1] 3
colSums(is.na(survey))          # count NAs in EVERY column at once
#> respondent        age     income  education 
#>          0          3          2          2
round(colMeans(is.na(survey)) * 100, 1)   # percentage missing per column
#> respondent        age     income  education 
#>        0.0       37.5       25.0       25.0
# Strategy 1: Remove rows with NAs (use carefully — loses data)
survey_complete <- na.omit(survey)
cat("Rows after na.omit:", nrow(survey_complete), "(was 8)\n")
#> Rows after na.omit: 1 (was 8)
# Strategy 2: Mean/Median Imputation for numeric columns
survey$age_imputed    <- ifelse(is.na(survey$age),
                                round(mean(survey$age, na.rm = TRUE)),
                                survey$age)
survey$income_imputed <- ifelse(is.na(survey$income),
                                median(survey$income, na.rm = TRUE),
                                survey$income)

# Strategy 3: Replace NA with a constant (for categorical)
survey$education[is.na(survey$education)] <- "Unknown"

survey[, c("respondent", "age", "age_imputed", "income", "income_imputed", "education")]
#>   respondent age age_imputed income income_imputed  education
#> 1          1  25          25 500000         500000 University
#> 2          2  NA          28 800000         800000  Secondary
#> 3          3  30          30     NA         550000 University
#> 4          4  22          22 450000         450000    Unknown
#> 5          5  NA          28 600000         600000    Primary
#> 6          6  35          35     NA         550000 University
#> 7          7  28          28 750000         750000    Unknown
#> 8          8  NA          28 400000         400000  Secondary

7.2 Data Manipulation with dplyr

dplyr is the tidyverse package for data manipulation. It provides a consistent set of verbs (functions) that each perform one clear operation on a data frame. These verbs are designed to be combined using the pipe operator %>% (read as “then”), which chains operations left-to-right in the order you think about them, making complex transformations highly readable.

Without dplyr, complex data manipulation requires nested function calls that read inside-out — confusing and error-prone. With dplyr and the pipe, the same operations read in plain English from left to right. Additionally, dplyr works with databases (via dbplyr) using the exact same syntax, so your skills transfer directly to SQL-backed data sources.

data(mtcars)
mtcars <- tibble::rownames_to_column(mtcars, "car_model")

# The Pipe Operator %>% — makes code read like plain English
# Without pipe (inside-out, hard to read):
# arrange(select(filter(mtcars, cyl == 6), car_model, mpg, cyl), desc(mpg))

# With pipe (reads: take mtcars, THEN filter, THEN select, THEN arrange):
mtcars %>%
  filter(cyl == 6) %>%
  select(car_model, mpg, cyl) %>%
  arrange(desc(mpg))
#>        car_model  mpg cyl
#> 1 Hornet 4 Drive 21.4   6
#> 2      Mazda RX4 21.0   6
#> 3  Mazda RX4 Wag 21.0   6
#> 4   Ferrari Dino 19.7   6
#> 5       Merc 280 19.2   6
#> 6        Valiant 18.1   6
#> 7      Merc 280C 17.8   6
# filter() — keep rows meeting a condition (like SQL WHERE)
mtcars %>% filter(cyl == 8, mpg > 15, hp > 200)
#>        car_model  mpg cyl disp  hp drat   wt qsec vs am gear carb
#> 1 Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
mtcars %>% filter(cyl == 4 | cyl == 6)              # OR condition
#>         car_model  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1       Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2   Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3      Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4  Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 5         Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 6       Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 7        Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 8        Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 9       Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 10       Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> 11    Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> 12 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 13  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 14      Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 15  Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> 16   Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 17   Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> 18     Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# select() — choose columns to keep (like SQL SELECT)
mtcars %>% select(car_model, mpg, hp, wt)
#>              car_model  mpg  hp    wt
#> 1            Mazda RX4 21.0 110 2.620
#> 2        Mazda RX4 Wag 21.0 110 2.875
#> 3           Datsun 710 22.8  93 2.320
#> 4       Hornet 4 Drive 21.4 110 3.215
#> 5    Hornet Sportabout 18.7 175 3.440
#> 6              Valiant 18.1 105 3.460
#> 7           Duster 360 14.3 245 3.570
#> 8            Merc 240D 24.4  62 3.190
#> 9             Merc 230 22.8  95 3.150
#> 10            Merc 280 19.2 123 3.440
#> 11           Merc 280C 17.8 123 3.440
#> 12          Merc 450SE 16.4 180 4.070
#> 13          Merc 450SL 17.3 180 3.730
#> 14         Merc 450SLC 15.2 180 3.780
#> 15  Cadillac Fleetwood 10.4 205 5.250
#> 16 Lincoln Continental 10.4 215 5.424
#> 17   Chrysler Imperial 14.7 230 5.345
#> 18            Fiat 128 32.4  66 2.200
#> 19         Honda Civic 30.4  52 1.615
#> 20      Toyota Corolla 33.9  65 1.835
#> 21       Toyota Corona 21.5  97 2.465
#> 22    Dodge Challenger 15.5 150 3.520
#> 23         AMC Javelin 15.2 150 3.435
#> 24          Camaro Z28 13.3 245 3.840
#> 25    Pontiac Firebird 19.2 175 3.845
#> 26           Fiat X1-9 27.3  66 1.935
#> 27       Porsche 914-2 26.0  91 2.140
#> 28        Lotus Europa 30.4 113 1.513
#> 29      Ford Pantera L 15.8 264 3.170
#> 30        Ferrari Dino 19.7 175 2.770
#> 31       Maserati Bora 15.0 335 3.570
#> 32          Volvo 142E 21.4 109 2.780
mtcars %>% select(-qsec, -vs, -am)                  # keep all EXCEPT these
#>              car_model  mpg cyl  disp  hp drat    wt gear carb
#> 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620    4    4
#> 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875    4    4
#> 3           Datsun 710 22.8   4 108.0  93 3.85 2.320    4    1
#> 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215    3    1
#> 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440    3    2
#> 6              Valiant 18.1   6 225.0 105 2.76 3.460    3    1
#> 7           Duster 360 14.3   8 360.0 245 3.21 3.570    3    4
#> 8            Merc 240D 24.4   4 146.7  62 3.69 3.190    4    2
#> 9             Merc 230 22.8   4 140.8  95 3.92 3.150    4    2
#> 10            Merc 280 19.2   6 167.6 123 3.92 3.440    4    4
#> 11           Merc 280C 17.8   6 167.6 123 3.92 3.440    4    4
#> 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070    3    3
#> 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730    3    3
#> 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780    3    3
#> 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250    3    4
#> 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424    3    4
#> 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345    3    4
#> 18            Fiat 128 32.4   4  78.7  66 4.08 2.200    4    1
#> 19         Honda Civic 30.4   4  75.7  52 4.93 1.615    4    2
#> 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835    4    1
#> 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465    3    1
#> 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520    3    2
#> 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435    3    2
#> 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840    3    4
#> 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845    3    2
#> 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935    4    1
#> 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140    5    2
#> 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513    5    2
#> 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170    5    4
#> 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770    5    6
#> 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570    5    8
#> 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780    4    2
# mutate() — create new columns or modify existing ones
mtcars %>%
  mutate(
    kpl        = round(mpg * 0.425, 2),              # convert MPG to km/liter
    hp_per_cyl = round(hp / cyl, 1),                 # power per cylinder
    wt_kg      = round(wt * 453.6, 0),              # weight in kg
    efficiency = ifelse(mpg > 20, "High", "Low")
  ) %>%
  select(car_model, mpg, kpl, hp_per_cyl, wt_kg, efficiency)
#>              car_model  mpg   kpl hp_per_cyl wt_kg efficiency
#> 1            Mazda RX4 21.0  8.92       18.3  1188       High
#> 2        Mazda RX4 Wag 21.0  8.92       18.3  1304       High
#> 3           Datsun 710 22.8  9.69       23.2  1052       High
#> 4       Hornet 4 Drive 21.4  9.09       18.3  1458       High
#> 5    Hornet Sportabout 18.7  7.95       21.9  1560        Low
#> 6              Valiant 18.1  7.69       17.5  1569        Low
#> 7           Duster 360 14.3  6.08       30.6  1619        Low
#> 8            Merc 240D 24.4 10.37       15.5  1447       High
#> 9             Merc 230 22.8  9.69       23.8  1429       High
#> 10            Merc 280 19.2  8.16       20.5  1560        Low
#> 11           Merc 280C 17.8  7.57       20.5  1560        Low
#> 12          Merc 450SE 16.4  6.97       22.5  1846        Low
#> 13          Merc 450SL 17.3  7.35       22.5  1692        Low
#> 14         Merc 450SLC 15.2  6.46       22.5  1715        Low
#> 15  Cadillac Fleetwood 10.4  4.42       25.6  2381        Low
#> 16 Lincoln Continental 10.4  4.42       26.9  2460        Low
#> 17   Chrysler Imperial 14.7  6.25       28.8  2424        Low
#> 18            Fiat 128 32.4 13.77       16.5   998       High
#> 19         Honda Civic 30.4 12.92       13.0   733       High
#> 20      Toyota Corolla 33.9 14.41       16.2   832       High
#> 21       Toyota Corona 21.5  9.14       24.2  1118       High
#> 22    Dodge Challenger 15.5  6.59       18.8  1597        Low
#> 23         AMC Javelin 15.2  6.46       18.8  1558        Low
#> 24          Camaro Z28 13.3  5.65       30.6  1742        Low
#> 25    Pontiac Firebird 19.2  8.16       21.9  1744        Low
#> 26           Fiat X1-9 27.3 11.60       16.5   878       High
#> 27       Porsche 914-2 26.0 11.05       22.8   971       High
#> 28        Lotus Europa 30.4 12.92       28.2   686       High
#> 29      Ford Pantera L 15.8  6.72       33.0  1438        Low
#> 30        Ferrari Dino 19.7  8.37       29.2  1256        Low
#> 31       Maserati Bora 15.0  6.38       41.9  1619        Low
#> 32          Volvo 142E 21.4  9.09       27.2  1261       High
# arrange() — sort rows
mtcars %>% arrange(desc(mpg)) %>% select(car_model, mpg) %>% head(5)
#>        car_model  mpg
#> 1 Toyota Corolla 33.9
#> 2       Fiat 128 32.4
#> 3    Honda Civic 30.4
#> 4   Lotus Europa 30.4
#> 5      Fiat X1-9 27.3
# group_by() + summarise() — aggregate by groups (like SQL GROUP BY)
# This is one of the most powerful and frequently used patterns in dplyr
mtcars %>%
  group_by(cyl) %>%
  summarise(
    n_cars   = n(),
    avg_mpg  = round(mean(mpg), 1),
    avg_hp   = round(mean(hp), 1),
    best_mpg = max(mpg)
  ) %>%
  arrange(cyl)
#> # A tibble: 3 × 5
#>     cyl n_cars avg_mpg avg_hp best_mpg
#>   <dbl>  <int>   <dbl>  <dbl>    <dbl>
#> 1     4     11    26.7   82.6     33.9
#> 2     6      7    19.7  122.      21.4
#> 3     8     14    15.1  209.      19.2
# Chaining everything — real-world example:
# "Among powerful cars, which engine type has the best fuel efficiency?"
mtcars %>%
  filter(hp >= 100) %>%
  mutate(engine_type = ifelse(vs == 1, "Straight", "V-shape")) %>%
  group_by(engine_type) %>%
  summarise(count = n(), avg_mpg = round(mean(mpg), 1), avg_hp = round(mean(hp), 1)) %>%
  arrange(desc(avg_mpg))
#> # A tibble: 2 × 4
#>   engine_type count avg_mpg avg_hp
#>   <chr>       <int>   <dbl>  <dbl>
#> 1 Straight        6    21.4   114.
#> 2 V-shape        17    16.1   196.

7.3 Reshaping Data with tidyr

Data can be stored in wide format (one row per subject, many columns for different measurements) or long format (one row per observation, fewer columns). Wide format is human-readable; long format is required by most R plotting and modeling functions. tidyr provides pivot_longer() (wide to long) and pivot_wider() (long to wide) to convert between formats.

Convert to long format when: plotting multiple variables with ggplot2 (which needs all values in one column and the variable name in another), running repeated-measures ANOVA, or applying the same operation to multiple columns. Convert to wide format when: presenting a summary table or creating a human-readable report.

# Wide Format: one row per student, one column per subject
exam_results_wide <- data.frame(
  student = c("Alice", "Bob", "Carol", "David"),
  Math    = c(85, 70, 90, 75), Science = c(88, 75, 82, 68),
  English = c(78, 80, 95, 72), History = c(82, 65, 88, 77)
)
exam_results_wide
#>   student Math Science English History
#> 1   Alice   85      88      78      82
#> 2     Bob   70      75      80      65
#> 3   Carol   90      82      95      88
#> 4   David   75      68      72      77
# pivot_longer(): Wide to Long — one row per student-subject combination
exam_results_long <- exam_results_wide %>%
  pivot_longer(
    cols      = -student,      # pivot all columns EXCEPT student
    names_to  = "subject",     # column names become a new 'subject' column
    values_to = "score"        # values go into a new 'score' column
  )
exam_results_long
#> # A tibble: 16 × 3
#>    student subject score
#>    <chr>   <chr>   <dbl>
#>  1 Alice   Math       85
#>  2 Alice   Science    88
#>  3 Alice   English    78
#>  4 Alice   History    82
#>  5 Bob     Math       70
#>  6 Bob     Science    75
#>  7 Bob     English    80
#>  8 Bob     History    65
#>  9 Carol   Math       90
#> 10 Carol   Science    82
#> 11 Carol   English    95
#> 12 Carol   History    88
#> 13 David   Math       75
#> 14 David   Science    68
#> 15 David   English    72
#> 16 David   History    77
# Now easily calculate per-subject and per-student statistics
exam_results_long %>% group_by(subject) %>%
  summarise(avg_score = mean(score), top_score = max(score))
#> # A tibble: 4 × 3
#>   subject avg_score top_score
#>   <chr>       <dbl>     <dbl>
#> 1 English      81.2        95
#> 2 History      78          88
#> 3 Math         80          90
#> 4 Science      78.2        88
# pivot_wider(): Long to Wide — reverse the operation
exam_results_long %>% pivot_wider(names_from = subject, values_from = score)
#> # A tibble: 4 × 5
#>   student  Math Science English History
#>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 Alice      85      88      78      82
#> 2 Bob        70      75      80      65
#> 3 Carol      90      82      95      88
#> 4 David      75      68      72      77

7.4 String Manipulation with stringr

stringr provides a consistent set of functions for working with text data (strings). All function names start with str_, making them easy to discover. String manipulation is critical for cleaning messy text: inconsistent capitalization, extra whitespace, extracting patterns, replacing values, splitting and joining text.

texts <- c("  Hello World  ", "data ANALYSIS", "r programming 2024", "Kigali, Rwanda")

str_trim(texts)                           # remove leading/trailing whitespace
#> [1] "Hello World"        "data ANALYSIS"      "r programming 2024"
#> [4] "Kigali, Rwanda"
str_to_title(texts)                       # Title Case (first letter of each word)
#> [1] "  Hello World  "    "Data Analysis"      "R Programming 2024"
#> [4] "Kigali, Rwanda"
str_to_upper(texts[2])                    # UPPER CASE
#> [1] "DATA ANALYSIS"
str_to_lower(texts[2])                    # lower case
#> [1] "data analysis"
str_replace(texts[3], "r", "R")          # replace first occurrence
#> [1] "R programming 2024"
str_replace_all(texts[1], "l", "L")      # replace ALL occurrences
#> [1] "  HeLLo WorLd  "
str_detect(texts, "data|r program")      # TRUE/FALSE: does pattern exist?
#> [1] FALSE  TRUE  TRUE FALSE
str_length(texts)                         # number of characters per string
#> [1] 15 13 18 14
str_sub(texts[4], 1, 6)                 # extract characters 1-6: "Kigali"
#> [1] "Kigali"
str_split(texts[4], ", ")               # split by comma-space
#> [[1]]
#> [1] "Kigali" "Rwanda"
str_c("R", "is", "great", sep = " ")    # concatenate strings
#> [1] "R is great"
str_count(texts, "a")                    # count occurrences of "a" per string
#> [1] 0 2 1 3

8 Module 8: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of examining and summarizing a dataset to understand its structure, spot patterns, identify anomalies, and form hypotheses — before applying any formal statistical models. It was championed by statistician John Tukey in his 1977 book of the same name. EDA is fundamentally about asking questions of your data with an open mind.

EDA is critical because: (1) it reveals data quality issues (missing values, wrong types, outliers) before they corrupt your model; (2) it suggests which variables are related and worth modeling; (3) it builds your intuition about the data; (4) it often reveals the answer to your question directly, without needing complex models. Never skip EDA — it is the foundation of trustworthy analysis.

data(iris)

# Step 1: Understand the Structure
str(iris)       # types, dimensions, preview of values — always start here
#> 'data.frame':    150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Step 2: Summary Statistics — different output for numeric vs factor
summary(iris)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#> 
# Deep dive into a specific column
cat("\n=== Sepal Length Statistics ===\n")
#> 
#> === Sepal Length Statistics ===
cat("Mean:    ", mean(iris$Sepal.Length), "\n")
#> Mean:     5.843333
cat("Median:  ", median(iris$Sepal.Length), "\n")
#> Median:   5.8
cat("Std Dev: ", round(sd(iris$Sepal.Length), 3), "\n")
#> Std Dev:  0.828
cat("IQR:     ", IQR(iris$Sepal.Length), "\n")
#> IQR:      1.3
quantile(iris$Sepal.Length)
#>   0%  25%  50%  75% 100% 
#>  4.3  5.1  5.8  6.4  7.9
# Step 3: Frequency Table for Categorical Variables
table(iris$Species)
#> 
#>     setosa versicolor  virginica 
#>         50         50         50
prop.table(table(iris$Species)) * 100      # percentages
#> 
#>     setosa versicolor  virginica 
#>   33.33333   33.33333   33.33333
# Step 4: Correlation Analysis
# Measures LINEAR relationship: -1 = perfect negative, 0 = none, +1 = perfect positive
cor_matrix <- cor(iris[, 1:4])
round(cor_matrix, 2)
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length         1.00       -0.12         0.87        0.82
#> Sepal.Width         -0.12        1.00        -0.43       -0.37
#> Petal.Length         0.87       -0.43         1.00        0.96
#> Petal.Width          0.82       -0.37         0.96        1.00
cat("\nPetal.Length vs Petal.Width correlation:",
    round(cor(iris$Petal.Length, iris$Petal.Width), 3),
    "\n-> Strong positive: larger petals tend to be both long AND wide\n")
#> 
#> Petal.Length vs Petal.Width correlation: 0.963 
#> -> Strong positive: larger petals tend to be both long AND wide
# Step 5: Group-wise Comparison
iris %>%
  group_by(Species) %>%
  summarise(
    mean_sepal_L = round(mean(Sepal.Length), 2),
    mean_petal_L = round(mean(Petal.Length), 2),
    mean_petal_W = round(mean(Petal.Width), 2)
  )
#> # A tibble: 3 × 4
#>   Species    mean_sepal_L mean_petal_L mean_petal_W
#>   <fct>             <dbl>        <dbl>        <dbl>
#> 1 setosa             5.01         1.46         0.25
#> 2 versicolor         5.94         4.26         1.33
#> 3 virginica          6.59         5.55         2.03
# Step 6: Outlier Detection Using the IQR (Interquartile Range) Method
# Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are flagged as outliers
detect_outliers <- function(x, column_name) {
  Q1  <- quantile(x, 0.25)
  Q3  <- quantile(x, 0.75)
  IQR_val <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR_val
  upper <- Q3 + 1.5 * IQR_val
  outliers <- x[x < lower | x > upper]
  cat(column_name, "| Fences: [", round(lower,2), ",", round(upper,2), "]",
      "| Outliers found:", length(outliers), "\n")
  return(outliers)
}

detect_outliers(iris$Sepal.Length, "Sepal.Length")
#> Sepal.Length | Fences: [ 3.15 , 8.35 ] | Outliers found: 0
#> numeric(0)
detect_outliers(iris$Sepal.Width,  "Sepal.Width")
#> Sepal.Width | Fences: [ 2.05 , 4.05 ] | Outliers found: 4
#> [1] 4.4 4.1 4.2 2.0
detect_outliers(iris$Petal.Length, "Petal.Length")
#> Petal.Length | Fences: [ -3.65 , 10.35 ] | Outliers found: 0
#> numeric(0)

9 Module 9: Data Visualization

Data visualization is the graphical representation of data to communicate patterns, trends, relationships, and distributions visually. Edward Tufte’s principle: “The purpose of a visualization is insight, not pictures.” R is world-famous for its visualization capabilities, especially through the ggplot2 package, which implements Leland Wilkinson’s Grammar of Graphics — a systematic way to describe any chart as a combination of data, aesthetics, and geometric objects.

Humans process visual information 60,000 times faster than text. A well-made chart can reveal a pattern in seconds that would take pages of numbers to describe. Visualization serves two roles: exploratory (for your own understanding during EDA) and explanatory (communicating findings to others). Both are essential skills for data analysts.

9.1 Base R Plots — Quick Visual Checks

# Histogram — shows the distribution (spread and shape) of a single numeric variable
# Use when: checking normality, understanding value spread, spotting skewness
hist(iris$Sepal.Length,
     main   = "Distribution of Sepal Length",
     xlab   = "Sepal Length (cm)", ylab = "Frequency",
     col    = "steelblue", border = "white", breaks = 15)

# Boxplot — shows median, quartiles, and outliers; great for group comparison
# Use when: comparing distributions across groups, spotting outliers
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Sepal Length by Species",
        xlab = "Species", ylab = "Sepal Length (cm)",
        col  = c("#3498db", "#2ecc71", "#e74c3c"))

9.2 ggplot2 — Professional, Grammar-Based Visualization

Every ggplot2 chart is built from three essential components:

  1. Data: the data frame you are plotting
  2. Aesthetics aes(): mappings from data variables to visual properties (x-axis, y-axis, color, size, shape)
  3. Geoms: the geometric objects that represent data points (geom_point() for dots, geom_bar() for bars, geom_line() for lines, geom_boxplot() for boxplots, etc.)

Additional layers — scales, themes, labels, facets — are added with +.

library(ggplot2)

# Scatter Plot — shows relationship between two numeric variables
# Use when: exploring correlation, showing how two measures relate
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species, shape = Species)) +
  geom_point(size = 2.5, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", linewidth = 0.8) +
  labs(title    = "Sepal Length vs Petal Length by Species",
       subtitle = "Linear trend lines shown per species",
       x        = "Sepal Length (cm)", y = "Petal Length (cm)") +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1")

# Overlapping Histograms with Density Curves — compare distributions across groups
# Use when: understanding how the same variable differs between groups
ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20, alpha = 0.5, position = "identity") +
  geom_density(aes(color = Species), linewidth = 1, fill = NA) +
  labs(title    = "Sepal Width Distribution by Species",
       subtitle = "Histogram with smoothed density curves",
       x        = "Sepal Width (cm)", y = "Density") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2")

# Boxplot with Jitter — shows distribution AND individual data points
# Use when: comparing distributions across groups, especially with small samples
ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.5, size = 1.5) +
  labs(title    = "Petal Width by Species",
       subtitle = "Boxplot with individual data points overlaid",
       x        = "Species", y = "Petal Width (cm)") +
  theme_classic() + theme(legend.position = "none") +
  scale_fill_brewer(palette = "Pastel1")

# Bar Chart with Error Bars — compare averages across categories
# Use when: showing a summary statistic (mean, total) across discrete groups
mtcars$cyl <- as.factor(mtcars$cyl)

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg), se = sd(mpg) / sqrt(n())) %>%
  ggplot(aes(x = cyl, y = avg_mpg, fill = cyl)) +
  geom_col(alpha = 0.85, width = 0.6) +
  geom_errorbar(aes(ymin = avg_mpg - se, ymax = avg_mpg + se), width = 0.2) +
  geom_text(aes(label = round(avg_mpg, 1)), vjust = -0.8, fontface = "bold") +
  labs(title    = "Average Fuel Efficiency by Number of Cylinders",
       subtitle = "Error bars represent +/- 1 standard error",
       x        = "Number of Cylinders", y = "Average MPG") +
  theme_minimal() + theme(legend.position = "none") +
  scale_fill_brewer(palette = "Blues")

# Faceted Plot — small multiples: same chart for each subgroup side-by-side
# Use when: you want to compare the same relationship across multiple categories
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(color = Species), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "gray40") +
  facet_wrap(~ Species, nrow = 1) +
  labs(title    = "Sepal Dimensions by Species (Faceted)",
       subtitle = "Each panel shows one species independently",
       x        = "Sepal Length (cm)", y = "Sepal Width (cm)") +
  theme_bw() + theme(legend.position = "none")


10 Module 10: Statistical Analysis

Statistical analysis is the process of collecting, examining, summarizing, and interpreting data to discover patterns and draw conclusions. It divides into descriptive statistics (summarizing what the data shows) and inferential statistics (drawing conclusions about a population from a sample, quantifying uncertainty through p-values, confidence intervals, and hypothesis tests).

10.1 Hypothesis Testing

A hypothesis test is a formal procedure for deciding whether data provides enough evidence to reject a null hypothesis (H0). The null hypothesis is always the “nothing interesting is happening” claim. The alternative hypothesis (H1) is what you suspect is actually true. The p-value is the probability of observing results as extreme as yours, if the null hypothesis were actually true. A p-value below 0.05 is conventionally taken as evidence to reject H0.

# One-Sample t-test — Is the mean of a sample different from a known value?
# Question: Is the average exam score significantly different from 75?
# H0: population mean = 75; H1: population mean is not 75
exam_scores <- c(78, 85, 72, 91, 68, 88, 74, 82, 79, 86, 77, 83)
t_result <- t.test(exam_scores, mu = 75)
t_result
#> 
#>  One Sample t-test
#> 
#> data:  exam_scores
#> t = 2.6547, df = 11, p-value = 0.0224
#> alternative hypothesis: true mean is not equal to 75
#> 95 percent confidence interval:
#>  75.89729 84.60271
#> sample estimates:
#> mean of x 
#>     80.25
cat("\nSample mean:", round(mean(exam_scores), 2), "\n")
#> 
#> Sample mean: 80.25
cat("p-value:", round(t_result$p.value, 4), "\n")
#> p-value: 0.0224
if (t_result$p.value < 0.05) {
  cat("-> p < 0.05: Reject H0. Mean is significantly different from 75.\n")
} else {
  cat("-> p >= 0.05: Fail to reject H0. No significant difference from 75.\n")
}
#> -> p < 0.05: Reject H0. Mean is significantly different from 75.
# Two-Sample t-test — Are two groups significantly different?
# Question: Do students who attended tutoring score higher?
tutored     <- c(82, 88, 79, 91, 85, 90, 87, 83)
not_tutored <- c(72, 68, 75, 78, 70, 73, 65, 71)

t_two <- t.test(tutored, not_tutored, alternative = "greater")
t_two
#> 
#>  Welch Two Sample t-test
#> 
#> data:  tutored and not_tutored
#> t = 6.9118, df = 13.991, p-value = 3.606e-06
#> alternative hypothesis: true difference in means is greater than 0
#> 95 percent confidence interval:
#>  10.52541      Inf
#> sample estimates:
#> mean of x mean of y 
#>    85.625    71.500
cat("Group means — Tutored:", round(mean(tutored),1),
    "| Not tutored:", round(mean(not_tutored),1), "\n")
#> Group means — Tutored: 85.6 | Not tutored: 71.5
# Chi-Square Test — Is there a relationship between two categorical variables?
# Question: Is there a relationship between gender and promotion?
promo_table <- matrix(c(15, 10, 25, 30), nrow = 2,
  dimnames = list(c("Promoted", "Not Promoted"), c("Male", "Female")))
promo_table
#>              Male Female
#> Promoted       15     25
#> Not Promoted   10     30
chi_result <- chisq.test(promo_table)
chi_result
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  promo_table
#> X-squared = 0.93091, df = 1, p-value = 0.3346

10.2 ANOVA — Comparing More Than Two Groups

ANOVA (Analysis of Variance) tests whether the means of three or more groups are significantly different. It compares the variation between groups to the variation within groups. If between-group variation is much larger than within-group variation (captured by the F-statistic), the groups are likely truly different. After ANOVA rejects H0, a post-hoc test (like Tukey’s HSD) identifies which specific pairs of groups differ.

# Question: Are petal lengths significantly different across iris species?
# H0: all three species have the same mean petal length
anova_model <- aov(Petal.Length ~ Species, data = iris)
summary(anova_model)
#>              Df Sum Sq Mean Sq F value Pr(>F)    
#> Species       2  437.1  218.55    1180 <2e-16 ***
#> Residuals   147   27.2    0.19                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tapply(iris$Petal.Length, iris$Species, mean)   # group means
#>     setosa versicolor  virginica 
#>      1.462      4.260      5.552
# Post-hoc test: WHICH pairs differ?
TukeyHSD(anova_model)   # all three pairs differ significantly
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = Petal.Length ~ Species, data = iris)
#> 
#> $Species
#>                       diff     lwr     upr p adj
#> versicolor-setosa    2.798 2.59422 3.00178     0
#> virginica-setosa     4.090 3.88622 4.29378     0
#> virginica-versicolor 1.292 1.08822 1.49578     0

10.3 Linear Regression — Modeling Relationships

Linear regression models the relationship between a response variable (Y) and one or more predictor variables (X). It fits a straight line through the data: Y = B0 + B1X1 + B2X2 + error. The coefficients B tell you how much Y changes for each unit increase in X. R-squared measures how well the model explains the variation in Y (0 = explains nothing, 1 = perfect explanation).

Linear regression is the foundation of predictive modeling. Even when more complex models are used, regression is typically the baseline to beat. Its output is interpretable — the coefficients directly tell you the quantitative relationship between predictors and outcome — which is essential for business decision-making and scientific communication.

# Simple Linear Regression: predict MPG from horsepower alone
model_simple <- lm(mpg ~ hp, data = mtcars)
summary(model_simple)
#> 
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.7121 -2.1122 -0.8854  1.5819  8.2360 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
#> hp          -0.06823    0.01012  -6.742 1.79e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
#> F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
coefs <- coef(model_simple)
cat("\n=== Interpretation ===\n")
#> 
#> === Interpretation ===
cat("Intercept:", round(coefs[1], 3), "-> Expected MPG when hp = 0\n")
#> Intercept: 30.099 -> Expected MPG when hp = 0
cat("hp coefficient:", round(coefs[2], 4),
    "-> Each extra HP reduces MPG by", abs(round(coefs[2], 4)), "\n")
#> hp coefficient: -0.0682 -> Each extra HP reduces MPG by 0.0682
cat("R-squared:", round(summary(model_simple)$r.squared, 3),
    "-> HP explains", round(summary(model_simple)$r.squared*100, 1), "% of MPG variance\n")
#> R-squared: 0.602 -> HP explains 60.2 % of MPG variance
# Visualization of the regression line
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "#2980b9", size = 2.5, alpha = 0.8) +
  geom_smooth(method = "lm", color = "#e74c3c", se = TRUE, linewidth = 1.2) +
  labs(title    = "Simple Linear Regression: MPG ~ Horsepower",
       subtitle = paste0("R2 = ", round(summary(model_simple)$r.squared, 3),
                         "  |  Each +1 HP reduces MPG by ",
                         abs(round(coefs[2], 3))),
       x = "Horsepower (hp)", y = "Miles Per Gallon (mpg)") +
  theme_minimal()

# Multiple Linear Regression: add more predictors
model_multi <- lm(mpg ~ hp + wt + factor(cyl), data = mtcars)
summary(model_multi)
#> 
#> Call:
#> lm(formula = mpg ~ hp + wt + factor(cyl), data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.2612 -1.0320 -0.3210  0.9281  5.3947 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  35.84600    2.04102  17.563 2.67e-16 ***
#> hp           -0.02312    0.01195  -1.934 0.063613 .  
#> wt           -3.18140    0.71960  -4.421 0.000144 ***
#> factor(cyl)6 -3.35902    1.40167  -2.396 0.023747 *  
#> factor(cyl)8 -3.18588    2.17048  -1.468 0.153705    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.44 on 27 degrees of freedom
#> Multiple R-squared:  0.8572, Adjusted R-squared:  0.8361 
#> F-statistic: 40.53 on 4 and 27 DF,  p-value: 4.869e-11
cat("Multiple R-squared:", round(summary(model_multi)$r.squared, 3),
    "-> hp + wt + cyl explains",
    round(summary(model_multi)$r.squared * 100, 1), "% of MPG variance\n")
#> Multiple R-squared: 0.857 -> hp + wt + cyl explains 85.7 % of MPG variance

11 Module 11: Advanced Topics

11.1 Writing Efficient R Code

Code efficiency in R means writing code that executes quickly and uses memory well. R’s biggest performance insight is vectorization — R is designed to operate on entire vectors at once using optimized compiled code (C/Fortran under the hood). Explicit for loops in R are slow because they are interpreted one iteration at a time. The same operation expressed as a vectorized call can be 10-1000x faster.

As datasets grow (thousands to millions of rows), poorly written code becomes impractically slow. A loop over a million rows that takes 30 seconds in R might take 0.3 seconds vectorized. For production data pipelines and big data workflows, efficiency is not optional — it is the difference between a pipeline that runs in minutes versus hours.

# Benchmarking: measure how long code takes with system.time()
n <- 500000
x <- runif(n)     # 500,000 random numbers

# Method 1: for loop — slow (interpreted one iteration at a time)
time_loop <- system.time({
  result_loop <- numeric(n)   # pre-allocate! Growing vectors is very slow
  for (i in seq_along(x)) result_loop[i] <- sqrt(x[i])
})

# Method 2: Vectorized — fast (compiled C code on the whole vector)
time_vec <- system.time({
  result_vec <- sqrt(x)
})

cat("Loop time:       ", time_loop["elapsed"], "seconds\n")
#> Loop time:        0.08 seconds
cat("Vectorized time: ", time_vec["elapsed"],  "seconds\n")
#> Vectorized time:  0.02 seconds
cat("Speedup factor:  ", round(time_loop["elapsed"] / max(time_vec["elapsed"], 0.001)), "x\n")
#> Speedup factor:   4 x
# data.table — high-performance data manipulation for large datasets
# Syntax: dt[filter_rows, compute_columns, by = group_columns]
library(data.table)
dt <- as.data.table(mtcars)

dt[cyl == 6, .(avg_mpg = mean(mpg), avg_hp = mean(hp), n = .N)]
#>     avg_mpg   avg_hp     n
#>       <num>    <num> <int>
#> 1: 19.74286 122.2857     7
dt[, .(avg_mpg = mean(mpg)), by = cyl]    # average MPG per cylinder count
#>       cyl  avg_mpg
#>    <fctr>    <num>
#> 1:      6 19.74286
#> 2:      4 26.66364
#> 3:      8 15.10000
dt[hp > 150, .N, by = cyl]               # count high-HP cars per cylinder group
#>       cyl     N
#>    <fctr> <int>
#> 1:      8    12
#> 2:      6     1

11.2 Introduction to Machine Learning with caret

Machine learning (ML) is the practice of training algorithms to learn patterns from data and make predictions or decisions without being explicitly programmed for each case. The caret package provides a unified interface to 200+ ML algorithms, handling train/test splitting, cross-validation, preprocessing, and model evaluation. The key concept: ML models learn from training data and are evaluated on test data they have never seen — simulating real-world deployment.

Traditional statistics focuses on understanding relationships (what factors affect Y, and by how much?). Machine learning focuses on prediction accuracy (given X, what will Y be?). Both are valuable — the choice depends on whether your goal is explanation or prediction. Understanding ML is increasingly essential for data analysts in business and research.

library(caret)

# Step 1: Split Data into Training (80%) and Test (20%) Sets
set.seed(42)    # set.seed() ensures reproducibility of random operations
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train_data  <- iris[train_index, ]
test_data   <- iris[-train_index, ]
cat("Training samples:", nrow(train_data), "| Test samples:", nrow(test_data), "\n")
#> Training samples: 120 | Test samples: 30
# Step 2: Define Training Control — 5-fold Cross-Validation
# CV splits training data into 5 folds: trains on 4, validates on 1, repeats 5 times
train_control <- trainControl(method = "cv", number = 5)

# Step 3: Train a k-Nearest Neighbors (kNN) Classifier
# kNN: classify based on the k closest examples in training data
model_knn <- train(
  Species ~ .,                          # predict Species using all other columns
  data       = train_data,
  method     = "knn",
  trControl  = train_control,
  preProcess = c("center", "scale")     # standardize features (critical for kNN)
)
model_knn
#> k-Nearest Neighbors 
#> 
#> 120 samples
#>   4 predictor
#>   3 classes: 'setosa', 'versicolor', 'virginica' 
#> 
#> Pre-processing: centered (4), scaled (4) 
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 96, 96, 96, 96, 96 
#> Resampling results across tuning parameters:
#> 
#>   k  Accuracy   Kappa 
#>   5  0.9500000  0.9250
#>   7  0.9583333  0.9375
#>   9  0.9500000  0.9250
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 7.
# Step 4: Evaluate on Test Data — data the model has NEVER seen
predictions <- predict(model_knn, newdata = test_data)
conf_matrix <- confusionMatrix(predictions, test_data$Species)
conf_matrix
#> Confusion Matrix and Statistics
#> 
#>             Reference
#> Prediction   setosa versicolor virginica
#>   setosa         10          0         0
#>   versicolor      0          9         3
#>   virginica       0          1         7
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.8667          
#>                  95% CI : (0.6928, 0.9624)
#>     No Information Rate : 0.3333          
#>     P-Value [Acc > NIR] : 2.296e-09       
#>                                           
#>                   Kappa : 0.8             
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: setosa Class: versicolor Class: virginica
#> Sensitivity                 1.0000            0.9000           0.7000
#> Specificity                 1.0000            0.8500           0.9500
#> Pos Pred Value              1.0000            0.7500           0.8750
#> Neg Pred Value              1.0000            0.9444           0.8636
#> Prevalence                  0.3333            0.3333           0.3333
#> Detection Rate              0.3333            0.3000           0.2333
#> Detection Prevalence        0.3333            0.4000           0.2667
#> Balanced Accuracy           1.0000            0.8750           0.8250
cat("\nOverall Accuracy:", round(conf_matrix$overall["Accuracy"] * 100, 1), "%\n")
#> 
#> Overall Accuracy: 86.7 %

12 Module 12: Capstone Project — End-to-End Data Analysis

A capstone project integrates all skills from the course into one complete, real-world workflow: data loading, inspection, cleaning, exploratory analysis, visualization, statistical modeling, and interpretation. This mirrors what data analysts actually do on the job. The dataset is mtcars (Motor Trend, 1974), comparing fuel efficiency, engine specs, and performance of 32 car models.

Research Question: What factors most strongly predict a car’s fuel efficiency (MPG), and how well can we model it?

# STEP 1: DATA LOADING & FIRST LOOK
data(mtcars)
mtcars <- tibble::rownames_to_column(mtcars, "car_model")
cat("Dataset: Motor Trend Car Road Tests (1974)\n")
#> Dataset: Motor Trend Car Road Tests (1974)
cat("Dimensions:", nrow(mtcars), "rows x", ncol(mtcars), "columns\n\n")
#> Dimensions: 32 rows x 12 columns
cat("Key variables:\n")
#> Key variables:
cat("  mpg  = Miles per gallon  [TARGET variable]\n")
#>   mpg  = Miles per gallon  [TARGET variable]
cat("  cyl  = Number of cylinders (4, 6, 8)\n")
#>   cyl  = Number of cylinders (4, 6, 8)
cat("  hp   = Gross horsepower\n")
#>   hp   = Gross horsepower
cat("  wt   = Weight (1000 lbs)\n")
#>   wt   = Weight (1000 lbs)
cat("  am   = Transmission (0=Automatic, 1=Manual)\n")
#>   am   = Transmission (0=Automatic, 1=Manual)
# STEP 2: CLEANING & PREPARATION
cat("Missing values:", sum(is.na(mtcars)), "\n")
#> Missing values: 0
mtcars <- mtcars %>%
  mutate(
    cyl  = factor(cyl, levels = c(4, 6, 8)),
    am   = factor(am,  labels = c("Automatic", "Manual")),
    vs   = factor(vs,  labels = c("V-shape", "Straight")),
    gear = factor(gear), carb = factor(carb)
  )
str(mtcars[, c("mpg", "cyl", "hp", "wt", "am")])
#> 'data.frame':    32 obs. of  5 variables:
#>  $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#>  $ hp : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ wt : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
# STEP 3: EXPLORATORY DATA ANALYSIS
cat("=== MPG Summary ===\n")
#> === MPG Summary ===
cat("Mean:", round(mean(mtcars$mpg), 2), " | Median:", median(mtcars$mpg),
    " | SD:", round(sd(mtcars$mpg), 2), "\n\n")
#> Mean: 20.09  | Median: 19.2  | SD: 6.03
cat("=== MPG by Cylinder Count ===\n")
#> === MPG by Cylinder Count ===
mtcars %>% group_by(cyl) %>%
  summarise(n = n(), avg_mpg = round(mean(mpg), 1), sd_mpg = round(sd(mpg), 1),
            min_mpg = min(mpg), max_mpg = max(mpg)) %>% print()
#> # A tibble: 3 × 6
#>   cyl       n avg_mpg sd_mpg min_mpg max_mpg
#>   <fct> <int>   <dbl>  <dbl>   <dbl>   <dbl>
#> 1 4        11    26.7    4.5    21.4    33.9
#> 2 6         7    19.7    1.5    17.8    21.4
#> 3 8        14    15.1    2.6    10.4    19.2
cat("\n=== MPG by Transmission ===\n")
#> 
#> === MPG by Transmission ===
mtcars %>% group_by(am) %>%
  summarise(n = n(), avg_mpg = round(mean(mpg), 1)) %>% print()
#> # A tibble: 2 × 3
#>   am            n avg_mpg
#>   <fct>     <int>   <dbl>
#> 1 Automatic    19    17.1
#> 2 Manual       13    24.4
# STEP 4: VISUALIZATION

# Chart 1: MPG Distribution
ggplot(mtcars, aes(x = mpg, fill = cyl)) +
  geom_histogram(bins = 12, alpha = 0.8, color = "white") +
  labs(title = "Distribution of Fuel Efficiency (MPG)",
       subtitle = "Colored by cylinder count",
       x = "Miles Per Gallon", y = "Count", fill = "Cylinders") +
  theme_minimal() + scale_fill_brewer(palette = "Set1")

# Chart 2: Weight vs MPG — the strongest predictor
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl, shape = am)) +
  geom_point(size = 3, alpha = 0.85) +
  geom_smooth(method = "lm", se = FALSE, color = "gray40", linetype = "dashed") +
  labs(title    = "Fuel Efficiency vs Vehicle Weight",
       subtitle = "Heavier cars consistently have lower MPG",
       x = "Weight (1000 lbs)", y = "Miles Per Gallon",
       color = "Cylinders", shape = "Transmission") +
  theme_minimal() + scale_color_brewer(palette = "Set1")

# Chart 3: Transmission comparison
ggplot(mtcars, aes(x = am, y = mpg, fill = am)) +
  geom_boxplot(alpha = 0.7) + geom_jitter(width = 0.1, alpha = 0.6, size = 2) +
  labs(title    = "MPG: Automatic vs Manual Transmission",
       subtitle = "Manual cars tend to be more fuel-efficient",
       x = "Transmission", y = "Miles Per Gallon") +
  theme_classic() + theme(legend.position = "none") +
  scale_fill_manual(values = c("#3498db", "#e74c3c"))

# STEP 5: STATISTICAL MODELING — build and compare models

m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + cyl + am, data = mtcars)

cat("=== Model Comparison ===\n")
#> === Model Comparison ===
cat(sprintf("%-35s R2: %.3f  AIC: %.1f\n", "Model 1: mpg ~ wt",
            summary(m1)$r.squared, AIC(m1)))
#> Model 1: mpg ~ wt                   R2: 0.753  AIC: 166.0
cat(sprintf("%-35s R2: %.3f  AIC: %.1f\n", "Model 2: mpg ~ wt + hp",
            summary(m2)$r.squared, AIC(m2)))
#> Model 2: mpg ~ wt + hp              R2: 0.827  AIC: 156.7
cat(sprintf("%-35s R2: %.3f  AIC: %.1f\n", "Model 3: mpg ~ wt + hp + cyl + am",
            summary(m3)$r.squared, AIC(m3)))
#> Model 3: mpg ~ wt + hp + cyl + am   R2: 0.866  AIC: 154.5
cat("\n=== Best Model Summary ===\n")
#> 
#> === Best Model Summary ===
summary(m3)
#> 
#> Call:
#> lm(formula = mpg ~ wt + hp + cyl + am, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.9387 -1.2560 -0.4013  1.1253  5.0513 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
#> wt          -2.49683    0.88559  -2.819  0.00908 ** 
#> hp          -0.03211    0.01369  -2.345  0.02693 *  
#> cyl6        -3.03134    1.40728  -2.154  0.04068 *  
#> cyl8        -2.16368    2.28425  -0.947  0.35225    
#> amManual     1.80921    1.39630   1.296  0.20646    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.41 on 26 degrees of freedom
#> Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
#> F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10
# STEP 6: CONCLUSIONS
cat("KEY FINDINGS\n")
#> KEY FINDINGS
cat("============\n\n")
#> ============
cat("1. WEIGHT is the strongest predictor:\n")
#> 1. WEIGHT is the strongest predictor:
cat("   Each additional 1000 lbs reduces MPG by ~",
    abs(round(coef(m3)["wt"], 1)), "miles\n\n")
#>    Each additional 1000 lbs reduces MPG by ~ 2.5 miles
cat("2. CYLINDER COUNT matters significantly:\n")
#> 2. CYLINDER COUNT matters significantly:
cat("   4-cyl:", round(mean(mtcars$mpg[mtcars$cyl==4]), 1), "MPG |",
    "6-cyl:", round(mean(mtcars$mpg[mtcars$cyl==6]), 1), "MPG |",
    "8-cyl:", round(mean(mtcars$mpg[mtcars$cyl==8]), 1), "MPG\n\n")
#>    4-cyl: 26.7 MPG | 6-cyl: 19.7 MPG | 8-cyl: 15.1 MPG
cat("3. TRANSMISSION effect:\n")
#> 3. TRANSMISSION effect:
auto_mpg   <- mean(mtcars$mpg[mtcars$am == "Automatic"])
manual_mpg <- mean(mtcars$mpg[mtcars$am == "Manual"])
cat("   Manual cars average", round(manual_mpg, 1), "MPG vs",
    round(auto_mpg, 1), "MPG for automatic\n\n")
#>    Manual cars average 24.4 MPG vs 17.1 MPG for automatic
cat("4. MODEL PERFORMANCE:\n")
#> 4. MODEL PERFORMANCE:
cat("   Combined model explains",
    round(summary(m3)$r.squared * 100, 1), "% of MPG variation\n\n")
#>    Combined model explains 86.6 % of MPG variation
cat("RECOMMENDATION: To maximize fuel efficiency, prioritize lighter\n")
#> RECOMMENDATION: To maximize fuel efficiency, prioritize lighter
cat("vehicles with fewer cylinders. Weight is the dominant factor.\n")
#> vehicles with fewer cylinders. Weight is the dominant factor.

13 Best Practices & Next Steps

13.1 Best Practices Checklist

Category Practice Why It Matters
Organization Use RStudio Projects Keeps all files together; makes file paths reliable and portable
Reproducibility Set set.seed() before random operations Anyone can reproduce your exact results
Code Style Use snake_case for variable names Community standard; consistent and readable
Documentation Comment the why, not just the what Future you and colleagues will understand the reasoning
Data Safety Never overwrite raw data files Always save cleaned data to a new file; preserve the original
Version Control Use Git for all R projects Track changes; revert mistakes; enable collaboration
Performance Prefer vectorized operations over loops 10-100x faster for large datasets
Packages Put all library() calls at the top of scripts Clear dependencies; easy for others to install what is needed
Reporting Use R Markdown for final analyses Code + results + narrative in one reproducible document
Validation Always inspect results for sanity Check that outputs make sense before sharing or acting on them

13.3 Useful Resources

Resource URL What It Offers
R for Data Science (free book) r4ds.had.co.nz Complete tidyverse guide by Hadley Wickham
TidyTuesday github.com/rfordatascience/tidytuesday Weekly real-world datasets to practice
R-bloggers r-bloggers.com Community tutorials and news
RStudio Cheatsheets posit.co/resources/cheatsheets Quick reference cards for all major packages
Stack Overflow stackoverflow.com/questions/tagged/r Q&A for specific problems
CRAN Task Views cran.r-project.org/web/views Packages organized by topic area

# Always include session info at the end of an analysis
# This documents exactly which R version and package versions were used
# ensuring that others can reproduce your results exactly
sessionInfo()
#> R version 4.6.0 (2026-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_Rwanda.utf8  LC_CTYPE=English_Rwanda.utf8   
#> [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C                   
#> [5] LC_TIME=English_Rwanda.utf8    
#> 
#> time zone: Africa/Kigali
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] caret_7.0-1         lattice_0.22-9      data.table_1.18.2.1
#> [4] ggplot2_4.0.3       stringr_1.6.0       tidyr_1.3.2        
#> [7] dplyr_1.2.1        
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6         xfun_0.57            bslib_0.10.0        
#>  [4] recipes_1.3.2        vctrs_0.7.3          tools_4.6.0         
#>  [7] generics_0.1.4       stats4_4.6.0         parallel_4.6.0      
#> [10] proxy_0.4-29         tibble_3.3.1         ModelMetrics_1.2.2.2
#> [13] pkgconfig_2.0.3      Matrix_1.7-5         RColorBrewer_1.1-3  
#> [16] S7_0.2.2             lifecycle_1.0.5      compiler_4.6.0      
#> [19] farver_2.1.2         codetools_0.2-20     htmltools_0.5.9     
#> [22] class_7.3-23         sass_0.4.10          yaml_2.3.12         
#> [25] prodlim_2026.03.11   pillar_1.11.1        jquerylib_0.1.4     
#> [28] MASS_7.3-65          cachem_1.1.0         gower_1.0.2         
#> [31] iterators_1.0.14     rpart_4.1.27         foreach_1.5.2       
#> [34] nlme_3.1-169         parallelly_1.47.0    lava_1.9.0          
#> [37] tidyselect_1.2.1     digest_0.6.39        stringi_1.8.7       
#> [40] future_1.70.0        reshape2_1.4.5       purrr_1.2.2         
#> [43] listenv_0.10.1       labeling_0.4.3       splines_4.6.0       
#> [46] fastmap_1.2.0        grid_4.6.0           cli_3.6.6           
#> [49] magrittr_2.0.5       survival_3.8-6       utf8_1.2.6          
#> [52] e1071_1.7-17         future.apply_1.20.2  withr_3.0.2         
#> [55] scales_1.4.0         lubridate_1.9.5      timechange_0.4.0    
#> [58] rmarkdown_2.31       globals_0.19.1       nnet_7.3-20         
#> [61] timeDate_4052.112    evaluate_1.0.5       knitr_1.51          
#> [64] hardhat_1.4.3        mgcv_1.9-4           rlang_1.2.0         
#> [67] Rcpp_1.1.1-1.1       glue_1.8.1           pROC_1.19.0.1       
#> [70] ipred_0.9-15         rstudioapi_0.18.0    jsonlite_2.0.0      
#> [73] R6_2.6.1             plyr_1.8.9

End of Course — You now have the complete foundation to tackle real-world data analysis with R. Keep practicing, stay curious, and always ask what your data is telling you. Happy coding!