How to use this course: Each concept is explained in three layers — what it is (Theory), why it matters (Importance), and when to use it (Context) — followed by practical R code examples. Read the theory first, then study the code, then try modifying the examples yourself.
R is a free, open-source programming language created in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland. It was designed from the ground up for statistical computing, data manipulation, and data visualization. Unlike general-purpose languages like Python or Java, R’s entire design philosophy revolves around data — every feature, every data structure, every built-in function exists to help you work with data more effectively.
R is one of the most important tools in data analysis because:
ggplot2.Choose R when your work involves statistical analysis, academic research, data visualization, bioinformatics, economics, or when you need to produce reports that combine code, results, and narrative automatically.
RStudio is an Integrated Development Environment (IDE) for R. Think of R as the engine and RStudio as the car — R does the computation, RStudio provides the dashboard, controls, and comfort that make working with R much easier. RStudio is not R itself; it is a tool that sits on top of R.
RStudio is divided into 4 panes, each serving a specific purpose:
| Pane | Location | Purpose |
|---|---|---|
| Source Editor | Top-left | Write, edit, and save R scripts (.R) or R Markdown (.Rmd) files |
| Console | Bottom-left | Run R commands interactively and see immediate output |
| Environment / History | Top-right | See all objects currently in memory; browse command history |
| Files / Plots / Packages / Help | Bottom-right | Browse files, view plots, manage packages, read documentation |
The Source Editor is where you write reusable, repeatable code. The Console is where you experiment quickly. This separation is important: always write your final analysis in a script (Source Editor) so you can re-run it later. Code typed only into the Console is lost when you close RStudio.
Every journey in R begins with understanding that R is an interpreted language — you write a command, press Enter (or click Run), and R immediately executes it and shows the result. There is no compilation step. This makes R excellent for interactive data exploration.
# The hash symbol # starts a comment. R ignores everything after it.
# Comments are essential — they explain WHY you wrote the code, not just what it does.
# R as a calculator — the most basic use
2 + 3 # addition#> [1] 5
#> [1] 53
#> [1] 56
#> [1] 3.142857
#> [1] 1024
#> [1] 2
#> [1] 3
#> [1] "Welcome to R!"
#> [1] 3.141593
#> [1] 2.718282
A variable is a named storage location in computer
memory. When you create a variable in R, you are telling R: “Reserve a
space in memory, put this value there, and let me refer to it by this
name.” In R, the standard assignment operator is <-
(read as “gets”). You can also use =, but
<- is the R convention and is strongly preferred by the
community because = is also used for function arguments,
which can create confusion.
Variables are the foundation of all programming. Without variables,
you would have to retype the same value every time you needed it, and
your code could not store or reuse results. Variables also make your
code readable (using average_salary is
clearer than using 47500 everywhere) and
maintainable (change the value in one place and it
updates everywhere).
Use variables whenever a value: (1) is used more than once, (2) might change and you want to update it in one place, (3) is the result of a calculation that you will use later, or (4) has a meaningful name that makes your code self-documenting.
# The <- operator assigns a value to a variable name
student_name <- "Alice Mukamana" # text value
student_age <- 22 # whole number
student_gpa <- 3.85 # decimal number
is_enrolled <- TRUE # yes/no (logical) value
# To see the value of a variable, just type its name
student_name#> [1] "Alice Mukamana"
#> [1] 22
#> [1] 3.85
# You can use a variable in a calculation immediately after creating it
monthly_salary <- 450000 # in Rwandan Francs
annual_salary <- monthly_salary * 12
annual_salary#> [1] 5400000
# Update a variable — the old value is replaced
student_age <- 23 # Alice had a birthday
student_age#> [1] 23
A data type tells R what kind of value a variable holds and therefore what operations are valid on it. R has 5 main atomic data types. The word “atomic” means these are the simplest, indivisible building blocks — all more complex data structures in R are built from these.
Understanding data types is critical because R behaves differently depending on type. You cannot do arithmetic on text. You cannot sort numbers alphabetically. When data is imported from a spreadsheet, numbers are sometimes read as text (“123” instead of 123), causing silent errors in your calculations. Knowing types helps you diagnose and fix these problems.
Always check data types when importing data (class(),
str()), before doing calculations (ensure columns are
numeric), and when joining datasets (types must match for merging to
work correctly).
#> [1] "numeric"
#> [1] TRUE
# Integer — the L suffix tells R to store as integer (more memory-efficient)
num_students <- 45L
class(num_students)#> [1] "integer"
#> [1] TRUE
# Character (String)
country <- "Rwanda"
capital <- 'Kigali' # single or double quotes both work
class(country)#> [1] "character"
# IMPORTANT: Numbers stored as characters cannot be used in math!
wrong_num <- "42" # this looks like a number but it is text
# wrong_num + 1 # this would cause an ERROR
as.numeric(wrong_num) + 1 # convert first, then add#> [1] 43
# Logical — must be ALL CAPS: TRUE or FALSE
passed_exam <- TRUE
has_degree <- FALSE
class(passed_exam)#> [1] "logical"
# Logical values are stored as 1 (TRUE) and 0 (FALSE)
# This is useful for counting: sum(logical_vector) counts the TRUEs
TRUE + TRUE + FALSE # = 2#> [1] 2
# Type Conversion — R can convert between types (called coercion)
as.numeric("3.14") # character to numeric#> [1] 3.14
#> [1] "100"
#> [1] FALSE
#> [1] TRUE
#> [1] 3
#> [1] "numeric"
#> [1] "logical"
#> [1] "character"
Operators are symbols that perform operations on values or variables. R has four categories of operators: arithmetic (math calculations), relational/comparison (comparing values), logical (combining TRUE/FALSE values), and assignment (storing values). Each produces a specific type of result — arithmetic gives numbers, relational gives TRUE/FALSE, logical gives TRUE/FALSE.
Operators are the verbs of programming — they express actions.
Relational operators are especially important in data analysis because
they are the foundation of filtering: “show me all
sales where revenue > 10000” translates directly into R as
data[data$revenue > 10000, ].
#> [1] 22
#> [1] 12
#> [1] 85
#> [1] 3.4
#> [1] 2
#> [1] 3
#> [1] 1419857
# Real-world arithmetic example
price_per_kg <- 1200 # RWF per kg of beans
quantity_kg <- 50
discount_rate <- 0.10 # 10% discount
subtotal <- price_per_kg * quantity_kg
discount <- subtotal * discount_rate
total <- subtotal - discount
cat("Subtotal: RWF", subtotal, "\n")#> Subtotal: RWF 60000
#> Discount: RWF 6000
#> Total: RWF 54000
# Relational (Comparison) Operators — always return TRUE or FALSE
a <- 10
b <- 20
a == b # FALSE — equal to (double == for comparison)#> [1] FALSE
#> [1] TRUE
#> [1] TRUE
#> [1] FALSE
#> [1] TRUE
#> [1] TRUE
# Logical Operators
# & (AND): BOTH conditions must be TRUE
# | (OR): AT LEAST ONE condition must be TRUE
# ! (NOT): reverses TRUE/FALSE
age <- 25
income <- 500000 # RWF per month
is_eligible <- (age >= 18) & (income >= 300000)
is_eligible#> [1] TRUE
#> [1] TRUE
Real-world data is never a single number — it comes in collections: a column of 1000 measurements, a table with rows and columns, a nested survey response. R has 5 core data structures to handle these cases. Choosing the right data structure is one of the most important decisions in data analysis because it affects how efficiently you can store, access, and manipulate your data.
A vector is R’s most fundamental data structure: an
ordered sequence of elements that are all the same
type. If you mix types, R will silently convert everything to
the most flexible type (this is called type coercion).
Vectors are “atomic” — they cannot hold other vectors or mixed types.
Almost everything in R is built on top of vectors — even a single number
like 42 is actually a vector of length 1.
Vectors are important because R is designed for vectorized
operations — operations that apply automatically to every
element without writing a loop. scores * 2 doubles every
score in a vector instantly. This makes R code both concise and
extremely fast compared to loop-based approaches. Understanding vectors
deeply is the single most important step for thinking in R.
Use a vector when you have a single variable measured across multiple observations — exam scores for 30 students, daily temperatures for a month, sales figures for 12 months. If your data has multiple variables, you need a data frame.
# Creating Vectors with c() — "combine" or "concatenate"
exam_scores <- c(78, 85, 92, 67, 88, 74, 91, 55, 83, 76)
student_names <- c("Alice", "Bob", "Carol", "David", "Eve",
"Frank", "Grace", "Henry", "Irene", "James")
passed <- c(TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE)
# Useful shortcuts for creating sequences
days_in_month <- 1:31 # consecutive integers
even_numbers <- seq(2, 20, by = 2) # 2, 4, 6, ..., 20
rep(0, times = 5) # repeat a value: 0 0 0 0 0#> [1] 0 0 0 0 0
#> [1] 1 2 1 2 1 2
#> [1] 10
#> [1] "numeric"
#> [1] 789
#> [1] 78.9
#> [1] 92
#> [1] 55
#> [1] 11.55133
# Accessing Elements — R uses 1-based indexing (first element is [1], NOT [0])
exam_scores[1] # first score#> [1] 78
#> [1] 85 92 67 88
#> [1] 78 92 91
#> [1] 85 92 67 88 74 91 55 83 76
#> Alice
#> 78
#> Bob Eve
#> 85 88
# Logical Indexing (Filtering) — the KEY to filtering data
exam_scores[exam_scores >= 80] # scores 80 and above#> Bob Carol Eve Grace Irene
#> 85 92 88 91 83
#> [1] "Bob" "Carol" "Eve" "Grace" "Irene"
#> [1] "David" "Henry"
# Vectorized Operations — R's Superpower: no loop needed
exam_scores + 5 # add 5 bonus points to everyone#> Alice Bob Carol David Eve Frank Grace Henry Irene James
#> 83 90 97 72 93 79 96 60 88 81
#> Alice Bob Carol David Eve Frank Grace Henry Irene James
#> TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
#> [1] 9
# ifelse() for element-wise conditional logic
letter_grades <- ifelse(exam_scores >= 90, "A",
ifelse(exam_scores >= 80, "B",
ifelse(exam_scores >= 70, "C",
ifelse(exam_scores >= 60, "D", "F"))))
letter_grades#> Alice Bob Carol David Eve Frank Grace Henry Irene James
#> "C" "B" "A" "D" "B" "C" "A" "F" "B" "C"
A matrix is a 2-dimensional extension of a vector: it has rows and columns, but like a vector, all elements must be the same type. A matrix is essentially a vector that has been given a shape (dimensions). Matrices are used heavily in linear algebra, statistics, and machine learning algorithms.
Matrices are important in data analysis when working with
numerical computations that require row/column
operations — correlation matrices, distance matrices, linear algebra for
regression coefficients, or image data. Many statistical functions
(cor(), cov(), solve()) work
directly on matrices.
Use matrices when: (1) all your data is numeric and of the same type, (2) you need matrix algebra operations (transpose, multiplication, inversion), (3) you are working with correlation or covariance matrices. For data with mixed types (numbers and text), use a data frame instead.
# Creating a Matrix — fills column by column by default
score_matrix <- matrix(
c(85, 78, 90, 72, # Math scores
88, 82, 79, 85, # Science scores
74, 91, 83, 69), # English scores
nrow = 4,
ncol = 3
)
rownames(score_matrix) <- c("Alice", "Bob", "Carol", "David")
colnames(score_matrix) <- c("Math", "Science", "English")
score_matrix#> Math Science English
#> Alice 85 88 74
#> Bob 78 82 91
#> Carol 90 79 83
#> David 72 85 69
#> [1] 4 3
#> [1] 4
#> [1] 3
# Accessing Elements: [row, column] — leave blank for entire row or column
score_matrix[1, ] # Alice's scores (row 1, all columns)#> Math Science English
#> 85 88 74
#> Alice Bob Carol David
#> 88 82 79 85
#> [1] 91
#> [1] 90
# Matrix Operations with apply()
apply(score_matrix, 1, mean) # MARGIN=1: average per student (across rows)#> Alice Bob Carol David
#> 82.33333 83.66667 84.00000 75.33333
#> Math Science English
#> 81.25 83.50 79.25
#> Alice Bob Carol David
#> Math 85 78 90 72
#> Science 88 82 79 85
#> English 74 91 83 69
A list is R’s most flexible data structure. Unlike vectors and matrices, a list can hold elements of different types and different lengths — including other lists. A list is like a filing cabinet: each drawer (element) can contain something completely different — a number, a sentence, a vector, a data frame, even another list.
Lists are essential because real-world data is rarely uniform. A person’s profile has a name (text), age (number), scores (vector), and address (nested structure) — a list can hold all of this together. Many R functions return lists as output (regression models, t-tests, etc.) because they need to return multiple different types of results at once.
Use lists when: (1) you need to store heterogeneous data (mixed
types) together, (2) a function needs to return multiple results of
different types, (3) you are working with the output of statistical
functions like lm(), t.test(), or
summary(), (4) you need nested or hierarchical data
structures.
# Creating a List — notice the mixed types
student_profile <- list(
name = "Alice Mukamana",
age = 22,
gpa = 3.85,
courses = c("Statistics", "R Programming", "Data Mining"),
scores = c(88, 92, 79),
is_enrolled = TRUE,
address = list(city = "Kigali", district = "Gasabo") # nested list!
)
# Accessing List Elements — Three Methods
student_profile$name # $ operator (most common for named lists)#> [1] "Alice Mukamana"
#> [1] 3.85
#> [1] "Statistics" "R Programming" "Data Mining"
#> [1] "Kigali"
#> [1] 7
#> [1] "name" "age" "gpa" "courses" "scores"
#> [6] "is_enrolled" "address"
# Adding and modifying elements
student_profile$graduation_year <- 2025
student_profile$age <- 23
# Why functions return lists — example with t.test()
test_result <- t.test(c(75, 82, 88, 79, 91), mu = 75)
class(test_result) # "htest" — a list#> [1] "htest"
#> [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
#> [6] "null.value" "stderr" "alternative" "method" "data.name"
#> [1] 0.05169353
#> [1] 74.90534 91.09466
#> attr(,"conf.level")
#> [1] 0.95
A data frame is R’s representation of a rectangular dataset — like a spreadsheet or database table. It has rows (observations) and columns (variables). The crucial difference from a matrix is that columns can have different types: one column can be text (names), another numeric (scores), another logical (passed?), another factor (category). Every column is a vector of the same length, and together they form the table.
Data frames are the single most important data structure for data
analysis. When you import a CSV file, it becomes a data frame. When you
analyze survey data, each respondent is a row and each question is a
column. The entire tidyverse ecosystem (dplyr, ggplot2,
tidyr) is built around data frames. Mastering data frames is
mastering R data analysis.
Use data frames when: (1) your data has multiple variables of different types (which is almost always), (2) you are working with imported datasets (CSV, Excel, database), (3) you need to filter rows, select columns, merge datasets, or create summaries — essentially any realistic data analysis task.
# Creating a Data Frame — each column is a named vector of the SAME length
employees <- data.frame(
id = 1:6,
name = c("Alice", "Bob", "Carol", "David", "Eve", "Frank"),
department = c("Finance", "IT", "HR", "Finance", "IT", "HR"),
salary = c(850000, 920000, 780000, 1100000, 880000, 750000),
years_exp = c(3, 5, 2, 8, 4, 1),
promoted = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE),
stringsAsFactors = FALSE
)
employees#> id name department salary years_exp promoted
#> 1 1 Alice Finance 850000 3 FALSE
#> 2 2 Bob IT 920000 5 TRUE
#> 3 3 Carol HR 780000 2 FALSE
#> 4 4 David Finance 1100000 8 TRUE
#> 5 5 Eve IT 880000 4 FALSE
#> 6 6 Frank HR 750000 1 FALSE
#> [1] 6
#> [1] 6
#> [1] 6 6
# str() is your MOST IMPORTANT tool for understanding a new dataset
# Shows: dimensions, column names, types, and a preview of values
str(employees)#> 'data.frame': 6 obs. of 6 variables:
#> $ id : int 1 2 3 4 5 6
#> $ name : chr "Alice" "Bob" "Carol" "David" ...
#> $ department: chr "Finance" "IT" "HR" "Finance" ...
#> $ salary : num 850000 920000 780000 1100000 880000 750000
#> $ years_exp : num 3 5 2 8 4 1
#> $ promoted : logi FALSE TRUE FALSE TRUE FALSE FALSE
#> id name department salary years_exp
#> Min. :1.00 Length :6 Length :6 Min. : 750000 Min. :1.000
#> 1st Qu.:2.25 N.unique :6 N.unique :3 1st Qu.: 797500 1st Qu.:2.250
#> Median :3.50 N.blank :0 N.blank :0 Median : 865000 Median :3.500
#> Mean :3.50 Min.nchar:3 Min.nchar:2 Mean : 880000 Mean :3.833
#> 3rd Qu.:4.75 Max.nchar:5 Max.nchar:7 3rd Qu.: 910000 3rd Qu.:4.750
#> Max. :6.00 Max. :1100000 Max. :8.000
#> promoted
#> Mode :logical
#> FALSE:4
#> TRUE :2
#>
#>
#>
#> [1] "Alice" "Bob" "Carol" "David" "Eve" "Frank"
#> [1] 850000 920000 780000 1100000 880000 750000
#> id name department salary years_exp promoted
#> 1 1 Alice Finance 850000 3 FALSE
#> id name department salary years_exp promoted
#> 3 3 Carol HR 780000 2 FALSE
#> 4 4 David Finance 1100000 8 TRUE
#> 5 5 Eve IT 880000 4 FALSE
#> [1] 920000
# Filtering Rows — the KEY operation in data analysis
employees[employees$department == "Finance", ]#> id name department salary years_exp promoted
#> 1 1 Alice Finance 850000 3 FALSE
#> 4 4 David Finance 1100000 8 TRUE
#> id name department salary years_exp promoted
#> 2 2 Bob IT 920000 5 TRUE
#> 4 4 David Finance 1100000 8 TRUE
#> id name department salary years_exp promoted
#> 2 2 Bob IT 920000 5 TRUE
#> 5 5 Eve IT 880000 4 FALSE
# Adding a New Column
employees$annual_bonus <- employees$salary * 0.10
employees$total_package <- employees$salary + employees$annual_bonus
head(employees)#> id name department salary years_exp promoted annual_bonus total_package
#> 1 1 Alice Finance 850000 3 FALSE 85000 935000
#> 2 2 Bob IT 920000 5 TRUE 92000 1012000
#> 3 3 Carol HR 780000 2 FALSE 78000 858000
#> 4 4 David Finance 1100000 8 TRUE 110000 1210000
#> 5 5 Eve IT 880000 4 FALSE 88000 968000
#> 6 6 Frank HR 750000 1 FALSE 75000 825000
A factor is R’s data structure for categorical variables — variables that take on a limited set of predefined values called “levels”. Examples: gender (Male/Female), education level (Primary/Secondary/University), rating (Poor/Fair/Good/Excellent). Internally, R stores factors as integers with a lookup table of labels, making them memory-efficient. Ordered factors have levels with a meaningful order; unordered factors do not.
Factors matter because:
lm(), glm(),
and aov() automatically create indicator (dummy) variables
from factors, which is necessary for correct statistical modeling.Use factors when: (1) a column has a fixed, known set of possible values (categories), (2) order matters (education level, satisfaction rating, income bracket), (3) you are building statistical models that include categorical predictors, (4) you want to control the display order in plots.
# Unordered Factor (Nominal) — for categories with no natural order
department <- factor(c("Finance", "IT", "HR", "Finance", "IT", "HR", "Finance"))
department#> [1] Finance IT HR Finance IT HR Finance
#> Levels: Finance HR IT
#> [1] "Finance" "HR" "IT"
#> [1] 3
#> department
#> Finance HR IT
#> 3 2 2
# Ordered Factor (Ordinal) — for categories with a meaningful order
satisfaction <- factor(
c("Good", "Excellent", "Fair", "Good", "Poor", "Excellent", "Fair"),
levels = c("Poor", "Fair", "Good", "Excellent"), # define the order explicitly
ordered = TRUE
)
satisfaction#> [1] Good Excellent Fair Good Poor Excellent Fair
#> Levels: Poor < Fair < Good < Excellent
# Now comparison operators work meaningfully
satisfaction > "Fair" # which responses are better than Fair?#> [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
#> [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
#> [1] Poor
#> Levels: Poor < Fair < Good < Excellent
#> [1] Excellent
#> Levels: Poor < Fair < Good < Excellent
# Relevel: change the reference category for modeling
department <- relevel(department, ref = "Finance")
levels(department) # Finance is now first (= reference in regression)#> [1] "Finance" "HR" "IT"
# Drop unused levels after filtering
small_dept <- department[department != "HR"]
droplevels(small_dept) # "HR" level is now removed#> [1] Finance IT Finance IT Finance
#> Levels: Finance IT
Control flow refers to the order in which R executes statements. By default, R runs code top to bottom, one line at a time. Control flow structures let you change that order based on conditions (if/else) or repeat code automatically multiple times (loops). This is what makes programs intelligent — they can make decisions and handle repetitive tasks.
An if statement evaluates a logical condition (TRUE
or FALSE) and executes a block of code only if the condition is TRUE.
else if checks additional conditions when the first is
FALSE. else is the fallback — it runs only when all
previous conditions were FALSE. Only one branch ever executes.
Conditional logic is the basis of all decision-making in code. In data analysis, you use it to: classify data into categories (pass/fail), handle special cases (what to do if a value is NA), apply different formulas based on conditions (discount tiers based on purchase amount), or validate inputs.
Use if/else when you need your code to behave differently depending
on a single value or condition. For applying a condition to an
entire vector (e.g., categorize all rows in a column),
use ifelse() instead — it is vectorized and applies to
every element simultaneously.
# Basic if / else if / else — income tax bracket example
monthly_income <- 650000 # RWF
if (monthly_income >= 1000000) {
tax_rate <- 0.30
bracket <- "High income"
} else if (monthly_income >= 500000) {
tax_rate <- 0.20
bracket <- "Middle income"
} else if (monthly_income >= 100000) {
tax_rate <- 0.10
bracket <- "Low income"
} else {
tax_rate <- 0.00
bracket <- "Tax exempt"
}
tax_amount <- monthly_income * tax_rate
cat("Income:", monthly_income, "RWF\n")#> Income: 650000 RWF
#> Bracket: Middle income
#> Tax rate: 20 %
#> Tax owed: 130000 RWF
# ifelse() for Vectorized Conditions — applies to every element of a vector
scores <- c(45, 72, 88, 58, 91, 63, 77)
pass_fail <- ifelse(scores >= 60, "Pass", "Fail")
pass_fail#> [1] "Fail" "Pass" "Pass" "Fail" "Pass" "Pass" "Pass"
# Nested ifelse() for multiple categories
grade <- ifelse(scores >= 90, "A",
ifelse(scores >= 80, "B",
ifelse(scores >= 70, "C",
ifelse(scores >= 60, "D", "F"))))
grade#> [1] "F" "C" "B" "F" "A" "D" "C"
# switch() — cleaner alternative to many else-if branches for exact value matching
get_currency_symbol <- function(country) {
switch(country,
"Rwanda" = "RWF",
"USA" = "USD",
"UK" = "GBP",
"France" = "EUR",
"Unknown currency"
)
}
get_currency_symbol("Rwanda")#> [1] "RWF"
#> [1] "Unknown currency"
A loop is a control structure that executes a block
of code repeatedly. A for loop repeats a fixed number of
times (once for each element in a sequence). A while loop
repeats as long as a condition remains TRUE — the number of iterations
is not known in advance. A repeat loop runs forever until
explicitly stopped with break.
Loops are essential when you need to perform the same operation many
times — reading multiple files, applying a function to each group,
building results iteratively, or running simulations. However, in R,
loops can be slow on large datasets. Wherever possible, prefer
vectorized operations or the apply family over
explicit loops for better performance.
Use for when you know how many iterations you need
(looping over a list of files, items, or indices). Use
while when you loop until a condition changes (convergence
in optimization). Use repeat/break rarely — usually for
polling or until-type logic. Prefer vectorized operations for speed when
working with vectors and data frames.
# for loop — iterates over each element in a sequence
months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun")
for (month in months) {
cat("Processing month:", month, "\n")
}#> Processing month: Jan
#> Processing month: Feb
#> Processing month: Mar
#> Processing month: Apr
#> Processing month: May
#> Processing month: Jun
# for loop — accumulating results
# IMPORTANT: pre-allocate the result vector for efficiency
n <- 10
factorial_n <- 1
for (i in 1:n) {
factorial_n <- factorial_n * i
}
cat("10! =", factorial_n, "\n")#> 10! = 3628800
# for loop — iterating over rows of a data frame
monthly_sales <- data.frame(
region = c("North", "South", "East", "West"),
Q1 = c(450, 380, 520, 410), Q2 = c(480, 400, 490, 430),
Q3 = c(510, 420, 540, 450), Q4 = c(600, 450, 580, 490)
)
monthly_sales$annual_total <- 0
for (i in 1:nrow(monthly_sales)) {
monthly_sales$annual_total[i] <- sum(monthly_sales[i, 2:5])
}
monthly_sales#> region Q1 Q2 Q3 Q4 annual_total
#> 1 North 450 480 510 600 2040
#> 2 South 380 400 420 450 1650
#> 3 East 520 490 540 580 2130
#> 4 West 410 430 450 490 1780
# while loop — repeats until condition changes
# Example: How many years does it take for an investment to double?
investment <- 100000 # initial investment in RWF
rate <- 0.08 # 8% annual growth
years <- 0
target <- 200000
while (investment < target) {
investment <- investment * (1 + rate)
years <- years + 1
}
cat("Investment doubles after", years, "years\n")#> Investment doubles after 10 years
#> Final value: RWF 215892
# break and next — controlling loop flow
# next: skip this iteration; break: exit the loop entirely
scores <- c(88, NA, 72, NA, 91, 55, NA, 78)
cat("Valid scores above 60: ")#> Valid scores above 60:
for (s in scores) {
if (is.na(s)) next # skip missing values
if (s < 60) next # skip failing scores
cat(s, "")
}#> 88 72 91 78
A function is a named, reusable block of code that
performs a specific task. Functions take inputs (called
arguments or parameters), process them, and return an
output. R comes with thousands of built-in functions
(mean(), sum(), paste()), and you
can write your own. The DRY principle — Don’t Repeat
Yourself — is the key motivation: if you write the same code
more than twice, it should be a function.
Functions are the foundation of good programming practice:
calculate_bmi(weight, height) is far clearer than the raw
formula repeated everywhere.#> [1] 42
#> [1] 12
#> [1] 4.60517
#> [1] 3
#> [1] 2.718282
#> [1] 4
#> [1] 3
#> [1] 3.14
#> [1] 39.6
#> [1] 39.5
#> [1] 267.8222
#> [1] 16.36527
#> [1] 396
#> [1] 23 68 80 147 181 237 265 306 344 396
#> 0% 25% 50% 75% 100%
#> 12.00 29.50 39.50 50.25 67.00
#> 90%
#> 57.1
#> [1] 20
#> [1] "HELLO, DATA ANALYST!"
#> [1] "hello, data analyst!"
#> [1] "Hello"
#> [1] "Hello, D@t@ An@lyst!"
#> [1] "spaces around"
#> [1] "R is great"
#> [1] "file_1.csv" "file_2.csv" "file_3.csv"
#> [1] "Score: 87.5%"
#> [1] 12 23 28 34 38 41 45 52 56 67
#> [1] 67 56 52 45 41 38 34 28 23 12
#> [1] 1 2 3
#> [1] 2 4 6 8 10
#> [1] 3
#> [1] 4
# Basic Function Structure:
# function_name <- function(argument1, argument2, ...) {
# body: code that does the work
# return(result)
# }
# Example 1: Unit Conversion
celsius_to_fahrenheit <- function(celsius) {
fahrenheit <- (celsius * 9/5) + 32
return(fahrenheit)
}
celsius_to_fahrenheit(0) # 32 degrees F (freezing)#> [1] 32
#> [1] 212
#> [1] 98.6
# Example 2: Default Arguments — used when caller does not provide a value
calculate_loan_payment <- function(principal, rate_annual = 0.12, years = 5) {
rate_monthly <- rate_annual / 12
n_months <- years * 12
payment <- principal * (rate_monthly * (1 + rate_monthly)^n_months) /
((1 + rate_monthly)^n_months - 1)
return(round(payment, 2))
}
calculate_loan_payment(1000000) # uses defaults: 12% rate, 5 years#> [1] 22244.45
#> [1] 20758.36
#> [1] 33214.31
# Example 3: Returning Multiple Values via a List
describe_vector <- function(x, label = "Data") {
x_clean <- x[!is.na(x)] # remove NAs first
list(
label = label,
n = length(x_clean),
n_missing = sum(is.na(x)),
mean = round(mean(x_clean), 2),
median = round(median(x_clean), 2),
sd = round(sd(x_clean), 2),
min = min(x_clean),
max = max(x_clean)
)
}
exam_scores <- c(78, 85, NA, 92, 67, 88, NA, 74, 91)
stats <- describe_vector(exam_scores, label = "Exam Scores")
stats$mean#> [1] 82.14
#> [1] 2
# Example 4: Input Validation — always check your inputs!
safe_divide <- function(numerator, denominator) {
if (!is.numeric(numerator) | !is.numeric(denominator)) {
stop("Both arguments must be numeric.") # stop() throws an error
}
if (denominator == 0) {
warning("Division by zero — returning NA.") # warning() alerts but continues
return(NA)
}
return(numerator / denominator)
}
safe_divide(10, 2)#> [1] 5
#> [1] NA
The apply family is a set of functions that apply a
function to elements of a data structure — rows of a matrix, elements of
a list, groups of a vector — without writing an explicit for loop. The
main members are apply() (matrices), lapply()
(lists, returns list), sapply() (lists, returns simplified
vector), and tapply() (vectors with groups).
The apply family produces cleaner, more readable
code than equivalent for loops and is often faster. More
importantly, it forces you to write your operation as a function, which
encourages modular, reusable code. In the tidyverse,
purrr::map() is the modern equivalent and is even more
powerful.
# apply() — apply a function over rows (margin=1) or columns (margin=2) of a matrix
scores_matrix <- matrix(c(85, 78, 90, 72, 88, 82, 79, 85, 74, 91, 83, 69),
nrow = 4, byrow = TRUE,
dimnames = list(c("Alice","Bob","Carol","David"),
c("Math","Science","English")))
scores_matrix#> Math Science English
#> Alice 85 78 90
#> Bob 72 88 82
#> Carol 79 85 74
#> David 91 83 69
#> Alice Bob Carol David
#> 84.33333 80.66667 79.33333 81.00000
#> Math Science English
#> 81.75 83.50 78.75
#> Math Science English
#> 91 88 90
# lapply() — apply function to each list element, returns a LIST
monthly_revenues <- list(Jan = c(450, 380, 520), Feb = c(480, 400, 490), Mar = c(510, 420, 540))
lapply(monthly_revenues, sum) # total per month (returns list)#> $Jan
#> [1] 1350
#>
#> $Feb
#> [1] 1370
#>
#> $Mar
#> [1] 1470
# sapply() — same as lapply but simplifies result to a VECTOR or MATRIX
sapply(monthly_revenues, sum) # returns a named vector — cleaner!#> Jan Feb Mar
#> 1350 1370 1470
#> Jan Feb Mar
#> 450.0000 456.6667 490.0000
# tapply() — apply function to groups — extremely useful for group-wise statistics
employee_salaries <- c(850000, 920000, 780000, 1100000, 880000, 750000)
departments <- c("Finance", "IT", "HR", "Finance", "IT", "HR")
tapply(employee_salaries, departments, mean) # average salary per department#> Finance HR IT
#> 975000 765000 900000
#> Finance HR IT
#> 2 2 2
In real data analysis, data almost never comes from code — it comes from external sources: CSV files from databases, Excel reports from colleagues, API responses, survey exports, database queries. R has extensive tools for importing data from virtually any format. Exporting allows you to share cleaned data, model results, or reports with others.
No matter how good your analysis is, if you cannot get data into R and results out of R, you cannot work in a real environment. Understanding import/export is the bridge between R and the rest of the data ecosystem — spreadsheets, databases, BI tools, and stakeholder reports.
# read.csv() — Base R, always available, works for any CSV file
sales_data <- read.csv(
"sales_2024.csv",
header = TRUE, # first row contains column names
sep = ",", # field separator (use ";" for European CSVs)
stringsAsFactors = FALSE, # keep text as character, not factor (best practice)
na.strings = c("NA", "", "N/A", "-", "null") # what counts as missing?
)
# readr::read_csv() — faster and smarter (tidyverse version)
library(readr)
sales_data <- read_csv("sales_2024.csv") # note: read_csv not read.csv
# Reading Excel files (install once: install.packages("readxl"))
library(readxl)
budget <- read_excel("budget_2024.xlsx", sheet = "Q1 Data")
excel_sheets("budget_2024.xlsx") # see what sheets are available
# Exporting Results
write.csv(sales_data, "sales_clean.csv", row.names = FALSE)
library(writexl)
write_xlsx(list("Sales" = sales_data, "Budget" = budget), "report.xlsx")
# Save and reload R objects
saveRDS(sales_data, "sales_clean.rds") # save single object
loaded <- readRDS("sales_clean.rds") # load it back# Using Built-in Datasets for Learning and Practice
data(mtcars) # Motor Trend car road tests (1974)
head(mtcars)#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Data cleaning is the process of detecting and correcting problems in raw data — missing values, wrong types, duplicate records, inconsistent formatting, and outliers. Data manipulation is reshaping, transforming, and restructuring data to make it suitable for analysis. Together, these tasks typically consume 60-80% of a data analyst’s time. Clean, well-structured data is the foundation of valid analysis — garbage in, garbage out.
library(dplyr) # data manipulation
library(tidyr) # data reshaping
library(stringr) # string manipulationIn R, missing values are represented by NA (Not
Available). NA is not zero, not an empty string, not
-999 — it is a genuine placeholder that says “we don’t know what this
value is.” NA is contagious: any arithmetic involving NA returns NA
(5 + NA = NA). This is intentional — if you don’t know one
value, you cannot know the result of a calculation involving it.
Unhandled missing values silently corrupt your analysis. A mean calculated without removing NAs returns NA. A model trained on data with NAs may error or produce biased results. Understanding why data is missing (Missing Completely at Random vs Not at Random) determines the right strategy for handling it.
Remove rows (na.omit) only when missingness is rare and random, and you have enough data to afford losing rows. Impute with mean/median when the variable is numeric and missingness is random. Impute with mode or “Unknown” for categorical variables. Never impute before splitting data for modeling — it causes data leakage.
# Creating Data with Missing Values
survey <- data.frame(
respondent = 1:8,
age = c(25, NA, 30, 22, NA, 35, 28, NA),
income = c(500000, 800000, NA, 450000, 600000, NA, 750000, 400000),
education = c("University", "Secondary", "University", NA,
"Primary", "University", NA, "Secondary")
)
survey#> respondent age income education
#> 1 1 25 500000 University
#> 2 2 NA 800000 Secondary
#> 3 3 30 NA University
#> 4 4 22 450000 <NA>
#> 5 5 NA 600000 Primary
#> 6 6 35 NA University
#> 7 7 28 750000 <NA>
#> 8 8 NA 400000 Secondary
#> [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
#> [1] 3
#> respondent age income education
#> 0 3 2 2
#> respondent age income education
#> 0.0 37.5 25.0 25.0
# Strategy 1: Remove rows with NAs (use carefully — loses data)
survey_complete <- na.omit(survey)
cat("Rows after na.omit:", nrow(survey_complete), "(was 8)\n")#> Rows after na.omit: 1 (was 8)
# Strategy 2: Mean/Median Imputation for numeric columns
survey$age_imputed <- ifelse(is.na(survey$age),
round(mean(survey$age, na.rm = TRUE)),
survey$age)
survey$income_imputed <- ifelse(is.na(survey$income),
median(survey$income, na.rm = TRUE),
survey$income)
# Strategy 3: Replace NA with a constant (for categorical)
survey$education[is.na(survey$education)] <- "Unknown"
survey[, c("respondent", "age", "age_imputed", "income", "income_imputed", "education")]#> respondent age age_imputed income income_imputed education
#> 1 1 25 25 500000 500000 University
#> 2 2 NA 28 800000 800000 Secondary
#> 3 3 30 30 NA 550000 University
#> 4 4 22 22 450000 450000 Unknown
#> 5 5 NA 28 600000 600000 Primary
#> 6 6 35 35 NA 550000 University
#> 7 7 28 28 750000 750000 Unknown
#> 8 8 NA 28 400000 400000 Secondary
dplyr is the tidyverse package for data manipulation. It
provides a consistent set of verbs (functions) that
each perform one clear operation on a data frame. These verbs are
designed to be combined using the pipe operator
%>% (read as “then”), which chains operations
left-to-right in the order you think about them, making complex
transformations highly readable.
Without dplyr, complex data manipulation requires nested function
calls that read inside-out — confusing and error-prone. With dplyr and
the pipe, the same operations read in plain English from left to right.
Additionally, dplyr works with databases (via dbplyr) using
the exact same syntax, so your skills transfer directly to SQL-backed
data sources.
data(mtcars)
mtcars <- tibble::rownames_to_column(mtcars, "car_model")
# The Pipe Operator %>% — makes code read like plain English
# Without pipe (inside-out, hard to read):
# arrange(select(filter(mtcars, cyl == 6), car_model, mpg, cyl), desc(mpg))
# With pipe (reads: take mtcars, THEN filter, THEN select, THEN arrange):
mtcars %>%
filter(cyl == 6) %>%
select(car_model, mpg, cyl) %>%
arrange(desc(mpg))#> car_model mpg cyl
#> 1 Hornet 4 Drive 21.4 6
#> 2 Mazda RX4 21.0 6
#> 3 Mazda RX4 Wag 21.0 6
#> 4 Ferrari Dino 19.7 6
#> 5 Merc 280 19.2 6
#> 6 Valiant 18.1 6
#> 7 Merc 280C 17.8 6
# filter() — keep rows meeting a condition (like SQL WHERE)
mtcars %>% filter(cyl == 8, mpg > 15, hp > 200)#> car_model mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> car_model mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 6 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> 7 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 8 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 9 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> 10 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 11 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 12 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 13 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> 14 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 15 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 16 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 17 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> 18 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
#> car_model mpg hp wt
#> 1 Mazda RX4 21.0 110 2.620
#> 2 Mazda RX4 Wag 21.0 110 2.875
#> 3 Datsun 710 22.8 93 2.320
#> 4 Hornet 4 Drive 21.4 110 3.215
#> 5 Hornet Sportabout 18.7 175 3.440
#> 6 Valiant 18.1 105 3.460
#> 7 Duster 360 14.3 245 3.570
#> 8 Merc 240D 24.4 62 3.190
#> 9 Merc 230 22.8 95 3.150
#> 10 Merc 280 19.2 123 3.440
#> 11 Merc 280C 17.8 123 3.440
#> 12 Merc 450SE 16.4 180 4.070
#> 13 Merc 450SL 17.3 180 3.730
#> 14 Merc 450SLC 15.2 180 3.780
#> 15 Cadillac Fleetwood 10.4 205 5.250
#> 16 Lincoln Continental 10.4 215 5.424
#> 17 Chrysler Imperial 14.7 230 5.345
#> 18 Fiat 128 32.4 66 2.200
#> 19 Honda Civic 30.4 52 1.615
#> 20 Toyota Corolla 33.9 65 1.835
#> 21 Toyota Corona 21.5 97 2.465
#> 22 Dodge Challenger 15.5 150 3.520
#> 23 AMC Javelin 15.2 150 3.435
#> 24 Camaro Z28 13.3 245 3.840
#> 25 Pontiac Firebird 19.2 175 3.845
#> 26 Fiat X1-9 27.3 66 1.935
#> 27 Porsche 914-2 26.0 91 2.140
#> 28 Lotus Europa 30.4 113 1.513
#> 29 Ford Pantera L 15.8 264 3.170
#> 30 Ferrari Dino 19.7 175 2.770
#> 31 Maserati Bora 15.0 335 3.570
#> 32 Volvo 142E 21.4 109 2.780
#> car_model mpg cyl disp hp drat wt gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 4 4
#> 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 4 1
#> 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 3 1
#> 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 3 2
#> 6 Valiant 18.1 6 225.0 105 2.76 3.460 3 1
#> 7 Duster 360 14.3 8 360.0 245 3.21 3.570 3 4
#> 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 4 2
#> 9 Merc 230 22.8 4 140.8 95 3.92 3.150 4 2
#> 10 Merc 280 19.2 6 167.6 123 3.92 3.440 4 4
#> 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 4 4
#> 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 3 3
#> 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 3 3
#> 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 3 3
#> 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 3 4
#> 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 3 4
#> 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 3 4
#> 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 4 1
#> 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 4 2
#> 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 4 1
#> 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 3 1
#> 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 3 2
#> 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 3 2
#> 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 3 4
#> 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 3 2
#> 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 4 1
#> 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 5 2
#> 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 5 2
#> 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 5 4
#> 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 5 6
#> 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 5 8
#> 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 4 2
# mutate() — create new columns or modify existing ones
mtcars %>%
mutate(
kpl = round(mpg * 0.425, 2), # convert MPG to km/liter
hp_per_cyl = round(hp / cyl, 1), # power per cylinder
wt_kg = round(wt * 453.6, 0), # weight in kg
efficiency = ifelse(mpg > 20, "High", "Low")
) %>%
select(car_model, mpg, kpl, hp_per_cyl, wt_kg, efficiency)#> car_model mpg kpl hp_per_cyl wt_kg efficiency
#> 1 Mazda RX4 21.0 8.92 18.3 1188 High
#> 2 Mazda RX4 Wag 21.0 8.92 18.3 1304 High
#> 3 Datsun 710 22.8 9.69 23.2 1052 High
#> 4 Hornet 4 Drive 21.4 9.09 18.3 1458 High
#> 5 Hornet Sportabout 18.7 7.95 21.9 1560 Low
#> 6 Valiant 18.1 7.69 17.5 1569 Low
#> 7 Duster 360 14.3 6.08 30.6 1619 Low
#> 8 Merc 240D 24.4 10.37 15.5 1447 High
#> 9 Merc 230 22.8 9.69 23.8 1429 High
#> 10 Merc 280 19.2 8.16 20.5 1560 Low
#> 11 Merc 280C 17.8 7.57 20.5 1560 Low
#> 12 Merc 450SE 16.4 6.97 22.5 1846 Low
#> 13 Merc 450SL 17.3 7.35 22.5 1692 Low
#> 14 Merc 450SLC 15.2 6.46 22.5 1715 Low
#> 15 Cadillac Fleetwood 10.4 4.42 25.6 2381 Low
#> 16 Lincoln Continental 10.4 4.42 26.9 2460 Low
#> 17 Chrysler Imperial 14.7 6.25 28.8 2424 Low
#> 18 Fiat 128 32.4 13.77 16.5 998 High
#> 19 Honda Civic 30.4 12.92 13.0 733 High
#> 20 Toyota Corolla 33.9 14.41 16.2 832 High
#> 21 Toyota Corona 21.5 9.14 24.2 1118 High
#> 22 Dodge Challenger 15.5 6.59 18.8 1597 Low
#> 23 AMC Javelin 15.2 6.46 18.8 1558 Low
#> 24 Camaro Z28 13.3 5.65 30.6 1742 Low
#> 25 Pontiac Firebird 19.2 8.16 21.9 1744 Low
#> 26 Fiat X1-9 27.3 11.60 16.5 878 High
#> 27 Porsche 914-2 26.0 11.05 22.8 971 High
#> 28 Lotus Europa 30.4 12.92 28.2 686 High
#> 29 Ford Pantera L 15.8 6.72 33.0 1438 Low
#> 30 Ferrari Dino 19.7 8.37 29.2 1256 Low
#> 31 Maserati Bora 15.0 6.38 41.9 1619 Low
#> 32 Volvo 142E 21.4 9.09 27.2 1261 High
#> car_model mpg
#> 1 Toyota Corolla 33.9
#> 2 Fiat 128 32.4
#> 3 Honda Civic 30.4
#> 4 Lotus Europa 30.4
#> 5 Fiat X1-9 27.3
# group_by() + summarise() — aggregate by groups (like SQL GROUP BY)
# This is one of the most powerful and frequently used patterns in dplyr
mtcars %>%
group_by(cyl) %>%
summarise(
n_cars = n(),
avg_mpg = round(mean(mpg), 1),
avg_hp = round(mean(hp), 1),
best_mpg = max(mpg)
) %>%
arrange(cyl)#> # A tibble: 3 × 5
#> cyl n_cars avg_mpg avg_hp best_mpg
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 4 11 26.7 82.6 33.9
#> 2 6 7 19.7 122. 21.4
#> 3 8 14 15.1 209. 19.2
# Chaining everything — real-world example:
# "Among powerful cars, which engine type has the best fuel efficiency?"
mtcars %>%
filter(hp >= 100) %>%
mutate(engine_type = ifelse(vs == 1, "Straight", "V-shape")) %>%
group_by(engine_type) %>%
summarise(count = n(), avg_mpg = round(mean(mpg), 1), avg_hp = round(mean(hp), 1)) %>%
arrange(desc(avg_mpg))#> # A tibble: 2 × 4
#> engine_type count avg_mpg avg_hp
#> <chr> <int> <dbl> <dbl>
#> 1 Straight 6 21.4 114.
#> 2 V-shape 17 16.1 196.
Data can be stored in wide format (one row per
subject, many columns for different measurements) or long
format (one row per observation, fewer columns). Wide format is
human-readable; long format is required by most R plotting and modeling
functions. tidyr provides pivot_longer() (wide
to long) and pivot_wider() (long to wide) to convert
between formats.
Convert to long format when: plotting multiple variables with ggplot2 (which needs all values in one column and the variable name in another), running repeated-measures ANOVA, or applying the same operation to multiple columns. Convert to wide format when: presenting a summary table or creating a human-readable report.
# Wide Format: one row per student, one column per subject
exam_results_wide <- data.frame(
student = c("Alice", "Bob", "Carol", "David"),
Math = c(85, 70, 90, 75), Science = c(88, 75, 82, 68),
English = c(78, 80, 95, 72), History = c(82, 65, 88, 77)
)
exam_results_wide#> student Math Science English History
#> 1 Alice 85 88 78 82
#> 2 Bob 70 75 80 65
#> 3 Carol 90 82 95 88
#> 4 David 75 68 72 77
# pivot_longer(): Wide to Long — one row per student-subject combination
exam_results_long <- exam_results_wide %>%
pivot_longer(
cols = -student, # pivot all columns EXCEPT student
names_to = "subject", # column names become a new 'subject' column
values_to = "score" # values go into a new 'score' column
)
exam_results_long#> # A tibble: 16 × 3
#> student subject score
#> <chr> <chr> <dbl>
#> 1 Alice Math 85
#> 2 Alice Science 88
#> 3 Alice English 78
#> 4 Alice History 82
#> 5 Bob Math 70
#> 6 Bob Science 75
#> 7 Bob English 80
#> 8 Bob History 65
#> 9 Carol Math 90
#> 10 Carol Science 82
#> 11 Carol English 95
#> 12 Carol History 88
#> 13 David Math 75
#> 14 David Science 68
#> 15 David English 72
#> 16 David History 77
# Now easily calculate per-subject and per-student statistics
exam_results_long %>% group_by(subject) %>%
summarise(avg_score = mean(score), top_score = max(score))#> # A tibble: 4 × 3
#> subject avg_score top_score
#> <chr> <dbl> <dbl>
#> 1 English 81.2 95
#> 2 History 78 88
#> 3 Math 80 90
#> 4 Science 78.2 88
# pivot_wider(): Long to Wide — reverse the operation
exam_results_long %>% pivot_wider(names_from = subject, values_from = score)#> # A tibble: 4 × 5
#> student Math Science English History
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alice 85 88 78 82
#> 2 Bob 70 75 80 65
#> 3 Carol 90 82 95 88
#> 4 David 75 68 72 77
stringr provides a consistent set of functions for
working with text data (strings). All function names start with
str_, making them easy to discover. String manipulation is
critical for cleaning messy text: inconsistent capitalization, extra
whitespace, extracting patterns, replacing values, splitting and joining
text.
texts <- c(" Hello World ", "data ANALYSIS", "r programming 2024", "Kigali, Rwanda")
str_trim(texts) # remove leading/trailing whitespace#> [1] "Hello World" "data ANALYSIS" "r programming 2024"
#> [4] "Kigali, Rwanda"
#> [1] " Hello World " "Data Analysis" "R Programming 2024"
#> [4] "Kigali, Rwanda"
#> [1] "DATA ANALYSIS"
#> [1] "data analysis"
#> [1] "R programming 2024"
#> [1] " HeLLo WorLd "
#> [1] FALSE TRUE TRUE FALSE
#> [1] 15 13 18 14
#> [1] "Kigali"
#> [[1]]
#> [1] "Kigali" "Rwanda"
#> [1] "R is great"
#> [1] 0 2 1 3
Exploratory Data Analysis (EDA) is the process of examining and summarizing a dataset to understand its structure, spot patterns, identify anomalies, and form hypotheses — before applying any formal statistical models. It was championed by statistician John Tukey in his 1977 book of the same name. EDA is fundamentally about asking questions of your data with an open mind.
EDA is critical because: (1) it reveals data quality issues (missing values, wrong types, outliers) before they corrupt your model; (2) it suggests which variables are related and worth modeling; (3) it builds your intuition about the data; (4) it often reveals the answer to your question directly, without needing complex models. Never skip EDA — it is the foundation of trustworthy analysis.
data(iris)
# Step 1: Understand the Structure
str(iris) # types, dimensions, preview of values — always start here#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
#>
#> === Sepal Length Statistics ===
#> Mean: 5.843333
#> Median: 5.8
#> Std Dev: 0.828
#> IQR: 1.3
#> 0% 25% 50% 75% 100%
#> 4.3 5.1 5.8 6.4 7.9
#>
#> setosa versicolor virginica
#> 50 50 50
#>
#> setosa versicolor virginica
#> 33.33333 33.33333 33.33333
# Step 4: Correlation Analysis
# Measures LINEAR relationship: -1 = perfect negative, 0 = none, +1 = perfect positive
cor_matrix <- cor(iris[, 1:4])
round(cor_matrix, 2)#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Sepal.Length 1.00 -0.12 0.87 0.82
#> Sepal.Width -0.12 1.00 -0.43 -0.37
#> Petal.Length 0.87 -0.43 1.00 0.96
#> Petal.Width 0.82 -0.37 0.96 1.00
cat("\nPetal.Length vs Petal.Width correlation:",
round(cor(iris$Petal.Length, iris$Petal.Width), 3),
"\n-> Strong positive: larger petals tend to be both long AND wide\n")#>
#> Petal.Length vs Petal.Width correlation: 0.963
#> -> Strong positive: larger petals tend to be both long AND wide
# Step 5: Group-wise Comparison
iris %>%
group_by(Species) %>%
summarise(
mean_sepal_L = round(mean(Sepal.Length), 2),
mean_petal_L = round(mean(Petal.Length), 2),
mean_petal_W = round(mean(Petal.Width), 2)
)#> # A tibble: 3 × 4
#> Species mean_sepal_L mean_petal_L mean_petal_W
#> <fct> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 1.46 0.25
#> 2 versicolor 5.94 4.26 1.33
#> 3 virginica 6.59 5.55 2.03
# Step 6: Outlier Detection Using the IQR (Interquartile Range) Method
# Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are flagged as outliers
detect_outliers <- function(x, column_name) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
outliers <- x[x < lower | x > upper]
cat(column_name, "| Fences: [", round(lower,2), ",", round(upper,2), "]",
"| Outliers found:", length(outliers), "\n")
return(outliers)
}
detect_outliers(iris$Sepal.Length, "Sepal.Length")#> Sepal.Length | Fences: [ 3.15 , 8.35 ] | Outliers found: 0
#> numeric(0)
#> Sepal.Width | Fences: [ 2.05 , 4.05 ] | Outliers found: 4
#> [1] 4.4 4.1 4.2 2.0
#> Petal.Length | Fences: [ -3.65 , 10.35 ] | Outliers found: 0
#> numeric(0)
Data visualization is the graphical representation
of data to communicate patterns, trends, relationships, and
distributions visually. Edward Tufte’s principle: “The purpose of a
visualization is insight, not pictures.” R is world-famous for its
visualization capabilities, especially through the ggplot2
package, which implements Leland Wilkinson’s Grammar of
Graphics — a systematic way to describe any chart as a
combination of data, aesthetics, and geometric objects.
Humans process visual information 60,000 times faster than text. A well-made chart can reveal a pattern in seconds that would take pages of numbers to describe. Visualization serves two roles: exploratory (for your own understanding during EDA) and explanatory (communicating findings to others). Both are essential skills for data analysts.
# Histogram — shows the distribution (spread and shape) of a single numeric variable
# Use when: checking normality, understanding value spread, spotting skewness
hist(iris$Sepal.Length,
main = "Distribution of Sepal Length",
xlab = "Sepal Length (cm)", ylab = "Frequency",
col = "steelblue", border = "white", breaks = 15)# Boxplot — shows median, quartiles, and outliers; great for group comparison
# Use when: comparing distributions across groups, spotting outliers
boxplot(Sepal.Length ~ Species, data = iris,
main = "Sepal Length by Species",
xlab = "Species", ylab = "Sepal Length (cm)",
col = c("#3498db", "#2ecc71", "#e74c3c"))Every ggplot2 chart is built from three essential components:
aes(): mappings from data
variables to visual properties (x-axis, y-axis, color, size, shape)geom_point() for dots, geom_bar() for
bars, geom_line() for lines, geom_boxplot()
for boxplots, etc.)Additional layers — scales, themes, labels, facets — are added with
+.
library(ggplot2)
# Scatter Plot — shows relationship between two numeric variables
# Use when: exploring correlation, showing how two measures relate
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species, shape = Species)) +
geom_point(size = 2.5, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", linewidth = 0.8) +
labs(title = "Sepal Length vs Petal Length by Species",
subtitle = "Linear trend lines shown per species",
x = "Sepal Length (cm)", y = "Petal Length (cm)") +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set1")# Overlapping Histograms with Density Curves — compare distributions across groups
# Use when: understanding how the same variable differs between groups
ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, alpha = 0.5, position = "identity") +
geom_density(aes(color = Species), linewidth = 1, fill = NA) +
labs(title = "Sepal Width Distribution by Species",
subtitle = "Histogram with smoothed density curves",
x = "Sepal Width (cm)", y = "Density") +
theme_minimal() +
scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2")# Boxplot with Jitter — shows distribution AND individual data points
# Use when: comparing distributions across groups, especially with small samples
ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.15, alpha = 0.5, size = 1.5) +
labs(title = "Petal Width by Species",
subtitle = "Boxplot with individual data points overlaid",
x = "Species", y = "Petal Width (cm)") +
theme_classic() + theme(legend.position = "none") +
scale_fill_brewer(palette = "Pastel1")# Bar Chart with Error Bars — compare averages across categories
# Use when: showing a summary statistic (mean, total) across discrete groups
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg), se = sd(mpg) / sqrt(n())) %>%
ggplot(aes(x = cyl, y = avg_mpg, fill = cyl)) +
geom_col(alpha = 0.85, width = 0.6) +
geom_errorbar(aes(ymin = avg_mpg - se, ymax = avg_mpg + se), width = 0.2) +
geom_text(aes(label = round(avg_mpg, 1)), vjust = -0.8, fontface = "bold") +
labs(title = "Average Fuel Efficiency by Number of Cylinders",
subtitle = "Error bars represent +/- 1 standard error",
x = "Number of Cylinders", y = "Average MPG") +
theme_minimal() + theme(legend.position = "none") +
scale_fill_brewer(palette = "Blues")# Faceted Plot — small multiples: same chart for each subgroup side-by-side
# Use when: you want to compare the same relationship across multiple categories
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "gray40") +
facet_wrap(~ Species, nrow = 1) +
labs(title = "Sepal Dimensions by Species (Faceted)",
subtitle = "Each panel shows one species independently",
x = "Sepal Length (cm)", y = "Sepal Width (cm)") +
theme_bw() + theme(legend.position = "none")Statistical analysis is the process of collecting, examining, summarizing, and interpreting data to discover patterns and draw conclusions. It divides into descriptive statistics (summarizing what the data shows) and inferential statistics (drawing conclusions about a population from a sample, quantifying uncertainty through p-values, confidence intervals, and hypothesis tests).
A hypothesis test is a formal procedure for deciding whether data provides enough evidence to reject a null hypothesis (H0). The null hypothesis is always the “nothing interesting is happening” claim. The alternative hypothesis (H1) is what you suspect is actually true. The p-value is the probability of observing results as extreme as yours, if the null hypothesis were actually true. A p-value below 0.05 is conventionally taken as evidence to reject H0.
# One-Sample t-test — Is the mean of a sample different from a known value?
# Question: Is the average exam score significantly different from 75?
# H0: population mean = 75; H1: population mean is not 75
exam_scores <- c(78, 85, 72, 91, 68, 88, 74, 82, 79, 86, 77, 83)
t_result <- t.test(exam_scores, mu = 75)
t_result#>
#> One Sample t-test
#>
#> data: exam_scores
#> t = 2.6547, df = 11, p-value = 0.0224
#> alternative hypothesis: true mean is not equal to 75
#> 95 percent confidence interval:
#> 75.89729 84.60271
#> sample estimates:
#> mean of x
#> 80.25
#>
#> Sample mean: 80.25
#> p-value: 0.0224
if (t_result$p.value < 0.05) {
cat("-> p < 0.05: Reject H0. Mean is significantly different from 75.\n")
} else {
cat("-> p >= 0.05: Fail to reject H0. No significant difference from 75.\n")
}#> -> p < 0.05: Reject H0. Mean is significantly different from 75.
# Two-Sample t-test — Are two groups significantly different?
# Question: Do students who attended tutoring score higher?
tutored <- c(82, 88, 79, 91, 85, 90, 87, 83)
not_tutored <- c(72, 68, 75, 78, 70, 73, 65, 71)
t_two <- t.test(tutored, not_tutored, alternative = "greater")
t_two#>
#> Welch Two Sample t-test
#>
#> data: tutored and not_tutored
#> t = 6.9118, df = 13.991, p-value = 3.606e-06
#> alternative hypothesis: true difference in means is greater than 0
#> 95 percent confidence interval:
#> 10.52541 Inf
#> sample estimates:
#> mean of x mean of y
#> 85.625 71.500
cat("Group means — Tutored:", round(mean(tutored),1),
"| Not tutored:", round(mean(not_tutored),1), "\n")#> Group means — Tutored: 85.6 | Not tutored: 71.5
# Chi-Square Test — Is there a relationship between two categorical variables?
# Question: Is there a relationship between gender and promotion?
promo_table <- matrix(c(15, 10, 25, 30), nrow = 2,
dimnames = list(c("Promoted", "Not Promoted"), c("Male", "Female")))
promo_table#> Male Female
#> Promoted 15 25
#> Not Promoted 10 30
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: promo_table
#> X-squared = 0.93091, df = 1, p-value = 0.3346
ANOVA (Analysis of Variance) tests whether the means of three or more groups are significantly different. It compares the variation between groups to the variation within groups. If between-group variation is much larger than within-group variation (captured by the F-statistic), the groups are likely truly different. After ANOVA rejects H0, a post-hoc test (like Tukey’s HSD) identifies which specific pairs of groups differ.
# Question: Are petal lengths significantly different across iris species?
# H0: all three species have the same mean petal length
anova_model <- aov(Petal.Length ~ Species, data = iris)
summary(anova_model)#> Df Sum Sq Mean Sq F value Pr(>F)
#> Species 2 437.1 218.55 1180 <2e-16 ***
#> Residuals 147 27.2 0.19
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> setosa versicolor virginica
#> 1.462 4.260 5.552
#> Tukey multiple comparisons of means
#> 95% family-wise confidence level
#>
#> Fit: aov(formula = Petal.Length ~ Species, data = iris)
#>
#> $Species
#> diff lwr upr p adj
#> versicolor-setosa 2.798 2.59422 3.00178 0
#> virginica-setosa 4.090 3.88622 4.29378 0
#> virginica-versicolor 1.292 1.08822 1.49578 0
Linear regression models the relationship between a response variable (Y) and one or more predictor variables (X). It fits a straight line through the data: Y = B0 + B1X1 + B2X2 + error. The coefficients B tell you how much Y changes for each unit increase in X. R-squared measures how well the model explains the variation in Y (0 = explains nothing, 1 = perfect explanation).
Linear regression is the foundation of predictive modeling. Even when more complex models are used, regression is typically the baseline to beat. Its output is interpretable — the coefficients directly tell you the quantitative relationship between predictors and outcome — which is essential for business decision-making and scientific communication.
# Simple Linear Regression: predict MPG from horsepower alone
model_simple <- lm(mpg ~ hp, data = mtcars)
summary(model_simple)#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.7121 -2.1122 -0.8854 1.5819 8.2360
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
#> hp -0.06823 0.01012 -6.742 1.79e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.863 on 30 degrees of freedom
#> Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
#> F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
#>
#> === Interpretation ===
#> Intercept: 30.099 -> Expected MPG when hp = 0
cat("hp coefficient:", round(coefs[2], 4),
"-> Each extra HP reduces MPG by", abs(round(coefs[2], 4)), "\n")#> hp coefficient: -0.0682 -> Each extra HP reduces MPG by 0.0682
cat("R-squared:", round(summary(model_simple)$r.squared, 3),
"-> HP explains", round(summary(model_simple)$r.squared*100, 1), "% of MPG variance\n")#> R-squared: 0.602 -> HP explains 60.2 % of MPG variance
# Visualization of the regression line
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = "#2980b9", size = 2.5, alpha = 0.8) +
geom_smooth(method = "lm", color = "#e74c3c", se = TRUE, linewidth = 1.2) +
labs(title = "Simple Linear Regression: MPG ~ Horsepower",
subtitle = paste0("R2 = ", round(summary(model_simple)$r.squared, 3),
" | Each +1 HP reduces MPG by ",
abs(round(coefs[2], 3))),
x = "Horsepower (hp)", y = "Miles Per Gallon (mpg)") +
theme_minimal()# Multiple Linear Regression: add more predictors
model_multi <- lm(mpg ~ hp + wt + factor(cyl), data = mtcars)
summary(model_multi)#>
#> Call:
#> lm(formula = mpg ~ hp + wt + factor(cyl), data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.2612 -1.0320 -0.3210 0.9281 5.3947
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 35.84600 2.04102 17.563 2.67e-16 ***
#> hp -0.02312 0.01195 -1.934 0.063613 .
#> wt -3.18140 0.71960 -4.421 0.000144 ***
#> factor(cyl)6 -3.35902 1.40167 -2.396 0.023747 *
#> factor(cyl)8 -3.18588 2.17048 -1.468 0.153705
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.44 on 27 degrees of freedom
#> Multiple R-squared: 0.8572, Adjusted R-squared: 0.8361
#> F-statistic: 40.53 on 4 and 27 DF, p-value: 4.869e-11
cat("Multiple R-squared:", round(summary(model_multi)$r.squared, 3),
"-> hp + wt + cyl explains",
round(summary(model_multi)$r.squared * 100, 1), "% of MPG variance\n")#> Multiple R-squared: 0.857 -> hp + wt + cyl explains 85.7 % of MPG variance
Code efficiency in R means writing code that executes quickly and uses memory well. R’s biggest performance insight is vectorization — R is designed to operate on entire vectors at once using optimized compiled code (C/Fortran under the hood). Explicit for loops in R are slow because they are interpreted one iteration at a time. The same operation expressed as a vectorized call can be 10-1000x faster.
As datasets grow (thousands to millions of rows), poorly written code becomes impractically slow. A loop over a million rows that takes 30 seconds in R might take 0.3 seconds vectorized. For production data pipelines and big data workflows, efficiency is not optional — it is the difference between a pipeline that runs in minutes versus hours.
# Benchmarking: measure how long code takes with system.time()
n <- 500000
x <- runif(n) # 500,000 random numbers
# Method 1: for loop — slow (interpreted one iteration at a time)
time_loop <- system.time({
result_loop <- numeric(n) # pre-allocate! Growing vectors is very slow
for (i in seq_along(x)) result_loop[i] <- sqrt(x[i])
})
# Method 2: Vectorized — fast (compiled C code on the whole vector)
time_vec <- system.time({
result_vec <- sqrt(x)
})
cat("Loop time: ", time_loop["elapsed"], "seconds\n")#> Loop time: 0.08 seconds
#> Vectorized time: 0.02 seconds
#> Speedup factor: 4 x
# data.table — high-performance data manipulation for large datasets
# Syntax: dt[filter_rows, compute_columns, by = group_columns]
library(data.table)
dt <- as.data.table(mtcars)
dt[cyl == 6, .(avg_mpg = mean(mpg), avg_hp = mean(hp), n = .N)]#> avg_mpg avg_hp n
#> <num> <num> <int>
#> 1: 19.74286 122.2857 7
#> cyl avg_mpg
#> <fctr> <num>
#> 1: 6 19.74286
#> 2: 4 26.66364
#> 3: 8 15.10000
#> cyl N
#> <fctr> <int>
#> 1: 8 12
#> 2: 6 1
Machine learning (ML) is the practice of training
algorithms to learn patterns from data and make predictions or decisions
without being explicitly programmed for each case. The
caret package provides a unified interface to 200+ ML
algorithms, handling train/test splitting, cross-validation,
preprocessing, and model evaluation. The key concept: ML models learn
from training data and are evaluated on test
data they have never seen — simulating real-world
deployment.
Traditional statistics focuses on understanding relationships (what factors affect Y, and by how much?). Machine learning focuses on prediction accuracy (given X, what will Y be?). Both are valuable — the choice depends on whether your goal is explanation or prediction. Understanding ML is increasingly essential for data analysts in business and research.
library(caret)
# Step 1: Split Data into Training (80%) and Test (20%) Sets
set.seed(42) # set.seed() ensures reproducibility of random operations
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
cat("Training samples:", nrow(train_data), "| Test samples:", nrow(test_data), "\n")#> Training samples: 120 | Test samples: 30
# Step 2: Define Training Control — 5-fold Cross-Validation
# CV splits training data into 5 folds: trains on 4, validates on 1, repeats 5 times
train_control <- trainControl(method = "cv", number = 5)
# Step 3: Train a k-Nearest Neighbors (kNN) Classifier
# kNN: classify based on the k closest examples in training data
model_knn <- train(
Species ~ ., # predict Species using all other columns
data = train_data,
method = "knn",
trControl = train_control,
preProcess = c("center", "scale") # standardize features (critical for kNN)
)
model_knn#> k-Nearest Neighbors
#>
#> 120 samples
#> 4 predictor
#> 3 classes: 'setosa', 'versicolor', 'virginica'
#>
#> Pre-processing: centered (4), scaled (4)
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 96, 96, 96, 96, 96
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 5 0.9500000 0.9250
#> 7 0.9583333 0.9375
#> 9 0.9500000 0.9250
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 7.
# Step 4: Evaluate on Test Data — data the model has NEVER seen
predictions <- predict(model_knn, newdata = test_data)
conf_matrix <- confusionMatrix(predictions, test_data$Species)
conf_matrix#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction setosa versicolor virginica
#> setosa 10 0 0
#> versicolor 0 9 3
#> virginica 0 1 7
#>
#> Overall Statistics
#>
#> Accuracy : 0.8667
#> 95% CI : (0.6928, 0.9624)
#> No Information Rate : 0.3333
#> P-Value [Acc > NIR] : 2.296e-09
#>
#> Kappa : 0.8
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: setosa Class: versicolor Class: virginica
#> Sensitivity 1.0000 0.9000 0.7000
#> Specificity 1.0000 0.8500 0.9500
#> Pos Pred Value 1.0000 0.7500 0.8750
#> Neg Pred Value 1.0000 0.9444 0.8636
#> Prevalence 0.3333 0.3333 0.3333
#> Detection Rate 0.3333 0.3000 0.2333
#> Detection Prevalence 0.3333 0.4000 0.2667
#> Balanced Accuracy 1.0000 0.8750 0.8250
#>
#> Overall Accuracy: 86.7 %
A capstone project integrates all skills from the
course into one complete, real-world workflow: data loading, inspection,
cleaning, exploratory analysis, visualization, statistical modeling, and
interpretation. This mirrors what data analysts actually do on the job.
The dataset is mtcars (Motor Trend, 1974), comparing fuel
efficiency, engine specs, and performance of 32 car models.
Research Question: What factors most strongly predict a car’s fuel efficiency (MPG), and how well can we model it?
# STEP 1: DATA LOADING & FIRST LOOK
data(mtcars)
mtcars <- tibble::rownames_to_column(mtcars, "car_model")
cat("Dataset: Motor Trend Car Road Tests (1974)\n")#> Dataset: Motor Trend Car Road Tests (1974)
#> Dimensions: 32 rows x 12 columns
#> Key variables:
#> mpg = Miles per gallon [TARGET variable]
#> cyl = Number of cylinders (4, 6, 8)
#> hp = Gross horsepower
#> wt = Weight (1000 lbs)
#> am = Transmission (0=Automatic, 1=Manual)
#> Missing values: 0
mtcars <- mtcars %>%
mutate(
cyl = factor(cyl, levels = c(4, 6, 8)),
am = factor(am, labels = c("Automatic", "Manual")),
vs = factor(vs, labels = c("V-shape", "Straight")),
gear = factor(gear), carb = factor(carb)
)
str(mtcars[, c("mpg", "cyl", "hp", "wt", "am")])#> 'data.frame': 32 obs. of 5 variables:
#> $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
#> === MPG Summary ===
cat("Mean:", round(mean(mtcars$mpg), 2), " | Median:", median(mtcars$mpg),
" | SD:", round(sd(mtcars$mpg), 2), "\n\n")#> Mean: 20.09 | Median: 19.2 | SD: 6.03
#> === MPG by Cylinder Count ===
mtcars %>% group_by(cyl) %>%
summarise(n = n(), avg_mpg = round(mean(mpg), 1), sd_mpg = round(sd(mpg), 1),
min_mpg = min(mpg), max_mpg = max(mpg)) %>% print()#> # A tibble: 3 × 6
#> cyl n avg_mpg sd_mpg min_mpg max_mpg
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 4 11 26.7 4.5 21.4 33.9
#> 2 6 7 19.7 1.5 17.8 21.4
#> 3 8 14 15.1 2.6 10.4 19.2
#>
#> === MPG by Transmission ===
#> # A tibble: 2 × 3
#> am n avg_mpg
#> <fct> <int> <dbl>
#> 1 Automatic 19 17.1
#> 2 Manual 13 24.4
# STEP 4: VISUALIZATION
# Chart 1: MPG Distribution
ggplot(mtcars, aes(x = mpg, fill = cyl)) +
geom_histogram(bins = 12, alpha = 0.8, color = "white") +
labs(title = "Distribution of Fuel Efficiency (MPG)",
subtitle = "Colored by cylinder count",
x = "Miles Per Gallon", y = "Count", fill = "Cylinders") +
theme_minimal() + scale_fill_brewer(palette = "Set1")# Chart 2: Weight vs MPG — the strongest predictor
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl, shape = am)) +
geom_point(size = 3, alpha = 0.85) +
geom_smooth(method = "lm", se = FALSE, color = "gray40", linetype = "dashed") +
labs(title = "Fuel Efficiency vs Vehicle Weight",
subtitle = "Heavier cars consistently have lower MPG",
x = "Weight (1000 lbs)", y = "Miles Per Gallon",
color = "Cylinders", shape = "Transmission") +
theme_minimal() + scale_color_brewer(palette = "Set1")# Chart 3: Transmission comparison
ggplot(mtcars, aes(x = am, y = mpg, fill = am)) +
geom_boxplot(alpha = 0.7) + geom_jitter(width = 0.1, alpha = 0.6, size = 2) +
labs(title = "MPG: Automatic vs Manual Transmission",
subtitle = "Manual cars tend to be more fuel-efficient",
x = "Transmission", y = "Miles Per Gallon") +
theme_classic() + theme(legend.position = "none") +
scale_fill_manual(values = c("#3498db", "#e74c3c"))# STEP 5: STATISTICAL MODELING — build and compare models
m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)
m3 <- lm(mpg ~ wt + hp + cyl + am, data = mtcars)
cat("=== Model Comparison ===\n")#> === Model Comparison ===
#> Model 1: mpg ~ wt R2: 0.753 AIC: 166.0
cat(sprintf("%-35s R2: %.3f AIC: %.1f\n", "Model 2: mpg ~ wt + hp",
summary(m2)$r.squared, AIC(m2)))#> Model 2: mpg ~ wt + hp R2: 0.827 AIC: 156.7
cat(sprintf("%-35s R2: %.3f AIC: %.1f\n", "Model 3: mpg ~ wt + hp + cyl + am",
summary(m3)$r.squared, AIC(m3)))#> Model 3: mpg ~ wt + hp + cyl + am R2: 0.866 AIC: 154.5
#>
#> === Best Model Summary ===
#>
#> Call:
#> lm(formula = mpg ~ wt + hp + cyl + am, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.9387 -1.2560 -0.4013 1.1253 5.0513
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
#> wt -2.49683 0.88559 -2.819 0.00908 **
#> hp -0.03211 0.01369 -2.345 0.02693 *
#> cyl6 -3.03134 1.40728 -2.154 0.04068 *
#> cyl8 -2.16368 2.28425 -0.947 0.35225
#> amManual 1.80921 1.39630 1.296 0.20646
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.41 on 26 degrees of freedom
#> Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
#> F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
#> KEY FINDINGS
#> ============
#> 1. WEIGHT is the strongest predictor:
#> Each additional 1000 lbs reduces MPG by ~ 2.5 miles
#> 2. CYLINDER COUNT matters significantly:
cat(" 4-cyl:", round(mean(mtcars$mpg[mtcars$cyl==4]), 1), "MPG |",
"6-cyl:", round(mean(mtcars$mpg[mtcars$cyl==6]), 1), "MPG |",
"8-cyl:", round(mean(mtcars$mpg[mtcars$cyl==8]), 1), "MPG\n\n")#> 4-cyl: 26.7 MPG | 6-cyl: 19.7 MPG | 8-cyl: 15.1 MPG
#> 3. TRANSMISSION effect:
auto_mpg <- mean(mtcars$mpg[mtcars$am == "Automatic"])
manual_mpg <- mean(mtcars$mpg[mtcars$am == "Manual"])
cat(" Manual cars average", round(manual_mpg, 1), "MPG vs",
round(auto_mpg, 1), "MPG for automatic\n\n")#> Manual cars average 24.4 MPG vs 17.1 MPG for automatic
#> 4. MODEL PERFORMANCE:
#> Combined model explains 86.6 % of MPG variation
#> RECOMMENDATION: To maximize fuel efficiency, prioritize lighter
#> vehicles with fewer cylinders. Weight is the dominant factor.
| Category | Practice | Why It Matters |
|---|---|---|
| Organization | Use RStudio Projects | Keeps all files together; makes file paths reliable and portable |
| Reproducibility | Set set.seed() before random operations |
Anyone can reproduce your exact results |
| Code Style | Use snake_case for variable names |
Community standard; consistent and readable |
| Documentation | Comment the why, not just the what | Future you and colleagues will understand the reasoning |
| Data Safety | Never overwrite raw data files | Always save cleaned data to a new file; preserve the original |
| Version Control | Use Git for all R projects | Track changes; revert mistakes; enable collaboration |
| Performance | Prefer vectorized operations over loops | 10-100x faster for large datasets |
| Packages | Put all library() calls at the top of scripts |
Clear dependencies; easy for others to install what is needed |
| Reporting | Use R Markdown for final analyses | Code + results + narrative in one reproducible document |
| Validation | Always inspect results for sanity | Check that outputs make sense before sharing or acting on them |
Beginner (Months 1-2)
- Master base R: vectors, data frames, functions, control flow
- Practice tidyverse: dplyr + ggplot2 daily
- Work through built-in datasets (iris, mtcars, gapminder)
Intermediate (Months 3-4)
- Statistical modeling: regression, ANOVA, hypothesis testing
- R Markdown for reproducible reports
- Real datasets from Kaggle or TidyTuesday weekly challenge
Advanced (Months 5-6+)
- Machine learning: caret, tidymodels
- Interactive dashboards: Shiny
- Big data: data.table, arrow, sparklyr
- Functional programming: purrr
| Resource | URL | What It Offers |
|---|---|---|
| R for Data Science (free book) | r4ds.had.co.nz | Complete tidyverse guide by Hadley Wickham |
| TidyTuesday | github.com/rfordatascience/tidytuesday | Weekly real-world datasets to practice |
| R-bloggers | r-bloggers.com | Community tutorials and news |
| RStudio Cheatsheets | posit.co/resources/cheatsheets | Quick reference cards for all major packages |
| Stack Overflow | stackoverflow.com/questions/tagged/r | Q&A for specific problems |
| CRAN Task Views | cran.r-project.org/web/views | Packages organized by topic area |
# Always include session info at the end of an analysis
# This documents exactly which R version and package versions were used
# ensuring that others can reproduce your results exactly
sessionInfo()#> R version 4.6.0 (2026-04-24 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_Rwanda.utf8 LC_CTYPE=English_Rwanda.utf8
#> [3] LC_MONETARY=English_Rwanda.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Rwanda.utf8
#>
#> time zone: Africa/Kigali
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] caret_7.0-1 lattice_0.22-9 data.table_1.18.2.1
#> [4] ggplot2_4.0.3 stringr_1.6.0 tidyr_1.3.2
#> [7] dplyr_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.57 bslib_0.10.0
#> [4] recipes_1.3.2 vctrs_0.7.3 tools_4.6.0
#> [7] generics_0.1.4 stats4_4.6.0 parallel_4.6.0
#> [10] proxy_0.4-29 tibble_3.3.1 ModelMetrics_1.2.2.2
#> [13] pkgconfig_2.0.3 Matrix_1.7-5 RColorBrewer_1.1-3
#> [16] S7_0.2.2 lifecycle_1.0.5 compiler_4.6.0
#> [19] farver_2.1.2 codetools_0.2-20 htmltools_0.5.9
#> [22] class_7.3-23 sass_0.4.10 yaml_2.3.12
#> [25] prodlim_2026.03.11 pillar_1.11.1 jquerylib_0.1.4
#> [28] MASS_7.3-65 cachem_1.1.0 gower_1.0.2
#> [31] iterators_1.0.14 rpart_4.1.27 foreach_1.5.2
#> [34] nlme_3.1-169 parallelly_1.47.0 lava_1.9.0
#> [37] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
#> [40] future_1.70.0 reshape2_1.4.5 purrr_1.2.2
#> [43] listenv_0.10.1 labeling_0.4.3 splines_4.6.0
#> [46] fastmap_1.2.0 grid_4.6.0 cli_3.6.6
#> [49] magrittr_2.0.5 survival_3.8-6 utf8_1.2.6
#> [52] e1071_1.7-17 future.apply_1.20.2 withr_3.0.2
#> [55] scales_1.4.0 lubridate_1.9.5 timechange_0.4.0
#> [58] rmarkdown_2.31 globals_0.19.1 nnet_7.3-20
#> [61] timeDate_4052.112 evaluate_1.0.5 knitr_1.51
#> [64] hardhat_1.4.3 mgcv_1.9-4 rlang_1.2.0
#> [67] Rcpp_1.1.1-1.1 glue_1.8.1 pROC_1.19.0.1
#> [70] ipred_0.9-15 rstudioapi_0.18.0 jsonlite_2.0.0
#> [73] R6_2.6.1 plyr_1.8.9
End of Course — You now have the complete foundation to tackle real-world data analysis with R. Keep practicing, stay curious, and always ask what your data is telling you. Happy coding!