Code
my_number <- 42
my_text <- "NISR"
my_logical <- TRUE
class(my_number)[1] "numeric"
Code
class(my_text)[1] "character"
Code
class(my_logical)[1] "logical"
Practical Notes and Exercises
Welcome to the Intermediate R Programming Training.
This course is designed for participants registered for intermediate R training, focusing on data cleaning, manipulation, and visualization in R.
R is a powerful, free software environment for statistical computing and graphics. It is widely used among statisticians, data analysts and data scientists for different purposes.
Why R?
ggplot2, R can produce publication-quality graphics.RStudio IDE
We will be using RStudio, which is an Integrated Development Environment (IDE) for R. It provides a user-friendly interface with four main panels:
1. Source Editor (Top-Left): Where you write your R scripts and .Rmd files.
2. Console (Bottom-Left): Where you can type and execute R code directly.
3. Environment/History (Top-Right): Shows the objects (data, variables) you have created.
4. Files/Plots/Packages/Help (Bottom-Right): Helps you manage files, view plots, install packages, and get help.
R has several basic data types. The most common ones are: * numeric: For numbers (e.g. 10.5, 55). * character: For text (strings), enclosed in quotes (e.g., "hello", 'Rwanda'). * logical: For TRUE or FALSE values.
You can check the type of a variable with the class() function.
my_number <- 42
my_text <- "NISR"
my_logical <- TRUE
class(my_number)[1] "numeric"
class(my_text)[1] "character"
class(my_logical)[1] "logical"
The most fundamental data structure in R is the vector.
A vector is like a single column of data that contains elements all of the same type (all numbers, all text, or all logical values).
You can create a vector using the c() function, which stands for combine.
# A numeric vector (numbers only)
provinces_population <- c(26, 24, 24, 17, 9)
# A character vector (text only)
provinces_names <- c("Southern", "Western", "Eastern", "Northern", "Kigali City")
# Print the vectors
provinces_population[1] 26 24 24 17 9
provinces_names[1] "Southern" "Western" "Eastern" "Northern" "Kigali City"
Vectors are vectorized in R — this means you can perform arithmetic on all elements at once. Operations happen element-by-element.
a <- c(1, 2, 3)
b <- c(10, 20, 30)
# Addition (adds 1st to 1st, 2nd to 2nd, etc.)
a + b[1] 11 22 33
# [1] 11 22 33
# Multiplication
a * b[1] 10 40 90
# [1] 10 40 90If one vector is shorter than the other, R will recycle (repeat) elements of the shorter vector until it matches the length of the longer one.
a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b [1] 11 22 13 24
# First recycle b: (10, 20, 10, 20)
# Result: (1+10, 2+20, 3+10, 4+20)if the longer vector’s length is not a multiple of the shorter vector’s length, R will warn you.
You can use square brackets [ ] to select elements from a vector.
# Get the first province
provinces_names[1][1] "Southern"
# Get the 2nd and 4th populations
provinces_population[c(2, 4)][1] 24 17
# Exclude the 3rd element
provinces_names[-3][1] "Southern" "Western" "Northern" "Kigali City"
# Change the population of the first province
provinces_population[1] <- 27
# Add a new province (creates a longer vector)
provinces_population <- c(provinces_population, 8)The most important data structure for data analysis in R is the data frame.
A data frame is like:
A spreadsheet in Excel
A table in a database
A tibble in tidyverse (we’ll see later)
It is a two-dimensional table:
Rows = observations/records
Columns = variables/features
Each column can have a different type (numeric, character, logical, etc.)
You can create a data frame using the data.frame() function.
# Create a simple data frame
employee_data <- data.frame(
id = c(1, 2, 3), # Numeric column
name = c("John", "Jane", "Peter"), # Character column
salary = c(50000, 55000, 52000) # Numeric column
)
employee_data id name salary
1 1 John 50000
2 2 Jane 55000
3 3 Peter 52000
# View structure
str(employee_data)'data.frame': 3 obs. of 3 variables:
$ id : num 1 2 3
$ name : chr "John" "Jane" "Peter"
$ salary: num 50000 55000 52000
# View summary statistics
summary(employee_data) id name salary
Min. :1.0 Length:3 Min. :50000
1st Qu.:1.5 Class :character 1st Qu.:51000
Median :2.0 Mode :character Median :52000
Mean :2.0 Mean :52333
3rd Qu.:2.5 3rd Qu.:53500
Max. :3.0 Max. :55000
There are several ways to access parts of a data frame:
# Access a column by $ name
employee_data$name[1] "John" "Jane" "Peter"
# Access by index (row, column)
employee_data[1, 2] # Row 1, column 2[1] "John"
employee_data[ , "salary"] # All rows, salary column[1] 50000 55000 52000
# Multiple columns
employee_data[ , c("name", "salary")] name salary
1 John 50000
2 Jane 55000
3 Peter 52000
# Multiple rows
employee_data[1:2, ] id name salary
1 1 John 50000
2 2 Jane 55000
# Add a new column
employee_data$department <- c("HR", "Finance", "IT")
# Add a new row
new_row <- data.frame(id=4, name="Alice", salary=60000, department="Marketing")
employee_data <- rbind(employee_data, new_row)In practice, we don’t type all our data by hand — we read it from files:
# Read CSV file
my_data <- read.csv("Data.csv")
# Read Excel file
# install.packages("readxl")
library(readxl)
my_data <- read_excel("Data.xlsx")One of R’s biggest strengths is its community-contributed packages.
A package is like an app for R — it contains:
Functions (tools to do specific tasks)
Data sets (ready-to-use examples)
Documentation (help files and guides)
Before using a package, you install it once on your computer (like downloading an app from an app store).
# Install a package (done only once, unless you update R or reinstall)
#install.packages("haven")
#install.packages("tidyverse")Every time you start a new R session (or reopen RStudio), you must load the package to use it./
# Load packages for use in this session
library(haven) # For reading Stata, SPSS, SAS files
library(tidyverse) # For data manipulation & visualizationhaven and tidyverse?haven: Reads data from Stata (.dta), SPSS, and SAS files while preserving labels.
tidyverse: A collection of packages for data manipulation (dplyr), visualization (ggplot2), data tidying (tidyr), and more.
# Check installed packages
# installed.packages()
# Or quickly check one package
"haven" %in% rownames(installed.packages())[1] TRUE
Updating Packages
# Update all installed packages
# update.packages()One of the first steps in any data analysis project is getting your data into R.
R can read many file formats: .csv, .xlsx, .dta, .sav, .json, and more.
In this course, we’ll start with a Stata (.dta) file.
We’ll use read_dta() from the haven package (which is already loaded) to import it.
# If the file is in your working directory, you can just write its name. Otherwise, provide the full path or a relative path.
file_path <- "cs_s1_s2_s3_s4_s6a_s6e_s6f_person.dta"
# Read the Stata file into R
df <- read_dta(file_path)
# View the first few rows
head(df)File not found
Check your working directory with getwd()
Set it with setwd("path/to/your/folder")
Wrong slashes in path
/ instead of \ or double them: "C:/Users/..."Package not loaded
library(haven) is run before read_dta().1. CSV Files
Most widely used format. Use read_csv() from the readr package (part of tidyverse):
df_csv <- read_csv("Data.csv")
head(df_csv)2. Excel Files
For .xlsx or .xls, use read_excel() from the readxl package:
library(readxl)
df_excel <- read_excel("Data.xlsx")
head(df_excel)3. SPSS Files (.sav)
Also supported by haven:
df_spss <- read_sav("RWPR81FL.SAV")
head(df_spss)# Save as CSV
write_csv(Your_df, "cleaned_data.csv")
# Save as Excel (needs openxlsx or writexl package)
library(writexl)
write_xlsx(Your_df, "cleaned_data.xlsx")
# Save as Stata (.dta) – needs haven
library(haven)
write_dta(Your_df, "cleaned_data.dta")
# Save as SPSS (.sav) – also haven
write_sav(Your_df, "cleaned_data.sav")
# Save as RDS (R's native format, keeps everything)
saveRDS(Your_df, "cleaned_data.rds")
# To load back RDS
df_loaded <- readRDS("cleaned_data.rds")After loading your dataset, the next step is to understand what’s inside it — its size, structure, and basic patterns.
This step helps you decide how to clean and analyze the data.
Here are some R equivalents of the common pandas commands from Python:
| pandas in Python | R equivalent | What it does |
|---|---|---|
df.head() |
head(df) |
Shows the first few rows (default: 6) |
df.shape |
dim(df) |
Shows the number of rows and columns |
df.info() |
str(df) |
Gives the structure: column names, types, and a preview of values |
df.describe() |
summary(df) |
Summary statistics for each column |
See column names
names(df)Check the number of rows only
nrow(df)[1] 66081
Check the number of columns only
ncol(df)[1] 127
Check the first few rows of a specific column
head(df$region)<labelled<double>[6]>: Region
[1] 1 1 1 1 1 1
Labels:
value label
1 Kigali City
2 Other Urban
3 Rural Southern
4 Rural Western
5 Rural Northern
6 Rural Eastern
View unique values in a column
unique(df$province)<labelled<double>[5]>: Province
[1] 1 2 3 4 5
Labels:
value label
1 Kigali City
2 Southern Province
3 Western Province
4 Northern Province
5 Eastern Province
When working with data, one of the most common tasks is to pick the columns you need, rename them for clarity, and filter rows based on conditions.
In Base R, you can do all of this without additional packages — although later we’ll also learn the tidyverse way.
You can select columns from a data frame by name or by their index number.
Example: Selecting specific columns by name
# Select important columns for our analysis
df_selected <- subset(
df,
select = c(hhid, province, district, ur2_2012, quintile, poverty, Consumption, s1q1, s1q4, s3q3, s1q3y, s4aq3, s4aq6a)
)
# Preview the new data frame
head(df_selected)# A tibble: 6 × 13
hhid province district ur2_2012 quintile poverty Consumption s1q1
<dbl> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl> <dbl> <dbl+l>
1 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4] 3 305391. 2 [Fem…
2 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4] 3 305391. 1 [Mal…
3 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4] 3 305391. 1 [Mal…
4 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4] 3 305391. 2 [Fem…
5 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4] 3 305391. 1 [Mal…
6 100002 1 [Kigali City] 11 [Nyar… 1 [Urba… 5 [Q5] 3 372962. 1 [Mal…
# ℹ 5 more variables: s1q4 <dbl+lbl>, s3q3 <dbl+lbl>, s1q3y <dbl+lbl>,
# s4aq3 <dbl+lbl>, s4aq6a <dbl+lbl>
Clear, descriptive column names make your analysis easier to follow.
You can change column names one at a time or all at once.
Example: Renaming specific columns
# Rename specific columns
names(df_selected)[names(df_selected) == "s1q1"] <- "Sex"
names(df_selected)[names(df_selected) == "s1q4"] <- "Marital_status"
names(df_selected)[names(df_selected) == "s3q3"] <- "Health_insurance"
names(df_selected)[names(df_selected) == "s1q3y"] <- "Age"
names(df_selected)[names(df_selected) == "s4aq3"] <- "Diploma_obtained"
names(df_selected)[names(df_selected) == "s4aq6a"] <- "Class_attended_2012"
# Check new names
names(df_selected) [1] "hhid" "province" "district"
[4] "ur2_2012" "quintile" "poverty"
[7] "Consumption" "Sex" "Marital_status"
[10] "Health_insurance" "Age" "Diploma_obtained"
[13] "Class_attended_2012"
Renaming all columns at once
names(df_selected) <- c("HHID", "Province", "District", "UrbanRural",
"Quintile", "PovertyStatus", "Consumption",
"Sex", "Marital_status", "Health_insurance",
"Age", "Diploma_obtained", "Class_attended_2012")You can filter rows by specifying conditions.
Example: Filtering by category (Province)
lets first convert province names to actual names
# Check the structure of the variable
str(df_selected$Province) dbl+lbl [1:66081] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
@ label : chr "Province"
@ format.stata: chr "%10.0g"
@ labels : Named num [1:5] 1 2 3 4 5
..- attr(*, "names")= chr [1:5] "Kigali City" "Southern Province" "Western Province" "Northern Province" ...
# Convert to factor with labels
#df_selected$Province <- factor(df_selected$Province,
# levels = c(1, 2, 3, 4, 5),
# labels = c("Kigali City",
# "Southern Province",
# "Western Province",
# "Northern Province",
# "Eastern Province"))
# Convert province from numeric codes to labels
df_selected$Province <- as_factor(df_selected$Province)
# Now the variable is readable
table(df_selected$Province)
Kigali City Southern Province Western Province Northern Province
6156 17211 15959 11071
Eastern Province
15684
then we can filter by category,
# Only rows for Northern Province
northern_province_data <- subset(df_selected, Province == "Northern Province")
# Rows for Kigali City OR Northern Province
kigali_and_north <- subset(df_selected, Province == "Kigali City" | Province == "Northern Province")
# Cleaner way: using %in%
provinces_to_select <- c("Kigali City", "Northern Province")
kigali_and_north_alt <- subset(df_selected, Province %in% provinces_to_select)Example: Filtering by numeric condition
# Only people aged 16 and above
age_above_16 <- subset(df_selected, Age >= 16)| Task | Code Example | Notes |
|---|---|---|
| Select columns by name | df[, c("col1", "col2")] |
Uses vector of names |
| Select columns by index | df[, c(1, 3, 5)] |
Index starts at 1 |
Select with subset() |
subset(df, select = c(col1, col2)) |
More readable |
| Rename one column | names(df)[names(df) == "old"] <- "new" |
Match name exactly |
| Rename all columns | names(df) <- c("new1", "new2", "new3") |
Order matters |
| Filter rows (one value) | subset(df, col == "Value") |
Exact match |
| Filter rows (multiple) | subset(df, col %in% c("A", "B")) |
%in% for multiple |
| Filter numeric | subset(df, age >= 18) |
Works with >, <, >=, <= |
| Multiple conditions (AND) | subset(df, col1 == "A" & col2 > 10) |
Both must be TRUE |
| Multiple conditions (OR) | subset(df, col1 == "A" | col2 > 10) |
Either can be TRUE |
Without subset() |
df[df$col == "Value", ] |
Base R indexing |
Categorical variables in R are often stored as factors.
A factor is a special type of vector that can only contain predefined values called levels.
They are memory efficient for repeated text values.
They are useful for modeling (e.g., regression, classification) because they tell R which variables are categorical.
They also make plots more meaningful by ordering categories automatically.
Basic Factor Operations
# 1 Check the class of 'province'
# haven (used for reading Stata, SPSS, SAS) often reads categories as 'labelled'
class(df_selected$Province)[1] "factor"
# 2 Convert 'province' to a factor
# df_selected$Province <- as.factor(df_selected$Province)
# 3 See the levels (unique categories)
levels(df_selected$Province)[1] "Kigali City" "Southern Province" "Western Province"
[4] "Northern Province" "Eastern Province"
More Factor Tasks
# Reorder factor levels (manually)
df_selected$Province <- factor(df_selected$Province,
levels = c("Kigali City", "Northern Province",
"Southern Province", "Eastern Province",
"Western Province"))
# Rename factor levels
levels(df_selected$Province) <- c("Kigali", "North", "South", "East", "West")
# Count the number of occurrences of each level
table(df_selected$Province)
Kigali North South East West
6156 11071 17211 15684 15959
# Check if a column is a factor
is.factor(df_selected$Province) # TRUE / FALSE[1] TRUE
Frequency tables help us summarize how many times each category appears in a dataset.
They are essential for understanding distributions before doing further analysis.
# Count how many observations are in each province
province_table <- table(df_selected$Province)
province_table
Kigali North South East West
6156 11071 17211 15684 15959
# Convert counts into proportions (relative frequencies)
province_table_pc <- prop.table(province_table) * 100
province_table_pc
Kigali North South East West
9.31584 16.75368 26.04531 23.73451 24.15066
# Convert province from numeric codes to labels
df_selected$Sex <- as_factor(df_selected$Sex)
# Cross-tabulation of Province by Sex
prov_sex_table <- table(df_selected$Province, df_selected$Sex)
prov_sex_table
Male Female Missing
Kigali 3008 3148 0
North 5290 5781 0
South 8165 9046 0
East 7533 8151 0
West 7548 8411 0
What this shows:
province and Sex.# Row percentages: within each province, what % are male/female?
prop.table(prov_sex_table, margin = 1) * 100
Male Female Missing
Kigali 48.86290 51.13710 0.00000
North 47.78249 52.21751 0.00000
South 47.44059 52.55941 0.00000
East 48.02984 51.97016 0.00000
West 47.29620 52.70380 0.00000
Example Output Interpretation
| Province | Male | Female |
|---|---|---|
| Kigali City | 48.86% | 51.13% |
| Northern Province | 47.78% | 52.21% |
| … | … | … |
This means that in Kigali City, 48.86% of respondents are male, 51.13% are female.
Sometimes we need to calculate summary statistics for groups in the data.
In Base R, the aggregate() function is a quick way to do this.
aggregate(formula, data, FUN)formula: y ~ x means “calculate something for y grouped by x”.
data: the data frame where the variables are stored.
FUN: the function to apply (e.g., mean, sum, max, min).
Example – Average Household Consumption by Province
# Average household consumption for each province
avg_consumption_by_province <- aggregate(
Consumption ~ Province,
data = df_selected,
FUN = mean
)
avg_consumption_by_province Province Consumption
1 Kigali 630632.5
2 North 237683.3
3 South 275067.9
4 East 264665.3
5 West 256286.5
Other Useful Variations
# Median consumption by province
aggregate(Consumption ~ Province, data = df_selected, FUN = median) Province Consumption
1 Kigali 339105.2
2 North 169252.5
3 South 189248.8
4 East 188566.6
5 West 170559.5
# Total consumption by province
aggregate(Consumption ~ Province, data = df_selected, FUN = sum) Province Consumption
1 Kigali 3882173609
2 North 2631391853
3 South 4734192910
4 East 4151010371
5 West 4090075625
# Average age by province
aggregate(Age ~ Province, data = df_selected, FUN = mean) Province Age
1 Kigali 22.43080
2 North 23.35146
3 South 24.02353
4 East 22.23495
5 West 22.50649
The tidyverse is a collection of R packages built specifically for data science tasks such as data cleaning, transformation, visualization, and reporting.
All packages in the tidyverse follow a consistent design philosophy, making it easier to learn and switch between them.
Some of the most popular tidyverse packages include:
| Package | Purpose |
|---|---|
| dplyr | Data manipulation (filtering, selecting, summarizing, etc.) |
| ggplot2 | Data visualization |
| tidyr | Reshaping and tidying data |
| readr | Reading CSV and other text files |
| stringr | Working with strings |
| forcats | Working with factors |
| lubridate | Working with dates and times |
More readable and concise syntax compared to base R.
Functions are designed to chain together naturally
Works well with the pipe operator for a step-by-step approach.
%>%One of the most useful features in tidyverse workflows is the pipe operator (%>%), provided by the magrittr package (automatically loaded with tidyverse).
The pipe lets you write code like a sentence, where the output of one function becomes the input of the next.
Without Pipes (Base R Style)
head(subset(df_selected, Province == "Kigali"))# A tibble: 6 × 13
HHID Province District UrbanRural Quintile PovertyStatus Consumption Sex
<dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl> <dbl> <fct>
1 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Fema…
2 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
3 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
4 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Fema…
5 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
6 100002 Kigali 11 [Nyaru… 1 [Urban] 5 [Q5] 3 372962. Male
# ℹ 5 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
# Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>
This can be harder to read because operations are nested inside each other.
library(dplyr)
df_selected %>%
filter(Province == "Kigali") %>%
tail()# A tibble: 6 × 13
HHID Province District UrbanRural Quintile PovertyStatus Consumption Sex
<dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl> <dbl> <fct>
1 103079 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 543027. Fema…
2 103079 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 543027. Fema…
3 103079 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 543027. Fema…
4 103079 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 543027. Male
5 103080 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 418005. Male
6 103080 Kigali 13 [Kicuk… 1 [Urban] 5 [Q5] 3 418005. Fema…
# ℹ 5 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
# Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>
Here, each step is on its own line, making it easier to read and debug.
The pipe takes the result on the left and passes it as the first argument to the function on the right.
For example:
# This:
x %>% f(y)
# Is equivalent to:
f(x, y)dplyrThe dplyr package is one of the core tidyverse tools for data manipulation.
It provides a set of intuitive “verbs” that let you describe what you want to do with your data, without worrying too much about how R does it internally.
Here are the most important verbs in dplyr:
select() – Choose which columns to keep or reorder.
rename() – Rename columns to something more meaningful.
filter() – Select rows that meet certain conditions.
mutate() – Add new columns or modify existing ones.
arrange() – Sort rows in ascending or descending order.
group_by() – Split the dataset into groups.
summarise() – Collapse each group into a single summary row.
Example: Selecting, Renaming, and Filtering in One Chain
# Load the tidyverse library
library(tidyverse)
library(haven)
df$province <- as.factor(df$province)
# Now province shows names, not just numbers
table(df$province)
1 2 3 4 5
6156 17211 15959 11071 15684
# Select certain columns, rename some, and filter rows for Northern Province
analysis_df_dplyr <- df %>%
select(
hhid, # Household ID
province, # Province
district, # District
ur2_2012, # Urban/Rural classification
quintile, # Wealth quintile
poverty, # Poverty status
Consumption, # Household consumption
Sex = s1q1, # Rename s1q1 to 'Sex'
Marital_status = s1q4, # Rename s1q4 to 'Marital_status'
Health_insurance = s3q3,# Rename s3q3 to 'Health_insurance'
Age = s1q3y # Rename s1q3y to 'Age'
) %>%
filter(province == "2") # Keep only rows from Northern Province
# Preview the result
head(analysis_df_dplyr)# A tibble: 6 × 11
hhid province district ur2_2012 quintile poverty Consumption Sex
<dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl>
1 100393 2 21 [Nyanza] 2 [Rural] 5 [Q5] 3 2384791. 1 [Male]
2 100393 2 21 [Nyanza] 2 [Rural] 5 [Q5] 3 2384791. 2 [Female]
3 100394 2 21 [Nyanza] 2 [Rural] 3 [Q3] 3 172956. 2 [Female]
4 100394 2 21 [Nyanza] 2 [Rural] 3 [Q3] 3 172956. 2 [Female]
5 100394 2 21 [Nyanza] 2 [Rural] 3 [Q3] 3 172956. 2 [Female]
6 100395 2 21 [Nyanza] 2 [Rural] 4 [Q4] 3 266837. 1 [Male]
# ℹ 3 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
# Age <dbl+lbl>
What’s Happening Here?
select() picks only the columns we need (and can rename them at the same time).
filter() keeps only the rows where province equals "Northern Province".
The pipe %>% sends the result of each step into the next one.
The output is a cleaner, smaller data frame ready for analysis.
dplyrIn base R, we used the aggregate() function to get averages.
In dplyr, we combine:
group_by() – to split the data into groups
summarise() – to calculate summary statistics for each group
Example: Average Consumption by Province
# Calculate average consumption by province
avg_cons_dplyr <- df_selected %>%
group_by(Province) %>%
summarise(
mean_consumption = mean(Consumption, na.rm = TRUE) # ignore NA values
)
avg_cons_dplyr# A tibble: 5 × 2
Province mean_consumption
<fct> <dbl>
1 Kigali 630632.
2 North 237683.
3 South 275068.
4 East 264665.
5 West 256286.
How This Works
group_by(province)
Creates groups for each province in the dataset.
summarise(mean_consumption = mean(Consumption, na.rm = TRUE))
Calculates the mean consumption for each group.
The result is a condensed table with one row per province.
Key Differences from Base R’s aggregate()
dplyr syntax is more readable and works naturally with pipes.
You can summarise multiple variables at once.
Works seamlessly with other tidyverse tools.
With summarise(), you can calculate several statistics at the same time.
# Summarise with multiple statistics
province_summary <- df_selected %>%
group_by(Province) %>%
summarise(
mean_consumption = mean(Consumption, na.rm = TRUE),
median_consumption = median(Consumption, na.rm = TRUE),
households = n(), # Count rows in each group
.groups = "drop" # Remove grouping in the output
)
province_summary# A tibble: 5 × 4
Province mean_consumption median_consumption households
<fct> <dbl> <dbl> <int>
1 Kigali 630632. 339105. 6156
2 North 237683. 169252. 11071
3 South 275068. 189249. 17211
4 East 264665. 188567. 15684
5 West 256286. 170560. 15959
mutate()mutate() is perfect for creating calculated columns without altering the original data.
# Add a new column for daily consumption
df_with_daily <- df_selected %>%
mutate(
daily_consumption = Consumption / 365 # Assuming annual data
)
head(df_with_daily)# A tibble: 6 × 14
HHID Province District UrbanRural Quintile PovertyStatus Consumption Sex
<dbl> <fct> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl> <dbl> <fct>
1 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Fema…
2 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
3 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
4 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Fema…
5 100001 Kigali 11 [Nyaru… 1 [Urban] 4 [Q4] 3 305391. Male
6 100002 Kigali 11 [Nyaru… 1 [Urban] 5 [Q5] 3 372962. Male
# ℹ 6 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
# Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>,
# daily_consumption <dbl>
This keeps all original columns but adds a new daily_consumption column.
arrange()You can arrange your results ascending or descending.
# Sort provinces from highest to lowest mean consumption
province_summary_sorted <- province_summary %>%
arrange(desc(mean_consumption))
province_summary_sorted# A tibble: 5 × 4
Province mean_consumption median_consumption households
<fct> <dbl> <dbl> <int>
1 Kigali 630632. 339105. 6156
2 South 275068. 189249. 17211
3 East 264665. 188567. 15684
4 West 256286. 170560. 15959
5 North 237683. 169252. 11071
Sometimes you want only certain groups after calculation — for example, provinces with high average consumption.
# Keep only provinces with mean consumption above 265000
high_consumption_provinces <- province_summary %>%
filter(mean_consumption > 265000)
high_consumption_provinces# A tibble: 2 × 4
Province mean_consumption median_consumption households
<fct> <dbl> <dbl> <int>
1 Kigali 630632. 339105. 6156
2 South 275068. 189249. 17211
n() is a special function in dplyr to count rows per group — useful for quick summaries.
# Count households per province
households_per_province <- df_selected %>%
group_by(Province) %>%
summarise(total_households = n())
households_per_province# A tibble: 5 × 2
Province total_households
<fct> <int>
1 Kigali 6156
2 North 11071
3 South 17211
4 East 15684
5 West 15959
Here’s an all-in-one pipeline combining everything:
final_summary <- df_selected %>%
mutate(daily_consumption = Consumption / 365) %>%
group_by(Province) %>%
summarise(
mean_annual = mean(Consumption, na.rm = TRUE),
mean_daily = mean(daily_consumption, na.rm = TRUE),
median_annual = median(Consumption, na.rm = TRUE),
households = n(),
.groups = "drop"
) %>%
arrange(desc(mean_annual)) %>%
filter(mean_annual > 265000)
final_summary# A tibble: 2 × 5
Province mean_annual mean_daily median_annual households
<fct> <dbl> <dbl> <dbl> <int>
1 Kigali 630632. 1728. 339105. 6156
2 South 275068. 754. 189249. 17211
What This Shows:
How to calculate multiple statistics in one step
How to add new calculated columns before summarising
How to sort results and filter summaries
How to count households per group
Data visualization helps you spot patterns, detect anomalies, and communicate findings more effectively.
We’ll explore two main approaches in R:
Base R plotting functions – Quick and simple.
ggplot2 from the tidyverse – More flexible and elegant.
Example: Base R Plots
# Turn off scientific notation
options(scipen = 999)
# Histogram of Consumption
hist(df_selected$Age,
main = "Distribution of Age",
xlab = "Age",
col = "skyblue", border = "white")# Boxplot of Consumption by Province
boxplot(Consumption ~ Province,
data = df_selected,
main = "Consumption by Province",
xlab = "Province", ylab = "Consumption",
col = "lightgreen")Example: ggplot2 Plots
library(ggplot2)
# Histogram using ggplot2
ggplot(df_selected, aes(x = Consumption)) +
geom_histogram(binwidth = 5000, fill = "steelblue", color = "green") +
labs(title = "Distribution of Household Consumption", x = "Consumption", y = "Count")# Boxplot using ggplot2
ggplot(df_selected, aes(x = Province, y = Consumption, fill = Province)) +
geom_boxplot() +
labs(title = "Consumption by Province", x = "Province", y = "Consumption") +
theme(legend.position = "none")ggplot2Why ggplot2?
ggplot2 is part of the tidyverse and is based on the Grammar of Graphics.
This means you build a plot layer by layer:
Start with your data and aesthetic mappings (aes()).
Add a geometry (bars, points, lines, etc.).
Add layers for labels, themes, and customization.
Think of it like assembling a sandwich:
Bread → data & aes()
Fillings → geom_*() layer
Sauces & toppings → labels, colors, and themes.
Basic Structure
ggplot(data = <DATA>, aes(x = <X-VARIABLE>, y = <Y-VARIABLE>)) +
geom_<TYPE>() +
other_layersQuick Reference: Common geom_ Functions in ggplot2
| Function | Best for | Example Variables |
|---|---|---|
geom_bar() |
Counts for categorical variables | Province, Sex |
geom_col() |
Bar charts when you already have counts or values | Consumption |
geom_histogram() |
Distribution of a single numeric variable | Consumption |
geom_density() |
Smoothed distribution curve | Consumption, Age |
geom_boxplot() |
Comparing distributions across groups | Consumption by Province |
geom_violin() |
Distribution + density shape | Consumption by PovertyStatus |
geom_point() |
Relationship between two numeric variables | Age vs Consumption |
geom_line() |
Trends over continuous variables (time, distance) | Year vs GDP |
geom_area() |
Filled area charts for trends | Year vs Rainfall |
geom_text() / geom_label() |
Adding text annotations to plots | Data labels for bars or points |
In real-world datasets, missing data is common, it could be due to incomplete surveys, data entry errors, or system failures.
In R, missing values are represented by the special symbol NA.
Understanding how to detect, quantify, and handle missing values is critical to ensure accurate analysis.
We can check for missing values in entire datasets, specific columns, or rows.
# Count missing values in each column
colSums(is.na(df_selected))
# Check if the dataset contains any missing values at all
anyNA(df_selected)
# Count missing values in a specific column
sum(is.na(df_selected$Marital_status))There are several strategies depending on the context:
(A) Removing Missing Values
If the missing data is small and random, you can drop those rows/columns.
# Remove rows with any missing values
df_no_na <- na.omit(df_selected)
# Remove rows with NA in a specific column
df_no_marital_na <- df_selected %>%
filter(!is.na(Marital_status))(B) Replacing Missing Values with a Category or Value
If NA means “Not Applicable” or “Unknown,” we can replace it with a label.
df_clean <- df_selected %>%
mutate(Marital_status = ifelse(is.na(Marital_status),
"Not Applicable",
as.character(Marital_status)))(C) Imputing Numeric Missing Values
Sometimes it’s reasonable to replace missing numbers with mean, median, or a calculated value.
df_imputed <- df_selected %>%
mutate(Consumption = ifelse(is.na(Consumption),
mean(Consumption, na.rm = TRUE),
Consumption))This avoids dropping rows, but be careful: imputation changes your data.
(D) Advanced Missing Data Handling
For more complex datasets, packages like mice or missForest can predict missing values using statistical models.
Example (not running here, just for reference):
library(mice)
imputed <- mice(df_selected, m = 5, method = 'pmm', seed = 123)
df_filled <- complete(imputed, 1)After handling missing data, always confirm the changes:
table(df_clean$Marital_status, useNA = "ifany")
colSums(is.na(df_clean))Outliers are extreme values that deviate significantly from most of the data.
They can occur due to:
Data entry errors (typos, wrong units)
Measurement errors (faulty instruments, recording mistakes).
Genuine extreme values (e.g., very high income households).
The most common way is to use boxplots or summary statistics:
# Basic summary
summary(df_selected$Consumption) Min. 1st Qu. Median Mean 3rd Qu. Max.
7339 125860 187121 294924 297387 12809880
# Boxplot to visualize outliers
ggplot(df_selected, aes(y = Consumption)) +
geom_boxplot(fill = "steelblue") +
labs(title = "Boxplot of Household Consumption")👉 Values beyond the whiskers (1.5 × IQR rule) are considered potential outliers.
Investigate
. If it’s a data entry error, fix or remove it.
. If it’s genuine (e.g., a rich household), keep it but consider robust methods
Remove
. Drop extreme rows, but risk losing valuable data.
Cap (Winsorize)
. Replace outliers with a threshold percentile.
. Useful when you want to reduce extreme influence without removing data.
Here we “cap” values below the 10th percentile and above the 90th percentile:
# Define the 10th and 90th percentiles
lower_bound <- quantile(df_selected$Consumption, 0.10, na.rm = TRUE)
upper_bound <- quantile(df_selected$Consumption, 0.90, na.rm = TRUE)
# Create a new capped column
df_capped <- df_selected %>%
mutate(cons_capped = ifelse(Consumption > upper_bound, upper_bound, Consumption)) %>%
mutate(cons_capped = ifelse(cons_capped < lower_bound, lower_bound, cons_capped))
# Compare boxplots: before vs after
par(mfrow = c(1, 2))
boxplot(df_selected$Consumption, main = "Original", col = "lightblue")
boxplot(df_capped$cons_capped, main = "Capped", col = "lightcoral")👉 Notice how the capped version looks “cleaner,” with fewer extreme values.
Another common approach is to flag outliers based on standard deviations from the mean:
# Compute z-scores
df_selected_z <- df_selected %>%
mutate(z_score = (Consumption - mean(Consumption, na.rm = TRUE)) / sd(Consumption, na.rm = TRUE))
# Filter out rows where z-score is greater than 3 (too extreme)
df_no_outliers_z <- df_selected_z %>%
filter(abs(z_score) < 3)
nrow(df_selected_z); nrow(df_no_outliers_z)[1] 66081
[1] 65245
👉 This keeps only data within ±3 standard deviations.
If your analysis is about inequality, poverty, or income distribution, outliers are important signals.
Removing them could bias results.
In such cases, you may report results with and without outliers to show their effect.
Lets load a messy dataset Data.csv into R:
library(tidyverse)
file_path <- "Data.csv"
data <- read_csv(file_path)
glimpse(data)Rows: 26
Columns: 5
$ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 6, 7…
$ sex <chr> "Male", "Male", "Fema", "5", "Female", "Male", "Male…
$ Province <chr> "Souther", "Kiga City", "Kigali City", "Eastern", "N…
$ `Education level` <chr> "pr", "pr", "pr", "Uni", "Uni", "Uni", "Sec", "Sec",…
$ Age <chr> "less than one year", "24", "33", "45", "20", "24", …
ID columnProblems: IDs are repeated (e.g., 1,2,3 appear twice). Some are missing (not continuous).
solution: Create a new unique ID based on row number.
data <- data %>%
mutate(ID = row_number())sex columnProblems: Inconsistent values: Male, Female, Fema, m, F.
unique(data$sex)[1] "Male" "Fema" "5" "Female" "m" "F"
# check with deplyr
#data %>%
# distinct(sex)Solution: Standardize all values to "Male" or "Female".
data <- data %>%
mutate(sex = case_when(
sex %in% c("Male", "m") ~ "Male",
sex %in% c("Female", "F", "Fema") ~ "Female",
sex %in% c("5") ~ NA_character_, # invalid value replaced with NA
TRUE ~ NA_character_ # any other unexpected value
))
# Check results:
table(data$sex, useNA = "ifany")
Female Male <NA>
7 18 1
Province columnProblems: Typos: Souther, Northen, Weastern, Kiga City and Mixed spellings.
Solution: Standardize names properly.
data <- data %>%
mutate(Province = case_when(
str_detect(Province, "Kiga") ~ "Kigali City",
str_detect(Province, "South") | str_detect(Province, "Souther") ~ "Southern Province",
str_detect(Province, "North") | str_detect(Province, "Northen") ~ "Northern Province",
str_detect(Province, "East") ~ "Eastern Province",
str_detect(Province, "West") | str_detect(Province, "Weast") | str_detect(Province, "Weastern") ~ "Western Province",
TRUE ~ Province
))
# check
table(data$Province, useNA = "ifany")
Eastern Province Kigali City Northern Province Southern Province
4 8 4 4
Western Province
6
Education level columnProblems: Mixed categories: pr, Uni, Sec, Nursary, Non educ, blank values. Some are inconsistent abbreviations.
Solution: Map categories to consistent labels.
#Use rename() in dplyr
data <- data %>%
rename(Education_level = `Education level`)
# if you want to clean all column names automatically (good for larger datasets)
#library(janitor)
#data <- data %>%
# janitor::clean_names()
data <- data %>%
mutate(Education_level = case_when(
Education_level %in% c("pr") ~ "Primary",
Education_level %in% c("Sec") ~ "Secondary",
Education_level %in% c("Uni") ~ "University",
str_detect(Education_level, "Nurs") ~ "Nursery",
str_detect(Education_level, "Non") ~ "No Education",
Education_level == "" | is.na(Education_level) ~ NA_character_,
TRUE ~ Education_level
))
#check
table(data$Education_level, useNA = "ifany")
No Education Nursery Primary Secondary University <NA>
3 3 8 6 4 2
Age columnProblems: Some rows have Age recorded as text: "less than one year". Some missing values.
Solution: Convert to numeric, and handle "less than one year".
data <- data %>%
mutate(Age = case_when(
Age == "less than one year" ~ "0",
TRUE ~ Age
)) %>%
mutate(Age = as.numeric(Age))
#Check
summary(data$Age) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 24.00 24.00 28.58 33.00 45.00 2
#Final Cleaned Data
glimpse(data)Rows: 26
Columns: 5
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ sex <chr> "Male", "Male", "Female", NA, "Female", "Male", "Male"…
$ Province <chr> "Southern Province", "Kigali City", "Kigali City", "Ea…
$ Education_level <chr> "Primary", "Primary", "Primary", "University", "Univer…
$ Age <dbl> 0, 24, 33, 45, 20, 24, 24, 33, 45, 20, 24, NA, 33, 45,…
1.1.2.1 Comments
In R, comments start with a hash symbol (
#). Anything after the#on that line is ignored by R.Code
Variable Assignment
Variables are like labeled boxes where you can store data or results so you can use them later without retyping.
In R, the primary assignment operator is
<-(read as “gets”).Code
Why
<-instead of=?In R,
<-is the traditional assignment operator.=can also assign values, but is more commonly used for setting arguments in functions (e.g.,mean(x, na.rm = TRUE)).Using
<-makes your code easier to read and avoids confusion.Code
Naming variables
Names should start with a letter, not a number.
Use letters, numbers, and underscores (_), but no spaces.
R is case-sensitive: Age and age are different variables.
Valid names:
Code
Invalid names:
Code
Updating variables
You can overwrite a variable by assigning it a new value:
Code
Assigning results of calculations
Code
Create a variable
length_mwith a value of2.5(representing meters).Create another variable
width_mwith a value of1.8.Calculate the area (
length_m * width_m) and store it inarea_m2.Print
area_m2.Tip: In RStudio, typing
<-is easy: pressAlt+-(Windows) orOption+-(Mac).