Intermediate R Programming Training

Practical Notes and Exercises

Author

Prosper AYINEBYONA

Published

August 17, 2025

1 Introduction

Welcome to the Intermediate R Programming Training.
This course is designed for participants registered for intermediate R training, focusing on data cleaning, manipulation, and visualization in R.

1.1 Introduction to R and Core Concepts

1.1.1 Welcome to R!

R is a powerful, free software environment for statistical computing and graphics. It is widely used among statisticians, data analysts and data scientists for different purposes.

Why R?

  • Open-Source and Free: Anyone can use it, and a global community contributes to it.
  • Powerful for Statistics: R was built by statisticians for statisticians. It contains a vast collection of packages for almost any statistical task imaginable.
  • Excellent for Visualization: With packages like ggplot2, R can produce publication-quality graphics.
  • Reproducible Research: R, combined with other tools makes it easy to create reproducible reports that combine your code, output, and analysis.

RStudio IDE

We will be using RStudio, which is an Integrated Development Environment (IDE) for R. It provides a user-friendly interface with four main panels:

1. Source Editor (Top-Left): Where you write your R scripts and .Rmd files.
2. Console (Bottom-Left): Where you can type and execute R code directly.
3. Environment/History (Top-Right): Shows the objects (data, variables) you have created.
4. Files/Plots/Packages/Help (Bottom-Right): Helps you manage files, view plots, install packages, and get help.

1.1.2 Basic R Syntax

1.1.2.1 Comments

In R, comments start with a hash symbol (#). Anything after the # on that line is ignored by R.

Code
# This is a comment. R will not run this line.

Variable Assignment

Variables are like labeled boxes where you can store data or results so you can use them later without retyping.

In R, the primary assignment operator is <- (read as “gets”).

Code
# Assign the value 10 to a variable named 'x'
x <- 10

# Print the value of x
x
[1] 10

Why <- instead of =?

  • In R, <- is the traditional assignment operator.

  • = can also assign values, but is more commonly used for setting arguments in functions (e.g., mean(x, na.rm = TRUE)).

  • Using <- makes your code easier to read and avoids confusion.

Code
# Assignment with '=' also works
y = 20
y
[1] 20

Naming variables

  • Names should start with a letter, not a number.

  • Use letters, numbers, and underscores (_), but no spaces.

  • R is case-sensitive: Age and age are different variables.

Valid names:

Code
height_cm <- 175
Age <- 30
mean_income <- 450000

Invalid names:

Code
2x <- 5        # starts with a number ❌
my age <- 25   # contains a space ❌

Updating variables

You can overwrite a variable by assigning it a new value:

Code
x <- 10
x <- x + 5  # increase x by 5
x
[1] 15

Assigning results of calculations

Code
a <- 4
b <- 6
sum_ab <- a + b
sum_ab
[1] 10
📚Exercise 1.1
  1. Create a variable length_m with a value of 2.5 (representing meters).

  2. Create another variable width_m with a value of 1.8.

  3. Calculate the area (length_m * width_m) and store it in area_m2.

  4. Print area_m2.

Tip: In RStudio, typing <- is easy: press Alt + - (Windows) or Option + - (Mac).

1.1.3 R Data Types

R has several basic data types. The most common ones are: * numeric: For numbers (e.g. 10.5, 55). * character: For text (strings), enclosed in quotes (e.g., "hello", 'Rwanda'). * logical: For TRUE or FALSE values.

You can check the type of a variable with the class() function.

Code
my_number <- 42
my_text <- "NISR"
my_logical <- TRUE

class(my_number)
[1] "numeric"
Code
class(my_text)
[1] "character"
Code
class(my_logical)
[1] "logical"

1.1.4 Vectors

The most fundamental data structure in R is the vector.
A vector is like a single column of data that contains elements all of the same type (all numbers, all text, or all logical values).

You can create a vector using the c() function, which stands for combine.

Code
# A numeric vector (numbers only)
provinces_population <- c(26, 24, 24, 17, 9)

# A character vector (text only)
provinces_names <- c("Southern", "Western", "Eastern", "Northern", "Kigali City")

# Print the vectors
provinces_population
[1] 26 24 24 17  9
Code
provinces_names
[1] "Southern"    "Western"     "Eastern"     "Northern"    "Kigali City"
Note

If you mix types, R will automatically convert everything to the most flexible type.

For example:

Code
mix <- c(1, "two", 3)
mix  # All elements become characters
[1] "1"   "two" "3"  

1.1.5 Vector Arithmetic

Vectors are vectorized in R — this means you can perform arithmetic on all elements at once. Operations happen element-by-element.

Code
a <- c(1, 2, 3)
b <- c(10, 20, 30)

# Addition (adds 1st to 1st, 2nd to 2nd, etc.)
a + b
[1] 11 22 33
Code
# [1] 11 22 33

# Multiplication
a * b
[1] 10 40 90
Code
# [1] 10 40 90

1.1.6 Recycling Rule

If one vector is shorter than the other, R will recycle (repeat) elements of the shorter vector until it matches the length of the longer one.

Code
a <- c(1, 2, 3, 4)
b <- c(10, 20)

a + b  
[1] 11 22 13 24
Code
# First recycle b: (10, 20, 10, 20)
# Result: (1+10, 2+20, 3+10, 4+20)
Be carefull

if the longer vector’s length is not a multiple of the shorter vector’s length, R will warn you.

1.1.7 Accessing Elements

You can use square brackets [ ] to select elements from a vector.

Code
# Get the first province
provinces_names[1]
[1] "Southern"
Code
# Get the 2nd and 4th populations
provinces_population[c(2, 4)]
[1] 24 17
Code
# Exclude the 3rd element
provinces_names[-3]
[1] "Southern"    "Western"     "Northern"    "Kigali City"

1.1.8 Modifying Vectors

Code
# Change the population of the first province
provinces_population[1] <- 27

# Add a new province (creates a longer vector)
provinces_population <- c(provinces_population, 8)

📚Exercise 1.2

Q1 Create Variables

Create three variables: 1. my_name containing your name as a character string. 2. my_age containing your age as a number. 3. is_statistician with a logical value (TRUE or FALSE).

Print out the class of each variable.

Q2: Vector Operations

  1. Create a vector named expenses with the following values: 1500, 2000, 1200, 3000.
  2. Create another vector named income with the value 10000.
  3. Calculate your total savings by subtracting the sum of expenses from income. (Hint: use the sum() function).

Q3

  1. Create a numeric vector sales_q1 with the values: 120, 150, 90.

  2. Create another numeric vector sales_q2 with the values: 130, 160, 95.

  3. Calculate the total sales for each store by adding the two vectors.

  4. Calculate the percentage increase from Q1 to Q2:

    ((Q2 - Q1) / Q1) * 100

  5. Extract only the results for stores with an increase greater than 10%.

💡 Tip

Vectors are everywhere in R — even a single number like 42 is technically a vector of length 1.


2 Data Frames and Data Import/Export

2.1 Data Frames

The most important data structure for data analysis in R is the data frame.
A data frame is like:

  • A spreadsheet in Excel

  • A table in a database

  • A tibble in tidyverse (we’ll see later)

It is a two-dimensional table:

  • Rows = observations/records

  • Columns = variables/features

  • Each column can have a different type (numeric, character, logical, etc.)

2.1.1 Creating a Data Frame

You can create a data frame using the data.frame() function.

Code
# Create a simple data frame
employee_data <- data.frame(
  id     = c(1, 2, 3),                   # Numeric column
  name   = c("John", "Jane", "Peter"),   # Character column
  salary = c(50000, 55000, 52000)        # Numeric column
)

employee_data
  id  name salary
1  1  John  50000
2  2  Jane  55000
3  3 Peter  52000

2.1.2 Checking Structure and Summary

Code
# View structure
str(employee_data)
'data.frame':   3 obs. of  3 variables:
 $ id    : num  1 2 3
 $ name  : chr  "John" "Jane" "Peter"
 $ salary: num  50000 55000 52000
Code
# View summary statistics
summary(employee_data)
       id          name               salary     
 Min.   :1.0   Length:3           Min.   :50000  
 1st Qu.:1.5   Class :character   1st Qu.:51000  
 Median :2.0   Mode  :character   Median :52000  
 Mean   :2.0                      Mean   :52333  
 3rd Qu.:2.5                      3rd Qu.:53500  
 Max.   :3.0                      Max.   :55000  

2.1.3 Accessing Data

There are several ways to access parts of a data frame:

Code
# Access a column by $ name
employee_data$name
[1] "John"  "Jane"  "Peter"
Code
# Access by index (row, column)
employee_data[1, 2]     # Row 1, column 2
[1] "John"
Code
employee_data[ , "salary"]  # All rows, salary column
[1] 50000 55000 52000
Code
# Multiple columns
employee_data[ , c("name", "salary")]
   name salary
1  John  50000
2  Jane  55000
3 Peter  52000
Code
# Multiple rows
employee_data[1:2, ]
  id name salary
1  1 John  50000
2  2 Jane  55000

2.1.4 Adding Columns and Rows

Code
# Add a new column
employee_data$department <- c("HR", "Finance", "IT")

# Add a new row
new_row <- data.frame(id=4, name="Alice", salary=60000, department="Marketing")
employee_data <- rbind(employee_data, new_row)

2.1.5 Importing Real Data

In practice, we don’t type all our data by hand — we read it from files:

Code
# Read CSV file
my_data <- read.csv("Data.csv")

# Read Excel file
# install.packages("readxl")
library(readxl)
my_data <- read_excel("Data.xlsx")
📚 Exercises 2.1
  1. Create a data frame named students with columns:
  • id (1 to 5)

  • name (five student names)

  • grade (five numeric grades)

  1. Extract the grades of the first three students
  2. Add a new column pass that is TRUE if grade ≥ 50, otherwise FALSE.
  3. Add a new row for a sixth student.

2.2 Installing and Loading Packages

One of R’s biggest strengths is its community-contributed packages.
A package is like an app for R — it contains:

  • Functions (tools to do specific tasks)

  • Data sets (ready-to-use examples)

  • Documentation (help files and guides)

2.2.1 Step 1: Installing a Package

Before using a package, you install it once on your computer (like downloading an app from an app store).

Code
# Install a package (done only once, unless you update R or reinstall)

#install.packages("haven")
#install.packages("tidyverse")
💡 Tip

The package name must be in quotes (" ") when installing. Installing requires an internet connection.

2.2.2 Step 2: Loading a Package

Every time you start a new R session (or reopen RStudio), you must load the package to use it./

Code
# Load packages for use in this session

library(haven)       # For reading Stata, SPSS, SAS files
library(tidyverse)   # For data manipulation & visualization
💡 Tip

When loading a package, no quotes are needed.

2.2.3 Why haven and tidyverse?

  • haven: Reads data from Stata (.dta), SPSS, and SAS files while preserving labels.

  • tidyverse: A collection of packages for data manipulation (dplyr), visualization (ggplot2), data tidying (tidyr), and more.

2.2.4 Checking if a Package is Installed

Code
# Check installed packages
# installed.packages()

# Or quickly check one package
"haven" %in% rownames(installed.packages())
[1] TRUE

Updating Packages

Code
# Update all installed packages

# update.packages()
📚 Exercises 2.3
  1. Install the package readxl (for reading Excel files).

  2. Load it into your session.

  3. Check if the package ggplot2 is already installed on your system.

💡 Tip

If you get the error “there is no package called …”, it means you need to install it first.

2.3 Importing Data

One of the first steps in any data analysis project is getting your data into R.
R can read many file formats: .csv, .xlsx, .dta, .sav, .json, and more.

In this course, we’ll start with a Stata (.dta) file.

We’ll use read_dta() from the haven package (which is already loaded) to import it.

Code
# If the file is in your working directory, you can just write its name. Otherwise, provide the full path or a relative path.

file_path <- "cs_s1_s2_s3_s4_s6a_s6e_s6f_person.dta"

# Read the Stata file into R
df <- read_dta(file_path)

# View the first few rows
head(df)

2.3.1 Common issues when importing:

  1. File not found

    • Check your working directory with getwd()

    • Set it with setwd("path/to/your/folder")

  2. Wrong slashes in path

    • On Windows, use / instead of \ or double them: "C:/Users/..."
  3. Package not loaded

    • Make sure library(haven) is run before read_dta().

2.3.2 Other Common Data Formats

1. CSV Files

Most widely used format. Use read_csv() from the readr package (part of tidyverse):

Code
df_csv <- read_csv("Data.csv")
head(df_csv)

2. Excel Files

For .xlsx or .xls, use read_excel() from the readxl package:

Code
library(readxl)
df_excel <- read_excel("Data.xlsx")
head(df_excel)

3. SPSS Files (.sav)

Also supported by haven:

Code
df_spss <- read_sav("RWPR81FL.SAV")
head(df_spss)
💡 Tip

You can check the type of file you have before importing it:

Code
tools::file_ext("Data.csv")  # Returns "csv"
[1] "csv"
💻Task
  1. Try loading a .csv file from your computer into R.

  2. View its first 6 rows with head().

  3. Check its structure with str().

2.3.3 Exporting Data in R

Code
# Save as CSV
write_csv(Your_df, "cleaned_data.csv")

# Save as Excel (needs openxlsx or writexl package)
library(writexl)
write_xlsx(Your_df, "cleaned_data.xlsx")

# Save as Stata (.dta) – needs haven
library(haven)
write_dta(Your_df, "cleaned_data.dta")

# Save as SPSS (.sav) – also haven
write_sav(Your_df, "cleaned_data.sav")

# Save as RDS (R's native format, keeps everything)
saveRDS(Your_df, "cleaned_data.rds")

# To load back RDS
df_loaded <- readRDS("cleaned_data.rds")

2.4 Inspecting Data Frames

After loading your dataset, the next step is to understand what’s inside it — its size, structure, and basic patterns.
This step helps you decide how to clean and analyze the data.

Here are some R equivalents of the common pandas commands from Python:

pandas in Python R equivalent What it does
df.head() head(df) Shows the first few rows (default: 6)
df.shape dim(df) Shows the number of rows and columns
df.info() str(df) Gives the structure: column names, types, and a preview of values
df.describe() summary(df) Summary statistics for each column
📄Examples
Code
# Show the first few rows
head(df)

# You can also control how many rows to show
head(df, 10)  # First 10 rows

# Get the dimensions (rows, columns)
dim(df)  

# Get the structure of the data frame
str(df)  

# Summary statistics for each column
summary(df)

2.4.1 Extra Useful Inspection Functions

See column names

Code
names(df)

Check the number of rows only

Code
nrow(df)
[1] 66081

Check the number of columns only

Code
ncol(df)
[1] 127

Check the first few rows of a specific column

Code
head(df$region)
<labelled<double>[6]>: Region
[1] 1 1 1 1 1 1

Labels:
 value          label
     1    Kigali City
     2    Other Urban
     3 Rural Southern
     4  Rural Western
     5 Rural Northern
     6  Rural Eastern

View unique values in a column

Code
unique(df$province)
<labelled<double>[5]>: Province
[1] 1 2 3 4 5

Labels:
 value             label
     1       Kigali City
     2 Southern Province
     3  Western Province
     4 Northern Province
     5  Eastern Province
💡 Tips for Inspecting Large Datasets
  • Too many rows? Use head() and tail() to see only the top and bottom of the data.

  • Too wide? Use glimpse(df) from dplyr for a horizontal preview:

    Code
    library(dplyr)
    glimpse(df)
  • Want quick statistics? Use summary() to spot outliers, missing values, or strange categories


📚 Exercises 2.4

Load and Inspect

  1. Load the cs_s0_s5_household.dta dataset into a data frame called eicv_data.
  2. Use dim(), head(), str() and glimpse() to answer the following questions:
    • How many rows and columns are in the dataset?
    • What is the data type of the province column?
    • What are the names of the first 6 columns? (Hint: use colnames()).
    • Get a quick preview of the data.

3 Data Manipulation with Base R

When working with data, one of the most common tasks is to pick the columns you need, rename them for clarity, and filter rows based on conditions.

In Base R, you can do all of this without additional packages — although later we’ll also learn the tidyverse way.

3.0.1 Selecting Columns

You can select columns from a data frame by name or by their index number.

Example: Selecting specific columns by name

Code
# Select important columns for our analysis
df_selected <- subset(
  df,
  select = c(hhid, province, district, ur2_2012, quintile, poverty, Consumption, s1q1, s1q4, s3q3, s1q3y, s4aq3, s4aq6a)
)

# Preview the new data frame
head(df_selected)
# A tibble: 6 × 13
    hhid province        district  ur2_2012 quintile poverty Consumption s1q1   
   <dbl> <dbl+lbl>       <dbl+lbl> <dbl+lb> <dbl+lb>   <dbl>       <dbl> <dbl+l>
1 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4]         3     305391. 2 [Fem…
2 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4]         3     305391. 1 [Mal…
3 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4]         3     305391. 1 [Mal…
4 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4]         3     305391. 2 [Fem…
5 100001 1 [Kigali City] 11 [Nyar… 1 [Urba… 4 [Q4]         3     305391. 1 [Mal…
6 100002 1 [Kigali City] 11 [Nyar… 1 [Urba… 5 [Q5]         3     372962. 1 [Mal…
# ℹ 5 more variables: s1q4 <dbl+lbl>, s3q3 <dbl+lbl>, s1q3y <dbl+lbl>,
#   s4aq3 <dbl+lbl>, s4aq6a <dbl+lbl>
💡 Tips
  • subset() is great for readability, but you can also use df[, c("col1", "col2")].

  • If you know the column positions: df[, c(1, 2, 5)] selects columns 1, 2, and 5.

3.1 Renaming Columns

Clear, descriptive column names make your analysis easier to follow.
You can change column names one at a time or all at once.

Example: Renaming specific columns

Code
# Rename specific columns

names(df_selected)[names(df_selected) == "s1q1"]  <- "Sex"
names(df_selected)[names(df_selected) == "s1q4"]  <- "Marital_status"
names(df_selected)[names(df_selected) == "s3q3"]  <- "Health_insurance"
names(df_selected)[names(df_selected) == "s1q3y"] <- "Age"
names(df_selected)[names(df_selected) == "s4aq3"] <- "Diploma_obtained"
names(df_selected)[names(df_selected) == "s4aq6a"] <- "Class_attended_2012"

# Check new names
names(df_selected)
 [1] "hhid"                "province"            "district"           
 [4] "ur2_2012"            "quintile"            "poverty"            
 [7] "Consumption"         "Sex"                 "Marital_status"     
[10] "Health_insurance"    "Age"                 "Diploma_obtained"   
[13] "Class_attended_2012"

Renaming all columns at once

Code
names(df_selected) <- c("HHID", "Province", "District", "UrbanRural", 
                        "Quintile", "PovertyStatus", "Consumption", 
                        "Sex", "Marital_status", "Health_insurance", 
                        "Age", "Diploma_obtained", "Class_attended_2012")
💡 Tip

Always rename columns right after selecting them, it saves you from memorizing cryptic names later.

3.1.1 Filtering Rows

You can filter rows by specifying conditions.

Example: Filtering by category (Province)

lets first convert province names to actual names

Code
# Check the structure of the variable
str(df_selected$Province)
 dbl+lbl [1:66081] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
 @ label       : chr "Province"
 @ format.stata: chr "%10.0g"
 @ labels      : Named num [1:5] 1 2 3 4 5
  ..- attr(*, "names")= chr [1:5] "Kigali City" "Southern Province" "Western Province" "Northern Province" ...
Code
# Convert to factor with labels
#df_selected$Province <- factor(df_selected$Province,
#                               levels = c(1, 2, 3, 4, 5),
#                               labels = c("Kigali City",
#                                          "Southern Province",
#                                          "Western Province",
#                                          "Northern Province",
#                                          "Eastern Province"))

# Convert province from numeric codes to labels
df_selected$Province <- as_factor(df_selected$Province)

# Now the variable is readable
table(df_selected$Province)

      Kigali City Southern Province  Western Province Northern Province 
             6156             17211             15959             11071 
 Eastern Province 
            15684 

then we can filter by category,

Code
# Only rows for Northern Province
northern_province_data <- subset(df_selected, Province == "Northern Province")

# Rows for Kigali City OR Northern Province
kigali_and_north <- subset(df_selected, Province == "Kigali City" | Province == "Northern Province")

# Cleaner way: using %in%
provinces_to_select <- c("Kigali City", "Northern Province")
kigali_and_north_alt <- subset(df_selected, Province %in% provinces_to_select)

Example: Filtering by numeric condition

Code
# Only people aged 16 and above
age_above_16 <- subset(df_selected, Age >= 16)
💡 Tips for Filtering
  • Use == for exact matches.

  • Use %in% when matching against a list of values.

  • Combine conditions with & (AND) or | (OR).

  • For more complex filtering, subset() is readable, but df[df$col > 10, ] is also valid.

📚 Exercises 3.1
  1. From df_selected, create a new data frame with only Sex, Age, and Province.

  2. Rename Sex to Gender.

  3. Filter only people in Kigali City who are aged 18 or older.

3.1.2

3.1.3 Base R Data Frame Manipulation Cheat Sheet

Task Code Example Notes
Select columns by name df[, c("col1", "col2")] Uses vector of names
Select columns by index df[, c(1, 3, 5)] Index starts at 1
Select with subset() subset(df, select = c(col1, col2)) More readable
Rename one column names(df)[names(df) == "old"] <- "new" Match name exactly
Rename all columns names(df) <- c("new1", "new2", "new3") Order matters
Filter rows (one value) subset(df, col == "Value") Exact match
Filter rows (multiple) subset(df, col %in% c("A", "B")) %in% for multiple
Filter numeric subset(df, age >= 18) Works with >, <, >=, <=
Multiple conditions (AND) subset(df, col1 == "A" & col2 > 10) Both must be TRUE
Multiple conditions (OR) subset(df, col1 == "A" | col2 > 10) Either can be TRUE
Without subset() df[df$col == "Value", ] Base R indexing

3.2 Factors in R

Categorical variables in R are often stored as factors.
A factor is a special type of vector that can only contain predefined values called levels.

  • They are memory efficient for repeated text values.

  • They are useful for modeling (e.g., regression, classification) because they tell R which variables are categorical.

  • They also make plots more meaningful by ordering categories automatically.

Basic Factor Operations

Code
# 1 Check the class of 'province'
# haven (used for reading Stata, SPSS, SAS) often reads categories as 'labelled'
class(df_selected$Province)
[1] "factor"
Code
# 2 Convert 'province' to a factor
# df_selected$Province <- as.factor(df_selected$Province)

# 3 See the levels (unique categories)
levels(df_selected$Province)
[1] "Kigali City"       "Southern Province" "Western Province" 
[4] "Northern Province" "Eastern Province" 

More Factor Tasks

Code
# Reorder factor levels (manually)
df_selected$Province <- factor(df_selected$Province,
                               levels = c("Kigali City", "Northern Province", 
                                          "Southern Province", "Eastern Province",
                                          "Western Province"))

# Rename factor levels
levels(df_selected$Province) <- c("Kigali", "North", "South", "East", "West")

# Count the number of occurrences of each level
table(df_selected$Province)

Kigali  North  South   East   West 
  6156  11071  17211  15684  15959 
Code
# Check if a column is a factor
is.factor(df_selected$Province)   # TRUE / FALSE
[1] TRUE
💡Tip
  • as.factor() → converts a variable to a factor.

  • factor() → creates a factor from scratch or reorders levels.

  • levels() → views or updates factor levels.

  • table() → counts how many times each category appears.

📚 Exercises 3.2

Q1: Select and Rename

  1. From the df_selected data frame, select the columns District, PovertyStatus, and Age. Store this in a new data frame called analysis_data.
  2. Rename the PovertyStatus column to Poverty.

Q2: Filter Data

  1. From the analysis_data data frame, create a new data frame called muhanga_data that contains only the data for the “Muhanga” district.
  2. How many people in the Muhanga district are older than 50?

4 Data Aggregation & Introduction to the Tidyverse

4.1 One-Way and Two-Way Tables (Base R)

Frequency tables help us summarize how many times each category appears in a dataset.
They are essential for understanding distributions before doing further analysis.

4.1.1 1. One-Way Table, Counts for a Single Variable

Code
# Count how many observations are in each province
province_table <- table(df_selected$Province)
province_table

Kigali  North  South   East   West 
  6156  11071  17211  15684  15959 

4.1.2 2. Proportions Table, Percentages Instead of Counts

Code
# Convert counts into proportions (relative frequencies)
province_table_pc <- prop.table(province_table) * 100
province_table_pc

  Kigali    North    South     East     West 
 9.31584 16.75368 26.04531 23.73451 24.15066 

4.1.3 3. Two-Way Table, Relationship Between Two Variables

Code
# Convert province from numeric codes to labels
df_selected$Sex <- as_factor(df_selected$Sex)

# Cross-tabulation of Province by Sex
prov_sex_table <- table(df_selected$Province, df_selected$Sex)
prov_sex_table
        
         Male Female Missing
  Kigali 3008   3148       0
  North  5290   5781       0
  South  8165   9046       0
  East   7533   8151       0
  West   7548   8411       0

What this shows:

  • A matrix where each cell contains the number of people for a given combination of province and Sex.

4.1.4 4. Row-Wise Percentages – Compare Within Each Group

Code
# Row percentages: within each province, what % are male/female?
prop.table(prov_sex_table, margin = 1) * 100
        
             Male   Female  Missing
  Kigali 48.86290 51.13710  0.00000
  North  47.78249 52.21751  0.00000
  South  47.44059 52.55941  0.00000
  East   48.02984 51.97016  0.00000
  West   47.29620 52.70380  0.00000

Example Output Interpretation

Province Male Female
Kigali City 48.86% 51.13%
Northern Province 47.78% 52.21%

This means that in Kigali City, 48.86% of respondents are male, 51.13% are female.

💡Tip
  • table() → for counts

  • prop.table() → for percentages

  • margin = 1 → row percentages (by group)

  • margin = 2 → column percentages (by category)

4.2 Aggregating Data (Base R)

Sometimes we need to calculate summary statistics for groups in the data.
In Base R, the aggregate() function is a quick way to do this.

4.2.1 Basic Syntax

Code
aggregate(formula, data, FUN)
  • formula: y ~ x means “calculate something for y grouped by x”.

  • data: the data frame where the variables are stored.

  • FUN: the function to apply (e.g., mean, sum, max, min).

Example – Average Household Consumption by Province

Code
# Average household consumption for each province
avg_consumption_by_province <- aggregate(
  Consumption ~ Province, 
  data = df_selected, 
  FUN = mean
)

avg_consumption_by_province
  Province Consumption
1   Kigali    630632.5
2    North    237683.3
3    South    275067.9
4     East    264665.3
5     West    256286.5

Other Useful Variations

Code
# Median consumption by province
aggregate(Consumption ~ Province, data = df_selected, FUN = median)
  Province Consumption
1   Kigali    339105.2
2    North    169252.5
3    South    189248.8
4     East    188566.6
5     West    170559.5
Code
# Total consumption by province
aggregate(Consumption ~ Province, data = df_selected, FUN = sum)
  Province Consumption
1   Kigali  3882173609
2    North  2631391853
3    South  4734192910
4     East  4151010371
5     West  4090075625
Code
# Average age by province
aggregate(Age ~ Province, data = df_selected, FUN = mean)
  Province      Age
1   Kigali 22.43080
2    North 23.35146
3    South 24.02353
4     East 22.23495
5     West 22.50649
💡Tip

You can aggregate multiple variables at once by passing cbind(var1, var2) in the formula.

Code
# Average consumption and age by province
aggregate(cbind(Consumption, Age) ~ Province, data = df_selected, FUN = mean)
  Province Consumption      Age
1   Kigali    630632.5 22.43080
2    North    237683.3 23.35146
3    South    275067.9 24.02353
4     East    264665.3 22.23495
5     West    256286.5 22.50649

4.3 Introduction to the Tidyverse

The tidyverse is a collection of R packages built specifically for data science tasks such as data cleaning, transformation, visualization, and reporting.
All packages in the tidyverse follow a consistent design philosophy, making it easier to learn and switch between them.

Some of the most popular tidyverse packages include:

Package Purpose
dplyr Data manipulation (filtering, selecting, summarizing, etc.)
ggplot2 Data visualization
tidyr Reshaping and tidying data
readr Reading CSV and other text files
stringr Working with strings
forcats Working with factors
lubridate Working with dates and times

4.3.1 Why Use the tidyverse?

  • More readable and concise syntax compared to base R.

  • Functions are designed to chain together naturally

  • Works well with the pipe operator for a step-by-step approach.

4.3.2 The Pipe Operator: %>%

One of the most useful features in tidyverse workflows is the pipe operator (%>%), provided by the magrittr package (automatically loaded with tidyverse).

The pipe lets you write code like a sentence, where the output of one function becomes the input of the next.

Without Pipes (Base R Style)

Code
head(subset(df_selected, Province == "Kigali"))
# A tibble: 6 × 13
    HHID Province District   UrbanRural Quintile PovertyStatus Consumption Sex  
   <dbl> <fct>    <dbl+lbl>  <dbl+lbl>  <dbl+lb>         <dbl>       <dbl> <fct>
1 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Fema…
2 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
3 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
4 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Fema…
5 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
6 100002 Kigali   11 [Nyaru… 1 [Urban]  5 [Q5]               3     372962. Male 
# ℹ 5 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
#   Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>

This can be harder to read because operations are nested inside each other.

4.3.3 With Pipes (Tidyverse Style)

Code
library(dplyr)

df_selected %>%
  filter(Province == "Kigali") %>%
  tail()
# A tibble: 6 × 13
    HHID Province District   UrbanRural Quintile PovertyStatus Consumption Sex  
   <dbl> <fct>    <dbl+lbl>  <dbl+lbl>  <dbl+lb>         <dbl>       <dbl> <fct>
1 103079 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     543027. Fema…
2 103079 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     543027. Fema…
3 103079 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     543027. Fema…
4 103079 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     543027. Male 
5 103080 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     418005. Male 
6 103080 Kigali   13 [Kicuk… 1 [Urban]  5 [Q5]               3     418005. Fema…
# ℹ 5 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
#   Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>

Here, each step is on its own line, making it easier to read and debug.

4.3.4 How the Pipe Works

The pipe takes the result on the left and passes it as the first argument to the function on the right.

For example:

Code
# This:
x %>% f(y)

# Is equivalent to:
f(x, y)
💡Tip

The pipe works best when each function’s first argument is a data frame (which is the tidyverse convention).

4.3.5 Data Manipulation with dplyr

The dplyr package is one of the core tidyverse tools for data manipulation.

It provides a set of intuitive “verbs” that let you describe what you want to do with your data, without worrying too much about how R does it internally.

4.3.6 Core dplyr Verbs

Here are the most important verbs in dplyr:

  • select() – Choose which columns to keep or reorder.

  • rename() – Rename columns to something more meaningful.

  • filter() – Select rows that meet certain conditions.

  • mutate() – Add new columns or modify existing ones.

  • arrange() – Sort rows in ascending or descending order.

  • group_by() – Split the dataset into groups.

  • summarise() – Collapse each group into a single summary row.

Example: Selecting, Renaming, and Filtering in One Chain

Code
# Load the tidyverse library
library(tidyverse)
library(haven)

df$province <- as.factor(df$province)
# Now province shows names, not just numbers
table(df$province)

    1     2     3     4     5 
 6156 17211 15959 11071 15684 
Code
# Select certain columns, rename some, and filter rows for Northern Province
analysis_df_dplyr <- df %>%
  select(
    hhid,           # Household ID
    province,       # Province
    district,       # District
    ur2_2012,       # Urban/Rural classification
    quintile,       # Wealth quintile
    poverty,        # Poverty status
    Consumption,    # Household consumption
    Sex = s1q1,     # Rename s1q1 to 'Sex'
    Marital_status = s1q4,  # Rename s1q4 to 'Marital_status'
    Health_insurance = s3q3,# Rename s3q3 to 'Health_insurance'
    Age = s1q3y             # Rename s1q3y to 'Age'
  ) %>%
  filter(province == "2")  # Keep only rows from Northern Province

# Preview the result
head(analysis_df_dplyr)
# A tibble: 6 × 11
    hhid province district    ur2_2012  quintile  poverty Consumption Sex       
   <dbl> <fct>    <dbl+lbl>   <dbl+lbl> <dbl+lbl>   <dbl>       <dbl> <dbl+lbl> 
1 100393 2        21 [Nyanza] 2 [Rural] 5 [Q5]          3    2384791. 1 [Male]  
2 100393 2        21 [Nyanza] 2 [Rural] 5 [Q5]          3    2384791. 2 [Female]
3 100394 2        21 [Nyanza] 2 [Rural] 3 [Q3]          3     172956. 2 [Female]
4 100394 2        21 [Nyanza] 2 [Rural] 3 [Q3]          3     172956. 2 [Female]
5 100394 2        21 [Nyanza] 2 [Rural] 3 [Q3]          3     172956. 2 [Female]
6 100395 2        21 [Nyanza] 2 [Rural] 4 [Q4]          3     266837. 1 [Male]  
# ℹ 3 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
#   Age <dbl+lbl>

What’s Happening Here?

  1. select() picks only the columns we need (and can rename them at the same time).

  2. filter() keeps only the rows where province equals "Northern Province".

  3. The pipe %>% sends the result of each step into the next one.

  4. The output is a cleaner, smaller data frame ready for analysis.

💡Tip

You can chain as many steps as you want — dplyr encourages writing code that reads like a sentence.

4.3.7 Aggregation with dplyr

In base R, we used the aggregate() function to get averages.
In dplyr, we combine:

  • group_by() – to split the data into groups

  • summarise() – to calculate summary statistics for each group

Example: Average Consumption by Province

Code
# Calculate average consumption by province
avg_cons_dplyr <- df_selected %>%
  group_by(Province) %>%
  summarise(
    mean_consumption = mean(Consumption, na.rm = TRUE)  # ignore NA values
  )

avg_cons_dplyr
# A tibble: 5 × 2
  Province mean_consumption
  <fct>               <dbl>
1 Kigali            630632.
2 North             237683.
3 South             275068.
4 East              264665.
5 West              256286.

How This Works

  1. group_by(province)
    Creates groups for each province in the dataset.

  2. summarise(mean_consumption = mean(Consumption, na.rm = TRUE))
    Calculates the mean consumption for each group.

  3. The result is a condensed table with one row per province.

Key Differences from Base R’s aggregate()

  • dplyr syntax is more readable and works naturally with pipes.

  • You can summarise multiple variables at once.

  • Works seamlessly with other tidyverse tools.

4.3.8 1. Multiple Summaries in One Go

With summarise(), you can calculate several statistics at the same time.

Code
# Summarise with multiple statistics
province_summary <- df_selected %>%
  group_by(Province) %>%
  summarise(
    mean_consumption = mean(Consumption, na.rm = TRUE),
    median_consumption = median(Consumption, na.rm = TRUE),
    households = n(),  # Count rows in each group
    .groups = "drop"   # Remove grouping in the output
  )

province_summary
# A tibble: 5 × 4
  Province mean_consumption median_consumption households
  <fct>               <dbl>              <dbl>      <int>
1 Kigali            630632.            339105.       6156
2 North             237683.            169252.      11071
3 South             275068.            189249.      17211
4 East              264665.            188567.      15684
5 West              256286.            170560.      15959

4.3.9 2. Adding New Columns with mutate()

mutate() is perfect for creating calculated columns without altering the original data.

Code
# Add a new column for daily consumption
df_with_daily <- df_selected %>%
  mutate(
    daily_consumption = Consumption / 365  # Assuming annual data
  )

head(df_with_daily)
# A tibble: 6 × 14
    HHID Province District   UrbanRural Quintile PovertyStatus Consumption Sex  
   <dbl> <fct>    <dbl+lbl>  <dbl+lbl>  <dbl+lb>         <dbl>       <dbl> <fct>
1 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Fema…
2 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
3 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
4 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Fema…
5 100001 Kigali   11 [Nyaru… 1 [Urban]  4 [Q4]               3     305391. Male 
6 100002 Kigali   11 [Nyaru… 1 [Urban]  5 [Q5]               3     372962. Male 
# ℹ 6 more variables: Marital_status <dbl+lbl>, Health_insurance <dbl+lbl>,
#   Age <dbl+lbl>, Diploma_obtained <dbl+lbl>, Class_attended_2012 <dbl+lbl>,
#   daily_consumption <dbl>

This keeps all original columns but adds a new daily_consumption column.

4.3.10 3. Sorting Results with arrange()

You can arrange your results ascending or descending.

Code
# Sort provinces from highest to lowest mean consumption
province_summary_sorted <- province_summary %>%
  arrange(desc(mean_consumption))

province_summary_sorted
# A tibble: 5 × 4
  Province mean_consumption median_consumption households
  <fct>               <dbl>              <dbl>      <int>
1 Kigali            630632.            339105.       6156
2 South             275068.            189249.      17211
3 East              264665.            188567.      15684
4 West              256286.            170560.      15959
5 North             237683.            169252.      11071

4.3.11 4. Filtering Groups after Summarising

Sometimes you want only certain groups after calculation — for example, provinces with high average consumption.

Code
# Keep only provinces with mean consumption above 265000
high_consumption_provinces <- province_summary %>%
  filter(mean_consumption > 265000)

high_consumption_provinces
# A tibble: 2 × 4
  Province mean_consumption median_consumption households
  <fct>               <dbl>              <dbl>      <int>
1 Kigali            630632.            339105.       6156
2 South             275068.            189249.      17211

4.4 5. Counting Rows in Each Group

n() is a special function in dplyr to count rows per group — useful for quick summaries.

Code
# Count households per province
households_per_province <- df_selected %>%
  group_by(Province) %>%
  summarise(total_households = n())

households_per_province
# A tibble: 5 × 2
  Province total_households
  <fct>               <int>
1 Kigali               6156
2 North               11071
3 South               17211
4 East                15684
5 West                15959

4.4.1 Bringing It All Together

Here’s an all-in-one pipeline combining everything:

Code
final_summary <- df_selected %>%
  mutate(daily_consumption = Consumption / 365) %>%
  group_by(Province) %>%
  summarise(
    mean_annual = mean(Consumption, na.rm = TRUE),
    mean_daily = mean(daily_consumption, na.rm = TRUE),
    median_annual = median(Consumption, na.rm = TRUE),
    households = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(mean_annual)) %>%
  filter(mean_annual > 265000)

final_summary
# A tibble: 2 × 5
  Province mean_annual mean_daily median_annual households
  <fct>          <dbl>      <dbl>         <dbl>      <int>
1 Kigali       630632.      1728.       339105.       6156
2 South        275068.       754.       189249.      17211

What This Shows:

  • How to calculate multiple statistics in one step

  • How to add new calculated columns before summarising

  • How to sort results and filter summaries

  • How to count households per group

📚 Exercise 4

Data Aggregation & Introduction to the Tidyverse

You will use the dataset df_selected from the examples in this section.
Unless stated otherwise, you may solve each exercise using either Base R or dplyr (but try dplyr for practice).

Q1 – Average Consumption by Province

Calculate the mean monthly consumption for each province.

  • In Base R, use aggregate()

  • In dplyr, use group_by() + summarise()

Q2 – Multiple Summary Statistics

For each province, calculate:

  • Mean consumption

  • Median consumption

  • Number of households

Q3 – Filter by Condition

Select only provinces where the mean monthly consumption is above 1,300.

  • Arrange results from highest to lowest mean consumption.

Q4 – Adding New Variables

Create a new variable daily_consumption (assuming Consumption is monthly).
Then, calculate the average daily consumption per province.

Q5 – Household Count

Count how many households are in each province.

  • Hint: Use n() inside summarise() in dplyr.

Q6 – Combine Operations in a Single Pipeline

Using dplyr, create a single pipeline that:

  1. Starts with df_selected

  2. Creates daily_consumption

  3. Groups by province

  4. Calculates mean monthly, mean daily, and household count

  5. Filters provinces where mean monthly > 1,300

  6. Sorts the results in descending order of mean monthly consumption

Q7 – Optional

Find the top 3 provinces with the highest median consumption.

  • Hint: After summarise(), use arrange() + slice_head(n = 3)

4.4.2

5 Data Visualization, Missing Values, and Outliers

5.1 Data Visualization in R

Data visualization helps you spot patterns, detect anomalies, and communicate findings more effectively.
We’ll explore two main approaches in R:

  1. Base R plotting functions – Quick and simple.

  2. ggplot2 from the tidyverse – More flexible and elegant.

Example: Base R Plots

Code
# Turn off scientific notation
options(scipen = 999)

# Histogram of Consumption
hist(df_selected$Age,
     main = "Distribution of Age",
     xlab = "Age",
     col = "skyblue", border = "white")

Code
# Boxplot of Consumption by Province
boxplot(Consumption ~ Province,
        data = df_selected,
        main = "Consumption by Province",
        xlab = "Province", ylab = "Consumption",
        col = "lightgreen")

Example: ggplot2 Plots

Code
library(ggplot2)

# Histogram using ggplot2
ggplot(df_selected, aes(x = Consumption)) +
  geom_histogram(binwidth = 5000, fill = "steelblue", color = "green") +
  labs(title = "Distribution of Household Consumption", x = "Consumption", y = "Count")

Code
# Boxplot using ggplot2
ggplot(df_selected, aes(x = Province, y = Consumption, fill = Province)) +
  geom_boxplot() +
  labs(title = "Consumption by Province", x = "Province", y = "Consumption") +
  theme(legend.position = "none")

💡Tip

Use binwidth in histograms to control the level of detail.

Smaller binwidth → more detail, but may look noisy. Larger binwidth → smoother look.

5.2 Data Visualization with ggplot2

Why ggplot2?

ggplot2 is part of the tidyverse and is based on the Grammar of Graphics.
This means you build a plot layer by layer:

  1. Start with your data and aesthetic mappings (aes()).

  2. Add a geometry (bars, points, lines, etc.).

  3. Add layers for labels, themes, and customization.

Think of it like assembling a sandwich:

  • Bread → data & aes()

  • Fillings → geom_*() layer

  • Sauces & toppings → labels, colors, and themes.

Basic Structure

Code
ggplot(data = <DATA>, aes(x = <X-VARIABLE>, y = <Y-VARIABLE>)) +
  geom_<TYPE>() +
  other_layers
Example 1: Histogram

Histograms are great for seeing how values are distributed.

Code
ggplot(df_selected, aes(x = Age)) +   geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +   labs(title = "Distribution of Household Age",        x = "Age (yrs)",        y = "Number of Households") 

What to notice:

  • binwidth controls bar width.

  • fill and color make it visually appealing.

Example 2: Bar Chart for Categorical Data
Code
ggplot(df_selected, aes(x = Province)) +
  geom_bar(fill = "lightgreen") +
  labs(title = "Number of Households by Province",
       x = "Province",
       y = "Count")

Here:

  • geom_bar() automatically counts observations.

  • Works well for categorical variables like province.

Example 3: Boxplot to Compare Groups

Boxplots are perfect for spotting differences and outliers between groups.

Code
ggplot(df_selected, aes(x = Province, y = Consumption, fill = Province)) +   geom_boxplot() +   labs(title = "Household Consumption by Province",        x = "Province",        y = "Consumption (RWF)") +   theme(legend.position = "none") 

What you get:

  • Median → thick line inside the box.

  • Interquartile range → box height.

  • Outliers → dots outside whiskers.

Example 4: Scatter Plot

If you have two numeric variables, scatter plots reveal relationships.

Code
ggplot(df_selected, aes(x = Age, y = Consumption)) +   geom_point(alpha = 0.6, color = "purple") +   labs(title = "Relationship between Age and Consumption",        x = "Age of Household Head",        y = "Consumption (RWF)") 

The alpha setting controls transparency, useful for overlapping points.

Example 5: Adding Color and Facets

You can color points by category or split plots into subplots (facets).

Code
ggplot(df_selected, aes(x = Age, y = Consumption, color = Province)) +   geom_point(alpha = 0.6) +   facet_wrap(~ Province) +   labs(title = "Consumption vs Age, by Province") 

Benefits:

  • Color helps identify groups within the same chart.

  • Facets give each group its own panel.

💡 Key takeaway

ggplot2 allows you to layer information and customize everything, which is powerful for both exploration and presentation.

Quick Reference: Common geom_ Functions in ggplot2

Function Best for Example Variables
geom_bar() Counts for categorical variables Province, Sex
geom_col() Bar charts when you already have counts or values Consumption
geom_histogram() Distribution of a single numeric variable Consumption
geom_density() Smoothed distribution curve Consumption, Age
geom_boxplot() Comparing distributions across groups Consumption by Province
geom_violin() Distribution + density shape Consumption by PovertyStatus
geom_point() Relationship between two numeric variables Age vs Consumption
geom_line() Trends over continuous variables (time, distance) Year vs GDP
geom_area() Filled area charts for trends Year vs Rainfall
geom_text() / geom_label() Adding text annotations to plots Data labels for bars or points
💡 Key takeaway
  • Use geom_bar() when your data is raw counts and ggplot2 will do the counting.

  • Use geom_col() when you already calculated the values (e.g., from summarise()).

  • Always start with aes() to define mappings (X, Y, color, fill, size, shape).

  • Build your plots layer by layer for flexibility.

📚 Exercises 5.1 Data Visualization with ggplot2

Dataset to use: df_selected

Q1. Bar Chart of Provinces

  1. Create a bar chart showing the number of respondents in each province.

  2. Color the bars by province and add a title “Number of Respondents per Province”.

Hint: Use geom_bar().

Q2. Average Consumption per Province

  1. Using dplyr, calculate the average consumption by province.

  2. Create a column chart using geom_col() with:

  • X-axis: province

  • Y-axis: average consumption

  • Bars filled by province

Hint: This is a good place to use %>% to chain group_by()summarise()ggplot().

Q3. Histogram of Household Consumption

  1. Create a histogram of the Consumption variable.

  2. Use binwidth = 5000 for better granularity.

  3. Fill bars in light blue.

Q4. Boxplot of Consumption by Province

  1. Create a boxplot of Consumption grouped by province.

  2. Fill boxes by province and make the outline black.

  3. Add axis labels and title.

Q5. Challenge

  1. Create a facet wrapped bar chart showing the number of respondents by Sex within each province.

  2. Use facet_wrap(~ province) to create separate small charts for each province.

5.3 Handling Missing Values in R

In real-world datasets, missing data is common, it could be due to incomplete surveys, data entry errors, or system failures.
In R, missing values are represented by the special symbol NA.

Understanding how to detect, quantify, and handle missing values is critical to ensure accurate analysis.

5.3.1 Step 1: Detect Missing Values

We can check for missing values in entire datasets, specific columns, or rows.

Code
# Count missing values in each column
colSums(is.na(df_selected))

# Check if the dataset contains any missing values at all
anyNA(df_selected)

# Count missing values in a specific column
sum(is.na(df_selected$Marital_status))

5.3.2 Step2: Handling Missing Values

There are several strategies depending on the context:

(A) Removing Missing Values

If the missing data is small and random, you can drop those rows/columns.

Code
# Remove rows with any missing values
df_no_na <- na.omit(df_selected)

# Remove rows with NA in a specific column
df_no_marital_na <- df_selected %>%
  filter(!is.na(Marital_status))
⚠ Caution

This approach can cause data loss if many rows have missing values.

(B) Replacing Missing Values with a Category or Value

If NA means “Not Applicable” or “Unknown,” we can replace it with a label.

Code
df_clean <- df_selected %>%
  mutate(Marital_status = ifelse(is.na(Marital_status),
                                 "Not Applicable",
                                 as.character(Marital_status)))

(C) Imputing Numeric Missing Values

Sometimes it’s reasonable to replace missing numbers with mean, median, or a calculated value.

Code
df_imputed <- df_selected %>%
  mutate(Consumption = ifelse(is.na(Consumption),
                              mean(Consumption, na.rm = TRUE),
                              Consumption))

This avoids dropping rows, but be careful: imputation changes your data.

(D) Advanced Missing Data Handling

For more complex datasets, packages like mice or missForest can predict missing values using statistical models.

Example (not running here, just for reference):

Code
library(mice)
imputed <- mice(df_selected, m = 5, method = 'pmm', seed = 123)
df_filled <- complete(imputed, 1)

5.3.3 Step 3: Verifying Changes

After handling missing data, always confirm the changes:

Code
table(df_clean$Marital_status, useNA = "ifany")
colSums(is.na(df_clean))
💡Tip

Missing values are not just an annoyance — they can contain useful information (e.g., “Not Applicable” might mean a respondent is a child). Think carefully before deleting or replacing them.

📚Exercises 5.2 Handling Missing Values in R

Q1. Detecting Missing Values

  1. Use anyNA() to check if your dataset df_selected has any missing values.

  2. Count how many missing values are in the Age column.

  3. Which column has the highest number of missing values? (Hint: use colSums(is.na(...)))

Q2. Removing Missing Values

  1. Create a new dataset df_no_na by removing all rows with missing values.

  2. Create another dataset df_no_age_na that only removes rows where Age is missing, but keeps the rest of the data intact.

  3. Compare the number of rows in df_selected, df_no_na, and df_no_age_na. What do you observe?

Q3. Replacing Missing Categorical Values

  1. In the Marital_status column, replace missing values (NA) with "Not Applicable".

  2. Use table() to check how many entries were replaced.

  3. Why might "Not Applicable" be more informative than simply dropping those rows?

Q4. Imputing Numeric Missing Values

  1. In the Consumption column, replace missing values with the mean of that column.

  2. Now, try replacing missing values with the median instead.

  3. Which imputation method (mean vs. median) would you prefer if the data is skewed? Why?

Q5

  • Imagine you are working with a dataset where Income has 30% missing values.

  • Would you drop rows, impute with mean/median, or use an advanced method (e.g., mice)

5.4 5.3 Handling Outliers

Outliers are extreme values that deviate significantly from most of the data.
They can occur due to:

  • Data entry errors (typos, wrong units)

  • Measurement errors (faulty instruments, recording mistakes).

  • Genuine extreme values (e.g., very high income households).

⚠️

Outliers can skew averages, inflate variances, and mislead models. That’s why detecting and handling them is an important preprocessing step.

5.4.1 1. Detecting Outliers

The most common way is to use boxplots or summary statistics:

Code
# Basic summary
summary(df_selected$Consumption)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    7339   125860   187121   294924   297387 12809880 
Code
# Boxplot to visualize outliers
ggplot(df_selected, aes(y = Consumption)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "Boxplot of Household Consumption")

👉 Values beyond the whiskers (1.5 × IQR rule) are considered potential outliers.

5.4.2 2. Options for Handling Outliers

  1. Investigate
    .
    If it’s a data entry error, fix or remove it.
    . If it’s genuine (e.g., a rich household), keep it but consider robust methods

  2. Remove
    . Drop extreme rows, but risk losing valuable data.

  3. Cap (Winsorize)
    .
    Replace outliers with a threshold percentile.
    . Useful when you want to reduce extreme influence without removing data.

5.4.3 3. Capping Outliers (Winsorization)

Here we “cap” values below the 10th percentile and above the 90th percentile:

Code
# Define the 10th and 90th percentiles
lower_bound <- quantile(df_selected$Consumption, 0.10, na.rm = TRUE)
upper_bound <- quantile(df_selected$Consumption, 0.90, na.rm = TRUE)

# Create a new capped column
df_capped <- df_selected %>%
  mutate(cons_capped = ifelse(Consumption > upper_bound, upper_bound, Consumption)) %>%
  mutate(cons_capped = ifelse(cons_capped < lower_bound, lower_bound, cons_capped))

# Compare boxplots: before vs after
par(mfrow = c(1, 2))
boxplot(df_selected$Consumption, main = "Original", col = "lightblue")
boxplot(df_capped$cons_capped, main = "Capped", col = "lightcoral")

👉 Notice how the capped version looks “cleaner,” with fewer extreme values.

5.4.4 4. Alternative: Z-score Method

Another common approach is to flag outliers based on standard deviations from the mean:

Code
# Compute z-scores
df_selected_z <- df_selected %>%
  mutate(z_score = (Consumption - mean(Consumption, na.rm = TRUE)) / sd(Consumption, na.rm = TRUE))

# Filter out rows where z-score is greater than 3 (too extreme)
df_no_outliers_z <- df_selected_z %>%
  filter(abs(z_score) < 3)

nrow(df_selected_z); nrow(df_no_outliers_z)
[1] 66081
[1] 65245

👉 This keeps only data within ±3 standard deviations.

5.4.5 5. When to Keep Outliers

  • If your analysis is about inequality, poverty, or income distribution, outliers are important signals.

  • Removing them could bias results.

  • In such cases, you may report results with and without outliers to show their effect.

📚Exercises: Handling Outliers
  1. Detection

    • Use a boxplot to visualize outliers in the Consumption column.
  2. Capping

    • Cap Consumption values below the 5th percentile and above the 95th percentile, and create a new column cons_capped95.
  3. Z-score Method

    • Remove rows where the z-score of Consumption is greater than 3 in absolute value.

    • Compare the number of rows before and after.


5.5 Practical Data Cleaning Example

Lets load a messy dataset Data.csv into R:

Code
library(tidyverse)
file_path <- "Data.csv"
data <- read_csv(file_path)

glimpse(data)
Rows: 26
Columns: 5
$ ID                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 6, 7…
$ sex               <chr> "Male", "Male", "Fema", "5", "Female", "Male", "Male…
$ Province          <chr> "Souther", "Kiga City", "Kigali City", "Eastern", "N…
$ `Education level` <chr> "pr", "pr", "pr", "Uni", "Uni", "Uni", "Sec", "Sec",…
$ Age               <chr> "less than one year", "24", "33", "45", "20", "24", …
  1. Cleaning the ID column

Problems: IDs are repeated (e.g., 1,2,3 appear twice). Some are missing (not continuous).

solution: Create a new unique ID based on row number.

Code
data <- data %>%
  mutate(ID = row_number())
  1. Cleaning the sex column

Problems: Inconsistent values: Male, Female, Fema, m, F.

Code
unique(data$sex)
[1] "Male"   "Fema"   "5"      "Female" "m"      "F"     
Code
# check with deplyr
#data %>%
#  distinct(sex)

Solution: Standardize all values to "Male" or "Female".

Code
data <- data %>%
  mutate(sex = case_when(
    sex %in% c("Male", "m") ~ "Male",
    sex %in% c("Female", "F", "Fema") ~ "Female",
    sex %in% c("5") ~ NA_character_,   # invalid value replaced with NA
    TRUE ~ NA_character_               # any other unexpected value
  ))


# Check results:
table(data$sex, useNA = "ifany")

Female   Male   <NA> 
     7     18      1 
  1. Cleaning the Province column

Problems: Typos: Souther, Northen, Weastern, Kiga City and Mixed spellings.

Solution: Standardize names properly.

Code
data <- data %>%
  mutate(Province = case_when(
    str_detect(Province, "Kiga") ~ "Kigali City",
    str_detect(Province, "South") | str_detect(Province, "Souther") ~ "Southern Province",
    str_detect(Province, "North") | str_detect(Province, "Northen") ~ "Northern Province",
    str_detect(Province, "East") ~ "Eastern Province",
    str_detect(Province, "West") | str_detect(Province, "Weast") | str_detect(Province, "Weastern") ~ "Western Province",
    TRUE ~ Province
  ))

# check
table(data$Province, useNA = "ifany")

 Eastern Province       Kigali City Northern Province Southern Province 
                4                 8                 4                 4 
 Western Province 
                6 
  1. Cleaning the Education level column

Problems: Mixed categories: pr, Uni, Sec, Nursary, Non educ, blank values. Some are inconsistent abbreviations.

Solution: Map categories to consistent labels.

Code
#Use rename() in dplyr
data <- data %>%
  rename(Education_level = `Education level`)


# if you want to clean all column names automatically (good for larger datasets)
#library(janitor)
#data <- data %>%
#  janitor::clean_names()


data <- data %>%
  mutate(Education_level = case_when(
    Education_level %in% c("pr") ~ "Primary",
    Education_level %in% c("Sec") ~ "Secondary",
    Education_level %in% c("Uni") ~ "University",
    str_detect(Education_level, "Nurs") ~ "Nursery",
    str_detect(Education_level, "Non") ~ "No Education",
    Education_level == "" | is.na(Education_level) ~ NA_character_,
    TRUE ~ Education_level
  ))

#check
table(data$Education_level, useNA = "ifany")

No Education      Nursery      Primary    Secondary   University         <NA> 
           3            3            8            6            4            2 
  1. Cleaning the Age column

Problems: Some rows have Age recorded as text: "less than one year". Some missing values.

Solution: Convert to numeric, and handle "less than one year".

Code
data <- data %>%
  mutate(Age = case_when(
    Age == "less than one year" ~ "0",
    TRUE ~ Age
  )) %>%
  mutate(Age = as.numeric(Age))

#Check
summary(data$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   24.00   24.00   28.58   33.00   45.00       2 
Code
#Final Cleaned Data
glimpse(data)
Rows: 26
Columns: 5
$ ID              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ sex             <chr> "Male", "Male", "Female", NA, "Female", "Male", "Male"…
$ Province        <chr> "Southern Province", "Kigali City", "Kigali City", "Ea…
$ Education_level <chr> "Primary", "Primary", "Primary", "University", "Univer…
$ Age             <dbl> 0, 24, 33, 45, 20, 24, 24, 33, 45, 20, 24, NA, 33, 45,…
👩🏻‍💻Final Assignment

Q1. Data Selection with dplyr

Load the dataset cs_s1_s2_s3_s4_s6a_s6e_s6f_person.dta.
Using dplyr, create a new data frame that contains only individuals from the Western Province who are older than 30 years.
Select only the following columns:

  • district,

  • Age (s1q3y)

  • Consumption.

Q2. Aggregation

Using the data frame from Question 1, calculate the average Consumption per district in the Western Province.

Q3. Visualization

Create a bar chart with ggplot2 showing the average Consumption per district (from Question 2).

Q2. Missing Values

Check the dataset for missing values in the Marital_status column

  • How many NAs are there?

  • Replace them with "Not Applicable" and confirm your change.

Q5. Outliers

Inspect the Consumption column:

  1. Create a boxplot of Consumption values.

  2. Cap values at the 5th and 95th percentiles and save them in a new column cons_capped95.

  3. Why might it sometimes be better to keep outliers instead of removing or capping them? Provide one example.