1 What is R?

R is a statistical programming language created at the University of Auckland (New Zealand) by Ross Ihaka and Robert Gentleman.
It evolved from the S language developed at Bell Labs.
The current R version is R version 4.5.1 (2025-06-13 ucrt).

1.1 Introduction to R and Core Concepts

Welcome to R!

R is a powerful, free environment for statistical computing and graphics.

1.1.1 Why R?

  • Open-source and free
  • Powerful for statistics
  • Excellent visualization (e.g., ggplot2)
  • Large community and thousands of packages

1.1.2 Reproducible reports with R Markdown

With R Markdown you can mix:

  • Text (explanations, comments)
  • Code (R chunks)
  • Outputs (tables, plots)

to create reproducible reports.

RStudio IDE has four panes:
Source (scripts, Rmd), Console, Environment/History, and Files/Plots/Packages/Help.


2 Operators

R can perform calculations directly in the console.

2.1 Arithmetic operators

Addition: +

1 + 1
## [1] 2

Subtraction: -

2 - 5
## [1] -3

Multiplication: *

5 * 6
## [1] 30

Division: /

9 / 3
## [1] 3

Modulus (remainder of a division): %%

6 %% 2
## [1] 0

Exponent: ^ or **

2 ^ 10   # or 2 ** 10
## [1] 1024

Integer division: %/%

1035 %/% 3
## [1] 345

2.2 Logical operators

Less than: <

1 < 0
## [1] FALSE

Less than or equal to: <=

1 <= 1
## [1] TRUE

Greater than: >

4 > 5
## [1] FALSE

Greater than or equal to: >=

3 >= 3
## [1] TRUE

Exactly equal to: ==

"R" == "r"
## [1] FALSE

The equality operator can also be used to match one element with multiple elements:

"Species" == c("Sepal.Length", "Sepal.Width", "Petal.Length", 
               "Petal.Width", "Species")
## [1] FALSE FALSE FALSE FALSE  TRUE

Not equal to: !=

5 != 5
## [1] FALSE

2.3 Negation (NOT)

Used to flip TRUE ↔︎ FALSE.

!TRUE   # or !T
## [1] FALSE
!(T & F)   # this is TRUE
## [1] TRUE
!(F | T)   # this is FALSE
## [1] FALSE

2.4 AND: &

TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
FALSE & TRUE
## [1] FALSE

2.5 OR: |

TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE
FALSE | TRUE
## [1] TRUE

2.6 Value Matching

In R, we have built-in functions to match elements in a vector.

The first is match(). It returns the position of the first match of its first argument in its second argument.

match("Species", c("Sepal.Length", "Sepal.Width", "Petal.Length", 
                   "Petal.Width", "Species"))
## [1] 5

The second is %in%, which checks the existence of a value in a vector.

"Species" %in% c("Sepal.Length", "Sepal.Width", "Petal.Length", 
                 "Petal.Width", "Species")
## [1] TRUE

3 R objects and assignment

In R we can use <-, = (single equal sign!), and -> to assign a value to a variable.

A variable name:

  • can begin with a letter or dot.
  • should not contain spaces (use _ or . instead).
  • can contain numbers, but not start with a number.
# This will give an error because of the space:
# t trainind <- "r programming"

Valid examples:

a  <- 5
b  <- 6
0  -> .a
a1 = 0.2

4 Data types

In R we have the following basic data types:

  • numeric
  • integer
  • complex
  • character
  • logical
  • raw
  • factor

4.1 Numeric / double

Examples: 15.5, 505, 38, pi

q <- 10.7
print(class(q))
## [1] "numeric"
print(typeof(q))
## [1] "double"

4.2 Integer

You can create an integer by adding L, e.g. 1L, 5L, 10L.

q <- 5L
print(class(q))
## [1] "integer"
print(typeof(q))
## [1] "integer"

4.3 Complex

Example: 3 + 1i, where i is the imaginary part.

q <- 3 + 1i
print(class(q))
## [1] "complex"
print(typeof(q))
## [1] "complex"
p1 <- a + 1i * b
print(p1)
## [1] 5+6i

4.4 Character / string

string <- "I am Learning R"
class(string)
## [1] "character"

Remember: "LeaRning" is different from "Learning" – R is case-sensitive.

4.5 Logical / Boolean (TRUE or FALSE)

TRUE   # or T
## [1] TRUE
FALSE  # or F
## [1] FALSE

Logical output often comes from comparisons:

"LeaRning" == "Learning"
## [1] FALSE

4.6 Raw

text <- "Christian Mugisha."
(raw_text <- charToRaw(text))
##  [1] 43 68 72 69 73 74 69 61 6e 20 4d 75 67 69 73 68 61 2e
class(raw_text)
## [1] "raw"

Converting raw back to text:

rawToChar(raw_text)
## [1] "Christian Mugisha."

4.7 Factors

Factors represent categorical variables (e.g., gender, levels, ratings).

Gender <- factor(c("Female", "Male"))
print(Gender)
## [1] Female Male  
## Levels: Female Male
class(Gender)
## [1] "factor"

4.8 Logical example

v <- TRUE
w <- FALSE

class(v); typeof(v)
## [1] "logical"
## [1] "logical"
!v
## [1] FALSE
isTRUE(w)
## [1] FALSE

5 Creating objects

5.1 Numeric object

t <- 10
x <- numeric(t)   # creates a numeric vector of length t
print(x)
##  [1] 0 0 0 0 0 0 0 0 0 0
# assigning values to x:
x[1] <- 2.5
print(x)
##  [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5.2 Integer

n <- 5
x <- integer(n)   # creates an integer vector of length n
print(x)
## [1] 0 0 0 0 0
class(x)
## [1] "integer"
# assigning values to x:
x[1] <- 2.5       # R will convert to numeric if needed
class(x)
## [1] "numeric"
print(x)
## [1] 2.5 0.0 0.0 0.0 0.0

6 Conversions

To convert data types we use as.<type>() functions, e.g.:

  • as.character()
  • as.numeric()
  • as.factor()

7 Scalars and vectors (1D)

  • A scalar is a single value (vector of length 1).
  • A vector is a collection of objects of the same type.

Let’s create a vector with elements of different types to see how R will handle them.

7.0.1 Numerics, characters, logicals

# ?c  # help for c()
v <- c(1, "R", TRUE, FALSE, NA)
print(v)
## [1] "1"     "R"     "TRUE"  "FALSE" NA
class(v)
## [1] "character"

R converts everything to character except NA (which can belong to multiple types).

7.0.2 Numeric and logical

v2 <- c(1, 4, 8, FALSE, TRUE, FALSE, FALSE, TRUE, "R" == "r")
print(v2)
## [1] 1 4 8 0 1 0 0 1 0

7.1 Vector arithmetic

Vectors are vectorized in R — operations apply element-by-element.

a <- c(5, 6, 7)
b <- c(10, 20, 30)

# Addition (1st+1st, 2nd+2nd, etc.)
a + b
## [1] 15 26 37
# Multiplication
a * b
## [1]  50 120 210

7.1.1 Recycling rule

If one vector is shorter, R recycles it:

a <- c(1, 2, 3, 4)
b <- c(10, 20)

a + b
## [1] 11 22 13 24
# b is recycled: (10, 20, 10, 20)

⚠️ Be careful!
If the longer vector’s length is not a multiple of the shorter one, R will show a warning.

7.1.2 Accessing elements

# Get the first item in v2
v2[1]
## [1] 1
# Get the 2nd and 4th elements
v2[c(2, 4)]
## [1] 4 0
# Exclude the 3rd element
v2[-3]
## [1] 1 4 0 1 0 0 1 0

7.1.3 Modifying vectors

# Change the first element
v2[1] <- 27
print(v2)
## [1] 27  4  8  0  1  0  0  1  0
# Add a new element
v21 <- c(v2, 81)
v21
##  [1] 27  4  8  0  1  0  0  1  0 81
📘 Exercise 1.2
Q1. Create Variables
Create:
  • my_name containing your name (character)
  • my_age containing your age (numeric)
  • is_statistician with TRUE/FALSE

Then print class() of each.

Q2. Vector Operations
  1. Create expenses <- c(1500, 2000, 1200, 3000).
  2. Create income <- 10000.
  3. Compute savings: income - sum(expenses).
Q3.
  1. Create sales_q1 <- c(120, 150, 90).
  2. Create sales_q2 <- c(130, 160, 95).
  3. Compute total per store: sales_q1 + sales_q2.
  4. Compute percentage increase: ((sales_q2 - sales_q1) / sales_q1) * 100.
  5. Extract results > 10% increase.
💡 Tip

In R, even a single number like 42 is a vector of length 1.


8 Data Frames and Data Import/Export

8.1 Data frames

The most important data structure for data analysis in R is the data frame:

  • Like a spreadsheet (rows = observations, columns = variables)
  • Each column can have a different type

8.2 Creating a data frame

employee_data <- data.frame(
  id     = c(1, 2, 3),
  name   = c("John", "Jane", "Peter"),
  salary = c(50000, 55000, 52000)
)

employee_data
##   id  name salary
## 1  1  John  50000
## 2  2  Jane  55000
## 3  3 Peter  52000

8.2.1 Structure and summary

str(employee_data)
## 'data.frame':    3 obs. of  3 variables:
##  $ id    : num  1 2 3
##  $ name  : chr  "John" "Jane" "Peter"
##  $ salary: num  50000 55000 52000
summary(employee_data)
##        id          name               salary     
##  Min.   :1.0   Length:3           Min.   :50000  
##  1st Qu.:1.5   Class :character   1st Qu.:51000  
##  Median :2.0   Mode  :character   Median :52000  
##  Mean   :2.0                      Mean   :52333  
##  3rd Qu.:2.5                      3rd Qu.:53500  
##  Max.   :3.0                      Max.   :55000

8.2.2 Accessing data

# Column by name
employee_data$name
## [1] "John"  "Jane"  "Peter"
# By index (row, column)
employee_data[1, 2]          # Row 1, column 2
## [1] "John"
employee_data[ , "salary"]   # All rows, salary column
## [1] 50000 55000 52000
employee_data[ , c("name", "salary")]  # Multiple columns
##    name salary
## 1  John  50000
## 2  Jane  55000
## 3 Peter  52000
employee_data[1:2, ]         # First two rows
##   id name salary
## 1  1 John  50000
## 2  2 Jane  55000

8.2.3 Adding columns and rows

# Add a new column
employee_data$department <- c("HR", "Finance", "IT")

# Add a new row
new_row <- data.frame(
  id         = 4,
  name       = "Alice",
  salary     = 60000,
  department = "Marketing"
)
employee_data <- rbind(employee_data, new_row)
employee_data
##   id  name salary department
## 1  1  John  50000         HR
## 2  2  Jane  55000    Finance
## 3  3 Peter  52000         IT
## 4  4 Alice  60000  Marketing

8.3 Importing real data

In practice, we read from files:

# Read CSV file
# my_data <- read.csv("Data.csv")

# Read Excel file
# library(readxl)
# my_data <- read_excel("Data.xlsx")
📘 Exercise 2 – Data Frames
Q1. Create a data frame students with:
  • id (1 to 5)
  • name (5 student names)
  • grade (5 numeric grades)

Q2. Extract the grades of the first three students.

Q3. Add a new column pass that is TRUE if grade ≥ 50, FALSE otherwise.

Q4. Add a new row for a sixth student.


9 Installing and Loading Packages

One of R’s biggest strengths is its community-contributed packages.
A package is like an app for R.

  • Functions (tools)
  • Datasets
  • Documentation

9.1 Installing packages

# Install (run once per machine)
# install.packages("haven")
# install.packages("tidyverse")
💡 Tip

Every time you start a new R session, you must load the package with library().

# Load packages for this session
library(haven)       # For reading Stata, SPSS, SAS files
library(tidyverse)   # For data manipulation & visualization

9.1.1 Why haven and tidyverse?

  • haven: Reads data from Stata (.dta), SPSS (.sav), SAS files, preserving labels.
  • readr: Part of the tidyverse, provides read_csv() for reading CSVs.
  • tidyverse: A collection of packages for manipulation (dplyr), visualization (ggplot2), tidying (tidyr), and more.

9.2 Checking if a package is installed

# Check installed packages
# installed.packages()

# Quickly check one package
"haven"   %in% rownames(installed.packages())
## [1] TRUE
"ggplot2" %in% rownames(installed.packages())
## [1] TRUE

Updating packages:

# Update all installed packages
# update.packages()
📘 Exercise 3 – Packages
  • Install the package readxl.
  • Load it with library(readxl).
  • Check if ggplot2 is installed.
💡 Tip

If you get the error “there is no package called …”, you must install it first with install.packages().


10 Importing the Rwanda Teachers Data

One of the first steps in data analysis is importing data.
R can read: .csv, .xlsx, .dta, .sav, .json, etc.

In this course, we use a CSV file: rwanda_teachers_500.csv.

We will use base R read.csv() here (you could also use readr::read_csv()).

10.1 Working directory

setwd("C:\\Users\\HP\\Desktop\\R PROGRAMING") # set working directory
getwd()                                       # confirm
## [1] "C:/Users/HP/Desktop/R PROGRAMING"
list.files()                                  # list files in the folder
## [1] "data_cleaned.csv"                "rwanda_teachers_500.csv"        
## [3] "rwanda_teachers_500_cleaned.csv"

10.2 Importing CSV data

# Read the CSV file
data <- read.csv("C:\\Users\\HP\\Desktop\\R PROGRAMING\\rwanda_teachers_500.csv")

# Display first 5 rows
head(data, 5)
##      Province   District  Sector      Teacher_Name Teacher_ID Education_Level
## 1 Kigali City     Gasabo   Rongi   Aline Munyaneza  T25000138              A2
## 2    Southern   Gisagara  Remera Noella Nkurunziza  T25000239              A1
## 3     Western Nyamasheke Gahanga  Leodomir Murerwa  T25000345              A0
## 4    Northern    Gicumbi Gatenga  Sophie Niyonzima  T25000423              A2
## 5 Kigali City Nyarugenge Gatsibo   Tijara Habimana  T25000558              A2
##   School_ID School_Level   Subject_Taught Date_of_Birth Gender
## 1  SCH00001      Primary        Geography    1998-04-15   Male
## 2  SCH00002  Pre-primary           French    2002-08-21 Female
## 3  SCH00003  Pre-primary      Kinyarwanda    1998-10-19 Female
## 4  SCH00004  Pre-primary        Chemistry    1982-11-11 Female
## 5  SCH00005      Primary Entrepreneurship    1983-05-04 Female
# Look at the structure
str(data)
## 'data.frame':    500 obs. of  11 variables:
##  $ Province       : chr  "Kigali City" "Southern" "Western" "Northern" ...
##  $ District       : chr  "Gasabo" "Gisagara" "Nyamasheke" "Gicumbi" ...
##  $ Sector         : chr  "Rongi" "Remera" "Gahanga" "Gatenga" ...
##  $ Teacher_Name   : chr  "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
##  $ Teacher_ID     : chr  "T25000138" "T25000239" "T25000345" "T25000423" ...
##  $ Education_Level: chr  "A2" "A1" "A0" "A2" ...
##  $ School_ID      : chr  "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
##  $ School_Level   : chr  "Primary" "Pre-primary" "Pre-primary" "Pre-primary" ...
##  $ Subject_Taught : chr  "Geography" "French" "Kinyarwanda" "Chemistry" ...
##  $ Date_of_Birth  : chr  "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" ...
##  $ Gender         : chr  "Male" "Female" "Female" "Female" ...

You should see ~500 rows and 11 columns. Many columns are chr (character).
That’s normal after reading a CSV, but we should:

  • Turn Date_of_Birth into a real Date
  • Turn categories (Province, District, …) into factors
  • Check duplicates, missing values, and simple counts

⚠️ Be careful!
If you have typos in labels (e.g., “Nyarugenge” vs “Nyarugunga”), clean the strings first, then convert to factors.


11 Data Cleaning on Teachers Data

11.1 Converting variables to proper types

# Convert Date_of_Birth to Date
data$Date_of_Birth  <- as.Date(data$Date_of_Birth, format = "%Y-%m-%d")

# Convert to factors
data$Subject_Taught <- as.factor(data$Subject_Taught)
data$Sector         <- as.factor(data$Sector)

# Quick check
summary(data$Subject_Taught)
##          Biology        Chemistry Computer Science          English 
##               48               41               44               45 
## Entrepreneurship           French        Geography          History 
##               50               52               41               51 
##      Kinyarwanda      Mathematics          Physics 
##               46               49               33
summary(data$Sector)
##    Bigogwe   Bugarama    Bumbogo     Busoro  Bwishyura    Gahanga   Gashonga 
##         10          5          8         15          5          8          7 
##    Gashora    Gatenga    Gatsibo   Gihundwe     Gikoma    Gikondo    Gikonko 
##          5          8         12          5          7          5          9 
##     Gisozi     Jabana       Jali      Jenda  Kabarondo   Kabarore    Kabatwa 
##         12          4          9          5         13          6         11 
##    Kacyiru    Kanombe     Karama  Karangazi   Kibilizi   Kigabiro   Kigarama 
##          6          9         11         17         12          7         17 
##  Kimironko   Kinyinya     Kitabi     Kiyovu    Kiyumba   Kiziguro      Mamba 
##          3         10          6          7          5          5         11 
##     Masaka    Matimba   Mugesera  Mukarange     Mukura    Murunda      Musha 
##          6          6          5          6          8         10          8 
##      Ndera     Ntyazo  Nyamabuye    Nyamata Nyamirambo     Remera     Rilima 
##          8          8          8          7         10         10         11 
##      Rongi  Rubengera   Ruhashya   Rusororo  Rwinkwavu       Save     Shangi 
##         12          6         15          9          6          7         11 
##    Shyogor    Shyogwe      Tumba 
##          8         14          6
str(data$Date_of_Birth)
##  Date[1:500], format: "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" "1983-05-04" ...
# Create Age in years
data$Age <- as.integer(floor((Sys.Date() - data$Date_of_Birth) / 365.25))
summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00

Why 365.25?

  • 3 normal years (365 days), 1 leap year (366 days)
  • Average = (3×365 + 366) / 4 = 365.25 days per year
# Convert more categorical variables
data$Education_Level <- as.factor(data$Education_Level)
data$Province        <- as.factor(data$Province)
data$District        <- as.factor(data$District)
data$School_Level    <- as.factor(data$School_Level)

str(data)
## 'data.frame':    500 obs. of  12 variables:
##  $ Province       : Factor w/ 5 levels "Eastern","Kigali City",..: 2 4 5 3 2 3 2 2 1 1 ...
##  $ District       : Factor w/ 30 levels "Bugesera","Burera",..: 4 7 21 6 23 27 12 23 19 16 ...
##  $ Sector         : Factor w/ 59 levels "Bigogwe","Bugarama",..: 50 48 6 9 10 28 16 44 49 21 ...
##  $ Teacher_Name   : chr  "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
##  $ Teacher_ID     : chr  "T25000138" "T25000239" "T25000345" "T25000423" ...
##  $ Education_Level: Factor w/ 3 levels "A0","A1","A2": 3 2 1 3 3 2 3 3 1 2 ...
##  $ School_ID      : chr  "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
##  $ School_Level   : Factor w/ 4 levels "Lower Secondary",..: 3 2 2 2 3 1 3 3 4 3 ...
##  $ Subject_Taught : Factor w/ 11 levels "Biology","Chemistry",..: 7 6 9 2 5 4 5 5 11 5 ...
##  $ Date_of_Birth  : Date, format: "1998-04-15" "2002-08-21" ...
##  $ Gender         : chr  "Male" "Female" "Female" "Female" ...
##  $ Age            : int  27 23 27 43 42 26 44 41 63 32 ...

11.2 Age groups

summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00

Suppose min = 22 and max = 65.

data$Age_Group <- cut(
  data$Age,
  breaks = c(22, 27, 37, 47, 57, 66),   # 66 to include 65
  labels = c("22-26", "27-36", "37-46", "47-56", "57-65"),
  right  = FALSE                        # [22,27), [27,37), etc.
)

# Check distribution
table(data$Age_Group)
## 
## 22-26 27-36 37-46 47-56 57-65 
##    38   119   132   103   108
prop.table(table(data$Age_Group))
## 
## 22-26 27-36 37-46 47-56 57-65 
## 0.076 0.238 0.264 0.206 0.216
# Sanity checks
summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00
sum(is.na(data$Age_Group))
## [1] 0
levels(data$Age_Group)
## [1] "22-26" "27-36" "37-46" "47-56" "57-65"

⚠️ Be careful with cut():
right = FALSE → bins are [lower, upper).
right = TRUE → bins are (lower, upper].


11.3 Checking missing values

any(is.na(data))          # any missing at all?
## [1] FALSE
sum(is.na(data))          # total missing cells
## [1] 0
colSums(is.na(data))      # per column
##        Province        District          Sector    Teacher_Name      Teacher_ID 
##               0               0               0               0               0 
## Education_Level       School_ID    School_Level  Subject_Taught   Date_of_Birth 
##               0               0               0               0               0 
##          Gender             Age       Age_Group 
##               0               0               0
rowSums(is.na(data))      # per row
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We no longer need Date_of_Birth since we created Age and Age_Group.

data <- subset(data, select = -Date_of_Birth)
names(data)
##  [1] "Province"        "District"        "Sector"          "Teacher_Name"   
##  [5] "Teacher_ID"      "Education_Level" "School_ID"       "School_Level"   
##  [9] "Subject_Taught"  "Gender"          "Age"             "Age_Group"

11.4 Duplicates

any(duplicated(data))
## [1] FALSE
data[duplicated(data), ]
##  [1] Province        District        Sector          Teacher_Name   
##  [5] Teacher_ID      Education_Level School_ID       School_Level   
##  [9] Subject_Taught  Gender          Age             Age_Group      
## <0 rows> (or 0-length row.names)

12 Descriptive Statistics on Teachers Data

12.1 Frequency counts

table(data$Province)        # Province distribution
## 
##     Eastern Kigali City    Northern    Southern     Western 
##          93          92         105         107         103
table(data$Education_Level) # Education level
## 
##  A0  A1  A2 
## 142 213 145
table(data$Subject_Taught)  # Subject taught
## 
##          Biology        Chemistry Computer Science          English 
##               48               41               44               45 
## Entrepreneurship           French        Geography          History 
##               50               52               41               51 
##      Kinyarwanda      Mathematics          Physics 
##               46               49               33

12.2 Summary statistics for Age

summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00
mean(data$Age, na.rm = TRUE)
## [1] 44.082
median(data$Age, na.rm = TRUE)
## [1] 43

12.3 Cross‑tabulations

# Teachers by Province × Gender
table(data$Province, data$Gender)
##              
##               Female Male
##   Eastern         41   52
##   Kigali City     38   54
##   Northern        49   56
##   Southern        49   58
##   Western         41   62
# Teachers by School_Level × Subject_Taught
table(data$School_Level, data$Subject_Taught)
##                  
##                   Biology Chemistry Computer Science English Entrepreneurship
##   Lower Secondary       8         7               11      14                9
##   Pre-primary           4         6                2       2                4
##   Primary              22        18               23      22               30
##   Upper Secondary      14        10                8       7                7
##                  
##                   French Geography History Kinyarwanda Mathematics Physics
##   Lower Secondary     14        14      14          22           8       9
##   Pre-primary          3         1       6           2           6       4
##   Primary             23        17      20          16          21      10
##   Upper Secondary     12         9      11           6          14      10

12.4 Proportions and percentages

prop.table(table(data$Age_Group))
## 
## 22-26 27-36 37-46 47-56 57-65 
## 0.076 0.238 0.264 0.206 0.216

Interpretation: Most teachers (around half) are in the 27–46 age range, so the workforce is mainly mid-career, with fewer very young (22–26) and a moderate number nearing retirement (57–65).


13 Basic Plots (Base R)

13.1 Age distribution

hist(data$Age,
     main = "Age Distribution of Teachers",
     xlab = "Age",
     col  = "skyblue")

13.2 Age groups (bar plot)

barplot(table(data$Age_Group),
        main = "Teacher Counts by Age Group",
        col  = "lightgreen")

13.3 Subject distribution

barplot(table(data$Subject_Taught),
        las  = 2,
        main = "Teachers per Subject",
        col  = "green")

13.4 Gender distribution

barplot(table(data$Gender),
        main = "Gender Distribution of Teachers",
        col  = c("green", "blue"))

13.5 Age by Province (boxplot)

boxplot(Age ~ Province,
        data = data,
        main = "Age Distribution by Province",
        col  = "lightgray")

13.6 Pie chart of Education Level

pie(table(data$Education_Level),
    main = "Education Level Distribution")


14 Better Plots with ggplot2

library(ggplot2)

14.1 Age distribution (histogram)

ggplot(data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Age Distribution of Teachers", x = "Age", y = "Count") +
  theme_minimal()

14.2 Age groups (bar plot)

ggplot(data, aes(x = Age_Group)) +
  geom_bar(fill = "lightgreen", color = "black") +
  labs(title = "Teacher Counts by Age Group", x = "Age Group", y = "Count") +
  theme_minimal()

14.3 Province distribution

ggplot(data, aes(x = Province)) +
  geom_bar(fill = "orange", color = "black") +
  labs(title = "Teachers per Province", x = "Province", y = "Count") +
  theme_minimal()

14.4 Education with counts

edu_counts <- table(data$Education_Level)
edu_counts
## 
##  A0  A1  A2 
## 142 213 145
pie(edu_counts,
    labels = paste(names(edu_counts), edu_counts),
    main   = "Education Level Distribution")

14.5 Subject distribution

ggplot(data, aes(x = Subject_Taught)) +
  geom_bar(fill = "purple", color = "black") +
  labs(title = "Teachers per Subject", x = "Subject", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

14.6 Gender distribution

ggplot(data, aes(x = Gender)) +
  geom_bar(fill = c("pink", "lightblue"), color = "black") +
  labs(title = "Gender Distribution of Teachers", x = "Gender", y = "Count") +
  theme_minimal()

14.7 Age by Province (boxplot)

ggplot(data, aes(x = Province, y = Age)) +
  geom_boxplot(fill = "lightgray") +
  labs(title = "Age Distribution by Province", x = "Province", y = "Age") +
  theme_minimal()


15 Starting Advanced Features – dplyr

Packages to use:

  • dplyr → clean, transform, summarise tables
  • tidyr → reshape tables long↔︎wide
  • ggplot2 → charts

15.1 The pipe %>%

  • %>% means “and then…” and makes steps readable.
data %>%
  filter(Gender == "Female") %>%            # keep only female teachers
  group_by(Province) %>%                    # group by Province
  summarise(Number_of_Female = n(), .groups = "drop")
## # A tibble: 5 × 2
##   Province    Number_of_Female
##   <fct>                  <int>
## 1 Eastern                   41
## 2 Kigali City               38
## 3 Northern                  49
## 4 Southern                  49
## 5 Western                   41

Explanation:

  • filter() keeps only rows with Gender = “Female”.
  • group_by(Province) organizes data by province.
  • summarise() creates a summary table for each group.

15.2 filter() – keep rows that meet a condition

# Teachers in Gikondo sector (example)
data_gasabo <- data %>%
  filter(Sector == "Gikondo")

# Female upper secondary teachers only
female_sec <- data %>%
  filter(Gender == "Female", School_Level == "Upper Secondary")

data_gasabo
##      Province   District  Sector      Teacher_Name Teacher_ID Education_Level
## 1    Southern    Ruhango Gikondo     Ngabo Rukundo  T25008597              A2
## 2     Eastern  Rwamagana Gikondo Aline Mukantagara  T25010353              A0
## 3 Kigali City Nyarugenge Gikondo   Cassien Rukundo  T25011097              A0
## 4     Eastern  Rwamagana Gikondo  Leodomir Uwimana  T25019966              A0
## 5 Kigali City Nyarugenge Gikondo    Samuel Mugisha  T25049974              A1
##   School_ID    School_Level   Subject_Taught Gender Age Age_Group
## 1  SCH00085         Primary Computer Science Female  25     22-26
## 2  SCH00103         Primary      Mathematics   Male  37     37-46
## 3  SCH00110 Upper Secondary          English   Male  32     27-36
## 4  SCH00199 Upper Secondary        Chemistry Female  45     37-46
## 5  SCH00499 Upper Secondary Computer Science   Male  59     57-65
female_sec
##       Province   District    Sector             Teacher_Name Teacher_ID
## 1     Southern       Huye Kabarondo         Olive Byiringiro  T25001369
## 2      Eastern    Gatsibo  Mugesera            Alice Uwimana  T25001679
## 3      Eastern    Kayonza      Save       Patrick Byiringiro  T25001964
## 4     Southern   Gisagara  Kinyinya            Noella Kagabo  T25003929
## 5     Northern    Rulindo   Kabatwa      Patrick Mukantagara  T25004189
## 6  Kigali City   Kicukiro     Musha       Sylvie Twizerimana  T25004561
## 7     Northern    Rulindo  Ruhashya        Sylvie Byiringiro  T25005352
## 8     Southern     Nyanza    Shangi          Samuel Mutabazi  T25005788
## 9     Southern    Kamonyi Rubengera           Leodomir Uwase  T25006489
## 10    Northern    Gicumbi    Shangi      Samuel Ndayishimiye  T25006645
## 11     Eastern     Kirehe Nyamabuye           Aline Mukamana  T25006931
## 12     Eastern    Gatsibo    Masaka          Jean Hagenimana  T25008145
## 13    Southern    Ruhango Karangazi             Joan Uwimana  T25008416
## 14     Western    Karongi   Gatsibo         Noella Niyonzima  T25009742
## 15    Southern    Ruhango   Bumbogo       Leodomir Rwitabiri  T25010228
## 16    Northern     Burera Rubengera Christianne Ndayishimiye  T25011745
## 17    Northern    Rulindo     Mamba             Ange Uwayezu  T25012131
## 18    Northern    Gicumbi  Bugarama          Gaelle Mutabazi  T25012219
## 19     Western    Nyabihu   Bigogwe         Kadete Munyaneza  T25012592
## 20     Western    Nyabihu  Ruhashya       Gaelle Mukantagara  T25014255
## 21    Southern    Ruhango    Shangi            Ngabo Murerwa  T25015967
## 22     Western    Rutsiro   Gatenga             Joan Ishimwe  T25016784
## 23     Eastern      Ngoma Mukarange        Lema Mbarushimana  T25018395
## 24     Eastern  Rwamagana   Gikondo         Leodomir Uwimana  T25019966
## 25    Southern    Ruhango   Gatsibo         Sophie Niyonzima  T25021187
## 26    Northern     Burera Kabarondo            Gigi Mutabazi  T25021880
## 27     Western  Ngororero    Rilima          Ange Byiringiro  T25022150
## 28 Kigali City   Kicukiro Karangazi           Sylvie Uwayezu  T25023271
## 29 Kigali City     Gasabo    Busoro          Ngabo Munyaneza  T25023678
## 30     Eastern   Bugesera      Jali          Olive Munyaneza  T25024267
## 31    Southern     Nyanza Mukarange           Eric Rwitabiri  T25024570
## 32     Western    Karongi   Kacyiru          Andrew Mukamana  T25025099
## 33    Southern   Gisagara  Kigabiro           Andrew Uwimana  T25025127
## 34    Northern     Burera     Ndera      Tijara Ndayishimiye  T25025377
## 35     Eastern   Bugesera    Jabana       Olive Mbarushimana  T25026487
## 36     Western     Rusizi  Ruhashya        Tijara Nkurunziza  T25026571
## 37    Southern  Nyamagabe   Gatenga         Sophie Niyonzima  T25027039
## 38     Western     Rusizi   Matimba          Lema Byiringiro  T25032127
## 39     Eastern  Nyagatare     Rongi       Yvette Mukantagara  T25033688
## 40 Kigali City     Gasabo    Kitabi        Kadete Byiringiro  T25033887
## 41    Southern     Nyanza Kabarondo               Ange Uwase  T25035132
## 42    Southern  Nyamagabe    Gisozi       Sophie Twizerimana  T25035740
## 43 Kigali City Nyarugenge   Nyamata            Aline Uwimana  T25036413
## 44    Southern    Kamonyi     Musha           Sophie Uwimana  T25036945
## 45     Eastern     Kirehe  Kigarama       Sylvie Twizerimana  T25037141
## 46     Western    Karongi  Kiziguro         Yvette Niyonzima  T25038419
## 47    Southern    Ruhango   Kabatwa           Olive Mukamana  T25038779
## 48     Eastern     Kirehe  Rusororo             Jean Ishimwe  T25038980
## 49     Eastern  Nyagatare     Ndera           Kadete Mugisha  T25041538
## 50     Western Nyamasheke  Kibilizi           Gaelle Mugisha  T25041775
## 51    Northern    Gicumbi   Gatsibo       Christian Habimana  T25044332
## 52    Northern    Rulindo    Kiyovu          Herve Munyakazi  T25047242
## 53     Western  Ngororero   Bigogwe      Christian Rwitabiri  T25050043
##    Education_Level School_ID    School_Level   Subject_Taught Gender Age
## 1               A1  SCH00013 Upper Secondary      Mathematics Female  37
## 2               A0  SCH00016 Upper Secondary      Mathematics Female  59
## 3               A0  SCH00019 Upper Secondary           French Female  57
## 4               A1  SCH00039 Upper Secondary Computer Science Female  61
## 5               A0  SCH00041 Upper Secondary           French Female  57
## 6               A1  SCH00045 Upper Secondary        Chemistry Female  44
## 7               A2  SCH00053 Upper Secondary           French Female  43
## 8               A0  SCH00057 Upper Secondary          Physics Female  24
## 9               A1  SCH00064 Upper Secondary Computer Science Female  27
## 10              A0  SCH00066 Upper Secondary          Biology Female  33
## 11              A1  SCH00069 Upper Secondary          English Female  54
## 12              A0  SCH00081 Upper Secondary        Geography Female  60
## 13              A1  SCH00084 Upper Secondary          English Female  34
## 14              A1  SCH00097 Upper Secondary          Physics Female  42
## 15              A1  SCH00102 Upper Secondary        Geography Female  41
## 16              A2  SCH00117 Upper Secondary      Mathematics Female  48
## 17              A1  SCH00121 Upper Secondary          History Female  45
## 18              A2  SCH00122 Upper Secondary Entrepreneurship Female  63
## 19              A2  SCH00125 Upper Secondary          Biology Female  42
## 20              A0  SCH00142 Upper Secondary          Physics Female  33
## 21              A1  SCH00159 Upper Secondary Entrepreneurship Female  36
## 22              A1  SCH00167 Upper Secondary          Biology Female  61
## 23              A2  SCH00183 Upper Secondary          English Female  41
## 24              A0  SCH00199 Upper Secondary        Chemistry Female  45
## 25              A2  SCH00211 Upper Secondary        Chemistry Female  65
## 26              A1  SCH00218 Upper Secondary Entrepreneurship Female  45
## 27              A2  SCH00221 Upper Secondary          Biology Female  41
## 28              A2  SCH00232 Upper Secondary          Biology Female  54
## 29              A0  SCH00236 Upper Secondary      Mathematics Female  25
## 30              A0  SCH00242 Upper Secondary          Biology Female  36
## 31              A0  SCH00245 Upper Secondary           French Female  48
## 32              A2  SCH00250 Upper Secondary          English Female  45
## 33              A0  SCH00251 Upper Secondary Entrepreneurship Female  38
## 34              A2  SCH00253 Upper Secondary      Mathematics Female  63
## 35              A1  SCH00264 Upper Secondary           French Female  60
## 36              A2  SCH00265 Upper Secondary          Biology Female  49
## 37              A1  SCH00270 Upper Secondary          History Female  47
## 38              A0  SCH00321 Upper Secondary      Mathematics Female  61
## 39              A1  SCH00336 Upper Secondary          English Female  23
## 40              A1  SCH00338 Upper Secondary      Mathematics Female  56
## 41              A0  SCH00351 Upper Secondary        Geography Female  36
## 42              A1  SCH00357 Upper Secondary      Kinyarwanda Female  34
## 43              A0  SCH00364 Upper Secondary        Geography Female  59
## 44              A1  SCH00369 Upper Secondary      Mathematics Female  33
## 45              A1  SCH00371 Upper Secondary      Mathematics Female  64
## 46              A2  SCH00384 Upper Secondary      Kinyarwanda Female  40
## 47              A1  SCH00387 Upper Secondary Computer Science Female  53
## 48              A0  SCH00389 Upper Secondary        Chemistry Female  29
## 49              A1  SCH00415 Upper Secondary          Biology Female  37
## 50              A0  SCH00417 Upper Secondary        Chemistry Female  64
## 51              A0  SCH00443 Upper Secondary        Chemistry Female  46
## 52              A2  SCH00472 Upper Secondary          History Female  27
## 53              A1  SCH00500 Upper Secondary          History Female  57
##    Age_Group
## 1      37-46
## 2      57-65
## 3      57-65
## 4      57-65
## 5      57-65
## 6      37-46
## 7      37-46
## 8      22-26
## 9      27-36
## 10     27-36
## 11     47-56
## 12     57-65
## 13     27-36
## 14     37-46
## 15     37-46
## 16     47-56
## 17     37-46
## 18     57-65
## 19     37-46
## 20     27-36
## 21     27-36
## 22     57-65
## 23     37-46
## 24     37-46
## 25     57-65
## 26     37-46
## 27     37-46
## 28     47-56
## 29     22-26
## 30     27-36
## 31     47-56
## 32     37-46
## 33     37-46
## 34     57-65
## 35     57-65
## 36     47-56
## 37     47-56
## 38     57-65
## 39     22-26
## 40     47-56
## 41     27-36
## 42     27-36
## 43     57-65
## 44     27-36
## 45     57-65
## 46     37-46
## 47     47-56
## 48     27-36
## 49     37-46
## 50     57-65
## 51     37-46
## 52     27-36
## 53     57-65
nrow(data_gasabo)
## [1] 5
nrow(female_sec)
## [1] 53

15.3 arrange() – sort rows

# Oldest teachers first
data %>%
  arrange(desc(Age)) %>%
  head(5)
##      Province   District   Sector      Teacher_Name Teacher_ID Education_Level
## 1 Kigali City   Kicukiro    Ndera     Ngabo Uwayezu  T25004614              A2
## 2    Southern    Kamonyi  Kiyumba      Gigi Rukundo  T25007444              A1
## 3     Eastern     Kirehe   Karama Christian Mugisha  T25015136              A1
## 4    Northern    Gicumbi Kinyinya  Herve Hagenimana  T25015265              A2
## 5     Western Nyamasheke  Gikonko      Olive Kagabo  T25015491              A1
##   School_ID    School_Level Subject_Taught Gender Age Age_Group
## 1  SCH00046         Primary      Chemistry   Male  65     57-65
## 2  SCH00074 Upper Secondary        History   Male  65     57-65
## 3  SCH00151         Primary    Kinyarwanda Female  65     57-65
## 4  SCH00152         Primary         French   Male  65     57-65
## 5  SCH00154         Primary    Mathematics Female  65     57-65
# Youngest teachers first
data %>%
  arrange(Age) %>%
  head(5)
##      Province  District  Sector       Teacher_Name Teacher_ID Education_Level
## 1 Kigali City    Gasabo Matimba   Gaelle Munyakazi  T25014157              A0
## 2    Southern  Gisagara  Remera  Noella Nkurunziza  T25000239              A1
## 3     Eastern Nyagatare   Rongi Yvette Mukantagara  T25033688              A1
## 4     Eastern Rwamagana Murunda  Patrick Niyonzima  T25037917              A2
## 5 Kigali City  Kicukiro   Mamba      Ngabo Umutesi  T25043231              A2
##   School_ID    School_Level   Subject_Taught Gender Age Age_Group
## 1  SCH00141 Lower Secondary Computer Science Female  22     22-26
## 2  SCH00002     Pre-primary           French Female  23     22-26
## 3  SCH00336 Upper Secondary          English Female  23     22-26
## 4  SCH00379 Lower Secondary      Kinyarwanda Female  23     22-26
## 5  SCH00432     Pre-primary          Biology   Male  23     22-26

15.4 mutate() – create new variables

data <- data %>%
  mutate(
    Near_Retirement = if_else(Age >= 57, "Yes", "No")
  )
head(data, 5)
##      Province   District  Sector      Teacher_Name Teacher_ID Education_Level
## 1 Kigali City     Gasabo   Rongi   Aline Munyaneza  T25000138              A2
## 2    Southern   Gisagara  Remera Noella Nkurunziza  T25000239              A1
## 3     Western Nyamasheke Gahanga  Leodomir Murerwa  T25000345              A0
## 4    Northern    Gicumbi Gatenga  Sophie Niyonzima  T25000423              A2
## 5 Kigali City Nyarugenge Gatsibo   Tijara Habimana  T25000558              A2
##   School_ID School_Level   Subject_Taught Gender Age Age_Group Near_Retirement
## 1  SCH00001      Primary        Geography   Male  27     27-36              No
## 2  SCH00002  Pre-primary           French Female  23     22-26              No
## 3  SCH00003  Pre-primary      Kinyarwanda Female  27     27-36              No
## 4  SCH00004  Pre-primary        Chemistry Female  43     37-46              No
## 5  SCH00005      Primary Entrepreneurship Female  42     37-46              No

15.5 select() – choose / reorder columns

small <- data %>%
  select(Teacher_ID, Teacher_Name, Province, District, Sector)

head(small, 2)
##   Teacher_ID      Teacher_Name    Province District Sector
## 1  T25000138   Aline Munyaneza Kigali City   Gasabo  Rongi
## 2  T25000239 Noella Nkurunziza    Southern Gisagara Remera

16 Exporting Cleaned Data

write.csv(
  data,
  "C:/Users/HP/Desktop/R PROGRAMING/data_cleaned.csv",
  row.names = FALSE
)

You now have a cleaned and documented teacher dataset, ready for more advanced analysis.