R is a statistical programming language created at the University of
Auckland (New Zealand) by Ross Ihaka and Robert Gentleman.
It evolved from the S language developed at Bell Labs.
The current R version is R version 4.5.1 (2025-06-13 ucrt).
Welcome to R!
R is a powerful, free environment for statistical computing and graphics.
With R Markdown you can mix:
to create reproducible reports.
RStudio IDE has four panes:
Source (scripts, Rmd), Console, Environment/History, and Files/Plots/Packages/Help.
R can perform calculations directly in the console.
Addition: +
1 + 1## [1] 2
Subtraction: -
2 - 5## [1] -3
Multiplication: *
5 * 6## [1] 30
Division: /
9 / 3## [1] 3
Modulus (remainder of a division):
%%
6 %% 2## [1] 0
Exponent: ^ or **
2 ^ 10 # or 2 ** 10## [1] 1024
Integer division: %/%
1035 %/% 3## [1] 345
Less than: <
1 < 0## [1] FALSE
Less than or equal to: <=
1 <= 1## [1] TRUE
Greater than: >
4 > 5## [1] FALSE
Greater than or equal to: >=
3 >= 3## [1] TRUE
Exactly equal to: ==
"R" == "r"## [1] FALSE
The equality operator can also be used to match one element with multiple elements:
"Species" == c("Sepal.Length", "Sepal.Width", "Petal.Length",
"Petal.Width", "Species")## [1] FALSE FALSE FALSE FALSE TRUE
Not equal to: !=
5 != 5## [1] FALSE
Used to flip TRUE ↔︎ FALSE.
!TRUE # or !T## [1] FALSE
!(T & F) # this is TRUE## [1] TRUE
!(F | T) # this is FALSE## [1] FALSE
&TRUE & TRUE## [1] TRUE
TRUE & FALSE## [1] FALSE
FALSE & FALSE## [1] FALSE
FALSE & TRUE## [1] FALSE
|TRUE | TRUE## [1] TRUE
TRUE | FALSE## [1] TRUE
FALSE | FALSE## [1] FALSE
FALSE | TRUE## [1] TRUE
In R, we have built-in functions to match elements in a vector.
The first is match(). It returns the position of the first match of its first argument in its second argument.
match("Species", c("Sepal.Length", "Sepal.Width", "Petal.Length",
"Petal.Width", "Species"))## [1] 5
The second is %in%, which checks the existence of a
value in a vector.
"Species" %in% c("Sepal.Length", "Sepal.Width", "Petal.Length",
"Petal.Width", "Species")## [1] TRUE
In R we can use <-, = (single equal
sign!), and -> to assign a value to a variable.
A variable name:
_ or
. instead).# This will give an error because of the space:
# t trainind <- "r programming"Valid examples:
a <- 5
b <- 6
0 -> .a
a1 = 0.2In R we have the following basic data types:
Examples: 15.5, 505, 38, pi
q <- 10.7
print(class(q))## [1] "numeric"
print(typeof(q))## [1] "double"
You can create an integer by adding L,
e.g. 1L, 5L, 10L.
q <- 5L
print(class(q))## [1] "integer"
print(typeof(q))## [1] "integer"
Example: 3 + 1i, where i is the imaginary
part.
q <- 3 + 1i
print(class(q))## [1] "complex"
print(typeof(q))## [1] "complex"
p1 <- a + 1i * b
print(p1)## [1] 5+6i
string <- "I am Learning R"
class(string)## [1] "character"
Remember:
"LeaRning"is different from"Learning"– R is case-sensitive.
TRUE # or T## [1] TRUE
FALSE # or F## [1] FALSE
Logical output often comes from comparisons:
"LeaRning" == "Learning"## [1] FALSE
text <- "Christian Mugisha."
(raw_text <- charToRaw(text))## [1] 43 68 72 69 73 74 69 61 6e 20 4d 75 67 69 73 68 61 2e
class(raw_text)## [1] "raw"
Converting raw back to text:
rawToChar(raw_text)## [1] "Christian Mugisha."
Factors represent categorical variables (e.g., gender, levels, ratings).
Gender <- factor(c("Female", "Male"))
print(Gender)## [1] Female Male
## Levels: Female Male
class(Gender)## [1] "factor"
v <- TRUE
w <- FALSE
class(v); typeof(v)## [1] "logical"
## [1] "logical"
!v## [1] FALSE
isTRUE(w)## [1] FALSE
t <- 10
x <- numeric(t) # creates a numeric vector of length t
print(x)## [1] 0 0 0 0 0 0 0 0 0 0
# assigning values to x:
x[1] <- 2.5
print(x)## [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
n <- 5
x <- integer(n) # creates an integer vector of length n
print(x)## [1] 0 0 0 0 0
class(x)## [1] "integer"
# assigning values to x:
x[1] <- 2.5 # R will convert to numeric if needed
class(x)## [1] "numeric"
print(x)## [1] 2.5 0.0 0.0 0.0 0.0
To convert data types we use as.<type>()
functions, e.g.:
as.character()as.numeric()as.factor()Let’s create a vector with elements of different types to see how R will handle them.
# ?c # help for c()
v <- c(1, "R", TRUE, FALSE, NA)
print(v)## [1] "1" "R" "TRUE" "FALSE" NA
class(v)## [1] "character"
R converts everything to character except NA (which can belong to multiple types).
v2 <- c(1, 4, 8, FALSE, TRUE, FALSE, FALSE, TRUE, "R" == "r")
print(v2)## [1] 1 4 8 0 1 0 0 1 0
Vectors are vectorized in R — operations apply element-by-element.
a <- c(5, 6, 7)
b <- c(10, 20, 30)
# Addition (1st+1st, 2nd+2nd, etc.)
a + b## [1] 15 26 37
# Multiplication
a * b## [1] 50 120 210
If one vector is shorter, R recycles it:
a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b## [1] 11 22 13 24
# b is recycled: (10, 20, 10, 20)⚠️ Be careful!
If the longer vector’s length is
not a multiple of the shorter one, R will show a warning.
# Get the first item in v2
v2[1]## [1] 1
# Get the 2nd and 4th elements
v2[c(2, 4)]## [1] 4 0
# Exclude the 3rd element
v2[-3]## [1] 1 4 0 1 0 0 1 0
# Change the first element
v2[1] <- 27
print(v2)## [1] 27 4 8 0 1 0 0 1 0
# Add a new element
v21 <- c(v2, 81)
v21## [1] 27 4 8 0 1 0 0 1 0 81
my_name containing your name (character)
my_age containing your age (numeric)
is_statistician with TRUE/FALSE
Then print class() of each.
expenses <- c(1500, 2000, 1200, 3000).
income <- 10000.
income - sum(expenses).
sales_q1 <- c(120, 150, 90).
sales_q2 <- c(130, 160, 95).
sales_q1 + sales_q2.
((sales_q2 - sales_q1) / sales_q1) *
100.
In R, even a single number like 42 is a vector of length
1.
The most important data structure for data analysis in R is the data frame:
employee_data <- data.frame(
id = c(1, 2, 3),
name = c("John", "Jane", "Peter"),
salary = c(50000, 55000, 52000)
)
employee_data## id name salary
## 1 1 John 50000
## 2 2 Jane 55000
## 3 3 Peter 52000
str(employee_data)## 'data.frame': 3 obs. of 3 variables:
## $ id : num 1 2 3
## $ name : chr "John" "Jane" "Peter"
## $ salary: num 50000 55000 52000
summary(employee_data)## id name salary
## Min. :1.0 Length:3 Min. :50000
## 1st Qu.:1.5 Class :character 1st Qu.:51000
## Median :2.0 Mode :character Median :52000
## Mean :2.0 Mean :52333
## 3rd Qu.:2.5 3rd Qu.:53500
## Max. :3.0 Max. :55000
# Column by name
employee_data$name## [1] "John" "Jane" "Peter"
# By index (row, column)
employee_data[1, 2] # Row 1, column 2## [1] "John"
employee_data[ , "salary"] # All rows, salary column## [1] 50000 55000 52000
employee_data[ , c("name", "salary")] # Multiple columns## name salary
## 1 John 50000
## 2 Jane 55000
## 3 Peter 52000
employee_data[1:2, ] # First two rows## id name salary
## 1 1 John 50000
## 2 2 Jane 55000
# Add a new column
employee_data$department <- c("HR", "Finance", "IT")
# Add a new row
new_row <- data.frame(
id = 4,
name = "Alice",
salary = 60000,
department = "Marketing"
)
employee_data <- rbind(employee_data, new_row)
employee_data## id name salary department
## 1 1 John 50000 HR
## 2 2 Jane 55000 Finance
## 3 3 Peter 52000 IT
## 4 4 Alice 60000 Marketing
In practice, we read from files:
# Read CSV file
# my_data <- read.csv("Data.csv")
# Read Excel file
# library(readxl)
# my_data <- read_excel("Data.xlsx")students with:
id (1 to 5)
name (5 student names)
grade (5 numeric grades)
Q2. Extract the grades of the first three students.
Q3. Add a new column pass that is TRUE
if grade ≥ 50, FALSE otherwise.
Q4. Add a new row for a sixth student.
One of R’s biggest strengths is its community-contributed
packages.
A package is like an app for R.
# Install (run once per machine)
# install.packages("haven")
# install.packages("tidyverse")Every time you start a new R session, you must load the package with
library().
# Load packages for this session
library(haven) # For reading Stata, SPSS, SAS files
library(tidyverse) # For data manipulation & visualizationread_csv() for reading CSVs.# Check installed packages
# installed.packages()
# Quickly check one package
"haven" %in% rownames(installed.packages())## [1] TRUE
"ggplot2" %in% rownames(installed.packages())## [1] TRUE
Updating packages:
# Update all installed packages
# update.packages()readxl. library(readxl). ggplot2 is installed.If you get the error “there is no package called …”, you
must install it first with install.packages().
One of the first steps in data analysis is importing data.
R can read: .csv, .xlsx,
.dta, .sav, .json,
etc.
In this course, we use a CSV file:
rwanda_teachers_500.csv.
We will use base R read.csv() here (you could also use
readr::read_csv()).
setwd("C:\\Users\\HP\\Desktop\\R PROGRAMING") # set working directory
getwd() # confirm## [1] "C:/Users/HP/Desktop/R PROGRAMING"
list.files() # list files in the folder## [1] "data_cleaned.csv" "rwanda_teachers_500.csv"
## [3] "rwanda_teachers_500_cleaned.csv"
# Read the CSV file
data <- read.csv("C:\\Users\\HP\\Desktop\\R PROGRAMING\\rwanda_teachers_500.csv")
# Display first 5 rows
head(data, 5)## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Kigali City Gasabo Rongi Aline Munyaneza T25000138 A2
## 2 Southern Gisagara Remera Noella Nkurunziza T25000239 A1
## 3 Western Nyamasheke Gahanga Leodomir Murerwa T25000345 A0
## 4 Northern Gicumbi Gatenga Sophie Niyonzima T25000423 A2
## 5 Kigali City Nyarugenge Gatsibo Tijara Habimana T25000558 A2
## School_ID School_Level Subject_Taught Date_of_Birth Gender
## 1 SCH00001 Primary Geography 1998-04-15 Male
## 2 SCH00002 Pre-primary French 2002-08-21 Female
## 3 SCH00003 Pre-primary Kinyarwanda 1998-10-19 Female
## 4 SCH00004 Pre-primary Chemistry 1982-11-11 Female
## 5 SCH00005 Primary Entrepreneurship 1983-05-04 Female
# Look at the structure
str(data)## 'data.frame': 500 obs. of 11 variables:
## $ Province : chr "Kigali City" "Southern" "Western" "Northern" ...
## $ District : chr "Gasabo" "Gisagara" "Nyamasheke" "Gicumbi" ...
## $ Sector : chr "Rongi" "Remera" "Gahanga" "Gatenga" ...
## $ Teacher_Name : chr "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
## $ Teacher_ID : chr "T25000138" "T25000239" "T25000345" "T25000423" ...
## $ Education_Level: chr "A2" "A1" "A0" "A2" ...
## $ School_ID : chr "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
## $ School_Level : chr "Primary" "Pre-primary" "Pre-primary" "Pre-primary" ...
## $ Subject_Taught : chr "Geography" "French" "Kinyarwanda" "Chemistry" ...
## $ Date_of_Birth : chr "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
You should see ~500 rows and 11 columns. Many columns are
chr (character).
That’s normal after reading a CSV, but we should:
Date_of_Birth into a real DateProvince, District, …)
into factors⚠️ Be careful!
If you have typos in labels
(e.g., “Nyarugenge” vs “Nyarugunga”), clean the strings first, then
convert to factors.
# Convert Date_of_Birth to Date
data$Date_of_Birth <- as.Date(data$Date_of_Birth, format = "%Y-%m-%d")
# Convert to factors
data$Subject_Taught <- as.factor(data$Subject_Taught)
data$Sector <- as.factor(data$Sector)
# Quick check
summary(data$Subject_Taught)## Biology Chemistry Computer Science English
## 48 41 44 45
## Entrepreneurship French Geography History
## 50 52 41 51
## Kinyarwanda Mathematics Physics
## 46 49 33
summary(data$Sector)## Bigogwe Bugarama Bumbogo Busoro Bwishyura Gahanga Gashonga
## 10 5 8 15 5 8 7
## Gashora Gatenga Gatsibo Gihundwe Gikoma Gikondo Gikonko
## 5 8 12 5 7 5 9
## Gisozi Jabana Jali Jenda Kabarondo Kabarore Kabatwa
## 12 4 9 5 13 6 11
## Kacyiru Kanombe Karama Karangazi Kibilizi Kigabiro Kigarama
## 6 9 11 17 12 7 17
## Kimironko Kinyinya Kitabi Kiyovu Kiyumba Kiziguro Mamba
## 3 10 6 7 5 5 11
## Masaka Matimba Mugesera Mukarange Mukura Murunda Musha
## 6 6 5 6 8 10 8
## Ndera Ntyazo Nyamabuye Nyamata Nyamirambo Remera Rilima
## 8 8 8 7 10 10 11
## Rongi Rubengera Ruhashya Rusororo Rwinkwavu Save Shangi
## 12 6 15 9 6 7 11
## Shyogor Shyogwe Tumba
## 8 14 6
str(data$Date_of_Birth)## Date[1:500], format: "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" "1983-05-04" ...
# Create Age in years
data$Age <- as.integer(floor((Sys.Date() - data$Date_of_Birth) / 365.25))
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
Why 365.25?
# Convert more categorical variables
data$Education_Level <- as.factor(data$Education_Level)
data$Province <- as.factor(data$Province)
data$District <- as.factor(data$District)
data$School_Level <- as.factor(data$School_Level)
str(data)## 'data.frame': 500 obs. of 12 variables:
## $ Province : Factor w/ 5 levels "Eastern","Kigali City",..: 2 4 5 3 2 3 2 2 1 1 ...
## $ District : Factor w/ 30 levels "Bugesera","Burera",..: 4 7 21 6 23 27 12 23 19 16 ...
## $ Sector : Factor w/ 59 levels "Bigogwe","Bugarama",..: 50 48 6 9 10 28 16 44 49 21 ...
## $ Teacher_Name : chr "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
## $ Teacher_ID : chr "T25000138" "T25000239" "T25000345" "T25000423" ...
## $ Education_Level: Factor w/ 3 levels "A0","A1","A2": 3 2 1 3 3 2 3 3 1 2 ...
## $ School_ID : chr "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
## $ School_Level : Factor w/ 4 levels "Lower Secondary",..: 3 2 2 2 3 1 3 3 4 3 ...
## $ Subject_Taught : Factor w/ 11 levels "Biology","Chemistry",..: 7 6 9 2 5 4 5 5 11 5 ...
## $ Date_of_Birth : Date, format: "1998-04-15" "2002-08-21" ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Age : int 27 23 27 43 42 26 44 41 63 32 ...
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
Suppose min = 22 and max = 65.
data$Age_Group <- cut(
data$Age,
breaks = c(22, 27, 37, 47, 57, 66), # 66 to include 65
labels = c("22-26", "27-36", "37-46", "47-56", "57-65"),
right = FALSE # [22,27), [27,37), etc.
)
# Check distribution
table(data$Age_Group)##
## 22-26 27-36 37-46 47-56 57-65
## 38 119 132 103 108
prop.table(table(data$Age_Group))##
## 22-26 27-36 37-46 47-56 57-65
## 0.076 0.238 0.264 0.206 0.216
# Sanity checks
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
sum(is.na(data$Age_Group))## [1] 0
levels(data$Age_Group)## [1] "22-26" "27-36" "37-46" "47-56" "57-65"
⚠️ Be careful with cut():
right = FALSE → bins are [lower, upper).
right = TRUE → bins are (lower, upper].
any(is.na(data)) # any missing at all?## [1] FALSE
sum(is.na(data)) # total missing cells## [1] 0
colSums(is.na(data)) # per column## Province District Sector Teacher_Name Teacher_ID
## 0 0 0 0 0
## Education_Level School_ID School_Level Subject_Taught Date_of_Birth
## 0 0 0 0 0
## Gender Age Age_Group
## 0 0 0
rowSums(is.na(data)) # per row## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
We no longer need Date_of_Birth since we created
Age and Age_Group.
data <- subset(data, select = -Date_of_Birth)
names(data)## [1] "Province" "District" "Sector" "Teacher_Name"
## [5] "Teacher_ID" "Education_Level" "School_ID" "School_Level"
## [9] "Subject_Taught" "Gender" "Age" "Age_Group"
any(duplicated(data))## [1] FALSE
data[duplicated(data), ]## [1] Province District Sector Teacher_Name
## [5] Teacher_ID Education_Level School_ID School_Level
## [9] Subject_Taught Gender Age Age_Group
## <0 rows> (or 0-length row.names)
table(data$Province) # Province distribution##
## Eastern Kigali City Northern Southern Western
## 93 92 105 107 103
table(data$Education_Level) # Education level##
## A0 A1 A2
## 142 213 145
table(data$Subject_Taught) # Subject taught##
## Biology Chemistry Computer Science English
## 48 41 44 45
## Entrepreneurship French Geography History
## 50 52 41 51
## Kinyarwanda Mathematics Physics
## 46 49 33
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
mean(data$Age, na.rm = TRUE)## [1] 44.082
median(data$Age, na.rm = TRUE)## [1] 43
# Teachers by Province × Gender
table(data$Province, data$Gender)##
## Female Male
## Eastern 41 52
## Kigali City 38 54
## Northern 49 56
## Southern 49 58
## Western 41 62
# Teachers by School_Level × Subject_Taught
table(data$School_Level, data$Subject_Taught)##
## Biology Chemistry Computer Science English Entrepreneurship
## Lower Secondary 8 7 11 14 9
## Pre-primary 4 6 2 2 4
## Primary 22 18 23 22 30
## Upper Secondary 14 10 8 7 7
##
## French Geography History Kinyarwanda Mathematics Physics
## Lower Secondary 14 14 14 22 8 9
## Pre-primary 3 1 6 2 6 4
## Primary 23 17 20 16 21 10
## Upper Secondary 12 9 11 6 14 10
prop.table(table(data$Age_Group))##
## 22-26 27-36 37-46 47-56 57-65
## 0.076 0.238 0.264 0.206 0.216
Interpretation: Most teachers (around half) are in the 27–46 age range, so the workforce is mainly mid-career, with fewer very young (22–26) and a moderate number nearing retirement (57–65).
hist(data$Age,
main = "Age Distribution of Teachers",
xlab = "Age",
col = "skyblue")barplot(table(data$Age_Group),
main = "Teacher Counts by Age Group",
col = "lightgreen")barplot(table(data$Subject_Taught),
las = 2,
main = "Teachers per Subject",
col = "green")barplot(table(data$Gender),
main = "Gender Distribution of Teachers",
col = c("green", "blue"))boxplot(Age ~ Province,
data = data,
main = "Age Distribution by Province",
col = "lightgray")pie(table(data$Education_Level),
main = "Education Level Distribution")library(ggplot2)ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Age Distribution of Teachers", x = "Age", y = "Count") +
theme_minimal()ggplot(data, aes(x = Age_Group)) +
geom_bar(fill = "lightgreen", color = "black") +
labs(title = "Teacher Counts by Age Group", x = "Age Group", y = "Count") +
theme_minimal()ggplot(data, aes(x = Province)) +
geom_bar(fill = "orange", color = "black") +
labs(title = "Teachers per Province", x = "Province", y = "Count") +
theme_minimal()edu_counts <- table(data$Education_Level)
edu_counts##
## A0 A1 A2
## 142 213 145
pie(edu_counts,
labels = paste(names(edu_counts), edu_counts),
main = "Education Level Distribution")ggplot(data, aes(x = Subject_Taught)) +
geom_bar(fill = "purple", color = "black") +
labs(title = "Teachers per Subject", x = "Subject", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data, aes(x = Gender)) +
geom_bar(fill = c("pink", "lightblue"), color = "black") +
labs(title = "Gender Distribution of Teachers", x = "Gender", y = "Count") +
theme_minimal()ggplot(data, aes(x = Province, y = Age)) +
geom_boxplot(fill = "lightgray") +
labs(title = "Age Distribution by Province", x = "Province", y = "Age") +
theme_minimal()Packages to use:
%>%%>% means “and then…” and makes steps readable.data %>%
filter(Gender == "Female") %>% # keep only female teachers
group_by(Province) %>% # group by Province
summarise(Number_of_Female = n(), .groups = "drop")## # A tibble: 5 × 2
## Province Number_of_Female
## <fct> <int>
## 1 Eastern 41
## 2 Kigali City 38
## 3 Northern 49
## 4 Southern 49
## 5 Western 41
Explanation:
filter() keeps only rows with Gender = “Female”.group_by(Province) organizes data by province.summarise() creates a summary table for each
group.# Teachers in Gikondo sector (example)
data_gasabo <- data %>%
filter(Sector == "Gikondo")
# Female upper secondary teachers only
female_sec <- data %>%
filter(Gender == "Female", School_Level == "Upper Secondary")
data_gasabo## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Southern Ruhango Gikondo Ngabo Rukundo T25008597 A2
## 2 Eastern Rwamagana Gikondo Aline Mukantagara T25010353 A0
## 3 Kigali City Nyarugenge Gikondo Cassien Rukundo T25011097 A0
## 4 Eastern Rwamagana Gikondo Leodomir Uwimana T25019966 A0
## 5 Kigali City Nyarugenge Gikondo Samuel Mugisha T25049974 A1
## School_ID School_Level Subject_Taught Gender Age Age_Group
## 1 SCH00085 Primary Computer Science Female 25 22-26
## 2 SCH00103 Primary Mathematics Male 37 37-46
## 3 SCH00110 Upper Secondary English Male 32 27-36
## 4 SCH00199 Upper Secondary Chemistry Female 45 37-46
## 5 SCH00499 Upper Secondary Computer Science Male 59 57-65
female_sec## Province District Sector Teacher_Name Teacher_ID
## 1 Southern Huye Kabarondo Olive Byiringiro T25001369
## 2 Eastern Gatsibo Mugesera Alice Uwimana T25001679
## 3 Eastern Kayonza Save Patrick Byiringiro T25001964
## 4 Southern Gisagara Kinyinya Noella Kagabo T25003929
## 5 Northern Rulindo Kabatwa Patrick Mukantagara T25004189
## 6 Kigali City Kicukiro Musha Sylvie Twizerimana T25004561
## 7 Northern Rulindo Ruhashya Sylvie Byiringiro T25005352
## 8 Southern Nyanza Shangi Samuel Mutabazi T25005788
## 9 Southern Kamonyi Rubengera Leodomir Uwase T25006489
## 10 Northern Gicumbi Shangi Samuel Ndayishimiye T25006645
## 11 Eastern Kirehe Nyamabuye Aline Mukamana T25006931
## 12 Eastern Gatsibo Masaka Jean Hagenimana T25008145
## 13 Southern Ruhango Karangazi Joan Uwimana T25008416
## 14 Western Karongi Gatsibo Noella Niyonzima T25009742
## 15 Southern Ruhango Bumbogo Leodomir Rwitabiri T25010228
## 16 Northern Burera Rubengera Christianne Ndayishimiye T25011745
## 17 Northern Rulindo Mamba Ange Uwayezu T25012131
## 18 Northern Gicumbi Bugarama Gaelle Mutabazi T25012219
## 19 Western Nyabihu Bigogwe Kadete Munyaneza T25012592
## 20 Western Nyabihu Ruhashya Gaelle Mukantagara T25014255
## 21 Southern Ruhango Shangi Ngabo Murerwa T25015967
## 22 Western Rutsiro Gatenga Joan Ishimwe T25016784
## 23 Eastern Ngoma Mukarange Lema Mbarushimana T25018395
## 24 Eastern Rwamagana Gikondo Leodomir Uwimana T25019966
## 25 Southern Ruhango Gatsibo Sophie Niyonzima T25021187
## 26 Northern Burera Kabarondo Gigi Mutabazi T25021880
## 27 Western Ngororero Rilima Ange Byiringiro T25022150
## 28 Kigali City Kicukiro Karangazi Sylvie Uwayezu T25023271
## 29 Kigali City Gasabo Busoro Ngabo Munyaneza T25023678
## 30 Eastern Bugesera Jali Olive Munyaneza T25024267
## 31 Southern Nyanza Mukarange Eric Rwitabiri T25024570
## 32 Western Karongi Kacyiru Andrew Mukamana T25025099
## 33 Southern Gisagara Kigabiro Andrew Uwimana T25025127
## 34 Northern Burera Ndera Tijara Ndayishimiye T25025377
## 35 Eastern Bugesera Jabana Olive Mbarushimana T25026487
## 36 Western Rusizi Ruhashya Tijara Nkurunziza T25026571
## 37 Southern Nyamagabe Gatenga Sophie Niyonzima T25027039
## 38 Western Rusizi Matimba Lema Byiringiro T25032127
## 39 Eastern Nyagatare Rongi Yvette Mukantagara T25033688
## 40 Kigali City Gasabo Kitabi Kadete Byiringiro T25033887
## 41 Southern Nyanza Kabarondo Ange Uwase T25035132
## 42 Southern Nyamagabe Gisozi Sophie Twizerimana T25035740
## 43 Kigali City Nyarugenge Nyamata Aline Uwimana T25036413
## 44 Southern Kamonyi Musha Sophie Uwimana T25036945
## 45 Eastern Kirehe Kigarama Sylvie Twizerimana T25037141
## 46 Western Karongi Kiziguro Yvette Niyonzima T25038419
## 47 Southern Ruhango Kabatwa Olive Mukamana T25038779
## 48 Eastern Kirehe Rusororo Jean Ishimwe T25038980
## 49 Eastern Nyagatare Ndera Kadete Mugisha T25041538
## 50 Western Nyamasheke Kibilizi Gaelle Mugisha T25041775
## 51 Northern Gicumbi Gatsibo Christian Habimana T25044332
## 52 Northern Rulindo Kiyovu Herve Munyakazi T25047242
## 53 Western Ngororero Bigogwe Christian Rwitabiri T25050043
## Education_Level School_ID School_Level Subject_Taught Gender Age
## 1 A1 SCH00013 Upper Secondary Mathematics Female 37
## 2 A0 SCH00016 Upper Secondary Mathematics Female 59
## 3 A0 SCH00019 Upper Secondary French Female 57
## 4 A1 SCH00039 Upper Secondary Computer Science Female 61
## 5 A0 SCH00041 Upper Secondary French Female 57
## 6 A1 SCH00045 Upper Secondary Chemistry Female 44
## 7 A2 SCH00053 Upper Secondary French Female 43
## 8 A0 SCH00057 Upper Secondary Physics Female 24
## 9 A1 SCH00064 Upper Secondary Computer Science Female 27
## 10 A0 SCH00066 Upper Secondary Biology Female 33
## 11 A1 SCH00069 Upper Secondary English Female 54
## 12 A0 SCH00081 Upper Secondary Geography Female 60
## 13 A1 SCH00084 Upper Secondary English Female 34
## 14 A1 SCH00097 Upper Secondary Physics Female 42
## 15 A1 SCH00102 Upper Secondary Geography Female 41
## 16 A2 SCH00117 Upper Secondary Mathematics Female 48
## 17 A1 SCH00121 Upper Secondary History Female 45
## 18 A2 SCH00122 Upper Secondary Entrepreneurship Female 63
## 19 A2 SCH00125 Upper Secondary Biology Female 42
## 20 A0 SCH00142 Upper Secondary Physics Female 33
## 21 A1 SCH00159 Upper Secondary Entrepreneurship Female 36
## 22 A1 SCH00167 Upper Secondary Biology Female 61
## 23 A2 SCH00183 Upper Secondary English Female 41
## 24 A0 SCH00199 Upper Secondary Chemistry Female 45
## 25 A2 SCH00211 Upper Secondary Chemistry Female 65
## 26 A1 SCH00218 Upper Secondary Entrepreneurship Female 45
## 27 A2 SCH00221 Upper Secondary Biology Female 41
## 28 A2 SCH00232 Upper Secondary Biology Female 54
## 29 A0 SCH00236 Upper Secondary Mathematics Female 25
## 30 A0 SCH00242 Upper Secondary Biology Female 36
## 31 A0 SCH00245 Upper Secondary French Female 48
## 32 A2 SCH00250 Upper Secondary English Female 45
## 33 A0 SCH00251 Upper Secondary Entrepreneurship Female 38
## 34 A2 SCH00253 Upper Secondary Mathematics Female 63
## 35 A1 SCH00264 Upper Secondary French Female 60
## 36 A2 SCH00265 Upper Secondary Biology Female 49
## 37 A1 SCH00270 Upper Secondary History Female 47
## 38 A0 SCH00321 Upper Secondary Mathematics Female 61
## 39 A1 SCH00336 Upper Secondary English Female 23
## 40 A1 SCH00338 Upper Secondary Mathematics Female 56
## 41 A0 SCH00351 Upper Secondary Geography Female 36
## 42 A1 SCH00357 Upper Secondary Kinyarwanda Female 34
## 43 A0 SCH00364 Upper Secondary Geography Female 59
## 44 A1 SCH00369 Upper Secondary Mathematics Female 33
## 45 A1 SCH00371 Upper Secondary Mathematics Female 64
## 46 A2 SCH00384 Upper Secondary Kinyarwanda Female 40
## 47 A1 SCH00387 Upper Secondary Computer Science Female 53
## 48 A0 SCH00389 Upper Secondary Chemistry Female 29
## 49 A1 SCH00415 Upper Secondary Biology Female 37
## 50 A0 SCH00417 Upper Secondary Chemistry Female 64
## 51 A0 SCH00443 Upper Secondary Chemistry Female 46
## 52 A2 SCH00472 Upper Secondary History Female 27
## 53 A1 SCH00500 Upper Secondary History Female 57
## Age_Group
## 1 37-46
## 2 57-65
## 3 57-65
## 4 57-65
## 5 57-65
## 6 37-46
## 7 37-46
## 8 22-26
## 9 27-36
## 10 27-36
## 11 47-56
## 12 57-65
## 13 27-36
## 14 37-46
## 15 37-46
## 16 47-56
## 17 37-46
## 18 57-65
## 19 37-46
## 20 27-36
## 21 27-36
## 22 57-65
## 23 37-46
## 24 37-46
## 25 57-65
## 26 37-46
## 27 37-46
## 28 47-56
## 29 22-26
## 30 27-36
## 31 47-56
## 32 37-46
## 33 37-46
## 34 57-65
## 35 57-65
## 36 47-56
## 37 47-56
## 38 57-65
## 39 22-26
## 40 47-56
## 41 27-36
## 42 27-36
## 43 57-65
## 44 27-36
## 45 57-65
## 46 37-46
## 47 47-56
## 48 27-36
## 49 37-46
## 50 57-65
## 51 37-46
## 52 27-36
## 53 57-65
nrow(data_gasabo)## [1] 5
nrow(female_sec)## [1] 53
# Oldest teachers first
data %>%
arrange(desc(Age)) %>%
head(5)## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Kigali City Kicukiro Ndera Ngabo Uwayezu T25004614 A2
## 2 Southern Kamonyi Kiyumba Gigi Rukundo T25007444 A1
## 3 Eastern Kirehe Karama Christian Mugisha T25015136 A1
## 4 Northern Gicumbi Kinyinya Herve Hagenimana T25015265 A2
## 5 Western Nyamasheke Gikonko Olive Kagabo T25015491 A1
## School_ID School_Level Subject_Taught Gender Age Age_Group
## 1 SCH00046 Primary Chemistry Male 65 57-65
## 2 SCH00074 Upper Secondary History Male 65 57-65
## 3 SCH00151 Primary Kinyarwanda Female 65 57-65
## 4 SCH00152 Primary French Male 65 57-65
## 5 SCH00154 Primary Mathematics Female 65 57-65
# Youngest teachers first
data %>%
arrange(Age) %>%
head(5)## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Kigali City Gasabo Matimba Gaelle Munyakazi T25014157 A0
## 2 Southern Gisagara Remera Noella Nkurunziza T25000239 A1
## 3 Eastern Nyagatare Rongi Yvette Mukantagara T25033688 A1
## 4 Eastern Rwamagana Murunda Patrick Niyonzima T25037917 A2
## 5 Kigali City Kicukiro Mamba Ngabo Umutesi T25043231 A2
## School_ID School_Level Subject_Taught Gender Age Age_Group
## 1 SCH00141 Lower Secondary Computer Science Female 22 22-26
## 2 SCH00002 Pre-primary French Female 23 22-26
## 3 SCH00336 Upper Secondary English Female 23 22-26
## 4 SCH00379 Lower Secondary Kinyarwanda Female 23 22-26
## 5 SCH00432 Pre-primary Biology Male 23 22-26
data <- data %>%
mutate(
Near_Retirement = if_else(Age >= 57, "Yes", "No")
)
head(data, 5)## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Kigali City Gasabo Rongi Aline Munyaneza T25000138 A2
## 2 Southern Gisagara Remera Noella Nkurunziza T25000239 A1
## 3 Western Nyamasheke Gahanga Leodomir Murerwa T25000345 A0
## 4 Northern Gicumbi Gatenga Sophie Niyonzima T25000423 A2
## 5 Kigali City Nyarugenge Gatsibo Tijara Habimana T25000558 A2
## School_ID School_Level Subject_Taught Gender Age Age_Group Near_Retirement
## 1 SCH00001 Primary Geography Male 27 27-36 No
## 2 SCH00002 Pre-primary French Female 23 22-26 No
## 3 SCH00003 Pre-primary Kinyarwanda Female 27 27-36 No
## 4 SCH00004 Pre-primary Chemistry Female 43 37-46 No
## 5 SCH00005 Primary Entrepreneurship Female 42 37-46 No
small <- data %>%
select(Teacher_ID, Teacher_Name, Province, District, Sector)
head(small, 2)## Teacher_ID Teacher_Name Province District Sector
## 1 T25000138 Aline Munyaneza Kigali City Gasabo Rongi
## 2 T25000239 Noella Nkurunziza Southern Gisagara Remera
write.csv(
data,
"C:/Users/HP/Desktop/R PROGRAMING/data_cleaned.csv",
row.names = FALSE
)You now have a cleaned and documented teacher dataset, ready for more advanced analysis.