R is a statistical programming language created at the University of Auckland (New Zealand) by Ross Ihaka and Robert Gentleman. It evolved from the S language developed at Bell Labs. The current R version is R version 4.5.1 (2025-06-13 ucrt).
Welcome to R!
R is a powerful, free environment for statistical computing and graphics.
Open-source and free
Powerful for statistics
Excellent visualization (e.g., ggplot2)
RStudio IDE has four panes: Source, Console, Environment/History, and Files/Plots/Packages/Help.
R can perform calculations in the console.
1 + 1## [1] 2
2 - 5## [1] -3
5 * 6## [1] 30
9 / 3## [1] 3
6 %% 2## [1] 0
2 ^ 10 # or 2 ** 10## [1] 1024
1035 %/% 3## [1] 345
1 < 0## [1] FALSE
1 <= 1## [1] TRUE
4 > 5## [1] FALSE
3 >= 3## [1] TRUE
"R" == "r"## [1] FALSE
The equality operator can also be used to match one element with multiple elements
"Species" == c("Sepal.Length", "Sepal.Width", "Petal.Length",
"Petal.Width", "Species")## [1] FALSE FALSE FALSE FALSE TRUE
5 != 5## [1] FALSE
Used to change a TRUE condition to FALSE (respectively a FALSE condition to TRUE)
!TRUE # or !T## [1] FALSE
!(T & F) # this is TRUE## [1] TRUE
!(F | T) # is FALSE## [1] FALSE
TRUE & TRUE## [1] TRUE
TRUE & FALSE## [1] FALSE
F & F## [1] FALSE
F & T## [1] FALSE
T | T## [1] TRUE
T | F## [1] TRUE
F | F## [1] FALSE
F | T## [1] TRUE
In R, we also have inbuilt functions that help to match element of a given vector. The first function is match(). You can check the documentation with help(“match”) or ?match. Read that: match returns a vector of the positions of (first) matches of its first argument in its second.
match("Species", c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))## [1] 5
The second function %in% check the existence of a value in a given vector (of values).
"Species" %in% c("Sepal.Length", "Sepal.Width", "Petal.Length",
"Petal.Width", "Species")## [1] TRUE
In R we can use <-, = (single equal sign !) and -> to assign a value to a variable.
A variable name:
# t trainind <- "r programming" // this will print error cause of space the error will be unexpected symbol in "t trainind"a <- 5
b <- 6
0 -> .a
a1 = .2In R we have the following data types: numeric, integer, complex, character, logical ,raw ,factor
Examples of numberic numbers are 15.5, 505, 38, pi
q <- 10.7
print(class(q))## [1] "numeric"
print(typeof(q))## [1] "double"
q <- 5L
print(class(q))## [1] "integer"
print(typeof(q))## [1] "integer"
An example of a complex number is 3+1i, where i is the imaginary part. Multiplying a real number by 1i, transforms it to complex.
q <- 3+1i
print(class(q))## [1] "complex"
print(typeof(q))## [1] "complex"
p1 <- a + 1i*b
print(a1)## [1] 0.2
string <- "I am Learning R"
class(string)## [1] "character"
Remember!! LeaRning is different from Learning.
TRUE # or T## [1] TRUE
FALSE # or F## [1] FALSE
Logical output can also be an outcome of a test. Example: if we want to check if “LeaRning” == “Learning”
"LeaRning" == "Learning"## [1] FALSE
text <- "Christian Mugisha."
(raw_text <- charToRaw(text))## [1] 43 68 72 69 73 74 69 61 6e 20 4d 75 67 69 73 68 61 2e
class(raw_text)## [1] "raw"
Converting raw to text:
rawToChar(raw_text)## [1] "Christian Mugisha."
They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc. They are useful in statistical modeling.
Gender <- factor(c("Female", "Male"))
print(Gender)## [1] Female Male
## Levels: Female Male
class(Gender)## [1] "factor"
v <- TRUE
w <- FALSE
class(v); typeof(v)## [1] "logical"
## [1] "logical"
!v## [1] FALSE
isTRUE(w)## [1] FALSE
t <- 10
x <- numeric(t) # creates a numeric object of size t
print(x)## [1] 0 0 0 0 0 0 0 0 0 0
# assigning values to x:
x[1] <- 2.5
print(x)## [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
n <- 5
x <- integer(n) # creates a numeric object of size n
print(x)## [1] 0 0 0 0 0
class(x)## [1] "integer"
# assigning values to x:
x[1] <- 2.5 # R will automatically convert integer to numeric
class(x)## [1] "numeric"
print(x)## [1] 2.5 0.0 0.0 0.0 0.0
To convert data in R we can use function starting with as. + data type from the base package.
Numeric to character
Character to numeric
Factor to character
Character to factor
A scalar is any number in N, Z, D, Q, R, or C (Quantum Mechanics)
Vectors: collection of objects of the same type. A vector can also be a sequence; Let create a vector with elements of different types to see how R will deal with them.
Numerics and characters
# ?c
v <- c(1, "R", T, FALSE, NA)
# print v
print(v)## [1] "1" "R" "TRUE" "FALSE" NA
# what is the class of v?
class(v)## [1] "character"
R converts everything in character type except NA which is common to numeric and character.
v2 <- c(1, 4, 8, FALSE, TRUE, FALSE, FALSE, TRUE, "R" == "r")
print(v2)## [1] 1 4 8 0 1 0 0 1 0
Vectors are vectorized in R — this means you can perform arithmetic on all elements at once. Operations happen element-by-element.
a <- c(5, 6, 7)
b <- c(10, 20, 30)
# Addition (adds 1st to 1st, 2nd to 2nd, etc.)
a + b## [1] 15 26 37
# Multiplication
a * b## [1] 50 120 210
If one vector is shorter than the other, R will recycle (repeat) elements of the shorter vector until it matches the length of the longer one.
a <- c(1, 2, 3, 4)
b <- c(10, 20)
a + b ## [1] 11 22 13 24
# First recycle b: (10, 20, 10, 20)
# Result: (1+10, 2+20, 3+10, 4+20)⚠️ Be careful!
If the longer vector’s length is
not a multiple of the shorter vector’s length, R will warn you.
You can use square brackets [ ] to select elements from a vector.
# Get the first item in v2
v2[1]## [1] 1
# Get the 2nd and 4th in v2
v2[c(2, 4)]## [1] 4 0
# Exclude the 3rd element
v2[-3]## [1] 1 4 0 1 0 0 1 0
# Change the number of the first v2
v2[1] <- 27
print(v2)## [1] 27 4 8 0 1 0 0 1 0
# Add a new number (creates a longer vector)
v21 <- c(v2, 81)
v21## [1] 27 4 8 0 1 0 0 1 0 81
Q1 Create Variables
Create three variables:
TRUE or FALSE).Print out the class of each variable.
Q2: Vector Operations
expenses with the values:
1500, 2000, 1200,
3000.income with the value
10000.expenses from income (hint: use
sum()).Q3
Create a numeric vector sales_q1 with the values:
120, 150, 90.
Create another numeric vector sales_q2 with the
values: 130, 160, 95.
Calculate the total sales for each store by adding the two vectors.
Calculate the percentage increase from Q1 to Q2:((Q2
- Q1) / Q1) * 100
Extract only the results for stores with an increase greater than 10%.
Vectors are everywhere in R — even a single number like
42 is technically a vector of length 1.
The most important data structure for data analysis in R is the data frame. A data frame is like:
A spreadsheet in Excel
A table in a database
A tibble in tidyverse (we’ll see later)
It is a two-dimensional table:
Rows = observations/records
Columns = variables/features
Each column can have a different type (numeric, character, logical, etc.)
You can create a data frame using the data.frame() function.
# Create a simple data frame
employee_data <- data.frame(
id = c(1, 2, 3), # Numeric column
name = c("John", "Jane", "Peter"), # Character column
salary = c(50000, 55000, 52000) # Numeric column
)
employee_data## id name salary
## 1 1 John 50000
## 2 2 Jane 55000
## 3 3 Peter 52000
# View structure
str(employee_data)## 'data.frame': 3 obs. of 3 variables:
## $ id : num 1 2 3
## $ name : chr "John" "Jane" "Peter"
## $ salary: num 50000 55000 52000
# View summary statistics
summary(employee_data)## id name salary
## Min. :1.0 Length:3 Min. :50000
## 1st Qu.:1.5 Class :character 1st Qu.:51000
## Median :2.0 Mode :character Median :52000
## Mean :2.0 Mean :52333
## 3rd Qu.:2.5 3rd Qu.:53500
## Max. :3.0 Max. :55000
There are several ways to access parts of a data frame:
# Access a column by $ name
employee_data$name## [1] "John" "Jane" "Peter"
# Access by index (row, column)
employee_data[1, 2] # Row 1, column 2## [1] "John"
employee_data[ , "salary"] # All rows, salary column## [1] 50000 55000 52000
# Multiple columns
employee_data[ , c("name", "salary")]## name salary
## 1 John 50000
## 2 Jane 55000
## 3 Peter 52000
# Multiple rows
employee_data[1:2, ]## id name salary
## 1 1 John 50000
## 2 2 Jane 55000
# Add a new column
employee_data$department <- c("HR", "Finance", "IT")
# Add a new row
new_row <- data.frame(id=4, name="Alice", salary=60000, department="Marketing")
employee_data <- rbind(employee_data, new_row)
employee_data## id name salary department
## 1 1 John 50000 HR
## 2 2 Jane 55000 Finance
## 3 3 Peter 52000 IT
## 4 4 Alice 60000 Marketing
In practice, we don’t type all our data by hand — we read it from files:
# Read CSV file
#my_data <- read.csv("Data.csv")
# Read Excel file
# install.packages("readxl")
library(readxl)
#my_data <- read_excel("Data.xlsx")Q1 Create a data frame named students with columns:
id (1 to 5)
name (five student names)
grade (five numeric grades)
Q2. Extract the grades of the first three students
Q3.Add a new column pass that is TRUE if grade ≥ 50, otherwise FALSE.
Q4. Add a new row for a sixth student.
Vectors are everywhere in R — even a single number like
42 is technically a vector of length 1.
One of R’s biggest strengths is its community-contributed packages. A package is like an app for R — it contains:
Functions (tools to do specific tasks)
Data sets (ready-to-use examples)
Documentation (help files and guides)
Before using a package, you install it once on your computer (like downloading an app from an app store).
# Install a package (done only once, unless you update R or reinstall)
#install.packages("haven")
#install.packages("tidyverse")Every time you start a new R session (or reopen RStudio),you must
load the package to use it./ .
# Load packages for use in this session
library(haven) # For reading Stata, SPSS, SAS files
library(tidyverse) # For data manipulation & visualizationhaven: Reads data from Stata (.dta), SPSS, and SAS files while preserving labels.
tidyverse: A collection of packages for data manipulation (dplyr), visualization (ggplot2), data tidying (tidyr), and more.
# Check installed packages
# installed.packages()
# Or quickly check one package
"haven" %in% rownames(installed.packages())## [1] TRUE
Updating Packages
# Update all installed packages
# update.packages()Install the package readxl (for reading Excel files).
Load it into your session.
Check if the package ggplot2 is already installed on your system.
If you get the error “there is no package called …”,
it means you need to install it first.
One of the first steps in any data analysis project is getting your data into R. R can read many file formats: .csv, .xlsx, .dta, .sav, .json, and more.
In this course, we’ll start with a csv (.csv) file.
We’ll use read_csv() from the haven package (which is already loaded) to import it.
library(dplyr)
data <- read.csv("C:\\Users\\HP\\Desktop\\R PROGRAMING\\rwanda_teachers_500.csv") # is how you read csv file
head(data,5) # display first 5 rows## Province District Sector Teacher_Name Teacher_ID Education_Level
## 1 Kigali City Gasabo Rongi Aline Munyaneza T25000138 A2
## 2 Southern Gisagara Remera Noella Nkurunziza T25000239 A1
## 3 Western Nyamasheke Gahanga Leodomir Murerwa T25000345 A0
## 4 Northern Gicumbi Gatenga Sophie Niyonzima T25000423 A2
## 5 Kigali City Nyarugenge Gatsibo Tijara Habimana T25000558 A2
## School_ID School_Level Subject_Taught Date_of_Birth Gender
## 1 SCH00001 Primary Geography 1998-04-15 Male
## 2 SCH00002 Pre-primary French 2002-08-21 Female
## 3 SCH00003 Pre-primary Kinyarwanda 1998-10-19 Female
## 4 SCH00004 Pre-primary Chemistry 1982-11-11 Female
## 5 SCH00005 Primary Entrepreneurship 1983-05-04 Female
# after loading the data , you have to seek first the summary of the data to see the structure
str(data)## 'data.frame': 500 obs. of 11 variables:
## $ Province : chr "Kigali City" "Southern" "Western" "Northern" ...
## $ District : chr "Gasabo" "Gisagara" "Nyamasheke" "Gicumbi" ...
## $ Sector : chr "Rongi" "Remera" "Gahanga" "Gatenga" ...
## $ Teacher_Name : chr "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
## $ Teacher_ID : chr "T25000138" "T25000239" "T25000345" "T25000423" ...
## $ Education_Level: chr "A2" "A1" "A0" "A2" ...
## $ School_ID : chr "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
## $ School_Level : chr "Primary" "Pre-primary" "Pre-primary" "Pre-primary" ...
## $ Subject_Taught : chr "Geography" "French" "Kinyarwanda" "Chemistry" ...
## $ Date_of_Birth : chr "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
you’ve seen that you got 500 rows and 11 columns, and right now every column is a character (chr). that’s normal after reading a CSV, but we should:
turn Date_of_Birth into a real Date
turn categories (Province, District, …) into factors(we factor those columns because they are labels, not text, and factors make summaries, plots, and models work the way you expect.)
quickly check duplicates, missing values, and simple counts
⚠️ Be careful!
if you’re still cleaning typos
(e.g., “Nyarugenge” vs “Nyarugunga”), fix strings first; then convert to
factor so you don’t bake in bad levels.
## converting Date into its data type
# Assuming your data frame is called df
data$Date_of_Birth <- as.Date(data$Date_of_Birth, format = "%Y-%m-%d")
data$Subject_Taught <- as.factor(data$Subject_Taught) # display number of Teacher per subject
data$Sector <- as.factor(data$Sector) # display number of Teacher per sector or use table(data$Sector)
# Quick check
summary(data$Subject_Taught)## Biology Chemistry Computer Science English
## 48 41 44 45
## Entrepreneurship French Geography History
## 50 52 41 51
## Kinyarwanda Mathematics Physics
## 46 49 33
summary(data$Sector)## Bigogwe Bugarama Bumbogo Busoro Bwishyura Gahanga Gashonga
## 10 5 8 15 5 8 7
## Gashora Gatenga Gatsibo Gihundwe Gikoma Gikondo Gikonko
## 5 8 12 5 7 5 9
## Gisozi Jabana Jali Jenda Kabarondo Kabarore Kabatwa
## 12 4 9 5 13 6 11
## Kacyiru Kanombe Karama Karangazi Kibilizi Kigabiro Kigarama
## 6 9 11 17 12 7 17
## Kimironko Kinyinya Kitabi Kiyovu Kiyumba Kiziguro Mamba
## 3 10 6 7 5 5 11
## Masaka Matimba Mugesera Mukarange Mukura Murunda Musha
## 6 6 5 6 8 10 8
## Ndera Ntyazo Nyamabuye Nyamata Nyamirambo Remera Rilima
## 8 8 8 7 10 10 11
## Rongi Rubengera Ruhashya Rusororo Rwinkwavu Save Shangi
## 12 6 15 9 6 7 11
## Shyogor Shyogwe Tumba
## 8 14 6
str(data$Date_of_Birth)## Date[1:500], format: "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" "1983-05-04" ...
## converting also the D.O.B Into Ages(years)
# Age in years
data$Age <- as.integer(floor((Sys.Date() - data$Date_of_Birth) / 365.25)) # converting you need to create another variables called Age and make it that its data type be in number(integer)Why dividing by 365.25
3years have 365 days
1 year has 366 days
On average: (3⋅365+366)/4=365.25days per year.
So, using 365.25 gives a more accurate conversion from days to years when calculating ages.
## converting Edu/province/district and school level in data type called Factor
# Assuming your data frame is called df
data$Education_Level <- as.factor(data$Education_Level)
data$Province <- as.factor(data$Province)
data$District <- as.factor(data$District)
data$School_Level <- as.factor(data$School_Level)
# Check the structure
str(data)## 'data.frame': 500 obs. of 12 variables:
## $ Province : Factor w/ 5 levels "Eastern","Kigali City",..: 2 4 5 3 2 3 2 2 1 1 ...
## $ District : Factor w/ 30 levels "Bugesera","Burera",..: 4 7 21 6 23 27 12 23 19 16 ...
## $ Sector : Factor w/ 59 levels "Bigogwe","Bugarama",..: 50 48 6 9 10 28 16 44 49 21 ...
## $ Teacher_Name : chr "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
## $ Teacher_ID : chr "T25000138" "T25000239" "T25000345" "T25000423" ...
## $ Education_Level: Factor w/ 3 levels "A0","A1","A2": 3 2 1 3 3 2 3 3 1 2 ...
## $ School_ID : chr "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
## $ School_Level : Factor w/ 4 levels "Lower Secondary",..: 3 2 2 2 3 1 3 3 4 3 ...
## $ Subject_Taught : Factor w/ 11 levels "Biology","Chemistry",..: 7 6 9 2 5 4 5 5 11 5 ...
## $ Date_of_Birth : Date, format: "1998-04-15" "2002-08-21" ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Age : int 27 23 27 43 42 26 44 41 63 32 ...
Before plotting age distributions, it is best practice to create age groups. To determine appropriate group boundaries, first check the minimum and maximum ages in the dataset using the summary() function. This function provides descriptive statistics for a single variable. When working with the entire dataset or grouped data frames, use summarise() from the dplyr package to generate customized summaries.
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
so the minimum age is 22 and max is 65 so that is where my boundaries must be
# Define age groups
data$Age_Group <- cut(
data$Age,
breaks = c(22, 27, 37, 47, 57, 66), # upper bound is exclusive, so use 66 to include 65
labels = c("22-26", "27-36", "37-46", "47-56", "57-65"),
right = FALSE # ensures intervals are [22,27), [27,37), etc.
)
# Check distribution
table(data$Age_Group) ##
## 22-26 27-36 37-46 47-56 57-65
## 38 119 132 103 108
⚠️ Be careful!
Using cut() with the same breaks,
right = FALSE makes left-closed, right-open bins (e.g., “22–26” =
[22,27)), so it includes 22–26 and excludes 27. By contrast, right =
TRUE makes left-open, right-closed bins (e.g., “22–26” = (22,27]), so it
excludes 22 but includes 23–27. If you need to keep right = TRUE and
still include 22 in the first bin, add include.lowest = TRUE.
# 2) Check distribution of groups
table(data$Age_Group)##
## 22-26 27-36 37-46 47-56 57-65
## 38 119 132 103 108
prop.table(table(data$Age_Group)) # percentages##
## 22-26 27-36 37-46 47-56 57-65
## 0.076 0.238 0.264 0.206 0.216
# 3) Sanity checks
summary(data$Age) # Min/Qu/Median/Mean/Max (your output is fine)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
sum(is.na(data$Age_Group)) # should be 0 with ages 22–65## [1] 0
levels(data$Age_Group) # confirm label order## [1] "22-26" "27-36" "37-46" "47-56" "57-65"
any(is.na(data)) #Check if any missing values exist## [1] FALSE
sum(is.na(data)) #Count total missing values## [1] 0
colSums(is.na(data)) # Missing values per column## Province District Sector Teacher_Name Teacher_ID
## 0 0 0 0 0
## Education_Level School_ID School_Level Subject_Taught Date_of_Birth
## 0 0 0 0 0
## Gender Age Age_Group
## 0 0 0
rowSums(is.na(data)) # Missing values per column## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#library(naniar)
#vis_miss(data) used to plot missing valueThe Good Analyst depends on the research question he/she have, he must know the needed variables to use so for us we don’t need date of birt cause we have the age and age group. so we are going to drop dob variable
# for removing var you can use many options like accessing index but it is not higly recommended please use subset command
data <- subset(data, select = -Date_of_Birth)
names(data) # here will be display all names excluding Date of B## [1] "Province" "District" "Sector" "Teacher_Name"
## [5] "Teacher_ID" "Education_Level" "School_ID" "School_Level"
## [9] "Subject_Taught" "Gender" "Age" "Age_Group"
data[duplicated(data), ]## [1] Province District Sector Teacher_Name
## [5] Teacher_ID Education_Level School_ID School_Level
## [9] Subject_Taught Gender Age Age_Group
## <0 rows> (or 0-length row.names)
# Province distribution
table(data$Province)##
## Eastern Kigali City Northern Southern Western
## 93 92 105 107 103
# Education level distribution
table(data$Education_Level)##
## A0 A1 A2
## 142 213 145
# Subject taught distribution
table(data$Subject_Taught)##
## Biology Chemistry Computer Science English
## 48 41 44 45
## Entrepreneurship French Geography History
## 50 52 41 51
## Kinyarwanda Mathematics Physics
## 46 49 33
Purpose: See how teachers are distributed across categories.
# Age summary
summary(data$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 33.00 43.00 44.08 54.00 65.00
# Mean and median age
mean(data$Age, na.rm = TRUE)## [1] 44.076
median(data$Age, na.rm = TRUE)## [1] 43
table(data$Province, data$Gender)##
## Female Male
## Eastern 41 52
## Kigali City 38 54
## Northern 49 56
## Southern 49 58
## Western 41 62
table(data$School_Level, data$Subject_Taught)##
## Biology Chemistry Computer Science English Entrepreneurship
## Lower Secondary 8 7 11 14 9
## Pre-primary 4 6 2 2 4
## Primary 22 18 23 22 30
## Upper Secondary 14 10 8 7 7
##
## French Geography History Kinyarwanda Mathematics Physics
## Lower Secondary 14 14 14 22 8 9
## Pre-primary 3 1 6 2 6 4
## Primary 23 17 20 16 21 10
## Upper Secondary 12 9 11 6 14 10
prop.table(table(data$Age_Group))##
## 22-26 27-36 37-46 47-56 57-65
## 0.076 0.238 0.264 0.206 0.216
Most teachers (around half) are in the 27–46 age range, showing that the workforce is mainly mid-career, with fewer very young (22–26) and a moderate number nearing retirement (57–65).
Purpose: Get central tendency and spread of ages.
hist(data$Age, main="Age Distribution of Teachers", xlab="Age", col="skyblue")
### Age Groups (Bar Plot)
barplot(table(data$Age_Group), main="Teacher Counts by Age Group", col="lightgreen")
### Subject Distribution
barplot(table(data$Subject_Taught), las=2, main="Teachers per Subject", col="purple")
### Gender Distribution
barplot(table(data$Gender), main="Gender Distribution of Teachers", col=c("pink","lightblue"))
### Boxplot of Age by Province
boxplot(Age ~ Province, data=data, main="Age Distribution by Province", col="lightgray")
### Pie chart of Education Level
pie(table(data$Education_Level), main="Education Level Distribution")
## Using ggplot2 as library for better drawing
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Age Distribution of Teachers", x = "Age", y = "Count") +
theme_minimal()ggplot(data, aes(x = Age_Group)) +
geom_bar(fill = "lightgreen", color = "black") +
labs(title = "Teacher Counts by Age Group", x = "Age Group", y = "Count") +
theme_minimal()
### Province Distribution
ggplot(data, aes(x = Province)) +
geom_bar(fill = "orange", color = "black") +
labs(title = "Teachers per Province", x = "Province", y = "Count") +
theme_minimal()
### Education with counts
# Frequency table
edu_counts <- table(data$Education_Level)
edu_counts##
## A0 A1 A2
## 142 213 145
# Pie chart with counts
pie(edu_counts,
labels = paste(names(edu_counts), edu_counts),
main = "Education Level Distribution")ggplot(data, aes(x = Subject_Taught)) +
geom_bar(fill = "purple", color = "black") +
labs(title = "Teachers per Subject", x = "Subject", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(data, aes(x = Gender)) +
geom_bar(fill = c("pink", "lightblue"), color = "black") +
labs(title = "Gender Distribution of Teachers", x = "Gender", y = "Count") +
theme_minimal()
### Age by Province (Boxplot)
ggplot(data, aes(x = Province, y = Age)) +
geom_boxplot(fill = "lightgray") +
labs(title = "Age Distribution by Province", x = "Province", y = "Age") +
theme_minimal()