1 What is R?

R is a statistical programming language created at the University of Auckland (New Zealand) by Ross Ihaka and Robert Gentleman. It evolved from the S language developed at Bell Labs. The current R version is R version 4.5.1 (2025-06-13 ucrt).

1.1 Introduction to R and Core Concepts

Welcome to R!

R is a powerful, free environment for statistical computing and graphics.

1.1.1 Why R?

  • Open-source and free

  • Powerful for statistics

  • Excellent visualization (e.g., ggplot2)

1.1.2 Reproducible reports with R Markdown

RStudio IDE has four panes: Source, Console, Environment/History, and Files/Plots/Packages/Help.

1.1.3 Operators

R can perform calculations in the console.

1.1.4 Arithmetic operators

1 + 1
## [1] 2

1.1.5 Subtraction: -

2 - 5
## [1] -3

1.1.6 Multiplication: *

5 * 6
## [1] 30

1.1.7 Division: /

9 / 3
## [1] 3

1.1.8 Modulus (remaining of a division) : %%

6 %% 2
## [1] 0

1.1.9 Exponent : ^ or **

2 ^ 10 # or 2 ** 10
## [1] 1024

1.1.10 Integer division: %/%

1035 %/% 3
## [1] 345

1.2 Logical operators

1.2.1 Less than: <

1 < 0
## [1] FALSE

1.2.2 Less than or equal to: <=

1 <= 1
## [1] TRUE

1.2.3 Greater than: >

4 > 5
## [1] FALSE

1.2.4 Greater than or equal to: >=

3 >= 3
## [1] TRUE

1.2.5 Exactly equal to: ==

"R" == "r"
## [1] FALSE

The equality operator can also be used to match one element with multiple elements

"Species" == c("Sepal.Length", "Sepal.Width", "Petal.Length", 
                   "Petal.Width", "Species")
## [1] FALSE FALSE FALSE FALSE  TRUE

1.2.6 Not equal to: !=

5 != 5
## [1] FALSE

Used to change a TRUE condition to FALSE (respectively a FALSE condition to TRUE)

1.3 Negation or not TRUE/ !T

!TRUE # or !T
## [1] FALSE
!(T & F) # this is TRUE
## [1] TRUE
!(F | T) # is FALSE
## [1] FALSE

1.4 AND: &

TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
F & F
## [1] FALSE
F & T
## [1] FALSE

1.5 OR: |

T | T
## [1] TRUE
T | F
## [1] TRUE
F | F
## [1] FALSE
F | T
## [1] TRUE

1.5.1 Value Matching

In R, we also have inbuilt functions that help to match element of a given vector. The first function is match(). You can check the documentation with help(“match”) or ?match. Read that: match returns a vector of the positions of (first) matches of its first argument in its second.

match("Species", c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))
## [1] 5

The second function %in% check the existence of a value in a given vector (of values).

"Species" %in% c("Sepal.Length", "Sepal.Width", "Petal.Length", 
                   "Petal.Width", "Species")
## [1] TRUE

2 R object and assignment

In R we can use <-, = (single equal sign !) and -> to assign a value to a variable.

A variable name:

  • can begin with a character or dot(s). Ex: a <- 1, 0 -> .a
  • should not contain space. Replace empty space with _ or a dot ..
# t trainind <- "r programming"  // this will print error cause of space the error will be unexpected symbol in "t trainind"
  • can contain numbers. Ex: a1 <- 1.
a <- 5
b <- 6
0 -> .a
a1 = .2

2.1 Data types

In R we have the following data types: numeric, integer, complex, character, logical ,raw ,factor

2.1.1 Numeric/double

Examples of numberic numbers are 15.5, 505, 38, pi

q <- 10.7
print(class(q))
## [1] "numeric"
print(typeof(q))
## [1] "double"

2.1.2 Integer

  • (1L, 5L, 10L, where the letter L declares this as an integer).
  • Check the class of q <- 5L. What do you see
q <- 5L
print(class(q))
## [1] "integer"
print(typeof(q))
## [1] "integer"

2.1.3 Complex

An example of a complex number is 3+1i, where i is the imaginary part. Multiplying a real number by 1i, transforms it to complex.

q <- 3+1i
print(class(q))
## [1] "complex"
print(typeof(q))
## [1] "complex"
p1 <- a + 1i*b
print(a1)
## [1] 0.2

2.1.4 Character/string

string <- "I am Learning R"
class(string)
## [1] "character"

Remember!! LeaRning is different from Learning.

2.1.5 Logical/Boolean - (TRUE or FALSE)

TRUE # or T
## [1] TRUE
FALSE # or F
## [1] FALSE

Logical output can also be an outcome of a test. Example: if we want to check if “LeaRning” == “Learning”

"LeaRning" == "Learning"
## [1] FALSE

2.1.6 Raw

text <- "Christian Mugisha."
(raw_text <- charToRaw(text))
##  [1] 43 68 72 69 73 74 69 61 6e 20 4d 75 67 69 73 68 61 2e
class(raw_text)
## [1] "raw"

Converting raw to text:

rawToChar(raw_text)
## [1] "Christian Mugisha."

2.1.7 Factors

They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc. They are useful in statistical modeling.

Gender <- factor(c("Female", "Male"))
print(Gender)
## [1] Female Male  
## Levels: Female Male
class(Gender)
## [1] "factor"

2.1.8 Logical

v <- TRUE
w <- FALSE

class(v); typeof(v)
## [1] "logical"
## [1] "logical"
!v
## [1] FALSE
isTRUE(w)
## [1] FALSE

3 Create object

  • Numeric object
t <- 10
x <- numeric(t) # creates a numeric object of size t
print(x)
##  [1] 0 0 0 0 0 0 0 0 0 0
# assigning values to x:
x[1] <- 2.5
print(x)
##  [1] 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  • Integer
n <- 5
x <- integer(n) # creates a numeric object of size n
print(x)
## [1] 0 0 0 0 0
class(x)
## [1] "integer"
# assigning values to x:
x[1] <- 2.5 # R will automatically convert integer to numeric
class(x)
## [1] "numeric"
print(x)
## [1] 2.5 0.0 0.0 0.0 0.0

3.1 Convertions:

To convert data in R we can use function starting with as. + data type from the base package.

  • Numeric to character

  • Character to numeric

  • Factor to character

  • Character to factor

3.2 Scalars and vectors (1D):

  • A scalar is any number in N, Z, D, Q, R, or C (Quantum Mechanics)

  • Vectors: collection of objects of the same type. A vector can also be a sequence; Let create a vector with elements of different types to see how R will deal with them.

  • Numerics and characters

 # ?c
v <- c(1, "R", T, FALSE, NA)
# print v
print(v)
## [1] "1"     "R"     "TRUE"  "FALSE" NA
# what is the class of v?
class(v)
## [1] "character"

R converts everything in character type except NA which is common to numeric and character.

  • Numeric and logical
v2 <- c(1, 4, 8, FALSE, TRUE, FALSE, FALSE, TRUE, "R" == "r")
print(v2)
## [1] 1 4 8 0 1 0 0 1 0

3.3 Vector Arithmetic

Vectors are vectorized in R — this means you can perform arithmetic on all elements at once. Operations happen element-by-element.

a <- c(5, 6, 7)
b <- c(10, 20, 30)

# Addition (adds 1st to 1st, 2nd to 2nd, etc.)
a + b
## [1] 15 26 37
# Multiplication
a * b
## [1]  50 120 210

3.3.1 Recycling Rule

If one vector is shorter than the other, R will recycle (repeat) elements of the shorter vector until it matches the length of the longer one.

a <- c(1, 2, 3, 4)
b <- c(10, 20)

a + b  
## [1] 11 22 13 24
# First recycle b: (10, 20, 10, 20)
# Result: (1+10, 2+20, 3+10, 4+20)

3.3.2 Accessing Elements

You can use square brackets [ ] to select elements from a vector.

# Get the first item in v2
v2[1]
## [1] 1
# Get the 2nd and 4th in v2
v2[c(2, 4)]
## [1] 4 0
# Exclude the 3rd element
v2[-3]
## [1] 1 4 0 1 0 0 1 0

3.3.3 Modifying Vectors

# Change the number of the first v2
v2[1] <- 27
print(v2)
## [1] 27  4  8  0  1  0  0  1  0
# Add a new number (creates a longer vector)
v21 <- c(v2, 81)

v21
##  [1] 27  4  8  0  1  0  0  1  0 81
📘 Exercise 1.2

Q1 Create Variables

Create three variables:

  1. my_name containing your name as a character string.
  2. my_age containing your age as a number.
  3. is_statistician with a logical value (TRUE or FALSE).

Print out the class of each variable.

Q2: Vector Operations

  1. Create a vector named expenses with the values: 1500, 2000, 1200, 3000.
  2. Create another vector named income with the value 10000.
  3. Calculate your total savings by subtracting the sum of expenses from income (hint: use sum()).

Q3

  1. Create a numeric vector sales_q1 with the values: 120, 150, 90.

  2. Create another numeric vector sales_q2 with the values: 130, 160, 95.

  3. Calculate the total sales for each store by adding the two vectors.

  4. Calculate the percentage increase from Q1 to Q2:((Q2 - Q1) / Q1) * 100

  5. Extract only the results for stores with an increase greater than 10%.

💡 Tip

Vectors are everywhere in R — even a single number like 42 is technically a vector of length 1.

4 Data Frames and Data Import/Export

4.1 Data Frames

The most important data structure for data analysis in R is the data frame. A data frame is like:

  • A spreadsheet in Excel

  • A table in a database

  • A tibble in tidyverse (we’ll see later)

  • It is a two-dimensional table:

Rows = observations/records

Columns = variables/features

Each column can have a different type (numeric, character, logical, etc.)

4.2 Creating a Data Frame

You can create a data frame using the data.frame() function.

# Create a simple data frame
employee_data <- data.frame(
  id     = c(1, 2, 3),                   # Numeric column
  name   = c("John", "Jane", "Peter"),   # Character column
  salary = c(50000, 55000, 52000)        # Numeric column
)

employee_data
##   id  name salary
## 1  1  John  50000
## 2  2  Jane  55000
## 3  3 Peter  52000

4.2.1 Checking Structure and Summary

# View structure
str(employee_data)
## 'data.frame':    3 obs. of  3 variables:
##  $ id    : num  1 2 3
##  $ name  : chr  "John" "Jane" "Peter"
##  $ salary: num  50000 55000 52000
# View summary statistics
summary(employee_data)
##        id          name               salary     
##  Min.   :1.0   Length:3           Min.   :50000  
##  1st Qu.:1.5   Class :character   1st Qu.:51000  
##  Median :2.0   Mode  :character   Median :52000  
##  Mean   :2.0                      Mean   :52333  
##  3rd Qu.:2.5                      3rd Qu.:53500  
##  Max.   :3.0                      Max.   :55000

4.2.2 Accessing Data

There are several ways to access parts of a data frame:

# Access a column by $ name
employee_data$name
## [1] "John"  "Jane"  "Peter"
# Access by index (row, column)
employee_data[1, 2]     # Row 1, column 2
## [1] "John"
employee_data[ , "salary"]  # All rows, salary column
## [1] 50000 55000 52000
# Multiple columns
employee_data[ , c("name", "salary")]
##    name salary
## 1  John  50000
## 2  Jane  55000
## 3 Peter  52000
# Multiple rows
employee_data[1:2, ]
##   id name salary
## 1  1 John  50000
## 2  2 Jane  55000

4.2.3 Adding Columns and Rows

# Add a new column
employee_data$department <- c("HR", "Finance", "IT")

# Add a new row
new_row <- data.frame(id=4, name="Alice", salary=60000, department="Marketing")
employee_data <- rbind(employee_data, new_row)
employee_data
##   id  name salary department
## 1  1  John  50000         HR
## 2  2  Jane  55000    Finance
## 3  3 Peter  52000         IT
## 4  4 Alice  60000  Marketing

5 Importing Real Data

In practice, we don’t type all our data by hand — we read it from files:

# Read CSV file
#my_data <- read.csv("Data.csv")

# Read Excel file
# install.packages("readxl")
library(readxl)
#my_data <- read_excel("Data.xlsx")
📘 Exercise 2

Q1 Create a data frame named students with columns:

  • id (1 to 5)

  • name (five student names)

  • grade (five numeric grades)

Q2. Extract the grades of the first three students

Q3.Add a new column pass that is TRUE if grade ≥ 50, otherwise FALSE.

Q4. Add a new row for a sixth student.

💡 Tip

Vectors are everywhere in R — even a single number like 42 is technically a vector of length 1.

5.1 Installing and Loading Packages

One of R’s biggest strengths is its community-contributed packages. A package is like an app for R — it contains:

Functions (tools to do specific tasks)

Data sets (ready-to-use examples)

Documentation (help files and guides)

5.1.1 Step 1: Installing a Package

Before using a package, you install it once on your computer (like downloading an app from an app store).

# Install a package (done only once, unless you update R or reinstall)

#install.packages("haven")
#install.packages("tidyverse")
💡 Tip

Every time you start a new R session (or reopen RStudio),you must load the package to use it./ .

# Load packages for use in this session

library(haven)       # For reading Stata, SPSS, SAS files
library(tidyverse)   # For data manipulation & visualization

5.1.2 Why haven and tidyverse?

  • haven: Reads data from Stata (.dta), SPSS, and SAS files while preserving labels.

  • tidyverse: A collection of packages for data manipulation (dplyr), visualization (ggplot2), data tidying (tidyr), and more.

5.1.3 Checking if a Package is Installed

# Check installed packages
# installed.packages()

# Or quickly check one package
"haven" %in% rownames(installed.packages())
## [1] TRUE

Updating Packages

# Update all installed packages

# update.packages()
📘 Exercise 3
  • Install the package readxl (for reading Excel files).

  • Load it into your session.

  • Check if the package ggplot2 is already installed on your system.

💡 Tip

If you get the error “there is no package called …”, it means you need to install it first.

5.2 Importing Data

One of the first steps in any data analysis project is getting your data into R. R can read many file formats: .csv, .xlsx, .dta, .sav, .json, and more.

In this course, we’ll start with a csv (.csv) file.

We’ll use read_csv() from the haven package (which is already loaded) to import it.

5.2.1 Importing csv data

library(dplyr)
data <- read.csv("C:\\Users\\HP\\Desktop\\R PROGRAMING\\rwanda_teachers_500.csv")  # is how you read csv file
head(data,5) # display first 5 rows
##      Province   District  Sector      Teacher_Name Teacher_ID Education_Level
## 1 Kigali City     Gasabo   Rongi   Aline Munyaneza  T25000138              A2
## 2    Southern   Gisagara  Remera Noella Nkurunziza  T25000239              A1
## 3     Western Nyamasheke Gahanga  Leodomir Murerwa  T25000345              A0
## 4    Northern    Gicumbi Gatenga  Sophie Niyonzima  T25000423              A2
## 5 Kigali City Nyarugenge Gatsibo   Tijara Habimana  T25000558              A2
##   School_ID School_Level   Subject_Taught Date_of_Birth Gender
## 1  SCH00001      Primary        Geography    1998-04-15   Male
## 2  SCH00002  Pre-primary           French    2002-08-21 Female
## 3  SCH00003  Pre-primary      Kinyarwanda    1998-10-19 Female
## 4  SCH00004  Pre-primary        Chemistry    1982-11-11 Female
## 5  SCH00005      Primary Entrepreneurship    1983-05-04 Female
# after loading the data , you have to seek first the summary of the data to see the structure 
str(data)
## 'data.frame':    500 obs. of  11 variables:
##  $ Province       : chr  "Kigali City" "Southern" "Western" "Northern" ...
##  $ District       : chr  "Gasabo" "Gisagara" "Nyamasheke" "Gicumbi" ...
##  $ Sector         : chr  "Rongi" "Remera" "Gahanga" "Gatenga" ...
##  $ Teacher_Name   : chr  "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
##  $ Teacher_ID     : chr  "T25000138" "T25000239" "T25000345" "T25000423" ...
##  $ Education_Level: chr  "A2" "A1" "A0" "A2" ...
##  $ School_ID      : chr  "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
##  $ School_Level   : chr  "Primary" "Pre-primary" "Pre-primary" "Pre-primary" ...
##  $ Subject_Taught : chr  "Geography" "French" "Kinyarwanda" "Chemistry" ...
##  $ Date_of_Birth  : chr  "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" ...
##  $ Gender         : chr  "Male" "Female" "Female" "Female" ...

you’ve seen that you got 500 rows and 11 columns, and right now every column is a character (chr). that’s normal after reading a CSV, but we should:

  • turn Date_of_Birth into a real Date

  • turn categories (Province, District, …) into factors(we factor those columns because they are labels, not text, and factors make summaries, plots, and models work the way you expect.)

  • quickly check duplicates, missing values, and simple counts

5.2.2 Data Cleaning

5.2.2.1 Converting every Variable to its technical data type

## converting Date into its data type 
# Assuming your data frame is called df
data$Date_of_Birth <- as.Date(data$Date_of_Birth, format = "%Y-%m-%d")
data$Subject_Taught <- as.factor(data$Subject_Taught) # display number of Teacher per subject
data$Sector         <- as.factor(data$Sector) # display number of Teacher per sector or use table(data$Sector)

# Quick check
summary(data$Subject_Taught)
##          Biology        Chemistry Computer Science          English 
##               48               41               44               45 
## Entrepreneurship           French        Geography          History 
##               50               52               41               51 
##      Kinyarwanda      Mathematics          Physics 
##               46               49               33
summary(data$Sector)
##    Bigogwe   Bugarama    Bumbogo     Busoro  Bwishyura    Gahanga   Gashonga 
##         10          5          8         15          5          8          7 
##    Gashora    Gatenga    Gatsibo   Gihundwe     Gikoma    Gikondo    Gikonko 
##          5          8         12          5          7          5          9 
##     Gisozi     Jabana       Jali      Jenda  Kabarondo   Kabarore    Kabatwa 
##         12          4          9          5         13          6         11 
##    Kacyiru    Kanombe     Karama  Karangazi   Kibilizi   Kigabiro   Kigarama 
##          6          9         11         17         12          7         17 
##  Kimironko   Kinyinya     Kitabi     Kiyovu    Kiyumba   Kiziguro      Mamba 
##          3         10          6          7          5          5         11 
##     Masaka    Matimba   Mugesera  Mukarange     Mukura    Murunda      Musha 
##          6          6          5          6          8         10          8 
##      Ndera     Ntyazo  Nyamabuye    Nyamata Nyamirambo     Remera     Rilima 
##          8          8          8          7         10         10         11 
##      Rongi  Rubengera   Ruhashya   Rusororo  Rwinkwavu       Save     Shangi 
##         12          6         15          9          6          7         11 
##    Shyogor    Shyogwe      Tumba 
##          8         14          6
str(data$Date_of_Birth)
##  Date[1:500], format: "1998-04-15" "2002-08-21" "1998-10-19" "1982-11-11" "1983-05-04" ...
## converting also the D.O.B Into Ages(years)
# Age in years
data$Age <- as.integer(floor((Sys.Date() - data$Date_of_Birth) / 365.25)) # converting you need to create another variables called Age and make it that its data type be in number(integer)

Why dividing by 365.25

  • 3years have 365 days

  • 1 year has 366 days

On average: (3⋅365+366)/4=365.25days per year.

So, using 365.25 gives a more accurate conversion from days to years when calculating ages.

## converting Edu/province/district and school level in data type called Factor
# Assuming your data frame is called df
data$Education_Level <- as.factor(data$Education_Level)
data$Province        <- as.factor(data$Province)
data$District        <- as.factor(data$District)
data$School_Level    <- as.factor(data$School_Level)

# Check the structure
str(data)
## 'data.frame':    500 obs. of  12 variables:
##  $ Province       : Factor w/ 5 levels "Eastern","Kigali City",..: 2 4 5 3 2 3 2 2 1 1 ...
##  $ District       : Factor w/ 30 levels "Bugesera","Burera",..: 4 7 21 6 23 27 12 23 19 16 ...
##  $ Sector         : Factor w/ 59 levels "Bigogwe","Bugarama",..: 50 48 6 9 10 28 16 44 49 21 ...
##  $ Teacher_Name   : chr  "Aline Munyaneza" "Noella Nkurunziza" "Leodomir Murerwa" "Sophie Niyonzima" ...
##  $ Teacher_ID     : chr  "T25000138" "T25000239" "T25000345" "T25000423" ...
##  $ Education_Level: Factor w/ 3 levels "A0","A1","A2": 3 2 1 3 3 2 3 3 1 2 ...
##  $ School_ID      : chr  "SCH00001" "SCH00002" "SCH00003" "SCH00004" ...
##  $ School_Level   : Factor w/ 4 levels "Lower Secondary",..: 3 2 2 2 3 1 3 3 4 3 ...
##  $ Subject_Taught : Factor w/ 11 levels "Biology","Chemistry",..: 7 6 9 2 5 4 5 5 11 5 ...
##  $ Date_of_Birth  : Date, format: "1998-04-15" "2002-08-21" ...
##  $ Gender         : chr  "Male" "Female" "Female" "Female" ...
##  $ Age            : int  27 23 27 43 42 26 44 41 63 32 ...

Before plotting age distributions, it is best practice to create age groups. To determine appropriate group boundaries, first check the minimum and maximum ages in the dataset using the summary() function. This function provides descriptive statistics for a single variable. When working with the entire dataset or grouped data frames, use summarise() from the dplyr package to generate customized summaries.

5.2.2.2 Formulating age group

5.2.2.3 checking the min and max age

summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00

so the minimum age is 22 and max is 65 so that is where my boundaries must be

# Define age groups
data$Age_Group <- cut(
  data$Age,
  breaks = c(22, 27, 37, 47, 57, 66),   # upper bound is exclusive, so use 66 to include 65
  labels = c("22-26", "27-36", "37-46", "47-56", "57-65"),
  right = FALSE   # ensures intervals are [22,27), [27,37), etc.
)

# Check distribution
table(data$Age_Group)  
## 
## 22-26 27-36 37-46 47-56 57-65 
##    38   119   132   103   108
# 2) Check distribution of groups
table(data$Age_Group)
## 
## 22-26 27-36 37-46 47-56 57-65 
##    38   119   132   103   108
prop.table(table(data$Age_Group))  # percentages
## 
## 22-26 27-36 37-46 47-56 57-65 
## 0.076 0.238 0.264 0.206 0.216
# 3) Sanity checks
summary(data$Age)                  # Min/Qu/Median/Mean/Max (your output is fine)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00
sum(is.na(data$Age_Group))         # should be 0 with ages 22–65
## [1] 0
levels(data$Age_Group)             # confirm label order
## [1] "22-26" "27-36" "37-46" "47-56" "57-65"

5.2.3 Cheking missing Values

any(is.na(data))  #Check if any missing values exist
## [1] FALSE
sum(is.na(data))  #Count total missing values
## [1] 0
colSums(is.na(data))  # Missing values per column
##        Province        District          Sector    Teacher_Name      Teacher_ID 
##               0               0               0               0               0 
## Education_Level       School_ID    School_Level  Subject_Taught   Date_of_Birth 
##               0               0               0               0               0 
##          Gender             Age       Age_Group 
##               0               0               0
rowSums(is.na(data))  # Missing values per column
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#library(naniar)
#vis_miss(data)   used to plot missing value

The Good Analyst depends on the research question he/she have, he must know the needed variables to use so for us we don’t need date of birt cause we have the age and age group. so we are going to drop dob variable

# for removing var you can use many options like accessing index but it is not higly recommended please use subset command
data <- subset(data, select = -Date_of_Birth)
names(data) # here will be display all names excluding Date of B
##  [1] "Province"        "District"        "Sector"          "Teacher_Name"   
##  [5] "Teacher_ID"      "Education_Level" "School_ID"       "School_Level"   
##  [9] "Subject_Taught"  "Gender"          "Age"             "Age_Group"

5.2.4 any(duplicated(data))

data[duplicated(data), ]
##  [1] Province        District        Sector          Teacher_Name   
##  [5] Teacher_ID      Education_Level School_ID       School_Level   
##  [9] Subject_Taught  Gender          Age             Age_Group      
## <0 rows> (or 0-length row.names)

5.3 Frequency Counts for Categorical Variables

# Province distribution
table(data$Province)
## 
##     Eastern Kigali City    Northern    Southern     Western 
##          93          92         105         107         103
# Education level distribution
table(data$Education_Level)
## 
##  A0  A1  A2 
## 142 213 145
# Subject taught distribution
table(data$Subject_Taught)
## 
##          Biology        Chemistry Computer Science          English 
##               48               41               44               45 
## Entrepreneurship           French        Geography          History 
##               50               52               41               51 
##      Kinyarwanda      Mathematics          Physics 
##               46               49               33

Purpose: See how teachers are distributed across categories.

5.4 Summary Statistics for Numeric Variables

# Age summary
summary(data$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   33.00   43.00   44.08   54.00   65.00
# Mean and median age
mean(data$Age, na.rm = TRUE)
## [1] 44.076
median(data$Age, na.rm = TRUE)
## [1] 43

5.5 Cross‑Tabulations

5.5.1 Teachers by Province × Gender:

table(data$Province, data$Gender)
##              
##               Female Male
##   Eastern         41   52
##   Kigali City     38   54
##   Northern        49   56
##   Southern        49   58
##   Western         41   62

5.5.2 Teachers by School_Level × Subject_Taught:

table(data$School_Level, data$Subject_Taught)
##                  
##                   Biology Chemistry Computer Science English Entrepreneurship
##   Lower Secondary       8         7               11      14                9
##   Pre-primary           4         6                2       2                4
##   Primary              22        18               23      22               30
##   Upper Secondary      14        10                8       7                7
##                  
##                   French Geography History Kinyarwanda Mathematics Physics
##   Lower Secondary     14        14      14          22           8       9
##   Pre-primary          3         1       6           2           6       4
##   Primary             23        17      20          16          21      10
##   Upper Secondary     12         9      11           6          14      10

5.6 Proportions and Percentages

  • Relative frequencies:
prop.table(table(data$Age_Group))
## 
## 22-26 27-36 37-46 47-56 57-65 
## 0.076 0.238 0.264 0.206 0.216

Most teachers (around half) are in the 27–46 age range, showing that the workforce is mainly mid-career, with fewer very young (22–26) and a moderate number nearing retirement (57–65).

Purpose: Get central tendency and spread of ages.

5.7 Basic Plots

5.7.1 Age Distribution

hist(data$Age, main="Age Distribution of Teachers", xlab="Age", col="skyblue")

### Age Groups (Bar Plot)

barplot(table(data$Age_Group), main="Teacher Counts by Age Group", col="lightgreen")

### Subject Distribution

barplot(table(data$Subject_Taught), las=2, main="Teachers per Subject", col="purple")

### Gender Distribution

barplot(table(data$Gender), main="Gender Distribution of Teachers", col=c("pink","lightblue"))

### Boxplot of Age by Province

boxplot(Age ~ Province, data=data, main="Age Distribution by Province", col="lightgray")

### Pie chart of Education Level

pie(table(data$Education_Level), main="Education Level Distribution")

## Using ggplot2 as library for better drawing

5.7.2 load ggplot 2 is library(ggplot2)

5.7.3 Age Distribution (Histogram)

ggplot(data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Age Distribution of Teachers", x = "Age", y = "Count") +
  theme_minimal()

5.7.4 Age Groups (Bar Plot)

ggplot(data, aes(x = Age_Group)) +
  geom_bar(fill = "lightgreen", color = "black") +
  labs(title = "Teacher Counts by Age Group", x = "Age Group", y = "Count") +
  theme_minimal()

### Province Distribution

ggplot(data, aes(x = Province)) +
  geom_bar(fill = "orange", color = "black") +
  labs(title = "Teachers per Province", x = "Province", y = "Count") +
  theme_minimal()

### Education with counts

# Frequency table
edu_counts <- table(data$Education_Level)
edu_counts
## 
##  A0  A1  A2 
## 142 213 145
# Pie chart with counts
pie(edu_counts,
    labels = paste(names(edu_counts), edu_counts),
    main = "Education Level Distribution")

5.7.5 Subject Distribution

ggplot(data, aes(x = Subject_Taught)) +
  geom_bar(fill = "purple", color = "black") +
  labs(title = "Teachers per Subject", x = "Subject", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5.7.6 Gender Distribution

ggplot(data, aes(x = Gender)) +
  geom_bar(fill = c("pink", "lightblue"), color = "black") +
  labs(title = "Gender Distribution of Teachers", x = "Gender", y = "Count") +
  theme_minimal()

### Age by Province (Boxplot)

ggplot(data, aes(x = Province, y = Age)) +
  geom_boxplot(fill = "lightgray") +
  labs(title = "Age Distribution by Province", x = "Province", y = "Age") +
  theme_minimal()