Executive Summary

This comprehensive analysis follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology to investigate factors affecting mathematics performance in some seleced schools in Ekiti_State. i examined six specific research questions using both descriptive and inferential statistics approaches to understand the relationships between continuous assessment frequency, demographic factors, and mathematics unified examination scores in Ekiti_State.

1. Business Understanding

1.1 Project Objectives

The primary objective of this analysis is to understand the factors that influence mathematics performance in schools.

1.2 Research Questions and Hypotheses

Research Question 1 (RQ1): Continuous Assessment Frequency

Question: How often do Continuous Assessments occur in mathematics classes? Purpose: To quantify and describe the distribution of continuous assessment frequency across the dataset.

Research Question 2 (RQ2): Mathematics Performance Overview

Question: What is the overall distribution and characteristics of Mathematics Unified Examination scores? Purpose: To provide descriptive statistics and understand the central tendencies and variability in mathematics performance.

Research Question 3 (RQ3): Impact of Assessment Frequency

Question: Does the frequency of Continuous Assessments affect Unified Examination scores? Null Hypothesis (H₀₃): There is no mean difference in UE scores across different CA frequency groups. Alternative Hypothesis (H₁₃): At least one group differs significantly in mean UE scores. Statistical Test: Independent samples t-test (α = 0.05)

Research Question 4 (RQ4): Gender Differences

Question: Are there significant gender differences in Mathematics Unified Examination scores? Null Hypothesis (H₀₄): Mean UE scores are equal between male and female students. Alternative Hypothesis (H₁₄): Mean UE scores differ significantly between genders. Statistical Test: Independent samples t-test (α = 0.05)

Research Question 5 (RQ5): Location Effects

Question: Do students from urban and rural locations perform differently on Mathematics Unified Examinations? Null Hypothesis (H₀₅): Mean UE scores are equal between urban and rural students. Alternative Hypothesis (H₁₅): Mean UE scores differ significantly between locations. Statistical Test: Independent samples t-test (α = 0.05)

Research Question 6 (RQ6): School Type Impact

Question: Is there a significant difference in Mathematics performance between public and private schools? Null Hypothesis (H₀₆): Mean UE scores are equal between public and private schools. Alternative Hypothesis (H₁₆): Mean UE scores differ significantly between school types. Statistical Test: Independent samples t-test (α = 0.05)

Data preprocessing

Loading important libraries and the data

# Load necessary libraries for data manipulation, visualization, and reading Excel files.
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(psych) 
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
file_path <- "C:/Users/MUSAAB-TECH/OneDrive/Ado/merged_school_data.xlsx"
school_data <- read_excel(file_path)

# Rename columns 
school_data <- school_data %>%
  rename(
    school_type = `School Type`,
    location = `School Location`,
    gender = GENDER,
    UE_score = `MATH SCORE`,
    CA_frequency = `FREQUENCY OF CONTINOUS ASSESSMENT`,
    Student_No= `STUDENT S/N` 
    
  )

# Display the first few rows to verify it loaded correctly
head(school_data)
## # A tibble: 6 × 9
##   Student_No Schools  `School Names`  school_type location gender UE_score   AGE
##        <dbl> <chr>    <chr>           <chr>       <chr>    <chr>     <dbl> <dbl>
## 1          1 School A Muslim Grammar… Public      Urban    F            66    15
## 2          2 School A Muslim Grammar… Public      Urban    M            67    18
## 3          3 School A Muslim Grammar… Public      Urban    F            71    15
## 4          4 School A Muslim Grammar… Public      Urban    M            69    15
## 5          5 School A Muslim Grammar… Public      Urban    M            67    18
## 6          6 School A Muslim Grammar… Public      Urban    M            69    16
## # ℹ 1 more variable: CA_frequency <dbl>
# Structure of the data
str(school_data)
## tibble [2,135 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Student_No  : num [1:2135] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Schools     : chr [1:2135] "School A" "School A" "School A" "School A" ...
##  $ School Names: chr [1:2135] "Muslim Grammar School Ado" "Muslim Grammar School Ado" "Muslim Grammar School Ado" "Muslim Grammar School Ado" ...
##  $ school_type : chr [1:2135] "Public" "Public" "Public" "Public" ...
##  $ location    : chr [1:2135] "Urban" "Urban" "Urban" "Urban" ...
##  $ gender      : chr [1:2135] "F" "M" "F" "M" ...
##  $ UE_score    : num [1:2135] 66 67 71 69 67 69 71 72 69 68 ...
##  $ AGE         : num [1:2135] 15 18 15 15 18 16 15 14 14 16 ...
##  $ CA_frequency: num [1:2135] 3 3 3 3 3 3 3 3 3 3 ...
summary(school_data)
##    Student_No      Schools          School Names       school_type       
##  Min.   :  1.0   Length:2135        Length:2135        Length:2135       
##  1st Qu.: 46.0   Class :character   Class :character   Class :character  
##  Median : 92.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.8                                                           
##  3rd Qu.:144.0                                                           
##  Max.   :558.0                                                           
##    location            gender             UE_score          AGE       
##  Length:2135        Length:2135        Min.   : 6.00   Min.   :14.00  
##  Class :character   Class :character   1st Qu.:58.00   1st Qu.:16.00  
##  Mode  :character   Mode  :character   Median :66.00   Median :16.00  
##                                        Mean   :64.29   Mean   :16.46  
##                                        3rd Qu.:68.00   3rd Qu.:17.00  
##                                        Max.   :88.00   Max.   :20.00  
##   CA_frequency  
##  Min.   :2.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.628  
##  3rd Qu.:3.000  
##  Max.   :3.000
describe(school_data)
##               vars    n   mean    sd median trimmed   mad min max range  skew
## Student_No       1 2135 100.76 67.47     92   95.51 72.65   1 558   557  0.69
## Schools*         2 2135   6.10  3.63      6    6.05  5.93   1  12    11  0.04
## School Names*    3 2135   6.25  3.32      6    6.24  4.45   1  12    11 -0.02
## school_type*     4 2135   1.97  0.18      2    2.00  0.00   1   2     1 -5.08
## location*        5 2135   1.84  0.37      2    1.92  0.00   1   2     1 -1.84
## gender*          6 2133   1.49  0.50      1    1.49  0.00   1   2     1  0.03
## UE_score         7 2135  64.29  7.22     66   64.38  5.93   6  88    82 -0.40
## AGE              8 2135  16.46  1.23     16   16.45  1.48  14  20     6  0.06
## CA_frequency     9 2135   2.63  0.48      3    2.66  0.00   2   3     1 -0.53
##               kurtosis   se
## Student_No        0.45 1.46
## Schools*         -1.36 0.08
## School Names*    -1.31 0.07
## school_type*     23.86 0.00
## location*         1.38 0.01
## gender*          -2.00 0.01
## UE_score          2.13 0.16
## AGE              -0.64 0.03
## CA_frequency     -1.72 0.01

Converting Character to factor

# Convert character columns that represent categories into factors for better analysis.
  school_data <- school_data %>% mutate(
    CA_frequency = as.factor(CA_frequency),
    gender = as.factor(gender),
    location = as.factor(location),
    school_type = as.factor(school_type)
  )

# Check for missing values
colSums(is.na(school_data))
##   Student_No      Schools School Names  school_type     location       gender 
##            0            0            0            0            0            2 
##     UE_score          AGE CA_frequency 
##            0            0            0

Check Struture after conversion

# Structure of the data
str(school_data)
## tibble [2,135 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Student_No  : num [1:2135] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Schools     : chr [1:2135] "School A" "School A" "School A" "School A" ...
##  $ School Names: chr [1:2135] "Muslim Grammar School Ado" "Muslim Grammar School Ado" "Muslim Grammar School Ado" "Muslim Grammar School Ado" ...
##  $ school_type : Factor w/ 2 levels "Private","Public": 2 2 2 2 2 2 2 2 2 2 ...
##  $ location    : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gender      : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 2 1 2 2 ...
##  $ UE_score    : num [1:2135] 66 67 71 69 67 69 71 72 69 68 ...
##  $ AGE         : num [1:2135] 15 18 15 15 18 16 15 14 14 16 ...
##  $ CA_frequency: Factor w/ 2 levels "2","3": 2 2 2 2 2 2 2 2 2 2 ...
summary(school_data)
##    Student_No      Schools          School Names        school_type  
##  Min.   :  1.0   Length:2135        Length:2135        Private:  74  
##  1st Qu.: 46.0   Class :character   Class :character   Public :2061  
##  Median : 92.0   Mode  :character   Mode  :character                 
##  Mean   :100.8                                                       
##  3rd Qu.:144.0                                                       
##  Max.   :558.0                                                       
##   location     gender        UE_score          AGE        CA_frequency
##  Rural: 345   F   :1082   Min.   : 6.00   Min.   :14.00   2: 795      
##  Urban:1790   M   :1051   1st Qu.:58.00   1st Qu.:16.00   3:1340      
##               NA's:   2   Median :66.00   Median :16.00               
##                           Mean   :64.29   Mean   :16.46               
##                           3rd Qu.:68.00   3rd Qu.:17.00               
##                           Max.   :88.00   Max.   :20.00
describe(school_data)
##               vars    n   mean    sd median trimmed   mad min max range  skew
## Student_No       1 2135 100.76 67.47     92   95.51 72.65   1 558   557  0.69
## Schools*         2 2135   6.10  3.63      6    6.05  5.93   1  12    11  0.04
## School Names*    3 2135   6.25  3.32      6    6.24  4.45   1  12    11 -0.02
## school_type*     4 2135   1.97  0.18      2    2.00  0.00   1   2     1 -5.08
## location*        5 2135   1.84  0.37      2    1.92  0.00   1   2     1 -1.84
## gender*          6 2133   1.49  0.50      1    1.49  0.00   1   2     1  0.03
## UE_score         7 2135  64.29  7.22     66   64.38  5.93   6  88    82 -0.40
## AGE              8 2135  16.46  1.23     16   16.45  1.48  14  20     6  0.06
## CA_frequency*    9 2135   1.63  0.48      2    1.66  0.00   1   2     1 -0.53
##               kurtosis   se
## Student_No        0.45 1.46
## Schools*         -1.36 0.08
## School Names*    -1.31 0.07
## school_type*     23.86 0.00
## location*         1.38 0.01
## gender*          -2.00 0.01
## UE_score          2.13 0.16
## AGE              -0.64 0.03
## CA_frequency*    -1.72 0.01

Check for duplicated rows

df_dup <- school_data   # alias for brevity

# Count duplicated rows (entire-row duplicates)
n_dup_rows <- sum(duplicated(df_dup))
cat("Number of fully duplicated rows:", n_dup_rows, "\n")
## Number of fully duplicated rows: 0

Outlier detection

# Select numeric columns and reshape to long format

num_long <- school_data %>%                       
  select(where(is.numeric), - Student_No, -AGE) %>%
  # keep only numeric variables
  pivot_longer(
    cols      = everything(),
    names_to  = "variable",
    values_to = "value"
  )

# Combined boxplot
ggplot(num_long, aes(x = variable, y = value)) +
  geom_boxplot(outlier.colour = "red", fill = "skyblue", alpha = 0.7) +
  coord_flip() +                               # puts variables on the y-axis for readability
  labs(
    title = "Box-plots of all numeric variables (outliers in red)",
    x     = NULL,
    y     = "Value"
  ) +
  theme_minimal(base_size = 13)

Replace missing value with the mode

# Check for missing values before imputation
colSums(is.na(school_data))
##   Student_No      Schools School Names  school_type     location       gender 
##            0            0            0            0            0            2 
##     UE_score          AGE CA_frequency 
##            0            0            0
# Impute missing gender values with the mode

# Calculate the mode for the 'gender' column (most frequent value).
gender_table <- table(school_data$gender, useNA = "no")
mode_gender <- names(gender_table)[which.max(gender_table)]

# Announce the mode that will be used for imputation.
cat("The mode for 'gender' is:", mode_gender, "\nNow replacing NAs...\n")
## The mode for 'gender' is: F 
## Now replacing NAs...
# Replace NA values in the 'gender' column with the calculated mode.

school_data <- school_data %>%
  mutate(gender = forcats::fct_na_value_to_level(gender, level = mode_gender))

# Check for missing values again to confirm imputation in the 'gender' column.
print("Missing values after imputing gender:")
## [1] "Missing values after imputing gender:"
colSums(is.na(school_data))
##   Student_No      Schools School Names  school_type     location       gender 
##            0            0            0            0            0            0 
##     UE_score          AGE CA_frequency 
##            0            0            0

Analysis and Modeling

Here, we address each research question by performing statistical tests and creating visualizations.

RQ1: How often do Continuous Assessments occur?

# Create a frequency table of CA occurrences
ca_frequency_table <- school_data %>%
  count(CA_frequency, name = "Frequency") %>%
  mutate(Percentage = (Frequency / sum(Frequency)) * 100)

# Print the table using kable for a clean format
kable(ca_frequency_table, caption = "Frequency of Continuous Assessments")
Frequency of Continuous Assessments
CA_frequency Frequency Percentage
2 795 37.23653
3 1340 62.76347
# Visualize the distribution with a bar chart
ggplot(ca_frequency_table, aes(x = reorder(CA_frequency, -Frequency), y = Frequency, fill = CA_frequency)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5) +
  labs(
    title = "Distribution of Continuous Assessment Frequency",
    x = "CA Frequency",
    y = "Number of Schools/Classes"
  ) +
  theme_minimal()

RQ2: What is the overall performance in Mathematics?

# Calculate descriptive statistics for the Unified Examination scores
summary_stats <- summary(school_data$UE_score)
kable(as.data.frame(t(summary_stats)), caption = "Descriptive Statistics for Mathematics UE Scores")
Descriptive Statistics for Mathematics UE Scores
Var1 Var2 Freq
A Min. 6.00000
A 1st Qu. 58.00000
A Median 66.00000
A Mean 64.28525
A 3rd Qu. 68.00000
A Max. 88.00000
# Visualize the distribution of scores with a histogram and density plot
ggplot(school_data, aes(x = UE_score)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20, fill = "skyblue", color = "white") +
  geom_density(alpha = 0.5, fill = "lightblue") +
  geom_vline(aes(xintercept = mean(UE_score)), color = "red", linetype = "dashed", size = 1) +
  annotate("text", x = mean(school_data$UE_score) * 1.1, y = 0.01, label = paste("Mean =", round(mean(school_data$UE_score), 2)), color = "red") +
  labs(
    title = "Distribution of Mathematics Unified Examination Scores",
    x = "UE Score",
    y = "Density"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(scales)          # for percent_format()
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Create 25-point bands

score_bands <- school_data %>%                              
  filter(!is.na(UE_score)) %>%                           
  mutate(
    band = cut(
      UE_score,
      breaks = c(-Inf, 25, 50, 75, 100),                  # band edges
      labels = c("0–25", "26–50", "51–75", "76–100"),
      right  = TRUE, include.lowest = TRUE
    )
  )
# Count & percentage per band

band_summary <- score_bands %>% 
  count(band, name = "n") %>% 
  mutate(
    pct = n / sum(n),
    label = percent(pct, accuracy = 0.1)
  )

print(band_summary)
## # A tibble: 4 × 4
##   band       n      pct label
##   <fct>  <int>    <dbl> <chr>
## 1 0–25       1 0.000468 0.0% 
## 2 26–50     63 0.0295   3.0% 
## 3 51–75   1966 0.921    92.1%
## 4 76–100   105 0.0492   4.9%
# Percentage bar chart, emphasising 0–25 band

ggplot(band_summary, aes(band, pct, fill = band == "0–25")) +
  geom_col(colour = "white") +
  geom_text(aes(label = label), vjust = -0.4, fontface = "bold") +
  scale_y_continuous(labels = percent_format(), expand = expansion(mult = c(0, 0.08))) +
  scale_fill_manual(values = c("TRUE" = "firebrick", "FALSE" = "steelblue"), guide = "none") +
  labs(
    title = "Students Performance in Mathematics Unified Examination",
    x = "Maths Unified-Exam Score",
    y = "Percentage of Students"
  ) +
  theme_minimal(base_size = 14)

According to the WAEC Grading System, the grades and score ranges are:
A1 – 75–100
B2 – 70–74
B3 – 65–69
C4 – 60–64
C5 – 55–59
C6 – 50–54
D7 – 45–49
E8 – 40–44
F9 – 0–39

# Build WAEC summary 
waec_bands <- school_data %>%
  filter(!is.na(UE_score)) %>%
  mutate(
    grade = case_when(
      UE_score >= 75 ~ "A1",
      UE_score >= 70 & UE_score <= 74 ~ "B2",
      UE_score >= 65 & UE_score <= 69 ~ "B3",
      UE_score >= 60 & UE_score <= 64 ~ "C4",
      UE_score >= 55 & UE_score <= 59 ~ "C5",
      UE_score >= 50 & UE_score <= 54 ~ "C6",
      UE_score >= 45 & UE_score <= 49 ~ "D7",
      UE_score >= 40 & UE_score <= 44 ~ "E8",
      UE_score < 40 ~ "F9"
    ),
    color_group = case_when(
      grade %in% c("D7","E8","F9") ~ "red",
      grade %in% c("C6","C5","C4") ~ "lightblue",
      grade %in% c("B3","B2")      ~ "darkblue",
      TRUE                         ~ "darkgreen"   # A1
    )
  )

waec_summary <- waec_bands %>%
  count(grade, color_group, name = "n") %>%
  mutate(
    pct        = n / sum(n),
    pct_lbl    = scales::percent(pct, accuracy = 0.1),
    range_lbl  = dplyr::recode(grade,
      "A1"="75–100","B2"="70–74","B3"="65–69",
      "C4"="60–64","C5"="55–59","C6"="50–54",
      "D7"="45–49","E8"="40–44","F9"="0–39"
    ),
    # positions: score range just above bar; percent even higher
    y_score = pct + 0.010,   # score-range position (was % position)
    y_pct   = pct + 0.035    # percent placed above the score-range
  )

waec_summary$grade <- factor(
  waec_summary$grade,
  levels = c("A1","B2","B3","C4","C5","C6","D7","E8","F9")
)

# Plot
ggplot(waec_summary, aes(x = grade, y = pct, fill = color_group)) +
  geom_col(color = "white") +
  # score range ABOVE bars (replaces where % used to be)
  geom_text(aes(y = y_score, label = paste0("(", range_lbl, ")")),
            vjust = 0, fontface = "bold") +
  # percentage ABOVE the score-range
  geom_text(aes(y = y_pct, label = pct_lbl),
            vjust = 0, fontface = "bold") +
  scale_y_continuous(labels = scales::percent_format(),
                     expand = expansion(mult = c(0, 0.30))) +  # extra headroom for labels
  scale_fill_identity() +
  labs(
    title = "Percentage of Students by WAEC Grade in Mathematics",
    x = "WAEC Grade",
    y = "Percentage of Students"
  ) +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 14) +
  theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

RQ3: Does Assessment Frequency impact Exam Scores?

An independent t-test is used to compare if there is a significant difference in mean UE scores across different CA frequency groups.

# Visualize the relationship with boxplots
ggplot(school_data, aes(x = CA_frequency, y = UE_score, fill = CA_frequency)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "UE Scores by Continuous Assessment Frequency",
    x = "CA Frequency",
    y = "Unified Examination Score"
  ) +
  theme_minimal()

# Perform independent t-test
t_test_CA_frequency <- t.test(UE_score ~ CA_frequency, data = school_data)
t_test_CA_frequency
## 
##  Welch Two Sample t-test
## 
## data:  UE_score by CA_frequency
## t = -15.551, df = 1306.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -5.745706 -4.458405
## sample estimates:
## mean in group 2 mean in group 3 
##        61.08302        66.18507

Interpretation: The p-value OF 2.2e-16 is less than 0.05, we reject the null hypothesis (H₀₃) and conclude that there is a statistically significant difference in mean UE scores among the different CA frequency groups. Student who had 3 continuous assessments had more mean score than their counterparts.

RQ4: Are there gender differences in performance?

An independent t-test is used to compare the mean scores of male and female students.

# Visualize the relationship with boxplots
ggplot(school_data, aes(x = gender, y = UE_score, fill = gender)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Mathematics UE Scores by Gender",
    x = "Gender",
    y = "Unified Examination Score"
  ) +
  theme_minimal()

# Perform independent t-test
t_test_gender <- t.test(UE_score ~ gender, data = school_data)
t_test_gender
## 
##  Welch Two Sample t-test
## 
## data:  UE_score by gender
## t = 0.54994, df = 2131.2, p-value = 0.5824
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -0.4413944  0.7854333
## sample estimates:
## mean in group F mean in group M 
##        64.36993        64.19791

Interpretation: The p-value from the t-test is 0.5824 and greater than 0.05, we fail to reject the null hypothesis (H₀₄) and conclude there is not a significant difference in mean mathematics scores between genders.

RQ5: Do location effects exist?

An independent t-test is used to compare the mean scores of students from urban and rural locations.

# Visualize the relationship with boxplots
ggplot(school_data, aes(x = location, y = UE_score, fill = location)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Mathematics UE Scores by Location",
    x = "Location",
    y = "Unified Examination Score"
  ) +
  theme_minimal()

# Perform independent t-test
t_test_location <- t.test(UE_score ~ location, data = school_data)
t_test_location
## 
##  Welch Two Sample t-test
## 
## data:  UE_score by location
## t = -9.2214, df = 445.31, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Rural and group Urban is not equal to 0
## 95 percent confidence interval:
##  -5.168756 -3.352643
## sample estimates:
## mean in group Rural mean in group Urban 
##            60.71304            64.97374

Interpretation: The p-value of 2.2e-16 is less than 0.05, we reject the null hypothesis (H₀₅) and conclude there is a significant difference in mean scores between urban and rural students. Urban students peroformed better with a mean score 64.97374 than their counterparts

RQ6: Does school type impact performance?

An independent t-test is used to compare the mean scores of students from public and private schools.

# Visualize the relationship with boxplots
ggplot(school_data, aes(x = school_type, y = UE_score, fill = school_type)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Mathematics UE Scores by School Type",
    x = "School Type",
    y = "Unified Examination Score"
  ) +
  theme_minimal()

# Perform independent t-test
t_test_school_type <- t.test(UE_score ~ school_type, data = school_data)
t_test_school_type
## 
##  Welch Two Sample t-test
## 
## data:  UE_score by school_type
## t = 5.0025, df = 79.242, p-value = 3.334e-06
## alternative hypothesis: true difference in means between group Private and group Public is not equal to 0
## 95 percent confidence interval:
##  2.384500 5.535743
## sample estimates:
## mean in group Private  mean in group Public 
##              68.10811              64.14799

Interpretation: The p-value of 3.334e-06 is less than 0.05, we reject the null hypothesis (H₀₆) and conclude there is a significant difference in mean scores between public and private schools.

Evaluation and Conclusion

Based on the analysis:

  • RQ1: 63 Percent of students have continuous assessments three times.

  • RQ2: 91 percent of student score within the range of 51-75, while 3.0 percent score in the range of 50 and below and only 4.0 percent score above 75. However based on the WAEC grading system, performance clusters in the mid-to-upper bands: B3 (65–69%) is the modal grade at 36.4%, followed by C5 (55–59%) 17.4%, C4 (60–64%) 17.2%, and C6 (50–54%) 8.3%. High performers are fewer, with A1 (≥75) 6.0% and B2 (70–74%) 13.4% (together 19.4%). Very few students fall into the lower bands, with D7–F9 (≤49%) totaling just 1.2% (D7 0.9%, E8 0.2%, F9 0.1%)

  • RQ3 (Assessment Frequency): The independent sample results indicate that the frequency of assessments has a statistically significant impact on exam scores, with students who had test done 3 times performing better.

  • RQ4 (Gender): The t-test reveals that there is not a significant gender gap in mathematics performance.

  • RQ5 (Location): The t-test shows that a student’s location (urban/rural) is a significant factor in their scores as urban students perormed better.

  • RQ6 (School Type): The t-test shows that there is a performance difference between public and private school students as private student performed better.

The conclusions from these tests provide actionable insights for educators and policymakers to address disparities and optimize teaching strategies.