Introduction

This report demonstrates two data science applications in education using R:

Predicting University Course Recommendations: Using student performance in science courses (Physics, Chemistry, Biology, Math) to recommend suitable university courses (e.g., Medicine, Engineering) with a Random Forest model.
Tracking Subject Teacher Performance: Analyzing Mathematics scores over time to assess student improvement across years and terms using trend analysis and statistical tests.

Part 1: Predicting University Course Recommendations

Objective

Step 1: Generate and Load Sample Data

Step 2: Exploratory Data Analysis

Step 3: Feature Engineering

Step 4: Prepare Data for Modeling

Step 5: Train Random Forest Model

Step 6: Evaluate the Model

Step 7: Feature Importance

Step 8: Predict for a New Student

Step 1: Generate and Load Sample Data

We simulate a dataset of 100 students across five courses (Medicine, Engineering, Computer Science, Biology, Physics), with 20 students per course. Each student has scores in Physics, Chemistry, Biology, and Math, tailored to reflect course-specific strengths.

# STEP 1
# Load required libraries
library(dplyr)
library(randomForest)
library(caret)
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Define courses and number of students per course
n_per_course <- 20
courses <- c("Medicine", "Engineering", "Computer Science", "Biology", "Physics")

# Function to generate scores based on course-specific strengths
generate_scores <- function(course, n) {
  if (course == "Medicine") {
    data.frame(
      StudentID = seq((which(courses == course) - 1) * n_per_course + 1, 
                      which(courses == course) * n_per_course),
      Physics = round(rnorm(n, mean = 75, sd = 10), 0),
      Chemistry = round(rnorm(n, mean = 85, sd = 8), 0),
      Biology = round(rnorm(n, mean = 90, sd = 7), 0),
      Math = round(rnorm(n, mean = 70, sd = 10), 0),
      Course = course
    )
  } else if (course == "Engineering") {
    data.frame(
      StudentID = seq((which(courses == course) - 1) * n_per_course + 1, 
                      which(courses == course) * n_per_course),
      Physics = round(rnorm(n, mean = 85, sd = 8), 0),
      Chemistry = round(rnorm(n, mean = 75, sd = 10), 0),
      Biology = round(rnorm(n, mean = 65, sd = 12), 0),
      Math = round(rnorm(n, mean = 85, sd = 8), 0),
      Course = course
    )
  } else if (course == "Computer Science") {
    data.frame(
      StudentID = seq((which(courses == course) - 1) * n_per_course + 1, 
                      which(courses == course) * n_per_course),
      Physics = round(rnorm(n, mean = 75, sd = 10), 0),
      Chemistry = round(rnorm(n, mean = 65, sd = 12), 0),
      Biology = round(rnorm(n, mean = 60, sd = 12), 0),
      Math = round(rnorm(n, mean = 90, sd = 7), 0),
      Course = course
    )
  } else if (course == "Biology") {
    data.frame(
      StudentID = seq((which(courses == course) - 1) * n_per_course + 1, 
                      which(courses == course) * n_per_course),
      Physics = round(rnorm(n, mean = 65, sd = 12), 0),
      Chemistry = round(rnorm(n, mean = 80, sd = 9), 0),
      Biology = round(rnorm(n, mean = 85, sd = 8), 0),
      Math = round(rnorm(n, mean = 65, sd = 12), 0),
      Course = course
    )
  } else if (course == "Physics") {
    data.frame(
      StudentID = seq((which(courses == course) - 1) * n_per_course + 1, 
                      which(courses == course) * n_per_course),
      Physics = round(rnorm(n, mean = 90, sd = 7), 0),
      Chemistry = round(rnorm(n, mean = 70, sd = 10), 0),
      Biology = round(rnorm(n, mean = 65, sd = 12), 0),
      Math = round(rnorm(n, mean = 80, sd = 9), 0),
      Course = course
    )
  }
}

# Generate and combine data
data_list <- lapply(courses, function(x) generate_scores(x, n_per_course))
data <- do.call(rbind, data_list)

# Ensure scores are between 0 and 100
data <- data %>%
  mutate(across(c(Physics, Chemistry, Biology, Math), ~ pmin(pmax(.x, 0), 100)))

# Shuffle data
data <- data[sample(1:nrow(data)), ]

# Save to CSV
write.csv(data, "student_scores.csv", row.names = FALSE)

# Preview data
head(data)

##    StudentID Physics Chemistry Biology Math           Course
## 41        41      86        52      86   86 Computer Science
## 12        12      79        83      90   47         Medicine
## 8          8      62        86      87   71         Medicine
## 71        71      60        88      94   81          Biology
## 15        15      69        92      88   63         Medicine
## 83        83      81        87      68   82          Physics

##Step 2: Exploratory Data Analysis We explore the data to understand score distributions and course-specific patterns.

# Summary statistics by course
data %>%
  group_by(Course) %>%
  summarise(across(c(Physics, Chemistry, Biology, Math), mean, .names = "{.col}_Mean"))

## # A tibble: 5 × 5
##   Course           Physics_Mean Chemistry_Mean Biology_Mean Math_Mean
##   <chr>                   <dbl>          <dbl>        <dbl>     <dbl>
## 1 Biology                  64.8           82.0         88.0      64.6
## 2 Computer Science         76.3           63.4         60.8      89.6
## 3 Engineering              87.8           71.4         64.4      83.6
## 4 Medicine                 76.4           84.6         90.4      68.8
## 5 Physics                  90.4           69.2         65.4      78.8

# Boxplot of Physics scores by course
ggplot(data, aes(x = Course, y = Physics, fill = Course)) +
  geom_boxplot() +
  labs(title = "Physics Scores by Course", x = "Course", y = "Physics Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Boxplot of Math scores by course
ggplot(data, aes(x = Course, y = Math, fill = Course)) +
  geom_boxplot() +
  labs(title = "Math Scores by Course", x = "Course", y = "Math Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

data <- data %>%
  mutate(Avg_Science = rowMeans(select(., Physics, Chemistry, Biology)))

##Step 4: Prepare Data for Modeling We convert the target variable (Course) to a factor and split the data into training (70%) and testing (30%) sets.

# Convert Course to factor
data$Course <- as.factor(data$Course)

# Select features and target
features <- data %>% select(Physics, Chemistry, Biology, Math, Avg_Science)
target <- data$Course

# Split data
set.seed(123)
trainIndex <- createDataPartition(target, p = 0.7, list = FALSE)
X_train <- features[trainIndex, ]
X_test <- features[-trainIndex, ]
y_train <- target[trainIndex]
y_test <- target[-trainIndex]

##Step 5: Train Random Forest Model We use a Random Forest classifier to predict the recommended course based on subject scores.

rf_model <- randomForest(x = X_train, y = y_train, ntree = 100, importance = TRUE)

##Step 6: Evaluate the Model We assess model performance using a confusion matrix and accuracy metrics.

# Predict on test set
predictions <- predict(rf_model, X_test)

# Confusion matrix
confusionMatrix(predictions, y_test)

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Biology Computer Science Engineering Medicine Physics
##   Biology                4                0           0        0       0
##   Computer Science       0                5           0        0       0
##   Engineering            0                1           2        1       3
##   Medicine               2                0           1        5       0
##   Physics                0                0           3        0       3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6333          
##                  95% CI : (0.4386, 0.8007)
##     No Information Rate : 0.2             
##     P-Value [Acc > NIR] : 2.843e-07       
##                                           
##                   Kappa : 0.5417          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Biology Class: Computer Science Class: Engineering
## Sensitivity                  0.6667                  0.8333            0.33333
## Specificity                  1.0000                  1.0000            0.79167
## Pos Pred Value               1.0000                  1.0000            0.28571
## Neg Pred Value               0.9231                  0.9600            0.82609
## Prevalence                   0.2000                  0.2000            0.20000
## Detection Rate               0.1333                  0.1667            0.06667
## Detection Prevalence         0.1333                  0.1667            0.23333
## Balanced Accuracy            0.8333                  0.9167            0.56250
##                      Class: Medicine Class: Physics
## Sensitivity                   0.8333         0.5000
## Specificity                   0.8750         0.8750
## Pos Pred Value                0.6250         0.5000
## Neg Pred Value                0.9545         0.8750
## Prevalence                    0.2000         0.2000
## Detection Rate                0.1667         0.1000
## Detection Prevalence          0.2667         0.2000
## Balanced Accuracy             0.8542         0.6875

##Step 7: Feature Importance We examine which features contribute most to the predictions.

# Extract and plot feature importance
importance_scores <- importance(rf_model)
print(importance_scores)

##                Biology Computer Science Engineering  Medicine    Physics
## Physics      8.3330570         2.128525    2.656980 -1.134798  3.9303901
## Chemistry    0.1649704         1.064516    1.433061  4.109752 -0.3353620
## Biology      3.2124209         2.074211    2.938108  6.703361  2.3355521
## Math         2.8731989         6.667960    2.280213  1.583301 -0.1335484
## Avg_Science -1.6960029         5.446759    2.232918  3.577386 -0.4227447
##             MeanDecreaseAccuracy MeanDecreaseGini
## Physics                 7.272648        13.858734
## Chemistry               2.562912         8.785018
## Biology                 7.811637        12.441063
## Math                    5.637878        10.149962
## Avg_Science             4.077527         9.861509

barplot(importance_scores[, "MeanDecreaseGini"], 
        names.arg = rownames(importance_scores),
        main = "Feature Importance",
        xlab = "Features", ylab = "Mean Decrease in Gini",
        col = "lightblue", las = 2)

##Step 8: Predict for a New Student We demonstrate the model by predicting a course for a hypothetical student.

new_student <- data.frame(
  Physics = 85,
  Chemistry = 80,
  Biology = 90,
  Math = 75,
  Avg_Science = mean(c(85, 80, 90))
)

predicted_course <- predict(rf_model, new_student)
cat("Predicted University Course:", as.character(predicted_course), "\n")

## Predicted University Course: Medicine

Part 2: Tracking Mathematics Teacher Performance

Objective

Track student performance in Mathematics over time (2021–2023, three terms per year) to assess improvement using trend analysis and statistical tests.

Step 1: Generate and Load Sample Data

Step 2: Calculate Term-Wise Averages

Step 3: Visualize the Trend

Step 4: Linear Regression for Trend Analysis

Step 5: Compare Term 1 vs. Term 3

Step 6: Percentage of Students with Improved Scores

##Step 1: Generate and Load Sample Data We simulate a dataset of 20 students with Math scores across three years and three terms per year, incorporating a slight upward trend.

# Set seed for reproducibility
set.seed(123)

# Define parameters
years <- 2021:2023
terms <- 1:3
students <- paste0("S", 1:20)

# Generate data
data <- expand.grid(Student_ID = students, Year = years, Term = terms)
data$Math_Score <- round(
  rnorm(nrow(data), 
        mean = 60 + 5 * (data$Year - 2021) + 2 * data$Term, 
        sd = 10), 
  0
)

# Ensure scores are between 0 and 100
data$Math_Score <- pmin(pmax(data$Math_Score, 0), 100)

# Save to CSV
write.csv(data, "math_scores.csv", row.names = FALSE)

# Preview data
head(data)

##   Student_ID Year Term Math_Score
## 1         S1 2021    1         56
## 2         S2 2021    1         60
## 3         S3 2021    1         78
## 4         S4 2021    1         63
## 5         S5 2021    1         63
## 6         S6 2021    1         79

##Step 2: Calculate Term-Wise Averages We compute average Math scores for each year and term to analyze trends.

# Calculate averages and create time variable
term_avg <- data %>%
  group_by(Year, Term) %>%
  summarise(Avg_Score = mean(Math_Score, na.rm = TRUE), .groups = "drop") %>%
  mutate(Time = Year + (Term - 1) / 3)

head(term_avg)

## # A tibble: 6 × 4
##    Year  Term Avg_Score  Time
##   <int> <int>     <dbl> <dbl>
## 1  2021     1      63.4 2021 
## 2  2021     2      62.8 2021.
## 3  2021     3      65.4 2022.
## 4  2022     1      66.6 2022 
## 5  2022     2      72.7 2022.
## 6  2022     3      69.2 2023.

Step 3: Visualize the Trend

We plot the average Math scores over time to visualize improvement.

ggplot(term_avg, aes(x = Time, y = Avg_Score)) +
  geom_line(color = "blue") +
  geom_point(color = "blue", size = 3) +
  labs(title = "Mathematics Performance Trend",
       x = "Time (Year + Term/3)",
       y = "Average Math Score") +
  theme_minimal()

Observation: The plot shows a gradual upward trend in average scores, suggesting improvement over time.

Step 4: Linear Regression for Trend Analysis

We use linear regression to test whether the upward trend is statistically significant.

lm_model <- lm(Avg_Score ~ Time, data = term_avg)
summary(lm_model)

## 
## Call:
## lm(formula = Avg_Score ~ Time, data = term_avg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3706 -1.3406 -0.3456  0.9344  3.6944 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9668.5294  1735.4501  -5.571 0.000841 ***
## Time            4.8150     0.8581   5.611 0.000807 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.216 on 7 degrees of freedom
## Multiple R-squared:  0.8181, Adjusted R-squared:  0.7921 
## F-statistic: 31.48 on 1 and 7 DF,  p-value: 0.0008067

# Extract slope and p-value
slope <- coef(lm_model)["Time"]
p_value <- summary(lm_model)$coefficients["Time", "Pr(>|t|)"]

cat("Slope:", round(slope, 2), "\n")

## Slope: 4.81

cat("P-value:", round(p_value, 4), "\n")

## P-value: 8e-04

if (slope > 0 && p_value < 0.05) {
  cat("Significant improvement in performance detected.\n")
} else {
  cat("No significant improvement detected.\n")
}

## Significant improvement in performance detected.

Interpretation: A positive slope with a low p-value (< 0.05) confirms a statistically significant improvement in Math scores over time.

Step 5: Compare Term 1 vs. Term 3

We compare scores between Term 1 and Term 3 within each year to assess within-year improvement.

# Filter and pivot data
term_comparison <- data %>%
  filter(Term %in% c(1, 3)) %>%
  group_by(Year, Term) %>%
  summarise(Avg_Score = mean(Math_Score, na.rm = TRUE), .groups = "drop") %>%
  tidyr::pivot_wider(names_from = Term, values_from = Avg_Score, names_prefix = "Term")

# Paired t-test
t_test <- t.test(term_comparison$Term3, term_comparison$Term1, paired = TRUE)
cat("Paired t-test P-value:", round(t_test$p.value, 4), "\n")

## Paired t-test P-value: 0.0396

if (t_test$p.value < 0.05 && t_test$estimate > 0) {
  cat("Term 3 scores are significantly higher than Term 1.\n")
} else {
  cat("No significant difference between Term 3 and Term 1.\n")
}

## Term 3 scores are significantly higher than Term 1.

Result: A significant p-value indicates that Term 3 scores are consistently higher than Term 1, suggesting within-year progress.

Step 6: Percentage of Students with Improved Scores

We calculate the percentage of students whose scores improved from Term 1 to Term 3 each year.

student_progress <- data %>%
  filter(Term %in% c(1, 3)) %>%
  tidyr::pivot_wider(names_from = Term, values_from = Math_Score, names_prefix = "Term") %>%
  mutate(Improved = Term3 > Term1) %>%
  group_by(Year) %>%
  summarise(Pct_Improved = mean(Improved, na.rm = TRUE) * 100, .groups = "drop")

print(student_progress)

## # A tibble: 3 × 2
##    Year Pct_Improved
##   <int>        <dbl>
## 1  2021           50
## 2  2022           55
## 3  2023           70

Observation: A high percentage of students (e.g., 70–90%) show improved scores from Term 1 to Term 3, reinforcing the trend of progress.

Conclusion

Course Recommendation: The Random Forest model effectively predicts university courses based on science subject performance, with Math and Biology being key predictors. Schools can use this model to guide students toward suitable academic paths.
Performance Tracking: The analysis confirms significant improvement in Mathematics scores over time, both across years and within terms. This suggests effective teaching or curriculum enhancements.

This framework can be extended with real data, additional features (e.g., attendance, extracurriculars), or more advanced models (e.g., neural networks) for enhanced accuracy.

Data Science Applications in Education

Mustapha Yusuf

2025-05-13

Introduction

Part 1: Predicting University Course Recommendations

Objective

Step 1: Generate and Load Sample Data

Step 2: Exploratory Data Analysis

Step 3: Feature Engineering

Step 4: Prepare Data for Modeling

Step 5: Train Random Forest Model

Step 6: Evaluate the Model

Step 7: Feature Importance

Step 8: Predict for a New Student

Step 1: Generate and Load Sample Data

Part 2: Tracking Mathematics Teacher Performance

Objective

Step 1: Generate and Load Sample Data

Step 2: Calculate Term-Wise Averages

Step 3: Visualize the Trend

Step 4: Linear Regression for Trend Analysis

Step 5: Compare Term 1 vs. Term 3

Step 6: Percentage of Students with Improved Scores

Step 3: Visualize the Trend

Step 4: Linear Regression for Trend Analysis

Step 5: Compare Term 1 vs. Term 3

Step 6: Percentage of Students with Improved Scores

Conclusion