import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Make plots look nicer
sns.set_theme(style="whitegrid")
# Display all columns
pd.set_option("display.max_columns", None)Monitoring and Evaluation of Student Performance Drivers and Prediction of Placement Outcomes Using Data Analytics and Machine Learning
Introduction
Educational institutions increasingly rely on data-driven approaches to monitor student performance, evaluate factors affecting academic outcomes, and improve placement success rates.
This project applies Monitoring and Evaluation (M&E) principles, exploratory data analysis, and machine learning techniques to understand the factors influencing student academic performance and placement outcomes.
The analysis focuses on student engagement indicators such as study habits, attendance, assignment completion, sleep patterns, and internet usage. These factors are evaluated to identify key drivers of academic success and develop predictive models for placement outcomes.
Objectives
General Objective
To evaluate factors influencing student academic performance and predict placement outcomes using data analytics and machine learning techniques.
Specific Objectives
- Assess student engagement indicators and their relationship with academic performance.
- Examine patterns in attendance, study habits, and assignment completion.
- Identify the strongest predictors of placement outcomes.
- Develop machine learning models to predict placement status.
- Generate recommendations for improving student success outcomes.
The Dataset Library
| Variable | Meaning |
|---|---|
| study_hours | Average daily study hours |
| attendance_percentage | Student attendance rate |
| sleep_hours | Average sleep duration |
| internet_usage | Daily internet usage hours |
| assignments_completed | Number of assignments completed |
| previous_academic_score | Previous academic performance |
| final_exam_score | Final examination result |
| placement_status | Whether student was placed |
Let’s import the necessary libraries
Next, load the dataset
df = pd.read_csv("../data/students_data.csv")Let’s farmiliarize with the data now.
df.head()| study_hours | attendance_percentage | sleep_hours | internet_usage | assignments_completed | previous_academic_score | final_exam_score | placement_status | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 56 | 8 | 7 | 10 | 62 | 100.00 | Placed |
| 1 | 4 | 69 | 5 | 3 | 8 | 56 | 100.00 | Placed |
| 2 | 11 | 60 | 7 | 6 | 10 | 45 | 100.00 | Placed |
| 3 | 8 | 99 | 9 | 8 | 4 | 55 | 90.17 | Placed |
| 4 | 5 | 52 | 8 | 6 | 8 | 40 | 78.82 | Placed |
df.shape(10000, 8)
The dataset contains 10000 rows and 8 columns in total. With the many rows, the dataset is sufficiently large to support trend analysis and predictive modeling. Next, lets identify the available variables so as to spot out the target variables and possible predictors
df.columnsIndex(['study_hours', 'attendance_percentage', 'sleep_hours', 'internet_usage',
'assignments_completed', 'previous_academic_score', 'final_exam_score',
'placement_status'],
dtype='str')
The dataset contains behavioral, academic, and outcome-related variables aimed at understanding factors influencing student academic success
Next, data types
df.info()<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 study_hours 10000 non-null int64
1 attendance_percentage 10000 non-null int64
2 sleep_hours 10000 non-null int64
3 internet_usage 10000 non-null int64
4 assignments_completed 10000 non-null int64
5 previous_academic_score 10000 non-null int64
6 final_exam_score 10000 non-null float64
7 placement_status 10000 non-null str
dtypes: float64(1), int64(6), str(1)
memory usage: 625.1 KB
Interpretation: All variables are complete suggesting reliable data capture.
DESCRIPTIVE ANALYSIS
df.describe()| study_hours | attendance_percentage | sleep_hours | internet_usage | assignments_completed | previous_academic_score | final_exam_score | |
|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 |
| mean | 5.989600 | 69.88460 | 6.498500 | 6.062600 | 9.988400 | 64.91100 | 86.704207 |
| std | 3.163589 | 17.61653 | 1.709354 | 3.138163 | 6.034145 | 17.50302 | 15.058383 |
| min | 1.000000 | 40.00000 | 4.000000 | 1.000000 | 0.000000 | 35.00000 | 26.670000 |
| 25% | 3.000000 | 55.00000 | 5.000000 | 3.000000 | 5.000000 | 50.00000 | 76.727500 |
| 50% | 6.000000 | 70.00000 | 6.500000 | 6.000000 | 10.000000 | 65.00000 | 92.120000 |
| 75% | 9.000000 | 85.00000 | 8.000000 | 9.000000 | 15.000000 | 80.00000 | 100.000000 |
| max | 11.000000 | 100.00000 | 9.000000 | 11.000000 | 20.000000 | 95.00000 | 100.000000 |
Interpretation: - The final exam score shows a mean of 86.7, median of 92.1, minimum score of 26.7, maximum score of 100 and standard deviation of 15.1 - The average final exam score is high (mean = 86.7), and the median is even higher (92.1), suggesting generally strong academic performance across students. - The mean is lesser than the median suggesting that a small group of low-performing students is pulling the average down. - Students generally perform well academically, but there is a hidden inequality where a smaller group of low-engagement students (low attendance, low assignments, low study behavior) is significantly underperforming.
Monitoring and Evaluation Framework
| Indicator Level | Variable |
|---|---|
| Input Indicator | Study Hours |
| Process Indicator | Attendance Percentage |
| Output Indicator | Assignments Completed |
| Outcome Indicator | Final Exam Score |
| Impact Indicator | Placement Status |
The framework assumes that increased student engagement and academic participation contribute to improved academic performance and placement outcomes.
Data Quality Assessment
Lets check for missing value
df.isnull().sum()study_hours 0
attendance_percentage 0
sleep_hours 0
internet_usage 0
assignments_completed 0
previous_academic_score 0
final_exam_score 0
placement_status 0
dtype: int64
Interpretation: There are no missing values in the data, suggesting reliable data collection processes.
Check for duplicates
df.duplicated().sum()np.int64(0)
Interpretation: No duplicate records detected, suggesting good data integrity.
Check the distribution of the data
sns.histplot(df['final_exam_score'])Interpretation: The graph shows left skewed distribution meaning most students perform well, but a small group of low-performing students pulls the average down, indicating inequality in learning outcomes.
Food for Thought: - Is the system generally strong or unequal? - Are we masking poor performance with high averages? - What does this mean for policy or intervention?
My findings so far: The dataset appears high quality with minimal missing values and no duplicate records. While most variables show reasonable distributions, there are notable outliers in student performance and engagement indicators, suggesting the presence of both high- and low-performing subgroups. The distribution of final exam scores is left-skewed, indicating that while overall performance is strong, a smaller group of low-performing students significantly affects equity in learning outcomes.
CORRELATION ANALYSIS
Firsst, my target variable in the dataset is “final_exam_score”
corr = df.corr(numeric_only=True)
corr| study_hours | attendance_percentage | sleep_hours | internet_usage | assignments_completed | previous_academic_score | final_exam_score | |
|---|---|---|---|---|---|---|---|
| study_hours | 1.000000 | 0.003801 | -0.005255 | 0.006684 | -0.000425 | 0.009451 | 0.562528 |
| attendance_percentage | 0.003801 | 1.000000 | -0.003981 | -0.014539 | -0.012892 | 0.001588 | 0.223367 |
| sleep_hours | -0.005255 | -0.003981 | 1.000000 | -0.000038 | -0.008234 | 0.023916 | 0.144675 |
| internet_usage | 0.006684 | -0.014539 | -0.000038 | 1.000000 | 0.020536 | 0.005371 | -0.151896 |
| assignments_completed | -0.000425 | -0.012892 | -0.008234 | 0.020536 | 1.000000 | 0.004178 | 0.387609 |
| previous_academic_score | 0.009451 | 0.001588 | 0.023916 | 0.005371 | 0.004178 | 1.000000 | 0.318805 |
| final_exam_score | 0.562528 | 0.223367 | 0.144675 | -0.151896 | 0.387609 | 0.318805 | 1.000000 |
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix - Student Performance")
plt.show()Interpretations: - Study hours show a strong positive relationship with final exam performance, suggesting that time spent studying is the most important behavioral factor influencing academic outcomes in this dataset, that is, more study hours means higher scores - Assignments completed has a moderate-to-strong positive relationship with exam performance, indicating that consistent academic engagement ,through working on assignments, significantly contributes to better outcomes - Prior academic performance moderately predicts final exam outcomes, suggesting some persistence in academic ability, but also room for change through current behavior - Attendance shows a weak positive relationship with performance, indicating that it plays a supportive but not dominant role in academic success.i.e Attendance alone is not enough but still matters - Sleep hours show a weak positive relationship with performance, suggesting minimal direct influence on exam outcomes in the dataset. Well, it might matter indirectly but not here - Internet usage shows a weak negative relationship with exam performance, suggesting that higher internet usage may be associated with slightly lower academic outcomes
Overall Insights: Academic performance is primarily driven by study habits and active engagement rather than demographic or passive factors. Study hours is the strongest predictor of success, followed by assignment completion and prior academic performance. While attendance contributes positively, its effect is weaker than direct academic effort. Interestingly, higher internet usage shows a slight negative association with performance, suggesting potential distraction effects.
The analysis reveals that student performance is most strongly influenced by study hours, indicating that individual effort is the primary driver of academic success. Assignment completion and prior academic performance also contribute meaningfully, though to a lesser extent. Attendance and sleep show relatively weak associations with outcomes, while higher internet usage is slightly negatively associated with performance, suggesting possible distraction effects. Overall, results indicate that academic engagement and study behavior are more influential than passive or environmental factors.
MY RECOMMENDATIONS SO FAR: - Schools should promote structured study routines and time management programs. - Strengthen assignment tracking and ensure timely feedback. - Encourage guided/academic internet use and digital discipline. - Focus on active participation, not just attendance tracking.
VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(df["final_exam_score"], kde=True)
plt.title("Distribution of Final Exam Scores")
plt.show()Interpretation: Most students are performing well, but a small group of low performers is pulling the average down.
Attendance vs Final Exam Score
sns.scatterplot(x=df["attendance_percentage"], y=df["final_exam_score"])
plt.title("Attendance vs Final Exam Score")
plt.show()Interpretation: Attendance shows a weak-to-moderate positive relationship with performance, suggesting that while attendance supports learning, it is not sufficient on its own to guarantee high performance.
Study hours vs Performance
sns.regplot(x="study_hours", y="final_exam_score", data=df)
plt.title("Relationship Between Study Hours and Performance")
plt.show()Interpretation: There is a clear positive relationship between study hours and exam performance, meaning students who study more tend to perform better, although variation suggests other factors also play a role
PREDICTION MODELLING
Objective: Can we predict student placement based on academic behavior and performance?
1.0: Import Machine Learning Tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix2.0: Let’s check the target variable
df["placement_status"].value_counts()placement_status
Placed 8356
Not Placed 1644
Name: count, dtype: int64
Let’s convert it into numerical variables
df["placement_status"] = df["placement_status"].map({"Placed":1, "Not Placed":0})Next, let’s select the features of the models. I’ll start by predicting placement status first
X = df[[
"study_hours",
"attendance_percentage",
"sleep_hours",
"internet_usage",
"assignments_completed",
"previous_academic_score",
"final_exam_score"
]]
y = df["placement_status"]Define the X and Y variables
X = df.drop(["placement_status", "final_exam_score"], axis=1)
y = df["placement_status"]I dropped final_exam_score column so as to avoid data leakage because final exam score often directly determines placement
Split the data…
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)Next, train the model through Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Then predictions
y_pred = model.predict(X_test)Evaluate the model
Accuracy:
accuracy_score(y_test, y_pred)0.9025
print(classification_report(y_test, y_pred)) precision recall f1-score support
0 0.78 0.61 0.68 345
1 0.92 0.96 0.94 1655
accuracy 0.90 2000
macro avg 0.85 0.79 0.81 2000
weighted avg 0.90 0.90 0.90 2000
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d")
plt.title("Confusion Matrix")
plt.show()Insights so far (Not Placed Students Group): - Accuracy = 0.9025 (≈ 90%)- The model correctly predicts placement outcomes about 9 out of 10 times. - Recall = 0.61: The model only correctly identifies 61% of students who are actually NOT placed. This means 39% of at-risk students are being missed - Precision = 0.78: When the model predicts “Not Placed”, it is correct 78% of the time. The model struggles most with identifying at-risk students, this is critical in M&E contexts where early intervention matters.
Insights so far (Placed Students Group): - Recall = 0.96: The model correctly identifies almost all students who will be placed. - Precision=0.92:Most predicted “Placed” students are truly placed. - The model is highly effective at identifying successful students, but less effective at detecting those at risk of not being placed.
Conclusion so far: The classification model achieves strong predictive performance with an overall accuracy of 90%. However, class-wise evaluation reveals imbalance in performance, with significantly stronger results for the “Placed” category compared to “Not Placed.” While the model effectively identifies successful students, it is less sensitive in detecting at-risk students, correctly identifying only 61% of them. This limitation is important in an M&E context, where early identification of at-risk individuals is critical for intervention planning. Additionally, the dataset exhibits class imbalance, which may contribute to this performance disparity.
Let’s improve the model
Let’s tell the model to pay more attention on the minority class
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X = df.drop(["placement_status", "final_exam_score"], axis=1)
y = df["placement_status"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y
)
model = LogisticRegression(max_iter=1000, class_weight="balanced")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred)) precision recall f1-score support
0 0.54 0.90 0.67 329
1 0.98 0.85 0.91 1671
accuracy 0.86 2000
macro avg 0.76 0.87 0.79 2000
weighted avg 0.90 0.86 0.87 2000
Model Performance Update:The improved classification model achieves an overall accuracy of 86%. After addressing class imbalance, the model demonstrates significantly improved sensitivity toward at-risk students, correctly identifying 90% of “Not Placed” cases. However, this improvement comes at the cost of reduced precision (0.54), indicating an increase in false positives. Conversely, the model maintains strong performance in identifying successful students, with high precision (0.98) for the “Placed” class.
Interpretation: The model prioritizes early detection of at-risk students, which is critical in monitoring and evaluation contexts where intervention is more important than strict classification accuracy.
What drives placement? Let’s do feature importance
import pandas as pd
import numpy as np
feature_importance = pd.Series(
model.coef_[0],
index=X.columns
).sort_values()
feature_importance.plot(kind="barh", figsize=(8,5))
plt.title("Feature Importance for Placement Prediction")
plt.show()Interpretation: - Students who dedicate more time to studying are significantly more likely to secure placement. This highlights that individual effort and time investment are the strongest determinants of success in this dataset. - Students who maintain adequate sleep patterns tend to perform better in placement outcomes. This suggests that rest and recovery play an important role in cognitive performance and academic success. - Students who consistently complete assignments demonstrate higher engagement and understanding, leading to better placement outcomes. - While prior performance contributes to placement outcomes, it is not the dominant factor. This suggests that current behavior and engagement are more important than historical academic results. - Being physically present in learning environments contributes slightly to success, but attendance alone is not sufficient to guarantee placement. - Higher internet usage is associated with lower chances of placement, possibly indicating distraction, reduced study focus, or non-academic screen time.
Possible M&E Interventions: - Interventions that encourage structured study habits are likely to have the highest impact on improving placement outcomes. - Well-being programs promoting healthy sleep habits may indirectly improve academic and employment outcomes. - Continuous assessment and assignment tracking are critical for improving student performance. - Attendance should be complemented with active engagement strategies rather than being used as a sole performance indicator. - Higher internet usage indicates the importance of digital discipline and controlled online engagement among students.
Overall, the findings suggest that student success is driven more by discipline, consistency, and healthy behavioral habits than by passive or historical academic indicators. This implies that interventions aimed at improving study habits, promoting well-being, and encouraging academic engagement are likely to have the greatest impact on improving placement outcomes.