Monitoring and Evaluation of Student Performance Drivers and Prediction of Placement Outcomes Using Data Analytics and Machine Learning

Introduction

Educational institutions increasingly rely on data-driven approaches to monitor student performance, evaluate factors affecting academic outcomes, and improve placement success rates.

This project applies Monitoring and Evaluation (M&E) principles, exploratory data analysis, and machine learning techniques to understand the factors influencing student academic performance and placement outcomes.

The analysis focuses on student engagement indicators such as study habits, attendance, assignment completion, sleep patterns, and internet usage. These factors are evaluated to identify key drivers of academic success and develop predictive models for placement outcomes.

Objectives

General Objective

To evaluate factors influencing student academic performance and predict placement outcomes using data analytics and machine learning techniques.

Specific Objectives

Assess student engagement indicators and their relationship with academic performance.
Examine patterns in attendance, study habits, and assignment completion.
Identify the strongest predictors of placement outcomes.
Develop machine learning models to predict placement status.
Generate recommendations for improving student success outcomes.

The Dataset Library

Variable	Meaning
study_hours	Average daily study hours
attendance_percentage	Student attendance rate
sleep_hours	Average sleep duration
internet_usage	Daily internet usage hours
assignments_completed	Number of assignments completed
previous_academic_score	Previous academic performance
final_exam_score	Final examination result
placement_status	Whether student was placed

Let’s import the necessary libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

# Make plots look nicer
sns.set_theme(style="whitegrid")

# Display all columns
pd.set_option("display.max_columns", None)

Next, load the dataset

df = pd.read_csv("../data/students_data.csv")

Let’s farmiliarize with the data now.

df.head()

	study_hours	attendance_percentage	sleep_hours	internet_usage	assignments_completed	previous_academic_score	final_exam_score	placement_status
0	7	56	8	7	10	62	100.00	Placed
1	4	69	5	3	8	56	100.00	Placed
2	11	60	7	6	10	45	100.00	Placed
3	8	99	9	8	4	55	90.17	Placed
4	5	52	8	6	8	40	78.82	Placed

df.shape

(10000, 8)

The dataset contains 10000 rows and 8 columns in total. With the many rows, the dataset is sufficiently large to support trend analysis and predictive modeling. Next, lets identify the available variables so as to spot out the target variables and possible predictors

df.columns

Index(['study_hours', 'attendance_percentage', 'sleep_hours', 'internet_usage',
       'assignments_completed', 'previous_academic_score', 'final_exam_score',
       'placement_status'],
      dtype='str')

The dataset contains behavioral, academic, and outcome-related variables aimed at understanding factors influencing student academic success

Next, data types

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   study_hours              10000 non-null  int64  
 1   attendance_percentage    10000 non-null  int64  
 2   sleep_hours              10000 non-null  int64  
 3   internet_usage           10000 non-null  int64  
 4   assignments_completed    10000 non-null  int64  
 5   previous_academic_score  10000 non-null  int64  
 6   final_exam_score         10000 non-null  float64
 7   placement_status         10000 non-null  str    
dtypes: float64(1), int64(6), str(1)
memory usage: 625.1 KB

Interpretation: All variables are complete suggesting reliable data capture.

DESCRIPTIVE ANALYSIS

df.describe()

	study_hours	attendance_percentage	sleep_hours	internet_usage	assignments_completed	previous_academic_score	final_exam_score
count	10000.000000	10000.00000	10000.000000	10000.000000	10000.000000	10000.00000	10000.000000
mean	5.989600	69.88460	6.498500	6.062600	9.988400	64.91100	86.704207
std	3.163589	17.61653	1.709354	3.138163	6.034145	17.50302	15.058383
min	1.000000	40.00000	4.000000	1.000000	0.000000	35.00000	26.670000
25%	3.000000	55.00000	5.000000	3.000000	5.000000	50.00000	76.727500
50%	6.000000	70.00000	6.500000	6.000000	10.000000	65.00000	92.120000
75%	9.000000	85.00000	8.000000	9.000000	15.000000	80.00000	100.000000
max	11.000000	100.00000	9.000000	11.000000	20.000000	95.00000	100.000000

Interpretation: - The final exam score shows a mean of 86.7, median of 92.1, minimum score of 26.7, maximum score of 100 and standard deviation of 15.1 - The average final exam score is high (mean = 86.7), and the median is even higher (92.1), suggesting generally strong academic performance across students. - The mean is lesser than the median suggesting that a small group of low-performing students is pulling the average down. - Students generally perform well academically, but there is a hidden inequality where a smaller group of low-engagement students (low attendance, low assignments, low study behavior) is significantly underperforming.

Monitoring and Evaluation Framework

Indicator Level	Variable
Input Indicator	Study Hours
Process Indicator	Attendance Percentage
Output Indicator	Assignments Completed
Outcome Indicator	Final Exam Score
Impact Indicator	Placement Status

The framework assumes that increased student engagement and academic participation contribute to improved academic performance and placement outcomes.

Data Quality Assessment

Lets check for missing value

df.isnull().sum()

study_hours                0
attendance_percentage      0
sleep_hours                0
internet_usage             0
assignments_completed      0
previous_academic_score    0
final_exam_score           0
placement_status           0
dtype: int64

Interpretation: There are no missing values in the data, suggesting reliable data collection processes.

Check for duplicates

df.duplicated().sum()

np.int64(0)

Interpretation: No duplicate records detected, suggesting good data integrity.

Check the distribution of the data

sns.histplot(df['final_exam_score'])

Interpretation: The graph shows left skewed distribution meaning most students perform well, but a small group of low-performing students pulls the average down, indicating inequality in learning outcomes.

Food for Thought: - Is the system generally strong or unequal? - Are we masking poor performance with high averages? - What does this mean for policy or intervention?

My findings so far: The dataset appears high quality with minimal missing values and no duplicate records. While most variables show reasonable distributions, there are notable outliers in student performance and engagement indicators, suggesting the presence of both high- and low-performing subgroups. The distribution of final exam scores is left-skewed, indicating that while overall performance is strong, a smaller group of low-performing students significantly affects equity in learning outcomes.

CORRELATION ANALYSIS

Firsst, my target variable in the dataset is “final_exam_score”

corr = df.corr(numeric_only=True)
corr

	study_hours	attendance_percentage	sleep_hours	internet_usage	assignments_completed	previous_academic_score	final_exam_score
study_hours	1.000000	0.003801	-0.005255	0.006684	-0.000425	0.009451	0.562528
attendance_percentage	0.003801	1.000000	-0.003981	-0.014539	-0.012892	0.001588	0.223367
sleep_hours	-0.005255	-0.003981	1.000000	-0.000038	-0.008234	0.023916	0.144675
internet_usage	0.006684	-0.014539	-0.000038	1.000000	0.020536	0.005371	-0.151896
assignments_completed	-0.000425	-0.012892	-0.008234	0.020536	1.000000	0.004178	0.387609
previous_academic_score	0.009451	0.001588	0.023916	0.005371	0.004178	1.000000	0.318805
final_exam_score	0.562528	0.223367	0.144675	-0.151896	0.387609	0.318805	1.000000

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Matrix - Student Performance")
plt.show()

Interpretations: - Study hours show a strong positive relationship with final exam performance, suggesting that time spent studying is the most important behavioral factor influencing academic outcomes in this dataset, that is, more study hours means higher scores - Assignments completed has a moderate-to-strong positive relationship with exam performance, indicating that consistent academic engagement ,through working on assignments, significantly contributes to better outcomes - Prior academic performance moderately predicts final exam outcomes, suggesting some persistence in academic ability, but also room for change through current behavior - Attendance shows a weak positive relationship with performance, indicating that it plays a supportive but not dominant role in academic success.i.e Attendance alone is not enough but still matters - Sleep hours show a weak positive relationship with performance, suggesting minimal direct influence on exam outcomes in the dataset. Well, it might matter indirectly but not here - Internet usage shows a weak negative relationship with exam performance, suggesting that higher internet usage may be associated with slightly lower academic outcomes

Overall Insights: Academic performance is primarily driven by study habits and active engagement rather than demographic or passive factors. Study hours is the strongest predictor of success, followed by assignment completion and prior academic performance. While attendance contributes positively, its effect is weaker than direct academic effort. Interestingly, higher internet usage shows a slight negative association with performance, suggesting potential distraction effects.

The analysis reveals that student performance is most strongly influenced by study hours, indicating that individual effort is the primary driver of academic success. Assignment completion and prior academic performance also contribute meaningfully, though to a lesser extent. Attendance and sleep show relatively weak associations with outcomes, while higher internet usage is slightly negatively associated with performance, suggesting possible distraction effects. Overall, results indicate that academic engagement and study behavior are more influential than passive or environmental factors.

MY RECOMMENDATIONS SO FAR: - Schools should promote structured study routines and time management programs. - Strengthen assignment tracking and ensure timely feedback. - Encourage guided/academic internet use and digital discipline. - Focus on active participation, not just attendance tracking.

VISUALIZATION

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(df["final_exam_score"], kde=True)
plt.title("Distribution of Final Exam Scores")
plt.show()

Interpretation: Most students are performing well, but a small group of low performers is pulling the average down.

Attendance vs Final Exam Score

sns.scatterplot(x=df["attendance_percentage"], y=df["final_exam_score"])
plt.title("Attendance vs Final Exam Score")
plt.show()

Interpretation: Attendance shows a weak-to-moderate positive relationship with performance, suggesting that while attendance supports learning, it is not sufficient on its own to guarantee high performance.

Study hours vs Performance

sns.regplot(x="study_hours", y="final_exam_score", data=df)
plt.title("Relationship Between Study Hours and Performance")
plt.show()

Interpretation: There is a clear positive relationship between study hours and exam performance, meaning students who study more tend to perform better, although variation suggests other factors also play a role

PREDICTION MODELLING

Objective: Can we predict student placement based on academic behavior and performance?

1.0: Import Machine Learning Tools

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

2.0: Let’s check the target variable

df["placement_status"].value_counts()

placement_status
Placed        8356
Not Placed    1644
Name: count, dtype: int64

Let’s convert it into numerical variables

df["placement_status"] = df["placement_status"].map({"Placed":1, "Not Placed":0})

Next, let’s select the features of the models. I’ll start by predicting placement status first

X = df[[
    "study_hours",
    "attendance_percentage",
    "sleep_hours",
    "internet_usage",
    "assignments_completed",
    "previous_academic_score",
    "final_exam_score"
]]

y = df["placement_status"]

Define the X and Y variables

X = df.drop(["placement_status", "final_exam_score"], axis=1)
y = df["placement_status"]

I dropped final_exam_score column so as to avoid data leakage because final exam score often directly determines placement

Split the data…

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

Next, train the model through Logistic Regression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then predictions

y_pred = model.predict(X_test)

Evaluate the model

Accuracy:

accuracy_score(y_test, y_pred)

0.9025

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.61      0.68       345
           1       0.92      0.96      0.94      1655

    accuracy                           0.90      2000
   macro avg       0.85      0.79      0.81      2000
weighted avg       0.90      0.90      0.90      2000

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d")
plt.title("Confusion Matrix")
plt.show()

Insights so far (Not Placed Students Group): - Accuracy = 0.9025 (≈ 90%)- The model correctly predicts placement outcomes about 9 out of 10 times. - Recall = 0.61: The model only correctly identifies 61% of students who are actually NOT placed. This means 39% of at-risk students are being missed - Precision = 0.78: When the model predicts “Not Placed”, it is correct 78% of the time. The model struggles most with identifying at-risk students, this is critical in M&E contexts where early intervention matters.

Insights so far (Placed Students Group): - Recall = 0.96: The model correctly identifies almost all students who will be placed. - Precision=0.92:Most predicted “Placed” students are truly placed. - The model is highly effective at identifying successful students, but less effective at detecting those at risk of not being placed.

Conclusion so far: The classification model achieves strong predictive performance with an overall accuracy of 90%. However, class-wise evaluation reveals imbalance in performance, with significantly stronger results for the “Placed” category compared to “Not Placed.” While the model effectively identifies successful students, it is less sensitive in detecting at-risk students, correctly identifying only 61% of them. This limitation is important in an M&E context, where early identification of at-risk individuals is critical for intervention planning. Additionally, the dataset exhibits class imbalance, which may contribute to this performance disparity.

Let’s improve the model

Let’s tell the model to pay more attention on the minority class

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X = df.drop(["placement_status", "final_exam_score"], axis=1)
y = df["placement_status"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

model = LogisticRegression(max_iter=1000, class_weight="balanced")
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.54      0.90      0.67       329
           1       0.98      0.85      0.91      1671

    accuracy                           0.86      2000
   macro avg       0.76      0.87      0.79      2000
weighted avg       0.90      0.86      0.87      2000

Model Performance Update:The improved classification model achieves an overall accuracy of 86%. After addressing class imbalance, the model demonstrates significantly improved sensitivity toward at-risk students, correctly identifying 90% of “Not Placed” cases. However, this improvement comes at the cost of reduced precision (0.54), indicating an increase in false positives. Conversely, the model maintains strong performance in identifying successful students, with high precision (0.98) for the “Placed” class.

Interpretation: The model prioritizes early detection of at-risk students, which is critical in monitoring and evaluation contexts where intervention is more important than strict classification accuracy.

What drives placement? Let’s do feature importance

import pandas as pd
import numpy as np

feature_importance = pd.Series(
    model.coef_[0],
    index=X.columns
).sort_values()

feature_importance.plot(kind="barh", figsize=(8,5))
plt.title("Feature Importance for Placement Prediction")
plt.show()

Interpretation: - Students who dedicate more time to studying are significantly more likely to secure placement. This highlights that individual effort and time investment are the strongest determinants of success in this dataset. - Students who maintain adequate sleep patterns tend to perform better in placement outcomes. This suggests that rest and recovery play an important role in cognitive performance and academic success. - Students who consistently complete assignments demonstrate higher engagement and understanding, leading to better placement outcomes. - While prior performance contributes to placement outcomes, it is not the dominant factor. This suggests that current behavior and engagement are more important than historical academic results. - Being physically present in learning environments contributes slightly to success, but attendance alone is not sufficient to guarantee placement. - Higher internet usage is associated with lower chances of placement, possibly indicating distraction, reduced study focus, or non-academic screen time.

Possible M&E Interventions: - Interventions that encourage structured study habits are likely to have the highest impact on improving placement outcomes. - Well-being programs promoting healthy sleep habits may indirectly improve academic and employment outcomes. - Continuous assessment and assignment tracking are critical for improving student performance. - Attendance should be complemented with active engagement strategies rather than being used as a sole performance indicator. - Higher internet usage indicates the importance of digital discipline and controlled online engagement among students.

Overall, the findings suggest that student success is driven more by discipline, consistency, and healthy behavioral habits than by passive or historical academic indicators. This implies that interventions aimed at improving study habits, promoting well-being, and encouraging academic engagement are likely to have the greatest impact on improving placement outcomes.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	None
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	1000
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None