import pandas as pd
import numpy as npLet’s import the necessary libraries
Next, load the dataset
df = pd.read_csv("../data/students_data.csv")Let’s farmiliarize with the data now.
df.head()| study_hours | attendance_percentage | sleep_hours | internet_usage | assignments_completed | previous_academic_score | final_exam_score | placement_status | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7 | 56 | 8 | 7 | 10 | 62 | 100.00 | Placed |
| 1 | 4 | 69 | 5 | 3 | 8 | 56 | 100.00 | Placed |
| 2 | 11 | 60 | 7 | 6 | 10 | 45 | 100.00 | Placed |
| 3 | 8 | 99 | 9 | 8 | 4 | 55 | 90.17 | Placed |
| 4 | 5 | 52 | 8 | 6 | 8 | 40 | 78.82 | Placed |
df.shape(10000, 8)
The dataset contains 10000 rows and 8 columns in total. With the many rows, the dataset is sufficiently large to support trend analysis and predictive modeling. Next, lets identify the available variables so as to spot out the target variables and possible predictors
df.columnsIndex(['study_hours', 'attendance_percentage', 'sleep_hours', 'internet_usage',
'assignments_completed', 'previous_academic_score', 'final_exam_score',
'placement_status'],
dtype='str')
The dataset contains behavioral, academic, and outcome-related variables aimed at understanding factors influencing student academic success
Next, data types
df.info()<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 study_hours 10000 non-null int64
1 attendance_percentage 10000 non-null int64
2 sleep_hours 10000 non-null int64
3 internet_usage 10000 non-null int64
4 assignments_completed 10000 non-null int64
5 previous_academic_score 10000 non-null int64
6 final_exam_score 10000 non-null float64
7 placement_status 10000 non-null str
dtypes: float64(1), int64(6), str(1)
memory usage: 625.1 KB
Interpretation: All variables are complete suggesting reliable data capture.
DESCRIPTIVE ANALYSIS
df.describe()| study_hours | attendance_percentage | sleep_hours | internet_usage | assignments_completed | previous_academic_score | final_exam_score | |
|---|---|---|---|---|---|---|---|
| count | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.00000 | 10000.000000 |
| mean | 5.989600 | 69.88460 | 6.498500 | 6.062600 | 9.988400 | 64.91100 | 86.704207 |
| std | 3.163589 | 17.61653 | 1.709354 | 3.138163 | 6.034145 | 17.50302 | 15.058383 |
| min | 1.000000 | 40.00000 | 4.000000 | 1.000000 | 0.000000 | 35.00000 | 26.670000 |
| 25% | 3.000000 | 55.00000 | 5.000000 | 3.000000 | 5.000000 | 50.00000 | 76.727500 |
| 50% | 6.000000 | 70.00000 | 6.500000 | 6.000000 | 10.000000 | 65.00000 | 92.120000 |
| 75% | 9.000000 | 85.00000 | 8.000000 | 9.000000 | 15.000000 | 80.00000 | 100.000000 |
| max | 11.000000 | 100.00000 | 9.000000 | 11.000000 | 20.000000 | 95.00000 | 100.000000 |
Interpretation: - The final exam score shows a mean of 86.7, median of 92.1, minimum score of 26.7, maximum score of 100 and standard deviation of 15.1 - The average final exam score is high (mean = 86.7), and the median is even higher (92.1), suggesting generally strong academic performance across students. - The mean is lesser than the median suggesting that a small group of low-performing students is pulling the average down. - Students generally perform well academically, but there is a hidden inequality where a smaller group of low-engagement students (low attendance, low assignments, low study behavior) is significantly underperforming.
Lets check for missing value
df.isnull().sum()study_hours 0
attendance_percentage 0
sleep_hours 0
internet_usage 0
assignments_completed 0
previous_academic_score 0
final_exam_score 0
placement_status 0
dtype: int64
Interpretation: There are no missing value
Check for duplicates
df.duplicated().sum()np.int64(0)
Interpretation: No duplicates in the data
The Dataset Library
| Variable | Meaning |
|---|---|
| study_hours | Average daily study hours |
| attendance_percentage | Student attendance rate |
| sleep_hours | Average sleep duration |
| internet_usage | Daily internet usage hours |
| assignments_completed | Number of assignments completed |
| previous_academic_score | Previous academic performance |
| final_exam_score | Final examination result |
| placement_status | Whether student was placed |
Identify Target Variables
Regression Target:final_exam_score
Classification Target: placement_status
OBSERVATIONS SO FAR: - Dataset contains academic and behavioral indicators. - No obvious missing values were observed. - Placement status appears suitable for classification modeling. - Final exam score can be modeled as a regression problem. - Attendance and assignment completion may strongly influence outcomes.