Let’s import the necessary libraries

import pandas as pd
import numpy as np

Next, load the dataset

df = pd.read_csv("../data/students_data.csv")

Let’s farmiliarize with the data now.

df.head()
study_hours attendance_percentage sleep_hours internet_usage assignments_completed previous_academic_score final_exam_score placement_status
0 7 56 8 7 10 62 100.00 Placed
1 4 69 5 3 8 56 100.00 Placed
2 11 60 7 6 10 45 100.00 Placed
3 8 99 9 8 4 55 90.17 Placed
4 5 52 8 6 8 40 78.82 Placed
df.shape
(10000, 8)

The dataset contains 10000 rows and 8 columns in total. With the many rows, the dataset is sufficiently large to support trend analysis and predictive modeling. Next, lets identify the available variables so as to spot out the target variables and possible predictors

df.columns
Index(['study_hours', 'attendance_percentage', 'sleep_hours', 'internet_usage',
       'assignments_completed', 'previous_academic_score', 'final_exam_score',
       'placement_status'],
      dtype='str')

The dataset contains behavioral, academic, and outcome-related variables aimed at understanding factors influencing student academic success

Next, data types

df.info()
<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   study_hours              10000 non-null  int64  
 1   attendance_percentage    10000 non-null  int64  
 2   sleep_hours              10000 non-null  int64  
 3   internet_usage           10000 non-null  int64  
 4   assignments_completed    10000 non-null  int64  
 5   previous_academic_score  10000 non-null  int64  
 6   final_exam_score         10000 non-null  float64
 7   placement_status         10000 non-null  str    
dtypes: float64(1), int64(6), str(1)
memory usage: 625.1 KB

Interpretation: All variables are complete suggesting reliable data capture.

DESCRIPTIVE ANALYSIS

df.describe()
study_hours attendance_percentage sleep_hours internet_usage assignments_completed previous_academic_score final_exam_score
count 10000.000000 10000.00000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000
mean 5.989600 69.88460 6.498500 6.062600 9.988400 64.91100 86.704207
std 3.163589 17.61653 1.709354 3.138163 6.034145 17.50302 15.058383
min 1.000000 40.00000 4.000000 1.000000 0.000000 35.00000 26.670000
25% 3.000000 55.00000 5.000000 3.000000 5.000000 50.00000 76.727500
50% 6.000000 70.00000 6.500000 6.000000 10.000000 65.00000 92.120000
75% 9.000000 85.00000 8.000000 9.000000 15.000000 80.00000 100.000000
max 11.000000 100.00000 9.000000 11.000000 20.000000 95.00000 100.000000

Interpretation: - The final exam score shows a mean of 86.7, median of 92.1, minimum score of 26.7, maximum score of 100 and standard deviation of 15.1 - The average final exam score is high (mean = 86.7), and the median is even higher (92.1), suggesting generally strong academic performance across students. - The mean is lesser than the median suggesting that a small group of low-performing students is pulling the average down. - Students generally perform well academically, but there is a hidden inequality where a smaller group of low-engagement students (low attendance, low assignments, low study behavior) is significantly underperforming.

Lets check for missing value

df.isnull().sum()
study_hours                0
attendance_percentage      0
sleep_hours                0
internet_usage             0
assignments_completed      0
previous_academic_score    0
final_exam_score           0
placement_status           0
dtype: int64

Interpretation: There are no missing value

Check for duplicates

df.duplicated().sum()
np.int64(0)

Interpretation: No duplicates in the data

The Dataset Library

Variable Meaning
study_hours Average daily study hours
attendance_percentage Student attendance rate
sleep_hours Average sleep duration
internet_usage Daily internet usage hours
assignments_completed Number of assignments completed
previous_academic_score Previous academic performance
final_exam_score Final examination result
placement_status Whether student was placed

Identify Target Variables

Regression Target:final_exam_score

Classification Target: placement_status

OBSERVATIONS SO FAR: - Dataset contains academic and behavioral indicators. - No obvious missing values were observed. - Placement status appears suitable for classification modeling. - Final exam score can be modeled as a regression problem. - Attendance and assignment completion may strongly influence outcomes.