data_understanding

Let’s import the necessary libraries

import pandas as pd
import numpy as np

Next, load the dataset

df = pd.read_csv("../data/students_data.csv")

Let’s farmiliarize with the data now.

df.head()

	study_hours	attendance_percentage	sleep_hours	internet_usage	assignments_completed	previous_academic_score	final_exam_score	placement_status
0	7	56	8	7	10	62	100.00	Placed
1	4	69	5	3	8	56	100.00	Placed
2	11	60	7	6	10	45	100.00	Placed
3	8	99	9	8	4	55	90.17	Placed
4	5	52	8	6	8	40	78.82	Placed

df.shape

(10000, 8)

The dataset contains 10000 rows and 8 columns in total. With the many rows, the dataset is sufficiently large to support trend analysis and predictive modeling. Next, lets identify the available variables so as to spot out the target variables and possible predictors

df.columns

Index(['study_hours', 'attendance_percentage', 'sleep_hours', 'internet_usage',
       'assignments_completed', 'previous_academic_score', 'final_exam_score',
       'placement_status'],
      dtype='str')

The dataset contains behavioral, academic, and outcome-related variables aimed at understanding factors influencing student academic success

Next, data types

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   study_hours              10000 non-null  int64  
 1   attendance_percentage    10000 non-null  int64  
 2   sleep_hours              10000 non-null  int64  
 3   internet_usage           10000 non-null  int64  
 4   assignments_completed    10000 non-null  int64  
 5   previous_academic_score  10000 non-null  int64  
 6   final_exam_score         10000 non-null  float64
 7   placement_status         10000 non-null  str    
dtypes: float64(1), int64(6), str(1)
memory usage: 625.1 KB

Interpretation: All variables are complete suggesting reliable data capture.

DESCRIPTIVE ANALYSIS

df.describe()

	study_hours	attendance_percentage	sleep_hours	internet_usage	assignments_completed	previous_academic_score	final_exam_score
count	10000.000000	10000.00000	10000.000000	10000.000000	10000.000000	10000.00000	10000.000000
mean	5.989600	69.88460	6.498500	6.062600	9.988400	64.91100	86.704207
std	3.163589	17.61653	1.709354	3.138163	6.034145	17.50302	15.058383
min	1.000000	40.00000	4.000000	1.000000	0.000000	35.00000	26.670000
25%	3.000000	55.00000	5.000000	3.000000	5.000000	50.00000	76.727500
50%	6.000000	70.00000	6.500000	6.000000	10.000000	65.00000	92.120000
75%	9.000000	85.00000	8.000000	9.000000	15.000000	80.00000	100.000000
max	11.000000	100.00000	9.000000	11.000000	20.000000	95.00000	100.000000

Interpretation: - The final exam score shows a mean of 86.7, median of 92.1, minimum score of 26.7, maximum score of 100 and standard deviation of 15.1 - The average final exam score is high (mean = 86.7), and the median is even higher (92.1), suggesting generally strong academic performance across students. - The mean is lesser than the median suggesting that a small group of low-performing students is pulling the average down. - Students generally perform well academically, but there is a hidden inequality where a smaller group of low-engagement students (low attendance, low assignments, low study behavior) is significantly underperforming.

Lets check for missing value

df.isnull().sum()

study_hours                0
attendance_percentage      0
sleep_hours                0
internet_usage             0
assignments_completed      0
previous_academic_score    0
final_exam_score           0
placement_status           0
dtype: int64

Interpretation: There are no missing value

Check for duplicates

df.duplicated().sum()

np.int64(0)

Interpretation: No duplicates in the data

The Dataset Library

Variable	Meaning
study_hours	Average daily study hours
attendance_percentage	Student attendance rate
sleep_hours	Average sleep duration
internet_usage	Daily internet usage hours
assignments_completed	Number of assignments completed
previous_academic_score	Previous academic performance
final_exam_score	Final examination result
placement_status	Whether student was placed

Identify Target Variables

Regression Target:final_exam_score

Classification Target: placement_status

OBSERVATIONS SO FAR: - Dataset contains academic and behavioral indicators. - No obvious missing values were observed. - Placement status appears suitable for classification modeling. - Final exam score can be modeled as a regression problem. - Attendance and assignment completion may strongly influence outcomes.