2026-03-23

Introduction

Data Description and Cleaning For this project, I chose a data set from Kaggle called Student Habits vs Academic Performance. The data is in the form of a csv file and is imported using the following code.

#Loading the data set
student_data_raw = read.csv(
  "Student_Habits_vs_Academic_Performance.csv", 
  sep = ",", header = TRUE)

The following variables are the ones I will be analyzing in this project:

  • age

  • gender

  • study_hours_per_day

  • social_media_hours

  • attendance_percentage

  • parental_education_level

  • mental_health_rating

  • exam_score

plotly pie chart : Age and Gender Distribution of Students

2D scatter plot using plotly (attendance_percentage vs age)

Observations of the 2D Scatterplot and five number summary

Visual Inference: By looking at the distribution of points, we can see that the points located beyond 60 on the x axis and beyond 80 on the y axis, form a large proportion of the total data points.

Variable specific Commentary: Most of the students with a perfect exam score have an attendance percentage of above 75%. In fact, a large number of students whose attendance percentage is high have a high exam score.

## [1] "Exam Scores 5 Number Summary"
##      0%     25%     50%     75%    100% 
##  18.400  58.475  70.500  81.325 100.000
## [1] "Attendence Percetnage 5 Number Summary"
##      0%     25%     50%     75%    100% 
##  56.000  78.000  84.400  91.025 100.000

Conclusion: According to both five number summaries, it is clear that a most students with a higher attendance percentage have a higher exam score. A majority of students who lie in the 80s for attendance have achieved an exam score in the 70s.

3D Plotly scatterplot

2D gg plot, Study Hours vs Mental Health

Discussion: In this plot, I aim to see if there is any correlation between a students mental health and the number of study hours they commit to per week.

ggplot Boxplot, Parental Education Level vs Exam Scores

Discussion: In this plot, I aim to see if there is a co relation between a students exam score and their parents education level. The reason I feel parental education level is an important factor is because the country I come from, which is India, there is an abundance of people who have not complted basic education. Their children however have the opportunity to do so but they also follow the same path. I want to see if my observation has any merit or not.

plotly 2D scatterplot : Study Hours vs Exam Scores

Regression analysis of the scatterplot

##         (Intercept) study_hours_per_day 
##            35.91016             9.49025
## [1] "P Value ="
## [1] 4.595701e-250

Visual Analysis: Looking at the scatter plot and the linear fit, it seems to be a good prediction. The fit line passes through the most dense area of markers, which tells us that most of the data lies on or around the fit line.

Residual Analysis: The one factor which tells us the goodness of fit is the p-value. The p-value being extremly close to 0 tells us that there is a strong linear relationship between both the variables.

Conclusion: by looking at the fit line and the p-value, we can confidently say that there is a linear relationship between the study hours per week and the exam scores.

Sources