Source:https://www.kaggle.com/code/amrmohameddiab/comprehensive-analysis-of-student-performance
Title: Comprehensive Analysis of Student Performance
This dataset comes from a private educational provider and includes information about 5,000 students. It is packed with data that helps analyze how well students are doing in school and what affects their grades.
What’s in the dataset?
Basic Information: Each student’s record includes their name, student ID, email, gender, and age. Academic Performance: It tracks how students perform on various tests and assignments throughout the school year, including midterms, finals, and project scores. Study and Lifestyle Factors: The dataset also looks at how many hours students study each week, whether they do activities outside of school, if they have internet at home, and the education level of their parents. Health and Wellbeing: There are details about how much stress students feel and how much sleep they get each night. Things to consider:
Missing Information: Some parts of the data, like how often students attend class or their assignment scores, are missing for some records. This may require fixing the data before using it. Uneven Distribution: Some departments have more students than others, which might affect the analysis.
# All of the libraries I need to use and then to read in the data and to report some details about the data.
library(dplyr)
library(ggplot2)
library(readr)
library(viridis)
library(plotly)
data <- read_csv("/Users/jayfreeportillo/Downloads/archive/Students_Grading_Dataset.csv")
summary(data)
## Student_ID First_Name Last_Name Email
## Length:5000 Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Gender Age Department Attendance (%)
## Length:5000 Min. :18.00 Length:5000 Min. : 50.01
## Class :character 1st Qu.:19.00 Class :character 1st Qu.: 63.27
## Mode :character Median :21.00 Mode :character Median : 75.72
## Mean :21.05 Mean : 75.43
## 3rd Qu.:23.00 3rd Qu.: 87.47
## Max. :24.00 Max. :100.00
## NA's :516
## Midterm_Score Final_Score Assignments_Avg Quizzes_Avg
## Min. :40.00 Min. :40.00 Min. :50.00 Min. :50.03
## 1st Qu.:55.46 1st Qu.:54.67 1st Qu.:62.09 1st Qu.:62.49
## Median :70.51 Median :69.73 Median :74.81 Median :74.69
## Mean :70.33 Mean :69.64 Mean :74.80 Mean :74.91
## 3rd Qu.:84.97 3rd Qu.:84.50 3rd Qu.:86.97 3rd Qu.:87.63
## Max. :99.98 Max. :99.98 Max. :99.98 Max. :99.96
## NA's :517
## Participation_Score Projects_Score Total_Score Grade
## Min. : 0.000 Min. : 50.01 Min. :50.02 Length:5000
## 1st Qu.: 2.440 1st Qu.: 62.32 1st Qu.:62.84 Class :character
## Median : 4.955 Median : 74.98 Median :75.39 Mode :character
## Mean : 4.980 Mean : 74.92 Mean :75.12
## 3rd Qu.: 7.500 3rd Qu.: 87.37 3rd Qu.:87.65
## Max. :10.000 Max. :100.00 Max. :99.99
##
## Study_Hours_per_Week Extracurricular_Activities Internet_Access_at_Home
## Min. : 5.00 Length:5000 Length:5000
## 1st Qu.:11.40 Class :character Class :character
## Median :17.50 Mode :character Mode :character
## Mean :17.66
## 3rd Qu.:24.10
## Max. :30.00
##
## Parent_Education_Level Family_Income_Level Stress_Level (1-10)
## Length:5000 Length:5000 Min. : 1.000
## Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 5.481
## 3rd Qu.: 8.000
## Max. :10.000
##
## Sleep_Hours_per_Night
## Min. :4.000
## 1st Qu.:5.200
## Median :6.500
## Mean :6.488
## 3rd Qu.:7.700
## Max. :9.000
##
# Manipulation of the data after reading it in, to remove bad data.
clean_data <- na.omit(data)
print(summary(clean_data))
## Student_ID First_Name Last_Name Email
## Length:3230 Length:3230 Length:3230 Length:3230
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Gender Age Department Attendance (%)
## Length:3230 Min. :18.00 Length:3230 Min. : 50.01
## Class :character 1st Qu.:19.00 Class :character 1st Qu.: 63.36
## Mode :character Median :21.00 Mode :character Median : 75.73
## Mean :21.03 Mean : 75.42
## 3rd Qu.:23.00 3rd Qu.: 87.30
## Max. :24.00 Max. :100.00
## Midterm_Score Final_Score Assignments_Avg Quizzes_Avg
## Min. :40.01 Min. :40.00 Min. :50.00 Min. :50.03
## 1st Qu.:55.39 1st Qu.:54.66 1st Qu.:62.07 1st Qu.:62.48
## Median :69.89 Median :69.75 Median :74.98 Median :74.81
## Mean :70.10 Mean :69.49 Mean :74.98 Mean :74.88
## 3rd Qu.:84.59 3rd Qu.:84.00 3rd Qu.:87.42 3rd Qu.:87.35
## Max. :99.97 Max. :99.98 Max. :99.98 Max. :99.95
## Participation_Score Projects_Score Total_Score Grade
## Min. : 0.000 Min. : 50.01 Min. :50.03 Length:3230
## 1st Qu.: 2.502 1st Qu.: 62.22 1st Qu.:62.84 Class :character
## Median : 5.020 Median : 74.97 Median :74.92 Mode :character
## Mean : 5.010 Mean : 74.85 Mean :74.90
## 3rd Qu.: 7.518 3rd Qu.: 87.14 3rd Qu.:87.48
## Max. :10.000 Max. :100.00 Max. :99.99
## Study_Hours_per_Week Extracurricular_Activities Internet_Access_at_Home
## Min. : 5.00 Length:3230 Length:3230
## 1st Qu.:11.50 Class :character Class :character
## Median :17.40 Mode :character Mode :character
## Mean :17.68
## 3rd Qu.:24.10
## Max. :30.00
## Parent_Education_Level Family_Income_Level Stress_Level (1-10)
## Length:3230 Length:3230 Min. : 1.000
## Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 5.482
## 3rd Qu.: 8.000
## Max. :10.000
## Sleep_Hours_per_Night
## Min. :4.000
## 1st Qu.:5.200
## Median :6.500
## Mean :6.476
## 3rd Qu.:7.700
## Max. :9.000
# I paste some code in here for my first tab
clean_data <- clean_data %>%
mutate(Study_Hours_Binned = cut(Study_Hours_per_Week, breaks = 5),
Attendance_Binned = cut(`Attendance (%)`, breaks = 5))
heatmap_plot <- ggplot(clean_data, aes(x = Study_Hours_Binned, y = Attendance_Binned, fill = Total_Score)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Student Performance Heatmap",
subtitle = "Impact of Study Hours and Attendance on Scores",
x = "Study Hours per Week (Binned)",
y = "Attendance (%) (Binned)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
heatmap_plot
Insights from the Heatmap: Impact of Study Hours and Attendance on Total
Scores This heatmap shows how attendance percentage (on the y-axis) and
study hours per week (on the x-axis) relate to total scores (color
intensity). The darker red areas represent higher total scores, while
lighter areas represent lower scores.
Key Takeaways: Higher Attendance Generally Leads to Higher Scores The top rows (80-100% attendance) have darker red shades, meaning students who attend class more often tend to score higher.
Lower attendance groups (50-60%) still have some strong scores but generally show more variation.
Study Hours Have a Mixed Impact on Scores Some students who study less (5-10 hours per week) still achieve high scores, but higher study hours (20-25 per week) also show strong performance in certain cases. This suggests that just studying more doesn’t guarantee higher scores—other factors like class participation or study quality may matter.
Strongest Performers: High Attendance + Moderate Study Hours The highest scores (dark red areas) appear when students have both high attendance (80-100%) and study around 10-20 hours per week. This suggests that a balance of attending class and consistent studying leads to better performance.
Lower Scores in Some High Study Groups Some students with high study hours (25-30 per week) and moderate attendance (60-80%) still have lower scores (lighter shades). This could mean that overstudying without attending class might not be the most effective approach.
Conclusions: Attending class is one of the biggest factors in achieving higher scores. Studying more helps, but there is a limit—quality of study may be more important than just the number of hours.
Students with both high attendance and consistent study habits tend to perform the best. If a student is studying a lot but still getting low scores, they might need to improve their study methods rather than just increasing hours.
# I paste some code in here for my second tab
study_hours_plot <- ggplot(clean_data, aes(x = Study_Hours_per_Week, fill = Grade)) +
geom_histogram(bins = 10, position = position_dodge(), alpha = 0.7) +
facet_grid(Department ~ Grade) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Distribution of Study Hours by Department and Grade",
x = "Study Hours per Week",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(size = 8),
strip.text.y = element_text(size = 8))
print(study_hours_plot)
Insights from the Visualization: Faceted histogram plot
Study Hours by Department and Grade This visualization shows how many hours students study per week across different departments (Business, CS, Engineering, Mathematics) and how study time relates to grades (A, B, C, D, F).
Key Takeaways: Students with A Grades Study More but Not Excessively In most departments, students who get A grades (red) tend to study consistently in the 10-30 hour range per week.
There is no clear pattern of extreme studying (beyond 30 hours) leading to A grades, meaning quality of study may matter more than just studying longer. F Grades Appear at Both Low and High Study Hours
Students who received F grades (orange) are present at both very low (less than 10 hours) and very high study hours (over 30 hours). This suggests that studying too little leads to failure, but studying too much without the right methods might also not be effective.
D and F Grades Have a More Spread-Out Study Pattern D (purple) and F (orange) grade students have a wider spread of study hours, meaning there is no single pattern for struggling students. Some D students study around 20 hours but still don’t perform well, suggesting that other factors like participation, attendance, or study methods could be important.
B and C Grades Show Balanced Study Hours B grade students (blue) have a fairly even distribution of study hours, often between 10-25 hours per week.
C grade students (green) also show a mix but are more spread out, suggesting that they might not have a consistent study pattern. Differences by Department
CS and Engineering students show a wider range of study hours for all grades, indicating more variation in how students manage their time. Mathematics and Business students have fewer high study hour cases, meaning they might achieve their grades with a more consistent study routine.
# I paste some code in here for my second tab
# Creating the Trellis Pie Charts
#Pie Charts
library(plotly)
library(dplyr)
# Assuming 'clean_data' is your DataFrame
grades_summary <- clean_data %>%
group_by(Department, Gender, Grade) %>%
summarise(Count = n(), .groups = 'drop')
# Determine the number of unique departments
num_departments <- length(unique(grades_summary$Department))
# Adding normalized coordinates for pie chart placement
grades_summary <- grades_summary %>%
mutate(xmin = (as.numeric(as.factor(Department)) - 1) / num_departments,
xmax = as.numeric(as.factor(Department)) / num_departments,
ymin = if_else(Gender == "Male", 0.5, 0),
ymax = if_else(Gender == "Male", 1, 0.5))
# Creating the Trellis Pie Charts
plot <- plot_ly()
for(dept in unique(grades_summary$Department)) {
for(gender in unique(grades_summary$Gender)) {
subset_data <- grades_summary %>%
filter(Department == dept, Gender == gender)
plot <- plot %>%
add_pie(data = subset_data, labels = ~Grade, values = ~Count,
domain = list(x = c(min(subset_data$xmin), max(subset_data$xmax)),
y = c(min(subset_data$ymin), max(subset_data$ymax))),
textinfo = 'label+percent', hoverinfo = 'label+percent+name',
name = paste(dept, gender))
}
}
plot <- plot %>%
layout(title = 'Distribution of Grades by Department and Gender',
showlegend = TRUE)
# Display the plot
plot
Insights from the Visualization: Trellis Pie Charts
(You can hover over the Charts for details)
Distribution of Grades by Department and Gender This visualization shows grade distributions by department and gender, with the top row representing male students and the bottom row representing female students. Each pie chart breaks down the percentage of students receiving grades A, B, C, D, and F across different departments.
Key Observations & Gender-Based Insights:
Higher A Grades for Females: Across most departments, female students have a slightly higher percentage of A grades compared to male students.
For example, in some departments, females have 33.1%-33.2% A grades, while males are slightly lower at 30.6%-31.3%.
This suggests that female students may be performing better academically overall.
Higher D and F Grades for Males: Males tend to have a slightly higher proportion of D and F grades compared to females. This is evident in several departments where male students’ D and F grades make up a larger portion of the distribution compared to females in the same department. This could indicate differences in study habits, engagement, or external factors affecting performance.
Balanced B and C Grade Distribution: The percentage of B and C grades appears relatively similar between males and females across most departments.
This suggests that both genders have a comparable middle-performing student group. More Consistency Among Female Students: Female students have a more consistent distribution of grades, with fewer extreme variations across different departments. The distribution of grades among male students fluctuates more, with some departments having higher D and F rates.
Potential Factors Influencing Gender Differences: Study Habits: Female students may be spending more time studying, leading to slightly better performance.
Class Engagement: Participation scores and engagement levels could vary by gender.
Course Difficulty or Interest: Certain majors might attract more male or female students, leading to different performance trends.
External Responsibilities: Factors like work, stress, or extracurricular involvement could impact male and female students differently.
Possible Conclusions: Female students tend to have a higher percentage of A grades and lower D/F grades, suggesting stronger academic performance overall. Male students have more variation in grades, with some departments showing a higher percentage of students struggling with lower grades. The difference is not extreme, but it is noticeable and could warrant further investigation into study habits, course engagement, and external factors affecting student performance.
# Aggregate data by Department
aggregated_data <- clean_data %>%
group_by(Department) %>%
summarise(Avg_Midterm_Score = mean(Midterm_Score, na.rm = TRUE),
Avg_Final_Score = mean(Final_Score, na.rm = TRUE))
# Create the plot
plot <- ggplot(aggregated_data, aes(x = reorder(Department, -Avg_Midterm_Score))) +
geom_point(aes(y = Avg_Midterm_Score, color = "Midterm Score"), size = 4, shape = 16) +
geom_point(aes(y = Avg_Final_Score, color = "Final Score"), size = 4, shape = 17) +
geom_line(aes(y = Avg_Midterm_Score, group = 1, color = "Midterm Score"), linetype = "solid") +
geom_line(aes(y = Avg_Final_Score, group = 1, color = "Final Score"), linetype = "dashed") +
scale_y_continuous(name = "Average Score",
sec.axis = sec_axis(~ ., name = "Average Final Score (dashed line)")) +
labs(x = "Department", color = "Score Type") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
text = element_text(size = 12),
axis.title = element_text(size = 14))
# Print the plot
print(plot)
Insights from the Visualization: Dual-axis scatter plot with connecting lines
Midterm vs. Final Scores by Department This chart compares average midterm scores (solid blue line) and final scores (dashed red line with triangles) across different departments (Mathematics, Engineering, CS, Business).
Key Takeaways: Midterm Scores Decrease Across Departments The midterm scores start highest in Mathematics and then gradually drop across Engineering, CS, and Business.
This suggests that students in Mathematics tend to perform better in midterms compared to other departments.
Final Scores Show More Variation Unlike midterms, final scores do not follow a clear trend—they increase in some departments (Engineering, Business) and decrease in others (Mathematics, CS). Business students show a significant jump in final scores, while Mathematics students see the lowest final scores.
Departments with Midterm to Final Score Drops Mathematics and CS students’ final scores are lower than their midterm scores, suggesting that they may struggle more in final exams. This could mean that finals are harder than midterms in these subjects, or students do not prepare as well for them.
Departments with Midterm to Final Score Increases Engineering and Business students improve from midterm to final exams, showing an upward trend.
This suggests that students in these departments may adapt better to finals, possibly due to better study strategies or easier final exams. Conclusions: Mathematics and CS students need more support for final exams, as their scores drop compared to midterms.
Business students perform better in finals than midterms, which could mean better preparation, an easier final exam, or improved understanding over time. Engineering students also improve in final exams, though not as drastically as Business students.
# Plot lines
library(ggplot2)
average_scores_by_income <- clean_data %>%
group_by(Department, Family_Income_Level) %>%
summarise(Average_Score = mean(Total_Score, na.rm = TRUE), .groups = 'drop')
colors_mixed <- c( "#2171b5", "#cb181d", "#238b45")
# Plotting the data
ggplot(average_scores_by_income, aes(x = Department, y = Average_Score, group = Family_Income_Level, color = Family_Income_Level)) +
geom_line(size = 2) +
geom_point(size = 4, shape = 21, fill = "white") +
labs(title = "Average Total Scores by Department and Family Income Level",
x = "Department",
y = "Average Total Score",
color = "Family Income Level") +
theme_minimal() +
scale_color_manual(values = colors_mixed)
Insights from the Visualization: Plot Lines
Average Total Scores by Department and Family Income Level This chart shows how average total scores vary across different departments (Business, CS, Engineering, Mathematics) based on family income level (High, Medium, Low).
Key Takeaways: Students from Low-Income Families Perform Best Overall In most departments, students from low-income families (red line) have the highest total scores, especially in Mathematics and Engineering.
This could suggest that students from low-income backgrounds put in extra effort to achieve high scores.
High-Income Students Have the Lowest Scores in Mathematics and Business The blue line (high-income students) drops sharply in Mathematics and Business, showing that these students tend to score lower in these fields.
This might indicate less academic pressure or different study habits among high-income students. Medium-Income Students Perform Steadily The green line (medium-income students) stays fairly balanced, showing consistent performance across departments.
They do not have extreme highs or lows, meaning they might have more stable academic habits. CS Department Has the Most Balanced Scores Across Income Levels In CS, all income groups perform similarly, with little difference between high, medium, and low-income students. This might suggest equal access to resources, study habits, or grading fairness in CS.
Conclusions: Students from low-income families tend to have higher scores, possibly due to greater motivation or harder work.
High-income students perform well in CS but struggle in Mathematics and Business, which could mean less academic pressure or different priorities.
Medium-income students have steady performance across all departments, showing a balanced academic approach. More research is needed to see if external factors (like access to tutors, work responsibilities, or school support) affect these trends.
After analyzing all the visualizations, we can conclude the following key findings:
Attendance Matters – Students with higher attendance tend to have better grades, showing that showing up to class is important for success.
Study Hours Should Be Balanced – Studying too little leads to lower grades, but studying too much doesn’t always mean better scores. The quality of study matters more than just time spent.
Grades Vary by Department – Some departments, like Mathematics and CS, show a drop in scores from midterms to finals, meaning students may struggle with final exams more. Engineering and Business students improve in finals, suggesting better preparation or easier exams.
Income Level Affects Performance – Low-income students tend to have the highest scores, possibly due to higher motivation. High-income students perform worse in some areas (like Mathematics and Business), which could mean less academic pressure or different study habits.
Gender Differences in Grades – Female students tend to get more A grades, while male students have more D and F grades. This suggests different learning approaches or levels of engagement between genders.
To explore these findings further, here are some key questions to ask:
What other factors influence grades? – Could stress levels, sleep, or extracurricular activities affect performance?
Why do some departments see score drops from midterms to finals? – Are final exams harder, or do students not prepare well?
How do study methods impact grades? – Are students who study smarter (not longer) performing better?
What support can improve struggling students’ performance? – Should there be tutoring, attendance incentives, or study workshops to help?
Why do low-income students score higher? – Are they more motivated, or do they use better study habits?
This dataset gives us a lot of useful information about what affects student success. We saw that attendance, study habits, gender, and income level all play a role in grades. Some students perform better in finals, while others struggle, showing that different students may need different types of support. Studying more doesn’t always mean better grades—how students study is just as important as how long they study.
The data also challenges assumptions. Low-income students often perform well, showing that motivation and effort can be stronger factors than resources alone. Meanwhile, some departments see big changes between midterm and final scores, suggesting that preparation and test difficulty matter.
Going forward, this data could help schools and educators create better strategies to support students. By focusing on improving study habits, encouraging attendance, and understanding different learning needs, schools can help students succeed, no matter their background.