Introduction

Source:https://www.kaggle.com/code/amrmohameddiab/comprehensive-analysis-of-student-performance

Title: Comprehensive Analysis of Student Performance

This dataset comes from a private educational provider and includes information about 5,000 students. It is packed with data that helps analyze how well students are doing in school and what affects their grades.

What’s in the dataset?

Basic Information: Each student’s record includes their name, student ID, email, gender, and age. Academic Performance: It tracks how students perform on various tests and assignments throughout the school year, including midterms, finals, and project scores. Study and Lifestyle Factors: The dataset also looks at how many hours students study each week, whether they do activities outside of school, if they have internet at home, and the education level of their parents. Health and Wellbeing: There are details about how much stress students feel and how much sleep they get each night. Things to consider:

Missing Information: Some parts of the data, like how often students attend class or their assignment scores, are missing for some records. This may require fixing the data before using it. Uneven Distribution: Some departments have more students than others, which might affect the analysis.

# All of the libraries I need to use and then to read in the data and to report some details about the data. 
library(dplyr)
library(ggplot2)
library(readr)
library(viridis)
library(plotly)
data <- read_csv("/Users/jayfreeportillo/Downloads/archive/Students_Grading_Dataset.csv")
summary(data)
##   Student_ID         First_Name         Last_Name            Email          
##  Length:5000        Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Gender               Age         Department        Attendance (%)  
##  Length:5000        Min.   :18.00   Length:5000        Min.   : 50.01  
##  Class :character   1st Qu.:19.00   Class :character   1st Qu.: 63.27  
##  Mode  :character   Median :21.00   Mode  :character   Median : 75.72  
##                     Mean   :21.05                      Mean   : 75.43  
##                     3rd Qu.:23.00                      3rd Qu.: 87.47  
##                     Max.   :24.00                      Max.   :100.00  
##                                                        NA's   :516     
##  Midterm_Score    Final_Score    Assignments_Avg  Quizzes_Avg   
##  Min.   :40.00   Min.   :40.00   Min.   :50.00   Min.   :50.03  
##  1st Qu.:55.46   1st Qu.:54.67   1st Qu.:62.09   1st Qu.:62.49  
##  Median :70.51   Median :69.73   Median :74.81   Median :74.69  
##  Mean   :70.33   Mean   :69.64   Mean   :74.80   Mean   :74.91  
##  3rd Qu.:84.97   3rd Qu.:84.50   3rd Qu.:86.97   3rd Qu.:87.63  
##  Max.   :99.98   Max.   :99.98   Max.   :99.98   Max.   :99.96  
##                                  NA's   :517                    
##  Participation_Score Projects_Score    Total_Score       Grade          
##  Min.   : 0.000      Min.   : 50.01   Min.   :50.02   Length:5000       
##  1st Qu.: 2.440      1st Qu.: 62.32   1st Qu.:62.84   Class :character  
##  Median : 4.955      Median : 74.98   Median :75.39   Mode  :character  
##  Mean   : 4.980      Mean   : 74.92   Mean   :75.12                     
##  3rd Qu.: 7.500      3rd Qu.: 87.37   3rd Qu.:87.65                     
##  Max.   :10.000      Max.   :100.00   Max.   :99.99                     
##                                                                         
##  Study_Hours_per_Week Extracurricular_Activities Internet_Access_at_Home
##  Min.   : 5.00        Length:5000                Length:5000            
##  1st Qu.:11.40        Class :character           Class :character       
##  Median :17.50        Mode  :character           Mode  :character       
##  Mean   :17.66                                                          
##  3rd Qu.:24.10                                                          
##  Max.   :30.00                                                          
##                                                                         
##  Parent_Education_Level Family_Income_Level Stress_Level (1-10)
##  Length:5000            Length:5000         Min.   : 1.000     
##  Class :character       Class :character    1st Qu.: 3.000     
##  Mode  :character       Mode  :character    Median : 5.000     
##                                             Mean   : 5.481     
##                                             3rd Qu.: 8.000     
##                                             Max.   :10.000     
##                                                                
##  Sleep_Hours_per_Night
##  Min.   :4.000        
##  1st Qu.:5.200        
##  Median :6.500        
##  Mean   :6.488        
##  3rd Qu.:7.700        
##  Max.   :9.000        
## 

Datasets

# Manipulation of the data after reading it in, to remove bad data.
clean_data <- na.omit(data)
print(summary(clean_data))
##   Student_ID         First_Name         Last_Name            Email          
##  Length:3230        Length:3230        Length:3230        Length:3230       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender               Age         Department        Attendance (%)  
##  Length:3230        Min.   :18.00   Length:3230        Min.   : 50.01  
##  Class :character   1st Qu.:19.00   Class :character   1st Qu.: 63.36  
##  Mode  :character   Median :21.00   Mode  :character   Median : 75.73  
##                     Mean   :21.03                      Mean   : 75.42  
##                     3rd Qu.:23.00                      3rd Qu.: 87.30  
##                     Max.   :24.00                      Max.   :100.00  
##  Midterm_Score    Final_Score    Assignments_Avg  Quizzes_Avg   
##  Min.   :40.01   Min.   :40.00   Min.   :50.00   Min.   :50.03  
##  1st Qu.:55.39   1st Qu.:54.66   1st Qu.:62.07   1st Qu.:62.48  
##  Median :69.89   Median :69.75   Median :74.98   Median :74.81  
##  Mean   :70.10   Mean   :69.49   Mean   :74.98   Mean   :74.88  
##  3rd Qu.:84.59   3rd Qu.:84.00   3rd Qu.:87.42   3rd Qu.:87.35  
##  Max.   :99.97   Max.   :99.98   Max.   :99.98   Max.   :99.95  
##  Participation_Score Projects_Score    Total_Score       Grade          
##  Min.   : 0.000      Min.   : 50.01   Min.   :50.03   Length:3230       
##  1st Qu.: 2.502      1st Qu.: 62.22   1st Qu.:62.84   Class :character  
##  Median : 5.020      Median : 74.97   Median :74.92   Mode  :character  
##  Mean   : 5.010      Mean   : 74.85   Mean   :74.90                     
##  3rd Qu.: 7.518      3rd Qu.: 87.14   3rd Qu.:87.48                     
##  Max.   :10.000      Max.   :100.00   Max.   :99.99                     
##  Study_Hours_per_Week Extracurricular_Activities Internet_Access_at_Home
##  Min.   : 5.00        Length:3230                Length:3230            
##  1st Qu.:11.50        Class :character           Class :character       
##  Median :17.40        Mode  :character           Mode  :character       
##  Mean   :17.68                                                          
##  3rd Qu.:24.10                                                          
##  Max.   :30.00                                                          
##  Parent_Education_Level Family_Income_Level Stress_Level (1-10)
##  Length:3230            Length:3230         Min.   : 1.000     
##  Class :character       Class :character    1st Qu.: 3.000     
##  Mode  :character       Mode  :character    Median : 5.000     
##                                             Mean   : 5.482     
##                                             3rd Qu.: 8.000     
##                                             Max.   :10.000     
##  Sleep_Hours_per_Night
##  Min.   :4.000        
##  1st Qu.:5.200        
##  Median :6.500        
##  Mean   :6.476        
##  3rd Qu.:7.700        
##  Max.   :9.000

Heatmap

# I paste some code in here for my first tab

clean_data <- clean_data %>%
  mutate(Study_Hours_Binned = cut(Study_Hours_per_Week, breaks = 5),
         Attendance_Binned = cut(`Attendance (%)`, breaks = 5))
heatmap_plot <- ggplot(clean_data, aes(x = Study_Hours_Binned, y = Attendance_Binned, fill = Total_Score)) +
  geom_tile(color = "black") +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Student Performance Heatmap",
       subtitle = "Impact of Study Hours and Attendance on Scores",
       x = "Study Hours per Week (Binned)",
       y = "Attendance (%) (Binned)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

heatmap_plot

Insights from the Heatmap: Impact of Study Hours and Attendance on Total Scores This heatmap shows how attendance percentage (on the y-axis) and study hours per week (on the x-axis) relate to total scores (color intensity). The darker red areas represent higher total scores, while lighter areas represent lower scores.

Key Takeaways: Higher Attendance Generally Leads to Higher Scores The top rows (80-100% attendance) have darker red shades, meaning students who attend class more often tend to score higher.

Lower attendance groups (50-60%) still have some strong scores but generally show more variation.

Study Hours Have a Mixed Impact on Scores Some students who study less (5-10 hours per week) still achieve high scores, but higher study hours (20-25 per week) also show strong performance in certain cases. This suggests that just studying more doesn’t guarantee higher scores—other factors like class participation or study quality may matter.

Strongest Performers: High Attendance + Moderate Study Hours The highest scores (dark red areas) appear when students have both high attendance (80-100%) and study around 10-20 hours per week. This suggests that a balance of attending class and consistent studying leads to better performance.

Lower Scores in Some High Study Groups Some students with high study hours (25-30 per week) and moderate attendance (60-80%) still have lower scores (lighter shades). This could mean that overstudying without attending class might not be the most effective approach.

Conclusions: Attending class is one of the biggest factors in achieving higher scores. Studying more helps, but there is a limit—quality of study may be more important than just the number of hours.

Students with both high attendance and consistent study habits tend to perform the best. If a student is studying a lot but still getting low scores, they might need to improve their study methods rather than just increasing hours.

Faceted histogram plot

# I paste some code in here for my second tab
study_hours_plot <- ggplot(clean_data, aes(x = Study_Hours_per_Week, fill = Grade)) +
  geom_histogram(bins = 10, position = position_dodge(), alpha = 0.7) +  
  facet_grid(Department ~ Grade) +  
  scale_fill_brewer(palette = "Set1") +  
  labs(title = "Distribution of Study Hours by Department and Grade",
       x = "Study Hours per Week",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text.x = element_text(size = 8),
        strip.text.y = element_text(size = 8))
print(study_hours_plot)

Insights from the Visualization: Faceted histogram plot

Study Hours by Department and Grade This visualization shows how many hours students study per week across different departments (Business, CS, Engineering, Mathematics) and how study time relates to grades (A, B, C, D, F).

Key Takeaways: Students with A Grades Study More but Not Excessively In most departments, students who get A grades (red) tend to study consistently in the 10-30 hour range per week.

There is no clear pattern of extreme studying (beyond 30 hours) leading to A grades, meaning quality of study may matter more than just studying longer. F Grades Appear at Both Low and High Study Hours

Students who received F grades (orange) are present at both very low (less than 10 hours) and very high study hours (over 30 hours). This suggests that studying too little leads to failure, but studying too much without the right methods might also not be effective.

D and F Grades Have a More Spread-Out Study Pattern D (purple) and F (orange) grade students have a wider spread of study hours, meaning there is no single pattern for struggling students. Some D students study around 20 hours but still don’t perform well, suggesting that other factors like participation, attendance, or study methods could be important.

B and C Grades Show Balanced Study Hours B grade students (blue) have a fairly even distribution of study hours, often between 10-25 hours per week.

C grade students (green) also show a mix but are more spread out, suggesting that they might not have a consistent study pattern. Differences by Department

CS and Engineering students show a wider range of study hours for all grades, indicating more variation in how students manage their time. Mathematics and Business students have fewer high study hour cases, meaning they might achieve their grades with a more consistent study routine.

Trellis Pie Charts

# I paste some code in here for my second tab
# Creating the Trellis Pie Charts
#Pie Charts 

library(plotly)
library(dplyr)

# Assuming 'clean_data' is your DataFrame
grades_summary <- clean_data %>%
  group_by(Department, Gender, Grade) %>%
  summarise(Count = n(), .groups = 'drop')

# Determine the number of unique departments
num_departments <- length(unique(grades_summary$Department))

# Adding normalized coordinates for pie chart placement
grades_summary <- grades_summary %>%
  mutate(xmin = (as.numeric(as.factor(Department)) - 1) / num_departments,
         xmax = as.numeric(as.factor(Department)) / num_departments,
         ymin = if_else(Gender == "Male", 0.5, 0),
         ymax = if_else(Gender == "Male", 1, 0.5))






# Creating the Trellis Pie Charts
plot <- plot_ly() 
for(dept in unique(grades_summary$Department)) {
  for(gender in unique(grades_summary$Gender)) {
    subset_data <- grades_summary %>%
      filter(Department == dept, Gender == gender)
    plot <- plot %>%
      add_pie(data = subset_data, labels = ~Grade, values = ~Count,
              domain = list(x = c(min(subset_data$xmin), max(subset_data$xmax)),
                            y = c(min(subset_data$ymin), max(subset_data$ymax))),
              textinfo = 'label+percent', hoverinfo = 'label+percent+name',
              name = paste(dept, gender))
  }
}

plot <- plot %>%
  layout(title = 'Distribution of Grades by Department and Gender',
         showlegend = TRUE)

# Display the plot
plot

Insights from the Visualization: Trellis Pie Charts

(You can hover over the Charts for details)

Distribution of Grades by Department and Gender This visualization shows grade distributions by department and gender, with the top row representing male students and the bottom row representing female students. Each pie chart breaks down the percentage of students receiving grades A, B, C, D, and F across different departments.

Key Observations & Gender-Based Insights:

Higher A Grades for Females: Across most departments, female students have a slightly higher percentage of A grades compared to male students.

For example, in some departments, females have 33.1%-33.2% A grades, while males are slightly lower at 30.6%-31.3%.

This suggests that female students may be performing better academically overall.

Higher D and F Grades for Males: Males tend to have a slightly higher proportion of D and F grades compared to females. This is evident in several departments where male students’ D and F grades make up a larger portion of the distribution compared to females in the same department. This could indicate differences in study habits, engagement, or external factors affecting performance.

Balanced B and C Grade Distribution: The percentage of B and C grades appears relatively similar between males and females across most departments.

This suggests that both genders have a comparable middle-performing student group. More Consistency Among Female Students: Female students have a more consistent distribution of grades, with fewer extreme variations across different departments. The distribution of grades among male students fluctuates more, with some departments having higher D and F rates.

Potential Factors Influencing Gender Differences: Study Habits: Female students may be spending more time studying, leading to slightly better performance.

Class Engagement: Participation scores and engagement levels could vary by gender.

Course Difficulty or Interest: Certain majors might attract more male or female students, leading to different performance trends.

External Responsibilities: Factors like work, stress, or extracurricular involvement could impact male and female students differently.

Possible Conclusions: Female students tend to have a higher percentage of A grades and lower D/F grades, suggesting stronger academic performance overall. Male students have more variation in grades, with some departments showing a higher percentage of students struggling with lower grades. The difference is not extreme, but it is noticeable and could warrant further investigation into study habits, course engagement, and external factors affecting student performance.

Dual-axis scatter plot with connecting lines

# Aggregate data by Department
aggregated_data <- clean_data %>%
  group_by(Department) %>%
  summarise(Avg_Midterm_Score = mean(Midterm_Score, na.rm = TRUE),
            Avg_Final_Score = mean(Final_Score, na.rm = TRUE))

# Create the plot
plot <- ggplot(aggregated_data, aes(x = reorder(Department, -Avg_Midterm_Score))) +
  geom_point(aes(y = Avg_Midterm_Score, color = "Midterm Score"), size = 4, shape = 16) +
  geom_point(aes(y = Avg_Final_Score, color = "Final Score"), size = 4, shape = 17) +
  geom_line(aes(y = Avg_Midterm_Score, group = 1, color = "Midterm Score"), linetype = "solid") +
  geom_line(aes(y = Avg_Final_Score, group = 1, color = "Final Score"), linetype = "dashed") +
  scale_y_continuous(name = "Average Score",
                     sec.axis = sec_axis(~ ., name = "Average Final Score (dashed line)")) +
  labs(x = "Department", color = "Score Type") +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank(),
        text = element_text(size = 12),
        axis.title = element_text(size = 14))

# Print the plot
print(plot)

Insights from the Visualization: Dual-axis scatter plot with connecting lines

Midterm vs. Final Scores by Department This chart compares average midterm scores (solid blue line) and final scores (dashed red line with triangles) across different departments (Mathematics, Engineering, CS, Business).

Key Takeaways: Midterm Scores Decrease Across Departments The midterm scores start highest in Mathematics and then gradually drop across Engineering, CS, and Business.

This suggests that students in Mathematics tend to perform better in midterms compared to other departments.

Final Scores Show More Variation Unlike midterms, final scores do not follow a clear trend—they increase in some departments (Engineering, Business) and decrease in others (Mathematics, CS). Business students show a significant jump in final scores, while Mathematics students see the lowest final scores.

Departments with Midterm to Final Score Drops Mathematics and CS students’ final scores are lower than their midterm scores, suggesting that they may struggle more in final exams. This could mean that finals are harder than midterms in these subjects, or students do not prepare as well for them.

Departments with Midterm to Final Score Increases Engineering and Business students improve from midterm to final exams, showing an upward trend.

This suggests that students in these departments may adapt better to finals, possibly due to better study strategies or easier final exams. Conclusions: Mathematics and CS students need more support for final exams, as their scores drop compared to midterms.

Business students perform better in finals than midterms, which could mean better preparation, an easier final exam, or improved understanding over time. Engineering students also improve in final exams, though not as drastically as Business students.

Plot lines

# Plot lines 


library(ggplot2)

average_scores_by_income <- clean_data %>%
  group_by(Department, Family_Income_Level) %>%
  summarise(Average_Score = mean(Total_Score, na.rm = TRUE), .groups = 'drop')


colors_mixed <- c( "#2171b5", "#cb181d", "#238b45")


# Plotting the data
ggplot(average_scores_by_income, aes(x = Department, y = Average_Score, group = Family_Income_Level, color = Family_Income_Level)) +
  geom_line(size = 2) +  
  geom_point(size = 4, shape = 21, fill = "white") +  
  labs(title = "Average Total Scores by Department and Family Income Level",
       x = "Department",
       y = "Average Total Score",
       color = "Family Income Level") +
  theme_minimal() +
  scale_color_manual(values = colors_mixed)

Insights from the Visualization: Plot Lines

Average Total Scores by Department and Family Income Level This chart shows how average total scores vary across different departments (Business, CS, Engineering, Mathematics) based on family income level (High, Medium, Low).

Key Takeaways: Students from Low-Income Families Perform Best Overall In most departments, students from low-income families (red line) have the highest total scores, especially in Mathematics and Engineering.

This could suggest that students from low-income backgrounds put in extra effort to achieve high scores.

High-Income Students Have the Lowest Scores in Mathematics and Business The blue line (high-income students) drops sharply in Mathematics and Business, showing that these students tend to score lower in these fields.

This might indicate less academic pressure or different study habits among high-income students. Medium-Income Students Perform Steadily The green line (medium-income students) stays fairly balanced, showing consistent performance across departments.

They do not have extreme highs or lows, meaning they might have more stable academic habits. CS Department Has the Most Balanced Scores Across Income Levels In CS, all income groups perform similarly, with little difference between high, medium, and low-income students. This might suggest equal access to resources, study habits, or grading fairness in CS.

Conclusions: Students from low-income families tend to have higher scores, possibly due to greater motivation or harder work.

High-income students perform well in CS but struggle in Mathematics and Business, which could mean less academic pressure or different priorities.

Medium-income students have steady performance across all departments, showing a balanced academic approach. More research is needed to see if external factors (like access to tutors, work responsibilities, or school support) affect these trends.

Findings

After analyzing all the visualizations, we can conclude the following key findings:

Attendance Matters – Students with higher attendance tend to have better grades, showing that showing up to class is important for success.

Study Hours Should Be Balanced – Studying too little leads to lower grades, but studying too much doesn’t always mean better scores. The quality of study matters more than just time spent.

Grades Vary by Department – Some departments, like Mathematics and CS, show a drop in scores from midterms to finals, meaning students may struggle with final exams more. Engineering and Business students improve in finals, suggesting better preparation or easier exams.

Income Level Affects Performance – Low-income students tend to have the highest scores, possibly due to higher motivation. High-income students perform worse in some areas (like Mathematics and Business), which could mean less academic pressure or different study habits.

Gender Differences in Grades – Female students tend to get more A grades, while male students have more D and F grades. This suggests different learning approaches or levels of engagement between genders.

To explore these findings further, here are some key questions to ask:

What other factors influence grades? – Could stress levels, sleep, or extracurricular activities affect performance?

Why do some departments see score drops from midterms to finals? – Are final exams harder, or do students not prepare well?

How do study methods impact grades? – Are students who study smarter (not longer) performing better?

What support can improve struggling students’ performance? – Should there be tutoring, attendance incentives, or study workshops to help?

Why do low-income students score higher? – Are they more motivated, or do they use better study habits?

This dataset gives us a lot of useful information about what affects student success. We saw that attendance, study habits, gender, and income level all play a role in grades. Some students perform better in finals, while others struggle, showing that different students may need different types of support. Studying more doesn’t always mean better grades—how students study is just as important as how long they study.

The data also challenges assumptions. Low-income students often perform well, showing that motivation and effort can be stronger factors than resources alone. Meanwhile, some departments see big changes between midterm and final scores, suggesting that preparation and test difficulty matter.

Going forward, this data could help schools and educators create better strategies to support students. By focusing on improving study habits, encouraging attendance, and understanding different learning needs, schools can help students succeed, no matter their background.