Python Assignment (Module 2)

Description of Dataset:

The dataset I used for my project was the Student Performance & Behavior Dataset. This dataset contains 5,000 student records from a private learning provider, capturing key details about their academic performance, study habits, and background. It includes scores from midterms, finals, assignments, quizzes, and projects, along with factors like attendance, study hours, stress levels, sleep, and family income.

Visualization 1

import os
os.environ['QT_OPA_PLATFORM_PLUGIN_PATH'] = '/Users/ayomide3/Anaconda3/Library/plugins/platforms'

import pandas as pd

file_path = "/Users/ayomide3/Downloads/archive/StudentsDataset.csv"

df = pd.read_csv(file_path)

# Visualization 1
import matplotlib.pyplot as plt

pie_df = df['Department'].value_counts()

colors = plt.cm.Paired.colors  


fig = plt.figure(figsize = (7,7))
fig.add_subplot(1, 1, 1)
plt.pie(pie_df, labels = pie_df.index, autopct = '%1.1f%%', startangle = 140, colors = colors, wedgeprops={'edgecolor': 'white'})

center_circle = plt.Circle((0,0), 0.3, fc = 'white')
plt.gca().add_artist(center_circle)

plt.title('Student Distribution by Department')
plt.show()

The graph shown above is a Donut Chart of the student distribution in their respected departments. According to the chart, we see that the largest department is Computer Science Majors taking up approxiamtely 40.4% of the student population. The second highest department is Engineering with 29.4%. Business Majors follow after with 20.1% of students while the Mathematics Department takes last place with 10.1% of students.

Visualization 2

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

fig = plt.figure(figsize=(18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()

top_25 = df.sort_values(by='Attendance (%)', ascending=False).head(25)

bar_width = 0.4
x = np.arange(len(top_25))


ax1.bar(x - bar_width/2, top_25['Attendance (%)'], bar_width, color='skyblue', label='Attendance (%)')
ax1.set_ylabel('Attendance (%)', color='skyblue')
ax1.tick_params(axis='y', labelcolor='skyblue')


ax2 = ax1.twinx()
ax2.bar(x + bar_width/2, top_25['Total_Score'], bar_width, color='gray', label='Total Score')
ax2.set_ylabel('Total Score', color='red')
ax2.tick_params(axis='y', labelcolor='red')


ax1.set_xlabel('Student ID')
ax1.set_xticks(x)
ax1.set_xticklabels(top_25['Student_ID'], rotation=45)
plt.title('Top 25 Students: Attendance vs Total Score')

fig.tight_layout()
plt.show()

The graph shown above is a Dual Axis Bar Chart consisting of The Top 25 students in Attendance vs their Total Score. Due to the fairly large dataset, there were more than 25 students that have an attendance rate of 100%, however, only 1 of them had a total score of 100%. Additionally, five other students had a total score lower than 60%. Majority of the students had a total score between 60-80% with about 7 students exceeding 80%. Overall, I concluded that the attendance scores do not correlate with the total scores and there may be other factors such as hours of sleep and hours studying that could be taken into play.

Visualization 3

import matplotlib.pyplot as plt
import seaborn as sns

numerical_cols = ['Attendance (%)', 'Midterm_Score', 'Final_Score', 'Assignments_Avg', 'Quizzes_Avg', 
                  'Participation_Score', 'Projects_Score', 'Total_Score', 'Study_Hours_per_Week', 
                  'Stress_Level (1-10)', 'Sleep_Hours_per_Night']

corr_matrix = df[numerical_cols].corr()

plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title('Correlation Heatmap of Academic Performance Variables')
plt.show()

The graph shown above is a heatmap of the correlation between the different academic performance variables. I concluded from the heatmap that there is little to no correlation between the academic performance variables, excluding the diagonal of the variables vs themselves. This counteracts my hypothesis from earlier that the total scores for the students are affected by variables such as hours of sleep and study hours per week because none of the various correlations exceed 0.02.

Visualization 4

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

bins = [0, 5, 10, 15, 20, 25, 30]  
labels = ['0-5', '6-10', '11-15', '16-20', '21-25', '26-30']

df['Study_Hours_Binned'] = pd.cut(df['Study_Hours_per_Week'], bins=bins, labels=labels, include_lowest=True)

study_hours_avg = df.groupby('Study_Hours_Binned')[['Midterm_Score', 'Final_Score', 'Total_Score']].mean()

plt.figure(figsize=(10,5))
plt.plot(study_hours_avg.index, study_hours_avg['Midterm_Score'], marker='o', label='Midterm Score')
plt.plot(study_hours_avg.index, study_hours_avg['Final_Score'], marker='o', label='Final Score')
plt.plot(study_hours_avg.index, study_hours_avg['Total_Score'], marker='o', label='Total Score')

plt.xlabel('Study Hours per Week (Binned)')
plt.ylabel('Average Scores')
plt.title('Study Hours vs Academic Performance (Binned)')
plt.legend()
plt.grid(True)
plt.show()

The graph above is a Binned Multiple Line Plot of Study Hours Per Week vs the mean score for the Academic Performances: Midterm Score, Final Score, and Total Score. According to the plot, surprisingly those who studied between 0-5 hours had the highest total score of approximately 76% and the highest midterm score of approximately 74%. The rest of the bins had a total score of about 74% while the midterm scores declined drastically from study hours 0-5 having 74% while the rest of the bins had a mean score of between 70-71%. The results of the final scores were interesting with the bins 11-15 and 21-25 have the highest scores of a little above 70% while bin 0-5 had a mean score of 65% and bins 6-10, 16-20, and 26-30 having average scores of around 69%.

Visualization 5

import matplotlib.pyplot as plt
import pandas as pd

grade_dist = df.groupby(['Department', 'Grade']).size().unstack()

grade_dist.plot(kind='bar', stacked=True, figsize=(10,6), colormap='plasma')

plt.xlabel('Department')
plt.ylabel('Number of Students')
plt.title('Letter Grades by Department')
plt.legend(title='Grade')
plt.xticks(rotation=45)

plt.show()

The final graph shown above is a stacked bar chart of the letter grades by the departments. All the departments have the letter grade A being their highest group with the CS department having over 500 students passing with an A. Unsurprisingly, the Mathematics department had the lowest amount of students for each letter grade with the letter grades B, C, D, and F being fairly distributed within the department. The rest of the departments also have the students for each letter grade (excluding A) also being fairly distributed. We can safely infer from this graph that major of the students within each department did not pass with an A.