The data set used for these visualizations collected data about various factors of student lives’, including academic, social, socio-economc, and lifestyle, as well as their performance on exams. Some of pivotal variables represent exam score, motivation, parental involvement, attendance, sleep hours, etc. Using these variables, I was able to make various data visualizations to represent these variables and their correlation to exam performance.
Looking at this bar chart, we can see the distribution is centered around scores of ~65-70, which makes sense as generally these are the average grades of students so the total amount of hours studied being the highest checks out.
It also makes sense that the total amount is low when you go lower or higher than this distribution. For example, since there are not a lot of students scoring above 90, even if they study more than students scoring 70, their total will be lower. The same logic applies for students scoring 60 or under.
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from matplotlib.ticker import FuncFormatter
path = '/Users/willtroisi/PyCharmMiscProject/sources/StudentPerformanceFactors.csv'
dfTrue = pd.read_csv(path)
df = dfTrue[['Exam_Score', 'Hours_Studied', 'Previous_Scores']]
x = df.groupby(['Exam_Score']).agg({'Exam_Score':['count'], 'Hours_Studied':['sum', 'mean'], 'Previous_Scores':['sum', 'mean']}).reset_index()
x.columns = ['Exam_Score', 'Count', 'Hours_Sum', 'Hours_Mean', 'PrevScore_Sum', 'PrevScore_Mean']
#Vertical Bar Chart
def pick_colors_according_to_mean_count(this_data):
colors = []
avg = this_data['Count'].mean()
for each in this_data['Count']:
if each > avg * 1.01:
colors.append('lightblue')
elif each < avg * 0.99:
colors.append('salmon')
else:
colors.append('orange')
return colors
bottom1 = 1
top1 = 100
d1 = x.loc[bottom1:top1]
my_colors = pick_colors_according_to_mean_count(d1)
fig = plt.figure(figsize = (18,16))
ax1 = fig.add_subplot(1,1,1)
ax1.bar(d1.Exam_Score, d1.Hours_Sum, color = my_colors)
plt.axhline(d1.Hours_Sum.mean(), color='black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.get_xaxis().tick_bottom()
ax1.set_title('Total Hours Studied by Exam Score', size=20)
Above = mpatches.Patch(color='lightblue', label='Above Average (≥80)')
Mid = mpatches.Patch(color='salmon', label='Average (61–79)')
Low = mpatches.Patch(color='orange', label='Below Average (≤60)')
ax1.legend(handles=[Above, Mid, Low], fontsize=15)
ax1.text(top1-10, d1.Hours_Sum.mean()+5, 'Mean = ' + str(d1.Hours_Sum.mean()), rotation=0, fontsize=14)
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ',')) # To add commas to the labels on the y axis
ax1.yaxis.set_major_formatter(comma_fmt)
plt.show()
Looking at this visualization, it’s important to note a few things. One, very few scored 57 or under on this exam and all students scoring 65 or under had relatively the same average score on the previous exam of ~70. What this is saying is that the previous exam no real corelation to predict how a student would perform on this exam (for all exam scored 65 or under for this exam).
That being said, with some of the lower numbers on this chart having so few observations, it may not be the best indicator for students scoring 57 or less.
#Dual Axis Bar Chart
bottom2 = 0
top2 = 10
d2 = x.loc[bottom2:top2]
fig = plt.figure(figsize=(18, 16))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4
x_pos = np.arange(len(d2))
ax1.bar(x_pos - (0.5 * bar_width), d2['Count'], width=bar_width, color='lightblue', label='Count',)
ax2.bar(x_pos + (0.5 * bar_width), d2['PrevScore_Mean'], width=bar_width, color='salmon', label='Avg Previous Score')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(d2['Exam_Score'], rotation=45, fontsize=8)
ax1.set_ylabel('Number of Students', fontsize=18, color='black')
ax2.set_ylabel('Average Previous Score', fontsize=18, color='black')
ax1.set_title('Student Count vs Average Previous Score (Top 10 Exam Scores)', fontsize=30)
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, fontsize=20, loc='upper left', shadow=True)
ax1.tick_params(axis='both', labelsize=18) #These are for making the labels bigger
ax2.tick_params(axis='both', labelsize=18)
plt.show()
Looking at this visualization, there is a clear relationship between attendance and exam performance. Additionally, while not a complete 1 to 1 correlation, motivation level does appear to have an effect on exam performance with a positive relationship. I’d argue this makes a lot of sense and this chart demonstrates that high motivation + attendance yields to higher overall exam performances than low motivation + attendance.
Adding on to that, given how close the lines are for motivation, attendance seems to matter far more than motivation. It’s also important to note that at low attendance levels, the lines seem more chaotic likely due to few students falling into these ranges making the averages more skewed.
df_multi = dfTrue[['Exam_Score', 'Attendance', 'Motivation_Level']]
grouped = df_multi.groupby(['Motivation_Level', 'Attendance'])['Exam_Score'].mean().reset_index()
fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)
my_colors = {'High': 'blue',
'Medium': 'red',
'Low': 'green'}
for key, grp in grouped.groupby('Motivation_Level'):
grp_sorted = grp.sort_values('Attendance')
grp_sorted.plot(ax=ax, kind='line', x='Attendance', y='Exam_Score',
color=my_colors[key], label=key, marker='o')
ax.set_title('Avg Exam Score by Attendance Grouped by Motivation Level', size=20)
ax.set_xlabel('Attendance (%)', fontsize=14)
ax.set_ylabel('Avg Exam Score', fontsize=14)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.legend(title='Motivation Level', fontsize=13)
plt.show()
This visualization aimed to determine whether motivation really had an impact on exam performance without including attendance. We can once again see that it does in fact have an affect, however it seems to be very minisucle with roughly a 0.4% difference between High and Medium level motivation and 0.5% for Medium and Low levels.
Additionally, as the average exam score is 67.2, it stands to reason that the average motivation level is between Medium and Low; as the average for Medium is 67.3.
donut = dfTrue.groupby('Motivation_Level')['Exam_Score'].mean()
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(1, 1, 1)
colormap = plt.get_cmap("tab20c")
number_colors = len(dfTrue['Motivation_Level'].unique())
color_ref_number = np.arange(number_colors) * 4
outer_colors = colormap(color_ref_number)
mean_scores = dfTrue.groupby('Motivation_Level')['Exam_Score'].mean()
dfTrue.groupby(['Motivation_Level'])['Exam_Score'].mean().plot(
kind='pie', radius=1, colors=outer_colors, pctdistance=0.75, labeldistance=1.1,
wedgeprops=dict(edgecolor='w', width=0.5), textprops={'fontsize': 18},
autopct=lambda p: '{:.2f}%\n(Avg: {:.1f})'.format(p, (p / 100) * mean_scores.sum()),
startangle=90, ax=ax)
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
ax.text(0, 0, f'Avg\nExam Score\n{dfTrue["Exam_Score"].mean():.1f}', size=18, ha='center', va='center')
ax.set_title('Avg Exam Score by Motivation Level', size=20)
ax.axis('equal')
plt.tight_layout()
plt.show()
This visualization compres Access to Resources and Motivation level to Exam Performance and determines that Access to Resources matter more. While we determined earlier that higher motivation typically leads to higher exam performance, we can see from this visualization that higher Access to Resources leads to better overall exam performance however combining the two leads to the best indicator. We can also see that Low Motivation + High Resources outperforms High Motivation + Low Access Resources, proving this once again.
We can see that the lowest values for each column in the lowest level of Access to Resources and the same stands for Motivation.
import seaborn as sns
from matplotlib.ticker import FuncFormatter
heat = dfTrue.groupby(['Motivation_Level', 'Access_to_Resources'])['Exam_Score'].mean().unstack()
fig, ax = plt.subplots(figsize=(18, 10))
comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
ax = sns.heatmap(heat, linewidth=0.2, annot=True, cmap='coolwarm', fmt='.1f',
square=True, annot_kws={'size': 11},
cbar_kws={'format': comma_fmt, 'orientation': 'vertical'},
ax=ax)
plt.title('Heatmap of Avg Exam Score by Motivation Level and Access to Resources', fontsize=18)
plt.xlabel('Access to Resources', fontsize=18, labelpad=10)
plt.ylabel('Motivation Level', fontsize=18, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
ax.invert_yaxis()
cbar = ax.collections[0].colorbar
max_count = heat.to_numpy().max()
my_colorbar_ticks = [*range(60, int(max_count), 2)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)
cbar.set_label('Avg Exam Score', rotation=270, fontsize=14, color='black', labelpad=20)
plt.show()
We are now done with the charts, here are my takeaways from my output
Based on these visualizations, we can infer a few things about this dataset. First, attendance seems to have a very strong correlation with exam performance, even over motivation. While Motivation is an indicator of exam performance with a generally positive correlation, it is minisucle. We can see in the heatmap that Access to Resources provides a stonger correlation with exam performance than Motivation as well as on the donut chart; showing that Motivation has very little impact with the donut. We also know that the average exam performance is around ~65-70. A suprising takewaway is from the dual-axis barchart, we see that average/low performance on the current exam did not seem to correlate with the previous exam. Students all scored ~70 on the previous exam regardless of whether they scored 55 - 65 on the previous one.
Overall, the takewayas from this is that Motivation surprisingly has little impact on Exam Performace however Attendance and Access to Resources has a strong positive correlation.