studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")
studentperformance
Numeric Summary of Data For At Least 2 Columns
Categorical Summary for Mother’s Job
mjob_summary = studentperformance |>
group_by(Mjob) |>
summarise(count = n()) |>
mutate(percentage = round(count / sum(count) * 100, 2))
print(arrange(mjob_summary, count))
## # A tibble: 5 × 3
## Mjob count percentage
## <chr> <int> <dbl>
## 1 health 48 7.4
## 2 teacher 72 11.1
## 3 at_home 135 20.8
## 4 services 136 21.0
## 5 other 258 39.8
In this categorical summary of the Mother’s Job variable, we can see
the count and proportion of mother’s jobs for all of the students in the
data set. A large portion (about 40%) sit in the ‘other’ category. About
21% are in the ‘services’ category which could be administrative,
police, etc. Another 21% are at home moms. 11% are teachers and 7.4% are
in the healthcare sector. This summary shows us that about 80% of
mothers in this data set are employed but about half of those are pooled
into the ‘other’ category meaning we don’t truly understand the nature
of their work unlike the other categories. It will be interesting to
explore if a mother’s job has any statistically significant influence on
student performance, absences, etc.
Numerical Summary for Final/Term 3 Grade (G3)
g3_summary = studentperformance |>
summarise(
Min = min(G3),
Max = max(G3),
Median = median(G3),
Mean = mean(G3),
First_Quartile = quantile(G3, 0.25),
Third_Quartile = quantile(G3, 0.75),
Std_Dev = sd(G3)
)
print(g3_summary)
## # A tibble: 1 × 7
## Min Max Median Mean First_Quartile Third_Quartile Std_Dev
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 19 12 11.9 10 14 3.23
In this numerical summary, we can see the nature of the distribution
of final or term 3 grades. Some students either fail or are not assigned
a final grade, which is why the minimum is zero. The grade scale goes up
to 20, but the max in the data set is 19. Our mean and median are pretty
much the same at about 12, and the first and third quartile are
equidistant from the median which means that the distribution appears to
be fairly symmetrical. In the context of this data set and grading
scale, we can generalize that grades of 13 or higher are considered
fairly good while grades below 11 are less ideal. Term 3 grades will be
a valuable feature as it will likely be a target variable in some of our
analysis.
A Set of At Least 3 Novel Questions
Does a desire for higher education impact a student’s final
grade?
What is the nature of the relationship between demographic
characteristics and student absences?
Do lifestyle choices actually affect academic performance?
Aggregation Function (Which answers question 1 above)
higher_education = studentperformance |>
group_by(higher) |>
summarise(
Average_G3 = mean(G3),
Student_Count = n()
)
print(higher_education)
## # A tibble: 2 × 3
## higher Average_G3 Student_Count
## <chr> <dbl> <int>
## 1 no 8.80 69
## 2 yes 12.3 580
In this aggregation function, we can see that students who are
interested in higher education had a higher average term 3 grade. There
is a fairly substantial difference in student count in each category, so
the results should be taken with a grain of salt. However, there are
still 69 students in the ‘no’ category which isn’t too small. In the
future, it might be valuable to investigate whether this difference
between the two classes is caused by this difference in group size or if
it can be attributed to something else.
Visual Summaries
Distribution of Final/Term 3 Grades
ggplot(studentperformance, aes(x = G3)) + geom_histogram(binwidth = 1, fill = 'blue', col = 'black') + labs(title = 'Distribution of Term 3 Grades (G3)', x = "Grade (G3)", y = 'Count') + theme_minimal()

In this plot, we are able to see the distribution of term 3 grades.
There are a fair amount of 0 grades which likely pulls the mean down a
bit, however we can still see that the distribution is fairly symmetric
without them. The grades to the right of the mean/median of 12 slowly
taper off in frequency as they approach the maximum. On the other hand,
grades to the left of the mean/median quickly taper off. A large chunk
of grades are 10 and 11 with much fewer at 9 and 8. It will be
interesting to investigate the relationship between term 3 grades and
other variables.
Interaction between Gender, Desire for Higher Education and
G3
ggplot(studentperformance, aes(x = higher, G3, fill = sex)) + geom_boxplot() + labs(title = "Term 3 Grades by Desire for Higher Education and Gender", x = "Wants Higher Education", y = "Term 3/Final Grade", fill = "Gender") + theme_light() + scale_fill_manual(values = c("F" = 'orange', "M" = "blue"))

This plot shows the relationship between the desire for higher
education and term 3 grades for each gender. There is a clear difference
in term 3 grades between the group that wants higher education and the
group that doesn’t. This notion is backed up by our aggregation function
from earlier. We also see that there is a difference between male and
female groups when it comes to final grades. Females typically have
higher term 3 grades regardless of desire for higher education.
Visually, we can see that the effect of wanting higher education makes a
bigger difference in final grades than gender. In the future, it may be
worth investigating the statistical significance of this
observation.