studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")

studentperformance

Numeric Summary of Data For At Least 2 Columns


Categorical Summary for Mother’s Job

mjob_summary = studentperformance |>
  group_by(Mjob) |>
  summarise(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2))

print(arrange(mjob_summary, count))
## # A tibble: 5 × 3
##   Mjob     count percentage
##   <chr>    <int>      <dbl>
## 1 health      48        7.4
## 2 teacher     72       11.1
## 3 at_home    135       20.8
## 4 services   136       21.0
## 5 other      258       39.8

In this categorical summary of the Mother’s Job variable, we can see the count and proportion of mother’s jobs for all of the students in the data set. A large portion (about 40%) sit in the ‘other’ category. About 21% are in the ‘services’ category which could be administrative, police, etc. Another 21% are at home moms. 11% are teachers and 7.4% are in the healthcare sector. This summary shows us that about 80% of mothers in this data set are employed but about half of those are pooled into the ‘other’ category meaning we don’t truly understand the nature of their work unlike the other categories. It will be interesting to explore if a mother’s job has any statistically significant influence on student performance, absences, etc.


Numerical Summary for Final/Term 3 Grade (G3)

g3_summary = studentperformance |>
  summarise(
    Min = min(G3),
    Max = max(G3),
    Median = median(G3),
    Mean = mean(G3),
    First_Quartile = quantile(G3, 0.25),
    Third_Quartile = quantile(G3, 0.75),
    Std_Dev = sd(G3)
  )

print(g3_summary)
## # A tibble: 1 × 7
##     Min   Max Median  Mean First_Quartile Third_Quartile Std_Dev
##   <dbl> <dbl>  <dbl> <dbl>          <dbl>          <dbl>   <dbl>
## 1     0    19     12  11.9             10             14    3.23

In this numerical summary, we can see the nature of the distribution of final or term 3 grades. Some students either fail or are not assigned a final grade, which is why the minimum is zero. The grade scale goes up to 20, but the max in the data set is 19. Our mean and median are pretty much the same at about 12, and the first and third quartile are equidistant from the median which means that the distribution appears to be fairly symmetrical. In the context of this data set and grading scale, we can generalize that grades of 13 or higher are considered fairly good while grades below 11 are less ideal. Term 3 grades will be a valuable feature as it will likely be a target variable in some of our analysis.

A Set of At Least 3 Novel Questions


Does a desire for higher education impact a student’s final grade?

What is the nature of the relationship between demographic characteristics and student absences?

Do lifestyle choices actually affect academic performance?


Aggregation Function (Which answers question 1 above)

higher_education = studentperformance |>
  group_by(higher) |>
  summarise(
    Average_G3 = mean(G3),
    Student_Count = n()
  )

print(higher_education)
## # A tibble: 2 × 3
##   higher Average_G3 Student_Count
##   <chr>       <dbl>         <int>
## 1 no           8.80            69
## 2 yes         12.3            580

In this aggregation function, we can see that students who are interested in higher education had a higher average term 3 grade. There is a fairly substantial difference in student count in each category, so the results should be taken with a grain of salt. However, there are still 69 students in the ‘no’ category which isn’t too small. In the future, it might be valuable to investigate whether this difference between the two classes is caused by this difference in group size or if it can be attributed to something else.

Visual Summaries


Distribution of Final/Term 3 Grades

ggplot(studentperformance, aes(x = G3)) + geom_histogram(binwidth = 1, fill = 'blue', col = 'black') + labs(title = 'Distribution of Term 3 Grades (G3)', x = "Grade (G3)", y = 'Count') + theme_minimal()

In this plot, we are able to see the distribution of term 3 grades. There are a fair amount of 0 grades which likely pulls the mean down a bit, however we can still see that the distribution is fairly symmetric without them. The grades to the right of the mean/median of 12 slowly taper off in frequency as they approach the maximum. On the other hand, grades to the left of the mean/median quickly taper off. A large chunk of grades are 10 and 11 with much fewer at 9 and 8. It will be interesting to investigate the relationship between term 3 grades and other variables.


Interaction between Gender, Desire for Higher Education and G3

ggplot(studentperformance, aes(x = higher, G3, fill = sex)) + geom_boxplot() + labs(title = "Term 3 Grades by Desire for Higher Education and Gender", x = "Wants Higher Education", y = "Term 3/Final Grade", fill = "Gender") + theme_light() + scale_fill_manual(values = c("F" = 'orange', "M" = "blue"))

This plot shows the relationship between the desire for higher education and term 3 grades for each gender. There is a clear difference in term 3 grades between the group that wants higher education and the group that doesn’t. This notion is backed up by our aggregation function from earlier. We also see that there is a difference between male and female groups when it comes to final grades. Females typically have higher term 3 grades regardless of desire for higher education. Visually, we can see that the effect of wanting higher education makes a bigger difference in final grades than gender. In the future, it may be worth investigating the statistical significance of this observation.