studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")
g1 = studentperformance |>
group_by(school, sex) |>
summarize(count = n(), avg_G3 = mean(G3), .groups = 'drop')
g1
## # A tibble: 4 × 4
## school sex count avg_G3
## <chr> <chr> <int> <dbl>
## 1 GP F 237 13.0
## 2 GP M 186 12.0
## 3 MS F 146 11.0
## 4 MS M 80 9.95
ggplot(g1, aes(x = school, y = avg_G3, fill = sex)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("orange", "blue")) +
labs(title = "Average G3 Grade for Males and Females at Each School", x = "School", y = "Average G3 Grade")
According to the plot, we can see that Males have lower average G3 grades at both schools. The grades are also higher overall at school GP than school MS. Our table also shows that the Male group at school GP is the smallest at n = 80. This group only represents about 12% of the total students in the data set so, if we were to pick a student at random, the choice is least likely to be a male at school GP. It may be worth investigating why one school produces higher grades than the other school.
g2 = studentperformance |>
group_by(Mjob) |>
summarize(count = n(), avg_absences = mean(absences), avg_studytime = mean(studytime), .groups = 'drop')
g2
## # A tibble: 5 × 4
## Mjob count avg_absences avg_studytime
## <chr> <int> <dbl> <dbl>
## 1 at_home 135 3.57 1.87
## 2 health 48 2.10 1.88
## 3 other 258 3.81 1.93
## 4 services 136 4.30 1.98
## 5 teacher 72 3.11 2.01
ggplot(g2, aes(x = reorder(Mjob, avg_absences), y = avg_absences, fill = Mjob)) +
geom_bar(stat = "identity", show.legend = F) +
labs(title = "Average Student Absences for Each Mother's Job Type", x = "Mother's Job", y = "Average Student Absences", colour = "Mother's Job")
## Ignoring unknown labels:
## • colour : "Mother's Job"
The smallest group by count are the mothers with a job in the ‘health’ sector at n = 48. This represents about 7% of all mother’s jobs within the data set. This reflects the difficulty around trying to obtain the qualifications to be a health professional and a job within the health sector. Investigating further demographic information for the country may give better insight as to whether these trends hold for the population.
studentperformance = studentperformance |>
mutate(age_group = cut(age, breaks = c(14, 16, 18, 20, 22, 24), labels = c("15-16", "17-18", "19-20","21-22", "23+")))
g3 = studentperformance |>
group_by(age_group) |>
summarize(count = n(), avg_Walc = mean(Walc), .groups = 'drop')
g3
## # A tibble: 4 × 3
## age_group count avg_Walc
## <fct> <int> <dbl>
## 1 15-16 289 2.15
## 2 17-18 319 2.41
## 3 19-20 38 2.21
## 4 21-22 3 2.67
ggplot(g3, aes(x = reorder(age_group, avg_Walc), y = avg_Walc, fill = age_group)) +
geom_bar(stat = "identity", show.legend = F) +
labs(title = "Average Weekend Alcohol Consumption by Age Group", x = "Age Group", y = "Average Weekend Alcohol Consumption", colour = "Mother's Job")
## Ignoring unknown labels:
## • colour : "Mother's Job"
There are only 3 students in the ‘21-22’ age group which represents less than 1% of all of the students in the data set. Students aged 20 or older are rare because students usually graduate at 18 or 19 years old. Students in this higher age group may have started school later or repeated years due to performance struggles. An investigation into the reason for late graduation in this country specifically may yield relevant context.
Older students are more likely to have much higher weekend alcohol consumption than younger students because they have reached legal drinking age.
job_combinations = studentperformance |>
group_by(Mjob, Fjob) |>
summarize(count = n(), .groups = 'drop')
ggplot(job_combinations, aes(x = Mjob, y = Fjob, fill = count)) +
geom_tile() +
geom_text(aes(label = count), col = "white") +
scale_fill_viridis_c() +
labs(title = "Count of Parent Job Combinations", x = "Mother's Job",
y = "Father's Job")
There are no combinations that are missing between mother’s and father’s job. Most fall into the other/other combination but there are some other combinations that have decent counts. For example, the services/services group has 59 while the at_home/services combination has 36. There is a fair amount of diversity for parent jobs within this data set.