Week 3 Data Dive

studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")

Group 1

group_by data frame

g1 = studentperformance |>
  group_by(school, sex) |>
  summarize(count = n(), avg_G3 = mean(G3), .groups = 'drop')

g1

## # A tibble: 4 × 4
##   school sex   count avg_G3
##   <chr>  <chr> <int>  <dbl>
## 1 GP     F       237  13.0 
## 2 GP     M       186  12.0 
## 3 MS     F       146  11.0 
## 4 MS     M        80   9.95

Visualization

ggplot(g1, aes(x = school, y = avg_G3, fill = sex)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("orange", "blue")) +
  labs(title = "Average G3 Grade for Males and Females at Each School", x = "School", y = "Average G3 Grade")

According to the plot, we can see that Males have lower average G3 grades at both schools. The grades are also higher overall at school GP than school MS. Our table also shows that the Male group at school GP is the smallest at n = 80. This group only represents about 12% of the total students in the data set so, if we were to pick a student at random, the choice is least likely to be a male at school GP. It may be worth investigating why one school produces higher grades than the other school.

Group 2

group_by data frame

g2 = studentperformance |>
  group_by(Mjob) |>
  summarize(count = n(), avg_absences = mean(absences), avg_studytime = mean(studytime), .groups = 'drop')

g2

## # A tibble: 5 × 4
##   Mjob     count avg_absences avg_studytime
##   <chr>    <int>        <dbl>         <dbl>
## 1 at_home    135         3.57          1.87
## 2 health      48         2.10          1.88
## 3 other      258         3.81          1.93
## 4 services   136         4.30          1.98
## 5 teacher     72         3.11          2.01

Visualization

ggplot(g2, aes(x = reorder(Mjob, avg_absences), y = avg_absences, fill = Mjob)) +
  geom_bar(stat = "identity", show.legend = F) +
  labs(title = "Average Student Absences for Each Mother's Job Type", x = "Mother's Job", y = "Average Student Absences", colour = "Mother's Job")

## Ignoring unknown labels:
## • colour : "Mother's Job"

The smallest group by count are the mothers with a job in the ‘health’ sector at n = 48. This represents about 7% of all mother’s jobs within the data set. This reflects the difficulty around trying to obtain the qualifications to be a health professional and a job within the health sector. Investigating further demographic information for the country may give better insight as to whether these trends hold for the population.

Group 3

group_by data frame

studentperformance = studentperformance |>
  mutate(age_group = cut(age, breaks = c(14, 16, 18, 20, 22, 24), labels = c("15-16", "17-18", "19-20","21-22", "23+")))

g3 = studentperformance |>
  group_by(age_group) |>
  summarize(count = n(), avg_Walc = mean(Walc), .groups = 'drop')

g3

## # A tibble: 4 × 3
##   age_group count avg_Walc
##   <fct>     <int>    <dbl>
## 1 15-16       289     2.15
## 2 17-18       319     2.41
## 3 19-20        38     2.21
## 4 21-22         3     2.67

Visualization

ggplot(g3, aes(x = reorder(age_group, avg_Walc), y = avg_Walc, fill = age_group)) +
  geom_bar(stat = "identity", show.legend = F) +
  labs(title = "Average Weekend Alcohol Consumption by Age Group", x = "Age Group", y = "Average Weekend Alcohol Consumption", colour = "Mother's Job")

## Ignoring unknown labels:
## • colour : "Mother's Job"

There are only 3 students in the ‘21-22’ age group which represents less than 1% of all of the students in the data set. Students aged 20 or older are rare because students usually graduate at 18 or 19 years old. Students in this higher age group may have started school later or repeated years due to performance struggles. An investigation into the reason for late graduation in this country specifically may yield relevant context.

Testable Hypothesis

Older students are more likely to have much higher weekend alcohol consumption than younger students because they have reached legal drinking age.

Categorical Variables

job_combinations = studentperformance |>
  group_by(Mjob, Fjob) |>
  summarize(count = n(), .groups = 'drop')

ggplot(job_combinations, aes(x = Mjob, y = Fjob, fill = count)) +
  geom_tile() +
  geom_text(aes(label = count), col = "white") +
  scale_fill_viridis_c() +
  labs(title = "Count of Parent Job Combinations", x = "Mother's Job",
       y = "Father's Job")

There are no combinations that are missing between mother’s and father’s job. Most fall into the other/other combination but there are some other combinations that have decent counts. For example, the services/services group has 59 while the at_home/services combination has 36. There is a fair amount of diversity for parent jobs within this data set.

Week 3 Data Dive

February 2nd, 2026

Group 1

group_by data frame

Visualization

Group 2

group_by data frame

Visualization

Group 3

group_by data frame

Visualization

Testable Hypothesis

Categorical Variables