Week 5 Data Dive

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")

Question 1

Prior to looking at the documentation, I didn’t understand the time spent studying (study_time) and time spent traveling to school (traveltime) variables. They are both ordinal categorical variables where each value represents a group of values (binning). In order to understand what each value actually meant, I had to check the documentation.

One reason that they may have encoded these variables is because of the uncertainty of the value for each student in the data set. If a student doesn’t know the exact time that it takes to travel to school but know generally what bin they are in, it can be easier. This same thing can apply to the study time variable where students don’t know exactly how much time they spend but a rough bin that applies to them.

If I didn’t read the documentation, I wouldn’t have understood what the values in the variables meant. Study time has values of 1, 2, 3 and 4 and I would have assumed that meant hours. Instead, 1 is <2 hours, 2 is 2 to 5 hours, 3 is 5 to 10 hours and 4 is >10 hours. Additionally, I know that this means study time per week instead of per day. A similar thing happens to the travel time variable.

Question 2

One of the variables I still don’t understand even after reading the documentation is number of past class failures (failures). I am a little bit confused as to what it means to fail a class in this specific context. In the United States, failing a class means getting a certain grade in a semester-long class. In the context of Portuguese students and this data set, we are given period 1, 2 and 3 grades. Are there three periods in a semester? Are there two semesters in a year? Is a failure considered getting a bad grade in one period or all three periods? The answer to these questions is important if we want to use this variable for meaningful analysis.

Question 3

# Produce a plot of the final grades against past class failures.
ggplot(studentperformance, aes(x = as.factor(failures), y = G3, fill = as.factor(failures))) + geom_boxplot() +
  labs(title = "Final Grade (G3) vs. Past Class Failures",
       x = "Number of Past Class Failures",
       y = "Final Grade (G3)",
       fill = "Failures")

# Define a failure using an arbitrary number for the sake of the assignment.
failure_rates = studentperformance |>
  group_by(failures) |>
  summarise(
    G1_Failure_Rate = mean(G1 < 10) * 100,
    G2_Failure_Rate = mean(G2 < 10) * 100,
    G3_Failure_Rate = mean(G3 < 10) * 100
  ) |>
  pivot_longer(cols = starts_with("G"),
               names_to = "Period",
               values_to = "Failure_Percentage")

# Produce a plot which shows the percentage of students failing against their past failures.
ggplot(failure_rates, aes(x = as.factor(failures), y = Failure_Percentage, fill = Period)) + geom_bar(stat = 'identity', position = 'dodge') + scale_fill_viridis_d(labels = c("G1 (Period 1)", "G2 (Period 2)", "G3 (Period 3)")) +
  labs(title = "Percentage of Students Failing Current Periods by Past Failures",
       subtitle = "A Failure is being considered as a score less than 10",
       x = "Past Failures",
       y = "Percentage of Students Currently Failing") +
  theme_dark()

When trying to use the number of past class failures variable, we still don’t know what the context of a failure is. For the sake of the second visualization, I considered a period grade of <10 to be a failure but this was an arbitrary number I chose. This sort of lack of context or understanding makes it hard to know the full story of what the visualizations are telling you. The box and whisker plots show a pattern that indicates that students with no past class failures have much higher grades overall. Our observations are similar for the second plot where no past failures appear to lead to a lot less students who are currently failing in all three period grades. However, with the lack of understanding of the nature of the past failures variable, we have to be very careful about drawing conclusions. Since I am not a domain expert here, I can’t make claims about the significance of these results.

Question 4

sp = studentperformance

# Find the unique values for each categorical column and then cross check it with the possible values in the documentation.
unique(sp$school)

## [1] "GP" "MS"

unique(sp$address)

## [1] "U" "R"

unique(sp$famsize)

## [1] "GT3" "LE3"

unique(sp$Pstatus)

## [1] "A" "T"

unique(sp$Mjob)

## [1] "at_home"  "health"   "other"    "services" "teacher"

unique(sp$Fjob)

## [1] "teacher"  "other"    "services" "health"   "at_home"

unique(sp$reason)

## [1] "course"     "other"      "home"       "reputation"

unique(sp$guardian)

## [1] "mother" "father" "other"

The documentation for this UCI provided ‘Student Performance’ data set, which is what I am using, says that there are no missing values. When I double checked, I found that there are no NA, NaN or Null values in any of the categorical columns. This means that there are no explicitly missing rows for all of the categorical columns.

There aren’t any implicitly missing rows or missing groups either because every categorical column in this data set has at least one instance of every possible value/category listed in the documentation.

Given that this is a UCI provided data set, it isn’t surprising that (explicitly or implicitly) missing values and groups aren’t present.

Question 5

sp |>
  group_by(G3) |>
  summarize(count = n())

## # A tibble: 17 × 2
##       G3 count
##    <dbl> <int>
##  1     0    15
##  2     1     1
##  3     5     1
##  4     6     3
##  5     7    10
##  6     8    35
##  7     9    35
##  8    10    97
##  9    11   104
## 10    12    72
## 11    13    82
## 12    14    63
## 13    15    49
## 14    16    36
## 15    17    29
## 16    18    15
## 17    19     2

ggplot(sp, aes(x = G3)) + geom_histogram(binwidth = 1)

For any of the period grades (G1, G2 or G3), a score of zero could be considered an outlier. We may define that as an outlier because these values deviate significantly from the rest of the distribution. If we look at the count of each G3 grade for students in the data set, there are 15 zeroes, 1 score of 1 and 0 scores of 2, 3, or 4. The majority of the scores are above 8. The histogram confirms the numbers in the table and shows us that the distribution is being skewed by these zeros. Whether or not we choose to keep these outliers in future analysis will depend on the context.