Week 4 Data Dive

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

studentperformance <- read_delim("./Portuguese Student.csv", delim = ";")

Build a data frame containing random samples

set.seed(432)

# Initialize an empty data frame.
random_samples = data.frame()

# Establish a fraction which will be used to create the sub-sample.
frac = 0.20

# A counter.
i = 1

# A while loop which runs until our counter reaches 6. There are obviously multiple ways to do this but this felt most natural to me. In this case, I chose to do 5 random samples.
while(i < 6){
  
# Finding the indices of our sample originating from the entire data set using the frac variable to determine our sample size and replacement is set to true.
  indices = sample(1:nrow(studentperformance), size = frac * nrow(studentperformance), replace = T)
  
# Use the indices to actually pull a sample from the original data.
  current_sample = studentperformance[indices,]
  
# Mutate a sample_id column to distinguish which rows belong to which sub-sample when they are combined in the larger data frame.
  current_sample = current_sample |>
    mutate(sample_id = i)
  
# In the first loop, bind the rows of current sample to the empty data frame. In subsequent loops, the rows of that current loop sample are bound to gradually building random_samples data frame.
  random_samples = bind_rows(random_samples, current_sample)
  
# Increase the counter so the loop doesn't run indefinitely.
  i = i + 1
}

# Create a comparison data frame which finds the mean and standard deviation of the final term grades (G3) in each sub-sample and other summary statistics.
comp = random_samples |>
  group_by(sample_id) |>
  summarize(count = n(),
            mean_G3 = mean(G3), 
            sd_G3 = sd(G3),
            iqr_G3 = IQR(G3),
            min_G3 = min(G3),
            max_G3 = max(G3),
            study_cor = cor(studytime, G3),
            avg_absences = mean(absences),
            .groups = 'drop'
            )

print(comp)

## # A tibble: 5 × 9
##   sample_id count mean_G3 sd_G3 iqr_G3 min_G3 max_G3 study_cor avg_absences
##       <dbl> <int>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <dbl>        <dbl>
## 1         1   129    11.7  3.02      4      0     18     0.237         3.50
## 2         2   129    12.1  2.94      4      0     19     0.282         3.81
## 3         3   129    11.8  3.80      4      0     19     0.286         3.73
## 4         4   129    11.7  3.19      4      0     19     0.221         3.74
## 5         5   129    11.9  3.19      4      0     19     0.277         3.40

How different are they?

In my example, I have produced a data frame with 5 different sub-samples. Each of these are 129 rows or about 20% of my full data set which makes the combined data frame 645 rows. I’ve put the mean and standard deviation of each sub-sample into a data frame alongside a couple other metrics so I can find any differences between the sub-samples. Running the chunk multiple times produces similar results with variation between each run. Generally speaking, the mean of G3 scores for each of the sub-samples is between 11.5 and 12.5 with the standard deviations ranging from about 2.75 to 3.75 (give or take). A score difference of 1 point (in a 20 point scale) can actually be pretty big given it’s about 5% of a grade. Context and scale really matter here so a 1 unit difference in a certain data set may mean nothing while it may mean everything in another. Here, I would say that sample means that vary by up to about a point is a decent amount of variation between sub-samples. If we were to use the common American grading system, a 5% change could be the difference between getting a B- (82%) and a B+ (87%) or a B (85%) and a A- (92%). It’s also worth noting that the standard deviation in sample 2 is 2.94 whereas it’s 3.797 in sample 3 indicating that the scores in sample 2 are closer together and more predictable.

What would you have called an anomaly in one sub-sample that you wouldn’t in another?

In sample 1, the maximum G3 score is 18 but is 19 in each of the other 4 samples. This means that a score of 19 in sample 1 is considered an anomaly because it never happened in that sample whereas a 19 in the other 4 samples is not an anomaly.

Are there aspects of the data that are consistent among all sub-samples?

Each sample has a minimum score of zero. This means that each sample picked up on at least one of the students that failed out or didn’t complete the semester. The study_cor column, which measures the correlation between study time and G3 grades, is positive for each sample. The strength of the correlation is also fairly consistent among the sub-samples which shows that more study time could lead to better grades. (It’s important to recognize that study time is not the only factor for G3 grades so this simple correlation doesn’t prove causation).

How does this comparison change as you increase the relative size of the sub-samples (e.g., 10%, 25%, 75%, etc.)?

set.seed(432)

# Initialize an empty data frame.
random_samples = data.frame()

# Establish a fraction which will be used to create the sub-sample.
frac = c(0.10, 0.5, 0.75)

# Loop through the different sample size proportions in the frac vector.
for(f in frac){
  
  # Loop through 1:5 so we can get 5 sub-samples from the original data.
  for(i in 1:5){
    indices = sample(1:nrow(studentperformance), size = f * nrow(studentperformance), replace = T)
    
    current_sample = studentperformance[indices,]
    
    # Mutate the sample_id number and the fraction size.
    current_sample = current_sample |>
      mutate(sample_id = i,
             frac_size = f)
    
    random_samples = bind_rows(random_samples, current_sample)
  }
}

# Create a comparison data frame which finds the mean and standard deviation of the final term grades (G3) in each sub-sample alongside other summary statistics.
comp = random_samples |>
  group_by(frac_size, sample_id) |>
  summarize(count = n(),
            mean_G3 = mean(G3), 
            sd_G3 = sd(G3),
            iqr_G3 = IQR(G3),
            min_G3 = min(G3),
            max_G3 = max(G3),
            study_cor = cor(studytime, G3),
            avg_absences = mean(absences),
            .groups = 'drop'
            )

print(comp)

## # A tibble: 15 × 10
##    frac_size sample_id count mean_G3 sd_G3 iqr_G3 min_G3 max_G3 study_cor
##        <dbl>     <int> <int>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>     <dbl>
##  1      0.1          1    64    11.7  2.74      4      0     18     0.175
##  2      0.1          2    64    11.7  3.30      4      0     18     0.273
##  3      0.1          3    64    12.0  2.69      4      5     18     0.352
##  4      0.1          4    64    12.2  3.22      4      0     19     0.238
##  5      0.1          5    64    12.4  2.98      5      7     19     0.166
##  6      0.5          1   324    11.7  3.46      4      0     19     0.274
##  7      0.5          2   324    11.9  3.26      4      0     19     0.385
##  8      0.5          3   324    11.9  3.19      4      0     19     0.250
##  9      0.5          4   324    11.8  3.32      4      0     18     0.272
## 10      0.5          5   324    11.9  3.28      4      0     19     0.248
## 11      0.75         1   486    11.9  3.14      4      0     19     0.268
## 12      0.75         2   486    12.0  2.98      4      0     19     0.219
## 13      0.75         3   486    11.6  3.35      4      0     19     0.193
## 14      0.75         4   486    11.9  3.31      4      0     18     0.288
## 15      0.75         5   486    11.9  3.23      4      0     19     0.235
## # ℹ 1 more variable: avg_absences <dbl>

When we increase our relative sub-sample size, our G3 mean is more consistent across the samples. This is because, the larger our sample, the closer it is to our original data set. If we were to treat our original data as the ‘entire population’ (even though it’s not), then increasing our sample size brings our sample mean closer to the true population mean. This is the law of large numbers. The study_cor column seems to tighten its range of possible values as we increase our sample size but avg_absences actually seems to broaden it’s range of possible values. This would need further runs and investigation to see if this just happened by chance or if there is any pattern.

Consider how this investigation affects how you might draw conclusions about the data in the future.

This investigation highlights the important difference between samples taken, especially when the samples are small. In many data sets, including this one, there is important nuance for certain metrics that can affect your analysis. For example, if we were to take a sample of 20% of the data, perform statistical tests and fit statistical models, we may get a completely different result than if we take a different 20% sample. It is important to be very careful when drawing conclusions since, just because a trend exists, it doesn’t mean it’s always there or is important.