A Regression Model for Volume of Breast Milk Produced

Background

Exclusive pumping refers to a parent feeding a baby breast milk that has been pumped rather than breastfeeding directly. One advantage of this (to a data analyst!) is that you can measure the amount of milk produced at each pump. This model aims to identify factors that affect the volume of milk pumped.

Data Source

The data set was provided by an exclusively pumping mum who used the Huckleberry App which allows the user to extract a CSV of all entered data. She used the app to record the volumes of milk pumped as well as other information about her baby including his growth and volume of milk consumed.

Data Preparation

Load and Clean Data

df <- read_csv("ronan.csv", show_col_types = FALSE) %>%
  filter(Type == "Pump")

data <- df %>%
  filter(Type == "Pump") %>%
  mutate(
    start = strptime(Start, format = "%d/%m/%y %H:%M"),
    date = as.Date(start),
    volume = as.numeric(str_extract(`Start Condition`, "\\d+"))) %>%
  arrange(start) %>%
  mutate(
    age_days = as.numeric(difftime(start, 
                                   as.Date("2023-03-02"), 
                                   units = "days")),
    minutes = as.numeric(difftime(start, lag(start), units = "mins")),
    hours = minutes/60,
    first_pump = ifelse(date != lag(date), 1, 0),
    first_pump = ifelse(is.na(first_pump), 1, first_pump)) %>%
  filter(!is.na(minutes)) %>%
  select(start, date, volume, age_days, minutes, hours, first_pump)
variables <- data.frame(first_column = c("start", "date", "volume", "age_days", "minutes", "hours", "first_pump"), second_column = c("Start - Time and Date", "Date without Time", "Volume of Pumped Milk in mL", "Age of Baby in Days", "Minutes Since Previous Pump", "Hours Since Previous Pump (calculated from minutes)", "1 = First Pump of the Day, 0 = Subsequent Pump"))

colnames(variables) <- c('Variable Name','Description') 

variables %>%
  kbl(caption = "Data Key") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Data Key
Variable Name Description
start Start - Time and Date
date Date without Time
volume Volume of Pumped Milk in mL
age_days Age of Baby in Days
minutes Minutes Since Previous Pump
hours Hours Since Previous Pump (calculated from minutes)
first_pump 1 = First Pump of the Day, 0 = Subsequent Pump
rm(variables)

Summary of Data

stat <- skim(data)
num <- stat %>%
  select(skim_variable, skim_type, n_missing, numeric.mean, numeric.sd, numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100, numeric.hist) %>%
  mutate(across(where(is.numeric), ~ round(., 1))) %>%
  filter(skim_type == "numeric") %>%
  select(-skim_type) %>%
  filter(skim_variable != "first_pump") %>%
  rename(Variable = skim_variable,
         NAs = n_missing,
         Mean = numeric.mean,
         SD = numeric.sd,
         P0 = numeric.p0,
         P25 = numeric.p25,
         P50 = numeric.p50,
         P75 = numeric.p75,
         P100 = numeric.p100,
         Histogram = numeric.hist) %>%
  kbl(caption = "Numeric Variable Statistics", align = "c") %>%
  column_spec(1, width = "2cm") %>%
  column_spec(2, width = "2cm") %>%
  column_spec(3, width = "2cm") %>%
  column_spec(4, width = "2cm") %>%
  column_spec(5, width = "2cm") %>%
  column_spec(6, width = "2cm") %>%
  column_spec(7, width = "2cm") %>%
  column_spec(8, width = "2cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 

num
Numeric Variable Statistics
Variable NAs Mean SD P0 P25 P50 P75 P100 Histogram
volume 0 219.7 109.3 10.0 150.0 180.0 240.0 570.0 ▂▇▁▂▁
age_days 0 112.9 67.2 6.7 51.2 110.5 170.5 265.5 ▇▆▆▆▂
minutes 0 388.7 604.5 0.0 200.0 257.5 537.8 13079.0 ▇▁▁▁▁
hours 0 6.5 10.1 0.0 3.3 4.3 9.0 218.0 ▇▁▁▁▁
nnum <- stat %>%
  select(skim_variable, skim_type, n_missing) %>%
  filter(skim_variable %in% c("date", "start", "first_pump")) %>%
  select(-skim_type) %>%
  rename(Variable = skim_variable,
         NAs = n_missing) %>%
  kbl(caption = "Non-Numeric Variable Statistics", align = "c") %>%
  column_spec(1, width = "3cm") %>%
  column_spec(2, width = "3cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 

nnum
Non-Numeric Variable Statistics
Variable NAs
date 0
start 0
first_pump 0
rm(stat, num, nnum)

There are no missing values in the table. The column for minutes since last pump (‘minutes’) does appear to have some extreme values, as the maximum value of 13,079 is much higher than the 75th percentile of 538. The minimum value is 0 which also doesn’t make sense for the time between two pumps. These values need to be inspected further and some may need to be removed. The hours since last pump (‘hours’) has the same concerns but as it’s a derivative of the ‘minutes’ column, only ‘minutes’ needs to be modified. Volume also has a slightly skewed distribution; this will be inspected too.

Visual Check for Extreme Values

colours <- c("#DDF093", "#003249", "#BFD7EA", "#847E89")

p1 <- ggplot(data, aes(y = volume, x= "")) +
  geom_boxplot(fill = colours[1]) +
  labs(title = "Distribution of Volume", y = "Volume (mL)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())

p2 <- ggplot(data, aes(y=hours, x= "")) +
  geom_boxplot(fill = colours[3]) +
  labs(title = "Distribution of Time Between Pumps", y = "Time (Hours)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())

p3 <- ggplot(data, aes(x=hours, y=volume)) +
  labs(title = "Relationship Between Time Between Pumps and Volume Produced", 
       x = "Time (Hours)", y = "Volume (mL)") +
  geom_point(color = colours[2], alpha=0.6) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))
(plot_spacer() | p1 | plot_spacer() | p2 | plot_spacer()) + plot_layout(widths = c(1,3,1,3,1))

p3

rm(p1, p2, p3)

The distribution of volume does have more large values, however there’s no indication that these are errors in data entry or otherwise invalid so I’m not going to make any changes. The distribution of time between pumps is clearly has some extreme values that make it difficult to visualise anything else in the plot. It’s also clear that there’s a relationship between the time since last pump and volume produced when ignoring those extreme values.

Further Inspection

In the first table I’m ordering the table by highest minutes and in the second I’m ordering by lowest.

mean_min <- mean(data$minutes)
sd_min <- sd(data$minutes)

t1 <- data %>% 
  arrange(desc(minutes)) %>%
  select(date, minutes) %>%
  mutate(`SD From Mean` = abs(minutes - mean_min)/sd_min) %>%
  head(10) %>%
  rename(Date = date,
       `Mins Since Last Pump` = minutes)


t2 <- data %>% 
  arrange(minutes) %>%
  select(date, minutes) %>%
  mutate(`SD From Mean` = (minutes - mean_min)/sd_min) %>%
  head(10) %>%
  rename(Date = date,
       `Mins Since Last Pump` = minutes)
t1 %>%
  kbl() %>%
  column_spec(1, width = "4cm") %>%
  column_spec(2, width = "5cm") %>%
  column_spec(3, width = "4cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Mins Since Last Pump SD From Mean
2023-11-22 13079 20.993342
2023-11-13 11130 17.769156
2023-11-05 3052 4.405908
2023-11-03 2886 4.131298
2023-11-01 2519 3.524178
2023-10-30 1874 2.457170
2023-10-29 1741 2.237151
2023-10-17 1694 2.159400
2023-10-22 1479 1.803730
2023-10-27 1475 1.797113
t2 %>%
  kbl() %>%
  column_spec(1, width = "4cm") %>%
  column_spec(2, width = "5cm") %>%
  column_spec(3, width = "4cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Mins Since Last Pump SD From Mean
2023-07-14 0 -0.6429450
2023-07-17 16 -0.6164766
2023-03-20 72 -0.5238371
2023-03-23 75 -0.5188743
2023-04-24 76 -0.5172200
2023-10-19 76 -0.5172200
2023-03-15 79 -0.5122572
2023-03-21 80 -0.5106029
2023-03-22 80 -0.5106029
2023-03-19 84 -0.5039858
rm(mean_min, sd_min, t1, t2)

The large values of time since last pump are over 10,000 hours - more than a week. Mum explained that towards the end of the supplied data she was intentionally pumping less often to reduce her supply and we established that I can use the count of pumps per day to work out the exact date that she started decreasing her supply - this is the first day than she only pumped twice in a day.

Filtering Data

Here I’m grouping by date and summarising by count to get the number of pumps per day, then filtering for days with 2 or fewer pumps. then I’ve arranged by date to get the first date where there was a reduction.

day_data <- data %>%
  group_by(date) %>%
  summarise(`Pump Count` = n()) %>%
  filter(`Pump Count` <= 2) %>%
  arrange(date) %>%
  head(5) %>%
  rename(Date = date)
day_data %>%
  kbl() %>%
  column_spec(1, width = "4cm") %>%
  column_spec(2, width = "4cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Date Pump Count
2023-10-13 2
2023-10-14 2
2023-10-15 2
2023-10-16 1
2023-10-17 2

Storing the minimum value to use later.

# Grab minimum date value

minimum <- min(day_data$Date)

Removing Extreme Values

Here I’m removing dates that are after the minimum date I found before, then filtering for rows with minutes greater than 0.

data_clean <- data %>%
  filter(date < minimum) %>%
  filter(minutes > 0)

Checking that the values of n <= 2 have been removed.

data_check <- data_clean %>%
  mutate(`Minimum Minutes Between Pumps` = min(minutes)) %>%
  group_by(date) %>%
  reframe(count = n(), `Minimum Minutes Between Pumps`) %>%
  mutate(`Minimum Count Per Day` = min(count)) %>% 
  head(1) %>%
  pivot_longer(cols = c(`Minimum Minutes Between Pumps`, `Minimum Count Per Day`)) %>%
  select(name, value) %>%
  mutate(`Successful?` = case_when(
    name == "Minimum Minutes Between Pumps" & value > 0 |
    name == "Minimum Count Per Day" & value > 2 ~ "Yes",
    TRUE ~ "No"
  )) %>%
  rename(Measure = name,
         Value = value)
data_check %>%
  kbl(caption = "Modified Values") %>%
  column_spec(1, width = "8cm") %>%
  column_spec(2, width = "4cm") %>%
  column_spec(3, width = "4cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Modified Values
Measure Value Successful?
Minimum Minutes Between Pumps 16 Yes
Minimum Count Per Day 3 Yes
rm(data_check, minimum, day_data)

Visualise Cleaned Data

p4 <- ggplot(data_clean, aes(y = volume, x= "")) +
  geom_boxplot(fill = colours[1]) +
  labs(title = "Distribution of Volume", y = "Volume (mL)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())

p5 <- ggplot(data_clean, aes(y=hours, x= "")) +
  geom_boxplot(fill = colours[3]) +
  labs(title = "Distribution of Time Between Pumps", y = "Time (Hours)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())

p6 <- ggplot(data_clean, aes(x=hours, y=volume)) +
  labs(title = "Time Between Pumps v Volume Produced", x = "Time (Hours)", y = "Volume (mL)") +
  geom_point(color = colours[2], alpha=0.6) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))
(plot_spacer() | p4 | plot_spacer() | p5 | plot_spacer()) + plot_layout(widths = c(1,3,1,3,1))

p6

After removing those values the distribution of hours between pumps still has some apparent outliers but overall it looks more reasonable. The relationship between hours since last pump and volume pumped is much clearer. This does mean that this model will only be applicable to the selected data i.e. before mum attempted to decrease supply.

Visualise Age Variable

ggplot(data_clean, aes(x=age_days, y=volume)) +
  geom_point(color = colours[2], alpha = 0.6) +
  labs(title = "Baby's Age v Volume Produced", x = "Baby's Age (Days)", y = "Volume (mL)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

rm(p4, p5, p6)

There is not a clear relationship between baby’s age and the volume pumped. I don’t think this will make sense to include in the final model but I’ll still test it to see if it adds any value.

Modelling

Normalisation and Correlation Matrix

Here I’m normalising the minutes since last pump and age of baby in days (scaling values to between 0 and 1). This puts these on the same scale which prevents the model becoming biased due to the difference in scale. The correlation matrix shows the correlation between the predictor variables to see if they’re related to each other separately from their relationships to the volume.

data_norm <- data_clean %>%
  mutate(min = scales::rescale(minutes),
         age = scales::rescale(age_days))

cor_matrix <- cor(data_norm[, c("first_pump", "min", "age")], use = "complete.obs")

pheatmap(cor_matrix, main = "Linear Model Correlation Matrix", display_numbers = T, 
         color = colorRampPalette(c('#9de24f', '#ffff66', '#ffbd55', '#ff6666'))(100), 
         fontsize_number = 15)

rm(cor_matrix)

The correlation matrix shows that there’s a strong relationship between the minute since last pump and whether it’s the first pump of the day. This does make sense as there is a longer gap overnight while sleeping. This could make it difficult to separate their effects on the volume. The age has a low correlation with any other variable but I’m also not convinced that it has a strong effect. First I’m going to test this relationship.

Relationship Between Baby’s Age and Volume Pumped

model_age <- lm(volume ~ age, data=data_norm)
summary(model_age)
## 
## Call:
## lm(formula = volume ~ age, data = data_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146.73  -73.38  -42.47   32.31  326.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  195.661      6.678  29.301  < 2e-16 ***
## age           52.623     12.088   4.353 1.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108.4 on 925 degrees of freedom
## Multiple R-squared:  0.02008,    Adjusted R-squared:  0.01902 
## F-statistic: 18.95 on 1 and 925 DF,  p-value: 1.49e-05

The model suggests there’s a statistically significant relationship between baby’s age and the volume pumped, but the R2 of 0.02 tells us that the linear model explains very little of the variance in volume pumped. It’s possible that there’s a non-linear relationship between volume and baby’s age. For the time being I’m going to exclude the age as beyond the scope of this project.

Variance Inflation Factor (VIF)

The VIF quantifies how much the variance of the estimated regression coefficient increases if your predictors are correlated. This essentially means that it tells us how much the accuracy of the predictions of the model due to the multicollinearity (relationship between) two predictor variables.

model <- lm(volume ~ first_pump + min, data = data_norm)
vif_values <- vif(model)
vif_df <- data.frame(Variable = names(vif_values), VIF = as.vector(vif_values))
vif_df %>% 
  kbl() %>%
  column_spec(1, width = "4cm") %>%
  column_spec(2, width = "4cm") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Variable VIF
first_pump 8.356252
min 8.356252

The VIFs for the first_pump & minutes model of >8 means that the relationship between the two variables will have an effect on the accuracy of the model. One way of counteracting this is to include an interaction variable, created by multiplying the variables.

VIF for Interaction Model

model_int <- lm(volume ~ min + first_pump + (min*first_pump), data = data_norm)
vif_df <- as.data.frame(vif(model_int, type = "predictor"))
vif_df %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
min 1 3 1 first_pump
first_pump 1 3 1 min

The GVIF (generalised VIF) is greatly reduced in this model which suggests the interaction term has helped to reduce the multicollinearity.

Model Summary

summary(model_int)
## 
## Call:
## lm(formula = volume ~ min + first_pump + (min * first_pump), 
##     data = data_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -162.576  -18.852    1.284   19.075  113.470 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      90.006      4.182  21.523  < 2e-16 ***
## min             318.536     17.056  18.676  < 2e-16 ***
## first_pump       20.665     21.228   0.973  0.33057    
## min:first_pump   86.742     33.459   2.592  0.00968 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.89 on 923 degrees of freedom
## Multiple R-squared:  0.9045, Adjusted R-squared:  0.9042 
## F-statistic:  2913 on 3 and 923 DF,  p-value: < 2.2e-16

Including the interaction variable has made the first_pump variable statistically insignificant suggesting that the interaction between the minutes since last pump and first_pump explains most of the effect of the first_pump variable.

Model Without First_Pump

This model removes the statistically insignificant variable.

model_int_2 <- lm(volume ~ min + min:first_pump, data = data_norm)

Final Model Summary

summary(model_int_2)
## 
## Call:
## lm(formula = volume ~ min + min:first_pump, data = data_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -160.359  -18.799    1.349   18.709  113.954 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       90.81       4.10  22.149   <2e-16 ***
## min              315.42      16.75  18.828   <2e-16 ***
## min:first_pump   117.16      11.96   9.799   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.88 on 924 degrees of freedom
## Multiple R-squared:  0.9044, Adjusted R-squared:  0.9042 
## F-statistic:  4369 on 2 and 924 DF,  p-value: < 2.2e-16

This model explains about 90% of the variation in volume pumped. Both variables are statistically significant.

Normality Test

Here I’m testing whether the residuals (difference between the model’s predicted value and the actual value) are normally distributed. In this model we assume that there’s a linear relationship between the variables; this is a way of verifying that assumption.

residuals_int_df <- as.data.frame(residuals(model_int_2))

ggplot(residuals_int_df, aes(x = residuals(model_int_2))) +
  geom_density(fill = colours[1], color = colours[4]) +
  labs(title = "Density Plot of Residuals",
       x = "Residuals",
       y = "Density")

The density plot shows that the model looks to be approximately, although not exactly, normally distributed.

Cross-Validation

Here K-fold cross validation is used to test the model. I’ve set the number of folds to 10. This splits the data into 10 groups (folds), trains the model on 9 sets, and tests it on the 10th set. This is repeated 10 times, using each group as the testing set once. This helps to make sure that I haven’t overfitted the model to the data set.

set.seed(123)

num_folds <- 10

ctrl <- trainControl(method = "cv", number = num_folds)

cv_results <- train(volume ~ min + min:first_pump, 
                    data = data_norm, method = "lm", trControl = ctrl)
cv_results
## Linear Regression 
## 
## 927 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 835, 834, 834, 835, 834, 835, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   33.90813  0.9050475  25.40757
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The cross validation model R2 of 0.905 is the average R2 across all the folds. This is very close to the original model R2 (0.904) and this reassures us that the model is not overfitted. It tells us that the model performs reliably across different subsets of the data suggesting that the model is reliable.

Summary

The final model for predicting the volume of milk produced has two independent variables - minutes since last pump and minutes since last pump x first_pump.

The model equation is:

\[ \text{Volume} = 90.81 + 315.42 \left(\frac{\text{minutes} - 16}{930}\right) + 117.16 \left(\frac{\text{minutes} - 16}{930}\right) \times \text{first_pump} \]
For example, when the time since the last pump is 500 minutes and it’s not the first pump of the day (i.e. first_pump is 0) the expected volume is: \[ \begin{align*} \text{Volume} &= 90.81 + 315.42 \left(\frac{500 - 16}{914}\right) + 117.16 \left(\frac{500 - 16}{914}\right) \times 0 \\ &= 90.81 + 315.42 \times 0.5295405\ + \text{0} \\ &= 257.8377 \\ &\approx 258 \, \text{mL} \end{align*} \]
When the time since the last pump is 500 minutes and it is the first pump of the day (i.e. first_pump is 1), the expected volume is: \[ \begin{align*} \text{Volume} &= 90.81 + 315.42 \left(\frac{500 - 16}{914}\right) + 117.16 \left(\frac{500 - 16}{914}\right) \times 1 \\ &= 90.81 + 315.42 \times 0.5295405\ + \text{117.16}\ \times \text{0.5295405} \times 1\\ &= 90.81 + 167.0277\ + 62.04096 \\ &= 319.8787 \\ &\approx 320 \, \text{mL} \end{align*} \] As the first_pump variable is either yes (1) or no (0), this part of the equation: \[ \begin{align*} 117.16 \left(\frac{\text{minutes} - 16}{930}\right) \times \text{first_pump} \end{align*} \] will be 0 when first_pump is 0. This means that while first_pump doesn’t have an effect on volume on its own, it increases the effect of the minutes variable.

Predicted vs Actual Volumes

This plot shows the actual volumes compared to the volumes predicted by my model.

predictions <- predict(model_int_2, newdata = data_norm)

data_norm$prediction <- predictions

ggplot(data_norm, aes(x = volume, y = prediction)) +
  geom_point(color = colours[2], alpha=0.6) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", 
              linewidth = 1.5, color = "darkgrey") +
  labs(
    title = "Predicted vs. Actual Volumes",
    x = "Actual Volume (mL)",
    y = "Predicted Volume (mL)") +
  theme_minimal()