Exclusive pumping refers to a parent feeding a baby breast milk that has been pumped rather than breastfeeding directly. One advantage of this (to a data analyst!) is that you can measure the amount of milk produced at each pump. This model aims to identify factors that affect the volume of milk pumped.
The data set was provided by an exclusively pumping mum who used the Huckleberry App which allows the user to extract a CSV of all entered data. She used the app to record the volumes of milk pumped as well as other information about her baby including his growth and volume of milk consumed.
df <- read_csv("ronan.csv", show_col_types = FALSE) %>%
filter(Type == "Pump")
data <- df %>%
filter(Type == "Pump") %>%
mutate(
start = strptime(Start, format = "%d/%m/%y %H:%M"),
date = as.Date(start),
volume = as.numeric(str_extract(`Start Condition`, "\\d+"))) %>%
arrange(start) %>%
mutate(
age_days = as.numeric(difftime(start,
as.Date("2023-03-02"),
units = "days")),
minutes = as.numeric(difftime(start, lag(start), units = "mins")),
hours = minutes/60,
first_pump = ifelse(date != lag(date), 1, 0),
first_pump = ifelse(is.na(first_pump), 1, first_pump)) %>%
filter(!is.na(minutes)) %>%
select(start, date, volume, age_days, minutes, hours, first_pump)variables <- data.frame(first_column = c("start", "date", "volume", "age_days", "minutes", "hours", "first_pump"), second_column = c("Start - Time and Date", "Date without Time", "Volume of Pumped Milk in mL", "Age of Baby in Days", "Minutes Since Previous Pump", "Hours Since Previous Pump (calculated from minutes)", "1 = First Pump of the Day, 0 = Subsequent Pump"))
colnames(variables) <- c('Variable Name','Description')
variables %>%
kbl(caption = "Data Key") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Variable Name | Description |
|---|---|
| start | Start - Time and Date |
| date | Date without Time |
| volume | Volume of Pumped Milk in mL |
| age_days | Age of Baby in Days |
| minutes | Minutes Since Previous Pump |
| hours | Hours Since Previous Pump (calculated from minutes) |
| first_pump | 1 = First Pump of the Day, 0 = Subsequent Pump |
stat <- skim(data)
num <- stat %>%
select(skim_variable, skim_type, n_missing, numeric.mean, numeric.sd, numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100, numeric.hist) %>%
mutate(across(where(is.numeric), ~ round(., 1))) %>%
filter(skim_type == "numeric") %>%
select(-skim_type) %>%
filter(skim_variable != "first_pump") %>%
rename(Variable = skim_variable,
NAs = n_missing,
Mean = numeric.mean,
SD = numeric.sd,
P0 = numeric.p0,
P25 = numeric.p25,
P50 = numeric.p50,
P75 = numeric.p75,
P100 = numeric.p100,
Histogram = numeric.hist) %>%
kbl(caption = "Numeric Variable Statistics", align = "c") %>%
column_spec(1, width = "2cm") %>%
column_spec(2, width = "2cm") %>%
column_spec(3, width = "2cm") %>%
column_spec(4, width = "2cm") %>%
column_spec(5, width = "2cm") %>%
column_spec(6, width = "2cm") %>%
column_spec(7, width = "2cm") %>%
column_spec(8, width = "2cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
num| Variable | NAs | Mean | SD | P0 | P25 | P50 | P75 | P100 | Histogram |
|---|---|---|---|---|---|---|---|---|---|
| volume | 0 | 219.7 | 109.3 | 10.0 | 150.0 | 180.0 | 240.0 | 570.0 | ▂▇▁▂▁ |
| age_days | 0 | 112.9 | 67.2 | 6.7 | 51.2 | 110.5 | 170.5 | 265.5 | ▇▆▆▆▂ |
| minutes | 0 | 388.7 | 604.5 | 0.0 | 200.0 | 257.5 | 537.8 | 13079.0 | ▇▁▁▁▁ |
| hours | 0 | 6.5 | 10.1 | 0.0 | 3.3 | 4.3 | 9.0 | 218.0 | ▇▁▁▁▁ |
nnum <- stat %>%
select(skim_variable, skim_type, n_missing) %>%
filter(skim_variable %in% c("date", "start", "first_pump")) %>%
select(-skim_type) %>%
rename(Variable = skim_variable,
NAs = n_missing) %>%
kbl(caption = "Non-Numeric Variable Statistics", align = "c") %>%
column_spec(1, width = "3cm") %>%
column_spec(2, width = "3cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
nnum| Variable | NAs |
|---|---|
| date | 0 |
| start | 0 |
| first_pump | 0 |
There are no missing values in the table. The column for minutes since last pump (‘minutes’) does appear to have some extreme values, as the maximum value of 13,079 is much higher than the 75th percentile of 538. The minimum value is 0 which also doesn’t make sense for the time between two pumps. These values need to be inspected further and some may need to be removed. The hours since last pump (‘hours’) has the same concerns but as it’s a derivative of the ‘minutes’ column, only ‘minutes’ needs to be modified. Volume also has a slightly skewed distribution; this will be inspected too.
colours <- c("#DDF093", "#003249", "#BFD7EA", "#847E89")
p1 <- ggplot(data, aes(y = volume, x= "")) +
geom_boxplot(fill = colours[1]) +
labs(title = "Distribution of Volume", y = "Volume (mL)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())
p2 <- ggplot(data, aes(y=hours, x= "")) +
geom_boxplot(fill = colours[3]) +
labs(title = "Distribution of Time Between Pumps", y = "Time (Hours)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())
p3 <- ggplot(data, aes(x=hours, y=volume)) +
labs(title = "Relationship Between Time Between Pumps and Volume Produced",
x = "Time (Hours)", y = "Volume (mL)") +
geom_point(color = colours[2], alpha=0.6) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The distribution of volume does have more large values, however there’s no indication that these are errors in data entry or otherwise invalid so I’m not going to make any changes. The distribution of time between pumps is clearly has some extreme values that make it difficult to visualise anything else in the plot. It’s also clear that there’s a relationship between the time since last pump and volume produced when ignoring those extreme values.
In the first table I’m ordering the table by highest minutes and in the second I’m ordering by lowest.
mean_min <- mean(data$minutes)
sd_min <- sd(data$minutes)
t1 <- data %>%
arrange(desc(minutes)) %>%
select(date, minutes) %>%
mutate(`SD From Mean` = abs(minutes - mean_min)/sd_min) %>%
head(10) %>%
rename(Date = date,
`Mins Since Last Pump` = minutes)
t2 <- data %>%
arrange(minutes) %>%
select(date, minutes) %>%
mutate(`SD From Mean` = (minutes - mean_min)/sd_min) %>%
head(10) %>%
rename(Date = date,
`Mins Since Last Pump` = minutes)t1 %>%
kbl() %>%
column_spec(1, width = "4cm") %>%
column_spec(2, width = "5cm") %>%
column_spec(3, width = "4cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Date | Mins Since Last Pump | SD From Mean |
|---|---|---|
| 2023-11-22 | 13079 | 20.993342 |
| 2023-11-13 | 11130 | 17.769156 |
| 2023-11-05 | 3052 | 4.405908 |
| 2023-11-03 | 2886 | 4.131298 |
| 2023-11-01 | 2519 | 3.524178 |
| 2023-10-30 | 1874 | 2.457170 |
| 2023-10-29 | 1741 | 2.237151 |
| 2023-10-17 | 1694 | 2.159400 |
| 2023-10-22 | 1479 | 1.803730 |
| 2023-10-27 | 1475 | 1.797113 |
t2 %>%
kbl() %>%
column_spec(1, width = "4cm") %>%
column_spec(2, width = "5cm") %>%
column_spec(3, width = "4cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Date | Mins Since Last Pump | SD From Mean |
|---|---|---|
| 2023-07-14 | 0 | -0.6429450 |
| 2023-07-17 | 16 | -0.6164766 |
| 2023-03-20 | 72 | -0.5238371 |
| 2023-03-23 | 75 | -0.5188743 |
| 2023-04-24 | 76 | -0.5172200 |
| 2023-10-19 | 76 | -0.5172200 |
| 2023-03-15 | 79 | -0.5122572 |
| 2023-03-21 | 80 | -0.5106029 |
| 2023-03-22 | 80 | -0.5106029 |
| 2023-03-19 | 84 | -0.5039858 |
The large values of time since last pump are over 10,000 hours - more than a week. Mum explained that towards the end of the supplied data she was intentionally pumping less often to reduce her supply and we established that I can use the count of pumps per day to work out the exact date that she started decreasing her supply - this is the first day than she only pumped twice in a day.
Here I’m grouping by date and summarising by count to get the number of pumps per day, then filtering for days with 2 or fewer pumps. then I’ve arranged by date to get the first date where there was a reduction.
day_data <- data %>%
group_by(date) %>%
summarise(`Pump Count` = n()) %>%
filter(`Pump Count` <= 2) %>%
arrange(date) %>%
head(5) %>%
rename(Date = date)day_data %>%
kbl() %>%
column_spec(1, width = "4cm") %>%
column_spec(2, width = "4cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Date | Pump Count |
|---|---|
| 2023-10-13 | 2 |
| 2023-10-14 | 2 |
| 2023-10-15 | 2 |
| 2023-10-16 | 1 |
| 2023-10-17 | 2 |
Storing the minimum value to use later.
Here I’m removing dates that are after the minimum date I found before, then filtering for rows with minutes greater than 0.
Checking that the values of n <= 2 have been removed.
data_check <- data_clean %>%
mutate(`Minimum Minutes Between Pumps` = min(minutes)) %>%
group_by(date) %>%
reframe(count = n(), `Minimum Minutes Between Pumps`) %>%
mutate(`Minimum Count Per Day` = min(count)) %>%
head(1) %>%
pivot_longer(cols = c(`Minimum Minutes Between Pumps`, `Minimum Count Per Day`)) %>%
select(name, value) %>%
mutate(`Successful?` = case_when(
name == "Minimum Minutes Between Pumps" & value > 0 |
name == "Minimum Count Per Day" & value > 2 ~ "Yes",
TRUE ~ "No"
)) %>%
rename(Measure = name,
Value = value)data_check %>%
kbl(caption = "Modified Values") %>%
column_spec(1, width = "8cm") %>%
column_spec(2, width = "4cm") %>%
column_spec(3, width = "4cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Measure | Value | Successful? |
|---|---|---|
| Minimum Minutes Between Pumps | 16 | Yes |
| Minimum Count Per Day | 3 | Yes |
p4 <- ggplot(data_clean, aes(y = volume, x= "")) +
geom_boxplot(fill = colours[1]) +
labs(title = "Distribution of Volume", y = "Volume (mL)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())
p5 <- ggplot(data_clean, aes(y=hours, x= "")) +
geom_boxplot(fill = colours[3]) +
labs(title = "Distribution of Time Between Pumps", y = "Time (Hours)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_blank())
p6 <- ggplot(data_clean, aes(x=hours, y=volume)) +
labs(title = "Time Between Pumps v Volume Produced", x = "Time (Hours)", y = "Volume (mL)") +
geom_point(color = colours[2], alpha=0.6) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
After removing those values the distribution of hours between pumps
still has some apparent outliers but overall it looks more reasonable.
The relationship between hours since last pump and volume pumped is much
clearer. This does mean that this model will only be applicable to the
selected data i.e. before mum attempted to decrease supply.
ggplot(data_clean, aes(x=age_days, y=volume)) +
geom_point(color = colours[2], alpha = 0.6) +
labs(title = "Baby's Age v Volume Produced", x = "Baby's Age (Days)", y = "Volume (mL)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))There is not a clear relationship between baby’s age and the volume pumped. I don’t think this will make sense to include in the final model but I’ll still test it to see if it adds any value.
Here I’m normalising the minutes since last pump and age of baby in days (scaling values to between 0 and 1). This puts these on the same scale which prevents the model becoming biased due to the difference in scale. The correlation matrix shows the correlation between the predictor variables to see if they’re related to each other separately from their relationships to the volume.
data_norm <- data_clean %>%
mutate(min = scales::rescale(minutes),
age = scales::rescale(age_days))
cor_matrix <- cor(data_norm[, c("first_pump", "min", "age")], use = "complete.obs")
pheatmap(cor_matrix, main = "Linear Model Correlation Matrix", display_numbers = T,
color = colorRampPalette(c('#9de24f', '#ffff66', '#ffbd55', '#ff6666'))(100),
fontsize_number = 15)The correlation matrix shows that there’s a strong relationship between the minute since last pump and whether it’s the first pump of the day. This does make sense as there is a longer gap overnight while sleeping. This could make it difficult to separate their effects on the volume. The age has a low correlation with any other variable but I’m also not convinced that it has a strong effect. First I’m going to test this relationship.
##
## Call:
## lm(formula = volume ~ age, data = data_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.73 -73.38 -42.47 32.31 326.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.661 6.678 29.301 < 2e-16 ***
## age 52.623 12.088 4.353 1.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108.4 on 925 degrees of freedom
## Multiple R-squared: 0.02008, Adjusted R-squared: 0.01902
## F-statistic: 18.95 on 1 and 925 DF, p-value: 1.49e-05
The model suggests there’s a statistically significant relationship between baby’s age and the volume pumped, but the R2 of 0.02 tells us that the linear model explains very little of the variance in volume pumped. It’s possible that there’s a non-linear relationship between volume and baby’s age. For the time being I’m going to exclude the age as beyond the scope of this project.
The VIF quantifies how much the variance of the estimated regression coefficient increases if your predictors are correlated. This essentially means that it tells us how much the accuracy of the predictions of the model due to the multicollinearity (relationship between) two predictor variables.
model <- lm(volume ~ first_pump + min, data = data_norm)
vif_values <- vif(model)
vif_df <- data.frame(Variable = names(vif_values), VIF = as.vector(vif_values))
vif_df %>%
kbl() %>%
column_spec(1, width = "4cm") %>%
column_spec(2, width = "4cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Variable | VIF |
|---|---|
| first_pump | 8.356252 |
| min | 8.356252 |
The VIFs for the first_pump & minutes model of >8 means that the relationship between the two variables will have an effect on the accuracy of the model. One way of counteracting this is to include an interaction variable, created by multiplying the variables.
model_int <- lm(volume ~ min + first_pump + (min*first_pump), data = data_norm)
vif_df <- as.data.frame(vif(model_int, type = "predictor"))| GVIF | Df | GVIF^(1/(2*Df)) | Interacts With | Other Predictors | |
|---|---|---|---|---|---|
| min | 1 | 3 | 1 | first_pump | – |
| first_pump | 1 | 3 | 1 | min | – |
The GVIF (generalised VIF) is greatly reduced in this model which suggests the interaction term has helped to reduce the multicollinearity.
##
## Call:
## lm(formula = volume ~ min + first_pump + (min * first_pump),
## data = data_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.576 -18.852 1.284 19.075 113.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.006 4.182 21.523 < 2e-16 ***
## min 318.536 17.056 18.676 < 2e-16 ***
## first_pump 20.665 21.228 0.973 0.33057
## min:first_pump 86.742 33.459 2.592 0.00968 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.89 on 923 degrees of freedom
## Multiple R-squared: 0.9045, Adjusted R-squared: 0.9042
## F-statistic: 2913 on 3 and 923 DF, p-value: < 2.2e-16
Including the interaction variable has made the first_pump variable statistically insignificant suggesting that the interaction between the minutes since last pump and first_pump explains most of the effect of the first_pump variable.
This model removes the statistically insignificant variable.
##
## Call:
## lm(formula = volume ~ min + min:first_pump, data = data_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.359 -18.799 1.349 18.709 113.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.81 4.10 22.149 <2e-16 ***
## min 315.42 16.75 18.828 <2e-16 ***
## min:first_pump 117.16 11.96 9.799 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.88 on 924 degrees of freedom
## Multiple R-squared: 0.9044, Adjusted R-squared: 0.9042
## F-statistic: 4369 on 2 and 924 DF, p-value: < 2.2e-16
This model explains about 90% of the variation in volume pumped. Both variables are statistically significant.
Here I’m testing whether the residuals (difference between the model’s predicted value and the actual value) are normally distributed. In this model we assume that there’s a linear relationship between the variables; this is a way of verifying that assumption.
residuals_int_df <- as.data.frame(residuals(model_int_2))
ggplot(residuals_int_df, aes(x = residuals(model_int_2))) +
geom_density(fill = colours[1], color = colours[4]) +
labs(title = "Density Plot of Residuals",
x = "Residuals",
y = "Density")The density plot shows that the model looks to be approximately, although not exactly, normally distributed.
Here K-fold cross validation is used to test the model. I’ve set the number of folds to 10. This splits the data into 10 groups (folds), trains the model on 9 sets, and tests it on the 10th set. This is repeated 10 times, using each group as the testing set once. This helps to make sure that I haven’t overfitted the model to the data set.
set.seed(123)
num_folds <- 10
ctrl <- trainControl(method = "cv", number = num_folds)
cv_results <- train(volume ~ min + min:first_pump,
data = data_norm, method = "lm", trControl = ctrl)## Linear Regression
##
## 927 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 835, 834, 834, 835, 834, 835, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 33.90813 0.9050475 25.40757
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The cross validation model R2 of 0.905 is the average R2 across all the folds. This is very close to the original model R2 (0.904) and this reassures us that the model is not overfitted. It tells us that the model performs reliably across different subsets of the data suggesting that the model is reliable.
The final model for predicting the volume of milk produced has two independent variables - minutes since last pump and minutes since last pump x first_pump.
The model equation is:
\[
\text{Volume} = 90.81 + 315.42 \left(\frac{\text{minutes} -
16}{930}\right) + 117.16 \left(\frac{\text{minutes} - 16}{930}\right)
\times \text{first_pump}
\]
For example, when the time since the last pump is 500
minutes and it’s not the first pump of the day (i.e. first_pump is 0)
the expected volume is: \[
\begin{align*}
\text{Volume} &= 90.81 + 315.42 \left(\frac{500 - 16}{914}\right) +
117.16 \left(\frac{500 - 16}{914}\right) \times 0 \\
&= 90.81 + 315.42 \times 0.5295405\ + \text{0} \\
&= 257.8377 \\
&\approx 258 \, \text{mL}
\end{align*}
\]
When the time since the last pump is 500 minutes and it
is the first pump of the day (i.e. first_pump is 1), the expected
volume is: \[
\begin{align*}
\text{Volume} &= 90.81 + 315.42 \left(\frac{500 - 16}{914}\right) +
117.16 \left(\frac{500 - 16}{914}\right) \times 1 \\
&= 90.81 + 315.42 \times 0.5295405\ + \text{117.16}\ \times
\text{0.5295405} \times 1\\
&= 90.81 + 167.0277\ + 62.04096 \\
&= 319.8787 \\
&\approx 320 \, \text{mL}
\end{align*}
\] As the first_pump variable is either yes (1) or no (0), this
part of the equation: \[
\begin{align*}
117.16 \left(\frac{\text{minutes} - 16}{930}\right) \times
\text{first_pump}
\end{align*}
\] will be 0 when first_pump is 0. This means that while
first_pump doesn’t have an effect on volume on its own, it increases the
effect of the minutes variable.
This plot shows the actual volumes compared to the volumes predicted by my model.
predictions <- predict(model_int_2, newdata = data_norm)
data_norm$prediction <- predictions
ggplot(data_norm, aes(x = volume, y = prediction)) +
geom_point(color = colours[2], alpha=0.6) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed",
linewidth = 1.5, color = "darkgrey") +
labs(
title = "Predicted vs. Actual Volumes",
x = "Actual Volume (mL)",
y = "Predicted Volume (mL)") +
theme_minimal()