Analyze biodiesel potential across algae groups and test for differences.
## **Analyzed groups:** diatom, Thalassiosira, Chlorella, Nannochloropsis, Scenedesmus
Analyze biodiesel potential across algae groups and test for differences.
## **Analyzed groups:** diatom, Thalassiosira, Chlorella, Nannochloropsis, Scenedesmus
## # A tibble: 5 × 4 ## group mean_bf sd_bf n ## <fct> <dbl> <dbl> <int> ## 1 diatom 0.286 0.0287 96 ## 2 Thalassiosira 0.299 0.0284 119 ## 3 Chlorella 0.463 0.0363 76 ## 4 Nannochloropsis 0.600 0.0409 74 ## 5 Scenedesmus 0.437 0.0311 55
Let \(Y_{gi}\) be biodiesel fraction for observation \(i\) in group \(g\).
Group mean and variance: \[\bar Y_g=\frac{1}{n_g}\sum_{i=1}^{n_g}Y_{gi},\quad s_g^2=\frac{1}{n_g-1}\sum_{i=1}^{n_g}(Y_{gi}-\bar Y_g)^2.\]
ggplot(df, aes(group, biodiesel_fraction)) + geom_violin(trim = FALSE) + geom_boxplot(width = 0.2, outlier.shape = NA) + geom_point(position = position_jitter(width = 0.15), alpha = 0.35) + labs(x = "Group", y = "Biodiesel fraction")
Per-group OLS fits \[Y_{gi}=\beta_{0g}+\beta_{1g} L_{gi}+\varepsilon_{gi},\] with \(L\)=lipid% and \(Y\)=biodiesel fraction; line is \(\hat Y_{gi}=\hat\beta_{0g}+\hat\beta_{1g}L_{gi}\).
ggplot(df, aes(lipid_pct, biodiesel_fraction, color = group)) + geom_point(alpha = 0.6) + geom_smooth(method = "lm", se = FALSE) + labs(x = "Lipid %", y = "Biodiesel fraction")
For observation \(i\) with input \(E_i\) (kWh/L) and biodiesel fraction \(Y_i\), \[\text{yield\_per\_kWh}_i=\dfrac{Y_i}{E_i}.\] Violin and boxplot summarize this distribution by group.
yd <- df %>%
filter(!is.na(biodiesel_fraction),
!is.na(energy_kWh_per_L),
energy_kWh_per_L > 0) %>%
mutate(yield_per_kWh = biodiesel_fraction / energy_kWh_per_L)
ggplot(yd, aes(group, yield_per_kWh, fill = group)) +
geom_violin(trim = FALSE, alpha = 0.4) +
geom_boxplot(width = 0.2, outlier.shape = NA) +
stat_summary(fun = mean, geom = "point", size = 2) +
labs(x = "Group", y = "Biodiesel fraction per kWh (1/L·kWh)", title = "Higher is better") +
theme(legend.position = "none")
A plane fit corresponds to multiple regression: \[Y_i=\beta_0+\beta_1 L_i+\beta_2 r_i+\varepsilon_i,\] with \(L\)=lipid% and \(r\)=growth rate per day.
plot_ly(df, x = ~lipid_pct, y = ~growth_rate_per_day, z = ~biodiesel_fraction,
type = "scatter3d", mode = "markers", color = ~group)
## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.286 0.00335 85.4 2.50e-265 ## 2 groupThalassiosira 0.0135 0.00450 2.99 2.93e- 3 ## 3 groupChlorella 0.177 0.00504 35.2 1.03e-126 ## 4 groupNannochloropsis 0.314 0.00507 61.9 9.11e-212 ## 5 groupScenedesmus 0.151 0.00555 27.3 1.07e- 94
We also adjust for lipid% (\(L\)), growth rate (\(r\)), and total energy (\(E\)). \[ \begin{aligned} Y &= \beta_0 \\ &\quad + \sum_{g\ne \text{diatom}} \beta_g \mathbf{1}\{G=g\} \\ &\quad + \gamma_1 L + \gamma_2 r + \gamma_3 E + \varepsilon. \end{aligned} \]
## * Winner by adjusted model: **Scenedesmus** (coef = 0.156, p = 3.59e-79)
## * Lipid% and growth rate are positive predictors; energy impact is small in this sample.
## * Groups overlap after noise; ranking shows mean advantage, not dominance.