1. (Note to self: add in text citations for these parts)
Algae is an ever growing part of agriculture for food and fuel. The field of studying algae is growing and many studies are conducted on smaller scales, but as sizes increase, risk increases as well and so procedures must be outlined to reduce risk in large scale development. This is where the ATP3 studies come in, dedicated to studying algae across 5 geographical regions while coordinating experimental design to standardize algae harvest operation to serve as a benchmark for future algae cultivation. The studies took place over 19 months between 2014 and 2015. The overall goal was to standardize algae harvest by comparing algal productivity, but smaller studies were conducted throughout. This paper will focus on the main goal of comparing algae biomass productivity in identical ponds.
2.
Inoculum for three differnt algae strains, Nannochloropsis oceana KA32, Chlorella vulgaris LRB-AZ-1201, and Desmodesmus sp. C046 were created in indoor columns using growth media suited for the strain which were then scaled up and transfered to identical outdoor ponds across 5 geographical locations: Arizona State University (ASU), California Polytechnic State University (CP), Georgia Institute of Technology (GT), Cellana LLC (CELL), and Florida Algae (FA). The initiation phase involved testing to see if each location could handle implementing the same procedures in a consistent manner, which they all were able to replicate. All data automatically highlighted outliers and spreadsheets were spread between workers to go through multiple reviews. They were all combined and assessed using R. The scripts and primary individual spreadsheets are available upon request. They tried not to remove too many outliers or intstrumentation errors due to the chance that the data could be called curated and not purely objective.
3.
The statistics I will analyze will be the Algal Harvest Yield Productivity g/m^2/day calculated as instructed in the usage guidelines from the authors. This will exclude the first and last two harvest data points to avoid interference from the grow out batch composition from before adapting to the pond and to remove the impact of contamination on the harvested biomass composition.(I have no clue how this explains the last two or how it has to do with contamination it is just what it told me. Do you have any ideas?) This will be compared to numerous other variables in the dataset which are all listed below.
(This is just copied and pasted 8 facts, I want to adjust the facts to better match my work now that I have a better idea. This is just a start for me, feel free to ignore for grading and suggestions.)
Algae can be used as sources for biodiesel, nutraceuticals, cosmetics, pharmaceuticals, fertilizers, and food sources based on their characteristics. Methyl esters in microalgae are the lipids that are called biodiesel.Native strands are preferred for ecological and economical security in production. Chlorophytes are extensively studied due to their TAG’s being their main carbon storage molecules. More than 200,000 microalgae species exist but only a limited number have been studied due to their morphological features and growth characteristics. They analyzed harvesting data based on natural conditions and culturing rather than experimental conditions to optimize growth and lipid content. These datasets represent a conservative, non-optimized, estimation of typical algal areal harvest yield productivities and compositions that can be achieved with these strains at the scale, locations, and operational parameters chosen.
6.
I am still unsure of my overarching question in my data set. I found one graph that would be really cool to reproduce, but the study definitely did not only have one aim and I think it was a bit of a bad choice. I am considering just choosing something to study using the data rather than recreating something from the study. We should discuss this Monday.
Part 2 - Your Work with the data
7./8. Create initial summary graphs/statistics
In the study I chose they use all of the ATP3 data to study algal productivity over the course of the entire project. So, to study all of the harvest data from the studies I will combine every single ATP3 harvest data.
Rows: 4175 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, PondID, DateTime, StrainID, SourceID, BatchI...
dbl (35): time.between.harvests.days, Harvest., Harvest.Vol..L., AFDW..g.L.,...
lgl (9): crash, Comments, NH4.mg.L, NH4.PCT.RSD, OD680, OD680.PCT.RSD, NH4....
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
harvest4 <- harvest4 |>filter(!is.na(AFDW..g.))# removing extra rows to make it match other datasetsharvest4 <- harvest4[ -c(9,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53) ]# removing extra columns to make it match other datasetsnames(harvest4)[names(harvest4) =="DateTime"] <-"Date"names(harvest4)[names(harvest4) =="Duration.days"] <-"time..d."# renaming variables to match other datasets
Rows: 113 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 328 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 326 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 292 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 117 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 270 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 142 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 127 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): SiteID, ExperimentID, StrainID, SourceID, BatchID, Date, PondID, Tr...
dbl (5): Harvest., time..d., Harvest.Vol..L., AFDW..g.L., AFDW..g.
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Productivity
We would like to compare the productivity of each algal strain under the different treatments in the study so we put them into one. In the study they use the productivity without the first and last two harvests to account for the initial grow out and contamination so I will be doing additional calculations to calculate that.
current <- harvest |>group_by(PondID, SourceID) |>mutate(`Ash-free_dry_mass_sum`=sum(`Ash-free_dry_mass_of_algal_harvest_g`))# calculating the sum of the mass grouped by pond and experiment. We only have to do group_by in this chunk for the rest of all of the calculations for it to stay. To undo it we would need to do ungroup()
currentf <- current |>filter(!Harvest.=='1') |>mutate(`Ash-free_dry_mass_sum_f`=sum(`Ash-free_dry_mass_of_algal_harvest_g`))# adjusting for the initial grow out for productivity calculations by removing the first harvest and calculating the sum of the masses.
currentfl <- currentf |>filter(!Harvest.==max(Harvest.)) |>mutate(`Ash-free_dry_mass_sum_fl`=sum(`Ash-free_dry_mass_of_algal_harvest_g`))# adjusting for the initial grow out and final harvest for productivity calculations by removing the first harvest and calculating the sum of the masses.
current <- current |>mutate(`Ash-free_dry_mass_div_pond_area`=`Ash-free_dry_mass_sum`/4.2)currentf <- currentf |>mutate(`Ash-free_dry_mass_div_pond_area_f`=`Ash-free_dry_mass_sum_f`/4.2)currentfl <- currentfl |>mutate(`Ash-free_dry_mass_div_pond_area_fl`=`Ash-free_dry_mass_sum_fl`/4.2)# dividing unadjusted and adjusted by pond area.
current <- current |>mutate(maxharvest =max(Harvest.))# defining the max harvest to find the total experimental duration per harvest.
current <- current |>mutate(minharvest =min(Harvest.))# defining the min harvest to subtract from the final for the adjusted experimental duration.
current2 <- current[current$Harvest.==current$maxharvest, ] # limiting the rows to the ones with the total experimental duration per experiment to divide.names(current2)[names(current2) =="experimental_duration"] <-"tot_experimental_duration"# I am also renaming this variable to be more accurate as we will be joining it later.
current3 <- current[current$Harvest.==current$minharvest, ] # limiting the rows to the ones with the time required to reach the first harvest per experiment to subtract from the final for the adjusted experimental duration.names(current3)[names(current3) =="experimental_duration"] <-"first_experimental_duration"# I am also renaming this variable to be more accurate as we will be joining it later.
current4 <- current |>arrange(desc(Harvest.)) |>mutate(n=1:n()) |>ungroup() |>filter(n==2)# defining and limiting the 2nd to max harvest to find the total experimental duration per harvest excluding the final harvest.names(current4)[names(current4) =="experimental_duration"] <-"2nd_last_experimental_duration"# I am also renaming this variable to be more accurate as we will be joining it later.
Bringing them back together and filling each pond/source to do calculations straight across rows for productivity.
joined1 <- joined |>group_by(PondID, SourceID) |> tidyr::fill(tot_experimental_duration, .direction ="downup") |>ungroup()# replacing na's with the total experimental duration for its experiment
joined2 <- joined1 |>group_by(PondID, SourceID) |> tidyr::fill(first_experimental_duration, .direction ="downup") |>ungroup()# replacing na's with the first experimental duration for its experiment
#joined3 <- joined2#group_by(PondID, SourceID) |> #tidyr::fill(`2nd_last_experimental_duration`, .direction = "downup") |>#ungroup()# replacing na's with the second to last experimental duration for its experiment. joined3 <- joined2 |>filter(!is.na(`2nd_last_experimental_duration`))# didn't do to reduce it to one observation per experiment for averaging and instead got rid of na's.
joined3 <- joined3 |>mutate(`Ash-free_dry_mass_div_time_g_m2_day`=`Ash-free_dry_mass_div_pond_area`/(`tot_experimental_duration`))# we divide now by the total experimental duration to get productivity in the units...
\(g/m^2/d\)
joined3 <- joined3 |>mutate(`Ash-free_dry_mass_div_time_g_m2_day_f`=`Ash-free_dry_mass_div_pond_area_f`/(`tot_experimental_duration`- (`first_experimental_duration`)))# we repeat the same thing but for our adjusted experimental duration by subtracting the first harvest duration from the total. This gives us the productivity without the initial growout with the same units...
\(g/m^2/d\)
joined3 <- joined3 |>mutate(`Ash-free_dry_mass_div_time_g_m2_day_fl`=`Ash-free_dry_mass_div_pond_area_fl`/(`2nd_last_experimental_duration`- (`first_experimental_duration`)))# we repeat the same thing but for our 2nd adjusted experimental duration by subtracting the first harvest duration from the 2nd to last total duration. This gives us the productivity without the initial growout and final harvest with the same units...
In the study they conclude that there is no significant difference in mean productivity between ponds so I will be attempting to reproduce those results.
joined4 <- joined3 |>filter(!is.na(`Ash-free_dry_mass_sum_f`))# removing n.a's so they do not mess with calculations (I think this is an appropriate choice)
library(RColorBrewer)coul <-brewer.pal(7, "Set2") barplot(height=treatment_mean_f$mean_productivity, names=treatment_mean_f$PondID, col=coul,xlab="Pond", ylab="Algae Ash-free Dry Mass g/m^2/day", main="Algae Productivity by Pond \n w/o Initial Grow Out and Final Harvest", ylim=c(0,25) )
# Plot for algae growth per meter^2 per day by treatment without the initial harvest factored in.
library(RColorBrewer)coul <-brewer.pal(7, "Set2") barplot(height=treatment_mean_fl$mean_productivity, names=treatment_mean_fl$PondID, col=coul,xlab="Pond", ylab="Algae Ash-free Dry Mass g/m^2/day", main="Algae Productivity by Pond \n w/o Initial Grow Out and Last Two Harvest", ylim=c(0,25) )
# Plot for algae growth per meter^2 per day by treatment without the initial and final harvest factored in.
WHY IS THIS MISSING TWO THINGS? IT SAYS INF WHEN I OPEN IT UP??
9. Define the parameter or parameters you are trying to estimated.
There is no difference in the algal productivity between different ponds.
10. I know I can use ANOVA and kruskal wallis but I’m not sure what else could answer the same question.
Part 3 - The conclusion
11.
12.
13.
14.
15.
Hotos, G., Avramidou, D., Mastropetros, S. G., Tsigkou, K., Kouvara, K., Makridis, P., & Kornaros, M. (2023). Isolation, identification, and chemical composition analysis of nine microalgal and cyanobacterial species isolated in lagoons of Western Greece. Algal Research, 69, 102935. https://doi.org/10.1016/j.algal.2022.102935
Knoshaug, E., Wolfrum, E., Laurens, L. et al. Unified field studies of the algae testbed public-private partnership as the benchmark for algae agronomics. Sci Data 5, 180267 (2018). https://doi.org/10.1038/sdata.2018.267
Knoshaug, E., Laurens, L., Kinchin, C., & Davis, R. (2016). Use of Cultivation Data from the Algae Testbed Public Private Partnership as Utilized in NREL’s Algae State of Technology Assessments. https://doi.org/10.2172/1330992
Lloyd, C., Tan, K. H., Lim, K. L., Valu, V. G., Fun, S. M., Chye, T. R., Mak, H. M., Sim, W. X., Musa, S. L., Ng, J. J., Bte Nordin, N. S., Bte Md Aidzil, N., Eng, Z. Y., Manickavasagam, P., & New, J. Y. (2021). Identification of microalgae cultured in bold’s basal medium from freshwater samples, from a high-rise city. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-84112-0
Sero, E. T., Siziba, N., Bunhu, T., & Shoko, R. (2021). Isolation and screening of microalgal species, native to Zimbabwe, with potential use in biodiesel production. All Life, 14(1), 256–264. https://doi.org/10.1080/26895293.2021.1911862