Victory Farms - Egg Yield vs Pond Quality

DSB Group

Evan Peters / Hao Lu / Yang Sun

Background

Victory Farms is a rapidly growing tilapia fish farm on track to becoming the largest Tilapia producer in Africa in the next 2 years. Established in 2015, VF is building its operation on the Kenyan side of Lake Victoria and its distribution capacity throughout Kenya. VF is rapidly expanding its farm operation, fish processing, and sales & marketing capabilities. By the end of 2018, VF was already the largest aquaculture farm in East Africa. VF is employing world-class technologies, people, and processes to build the leading tilapia business globally. The Company has the highest standards for performance, execution, culture, and integrity.

Problem

Optimizing egg yields is of utmost importance. It is likely that pond water quality plays a role in egg yields, but the exact relationship is unknown. We intend to learn more about that potential relationship.

Data

In this project we will be using pond quality data to predict talapia egg yields.

Data and data-dictionary is provided by Victory Farms.

Column Name Data Type Description
Batch Code Text Unique identity code for egg collection dates
Egg Collection Date Date Date eggs are collected from the ponds
Pond Text Name of pond from which eggs are collected
Hapa area Number Area in square metres for pond hapa from which eggs are collected
Hapa Number Number Number of ponds hapa from which eggs are collected
Stage 1 Number Stage 1 of egg collection
Stage 2 Number Stage 2 of egg collection
Grams Collected Number Total mass of eggs collected from stages 1 and 2 in grams
Kg collected Number Total mass of grams collected from stages 1 and 2 in kilograms
g/m2 Number The per unit mass of eggs collected in g/m2 of the pond
Month Number Month of collection
Week Number Week of collection
Est pcs Number Est number of egg pieces collected from the stated pond
Pond Line Text Larger pond classification
Culling Code Text Code for culled eggs

Project Details

As stated in the project prompt, mass per m^2 of pond is the independent variable to model off of. Feature selection/combination/creation will require data comprehension. Some variables may be dropped or combined. For example, oxygen ratings are taken at 2 seperate times. Perhaps the delta between these values is more valuable than the two values themselves. We will explore this. We may take the continuous yield variable and make it categorical, ie; very low, low, average, high and very high yields, for simplicity and comprehension. This is optional. Pond data will be normalized if sensible, for example, temperature may be evaluated as deviation from the mean, rather than a pure numeric value. Depending on how much data we have, we can look to see if there is seasonality to the collection rates. This effect may just be coded in a dummy variable for month (we don’t want to use month or week as a numeric var). Initial observation of the data will be necessary/useful. Pairwise observations are often a great way of looking for correlations. Such as the sample below with Fisher’s iris data:

pairs(iris[,-5])

The goal of the project is to determine which environmental factors, if any, contribute to egg-yield.

If factors contribute positively to an enhanced egg yield, extra care can be taken to ensure that those conditions are met during the fish raising process.

Preperation

Without access to the data, it is hard to make accurate estimates about what will be necessary to do in preperation. Here are some guesses. Pond name and Culling code are likely not important variables.

Data will be split into two categories randomly for training and testing. We suggest and 80/20 training testing split #### Packages * dplyr * caret * randomForest * ctree

Concerns:

  • matching collection times w/ pond data timestamps
  • missing data might be tricky to fill-in/interpolate
  • if regressions have failed to find relationships between dependent and independent variables, it may be that more data is needed / aerial photo data