DSB Group
Evan Peters / Hao Lu / Yang Sun
Victory Farms is a rapidly growing tilapia fish farm on track to becoming the largest Tilapia producer in Africa in the next 2 years. Established in 2015, VF is building its operation on the Kenyan side of Lake Victoria and its distribution capacity throughout Kenya. VF is rapidly expanding its farm operation, fish processing, and sales & marketing capabilities. By the end of 2018, VF was already the largest aquaculture farm in East Africa. VF is employing world-class technologies, people, and processes to build the leading tilapia business globally. The Company has the highest standards for performance, execution, culture, and integrity.
Optimizing egg yields is of utmost importance. It is likely that pond water quality plays a role in egg yields, but the exact relationship is unknown. We intend to learn more about that potential relationship.
In this project we will be using pond quality data to predict talapia egg yields.
Data and data-dictionary is provided by Victory Farms.
| Column Name | Data Type | Description |
|---|---|---|
| Batch Code | Text | Unique identity code for egg collection dates |
| Egg Collection Date | Date | Date eggs are collected from the ponds |
| Pond | Text | Name of pond from which eggs are collected |
| Hapa area | Number | Area in square metres for pond hapa from which eggs are collected |
| Hapa Number | Number | Number of ponds hapa from which eggs are collected |
| Stage 1 | Number | Stage 1 of egg collection |
| Stage 2 | Number | Stage 2 of egg collection |
| Grams Collected | Number | Total mass of eggs collected from stages 1 and 2 in grams |
| Kg collected | Number | Total mass of grams collected from stages 1 and 2 in kilograms |
| g/m2 | Number | The per unit mass of eggs collected in g/m2 of the pond |
| Month | Number | Month of collection |
| Week | Number | Week of collection |
| Est pcs | Number | Est number of egg pieces collected from the stated pond |
| Pond Line | Text | Larger pond classification |
| Culling Code | Text | Code for culled eggs |
As stated in the project prompt, mass per m^2 of pond is the independent variable to model off of. Feature selection/combination/creation will require data comprehension. Some variables may be dropped or combined. For example, oxygen ratings are taken at 2 seperate times. Perhaps the delta between these values is more valuable than the two values themselves. We will explore this. We may take the continuous yield variable and make it categorical, ie; very low, low, average, high and very high yields, for simplicity and comprehension. This is optional. Pond data will be normalized if sensible, for example, temperature may be evaluated as deviation from the mean, rather than a pure numeric value. Depending on how much data we have, we can look to see if there is seasonality to the collection rates. This effect may just be coded in a dummy variable for month (we don’t want to use month or week as a numeric var). Initial observation of the data will be necessary/useful. Pairwise observations are often a great way of looking for correlations. Such as the sample below with Fisher’s iris data:
pairs(iris[,-5])
The goal of the project is to determine which environmental factors, if any, contribute to egg-yield.
If factors contribute positively to an enhanced egg yield, extra care can be taken to ensure that those conditions are met during the fish raising process.
Without access to the data, it is hard to make accurate estimates about what will be necessary to do in preperation. Here are some guesses. Pond name and Culling code are likely not important variables.
Data will be split into two categories randomly for training and testing. We suggest and 80/20 training testing split #### Packages * dplyr * caret * randomForest * ctree
Concerns: