Victory Farms - Egg Yield vs Pond Quality

DSB Group

Evan Peters / Hao Lu / Yang Sun

URL: ## Business Problem Victory Farms raises fish. Optimizing egg yields is of utmost importance. It is likely that pond water quality plays a role in egg yields, but the exact relationship is unknown. We intend to learn more about that potential relationship.

Data

There are two data sets, one that tracks egg yield, and one that tracks pond water quality.

Clean data to factors, dates, and anything else. (showing a small snippet of trying to normalize the dates from character)

sapply(pond,class)
##              Timestamp                   Date                   Pond 
##            "character"            "character"            "character" 
##      Temperature..7am. Dissolved.Oxygen..7am.      Temperature..3pm. 
##              "numeric"              "numeric"              "numeric" 
## Dissolved.Oxygen..3pm.          Ammonia..mg.L          Nitrite..mg.L 
##            "character"              "numeric"              "numeric" 
##          Nitrate..mg.L        Phosphate..mg.L                pH..AM. 
##              "numeric"              "numeric"              "numeric" 
## Total.Alkalinity..mg.L        Visibility..cm.     Un.ionised.Ammonia 
##              "numeric"              "numeric"              "integer" 
##                pH..PM. 
##              "numeric"
cols <- c("Pond")
pond[cols] <- lapply(pond[cols],as.factor)
# change columns for pond TO DATE
# current text fails, so use the Timestamp column, split and then convert to date
# seems weird, but i tried it several times and this is the only thing that works
pond <- pond %>%
  extract(col="Timestamp", into = c("NewDate","Hours"), regex="^(\\S+)\\s+(.*)")
pond$NewDate <- as.Date(pond$NewDate,"%m/%d/%Y")
pond$Dissolved.Oxygen..3pm.[6779] <- 14.48
pond$Dissolved.Oxygen..3pm.[9920] <- 11.31
cols <- c("Dissolved.Oxygen..3pm.")
pond[cols] <- lapply(pond[cols],as.numeric)
## Warning in lapply(pond[cols], as.numeric): NAs introduced by coercion
# change columns for yield TO FACTOR
cols <- c("Batch.Code","Pond","Pond.Line","Week","Month","Culling.Code","Pond.1")
yield[cols] <- lapply(yield[cols],factor)

# change columns for yield TO DATE
cols <- c("Egg.Collection.Date")
yield[cols] <- lapply(yield[cols],as.Date,"%m/%d/%y")
pond %>%
  summarise_all(funs(sum(is.na(.)))) / nrow(pond)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
##   NewDate Hours         Date Pond Temperature..7am. Dissolved.Oxygen..7am.
## 1       0     0 0.0001369207    0        0.02512494             0.01560895
##   Temperature..3pm. Dissolved.Oxygen..3pm. Ammonia..mg.L Nitrite..mg.L
## 1         0.3126583              0.3160813     0.7163004     0.7278702
##   Nitrate..mg.L Phosphate..mg.L   pH..AM. Total.Alkalinity..mg.L
## 1     0.8336414       0.7755186 0.4508113              0.9970562
##   Visibility..cm. Un.ionised.Ammonia  pH..PM.
## 1       0.8834121          0.9994523 0.559937
# create accurate date data, normalizing to the min date
# minimum is "2019-02-19"
mindate <- min(pond$NewDate)
# create time difference and NewWeek relative to minimum date
pond$TimeDiff <- as.integer(pond$NewDate - mindate)
pond$NewWeek <- floor(pond$TimeDiff/7)

yield$TimeDiff <- as.integer(yield$Egg.Collection.Date - mindate)
yield$NewWeek <- floor(yield$TimeDiff/7)

Also, each pond has many sub-ponds called “Hapas”. We need to rollup the yields from the Hapas into one target, called AvgYield. AvgYield is the average g/m2 from a given pond in a given egg collection day

# Some aggregated fields on yield
yield <- yield %>% group_by(Egg.Collection.Date, Pond) %>% 
  mutate(TotalGramsDate = sum(Grams.Collected)) 
yield <- yield %>% group_by(Week, Pond) %>% 
  mutate(TotalGramsWeek = sum(Grams.Collected)) 
yield <- yield %>% group_by(Egg.Collection.Date, Pond) %>% 
  mutate(AvgYield = mean(g.m2)) %>%ungroup()
# # the yield data should be rolled up on day and pond and week and month
yield <- yield %>% group_by(Pond,NewWeek) %>% 
   mutate(AvgWeekYield = mean(g.m2,na.rm=T)) 
yield <- yield %>% group_by(Pond,Month) %>% 
  mutate(AvgMonthYield = mean(g.m2,na.rm=T)) 

Data Summary

Pond quality data is sparse First, the number of distinct days of egg collection against the number of distinct pond quality days in the data

length(unique(yield$Egg.Collection.Date))
## [1] 305
sum(unique(pond$NewDate) %in% unique(yield$Egg.Collection.Date))
## [1] 173
sum(unique(pond$NewDate) %in% unique(yield$Egg.Collection.Date))/length(unique(yield$Egg.Collection.Date))
## [1] 0.5672131

And the relative sparsity of other data

nrow(pond)
## [1] 14607
pond %>%
  summarise_all(funs(sum(is.na(.)))) / nrow(pond)
##   NewDate Hours         Date Pond Temperature..7am. Dissolved.Oxygen..7am.
## 1       0     0 0.0001369207    0        0.02512494             0.01560895
##   Temperature..3pm. Dissolved.Oxygen..3pm. Ammonia..mg.L Nitrite..mg.L
## 1         0.3126583              0.3160813     0.7163004     0.7278702
##   Nitrate..mg.L Phosphate..mg.L   pH..AM. Total.Alkalinity..mg.L
## 1     0.8336414       0.7755186 0.4508113              0.9970562
##   Visibility..cm. Un.ionised.Ammonia  pH..PM. TimeDiff NewWeek
## 1       0.8834121          0.9994523 0.559937        0       0

Sparisty and data collection timing is obviously going to be an issue.

Trying some relationships / simple findings

# summary of ponds
yield %>% group_by(Pond) %>% summarise(n = n(),avg_yield = mean(AvgYield),var=var(AvgYield))
## # A tibble: 25 × 4
##    Pond      n avg_yield   var
##    <fct> <int>     <dbl> <dbl>
##  1 A1       64      3.18 0.603
##  2 A13     344      3.51 2.46 
##  3 A14     360      3.14 1.73 
##  4 A15     349      2.54 1.13 
##  5 A16     366      3.46 2.85 
##  6 A17     132      2.33 1.44 
##  7 A18     272      3.32 2.16 
##  8 A6      176      3.92 1.21 
##  9 A8      240      3.47 1.47 
## 10 A9      336      3.73 1.74 
## # … with 15 more rows
small.model <- lm(AvgYield ~ Pond, data = yield)
summary(small.model)
## 
## Call:
## lm(formula = AvgYield ~ Pond, data = yield)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3610 -0.9317 -0.1228  0.7618  5.0992 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.182813   0.162065  19.639  < 2e-16 ***
## PondA13          0.330378   0.176498   1.872 0.061259 .  
## PondA14         -0.039861   0.175882  -0.227 0.820712    
## PondA15         -0.640549   0.176300  -3.633 0.000281 ***
## PondA16          0.280439   0.175664   1.596 0.110422    
## PondA17         -0.857244   0.197483  -4.341 1.43e-05 ***
## PondA18          0.140717   0.180125   0.781 0.434695    
## PondA6           0.738210   0.189251   3.901 9.66e-05 ***
## PondA8           0.282500   0.182398   1.549 0.121462    
## PondA9           0.543155   0.176827   3.072 0.002135 ** 
## PondAlex's Pond  1.913322   0.178310  10.730  < 2e-16 ***
## PondB1           0.034854   0.168838   0.206 0.836455    
## PondB10         -0.513422   0.171451  -2.995 0.002755 ** 
## PondB11          0.375028   0.173452   2.162 0.030632 *  
## PondB2          -1.453300   0.173303  -8.386  < 2e-16 ***
## PondB3          -0.034200   0.175083  -0.195 0.845136    
## PondB4           0.692480   0.169478   4.086 4.43e-05 ***
## PondB5          -0.020494   0.171203  -0.120 0.904720    
## PondB6           0.009772   0.172047   0.057 0.954708    
## PondB7           0.099987   0.170700   0.586 0.558061    
## PondB8          -0.233454   0.171642  -1.360 0.173823    
## PondB9          -0.694608   0.171537  -4.049 5.18e-05 ***
## PondMB12        -0.793229   0.232983  -3.405 0.000665 ***
## PondMB14        -1.239375   0.261322  -4.743 2.14e-06 ***
## PondW5          -0.093152   0.175316  -0.531 0.595199    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.297 on 9333 degrees of freedom
## Multiple R-squared:  0.1828, Adjusted R-squared:  0.1807 
## F-statistic: 86.96 on 24 and 9333 DF,  p-value: < 2.2e-16
# Smaller ponds yield slightly more fish on average
yield %>% group_by(Hapa.Area) %>% summarise(n = n(),avg_yield = mean(AvgYield),var_avg_yield = var(AvgYield))
## # A tibble: 2 × 4
##   Hapa.Area     n avg_yield var_avg_yield
##       <int> <int>     <dbl>         <dbl>
## 1        40  2647      3.31          2.09
## 2        80  6711      3.14          2.03
small.model <- lm(AvgYield ~ Hapa.Area, data = yield)
summary(small.model)
## 
## Call:
## lm(formula = AvgYield ~ Hapa.Area, data = yield)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1374 -1.0599 -0.1582  0.8482  6.5712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.4741765  0.0582817  59.610  < 2e-16 ***
## Hapa.Area   -0.0042096  0.0008208  -5.129 2.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 9356 degrees of freedom
## Multiple R-squared:  0.002804,   Adjusted R-squared:  0.002697 
## F-statistic: 26.31 on 1 and 9356 DF,  p-value: 2.973e-07
# time vs yield: some weeks are better than others
yield %>% group_by(NewWeek) %>% summarise(n = n(),avg_yield = mean(AvgWeekYield),var_avg_yield = var(AvgWeekYield))
## # A tibble: 59 × 4
##    NewWeek     n avg_yield var_avg_yield
##      <dbl> <int>     <dbl>         <dbl>
##  1     101    40      2.73         0.578
##  2     102   179      4.23         3.62 
##  3     103   175      3.92         3.72 
##  4     104   175      2.87         2.70 
##  5     105   183      3.36         1.48 
##  6     106   171      2.71         1.19 
##  7     107   186      2.76         1.53 
##  8     108   201      1.87         0.913
##  9     109   199      2.18         1.02 
## 10     110   189      3.08         1.87 
## # … with 49 more rows
small.model <- lm(AvgWeekYield ~ NewWeek, data = yield)
summary(small.model)
## 
## Call:
## lm(formula = AvgWeekYield ~ NewWeek, data = yield)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0060 -1.0796 -0.1722  0.8689  6.5579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.564863   0.130396  19.670  < 2e-16 ***
## NewWeek     0.004923   0.001028   4.787 1.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 9356 degrees of freedom
## Multiple R-squared:  0.002443,   Adjusted R-squared:  0.002337 
## F-statistic: 22.91 on 1 and 9356 DF,  p-value: 1.72e-06
# we see while theres plenty of significance between week and yield, it explains none of the variance
plot(yield$NewWeek,yield$AvgWeekYield)

# time vs yield: some months are better than others
yield %>% group_by(Month) %>% summarise(n = n(),avg_yield = mean(AvgMonthYield),var_avg_yield = var(AvgMonthYield))
## # A tibble: 12 × 4
##    Month     n avg_yield var_avg_yield
##    <fct> <int>     <dbl>         <dbl>
##  1 Apr     837      2.85         1.70 
##  2 Aug     838      3.17         0.715
##  3 Dec     709      3.20         0.491
##  4 Feb     720      3.60         1.57 
##  5 Jan     507      2.20         0.352
##  6 Jul     854      3.66         1.61 
##  7 Jun     936      2.99         1.66 
##  8 Mar     849      2.41         0.399
##  9 May     865      3.17         2.15 
## 10 Nov     721      3.90         0.618
## 11 Oct     745      3.58         0.971
## 12 Sep     777      3.33         0.747
small.model <- lm(AvgMonthYield ~ Month, data = yield)
summary(small.model)
## 
## Call:
## lm(formula = AvgMonthYield ~ Month, data = yield)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5813 -0.6817 -0.0318  0.7091  3.8836 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.85010    0.03673  77.597  < 2e-16 ***
## MonthAug     0.32151    0.05193   6.191 6.21e-10 ***
## MonthDec     0.34613    0.05424   6.382 1.83e-10 ***
## MonthFeb     0.75374    0.05401  13.955  < 2e-16 ***
## MonthJan    -0.64583    0.05980 -10.800  < 2e-16 ***
## MonthJul     0.80617    0.05168  15.598  < 2e-16 ***
## MonthJun     0.14487    0.05055   2.866  0.00417 ** 
## MonthMar    -0.43983    0.05176  -8.498  < 2e-16 ***
## MonthMay     0.31994    0.05152   6.210 5.53e-10 ***
## MonthNov     1.04855    0.05399  19.420  < 2e-16 ***
## MonthOct     0.73252    0.05352  13.686  < 2e-16 ***
## MonthSep     0.48279    0.05294   9.120  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.063 on 9346 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1545 
## F-statistic: 156.5 on 11 and 9346 DF,  p-value: < 2.2e-16

This was all without joining pond quality data. Lets join some data. Remember we only have a fraction of the data for egg collection dates…

# join on exact day, INNER JOIN to ignore days w.o pond data w. rolledup data for dayjoin
dayjoin <- inner_join(yield,
                      pond,
                      by = c("Egg.Collection.Date"="NewDate","Pond"="Pond"))

Split train/test.

# create train/test

trainIndex <- createDataPartition(dayjoin$AvgYield, p = .8,list=F)
dayjoin_train <- dayjoin[ trainIndex, ]
dayjoin_test  <- dayjoin[-trainIndex, ]

Train model and test. No algorithm was able to run due to sparsity constraints. Most common errors were “all rows removed due to NA” Or “contrasts” error which is related to lack of factors.

Conclusions

Relationships exist between ponds