DSB Group
Evan Peters / Hao Lu / Yang Sun
URL: ## Business Problem Victory Farms raises fish. Optimizing egg yields is of utmost importance. It is likely that pond water quality plays a role in egg yields, but the exact relationship is unknown. We intend to learn more about that potential relationship.
There are two data sets, one that tracks egg yield, and one that tracks pond water quality.
Clean data to factors, dates, and anything else. (showing a small snippet of trying to normalize the dates from character)
sapply(pond,class)
## Timestamp Date Pond
## "character" "character" "character"
## Temperature..7am. Dissolved.Oxygen..7am. Temperature..3pm.
## "numeric" "numeric" "numeric"
## Dissolved.Oxygen..3pm. Ammonia..mg.L Nitrite..mg.L
## "character" "numeric" "numeric"
## Nitrate..mg.L Phosphate..mg.L pH..AM.
## "numeric" "numeric" "numeric"
## Total.Alkalinity..mg.L Visibility..cm. Un.ionised.Ammonia
## "numeric" "numeric" "integer"
## pH..PM.
## "numeric"
cols <- c("Pond")
pond[cols] <- lapply(pond[cols],as.factor)
# change columns for pond TO DATE
# current text fails, so use the Timestamp column, split and then convert to date
# seems weird, but i tried it several times and this is the only thing that works
pond <- pond %>%
extract(col="Timestamp", into = c("NewDate","Hours"), regex="^(\\S+)\\s+(.*)")
pond$NewDate <- as.Date(pond$NewDate,"%m/%d/%Y")
pond$Dissolved.Oxygen..3pm.[6779] <- 14.48
pond$Dissolved.Oxygen..3pm.[9920] <- 11.31
cols <- c("Dissolved.Oxygen..3pm.")
pond[cols] <- lapply(pond[cols],as.numeric)
## Warning in lapply(pond[cols], as.numeric): NAs introduced by coercion
# change columns for yield TO FACTOR
cols <- c("Batch.Code","Pond","Pond.Line","Week","Month","Culling.Code","Pond.1")
yield[cols] <- lapply(yield[cols],factor)
# change columns for yield TO DATE
cols <- c("Egg.Collection.Date")
yield[cols] <- lapply(yield[cols],as.Date,"%m/%d/%y")
pond %>%
summarise_all(funs(sum(is.na(.)))) / nrow(pond)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## NewDate Hours Date Pond Temperature..7am. Dissolved.Oxygen..7am.
## 1 0 0 0.0001369207 0 0.02512494 0.01560895
## Temperature..3pm. Dissolved.Oxygen..3pm. Ammonia..mg.L Nitrite..mg.L
## 1 0.3126583 0.3160813 0.7163004 0.7278702
## Nitrate..mg.L Phosphate..mg.L pH..AM. Total.Alkalinity..mg.L
## 1 0.8336414 0.7755186 0.4508113 0.9970562
## Visibility..cm. Un.ionised.Ammonia pH..PM.
## 1 0.8834121 0.9994523 0.559937
# create accurate date data, normalizing to the min date
# minimum is "2019-02-19"
mindate <- min(pond$NewDate)
# create time difference and NewWeek relative to minimum date
pond$TimeDiff <- as.integer(pond$NewDate - mindate)
pond$NewWeek <- floor(pond$TimeDiff/7)
yield$TimeDiff <- as.integer(yield$Egg.Collection.Date - mindate)
yield$NewWeek <- floor(yield$TimeDiff/7)
Also, each pond has many sub-ponds called “Hapas”. We need to rollup the yields from the Hapas into one target, called AvgYield. AvgYield is the average g/m2 from a given pond in a given egg collection day
# Some aggregated fields on yield
yield <- yield %>% group_by(Egg.Collection.Date, Pond) %>%
mutate(TotalGramsDate = sum(Grams.Collected))
yield <- yield %>% group_by(Week, Pond) %>%
mutate(TotalGramsWeek = sum(Grams.Collected))
yield <- yield %>% group_by(Egg.Collection.Date, Pond) %>%
mutate(AvgYield = mean(g.m2)) %>%ungroup()
# # the yield data should be rolled up on day and pond and week and month
yield <- yield %>% group_by(Pond,NewWeek) %>%
mutate(AvgWeekYield = mean(g.m2,na.rm=T))
yield <- yield %>% group_by(Pond,Month) %>%
mutate(AvgMonthYield = mean(g.m2,na.rm=T))
Pond quality data is sparse First, the number of distinct days of egg collection against the number of distinct pond quality days in the data
length(unique(yield$Egg.Collection.Date))
## [1] 305
sum(unique(pond$NewDate) %in% unique(yield$Egg.Collection.Date))
## [1] 173
sum(unique(pond$NewDate) %in% unique(yield$Egg.Collection.Date))/length(unique(yield$Egg.Collection.Date))
## [1] 0.5672131
And the relative sparsity of other data
nrow(pond)
## [1] 14607
pond %>%
summarise_all(funs(sum(is.na(.)))) / nrow(pond)
## NewDate Hours Date Pond Temperature..7am. Dissolved.Oxygen..7am.
## 1 0 0 0.0001369207 0 0.02512494 0.01560895
## Temperature..3pm. Dissolved.Oxygen..3pm. Ammonia..mg.L Nitrite..mg.L
## 1 0.3126583 0.3160813 0.7163004 0.7278702
## Nitrate..mg.L Phosphate..mg.L pH..AM. Total.Alkalinity..mg.L
## 1 0.8336414 0.7755186 0.4508113 0.9970562
## Visibility..cm. Un.ionised.Ammonia pH..PM. TimeDiff NewWeek
## 1 0.8834121 0.9994523 0.559937 0 0
Sparisty and data collection timing is obviously going to be an issue.
# summary of ponds
yield %>% group_by(Pond) %>% summarise(n = n(),avg_yield = mean(AvgYield),var=var(AvgYield))
## # A tibble: 25 × 4
## Pond n avg_yield var
## <fct> <int> <dbl> <dbl>
## 1 A1 64 3.18 0.603
## 2 A13 344 3.51 2.46
## 3 A14 360 3.14 1.73
## 4 A15 349 2.54 1.13
## 5 A16 366 3.46 2.85
## 6 A17 132 2.33 1.44
## 7 A18 272 3.32 2.16
## 8 A6 176 3.92 1.21
## 9 A8 240 3.47 1.47
## 10 A9 336 3.73 1.74
## # … with 15 more rows
small.model <- lm(AvgYield ~ Pond, data = yield)
summary(small.model)
##
## Call:
## lm(formula = AvgYield ~ Pond, data = yield)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3610 -0.9317 -0.1228 0.7618 5.0992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.182813 0.162065 19.639 < 2e-16 ***
## PondA13 0.330378 0.176498 1.872 0.061259 .
## PondA14 -0.039861 0.175882 -0.227 0.820712
## PondA15 -0.640549 0.176300 -3.633 0.000281 ***
## PondA16 0.280439 0.175664 1.596 0.110422
## PondA17 -0.857244 0.197483 -4.341 1.43e-05 ***
## PondA18 0.140717 0.180125 0.781 0.434695
## PondA6 0.738210 0.189251 3.901 9.66e-05 ***
## PondA8 0.282500 0.182398 1.549 0.121462
## PondA9 0.543155 0.176827 3.072 0.002135 **
## PondAlex's Pond 1.913322 0.178310 10.730 < 2e-16 ***
## PondB1 0.034854 0.168838 0.206 0.836455
## PondB10 -0.513422 0.171451 -2.995 0.002755 **
## PondB11 0.375028 0.173452 2.162 0.030632 *
## PondB2 -1.453300 0.173303 -8.386 < 2e-16 ***
## PondB3 -0.034200 0.175083 -0.195 0.845136
## PondB4 0.692480 0.169478 4.086 4.43e-05 ***
## PondB5 -0.020494 0.171203 -0.120 0.904720
## PondB6 0.009772 0.172047 0.057 0.954708
## PondB7 0.099987 0.170700 0.586 0.558061
## PondB8 -0.233454 0.171642 -1.360 0.173823
## PondB9 -0.694608 0.171537 -4.049 5.18e-05 ***
## PondMB12 -0.793229 0.232983 -3.405 0.000665 ***
## PondMB14 -1.239375 0.261322 -4.743 2.14e-06 ***
## PondW5 -0.093152 0.175316 -0.531 0.595199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.297 on 9333 degrees of freedom
## Multiple R-squared: 0.1828, Adjusted R-squared: 0.1807
## F-statistic: 86.96 on 24 and 9333 DF, p-value: < 2.2e-16
# Smaller ponds yield slightly more fish on average
yield %>% group_by(Hapa.Area) %>% summarise(n = n(),avg_yield = mean(AvgYield),var_avg_yield = var(AvgYield))
## # A tibble: 2 × 4
## Hapa.Area n avg_yield var_avg_yield
## <int> <int> <dbl> <dbl>
## 1 40 2647 3.31 2.09
## 2 80 6711 3.14 2.03
small.model <- lm(AvgYield ~ Hapa.Area, data = yield)
summary(small.model)
##
## Call:
## lm(formula = AvgYield ~ Hapa.Area, data = yield)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1374 -1.0599 -0.1582 0.8482 6.5712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4741765 0.0582817 59.610 < 2e-16 ***
## Hapa.Area -0.0042096 0.0008208 -5.129 2.97e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 9356 degrees of freedom
## Multiple R-squared: 0.002804, Adjusted R-squared: 0.002697
## F-statistic: 26.31 on 1 and 9356 DF, p-value: 2.973e-07
# time vs yield: some weeks are better than others
yield %>% group_by(NewWeek) %>% summarise(n = n(),avg_yield = mean(AvgWeekYield),var_avg_yield = var(AvgWeekYield))
## # A tibble: 59 × 4
## NewWeek n avg_yield var_avg_yield
## <dbl> <int> <dbl> <dbl>
## 1 101 40 2.73 0.578
## 2 102 179 4.23 3.62
## 3 103 175 3.92 3.72
## 4 104 175 2.87 2.70
## 5 105 183 3.36 1.48
## 6 106 171 2.71 1.19
## 7 107 186 2.76 1.53
## 8 108 201 1.87 0.913
## 9 109 199 2.18 1.02
## 10 110 189 3.08 1.87
## # … with 49 more rows
small.model <- lm(AvgWeekYield ~ NewWeek, data = yield)
summary(small.model)
##
## Call:
## lm(formula = AvgWeekYield ~ NewWeek, data = yield)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0060 -1.0796 -0.1722 0.8689 6.5579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.564863 0.130396 19.670 < 2e-16 ***
## NewWeek 0.004923 0.001028 4.787 1.72e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 9356 degrees of freedom
## Multiple R-squared: 0.002443, Adjusted R-squared: 0.002337
## F-statistic: 22.91 on 1 and 9356 DF, p-value: 1.72e-06
# we see while theres plenty of significance between week and yield, it explains none of the variance
plot(yield$NewWeek,yield$AvgWeekYield)
# time vs yield: some months are better than others
yield %>% group_by(Month) %>% summarise(n = n(),avg_yield = mean(AvgMonthYield),var_avg_yield = var(AvgMonthYield))
## # A tibble: 12 × 4
## Month n avg_yield var_avg_yield
## <fct> <int> <dbl> <dbl>
## 1 Apr 837 2.85 1.70
## 2 Aug 838 3.17 0.715
## 3 Dec 709 3.20 0.491
## 4 Feb 720 3.60 1.57
## 5 Jan 507 2.20 0.352
## 6 Jul 854 3.66 1.61
## 7 Jun 936 2.99 1.66
## 8 Mar 849 2.41 0.399
## 9 May 865 3.17 2.15
## 10 Nov 721 3.90 0.618
## 11 Oct 745 3.58 0.971
## 12 Sep 777 3.33 0.747
small.model <- lm(AvgMonthYield ~ Month, data = yield)
summary(small.model)
##
## Call:
## lm(formula = AvgMonthYield ~ Month, data = yield)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5813 -0.6817 -0.0318 0.7091 3.8836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.85010 0.03673 77.597 < 2e-16 ***
## MonthAug 0.32151 0.05193 6.191 6.21e-10 ***
## MonthDec 0.34613 0.05424 6.382 1.83e-10 ***
## MonthFeb 0.75374 0.05401 13.955 < 2e-16 ***
## MonthJan -0.64583 0.05980 -10.800 < 2e-16 ***
## MonthJul 0.80617 0.05168 15.598 < 2e-16 ***
## MonthJun 0.14487 0.05055 2.866 0.00417 **
## MonthMar -0.43983 0.05176 -8.498 < 2e-16 ***
## MonthMay 0.31994 0.05152 6.210 5.53e-10 ***
## MonthNov 1.04855 0.05399 19.420 < 2e-16 ***
## MonthOct 0.73252 0.05352 13.686 < 2e-16 ***
## MonthSep 0.48279 0.05294 9.120 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.063 on 9346 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1545
## F-statistic: 156.5 on 11 and 9346 DF, p-value: < 2.2e-16
This was all without joining pond quality data. Lets join some data. Remember we only have a fraction of the data for egg collection dates…
# join on exact day, INNER JOIN to ignore days w.o pond data w. rolledup data for dayjoin
dayjoin <- inner_join(yield,
pond,
by = c("Egg.Collection.Date"="NewDate","Pond"="Pond"))
Split train/test.
# create train/test
trainIndex <- createDataPartition(dayjoin$AvgYield, p = .8,list=F)
dayjoin_train <- dayjoin[ trainIndex, ]
dayjoin_test <- dayjoin[-trainIndex, ]
Train model and test. No algorithm was able to run due to sparsity constraints. Most common errors were “all rows removed due to NA” Or “contrasts” error which is related to lack of factors.
Relationships exist between ponds