Abstract

Bees are essential for food production for humans and for the maintenance of natural ecosystems. This paper presents a proposal to predict the health level of honeybee colonies using data from internal and external beehive sensors and from in-loco inspections by beekeepers. The data set was obtained by gathering inspection information and internal and external sensors measurements, based on the date of collection. However, obtaining inspection data frequently is not feasible due to the stress caused to the beehive, especially in periods such as winter, where the beehive becomes more sensitive. As a solution, the beehives health status was obtained through a partitioning clustering method and then validated by in-loco inspection data already obtained. We propose a logistic regression model with an elastic net penalty, which consists of a fusion of lasso (l1) and ridge (l2) methods. We obtained a flexible and robust model compared to the usual logistic regression and a diagnostic tool that can avoid unnecessary inspections and, consequently, reduce the stress of the beehives.

Packages

Preprocessing

hide

Description of dataset

O conjunto de dados foi obtido pela união dos dados de sensores internos, externos e de inspeção através de um algoritmo criado em python na ferramenta google colab. E foi preprocessado (limpeza, consistência, imputação…) no software R.

Algumas informações a respeito do dataset:

Turno Contagem %
dia 18107 50.5
noite 17744 49.5
##   TurnDay        Brood_Temp     Brood_Humidity    Hive_Temp     
##  dia  :18107   Min.   :-3.467   Min.   :22.00   Min.   :-5.744  
##  noite:17744   1st Qu.:22.133   1st Qu.:62.00   1st Qu.:21.961  
##  NA's :    4   Median :30.072   Median :67.00   Median :28.539  
##                Mean   :27.198   Mean   :66.21   Mean   :26.440  
##                3rd Qu.:33.528   3rd Qu.:71.00   3rd Qu.:33.144  
##                Max.   :39.950   Max.   :89.00   Max.   :39.928  
##                NA's   :12                       NA's   :12      
##  Hive_Humidity       Weight        Ext_Temperature     DewPoint      
##  Min.   :19.00   Min.   :  1.034   Min.   :-10.00   Min.   :-10.000  
##  1st Qu.:60.00   1st Qu.: 23.179   1st Qu.:  2.50   1st Qu.:  0.600  
##  Median :66.00   Median : 28.191   Median : 12.20   Median :  2.330  
##  Mean   :65.51   Mean   : 28.122   Mean   : 13.19   Mean   :  5.733  
##  3rd Qu.:72.00   3rd Qu.: 31.856   3rd Qu.: 22.80   3rd Qu.: 12.200  
##  Max.   :93.00   Max.   :129.936   Max.   : 36.00   Max.   : 20.000  
##                  NA's   :3588      NA's   :6945     NA's   :14397    
##  WindDirection     WindSpeed         Brood            Bees      
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:  0.0   1st Qu.: 0.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 70.0   Median :15.00   Median :1.000   Median :1.000  
##  Mean   :114.4   Mean   :17.23   Mean   :0.852   Mean   :0.926  
##  3rd Qu.:220.0   3rd Qu.:31.00   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :360.0   Max.   :99.00   Max.   :1.000   Max.   :1.000  
##  NA's   :8295    NA's   :6861    NA's   :18420   NA's   :18420  
##      Queen            Food         Stressors         Space      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.000   Median :1.000   Median :0.000   Median :1.000  
##  Mean   :0.912   Mean   :0.958   Mean   :0.441   Mean   :0.724  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##  NA's   :18420   NA's   :18420   NA's   :18420   NA's   :18420

PCA

## [1] 17298
##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = dataset_pca)
## 
##      X2 = 74462.215
##      df = 36
## p-value < 2.22e-16
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7133 1.4186 1.1962 0.9652 0.74898 0.71839 0.59391
## Proportion of Variance 0.3261 0.2236 0.1590 0.1035 0.06233 0.05734 0.03919
## Cumulative Proportion  0.3261 0.5498 0.7087 0.8122 0.87457 0.93192 0.97111
##                           PC8     PC9
## Standard deviation     0.3759 0.34461
## Proportion of Variance 0.0157 0.01319
## Cumulative Proportion  0.9868 1.00000
## [1] 2.9353069 2.0125505 1.4307853 0.9315423 0.5609767 0.5160843 0.3527307
## [8] 0.1412693 0.1187540

##      
## Pareto chart analysis for summary(pca)$importance[2, ]
##       Frequency Cum.Freq. Percentage Cum.Percent.
##   PC1   0.32615   0.32615   32.61500     32.61500
##   PC2   0.22362   0.54977   22.36200     54.97700
##   PC3   0.15898   0.70875   15.89800     70.87500
##   PC4   0.10350   0.81225   10.35000     81.22500
##   PC5   0.06233   0.87458    6.23300     87.45800
##   PC6   0.05734   0.93192    5.73400     93.19200
##   PC7   0.03919   0.97111    3.91900     97.11100
##   PC8   0.01570   0.98681    1.57000     98.68100
##   PC9   0.01319   1.00000    1.31900    100.00000

Projected high dimensional data in two dimensions with T-Stochastic Neighbour Embedding (T-SNE)