Abstract
Bees are essential for food production for humans and for the maintenance of natural ecosystems. This paper presents a proposal to predict the health level of honeybee colonies using data from internal and external beehive sensors and from in-loco inspections by beekeepers. The data set was obtained by gathering inspection information and internal and external sensors measurements, based on the date of collection. However, obtaining inspection data frequently is not feasible due to the stress caused to the beehive, especially in periods such as winter, where the beehive becomes more sensitive. As a solution, the beehives health status was obtained through a partitioning clustering method and then validated by in-loco inspection data already obtained. We propose a logistic regression model with an elastic net penalty, which consists of a fusion of lasso (l1) and ridge (l2) methods. We obtained a flexible and robust model compared to the usual logistic regression and a diagnostic tool that can avoid unnecessary inspections and, consequently, reduce the stress of the beehives.
Packages
Preprocessing
hide
Description of dataset
O conjunto de dados foi obtido pela união dos dados de sensores internos, externos e de inspeção através de um algoritmo criado em python
na ferramenta google colab
. E foi preprocessado (limpeza, consistência, imputação…) no software R
.
Algumas informações a respeito do dataset:
- Número total de observações:
35855
- Número de observações rotuladas (inspeção associada):
17435
- Número de observações não rotuladas (s/ inspeção associada):
18420
- Número de observações rotuladas (inspeção associada):
- Variables in study
- TurnDay:
- Type:
factor
. - Description: Turno do dia associado ao horário da medição.
- Missing Values:
0.01%
. - Prevalence:
- Type:
- TurnDay:
Turno | Contagem | % |
---|---|---|
dia | 18107 | 50.5 |
noite | 17744 | 49.5 |
- Brood Temp:
- Type:
numeric
. - Description: Sensor de temperatura em ºC no centro da colméia.
- Missing Values:
0.03%
. - Visualization:
- Type:
- Brood Humidity:
- Type:
numeric
. - Description: Sensor de umidade em … no cetro da colméia.
- Missing Values:
0%
. - Visualization:
- Type:
- Hive Temp:
- Type:
numeric
. - Description: Sensor de temperatura em ºC na parede interna da colméia.
- Missing Values:
0.03%
. - Visualization:
- Type:
- Hive Humidity:
- Type:
numeric
. - Description: Sensor de umidade em … na parede interna da colméia.
- Missing Values:
0%
. - Visualization:
- Type:
- Weight:
- Type:
numeric
. - Description: Sensor de Peso da colméia.
- Missing Values:
10.01%
. - Visualization:
- Type:
- Ext Temperature:
- Type:
numeric
. - Description: Sensor de temperatura em ºC na parte externa da colméia.
- Missing Values
19.37%
. - Visualization:
- Type:
- Dew Point:
- Type:
numeric
. - Description: Ponto de Orvalho.
- Missing Values:
40.15%
. - Visualization:
- Type:
- Wind Direction:
- Type:
numeric
. - Description: Direção do vento.
- Missing Values:
23.13%
. - Visualization:
- Type:
- Wind Speed:
- Type:
numeric
. - Description: Velocidade do vento.
- Missing Values:
19.14%
. - Visualization:
- Type:
- Brood:
- Type:
factor
. - Description: Todos os estágios da ninhada presentes em quantidades apropriadas.
- Missing Values:
51.37%
.
- Type:
- Bees:
- Type:
factor
. - Description: Abelhas adultas suficientes e com boa estrutura etária para cuidar das crias e realizar todas as tarefas da colônia.
- Missing Values:
51.37%
.
- Type:
- Queen:
- Type:
factor
. - Description: Uma rainha jovem e produtiva, presente.
- Missing Values:
51.37%
.
- Type:
- Food:
- Type:
factor
. - Description: Quantidade suficiente de água, forrageamento e alimento em estoque disponÃvel.
- Missing Values:
51.37%
.
- Type:
- Stressors:
- Type:
factor
. - Description: Nenhum estressor aparente presente que poderia levar à redução da população da colônia e/ou afetar seu potencial de crescimento.
- Missing Values:
51.37%
.
- Type:
- Space:
- Type:
factor
. - Description: Espaço adequado para tamanho esperado da colônia à curto e médio prazo que seja sanitário, defensável e que possua espaço para os ovos.
- Missing Values:
51.37%
.
- Type:
- Dataset summary:
## TurnDay Brood_Temp Brood_Humidity Hive_Temp
## dia :18107 Min. :-3.467 Min. :22.00 Min. :-5.744
## noite:17744 1st Qu.:22.133 1st Qu.:62.00 1st Qu.:21.961
## NA's : 4 Median :30.072 Median :67.00 Median :28.539
## Mean :27.198 Mean :66.21 Mean :26.440
## 3rd Qu.:33.528 3rd Qu.:71.00 3rd Qu.:33.144
## Max. :39.950 Max. :89.00 Max. :39.928
## NA's :12 NA's :12
## Hive_Humidity Weight Ext_Temperature DewPoint
## Min. :19.00 Min. : 1.034 Min. :-10.00 Min. :-10.000
## 1st Qu.:60.00 1st Qu.: 23.179 1st Qu.: 2.50 1st Qu.: 0.600
## Median :66.00 Median : 28.191 Median : 12.20 Median : 2.330
## Mean :65.51 Mean : 28.122 Mean : 13.19 Mean : 5.733
## 3rd Qu.:72.00 3rd Qu.: 31.856 3rd Qu.: 22.80 3rd Qu.: 12.200
## Max. :93.00 Max. :129.936 Max. : 36.00 Max. : 20.000
## NA's :3588 NA's :6945 NA's :14397
## WindDirection WindSpeed Brood Bees
## Min. : 0.0 Min. : 0.00 Min. :0.000 Min. :0.000
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 70.0 Median :15.00 Median :1.000 Median :1.000
## Mean :114.4 Mean :17.23 Mean :0.852 Mean :0.926
## 3rd Qu.:220.0 3rd Qu.:31.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :360.0 Max. :99.00 Max. :1.000 Max. :1.000
## NA's :8295 NA's :6861 NA's :18420 NA's :18420
## Queen Food Stressors Space
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.000 Median :0.000 Median :1.000
## Mean :0.912 Mean :0.958 Mean :0.441 Mean :0.724
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## NA's :18420 NA's :18420 NA's :18420 NA's :18420
PCA
set.seed(101)
dataset_pca <- dataset_clean[complete.cases(dataset_clean[, 2:10]), 2:10]
nrow(dataset_pca)
## [1] 17298
## Bartlett's Test of Sphericity
##
## Call: bart_spher(x = dataset_pca)
##
## X2 = 74462.215
## df = 36
## p-value < 2.22e-16
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7133 1.4186 1.1962 0.9652 0.74898 0.71839 0.59391
## Proportion of Variance 0.3261 0.2236 0.1590 0.1035 0.06233 0.05734 0.03919
## Cumulative Proportion 0.3261 0.5498 0.7087 0.8122 0.87457 0.93192 0.97111
## PC8 PC9
## Standard deviation 0.3759 0.34461
## Proportion of Variance 0.0157 0.01319
## Cumulative Proportion 0.9868 1.00000
## [1] 2.9353069 2.0125505 1.4307853 0.9315423 0.5609767 0.5160843 0.3527307
## [8] 0.1412693 0.1187540
##
## Pareto chart analysis for summary(pca)$importance[2, ]
## Frequency Cum.Freq. Percentage Cum.Percent.
## PC1 0.32615 0.32615 32.61500 32.61500
## PC2 0.22362 0.54977 22.36200 54.97700
## PC3 0.15898 0.70875 15.89800 70.87500
## PC4 0.10350 0.81225 10.35000 81.22500
## PC5 0.06233 0.87458 6.23300 87.45800
## PC6 0.05734 0.93192 5.73400 93.19200
## PC7 0.03919 0.97111 3.91900 97.11100
## PC8 0.01570 0.98681 1.57000 98.68100
## PC9 0.01319 1.00000 1.31900 100.00000
fviz_pca_var(a, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)