Rows: 163
Columns: 11
$ Day <dbl> 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, …
$ Mortality <dbl> 170, 42, 62, 60, 69, 65, 73, 74, 78, 65, 68, 92, 67, 68, 101…
$ Disease <chr> "Stroke", "Stroke", "Split", "Stroke", "Split", "Stroke", "S…
$ pm25 <dbl> 0.000000, 28.148649, 0.000000, 33.926950, 11.618608, 12.2558…
$ pm10 <dbl> 0.000000, 25.445946, 0.000000, 31.201382, 9.922674, 10.46782…
$ temp <dbl> 0.00000, 27.00270, 0.00000, 26.58406, 25.46414, 25.84336, 25…
$ humidity <dbl> 0.00000, 87.09730, 0.00000, 86.86027, 83.44864, 86.62773, 84…
$ co2 <dbl> 0.0000, 397.2297, 0.0000, 462.0533, 552.4929, 539.3997, 550.…
$ nh3 <dbl> 0.000000, 17.702703, 0.000000, 21.301579, 21.072397, 18.4188…
$ h2s <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ velocity <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Development of a Machine Learning Model for Classifying Poultry Diseases
Data Gathering
Using an IoT-based decision support system, a data set is collected from a poultry house that contains the following variables:
Day of the cycle (typically 1 - 33 per cycle)
Daily Mortality Count
Disease (majority causing mortality: Split, Stroke)
Ammonia (ppm)
Carbon Dioxide (ppm)
Hydrogen Sulfide (ppm)
Particulate Matter 2.5 (ug/m^3)
Particulate Matter 10 (ug/m^3)
Temperature (Celsius)
Humidity (%)
Wind Velocity (m/s)
The Data Set
The data gathered contains 163 rows and 11 columns.
Showing the first 10 rows of the data.
# A tibble: 10 × 11
Day Mortality Disease pm25 pm10 temp humidity co2 nh3 h2s velocity
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33 170 Stroke 0 0 0 0 0 0 0 0
2 32 42 Stroke 28.1 25.4 27.0 87.1 397. 17.7 0 0
3 31 62 Split 0 0 0 0 0 0 0 0
4 30 60 Stroke 33.9 31.2 26.6 86.9 462. 21.3 0 0
5 29 69 Split 11.6 9.92 25.5 83.4 552. 21.1 0 0
6 28 65 Stroke 12.3 10.5 25.8 86.6 539. 18.4 0 0
7 27 73 Stroke 17.2 15.1 25.8 84.9 550. 17.2 0 0
8 26 74 Split 34.8 30.8 25.8 82.1 561. 16.5 0 0
9 25 78 Split 12.3 10.5 25.6 84.2 555. 14.3 0 0
10 24 65 Stroke 10.4 9.15 25.8 84.1 544. 11.3 0 0
Data Cleaning
Count of NA values.
[1] 44
There are observations (rows) that contains NA or missing values. Therefore they have to be removed before the modeling process.
Rows: 159
Columns: 11
$ Day <dbl> 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, …
$ Mortality <dbl> 170, 42, 62, 60, 69, 65, 73, 74, 78, 65, 68, 92, 67, 68, 101…
$ Disease <chr> "Stroke", "Stroke", "Split", "Stroke", "Split", "Stroke", "S…
$ pm25 <dbl> 0.000000, 28.148649, 0.000000, 33.926950, 11.618608, 12.2558…
$ pm10 <dbl> 0.000000, 25.445946, 0.000000, 31.201382, 9.922674, 10.46782…
$ temp <dbl> 0.00000, 27.00270, 0.00000, 26.58406, 25.46414, 25.84336, 25…
$ humidity <dbl> 0.00000, 87.09730, 0.00000, 86.86027, 83.44864, 86.62773, 84…
$ co2 <dbl> 0.0000, 397.2297, 0.0000, 462.0533, 552.4929, 539.3997, 550.…
$ nh3 <dbl> 0.000000, 17.702703, 0.000000, 21.301579, 21.072397, 18.4188…
$ h2s <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ velocity <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
After removing NA, there are now 159 observations in the data set for analysis.
Exploratory Data Analysis
Disease Distribution
We plot the distribution of the Disease label, the variable we want to classify. Based on the plot, Split and Stroke diseases have a close distribution which is a good parameter to consider in using the accuracy metric in assessing model performance.
Cycle Day versus Mortality
Based on the plot, we can see that there is correlation between the cycle day and number of mortality. However, there is no clear distinction on the disease that caused the mortality.
mldata |>
ggplot() +
geom_point(aes(x = Mortality, y = Day, col = Disease))Harmful Gases versus Mortality
There is no distinction between what disease that caused the mortality based on ammonia, carbon dioxide, and hydrogen sulfide.
Microclimatic Parameters versus Mortality
Similarly, there is no visual distinction between what disease that cause mortality based on microclimatic paramaters temperature, humidity, and wind velocity.
Air Quality Parameters versus Mortality
The same is true with air quality parameters pm2.5 and pm10.
Every single variable could not be determined as a potential predictor for mortality causing diseases. We shall use a robust analysis to identify patterns by combining multiple predictors.
Predictive Modeling using Machine Learning
We train and tune a logistic regression model using the glmnet computational engine and compare performance across different combinations of tuning parameters.
The results below is genereated using a framework for machine learning and modeling in R.
Modeling Results
Training Workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_dummy()
• step_nzv()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
Models Generated (Parameter Tuning)
# A tibble: 20 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2.21e- 2 0.0813 accuracy binary 0.536 10 0.0309 Preprocessor1_Model…
2 2.21e- 2 0.0813 roc_auc binary 0.564 10 0.0339 Preprocessor1_Model…
3 5.73e- 8 0.145 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
4 5.73e- 8 0.145 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
5 4.70e- 6 0.274 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
6 4.70e- 6 0.274 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
7 6.65e- 1 0.358 accuracy binary 0.509 10 0.00606 Preprocessor1_Model…
8 6.65e- 1 0.358 roc_auc binary 0.5 10 0 Preprocessor1_Model…
9 9.09e- 3 0.458 accuracy binary 0.527 10 0.0347 Preprocessor1_Model…
10 9.09e- 3 0.458 roc_auc binary 0.566 10 0.0354 Preprocessor1_Model…
11 1.45e- 4 0.562 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
12 1.45e- 4 0.562 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
13 1.05e-10 0.701 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
14 1.05e-10 0.701 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
15 1.21e- 9 0.808 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
16 1.21e- 9 0.808 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
17 4.35e- 7 0.841 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
18 4.35e- 7 0.841 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
19 5.19e- 5 0.971 accuracy binary 0.502 10 0.0372 Preprocessor1_Model…
20 5.19e- 5 0.971 roc_auc binary 0.540 10 0.0437 Preprocessor1_Model…
Plot for Best Model Selection
Best Model Selection
# A tibble: 5 × 8
penalty mixture .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0221 0.0813 accuracy binary 0.536 10 0.0309 Preprocessor1_Mo…
2 0.00909 0.458 accuracy binary 0.527 10 0.0347 Preprocessor1_Mo…
3 0.665 0.358 accuracy binary 0.509 10 0.00606 Preprocessor1_Mo…
4 0.0000000573 0.145 accuracy binary 0.502 10 0.0372 Preprocessor1_Mo…
5 0.00000470 0.274 accuracy binary 0.502 10 0.0372 Preprocessor1_Mo…
Model Performance on Test Set
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.490 Preprocessor1_Model1
2 roc_auc binary 0.472 Preprocessor1_Model1
Confusion Matrix
Predictor Importance
The plot shows that pm25 is the most important predictor, followed by pm10 and nh3.
Sensitivity
Sensitivity is the measure of the model’s performance in predicting the Split disease.
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.208
Specificity
Specificity is the measure of the model’s performance in predicting the Stroke disease.
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 spec binary 0.76
Interpretation
The best logistic model obtained an overall accuracy of 53.6% on training and 48.9% on testing using the environmental predictors below:
Ammonia (ppm)
Carbon Dioxide (ppm)
Hydrogen Sulfide (ppm)
Particulate Matter 2.5 (ug/m^3)
Particulate Matter 10 (ug/m^3)
Temperature (Celsius)
Humidity (%)
Wind Velocity (m/s)
The low accuracy can be attributed to the model’s performance in predicting the Split disease with only 20.8% (sensitivity) score. The model, however, performs well in predicting the Stroke disease with 76% (specificity) score.
Further research must be done through gathering more data.