Development of a Machine Learning Model for Classifying Poultry Diseases

University of Southeastern Philippines

Author

Jamal Kay B. Rogers

Published

September 15, 2023

Data Gathering

Using an IoT-based decision support system, a data set is collected from a poultry house that contains the following variables:

Day of the cycle (typically 1 - 33 per cycle)
Daily Mortality Count
Disease (majority causing mortality: Split, Stroke)
Ammonia (ppm)
Carbon Dioxide (ppm)
Hydrogen Sulfide (ppm)
Particulate Matter 2.5 (ug/m^3)
Particulate Matter 10 (ug/m^3)
Temperature (Celsius)
Humidity (%)
Wind Velocity (m/s)

The Data Set

The data gathered contains 163 rows and 11 columns.

Rows: 163
Columns: 11
$ Day       <dbl> 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, …
$ Mortality <dbl> 170, 42, 62, 60, 69, 65, 73, 74, 78, 65, 68, 92, 67, 68, 101…
$ Disease   <chr> "Stroke", "Stroke", "Split", "Stroke", "Split", "Stroke", "S…
$ pm25      <dbl> 0.000000, 28.148649, 0.000000, 33.926950, 11.618608, 12.2558…
$ pm10      <dbl> 0.000000, 25.445946, 0.000000, 31.201382, 9.922674, 10.46782…
$ temp      <dbl> 0.00000, 27.00270, 0.00000, 26.58406, 25.46414, 25.84336, 25…
$ humidity  <dbl> 0.00000, 87.09730, 0.00000, 86.86027, 83.44864, 86.62773, 84…
$ co2       <dbl> 0.0000, 397.2297, 0.0000, 462.0533, 552.4929, 539.3997, 550.…
$ nh3       <dbl> 0.000000, 17.702703, 0.000000, 21.301579, 21.072397, 18.4188…
$ h2s       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ velocity  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Showing the first 10 rows of the data.

# A tibble: 10 × 11
     Day Mortality Disease  pm25  pm10  temp humidity   co2   nh3   h2s velocity
   <dbl>     <dbl> <chr>   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>
 1    33       170 Stroke    0    0      0        0      0    0       0        0
 2    32        42 Stroke   28.1 25.4   27.0     87.1  397.  17.7     0        0
 3    31        62 Split     0    0      0        0      0    0       0        0
 4    30        60 Stroke   33.9 31.2   26.6     86.9  462.  21.3     0        0
 5    29        69 Split    11.6  9.92  25.5     83.4  552.  21.1     0        0
 6    28        65 Stroke   12.3 10.5   25.8     86.6  539.  18.4     0        0
 7    27        73 Stroke   17.2 15.1   25.8     84.9  550.  17.2     0        0
 8    26        74 Split    34.8 30.8   25.8     82.1  561.  16.5     0        0
 9    25        78 Split    12.3 10.5   25.6     84.2  555.  14.3     0        0
10    24        65 Stroke   10.4  9.15  25.8     84.1  544.  11.3     0        0

Data Cleaning

Count of NA values.

[1] 44

There are observations (rows) that contains NA or missing values. Therefore they have to be removed before the modeling process.

Rows: 159
Columns: 11
$ Day       <dbl> 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, …
$ Mortality <dbl> 170, 42, 62, 60, 69, 65, 73, 74, 78, 65, 68, 92, 67, 68, 101…
$ Disease   <chr> "Stroke", "Stroke", "Split", "Stroke", "Split", "Stroke", "S…
$ pm25      <dbl> 0.000000, 28.148649, 0.000000, 33.926950, 11.618608, 12.2558…
$ pm10      <dbl> 0.000000, 25.445946, 0.000000, 31.201382, 9.922674, 10.46782…
$ temp      <dbl> 0.00000, 27.00270, 0.00000, 26.58406, 25.46414, 25.84336, 25…
$ humidity  <dbl> 0.00000, 87.09730, 0.00000, 86.86027, 83.44864, 86.62773, 84…
$ co2       <dbl> 0.0000, 397.2297, 0.0000, 462.0533, 552.4929, 539.3997, 550.…
$ nh3       <dbl> 0.000000, 17.702703, 0.000000, 21.301579, 21.072397, 18.4188…
$ h2s       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ velocity  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

After removing NA, there are now 159 observations in the data set for analysis.

Exploratory Data Analysis

Disease Distribution

We plot the distribution of the Disease label, the variable we want to classify. Based on the plot, Split and Stroke diseases have a close distribution which is a good parameter to consider in using the accuracy metric in assessing model performance.

Cycle Day versus Mortality

Based on the plot, we can see that there is correlation between the cycle day and number of mortality. However, there is no clear distinction on the disease that caused the mortality.

mldata |>
        ggplot() +
        geom_point(aes(x = Mortality, y = Day, col = Disease))

Harmful Gases versus Mortality

There is no distinction between what disease that caused the mortality based on ammonia, carbon dioxide, and hydrogen sulfide.

Microclimatic Parameters versus Mortality

Similarly, there is no visual distinction between what disease that cause mortality based on microclimatic paramaters temperature, humidity, and wind velocity.

Air Quality Parameters versus Mortality

The same is true with air quality parameters pm2.5 and pm10.

Every single variable could not be determined as a potential predictor for mortality causing diseases. We shall use a robust analysis to identify patterns by combining multiple predictors.

Predictive Modeling using Machine Learning

We train and tune a logistic regression model using the glmnet computational engine and compare performance across different combinations of tuning parameters.

The results below is genereated using a framework for machine learning and modeling in R.

Modeling Results

Training Workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps

• step_dummy()
• step_nzv()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet

Models Generated (Parameter Tuning)

# A tibble: 20 × 8
    penalty mixture .metric  .estimator  mean     n std_err .config             
      <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
 1 2.21e- 2  0.0813 accuracy binary     0.536    10 0.0309  Preprocessor1_Model…
 2 2.21e- 2  0.0813 roc_auc  binary     0.564    10 0.0339  Preprocessor1_Model…
 3 5.73e- 8  0.145  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
 4 5.73e- 8  0.145  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
 5 4.70e- 6  0.274  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
 6 4.70e- 6  0.274  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
 7 6.65e- 1  0.358  accuracy binary     0.509    10 0.00606 Preprocessor1_Model…
 8 6.65e- 1  0.358  roc_auc  binary     0.5      10 0       Preprocessor1_Model…
 9 9.09e- 3  0.458  accuracy binary     0.527    10 0.0347  Preprocessor1_Model…
10 9.09e- 3  0.458  roc_auc  binary     0.566    10 0.0354  Preprocessor1_Model…
11 1.45e- 4  0.562  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
12 1.45e- 4  0.562  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
13 1.05e-10  0.701  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
14 1.05e-10  0.701  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
15 1.21e- 9  0.808  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
16 1.21e- 9  0.808  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
17 4.35e- 7  0.841  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
18 4.35e- 7  0.841  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…
19 5.19e- 5  0.971  accuracy binary     0.502    10 0.0372  Preprocessor1_Model…
20 5.19e- 5  0.971  roc_auc  binary     0.540    10 0.0437  Preprocessor1_Model…

Plot for Best Model Selection

Best Model Selection

# A tibble: 5 × 8
       penalty mixture .metric  .estimator  mean     n std_err .config          
         <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>            
1 0.0221        0.0813 accuracy binary     0.536    10 0.0309  Preprocessor1_Mo…
2 0.00909       0.458  accuracy binary     0.527    10 0.0347  Preprocessor1_Mo…
3 0.665         0.358  accuracy binary     0.509    10 0.00606 Preprocessor1_Mo…
4 0.0000000573  0.145  accuracy binary     0.502    10 0.0372  Preprocessor1_Mo…
5 0.00000470    0.274  accuracy binary     0.502    10 0.0372  Preprocessor1_Mo…

Model Performance on Test Set

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.490 Preprocessor1_Model1
2 roc_auc  binary         0.472 Preprocessor1_Model1

Confusion Matrix

Predictor Importance

The plot shows that pm25 is the most important predictor, followed by pm10 and nh3.

Sensitivity

Sensitivity is the measure of the model’s performance in predicting the Split disease.

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 sens    binary         0.208

Specificity

Specificity is the measure of the model’s performance in predicting the Stroke disease.

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 spec    binary          0.76

Interpretation

The best logistic model obtained an overall accuracy of 53.6% on training and 48.9% on testing using the environmental predictors below:

Ammonia (ppm)
Carbon Dioxide (ppm)
Hydrogen Sulfide (ppm)
Particulate Matter 2.5 (ug/m^3)
Particulate Matter 10 (ug/m^3)
Temperature (Celsius)
Humidity (%)
Wind Velocity (m/s)

The low accuracy can be attributed to the model’s performance in predicting the Split disease with only 20.8% (sensitivity) score. The model, however, performs well in predicting the Stroke disease with 76% (specificity) score.

Further research must be done through gathering more data.