Applied Predictive
Modeling
Chapter 3 Data Pre-processing
he UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe
The data can be accessed via:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
knit_table(Glass, 'View Glass Data')
| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
|---|---|---|---|---|---|---|---|---|---|
| 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0.00 | 0.00 | 1 |
| 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0.00 | 0.00 | 1 |
| 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0.00 | 0.00 | 1 |
| 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0.00 | 0.00 | 1 |
| 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0.00 | 0.00 | 1 |
| 1.51596 | 12.79 | 3.61 | 1.62 | 72.97 | 0.64 | 8.07 | 0.00 | 0.26 | 1 |
| 1.51743 | 13.30 | 3.60 | 1.14 | 73.09 | 0.58 | 8.17 | 0.00 | 0.00 | 1 |
| 1.51756 | 13.15 | 3.61 | 1.05 | 73.24 | 0.57 | 8.24 | 0.00 | 0.00 | 1 |
| 1.51918 | 14.04 | 3.58 | 1.37 | 72.08 | 0.56 | 8.30 | 0.00 | 0.00 | 1 |
| 1.51755 | 13.00 | 3.60 | 1.36 | 72.99 | 0.57 | 8.40 | 0.00 | 0.11 | 1 |
| 1.51571 | 12.72 | 3.46 | 1.56 | 73.20 | 0.67 | 8.09 | 0.00 | 0.24 | 1 |
| 1.51763 | 12.80 | 3.66 | 1.27 | 73.01 | 0.60 | 8.56 | 0.00 | 0.00 | 1 |
| 1.51589 | 12.88 | 3.43 | 1.40 | 73.28 | 0.69 | 8.05 | 0.00 | 0.24 | 1 |
| 1.51748 | 12.86 | 3.56 | 1.27 | 73.21 | 0.54 | 8.38 | 0.00 | 0.17 | 1 |
| 1.51763 | 12.61 | 3.59 | 1.31 | 73.29 | 0.58 | 8.50 | 0.00 | 0.00 | 1 |
| 1.51761 | 12.81 | 3.54 | 1.23 | 73.24 | 0.58 | 8.39 | 0.00 | 0.00 | 1 |
| 1.51784 | 12.68 | 3.67 | 1.16 | 73.11 | 0.61 | 8.70 | 0.00 | 0.00 | 1 |
| 1.52196 | 14.36 | 3.85 | 0.89 | 71.36 | 0.15 | 9.15 | 0.00 | 0.00 | 1 |
| 1.51911 | 13.90 | 3.73 | 1.18 | 72.12 | 0.06 | 8.89 | 0.00 | 0.00 | 1 |
| 1.51735 | 13.02 | 3.54 | 1.69 | 72.73 | 0.54 | 8.44 | 0.00 | 0.07 | 1 |
| 1.51750 | 12.82 | 3.55 | 1.49 | 72.75 | 0.54 | 8.52 | 0.00 | 0.19 | 1 |
| 1.51966 | 14.77 | 3.75 | 0.29 | 72.02 | 0.03 | 9.00 | 0.00 | 0.00 | 1 |
| 1.51736 | 12.78 | 3.62 | 1.29 | 72.79 | 0.59 | 8.70 | 0.00 | 0.00 | 1 |
| 1.51751 | 12.81 | 3.57 | 1.35 | 73.02 | 0.62 | 8.59 | 0.00 | 0.00 | 1 |
| 1.51720 | 13.38 | 3.50 | 1.15 | 72.85 | 0.50 | 8.43 | 0.00 | 0.00 | 1 |
| 1.51764 | 12.98 | 3.54 | 1.21 | 73.00 | 0.65 | 8.53 | 0.00 | 0.00 | 1 |
| 1.51793 | 13.21 | 3.48 | 1.41 | 72.64 | 0.59 | 8.43 | 0.00 | 0.00 | 1 |
| 1.51721 | 12.87 | 3.48 | 1.33 | 73.04 | 0.56 | 8.43 | 0.00 | 0.00 | 1 |
| 1.51768 | 12.56 | 3.52 | 1.43 | 73.15 | 0.57 | 8.54 | 0.00 | 0.00 | 1 |
| 1.51784 | 13.08 | 3.49 | 1.28 | 72.86 | 0.60 | 8.49 | 0.00 | 0.00 | 1 |
| 1.51768 | 12.65 | 3.56 | 1.30 | 73.08 | 0.61 | 8.69 | 0.00 | 0.14 | 1 |
| 1.51747 | 12.84 | 3.50 | 1.14 | 73.27 | 0.56 | 8.55 | 0.00 | 0.00 | 1 |
| 1.51775 | 12.85 | 3.48 | 1.23 | 72.97 | 0.61 | 8.56 | 0.09 | 0.22 | 1 |
| 1.51753 | 12.57 | 3.47 | 1.38 | 73.39 | 0.60 | 8.55 | 0.00 | 0.06 | 1 |
| 1.51783 | 12.69 | 3.54 | 1.34 | 72.95 | 0.57 | 8.75 | 0.00 | 0.00 | 1 |
| 1.51567 | 13.29 | 3.45 | 1.21 | 72.74 | 0.56 | 8.57 | 0.00 | 0.00 | 1 |
| 1.51909 | 13.89 | 3.53 | 1.32 | 71.81 | 0.51 | 8.78 | 0.11 | 0.00 | 1 |
| 1.51797 | 12.74 | 3.48 | 1.35 | 72.96 | 0.64 | 8.68 | 0.00 | 0.00 | 1 |
| 1.52213 | 14.21 | 3.82 | 0.47 | 71.77 | 0.11 | 9.57 | 0.00 | 0.00 | 1 |
| 1.52213 | 14.21 | 3.82 | 0.47 | 71.77 | 0.11 | 9.57 | 0.00 | 0.00 | 1 |
| 1.51793 | 12.79 | 3.50 | 1.12 | 73.03 | 0.64 | 8.77 | 0.00 | 0.00 | 1 |
| 1.51755 | 12.71 | 3.42 | 1.20 | 73.20 | 0.59 | 8.64 | 0.00 | 0.00 | 1 |
| 1.51779 | 13.21 | 3.39 | 1.33 | 72.76 | 0.59 | 8.59 | 0.00 | 0.00 | 1 |
| 1.52210 | 13.73 | 3.84 | 0.72 | 71.76 | 0.17 | 9.74 | 0.00 | 0.00 | 1 |
| 1.51786 | 12.73 | 3.43 | 1.19 | 72.95 | 0.62 | 8.76 | 0.00 | 0.30 | 1 |
| 1.51900 | 13.49 | 3.48 | 1.35 | 71.95 | 0.55 | 9.00 | 0.00 | 0.00 | 1 |
| 1.51869 | 13.19 | 3.37 | 1.18 | 72.72 | 0.57 | 8.83 | 0.00 | 0.16 | 1 |
| 1.52667 | 13.99 | 3.70 | 0.71 | 71.57 | 0.02 | 9.82 | 0.00 | 0.10 | 1 |
| 1.52223 | 13.21 | 3.77 | 0.79 | 71.99 | 0.13 | 10.02 | 0.00 | 0.00 | 1 |
| 1.51898 | 13.58 | 3.35 | 1.23 | 72.08 | 0.59 | 8.91 | 0.00 | 0.00 | 1 |
| 1.52320 | 13.72 | 3.72 | 0.51 | 71.75 | 0.09 | 10.06 | 0.00 | 0.16 | 1 |
| 1.51926 | 13.20 | 3.33 | 1.28 | 72.36 | 0.60 | 9.14 | 0.00 | 0.11 | 1 |
| 1.51808 | 13.43 | 2.87 | 1.19 | 72.84 | 0.55 | 9.03 | 0.00 | 0.00 | 1 |
| 1.51837 | 13.14 | 2.84 | 1.28 | 72.85 | 0.55 | 9.07 | 0.00 | 0.00 | 1 |
| 1.51778 | 13.21 | 2.81 | 1.29 | 72.98 | 0.51 | 9.02 | 0.00 | 0.09 | 1 |
| 1.51769 | 12.45 | 2.71 | 1.29 | 73.70 | 0.56 | 9.06 | 0.00 | 0.24 | 1 |
| 1.51215 | 12.99 | 3.47 | 1.12 | 72.98 | 0.62 | 8.35 | 0.00 | 0.31 | 1 |
| 1.51824 | 12.87 | 3.48 | 1.29 | 72.95 | 0.60 | 8.43 | 0.00 | 0.00 | 1 |
| 1.51754 | 13.48 | 3.74 | 1.17 | 72.99 | 0.59 | 8.03 | 0.00 | 0.00 | 1 |
| 1.51754 | 13.39 | 3.66 | 1.19 | 72.79 | 0.57 | 8.27 | 0.00 | 0.11 | 1 |
| 1.51905 | 13.60 | 3.62 | 1.11 | 72.64 | 0.14 | 8.76 | 0.00 | 0.00 | 1 |
| 1.51977 | 13.81 | 3.58 | 1.32 | 71.72 | 0.12 | 8.67 | 0.69 | 0.00 | 1 |
| 1.52172 | 13.51 | 3.86 | 0.88 | 71.79 | 0.23 | 9.54 | 0.00 | 0.11 | 1 |
| 1.52227 | 14.17 | 3.81 | 0.78 | 71.35 | 0.00 | 9.69 | 0.00 | 0.00 | 1 |
| 1.52172 | 13.48 | 3.74 | 0.90 | 72.01 | 0.18 | 9.61 | 0.00 | 0.07 | 1 |
| 1.52099 | 13.69 | 3.59 | 1.12 | 71.96 | 0.09 | 9.40 | 0.00 | 0.00 | 1 |
| 1.52152 | 13.05 | 3.65 | 0.87 | 72.22 | 0.19 | 9.85 | 0.00 | 0.17 | 1 |
| 1.52152 | 13.05 | 3.65 | 0.87 | 72.32 | 0.19 | 9.85 | 0.00 | 0.17 | 1 |
| 1.52152 | 13.12 | 3.58 | 0.90 | 72.20 | 0.23 | 9.82 | 0.00 | 0.16 | 1 |
| 1.52300 | 13.31 | 3.58 | 0.82 | 71.99 | 0.12 | 10.17 | 0.00 | 0.03 | 1 |
| 1.51574 | 14.86 | 3.67 | 1.74 | 71.87 | 0.16 | 7.36 | 0.00 | 0.12 | 2 |
| 1.51848 | 13.64 | 3.87 | 1.27 | 71.96 | 0.54 | 8.32 | 0.00 | 0.32 | 2 |
| 1.51593 | 13.09 | 3.59 | 1.52 | 73.10 | 0.67 | 7.83 | 0.00 | 0.00 | 2 |
| 1.51631 | 13.34 | 3.57 | 1.57 | 72.87 | 0.61 | 7.89 | 0.00 | 0.00 | 2 |
| 1.51596 | 13.02 | 3.56 | 1.54 | 73.11 | 0.72 | 7.90 | 0.00 | 0.00 | 2 |
| 1.51590 | 13.02 | 3.58 | 1.51 | 73.12 | 0.69 | 7.96 | 0.00 | 0.00 | 2 |
| 1.51645 | 13.44 | 3.61 | 1.54 | 72.39 | 0.66 | 8.03 | 0.00 | 0.00 | 2 |
| 1.51627 | 13.00 | 3.58 | 1.54 | 72.83 | 0.61 | 8.04 | 0.00 | 0.00 | 2 |
| 1.51613 | 13.92 | 3.52 | 1.25 | 72.88 | 0.37 | 7.94 | 0.00 | 0.14 | 2 |
| 1.51590 | 12.82 | 3.52 | 1.90 | 72.86 | 0.69 | 7.97 | 0.00 | 0.00 | 2 |
| 1.51592 | 12.86 | 3.52 | 2.12 | 72.66 | 0.69 | 7.97 | 0.00 | 0.00 | 2 |
| 1.51593 | 13.25 | 3.45 | 1.43 | 73.17 | 0.61 | 7.86 | 0.00 | 0.00 | 2 |
| 1.51646 | 13.41 | 3.55 | 1.25 | 72.81 | 0.68 | 8.10 | 0.00 | 0.00 | 2 |
| 1.51594 | 13.09 | 3.52 | 1.55 | 72.87 | 0.68 | 8.05 | 0.00 | 0.09 | 2 |
| 1.51409 | 14.25 | 3.09 | 2.08 | 72.28 | 1.10 | 7.08 | 0.00 | 0.00 | 2 |
| 1.51625 | 13.36 | 3.58 | 1.49 | 72.72 | 0.45 | 8.21 | 0.00 | 0.00 | 2 |
| 1.51569 | 13.24 | 3.49 | 1.47 | 73.25 | 0.38 | 8.03 | 0.00 | 0.00 | 2 |
| 1.51645 | 13.40 | 3.49 | 1.52 | 72.65 | 0.67 | 8.08 | 0.00 | 0.10 | 2 |
| 1.51618 | 13.01 | 3.50 | 1.48 | 72.89 | 0.60 | 8.12 | 0.00 | 0.00 | 2 |
| 1.51640 | 12.55 | 3.48 | 1.87 | 73.23 | 0.63 | 8.08 | 0.00 | 0.09 | 2 |
| 1.51841 | 12.93 | 3.74 | 1.11 | 72.28 | 0.64 | 8.96 | 0.00 | 0.22 | 2 |
| 1.51605 | 12.90 | 3.44 | 1.45 | 73.06 | 0.44 | 8.27 | 0.00 | 0.00 | 2 |
| 1.51588 | 13.12 | 3.41 | 1.58 | 73.26 | 0.07 | 8.39 | 0.00 | 0.19 | 2 |
| 1.51590 | 13.24 | 3.34 | 1.47 | 73.10 | 0.39 | 8.22 | 0.00 | 0.00 | 2 |
| 1.51629 | 12.71 | 3.33 | 1.49 | 73.28 | 0.67 | 8.24 | 0.00 | 0.00 | 2 |
| 1.51860 | 13.36 | 3.43 | 1.43 | 72.26 | 0.51 | 8.60 | 0.00 | 0.00 | 2 |
| 1.51841 | 13.02 | 3.62 | 1.06 | 72.34 | 0.64 | 9.13 | 0.00 | 0.15 | 2 |
| 1.51743 | 12.20 | 3.25 | 1.16 | 73.55 | 0.62 | 8.90 | 0.00 | 0.24 | 2 |
| 1.51689 | 12.67 | 2.88 | 1.71 | 73.21 | 0.73 | 8.54 | 0.00 | 0.00 | 2 |
| 1.51811 | 12.96 | 2.96 | 1.43 | 72.92 | 0.60 | 8.79 | 0.14 | 0.00 | 2 |
| 1.51655 | 12.75 | 2.85 | 1.44 | 73.27 | 0.57 | 8.79 | 0.11 | 0.22 | 2 |
| 1.51730 | 12.35 | 2.72 | 1.63 | 72.87 | 0.70 | 9.23 | 0.00 | 0.00 | 2 |
| 1.51820 | 12.62 | 2.76 | 0.83 | 73.81 | 0.35 | 9.42 | 0.00 | 0.20 | 2 |
| 1.52725 | 13.80 | 3.15 | 0.66 | 70.57 | 0.08 | 11.64 | 0.00 | 0.00 | 2 |
| 1.52410 | 13.83 | 2.90 | 1.17 | 71.15 | 0.08 | 10.79 | 0.00 | 0.00 | 2 |
| 1.52475 | 11.45 | 0.00 | 1.88 | 72.19 | 0.81 | 13.24 | 0.00 | 0.34 | 2 |
| 1.53125 | 10.73 | 0.00 | 2.10 | 69.81 | 0.58 | 13.30 | 3.15 | 0.28 | 2 |
| 1.53393 | 12.30 | 0.00 | 1.00 | 70.16 | 0.12 | 16.19 | 0.00 | 0.24 | 2 |
| 1.52222 | 14.43 | 0.00 | 1.00 | 72.67 | 0.10 | 11.52 | 0.00 | 0.08 | 2 |
| 1.51818 | 13.72 | 0.00 | 0.56 | 74.45 | 0.00 | 10.99 | 0.00 | 0.00 | 2 |
| 1.52664 | 11.23 | 0.00 | 0.77 | 73.21 | 0.00 | 14.68 | 0.00 | 0.00 | 2 |
| 1.52739 | 11.02 | 0.00 | 0.75 | 73.08 | 0.00 | 14.96 | 0.00 | 0.00 | 2 |
| 1.52777 | 12.64 | 0.00 | 0.67 | 72.02 | 0.06 | 14.40 | 0.00 | 0.00 | 2 |
| 1.51892 | 13.46 | 3.83 | 1.26 | 72.55 | 0.57 | 8.21 | 0.00 | 0.14 | 2 |
| 1.51847 | 13.10 | 3.97 | 1.19 | 72.44 | 0.60 | 8.43 | 0.00 | 0.00 | 2 |
| 1.51846 | 13.41 | 3.89 | 1.33 | 72.38 | 0.51 | 8.28 | 0.00 | 0.00 | 2 |
| 1.51829 | 13.24 | 3.90 | 1.41 | 72.33 | 0.55 | 8.31 | 0.00 | 0.10 | 2 |
| 1.51708 | 13.72 | 3.68 | 1.81 | 72.06 | 0.64 | 7.88 | 0.00 | 0.00 | 2 |
| 1.51673 | 13.30 | 3.64 | 1.53 | 72.53 | 0.65 | 8.03 | 0.00 | 0.29 | 2 |
| 1.51652 | 13.56 | 3.57 | 1.47 | 72.45 | 0.64 | 7.96 | 0.00 | 0.00 | 2 |
| 1.51844 | 13.25 | 3.76 | 1.32 | 72.40 | 0.58 | 8.42 | 0.00 | 0.00 | 2 |
| 1.51663 | 12.93 | 3.54 | 1.62 | 72.96 | 0.64 | 8.03 | 0.00 | 0.21 | 2 |
| 1.51687 | 13.23 | 3.54 | 1.48 | 72.84 | 0.56 | 8.10 | 0.00 | 0.00 | 2 |
| 1.51707 | 13.48 | 3.48 | 1.71 | 72.52 | 0.62 | 7.99 | 0.00 | 0.00 | 2 |
| 1.52177 | 13.20 | 3.68 | 1.15 | 72.75 | 0.54 | 8.52 | 0.00 | 0.00 | 2 |
| 1.51872 | 12.93 | 3.66 | 1.56 | 72.51 | 0.58 | 8.55 | 0.00 | 0.12 | 2 |
| 1.51667 | 12.94 | 3.61 | 1.26 | 72.75 | 0.56 | 8.60 | 0.00 | 0.00 | 2 |
| 1.52081 | 13.78 | 2.28 | 1.43 | 71.99 | 0.49 | 9.85 | 0.00 | 0.17 | 2 |
| 1.52068 | 13.55 | 2.09 | 1.67 | 72.18 | 0.53 | 9.57 | 0.27 | 0.17 | 2 |
| 1.52020 | 13.98 | 1.35 | 1.63 | 71.76 | 0.39 | 10.56 | 0.00 | 0.18 | 2 |
| 1.52177 | 13.75 | 1.01 | 1.36 | 72.19 | 0.33 | 11.14 | 0.00 | 0.00 | 2 |
| 1.52614 | 13.70 | 0.00 | 1.36 | 71.24 | 0.19 | 13.44 | 0.00 | 0.10 | 2 |
| 1.51813 | 13.43 | 3.98 | 1.18 | 72.49 | 0.58 | 8.15 | 0.00 | 0.00 | 2 |
| 1.51800 | 13.71 | 3.93 | 1.54 | 71.81 | 0.54 | 8.21 | 0.00 | 0.15 | 2 |
| 1.51811 | 13.33 | 3.85 | 1.25 | 72.78 | 0.52 | 8.12 | 0.00 | 0.00 | 2 |
| 1.51789 | 13.19 | 3.90 | 1.30 | 72.33 | 0.55 | 8.44 | 0.00 | 0.28 | 2 |
| 1.51806 | 13.00 | 3.80 | 1.08 | 73.07 | 0.56 | 8.38 | 0.00 | 0.12 | 2 |
| 1.51711 | 12.89 | 3.62 | 1.57 | 72.96 | 0.61 | 8.11 | 0.00 | 0.00 | 2 |
| 1.51674 | 12.79 | 3.52 | 1.54 | 73.36 | 0.66 | 7.90 | 0.00 | 0.00 | 2 |
| 1.51674 | 12.87 | 3.56 | 1.64 | 73.14 | 0.65 | 7.99 | 0.00 | 0.00 | 2 |
| 1.51690 | 13.33 | 3.54 | 1.61 | 72.54 | 0.68 | 8.11 | 0.00 | 0.00 | 2 |
| 1.51851 | 13.20 | 3.63 | 1.07 | 72.83 | 0.57 | 8.41 | 0.09 | 0.17 | 2 |
| 1.51662 | 12.85 | 3.51 | 1.44 | 73.01 | 0.68 | 8.23 | 0.06 | 0.25 | 2 |
| 1.51709 | 13.00 | 3.47 | 1.79 | 72.72 | 0.66 | 8.18 | 0.00 | 0.00 | 2 |
| 1.51660 | 12.99 | 3.18 | 1.23 | 72.97 | 0.58 | 8.81 | 0.00 | 0.24 | 2 |
| 1.51839 | 12.85 | 3.67 | 1.24 | 72.57 | 0.62 | 8.68 | 0.00 | 0.35 | 2 |
| 1.51769 | 13.65 | 3.66 | 1.11 | 72.77 | 0.11 | 8.60 | 0.00 | 0.00 | 3 |
| 1.51610 | 13.33 | 3.53 | 1.34 | 72.67 | 0.56 | 8.33 | 0.00 | 0.00 | 3 |
| 1.51670 | 13.24 | 3.57 | 1.38 | 72.70 | 0.56 | 8.44 | 0.00 | 0.10 | 3 |
| 1.51643 | 12.16 | 3.52 | 1.35 | 72.89 | 0.57 | 8.53 | 0.00 | 0.00 | 3 |
| 1.51665 | 13.14 | 3.45 | 1.76 | 72.48 | 0.60 | 8.38 | 0.00 | 0.17 | 3 |
| 1.52127 | 14.32 | 3.90 | 0.83 | 71.50 | 0.00 | 9.49 | 0.00 | 0.00 | 3 |
| 1.51779 | 13.64 | 3.65 | 0.65 | 73.00 | 0.06 | 8.93 | 0.00 | 0.00 | 3 |
| 1.51610 | 13.42 | 3.40 | 1.22 | 72.69 | 0.59 | 8.32 | 0.00 | 0.00 | 3 |
| 1.51694 | 12.86 | 3.58 | 1.31 | 72.61 | 0.61 | 8.79 | 0.00 | 0.00 | 3 |
| 1.51646 | 13.04 | 3.40 | 1.26 | 73.01 | 0.52 | 8.58 | 0.00 | 0.00 | 3 |
| 1.51655 | 13.41 | 3.39 | 1.28 | 72.64 | 0.52 | 8.65 | 0.00 | 0.00 | 3 |
| 1.52121 | 14.03 | 3.76 | 0.58 | 71.79 | 0.11 | 9.65 | 0.00 | 0.00 | 3 |
| 1.51776 | 13.53 | 3.41 | 1.52 | 72.04 | 0.58 | 8.79 | 0.00 | 0.00 | 3 |
| 1.51796 | 13.50 | 3.36 | 1.63 | 71.94 | 0.57 | 8.81 | 0.00 | 0.09 | 3 |
| 1.51832 | 13.33 | 3.34 | 1.54 | 72.14 | 0.56 | 8.99 | 0.00 | 0.00 | 3 |
| 1.51934 | 13.64 | 3.54 | 0.75 | 72.65 | 0.16 | 8.89 | 0.15 | 0.24 | 3 |
| 1.52211 | 14.19 | 3.78 | 0.91 | 71.36 | 0.23 | 9.14 | 0.00 | 0.37 | 3 |
| 1.51514 | 14.01 | 2.68 | 3.50 | 69.89 | 1.68 | 5.87 | 2.20 | 0.00 | 5 |
| 1.51915 | 12.73 | 1.85 | 1.86 | 72.69 | 0.60 | 10.09 | 0.00 | 0.00 | 5 |
| 1.52171 | 11.56 | 1.88 | 1.56 | 72.86 | 0.47 | 11.41 | 0.00 | 0.00 | 5 |
| 1.52151 | 11.03 | 1.71 | 1.56 | 73.44 | 0.58 | 11.62 | 0.00 | 0.00 | 5 |
| 1.51969 | 12.64 | 0.00 | 1.65 | 73.75 | 0.38 | 11.53 | 0.00 | 0.00 | 5 |
| 1.51666 | 12.86 | 0.00 | 1.83 | 73.88 | 0.97 | 10.17 | 0.00 | 0.00 | 5 |
| 1.51994 | 13.27 | 0.00 | 1.76 | 73.03 | 0.47 | 11.32 | 0.00 | 0.00 | 5 |
| 1.52369 | 13.44 | 0.00 | 1.58 | 72.22 | 0.32 | 12.24 | 0.00 | 0.00 | 5 |
| 1.51316 | 13.02 | 0.00 | 3.04 | 70.48 | 6.21 | 6.96 | 0.00 | 0.00 | 5 |
| 1.51321 | 13.00 | 0.00 | 3.02 | 70.70 | 6.21 | 6.93 | 0.00 | 0.00 | 5 |
| 1.52043 | 13.38 | 0.00 | 1.40 | 72.25 | 0.33 | 12.50 | 0.00 | 0.00 | 5 |
| 1.52058 | 12.85 | 1.61 | 2.17 | 72.18 | 0.76 | 9.70 | 0.24 | 0.51 | 5 |
| 1.52119 | 12.97 | 0.33 | 1.51 | 73.39 | 0.13 | 11.27 | 0.00 | 0.28 | 5 |
| 1.51905 | 14.00 | 2.39 | 1.56 | 72.37 | 0.00 | 9.57 | 0.00 | 0.00 | 6 |
| 1.51937 | 13.79 | 2.41 | 1.19 | 72.76 | 0.00 | 9.77 | 0.00 | 0.00 | 6 |
| 1.51829 | 14.46 | 2.24 | 1.62 | 72.38 | 0.00 | 9.26 | 0.00 | 0.00 | 6 |
| 1.51852 | 14.09 | 2.19 | 1.66 | 72.67 | 0.00 | 9.32 | 0.00 | 0.00 | 6 |
| 1.51299 | 14.40 | 1.74 | 1.54 | 74.55 | 0.00 | 7.59 | 0.00 | 0.00 | 6 |
| 1.51888 | 14.99 | 0.78 | 1.74 | 72.50 | 0.00 | 9.95 | 0.00 | 0.00 | 6 |
| 1.51916 | 14.15 | 0.00 | 2.09 | 72.74 | 0.00 | 10.88 | 0.00 | 0.00 | 6 |
| 1.51969 | 14.56 | 0.00 | 0.56 | 73.48 | 0.00 | 11.22 | 0.00 | 0.00 | 6 |
| 1.51115 | 17.38 | 0.00 | 0.34 | 75.41 | 0.00 | 6.65 | 0.00 | 0.00 | 6 |
| 1.51131 | 13.69 | 3.20 | 1.81 | 72.81 | 1.76 | 5.43 | 1.19 | 0.00 | 7 |
| 1.51838 | 14.32 | 3.26 | 2.22 | 71.25 | 1.46 | 5.79 | 1.63 | 0.00 | 7 |
| 1.52315 | 13.44 | 3.34 | 1.23 | 72.38 | 0.60 | 8.83 | 0.00 | 0.00 | 7 |
| 1.52247 | 14.86 | 2.20 | 2.06 | 70.26 | 0.76 | 9.76 | 0.00 | 0.00 | 7 |
| 1.52365 | 15.79 | 1.83 | 1.31 | 70.43 | 0.31 | 8.61 | 1.68 | 0.00 | 7 |
| 1.51613 | 13.88 | 1.78 | 1.79 | 73.10 | 0.00 | 8.67 | 0.76 | 0.00 | 7 |
| 1.51602 | 14.85 | 0.00 | 2.38 | 73.28 | 0.00 | 8.76 | 0.64 | 0.09 | 7 |
| 1.51623 | 14.20 | 0.00 | 2.79 | 73.46 | 0.04 | 9.04 | 0.40 | 0.09 | 7 |
| 1.51719 | 14.75 | 0.00 | 2.00 | 73.02 | 0.00 | 8.53 | 1.59 | 0.08 | 7 |
| 1.51683 | 14.56 | 0.00 | 1.98 | 73.29 | 0.00 | 8.52 | 1.57 | 0.07 | 7 |
| 1.51545 | 14.14 | 0.00 | 2.68 | 73.39 | 0.08 | 9.07 | 0.61 | 0.05 | 7 |
| 1.51556 | 13.87 | 0.00 | 2.54 | 73.23 | 0.14 | 9.41 | 0.81 | 0.01 | 7 |
| 1.51727 | 14.70 | 0.00 | 2.34 | 73.28 | 0.00 | 8.95 | 0.66 | 0.00 | 7 |
| 1.51531 | 14.38 | 0.00 | 2.66 | 73.10 | 0.04 | 9.08 | 0.64 | 0.00 | 7 |
| 1.51609 | 15.01 | 0.00 | 2.51 | 73.05 | 0.05 | 8.83 | 0.53 | 0.00 | 7 |
| 1.51508 | 15.15 | 0.00 | 2.25 | 73.50 | 0.00 | 8.34 | 0.63 | 0.00 | 7 |
| 1.51653 | 11.95 | 0.00 | 1.19 | 75.18 | 2.70 | 8.93 | 0.00 | 0.00 | 7 |
| 1.51514 | 14.85 | 0.00 | 2.42 | 73.72 | 0.00 | 8.39 | 0.56 | 0.00 | 7 |
| 1.51658 | 14.80 | 0.00 | 1.99 | 73.11 | 0.00 | 8.28 | 1.71 | 0.00 | 7 |
| 1.51617 | 14.95 | 0.00 | 2.27 | 73.30 | 0.00 | 8.71 | 0.67 | 0.00 | 7 |
| 1.51732 | 14.95 | 0.00 | 1.80 | 72.99 | 0.00 | 8.61 | 1.55 | 0.00 | 7 |
| 1.51645 | 14.94 | 0.00 | 1.87 | 73.11 | 0.00 | 8.67 | 1.38 | 0.00 | 7 |
| 1.51831 | 14.39 | 0.00 | 1.82 | 72.86 | 1.41 | 6.47 | 2.88 | 0.00 | 7 |
| 1.51640 | 14.37 | 0.00 | 2.74 | 72.85 | 0.00 | 9.45 | 0.54 | 0.00 | 7 |
| 1.51623 | 14.14 | 0.00 | 2.88 | 72.61 | 0.08 | 9.18 | 1.06 | 0.00 | 7 |
| 1.51685 | 14.92 | 0.00 | 1.99 | 73.06 | 0.00 | 8.40 | 1.59 | 0.00 | 7 |
| 1.52065 | 14.36 | 0.00 | 2.02 | 73.42 | 0.00 | 8.44 | 1.64 | 0.00 | 7 |
| 1.51651 | 14.38 | 0.00 | 1.94 | 73.61 | 0.00 | 8.48 | 1.57 | 0.00 | 7 |
| 1.51711 | 14.23 | 0.00 | 2.08 | 73.36 | 0.00 | 8.62 | 1.67 | 0.00 | 7 |
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors
glass_pred = Glass |>
dplyr::select(-Type)
pairs(glass_pred, lower.panel=panel.smooth)
glass_long <- glass_pred |>
pivot_longer(cols = everything(), names_to = "predictors", values_to = "value")
ggplot(glass_long, aes(x = value)) +
geom_histogram(bins=30, fill = "#516c94") +
facet_wrap(~predictors, scales = "free") +
labs(title='Glass Predictor Histograms') +
theme_minimal()
ggplot(glass_long, aes(x = value)) +
geom_boxplot(fill='#516c94') +
facet_wrap(~predictors, scales = "free") +
labs(title='Glass Predictor Boxplots') +
theme_minimal()
ggpairs(
data=Glass,
columns=1:9,
upper = "blank",
title="Pairs Plot of Glass Predictors",
)
glass_pred |>
cor() |>
corrplot(
type = 'lower',
diag = FALSE,
col = COL2('PiYG', 10), tl.col = 'black',
title='Glass Predictor Correlation Heatmap'
)
Do there appear to be any outliers in the data? Are any predictors skewed?
count_outliers_iqr <- function(x) {
if (!is.numeric(x)) {
return(NA) # Skip non-numeric columns
}
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - (1.5 * IQR_val)
upper_bound <- Q3 + (1.5 * IQR_val)
outlier_count <- sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
return(outlier_count)
}
outlier_counts <- apply(glass_pred, 2, count_outliers_iqr)
# Convert the results to a data frame
outlier_counts_df <- as.data.frame(as.list(outlier_counts))
outlier_counts_df = as.data.frame(t(rbind(outlier_counts_df, (outlier_counts_df / nrow(glass_pred)) * 100)))
outlier_counts_df = outlier_counts_df |> arrange(desc(V1))
colnames(outlier_counts_df) = c('Outlier Count', '% Outlier')
knit_table(
outlier_counts_df,
'IQR Outlier Stats per Predictor')
| Outlier Count | % Outlier | |
|---|---|---|
| Ba | 38 | 17.757009 |
| Ca | 26 | 12.149533 |
| Al | 18 | 8.411215 |
| RI | 17 | 7.943925 |
| Si | 12 | 5.607477 |
| Fe | 12 | 5.607477 |
| Na | 7 | 3.271028 |
| K | 7 | 3.271028 |
| Mg | 0 | 0.000000 |
zero_cnts = colSums(glass_pred==0)|>sort(decreasing = TRUE)
zero_cnts = as.data.frame(t(rbind(zero_cnts, (zero_cnts / nrow(glass_pred)) * 100)))
colnames(zero_cnts) = c('Zero Value Count', '% Zero Values')
knit_table(
zero_cnts,
'Count of 0 Values per Predictor'
)
| Zero Value Count | % Zero Values | |
|---|---|---|
| Ba | 176 | 82.24299 |
| Fe | 144 | 67.28972 |
| Mg | 42 | 19.62617 |
| K | 30 | 14.01869 |
| RI | 0 | 0.00000 |
| Na | 0 | 0.00000 |
| Al | 0 | 0.00000 |
| Si | 0 | 0.00000 |
| Ca | 0 | 0.00000 |
skew_df = as.data.frame(
sapply(glass_pred, skewness) |>
sort(decreasing = TRUE)
)
skew_df = cbind(skew_df, rownames(skew_df))
colnames(skew_df) = c('skew_value', 'predictor')
skew_df = skew_df |>
mutate(skew_direction=ifelse(skew_value>0.5, 'Right', ifelse(-0.5>skew_value, 'Left', 'Symmetric')))
skew_df |>
ggplot(aes(x = reorder(predictor, -skew_value), y = skew_value)) +
# annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=0.5, alpha=.2, fill="purple") +
# annotate("text", x='Mg', y=0.3, color="purple", label='Roughly symmetric', size=3) +
annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=-1, alpha=.15, fill="darkorange") +
annotate("rect", xmin = -Inf, xmax = Inf, ymin=0.5, ymax=1, alpha=.15, fill="darkorange") +
annotate("text", x='Si', y=0.7, color="darkorange", label='Moderately skewed', size=4) +
annotate("rect", xmin = -Inf, xmax = Inf, ymin=1, ymax=max(skew_df$skew_value), alpha=.15, fill="red") +
annotate("rect", xmin = -Inf, xmax = Inf, ymin=-1, ymax=min(skew_df$skew_value), alpha=.15, fill="red") +
annotate("text", x='Si', y=1.3, color="red", label='Highly skewed', size=4) +
geom_col(aes(fill=skew_direction)) +
scale_fill_manual(values = c("Left" = "#22a298", "Right" = "#516c94", "Symmetric"='grey')) +
labs(
title= "Glass Predictor Skewness (Fisher-Pearson)",
x = "Predictor",
y='Coefficient of Skewness',
fill='Skew Direction') +
theme_minimal()
All predictors other than Mg appear to have outliers. According to the IQR outlier evaluation, Ba has the most outliers with a count of 38, followed by Ca, Al, RI, Si & Fe, and Na & K. Ba likely has the highest outlier count due to having the most instances of 0% and thus is heavily right-skewed. Fe experiences a similar phenomenon and is right-skewed with a high count of 0 values. Ca also appears to be right skewed with a median around 9, although not as extreme as Ba and Fe.
According to the Fisher-Pearson coefficient of skewness, K is the highest skewed predictor and has a right skew followed by Ba, Ca, Fe, and RI as predictors that are highly skewed. Mg also is considered to be highlight left-skewed. Al (right skewed) and Si (left skewed) are the only predictors that are considered to be moderately skewed. Na is the only predictor that has a roughly symmetric or normal distribution.
Fisher-Pearson coefficient of skewness (g~1) Interpretation - Skewness Strength - ± 0.5: Generally considered roughly symmetric - 0.5 to 1 or -0.5 to -1: Moderately skewed - >1 or <-1: Highly skewed - Skew Direction - 0: Symmetric distribution (e.g., normal distribution). - Positive: Skewed right (long tail on the right). - Negative: Skewed left (long tail on the left).
Are there any relevant transformations of one or more predictors that might improve the classification model?
bc_lambda_list = c()
for (i in colnames(glass_pred)){
bc = BoxCoxTrans(glass_pred[, i])
bc_lambda_list = c(bc_lambda_list, bc$lambda)
og_plot = glass_pred |> gather() |>
ggplot(aes(x = value)) +
geom_histogram(bins=30, fill = "#516c94") +
labs(title=paste0(i, ' Histogram')) +
theme_minimal()
# print(og_plot)
trans = predict(bc, glass_pred[, i])
fp_skew = skewness(trans)
# print(fp_skew)
trans = as.data.frame(trans)
sapply(glass_pred, skewness)
trans_plot = trans |> gather() |>
ggplot(aes(x = value)) +
geom_histogram(bins=30, fill = "#516c94") +
labs(
title=paste0(i, ' Transformed Histogram'),
subtitle = paste0('Lambda: ', round(bc$lambda, 2), ', Skewness: ', round(fp_skew, 2))) +
theme_minimal()
if(!is.na(bc$lambda)){
print(og_plot + trans_plot)
}
}
bc_lambda_df = data.frame('Predictor'=colnames(glass_pred), 'Lambda_Value'=bc_lambda_list) |>
arrange(Lambda_Value)
knit_table(bc_lambda_df, 'Box Cox Predictor Lambda Values')
| Predictor | Lambda_Value |
|---|---|
| RI | -2.0 |
| Ca | -1.1 |
| Na | -0.1 |
| Al | 0.5 |
| Si | 2.0 |
| Mg | NA |
| K | NA |
| Ba | NA |
| Fe | NA |
The Box-Cox transformations of Na, Al, Si, and Ca appear to have assisted in normalizing their respective values and reducing skewness; however, their distribution remain not entirely symmetric. Box-Cox transformation for Ba, Fe, Mg, and K where not able to be calculated due to these predictors having values of zero.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
## See ?Soybean for details
knit_table(head(Soybean), 'View Soybean Data Head')
| Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 4 | 0 | 2 | 1 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaporthe-stem-canker | 5 | 0 | 2 | 1 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
library(nombre)
soy_pred = Soybean |>
dplyr::select(-Class)
sapply(soy_pred, n_distinct)
## date plant.stand precip temp hail
## 8 3 4 4 3
## crop.hist area.dam sever seed.tmt germ
## 5 5 4 4 4
## plant.growth leaves leaf.halo leaf.marg leaf.size
## 3 2 4 4 4
## leaf.shread leaf.malf leaf.mild stem lodging
## 3 3 4 3 3
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 5 5 3 4 3
## int.discolor sclerotia fruit.pods fruit.spots seed
## 4 3 5 5 3
## mold.growth seed.discolor seed.size shriveling roots
## 3 3 3 3 4
soy_unique = as.data.frame(sapply(soy_pred, n_distinct))
colnames(soy_unique) = c('unique_cnt')
soy_unique = soy_unique |>
mutate(unique_percent = round(unique_cnt/nrow(soy_pred)*100, 2))
freq_ratios = c()
for (col in colnames(soy_pred)){
df_x = data.frame(soy_pred[, col])
colnames(df_x) = c('column')
col_ratio = df_x |>
group_by(column) |>
summarise(cnt = n()) |>
arrange(desc(cnt)) |>
mutate(nrow = nom_ord(row_number())) |>
dplyr::select(cnt, nrow) |>
pivot_wider(names_from = nrow, values_from = cnt) |>
mutate(ratio = first/second) |>
pull(ratio)
freq_ratios = c(freq_ratios, col_ratio)
}
freq_ratios
## [1] 1.137405 1.208191 4.098214 1.879397 3.425197 1.004587 1.213904
## [8] 1.651282 1.373874 1.103627 1.951327 7.870130 1.547511 1.615385
## [15] 1.479638 4.870000 6.595238 4.953704 1.253378 4.297521 1.984293
## [22] 1.807910 4.462264 3.681481 16.815789 13.204545 16.447368 3.130769
## [29] 3.254717 4.139130 5.695652 4.839623 5.782609 5.084906 6.406977
soy_freq_dist_df = cbind(soy_unique, freq_ratios)
knit_table(soy_freq_dist_df, 'Spybean Predictor Stats')
| unique_cnt | unique_percent | freq_ratios | |
|---|---|---|---|
| date | 8 | 1.17 | 1.137405 |
| plant.stand | 3 | 0.44 | 1.208191 |
| precip | 4 | 0.59 | 4.098214 |
| temp | 4 | 0.59 | 1.879397 |
| hail | 3 | 0.44 | 3.425197 |
| crop.hist | 5 | 0.73 | 1.004587 |
| area.dam | 5 | 0.73 | 1.213904 |
| sever | 4 | 0.59 | 1.651282 |
| seed.tmt | 4 | 0.59 | 1.373874 |
| germ | 4 | 0.59 | 1.103627 |
| plant.growth | 3 | 0.44 | 1.951327 |
| leaves | 2 | 0.29 | 7.870130 |
| leaf.halo | 4 | 0.59 | 1.547511 |
| leaf.marg | 4 | 0.59 | 1.615385 |
| leaf.size | 4 | 0.59 | 1.479638 |
| leaf.shread | 3 | 0.44 | 4.870000 |
| leaf.malf | 3 | 0.44 | 6.595238 |
| leaf.mild | 4 | 0.59 | 4.953704 |
| stem | 3 | 0.44 | 1.253378 |
| lodging | 3 | 0.44 | 4.297521 |
| stem.cankers | 5 | 0.73 | 1.984293 |
| canker.lesion | 5 | 0.73 | 1.807910 |
| fruiting.bodies | 3 | 0.44 | 4.462264 |
| ext.decay | 4 | 0.59 | 3.681481 |
| mycelium | 3 | 0.44 | 16.815790 |
| int.discolor | 4 | 0.59 | 13.204546 |
| sclerotia | 3 | 0.44 | 16.447368 |
| fruit.pods | 5 | 0.73 | 3.130769 |
| fruit.spots | 5 | 0.73 | 3.254717 |
| seed | 3 | 0.44 | 4.139130 |
| mold.growth | 3 | 0.44 | 5.695652 |
| seed.discolor | 3 | 0.44 | 4.839623 |
| seed.size | 3 | 0.44 | 5.782609 |
| shriveling | 3 | 0.44 | 5.084906 |
| roots | 4 | 0.59 | 6.406977 |
None of the categorical predictors have a degenerate distribution. A couple predictors (mycelium and mycelium) are close to having near-zero variance predictors with the fraction of unique values over the sample size being less than 1% and having the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value around 16.
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_counts <- as.data.frame(colSums(is.na(soy_pred)))
missing_counts$variable <- rownames(missing_counts)
colnames(missing_counts) <- c("count", "variable")
ggplot(missing_counts, aes(x =reorder(variable, count), y = count)) +
geom_bar(stat = "identity", fill = "#516c94") +
coord_flip() +
labs(title = "Missing Value Counts by Predictor", y = "Missing Value Count", x = "Predictor") +
theme_minimal()
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.5.1
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Soybean |>
group_by(Class) |>
summarise(across(everything(), ~ sum(is.na(.)) / n()), .groups = "drop") |>
melt(id.vars = "Class") |>
ggplot(aes(x=Class, y=variable, fill=value)) +
geom_tile() +
scale_fill_gradient2(
low = "white",
mid='lightblue',
high = "darkblue",
midpoint = 0.5,
limits = c(0, 1)
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The predictors sever, seed.tmt, lodging, and hail all have the highest
missing value count of 112 which is equal to about 18% of the
dataset.
The classes 2-4-d-injury, cyst-nematode, herbicide-injury, diaporthe-pod-&-stem-blight, and phytophthora-rot all have high null counts.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The high null count predictors with the classes that have high null counts may prove to be beneficial predictors if the null value is treated as its own value. With that, one strategy would be to keep the predictors that have a high null percentage consistently across classes and plan to use a predictive model that can account for missing data. For the other predictors, with high null values, imputation using K-Nearesr Neighbors would be a good place to start given the missing data account for less than 20% of the dataset and the dataset is relativley small.