Exercises

Applied Predictive Modeling
Chapter 3 Data Pre-processing

3.1

he UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
knit_table(Glass, 'View Glass Data')
View Glass Data
RI Na Mg Al Si K Ca Ba Fe Type
1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.00 0.00 1
1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0.00 0.00 1
1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0.00 0.00 1
1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0.00 0.00 1
1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0.00 0.00 1
1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0.00 0.26 1
1.51743 13.30 3.60 1.14 73.09 0.58 8.17 0.00 0.00 1
1.51756 13.15 3.61 1.05 73.24 0.57 8.24 0.00 0.00 1
1.51918 14.04 3.58 1.37 72.08 0.56 8.30 0.00 0.00 1
1.51755 13.00 3.60 1.36 72.99 0.57 8.40 0.00 0.11 1
1.51571 12.72 3.46 1.56 73.20 0.67 8.09 0.00 0.24 1
1.51763 12.80 3.66 1.27 73.01 0.60 8.56 0.00 0.00 1
1.51589 12.88 3.43 1.40 73.28 0.69 8.05 0.00 0.24 1
1.51748 12.86 3.56 1.27 73.21 0.54 8.38 0.00 0.17 1
1.51763 12.61 3.59 1.31 73.29 0.58 8.50 0.00 0.00 1
1.51761 12.81 3.54 1.23 73.24 0.58 8.39 0.00 0.00 1
1.51784 12.68 3.67 1.16 73.11 0.61 8.70 0.00 0.00 1
1.52196 14.36 3.85 0.89 71.36 0.15 9.15 0.00 0.00 1
1.51911 13.90 3.73 1.18 72.12 0.06 8.89 0.00 0.00 1
1.51735 13.02 3.54 1.69 72.73 0.54 8.44 0.00 0.07 1
1.51750 12.82 3.55 1.49 72.75 0.54 8.52 0.00 0.19 1
1.51966 14.77 3.75 0.29 72.02 0.03 9.00 0.00 0.00 1
1.51736 12.78 3.62 1.29 72.79 0.59 8.70 0.00 0.00 1
1.51751 12.81 3.57 1.35 73.02 0.62 8.59 0.00 0.00 1
1.51720 13.38 3.50 1.15 72.85 0.50 8.43 0.00 0.00 1
1.51764 12.98 3.54 1.21 73.00 0.65 8.53 0.00 0.00 1
1.51793 13.21 3.48 1.41 72.64 0.59 8.43 0.00 0.00 1
1.51721 12.87 3.48 1.33 73.04 0.56 8.43 0.00 0.00 1
1.51768 12.56 3.52 1.43 73.15 0.57 8.54 0.00 0.00 1
1.51784 13.08 3.49 1.28 72.86 0.60 8.49 0.00 0.00 1
1.51768 12.65 3.56 1.30 73.08 0.61 8.69 0.00 0.14 1
1.51747 12.84 3.50 1.14 73.27 0.56 8.55 0.00 0.00 1
1.51775 12.85 3.48 1.23 72.97 0.61 8.56 0.09 0.22 1
1.51753 12.57 3.47 1.38 73.39 0.60 8.55 0.00 0.06 1
1.51783 12.69 3.54 1.34 72.95 0.57 8.75 0.00 0.00 1
1.51567 13.29 3.45 1.21 72.74 0.56 8.57 0.00 0.00 1
1.51909 13.89 3.53 1.32 71.81 0.51 8.78 0.11 0.00 1
1.51797 12.74 3.48 1.35 72.96 0.64 8.68 0.00 0.00 1
1.52213 14.21 3.82 0.47 71.77 0.11 9.57 0.00 0.00 1
1.52213 14.21 3.82 0.47 71.77 0.11 9.57 0.00 0.00 1
1.51793 12.79 3.50 1.12 73.03 0.64 8.77 0.00 0.00 1
1.51755 12.71 3.42 1.20 73.20 0.59 8.64 0.00 0.00 1
1.51779 13.21 3.39 1.33 72.76 0.59 8.59 0.00 0.00 1
1.52210 13.73 3.84 0.72 71.76 0.17 9.74 0.00 0.00 1
1.51786 12.73 3.43 1.19 72.95 0.62 8.76 0.00 0.30 1
1.51900 13.49 3.48 1.35 71.95 0.55 9.00 0.00 0.00 1
1.51869 13.19 3.37 1.18 72.72 0.57 8.83 0.00 0.16 1
1.52667 13.99 3.70 0.71 71.57 0.02 9.82 0.00 0.10 1
1.52223 13.21 3.77 0.79 71.99 0.13 10.02 0.00 0.00 1
1.51898 13.58 3.35 1.23 72.08 0.59 8.91 0.00 0.00 1
1.52320 13.72 3.72 0.51 71.75 0.09 10.06 0.00 0.16 1
1.51926 13.20 3.33 1.28 72.36 0.60 9.14 0.00 0.11 1
1.51808 13.43 2.87 1.19 72.84 0.55 9.03 0.00 0.00 1
1.51837 13.14 2.84 1.28 72.85 0.55 9.07 0.00 0.00 1
1.51778 13.21 2.81 1.29 72.98 0.51 9.02 0.00 0.09 1
1.51769 12.45 2.71 1.29 73.70 0.56 9.06 0.00 0.24 1
1.51215 12.99 3.47 1.12 72.98 0.62 8.35 0.00 0.31 1
1.51824 12.87 3.48 1.29 72.95 0.60 8.43 0.00 0.00 1
1.51754 13.48 3.74 1.17 72.99 0.59 8.03 0.00 0.00 1
1.51754 13.39 3.66 1.19 72.79 0.57 8.27 0.00 0.11 1
1.51905 13.60 3.62 1.11 72.64 0.14 8.76 0.00 0.00 1
1.51977 13.81 3.58 1.32 71.72 0.12 8.67 0.69 0.00 1
1.52172 13.51 3.86 0.88 71.79 0.23 9.54 0.00 0.11 1
1.52227 14.17 3.81 0.78 71.35 0.00 9.69 0.00 0.00 1
1.52172 13.48 3.74 0.90 72.01 0.18 9.61 0.00 0.07 1
1.52099 13.69 3.59 1.12 71.96 0.09 9.40 0.00 0.00 1
1.52152 13.05 3.65 0.87 72.22 0.19 9.85 0.00 0.17 1
1.52152 13.05 3.65 0.87 72.32 0.19 9.85 0.00 0.17 1
1.52152 13.12 3.58 0.90 72.20 0.23 9.82 0.00 0.16 1
1.52300 13.31 3.58 0.82 71.99 0.12 10.17 0.00 0.03 1
1.51574 14.86 3.67 1.74 71.87 0.16 7.36 0.00 0.12 2
1.51848 13.64 3.87 1.27 71.96 0.54 8.32 0.00 0.32 2
1.51593 13.09 3.59 1.52 73.10 0.67 7.83 0.00 0.00 2
1.51631 13.34 3.57 1.57 72.87 0.61 7.89 0.00 0.00 2
1.51596 13.02 3.56 1.54 73.11 0.72 7.90 0.00 0.00 2
1.51590 13.02 3.58 1.51 73.12 0.69 7.96 0.00 0.00 2
1.51645 13.44 3.61 1.54 72.39 0.66 8.03 0.00 0.00 2
1.51627 13.00 3.58 1.54 72.83 0.61 8.04 0.00 0.00 2
1.51613 13.92 3.52 1.25 72.88 0.37 7.94 0.00 0.14 2
1.51590 12.82 3.52 1.90 72.86 0.69 7.97 0.00 0.00 2
1.51592 12.86 3.52 2.12 72.66 0.69 7.97 0.00 0.00 2
1.51593 13.25 3.45 1.43 73.17 0.61 7.86 0.00 0.00 2
1.51646 13.41 3.55 1.25 72.81 0.68 8.10 0.00 0.00 2
1.51594 13.09 3.52 1.55 72.87 0.68 8.05 0.00 0.09 2
1.51409 14.25 3.09 2.08 72.28 1.10 7.08 0.00 0.00 2
1.51625 13.36 3.58 1.49 72.72 0.45 8.21 0.00 0.00 2
1.51569 13.24 3.49 1.47 73.25 0.38 8.03 0.00 0.00 2
1.51645 13.40 3.49 1.52 72.65 0.67 8.08 0.00 0.10 2
1.51618 13.01 3.50 1.48 72.89 0.60 8.12 0.00 0.00 2
1.51640 12.55 3.48 1.87 73.23 0.63 8.08 0.00 0.09 2
1.51841 12.93 3.74 1.11 72.28 0.64 8.96 0.00 0.22 2
1.51605 12.90 3.44 1.45 73.06 0.44 8.27 0.00 0.00 2
1.51588 13.12 3.41 1.58 73.26 0.07 8.39 0.00 0.19 2
1.51590 13.24 3.34 1.47 73.10 0.39 8.22 0.00 0.00 2
1.51629 12.71 3.33 1.49 73.28 0.67 8.24 0.00 0.00 2
1.51860 13.36 3.43 1.43 72.26 0.51 8.60 0.00 0.00 2
1.51841 13.02 3.62 1.06 72.34 0.64 9.13 0.00 0.15 2
1.51743 12.20 3.25 1.16 73.55 0.62 8.90 0.00 0.24 2
1.51689 12.67 2.88 1.71 73.21 0.73 8.54 0.00 0.00 2
1.51811 12.96 2.96 1.43 72.92 0.60 8.79 0.14 0.00 2
1.51655 12.75 2.85 1.44 73.27 0.57 8.79 0.11 0.22 2
1.51730 12.35 2.72 1.63 72.87 0.70 9.23 0.00 0.00 2
1.51820 12.62 2.76 0.83 73.81 0.35 9.42 0.00 0.20 2
1.52725 13.80 3.15 0.66 70.57 0.08 11.64 0.00 0.00 2
1.52410 13.83 2.90 1.17 71.15 0.08 10.79 0.00 0.00 2
1.52475 11.45 0.00 1.88 72.19 0.81 13.24 0.00 0.34 2
1.53125 10.73 0.00 2.10 69.81 0.58 13.30 3.15 0.28 2
1.53393 12.30 0.00 1.00 70.16 0.12 16.19 0.00 0.24 2
1.52222 14.43 0.00 1.00 72.67 0.10 11.52 0.00 0.08 2
1.51818 13.72 0.00 0.56 74.45 0.00 10.99 0.00 0.00 2
1.52664 11.23 0.00 0.77 73.21 0.00 14.68 0.00 0.00 2
1.52739 11.02 0.00 0.75 73.08 0.00 14.96 0.00 0.00 2
1.52777 12.64 0.00 0.67 72.02 0.06 14.40 0.00 0.00 2
1.51892 13.46 3.83 1.26 72.55 0.57 8.21 0.00 0.14 2
1.51847 13.10 3.97 1.19 72.44 0.60 8.43 0.00 0.00 2
1.51846 13.41 3.89 1.33 72.38 0.51 8.28 0.00 0.00 2
1.51829 13.24 3.90 1.41 72.33 0.55 8.31 0.00 0.10 2
1.51708 13.72 3.68 1.81 72.06 0.64 7.88 0.00 0.00 2
1.51673 13.30 3.64 1.53 72.53 0.65 8.03 0.00 0.29 2
1.51652 13.56 3.57 1.47 72.45 0.64 7.96 0.00 0.00 2
1.51844 13.25 3.76 1.32 72.40 0.58 8.42 0.00 0.00 2
1.51663 12.93 3.54 1.62 72.96 0.64 8.03 0.00 0.21 2
1.51687 13.23 3.54 1.48 72.84 0.56 8.10 0.00 0.00 2
1.51707 13.48 3.48 1.71 72.52 0.62 7.99 0.00 0.00 2
1.52177 13.20 3.68 1.15 72.75 0.54 8.52 0.00 0.00 2
1.51872 12.93 3.66 1.56 72.51 0.58 8.55 0.00 0.12 2
1.51667 12.94 3.61 1.26 72.75 0.56 8.60 0.00 0.00 2
1.52081 13.78 2.28 1.43 71.99 0.49 9.85 0.00 0.17 2
1.52068 13.55 2.09 1.67 72.18 0.53 9.57 0.27 0.17 2
1.52020 13.98 1.35 1.63 71.76 0.39 10.56 0.00 0.18 2
1.52177 13.75 1.01 1.36 72.19 0.33 11.14 0.00 0.00 2
1.52614 13.70 0.00 1.36 71.24 0.19 13.44 0.00 0.10 2
1.51813 13.43 3.98 1.18 72.49 0.58 8.15 0.00 0.00 2
1.51800 13.71 3.93 1.54 71.81 0.54 8.21 0.00 0.15 2
1.51811 13.33 3.85 1.25 72.78 0.52 8.12 0.00 0.00 2
1.51789 13.19 3.90 1.30 72.33 0.55 8.44 0.00 0.28 2
1.51806 13.00 3.80 1.08 73.07 0.56 8.38 0.00 0.12 2
1.51711 12.89 3.62 1.57 72.96 0.61 8.11 0.00 0.00 2
1.51674 12.79 3.52 1.54 73.36 0.66 7.90 0.00 0.00 2
1.51674 12.87 3.56 1.64 73.14 0.65 7.99 0.00 0.00 2
1.51690 13.33 3.54 1.61 72.54 0.68 8.11 0.00 0.00 2
1.51851 13.20 3.63 1.07 72.83 0.57 8.41 0.09 0.17 2
1.51662 12.85 3.51 1.44 73.01 0.68 8.23 0.06 0.25 2
1.51709 13.00 3.47 1.79 72.72 0.66 8.18 0.00 0.00 2
1.51660 12.99 3.18 1.23 72.97 0.58 8.81 0.00 0.24 2
1.51839 12.85 3.67 1.24 72.57 0.62 8.68 0.00 0.35 2
1.51769 13.65 3.66 1.11 72.77 0.11 8.60 0.00 0.00 3
1.51610 13.33 3.53 1.34 72.67 0.56 8.33 0.00 0.00 3
1.51670 13.24 3.57 1.38 72.70 0.56 8.44 0.00 0.10 3
1.51643 12.16 3.52 1.35 72.89 0.57 8.53 0.00 0.00 3
1.51665 13.14 3.45 1.76 72.48 0.60 8.38 0.00 0.17 3
1.52127 14.32 3.90 0.83 71.50 0.00 9.49 0.00 0.00 3
1.51779 13.64 3.65 0.65 73.00 0.06 8.93 0.00 0.00 3
1.51610 13.42 3.40 1.22 72.69 0.59 8.32 0.00 0.00 3
1.51694 12.86 3.58 1.31 72.61 0.61 8.79 0.00 0.00 3
1.51646 13.04 3.40 1.26 73.01 0.52 8.58 0.00 0.00 3
1.51655 13.41 3.39 1.28 72.64 0.52 8.65 0.00 0.00 3
1.52121 14.03 3.76 0.58 71.79 0.11 9.65 0.00 0.00 3
1.51776 13.53 3.41 1.52 72.04 0.58 8.79 0.00 0.00 3
1.51796 13.50 3.36 1.63 71.94 0.57 8.81 0.00 0.09 3
1.51832 13.33 3.34 1.54 72.14 0.56 8.99 0.00 0.00 3
1.51934 13.64 3.54 0.75 72.65 0.16 8.89 0.15 0.24 3
1.52211 14.19 3.78 0.91 71.36 0.23 9.14 0.00 0.37 3
1.51514 14.01 2.68 3.50 69.89 1.68 5.87 2.20 0.00 5
1.51915 12.73 1.85 1.86 72.69 0.60 10.09 0.00 0.00 5
1.52171 11.56 1.88 1.56 72.86 0.47 11.41 0.00 0.00 5
1.52151 11.03 1.71 1.56 73.44 0.58 11.62 0.00 0.00 5
1.51969 12.64 0.00 1.65 73.75 0.38 11.53 0.00 0.00 5
1.51666 12.86 0.00 1.83 73.88 0.97 10.17 0.00 0.00 5
1.51994 13.27 0.00 1.76 73.03 0.47 11.32 0.00 0.00 5
1.52369 13.44 0.00 1.58 72.22 0.32 12.24 0.00 0.00 5
1.51316 13.02 0.00 3.04 70.48 6.21 6.96 0.00 0.00 5
1.51321 13.00 0.00 3.02 70.70 6.21 6.93 0.00 0.00 5
1.52043 13.38 0.00 1.40 72.25 0.33 12.50 0.00 0.00 5
1.52058 12.85 1.61 2.17 72.18 0.76 9.70 0.24 0.51 5
1.52119 12.97 0.33 1.51 73.39 0.13 11.27 0.00 0.28 5
1.51905 14.00 2.39 1.56 72.37 0.00 9.57 0.00 0.00 6
1.51937 13.79 2.41 1.19 72.76 0.00 9.77 0.00 0.00 6
1.51829 14.46 2.24 1.62 72.38 0.00 9.26 0.00 0.00 6
1.51852 14.09 2.19 1.66 72.67 0.00 9.32 0.00 0.00 6
1.51299 14.40 1.74 1.54 74.55 0.00 7.59 0.00 0.00 6
1.51888 14.99 0.78 1.74 72.50 0.00 9.95 0.00 0.00 6
1.51916 14.15 0.00 2.09 72.74 0.00 10.88 0.00 0.00 6
1.51969 14.56 0.00 0.56 73.48 0.00 11.22 0.00 0.00 6
1.51115 17.38 0.00 0.34 75.41 0.00 6.65 0.00 0.00 6
1.51131 13.69 3.20 1.81 72.81 1.76 5.43 1.19 0.00 7
1.51838 14.32 3.26 2.22 71.25 1.46 5.79 1.63 0.00 7
1.52315 13.44 3.34 1.23 72.38 0.60 8.83 0.00 0.00 7
1.52247 14.86 2.20 2.06 70.26 0.76 9.76 0.00 0.00 7
1.52365 15.79 1.83 1.31 70.43 0.31 8.61 1.68 0.00 7
1.51613 13.88 1.78 1.79 73.10 0.00 8.67 0.76 0.00 7
1.51602 14.85 0.00 2.38 73.28 0.00 8.76 0.64 0.09 7
1.51623 14.20 0.00 2.79 73.46 0.04 9.04 0.40 0.09 7
1.51719 14.75 0.00 2.00 73.02 0.00 8.53 1.59 0.08 7
1.51683 14.56 0.00 1.98 73.29 0.00 8.52 1.57 0.07 7
1.51545 14.14 0.00 2.68 73.39 0.08 9.07 0.61 0.05 7
1.51556 13.87 0.00 2.54 73.23 0.14 9.41 0.81 0.01 7
1.51727 14.70 0.00 2.34 73.28 0.00 8.95 0.66 0.00 7
1.51531 14.38 0.00 2.66 73.10 0.04 9.08 0.64 0.00 7
1.51609 15.01 0.00 2.51 73.05 0.05 8.83 0.53 0.00 7
1.51508 15.15 0.00 2.25 73.50 0.00 8.34 0.63 0.00 7
1.51653 11.95 0.00 1.19 75.18 2.70 8.93 0.00 0.00 7
1.51514 14.85 0.00 2.42 73.72 0.00 8.39 0.56 0.00 7
1.51658 14.80 0.00 1.99 73.11 0.00 8.28 1.71 0.00 7
1.51617 14.95 0.00 2.27 73.30 0.00 8.71 0.67 0.00 7
1.51732 14.95 0.00 1.80 72.99 0.00 8.61 1.55 0.00 7
1.51645 14.94 0.00 1.87 73.11 0.00 8.67 1.38 0.00 7
1.51831 14.39 0.00 1.82 72.86 1.41 6.47 2.88 0.00 7
1.51640 14.37 0.00 2.74 72.85 0.00 9.45 0.54 0.00 7
1.51623 14.14 0.00 2.88 72.61 0.08 9.18 1.06 0.00 7
1.51685 14.92 0.00 1.99 73.06 0.00 8.40 1.59 0.00 7
1.52065 14.36 0.00 2.02 73.42 0.00 8.44 1.64 0.00 7
1.51651 14.38 0.00 1.94 73.61 0.00 8.48 1.57 0.00 7
1.51711 14.23 0.00 2.08 73.36 0.00 8.62 1.67 0.00 7

a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

glass_pred = Glass |>
  dplyr::select(-Type)

pairs(glass_pred, lower.panel=panel.smooth)

glass_long <- glass_pred |>
  pivot_longer(cols = everything(), names_to = "predictors", values_to = "value")

ggplot(glass_long, aes(x = value)) +
  geom_histogram(bins=30, fill = "#516c94") +
  facet_wrap(~predictors, scales = "free") +
  labs(title='Glass Predictor Histograms') +
  theme_minimal()

ggplot(glass_long, aes(x = value)) +
  geom_boxplot(fill='#516c94') +
  facet_wrap(~predictors, scales = "free") +
  labs(title='Glass Predictor Boxplots') +
  theme_minimal()

ggpairs(
  data=Glass,
  columns=1:9,
  upper = "blank",
  title="Pairs Plot of Glass Predictors", 
  )

glass_pred |>
  cor() |>
  corrplot(
    type = 'lower',
    diag = FALSE,
    col = COL2('PiYG', 10), tl.col = 'black',
    title='Glass Predictor Correlation Heatmap'
    ) 

b

Do there appear to be any outliers in the data? Are any predictors skewed?

count_outliers_iqr <- function(x) {
  if (!is.numeric(x)) {
    return(NA) # Skip non-numeric columns
  }
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  lower_bound <- Q1 - (1.5 * IQR_val)
  upper_bound <- Q3 + (1.5 * IQR_val)
  
  outlier_count <- sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
  return(outlier_count)
}

outlier_counts <- apply(glass_pred, 2, count_outliers_iqr)

# Convert the results to a data frame
outlier_counts_df <- as.data.frame(as.list(outlier_counts))

outlier_counts_df = as.data.frame(t(rbind(outlier_counts_df, (outlier_counts_df / nrow(glass_pred)) * 100)))
outlier_counts_df = outlier_counts_df |> arrange(desc(V1))
colnames(outlier_counts_df) = c('Outlier Count', '% Outlier')
knit_table(
  outlier_counts_df,
  'IQR Outlier Stats per Predictor')
IQR Outlier Stats per Predictor
Outlier Count % Outlier
Ba 38 17.757009
Ca 26 12.149533
Al 18 8.411215
RI 17 7.943925
Si 12 5.607477
Fe 12 5.607477
Na 7 3.271028
K 7 3.271028
Mg 0 0.000000
zero_cnts = colSums(glass_pred==0)|>sort(decreasing = TRUE)
zero_cnts = as.data.frame(t(rbind(zero_cnts, (zero_cnts / nrow(glass_pred)) * 100)))
colnames(zero_cnts) = c('Zero Value Count', '% Zero Values')

knit_table(
  zero_cnts,
  'Count of 0 Values per Predictor'
)
Count of 0 Values per Predictor
Zero Value Count % Zero Values
Ba 176 82.24299
Fe 144 67.28972
Mg 42 19.62617
K 30 14.01869
RI 0 0.00000
Na 0 0.00000
Al 0 0.00000
Si 0 0.00000
Ca 0 0.00000
skew_df = as.data.frame(
 sapply(glass_pred, skewness) |>
  sort(decreasing = TRUE)
)
skew_df = cbind(skew_df, rownames(skew_df))
colnames(skew_df) = c('skew_value', 'predictor')
skew_df = skew_df |>
  mutate(skew_direction=ifelse(skew_value>0.5, 'Right', ifelse(-0.5>skew_value, 'Left', 'Symmetric')))

skew_df |>
  ggplot(aes(x = reorder(predictor, -skew_value), y = skew_value)) +
  # annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=0.5, alpha=.2, fill="purple") +
  # annotate("text", x='Mg', y=0.3, color="purple", label='Roughly symmetric', size=3) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=-1, alpha=.15, fill="darkorange") +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=0.5, ymax=1, alpha=.15, fill="darkorange") +
  annotate("text", x='Si', y=0.7, color="darkorange", label='Moderately skewed', size=4) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=1, ymax=max(skew_df$skew_value), alpha=.15, fill="red") +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=-1, ymax=min(skew_df$skew_value), alpha=.15, fill="red") +
  annotate("text", x='Si', y=1.3, color="red", label='Highly skewed', size=4) +
  geom_col(aes(fill=skew_direction)) +
  scale_fill_manual(values = c("Left" = "#22a298", "Right" = "#516c94", "Symmetric"='grey')) +
  labs(
    title= "Glass Predictor Skewness (Fisher-Pearson)",
    x = "Predictor",
    y='Coefficient of Skewness',
    fill='Skew Direction') +
  theme_minimal()

All predictors other than Mg appear to have outliers. According to the IQR outlier evaluation, Ba has the most outliers with a count of 38, followed by Ca, Al, RI, Si & Fe, and Na & K. Ba likely has the highest outlier count due to having the most instances of 0% and thus is heavily right-skewed. Fe experiences a similar phenomenon and is right-skewed with a high count of 0 values. Ca also appears to be right skewed with a median around 9, although not as extreme as Ba and Fe.

According to the Fisher-Pearson coefficient of skewness, K is the highest skewed predictor and has a right skew followed by Ba, Ca, Fe, and RI as predictors that are highly skewed. Mg also is considered to be highlight left-skewed. Al (right skewed) and Si (left skewed) are the only predictors that are considered to be moderately skewed. Na is the only predictor that has a roughly symmetric or normal distribution.

Fisher-Pearson coefficient of skewness (g~1) Interpretation - Skewness Strength - ± 0.5: Generally considered roughly symmetric - 0.5 to 1 or -0.5 to -1: Moderately skewed - >1 or <-1: Highly skewed - Skew Direction - 0: Symmetric distribution (e.g., normal distribution). - Positive: Skewed right (long tail on the right). - Negative: Skewed left (long tail on the left).

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

bc_lambda_list = c()
for (i in colnames(glass_pred)){
  bc = BoxCoxTrans(glass_pred[, i])
  bc_lambda_list = c(bc_lambda_list, bc$lambda)

  og_plot = glass_pred |> gather() |>
    ggplot(aes(x = value)) +
    geom_histogram(bins=30, fill = "#516c94") +
    labs(title=paste0(i, ' Histogram')) +
    theme_minimal()
  # print(og_plot)
  
  trans = predict(bc, glass_pred[, i])
  fp_skew = skewness(trans)
  # print(fp_skew)
  trans = as.data.frame(trans)
  sapply(glass_pred, skewness)
  trans_plot = trans |> gather() |>
    ggplot(aes(x = value)) +
    geom_histogram(bins=30, fill = "#516c94") +
    labs(
      title=paste0(i, ' Transformed Histogram'),
      subtitle = paste0('Lambda: ', round(bc$lambda, 2), ', Skewness: ', round(fp_skew, 2))) +
    theme_minimal()
  
  if(!is.na(bc$lambda)){
    print(og_plot + trans_plot)
  }
}

bc_lambda_df = data.frame('Predictor'=colnames(glass_pred), 'Lambda_Value'=bc_lambda_list) |>
  arrange(Lambda_Value)

knit_table(bc_lambda_df, 'Box Cox Predictor Lambda Values')
Box Cox Predictor Lambda Values
Predictor Lambda_Value
RI -2.0
Ca -1.1
Na -0.1
Al 0.5
Si 2.0
Mg NA
K NA
Ba NA
Fe NA

The Box-Cox transformations of Na, Al, Si, and Ca appear to have assisted in normalizing their respective values and reducing skewness; however, their distribution remain not entirely symmetric. Box-Cox transformation for Ba, Fe, Mg, and K where not able to be calculated due to these predictors having values of zero.


3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
## See ?Soybean for details

knit_table(head(Soybean), 'View Soybean Data Head')
View Soybean Data Head
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
diaporthe-stem-canker 6 0 2 1 0 1 1 1 0 0 1 1 0 2 2 0 0 0 1 1 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 4 0 2 1 0 2 0 2 1 1 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 1 2 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 0 1 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 6 0 2 1 0 2 0 1 0 2 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 5 0 2 1 0 3 0 1 0 1 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0

a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(nombre)

soy_pred = Soybean |>
  dplyr::select(-Class)

  
sapply(soy_pred, n_distinct)
##            date     plant.stand          precip            temp            hail 
##               8               3               4               4               3 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##               5               5               4               4               4 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##               3               2               4               4               4 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##               3               3               4               3               3 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##               5               5               3               4               3 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##               4               3               5               5               3 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##               3               3               3               3               4
soy_unique = as.data.frame(sapply(soy_pred, n_distinct))
colnames(soy_unique) = c('unique_cnt')
soy_unique = soy_unique |>
  mutate(unique_percent = round(unique_cnt/nrow(soy_pred)*100, 2))

freq_ratios = c()
for (col in colnames(soy_pred)){
  df_x = data.frame(soy_pred[, col])
  colnames(df_x) = c('column')
  col_ratio = df_x |>
    group_by(column) |>
    summarise(cnt = n()) |>
    arrange(desc(cnt)) |>
    mutate(nrow = nom_ord(row_number())) |>
    dplyr::select(cnt, nrow) |>
    pivot_wider(names_from = nrow, values_from = cnt) |>
    mutate(ratio = first/second) |>
    pull(ratio)
  freq_ratios = c(freq_ratios, col_ratio)
}
freq_ratios
##  [1]  1.137405  1.208191  4.098214  1.879397  3.425197  1.004587  1.213904
##  [8]  1.651282  1.373874  1.103627  1.951327  7.870130  1.547511  1.615385
## [15]  1.479638  4.870000  6.595238  4.953704  1.253378  4.297521  1.984293
## [22]  1.807910  4.462264  3.681481 16.815789 13.204545 16.447368  3.130769
## [29]  3.254717  4.139130  5.695652  4.839623  5.782609  5.084906  6.406977
soy_freq_dist_df = cbind(soy_unique, freq_ratios)

knit_table(soy_freq_dist_df, 'Spybean Predictor Stats')
Spybean Predictor Stats
unique_cnt unique_percent freq_ratios
date 8 1.17 1.137405
plant.stand 3 0.44 1.208191
precip 4 0.59 4.098214
temp 4 0.59 1.879397
hail 3 0.44 3.425197
crop.hist 5 0.73 1.004587
area.dam 5 0.73 1.213904
sever 4 0.59 1.651282
seed.tmt 4 0.59 1.373874
germ 4 0.59 1.103627
plant.growth 3 0.44 1.951327
leaves 2 0.29 7.870130
leaf.halo 4 0.59 1.547511
leaf.marg 4 0.59 1.615385
leaf.size 4 0.59 1.479638
leaf.shread 3 0.44 4.870000
leaf.malf 3 0.44 6.595238
leaf.mild 4 0.59 4.953704
stem 3 0.44 1.253378
lodging 3 0.44 4.297521
stem.cankers 5 0.73 1.984293
canker.lesion 5 0.73 1.807910
fruiting.bodies 3 0.44 4.462264
ext.decay 4 0.59 3.681481
mycelium 3 0.44 16.815790
int.discolor 4 0.59 13.204546
sclerotia 3 0.44 16.447368
fruit.pods 5 0.73 3.130769
fruit.spots 5 0.73 3.254717
seed 3 0.44 4.139130
mold.growth 3 0.44 5.695652
seed.discolor 3 0.44 4.839623
seed.size 3 0.44 5.782609
shriveling 3 0.44 5.084906
roots 4 0.59 6.406977

None of the categorical predictors have a degenerate distribution. A couple predictors (mycelium and mycelium) are close to having near-zero variance predictors with the fraction of unique values over the sample size being less than 1% and having the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value around 16.

b

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_counts <- as.data.frame(colSums(is.na(soy_pred)))
missing_counts$variable <- rownames(missing_counts)
colnames(missing_counts) <- c("count", "variable")

ggplot(missing_counts, aes(x =reorder(variable, count), y = count)) +
  geom_bar(stat = "identity", fill = "#516c94") +
  coord_flip() +
  labs(title = "Missing Value Counts by Predictor", y = "Missing Value Count", x = "Predictor") +
  theme_minimal()

library(reshape2)
## Warning: package 'reshape2' was built under R version 4.5.1
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
Soybean |>
  group_by(Class) |>
  summarise(across(everything(), ~ sum(is.na(.)) / n()), .groups = "drop") |>
  melt(id.vars = "Class") |>
  ggplot(aes(x=Class, y=variable, fill=value)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "white",
    mid='lightblue',
    high = "darkblue",
    midpoint = 0.5,
    limits = c(0, 1)
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The predictors sever, seed.tmt, lodging, and hail all have the highest missing value count of 112 which is equal to about 18% of the dataset.

The classes 2-4-d-injury, cyst-nematode, herbicide-injury, diaporthe-pod-&-stem-blight, and phytophthora-rot all have high null counts.

c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The high null count predictors with the classes that have high null counts may prove to be beneficial predictors if the null value is treated as its own value. With that, one strategy would be to keep the predictors that have a high null percentage consistently across classes and plan to use a predictive model that can account for missing data. For the other predictors, with high null values, imputation using K-Nearesr Neighbors would be a good place to start given the missing data account for less than 20% of the dataset and the dataset is relativley small.