DATA624 Homework 4

Exercises

Applied Predictive Modeling
Chapter 3 Data Pre-processing

3.1

he UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

The data can be accessed via:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.5.2

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

knit_table(Glass, 'View Glass Data')

View Glass Data
RI	Na	Mg	Al	Si	K	Ca	Ba	Fe	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	0.00	0.00	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	0.00	0.00	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	0.00	0.00	1
1.51766	13.21	3.69	1.29	72.61	0.57	8.22	0.00	0.00	1
1.51742	13.27	3.62	1.24	73.08	0.55	8.07	0.00	0.00	1
1.51596	12.79	3.61	1.62	72.97	0.64	8.07	0.00	0.26	1
1.51743	13.30	3.60	1.14	73.09	0.58	8.17	0.00	0.00	1
1.51756	13.15	3.61	1.05	73.24	0.57	8.24	0.00	0.00	1
1.51918	14.04	3.58	1.37	72.08	0.56	8.30	0.00	0.00	1
1.51755	13.00	3.60	1.36	72.99	0.57	8.40	0.00	0.11	1
1.51571	12.72	3.46	1.56	73.20	0.67	8.09	0.00	0.24	1
1.51763	12.80	3.66	1.27	73.01	0.60	8.56	0.00	0.00	1
1.51589	12.88	3.43	1.40	73.28	0.69	8.05	0.00	0.24	1
1.51748	12.86	3.56	1.27	73.21	0.54	8.38	0.00	0.17	1
1.51763	12.61	3.59	1.31	73.29	0.58	8.50	0.00	0.00	1
1.51761	12.81	3.54	1.23	73.24	0.58	8.39	0.00	0.00	1
1.51784	12.68	3.67	1.16	73.11	0.61	8.70	0.00	0.00	1
1.52196	14.36	3.85	0.89	71.36	0.15	9.15	0.00	0.00	1
1.51911	13.90	3.73	1.18	72.12	0.06	8.89	0.00	0.00	1
1.51735	13.02	3.54	1.69	72.73	0.54	8.44	0.00	0.07	1
1.51750	12.82	3.55	1.49	72.75	0.54	8.52	0.00	0.19	1
1.51966	14.77	3.75	0.29	72.02	0.03	9.00	0.00	0.00	1
1.51736	12.78	3.62	1.29	72.79	0.59	8.70	0.00	0.00	1
1.51751	12.81	3.57	1.35	73.02	0.62	8.59	0.00	0.00	1
1.51720	13.38	3.50	1.15	72.85	0.50	8.43	0.00	0.00	1
1.51764	12.98	3.54	1.21	73.00	0.65	8.53	0.00	0.00	1
1.51793	13.21	3.48	1.41	72.64	0.59	8.43	0.00	0.00	1
1.51721	12.87	3.48	1.33	73.04	0.56	8.43	0.00	0.00	1
1.51768	12.56	3.52	1.43	73.15	0.57	8.54	0.00	0.00	1
1.51784	13.08	3.49	1.28	72.86	0.60	8.49	0.00	0.00	1
1.51768	12.65	3.56	1.30	73.08	0.61	8.69	0.00	0.14	1
1.51747	12.84	3.50	1.14	73.27	0.56	8.55	0.00	0.00	1
1.51775	12.85	3.48	1.23	72.97	0.61	8.56	0.09	0.22	1
1.51753	12.57	3.47	1.38	73.39	0.60	8.55	0.00	0.06	1
1.51783	12.69	3.54	1.34	72.95	0.57	8.75	0.00	0.00	1
1.51567	13.29	3.45	1.21	72.74	0.56	8.57	0.00	0.00	1
1.51909	13.89	3.53	1.32	71.81	0.51	8.78	0.11	0.00	1
1.51797	12.74	3.48	1.35	72.96	0.64	8.68	0.00	0.00	1
1.52213	14.21	3.82	0.47	71.77	0.11	9.57	0.00	0.00	1
1.52213	14.21	3.82	0.47	71.77	0.11	9.57	0.00	0.00	1
1.51793	12.79	3.50	1.12	73.03	0.64	8.77	0.00	0.00	1
1.51755	12.71	3.42	1.20	73.20	0.59	8.64	0.00	0.00	1
1.51779	13.21	3.39	1.33	72.76	0.59	8.59	0.00	0.00	1
1.52210	13.73	3.84	0.72	71.76	0.17	9.74	0.00	0.00	1
1.51786	12.73	3.43	1.19	72.95	0.62	8.76	0.00	0.30	1
1.51900	13.49	3.48	1.35	71.95	0.55	9.00	0.00	0.00	1
1.51869	13.19	3.37	1.18	72.72	0.57	8.83	0.00	0.16	1
1.52667	13.99	3.70	0.71	71.57	0.02	9.82	0.00	0.10	1
1.52223	13.21	3.77	0.79	71.99	0.13	10.02	0.00	0.00	1
1.51898	13.58	3.35	1.23	72.08	0.59	8.91	0.00	0.00	1
1.52320	13.72	3.72	0.51	71.75	0.09	10.06	0.00	0.16	1
1.51926	13.20	3.33	1.28	72.36	0.60	9.14	0.00	0.11	1
1.51808	13.43	2.87	1.19	72.84	0.55	9.03	0.00	0.00	1
1.51837	13.14	2.84	1.28	72.85	0.55	9.07	0.00	0.00	1
1.51778	13.21	2.81	1.29	72.98	0.51	9.02	0.00	0.09	1
1.51769	12.45	2.71	1.29	73.70	0.56	9.06	0.00	0.24	1
1.51215	12.99	3.47	1.12	72.98	0.62	8.35	0.00	0.31	1
1.51824	12.87	3.48	1.29	72.95	0.60	8.43	0.00	0.00	1
1.51754	13.48	3.74	1.17	72.99	0.59	8.03	0.00	0.00	1
1.51754	13.39	3.66	1.19	72.79	0.57	8.27	0.00	0.11	1
1.51905	13.60	3.62	1.11	72.64	0.14	8.76	0.00	0.00	1
1.51977	13.81	3.58	1.32	71.72	0.12	8.67	0.69	0.00	1
1.52172	13.51	3.86	0.88	71.79	0.23	9.54	0.00	0.11	1
1.52227	14.17	3.81	0.78	71.35	0.00	9.69	0.00	0.00	1
1.52172	13.48	3.74	0.90	72.01	0.18	9.61	0.00	0.07	1
1.52099	13.69	3.59	1.12	71.96	0.09	9.40	0.00	0.00	1
1.52152	13.05	3.65	0.87	72.22	0.19	9.85	0.00	0.17	1
1.52152	13.05	3.65	0.87	72.32	0.19	9.85	0.00	0.17	1
1.52152	13.12	3.58	0.90	72.20	0.23	9.82	0.00	0.16	1
1.52300	13.31	3.58	0.82	71.99	0.12	10.17	0.00	0.03	1
1.51574	14.86	3.67	1.74	71.87	0.16	7.36	0.00	0.12	2
1.51848	13.64	3.87	1.27	71.96	0.54	8.32	0.00	0.32	2
1.51593	13.09	3.59	1.52	73.10	0.67	7.83	0.00	0.00	2
1.51631	13.34	3.57	1.57	72.87	0.61	7.89	0.00	0.00	2
1.51596	13.02	3.56	1.54	73.11	0.72	7.90	0.00	0.00	2
1.51590	13.02	3.58	1.51	73.12	0.69	7.96	0.00	0.00	2
1.51645	13.44	3.61	1.54	72.39	0.66	8.03	0.00	0.00	2
1.51627	13.00	3.58	1.54	72.83	0.61	8.04	0.00	0.00	2
1.51613	13.92	3.52	1.25	72.88	0.37	7.94	0.00	0.14	2
1.51590	12.82	3.52	1.90	72.86	0.69	7.97	0.00	0.00	2
1.51592	12.86	3.52	2.12	72.66	0.69	7.97	0.00	0.00	2
1.51593	13.25	3.45	1.43	73.17	0.61	7.86	0.00	0.00	2
1.51646	13.41	3.55	1.25	72.81	0.68	8.10	0.00	0.00	2
1.51594	13.09	3.52	1.55	72.87	0.68	8.05	0.00	0.09	2
1.51409	14.25	3.09	2.08	72.28	1.10	7.08	0.00	0.00	2
1.51625	13.36	3.58	1.49	72.72	0.45	8.21	0.00	0.00	2
1.51569	13.24	3.49	1.47	73.25	0.38	8.03	0.00	0.00	2
1.51645	13.40	3.49	1.52	72.65	0.67	8.08	0.00	0.10	2
1.51618	13.01	3.50	1.48	72.89	0.60	8.12	0.00	0.00	2
1.51640	12.55	3.48	1.87	73.23	0.63	8.08	0.00	0.09	2
1.51841	12.93	3.74	1.11	72.28	0.64	8.96	0.00	0.22	2
1.51605	12.90	3.44	1.45	73.06	0.44	8.27	0.00	0.00	2
1.51588	13.12	3.41	1.58	73.26	0.07	8.39	0.00	0.19	2
1.51590	13.24	3.34	1.47	73.10	0.39	8.22	0.00	0.00	2
1.51629	12.71	3.33	1.49	73.28	0.67	8.24	0.00	0.00	2
1.51860	13.36	3.43	1.43	72.26	0.51	8.60	0.00	0.00	2
1.51841	13.02	3.62	1.06	72.34	0.64	9.13	0.00	0.15	2
1.51743	12.20	3.25	1.16	73.55	0.62	8.90	0.00	0.24	2
1.51689	12.67	2.88	1.71	73.21	0.73	8.54	0.00	0.00	2
1.51811	12.96	2.96	1.43	72.92	0.60	8.79	0.14	0.00	2
1.51655	12.75	2.85	1.44	73.27	0.57	8.79	0.11	0.22	2
1.51730	12.35	2.72	1.63	72.87	0.70	9.23	0.00	0.00	2
1.51820	12.62	2.76	0.83	73.81	0.35	9.42	0.00	0.20	2
1.52725	13.80	3.15	0.66	70.57	0.08	11.64	0.00	0.00	2
1.52410	13.83	2.90	1.17	71.15	0.08	10.79	0.00	0.00	2
1.52475	11.45	0.00	1.88	72.19	0.81	13.24	0.00	0.34	2
1.53125	10.73	0.00	2.10	69.81	0.58	13.30	3.15	0.28	2
1.53393	12.30	0.00	1.00	70.16	0.12	16.19	0.00	0.24	2
1.52222	14.43	0.00	1.00	72.67	0.10	11.52	0.00	0.08	2
1.51818	13.72	0.00	0.56	74.45	0.00	10.99	0.00	0.00	2
1.52664	11.23	0.00	0.77	73.21	0.00	14.68	0.00	0.00	2
1.52739	11.02	0.00	0.75	73.08	0.00	14.96	0.00	0.00	2
1.52777	12.64	0.00	0.67	72.02	0.06	14.40	0.00	0.00	2
1.51892	13.46	3.83	1.26	72.55	0.57	8.21	0.00	0.14	2
1.51847	13.10	3.97	1.19	72.44	0.60	8.43	0.00	0.00	2
1.51846	13.41	3.89	1.33	72.38	0.51	8.28	0.00	0.00	2
1.51829	13.24	3.90	1.41	72.33	0.55	8.31	0.00	0.10	2
1.51708	13.72	3.68	1.81	72.06	0.64	7.88	0.00	0.00	2
1.51673	13.30	3.64	1.53	72.53	0.65	8.03	0.00	0.29	2
1.51652	13.56	3.57	1.47	72.45	0.64	7.96	0.00	0.00	2
1.51844	13.25	3.76	1.32	72.40	0.58	8.42	0.00	0.00	2
1.51663	12.93	3.54	1.62	72.96	0.64	8.03	0.00	0.21	2
1.51687	13.23	3.54	1.48	72.84	0.56	8.10	0.00	0.00	2
1.51707	13.48	3.48	1.71	72.52	0.62	7.99	0.00	0.00	2
1.52177	13.20	3.68	1.15	72.75	0.54	8.52	0.00	0.00	2
1.51872	12.93	3.66	1.56	72.51	0.58	8.55	0.00	0.12	2
1.51667	12.94	3.61	1.26	72.75	0.56	8.60	0.00	0.00	2
1.52081	13.78	2.28	1.43	71.99	0.49	9.85	0.00	0.17	2
1.52068	13.55	2.09	1.67	72.18	0.53	9.57	0.27	0.17	2
1.52020	13.98	1.35	1.63	71.76	0.39	10.56	0.00	0.18	2
1.52177	13.75	1.01	1.36	72.19	0.33	11.14	0.00	0.00	2
1.52614	13.70	0.00	1.36	71.24	0.19	13.44	0.00	0.10	2
1.51813	13.43	3.98	1.18	72.49	0.58	8.15	0.00	0.00	2
1.51800	13.71	3.93	1.54	71.81	0.54	8.21	0.00	0.15	2
1.51811	13.33	3.85	1.25	72.78	0.52	8.12	0.00	0.00	2
1.51789	13.19	3.90	1.30	72.33	0.55	8.44	0.00	0.28	2
1.51806	13.00	3.80	1.08	73.07	0.56	8.38	0.00	0.12	2
1.51711	12.89	3.62	1.57	72.96	0.61	8.11	0.00	0.00	2
1.51674	12.79	3.52	1.54	73.36	0.66	7.90	0.00	0.00	2
1.51674	12.87	3.56	1.64	73.14	0.65	7.99	0.00	0.00	2
1.51690	13.33	3.54	1.61	72.54	0.68	8.11	0.00	0.00	2
1.51851	13.20	3.63	1.07	72.83	0.57	8.41	0.09	0.17	2
1.51662	12.85	3.51	1.44	73.01	0.68	8.23	0.06	0.25	2
1.51709	13.00	3.47	1.79	72.72	0.66	8.18	0.00	0.00	2
1.51660	12.99	3.18	1.23	72.97	0.58	8.81	0.00	0.24	2
1.51839	12.85	3.67	1.24	72.57	0.62	8.68	0.00	0.35	2
1.51769	13.65	3.66	1.11	72.77	0.11	8.60	0.00	0.00	3
1.51610	13.33	3.53	1.34	72.67	0.56	8.33	0.00	0.00	3
1.51670	13.24	3.57	1.38	72.70	0.56	8.44	0.00	0.10	3
1.51643	12.16	3.52	1.35	72.89	0.57	8.53	0.00	0.00	3
1.51665	13.14	3.45	1.76	72.48	0.60	8.38	0.00	0.17	3
1.52127	14.32	3.90	0.83	71.50	0.00	9.49	0.00	0.00	3
1.51779	13.64	3.65	0.65	73.00	0.06	8.93	0.00	0.00	3
1.51610	13.42	3.40	1.22	72.69	0.59	8.32	0.00	0.00	3
1.51694	12.86	3.58	1.31	72.61	0.61	8.79	0.00	0.00	3
1.51646	13.04	3.40	1.26	73.01	0.52	8.58	0.00	0.00	3
1.51655	13.41	3.39	1.28	72.64	0.52	8.65	0.00	0.00	3
1.52121	14.03	3.76	0.58	71.79	0.11	9.65	0.00	0.00	3
1.51776	13.53	3.41	1.52	72.04	0.58	8.79	0.00	0.00	3
1.51796	13.50	3.36	1.63	71.94	0.57	8.81	0.00	0.09	3
1.51832	13.33	3.34	1.54	72.14	0.56	8.99	0.00	0.00	3
1.51934	13.64	3.54	0.75	72.65	0.16	8.89	0.15	0.24	3
1.52211	14.19	3.78	0.91	71.36	0.23	9.14	0.00	0.37	3
1.51514	14.01	2.68	3.50	69.89	1.68	5.87	2.20	0.00	5
1.51915	12.73	1.85	1.86	72.69	0.60	10.09	0.00	0.00	5
1.52171	11.56	1.88	1.56	72.86	0.47	11.41	0.00	0.00	5
1.52151	11.03	1.71	1.56	73.44	0.58	11.62	0.00	0.00	5
1.51969	12.64	0.00	1.65	73.75	0.38	11.53	0.00	0.00	5
1.51666	12.86	0.00	1.83	73.88	0.97	10.17	0.00	0.00	5
1.51994	13.27	0.00	1.76	73.03	0.47	11.32	0.00	0.00	5
1.52369	13.44	0.00	1.58	72.22	0.32	12.24	0.00	0.00	5
1.51316	13.02	0.00	3.04	70.48	6.21	6.96	0.00	0.00	5
1.51321	13.00	0.00	3.02	70.70	6.21	6.93	0.00	0.00	5
1.52043	13.38	0.00	1.40	72.25	0.33	12.50	0.00	0.00	5
1.52058	12.85	1.61	2.17	72.18	0.76	9.70	0.24	0.51	5
1.52119	12.97	0.33	1.51	73.39	0.13	11.27	0.00	0.28	5
1.51905	14.00	2.39	1.56	72.37	0.00	9.57	0.00	0.00	6
1.51937	13.79	2.41	1.19	72.76	0.00	9.77	0.00	0.00	6
1.51829	14.46	2.24	1.62	72.38	0.00	9.26	0.00	0.00	6
1.51852	14.09	2.19	1.66	72.67	0.00	9.32	0.00	0.00	6
1.51299	14.40	1.74	1.54	74.55	0.00	7.59	0.00	0.00	6
1.51888	14.99	0.78	1.74	72.50	0.00	9.95	0.00	0.00	6
1.51916	14.15	0.00	2.09	72.74	0.00	10.88	0.00	0.00	6
1.51969	14.56	0.00	0.56	73.48	0.00	11.22	0.00	0.00	6
1.51115	17.38	0.00	0.34	75.41	0.00	6.65	0.00	0.00	6
1.51131	13.69	3.20	1.81	72.81	1.76	5.43	1.19	0.00	7
1.51838	14.32	3.26	2.22	71.25	1.46	5.79	1.63	0.00	7
1.52315	13.44	3.34	1.23	72.38	0.60	8.83	0.00	0.00	7
1.52247	14.86	2.20	2.06	70.26	0.76	9.76	0.00	0.00	7
1.52365	15.79	1.83	1.31	70.43	0.31	8.61	1.68	0.00	7
1.51613	13.88	1.78	1.79	73.10	0.00	8.67	0.76	0.00	7
1.51602	14.85	0.00	2.38	73.28	0.00	8.76	0.64	0.09	7
1.51623	14.20	0.00	2.79	73.46	0.04	9.04	0.40	0.09	7
1.51719	14.75	0.00	2.00	73.02	0.00	8.53	1.59	0.08	7
1.51683	14.56	0.00	1.98	73.29	0.00	8.52	1.57	0.07	7
1.51545	14.14	0.00	2.68	73.39	0.08	9.07	0.61	0.05	7
1.51556	13.87	0.00	2.54	73.23	0.14	9.41	0.81	0.01	7
1.51727	14.70	0.00	2.34	73.28	0.00	8.95	0.66	0.00	7
1.51531	14.38	0.00	2.66	73.10	0.04	9.08	0.64	0.00	7
1.51609	15.01	0.00	2.51	73.05	0.05	8.83	0.53	0.00	7
1.51508	15.15	0.00	2.25	73.50	0.00	8.34	0.63	0.00	7
1.51653	11.95	0.00	1.19	75.18	2.70	8.93	0.00	0.00	7
1.51514	14.85	0.00	2.42	73.72	0.00	8.39	0.56	0.00	7
1.51658	14.80	0.00	1.99	73.11	0.00	8.28	1.71	0.00	7
1.51617	14.95	0.00	2.27	73.30	0.00	8.71	0.67	0.00	7
1.51732	14.95	0.00	1.80	72.99	0.00	8.61	1.55	0.00	7
1.51645	14.94	0.00	1.87	73.11	0.00	8.67	1.38	0.00	7
1.51831	14.39	0.00	1.82	72.86	1.41	6.47	2.88	0.00	7
1.51640	14.37	0.00	2.74	72.85	0.00	9.45	0.54	0.00	7
1.51623	14.14	0.00	2.88	72.61	0.08	9.18	1.06	0.00	7
1.51685	14.92	0.00	1.99	73.06	0.00	8.40	1.59	0.00	7
1.52065	14.36	0.00	2.02	73.42	0.00	8.44	1.64	0.00	7
1.51651	14.38	0.00	1.94	73.61	0.00	8.48	1.57	0.00	7
1.51711	14.23	0.00	2.08	73.36	0.00	8.62	1.67	0.00	7

a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

glass_pred = Glass |>
  dplyr::select(-Type)

pairs(glass_pred, lower.panel=panel.smooth)

glass_long <- glass_pred |>
  pivot_longer(cols = everything(), names_to = "predictors", values_to = "value")

ggplot(glass_long, aes(x = value)) +
  geom_histogram(bins=30, fill = "#516c94") +
  facet_wrap(~predictors, scales = "free") +
  labs(title='Glass Predictor Histograms') +
  theme_minimal()

ggplot(glass_long, aes(x = value)) +
  geom_boxplot(fill='#516c94') +
  facet_wrap(~predictors, scales = "free") +
  labs(title='Glass Predictor Boxplots') +
  theme_minimal()

ggpairs(
  data=Glass,
  columns=1:9,
  upper = "blank",
  title="Pairs Plot of Glass Predictors", 
  )

glass_pred |>
  cor() |>
  corrplot(
    type = 'lower',
    diag = FALSE,
    col = COL2('PiYG', 10), tl.col = 'black',
    title='Glass Predictor Correlation Heatmap'
    )

b

Do there appear to be any outliers in the data? Are any predictors skewed?

count_outliers_iqr <- function(x) {
  if (!is.numeric(x)) {
    return(NA) # Skip non-numeric columns
  }
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  lower_bound <- Q1 - (1.5 * IQR_val)
  upper_bound <- Q3 + (1.5 * IQR_val)
  
  outlier_count <- sum(x < lower_bound | x > upper_bound, na.rm = TRUE)
  return(outlier_count)
}

outlier_counts <- apply(glass_pred, 2, count_outliers_iqr)

# Convert the results to a data frame
outlier_counts_df <- as.data.frame(as.list(outlier_counts))

outlier_counts_df = as.data.frame(t(rbind(outlier_counts_df, (outlier_counts_df / nrow(glass_pred)) * 100)))
outlier_counts_df = outlier_counts_df |> arrange(desc(V1))
colnames(outlier_counts_df) = c('Outlier Count', '% Outlier')
knit_table(
  outlier_counts_df,
  'IQR Outlier Stats per Predictor')

IQR Outlier Stats per Predictor
	Outlier Count	% Outlier
Ba	38	17.757009
Ca	26	12.149533
Al	18	8.411215
RI	17	7.943925
Si	12	5.607477
Fe	12	5.607477
Na	7	3.271028
K	7	3.271028
Mg	0	0.000000

zero_cnts = colSums(glass_pred==0)|>sort(decreasing = TRUE)
zero_cnts = as.data.frame(t(rbind(zero_cnts, (zero_cnts / nrow(glass_pred)) * 100)))
colnames(zero_cnts) = c('Zero Value Count', '% Zero Values')

knit_table(
  zero_cnts,
  'Count of 0 Values per Predictor'
)

Count of 0 Values per Predictor
	Zero Value Count	% Zero Values
Ba	176	82.24299
Fe	144	67.28972
Mg	42	19.62617
K	30	14.01869
RI	0	0.00000
Na	0	0.00000
Al	0	0.00000
Si	0	0.00000
Ca	0	0.00000

skew_df = as.data.frame(
 sapply(glass_pred, skewness) |>
  sort(decreasing = TRUE)
)
skew_df = cbind(skew_df, rownames(skew_df))
colnames(skew_df) = c('skew_value', 'predictor')
skew_df = skew_df |>
  mutate(skew_direction=ifelse(skew_value>0.5, 'Right', ifelse(-0.5>skew_value, 'Left', 'Symmetric')))

skew_df |>
  ggplot(aes(x = reorder(predictor, -skew_value), y = skew_value)) +
  # annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=0.5, alpha=.2, fill="purple") +
  # annotate("text", x='Mg', y=0.3, color="purple", label='Roughly symmetric', size=3) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=-0.5, ymax=-1, alpha=.15, fill="darkorange") +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=0.5, ymax=1, alpha=.15, fill="darkorange") +
  annotate("text", x='Si', y=0.7, color="darkorange", label='Moderately skewed', size=4) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=1, ymax=max(skew_df$skew_value), alpha=.15, fill="red") +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin=-1, ymax=min(skew_df$skew_value), alpha=.15, fill="red") +
  annotate("text", x='Si', y=1.3, color="red", label='Highly skewed', size=4) +
  geom_col(aes(fill=skew_direction)) +
  scale_fill_manual(values = c("Left" = "#22a298", "Right" = "#516c94", "Symmetric"='grey')) +
  labs(
    title= "Glass Predictor Skewness (Fisher-Pearson)",
    x = "Predictor",
    y='Coefficient of Skewness',
    fill='Skew Direction') +
  theme_minimal()

All predictors other than Mg appear to have outliers. According to the IQR outlier evaluation, Ba has the most outliers with a count of 38, followed by Ca, Al, RI, Si & Fe, and Na & K. Ba likely has the highest outlier count due to having the most instances of 0% and thus is heavily right-skewed. Fe experiences a similar phenomenon and is right-skewed with a high count of 0 values. Ca also appears to be right skewed with a median around 9, although not as extreme as Ba and Fe.

According to the Fisher-Pearson coefficient of skewness, K is the highest skewed predictor and has a right skew followed by Ba, Ca, Fe, and RI as predictors that are highly skewed. Mg also is considered to be highlight left-skewed. Al (right skewed) and Si (left skewed) are the only predictors that are considered to be moderately skewed. Na is the only predictor that has a roughly symmetric or normal distribution.

Fisher-Pearson coefficient of skewness (g~1) Interpretation - Skewness Strength - ± 0.5: Generally considered roughly symmetric - 0.5 to 1 or -0.5 to -1: Moderately skewed - >1 or <-1: Highly skewed - Skew Direction - 0: Symmetric distribution (e.g., normal distribution). - Positive: Skewed right (long tail on the right). - Negative: Skewed left (long tail on the left).

c

Are there any relevant transformations of one or more predictors that might improve the classification model?

bc_lambda_list = c()
for (i in colnames(glass_pred)){
  bc = BoxCoxTrans(glass_pred[, i])
  bc_lambda_list = c(bc_lambda_list, bc$lambda)

  og_plot = glass_pred |> gather() |>
    ggplot(aes(x = value)) +
    geom_histogram(bins=30, fill = "#516c94") +
    labs(title=paste0(i, ' Histogram')) +
    theme_minimal()
  # print(og_plot)
  
  trans = predict(bc, glass_pred[, i])
  fp_skew = skewness(trans)
  # print(fp_skew)
  trans = as.data.frame(trans)
  sapply(glass_pred, skewness)
  trans_plot = trans |> gather() |>
    ggplot(aes(x = value)) +
    geom_histogram(bins=30, fill = "#516c94") +
    labs(
      title=paste0(i, ' Transformed Histogram'),
      subtitle = paste0('Lambda: ', round(bc$lambda, 2), ', Skewness: ', round(fp_skew, 2))) +
    theme_minimal()
  
  if(!is.na(bc$lambda)){
    print(og_plot + trans_plot)
  }
}

bc_lambda_df = data.frame('Predictor'=colnames(glass_pred), 'Lambda_Value'=bc_lambda_list) |>
  arrange(Lambda_Value)

knit_table(bc_lambda_df, 'Box Cox Predictor Lambda Values')

Box Cox Predictor Lambda Values
Predictor	Lambda_Value
RI	-2.0
Ca	-1.1
Na	-0.1
Al	0.5
Si	2.0
Mg	NA
K	NA
Ba	NA
Fe	NA

The Box-Cox transformations of Na, Al, Si, and Ca appear to have assisted in normalizing their respective values and reducing skewness; however, their distribution remain not entirely symmetric. Box-Cox transformation for Ba, Fe, Mg, and K where not able to be calculated due to these predictors having values of zero.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
## See ?Soybean for details

knit_table(head(Soybean), 'View Soybean Data Head')

View Soybean Data Head
Class	date	precip	temp	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.marg	leaf.size	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	fruit.spots
diaporthe-stem-canker	6	2	1	1	1	1	0	0	1	1	2	2	1	1	3	1	1	1	4
diaporthe-stem-canker	4	2	1	2	0	2	1	1	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	1	2	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	0	1	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	6	2	1	2	0	1	0	2	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	5	2	1	3	0	1	0	1	1	1	2	2	1	0	3	0	1	1	4

a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(nombre)

soy_pred = Soybean |>
  dplyr::select(-Class)

  
sapply(soy_pred, n_distinct)

##            date     plant.stand          precip            temp            hail 
##               8               3               4               4               3 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##               5               5               4               4               4 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##               3               2               4               4               4 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##               3               3               4               3               3 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##               5               5               3               4               3 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##               4               3               5               5               3 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##               3               3               3               3               4

soy_unique = as.data.frame(sapply(soy_pred, n_distinct))
colnames(soy_unique) = c('unique_cnt')
soy_unique = soy_unique |>
  mutate(unique_percent = round(unique_cnt/nrow(soy_pred)*100, 2))

freq_ratios = c()
for (col in colnames(soy_pred)){
  df_x = data.frame(soy_pred[, col])
  colnames(df_x) = c('column')
  col_ratio = df_x |>
    group_by(column) |>
    summarise(cnt = n()) |>
    arrange(desc(cnt)) |>
    mutate(nrow = nom_ord(row_number())) |>
    dplyr::select(cnt, nrow) |>
    pivot_wider(names_from = nrow, values_from = cnt) |>
    mutate(ratio = first/second) |>
    pull(ratio)
  freq_ratios = c(freq_ratios, col_ratio)
}
freq_ratios

##  [1]  1.137405  1.208191  4.098214  1.879397  3.425197  1.004587  1.213904
##  [8]  1.651282  1.373874  1.103627  1.951327  7.870130  1.547511  1.615385
## [15]  1.479638  4.870000  6.595238  4.953704  1.253378  4.297521  1.984293
## [22]  1.807910  4.462264  3.681481 16.815789 13.204545 16.447368  3.130769
## [29]  3.254717  4.139130  5.695652  4.839623  5.782609  5.084906  6.406977

soy_freq_dist_df = cbind(soy_unique, freq_ratios)

knit_table(soy_freq_dist_df, 'Spybean Predictor Stats')

Spybean Predictor Stats
	unique_cnt	unique_percent	freq_ratios
date	8	1.17	1.137405
plant.stand	3	0.44	1.208191
precip	4	0.59	4.098214
temp	4	0.59	1.879397
hail	3	0.44	3.425197
crop.hist	5	0.73	1.004587
area.dam	5	0.73	1.213904
sever	4	0.59	1.651282
seed.tmt	4	0.59	1.373874
germ	4	0.59	1.103627
plant.growth	3	0.44	1.951327
leaves	2	0.29	7.870130
leaf.halo	4	0.59	1.547511
leaf.marg	4	0.59	1.615385
leaf.size	4	0.59	1.479638
leaf.shread	3	0.44	4.870000
leaf.malf	3	0.44	6.595238
leaf.mild	4	0.59	4.953704
stem	3	0.44	1.253378
lodging	3	0.44	4.297521
stem.cankers	5	0.73	1.984293
canker.lesion	5	0.73	1.807910
fruiting.bodies	3	0.44	4.462264
ext.decay	4	0.59	3.681481
mycelium	3	0.44	16.815790
int.discolor	4	0.59	13.204546
sclerotia	3	0.44	16.447368
fruit.pods	5	0.73	3.130769
fruit.spots	5	0.73	3.254717
seed	3	0.44	4.139130
mold.growth	3	0.44	5.695652
seed.discolor	3	0.44	4.839623
seed.size	3	0.44	5.782609
shriveling	3	0.44	5.084906
roots	4	0.59	6.406977

None of the categorical predictors have a degenerate distribution. A couple predictors (mycelium and mycelium) are close to having near-zero variance predictors with the fraction of unique values over the sample size being less than 1% and having the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value around 16.

b

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_counts <- as.data.frame(colSums(is.na(soy_pred)))
missing_counts$variable <- rownames(missing_counts)
colnames(missing_counts) <- c("count", "variable")

ggplot(missing_counts, aes(x =reorder(variable, count), y = count)) +
  geom_bar(stat = "identity", fill = "#516c94") +
  coord_flip() +
  labs(title = "Missing Value Counts by Predictor", y = "Missing Value Count", x = "Predictor") +
  theme_minimal()

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.5.1

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Soybean |>
  group_by(Class) |>
  summarise(across(everything(), ~ sum(is.na(.)) / n()), .groups = "drop") |>
  melt(id.vars = "Class") |>
  ggplot(aes(x=Class, y=variable, fill=value)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "white",
    mid='lightblue',
    high = "darkblue",
    midpoint = 0.5,
    limits = c(0, 1)
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The predictors sever, seed.tmt, lodging, and hail all have the highest missing value count of 112 which is equal to about 18% of the dataset.

The classes 2-4-d-injury, cyst-nematode, herbicide-injury, diaporthe-pod-&-stem-blight, and phytophthora-rot all have high null counts.

c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The high null count predictors with the classes that have high null counts may prove to be beneficial predictors if the null value is treated as its own value. With that, one strategy would be to keep the predictors that have a high null percentage consistently across classes and plan to use a predictive model that can account for missing data. For the other predictors, with high null values, imputation using K-Nearesr Neighbors would be a good place to start given the missing data account for less than 20% of the dataset and the dataset is relativley small.

DATA624 Homework 4

Olivia Azevedo

2026-03-01

Exercises

3.1

a

b

c

3.2

a

b

c