Data Pre-processing exercises

#####Chapter 3 KJ 1, 2

3.1. Description of Glass dataset

A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).

Using visualizations explore the predictor variables to understand their distributions as well as the relationships between predictors.

There are a total of 214 glass samples taken with no instances of missing data for any of the predictor variables. Based upon their histograms and skewness, the predictors RI, Na, Al, Si & Ca display either either a normal distribution pattern or a distribution that could be transformed into a normal distribution pattern i.e. division by sqrt(s). The remaining predictor variables Mg, K, Ba & Fe display concentrations of 0 frequency.

Do there appear to be any outliers in the data? Are any predictors skewed?

The existance of concentrations of 0 occurrence without additional information does not indicate an invalid measurement and therefore discarding this data or imputing replacement data would reduce the predictive accuracy of any model based on such action.

Are there any relevant transformations of one or more predictors that might improve the classification model?

A better solution to handling the predictors with concentrations of 0 frequency is to use a zero-inflated binary distribution for continuous data.

The two predictors with the greatest correlation are RI and Ca suggesting that in a multivariable regression model, one of these explanatory variables could be removed because it is strongly co-linear with the other thus having little to no loss of predictive ability to the model.

data(Glass)
#pandoc.table(describe(Glass), split.tables=Inf, style='rmarkdown')

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
RI	1	214	1.518	0.003037	1.518	1.518	0.001875	1.511	1.534	0.02278	1.603	4.717	0.0002076
Na	2	214	13.41	0.8166	13.3	13.38	0.6449	10.73	17.38	6.65	0.4478	2.898	0.05582
Mg	3	214	2.685	1.442	3.48	2.866	0.3039	0	4.49	4.49	-1.136	-0.4527	0.0986
Al	4	214	1.445	0.4993	1.36	1.412	0.3113	0.29	3.5	3.21	0.8946	1.938	0.03413
Si	5	214	72.65	0.7745	72.79	72.71	0.5708	69.81	75.41	5.6	-0.7202	2.816	0.05295
K	6	214	0.4971	0.6522	0.555	0.4318	0.1705	0	6.21	6.21	6.46	52.87	0.04458
Ca	7	214	8.957	1.423	8.6	8.742	0.6598	5.43	16.19	10.76	2.018	6.41	0.09728
Ba	8	214	0.175	0.4972	0	0.03378	0	0	3.15	3.15	3.369	12.08	0.03399
Fe	9	214	0.05701	0.09744	0	0.03581	0	0	0.51	0.51	1.73	2.52	0.006661
Type*	10	214	2.542	1.708	2	2.308	1.483	1	6	5	1.038	-0.2871	0.1167

Correlation Matrix

	RI	Na	Mg	Al	Si	K	Ca	Ba	Fe
RI	1	-0.1919	-0.1223	-0.4073	-0.5421	-0.2898	0.8104	-0.000386	0.143
Na	-0.1919	1	-0.2737	0.1568	-0.06981	-0.2661	-0.2754	0.3266	-0.2413
Mg	-0.1223	-0.2737	1	-0.4818	-0.1659	0.005396	-0.4438	-0.4923	0.08306
Al	-0.4073	0.1568	-0.4818	1	-0.005524	0.326	-0.2596	0.4794	-0.0744
Si	-0.5421	-0.06981	-0.1659	-0.005524	1	-0.1933	-0.2087	-0.1022	-0.0942
K	-0.2898	-0.2661	0.005396	0.326	-0.1933	1	-0.3178	-0.04262	-0.007719
Ca	0.8104	-0.2754	-0.4438	-0.2596	-0.2087	-0.3178	1	-0.1128	0.125
Ba	-0.000386	0.3266	-0.4923	0.4794	-0.1022	-0.04262	-0.1128	1	-0.05869
Fe	0.143	-0.2413	0.08306	-0.0744	-0.0942	-0.007719	0.125	-0.05869	1

## integer(0)

3.2. Description of Soybean dataset

There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenrate in the ways discussed earlier in this chapter?

Based upon the histograms of categorical data, the following predictor variables are candidates for degenerate variables that can be eliminated based upon an over concentration of data for one value and sparse occurrences of data elsewhere.

in.discolor, leaf.malf, leaf.mild, leaf.shread, leaves, lodging, mold.growth, mycelium, roots, sclerotia, seed, seed.discolor, seed.size and shriveling.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

There are 682 total observations of which the following attributes contain the most missing data in rank order of increasing magnitude. Perhaps the cause of this is null data as non-observances. This would make sense in the case of hail, mold, seed discoloration, shriveling, fruit attributes etc. If it doesn’t exist it cannot be observed.

hail* 562 sever* 562 seed.tmt* 562 lodging* 562 germ* 571 leaf.mild* 575 fruiting.bodies* 577 fruit.spots* 577 seed.discolor* 577 shriveling* 577 leaf.shread* 583 seed* 591 mold.growth* 591 seed.size* 591 leaf.halo* 599 leaf.marg* 599 leaf.size* 599 leaf.malf* 599 fruit.pods* 599

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

All of the predictor variables in this data set contain enough observations to make them useful in a predictive model. A strategy for data-cleanup suitable for a regression model would be as follows . . .

Remove degenrate variables from part a. Variables that remain with missing data that can be considered a “non-observation” such as hail can be coded to zero. Varibales with missing data for unknown reasons should be imputed from the other variables in the observation using the predict function. Outliers records should be explained and perhaps removed
Once all non-degenerate variables have been assigned, a regression model can be developed and co-linear variables can be systematically eliminated until the simplest explanatory model is left.

data(Soybean)
#pandoc.table(describe(Soybean), split.tables=Inf, style='rmarkdown')

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Class*	1	683	9.296	5.511	8	9.179	7.413	1	19	18	0.113	-1.379	0.2109
date*	2	682	4.554	1.694	5	4.615	1.483	1	7	6	-0.304	-0.9045	0.06487
plant.stand*	3	647	1.453	0.4982	1	1.441	0	1	2	1	0.189	-1.967	0.01958
precip*	4	645	2.597	0.6861	3	2.745	0	1	3	2	-1.416	0.5502	0.02702
temp*	5	653	2.182	0.6282	2	2.228	0	1	3	2	-0.1583	-0.5843	0.02458
hail*	6	562	1.226	0.4186	1	1.158	0	1	2	1	1.307	-0.2925	0.01766
crop.hist*	7	667	2.885	0.9758	3	2.978	1.483	1	4	3	-0.3976	-0.9188	0.03778
area.dam*	8	682	2.581	1.074	2	2.601	1.483	1	4	3	0.01799	-1.286	0.04114
sever*	9	562	1.733	0.597	2	1.691	0	1	3	2	0.1739	-0.5648	0.02518
seed.tmt*	10	562	1.52	0.6122	1	1.447	0	1	3	2	0.7397	-0.4397	0.02583
germ*	11	571	2.049	0.791	2	2.061	1.483	1	3	2	-0.08681	-1.4	0.0331
plant.growth*	12	667	1.339	0.4737	1	1.299	0	1	2	1	0.6795	-1.541	0.01834
leaves*	13	683	1.887	0.3165	2	1.984	0	1	2	1	-2.444	3.977	0.01211
leaf.halo*	14	599	2.202	0.949	3	2.252	0	1	3	2	-0.4108	-1.765	0.03878
leaf.marg*	15	599	1.773	0.9565	1	1.717	0	1	3	2	0.4648	-1.747	0.03908
leaf.size*	16	599	2.284	0.6117	2	2.337	0	1	3	2	-0.2495	-0.6294	0.02499
leaf.shread*	17	583	1.165	0.3712	1	1.081	0	1	2	1	1.804	1.255	0.01537
leaf.malf*	18	599	1.075	0.2638	1	1	0	1	2	1	3.216	8.354	0.01078
leaf.mild*	19	575	1.104	0.4041	1	1	0	1	3	2	3.953	14.68	0.01685
stem*	20	667	1.556	0.4972	2	1.57	0	1	2	1	-0.2258	-1.952	0.01925
lodging*	21	562	1.075	0.2632	1	1	0	1	2	1	3.226	8.421	0.0111
stem.cankers*	22	645	2.06	1.352	1	1.952	0	1	4	3	0.6098	-1.509	0.05322
canker.lesion*	23	645	1.98	1.084	2	1.851	1.483	1	4	3	0.5146	-1.238	0.04268
fruiting.bodies*	24	577	1.18	0.3847	1	1.102	0	1	2	1	1.659	0.7549	0.01602
ext.decay*	25	645	1.25	0.4775	1	1.162	0	1	3	2	1.695	1.975	0.0188
mycelium*	26	645	1.009	0.09607	1	1	0	1	2	1	10.2	102.2	0.003783
int.discolor*	27	645	1.13	0.419	1	1	0	1	3	2	3.339	10.57	0.0165
sclerotia*	28	645	1.031	0.1735	1	1	0	1	2	1	5.399	27.19	0.00683
fruit.pods*	29	599	1.504	0.8825	1	1.283	0	1	4	3	1.838	2.413	0.03606
fruit.spots*	30	577	1.847	1.17	1	1.687	0	1	4	3	0.9465	-0.7574	0.04871
seed*	31	591	1.195	0.3962	1	1.118	0	1	2	1	1.539	0.3693	0.0163
mold.growth*	32	591	1.113	0.3173	1	1.017	0	1	2	1	2.433	3.925	0.01305
seed.discolor*	33	577	1.111	0.3143	1	1.015	0	1	2	1	2.472	4.116	0.01308
seed.size*	34	591	1.1	0.3	1	1	0	1	2	1	2.663	5.1	0.01234
shriveling*	35	577	1.066	0.2482	1	1	0	1	2	1	3.492	10.21	0.01033
roots*	36	652	1.178	0.4388	1	1.069	0	1	3	2	2.458	5.486	0.01719

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3020 rows containing non-finite values (stat_bin).

Data Pre-processing exercises - IS624 HW2