Homework 4 Predictive analytics

Salma Elshahawy

2021-03-06

Github repo | portfolio | Blog

Problem_3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of several class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

#> 'data.frame':    214 obs. of  10 variables:
#>  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
#>  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
#>  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
#>  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
#>  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
#>  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
#>  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
#>  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
#>  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors

I will start by conducting a uni-variate analysis for each predictor to study the distribution.

RI, Si, Na, AI, and Ca have Gausian normal distribution. However, the rest of the variables are either severly skewed with long tail to the right, or has a bi modal distribution such as the Mg. We can consider to normalize/standardize the data or make a transformation to make the predictors have more reliability in building the model.

The next type of visualization would be the multi-variate visualization, which reveals the reationship between each predictor and the target variable. We will use the correlation heatmap plot utilizing pearson correlation metho.

Most of the predictors are negatively correlated with each other.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

For skewness: Looking back to the uni-variate (histograms), we can see that the majority of the variables are skewed with a long tail to the right. For outliers: This can be determined from the boxplot

Yes there are outliers in most of the predictors. As shown from the scatter plot, the outliers are creating a cluster within value (15-20) for Na and Ca

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

yes, I would consider a transformation, Boxcox transformation or log-transformation.


Problem_3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environemental conditions (e.g. temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

#> 'data.frame':    683 obs. of  36 variables:
#>  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
#>  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
#>  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
#>  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
#>  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
#>  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
#>  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
#>  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
#>  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
#>  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
#>  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
#>  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
#>  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

#>   date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ
#> 1    6           0      2    1    0         1        1     1        0    0
#> 2    4           0      2    1    0         2        0     2        1    1
#> 3    3           0      2    1    0         1        0     2        1    2
#> 4    3           0      2    1    0         1        0     2        0    1
#> 5    6           0      2    1    0         2        0     1        0    2
#> 6    5           0      2    1    0         3        0     1        0    1
#>   plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
#> 1            1      1         0         2         2           0         0
#> 2            1      1         0         2         2           0         0
#> 3            1      1         0         2         2           0         0
#> 4            1      1         0         2         2           0         0
#> 5            1      1         0         2         2           0         0
#> 6            1      1         0         2         2           0         0
#>   leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
#> 1         0    1       1            3             1               1         1
#> 2         0    1       0            3             1               1         1
#> 3         0    1       0            3             0               1         1
#> 4         0    1       0            3             0               1         1
#> 5         0    1       0            3             1               1         1
#> 6         0    1       0            3             0               1         1
#>   mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
#> 1        0            0         0          0           4    0           0
#> 2        0            0         0          0           4    0           0
#> 3        0            0         0          0           4    0           0
#> 4        0            0         0          0           4    0           0
#> 5        0            0         0          0           4    0           0
#> 6        0            0         0          0           4    0           0
#>   seed.discolor seed.size shriveling roots
#> 1             0         0          0     0
#> 2             0         0          0     0
#> 3             0         0          0     0
#> 4             0         0          0     0
#> 5             0         0          0     0
#> 6             0         0          0     0

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

I will start with counting the missing values in the Soybean.

#>            hail           sever        seed.tmt         lodging            germ 
#>             121             121             121             121             112 
#>       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
#>             108             106             106             106             106 
#>     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
#>             100              92              92              92              84 
#>       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
#>              84              84              84              84              38 
#>    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
#>              38              38              38              38              38 
#>       sclerotia     plant.stand           roots            temp       crop.hist 
#>              38              36              31              30              16 
#>    plant.growth            stem            date        area.dam          leaves 
#>              16              16               1               1               0
2-4-d-injury alternarialeaf-spot anthracnose bacterial-blight bacterial-pustule brown-spot brown-stem-rot charcoal-rot cyst-nematode diaporthe-pod-&-stem-blight diaporthe-stem-canker downy-mildew frog-eye-leaf-spot herbicide-injury phyllosticta-leaf-spot phytophthora-rot powdery-mildew purple-seed-stain rhizoctonia-root-rot
hail 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
sever 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
seed.tmt 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
lodging 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 68 0 0 0
germ 16 0 0 0 0 0 0 0 14 6 0 0 0 8 0 68 0 0 0
leaf.mild 16 0 0 0 0 0 0 0 14 15 0 0 0 8 0 55 0 0 0
fruiting.bodies 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
fruit.spots 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
seed.discolor 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
shriveling 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 68 0 0 0
leaf.shread 16 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
seed 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
mold.growth 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
seed.size 16 0 0 0 0 0 0 0 0 0 0 0 0 8 0 68 0 0 0
leaf.halo 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.marg 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.size 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
leaf.malf 0 0 0 0 0 0 0 0 14 15 0 0 0 0 0 55 0 0 0
fruit.pods 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 68 0 0 0
precip 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
stem.cankers 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
canker.lesion 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
ext.decay 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
mycelium 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
int.discolor 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
sclerotia 16 0 0 0 0 0 0 0 14 0 0 0 0 8 0 0 0 0 0
plant.stand 16 0 0 0 0 0 0 0 14 6 0 0 0 0 0 0 0 0 0
roots 16 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0
temp 16 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0
crop.hist 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
plant.growth 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stem 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
date 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
area.dam 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
leaves 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The numbers are the count of missing values for the predictors.

From this table, it seems that some predictors have same rows with missing values, and the same distribution of classes. Furthere, these predictors’ missing values are biased toward the class phytophthorarot. For example, for the predictor hail, out of the 121 missing values, 68 (56%) of them are phytophthorarot. This indicates “informative missingness”, which can induce significant bias in the model.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

#> [1] "Eliminated 68 rows."
#> [1] "615 rows remaining."
#> [1] "53 rows still contain missing values."
#> [1] "Filling 1 missing values for feature:  date ."
#> [1] "The most frequent factor of this feature is: 5 , which is 24.27 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 36 missing values for feature:  plant.stand ."
#> [1] "The most frequent factor of this feature is: 0 , which is 61.14 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  precip ."
#> [1] "The most frequent factor of this feature is: 2 , which is 72.27 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 30 missing values for feature:  temp ."
#> [1] "The most frequent factor of this feature is: 1 , which is 57.09 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 53 missing values for feature:  hail ."
#> [1] "The most frequent factor of this feature is: 0 , which is 77.4 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 16 missing values for feature:  crop.hist ."
#> [1] "The most frequent factor of this feature is: 3 , which is 32.39 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 1 missing values for feature:  area.dam ."
#> [1] "The most frequent factor of this feature is: 3 , which is 30.46 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 53 missing values for feature:  sever ."
#> [1] "The most frequent factor of this feature is: 1 , which is 57.3 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 53 missing values for feature:  seed.tmt ."
#> [1] "The most frequent factor of this feature is: 0 , which is 54.27 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 44 missing values for feature:  germ ."
#> [1] "The most frequent factor of this feature is: 1 , which is 37.3 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 16 missing values for feature:  plant.growth ."
#> [1] "The most frequent factor of this feature is: 0 , which is 73.62 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 0 missing values for feature:  leaves ."
#> [1] "The most frequent factor of this feature is: 1 , which is 87.48 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 29 missing values for feature:  leaf.halo ."
#> [1] "The most frequent factor of this feature is: 2 , which is 58.36 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 29 missing values for feature:  leaf.marg ."
#> [1] "The most frequent factor of this feature is: 0 , which is 60.92 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 29 missing values for feature:  leaf.size ."
#> [1] "The most frequent factor of this feature is: 1 , which is 55.8 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 45 missing values for feature:  leaf.shread ."
#> [1] "The most frequent factor of this feature is: 0 , which is 83.16 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 29 missing values for feature:  leaf.malf ."
#> [1] "The most frequent factor of this feature is: 0 , which is 92.32 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 53 missing values for feature:  leaf.mild ."
#> [1] "The most frequent factor of this feature is: 0 , which is 92.88 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 16 missing values for feature:  stem ."
#> [1] "The most frequent factor of this feature is: 1 , which is 50.58 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 53 missing values for feature:  lodging ."
#> [1] "The most frequent factor of this feature is: 0 , which is 92.53 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  stem.cankers ."
#> [1] "The most frequent factor of this feature is: 0 , which is 64.64 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  canker.lesion ."
#> [1] "The most frequent factor of this feature is: 0 , which is 55.46 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  fruiting.bodies ."
#> [1] "The most frequent factor of this feature is: 0 , which is 81.98 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  ext.decay ."
#> [1] "The most frequent factor of this feature is: 0 , which is 76.6 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  mycelium ."
#> [1] "The most frequent factor of this feature is: 0 , which is 98.96 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  int.discolor ."
#> [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  sclerotia ."
#> [1] "The most frequent factor of this feature is: 0 , which is 96.53 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 16 missing values for feature:  fruit.pods ."
#> [1] "The most frequent factor of this feature is: 0 , which is 67.95 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  fruit.spots ."
#> [1] "The most frequent factor of this feature is: 0 , which is 59.79 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 24 missing values for feature:  seed ."
#> [1] "The most frequent factor of this feature is: 0 , which is 80.54 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 24 missing values for feature:  mold.growth ."
#> [1] "The most frequent factor of this feature is: 0 , which is 88.66 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  seed.discolor ."
#> [1] "The most frequent factor of this feature is: 0 , which is 88.91 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 24 missing values for feature:  seed.size ."
#> [1] "The most frequent factor of this feature is: 0 , which is 90.02 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 38 missing values for feature:  shriveling ."
#> [1] "The most frequent factor of this feature is: 0 , which is 93.41 % of the class."
#> [1] "------------------------------------------------"
#> [1] "Filling 31 missing values for feature:  roots ."
#> [1] "The most frequent factor of this feature is: 0 , which is 94.35 % of the class."
#> [1] "------------------------------------------------"
#> [1] "There are now 615 rows. 0 rows have missing values."

Github repo | portfolio | Blog