3.1

data(Glass)
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

a)

Normally distributed: Al, Ca Right skewed: Fe, Ba, Rl, Na, K Left skewed: Si, Mg

variables = Glass%>%
  select(-Type)%>%
  names()

Glass%>%
  pivot_longer(variables)%>%
  ggplot(aes(x=value))+
  geom_histogram(bins=30)+
  facet_wrap(~name, scales = 'free')
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(variables)` instead of `variables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

# scatter plots show correlation between predictor variables
Glass%>%
  select(-Type)%>%
  pairs()

b)

Other than variables Al and Ca, the other predictors are noticeably skewed.

Based on the boxplots bellow, outliers exist for all variables except Mg

Glass%>%
  pivot_longer(variables)%>%
  ggplot(aes(x=value))+
  geom_boxplot()+
  facet_wrap(~name, scales = 'free')

c)

Since most of the variables are skewed we could apply log transformations to normalize the distributions. For example, we can apply a log transformation to Si, one of the variables that were shown as left skewed previously. As we can see, after the log transformation, the data is closer to a normal distribution

Glass%>%
  ggplot(aes(x = log(Si)))+geom_histogram(bins=30)

3.2

data(Soybean)
head(Soybean)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

a)

From the summary, we can see that many variables contain missing (NA) values mycelium stands out as a ‘degenerate distribution’ as it is mostly one value (639 0’s vs 6 1’s)

summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 

b)

From the summary in question a, we see that there are many predictors with large amounts of NA values. The ones with the highest number of NA values seem to be sever, hail, lodging and seed.tmt

Looking at Classes, we see that all the null values come from 5 classes as seen bellow (The first 4 classes shown all have at least one null value per row in the dataset while the 5th class has null values in 77% of its rows)

non_null = Soybean%>%
  na.omit()%>%
  group_by(Class)%>%
  summarize(non_null_count = n())

all = Soybean%>%
  group_by(Class)%>%
  summarize(count = n())

all%>%
  left_join(non_null, by = 'Class')%>%
  mutate(ratio_na = (count - replace_na(non_null_count,0))/count)%>%
  arrange(-ratio_na)
## # A tibble: 19 x 4
##    Class                       count non_null_count ratio_na
##    <fct>                       <int>          <int>    <dbl>
##  1 2-4-d-injury                   16             NA    1    
##  2 cyst-nematode                  14             NA    1    
##  3 diaporthe-pod-&-stem-blight    15             NA    1    
##  4 herbicide-injury                8             NA    1    
##  5 phytophthora-rot               88             20    0.773
##  6 alternarialeaf-spot            91             91    0    
##  7 anthracnose                    44             44    0    
##  8 bacterial-blight               20             20    0    
##  9 bacterial-pustule              20             20    0    
## 10 brown-spot                     92             92    0    
## 11 brown-stem-rot                 44             44    0    
## 12 charcoal-rot                   20             20    0    
## 13 diaporthe-stem-canker          20             20    0    
## 14 downy-mildew                   20             20    0    
## 15 frog-eye-leaf-spot             91             91    0    
## 16 phyllosticta-leaf-spot         20             20    0    
## 17 powdery-mildew                 20             20    0    
## 18 purple-seed-stain              20             20    0    
## 19 rhizoctonia-root-rot           20             20    0

c)

Since we know that all null values come from 5 classes, we cannot drop null values as it would entirely drop 4 classes and drop most of the 5th class. Since dropping null values would drop entire classes, I think imputing the data would work better.

Something interesting I noticed, for the Classes with 100% rows including at least one NA,is that we can see predictors such as sever, seed.tmt, leaf.mild, lodging are all NA. Perhaps for values like those we could replace with NA with 0 instead of imputing since there are no values originally

Soybean%>%
  filter(Class %in% c('2-4-d-injury','cyst-nematode',
                      'diaporthe-pod-&-stem-blight','herbicide-injury'))%>%
  summary()
##                          Class         date    plant.stand  precip     temp   
##  2-4-d-injury               :16   5      : 9   0   : 7     0   : 0   0   : 8  
##  diaporthe-pod-&-stem-blight:15   3      : 8   1   :10     1   : 2   1   : 0  
##  cyst-nematode              :14   6      : 8   NA's:36     2   :13   2   :15  
##  herbicide-injury           : 8   1      : 7               NA's:38   NA's:30  
##  alternarialeaf-spot        : 0   2      : 7                                  
##  anthracnose                : 0   (Other):13                                  
##  (Other)                    : 0   NA's   : 1                                  
##    hail    crop.hist area.dam   sever    seed.tmt    germ    plant.growth
##  0   : 0   0   : 6   0   :10   0   : 0   0   : 0   0   : 5   0   :15     
##  1   : 0   1   : 9   1   :12   1   : 0   1   : 0   1   : 2   1   :22     
##  NA's:53   2   :11   2   :10   2   : 0   2   : 0   2   : 2   NA's:16     
##            3   :11   3   :20   NA's:53   NA's:53   NA's:44               
##            NA's:16   NA's: 1                                             
##                                                                          
##                                                                          
##  leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild   stem   
##  0:15   0   :20   0   : 0   0   : 0   0   : 8     0   : 0   0   : 0   0   :14  
##  1:38   1   : 0   1   : 4   1   : 4   1   : 0     1   :24   1   : 0   1   :23  
##         2   : 4   2   :20   2   :20   NA's:45     NA's:29   2   : 0   NA's:16  
##         NA's:29   NA's:29   NA's:29                         NA's:53            
##                                                                                
##                                                                                
##                                                                                
##  lodging   stem.cankers canker.lesion fruiting.bodies ext.decay mycelium 
##  0   : 0   0   :15      0   :15       0   : 0         0   :15   0   :15  
##  1   : 0   1   : 0      1   : 0       1   :15         1   : 0   1   : 0  
##  NA's:53   2   : 0      2   : 0       NA's:38         2   : 0   NA's:38  
##            3   : 0      3   : 0                       NA's:38            
##            NA's:38      NA's:38                                          
##                                                                          
##                                                                          
##  int.discolor sclerotia fruit.pods fruit.spots   seed    mold.growth
##  0   :15      0   :15   0   : 0    0   : 0     0   : 3   0   :14    
##  1   : 0      1   : 0   1   :15    1   : 0     1   :26   1   :15    
##  2   : 0      NA's:38   2   :14    2   :15     NA's:24   NA's:24    
##  NA's:38                3   : 8    4   : 0                          
##                         NA's:16    NA's:38                          
##                                                                     
##                                                                     
##  seed.discolor seed.size shriveling  roots   
##  0   : 0       0   : 0   0   : 0    0   : 0  
##  1   :15       1   :29   1   :15    1   : 8  
##  NA's:38       NA's:24   NA's:38    2   :14  
##                                     NA's:31  
##                                              
##                                              
##