DATA 624 HW 4

Kuhn and Johnson 3.1)

library(mlbench)
library(e1071)
library(caret)
library(fpp2)
library(tidyr)
library(dplyr)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

The Glass data set consists of 214 observations of 7 different types of glass. The data gives the refractive index of the glass sample and the percentage of sodium, magnesium, aluminum, silicon, potassium, calcium, barium and iron in the sample.

a and b)Using visualizations, explore the predictor variables to understand their distributions, as well as the relationships between predictors. Do there appear to be any outliers in the data? Are any predictors skewed? (I answered questions a and b together below.)

Refractive Index

summary(Glass$RI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534

plot(Glass$RI, ylab="Refractive Index")

hist(Glass$RI, xlab="Refractive Index",main="Histogram of Refractive Index")

boxplot(Glass$RI, main="Refractive Index")

skewness(Glass$RI)

## [1] 1.602715

The refractive index of the sample ranges between 1.511 and 1.534. The mean and median refractive index are equal at 1.518. The distribution of the refractive index is skewed to the right. There are a number of outliers, with many more outliers above 1.523 than outliers at the low end. There are no missing values.

Percentage of Sodium in the Sample

summary(Glass$Na)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38

plot(Glass$Na, ylab="Percentage of Sodium in the Sample")

hist(Glass$Na, xlab="Percentage of Sodium in the Sample",main="Histogram of the Percentage of Sodium in the Sample")

boxplot(Glass$Na, main="Percentage of Sodium in the Sample")

skewness(Glass$Na)

## [1] 0.4478343

The percentage of sodium in the sample ranges from 10.73% to 17.38%. There are no missing values. The mean and median are very close to each other; the values are 13.41 and 13.30, respectively. The distribution of the percentage of sodium in glass samples is nearly normal. There are some outliers, both at the high end and the low end.

Percentage of Magnesium in the Sample

summary(Glass$Mg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490

plot(Glass$Mg, ylab="Percentage of Magnesium in the Sample")

hist(Glass$Mg, xlab="Percentage of Magnesium in the Sample",main="Histogram of the Percentage of Magnesium in the Sample")

boxplot(Glass$Mg, main="Percentage of Magnesium in the Sample")

skewness(Glass$Mg)

## [1] -1.136452

The percentage of magnesium in the glass sample ranges from 0 to 4.490%. There are many samples without any magnesium. The distribution shows a large number of samples without magnesium and a large number of samples near the maximum percentage and few samples in between.

Percentage of Aluminum in the Sample

summary(Glass$Al)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500

plot(Glass$Al, ylab="Percentage of Aluminum in the Sample")

hist(Glass$Al, xlab="Percentage of Aluminum in the Sample",main="Histogram of the Percentage of Aluminum in the Sample")

boxplot(Glass$Al, main="Percentage of Aluminum in the Sample")

skewness(Glass$Al)

## [1] 0.8946104

The percentage of aluminum in the glass sample ranges from 0.290% to 3.500%. The median and mean are similar. The distribution of the percentage of aluminum in the glass samples is skewed to the right. There are outliers above 2.4 and below 0.5. There are no missing values.

Percentage of Silicon in the Sample

summary(Glass$Si)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41

plot(Glass$Si, ylab="Percentage of Silicon in the Sample")

hist(Glass$Si, xlab="Percentage of Silicon in the Sample",main="Histogram of the Percentage of Silicon in the Sample")

boxplot(Glass$Si, main="Percentage of Silicon in the Sample")

skewness(Glass$Si)

## [1] -0.7202392

The percentage of silicon in the glass samples ranges from 69.81% to 75.41%. The mean and median value of silicon in samples is very similar. The distribution is skewed left and there are more outliers below 71 than above 74. There are no missing values.

Percentage of Potassium in the Sample

summary(Glass$K)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100

plot(Glass$K, ylab="Percentage of Potasium in the Sample")

hist(Glass$K, xlab="Percentage of Potasium in the Sample",main="Histogram of the Percentage of Potasium in the Sample")

boxplot(Glass$K, main="Percentage of Potasium in the Sample")

skewness(Glass$K)

## [1] 6.460089

The percentage of potassium in the glass samples ranges from 0 to 6.21% The mean is slightly lower than the median. Nearly all of the samples are between 0-1% potassium. The distribution is skewed to the right. There are a few outliers between 1.2% and 3% and 2 samples who are outliers and are over 6% potassium. The newness is very large, with a value of 6.46.

Percentage of Calcium in the Sample

summary(Glass$Ca)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190

plot(Glass$Ca, ylab="Percentage of Calcium in the Sample")

hist(Glass$Ca, xlab="Percentage of Calcium in the Sample",main="Histogram of the Percentage of Calcium in the Sample")

boxplot(Glass$Ca, main="Percentage of Calcium in the Sample")

skewness(Glass$Ca)

## [1] 2.018446

The percentage of calcium in the glass samples ranges from 5.430% to 16.190%. The median is slightly lower than the mean. The distribution is skewed to the right and there are more outliers above 11 than below 7. There are no missing values.

Percentage of Barium in the Sample

summary(Glass$Ba)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150

plot(Glass$Ba, ylab="Percentage of Barium in the Sample")

hist(Glass$Ba, xlab="Percentage of Barium in the Sample",main="Histogram of the Percentage of Barium in the Sample")

boxplot(Glass$Ba, main="Percentage of Barium in the Sample")

skewness(Glass$Ba)

## [1] 3.36868

The percentage of barium in the sample ranges from 0 to 3.150%. The median value is 0, as most of the samples contain no barium. The data is skewed to the right. All of the samples that have barium in them are outliers. There are no missing values. The skewness is large, with a value of 3.37.

Percentage of Iron in the Sample

summary(Glass$Fe)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000

plot(Glass$Fe, ylab="Percentage of Iron in the Sample")

hist(Glass$Fe, xlab="Percentage of Iron in the Sample",main="Histogram of the Percentage of Iron in the Sample")

boxplot(Glass$Fe, main="Percentage of Iron in the Sample")

skewness(Glass$Fe)

## [1] 1.729811

The percentage of iron in the sample ranges from 0 to 0.510%. The median value is 0, as most of the samples contain no iron The data is skewed to the right. Samples with a percentage of iron that is greater than 0.28% are outliers. There are no missing values.

Correlation of Variables

The following are the correlation values between each of the variables. The closer the correlation is to 1 or -1, the more highly correlated the variables.

library(corrplot)

## corrplot 0.84 loaded

glass_pred <- subset(Glass, select=-c(Type))
correlation <- cor(glass_pred, method = "pearson")
corrplot(correlation, type="upper", method="color")

The refractive index is positively correlated to the percentage of calcium in the sample and negatively correlated to the percentage of silicon in the sample. The percentage of magnesium is negatively correlated with the percentage of aluminum in the sample and the percentage of barium in the sample. The percentage of aluminum is positively correlated with the percentage of barium in the sample.

plot(glass_pred)

The plot above displays the relationship between each of the variables. The most significant relationships were highlighted by the correlation plot. There is a positive correlation between the refractive index and the percentage of calcium in the sample and a negative correlation between the refractive index and the percentage of aluminum in the sample.

Are there any relevant transformations of one of more predictors that might improve the classification model?

The percentage of magnesium is not a normal distribution. I will apply a Box-Cox transformation.

lambda_mg <- BoxCox.lambda(Glass$Mg)
hist(BoxCox(Glass$Mg,lambda_mg), main="Histogram of the Box Cox Transformation of the Percentage of Magnesium in the Sample")

Because the data is bimodal, the Box Cox transformation did not create a more normal distribution.

The percentage of barium in the sample is not a normal distribution. I will apply a Box-Cox transformation.

#ba_trans <- BoxCoxTrans(Glass$Ba)
#ba_trans
#plot(ba_trans)

lambda_ba <- BoxCox.lambda(Glass$Ba)
hist(BoxCox(Glass$Ba,lambda_ba), main="Histogram of the Box Cox Transformation of the Percentage of Barium in the Sample")

Because nearly all of the samples have no barium, performing a Box Cox transformation is not effective.

The percentage of iron in the sample is not normally distributed. I will apply a Box-Cox transformation.

#fe_trans <- BoxCoxTrans(Glass$Fe)
#fe_trans
#plot(fe_trans)

lambda_fe <- BoxCox.lambda(Glass$Fe)
hist(BoxCox(Glass$Fe,lambda_fe), main="Histogram of the Box Cox Transformation of the Percentage of Iron in the Sample")

Because of the large number of samples that have no iron, performing a Box Cox transformation is not effective.

The percentage of calcium in the sample is not a normal distribution. I will apply a Box-Cox transformation.

lambda_ca <- BoxCox.lambda(Glass$Ca)
hist(BoxCox(Glass$Ca,lambda_ca), main="Histogram of the Box Cox Transformation of the Percentage of Calcium in the Sample")

skewness(BoxCox(Glass$Ca,lambda_ca))

## [1] 0.6410682

The Box Cox transformation lowered the skewness of the calcium variable. The skewness decreased from 2 to 0.6. The histogram shows a nearly normal distribution.

lambda_al <- BoxCox.lambda(Glass$Al)
hist(BoxCox(Glass$Al,lambda_al), main="Histogram of the Box Cox Transformation of the Percentage of Aluminum in the Sample")

skewness(BoxCox(Glass$Al,lambda_al))

## [1] -0.1506025

The Box Cox transformation lowered the skewness of the calcium variable. The skewness decreased from 0.89 to -.15. The histogram shows a nearly normal distribution.

I would consider combining the variables that are correlated. I will combine the refractive index with the percentage of calcium by adding them together.

Glass <- Glass %>%
  mutate(RICa=RI+Ca)
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type     RICa
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1 10.27101
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1  9.34761
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1  9.29618
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1  9.73766
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1  9.58742
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1  9.58596

3.2) The Soybean data set consists of 683 observations of 36 variables that can be used to predict disease in soybeans.

data("Soybean")
head(Soybean)

##                   Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker    6           0      2    1    0         1
## 2 diaporthe-stem-canker    4           0      2    1    0         2
## 3 diaporthe-stem-canker    3           0      2    1    0         1
## 4 diaporthe-stem-canker    3           0      2    1    0         1
## 5 diaporthe-stem-canker    6           0      2    1    0         2
## 6 diaporthe-stem-canker    5           0      2    1    0         3
##   area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1        1     1        0    0            1      1         0         2
## 2        0     2        1    1            1      1         0         2
## 3        0     2        1    2            1      1         0         2
## 4        0     2        0    1            1      1         0         2
## 5        0     1        0    2            1      1         0         2
## 6        0     1        0    1            1      1         0         2
##   leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1         2           0         0         0    1       1            3
## 2         2           0         0         0    1       0            3
## 3         2           0         0         0    1       0            3
## 4         2           0         0         0    1       0            3
## 5         2           0         0         0    1       0            3
## 6         2           0         0         0    1       0            3
##   canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             1               1         1        0            0         0
## 2             1               1         1        0            0         0
## 3             0               1         1        0            0         0
## 4             0               1         1        0            0         0
## 5             1               1         1        0            0         0
## 6             0               1         1        0            0         0
##   fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1          0           4    0           0             0         0
## 2          0           4    0           0             0         0
## 3          0           4    0           0             0         0
## 4          0           4    0           0             0         0
## 5          0           4    0           0             0         0
## 6          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in the chapter?

for (x in 1:length(Soybean)){
  hist(as.numeric(Soybean[,x]), main=colnames(Soybean[x]))
}

Variables that have a degenerate distribution have a very high frequency in 1 value. The following variables have a degenerate distribution: leaves, leaf.shread, leaf.malf, leaf.mild, lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, shriveling and roots.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

The predictors that are missing more data are: hail, sever, seed.tmt, leaf.halo, leaf.marg, leaf.size, leaf.shred, leaf.malf, leaf.mild, lodging, fruiting.bodies, fruit.pods, fruit.spots, seed, mold.growth, seed.discolor, seed.size and shriveling.

soybean_na <- Soybean %>% 
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))) 

soybean_na$na_sum <- rowSums(soybean_na[,2:36])
soybean_na

## # A tibble: 19 x 37
##    Class      date plant.stand precip  temp  hail crop.hist area.dam sever
##    <fct>     <int>       <int>  <int> <int> <int>     <int>    <int> <int>
##  1 2-4-d-in~     1          16     16    16    16        16        1    16
##  2 alternar~     0           0      0     0     0         0        0     0
##  3 anthracn~     0           0      0     0     0         0        0     0
##  4 bacteria~     0           0      0     0     0         0        0     0
##  5 bacteria~     0           0      0     0     0         0        0     0
##  6 brown-sp~     0           0      0     0     0         0        0     0
##  7 brown-st~     0           0      0     0     0         0        0     0
##  8 charcoal~     0           0      0     0     0         0        0     0
##  9 cyst-nem~     0          14     14    14    14         0        0    14
## 10 diaporth~     0           6      0     0    15         0        0    15
## 11 diaporth~     0           0      0     0     0         0        0     0
## 12 downy-mi~     0           0      0     0     0         0        0     0
## 13 frog-eye~     0           0      0     0     0         0        0     0
## 14 herbicid~     0           0      8     0     8         0        0     8
## 15 phyllost~     0           0      0     0     0         0        0     0
## 16 phytopht~     0           0      0     0    68         0        0    68
## 17 powdery-~     0           0      0     0     0         0        0     0
## 18 purple-s~     0           0      0     0     0         0        0     0
## 19 rhizocto~     0           0      0     0     0         0        0     0
## # ... with 28 more variables: seed.tmt <int>, germ <int>,
## #   plant.growth <int>, leaves <int>, leaf.halo <int>, leaf.marg <int>,
## #   leaf.size <int>, leaf.shread <int>, leaf.malf <int>, leaf.mild <int>,
## #   stem <int>, lodging <int>, stem.cankers <int>, canker.lesion <int>,
## #   fruiting.bodies <int>, ext.decay <int>, mycelium <int>,
## #   int.discolor <int>, sclerotia <int>, fruit.pods <int>,
## #   fruit.spots <int>, seed <int>, mold.growth <int>, seed.discolor <int>,
## #   seed.size <int>, shriveling <int>, roots <int>, na_sum <dbl>

There is a relationship between the Class and the absence of data. There are many missing values for the following classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury and phytophthora. There is also a relationship within each class for the variables that are more likely to have missing data.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For the variables that are degenerate, it is best to eliminate those as predictors. I will therefore eliminate the following variables: leaves, leaf.shread, leaf.malf, leaf.mild, lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, shriveling and roots. I will then look at the number of missing values remaining for each class and each variable.

soybean_remove_degen <- subset(Soybean, select=-c(leaves,leaf.shread,leaf.malf, leaf.mild, lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, shriveling, roots))

soybean_rem_deg_na <- soybean_remove_degen %>% 
  group_by(Class) %>%
  summarise_all(funs(sum(is.na(.)))) 

soybean_rem_deg_na$na_sum <- rowSums(soybean_rem_deg_na[,2:23])
soybean_rem_deg_na

## # A tibble: 19 x 24
##    Class      date plant.stand precip  temp  hail crop.hist area.dam sever
##    <fct>     <int>       <int>  <int> <int> <int>     <int>    <int> <int>
##  1 2-4-d-in~     1          16     16    16    16        16        1    16
##  2 alternar~     0           0      0     0     0         0        0     0
##  3 anthracn~     0           0      0     0     0         0        0     0
##  4 bacteria~     0           0      0     0     0         0        0     0
##  5 bacteria~     0           0      0     0     0         0        0     0
##  6 brown-sp~     0           0      0     0     0         0        0     0
##  7 brown-st~     0           0      0     0     0         0        0     0
##  8 charcoal~     0           0      0     0     0         0        0     0
##  9 cyst-nem~     0          14     14    14    14         0        0    14
## 10 diaporth~     0           6      0     0    15         0        0    15
## 11 diaporth~     0           0      0     0     0         0        0     0
## 12 downy-mi~     0           0      0     0     0         0        0     0
## 13 frog-eye~     0           0      0     0     0         0        0     0
## 14 herbicid~     0           0      8     0     8         0        0     8
## 15 phyllost~     0           0      0     0     0         0        0     0
## 16 phytopht~     0           0      0     0    68         0        0    68
## 17 powdery-~     0           0      0     0     0         0        0     0
## 18 purple-s~     0           0      0     0     0         0        0     0
## 19 rhizocto~     0           0      0     0     0         0        0     0
## # ... with 15 more variables: seed.tmt <int>, germ <int>,
## #   plant.growth <int>, leaf.halo <int>, leaf.marg <int>, leaf.size <int>,
## #   stem <int>, stem.cankers <int>, canker.lesion <int>,
## #   fruiting.bodies <int>, ext.decay <int>, fruit.pods <int>,
## #   fruit.spots <int>, seed <int>, na_sum <dbl>

There are still a significant number of missing variables for the following Classes: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury and phytophthora-rot.

I will impute the median for the missing values.

med_date <- median(is.numeric(soybean_remove_degen$date), na.rm=T)
soybean_remove_degen$date[is.na(soybean_remove_degen$date)] <- med_date

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$date), value =
## structure(c(7L, : invalid factor level, NA generated

med_plant <- median(is.numeric(soybean_remove_degen$plant.stand), na.rm=T)
soybean_remove_degen$plant.stand[is.na(soybean_remove_degen$plant.stand)] <- med_plant

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$plant.stand), :
## invalid factor level, NA generated

med_precip <- median(is.numeric(soybean_remove_degen$precip), na.rm=T)
soybean_remove_degen$precip[is.na(soybean_remove_degen$precip)] <- med_precip

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$precip), value
## = structure(c(3L, : invalid factor level, NA generated

med_temp <- median(is.numeric(soybean_remove_degen$temp), na.rm=T)
soybean_remove_degen$temp[is.na(soybean_remove_degen$temp)] <- med_temp

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$temp), value =
## structure(c(2L, : invalid factor level, NA generated

med_hail <- median(is.numeric(soybean_remove_degen$hail), na.rm=T)
soybean_remove_degen$hail[is.na(soybean_remove_degen$hail)] <- med_hail

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$hail), value =
## structure(c(1L, : invalid factor level, NA generated

med_crophist <- median(is.numeric(soybean_remove_degen$crop.hist), na.rm=T)
soybean_remove_degen$crop.hist[is.na(soybean_remove_degen$crop.hist)] <- med_crophist

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$crop.hist), :
## invalid factor level, NA generated

med_area <- median(is.numeric(soybean_remove_degen$area.dam), na.rm=T)
soybean_remove_degen$area.dam[is.na(soybean_remove_degen$area.dam)] <- med_area

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$area.dam),
## value = structure(c(2L, : invalid factor level, NA generated

med_sever <- median(is.numeric(soybean_remove_degen$sever), na.rm=T)
soybean_remove_degen$sever[is.na(soybean_remove_degen$sever)] <- med_sever

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$sever), value =
## structure(c(2L, : invalid factor level, NA generated

med_seed.tmt <- median(is.numeric(soybean_remove_degen$seed.tmt), na.rm=T)
soybean_remove_degen$seed.tmt[is.na(soybean_remove_degen$seed.tmt)] <- med_seed.tmt

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$seed.tmt),
## value = structure(c(1L, : invalid factor level, NA generated

med_germ <- median(is.numeric(soybean_remove_degen$germ), na.rm=T)
soybean_remove_degen$germ[is.na(soybean_remove_degen$germ)] <- med_germ

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$germ), value =
## structure(c(1L, : invalid factor level, NA generated

med_plant.growth <- median(is.numeric(soybean_remove_degen$plant.growth), na.rm=T)
soybean_remove_degen$plant.growth[is.na(soybean_remove_degen$plant.growth)] <- med_plant.growth

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen
## $plant.growth), : invalid factor level, NA generated

med_leaf.halo <- median(is.numeric(soybean_remove_degen$leaf.halo), na.rm=T)
soybean_remove_degen$leaf.halo[is.na(soybean_remove_degen$leaf.halo)] <- med_leaf.halo

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$leaf.halo), :
## invalid factor level, NA generated

med_leaf.marg <- median(is.numeric(soybean_remove_degen$leaf.marg), na.rm=T)
soybean_remove_degen$leaf.marg[is.na(soybean_remove_degen$leaf.marg)] <- med_leaf.marg

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$leaf.marg), :
## invalid factor level, NA generated

med_leaf.size <- median(is.numeric(soybean_remove_degen$leaf.size), na.rm=T)
soybean_remove_degen$leaf.size[is.na(soybean_remove_degen$leaf.size)] <- med_leaf.size

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$leaf.size), :
## invalid factor level, NA generated

med_plant.growth <- median(is.numeric(soybean_remove_degen$plant.growth), na.rm=T)
soybean_remove_degen$plant.growth[is.na(soybean_remove_degen$plant.growth)] <- med_plant.growth

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen
## $plant.growth), : invalid factor level, NA generated

med_stem <- median(is.numeric(soybean_remove_degen$stem), na.rm=T)
soybean_remove_degen$stem[is.na(soybean_remove_degen$stem)] <- med_stem

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$stem), value =
## structure(c(2L, : invalid factor level, NA generated

med_stem.cankers <- median(is.numeric(soybean_remove_degen$stem.cankers), na.rm=T)
soybean_remove_degen$stem.cankers[is.na(soybean_remove_degen$stem.cankers)] <- med_stem.cankers

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen
## $stem.cankers), : invalid factor level, NA generated

med_canker.lesion <- median(is.numeric(soybean_remove_degen$canker.lesion), na.rm=T)
soybean_remove_degen$canker.lesion[is.na(soybean_remove_degen$canker.lesion)] <- med_canker.lesion

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen
## $canker.lesion), : invalid factor level, NA generated

med_fruiting.bodies <- median(is.numeric(soybean_remove_degen$fruiting.bodies), na.rm=T)
soybean_remove_degen$fruiting.bodies[is.na(soybean_remove_degen$fruiting.bodies)] <- med_fruiting.bodies

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen
## $fruiting.bodies), : invalid factor level, NA generated

med_ext.decay <- median(is.numeric(soybean_remove_degen$ext.decay), na.rm=T)
soybean_remove_degen$ext.decay[is.na(soybean_remove_degen$ext.decay)] <- med_ext.decay

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$ext.decay), :
## invalid factor level, NA generated

med_fruit.pods <- median(is.numeric(soybean_remove_degen$fruit.pods), na.rm=T)
soybean_remove_degen$fruit.pods[is.na(soybean_remove_degen$fruit.pods)] <- med_fruit.pods

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$fruit.pods), :
## invalid factor level, NA generated

med_fruit.spots <- median(is.numeric(soybean_remove_degen$fruit.spots), na.rm=T)
soybean_remove_degen$fruit.spots[is.na(soybean_remove_degen$fruit.spots)] <- med_fruit.spots

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$fruit.spots), :
## invalid factor level, NA generated

med_seed <- median(is.numeric(soybean_remove_degen$seed), na.rm=T)
soybean_remove_degen$seed[is.na(soybean_remove_degen$seed)] <- med_seed

## Warning in `[<-.factor`(`*tmp*`, is.na(soybean_remove_degen$seed), value =
## structure(c(1L, : invalid factor level, NA generated

head(soybean_remove_degen)

##                   Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker    6           0      2    1    0         1
## 2 diaporthe-stem-canker    4           0      2    1    0         2
## 3 diaporthe-stem-canker    3           0      2    1    0         1
## 4 diaporthe-stem-canker    3           0      2    1    0         1
## 5 diaporthe-stem-canker    6           0      2    1    0         2
## 6 diaporthe-stem-canker    5           0      2    1    0         3
##   area.dam sever seed.tmt germ plant.growth leaf.halo leaf.marg leaf.size
## 1        1     1        0    0            1         0         2         2
## 2        0     2        1    1            1         0         2         2
## 3        0     2        1    2            1         0         2         2
## 4        0     2        0    1            1         0         2         2
## 5        0     1        0    2            1         0         2         2
## 6        0     1        0    1            1         0         2         2
##   stem stem.cankers canker.lesion fruiting.bodies ext.decay fruit.pods
## 1    1            3             1               1         1          0
## 2    1            3             1               1         1          0
## 3    1            3             0               1         1          0
## 4    1            3             0               1         1          0
## 5    1            3             1               1         1          0
## 6    1            3             0               1         1          0
##   fruit.spots seed
## 1           4    0
## 2           4    0
## 3           4    0
## 4           4    0
## 5           4    0
## 6           4    0

DATA 624 HW 4

Sarah Wigodsky

February 28, 2019

Refractive Index

Percentage of Sodium in the Sample

Correlation of Variables