Data 624 Week3

#Applied Predictive Modeling KJ

3.1 & 3.2

#3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identiﬁcation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

Nine chemery predictors are numerical data, and column “type” is facots with six levels 1,2,3,5,6,and 7. There are no missing data in the table.

library(mlbench) 
data(Glass) 
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Si has the highest percentage usage of nince element, 69.81% to 75.41% in all six types glasses. Fe has the lowest percentage usage of nince element, 0% to 0.51% in all six types glasses. The differences of mininum and maximum percentage usage in all nine predictors are very small, 0.4% to 11%.

Type “1” and “2” glasses have 70 and 76 samples which are 65% of total sample size. Type “6” glass has 9 sample size which is the samllest among sixt types glasses.

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Use the pairs function to explore the predicor variables. Each scatter plot showed the correlation of the percentage usage of two elements for six tyes of glasses.

Ba is tipically useage for type “7” glass. Fe looks more likly used in type “2” glass.

RI and Ca had positive correlation, and RI and Si has negative correlation.

my_cols <- c("blue", "red", "darkgrey","purple", "black", "yellowgreen")  
pairs(Glass,pch = 19,  cex = 0.5,
      col = my_cols[Glass$Type],
      lower.panel=NULL)

The following I filtered data by glass type, and showed type “1”, “2” and “7”.

From the following graph histgram,there are specific percentage usages for Mg and K in three types of glasses. Ba is in type “7”, but not in typy “1” and “2”.

The paris showed the correlation factors also changed. In type “1”, Mg and AI’s correlation is -0.39; Type “2” is 0.28; and type “7” is -0.42.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(psych)
g1<-Glass %>% filter(Type == 1)
pairs.panels(g1[,1:8],show.points=FALSE,gap=0)

g2<-Glass %>% filter(Type == 2)
pairs.panels(g2[,1:8],show.points=FALSE,gap=0)

library(dplyr)
library(psych)
g7<-Glass %>% filter(Type == 7)
pairs.panels(g7[,1:8],show.points=FALSE,gap=0)

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Select element K as an example, and use plot function to show all samples.

plot(Glass$K, main="K persentage udage of glass", xlab = "index")

Skew=

Use skewness function to calculate the sknewness. K element has been obervered major samples are loacted at lower bound of the range, and there are a few sameple type “5” are at upper bound. unsprisly, K has long tail on the righ. The skew_K is 6.5, positive sknewness.

suppressMessages(library(PerformanceAnalytics))
skew_K <-  skewness(Glass$K)

skew_K

## [1] 6.505636

hist(Glass$K,main="Histogram of K persentage usage for glass", xlab = "percentage")

(c) Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

Use log to ransform element K and showed the mean of log(K) is at the center of the belt curve.

logK<-log(Glass$K)
T1 <- data.frame(Glass[,c("K")], logK)
pairs.panels(T1,show.points=FALSE,gap=0)

One more step to remove the outliner which is greater than Q3+1.5*IQR. K percentage usage distribution has two models distribution.

K_cutoff<-quantile(Glass$K, 0.75)+1.5*IQR(Glass$K)
K_outliner<-which(Glass$K > K_cutoff)
K_adjusted<-Glass[-K_outliner,]
hist(K_adjusted$K,main="Histogram of K persentage usage for glass", xlab = "percentage")

35 out of 70 samples in glass type “1” has 0.5%-0.6% K.

K_adj_g1<-K_adjusted%>% filter(Type == 1)
hist(K_adj_g1$K,main="Histogram of K persentage usage for glass", xlab = "percentage")

28 out of 76 samples in glass type “1” has 0.5%-0.7% K.

K_adj_g2<-K_adjusted%>% filter(Type == 2)
hist(K_adj_g2$K,main="Histogram of K persentage usage for glass", xlab = "percentage")

21 out of 29 samples in glass type “7” has 0.1% K.

K_adj_g7<-K_adjusted%>% filter(Type == 7)
hist(K_adj_g7$K,main="Histogram of K persentage usage for glass", xlab = "percentage")

3.2 3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench) 
data(Soybean) 
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Use gourpby function to find the disease and relative to sever symtems. Sever column has 121 missing value.

#install.packages("dplyr")
suppressMessages(library(dplyr))
Soybean %>%group_by(Class) %>% count(sever)

## Warning: Factor `sever` contains implicit NA, consider using
## `forcats::fct_explicit_na`

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

missPredit_classname <- Soybean %>% filter(is.na(sever))%>% count(Class) %>% select(Class) 
missPredit_classname

In this case, we don’t have reference value. Ignor the missing data in Sever.

type_2_4_d_injury <- Soybean %>%  filter(Class == '2-4-d-injury')%>%  select(Class,sever) %>% count(sever)

## Warning: Factor `sever` contains implicit NA, consider using
## `forcats::fct_explicit_na`

type_2_4_d_injury

In this case of phytophthora_rot disease type, we have two reference values “1” and “2”. In 20 (=7+13) known data, “1” has 35% (=7/20) and “2” has 65% (=13/20). We will simulate 35% of “1” and 65% of “2” for 68 missing values.

type_phytophthora_rot <- Soybean %>%  filter(Class == 'phytophthora-rot')%>%  select(Class,sever) %>% count(sever)

## Warning: Factor `sever` contains implicit NA, consider using
## `forcats::fct_explicit_na`

type_phytophthora_rot

There are a few mssing categories in class missing sever. such as “2-4-d-injury”, “cyst-nematode”, “diaporthe-pod-&-stem-blight”, “herbicide-injury” and “phytophthora-rot”. However, “phytophthora-rot” is not missing 100% value, 7 samples has sever equals 1, 13 samples has sever euqal 2, and 68 are missing.

Break down 3 tables: notNA_sever:excluds all NA in sever other_class_sever: in all NA sever, exclude Class ‘phytophthora-rot’ t1:conbine notNA_sever & other_class_sever

notNA_sever<-Soybean%>% filter(sever != '')
notNA_sever<-as.data.frame(notNA_sever)
  
other_class_sever<- Soybean%>%filter((is.na(sever))&(Class!='phytophthora-rot'))
other_class_sever <- as.data.frame(other_class_sever)

t1 <- rbind(notNA_sever,other_class_sever)

number of sever 1: number of sever fare is 7:13. The following will similar random value 1 and 2 at the ration of 7:13 to replace 68 values.

set.seed(1)
randnum <-sample(c(1,2), size=68, replace=TRUE, prob=c(0.35,0.65))
randnum <- as.data.frame(randnum)
randnum

typeof(randnum)

## [1] "list"

na_sever<-as.data.frame(Soybean%>% filter((is.na(sever))& (Class=='phytophthora-rot')))
na_sever

na_sever['sever']<-randnum #fit random data to the miss column
na_sever$sever # check fill in data

##  [1] 2 2 2 1 2 1 1 1 2 2 2 2 1 2 1 2 1 1 2 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 1
## [36] 1 1 2 1 2 1 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 1 2 1 2 2 2 1 2 2 1

total <- rbind(t1,na_sever)
total[0]

Data 624 Week3

Chunhui Zhu & Chunmei Zhu

9/14/2019

3.1 & 3.2

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

(c) Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

End.

Data 624 Week3

Chunhui Zhu & Chunmei Zhu

9/14/2019

3.1 & 3.2

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

(c) Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

End.