Predictive
Analytics

Question 3:1

The UC Irvine Machine Learning Repository contains the dataset, Glass, consisting of 10 variables, with 214 observations. The target variable is one of 6 categories of glass types. We investigate the distributions to find if there are any transformations necessary.

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

All 9 of our explanatory variables are skewed. Mg content is bi-modal. Ba content and Fe content are both nearly degenerate, where nearly all cases have the same value. We take a look at Mg level to see if it might be separable into values near each of its modes. We separate all values by those less than or greater than 2, then look at the categories for each.

## [1] "Glass category 1: building_windows_float_processed "

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [1] "Glass category 2: building_windows_non_float_processed "

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1

## [1] "Glass category 3: vehicle_windows_float_processed "

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

## [1] "Glass category 5: containers"

##  [1] 1 0 0 0 0 0 0 0 0 0 0 0 0

## [1] "Glass category 6: tableware"

## [1] 1 1 1 1 0 0 0 0 0

## [1] "Glass category 7: headlamps"

##  [1] 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

There is definitely meaning in whether Mg levels are high or low. Categories 1 and 3 are all high in Mg. Categories 5 and 7 are nearly all low in Mg. If we mean to use this data for inference, it would make sense to change this variable into a categorical variable. If we mean to make a prediction, it it better to retain the information that may be in the original variable.

.

We take a look at the correlations between our variables. Refractive index is correlated with Ca and the inclusion of both might lead to collinearity. We might consider combining these variables. Levels of Na, Al and Ba are correlated with category. Refractive Index and levels of Mg and Fe are negatively correlated with category.

We apply a Box-Cox transformation to RI,Al,Si,Ca,K and Ba. The skewness measure does not change much, but RI,Al,Si and Ca are visibly changed and look much more normal. Mg,Ca,Ba and Fe do not allow a lambda to be determined for a Box-Cox transformation. Data with 2 peaks or with a high peak and little else do not lend themselves to Box-Cox transformation because there isn’t much to spread out.

## [1] "skewness for transformed RI"

## [1] 1.56566

## [1] "skewness for transformed Si"

## [1] -0.6509057

## [1] "skewness for transformed Ca"

## [1] -0.1939557

## [1] "skewness for transformed K"

## [1] 6.460089

## [1] "skewness for transformed Ba"

## [1] 3.36868

## $RI
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2 
## 
## 
## $Na
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations
## 
## 
## $Mg
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Al
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5 
## 
## 
## $Si
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2 
## 
## 
## $K
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Ca
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1 
## 
## 
## $Ba
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Lambda could not be estimated; no transformation is applied
## 
## 
## $Fe
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Lambda could not be estimated; no transformation is applied

One transformation that we could perform is the spatial sign. It minimizes outliers by finding a center for the data and gives values to each based on direction from the center. All far away points are collapsed into the same mathematical distance. This transformation helps minimize outliers, but it makes it more difficult to interpret the result. The nearzerovar function shows us that this data doesn’t have a worry for variables with really low values.

##             RI          Na          Mg           Al        Si            K
## 1 0.0001102115 0.005295597 0.001743199 0.0004071898 0.9999842 2.329441e-05
## 2 0.0001069841 0.005252677 0.001361385 0.0004410094 0.9999850 1.815180e-04
## 3 0.0001060694 0.005080150 0.001332929 0.0004659497 0.9999860 1.464345e-04
## 4 0.0001073436 0.005012061 0.001400038 0.0004309316 0.9999862 2.162661e-04
## 5 0.0001059414 0.004970262 0.001355867 0.0004170799 0.9999865 2.060018e-04
## 6 0.0001061040 0.004804939 0.001356203 0.0004781618 0.9999873 2.404348e-04
##             Ca Ba           Fe         Type
## 1 0.0003204744  0 0.000000e+00 0.0003882402
## 2 0.0003080446  0 0.000000e+00 0.0003781625
## 3 0.0003056029  0 0.000000e+00 0.0003754730
## 4 0.0003109311  0 0.000000e+00 0.0003794141
## 5 0.0003062572  0 0.000000e+00 0.0003745488
## 6 0.0003071817  0 9.767664e-05 0.0003756794

## integer(0)

Another possible transformation, the principal component analysis, allows us to control for variables that add little to the difference in our data. It creates orthogonal components that capture variability in the data. Another method, partial least squares, is supervised and captures relationship to the target variable also. PCA has the potential to also be uninterpretable, but less than the spatial sign. If we’re looking for a prediction, interpretability is less important than accuracy. In preparation for PCA, we center and rescale it so some values aren’t overrepresented without adding meaning. The beginning of our PCA components are:

## Created from 214 samples and 10 variables
## 
## Pre-processing:
##   - Box-Cox transformation (6)
##   - centered (10)
##   - ignored (0)
##   - principal component signal extraction (10)
##   - scaled (10)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1, -0.4
## PCA needed 7 components to capture 95 percent of the variance

##        PC1     PC2     PC3     PC4     PC5     PC6     PC7
## 1  -1.7516 -0.0866  0.1803  1.7129 -0.2080  0.3577  0.4852
## 2  -0.3364 -1.2469  0.5573  0.9179 -0.1413  0.2050  0.0370
## 3  -0.0856 -1.5730  0.6492  0.3741 -0.1057  0.4656  0.3514
## 4  -0.7794 -1.1542  0.1637  0.4750 -0.4126  0.4812  0.0815
## 5  -0.6442 -1.3439  0.5764  0.1911 -0.3405  0.5404 -0.1597
## 6  -0.7322 -1.5567 -0.7533 -1.0335  1.8054  0.0734  0.1932
## 7  -0.7503 -1.2917  0.6387  0.2039 -0.3716  0.4825 -0.3342
## 8  -0.9090 -1.2835  0.7455  0.0081 -0.4119  0.5832 -0.4864
## 9  -0.6679 -0.5951  0.0101  1.4368 -0.3187 -0.0298  0.1866
## 10 -0.9118 -1.1252 -0.0409 -0.3801  0.4754  0.3617 -0.0025
## 11 -0.6731 -1.6282 -0.5379 -1.2339  1.5988  0.1706  0.0314
## 12 -0.9454 -1.1486  0.2631 -0.2082 -0.5652  0.7434  0.0123
## 13 -0.7425 -1.6099 -0.3464 -1.1218  1.6246  0.0673 -0.2721
## 14 -1.0921 -1.1582 -0.0503 -0.7993  0.9880  0.3190 -0.2284
## 15 -0.8995 -1.2408  0.3641 -0.5699 -0.5837  0.9290  0.0094

Question 3:2

Our next dataset, Soybean, relates to 19 classes of soybean plant. 4 of them are not well represented in the data. There are 35 possible predictor variables. In investigating our data, we note that some data is missing. We wish to find if there is meaning in their lack. Notably, recommender systems data can use the lack of a variable as evidence. A movie watcher or book reader may have no interest in a movie or book. A low rating in such a case may mean more interest in the subject than not bothering to rate it.

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

## 'data.frame':    683 obs. of  19 variables:
##  $ hail           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ sever          : num  1 2 2 2 1 1 1 1 1 2 ...
##  $ seed.tmt       : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ germ           : num  0 1 2 1 2 1 0 2 1 2 ...
##  $ leaf.halo      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ leaf.marg      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.size      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.shread    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ leaf.malf      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ leaf.mild      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lodging        : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ fruiting.bodies: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fruit.spots    : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mold.growth    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ seed.discolor  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ seed.size      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ shriveling     : num  0 0 0 0 0 0 0 0 0 0 ...

Here, we note that 19 variables are missing more than 10 percent of their values. The average values of our 19 variables are:

##            hail           sever        seed.tmt            germ 
##      0.22597865      0.73309609      0.51957295      1.04903678 
##       leaf.halo       leaf.marg       leaf.size     leaf.shread 
##      1.20200334      0.77295492      1.28380634      0.16466552 
##       leaf.malf       leaf.mild         lodging fruiting.bodies 
##      0.07512521      0.10434783      0.07473310      0.18024263 
##      fruit.pods     fruit.spots            seed     mold.growth 
##      0.50417362      1.02079723      0.19458545      0.11336717 
##   seed.discolor       seed.size      shriveling 
##      0.11091854      0.09983080      0.06585789

Amongst these 19 variables, there are some that are highly correlated with one another. Leaf.marg and leaf.size are correlated. Leaf.halo and leaf.marg are very negatively correlated. Fruit pods and fruit spots are correlated with our target variable.

Our corrplots for the other variables, that are mostly full, do not appear to be strikingly different from those of the previous group. Precip and int.discolor are negatively correlated. Enough rain must keep soybeans from being discolored. Stem is correlated with stem.cankers and canker.lesion. Plant.growth, stem, stem.cankers, canker.lesion, ext.decay and roots are correlated with our target variable. Date, temp, area.dam and int.discolor are negatively correlated with our target variable.

The distributions of our variables are all categorical. 15 variables have 3 values including NA. Further exploration might look at NAs along with missing data to see if NA and missing mean the same thing or something different. Perhaps they don’t apply to some varieties. Perhaps they were recorded as being missing. The variable, leaves, has only 2 categories. Leaves, leaf.shread, leaf.malf, leaf.mild, mycelium, int.discolor, sclerotia, seed.size, shriveling and roots all have rather uneven distributions. In the investigation stage, it would be useful to look at confusion matrices, chi-square tests or other categorical tests. Sometimes, lopsided tests can find a specific feature shared by only one or few categories. That would be worth finding before a model is fit.

—————————————————————————

_Appendix______________________

font_add_google(name = "Corben", family = "corben", regular.wt = 400, bold.wt = 700)

data(Glass) str(Glass) #create another set of Glass to perform transformations on Glass_ready<-Glass #Prepare labels for ggplot label_set<-c(“RI”,“Na”,“Mg”,“Al”,“Si”,“K”,“Ca”,“Ba”,“Fe”) label_paste<-function(x){label_set<-paste0(x," level“)} label_set<-sapply(label_set,label_paste) skew_num<-apply(Glass[,(1:9)],2,skewness) skew_num<-round(skew_num,digits=3) skew_paste<-function(x){skew_num<-paste0(”original skewness is “,x)} skew_num<-sapply(skew_num,skew_paste) plotter <- function(df, x.axis,lab_num,colorbar=‘#dee253’) {ggplot(df, aes_string(x = x.axis)) + geom_histogram(alpha=.6,color=‘gray’,fill=colorbar) + ylab(label_set[lab_num])+xlab(skew_num[lab_num])+ theme(axis.text.x = element_text(angle = 30, hjust = .9),text = element_text(family =”corben“,color=‘#249382’,size=16)) }

plotter(Glass,Glass[,1],1) plotter(Glass,Glass[,2],2) plotter(Glass,Glass[,3],3) plotter(Glass,Glass[,4],4) plotter(Glass,Glass[,5],5) plotter(Glass,Glass[,6],6) plotter(Glass,Glass[,7],7) plotter(Glass,Glass[,8],8) plotter(Glass,Glass[,9],9)

Mg_Var<-Glass$Mg Mg_Var[Mg_Var<2]<-0 Mg_Var[Mg_Var>2]<-1 print('Glass category 1: building_windows_float_processed ' ) Mg_Var[which(Glass$Type==1)] print(‘Glass category 2: building_windows_non_float_processed’ ) Mg_Var[which(Glass$Type==2)] print(‘Glass category 3: vehicle_windows_float_processed’ ) Mg_Var[which(Glass$Type==3)] print(‘Glass category 5: containers’) Mg_Var[which(Glass$Type==5)] print(‘Glass category 6: tableware’ ) Mg_Var[which(Glass$Type==6)] print(‘Glass category 7: headlamps’ ) Mg_Var[which(Glass$Type==7)]

Glass[,10]<-as.numeric(levels(Glass[,10]))[Glass[,10]] Glass_ready[,10]<-as.numeric(levels(Glass_ready[,10]))[Glass_ready[,10]] correl.matrix<-cor(Glass, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”)

Glass_ready$Al<-Glass_ready$Al^.5 #box cox transformations———————— transformed<-BoxCoxTrans(Glass$RI) Glass_ready$RI<-predict(transformed,Glass$RI) print('skewness for transformed RI') skewness(Glass_ready$RI)

transformed<-BoxCoxTrans(Glass$Si) Glass_ready$Si<-predict(transformed,Glass$Si) print('skewness for transformed Si') skewness(Glass_ready$Si)

transformed<-BoxCoxTrans(Glass$Ca) Glass_ready$Ca<-predict(transformed,Glass$Ca) print('skewness for transformed Ca') skewness(Glass_ready$Ca)

transformed<-BoxCoxTrans(Glass$K) Glass_ready$K<-predict(transformed,Glass$K) print('skewness for transformed K') skewness(Glass_ready$K)

transformed<-BoxCoxTrans(Glass$Ba) Glass_ready$Ba<-predict(transformed,Glass$Ba) print('skewness for transformed Ba') skewness(Glass_ready$Ba) #————————————

plotter(Glass,Glass[,1],1) plotter(Glass_ready,Glass_ready[,1],1,‘red’) plotter(Glass,Glass[,4],4) plotter(Glass_ready,Glass_ready[,4],4,‘red’) plotter(Glass,Glass[,5],5) plotter(Glass_ready,Glass_ready[,5],5,‘red’) plotter(Glass,Glass[,7],7) plotter(Glass_ready,Glass_ready[,7],7,‘red’) plotter(Glass,Glass[,6],6) plotter(Glass_ready,Glass_ready[,6],6,‘red’) plotter(Glass,Glass[,8],8) plotter(Glass_ready,Glass_ready[,8],8,‘red’)

transformations<-apply(Glass[,-10],2,BoxCoxTrans) transformations sp_sign_transform<-spatialSign(Glass_ready)

head(sp_sign_transform)

#any vars near zero? nearZeroVar(Glass) #no

pca_data<-preProcess(x=Glass, method = c(“BoxCox”,“center”,“scale”,“pca”)) pca_data

round(predict(pca_data,Glass)[1:15,],4)

data(Soybean) str(Soybean) variable_index<-as.data.frame(1:36) variable_index<-cbind(variable_index,variable_index) colnames(variable_index)<-c(‘variable’,‘number_missing’) variable_index[,1]<-colnames(Soybean) for(i in c(1:36)){ this<-Soybean[is.na(Soybean[,i]),] #Find how many NAs are in each column variable_index[i,2]<-length(this[,1]) } ggplot(data=variable_index,aes(x=variable,y=number_missing)) + geom_bar(alpha=.6,color=‘gray’,fill=‘#249382’,stat=‘identity’) + theme(axis.text.x = element_text(angle = 30, hjust = .9),text = element_text(family = “corben”,color=‘#249382’,size=16))+ylim(0,683) variable_lacking_index<-variable_index[variable_index[,2]>68,] Soybean_lacking_set<-Soybean[,(colnames(Soybean) %in% variable_lacking_index[,1])] defactorer<-function(x){as.numeric(levels(x))[x]} for(i in 1:19){ Soybean_lacking_set[,i]<-defactorer(Soybean_lacking_set[,i]) } str(Soybean_lacking_set) apply(Soybean_lacking_set,2,mean,na.rm=TRUE)

Soybean_lacking_setb<-cbind(Soybean$Class,Soybean_lacking_set) colnames(Soybean_lacking_setb)[1]<- c("Class") Soybean_lacking_setb$Class<-as.numeric(Soybean_lacking_setb$Class) correl.matrix<-cor(Soybean_lacking_setb, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”)

Soybean_lacking_set[!is.na(Soybean_lacking_set)]<-0 Soybean_lacking_set[is.na(Soybean_lacking_set)]<-1 Soybean_lacking_set<-cbind(Soybean$Class,Soybean_lacking_set) colnames(Soybean_lacking_set)[1]<- c("Class") Soybean_lacking_set$Class<-as.numeric(Soybean_lacking_set$Class) Soybean$Class<-as.numeric(Soybean$Class)

correl.matrix<-cor(Soybean_lacking_set, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”) Soybean_full_cases_vars_set<-Soybean[,!((colnames(Soybean) %in% variable_lacking_index[,1]))] for(i in 2:17){ Soybean_full_cases_vars_set[,i]<-defactorer(Soybean_full_cases_vars_set[,i]) }

correl.matrix<-cor(Soybean_full_cases_vars_set, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”)

col_label<-colnames(Soybean) plotter <- function(df, x.axis,lab_num,colorbar=‘#249382’) {ggplot(df, aes_string(x = x.axis)) + geom_histogram(alpha=.6,color=‘gray’,fill=colorbar,stat=‘count’) + ylab(col_label[lab_num])+xlab(‘’)+ theme(axis.text.x = element_text(angle = 30, hjust = .9),text = element_text(family = “corben”,color=’#dee253’,size=16)) }

#ggplot won’t accept a for loop plotter(Soybean,Soybean[,1],1) plotter(Soybean,Soybean[,2],2) plotter(Soybean,Soybean[,3],3) plotter(Soybean,Soybean[,4],4) plotter(Soybean,Soybean[,5],5) plotter(Soybean,Soybean[,6],6) plotter(Soybean,Soybean[,7],7) plotter(Soybean,Soybean[,8],8) plotter(Soybean,Soybean[,9],9) plotter(Soybean,Soybean[,10],10) plotter(Soybean,Soybean[,11],11) plotter(Soybean,Soybean[,12],12) plotter(Soybean,Soybean[,13],13) plotter(Soybean,Soybean[,14],14) plotter(Soybean,Soybean[,15],15) plotter(Soybean,Soybean[,16],16) plotter(Soybean,Soybean[,17],17) plotter(Soybean,Soybean[,18],18) plotter(Soybean,Soybean[,19],19) plotter(Soybean,Soybean[,20],20) plotter(Soybean,Soybean[,21],21) plotter(Soybean,Soybean[,22],22) plotter(Soybean,Soybean[,23],23) plotter(Soybean,Soybean[,24],24) plotter(Soybean,Soybean[,25],25) plotter(Soybean,Soybean[,26],26) plotter(Soybean,Soybean[,27],27) plotter(Soybean,Soybean[,28],28) plotter(Soybean,Soybean[,29],29) plotter(Soybean,Soybean[,30],30) plotter(Soybean,Soybean[,31],31) plotter(Soybean,Soybean[,32],32) plotter(Soybean,Soybean[,33],33) plotter(Soybean,Soybean[,34],34) plotter(Soybean,Soybean[,35],35) plotter(Soybean,Soybean[,36],36) ```

Dan Wigodsky

Data 624 Homework 4

February 28, 2019

Question 3:1

The UC Irvine Machine Learning Repository contains the dataset, Glass, consisting of 10 variables, with 214 observations. The target variable is one of 6 categories of glass types. We investigate the distributions to find if there are any transformations necessary.

.

Question 3:2

Here, we note that 19 variables are missing more than 10 percent of their values. The average values of our 19 variables are:

Amongst these 19 variables, there are some that are highly correlated with one another. Leaf.marg and leaf.size are correlated. Leaf.halo and leaf.marg are very negatively correlated. Fruit pods and fruit spots are correlated with our target variable.

—————————————————————————

_Appendix______________________

Dan Wigodsky

Data 624 Homework 4

February 28, 2019

Question 3:1

The UC Irvine Machine Learning Repository contains the dataset, Glass, consisting of 10 variables, with 214 observations. The target variable is one of 6 categories of glass types. We investigate the distributions to find if there are any transformations necessary.

.

Question 3:2

Here, we note that 19 variables are missing more than 10 percent of their values. The average values of our 19 variables are:

Amongst these 19 variables, there are some that are highly correlated with one another. Leaf.marg and leaf.size are correlated. Leaf.halo and leaf.marg are very negatively correlated. Fruit pods and fruit spots are correlated with our target variable.

We also want to look at correlations specifically related to their lack. We find that many variables drop out at the same time. They are correlated with one another. They do not, however, appear to be very corrrelated with the target variable.

—————————————————————————

_______________________Appendix____________________________________________

_Appendix______________________