Exercise 2 Data Pre-Processing

Author

Sara Nishiyama

library(mlbench)
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.2
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(caret)
Loading required package: lattice
library(AppliedPredictiveModeling)
data(Glass)
str(Glass)
'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
       RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

A visualizing distributions with histograms

glass_long<-Glass|>
  pivot_longer(cols=-Type, names_to="Predictor", values_to="Value")

Histograms: We can see that elements like Si, and Na are the primary components. Ba, Fe, and Ca are heavily right skewed. K and Ri also seem rightly skewed. Mg, and Si are left skewed, while Na and Ai appear normaly distributed.

ggplot(glass_long, aes(x=Value))+
  geom_histogram(bins=15, fill="steelblue", color="white")+
  facet_wrap(~Predictor, scales="free")+
  theme_minimal()+
  labs(title="Distribution of Glass Predictors")

A. Visualizing outliers with Box Plots

Scaling will be needed based on these results, almost all box plots have outliers. K and Ba show extreme outliers. Na, Al, Si and Ca all show moderate outliers.

ggplot(glass_long, aes(x=Predictor, y=Value, fill=Predictor))+
  geom_boxplot(outlier.color="red", outlier.shape=16)+
  facet_wrap(~Predictor, scales="free")+
  theme_minimal()+
  theme(legend.position="none")+
  labs(title="Boxplots for Outlier Detection",
       y="Concentration/Index Value")

A. Visualizing Relationships with scatter plots

The scatter plot shows a strong positive linear relationship between Ca and RI. Type 2 glass seems to have the most outliers.

ggplot(Glass, aes(x=Ca, y=RI, color=Type))+
  geom_point(alpha=0.7)+
  theme_light()+
  labs(title="Relationship: RI vs Calcium",
       subtitle="Color-coded by Glass Type")

B. Outliers and Skewness

The output shows that K and Ba are heavily right-skewed, while Mg is left skewed.

library(e1071)
Warning: package 'e1071' was built under R version 4.5.2

Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':

    element
original_skew<-apply(Glass[, 1:9], 2, skewness)
print("Original Skewness")
[1] "Original Skewness"
print(original_skew)
        RI         Na         Mg         Al         Si          K         Ca 
 1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
        Ba         Fe 
 3.3686800  1.7298107 

K and Ba shows large gaps between the 3rd quartile and the max values. This confirms the extreme outliers in our boxplots.

summary(Glass[, 1:9])
       RI              Na              Mg              Al       
 Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
 1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
 Median :1.518   Median :13.30   Median :3.480   Median :1.360  
 Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
 3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
 Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
       Si              K                Ca               Ba       
 Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
 1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
 Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
 Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
 3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
 Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
       Fe         
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.05701  
 3rd Qu.:0.10000  
 Max.   :0.51000  

C. Predictor Transformations.

Centering and scaling were included to ensure predictors with larger raw values do not dominate predictors with smaller values (like Si and Ri) during the SVM calculations. YeoJohnson was used to to reduce skewness and handle zero values like Fe and Ba. Lastly, the spatial sign was used to handle the extreme outliers and pull them towards the center in order to prevent the outliers from distorting the SVM decision boundary.

transform_plan<-preProcess(Glass[, 1:9],
                           method=c("YeoJohnson", "center", "scale", "spatialSign"))

transform data

glass_transformed<-predict(transform_plan, Glass[,1:9])

check new skewness

new_skew<-apply(glass_transformed, 2, skewness)
print("Skewness after Yeo-Johnson:")
[1] "Skewness after Yeo-Johnson:"
print(new_skew)
         RI          Na          Mg          Al          Si           K 
 0.55680250 -0.04058783 -0.57412423 -0.04637538 -0.40715085 -0.63327286 
         Ca          Ba          Fe 
 0.26099372  2.07809167  0.92188227 
par(mfrow=c(1,2))
plot(Glass$Ca, Glass$RI, main="Original (with outliers)", col="red")
plot(glass_transformed$Ca,glass_transformed$RI, main="After Spatial Sign", col="green")

D

library(kernlab)

Attaching package: 'kernlab'
The following object is masked from 'package:ggplot2':

    alpha
set.seed(321)
sigDist<-sigest(Type~., data=Glass, frac=1)
sigDist
       90%        50%        10% 
0.02181313 0.08309744 0.68601491 
svmTuneGrid<-data.frame(sigma=as.vector(sigDist)[1],C=2^(-2:10))
svmTuneGrid
        sigma       C
1  0.02181313    0.25
2  0.02181313    0.50
3  0.02181313    1.00
4  0.02181313    2.00
5  0.02181313    4.00
6  0.02181313    8.00
7  0.02181313   16.00
8  0.02181313   32.00
9  0.02181313   64.00
10 0.02181313  128.00
11 0.02181313  256.00
12 0.02181313  512.00
13 0.02181313 1024.00
set.seed(1056)
svmFit<-train(Type~., data=Glass, method="svmRadial",
              preProc=c("YeoJohnson","center", "scale", "spatialSign"),tuneGrid=svmTuneGrid,
              trControl=trainControl(method="repeatedcv", repeats=5))
plot(svmFit, scales=list(x=list(log=2)))

Our model shows that it performs best when the Cost parameter is set at 2^6. At this setting the SVM can accurately identify the glass type about 74% of the time. Pushing the Cost higher will over fit the model.