Histograms: We can see that elements like Si, and Na are the primary components. Ba, Fe, and Ca are heavily right skewed. K and Ri also seem rightly skewed. Mg, and Si are left skewed, while Na and Ai appear normaly distributed.
ggplot(glass_long, aes(x=Value))+geom_histogram(bins=15, fill="steelblue", color="white")+facet_wrap(~Predictor, scales="free")+theme_minimal()+labs(title="Distribution of Glass Predictors")
A. Visualizing outliers with Box Plots
Scaling will be needed based on these results, almost all box plots have outliers. K and Ba show extreme outliers. Na, Al, Si and Ca all show moderate outliers.
ggplot(glass_long, aes(x=Predictor, y=Value, fill=Predictor))+geom_boxplot(outlier.color="red", outlier.shape=16)+facet_wrap(~Predictor, scales="free")+theme_minimal()+theme(legend.position="none")+labs(title="Boxplots for Outlier Detection",y="Concentration/Index Value")
A. Visualizing Relationships with scatter plots
The scatter plot shows a strong positive linear relationship between Ca and RI. Type 2 glass seems to have the most outliers.
ggplot(Glass, aes(x=Ca, y=RI, color=Type))+geom_point(alpha=0.7)+theme_light()+labs(title="Relationship: RI vs Calcium",subtitle="Color-coded by Glass Type")
B. Outliers and Skewness
The output shows that K and Ba are heavily right-skewed, while Mg is left skewed.
library(e1071)
Warning: package 'e1071' was built under R version 4.5.2
Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':
element
RI Na Mg Al Si K Ca
1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
Ba Fe
3.3686800 1.7298107
K and Ba shows large gaps between the 3rd quartile and the max values. This confirms the extreme outliers in our boxplots.
summary(Glass[, 1:9])
RI Na Mg Al
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
Median :1.518 Median :13.30 Median :3.480 Median :1.360
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
Si K Ca Ba
Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
Fe
Min. :0.00000
1st Qu.:0.00000
Median :0.00000
Mean :0.05701
3rd Qu.:0.10000
Max. :0.51000
C. Predictor Transformations.
Centering and scaling were included to ensure predictors with larger raw values do not dominate predictors with smaller values (like Si and Ri) during the SVM calculations. YeoJohnson was used to to reduce skewness and handle zero values like Fe and Ba. Lastly, the spatial sign was used to handle the extreme outliers and pull them towards the center in order to prevent the outliers from distorting the SVM decision boundary.
Our model shows that it performs best when the Cost parameter is set at 2^6. At this setting the SVM can accurately identify the glass type about 74% of the time. Pushing the Cost higher will over fit the model.