dataset = read.csv("glass.csv")
head(dataset)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Source:
Creator: B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN
Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation
These data have been taken from the UCI Repository Of Machine Learning Databases (Blake & Merz 1998) and were converted to R format by Friedrich Leisch in the late 1990s.
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to predict the type of glass on basis of the chemical analysis.
It contains 214 observations on the following 10 variables (columns):
- RI: Refractive index. Floating Point
- Na: Sodium. Floating Point
- Mg: Magnesium. Floating Point
- Al: Aluminum. Floating Point
- Si: Silicon. Floating Point
- K: Potassium. Floating Point
- Ca: Calcium. Floating Point
- Ba: Barium. Floating Point
- Fe: Iron. Floating Point
- Type: Type of glass (class attribute). Integer
nrow(dataset)
## [1] 214
names(dataset)
## [1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type"
str(dataset)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: int 1 1 1 1 1 1 1 1 1 1 ...
There are no missing values (NA or NULL values).
colSums(is.na(dataset))
## RI Na Mg Al Si K Ca Ba Fe Type
## 0 0 0 0 0 0 0 0 0 0
any(is.na(dataset))
## [1] FALSE
library(psych)
describe(dataset[, !(names(dataset) == "Type")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## RI 1 214 1.52 0.00 1.52 1.52 0.00 1.51 1.53 0.02 1.60 4.72
## Na 2 214 13.41 0.82 13.30 13.38 0.64 10.73 17.38 6.65 0.45 2.90
## Mg 3 214 2.68 1.44 3.48 2.87 0.30 0.00 4.49 4.49 -1.14 -0.45
## Al 4 214 1.44 0.50 1.36 1.41 0.31 0.29 3.50 3.21 0.89 1.94
## Si 5 214 72.65 0.77 72.79 72.71 0.57 69.81 75.41 5.60 -0.72 2.82
## K 6 214 0.50 0.65 0.56 0.43 0.17 0.00 6.21 6.21 6.46 52.87
## Ca 7 214 8.96 1.42 8.60 8.74 0.66 5.43 16.19 10.76 2.02 6.41
## Ba 8 214 0.18 0.50 0.00 0.03 0.00 0.00 3.15 3.15 3.37 12.08
## Fe 9 214 0.06 0.10 0.00 0.04 0.00 0.00 0.51 0.51 1.73 2.52
## se
## RI 0.00
## Na 0.06
## Mg 0.10
## Al 0.03
## Si 0.05
## K 0.04
## Ca 0.10
## Ba 0.03
## Fe 0.01
Apart from Mg and Si, the skew values of the other elements are all positive and much greater than 0. This means that that the distributions are positively skewed (right-skewed). So, the right tail (larger values) is longer than the left tail, and the majority of data points are on the lower end.
The data is significantly skewed. To normalize the distribution transformations (log) are needed.
The K (in primis), Ca, Ba and RI variables have kurtosis values > 3. This means that their distribution has heavier tails than the normal distribution, meaning there are more outliers.
The Mg and Al variables have kurtosis values < 3. So the distibution has lighter tails than the normal distribution, meaning fewer outliers.
Distributions
hist(dataset$RI,
main="Histogram of RI (refractive index)",
xlab="RI"
)
hist(dataset$Na,
main="Histogram of Na (Sodium)",
xlab="Na"
)
hist(dataset$Mg,
main="Histogram of Mg (Magnesium)",
xlab="Mg"
)
hist(dataset$Al,
main="Histogram of Al (Aluminum)",
xlab="Al"
)
hist(dataset$Si,
main="Histogram of Si (Silicon)",
xlab="Si"
)
hist(dataset$K,
main="Histogram of K (Potassium)",
xlab="K"
)
hist(dataset$Ca,
main="Histogram of Ca (Calcium)",
xlab="Ca"
)
hist(dataset$Ba,
main="Histogram of Ba (Barium)",
xlab="Ba"
)
hist(dataset$Fe,
main="Histogram of Fe (Iron)",
xlab="Fe"
)
Check for outliers
boxplot(dataset$RI, main = "refractive index", horizontal = FALSE)
boxplot(dataset$Na, main = "sodium", horizontal = FALSE)
boxplot(dataset$Mg, main = "magnesium", horizontal = FALSE)
boxplot(dataset$Al, main = "aluminum", horizontal = FALSE)
boxplot(dataset$Si, main = "silicon", horizontal = FALSE)
boxplot(dataset$K, main = "potassium", horizontal = FALSE)
boxplot(dataset$Ca, main = "calcium", horizontal = FALSE)
boxplot(dataset$Ba, main = "barium", horizontal = FALSE)
boxplot(dataset$Fe, main = "iron", horizontal = FALSE)
There are a lot of outliers but this is probably due to the natural variation (concentration) of the chemical elements in glass composition. So I am treating them as valid data points.
Distributions:
The distribution of glass types is uneven, with types 2, 5, and 6 being significantly underrepresented. The fourth type is not represented at all.
barplot(table(dataset$Type), main = "Type of Glass Distribution")
+1 to avoid issues with zero or negative values
dataset$K_log = log(dataset$K + 1)
dataset$Ca_log = log(dataset$Ca + 1)
dataset$Ba_log = log(dataset$Ba + 1)
describe(dataset$K_log)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 214 0.36 0.27 0.44 0.35 0.11 0 1.98 1.98 1.95 10.34 0.02
describe(dataset$Ca_log)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 214 2.29 0.13 2.26 2.27 0.07 1.86 2.84 0.98 1.15 4 0.01
describe(dataset$Ba_log)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 214 0.11 0.28 0 0.03 0 0 1.42 1.42 2.69 6.39 0.02
After the log, the skew values have been reduced to resemble more a normal distribution.
After transforming the data, it is often a good idea to standardize the variables (mean = 0, standard deviation = 1). This ensures that PCA does not prioritize variables with larger scales.
Remove the columns not log-transformed and the Type column that should not be standardized.
columns_to_remove = c("K", "Ca", "Ba", "Type")
subset = dataset[, !names(dataset) %in% columns_to_remove]
names(subset)
## [1] "RI" "Na" "Mg" "Al" "Si" "Fe" "K_log" "Ca_log"
## [9] "Ba_log"
dataset_standardized = scale(subset)
dataset_standardized = as.data.frame(dataset_standardized)
dataset_standardized$Type = dataset$Type
head(dataset_standardized)
## RI Na Mg Al Si Fe K_log
## 1 0.8708258 0.2842867 1.2517037 -0.6908222 -1.12444556 -0.5850791 -1.0941592
## 2 -0.2487502 0.5904328 0.6346799 -0.1700615 0.10207972 -0.5850791 0.1295429
## 3 -0.7196308 0.1495824 0.6000157 0.1904651 0.43776033 -0.5850791 -0.1004727
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979 -0.5850791 0.3459757
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126 0.55395746 -0.5850791 0.2989716
## 6 -0.7920739 -0.7566101 0.6416128 0.3506992 0.41193874 2.0832652 0.5059006
## Ca_log Ba_log Type
## 1 -0.09282075 -0.3859754 1
## 2 -0.85379817 -0.3859754 1
## 3 -0.89739813 -0.3859754 1
## 4 -0.52195771 -0.3859754 1
## 5 -0.64789711 -0.3859754 1
## 6 -0.64789711 -0.3859754 1
describe(dataset_standardized)
## vars n mean sd median trimmed mad min max range skew kurtosis
## RI 1 214 0.00 1.0 -0.23 -0.12 0.62 -2.38 5.13 7.50 1.60 4.72
## Na 2 214 0.00 1.0 -0.13 -0.04 0.79 -3.28 4.86 8.14 0.45 2.90
## Mg 3 214 0.00 1.0 0.55 0.13 0.21 -1.86 1.25 3.11 -1.14 -0.45
## Al 4 214 0.00 1.0 -0.17 -0.07 0.62 -2.31 4.12 6.43 0.89 1.94
## Si 5 214 0.00 1.0 0.18 0.07 0.74 -3.67 3.56 7.23 -0.72 2.82
## Fe 6 214 0.00 1.0 -0.59 -0.22 0.00 -0.59 4.65 5.23 1.73 2.52
## K_log 7 214 0.00 1.0 0.31 -0.04 0.39 -1.31 5.93 7.24 1.95 10.34
## Ca_log 8 214 0.00 1.0 -0.21 -0.11 0.53 -3.29 4.26 7.55 1.15 4.00
## Ba_log 9 214 0.00 1.0 -0.39 -0.29 0.00 -0.39 4.64 5.02 2.69 6.39
## Type 10 214 2.78 2.1 2.00 2.48 1.48 1.00 7.00 6.00 1.10 -0.33
## se
## RI 0.07
## Na 0.07
## Mg 0.07
## Al 0.07
## Si 0.07
## Fe 0.07
## K_log 0.07
## Ca_log 0.07
## Ba_log 0.07
## Type 0.14
RQ: What are the primary patterns of variation in the chemical composition of different types of glass and which chemical components contribute most significantly to the differentiation between glass types?
In this case the correlation matrix makes more sense since the variables have different scales and units. The analysis will focus on the relationships between variables after accounting for differences in scale, so no variable disproportionately influences the results.
dataset_standardized_pca= dataset_standardized[, c(1, 2, 3 , 4, 5, 7, 8, 9, 6)]
R_standardized = cor(dataset_standardized_pca)
round(R_standardized, 3)
## RI Na Mg Al Si K_log Ca_log Ba_log Fe
## RI 1.000 -0.192 -0.122 -0.407 -0.542 -0.327 0.798 -0.050 0.143
## Na -0.192 1.000 -0.274 0.157 -0.070 -0.434 -0.261 0.389 -0.241
## Mg -0.122 -0.274 1.000 -0.482 -0.166 0.226 -0.416 -0.530 0.083
## Al -0.407 0.157 -0.482 1.000 -0.006 0.249 -0.282 0.526 -0.074
## Si -0.542 -0.070 -0.166 -0.006 1.000 -0.127 -0.183 -0.056 -0.094
## K_log -0.327 -0.434 0.226 0.249 -0.127 1.000 -0.439 -0.160 0.063
## Ca_log 0.798 -0.261 -0.416 -0.282 -0.183 -0.439 1.000 -0.162 0.129
## Ba_log -0.050 0.389 -0.530 0.526 -0.056 -0.160 -0.162 1.000 -0.074
## Fe 0.143 -0.241 0.083 -0.074 -0.094 0.063 0.129 -0.074 1.000
corPlot(R_standardized)
There is an insufficient correlation between the variables (each one needs to have a MSA greater than 0.5). This suggests that the components may not capture meaningful patterns.
PCA can still perform dimensionality reduction but the results might lack significance.
KMO(R_standardized)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R_standardized)
## Overall MSA = 0.22
## MSA for each item =
## RI Na Mg Al Si K_log Ca_log Ba_log Fe
## 0.69 0.14 0.18 0.29 0.11 0.15 0.25 0.22 0.91
det(R_standardized)
## [1] 0.000867296
Bartlett’s test evaluates whether the correlation matrix is significantly different from the identity matrix. This help to determine if the dataset is suitable for PCA or FA.
Null Hypothesis: the variables are uncorrelated, and the correlation matrix is equal to the identity matrix
Alternative Hypothesis: the variables are correlated, and the correlation matrix is not the identity matrix.
Since the p-value is < 0.05, the null hypthesis is rejected. This means the correlation matrix is significantly different from the identity matrix, and there are correlations among the variables.
So, according to Bartlett’s test, the dataset is suitable for PCA or FA. This test does not account for sampling adequacy (KMO).
On the other hand, according the KMO test, the dataset is not suitable for PCA.
cortest.bartlett(R_standardized, n = nrow(dataset_standardized_pca))
## $chisq
## [1] 1474.652
##
## $p.value
## [1] 9.831003e-287
##
## $df
## [1] 36
I decided to continue the analysis and see the results anyway.
Analysis:
PC1 to PC4 have eigenvalues > 1, so they are significant under the Kaiser criterion. The first 4 components together capture a substantial amount of variance in the dataset (81.2%).
The remaining components explain progressively less variance.
PC5 adds about 9.7% to the total variance, bringing the cumulative variance to about 90.9%.
PC6 to PC9 contribute only a small amount of variance and are less meaningful in representing the data.
library(FactoMineR)
components_all = PCA(dataset_standardized_pca,
scale.unit = FALSE,
graph = FALSE
)
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components_all)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.53440299 28.2922400 28.29224
## Dim.2 2.23020703 24.8964165 53.18866
## Dim.3 1.35882014 15.1688842 68.35754
## Dim.4 1.15091117 12.8479389 81.20548
## Dim.5 0.87059317 9.7186718 90.92415
## Dim.6 0.43448182 4.8502404 95.77439
## Dim.7 0.30074476 3.3572967 99.13169
## Dim.8 0.06507632 0.7264649 99.85815
## Dim.9 0.01270654 0.1418466 100.00000
This is a good results since to obtain 80% of variance you need only 4 components out of 9, and to obtain 90% you need to keep only 5 components.
This reduction should reduce noise and improve interpretability without losing too much information.
Scree plot
This plot shows the eigenvalues for each principal components in a visual way.
As said earlier, PC1 to PC4 have eigenvalues > 1 and, based on the Kaiser criterion, these principal components are those really significant and explain most of the dataset’variance.
The “elbow” in the plot (where the curve starts to flatten) appears at PC4, confirming that the first 4 PCs are sufficient to capture the majority of variance.
fviz_eig(components_all,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal Component",
addlabels = TRUE
)
Parallel analysis
Parallel analysis compares the eigenvalues of the actual data (blue line) with those of re-sampled (random) data (red dashed line).
PCs with eigenvalues from the actual data that are greater than the corresponding eigenvalues of the resampled data are considered significant.
Parallel analysis suggests that the number of components should be 4.
This result aligns with the earlier scree plot and eigenvalue analysis.
fa.parallel(dataset_standardized_pca,
sim = FALSE,
fa = "pc",
)
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected. Examine the results carefully
## Parallel analysis suggests that the number of factors = NA and the number of components = 4
PCA using only the first 4 PC
library(FactoMineR)
components_4 = PCA(dataset_standardized_pca,
ncp = 4,
scale.unit = FALSE,
graph = FALSE
)
components_4
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 214 individuals, described by 9 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
Correlation of each variable with the PCs.
This values show how strongly each original variable is associated with each PC (higher absolute values indicate stronger relationships).
Dim. 1 is strongly correlated (influenced) with RI and Ca_log.
Dim. 2 is strongly correlated with Mg, Ba_log, and Na. This dimension captures variability related to these variables
Dim. 3 is strongly correlated with K_log and Si. It represents features related to this variables.
Dim. 4 is strongly correlated with Si and Na.
components_4$var$cor
## Dim.1 Dim.2 Dim.3 Dim.4
## RI -0.8497531 0.3922069 0.1871308 -0.15822541
## Na 0.4146429 0.5077198 -0.4057552 -0.49102846
## Mg -0.1933380 -0.8138143 -0.1850017 -0.41002133
## Al 0.6910020 0.2994248 0.5158736 0.10637838
## Si 0.3669094 -0.1199463 -0.5130974 0.74555308
## K_log 0.2754157 -0.6207597 0.6075550 0.01114375
## Ca_log -0.7795654 0.4888027 0.0672862 0.30757666
## Ba_log 0.4869420 0.6480424 0.2550265 -0.12228892
## Fe -0.2865628 -0.1360378 0.4040237 0.21261434
Contribution of each original variable to each dimension.
It shows the percentage that each variable contributed to the variance explained by each PC.
Higher percentage means a variable is more important for that PC.
It is easier to interpret with respect to the previous table.
components_4$var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## RI 28.358004 6.8651670 2.5650406 2.16509263
## Na 6.752095 11.5045283 12.0595752 20.85150262
## Mg 1.467995 29.5577411 2.5070064 14.53908002
## Al 18.752049 4.0012534 19.4935296 0.97865751
## Si 5.286983 0.6420872 19.2842849 48.07078002
## K_log 2.978981 17.1975929 27.0380330 0.01073956
## Ca_log 23.866858 10.6632089 0.3316315 8.18145960
## Ba_log 9.312034 18.7424975 4.7640286 1.29329699
## Fe 3.225000 0.8259238 11.9568700 3.90939105
It seems that the variables that contribute the most to the variance captured by the first two dimensions (which are the most important ones and have similar eigenvalues) are:
RI
Ca
Mg
Ba
Na
fviz_pca_var(components_4, repel = TRUE)
Together, the first 2 components explain 53.2% of the total variance in the data.
Variables that point in similar directions are positively correlated.
Variables pointing in opposite directions are negatively correlated.
The right-angle variables are probably uncorrelated.
The length of the arrows indicates how well that variable is represented in these two dimensions.
Na and Ba are strongly positively correlated
RI and Ca are positively correlated
Si points nearly horizontally to the right, suggesting it is strongly associated with the first principal component rather than the second component. The same holds for Fe (nearly horizontally to the left)
Mg points nearly vertical to the left, suggesting it is strongly associated with the second principal component rather than the first component.
Fe appears to have relatively weak contributions to both components (better association with the first component)
table(dataset$Type)
##
## 1 2 3 5 6 7
## 70 76 17 13 9 29
dataset$Type = as.factor(dataset$Type)
fviz_pca_biplot(components_4,
geom.ind = "point",
col.ind = dataset$Type,
palette = "RdBu",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = FALSE,
legend.title = "Glass Type")
The PCA biplot shows both the variables (arrows) and glass samples colored by their type.
The position of a sample along the principal component axes indicates its contribution to the respective component.
Samples close together are more similar in terms of their chemical composition.
The length and direction of an arrow indicate the variable’s importance and correlation with the principal components.
There is a central region that presents a lot of overlap between glass types. The first two PC are not enough to separate them.
In particular the types 5 and 6 which are under-represented in the dataset do not form clusters.
The ellipses show the 95% confidence regions for each glass type
head(components_4$ind$coord)
## Dim.1 Dim.2 Dim.3 Dim.4
## 1 -1.2554354 -0.09449284 -0.8295099 -1.69285829
## 2 0.5855765 -0.69861737 -0.7336488 -0.73572334
## 3 0.9408826 -0.82630964 -0.7608286 -0.19836287
## 4 0.1317710 -1.00769430 -0.3139305 -0.40581744
## 5 0.3484001 -1.06738367 -0.6868016 -0.03493444
## 6 0.3059769 -1.55363978 0.8701100 0.81342206
RQ: what are the primary patterns of variation in the chemical composition of different types of glass, and can we identify which chemical components contribute most significantly to the differentiation between glass types?
Conclusion:
Based on the PCA analysis:
- the first 4 principal components explain 81.2% of the total variance in the glass composition data, indicating moderate patterns in chemical composition.
- PC1 is primarily driven by RI (refractive index) and Ca, suggesting these elements are the main differentiators in glass composition
- PC2 captures variability linked to Mg, Ba and Na, representing a secondary pattern of chemical composition
- PC3 reflects variations tied to K and Si
- PC4 highlights variability linked to Si and Na, possibly capturing subtle compositional differences.
Different variables dominate different PCs, indicating that the dataset is multidimensional, with distinct patterns captured by the PCs.
The analysis did not reveal clear patterns in how chemical compositions vary across different glass types. The plots based on the first 2 PCs did not show distinct clustering based on their chemical composition.
PCA does not seem to perform good but this is expected since the KMO test highlighted that there is no sufficient correlation between variables to effectively apply PCA.
Let’s see if we can effectively reduce the dimensionality of the glass dataset without compromising performance.
Add the coord. of the first 4 PC for the individuals samples to the dataset
dataset_standardized$PC1 = components_all$ind$coord[, 1]
dataset_standardized$PC2 = components_all$ind$coord[, 2]
dataset_standardized$PC3 = components_all$ind$coord[, 3]
dataset_standardized$PC4 = components_all$ind$coord[, 4]
head(dataset_standardized)
## RI Na Mg Al Si Fe K_log
## 1 0.8708258 0.2842867 1.2517037 -0.6908222 -1.12444556 -0.5850791 -1.0941592
## 2 -0.2487502 0.5904328 0.6346799 -0.1700615 0.10207972 -0.5850791 0.1295429
## 3 -0.7196308 0.1495824 0.6000157 0.1904651 0.43776033 -0.5850791 -0.1004727
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979 -0.5850791 0.3459757
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126 0.55395746 -0.5850791 0.2989716
## 6 -0.7920739 -0.7566101 0.6416128 0.3506992 0.41193874 2.0832652 0.5059006
## Ca_log Ba_log Type PC1 PC2 PC3 PC4
## 1 -0.09282075 -0.3859754 1 -1.2554354 -0.09449284 -0.8295099 -1.69285829
## 2 -0.85379817 -0.3859754 1 0.5855765 -0.69861737 -0.7336488 -0.73572334
## 3 -0.89739813 -0.3859754 1 0.9408826 -0.82630964 -0.7608286 -0.19836287
## 4 -0.52195771 -0.3859754 1 0.1317710 -1.00769430 -0.3139305 -0.40581744
## 5 -0.64789711 -0.3859754 1 0.3484001 -1.06738367 -0.6868016 -0.03493444
## 6 -0.64789711 -0.3859754 1 0.3059769 -1.55363978 0.8701100 0.81342206
Split standardized data:
X = dataset_standardized[, !names(dataset_standardized) %in% c("Type", "PC1", "PC2", "PC3", "PC4")]
X_pca_4 = dataset_standardized[, names(dataset_standardized) %in% c("PC1", "PC2", "PC3", "PC4")]
y = as.factor(dataset_standardized$Type)
table(dataset_standardized$Type)
##
## 1 2 3 5 6 7
## 70 76 17 13 9 29
Cross-validation with 5 folds should help understand the performance of the model and if it is generalizing well.
I used Multinomial Logistic Regression to predict the Type of glass from the other variables.
There are 7 types of glass, but the fourth type has no observations. Additionally, types 3, 5, and 6 are very under-represented in the dataset.
library(caret)
## Loading required package: lattice
ctrl = trainControl(method = "cv", number = 5)
cv_original = train(
x = X,
y = y,
method = "multinom",
trControl = ctrl
)
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 119.607397
## iter 20 value 99.578370
## iter 30 value 89.252229
## iter 40 value 87.370787
## iter 50 value 86.344433
## iter 60 value 86.047476
## iter 70 value 85.910460
## iter 80 value 85.877732
## iter 90 value 85.876975
## final value 85.876845
## converged
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 129.018273
## iter 20 value 122.172840
## iter 30 value 121.782146
## iter 40 value 121.772014
## final value 121.772009
## converged
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 119.617222
## iter 20 value 99.618283
## iter 30 value 89.430085
## iter 40 value 87.584220
## iter 50 value 86.743208
## iter 60 value 86.597925
## iter 70 value 86.548121
## iter 80 value 86.535529
## iter 90 value 86.525710
## iter 100 value 86.513701
## final value 86.513701
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 128.578568
## iter 20 value 104.825633
## iter 30 value 97.830761
## iter 40 value 92.912930
## iter 50 value 91.412432
## iter 60 value 89.075408
## iter 70 value 87.538180
## iter 80 value 87.484809
## iter 90 value 87.483353
## final value 87.483345
## converged
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 135.508133
## iter 20 value 123.406852
## iter 30 value 123.214901
## iter 40 value 123.209873
## iter 40 value 123.209872
## iter 40 value 123.209872
## final value 123.209872
## converged
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 128.591223
## iter 20 value 104.854370
## iter 30 value 97.919497
## iter 40 value 93.145561
## iter 50 value 91.753587
## iter 60 value 90.327256
## iter 70 value 90.053350
## iter 80 value 89.983305
## iter 90 value 89.950427
## iter 100 value 89.923114
## final value 89.923114
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 111.758722
## iter 20 value 84.708463
## iter 30 value 77.632457
## iter 40 value 74.549375
## iter 50 value 73.743312
## iter 60 value 73.394303
## iter 70 value 73.206633
## iter 80 value 72.946136
## iter 90 value 71.153435
## iter 100 value 67.740336
## final value 67.740336
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 127.690406
## iter 20 value 115.575617
## iter 30 value 115.346198
## iter 40 value 115.344799
## final value 115.344797
## converged
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 111.774313
## iter 20 value 84.778390
## iter 30 value 77.231982
## iter 40 value 74.650410
## iter 50 value 74.020283
## iter 60 value 73.806701
## iter 70 value 73.729759
## iter 80 value 73.670981
## iter 90 value 73.658580
## iter 100 value 73.646885
## final value 73.646885
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 309.974388
## iter 10 value 125.590212
## iter 20 value 108.764871
## iter 30 value 97.382442
## iter 40 value 92.752814
## iter 50 value 91.398766
## iter 60 value 90.431569
## iter 70 value 90.233981
## iter 80 value 90.219069
## iter 90 value 90.197145
## iter 100 value 90.176196
## final value 90.176196
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 309.974388
## iter 10 value 134.916029
## iter 20 value 127.897652
## iter 30 value 127.549852
## iter 40 value 127.525607
## final value 127.525604
## converged
## # weights: 66 (50 variable)
## initial value 309.974388
## iter 10 value 125.600265
## iter 20 value 108.795280
## iter 30 value 97.514664
## iter 40 value 93.021845
## iter 50 value 91.819696
## iter 60 value 91.267320
## iter 70 value 91.177621
## iter 80 value 91.168024
## iter 90 value 91.152091
## iter 100 value 91.145213
## final value 91.145213
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 304.599110
## iter 10 value 120.268905
## iter 20 value 102.738940
## iter 30 value 93.104347
## iter 40 value 87.745375
## iter 50 value 86.346784
## iter 60 value 84.719225
## iter 70 value 84.613418
## iter 80 value 84.612147
## final value 84.612140
## converged
## # weights: 66 (50 variable)
## initial value 304.599110
## iter 10 value 131.040222
## iter 20 value 124.342847
## iter 30 value 124.207613
## iter 40 value 124.204604
## iter 40 value 124.204604
## iter 40 value 124.204604
## final value 124.204604
## converged
## # weights: 66 (50 variable)
## initial value 304.599110
## iter 10 value 120.280500
## iter 20 value 102.781170
## iter 30 value 93.255224
## iter 40 value 88.115981
## iter 50 value 86.940158
## iter 60 value 86.194152
## iter 70 value 86.135971
## iter 80 value 86.029160
## iter 90 value 85.942162
## iter 100 value 85.911808
## final value 85.911808
## stopped after 100 iterations
## # weights: 66 (50 variable)
## initial value 383.436526
## iter 10 value 167.512543
## iter 20 value 155.011106
## iter 30 value 154.493623
## iter 40 value 154.461659
## final value 154.461629
## converged
cv_original
## Penalized Multinomial Regression
##
## 214 samples
## 9 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 171, 171, 171, 173, 170
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6454803 0.5115487
## 1e-04 0.6454803 0.5148665
## 1e-01 0.6680864 0.5408954
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
cv_pca_4 = train(
x = X_pca_4,
y = y,
method = "multinom",
trControl = ctrl
)
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 141.279455
## iter 20 value 137.242487
## iter 30 value 137.065838
## final value 137.065804
## converged
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 148.605775
## iter 20 value 146.133289
## iter 30 value 146.032366
## final value 146.032364
## converged
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 141.287300
## iter 20 value 137.253655
## iter 30 value 137.078825
## final value 137.078793
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 148.322099
## iter 20 value 141.595414
## iter 30 value 141.156695
## final value 141.150896
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 157.971931
## iter 20 value 151.003039
## iter 30 value 150.928052
## final value 150.928043
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 148.326566
## iter 20 value 141.607696
## iter 30 value 141.171362
## final value 141.165622
## converged
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 152.323808
## iter 20 value 146.339408
## iter 30 value 146.187832
## final value 146.187813
## converged
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 158.817751
## iter 20 value 154.194659
## iter 30 value 154.107519
## final value 154.107479
## converged
## # weights: 36 (25 variable)
## initial value 306.390869
## iter 10 value 152.330622
## iter 20 value 146.349409
## iter 30 value 146.198563
## final value 146.198543
## converged
## # weights: 36 (25 variable)
## initial value 304.599110
## iter 10 value 137.985319
## iter 20 value 131.723461
## iter 30 value 130.914419
## final value 130.911076
## converged
## # weights: 36 (25 variable)
## initial value 304.599110
## iter 10 value 145.586322
## iter 20 value 141.085128
## iter 30 value 141.052525
## iter 30 value 141.052523
## iter 30 value 141.052523
## final value 141.052523
## converged
## # weights: 36 (25 variable)
## initial value 304.599110
## iter 10 value 137.993345
## iter 20 value 131.735728
## iter 30 value 130.932052
## final value 130.927221
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 152.375443
## iter 20 value 145.875669
## iter 30 value 145.767692
## final value 145.766714
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 159.771323
## iter 20 value 153.919474
## iter 30 value 153.621443
## final value 153.620756
## converged
## # weights: 36 (25 variable)
## initial value 308.182629
## iter 10 value 152.383329
## iter 20 value 145.885776
## iter 30 value 145.778264
## final value 145.777306
## converged
## # weights: 36 (25 variable)
## initial value 383.436526
## iter 10 value 187.672768
## iter 20 value 179.437599
## iter 30 value 178.997175
## final value 178.988655
## converged
cv_pca_4
## Penalized Multinomial Regression
##
## 214 samples
## 4 predictor
## 6 classes: '1', '2', '3', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 171, 172, 171, 170, 172
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6315816 0.4825275
## 1e-04 0.6315816 0.4825275
## 1e-01 0.6268247 0.4749894
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.
original_accuracy = max(cv_original$results$Accuracy)
pca_accuracy_4 = max(cv_pca_4$results$Accuracy)
cat("Accuracy (Original):", original_accuracy, "\n")
## Accuracy (Original): 0.6680864
cat("Accuracy (PCA, 4 PCs):", pca_accuracy_4, "\n")
## Accuracy (PCA, 4 PCs): 0.6315816
I found that using the first 4 principal components achieved comparable accuracy to using all 9 original variables. This indicates that these components effectively capture some of the underlying patterns in the data.
However, the low accuracy suggests that accurately classifying glass types based on their chemical composition may be inherently challenging due to subtle differences between them.