One multivariate statistical method of your choice

Import dataset

dataset = read.csv("glass.csv")
head(dataset)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Description

Source:

Creator: B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN

Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation

These data have been taken from the UCI Repository Of Machine Learning Databases (Blake & Merz 1998) and were converted to R format by Friedrich Leisch in the late 1990s.

A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to predict the type of glass on basis of the chemical analysis.

It contains 214 observations on the following 10 variables (columns):

- RI: Refractive index. Floating Point

- Na: Sodium. Floating Point

- Mg: Magnesium. Floating Point

- Al: Aluminum. Floating Point

- Si: Silicon. Floating Point

- K: Potassium. Floating Point

- Ca: Calcium. Floating Point

- Ba: Barium. Floating Point

- Fe: Iron. Floating Point

- Type: Type of glass (class attribute). Integer

nrow(dataset)

## [1] 214

names(dataset)

##  [1] "RI"   "Na"   "Mg"   "Al"   "Si"   "K"    "Ca"   "Ba"   "Fe"   "Type"

str(dataset)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: int  1 1 1 1 1 1 1 1 1 1 ...

Missing values

There are no missing values (NA or NULL values).

colSums(is.na(dataset))

##   RI   Na   Mg   Al   Si    K   Ca   Ba   Fe Type 
##    0    0    0    0    0    0    0    0    0    0

any(is.na(dataset))

## [1] FALSE

Analyze numerical variables

library(psych)

describe(dataset[, !(names(dataset) == "Type")])

##    vars   n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## RI    1 214  1.52 0.00   1.52    1.52 0.00  1.51  1.53  0.02  1.60     4.72
## Na    2 214 13.41 0.82  13.30   13.38 0.64 10.73 17.38  6.65  0.45     2.90
## Mg    3 214  2.68 1.44   3.48    2.87 0.30  0.00  4.49  4.49 -1.14    -0.45
## Al    4 214  1.44 0.50   1.36    1.41 0.31  0.29  3.50  3.21  0.89     1.94
## Si    5 214 72.65 0.77  72.79   72.71 0.57 69.81 75.41  5.60 -0.72     2.82
## K     6 214  0.50 0.65   0.56    0.43 0.17  0.00  6.21  6.21  6.46    52.87
## Ca    7 214  8.96 1.42   8.60    8.74 0.66  5.43 16.19 10.76  2.02     6.41
## Ba    8 214  0.18 0.50   0.00    0.03 0.00  0.00  3.15  3.15  3.37    12.08
## Fe    9 214  0.06 0.10   0.00    0.04 0.00  0.00  0.51  0.51  1.73     2.52
##      se
## RI 0.00
## Na 0.06
## Mg 0.10
## Al 0.03
## Si 0.05
## K  0.04
## Ca 0.10
## Ba 0.03
## Fe 0.01

Apart from Mg and Si, the skew values of the other elements are all positive and much greater than 0. This means that that the distributions are positively skewed (right-skewed). So, the right tail (larger values) is longer than the left tail, and the majority of data points are on the lower end.

The data is significantly skewed. To normalize the distribution transformations (log) are needed.

The K (in primis), Ca, Ba and RI variables have kurtosis values > 3. This means that their distribution has heavier tails than the normal distribution, meaning there are more outliers.

The Mg and Al variables have kurtosis values < 3. So the distibution has lighter tails than the normal distribution, meaning fewer outliers.

Distributions

hist(dataset$RI, 
     main="Histogram of RI (refractive index)",
     xlab="RI"
     )

hist(dataset$Na, 
     main="Histogram of Na (Sodium)",
     xlab="Na"
     )

hist(dataset$Mg, 
     main="Histogram of Mg (Magnesium)",
     xlab="Mg"
     )

hist(dataset$Al, 
     main="Histogram of Al (Aluminum)",
     xlab="Al"
     )

hist(dataset$Si, 
     main="Histogram of Si (Silicon)",
     xlab="Si"
     )

hist(dataset$K, 
     main="Histogram of K (Potassium)",
     xlab="K"
     )

hist(dataset$Ca, 
     main="Histogram of Ca (Calcium)",
     xlab="Ca"
     )

hist(dataset$Ba, 
     main="Histogram of Ba (Barium)",
     xlab="Ba"
     )

hist(dataset$Fe, 
     main="Histogram of Fe (Iron)",
     xlab="Fe"
     )

Check for outliers

boxplot(dataset$RI, main = "refractive index", horizontal = FALSE)

boxplot(dataset$Na, main = "sodium", horizontal = FALSE)

boxplot(dataset$Mg, main = "magnesium", horizontal = FALSE)

boxplot(dataset$Al, main = "aluminum", horizontal = FALSE)

boxplot(dataset$Si, main = "silicon", horizontal = FALSE)

boxplot(dataset$K, main = "potassium", horizontal = FALSE)

boxplot(dataset$Ca, main = "calcium", horizontal = FALSE)

boxplot(dataset$Ba, main = "barium", horizontal = FALSE)

boxplot(dataset$Fe, main = "iron", horizontal = FALSE)

There are a lot of outliers but this is probably due to the natural variation (concentration) of the chemical elements in glass composition. So I am treating them as valid data points.

Analyze categorical variables

Distributions:

The distribution of glass types is uneven, with types 2, 5, and 6 being significantly underrepresented. The fourth type is not represented at all.

barplot(table(dataset$Type), main = "Type of Glass Distribution")

Log Transformation

+1 to avoid issues with zero or negative values

dataset$K_log = log(dataset$K + 1)
dataset$Ca_log = log(dataset$Ca + 1)
dataset$Ba_log = log(dataset$Ba + 1)

describe(dataset$K_log)

##    vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 214 0.36 0.27   0.44    0.35 0.11   0 1.98  1.98 1.95    10.34 0.02

describe(dataset$Ca_log)

##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 214 2.29 0.13   2.26    2.27 0.07 1.86 2.84  0.98 1.15        4 0.01

describe(dataset$Ba_log)

##    vars   n mean   sd median trimmed mad min  max range skew kurtosis   se
## X1    1 214 0.11 0.28      0    0.03   0   0 1.42  1.42 2.69     6.39 0.02

After the log, the skew values have been reduced to resemble more a normal distribution.

Standardization

After transforming the data, it is often a good idea to standardize the variables (mean = 0, standard deviation = 1). This ensures that PCA does not prioritize variables with larger scales.

Remove the columns not log-transformed and the Type column that should not be standardized.

columns_to_remove = c("K", "Ca", "Ba", "Type")
subset = dataset[, !names(dataset) %in% columns_to_remove]

names(subset)

## [1] "RI"     "Na"     "Mg"     "Al"     "Si"     "Fe"     "K_log"  "Ca_log"
## [9] "Ba_log"

dataset_standardized = scale(subset)

dataset_standardized = as.data.frame(dataset_standardized)
dataset_standardized$Type = dataset$Type

head(dataset_standardized)

##           RI         Na        Mg         Al          Si         Fe      K_log
## 1  0.8708258  0.2842867 1.2517037 -0.6908222 -1.12444556 -0.5850791 -1.0941592
## 2 -0.2487502  0.5904328 0.6346799 -0.1700615  0.10207972 -0.5850791  0.1295429
## 3 -0.7196308  0.1495824 0.6000157  0.1904651  0.43776033 -0.5850791 -0.1004727
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979 -0.5850791  0.3459757
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126  0.55395746 -0.5850791  0.2989716
## 6 -0.7920739 -0.7566101 0.6416128  0.3506992  0.41193874  2.0832652  0.5059006
##        Ca_log     Ba_log Type
## 1 -0.09282075 -0.3859754    1
## 2 -0.85379817 -0.3859754    1
## 3 -0.89739813 -0.3859754    1
## 4 -0.52195771 -0.3859754    1
## 5 -0.64789711 -0.3859754    1
## 6 -0.64789711 -0.3859754    1

describe(dataset_standardized)

##        vars   n mean  sd median trimmed  mad   min  max range  skew kurtosis
## RI        1 214 0.00 1.0  -0.23   -0.12 0.62 -2.38 5.13  7.50  1.60     4.72
## Na        2 214 0.00 1.0  -0.13   -0.04 0.79 -3.28 4.86  8.14  0.45     2.90
## Mg        3 214 0.00 1.0   0.55    0.13 0.21 -1.86 1.25  3.11 -1.14    -0.45
## Al        4 214 0.00 1.0  -0.17   -0.07 0.62 -2.31 4.12  6.43  0.89     1.94
## Si        5 214 0.00 1.0   0.18    0.07 0.74 -3.67 3.56  7.23 -0.72     2.82
## Fe        6 214 0.00 1.0  -0.59   -0.22 0.00 -0.59 4.65  5.23  1.73     2.52
## K_log     7 214 0.00 1.0   0.31   -0.04 0.39 -1.31 5.93  7.24  1.95    10.34
## Ca_log    8 214 0.00 1.0  -0.21   -0.11 0.53 -3.29 4.26  7.55  1.15     4.00
## Ba_log    9 214 0.00 1.0  -0.39   -0.29 0.00 -0.39 4.64  5.02  2.69     6.39
## Type     10 214 2.78 2.1   2.00    2.48 1.48  1.00 7.00  6.00  1.10    -0.33
##          se
## RI     0.07
## Na     0.07
## Mg     0.07
## Al     0.07
## Si     0.07
## Fe     0.07
## K_log  0.07
## Ca_log 0.07
## Ba_log 0.07
## Type   0.14

RQ

RQ: What are the primary patterns of variation in the chemical composition of different types of glass and which chemical components contribute most significantly to the differentiation between glass types?

In this case the correlation matrix makes more sense since the variables have different scales and units. The analysis will focus on the relationships between variables after accounting for differences in scale, so no variable disproportionately influences the results.

dataset_standardized_pca= dataset_standardized[, c(1, 2, 3 , 4, 5, 7, 8, 9, 6)]

R_standardized = cor(dataset_standardized_pca)

round(R_standardized, 3)

##            RI     Na     Mg     Al     Si  K_log Ca_log Ba_log     Fe
## RI      1.000 -0.192 -0.122 -0.407 -0.542 -0.327  0.798 -0.050  0.143
## Na     -0.192  1.000 -0.274  0.157 -0.070 -0.434 -0.261  0.389 -0.241
## Mg     -0.122 -0.274  1.000 -0.482 -0.166  0.226 -0.416 -0.530  0.083
## Al     -0.407  0.157 -0.482  1.000 -0.006  0.249 -0.282  0.526 -0.074
## Si     -0.542 -0.070 -0.166 -0.006  1.000 -0.127 -0.183 -0.056 -0.094
## K_log  -0.327 -0.434  0.226  0.249 -0.127  1.000 -0.439 -0.160  0.063
## Ca_log  0.798 -0.261 -0.416 -0.282 -0.183 -0.439  1.000 -0.162  0.129
## Ba_log -0.050  0.389 -0.530  0.526 -0.056 -0.160 -0.162  1.000 -0.074
## Fe      0.143 -0.241  0.083 -0.074 -0.094  0.063  0.129 -0.074  1.000

corPlot(R_standardized)

There is an insufficient correlation between the variables (each one needs to have a MSA greater than 0.5). This suggests that the components may not capture meaningful patterns.

PCA can still perform dimensionality reduction but the results might lack significance.

KMO(R_standardized)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R_standardized)
## Overall MSA =  0.22
## MSA for each item = 
##     RI     Na     Mg     Al     Si  K_log Ca_log Ba_log     Fe 
##   0.69   0.14   0.18   0.29   0.11   0.15   0.25   0.22   0.91

det(R_standardized)

## [1] 0.000867296

Bartlett’s test evaluates whether the correlation matrix is significantly different from the identity matrix. This help to determine if the dataset is suitable for PCA or FA.

Null Hypothesis: the variables are uncorrelated, and the correlation matrix is equal to the identity matrix

Alternative Hypothesis: the variables are correlated, and the correlation matrix is not the identity matrix.

Since the p-value is < 0.05, the null hypthesis is rejected. This means the correlation matrix is significantly different from the identity matrix, and there are correlations among the variables.

So, according to Bartlett’s test, the dataset is suitable for PCA or FA. This test does not account for sampling adequacy (KMO).

On the other hand, according the KMO test, the dataset is not suitable for PCA.

cortest.bartlett(R_standardized, n = nrow(dataset_standardized_pca))

## $chisq
## [1] 1474.652
## 
## $p.value
## [1] 9.831003e-287
## 
## $df
## [1] 36

I decided to continue the analysis and see the results anyway.

Analysis:

PC1 to PC4 have eigenvalues > 1, so they are significant under the Kaiser criterion. The first 4 components together capture a substantial amount of variance in the dataset (81.2%).

The remaining components explain progressively less variance.

PC5 adds about 9.7% to the total variance, bringing the cumulative variance to about 90.9%.

PC6 to PC9 contribute only a small amount of variance and are less meaningful in representing the data.

library(FactoMineR)

components_all = PCA(dataset_standardized_pca, 
                 scale.unit = FALSE,
                 graph = FALSE
                 )

library(factoextra)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

get_eigenvalue(components_all)

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.53440299       28.2922400                    28.29224
## Dim.2 2.23020703       24.8964165                    53.18866
## Dim.3 1.35882014       15.1688842                    68.35754
## Dim.4 1.15091117       12.8479389                    81.20548
## Dim.5 0.87059317        9.7186718                    90.92415
## Dim.6 0.43448182        4.8502404                    95.77439
## Dim.7 0.30074476        3.3572967                    99.13169
## Dim.8 0.06507632        0.7264649                    99.85815
## Dim.9 0.01270654        0.1418466                   100.00000

This is a good results since to obtain 80% of variance you need only 4 components out of 9, and to obtain 90% you need to keep only 5 components.

This reduction should reduce noise and improve interpretability without losing too much information.

Scree plot

This plot shows the eigenvalues for each principal components in a visual way.

As said earlier, PC1 to PC4 have eigenvalues > 1 and, based on the Kaiser criterion, these principal components are those really significant and explain most of the dataset’variance.

The “elbow” in the plot (where the curve starts to flatten) appears at PC4, confirming that the first 4 PCs are sufficient to capture the majority of variance.

fviz_eig(components_all,
         choice = "eigenvalue",
         main = "Scree plot",
         ylab = "Eigenvalue",
         xlab = "Principal Component",
         addlabels = TRUE
         )

Parallel analysis

Parallel analysis compares the eigenvalues of the actual data (blue line) with those of re-sampled (random) data (red dashed line).

PCs with eigenvalues from the actual data that are greater than the corresponding eigenvalues of the resampled data are considered significant.

Parallel analysis suggests that the number of components should be 4.

This result aligns with the earlier scree plot and eigenvalue analysis.

fa.parallel(dataset_standardized_pca, 
            sim = FALSE, 
            fa = "pc",
            )

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  4

PCA using only the first 4 PC

library(FactoMineR)

components_4 = PCA(dataset_standardized_pca,
                 ncp = 4,
                 scale.unit = FALSE,
                 graph = FALSE
                 )

components_4

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 214 individuals, described by 9 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Correlation of each variable with the PCs.

This values show how strongly each original variable is associated with each PC (higher absolute values indicate stronger relationships).

Dim. 1 is strongly correlated (influenced) with RI and Ca_log.

Dim. 2 is strongly correlated with Mg, Ba_log, and Na. This dimension captures variability related to these variables

Dim. 3 is strongly correlated with K_log and Si. It represents features related to this variables.

Dim. 4 is strongly correlated with Si and Na.

components_4$var$cor

##             Dim.1      Dim.2      Dim.3       Dim.4
## RI     -0.8497531  0.3922069  0.1871308 -0.15822541
## Na      0.4146429  0.5077198 -0.4057552 -0.49102846
## Mg     -0.1933380 -0.8138143 -0.1850017 -0.41002133
## Al      0.6910020  0.2994248  0.5158736  0.10637838
## Si      0.3669094 -0.1199463 -0.5130974  0.74555308
## K_log   0.2754157 -0.6207597  0.6075550  0.01114375
## Ca_log -0.7795654  0.4888027  0.0672862  0.30757666
## Ba_log  0.4869420  0.6480424  0.2550265 -0.12228892
## Fe     -0.2865628 -0.1360378  0.4040237  0.21261434

Contribution of each original variable to each dimension.

It shows the percentage that each variable contributed to the variance explained by each PC.

Higher percentage means a variable is more important for that PC.

It is easier to interpret with respect to the previous table.

components_4$var$contrib

##            Dim.1      Dim.2      Dim.3       Dim.4
## RI     28.358004  6.8651670  2.5650406  2.16509263
## Na      6.752095 11.5045283 12.0595752 20.85150262
## Mg      1.467995 29.5577411  2.5070064 14.53908002
## Al     18.752049  4.0012534 19.4935296  0.97865751
## Si      5.286983  0.6420872 19.2842849 48.07078002
## K_log   2.978981 17.1975929 27.0380330  0.01073956
## Ca_log 23.866858 10.6632089  0.3316315  8.18145960
## Ba_log  9.312034 18.7424975  4.7640286  1.29329699
## Fe      3.225000  0.8259238 11.9568700  3.90939105

It seems that the variables that contribute the most to the variance captured by the first two dimensions (which are the most important ones and have similar eigenvalues) are:

fviz_pca_var(components_4, repel = TRUE)

Together, the first 2 components explain 53.2% of the total variance in the data.

Variables that point in similar directions are positively correlated.

Variables pointing in opposite directions are negatively correlated.

The right-angle variables are probably uncorrelated.

The length of the arrows indicates how well that variable is represented in these two dimensions.

Na and Ba are strongly positively correlated
RI and Ca are positively correlated
Si points nearly horizontally to the right, suggesting it is strongly associated with the first principal component rather than the second component. The same holds for Fe (nearly horizontally to the left)
Mg points nearly vertical to the left, suggesting it is strongly associated with the second principal component rather than the first component.
Fe appears to have relatively weak contributions to both components (better association with the first component)

table(dataset$Type)

## 
##  1  2  3  5  6  7 
## 70 76 17 13  9 29

dataset$Type = as.factor(dataset$Type)

fviz_pca_biplot(components_4,
                geom.ind = "point",
                col.ind = dataset$Type,
                palette = "RdBu",
                addEllipses = TRUE,  
                label = "var",
                col.var = "black",     
                repel = FALSE,          
                legend.title = "Glass Type")

The PCA biplot shows both the variables (arrows) and glass samples colored by their type.

The position of a sample along the principal component axes indicates its contribution to the respective component.

Samples close together are more similar in terms of their chemical composition.

The length and direction of an arrow indicate the variable’s importance and correlation with the principal components.

There is a central region that presents a lot of overlap between glass types. The first two PC are not enough to separate them.

In particular the types 5 and 6 which are under-represented in the dataset do not form clusters.

The ellipses show the 95% confidence regions for each glass type

Type 7 (blue asterisks) clusters in the upper right, strongly associated with Ba, Na, and Al
Type 2 (orange triangles) spreads across the second-fourth quadrant bisector. However, the majority appears to be clustered slightly below the origin.
Type 1 (red dots) spreads near the origin.

head(components_4$ind$coord)

##        Dim.1       Dim.2      Dim.3       Dim.4
## 1 -1.2554354 -0.09449284 -0.8295099 -1.69285829
## 2  0.5855765 -0.69861737 -0.7336488 -0.73572334
## 3  0.9408826 -0.82630964 -0.7608286 -0.19836287
## 4  0.1317710 -1.00769430 -0.3139305 -0.40581744
## 5  0.3484001 -1.06738367 -0.6868016 -0.03493444
## 6  0.3059769 -1.55363978  0.8701100  0.81342206

RQ: what are the primary patterns of variation in the chemical composition of different types of glass, and can we identify which chemical components contribute most significantly to the differentiation between glass types?

Conclusion:

Based on the PCA analysis:

- the first 4 principal components explain 81.2% of the total variance in the glass composition data, indicating moderate patterns in chemical composition.

- PC1 is primarily driven by RI (refractive index) and Ca, suggesting these elements are the main differentiators in glass composition

- PC2 captures variability linked to Mg, Ba and Na, representing a secondary pattern of chemical composition

- PC3 reflects variations tied to K and Si

- PC4 highlights variability linked to Si and Na, possibly capturing subtle compositional differences.

Different variables dominate different PCs, indicating that the dataset is multidimensional, with distinct patterns captured by the PCs.

The analysis did not reveal clear patterns in how chemical compositions vary across different glass types. The plots based on the first 2 PCs did not show distinct clustering based on their chemical composition.

PCA does not seem to perform good but this is expected since the KMO test highlighted that there is no sufficient correlation between variables to effectively apply PCA.

Let’s see if we can effectively reduce the dimensionality of the glass dataset without compromising performance.

(Optional) Test the model with PCA components

Add the coord. of the first 4 PC for the individuals samples to the dataset

dataset_standardized$PC1 = components_all$ind$coord[, 1]
dataset_standardized$PC2 = components_all$ind$coord[, 2] 
dataset_standardized$PC3 = components_all$ind$coord[, 3] 
dataset_standardized$PC4 = components_all$ind$coord[, 4] 

head(dataset_standardized)

##           RI         Na        Mg         Al          Si         Fe      K_log
## 1  0.8708258  0.2842867 1.2517037 -0.6908222 -1.12444556 -0.5850791 -1.0941592
## 2 -0.2487502  0.5904328 0.6346799 -0.1700615  0.10207972 -0.5850791  0.1295429
## 3 -0.7196308  0.1495824 0.6000157  0.1904651  0.43776033 -0.5850791 -0.1004727
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979 -0.5850791  0.3459757
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126  0.55395746 -0.5850791  0.2989716
## 6 -0.7920739 -0.7566101 0.6416128  0.3506992  0.41193874  2.0832652  0.5059006
##        Ca_log     Ba_log Type        PC1         PC2        PC3         PC4
## 1 -0.09282075 -0.3859754    1 -1.2554354 -0.09449284 -0.8295099 -1.69285829
## 2 -0.85379817 -0.3859754    1  0.5855765 -0.69861737 -0.7336488 -0.73572334
## 3 -0.89739813 -0.3859754    1  0.9408826 -0.82630964 -0.7608286 -0.19836287
## 4 -0.52195771 -0.3859754    1  0.1317710 -1.00769430 -0.3139305 -0.40581744
## 5 -0.64789711 -0.3859754    1  0.3484001 -1.06738367 -0.6868016 -0.03493444
## 6 -0.64789711 -0.3859754    1  0.3059769 -1.55363978  0.8701100  0.81342206

Split standardized data:

X = dataset_standardized[, !names(dataset_standardized) %in% c("Type", "PC1", "PC2", "PC3", "PC4")]
X_pca_4 = dataset_standardized[, names(dataset_standardized) %in% c("PC1", "PC2", "PC3", "PC4")]
y = as.factor(dataset_standardized$Type)

table(dataset_standardized$Type)

## 
##  1  2  3  5  6  7 
## 70 76 17 13  9 29

Cross-validation with 5 folds should help understand the performance of the model and if it is generalizing well.

I used Multinomial Logistic Regression to predict the Type of glass from the other variables.

There are 7 types of glass, but the fourth type has no observations. Additionally, types 3, 5, and 6 are very under-represented in the dataset.

library(caret)

## Loading required package: lattice

ctrl = trainControl(method = "cv", number = 5)

cv_original = train(
  x = X,
  y = y,
  method = "multinom",
  trControl = ctrl
)

## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 119.607397
## iter  20 value 99.578370
## iter  30 value 89.252229
## iter  40 value 87.370787
## iter  50 value 86.344433
## iter  60 value 86.047476
## iter  70 value 85.910460
## iter  80 value 85.877732
## iter  90 value 85.876975
## final  value 85.876845 
## converged
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 129.018273
## iter  20 value 122.172840
## iter  30 value 121.782146
## iter  40 value 121.772014
## final  value 121.772009 
## converged
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 119.617222
## iter  20 value 99.618283
## iter  30 value 89.430085
## iter  40 value 87.584220
## iter  50 value 86.743208
## iter  60 value 86.597925
## iter  70 value 86.548121
## iter  80 value 86.535529
## iter  90 value 86.525710
## iter 100 value 86.513701
## final  value 86.513701 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 128.578568
## iter  20 value 104.825633
## iter  30 value 97.830761
## iter  40 value 92.912930
## iter  50 value 91.412432
## iter  60 value 89.075408
## iter  70 value 87.538180
## iter  80 value 87.484809
## iter  90 value 87.483353
## final  value 87.483345 
## converged
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 135.508133
## iter  20 value 123.406852
## iter  30 value 123.214901
## iter  40 value 123.209873
## iter  40 value 123.209872
## iter  40 value 123.209872
## final  value 123.209872 
## converged
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 128.591223
## iter  20 value 104.854370
## iter  30 value 97.919497
## iter  40 value 93.145561
## iter  50 value 91.753587
## iter  60 value 90.327256
## iter  70 value 90.053350
## iter  80 value 89.983305
## iter  90 value 89.950427
## iter 100 value 89.923114
## final  value 89.923114 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 111.758722
## iter  20 value 84.708463
## iter  30 value 77.632457
## iter  40 value 74.549375
## iter  50 value 73.743312
## iter  60 value 73.394303
## iter  70 value 73.206633
## iter  80 value 72.946136
## iter  90 value 71.153435
## iter 100 value 67.740336
## final  value 67.740336 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 127.690406
## iter  20 value 115.575617
## iter  30 value 115.346198
## iter  40 value 115.344799
## final  value 115.344797 
## converged
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 111.774313
## iter  20 value 84.778390
## iter  30 value 77.231982
## iter  40 value 74.650410
## iter  50 value 74.020283
## iter  60 value 73.806701
## iter  70 value 73.729759
## iter  80 value 73.670981
## iter  90 value 73.658580
## iter 100 value 73.646885
## final  value 73.646885 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 309.974388 
## iter  10 value 125.590212
## iter  20 value 108.764871
## iter  30 value 97.382442
## iter  40 value 92.752814
## iter  50 value 91.398766
## iter  60 value 90.431569
## iter  70 value 90.233981
## iter  80 value 90.219069
## iter  90 value 90.197145
## iter 100 value 90.176196
## final  value 90.176196 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 309.974388 
## iter  10 value 134.916029
## iter  20 value 127.897652
## iter  30 value 127.549852
## iter  40 value 127.525607
## final  value 127.525604 
## converged
## # weights:  66 (50 variable)
## initial  value 309.974388 
## iter  10 value 125.600265
## iter  20 value 108.795280
## iter  30 value 97.514664
## iter  40 value 93.021845
## iter  50 value 91.819696
## iter  60 value 91.267320
## iter  70 value 91.177621
## iter  80 value 91.168024
## iter  90 value 91.152091
## iter 100 value 91.145213
## final  value 91.145213 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 304.599110 
## iter  10 value 120.268905
## iter  20 value 102.738940
## iter  30 value 93.104347
## iter  40 value 87.745375
## iter  50 value 86.346784
## iter  60 value 84.719225
## iter  70 value 84.613418
## iter  80 value 84.612147
## final  value 84.612140 
## converged
## # weights:  66 (50 variable)
## initial  value 304.599110 
## iter  10 value 131.040222
## iter  20 value 124.342847
## iter  30 value 124.207613
## iter  40 value 124.204604
## iter  40 value 124.204604
## iter  40 value 124.204604
## final  value 124.204604 
## converged
## # weights:  66 (50 variable)
## initial  value 304.599110 
## iter  10 value 120.280500
## iter  20 value 102.781170
## iter  30 value 93.255224
## iter  40 value 88.115981
## iter  50 value 86.940158
## iter  60 value 86.194152
## iter  70 value 86.135971
## iter  80 value 86.029160
## iter  90 value 85.942162
## iter 100 value 85.911808
## final  value 85.911808 
## stopped after 100 iterations
## # weights:  66 (50 variable)
## initial  value 383.436526 
## iter  10 value 167.512543
## iter  20 value 155.011106
## iter  30 value 154.493623
## iter  40 value 154.461659
## final  value 154.461629 
## converged

cv_original

## Penalized Multinomial Regression 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 171, 171, 171, 173, 170 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6454803  0.5115487
##   1e-04  0.6454803  0.5148665
##   1e-01  0.6680864  0.5408954
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

cv_pca_4 = train(
  x = X_pca_4,
  y = y,
  method = "multinom",
  trControl = ctrl
)

## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 141.279455
## iter  20 value 137.242487
## iter  30 value 137.065838
## final  value 137.065804 
## converged
## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 148.605775
## iter  20 value 146.133289
## iter  30 value 146.032366
## final  value 146.032364 
## converged
## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 141.287300
## iter  20 value 137.253655
## iter  30 value 137.078825
## final  value 137.078793 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 148.322099
## iter  20 value 141.595414
## iter  30 value 141.156695
## final  value 141.150896 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 157.971931
## iter  20 value 151.003039
## iter  30 value 150.928052
## final  value 150.928043 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 148.326566
## iter  20 value 141.607696
## iter  30 value 141.171362
## final  value 141.165622 
## converged
## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 152.323808
## iter  20 value 146.339408
## iter  30 value 146.187832
## final  value 146.187813 
## converged
## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 158.817751
## iter  20 value 154.194659
## iter  30 value 154.107519
## final  value 154.107479 
## converged
## # weights:  36 (25 variable)
## initial  value 306.390869 
## iter  10 value 152.330622
## iter  20 value 146.349409
## iter  30 value 146.198563
## final  value 146.198543 
## converged
## # weights:  36 (25 variable)
## initial  value 304.599110 
## iter  10 value 137.985319
## iter  20 value 131.723461
## iter  30 value 130.914419
## final  value 130.911076 
## converged
## # weights:  36 (25 variable)
## initial  value 304.599110 
## iter  10 value 145.586322
## iter  20 value 141.085128
## iter  30 value 141.052525
## iter  30 value 141.052523
## iter  30 value 141.052523
## final  value 141.052523 
## converged
## # weights:  36 (25 variable)
## initial  value 304.599110 
## iter  10 value 137.993345
## iter  20 value 131.735728
## iter  30 value 130.932052
## final  value 130.927221 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 152.375443
## iter  20 value 145.875669
## iter  30 value 145.767692
## final  value 145.766714 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 159.771323
## iter  20 value 153.919474
## iter  30 value 153.621443
## final  value 153.620756 
## converged
## # weights:  36 (25 variable)
## initial  value 308.182629 
## iter  10 value 152.383329
## iter  20 value 145.885776
## iter  30 value 145.778264
## final  value 145.777306 
## converged
## # weights:  36 (25 variable)
## initial  value 383.436526 
## iter  10 value 187.672768
## iter  20 value 179.437599
## iter  30 value 178.997175
## final  value 178.988655 
## converged

cv_pca_4

## Penalized Multinomial Regression 
## 
## 214 samples
##   4 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 171, 172, 171, 170, 172 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6315816  0.4825275
##   1e-04  0.6315816  0.4825275
##   1e-01  0.6268247  0.4749894
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.

original_accuracy = max(cv_original$results$Accuracy)
pca_accuracy_4 = max(cv_pca_4$results$Accuracy)

cat("Accuracy (Original):", original_accuracy, "\n")

## Accuracy (Original): 0.6680864

cat("Accuracy (PCA, 4 PCs):", pca_accuracy_4, "\n")

## Accuracy (PCA, 4 PCs): 0.6315816

I found that using the first 4 principal components achieved comparable accuracy to using all 9 original variables. This indicates that these components effectively capture some of the underlying patterns in the data.

However, the low accuracy suggests that accurately classifying glass types based on their chemical composition may be inherently challenging due to subtle differences between them.

Johana Radojević - HOMEWORK ASSIGNMENT 2

Johana Radojević

2025-02-01