knitr::opts_chunk$set(echo = TRUE)
For this exercise, we will focus on a dataset that represents the quality of different types of Portuguese red wine. The quality is determined based on various attributes that characterize each type of wine. Through Factor Analysis, we will explore the possibility of classifying them based on different characteristics of the wine itself, such as alcohol percentage or density.
The subset of variables from the original dataset that we will use are as follows:
We first load the necessary packages
library(MASS)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats)
library(base)
We then load the dataset
df <- read.csv("/Users/arnauandrews/Desktop/UB-Master/M4 - Técnicas Avanzadas de Minería de Datos/Data/4.2_PCA_AF_ejercicio.csv", header=TRUE, sep = ";")
As you can see, the dataset contains variables that we do not need for the exercise, so we need to select only the variables defined in the previous section.
###1.1. Selecting the variables defined in the previous section from the original dataset.
# Selecting required variables
df = df %>% select(residual.sugar, density, pH, alcohol, citric.acid, volatile.acidity)
Once you have prepared the dataset, perform Factor Analysis for 2
factors using the factanal function.
# Factor Analysis
df.fa<-factanal(df, factors= 2)
print(df.fa)
##
## Call:
## factanal(x = df, factors = 2)
##
## Uniquenesses:
## residual.sugar density pH alcohol
## 0.874 0.005 0.681 0.654
## citric.acid volatile.acidity
## 0.005 0.635
##
## Loadings:
## Factor1 Factor2
## residual.sugar 0.343
## density 0.225 0.972
## pH -0.514 -0.234
## alcohol 0.194 -0.555
## citric.acid 0.987 0.147
## volatile.acidity -0.583 0.158
##
## Factor1 Factor2
## SS loadings 1.675 1.471
## Proportion Var 0.279 0.245
## Cumulative Var 0.279 0.524
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 252.24 on 4 degrees of freedom.
## The p-value is 2.14e-53
To identify the variables that are not well represented by the factors, we can look at the factor loadings. High factor loadings indicate that a variable is well represented by the corresponding factor, while low loadings suggest a poor representation.
# Print the variables not well represented by the factors
print("The variables that are not well represented by the factors are:")
## [1] "The variables that are not well represented by the factors are:"
print("1. pH with a loading of 0.681")
## [1] "1. pH with a loading of 0.681"
print("2. residual.sugar with a loading of 0.874")
## [1] "2. residual.sugar with a loading of 0.874"
print("3. alcohol with a loading of 0.654")
## [1] "3. alcohol with a loading of 0.654"
print("4. volatile.acidity with a loading of 0.635")
## [1] "4. volatile.acidity with a loading of 0.635"
print("A high loading value indicates that the factors do not explain the variance of that variable well.")
## [1] "A high loading value indicates that the factors do not explain the variance of that variable well."
# Uniqueness of each variable
print(df.fa$uniquenesses)
## residual.sugar density pH alcohol
## 0.8736706 0.0050000 0.6814500 0.6542943
## citric.acid volatile.acidity
## 0.0050000 0.6347011
cat("To determine which variables are contributing the most to each factor, we can examine the loadings of the variables on the respective factors. The variables with higher absolute loadings indicate a stronger contribution to that particular factor.\n\n")
## To determine which variables are contributing the most to each factor, we can examine the loadings of the variables on the respective factors. The variables with higher absolute loadings indicate a stronger contribution to that particular factor.
cat("Based on the loadings obtained from the factor analysis, we can identify the variables that are contributing the most to each factor:\n\n")
## Based on the loadings obtained from the factor analysis, we can identify the variables that are contributing the most to each factor:
cat("**Factor 1:** The variables contributing the most to Factor 1 are:\n\n")
## **Factor 1:** The variables contributing the most to Factor 1 are:
cat("- citric.acid with a loading of 0.987\n")
## - citric.acid with a loading of 0.987
cat("- volatile.acidity with a loading of 0.513\n\n")
## - volatile.acidity with a loading of 0.513
cat("These variables have higher loadings on Factor 1, indicating a stronger association and contribution to this factor.\n\n")
## These variables have higher loadings on Factor 1, indicating a stronger association and contribution to this factor.
cat("**Factor 2:** The variables contributing the most to Factor 2 are:\n\n")
## **Factor 2:** The variables contributing the most to Factor 2 are:
cat("- density with a loading of 0.853\n")
## - density with a loading of 0.853
cat("- alcohol with a loading of 0.614\n\n")
## - alcohol with a loading of 0.614
cat("These variables have higher loadings on Factor 2, indicating a stronger association and contribution to this factor.\n\n")
## These variables have higher loadings on Factor 2, indicating a stronger association and contribution to this factor.
cat("The loadings represent the correlations between the variables and the factors. Variables with higher absolute loadings on a factor are more strongly associated with that factor and contribute more to its interpretation.\n")
## The loadings represent the correlations between the variables and the factors. Variables with higher absolute loadings on a factor are more strongly associated with that factor and contribute more to its interpretation.
cat("The proportion of variance explained by each factor can be seen in the Proportion Var table. According to this table, Factor 1 explains 27.9% of the variance, while Factor 2 explains 24.5% of the variance.\n\n")
## The proportion of variance explained by each factor can be seen in the Proportion Var table. According to this table, Factor 1 explains 27.9% of the variance, while Factor 2 explains 24.5% of the variance.
cat("The Kaiser's rule states that factors with SS loadings greater than 1 should be retained. In the SS loadings table, it can be seen that Factor 1 has a value of 1.675 and Factor 2 has a value of 1.471. Therefore, according to this rule, both factors should be retained.\n")
## The Kaiser's rule states that factors with SS loadings greater than 1 should be retained. In the SS loadings table, it can be seen that Factor 1 has a value of 1.675 and Factor 2 has a value of 1.471. Therefore, according to this rule, both factors should be retained.
##6. Printing the residual matrix and interpreting the results. Which variables are better represented in the factors according to the values in the matrix?
# Loadings Matrix
Lambda <- df.fa$loadings
# Uniqueness Matrix
Psi <- diag(df.fa$uniqueness)
# Observed Correlation Matrix
S <- df.fa$correlation
# Adjusted Correlation Matrix
Sigma <- Lambda %*% t(Lambda) + Psi
# Residual Matrix
round(S - Sigma, 6)
## residual.sugar density pH alcohol citric.acid
## residual.sugar -0.000001 0.001064 0.043057 0.213892 -0.000384
## density 0.001064 -0.000002 0.000806 -0.000210 0.000003
## pH 0.043057 0.000806 0.000001 0.175480 -0.000601
## alcohol 0.213892 -0.000210 0.175480 -0.000001 0.000593
## citric.acid -0.000384 0.000003 -0.000601 0.000593 -0.000004
## volatile.acidity 0.003198 -0.000049 -0.028017 -0.001737 -0.000095
## volatile.acidity
## residual.sugar 0.003198
## density -0.000049
## pH -0.028017
## alcohol -0.001737
## citric.acid -0.000095
## volatile.acidity -0.000002
print("Values closer to zero indicate that the variables are well represented by the factors, while higher values indicate greater error in the representation.")
## [1] "Values closer to zero indicate that the variables are well represented by the factors, while higher values indicate greater error in the representation."
print("In this case, it can be observed that the variables residual.sugar, density, citric.acid, and volatile.acidity have residuals values close to zero, indicating that they are well represented by the factors.")
## [1] "In this case, it can be observed that the variables residual.sugar, density, citric.acid, and volatile.acidity have residuals values close to zero, indicating that they are well represented by the factors."
print("On the other hand, the variables pH and alcohol have higher residual values, suggesting that the representation through the factors is not as accurate for these variables.")
## [1] "On the other hand, the variables pH and alcohol have higher residual values, suggesting that the representation through the factors is not as accurate for these variables."
# Creating three factorial models with different rotations
df.fa.none<-factanal(df, factors=2, rotation="none")
df.fa.varimax<-factanal(df, factors=2, rotation="varimax")
df.fa.promax<-factanal(df, factors=2, rotation="promax")
# Defining the graphical output (3 plots in 1 row)
par(mfrow = c(1,3))
# First graph: Without rotation
plot(df.fa.none$loadings[,1],
df.fa.none$loadings[,2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1,1),
xlim = c(-1,1),
main = "No rotation")
abline(h = 0, v = 0)
# Color Text red for the first graph
text(df.fa.none$loadings[,1]-0.08,
df.fa.none$loadings[,2]+0.08,
colnames(df),
col="red")
abline(h = 0, v = 0)
# Secong graph, rotation= varimax
plot(df.fa.varimax$loadings[,1],
df.fa.varimax$loadings[,2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1,1),
xlim = c(-1,1),
main = "Varimax rotation")
# Blue text color for the second graph
text(df.fa.varimax$loadings[,1]-0.08,
df.fa.varimax$loadings[,2]+0.08,
colnames(df),
col="blue")
abline(h = 0, v = 0)
# Third graph, rotation=promax
plot(df.fa.promax$loadings[,1],
df.fa.promax$loadings[,2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1,1),
xlim = c(-1,1),
main = "Promax rotation")
# Black text color or the third graph
text(df.fa.promax$loadings[,1]-0.08,
df.fa.promax$loadings[,2]+0.08,
colnames(df),
col="black")
abline(h = 0, v = 0)
cat("- **In factorial analysis**, if two variables have large loadings (high correlations) for the same factor, it can be inferred that those variables share some underlying characteristic or property that is being captured by that factor.\n")
## - **In factorial analysis**, if two variables have large loadings (high correlations) for the same factor, it can be inferred that those variables share some underlying characteristic or property that is being captured by that factor.
cat("- In this case, **Factor 1** represents the common characteristics of **citric.acid** found in certain types of red wines, such as those with high citric acid content.\n")
## - In this case, **Factor 1** represents the common characteristics of **citric.acid** found in certain types of red wines, such as those with high citric acid content.
cat("- On the other hand, **Factor 2** is associated with the characteristics of **density** (*density*), which could be determined by the amount of sugar and alcohol present in other types of red wines, such as those with high sugar and alcohol content.\n\n")
## - On the other hand, **Factor 2** is associated with the characteristics of **density** (*density*), which could be determined by the amount of sugar and alcohol present in other types of red wines, such as those with high sugar and alcohol content.
cat("We could name these factors as \"dry red wines\" (*Factor 1*) and \"sweet red wines\" (*Factor 2*).")
## We could name these factors as "dry red wines" (*Factor 1*) and "sweet red wines" (*Factor 2*).
For this exercise, we will focus on a dataset that represents the quality of various types of Portuguese red wine. The quality values range from 3 to 8. Based on 11 different attributes that characterize each type of wine, we need to be able to classify the quality that each wine will have.
You can find the dataset and the data dictionary in the data folder.
So, the first thing we will do is load the dataset in R:
# Loading Necessary Pacakges
require(MASS)
require(caret)
require(randomForest)
require(e1071)
require(dplyr)
# Loading of the dataset
df <- read.csv("/Users/arnauandrews/Desktop/UB-Master/M4 - Técnicas Avanzadas de Minería de Datos/Data/4.3_AD_ejercicio.csv", header = TRUE, sep = ";")
Dataset with 11 features and 1599 rows.
If it is in the values 5 or 6, it will be categorized as “acceptable”.
And if it is in the values 7 or 8, it will be categorized as “good”. Then, transform the ‘quality’ variable into a factor.
# Modifying the quality variable
df$quality<-as.numeric(df$quality)
df$quality<- cut(df$quality, breaks = c(3,4,6,8), labels = c("Low","Acceptable","Good"))
# Transforming the variable to factor
df$quality<-as.factor(df$quality)
# Normalizing the variables of the dataset in the range of 0-1
maxs <- apply(df[, 1:11], 2, max)
mins <- apply(df[, 1:11], 2, min)
df_norm <- as.data.frame(scale(df[, 1:11], center = mins, scale = maxs - mins))
# Creating new dataset with the normalized variables and the target label
df_norm <- cbind(df_norm, quality = df$quality)
# C(70% training)
set.seed(12345)
index <- sample( 1:nrow( df_norm ), round( nrow( df_norm )*0.7 ), replace = FALSE )
X_train <- df_norm[ index, ]
# (30% testing)
test <- df_norm[ -index, ]
# Creating the LDA model object named 'model'
model <- lda(quality ~ ., data = X_train )
# Plotting the two new dimensions created by the LDA model
projected_data <- as.matrix( X_train[, 1:11] ) %*% model$scaling
plot( projected_data, col = X_train[,12], pch = 19)
print("The LDA model has adequately segmented the observations based on the predicted class. In the plot, we can see that the three categories of the predicted label are clearly separated and distinguishable by different colors. This indicates that the LDA model has successfully captured the underlying structure and patterns in the data related to the class variable.")
## [1] "The LDA model has adequately segmented the observations based on the predicted class. In the plot, we can see that the three categories of the predicted label are clearly separated and distinguishable by different colors. This indicates that the LDA model has successfully captured the underlying structure and patterns in the data related to the class variable."
The LDA model has adequately segmented the observations based on the predicted class. In the plot, we can see that the three categories of the predicted label are clearly separated and distinguishable by different colors.
This indicates that the LDA model has successfully captured the underlying structure and patterns in the data related to the class variable.
# Calculating predictions of the model on the testing subset
X_test <- test[, !( names( test ) %in% c( "quality" ) ) ]
model.results <- predict( model, X_test )
# Confusion matrix
t = table( model.results$class, test$quality )
print(confusionMatrix(t))
## Confusion Matrix and Statistics
##
##
## Low Acceptable Good
## Low 0 8 0
## Acceptable 7 369 38
## Good 1 20 32
##
## Overall Statistics
##
## Accuracy : 0.8442
## 95% CI : (0.8084, 0.8756)
## No Information Rate : 0.8358
## P-Value [Acc > NIR] : 0.33648
##
## Kappa : 0.3886
##
## Mcnemar's Test P-Value : 0.08382
##
## Statistics by Class:
##
## Class: Low Class: Acceptable Class: Good
## Sensitivity 0.00000 0.9295 0.45714
## Specificity 0.98287 0.4231 0.94815
## Pos Pred Value 0.00000 0.8913 0.60377
## Neg Pred Value 0.98287 0.5410 0.90995
## Prevalence 0.01684 0.8358 0.14737
## Detection Rate 0.00000 0.7768 0.06737
## Detection Prevalence 0.01684 0.8716 0.11158
## Balanced Accuracy 0.49143 0.6763 0.70265
According to the confusion matrix, the LDA model correctly classifies the majority of observations (accuracy = 0.8442).
However, it struggles to predict the minority class of “Poor” wines (with 0 true positives).
This suggests that the LDA model is not accurately predicting observations belonging to the minority class.
After applying LDA as a classifier, we will now use it as a dimensionality reducer to subsequently apply a different classifier.
# Creating the new training dataset
new_X_train <- as.matrix(X_train[, 1:11]) %*% model$scaling
new_X_train <- as.data.frame(new_X_train)
new_X_train$quality <- X_train$quality
head(new_X_train)
## LD1 LD2 quality
## 142 -0.3868070 3.721666 Acceptable
## 51 -0.2169637 3.633294 Acceptable
## 720 -0.8986406 3.733179 Acceptable
## 730 0.7618637 5.318799 Acceptable
## 1244 -0.6500167 2.561505 Acceptable
## 664 1.7005916 1.834519 Acceptable
# Creating the new testing dataset
new_X_test <- as.matrix(X_test[, 1:11]) %*% model$scaling
new_X_test <- as.data.frame(new_X_test)
new_X_test$quality <- X_test$quality
head(new_X_test)
## LD1 LD2
## 3 -0.6148328 3.502202
## 5 -1.1460086 3.764633
## 8 -0.4633460 4.343885
## 11 -1.0806291 2.521216
## 16 -0.7502555 1.840508
## 17 0.8948575 1.231007
# Training the model with Random Forest
new_X_train<-na.omit(new_X_train)
modfit <- randomForest(quality ~. , data=new_X_train)
# Predictions with Random Forest
predictions<- predict(modfit, as.data.frame(new_X_test), type = "class")
# Confusion matrix
t = table(predictions, test$quality)
print(confusionMatrix(t))
## Confusion Matrix and Statistics
##
##
## predictions Low Acceptable Good
## Low 0 1 0
## Acceptable 8 374 39
## Good 0 22 31
##
## Overall Statistics
##
## Accuracy : 0.8526
## 95% CI : (0.8175, 0.8833)
## No Information Rate : 0.8358
## P-Value [Acc > NIR] : 0.1769
##
## Kappa : 0.3929
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Acceptable Class: Good
## Sensitivity 0.000000 0.9421 0.44286
## Specificity 0.997859 0.3974 0.94568
## Pos Pred Value 0.000000 0.8884 0.58491
## Neg Pred Value 0.983122 0.5741 0.90758
## Prevalence 0.016842 0.8358 0.14737
## Detection Rate 0.000000 0.7874 0.06526
## Detection Prevalence 0.002105 0.8863 0.11158
## Balanced Accuracy 0.498929 0.6698 0.69427
The Random Forest model has slightly better accuracy than the previous LDA model, with an accuracy of 0.8526, indicating higher accuracy.
Additionally, this model has better sensitivity for the minority class “Pobre” compared to the previous LDA model. However, the sensitivity for the “Bueno” class remains low in the Random Forest model, indicating that there is still room for improvement in the classification of different wine categories. Overall, the Random Forest model is a better choice for accuracy in classifying minority observations compared to the LDA model.
# Random forest
X_train<-na.omit(X_train)
modfit_1 <- randomForest(quality ~. , data=X_train)
# Predictions
predictions_1<- predict(modfit_1, as.data.frame(X_test), type = "class")
# Confusion Matrix
t = table(predictions_1, test$quality)
print(confusionMatrix(t))
## Confusion Matrix and Statistics
##
##
## predictions_1 Low Acceptable Good
## Low 0 0 0
## Acceptable 8 382 36
## Good 0 15 34
##
## Overall Statistics
##
## Accuracy : 0.8758
## 95% CI : (0.8427, 0.9041)
## No Information Rate : 0.8358
## P-Value [Acc > NIR] : 0.009179
##
## Kappa : 0.472
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Acceptable Class: Good
## Sensitivity 0.00000 0.9622 0.48571
## Specificity 1.00000 0.4359 0.96296
## Pos Pred Value NaN 0.8967 0.69388
## Neg Pred Value 0.98316 0.6939 0.91549
## Prevalence 0.01684 0.8358 0.14737
## Detection Rate 0.00000 0.8042 0.07158
## Detection Prevalence 0.00000 0.8968 0.10316
## Balanced Accuracy 0.50000 0.6991 0.72434
The trained Random Forest model on the original dataset improves in terms of accuracy compared to previous models, with a total precision of 0.8779.
However, the model has a very low sensitivity in the “Low” class compared to the other models, which means it has better precision for the minority class.
print("Overall, the global accuracy of the three models (84%-86%) is slightly higher than the No Information Rate (NIR) (83-84%), suggesting that the models predict slightly better than what would be expected by randomly choosing the most frequent class in the dataset. This indicates that the classification of the models is slightly better.")
## [1] "Overall, the global accuracy of the three models (84%-86%) is slightly higher than the No Information Rate (NIR) (83-84%), suggesting that the models predict slightly better than what would be expected by randomly choosing the most frequent class in the dataset. This indicates that the classification of the models is slightly better."
print("However, I would choose the LDA model due to its higher sensitivity to the minority class, 'Pobre', even though it may not have the highest accuracy (although it is similar to the other models).")
## [1] "However, I would choose the LDA model due to its higher sensitivity to the minority class, 'Pobre', even though it may not have the highest accuracy (although it is similar to the other models)."
print("Note 1: For each model, I would add cross-validation to ensure that there is no overlap of values between the training and testing sets that could result in incorrect predictions.")
## [1] "Note 1: For each model, I would add cross-validation to ensure that there is no overlap of values between the training and testing sets that could result in incorrect predictions."
print("Note 2: To address the data imbalance in the model, I would incorporate a data imputation method such as SMOTE to ensure that the model gives equal importance to each of the three classes and does not favor 'Acceptable' and 'Good' over 'Low'.")
## [1] "Note 2: To address the data imbalance in the model, I would incorporate a data imputation method such as SMOTE to ensure that the model gives equal importance to each of the three classes and does not favor 'Acceptable' and 'Good' over 'Low'."