Dimensional reduction

INTRODUCTION

In this code both dimensional reduction and clustering are used as unsupervised learning techniques to analyze a dataset containing information about sales and profits of various products in a supermarket. The dataset we are using in this task is same as task_1

#"
#" WHY DIMENSIONAL REDUCTION?

Dimensionality reduction techniques are commonly used in machine learning to reduce the number of features in a dataset. some of the reason are,Improved performance: High-dimensional datasets can be computationally expensive to work with, both in terms of memory requirements and computation time. Dimensionality reduction can reduce the computational cost and make it easier to analyze the data. Overfitting: Dimensionality reduction can help to reduce the risk of overfitting by removing irrelevant features.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.2.2

library(cluster)

## Warning: package 'cluster' was built under R version 4.2.2

library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.2.2

importing the necessary libraries for the analysis. dplyr is used for data manipulation, FactoMineR is used for PCA, fastDummies is used for encoding categorical variables, and cluster is used for k-means clustering.

# Load the data
grocery_data <- read.csv("C:\\Users\\mugil\\Desktop\\unsupervised  learning\\SampleSuperstore.csv")

Load the data from the CSV file “SampleSuperstore.csv” into a data frame called grocery_data.

# Select the relevant columns for analysis
cols <- c("Ship.Mode", "Segment", "Region", "Category", "Sub.Category", "Sales", "Quantity", "Discount", "Profit")
grocery_data_subset <- select(grocery_data, cols)

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(cols)
## 
##   # Now:
##   data %>% select(all_of(cols))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

This line creates a vector called cols containing the names of the columns to be selected. Then it selects only those columns from the grocery_data data frame and creates a new data frame called grocery_data_subset.

# Encode categorical variables as dummy variables
grocery_data_subset_encoded <- dummy_cols(grocery_data_subset, select_columns = c("Ship.Mode", "Segment", "Region", "Category", "Sub.Category"))

#’The categorical variables in the data are encoded as dummy variables using the dummy_cols() function from the fastDummies library. The columns to encode are specified in the select_columns argument.

# Scale the numeric variables
grocery_data_subset_scaled <- scale(grocery_data_subset_encoded[,6:ncol(grocery_data_subset_encoded)])

The numeric variables in the data are scaled using the scale() function. Only columns 6 to the end of the encoded data are selected for scaling.

# Perform PCA
pca_result <- PCA(grocery_data_subset_scaled, graph = FALSE)

PCA is performed on the scaled data using the PCA() function from the FactoMineR library. The graph argument is set to FALSE to prevent the function from plotting the result.

# Extract the loadings and contributions of the variables to each component
loadings <- pca_result$var$coord[, 1:2]
contributions <- pca_result$var$contrib[, 1:2]

The loadings and contributions of each variable to each principal component are extracted from the PCA result. Only the first two principal components are selected.

# Plot the loadings and contributions
plot(loadings, type = "n", xlab = "PC1", ylab = "PC2")
text(loadings, labels = colnames(grocery_data_subset_encoded)[6:ncol(grocery_data_subset_encoded)], cex = 0.8)

plot(contributions, type = "n", xlab = "PC1", ylab = "PC2")
text(contributions, labels = colnames(grocery_data_subset_encoded)[6:ncol(grocery_data_subset_encoded)], cex = 0.8)

The loadings and contributions are plotted on separate graphs using the plot() function. The text() function is used to add labels to the plot.

# Extract the scores for each observation
scores <- as.data.frame(pca_result$ind$coord[, 1:2])

The scores for each observation (row) in the PCA result are extracted. Only the first two principal components are selected

scores$Category <- grocery_data_subset$Category

This line adds a new column to the scores data frame, which corresponds to the category of each observation in the original data set. The grocery_data_subset$Category variable is used to populate this new column.

# Plot the scores for each observation by Category

This line creates a scatter plot of the first two principal components (Dim.1 and Dim.2) from the scores data frame. The color of each point is determined by its category, which is converted to a numeric value using the as.numeric() function. The pch argument sets the shape of the points to a filled circle, and the xlab and ylab arguments set the labels for the x and y axes.

plot(scores$Dim.1, scores$Dim.2, col = as.numeric(scores$Category), pch = 19, xlab = "PC1", ylab = "PC2")

## Warning in plot.xy(xy, type, ...): NAs introduced by coercion

legend("topleft", legend = unique(scores$Category), col = unique(as.numeric(scores$Category)), pch = 19)

## Warning in unique(as.numeric(scores$Category)): NAs introduced by coercion

############CLUSTERING##########



# Select the relevant columns for analysis
cols <- c("Ship.Mode", "Segment", "Region", "Category", "Sub.Category", "Sales", "Quantity", "Discount", "Profit")
grocery_data_subset <- select(grocery_data, cols)

This code selects a subset of columns from the grocery_data data frame and stores them in a new data frame called grocery_data_subset.

# Encode categorical variables as dummy variables
grocery_data_subset_encoded <- dummy_cols(grocery_data_subset, select_columns = c("Ship.Mode", "Segment", "Region", "Category", "Sub.Category"))

This line encodes categorical variables as dummy variables in the grocery_data_subset data frame using the dummy_cols() function from the FactoMineR package. The select_columns parameter specifies the columns to be encoded.

# Scale the numeric variables
grocery_data_subset_scaled <- scale(grocery_data_subset_encoded[,6:ncol(grocery_data_subset_encoded)])

This line scales the numeric variables in the grocery_data_subset_encoded data frame using the scale() function. The [,6:ncol(grocery_data_subset_encoded)] specifies the columns to be scaled, which are columns 6 to the last column.

# Perform PCA
pca_result <- PCA(grocery_data_subset_scaled, graph = FALSE)

This line performs principal component analysis (PCA) on the grocery_data_subset_scaled data frame using the PCA() function from the FactoMineR package.

# Extract the scores for each observation
scores <- as.data.frame(pca_result$ind$coord[, 1:2])

This line of code is extracting the scores of each observation in the dataset on the first two principal components of the PCA.

# Perform k-means clustering on the scores
k <- 3 # number of clusters
kmeans_result <- kmeans(scores, centers = k)

This code performs k-means clustering on scores using the kmeans() function from the stats package. The number of clusters is set to k (which is 3 in this case), and the resulting k-means model is stored in kmeans_result.

# Add the cluster assignments to the scores dataframe
scores$cluster <- kmeans_result$cluster

This code adds a new column to scores that contains the cluster assignment for each observation.

# Plot the scores with different colors for different clusters

This code plots the first two principal components of pca_result (which are stored in scores), with different colors representing different clusters.

plot(scores$Dim.1, scores$Dim.2, col = scores$cluster, pch = 19, xlab = "PC1", ylab = "PC2")


# Get the cluster centers (in PC space) and plot them as red crosses
cluster_centers <- as.data.frame(kmeans_result$centers[, 1:2])
points(cluster_centers$V1, cluster_centers$V2, col = "red", pch = 4, cex = 2)

The cluster centers (in PC space) are obtained using the $centers attribute of the kmeans_result object. The cluster centers are then plotted as red crosses using the points() function.

table(scores$cluster)

## 
##    1    2    3 
## 6015 1857 2122

The number of observations in each cluster is printed using the table() function applied to the $cluster attribute of the scores data frame.

SUMMARY

The code first loads the required libraries, including dplyr for data manipulation, FactoMineR for PCA, fastDummies for encoding categorical variables, and cluster for k-means clustering. It then loads the data from a CSV file and selects the relevant columns for analysis. The categorical variables in the data are encoded as dummy variables, and the numeric variables are scaled. PCA is then performed on the scaled data, and the loadings and contributions of each variable to each principal component are extracted and plotted. The scores for each observation in the PCA result are then extracted, and a scatter plot of the scores on the first two principal components is created, with each point colored by category. Finally, k-means clustering is performed on the scaled data, and the resulting clusters are plotted on a scatter plot, colored by cluster.

CONCLUSION

The code provided performs principal component analysis (PCA) and k-means clustering on a dataset of sales data from a grocery store. The PCA is used to reduce the dimensionality of the data and identify patterns in the data, and the k-means clustering is used to group similar observations together based on their sales data.

Dimensional reduction

mugil

2023-03-17