Introduction

In this analysis, we delve into the realm of dimensionality reduction, a fundamental technique in exploratory data analysis and machine learning. Dimensionality reduction methods, such as Principal Component Analysis (PCA), allow us to extract meaningful insights from high-dimensional datasets by capturing the most important features while minimizing information loss. Through this process, we aim to uncover underlying patterns, reduce computational complexity, and facilitate visualization and interpretation of complex datasets.

Data Overview

To illustrate dimensionality reduction techniques, we leverage an automotive dataset containing information on various attributes such as brand, price, year of registration, powerPS, and odometer readings. Our goal is to perform a comprehensive analysis, encompassing data preprocessing, outlier detection, assumption testing, and PCA, to gain a deeper understanding of the dataset’s structure and extract meaningful insights.

Loading Data and Initial Exploration:

We begin by loading the necessary libraries and adjusting the memory limit to 8 GB to ensure smooth execution and efficient memory usage. The working directory is set to access the dataset conveniently. We then read the CSV file containing automotive data and convert it into transactions, a suitable format for association rule mining.

Subsetting Data:

In this step, we focus on selecting relevant columns from the dataset. We specify the columns to keep, including brand, price, year of registration, powerPS, and odometer. By subsetting the dataframe, we prepare a modified dataset for further analysis.

library(readr)
auto_list <- read.csv("A CORDONNER/autos.csv", header = TRUE)

head(auto_list)
##           dateCrawled
## 1 2016-03-26 17:47:46
## 2 2016-04-04 13:38:56
## 3 2016-03-26 18:57:24
## 4 2016-03-12 16:58:10
## 5 2016-04-01 14:38:50
## 6 2016-03-21 13:47:45
##                                                                   name seller
## 1                                     Peugeot_807_160_NAVTECH_ON_BOARD privat
## 2                           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat
## 3                                           Volkswagen_Golf_1.6_United privat
## 4                  Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama privat
## 5   Ford_Focus_1_6_Benzin_T\xdcV_neu_ist_sehr_gepflegt.mit_Klimaanlage privat
## 6 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow\xb4n_Go_Sitze_7Sitze privat
##   offerType  price  abtest vehicleType yearOfRegistration   gearbox powerPS
## 1   Angebot $5,000 control         bus               2004   manuell     158
## 2   Angebot $8,500 control   limousine               1997 automatik     286
## 3   Angebot $8,990    test   limousine               2009   manuell     102
## 4   Angebot $4,350 control  kleinwagen               2007 automatik      71
## 5   Angebot $1,350    test       kombi               2003   manuell       0
## 6   Angebot $7,900    test         bus               2006 automatik     150
##     model  odometer monthOfRegistration fuelType      brand notRepairedDamage
## 1  andere 150,000km                   3      lpg    peugeot              nein
## 2     7er 150,000km                   6   benzin        bmw              nein
## 3    golf  70,000km                   7   benzin volkswagen              nein
## 4  fortwo  70,000km                   6   benzin      smart              nein
## 5   focus 150,000km                   7   benzin       ford              nein
## 6 voyager 150,000km                   4   diesel   chrysler                  
##           dateCreated nrOfPictures postalCode            lastSeen
## 1 2016-03-26 00:00:00            0      79588 2016-04-06 06:45:54
## 2 2016-04-04 00:00:00            0      71034 2016-04-06 14:45:08
## 3 2016-03-26 00:00:00            0      35394 2016-04-06 20:15:37
## 4 2016-03-12 00:00:00            0      33729 2016-03-15 03:16:28
## 5 2016-04-01 00:00:00            0      39218 2016-04-01 14:38:50
## 6 2016-03-21 00:00:00            0      22962 2016-04-06 09:45:21
class(auto_list)
## [1] "data.frame"
colnames(auto_list)
##  [1] "dateCrawled"         "name"                "seller"             
##  [4] "offerType"           "price"               "abtest"             
##  [7] "vehicleType"         "yearOfRegistration"  "gearbox"            
## [10] "powerPS"             "model"               "odometer"           
## [13] "monthOfRegistration" "fuelType"            "brand"              
## [16] "notRepairedDamage"   "dateCreated"         "nrOfPictures"       
## [19] "postalCode"          "lastSeen"
str(auto_list)
## 'data.frame':    50000 obs. of  20 variables:
##  $ dateCrawled        : chr  "2016-03-26 17:47:46" "2016-04-04 13:38:56" "2016-03-26 18:57:24" "2016-03-12 16:58:10" ...
##  $ name               : chr  "Peugeot_807_160_NAVTECH_ON_BOARD" "BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik" "Volkswagen_Golf_1.6_United" "Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama" ...
##  $ seller             : chr  "privat" "privat" "privat" "privat" ...
##  $ offerType          : chr  "Angebot" "Angebot" "Angebot" "Angebot" ...
##  $ price              : chr  "$5,000" "$8,500" "$8,990" "$4,350" ...
##  $ abtest             : chr  "control" "control" "test" "control" ...
##  $ vehicleType        : chr  "bus" "limousine" "limousine" "kleinwagen" ...
##  $ yearOfRegistration : int  2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
##  $ gearbox            : chr  "manuell" "automatik" "manuell" "automatik" ...
##  $ powerPS            : int  158 286 102 71 0 150 90 90 0 90 ...
##  $ model              : chr  "andere" "7er" "golf" "fortwo" ...
##  $ odometer           : chr  "150,000km" "150,000km" "70,000km" "70,000km" ...
##  $ monthOfRegistration: int  3 6 7 6 7 4 8 12 10 7 ...
##  $ fuelType           : chr  "lpg" "benzin" "benzin" "benzin" ...
##  $ brand              : chr  "peugeot" "bmw" "volkswagen" "smart" ...
##  $ notRepairedDamage  : chr  "nein" "nein" "nein" "nein" ...
##  $ dateCreated        : chr  "2016-03-26 00:00:00" "2016-04-04 00:00:00" "2016-03-26 00:00:00" "2016-03-12 00:00:00" ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  79588 71034 35394 33729 39218 22962 31535 53474 7426 15749 ...
##  $ lastSeen           : chr  "2016-04-06 06:45:54" "2016-04-06 14:45:08" "2016-04-06 20:15:37" "2016-03-15 03:16:28" ...

Data Cleaning:

Here, we clean the data by removing non-numeric characters from the “price” and “odometer” columns, making them suitable for numerical operations. The cleaned dataset is displayed to ensure the successful removal of non-numeric characters.

cols_to_keep <- c("brand", "price", "yearOfRegistration",  "powerPS", "odometer")


auto_list <- auto_list[, cols_to_keep]


head(auto_list)
##        brand  price yearOfRegistration powerPS  odometer
## 1    peugeot $5,000               2004     158 150,000km
## 2        bmw $8,500               1997     286 150,000km
## 3 volkswagen $8,990               2009     102  70,000km
## 4      smart $4,350               2007      71  70,000km
## 5       ford $1,350               2003       0 150,000km
## 6   chrysler $7,900               2006     150 150,000km
colnames(auto_list)
## [1] "brand"              "price"              "yearOfRegistration"
## [4] "powerPS"            "odometer"

Missing Values Detection:

We proceed by identifying missing values in each column of the dataset. Calculating the number of missing values provides valuable insights into data completeness, essential for subsequent analyses.

Outlier Detection:

Utilizing the identify_outliers() function, we detect outliers in numeric columns by computing z-scores and flagging values with z-scores exceeding a threshold of 3. The number of outliers in each numeric column is then summarized, informing data quality assessment and potential outlier removal.

Handling Outliers:

Following outlier detection, we remove rows containing outliers to ensure data integrity and reliability. The resulting dataset is inspected to confirm the successful removal of outlier-containing rows.

auto_list$price <- as.numeric(gsub("[^0-9.]", "", auto_list$price))
auto_list$odometer <- as.numeric(gsub("[^0-9.]", "", gsub("km", "", auto_list$odometer)))


head(auto_list)
##        brand price yearOfRegistration powerPS odometer
## 1    peugeot  5000               2004     158   150000
## 2        bmw  8500               1997     286   150000
## 3 volkswagen  8990               2009     102    70000
## 4      smart  4350               2007      71    70000
## 5       ford  1350               2003       0   150000
## 6   chrysler  7900               2006     150   150000
str(auto_list)
## 'data.frame':    50000 obs. of  5 variables:
##  $ brand             : chr  "peugeot" "bmw" "volkswagen" "smart" ...
##  $ price             : num  5000 8500 8990 4350 1350 7900 300 1990 250 590 ...
##  $ yearOfRegistration: int  2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
##  $ powerPS           : int  158 286 102 71 0 150 90 90 0 90 ...
##  $ odometer          : num  150000 150000 70000 70000 150000 150000 150000 150000 150000 150000 ...
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
missing_values <- colSums(is.na(auto_list))

print(missing_values)
##              brand              price yearOfRegistration            powerPS 
##                  0                  0                  0                  0 
##           odometer 
##                  0
identify_outliers <- function(x) {
  z_scores <- scale(x)
  outliers <- abs(z_scores) > 3  
  return(outliers)
}

numeric_cols <- sapply(auto_list, is.numeric)
outliers <- auto_list[, numeric_cols] %>%
  lapply(identify_outliers)
outlier_counts <- sapply(outliers, sum)
print(outlier_counts)
##              price yearOfRegistration            powerPS           odometer 
##                  9                 22                 65                967
rows_with_outliers <- apply(do.call(cbind, outliers), 1, any)
auto_list <- auto_list[!rows_with_outliers, ]

str(auto_list)
## 'data.frame':    48959 obs. of  5 variables:
##  $ brand             : chr  "peugeot" "bmw" "volkswagen" "smart" ...
##  $ price             : num  5000 8500 8990 4350 1350 7900 300 1990 250 590 ...
##  $ yearOfRegistration: int  2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
##  $ powerPS           : int  158 286 102 71 0 150 90 90 0 90 ...
##  $ odometer          : num  150000 150000 70000 70000 150000 150000 150000 150000 150000 150000 ...

Data Standardization:

To prepare the data for principal component analysis (PCA), we standardize the numeric variables by scaling them to have a mean of 0 and a standard deviation of 1. Summary statistics of the standardized data provide insights into the distribution of variables.

numeric_data <- auto_list[sapply(auto_list, is.numeric)]
standardized_data <- scale(numeric_data)
summary(standardized_data)
##      price          yearOfRegistration    powerPS           odometer      
##  Min.   :-0.41402   Min.   :-12.5680   Min.   :-1.6088   Min.   :-3.2279  
##  1st Qu.:-0.33196   1st Qu.: -0.5957   1st Qu.:-0.5964   1st Qu.:-0.0857  
##  Median :-0.20137   Median : -0.0576   Median :-0.1115   Median : 0.5974  
##  Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.09977   3rd Qu.:  0.6150   3rd Qu.: 0.5301   3rd Qu.: 0.5974  
##  Max.   :92.35464   Max.   :  2.0947   Max.   : 8.9432   Max.   : 0.5974

#Assumption Testing:

Assumption testing involves assessing the suitability of data for PCA. We conduct the Kaiser-Meyer-Olkin (KMO) test and Bartlett’s test of sphericity to evaluate the adequacy of the data for factor analysis. The test results inform the decision to proceed with PCA.

library(psych)

kmo_result <- KMO(numeric_data)
print(paste("KMO Test Statistic:", kmo_result$MSA, "\n"))
## [1] "KMO Test Statistic: 0.571468005265458 \n"
bartlett_result <- cortest.bartlett(cor(numeric_data), n = nrow(numeric_data))
print(paste("Bartlett's Test Statistic:", bartlett_result$chisq, "\n"))
## [1] "Bartlett's Test Statistic: 13430.5465175227 \n"
print(paste("Bartlett's Test p-value:", bartlett_result$p.value, "\n"))
## [1] "Bartlett's Test p-value: 0 \n"

Handling Infinite and Zero Values:

Here, we address the presence of infinite and zero values in the standardized data. The counts of infinite and zero values are computed to understand their prevalence and potential impact on subsequent analyses. Strategies for handling these values are discussed to ensure robust results.

num_infinite <- sum(!is.finite(standardized_data))
num_zeros <- sum(apply(standardized_data, 1, function(x) sum(x == 0)))


cat("Number of infinite values:", num_infinite, "\n")
## Number of infinite values: 0
cat("Number of zero values:", num_zeros, "\n")
## Number of zero values: 0

Imputing Infinite Values:

We identify columns containing infinite values and impute them with the respective column means. This ensures the data remains suitable for analysis while addressing potential biases introduced by infinite values.

cols_with_infinite <- apply(standardized_data, 2, function(x) any(!is.finite(x)))

for (col_index in which(cols_with_infinite)) {
  col_mean <- mean(standardized_data[, col_index], na.rm = TRUE)
  standardized_data[!is.finite(standardized_data[, col_index]), col_index] <- col_mean
}


pca_result <- prcomp(standardized_data, center = TRUE, scale. = TRUE)


summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.2723 0.9974 0.8862 0.7752
## Proportion of Variance 0.4047 0.2487 0.1963 0.1502
## Cumulative Proportion  0.4047 0.6534 0.8498 1.0000

Principal Component Analysis (PCA):

PCA is performed on the standardized data to reduce dimensionality and uncover underlying patterns. The resulting principal components are summarized, providing insights into the variance explained by each component and the cumulative variance explained.

library(corrplot)
## corrplot 0.92 loaded
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
cor_matrix <- cor(standardized_data)
corrplot(cor_matrix, method = "color", type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

Visualizing Correlation:

The correlation structure of the standardized data is visualized using a correlation matrix plot. This visualization helps identify patterns of association between variables, informing subsequent analyses and interpretations.

pca_result <- prcomp(standardized_data, center = TRUE, scale. = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.2723 0.9974 0.8862 0.7752
## Proportion of Variance 0.4047 0.2487 0.1963 0.1502
## Cumulative Proportion  0.4047 0.6534 0.8498 1.0000

Scree Plot:

A scree plot is generated to visualize the variance explained by each principal component. This plot aids in determining the appropriate number of components to retain, guiding dimensionality reduction efforts.

plot(pca_result, type = "l", main = "Scree Plot")

The scree plot visually represents the variance explained by each principal component. This plot aids in determining the appropriate number of components to retain, guiding dimensionality reduction efforts. Here, the scree plot shows a clear elbow point, indicating that the first two or three principal components capture the most significant amount of variance in the data.

Finally, the contributions of variables to each principal component are visualized using a variable contribution plot. This plot highlights the relative importance of variables in explaining the variation captured by each component, facilitating interpretation and insight generation.

library(factoextra)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_contrib(pca_result, "var", axes = 1:4, fill = "tomato3", color = "tomato4")

This biplot combines both sample scores and variable loadings onto a single scatter plot, allowing for the visualization of relationships between samples and variables. This visualization aids in interpreting the principal components and identifying patterns within the data.

install.packages("GGally", repos = "https://cran.r-project.org")
## Installing package into 'C:/Users/Alain/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'GGally' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alain\AppData\Local\Temp\RtmpqGZw53\downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(GGally)
ggpairs(auto_list[, -1])

install.packages("reshape2", repos = "https://cran.r-project.org")
## Installing package into 'C:/Users/Alain/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'reshape2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Alain\AppData\Local\Temp\RtmpqGZw53\downloaded_packages
library(reshape2)

This plot provides a pairwise scatterplot matrix of the variables in the dataset. Each cell in the matrix shows the relationship between two variables, allowing us to observe patterns and correlations between different attributes. In the context of dimensionality reduction, this plot can help visualize potential clusters or patterns in lower-dimensional space.

library(ggplot2)
cor_matrix <- cor(standardized_data)
ggplot(data = melt(cor_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Correlation Heatmap")

The heatmap visualizes the correlation matrix between the standardized variables. Darker colors indicate stronger correlations, while lighter colors indicate weaker correlations or no correlation. Understanding the interplay between variables is crucial in dimensionality reduction, as it helps identify redundant or highly correlated variables that may be candidates for dimension reduction.

library(scatterplot3d)
scatterplot3d(pca_result$x[,1:3], main="3D Scatterplot", pch=16, color="blue")

This plot represents the data in a three-dimensional space using the first three principal components from PCA. It allows visualization of the data points in a reduced-dimensional space, helping to identify clusters or patterns that may not be apparent in higher dimensions. While it’s challenging to visualize high-dimensional data directly, projecting it onto lower-dimensional space using PCA enables easier interpretation and exploration.

fviz_pca_biplot(pca_result, geom = "point", col.ind = "blue", col.var = "red")

The PCA biplot combines a scatterplot of the data points with vectors indicating the direction and strength of each variable’s contribution to the principal components. This plot allows us to visualize both the data points and the variables simultaneously, providing insights into how variables influence the principal components and how data points are distributed in the reduced-dimensional space.

library(GGally)
ggparcoord(auto_list, columns = 2:5, groupColumn = "brand", scale = "globalminmax")

Parallel coordinates plot visualizes multivariate data by plotting each observation as a line parallel to the axes. Different lines represent different observations, while the position of the lines on each axis represents the value of a particular variable. This plot is useful for understanding the distribution of data points along multiple dimensions simultaneously, making it suitable for exploring high-dimensional datasets and identifying patterns or clusters

Conclusion

In conclusion, our journey through dimensionality reduction has provided valuable insights into the structure and relationships within the automotive dataset. By leveraging techniques such as PCA, we successfully reduced the dimensionality of the data while preserving its essential characteristics. Through outlier detection and assumption testing, we ensured the robustness of our analysis and the reliability of our findings. Moreover, visualizations such as correlation plots and biplots aided in the interpretation of results and the identification of meaningful patterns. Overall, this analysis underscores the importance of dimensionality reduction t