In this analysis, we delve into the realm of dimensionality reduction, a fundamental technique in exploratory data analysis and machine learning. Dimensionality reduction methods, such as Principal Component Analysis (PCA), allow us to extract meaningful insights from high-dimensional datasets by capturing the most important features while minimizing information loss. Through this process, we aim to uncover underlying patterns, reduce computational complexity, and facilitate visualization and interpretation of complex datasets.
To illustrate dimensionality reduction techniques, we leverage an automotive dataset containing information on various attributes such as brand, price, year of registration, powerPS, and odometer readings. Our goal is to perform a comprehensive analysis, encompassing data preprocessing, outlier detection, assumption testing, and PCA, to gain a deeper understanding of the dataset’s structure and extract meaningful insights.
We begin by loading the necessary libraries and adjusting the memory limit to 8 GB to ensure smooth execution and efficient memory usage. The working directory is set to access the dataset conveniently. We then read the CSV file containing automotive data and convert it into transactions, a suitable format for association rule mining.
In this step, we focus on selecting relevant columns from the dataset. We specify the columns to keep, including brand, price, year of registration, powerPS, and odometer. By subsetting the dataframe, we prepare a modified dataset for further analysis.
library(readr)
auto_list <- read.csv("A CORDONNER/autos.csv", header = TRUE)
head(auto_list)
## dateCrawled
## 1 2016-03-26 17:47:46
## 2 2016-04-04 13:38:56
## 3 2016-03-26 18:57:24
## 4 2016-03-12 16:58:10
## 5 2016-04-01 14:38:50
## 6 2016-03-21 13:47:45
## name seller
## 1 Peugeot_807_160_NAVTECH_ON_BOARD privat
## 2 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat
## 3 Volkswagen_Golf_1.6_United privat
## 4 Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama privat
## 5 Ford_Focus_1_6_Benzin_T\xdcV_neu_ist_sehr_gepflegt.mit_Klimaanlage privat
## 6 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow\xb4n_Go_Sitze_7Sitze privat
## offerType price abtest vehicleType yearOfRegistration gearbox powerPS
## 1 Angebot $5,000 control bus 2004 manuell 158
## 2 Angebot $8,500 control limousine 1997 automatik 286
## 3 Angebot $8,990 test limousine 2009 manuell 102
## 4 Angebot $4,350 control kleinwagen 2007 automatik 71
## 5 Angebot $1,350 test kombi 2003 manuell 0
## 6 Angebot $7,900 test bus 2006 automatik 150
## model odometer monthOfRegistration fuelType brand notRepairedDamage
## 1 andere 150,000km 3 lpg peugeot nein
## 2 7er 150,000km 6 benzin bmw nein
## 3 golf 70,000km 7 benzin volkswagen nein
## 4 fortwo 70,000km 6 benzin smart nein
## 5 focus 150,000km 7 benzin ford nein
## 6 voyager 150,000km 4 diesel chrysler
## dateCreated nrOfPictures postalCode lastSeen
## 1 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
## 2 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
## 3 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
## 4 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
## 5 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
## 6 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
class(auto_list)
## [1] "data.frame"
colnames(auto_list)
## [1] "dateCrawled" "name" "seller"
## [4] "offerType" "price" "abtest"
## [7] "vehicleType" "yearOfRegistration" "gearbox"
## [10] "powerPS" "model" "odometer"
## [13] "monthOfRegistration" "fuelType" "brand"
## [16] "notRepairedDamage" "dateCreated" "nrOfPictures"
## [19] "postalCode" "lastSeen"
str(auto_list)
## 'data.frame': 50000 obs. of 20 variables:
## $ dateCrawled : chr "2016-03-26 17:47:46" "2016-04-04 13:38:56" "2016-03-26 18:57:24" "2016-03-12 16:58:10" ...
## $ name : chr "Peugeot_807_160_NAVTECH_ON_BOARD" "BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik" "Volkswagen_Golf_1.6_United" "Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama" ...
## $ seller : chr "privat" "privat" "privat" "privat" ...
## $ offerType : chr "Angebot" "Angebot" "Angebot" "Angebot" ...
## $ price : chr "$5,000" "$8,500" "$8,990" "$4,350" ...
## $ abtest : chr "control" "control" "test" "control" ...
## $ vehicleType : chr "bus" "limousine" "limousine" "kleinwagen" ...
## $ yearOfRegistration : int 2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
## $ gearbox : chr "manuell" "automatik" "manuell" "automatik" ...
## $ powerPS : int 158 286 102 71 0 150 90 90 0 90 ...
## $ model : chr "andere" "7er" "golf" "fortwo" ...
## $ odometer : chr "150,000km" "150,000km" "70,000km" "70,000km" ...
## $ monthOfRegistration: int 3 6 7 6 7 4 8 12 10 7 ...
## $ fuelType : chr "lpg" "benzin" "benzin" "benzin" ...
## $ brand : chr "peugeot" "bmw" "volkswagen" "smart" ...
## $ notRepairedDamage : chr "nein" "nein" "nein" "nein" ...
## $ dateCreated : chr "2016-03-26 00:00:00" "2016-04-04 00:00:00" "2016-03-26 00:00:00" "2016-03-12 00:00:00" ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 79588 71034 35394 33729 39218 22962 31535 53474 7426 15749 ...
## $ lastSeen : chr "2016-04-06 06:45:54" "2016-04-06 14:45:08" "2016-04-06 20:15:37" "2016-03-15 03:16:28" ...
Here, we clean the data by removing non-numeric characters from the “price” and “odometer” columns, making them suitable for numerical operations. The cleaned dataset is displayed to ensure the successful removal of non-numeric characters.
cols_to_keep <- c("brand", "price", "yearOfRegistration", "powerPS", "odometer")
auto_list <- auto_list[, cols_to_keep]
head(auto_list)
## brand price yearOfRegistration powerPS odometer
## 1 peugeot $5,000 2004 158 150,000km
## 2 bmw $8,500 1997 286 150,000km
## 3 volkswagen $8,990 2009 102 70,000km
## 4 smart $4,350 2007 71 70,000km
## 5 ford $1,350 2003 0 150,000km
## 6 chrysler $7,900 2006 150 150,000km
colnames(auto_list)
## [1] "brand" "price" "yearOfRegistration"
## [4] "powerPS" "odometer"
We proceed by identifying missing values in each column of the dataset. Calculating the number of missing values provides valuable insights into data completeness, essential for subsequent analyses.
Outlier Detection:
Utilizing the identify_outliers() function, we detect outliers in numeric columns by computing z-scores and flagging values with z-scores exceeding a threshold of 3. The number of outliers in each numeric column is then summarized, informing data quality assessment and potential outlier removal.
Handling Outliers:
Following outlier detection, we remove rows containing outliers to ensure data integrity and reliability. The resulting dataset is inspected to confirm the successful removal of outlier-containing rows.
auto_list$price <- as.numeric(gsub("[^0-9.]", "", auto_list$price))
auto_list$odometer <- as.numeric(gsub("[^0-9.]", "", gsub("km", "", auto_list$odometer)))
head(auto_list)
## brand price yearOfRegistration powerPS odometer
## 1 peugeot 5000 2004 158 150000
## 2 bmw 8500 1997 286 150000
## 3 volkswagen 8990 2009 102 70000
## 4 smart 4350 2007 71 70000
## 5 ford 1350 2003 0 150000
## 6 chrysler 7900 2006 150 150000
str(auto_list)
## 'data.frame': 50000 obs. of 5 variables:
## $ brand : chr "peugeot" "bmw" "volkswagen" "smart" ...
## $ price : num 5000 8500 8990 4350 1350 7900 300 1990 250 590 ...
## $ yearOfRegistration: int 2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
## $ powerPS : int 158 286 102 71 0 150 90 90 0 90 ...
## $ odometer : num 150000 150000 70000 70000 150000 150000 150000 150000 150000 150000 ...
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
missing_values <- colSums(is.na(auto_list))
print(missing_values)
## brand price yearOfRegistration powerPS
## 0 0 0 0
## odometer
## 0
identify_outliers <- function(x) {
z_scores <- scale(x)
outliers <- abs(z_scores) > 3
return(outliers)
}
numeric_cols <- sapply(auto_list, is.numeric)
outliers <- auto_list[, numeric_cols] %>%
lapply(identify_outliers)
outlier_counts <- sapply(outliers, sum)
print(outlier_counts)
## price yearOfRegistration powerPS odometer
## 9 22 65 967
rows_with_outliers <- apply(do.call(cbind, outliers), 1, any)
auto_list <- auto_list[!rows_with_outliers, ]
str(auto_list)
## 'data.frame': 48959 obs. of 5 variables:
## $ brand : chr "peugeot" "bmw" "volkswagen" "smart" ...
## $ price : num 5000 8500 8990 4350 1350 7900 300 1990 250 590 ...
## $ yearOfRegistration: int 2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
## $ powerPS : int 158 286 102 71 0 150 90 90 0 90 ...
## $ odometer : num 150000 150000 70000 70000 150000 150000 150000 150000 150000 150000 ...
To prepare the data for principal component analysis (PCA), we standardize the numeric variables by scaling them to have a mean of 0 and a standard deviation of 1. Summary statistics of the standardized data provide insights into the distribution of variables.
numeric_data <- auto_list[sapply(auto_list, is.numeric)]
standardized_data <- scale(numeric_data)
summary(standardized_data)
## price yearOfRegistration powerPS odometer
## Min. :-0.41402 Min. :-12.5680 Min. :-1.6088 Min. :-3.2279
## 1st Qu.:-0.33196 1st Qu.: -0.5957 1st Qu.:-0.5964 1st Qu.:-0.0857
## Median :-0.20137 Median : -0.0576 Median :-0.1115 Median : 0.5974
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.09977 3rd Qu.: 0.6150 3rd Qu.: 0.5301 3rd Qu.: 0.5974
## Max. :92.35464 Max. : 2.0947 Max. : 8.9432 Max. : 0.5974
#Assumption Testing:
Assumption testing involves assessing the suitability of data for PCA. We conduct the Kaiser-Meyer-Olkin (KMO) test and Bartlett’s test of sphericity to evaluate the adequacy of the data for factor analysis. The test results inform the decision to proceed with PCA.
library(psych)
kmo_result <- KMO(numeric_data)
print(paste("KMO Test Statistic:", kmo_result$MSA, "\n"))
## [1] "KMO Test Statistic: 0.571468005265458 \n"
bartlett_result <- cortest.bartlett(cor(numeric_data), n = nrow(numeric_data))
print(paste("Bartlett's Test Statistic:", bartlett_result$chisq, "\n"))
## [1] "Bartlett's Test Statistic: 13430.5465175227 \n"
print(paste("Bartlett's Test p-value:", bartlett_result$p.value, "\n"))
## [1] "Bartlett's Test p-value: 0 \n"
Here, we address the presence of infinite and zero values in the standardized data. The counts of infinite and zero values are computed to understand their prevalence and potential impact on subsequent analyses. Strategies for handling these values are discussed to ensure robust results.
num_infinite <- sum(!is.finite(standardized_data))
num_zeros <- sum(apply(standardized_data, 1, function(x) sum(x == 0)))
cat("Number of infinite values:", num_infinite, "\n")
## Number of infinite values: 0
cat("Number of zero values:", num_zeros, "\n")
## Number of zero values: 0
We identify columns containing infinite values and impute them with the respective column means. This ensures the data remains suitable for analysis while addressing potential biases introduced by infinite values.
cols_with_infinite <- apply(standardized_data, 2, function(x) any(!is.finite(x)))
for (col_index in which(cols_with_infinite)) {
col_mean <- mean(standardized_data[, col_index], na.rm = TRUE)
standardized_data[!is.finite(standardized_data[, col_index]), col_index] <- col_mean
}
pca_result <- prcomp(standardized_data, center = TRUE, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.2723 0.9974 0.8862 0.7752
## Proportion of Variance 0.4047 0.2487 0.1963 0.1502
## Cumulative Proportion 0.4047 0.6534 0.8498 1.0000
PCA is performed on the standardized data to reduce dimensionality and uncover underlying patterns. The resulting principal components are summarized, providing insights into the variance explained by each component and the cumulative variance explained.
library(corrplot)
## corrplot 0.92 loaded
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
cor_matrix <- cor(standardized_data)
corrplot(cor_matrix, method = "color", type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
Visualizing Correlation:
The correlation structure of the standardized data is visualized using a correlation matrix plot. This visualization helps identify patterns of association between variables, informing subsequent analyses and interpretations.
pca_result <- prcomp(standardized_data, center = TRUE, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.2723 0.9974 0.8862 0.7752
## Proportion of Variance 0.4047 0.2487 0.1963 0.1502
## Cumulative Proportion 0.4047 0.6534 0.8498 1.0000
Scree Plot:
A scree plot is generated to visualize the variance explained by each principal component. This plot aids in determining the appropriate number of components to retain, guiding dimensionality reduction efforts.
plot(pca_result, type = "l", main = "Scree Plot")
The scree plot visually represents the variance explained by each
principal component. This plot aids in determining the appropriate
number of components to retain, guiding dimensionality reduction
efforts. Here, the scree plot shows a clear elbow point, indicating that
the first two or three principal components capture the most significant
amount of variance in the data.
Finally, the contributions of variables to each principal component are visualized using a variable contribution plot. This plot highlights the relative importance of variables in explaining the variation captured by each component, facilitating interpretation and insight generation.
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_contrib(pca_result, "var", axes = 1:4, fill = "tomato3", color = "tomato4")
This biplot combines both sample scores and variable loadings onto a
single scatter plot, allowing for the visualization of relationships
between samples and variables. This visualization aids in interpreting
the principal components and identifying patterns within the data.
install.packages("GGally", repos = "https://cran.r-project.org")
## Installing package into 'C:/Users/Alain/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'GGally' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alain\AppData\Local\Temp\RtmpqGZw53\downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(GGally)
ggpairs(auto_list[, -1])
install.packages("reshape2", repos = "https://cran.r-project.org")
## Installing package into 'C:/Users/Alain/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'reshape2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Alain\AppData\Local\Temp\RtmpqGZw53\downloaded_packages
library(reshape2)
This plot provides a pairwise scatterplot matrix of the variables in the dataset. Each cell in the matrix shows the relationship between two variables, allowing us to observe patterns and correlations between different attributes. In the context of dimensionality reduction, this plot can help visualize potential clusters or patterns in lower-dimensional space.
library(ggplot2)
cor_matrix <- cor(standardized_data)
ggplot(data = melt(cor_matrix), aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Correlation Heatmap")
The heatmap visualizes the correlation matrix between the standardized variables. Darker colors indicate stronger correlations, while lighter colors indicate weaker correlations or no correlation. Understanding the interplay between variables is crucial in dimensionality reduction, as it helps identify redundant or highly correlated variables that may be candidates for dimension reduction.
library(scatterplot3d)
scatterplot3d(pca_result$x[,1:3], main="3D Scatterplot", pch=16, color="blue")
This plot represents the data in a three-dimensional space using the first three principal components from PCA. It allows visualization of the data points in a reduced-dimensional space, helping to identify clusters or patterns that may not be apparent in higher dimensions. While it’s challenging to visualize high-dimensional data directly, projecting it onto lower-dimensional space using PCA enables easier interpretation and exploration.
fviz_pca_biplot(pca_result, geom = "point", col.ind = "blue", col.var = "red")
The PCA biplot combines a scatterplot of the data points with vectors indicating the direction and strength of each variable’s contribution to the principal components. This plot allows us to visualize both the data points and the variables simultaneously, providing insights into how variables influence the principal components and how data points are distributed in the reduced-dimensional space.
library(GGally)
ggparcoord(auto_list, columns = 2:5, groupColumn = "brand", scale = "globalminmax")
Parallel coordinates plot visualizes multivariate data by plotting each observation as a line parallel to the axes. Different lines represent different observations, while the position of the lines on each axis represents the value of a particular variable. This plot is useful for understanding the distribution of data points along multiple dimensions simultaneously, making it suitable for exploring high-dimensional datasets and identifying patterns or clusters
In conclusion, our journey through dimensionality reduction has provided valuable insights into the structure and relationships within the automotive dataset. By leveraging techniques such as PCA, we successfully reduced the dimensionality of the data while preserving its essential characteristics. Through outlier detection and assumption testing, we ensured the robustness of our analysis and the reliability of our findings. Moreover, visualizations such as correlation plots and biplots aided in the interpretation of results and the identification of meaningful patterns. Overall, this analysis underscores the importance of dimensionality reduction t