The purpose of this project is to employ Principal Component Analysis, to reduce the dimensionality of a dataset of multiple car characteristics.
Libraries
library(dplyr)
library(factoextra)
library(caret)
library(ggplot2)
library(corrplot)
The dataset used in the project was taken from kaggle. It contains 428 observations of 19 variables, making it an ideal candidate for dimension reduction. Every row contains basic information about a different car model, with variables such as engine size, weight or horsepower. The set also includes only binary variables in the form of TRUE or FALSE answers, in the first few columns succeeding the name of the car. Those will have to be adapted for the purposes of this task.
cars <- read.csv("cars.csv")
cars$sports_car <- as.numeric(cars$sports_car)
cars$suv <- as.numeric(cars$suv)
cars$wagon <- as.numeric(cars$wagon)
cars$minivan <- as.numeric(cars$minivan)
cars$pickup <- as.numeric(cars$pickup)
cars$all_wheel <- as.numeric(cars$all_wheel)
cars$rear_wheel <- as.numeric(cars$rear_wheel)
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : num 0 0 0 0 0 0 0 0 0 0 ...
## $ suv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wagon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ minivan : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pickup : num 0 0 0 0 0 0 0 0 0 0 ...
## $ all_wheel : num 0 0 0 0 0 0 0 0 0 0 ...
## $ rear_wheel : num 0 0 0 0 0 0 0 0 0 0 ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
summary(cars)
## name sports_car suv wagon
## Length:428 Min. :0.0000 Min. :0.0000 Min. :0.00000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Mode :character Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1145 Mean :0.1402 Mean :0.07009
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## minivan pickup all_wheel rear_wheel
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.000 Median :0.000
## Mean :0.04673 Mean :0.05607 Mean :0.215 Mean :0.257
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.00000 Max. :1.000 Max. :1.000
##
## msrp dealer_cost eng_size ncyl
## Min. : 10280 Min. : 9875 Min. :1.300 Min. :-1.000
## 1st Qu.: 20334 1st Qu.: 18866 1st Qu.:2.375 1st Qu.: 4.000
## Median : 27635 Median : 25295 Median :3.000 Median : 6.000
## Mean : 32775 Mean : 30015 Mean :3.197 Mean : 5.776
## 3rd Qu.: 39205 3rd Qu.: 35710 3rd Qu.:3.900 3rd Qu.: 6.000
## Max. :192465 Max. :173560 Max. :8.300 Max. :12.000
##
## horsepwr city_mpg hwy_mpg weight wheel_base
## Min. : 73.0 Min. :10.00 Min. :12.00 Min. :1850 Min. : 89.0
## 1st Qu.:165.0 1st Qu.:17.00 1st Qu.:24.00 1st Qu.:3102 1st Qu.:103.0
## Median :210.0 Median :19.00 Median :26.00 Median :3474 Median :107.0
## Mean :215.9 Mean :20.09 Mean :26.91 Mean :3577 Mean :108.2
## 3rd Qu.:255.0 3rd Qu.:21.00 3rd Qu.:29.00 3rd Qu.:3974 3rd Qu.:112.0
## Max. :500.0 Max. :60.00 Max. :66.00 Max. :7190 Max. :144.0
## NA's :14 NA's :14 NA's :2 NA's :2
## length width
## Min. :143.0 Min. :64.00
## 1st Qu.:177.0 1st Qu.:69.00
## Median :186.0 Median :71.00
## Mean :185.1 Mean :71.29
## 3rd Qu.:193.0 3rd Qu.:73.00
## Max. :227.0 Max. :81.00
## NA's :26 NA's :28
Additionally, there are a number of missing values, but fortunately not in the converted variables, only the originally numerical ones. They are imputed into the set, using the k-Nearest-Neighbors imputation technique. In the process of doing that, the data is also standardized for the purposes of the PCA algorithm.
cars_knn <- preProcess(cars, method = c('center', 'scale', 'knnImpute'))
cars_imp <- predict(cars_knn, cars)
The “name” column is dropped from the set and a correlation matrix is established.
cars_nonames <- cars_imp %>% select(-name)
corr_matrix <- cor(cars_nonames, method = "pearson")
#corr_matrix
corrplot(corr_matrix)
There is a lot of strong correlations, both positive and negative, which is good for PCA as it seeks to reduce dimensionality, however despite encoding all the categorical variables at the front of the dataset, they have almost no bearing on the correlation. To avoid multicollinearity, they are removed from the dataset, before proceeding with PCA.
cars_2 <- cars_nonames %>% select(-sports_car,
-minivan,
-suv,
-wagon,
-all_wheel,
-rear_wheel,
-pickup)
corr_matrix2 <- cor(cars_2, method = "pearson")
#corr_matrix2
corrplot(corr_matrix2)
With the insignificant variables removed and the strong correlations
preserved, the set is ready for PCA.
pca <- prcomp(cars_2, scale. = TRUE, center = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6565 1.3892 0.90027 0.61595 0.50058 0.46390 0.3911
## Proportion of Variance 0.6415 0.1754 0.07368 0.03449 0.02278 0.01956 0.0139
## Cumulative Proportion 0.6415 0.8170 0.89067 0.92516 0.94794 0.96750 0.9814
## PC8 PC9 PC10 PC11
## Standard deviation 0.30136 0.2755 0.19250 0.02825
## Proportion of Variance 0.00826 0.0069 0.00337 0.00007
## Cumulative Proportion 0.98966 0.9966 0.99993 1.00000
To choose the right amount of principal components, eigenvalues are extracted and scree plots with eigenvalues and percentage of variance explained, are produced.
eigenvalues <- pca$sdev^2
eigenvalues
## [1] 7.0568579123 1.9299766387 0.8104876153 0.3793926247 0.2505816194
## [6] 0.2152052481 0.1529253933 0.0908174247 0.0758996944 0.0370575823
## [11] 0.0007982468
Based on the Kaiser criterion and the Cumulative Explained Variance,
there is only need for the first 2 principal components, as they explain
over 81% of the variance and their eigenvalues are both
over 1.
The following plots show the combined and individual contributions of each variable, to the principal components.
While the first component gets a lot of varied contributions, the second one has 2 clearly prominent values, dealer_cost and msrp. The two components seems to be “focusing” on 2 aspects of any given car - it’s performance and its pricing.
In the case of this dataset, the PCA was very successful in reducing the full scope of the data to just 2 dimensions, each having contributions from its major components.