Introduction

The purpose of this project is to employ Principal Component Analysis, to reduce the dimensionality of a dataset of multiple car characteristics.


Libraries

library(dplyr)
library(factoextra)
library(caret)
library(ggplot2)
library(corrplot)


The Dataset

The dataset used in the project was taken from kaggle. It contains 428 observations of 19 variables, making it an ideal candidate for dimension reduction. Every row contains basic information about a different car model, with variables such as engine size, weight or horsepower. The set also includes only binary variables in the form of TRUE or FALSE answers, in the first few columns succeeding the name of the car. Those will have to be adapted for the purposes of this task.

cars <- read.csv("cars.csv")

cars$sports_car <- as.numeric(cars$sports_car)
cars$suv <- as.numeric(cars$suv)
cars$wagon <- as.numeric(cars$wagon)
cars$minivan <- as.numeric(cars$minivan)
cars$pickup <- as.numeric(cars$pickup)
cars$all_wheel <- as.numeric(cars$all_wheel)
cars$rear_wheel <- as.numeric(cars$rear_wheel)
str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ suv        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wagon      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ minivan    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pickup     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ all_wheel  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rear_wheel : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...
summary(cars)
##      name             sports_car          suv             wagon        
##  Length:428         Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.00000  
##                     Mean   :0.1145   Mean   :0.1402   Mean   :0.07009  
##                     3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                        
##     minivan            pickup          all_wheel       rear_wheel   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.00000   Median :0.00000   Median :0.000   Median :0.000  
##  Mean   :0.04673   Mean   :0.05607   Mean   :0.215   Mean   :0.257  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:1.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000   Max.   :1.000  
##                                                                     
##       msrp         dealer_cost        eng_size          ncyl       
##  Min.   : 10280   Min.   :  9875   Min.   :1.300   Min.   :-1.000  
##  1st Qu.: 20334   1st Qu.: 18866   1st Qu.:2.375   1st Qu.: 4.000  
##  Median : 27635   Median : 25295   Median :3.000   Median : 6.000  
##  Mean   : 32775   Mean   : 30015   Mean   :3.197   Mean   : 5.776  
##  3rd Qu.: 39205   3rd Qu.: 35710   3rd Qu.:3.900   3rd Qu.: 6.000  
##  Max.   :192465   Max.   :173560   Max.   :8.300   Max.   :12.000  
##                                                                    
##     horsepwr        city_mpg        hwy_mpg          weight       wheel_base   
##  Min.   : 73.0   Min.   :10.00   Min.   :12.00   Min.   :1850   Min.   : 89.0  
##  1st Qu.:165.0   1st Qu.:17.00   1st Qu.:24.00   1st Qu.:3102   1st Qu.:103.0  
##  Median :210.0   Median :19.00   Median :26.00   Median :3474   Median :107.0  
##  Mean   :215.9   Mean   :20.09   Mean   :26.91   Mean   :3577   Mean   :108.2  
##  3rd Qu.:255.0   3rd Qu.:21.00   3rd Qu.:29.00   3rd Qu.:3974   3rd Qu.:112.0  
##  Max.   :500.0   Max.   :60.00   Max.   :66.00   Max.   :7190   Max.   :144.0  
##                  NA's   :14      NA's   :14      NA's   :2      NA's   :2      
##      length          width      
##  Min.   :143.0   Min.   :64.00  
##  1st Qu.:177.0   1st Qu.:69.00  
##  Median :186.0   Median :71.00  
##  Mean   :185.1   Mean   :71.29  
##  3rd Qu.:193.0   3rd Qu.:73.00  
##  Max.   :227.0   Max.   :81.00  
##  NA's   :26      NA's   :28

Additionally, there are a number of missing values, but fortunately not in the converted variables, only the originally numerical ones. They are imputed into the set, using the k-Nearest-Neighbors imputation technique. In the process of doing that, the data is also standardized for the purposes of the PCA algorithm.

cars_knn <- preProcess(cars, method = c('center', 'scale', 'knnImpute'))
cars_imp <- predict(cars_knn, cars)


The “name” column is dropped from the set and a correlation matrix is established.

cars_nonames <- cars_imp %>% select(-name)

corr_matrix <- cor(cars_nonames, method = "pearson")
#corr_matrix
corrplot(corr_matrix)

There is a lot of strong correlations, both positive and negative, which is good for PCA as it seeks to reduce dimensionality, however despite encoding all the categorical variables at the front of the dataset, they have almost no bearing on the correlation. To avoid multicollinearity, they are removed from the dataset, before proceeding with PCA.

cars_2 <- cars_nonames %>% select(-sports_car,
                                  -minivan,
                                  -suv,
                                  -wagon,
                                  -all_wheel,
                                  -rear_wheel,
                                  -pickup)

corr_matrix2 <- cor(cars_2, method = "pearson")
#corr_matrix2
corrplot(corr_matrix2)

With the insignificant variables removed and the strong correlations preserved, the set is ready for PCA.

Principal Component Analysis

pca <- prcomp(cars_2, scale. = TRUE, center = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.6565 1.3892 0.90027 0.61595 0.50058 0.46390 0.3911
## Proportion of Variance 0.6415 0.1754 0.07368 0.03449 0.02278 0.01956 0.0139
## Cumulative Proportion  0.6415 0.8170 0.89067 0.92516 0.94794 0.96750 0.9814
##                            PC8    PC9    PC10    PC11
## Standard deviation     0.30136 0.2755 0.19250 0.02825
## Proportion of Variance 0.00826 0.0069 0.00337 0.00007
## Cumulative Proportion  0.98966 0.9966 0.99993 1.00000


To choose the right amount of principal components, eigenvalues are extracted and scree plots with eigenvalues and percentage of variance explained, are produced.

eigenvalues <- pca$sdev^2
eigenvalues
##  [1] 7.0568579123 1.9299766387 0.8104876153 0.3793926247 0.2505816194
##  [6] 0.2152052481 0.1529253933 0.0908174247 0.0758996944 0.0370575823
## [11] 0.0007982468


Based on the Kaiser criterion and the Cumulative Explained Variance, there is only need for the first 2 principal components, as they explain over 81% of the variance and their eigenvalues are both over 1.

The following plots show the combined and individual contributions of each variable, to the principal components.

While the first component gets a lot of varied contributions, the second one has 2 clearly prominent values, dealer_cost and msrp. The two components seems to be “focusing” on 2 aspects of any given car - it’s performance and its pricing.


Conclusions

In the case of this dataset, the PCA was very successful in reducing the full scope of the data to just 2 dimensions, each having contributions from its major components.