CARS--RMARK.utf8

PCA of cars

##Robert Sidorowski

Introduction

Principal component analysis is a statistical procedure that allows you to summarize the content of information in large data sets using a smaller set of tools, such as, for example, summary indicators that allow easier visualization and analysis.

Data Description

In my PCA analysis I used an existing database. This database contains data about cars.There are data about 32 car models.They come from the American automotive magazine and more specifically from Motor Trend magazine from 1974.Each car has 11 variables, expressed in US units.

Variables and their description

MPG - Fuel consumption (miles per gallon). Cars that have more power or are heavier also tend to consume more fuel.

CYL - The number of cylinders in the car. If a car has more power, it has more cylinders.

DISP - Displacement: the combined volume of the engine’s cylinders.

HP - This is the power that the car generates.

DRAT - Rear axle ratio, which describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will decrease fuel efficiency.

WT - Car weight.

QSEC - Acceleration and speed of the car for 1/4 mile.

VS - This means whether the vehicle engine block is V-shaped or is a more common simple shape.

AM - Shows whether the gearbox is automatic or manual.

GEAR - The number of gears enabling forward travel, i.e. 1,2,3 and so on.

CARB - Number of carburetors associated with more powerful engines.

PCA

Database Import:

data(mtcars)

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

PCA works the best with data, which is numeric. So we have to exclude two categorical variables. These variables are VS and AM. After excluding these variables, we will have a matrix with 9 columns and 32 rows.

cars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)

summary(cars.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

After this operation, we get 9 main components. From the results obtained, we can see that PC1 explains 63% of the total variance, and PC2 explains 23% of the variance.

Visualisation

Let’s start with the correlation matrix.

library(corrplot)

## corrplot 0.84 loaded

cars.cor<-cor(mtcars, method="pearson")  
print(cars.cor, digits=2)

##        mpg   cyl  disp    hp   drat    wt   qsec    vs     am  gear   carb
## mpg   1.00 -0.85 -0.85 -0.78  0.681 -0.87  0.419  0.66  0.600  0.48 -0.551
## cyl  -0.85  1.00  0.90  0.83 -0.700  0.78 -0.591 -0.81 -0.523 -0.49  0.527
## disp -0.85  0.90  1.00  0.79 -0.710  0.89 -0.434 -0.71 -0.591 -0.56  0.395
## hp   -0.78  0.83  0.79  1.00 -0.449  0.66 -0.708 -0.72 -0.243 -0.13  0.750
## drat  0.68 -0.70 -0.71 -0.45  1.000 -0.71  0.091  0.44  0.713  0.70 -0.091
## wt   -0.87  0.78  0.89  0.66 -0.712  1.00 -0.175 -0.55 -0.692 -0.58  0.428
## qsec  0.42 -0.59 -0.43 -0.71  0.091 -0.17  1.000  0.74 -0.230 -0.21 -0.656
## vs    0.66 -0.81 -0.71 -0.72  0.440 -0.55  0.745  1.00  0.168  0.21 -0.570
## am    0.60 -0.52 -0.59 -0.24  0.713 -0.69 -0.230  0.17  1.000  0.79  0.058
## gear  0.48 -0.49 -0.56 -0.13  0.700 -0.58 -0.213  0.21  0.794  1.00  0.274
## carb -0.55  0.53  0.39  0.75 -0.091  0.43 -0.656 -0.57  0.058  0.27  1.000

corrplot(cars.cor, order ="alphabet")

We can see that some variables are quite well correlated.

Now I use ggbiplot packages. It allows us to create very functional biplot charts.

library(devtools)

## Loading required package: usethis

library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

ggbiplot(cars.pca, labels=rownames(mtcars))

Here, we can see that the variables hp, cyl and disp all contribute to PC1, with higher values in these variables shifting the samples to the right in this graph. We can also see which cars are similar.

Now we create a list that will show us the origin of each car. It will be useful for the next chart.

cars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))

ggbiplot(cars.pca,ellipse=TRUE,  labels=rownames(mtcars), groups=cars.country)

From this chart, we can see that American cars form a cluster on the right. In addition, American cars have high values for the following variables: hp, cyl, disp and wt. The second cluster is Japanese cars. They are located on the left. Japanese cars are characterized by taking high values of the mpg variable. European cars form a third cluster. Interestingly, they are the most optimized.

It makes no sense to visualize PC3, PC4 and so on because they explain a small percentage of total variation.

ggbiplot(cars.pca,ellipse=TRUE,circle=TRUE, labels=rownames(mtcars), groups=cars.country)

After adding a circle to dataset we can see the center of data.

Quality of measures of PCA

plot(cars.pca)

plot(cars.pca, type = "l")

dm <- dist(mtcars)
hc <- hclust(dm, method ="complete")

plot(hc)

plot(density(dm))

carstree <- hclust(dist(mtcars), method="ward.D2")
plot(carstree)

rect.hclust(carstree,k=4,border="red")

Conclusion

In this project PCA was done on rather small dataset. The analysis helps to get better understanding of the data and dependencies between variables.It is always good to check, whether such analysis can improve the model. In this project the way of how to use PCA on classification problem was covered, so this can be used on problems, in which dataset consists of more variables.