##Robert Sidorowski
Principal component analysis is a statistical procedure that allows you to summarize the content of information in large data sets using a smaller set of tools, such as, for example, summary indicators that allow easier visualization and analysis.
In my PCA analysis I used an existing database. This database contains data about cars.There are data about 32 car models.They come from the American automotive magazine and more specifically from Motor Trend magazine from 1974.Each car has 11 variables, expressed in US units.
MPG - Fuel consumption (miles per gallon). Cars that have more power or are heavier also tend to consume more fuel.
CYL - The number of cylinders in the car. If a car has more power, it has more cylinders.
DISP - Displacement: the combined volume of the engine’s cylinders.
HP - This is the power that the car generates.
DRAT - Rear axle ratio, which describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will decrease fuel efficiency.
WT - Car weight.
QSEC - Acceleration and speed of the car for 1/4 mile.
VS - This means whether the vehicle engine block is V-shaped or is a more common simple shape.
AM - Shows whether the gearbox is automatic or manual.
GEAR - The number of gears enabling forward travel, i.e. 1,2,3 and so on.
CARB - Number of carburetors associated with more powerful engines.
Database Import:
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
PCA works the best with data, which is numeric. So we have to exclude two categorical variables. These variables are VS and AM. After excluding these variables, we will have a matrix with 9 columns and 32 rows.
cars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. = TRUE)
summary(cars.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
## PC8 PC9
## Standard deviation 0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion 0.9975 1.00000
After this operation, we get 9 main components. From the results obtained, we can see that PC1 explains 63% of the total variance, and PC2 explains 23% of the variance.
Let’s start with the correlation matrix.
library(corrplot)
## corrplot 0.84 loaded
cars.cor<-cor(mtcars, method="pearson")
print(cars.cor, digits=2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.681 -0.87 0.419 0.66 0.600 0.48 -0.551
## cyl -0.85 1.00 0.90 0.83 -0.700 0.78 -0.591 -0.81 -0.523 -0.49 0.527
## disp -0.85 0.90 1.00 0.79 -0.710 0.89 -0.434 -0.71 -0.591 -0.56 0.395
## hp -0.78 0.83 0.79 1.00 -0.449 0.66 -0.708 -0.72 -0.243 -0.13 0.750
## drat 0.68 -0.70 -0.71 -0.45 1.000 -0.71 0.091 0.44 0.713 0.70 -0.091
## wt -0.87 0.78 0.89 0.66 -0.712 1.00 -0.175 -0.55 -0.692 -0.58 0.428
## qsec 0.42 -0.59 -0.43 -0.71 0.091 -0.17 1.000 0.74 -0.230 -0.21 -0.656
## vs 0.66 -0.81 -0.71 -0.72 0.440 -0.55 0.745 1.00 0.168 0.21 -0.570
## am 0.60 -0.52 -0.59 -0.24 0.713 -0.69 -0.230 0.17 1.000 0.79 0.058
## gear 0.48 -0.49 -0.56 -0.13 0.700 -0.58 -0.213 0.21 0.794 1.00 0.274
## carb -0.55 0.53 0.39 0.75 -0.091 0.43 -0.656 -0.57 0.058 0.27 1.000
corrplot(cars.cor, order ="alphabet")
We can see that some variables are quite well correlated.
Now I use ggbiplot packages. It allows us to create very functional biplot charts.
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
ggbiplot(cars.pca, labels=rownames(mtcars))
Here, we can see that the variables hp, cyl and disp all contribute to PC1, with higher values in these variables shifting the samples to the right in this graph. We can also see which cars are similar.
Now we create a list that will show us the origin of each car. It will be useful for the next chart.
cars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))
ggbiplot(cars.pca,ellipse=TRUE, labels=rownames(mtcars), groups=cars.country)
From this chart, we can see that American cars form a cluster on the right. In addition, American cars have high values for the following variables: hp, cyl, disp and wt. The second cluster is Japanese cars. They are located on the left. Japanese cars are characterized by taking high values of the mpg variable. European cars form a third cluster. Interestingly, they are the most optimized.
It makes no sense to visualize PC3, PC4 and so on because they explain a small percentage of total variation.
ggbiplot(cars.pca,ellipse=TRUE,circle=TRUE, labels=rownames(mtcars), groups=cars.country)
After adding a circle to dataset we can see the center of data.
plot(cars.pca)
plot(cars.pca, type = "l")
dm <- dist(mtcars)
hc <- hclust(dm, method ="complete")
plot(hc)
plot(density(dm))
carstree <- hclust(dist(mtcars), method="ward.D2")
plot(carstree)
rect.hclust(carstree,k=4,border="red")
In this project PCA was done on rather small dataset. The analysis helps to get better understanding of the data and dependencies between variables.It is always good to check, whether such analysis can improve the model. In this project the way of how to use PCA on classification problem was covered, so this can be used on problems, in which dataset consists of more variables.