Link: https://www.kaggle.com/datasets/uciml/autompg-dataset?resource=download
mydata <- read.csv("./auto-mpg.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## mpg cylinders displacement horsepower weight acceleration model.year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## car.name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Rewriting all variables as numeric, as they were treated as character.
mydata$mpg <- as.numeric(mydata$mpg)
mydata$cylinders <- as.numeric(mydata$cylinders)
mydata$displacement <- as.numeric(mydata$displacement)
mydata$horsepower <- as.numeric(mydata$horsepower)
## Warning: NAs introduced by coercion
mydata$weight <- as.numeric(mydata$weight)
mydata$acceleration <- as.numeric(mydata$acceleration)
With the correction some values were dropped (NAs appeared), now I will remove observations containing NA.
library(tidyr)
mydata <- mydata %>% drop_na()
Removing variables model year, origin and car name as they are not important for PCA.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>% select(-c(7,8,9))
head(mydata)
## mpg cylinders displacement horsepower weight acceleration
## 1 18 8 307 130 3504 12.0
## 2 15 8 350 165 3693 11.5
## 3 18 8 318 150 3436 11.0
## 4 16 8 304 150 3433 12.0
## 5 17 8 302 140 3449 10.5
## 6 15 8 429 198 4341 10.0
summary(mydata)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration
## Min. : 8.00
## 1st Qu.:13.78
## Median :15.50
## Mean :15.54
## 3rd Qu.:17.02
## Max. :24.80
This data set explains car specifications: fuel consumption (mpg), number of cylinders in the engine, displacement of the engine, horsepower of the engine, weight of the car and acceleration in seconds. It has 392 observations and 6 numeric variables. Unit of observations is one car.
Data needs to be standardized as variances of variables are different.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(mydata,basic=FALSE),2)
## mpg cylinders displacement horsepower weight acceleration
## median 22.75 4.00 151.00 93.50 2803.50 15.50
## mean 23.45 5.47 194.41 104.47 2977.58 15.54
## SE.mean 0.39 0.09 5.29 1.94 42.90 0.14
## CI.mean.0.95 0.78 0.17 10.39 3.82 84.35 0.27
## var 60.92 2.91 10950.37 1481.57 721484.71 7.61
## std.dev 7.81 1.71 104.64 38.49 849.40 2.76
## coef.var 0.33 0.31 0.54 0.37 0.29 0.18
Correlation matrix shows high correlation between variables, which means PCA can be done (high pearson correlation coefficients).
R <- cor(mydata)
round(R,3)
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
We conduct Bartletts test of sphericity.
We reject H0 at p<0,001, population correlation matrix is different from identity matrix. Degrees of freedom show number of pearson correlation coefficients under the main diagonal in correlation matrix (15).
library(psych)
cortest.bartlett(R,n=nrow(mydata))
## $chisq
## [1] 3206.136
##
## $p.value
## [1] 0
##
## $df
## [1] 15
Next we check KMO and MSA statistics, overall MSA (KMO) is bigger than 0,5 and all MSA for each variable is higher than 0,5. Data is appropriate for PCA.
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.85
## MSA for each item =
## mpg cylinders displacement horsepower weight acceleration
## 0.96 0.87 0.83 0.84 0.84 0.69
Getting eigenvalues. If we merge all 6 variables into 2 PC we catch 91,95% of information. Variance of first PC is 4,79. To choose appropriate number of components we should look at the criterias. Kaiser rule is not met, but explained below also scree plot. Number of components (I will use 2) is over 70% for tehnical data, last retained component captures more than 5% of total variance (capturees 12,14%).
library(FactoMineR)
components <- PCA(mydata,
scale.unit = TRUE,
graph = FALSE)
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.7882662 79.804436 79.80444
## Dim.2 0.7286311 12.143852 91.94829
## Dim.3 0.2584673 4.307789 96.25608
## Dim.4 0.1251770 2.086284 98.34236
## Dim.5 0.0631764 1.052940 99.39530
## Dim.6 0.0362820 0.604700 100.00000
Based on the scree plot the biggest break is between 1 and 2 so we should take 1 PC, for the sake of the HW I will take 2 PC so I can make the analysis. Also kaiser rule suggests that eigenvalues of PC should be more than 1 for standardized variables which our secnd PC does not meet (eigenvalue PC2 = 0,73)
library(factoextra)
fviz_eig(components,
choice = "eigenvalue",
main = "Scree plot",
ylab = "Eigenvalue",
xlab = "Principal component",
addlabels = TRUE)
We did the parallel anlysis, which says that we should take number of PCs where empirical eigenvalues are higher than theoretical eigenvalues. Parallel anlysis again suggest to take just one PC, which doesn’t make sense so I will take 2.
library(psych)
fa.parallel(mydata,
sim = FALSE,
fa = "pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
Making principal component analysis.
library(FactoMineR)
components <- PCA(mydata,
scale.unit = TRUE,
graph = FALSE,
ncp = 2)
components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 392 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
Code var$cor show loadings, which in my case shows 2 contrasts as in each PC 2 variables have negative sign. First PC show contrast between variables mpg and acceleration which we can name Driving efficiency and variables displacement, cylinders, horsepower and weight which we can name Engine power & Size. Second principal component shows contrast between variables mpg and horsepower which we can name Powertrain efficiency and variables cylinders, displacement, weight and acceleration which we can name Engine and performance.
print(components$var$cor)
## Dim.1 Dim.2
## mpg -0.8730372 -0.20899050
## cylinders 0.9422769 0.12660072
## displacement 0.9705401 0.09261304
## horsepower 0.9499498 -0.14183263
## weight 0.9411562 0.24421077
## acceleration -0.6387949 0.76196704
print(components$var$contrib)
## Dim.1 Dim.2
## mpg 15.91795 5.994395
## cylinders 18.54295 2.199706
## displacement 19.67201 1.177163
## horsepower 18.84617 2.760861
## weight 18.49887 8.185061
## acceleration 8.52206 79.682814
library(factoextra)
fviz_pca_var(components,
repel = TRUE)
library(factoextra)
fviz_pca_biplot(components)
head(components$ind$coord)
## Dim.1 Dim.2
## 1 2.325970 -0.5720821
## 2 3.206057 -0.6827412
## 3 2.669984 -0.9940131
## 4 2.605465 -0.6227695
## 5 2.599901 -1.0935928
## 6 4.401453 -1.0107829
RQ answer: Audi 5000s (diesel) car ID328 is better in driving efficiency compared to engine power & size (PC1) and better in powertrain efficiency compared to engine & performance (PC2).
components$ind$coord[328,]
## Dim.1 Dim.2
## -2.703993 -1.682394
Principal component analysis was performed on 6 standardized variables (n = 392). The KMO measure confirms the appropriateness of the variables, KMO = 0.85, data falls into the category “Meritorious”. The MSA statistics for the individual variables are above 0.50 for all 6 variables. Based on the parallel analysis (and other rules as well), it makes most sense to retain only 1 PC, but for the sake of the analysis I took 2 PC (reason: all 6 variables all to correlated between each other in the “same” direction). Two principal components together summarize almost 92% of the information in the original 6 variables.
Based on the component’s loadings, we conclude that PC1 (𝜆1 = 4.79) represents contrast between Driving efficiency and Engine power & Size, while PC2 (𝜆2 = 0.73) represents the contrast between Powertrain efficiency and Engine and performance.