1. Importing raw. dataset

Link: https://www.kaggle.com/datasets/uciml/autompg-dataset?resource=download

mydata <- read.csv("./auto-mpg.csv", header=TRUE, sep=",", dec=".")

head(mydata)
##   mpg cylinders displacement horsepower weight acceleration model.year origin
## 1  18         8          307        130   3504         12.0         70      1
## 2  15         8          350        165   3693         11.5         70      1
## 3  18         8          318        150   3436         11.0         70      1
## 4  16         8          304        150   3433         12.0         70      1
## 5  17         8          302        140   3449         10.5         70      1
## 6  15         8          429        198   4341         10.0         70      1
##                    car.name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

2. Data manipulation

Rewriting all variables as numeric, as they were treated as character.

mydata$mpg <- as.numeric(mydata$mpg)
mydata$cylinders <- as.numeric(mydata$cylinders)
mydata$displacement <- as.numeric(mydata$displacement)
mydata$horsepower <- as.numeric(mydata$horsepower)
## Warning: NAs introduced by coercion
mydata$weight <- as.numeric(mydata$weight)
mydata$acceleration <- as.numeric(mydata$acceleration)

With the correction some values were dropped (NAs appeared), now I will remove observations containing NA.

library(tidyr)
mydata <- mydata %>% drop_na()

Removing variables model year, origin and car name as they are not important for PCA.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>% select(-c(7,8,9))

3. Explaining dataset

head(mydata)
##   mpg cylinders displacement horsepower weight acceleration
## 1  18         8          307        130   3504         12.0
## 2  15         8          350        165   3693         11.5
## 3  18         8          318        150   3436         11.0
## 4  16         8          304        150   3433         12.0
## 5  17         8          302        140   3449         10.5
## 6  15         8          429        198   4341         10.0
summary(mydata)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration  
##  Min.   : 8.00  
##  1st Qu.:13.78  
##  Median :15.50  
##  Mean   :15.54  
##  3rd Qu.:17.02  
##  Max.   :24.80

This data set explains car specifications: fuel consumption (mpg), number of cylinders in the engine, displacement of the engine, horsepower of the engine, weight of the car and acceleration in seconds. It has 392 observations and 6 numeric variables. Unit of observations is one car.

4. Principal component analysis RQ: How does Audi 5000s (diesel) (ID328) compare in different characteristics compared to other cars?

Data needs to be standardized as variances of variables are different.

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:tidyr':
## 
##     extract
round(stat.desc(mydata,basic=FALSE),2)
##                mpg cylinders displacement horsepower    weight acceleration
## median       22.75      4.00       151.00      93.50   2803.50        15.50
## mean         23.45      5.47       194.41     104.47   2977.58        15.54
## SE.mean       0.39      0.09         5.29       1.94     42.90         0.14
## CI.mean.0.95  0.78      0.17        10.39       3.82     84.35         0.27
## var          60.92      2.91     10950.37    1481.57 721484.71         7.61
## std.dev       7.81      1.71       104.64      38.49    849.40         2.76
## coef.var      0.33      0.31         0.54       0.37      0.29         0.18

Correlation matrix shows high correlation between variables, which means PCA can be done (high pearson correlation coefficients).

R <- cor(mydata)
round(R,3)
##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000

We conduct Bartletts test of sphericity.

We reject H0 at p<0,001, population correlation matrix is different from identity matrix. Degrees of freedom show number of pearson correlation coefficients under the main diagonal in correlation matrix (15).

library(psych)
cortest.bartlett(R,n=nrow(mydata))
## $chisq
## [1] 3206.136
## 
## $p.value
## [1] 0
## 
## $df
## [1] 15

Next we check KMO and MSA statistics, overall MSA (KMO) is bigger than 0,5 and all MSA for each variable is higher than 0,5. Data is appropriate for PCA.

KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.85
## MSA for each item = 
##          mpg    cylinders displacement   horsepower       weight acceleration 
##         0.96         0.87         0.83         0.84         0.84         0.69

Getting eigenvalues. If we merge all 6 variables into 2 PC we catch 91,95% of information. Variance of first PC is 4,79. To choose appropriate number of components we should look at the criterias. Kaiser rule is not met, but explained below also scree plot. Number of components (I will use 2) is over 70% for tehnical data, last retained component captures more than 5% of total variance (capturees 12,14%).

library(FactoMineR)
components <- PCA(mydata,
                  scale.unit = TRUE,
                  graph = FALSE)

library(factoextra)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(components)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  4.7882662        79.804436                    79.80444
## Dim.2  0.7286311        12.143852                    91.94829
## Dim.3  0.2584673         4.307789                    96.25608
## Dim.4  0.1251770         2.086284                    98.34236
## Dim.5  0.0631764         1.052940                    99.39530
## Dim.6  0.0362820         0.604700                   100.00000

Based on the scree plot the biggest break is between 1 and 2 so we should take 1 PC, for the sake of the HW I will take 2 PC so I can make the analysis. Also kaiser rule suggests that eigenvalues of PC should be more than 1 for standardized variables which our secnd PC does not meet (eigenvalue PC2 = 0,73)

library(factoextra)
fviz_eig(components,
         choice = "eigenvalue",
         main = "Scree plot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

We did the parallel anlysis, which says that we should take number of PCs where empirical eigenvalues are higher than theoretical eigenvalues. Parallel anlysis again suggest to take just one PC, which doesn’t make sense so I will take 2.

library(psych)
fa.parallel(mydata,
            sim = FALSE,
            fa = "pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

Making principal component analysis.

library(FactoMineR)
components <- PCA(mydata,
                  scale.unit = TRUE,
                  graph = FALSE,
                  ncp = 2)

components
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 392 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Code var$cor show loadings, which in my case shows 2 contrasts as in each PC 2 variables have negative sign. First PC show contrast between variables mpg and acceleration which we can name Driving efficiency and variables displacement, cylinders, horsepower and weight which we can name Engine power & Size. Second principal component shows contrast between variables mpg and horsepower which we can name Powertrain efficiency and variables cylinders, displacement, weight and acceleration which we can name Engine and performance.

print(components$var$cor)
##                   Dim.1       Dim.2
## mpg          -0.8730372 -0.20899050
## cylinders     0.9422769  0.12660072
## displacement  0.9705401  0.09261304
## horsepower    0.9499498 -0.14183263
## weight        0.9411562  0.24421077
## acceleration -0.6387949  0.76196704
print(components$var$contrib)
##                 Dim.1     Dim.2
## mpg          15.91795  5.994395
## cylinders    18.54295  2.199706
## displacement 19.67201  1.177163
## horsepower   18.84617  2.760861
## weight       18.49887  8.185061
## acceleration  8.52206 79.682814
library(factoextra)
fviz_pca_var(components,
             repel = TRUE)

library(factoextra)
fviz_pca_biplot(components)

head(components$ind$coord)
##      Dim.1      Dim.2
## 1 2.325970 -0.5720821
## 2 3.206057 -0.6827412
## 3 2.669984 -0.9940131
## 4 2.605465 -0.6227695
## 5 2.599901 -1.0935928
## 6 4.401453 -1.0107829

RQ answer: Audi 5000s (diesel) car ID328 is better in driving efficiency compared to engine power & size (PC1) and better in powertrain efficiency compared to engine & performance (PC2).

components$ind$coord[328,]
##     Dim.1     Dim.2 
## -2.703993 -1.682394

Conclusion

Principal component analysis was performed on 6 standardized variables (n = 392). The KMO measure confirms the appropriateness of the variables, KMO = 0.85, data falls into the category “Meritorious”. The MSA statistics for the individual variables are above 0.50 for all 6 variables. Based on the parallel analysis (and other rules as well), it makes most sense to retain only 1 PC, but for the sake of the analysis I took 2 PC (reason: all 6 variables all to correlated between each other in the “same” direction). Two principal components together summarize almost 92% of the information in the original 6 variables.

Based on the component’s loadings, we conclude that PC1 (𝜆1 = 4.79) represents contrast between Driving efficiency and Engine power & Size, while PC2 (𝜆2 = 0.73) represents the contrast between Powertrain efficiency and Engine and performance.