Homework 4 PCA

importdata <- read.table("~/Multivirate Analyses/MVA_2022_2023/Homework/Homework 3 R/CarPrice_Assignment.csv", header=TRUE, sep=",", dec=".")

customdata <- importdata %>% dplyr::select(1, 26, 4, 10, 11, 12, 13, 14, 17, 19, 20, 21, 22)
head(customdata)

##   car_ID price fueltype wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio
## 1      1 13495      gas      88.6     168.8     64.1      48.8       2548        130      3.47   2.68              9.0
## 2      2 16500      gas      88.6     168.8     64.1      48.8       2548        130      3.47   2.68              9.0
## 3      3 16500      gas      94.5     171.2     65.5      52.4       2823        152      2.68   3.47              9.0
## 4      4 13950      gas      99.8     176.6     66.2      54.3       2337        109      3.19   3.40             10.0
## 5      5 17450      gas      99.4     176.6     66.4      54.3       2824        136      3.19   3.40              8.0
## 6      6 15250      gas      99.8     177.3     66.3      53.1       2507        136      3.19   3.40              8.5
##   horsepower
## 1        111
## 2        111
## 3        154
## 4        102
## 5        115
## 6        110

Description of data:

Source of data: Car Price Prediction Multiple Linear Regression https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv

Unit of Observation: Cars Sample Size: 205 observations

Variables: 12

Car_ID: ID

Dependent variable - price: Price of car in Dollars

Independent variables - fueltype: Car Fuel Type (gas or diesel)

wheelbase: Length of the wheelbase in inches (distance between front and rear wheels)
carlength: Length of car in inches
carwidth: Width of car in inches
carheight: Height of car inches
curbweight: Weight of car in pounds
enginesize: Size of engine in cubic inches
boreratio: Car Bore-Stroke Ratio is the ratio between the dimensions of the engine cylinder bore diameter to its piston stroke-length
stroke: Car strokes (numerical value in strokes)
compressionratio: Car compression ratio (ratio between the volume of the cylinder with the piston in the bottom position, Vbottom (largest volume), and in the top position, Vtop (smallest volume)
horsepower: Car horsepower

Purpose: Looking at the data we have several variables that are likely describing different dimensions of the same thing. For example, Car Length, Car Height, Car Width, Curb Weight may all be describing the size of a car.

Therefore, it would be best to reduce the number of variables whilst keeping a relatively close approximation of each of the original variables contribution.

We will achieve this dimesion reduction through principle component analysis.

#Changed any missing data to NA and removed any NA's from the data.
mydata1 <- customdata %>% 
              na_if ('') %>% 
              drop_na()

#Changed fueltype to a factor.
mydata1$fuelTypeF <- factor(mydata1$fueltype, 
                            levels = c("gas","diesel"),  
                            labels = c("Gas", "Diesel"))

#Initial model 
stat1 <- lm(price ~ fuelTypeF + wheelbase + curbweight + carlength + carwidth + carheight + enginesize + boreratio + stroke + compressionratio + horsepower,
           data = mydata1)

Estimate of the potential regression model

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(stat1)

##        fuelTypeF        wheelbase       curbweight        carlength         carwidth        carheight       enginesize 
##        48.708926         6.980363        13.876185         8.662738         5.575933         2.170805         5.372246 
##        boreratio           stroke compressionratio       horsepower 
##         1.982688         1.367405        45.161054         5.578574

mean(vif(stat1))

## [1] 13.22154

We can see that we have have an issue with Multicollinearity as our average Variance inflation factor is over 13, PCA would be appropriate to help us peform regression in the future on this dataset.

mydata1_PCA <- mydata1[, c("wheelbase", "curbweight", "carlength", "carwidth", "carheight", "enginesize", "boreratio", "stroke", "compressionratio", "horsepower")] 

round(stat.desc(mydata1_PCA, basic = FALSE), 2)

##              wheelbase curbweight carlength carwidth carheight enginesize boreratio stroke compressionratio horsepower
## median           97.00    2414.00    173.20    65.50     54.10     120.00      3.31   3.29             9.00      95.00
## mean             98.76    2555.57    174.05    65.91     53.72     126.91      3.33   3.26            10.14     104.12
## SE.mean           0.42      36.37      0.86     0.15      0.17       2.91      0.02   0.02             0.28       2.76
## CI.mean.0.95      0.83      71.70      1.70     0.30      0.34       5.73      0.04   0.04             0.55       5.45
## var              36.26  271107.87    152.21     4.60      5.97    1734.11      0.07   0.10            15.78    1563.74
## std.dev           6.02     520.68     12.34     2.15      2.44      41.64      0.27   0.31             3.97      39.54
## coef.var          0.06       0.20      0.07     0.03      0.05       0.33      0.08   0.10             0.39       0.38

We can see that the Variances of the variables are not equal, therefore we need to standardize variables

SD <- cor(mydata1_PCA) 
round(SD, 3)

##                  wheelbase curbweight carlength carwidth carheight enginesize boreratio stroke compressionratio
## wheelbase            1.000      0.776     0.875    0.795     0.589      0.569     0.489  0.161            0.250
## curbweight           0.776      1.000     0.878    0.867     0.296      0.851     0.648  0.169            0.151
## carlength            0.875      0.878     1.000    0.841     0.491      0.683     0.606  0.130            0.158
## carwidth             0.795      0.867     0.841    1.000     0.279      0.735     0.559  0.183            0.181
## carheight            0.589      0.296     0.491    0.279     1.000      0.067     0.171 -0.055            0.261
## enginesize           0.569      0.851     0.683    0.735     0.067      1.000     0.584  0.203            0.029
## boreratio            0.489      0.648     0.606    0.559     0.171      0.584     1.000 -0.056            0.005
## stroke               0.161      0.169     0.130    0.183    -0.055      0.203    -0.056  1.000            0.186
## compressionratio     0.250      0.151     0.158    0.181     0.261      0.029     0.005  0.186            1.000
## horsepower           0.353      0.751     0.553    0.641    -0.109      0.810     0.574  0.081           -0.204
##                  horsepower
## wheelbase             0.353
## curbweight            0.751
## carlength             0.553
## carwidth              0.641
## carheight            -0.109
## enginesize            0.810
## boreratio             0.574
## stroke                0.081
## compressionratio     -0.204
## horsepower            1.000

Variables Standardised

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

cortest.bartlett(SD, n = nrow(mydata1_PCA))

## $chisq
## [1] 1835.363
## 
## $p.value
## [1] 0
## 
## $df
## [1] 45

We check to see if the data is suitable for PCA using Bartlets test of sphericity

H0: Correlations between variables are equal to 0 H1: Correlations between variables are not equal to 0

We reject the null hypothesis with p-value < 0.0001, the correlations are high enough, the data is suitable for PCA.

KMO(SD)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = SD)
## Overall MSA =  0.86
## MSA for each item = 
##        wheelbase       curbweight        carlength         carwidth        carheight       enginesize        boreratio 
##             0.84             0.87             0.90             0.91             0.75             0.91             0.95 
##           stroke compressionratio       horsepower 
##             0.62             0.46             0.78

We check if the absolute average correlation is high enough for PCA

Overall MSA should be above .5, in this case it is .86 which is Meritorious.

PComponents <- PCA(mydata1_PCA,
                  scale.unit = TRUE,
                  graph = FALSE) 


get_eigenvalue(PComponents)

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  5.32371235       53.2371235                    53.23712
## Dim.2  1.65121648       16.5121648                    69.74929
## Dim.3  1.14385156       11.4385156                    81.18780
## Dim.4  0.69111794        6.9111794                    88.09898
## Dim.5  0.45084788        4.5084788                    92.60746
## Dim.6  0.30353739        3.0353739                    95.64284
## Dim.7  0.16466212        1.6466212                    97.28946
## Dim.8  0.12253599        1.2253599                    98.51482
## Dim.9  0.08715785        0.8715785                    99.38640
## Dim.10 0.06136045        0.6136045                   100.00000

The Kaiser-Guttman criterion is that any principle component with variance less than 1 contains less information than one of the original variables and so is not worth retaining, thus we only take PCA with eigenvalue > 1.

In this case we have 3

Using 3 components we should be able to explain 81% of the variance.

fviz_eig(PComponents,
         choice = "eigenvalue",
         main = "Scree plot",
         ylab = "Eigenvalue",
         xlab = "Principal component",
         addlabels = TRUE)

We can see a slight elbow break at PC 4, so we should retain the 3 PC’s before the break.

fa.parallel(mydata1_PCA, 
            sim = FALSE, 
            fa = "pc")

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, : The estimated weights for the factor
## scores are probably incorrect.  Try a different factor score estimation method.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

Parallel analysis indicates that we should use 2 components, however, both the Scree Plot and the Kaiser-Guttman criterion suggest we can use 3 components and so we will use 3.

PComponents <- PCA(mydata1_PCA, 
                  scale.unit = TRUE, 
                  graph = FALSE,
                  ncp = 3)

PComponents

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 205 individuals, described by 10 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

print(PComponents$var$cor)

##                      Dim.1       Dim.2       Dim.3
## wheelbase        0.8492962  0.38097916 -0.07084381
## curbweight       0.9647616 -0.06123861  0.03463443
## carlength        0.9281465  0.17276036 -0.09464495
## carwidth         0.9147703  0.02382500  0.06713919
## carheight        0.3764917  0.72928178 -0.36011982
## enginesize       0.8529240 -0.31566997  0.14673350
## boreratio        0.7142067 -0.20400951 -0.25030963
## stroke           0.1823040  0.09463025  0.88446105
## compressionratio 0.1577357  0.66827311  0.35617704
## horsepower       0.7320866 -0.58587052  0.03409522

Weight of each PC is how much information was retained of the variable in first three components.

For example Car Width = .91^2+.0.022+0.06^2 = 0.8321x100 = 83%

With PC1, PC2 and PC3 we retained 83% of information of the variable Curbweight, and lost 17%.

print(PComponents$var$contrib)

##                       Dim.1       Dim.2      Dim.3
## wheelbase        13.5488920  8.79019347  0.4387672
## curbweight       17.4833808  0.22711540  0.1048689
## carlength        16.1814897  1.80752443  0.7831144
## carwidth         15.7184437  0.03437651  0.3940783
## carheight         2.6625411 32.20970270 11.3376846
## enginesize       13.6648891  6.03479498  1.8823002
## boreratio         9.5814951  2.52055862  5.4775387
## stroke            0.6242777  0.54232038 68.3892367
## compressionratio  0.4673535 27.04605664 11.0907821
## horsepower       10.0672373 20.78735688  0.1016289

Percentage information contributions of each variable to each Component

For example: the Variable Curbweight contributed 17.4% of information to dimension 1

fviz_pca_var(PComponents, 
             repel = TRUE)

We can see the coordinates of each variable with their weighting contribution

fviz_pca_biplot(PComponents)

Combination of coordinates of each unit in the sample and their weight

For example we can see that unit 130 is likely to be a powerful car.

head(PComponents$ind$coord)

##        Dim.1      Dim.2      Dim.3
## 1 -1.3861096 -2.1859208 -0.9463190
## 2 -1.3861096 -2.1859208 -0.9463190
## 3 -0.3883140 -1.0203233  1.4124103
## 4 -0.2384437  0.4636708  0.3160968
## 5  0.4764623 -0.1714430  0.2883191
## 6  0.1382168 -0.2718164  0.4599246

Coordinates of each unit in the sample.

#instering components into dataframe
mydata1$PC1 <- PComponents$ind$coord[ , 1] 
mydata1$PC2 <- PComponents$ind$coord[ , 2] 
mydata1$PC3 <- PComponents$ind$coord[ , 3] 

head(mydata1, 3)

##   car_ID price fueltype wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio
## 1      1 13495      gas      88.6     168.8     64.1      48.8       2548        130      3.47   2.68                9
## 2      2 16500      gas      88.6     168.8     64.1      48.8       2548        130      3.47   2.68                9
## 3      3 16500      gas      94.5     171.2     65.5      52.4       2823        152      2.68   3.47                9
##   horsepower fuelTypeF       PC1       PC2       PC3
## 1        111       Gas -1.386110 -2.185921 -0.946319
## 2        111       Gas -1.386110 -2.185921 -0.946319
## 3        154       Gas -0.388314 -1.020323  1.412410

#checking the correlation between components, must be 0
cor(x=mydata1$PC1, y=mydata1$PC2)

## [1] 6.837024e-16

The correlation is 0

the Components are orthogonal.

Conclusion

Principal components method was performed on 10 standardized variables (n= 205). The KMO measure confirms the appropriateness of the variables, KMO = 0.86

The MSA statistics for the individual variables are above 0.50 for three variables, and for the Compression Ratio intelligence variable it is only 0.46, but we left it in the analysis as it is very close to .5. Based on the Scree Plot and Kaiser-Guttman criterion, it makes the most sense to retain the first three principal components, which together summarize 81% of the information in the original 10 variables.

Based on the weighting of the principal components, we conclude that PC1 (𝜆1 = 5.32371235) represents the overall size of a car, while PC2 (𝜆2 = 1.65121648) represents power output and PC3 (𝜆2 = 1.14385156) represents the qualities of a passenger vehicle.

Homework 4 PCA

Paul Shenstone

2023-02-19