Introduction

PCA is an unsupervised dimensionality reduction technique that takes a set of possibly correlated variables and transforms them into a smaller set of new variables called principal components that are completely uncorrelated with one another. Each component is a linear combination of the originals, and they’re ordered so that the first captures the most variance in the data, the second captures the most of what’s left while being orthogonal to the first and so on. (https://en.wikipedia.org/wiki/Principal_component_analysis)

“Principal Component Regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. […] One major use of PCR lies in overcoming the multicollinearity problem which arises when two or more of the explanatory variables are close to being collinear. PCR can lead to efficient prediction of the outcome based on the assumed model.” (https://en.wikipedia.org/wiki/Principal_component_regression)

There are some potential issues of PCR.

PCA does not consider the target variable when creating components, meaning it might discard components with low variance that are actually important for prediction.
The resulting regression coefficients are for the principal components, not the original variables, making it harder to explain the direct impact of each original predictor. Rotated PCA, which improves the interpretability of components by rotating the PCA axes to better align with data clusters can be used in order to mitigate this issue if standard PCs are not interpretable.
PCA doesn’t work on categorical variables and requires numerical input. It is not advised to one-hot encode categorical variables in order to use them in PCA.

Installing and loading necessary packages

library(readr)
library(dplyr)
library(corrplot)
library(e1071)
library(forecast)
library(clusterSim)
library(factoextra)
library(psych)
library(stats)
library(gridExtra)
library(wesanderson)
library(caret)
library(tidymodels)
library(plotly)

Dataset

Source: https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv

data <- read.csv("CarPrice_Assignment.csv")

missing.data <- data[!complete.cases(data),]
missing.data

##  [1] car_ID           symboling        CarName          fueltype        
##  [5] aspiration       doornumber       carbody          drivewheel      
##  [9] enginelocation   wheelbase        carlength        carwidth        
## [13] carheight        curbweight       enginetype       cylindernumber  
## [17] enginesize       fuelsystem       boreratio        stroke          
## [21] compressionratio horsepower       peakrpm          citympg         
## [25] highwaympg       price           
## <0 wierszy> (lub 'row.names' o zerowej długości)

data <- data[complete.cases(data),]

head(data)

##   car_ID symboling                  CarName fueltype aspiration doornumber
## 1      1         3       alfa-romero giulia      gas        std        two
## 2      2         3      alfa-romero stelvio      gas        std        two
## 3      3         1 alfa-romero Quadrifoglio      gas        std        two
## 4      4         2              audi 100 ls      gas        std       four
## 5      5         2               audi 100ls      gas        std       four
## 6      6         2                 audi fox      gas        std        two
##       carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 1 convertible        rwd          front      88.6     168.8     64.1      48.8
## 2 convertible        rwd          front      88.6     168.8     64.1      48.8
## 3   hatchback        rwd          front      94.5     171.2     65.5      52.4
## 4       sedan        fwd          front      99.8     176.6     66.2      54.3
## 5       sedan        4wd          front      99.4     176.6     66.4      54.3
## 6       sedan        fwd          front      99.8     177.3     66.3      53.1
##   curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 1       2548       dohc           four        130       mpfi      3.47   2.68
## 2       2548       dohc           four        130       mpfi      3.47   2.68
## 3       2823       ohcv            six        152       mpfi      2.68   3.47
## 4       2337        ohc           four        109       mpfi      3.19   3.40
## 5       2824        ohc           five        136       mpfi      3.19   3.40
## 6       2507        ohc           five        136       mpfi      3.19   3.40
##   compressionratio horsepower peakrpm citympg highwaympg price
## 1              9.0        111    5000      21         27 13495
## 2              9.0        111    5000      21         27 16500
## 3              9.0        154    5000      19         26 16500
## 4             10.0        102    5500      24         30 13950
## 5              8.0        115    5500      18         22 17450
## 6              8.5        110    5500      19         25 15250

car_ID: A unique identifier for each row
symboling: The assigned insurance risk rating
CarName: The make and model of the car
wheelbase: The distance between the centers of the front and rear wheels
carlength, carwidth, carheight: The external dimensions of the vehicle
curbweight: The total weight of the vehicle with standard equipment and all necessary operating consumables
carbody: The shape of the car
doornumber: Whether it’s a two-door or four-door vehicle
enginetype: The internal architecture
cylindernumber: Number of cylinders
enginesize: The total volume of the cylinders in the engine
boreratio: The ratio between the cylinder bore diameter and the piston stroke
stroke: The distance the piston travels inside the cylinder
compressionratio: The ratio of the volume of the combustion chamber from its largest capacity to its smallest capacity
enginelocation: Front or Rear
horsepower: The power the engine produces
peakrpm: The engine speed at which maximum power is reached
aspiration: Standard (natural) vs. Turbocharged
fueltype: Gas vs. Diesel
fuelsystem: How fuel is delivered
citympg
highwaympg: Fuel efficiency in different driving conditions
drivewheel: Which wheels receive power (fwd: Front Wheel Drive; rwd: Rear Wheel Drive; 4wd: Four Wheel Drive)

Many of these variables may be correlated, like carwidth and wheelbase.

This dataset has mixed data types, which require careful handling. PCA assumes continuous, linearly related and scaled variables. In order to perform correct PCA and PCR, we need to make sure variables are numerical and categorical ones are not included in the process of principal component extraction. This will lead to a partial PCR instead of a full one, as we will use categorical variables in the linear regression model together with PCs.

data.numeric <- data[, sapply(data, is.numeric) & names(data) != "car_ID"]
data.categorical <- data[, sapply(data, is.character) & names(data) != "CarName"]

summary(data.numeric)

##    symboling         wheelbase        carlength        carwidth    
##  Min.   :-2.0000   Min.   : 86.60   Min.   :141.1   Min.   :60.30  
##  1st Qu.: 0.0000   1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10  
##  Median : 1.0000   Median : 97.00   Median :173.2   Median :65.50  
##  Mean   : 0.8341   Mean   : 98.76   Mean   :174.0   Mean   :65.91  
##  3rd Qu.: 2.0000   3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90  
##  Max.   : 3.0000   Max.   :120.90   Max.   :208.1   Max.   :72.30  
##    carheight       curbweight     enginesize      boreratio        stroke     
##  Min.   :47.80   Min.   :1488   Min.   : 61.0   Min.   :2.54   Min.   :2.070  
##  1st Qu.:52.00   1st Qu.:2145   1st Qu.: 97.0   1st Qu.:3.15   1st Qu.:3.110  
##  Median :54.10   Median :2414   Median :120.0   Median :3.31   Median :3.290  
##  Mean   :53.72   Mean   :2556   Mean   :126.9   Mean   :3.33   Mean   :3.255  
##  3rd Qu.:55.50   3rd Qu.:2935   3rd Qu.:141.0   3rd Qu.:3.58   3rd Qu.:3.410  
##  Max.   :59.80   Max.   :4066   Max.   :326.0   Max.   :3.94   Max.   :4.170  
##  compressionratio   horsepower       peakrpm        citympg     
##  Min.   : 7.00    Min.   : 48.0   Min.   :4150   Min.   :13.00  
##  1st Qu.: 8.60    1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00  
##  Median : 9.00    Median : 95.0   Median :5200   Median :24.00  
##  Mean   :10.14    Mean   :104.1   Mean   :5125   Mean   :25.22  
##  3rd Qu.: 9.40    3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00  
##  Max.   :23.00    Max.   :288.0   Max.   :6600   Max.   :49.00  
##    highwaympg        price      
##  Min.   :16.00   Min.   : 5118  
##  1st Qu.:25.00   1st Qu.: 7788  
##  Median :30.00   Median :10295  
##  Mean   :30.75   Mean   :13277  
##  3rd Qu.:34.00   3rd Qu.:16503  
##  Max.   :54.00   Max.   :45400

summary(data.categorical)

##    fueltype          aspiration         doornumber          carbody         
##  Length:205         Length:205         Length:205         Length:205        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   drivewheel        enginelocation      enginetype        cylindernumber    
##  Length:205         Length:205         Length:205         Length:205        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##   fuelsystem       
##  Length:205        
##  Class :character  
##  Mode  :character

colnames(data.numeric)

##  [1] "symboling"        "wheelbase"        "carlength"        "carwidth"        
##  [5] "carheight"        "curbweight"       "enginesize"       "boreratio"       
##  [9] "stroke"           "compressionratio" "horsepower"       "peakrpm"         
## [13] "citympg"          "highwaympg"       "price"

head(data.numeric)

##   symboling wheelbase carlength carwidth carheight curbweight enginesize
## 1         3      88.6     168.8     64.1      48.8       2548        130
## 2         3      88.6     168.8     64.1      48.8       2548        130
## 3         1      94.5     171.2     65.5      52.4       2823        152
## 4         2      99.8     176.6     66.2      54.3       2337        109
## 5         2      99.4     176.6     66.4      54.3       2824        136
## 6         2      99.8     177.3     66.3      53.1       2507        136
##   boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
## 1      3.47   2.68              9.0        111    5000      21         27 13495
## 2      3.47   2.68              9.0        111    5000      21         27 16500
## 3      2.68   3.47              9.0        154    5000      19         26 16500
## 4      3.19   3.40             10.0        102    5500      24         30 13950
## 5      3.19   3.40              8.0        115    5500      18         22 17450
## 6      3.19   3.40              8.5        110    5500      19         25 15250

# Data is still not ready for dimension reduction, as symboling isn't a continuous variable, but an ordinal one.
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "symboling" & names(data) != "car_ID" & names(data) != "price"]
# The other numerical variables have values with physical meaning and are truly continuous. Price, as a regressand, is not undergoing PCA.

Linear correlation

PCA relies on linear combinations. If two variables are strongly linearly correlated, PCA can efficiently compress that relationship into a single component.

cor_matrix <- cor(data.numeric)
corrplot(cor_matrix, order = "hclust")

As we can observe, vairbales in this dataset are linearly correlated, some very strongly (positively, like carlength and carwidth or negatively, like horsepower and highwaympg). This bides well for our PCA and PCR, as there is a lot of redundancy in information carried by this data.

Data preparation for PCA

PCA is sensitive to the distribution and scale of used variables. Because it operates on variance, heavily skewed variables can disproportionately dominate the components, so it iss advisable to correct for skewness beforehand, typically through a logarithmic or Box-Cox transformation. Additionally, it will treat larger-magnitude variables as more important because og their bigger ranges. Standardising the data ensures every variable enters the analysis as equally important and the components don’t reflect arbitrary differences in measurement scale.

Distirbution adjustments

par(mfrow=c(1, 2))

hist(data$'enginesize', col = "lightblue", breaks = 25, freq = FALSE, main = "enginesize")
hist(log(data$'enginesize'), col = "lightblue", breaks = 25, freq = FALSE, main = "Log enginesize")

hist(data$'horsepower', col = "lightblue", breaks = 25, freq = FALSE, main = "horsepower")
hist(log(data$'horsepower'), col = "lightblue", breaks = 25, freq = FALSE, main = "Log horsepower")

par(mfrow=c(1, 1))

# Loop for all variables - checking skewness and Box-cox transformation.
# Skewness is checked, then Box-Cox applied if neccessary:

columns_to_process <- names(data.numeric)
data.transformed <- data.numeric

for (col_name in columns_to_process) {
  val <- data.numeric[[col_name]]
    skew_val <- skewness(val, na.rm = TRUE)
  if (!is.na(skew_val) && skew_val > 1) {
    lambda <- BoxCox.lambda(val, method = "guerrero")
    data.transformed[[col_name]] <- BoxCox(val, lambda)
    print(col_name)
  } else {
    data.transformed[[col_name]] <- val
  }
}

## [1] "wheelbase"
## [1] "enginesize"
## [1] "compressionratio"
## [1] "horsepower"

Normalization

data.transformed.matrix <- as.matrix(data.transformed) # cluster Sim accepts only matrix
data.transformed.normalized.matrix <- data.Normalization(data.transformed.matrix, type="n1", normalization="column")  # z-score normalization: (Z=(X-mu)/sigma), where mu is mean and sigma is std. deviation)

Non-skewed, normalized continuous data is ready for PCA.

Principal Component Analysis

pca <- prcomp(data.transformed.normalized.matrix)

fviz_eig(pca)

summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.6135 1.5176 1.05504 0.92399 0.78459 0.63103 0.5348
## Proportion of Variance 0.5254 0.1772 0.08562 0.06567 0.04735 0.03063 0.0220
## Cumulative Proportion  0.5254 0.7026 0.78821 0.85388 0.90123 0.93186 0.9539
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.48393 0.35554 0.28930 0.27119 0.24546 0.14724
## Proportion of Variance 0.01801 0.00972 0.00644 0.00566 0.00463 0.00167
## Cumulative Proportion  0.97188 0.98160 0.98804 0.99370 0.99833 1.00000

First two principal components explain 70% of variance, first three - 78,82%.

Loading and contributions of individual variables

Loading is the correlation between original variable and a component. Loading’s sign and magnitude can be interpreted as what the component represents. Contribution is proportional share explaining the component’s variance(how much each variable drives the construction of a given PC, regardless of direction).

pca$rotation

##                          PC1          PC2         PC3         PC4         PC5
## wheelbase         0.30504048 -0.280887732  0.06510167 -0.28512190  0.02365686
## carlength         0.35019252 -0.157555656  0.06258819 -0.15560531 -0.04302767
## carwidth          0.33719981 -0.101463577 -0.09037191 -0.05774731 -0.24578269
## carheight         0.12281060 -0.407135389  0.38198107 -0.50554530  0.24952836
## curbweight        0.36495689 -0.059665847 -0.05985525  0.04127834 -0.12000685
## enginesize        0.33414215 -0.007049031 -0.16829742  0.24522748  0.07383858
## boreratio         0.27980520  0.008026666  0.16839760  0.45420961  0.05182884
## stroke            0.05903472 -0.073163294 -0.85729018 -0.26951163  0.27538350
## compressionratio -0.02458476 -0.508817538 -0.16684749  0.16930830 -0.69092082
## horsepower        0.32353894  0.298497640 -0.07382732  0.05534073 -0.07889889
## peakrpm          -0.08394159  0.458991228  0.01338497 -0.50505623 -0.54066308
## citympg          -0.32398347 -0.300224281 -0.06152902  0.05412686 -0.01882277
## highwaympg       -0.33408899 -0.249786957 -0.07282489  0.07112647 -0.01648886
##                           PC6         PC7         PC8          PC9         PC10
## wheelbase         0.090683769 -0.42880237 -0.05710075  0.431306917  0.544947011
## carlength         0.056543486 -0.12378014  0.01607265  0.441749559 -0.715452449
## carwidth          0.245568902 -0.43311685  0.20593248 -0.690049682 -0.070853035
## carheight        -0.317476453  0.38307226  0.09792399 -0.299618476  0.032838241
## curbweight        0.138846721  0.11238580  0.02098189 -0.009802569 -0.005050576
## enginesize       -0.000313841  0.36222141  0.64521123  0.131065350  0.015290398
## boreratio        -0.750154577 -0.30678697 -0.12529514 -0.065708630 -0.028872640
## stroke           -0.277384276 -0.01081291 -0.15714525 -0.059827483 -0.046090306
## compressionratio -0.098177623  0.31568195 -0.25857023  0.037641993  0.067838543
## horsepower       -0.060645611  0.19306744  0.12497610  0.082210426  0.384020555
## peakrpm          -0.361490176 -0.06353547  0.22413078  0.076935912 -0.039647385
## citympg          -0.083103448 -0.21243604  0.38622049  0.083903808  0.139783467
## highwaympg       -0.132790819 -0.20568519  0.45329256  0.081155722 -0.093057020
##                          PC11        PC12         PC13
## wheelbase         0.026202793 -0.21828841  0.123592793
## carlength        -0.232825070  0.15098988 -0.140578743
## carwidth         -0.169901370 -0.05415989 -0.028275625
## carheight        -0.043459606  0.07430743  0.004864219
## curbweight        0.770514530  0.45402011  0.106115684
## enginesize        0.041723068 -0.47674564 -0.021107550
## boreratio         0.070776965 -0.03600333 -0.016982288
## stroke            0.006543094  0.01151734 -0.010502492
## compressionratio -0.132751562 -0.08065059  0.018065463
## horsepower       -0.501779873  0.57278869 -0.028485092
## peakrpm           0.163384128 -0.12365728 -0.016741812
## citympg           0.104706383  0.25322165 -0.703537377
## highwaympg       -0.082469534  0.26878255  0.675019710

var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

Principal Components interpretation

(By the way, I recommend the https://en.wikipedia.org/wiki/Glossary_of_automotive_terms for this part)

The first pricnicpal component is dominated by variables related to car’s size and power, like curbweight, carlength, carwidth, enginesize, wheelbase, borereatio and horsepower. These all load positively and contribute more than average. Highly contributing vairbales are also citympg and highwaympg, but they load negatively, meaning thic component loads negatively with fuel efficency. This reflects a real-world relationship: larger, more powerful cars tend to be less fuel efficient. PC1 can therefore be interpreted as a size-and-power axis, where a high score indicates a large, powerful, and inefficient vehicle, and a low score a small, economical one.

The second principal component has a very high contribution of compressionratio, peakrpm and carheight. Citympg and horsepower contribute only a little above average. Compressionratio has a strongly negative loading and peakrpm - strongly positive. Carheight and citympg load negatively and horsepower positively. The positive loading of peakrpm and horsepower against the negative loading of compressionratio and carheight suggests this component separates high-revving, powerful cars with low rooflines from tall, high-compression vehicles. High-compression, low-revving engines are typically found in diesel or economy-oriented cars, which also tend to have taller, more upright body styles and better fuel efficiency, which is consistent with signs of loadings for citympg and carheight. PC2 can therefore be interpreted as a axis contrasting sporty vehicles against economy-oriented ones.

What is notable is that these two components are interpretable without resorting to rotation techniques such as Varimax, which are often needed to achieve meaningful loadings. The data has produced components that align well with intuitive automotive concepts. This reflects car design trade-offs vividly.

plot(pca$x[,1], pca$x[,2],
     xlab = "PC1", ylab = "PC2",
     main = "PCA",
     pch = 19, 
     col = rgb(0, 0, 1, 0.2))

According to our PC interpretation, vehicles on the far right of this chart are likely the largest, heaviest, and most powerful cars in the dataset (e.g., luxury sedans or large SUVs). Vehicles on the far left represent small, lightweight, and highly fuel-efficient cars, vehicles at the top are performance-oriented with high peak RPMs and lower profiles and on the bottom are taller cars with high compression ration (less effective at extracting energy from fuel).

fviz_pca_var(pca, col.var="blue")

On this loading plot, positively correlated variables are grouped together and negatively correlated variables are positioned on opposite sides.

fviz_pca_ind(pca, col.ind="cos2", geom="point", gradient.cols=c("white", "#2E9FDF", "#FC4E07"))

The cos2 (square cosine) measures the quality of representation. High cos2 mean that a car is well represented by the two PCs on the chart. Low cos2 symobolises less accurate reflection of its original features. Thanks to this plot it is clear that the first two principal components may be not enough to express the original dataset well enough. The presence of several low cos2 values (indicated by the blue and white points) signifies that certain vehicle characteristics are better explained by next components.

plot(pca, type = "l", main = "")

My hypothesis is that three PCs will be the optmial choice for regression. The scree plot gets visibly flatter after the third component. It looks like a three-component model will provide best information retention while still mitigating high multicollinearity of this dataset.

# This Claude generated code provides a wonderful visualisation of the first three PCs
pca_scores <- as.data.frame(pca$x[, 1:3])

cos2_3d <- rowSums(pca$x[, 1:3]^2) / rowSums(pca$x^2)

plot_ly(
  data = pca_scores,
  x = ~PC1, y = ~PC2, z = ~PC3,
  type = "scatter3d",
  mode = "markers",
  marker = list(
    size = 3,
    color = cos2_3d,
    colorscale = list(
      c(0, "white"),
      c(0.5, "#2E9FDF"),
      c(1, "#FC4E07")
    ),
    colorbar = list(title = "cos2"),
    showscale = TRUE
  )
) %>%
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    ),
    title = "Individuals - PCA"
  )

Ordinary Least Squares vs Partial Principal Component Regression

Ordinary Least Squares

Basic estimation of the simplest possible regression model for comparison with partical PCR.

ols <- lm(price ~ symboling + fueltype + aspiration + 
                          doornumber + carbody + drivewheel + enginelocation + wheelbase + 
                          carlength + carwidth + carheight + curbweight + enginetype + 
                          cylindernumber + enginesize + fuelsystem + boreratio + stroke + 
                          compressionratio + horsepower + peakrpm + citympg + highwaympg, data)

summary(ols)

## 
## Call:
## lm(formula = price ~ symboling + fueltype + aspiration + doornumber + 
##     carbody + drivewheel + enginelocation + wheelbase + carlength + 
##     carwidth + carheight + curbweight + enginetype + cylindernumber + 
##     enginesize + fuelsystem + boreratio + stroke + compressionratio + 
##     horsepower + peakrpm + citympg + highwaympg, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5416.2 -1152.0   -35.8   830.8  9835.6 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.226e+04  1.652e+04  -1.347 0.179705    
## symboling             7.388e+01  2.386e+02   0.310 0.757238    
## fueltypegas          -1.178e+04  7.017e+03  -1.678 0.095232 .  
## aspirationturbo       1.626e+03  8.856e+02   1.836 0.068172 .  
## doornumbertwo         1.876e+02  5.854e+02   0.320 0.749028    
## carbodyhardtop       -3.207e+03  1.376e+03  -2.331 0.020992 *  
## carbodyhatchback     -3.281e+03  1.223e+03  -2.683 0.008055 ** 
## carbodysedan         -2.152e+03  1.332e+03  -1.615 0.108182    
## carbodywagon         -3.266e+03  1.455e+03  -2.244 0.026191 *  
## drivewheelfwd         7.405e+01  1.040e+03   0.071 0.943351    
## drivewheelrwd         1.033e+03  1.205e+03   0.857 0.392688    
## enginelocationrear    7.695e+03  2.536e+03   3.035 0.002802 ** 
## wheelbase             4.882e+01  9.675e+01   0.505 0.614563    
## carlength            -6.130e+01  4.875e+01  -1.257 0.210410    
## carwidth              6.936e+02  2.394e+02   2.897 0.004283 ** 
## carheight             8.943e+01  1.278e+02   0.700 0.485209    
## curbweight            3.942e+00  1.715e+00   2.299 0.022781 *  
## enginetypedohcv      -7.189e+03  4.674e+03  -1.538 0.125912    
## enginetypel          -1.051e+03  1.608e+03  -0.654 0.514246    
## enginetypeohc         3.126e+03  9.088e+02   3.439 0.000741 ***
## enginetypeohcf        1.234e+03  1.572e+03   0.785 0.433661    
## enginetypeohcv       -5.605e+03  1.247e+03  -4.495 1.31e-05 ***
## enginetyperotor      -6.925e+01  4.505e+03  -0.015 0.987754    
## cylindernumberfive   -9.280e+03  2.716e+03  -3.417 0.000800 ***
## cylindernumberfour   -9.879e+03  3.054e+03  -3.234 0.001476 ** 
## cylindernumbersix    -6.570e+03  2.192e+03  -2.997 0.003154 ** 
## cylindernumberthree  -4.629e+02  4.499e+03  -0.103 0.918173    
## cylindernumbertwelve -1.024e+04  4.384e+03  -2.336 0.020707 *  
## cylindernumbertwo            NA         NA      NA       NA    
## enginesize            1.174e+02  2.600e+01   4.515 1.21e-05 ***
## fuelsystem2bbl       -3.907e+01  8.920e+02  -0.044 0.965118    
## fuelsystem4bbl       -1.624e+03  2.775e+03  -0.585 0.559295    
## fuelsystemidi                NA         NA      NA       NA    
## fuelsystemmfi        -3.480e+03  2.590e+03  -1.344 0.180967    
## fuelsystemmpfi       -2.444e+02  1.001e+03  -0.244 0.807415    
## fuelsystemspdi       -3.027e+03  1.382e+03  -2.191 0.029883 *  
## fuelsystemspfi       -6.187e+02  2.508e+03  -0.247 0.805484    
## boreratio            -1.882e+03  1.598e+03  -1.178 0.240443    
## stroke               -4.454e+03  9.009e+02  -4.944 1.89e-06 ***
## compressionratio     -8.003e+02  5.259e+02  -1.522 0.129981    
## horsepower            9.791e+00  2.227e+01   0.440 0.660789    
## peakrpm               2.202e+00  6.194e-01   3.555 0.000495 ***
## citympg              -1.477e+02  1.474e+02  -1.003 0.317569    
## highwaympg            1.916e+02  1.347e+02   1.422 0.156916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2197 on 163 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9243 
## F-statistic: 61.79 on 41 and 163 DF,  p-value: < 2.2e-16

NAs appear! It was probable, as this dataset is quite small. This is also a great reson to use PCR, as this is a lot of regressors for ~200 observations.

# Checking for perfect multicollinearity
alias_info <- alias(ols)
print(alias_info)

## Model :
## price ~ symboling + fueltype + aspiration + doornumber + carbody + 
##     drivewheel + enginelocation + wheelbase + carlength + carwidth + 
##     carheight + curbweight + enginetype + cylindernumber + enginesize + 
##     fuelsystem + boreratio + stroke + compressionratio + horsepower + 
##     peakrpm + citympg + highwaympg
## 
## Complete :
##                   (Intercept) symboling fueltypegas aspirationturbo
## cylindernumbertwo  0           0         0           0             
## fuelsystemidi      1           0        -1           0             
##                   doornumbertwo carbodyhardtop carbodyhatchback carbodysedan
## cylindernumbertwo  0             0              0                0          
## fuelsystemidi      0             0              0                0          
##                   carbodywagon drivewheelfwd drivewheelrwd enginelocationrear
## cylindernumbertwo  0            0             0             0                
## fuelsystemidi      0            0             0             0                
##                   wheelbase carlength carwidth carheight curbweight
## cylindernumbertwo  0         0         0        0         0        
## fuelsystemidi      0         0         0        0         0        
##                   enginetypedohcv enginetypel enginetypeohc enginetypeohcf
## cylindernumbertwo  0               0           0             0            
## fuelsystemidi      0               0           0             0            
##                   enginetypeohcv enginetyperotor cylindernumberfive
## cylindernumbertwo  0              1               0                
## fuelsystemidi      0              0               0                
##                   cylindernumberfour cylindernumbersix cylindernumberthree
## cylindernumbertwo  0                  0                 0                 
## fuelsystemidi      0                  0                 0                 
##                   cylindernumbertwelve enginesize fuelsystem2bbl fuelsystem4bbl
## cylindernumbertwo  0                    0          0              0            
## fuelsystemidi      0                    0          0              0            
##                   fuelsystemmfi fuelsystemmpfi fuelsystemspdi fuelsystemspfi
## cylindernumbertwo  0             0              0              0            
## fuelsystemidi      0             0              0              0            
##                   boreratio stroke compressionratio horsepower peakrpm citympg
## cylindernumbertwo  0         0      0                0          0       0     
## fuelsystemidi      0         0      0                0          0       0     
##                   highwaympg
## cylindernumbertwo  0        
## fuelsystemidi      0

# fueltype gas and fuelsystem idi + cylindernumber two and enginetype rotor are problematic:
# - every single car that has a 2-cylinder engine is also a rotor engine
# - fuelsystem idi is the opposite of fueltype gas - fuelsystem idi is the fuel system used only for Diesel engines

Typically one would have to take care of redundant variables and test this model to assess if it’s meeting specific assumptions. Leaving redundant variables inflates variance, but doesn’t affect the estimators. Removing them may cause omitted variable bias and in this case we care more about predictive power than correct assessment of significance, so we will leave them in the model. For comparability, both models are evaluated in their unmodified forms. No variable selection or assumption corrections are applied to either model, as the focus is on predictive performance rather than econometric validity.

Partial Principal Component Regression

# Data preparation:
# Extracting PCs, binding with leftover categorical variables:
pca_scores <- as.data.frame(pca$x)

PCR_data <- bind_cols(
  price = data$price,
  pca_scores,
  data.categorical
)

# Model estimation:
  
pcr_partial <- lm(price ~ PC1 + PC2 + PC3 + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + enginetype + cylindernumber + fuelsystem,
                  PCR_data)

summary(pcr_partial)

## 
## Call:
## lm(formula = price ~ PC1 + PC2 + PC3 + fueltype + aspiration + 
##     doornumber + carbody + drivewheel + enginelocation + enginetype + 
##     cylindernumber + fuelsystem, data = PCR_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7203  -1267      0   1090  12530 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           38672.62    3066.14  12.613  < 2e-16 ***
## PC1                    1690.30     174.30   9.698  < 2e-16 ***
## PC2                     631.44     352.85   1.790 0.075268 .  
## PC3                     551.80     280.00   1.971 0.050341 .  
## fueltypegas           -2911.87    1895.95  -1.536 0.126395    
## aspirationturbo        1257.79     740.63   1.698 0.091246 .  
## doornumbertwo           416.63     631.58   0.660 0.510342    
## carbodyhardtop        -3102.52    1525.84  -2.033 0.043539 *  
## carbodyhatchback      -2894.48    1317.16  -2.198 0.029304 *  
## carbodysedan          -2238.37    1370.59  -1.633 0.104247    
## carbodywagon          -3668.91    1520.65  -2.413 0.016873 *  
## drivewheelfwd           450.75    1088.43   0.414 0.679291    
## drivewheelrwd          1910.12    1226.20   1.558 0.121108    
## enginelocationrear    11260.12    2370.50   4.750 4.23e-06 ***
## enginetypedohcv       -9341.18    3438.78  -2.716 0.007266 ** 
## enginetypel            -915.98    1408.91  -0.650 0.516462    
## enginetypeohc          2559.76     985.06   2.599 0.010164 *  
## enginetypeohcf           20.89    1411.76   0.015 0.988212    
## enginetypeohcv        -5295.82    1353.59  -3.912 0.000131 ***
## enginetyperotor      -19237.66    3616.69  -5.319 3.17e-07 ***
## cylindernumberfive   -20001.49    2280.05  -8.772 1.57e-15 ***
## cylindernumberfour   -23815.22    2196.70 -10.841  < 2e-16 ***
## cylindernumbersix    -15311.75    1842.36  -8.311 2.61e-14 ***
## cylindernumberthree  -12714.06    4112.85  -3.091 0.002321 ** 
## cylindernumbertwelve  -2215.77    3112.17  -0.712 0.477438    
## cylindernumbertwo           NA         NA      NA       NA    
## fuelsystem2bbl        -1136.13     923.68  -1.230 0.220355    
## fuelsystem4bbl        -3946.73    3254.01  -1.213 0.226819    
## fuelsystemidi               NA         NA      NA       NA    
## fuelsystemmfi         -3801.52    2986.71  -1.273 0.204783    
## fuelsystemmpfi        -1566.60    1042.99  -1.502 0.134903    
## fuelsystemspdi        -3722.17    1443.05  -2.579 0.010724 *  
## fuelsystemspfi        -3094.26    2874.26  -1.077 0.283175    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2675 on 174 degrees of freedom
## Multiple R-squared:  0.9043, Adjusted R-squared:  0.8879 
## F-statistic: 54.84 on 30 and 174 DF,  p-value: < 2.2e-16

Same as for OLS applies here. Let’s compare the prediction quality.

Predictions comparison

In order to correctly assess prediction quality of these two models I applied cross-validation approach with 90% - 10% split for trainging and testing subsets.

It is imprortant to note that when performing predicitons on the testing dataset one must apply exactly the same PCA transformation learned from the training set to the test set. (This will have slighly different results than the PCA on whole dataset we analyzed before, as for every fold it will be performed again separately on the training set.)

# caret package can be used in case of the simple linear regression
# 10-fold cross-validation settings
set.seed(2026)
cv_settings <- trainControl(method = "cv", number = 10)

ols.cv <- train(price ~ symboling + fueltype + aspiration + 
                          doornumber + carbody + drivewheel + enginelocation + wheelbase + 
                          carlength + carwidth + carheight + curbweight + enginetype + 
                          cylindernumber + enginesize + fuelsystem + boreratio + stroke + 
                          compressionratio + horsepower + peakrpm + citympg + highwaympg, 
                        data = data, 
                        method = "lm",
                        trControl = cv_settings)

ols.results <- ols.cv$resample %>%
  summarise(
    MAE = mean(MAE),
    RMSE = mean(RMSE),
    Rsquared = mean(Rsquared)
  )

# trainControl(method = "cv", number = 10) sets up 10-fold cross-validation, it splits the data into 10 chunks, trains on 9, tests on 1, and rotates through all combinations
# train(..., method = "lm") fits OLS model repeatedly across those 10 folds and averages the prediction error

As per my earlier analysis for partial PCR I chose three first principal components as regressors. For comparison, attempts on 2 and 4 PCs were also included.

# caret package doesn't support partial PCR, so it has to be done manually in tidymodels

data.numeric <- data[, sapply(data, is.numeric) & names(data) != "symboling" & names(data) != "car_ID" & names(data) != "price"]
data.categorical <- data[, sapply(data, is.character) & names(data) != "CarName"]

set.seed(2026)
folds <- vfold_cv(data, v = 10)

# PCA applied only on numeric regressors
rec <- recipe(price ~ ., data = data) %>%
  step_select(price, all_of(names(data.numeric)), all_of(names(data.categorical))) %>%
  step_normalize(all_of(names(data.numeric))) %>% # skewness transformation omitted for code simplicity
  step_pca(all_of(names(data.numeric)), num_comp = 3)# 3 PCs

# Model
lin_mod <- linear_reg() %>% set_engine("lm")

# Workflow
pcr_partial_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lin_mod)

# CV - PCA is refit on each training fold internally
pcr_partial_cv <- fit_resamples(
  pcr_partial_wf,
  resamples = folds,
  metrics = metric_set(mae, rmse, rsq)
)

Results and conclusions

ols.results

##        MAE     RMSE  Rsquared
## 1 1921.803 2651.101 0.8824575

collect_metrics(pcr_partial_cv) # 3 PCs

## # A tibble: 3 × 6
##   .metric .estimator     mean     n  std_err .config        
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   1956.       10  85.3    pre0_mod0_post0
## 2 rmse    standard   2666.       10 170.     pre0_mod0_post0
## 3 rsq     standard      0.884    10   0.0183 pre0_mod0_post0

collect_metrics(pcr_partial_cv_4PC)

## # A tibble: 3 × 6
##   .metric .estimator     mean     n  std_err .config        
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   1980.       10  82.3    pre0_mod0_post0
## 2 rmse    standard   2694.       10 162.     pre0_mod0_post0
## 3 rsq     standard      0.882    10   0.0186 pre0_mod0_post0

collect_metrics(pcr_partial_cv_2PC)

## # A tibble: 3 × 6
##   .metric .estimator     mean     n  std_err .config        
##   <chr>   <chr>         <dbl> <int>    <dbl> <chr>          
## 1 mae     standard   1967.       10  92.7    pre0_mod0_post0
## 2 rmse    standard   2667.       10 166.     pre0_mod0_post0
## 3 rsq     standard      0.884    10   0.0170 pre0_mod0_post0

Following cross-validation, the classical OLS regression model demonstrates marginally stronger predictive performance compared to the PCR approach, with slightly lower Mean Absolute Error and Root Mean Square Error, and a marginally higher R-squared. That said, the difference between the best-performing PCR model and OLS remains quite modest, suggesting that PCR remains a competitive alternative in contexts where dimensionality reduction or multicollinearity are primary concerns.

As anticipated, three principal components are the optimal configuration for the PCR models, but 2 perform similarily. Selecting appropriate number of components is an important step in this process. Retaining too few risks losing meaningful variance in the data, while including too many undermines the core benefit of the method. In cases where interpretability is less critical and dimensionality reduction is prioritized, PCR continues to offer a valuable and principled analytical framework.

Bibliography

https://en.wikipedia.org/wiki/Principal_component_analysis
https://en.wikipedia.org/wiki/Principal_component_regression
https://en.wikipedia.org/wiki/Glossary_of_automotive_terms
Svante Wold; Kim Esbensen; Paul Geladi. (1987). Principal component analysis. , 2(1-3), 37–52. doi:10.1016/0169-7439(87)80084-9
Materials from Unsupervised Learning course at University of Warsaw, Faculty of Economic Sciences

Car Price Prediction - Principal Component Regression Approach

Marta Zawada

2026-02-26