PCA is an unsupervised dimensionality reduction technique that takes a set of possibly correlated variables and transforms them into a smaller set of new variables called principal components that are completely uncorrelated with one another. Each component is a linear combination of the originals, and they’re ordered so that the first captures the most variance in the data, the second captures the most of what’s left while being orthogonal to the first and so on. (https://en.wikipedia.org/wiki/Principal_component_analysis)
“Principal Component Regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. […] One major use of PCR lies in overcoming the multicollinearity problem which arises when two or more of the explanatory variables are close to being collinear. PCR can lead to efficient prediction of the outcome based on the assumed model.” (https://en.wikipedia.org/wiki/Principal_component_regression)
There are some potential issues of PCR.
Source: https://www.kaggle.com/datasets/hellbuoy/car-price-prediction?select=CarPrice_Assignment.csv
data <- read.csv("CarPrice_Assignment.csv")
missing.data <- data[!complete.cases(data),]
missing.data## [1] car_ID symboling CarName fueltype
## [5] aspiration doornumber carbody drivewheel
## [9] enginelocation wheelbase carlength carwidth
## [13] carheight curbweight enginetype cylindernumber
## [17] enginesize fuelsystem boreratio stroke
## [21] compressionratio horsepower peakrpm citympg
## [25] highwaympg price
## <0 wierszy> (lub 'row.names' o zerowej długości)
## car_ID symboling CarName fueltype aspiration doornumber
## 1 1 3 alfa-romero giulia gas std two
## 2 2 3 alfa-romero stelvio gas std two
## 3 3 1 alfa-romero Quadrifoglio gas std two
## 4 4 2 audi 100 ls gas std four
## 5 5 2 audi 100ls gas std four
## 6 6 2 audi fox gas std two
## carbody drivewheel enginelocation wheelbase carlength carwidth carheight
## 1 convertible rwd front 88.6 168.8 64.1 48.8
## 2 convertible rwd front 88.6 168.8 64.1 48.8
## 3 hatchback rwd front 94.5 171.2 65.5 52.4
## 4 sedan fwd front 99.8 176.6 66.2 54.3
## 5 sedan 4wd front 99.4 176.6 66.4 54.3
## 6 sedan fwd front 99.8 177.3 66.3 53.1
## curbweight enginetype cylindernumber enginesize fuelsystem boreratio stroke
## 1 2548 dohc four 130 mpfi 3.47 2.68
## 2 2548 dohc four 130 mpfi 3.47 2.68
## 3 2823 ohcv six 152 mpfi 2.68 3.47
## 4 2337 ohc four 109 mpfi 3.19 3.40
## 5 2824 ohc five 136 mpfi 3.19 3.40
## 6 2507 ohc five 136 mpfi 3.19 3.40
## compressionratio horsepower peakrpm citympg highwaympg price
## 1 9.0 111 5000 21 27 13495
## 2 9.0 111 5000 21 27 16500
## 3 9.0 154 5000 19 26 16500
## 4 10.0 102 5500 24 30 13950
## 5 8.0 115 5500 18 22 17450
## 6 8.5 110 5500 19 25 15250
Many of these variables may be correlated, like carwidth and wheelbase.
This dataset has mixed data types, which require careful handling. PCA assumes continuous, linearly related and scaled variables. In order to perform correct PCA and PCR, we need to make sure variables are numerical and categorical ones are not included in the process of principal component extraction. This will lead to a partial PCR instead of a full one, as we will use categorical variables in the linear regression model together with PCs.
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "car_ID"]
data.categorical <- data[, sapply(data, is.character) & names(data) != "CarName"]
summary(data.numeric)## symboling wheelbase carlength carwidth
## Min. :-2.0000 Min. : 86.60 Min. :141.1 Min. :60.30
## 1st Qu.: 0.0000 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10
## Median : 1.0000 Median : 97.00 Median :173.2 Median :65.50
## Mean : 0.8341 Mean : 98.76 Mean :174.0 Mean :65.91
## 3rd Qu.: 2.0000 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90
## Max. : 3.0000 Max. :120.90 Max. :208.1 Max. :72.30
## carheight curbweight enginesize boreratio stroke
## Min. :47.80 Min. :1488 Min. : 61.0 Min. :2.54 Min. :2.070
## 1st Qu.:52.00 1st Qu.:2145 1st Qu.: 97.0 1st Qu.:3.15 1st Qu.:3.110
## Median :54.10 Median :2414 Median :120.0 Median :3.31 Median :3.290
## Mean :53.72 Mean :2556 Mean :126.9 Mean :3.33 Mean :3.255
## 3rd Qu.:55.50 3rd Qu.:2935 3rd Qu.:141.0 3rd Qu.:3.58 3rd Qu.:3.410
## Max. :59.80 Max. :4066 Max. :326.0 Max. :3.94 Max. :4.170
## compressionratio horsepower peakrpm citympg
## Min. : 7.00 Min. : 48.0 Min. :4150 Min. :13.00
## 1st Qu.: 8.60 1st Qu.: 70.0 1st Qu.:4800 1st Qu.:19.00
## Median : 9.00 Median : 95.0 Median :5200 Median :24.00
## Mean :10.14 Mean :104.1 Mean :5125 Mean :25.22
## 3rd Qu.: 9.40 3rd Qu.:116.0 3rd Qu.:5500 3rd Qu.:30.00
## Max. :23.00 Max. :288.0 Max. :6600 Max. :49.00
## highwaympg price
## Min. :16.00 Min. : 5118
## 1st Qu.:25.00 1st Qu.: 7788
## Median :30.00 Median :10295
## Mean :30.75 Mean :13277
## 3rd Qu.:34.00 3rd Qu.:16503
## Max. :54.00 Max. :45400
## fueltype aspiration doornumber carbody
## Length:205 Length:205 Length:205 Length:205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## drivewheel enginelocation enginetype cylindernumber
## Length:205 Length:205 Length:205 Length:205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## fuelsystem
## Length:205
## Class :character
## Mode :character
## [1] "symboling" "wheelbase" "carlength" "carwidth"
## [5] "carheight" "curbweight" "enginesize" "boreratio"
## [9] "stroke" "compressionratio" "horsepower" "peakrpm"
## [13] "citympg" "highwaympg" "price"
## symboling wheelbase carlength carwidth carheight curbweight enginesize
## 1 3 88.6 168.8 64.1 48.8 2548 130
## 2 3 88.6 168.8 64.1 48.8 2548 130
## 3 1 94.5 171.2 65.5 52.4 2823 152
## 4 2 99.8 176.6 66.2 54.3 2337 109
## 5 2 99.4 176.6 66.4 54.3 2824 136
## 6 2 99.8 177.3 66.3 53.1 2507 136
## boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
## 1 3.47 2.68 9.0 111 5000 21 27 13495
## 2 3.47 2.68 9.0 111 5000 21 27 16500
## 3 2.68 3.47 9.0 154 5000 19 26 16500
## 4 3.19 3.40 10.0 102 5500 24 30 13950
## 5 3.19 3.40 8.0 115 5500 18 22 17450
## 6 3.19 3.40 8.5 110 5500 19 25 15250
# Data is still not ready for dimension reduction, as symboling isn't a continuous variable, but an ordinal one.
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "symboling" & names(data) != "car_ID" & names(data) != "price"]
# The other numerical variables have values with physical meaning and are truly continuous. Price, as a regressand, is not undergoing PCA.PCA relies on linear combinations. If two variables are strongly linearly correlated, PCA can efficiently compress that relationship into a single component.
As we can observe, vairbales in this dataset are linearly correlated, some very strongly (positively, like carlength and carwidth or negatively, like horsepower and highwaympg). This bides well for our PCA and PCR, as there is a lot of redundancy in information carried by this data.
PCA is sensitive to the distribution and scale of used variables. Because it operates on variance, heavily skewed variables can disproportionately dominate the components, so it iss advisable to correct for skewness beforehand, typically through a logarithmic or Box-Cox transformation. Additionally, it will treat larger-magnitude variables as more important because og their bigger ranges. Standardising the data ensures every variable enters the analysis as equally important and the components don’t reflect arbitrary differences in measurement scale.
par(mfrow=c(1, 2))
hist(data$'enginesize', col = "lightblue", breaks = 25, freq = FALSE, main = "enginesize")
hist(log(data$'enginesize'), col = "lightblue", breaks = 25, freq = FALSE, main = "Log enginesize")hist(data$'horsepower', col = "lightblue", breaks = 25, freq = FALSE, main = "horsepower")
hist(log(data$'horsepower'), col = "lightblue", breaks = 25, freq = FALSE, main = "Log horsepower")# Loop for all variables - checking skewness and Box-cox transformation.
# Skewness is checked, then Box-Cox applied if neccessary:
columns_to_process <- names(data.numeric)
data.transformed <- data.numeric
for (col_name in columns_to_process) {
val <- data.numeric[[col_name]]
skew_val <- skewness(val, na.rm = TRUE)
if (!is.na(skew_val) && skew_val > 1) {
lambda <- BoxCox.lambda(val, method = "guerrero")
data.transformed[[col_name]] <- BoxCox(val, lambda)
print(col_name)
} else {
data.transformed[[col_name]] <- val
}
}## [1] "wheelbase"
## [1] "enginesize"
## [1] "compressionratio"
## [1] "horsepower"
data.transformed.matrix <- as.matrix(data.transformed) # cluster Sim accepts only matrix
data.transformed.normalized.matrix <- data.Normalization(data.transformed.matrix, type="n1", normalization="column") # z-score normalization: (Z=(X-mu)/sigma), where mu is mean and sigma is std. deviation)Non-skewed, normalized continuous data is ready for PCA.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6135 1.5176 1.05504 0.92399 0.78459 0.63103 0.5348
## Proportion of Variance 0.5254 0.1772 0.08562 0.06567 0.04735 0.03063 0.0220
## Cumulative Proportion 0.5254 0.7026 0.78821 0.85388 0.90123 0.93186 0.9539
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.48393 0.35554 0.28930 0.27119 0.24546 0.14724
## Proportion of Variance 0.01801 0.00972 0.00644 0.00566 0.00463 0.00167
## Cumulative Proportion 0.97188 0.98160 0.98804 0.99370 0.99833 1.00000
First two principal components explain 70% of variance, first three - 78,82%.
Loading is the correlation between original variable and a component. Loading’s sign and magnitude can be interpreted as what the component represents. Contribution is proportional share explaining the component’s variance(how much each variable drives the construction of a given PC, regardless of direction).
## PC1 PC2 PC3 PC4 PC5
## wheelbase 0.30504048 -0.280887732 0.06510167 -0.28512190 0.02365686
## carlength 0.35019252 -0.157555656 0.06258819 -0.15560531 -0.04302767
## carwidth 0.33719981 -0.101463577 -0.09037191 -0.05774731 -0.24578269
## carheight 0.12281060 -0.407135389 0.38198107 -0.50554530 0.24952836
## curbweight 0.36495689 -0.059665847 -0.05985525 0.04127834 -0.12000685
## enginesize 0.33414215 -0.007049031 -0.16829742 0.24522748 0.07383858
## boreratio 0.27980520 0.008026666 0.16839760 0.45420961 0.05182884
## stroke 0.05903472 -0.073163294 -0.85729018 -0.26951163 0.27538350
## compressionratio -0.02458476 -0.508817538 -0.16684749 0.16930830 -0.69092082
## horsepower 0.32353894 0.298497640 -0.07382732 0.05534073 -0.07889889
## peakrpm -0.08394159 0.458991228 0.01338497 -0.50505623 -0.54066308
## citympg -0.32398347 -0.300224281 -0.06152902 0.05412686 -0.01882277
## highwaympg -0.33408899 -0.249786957 -0.07282489 0.07112647 -0.01648886
## PC6 PC7 PC8 PC9 PC10
## wheelbase 0.090683769 -0.42880237 -0.05710075 0.431306917 0.544947011
## carlength 0.056543486 -0.12378014 0.01607265 0.441749559 -0.715452449
## carwidth 0.245568902 -0.43311685 0.20593248 -0.690049682 -0.070853035
## carheight -0.317476453 0.38307226 0.09792399 -0.299618476 0.032838241
## curbweight 0.138846721 0.11238580 0.02098189 -0.009802569 -0.005050576
## enginesize -0.000313841 0.36222141 0.64521123 0.131065350 0.015290398
## boreratio -0.750154577 -0.30678697 -0.12529514 -0.065708630 -0.028872640
## stroke -0.277384276 -0.01081291 -0.15714525 -0.059827483 -0.046090306
## compressionratio -0.098177623 0.31568195 -0.25857023 0.037641993 0.067838543
## horsepower -0.060645611 0.19306744 0.12497610 0.082210426 0.384020555
## peakrpm -0.361490176 -0.06353547 0.22413078 0.076935912 -0.039647385
## citympg -0.083103448 -0.21243604 0.38622049 0.083903808 0.139783467
## highwaympg -0.132790819 -0.20568519 0.45329256 0.081155722 -0.093057020
## PC11 PC12 PC13
## wheelbase 0.026202793 -0.21828841 0.123592793
## carlength -0.232825070 0.15098988 -0.140578743
## carwidth -0.169901370 -0.05415989 -0.028275625
## carheight -0.043459606 0.07430743 0.004864219
## curbweight 0.770514530 0.45402011 0.106115684
## enginesize 0.041723068 -0.47674564 -0.021107550
## boreratio 0.070776965 -0.03600333 -0.016982288
## stroke 0.006543094 0.01151734 -0.010502492
## compressionratio -0.132751562 -0.08065059 0.018065463
## horsepower -0.501779873 0.57278869 -0.028485092
## peakrpm 0.163384128 -0.12365728 -0.016741812
## citympg 0.104706383 0.25322165 -0.703537377
## highwaympg -0.082469534 0.26878255 0.675019710
var <- get_pca_var(pca)
a <- fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b <- fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')(By the way, I recommend the https://en.wikipedia.org/wiki/Glossary_of_automotive_terms for this part)
The first pricnicpal component is dominated by variables related to car’s size and power, like curbweight, carlength, carwidth, enginesize, wheelbase, borereatio and horsepower. These all load positively and contribute more than average. Highly contributing vairbales are also citympg and highwaympg, but they load negatively, meaning thic component loads negatively with fuel efficency. This reflects a real-world relationship: larger, more powerful cars tend to be less fuel efficient. PC1 can therefore be interpreted as a size-and-power axis, where a high score indicates a large, powerful, and inefficient vehicle, and a low score a small, economical one.
The second principal component has a very high contribution of compressionratio, peakrpm and carheight. Citympg and horsepower contribute only a little above average. Compressionratio has a strongly negative loading and peakrpm - strongly positive. Carheight and citympg load negatively and horsepower positively. The positive loading of peakrpm and horsepower against the negative loading of compressionratio and carheight suggests this component separates high-revving, powerful cars with low rooflines from tall, high-compression vehicles. High-compression, low-revving engines are typically found in diesel or economy-oriented cars, which also tend to have taller, more upright body styles and better fuel efficiency, which is consistent with signs of loadings for citympg and carheight. PC2 can therefore be interpreted as a axis contrasting sporty vehicles against economy-oriented ones.
What is notable is that these two components are interpretable without resorting to rotation techniques such as Varimax, which are often needed to achieve meaningful loadings. The data has produced components that align well with intuitive automotive concepts. This reflects car design trade-offs vividly.
plot(pca$x[,1], pca$x[,2],
xlab = "PC1", ylab = "PC2",
main = "PCA",
pch = 19,
col = rgb(0, 0, 1, 0.2))According to our PC interpretation, vehicles on the far right of this chart are likely the largest, heaviest, and most powerful cars in the dataset (e.g., luxury sedans or large SUVs). Vehicles on the far left represent small, lightweight, and highly fuel-efficient cars, vehicles at the top are performance-oriented with high peak RPMs and lower profiles and on the bottom are taller cars with high compression ration (less effective at extracting energy from fuel).
On this loading plot, positively correlated variables are grouped together and negatively correlated variables are positioned on opposite sides.
The cos2 (square cosine) measures the quality of representation. High cos2 mean that a car is well represented by the two PCs on the chart. Low cos2 symobolises less accurate reflection of its original features. Thanks to this plot it is clear that the first two principal components may be not enough to express the original dataset well enough. The presence of several low cos2 values (indicated by the blue and white points) signifies that certain vehicle characteristics are better explained by next components.
My hypothesis is that three PCs will be the optmial choice for regression. The scree plot gets visibly flatter after the third component. It looks like a three-component model will provide best information retention while still mitigating high multicollinearity of this dataset.
# This Claude generated code provides a wonderful visualisation of the first three PCs
pca_scores <- as.data.frame(pca$x[, 1:3])
cos2_3d <- rowSums(pca$x[, 1:3]^2) / rowSums(pca$x^2)
plot_ly(
data = pca_scores,
x = ~PC1, y = ~PC2, z = ~PC3,
type = "scatter3d",
mode = "markers",
marker = list(
size = 3,
color = cos2_3d,
colorscale = list(
c(0, "white"),
c(0.5, "#2E9FDF"),
c(1, "#FC4E07")
),
colorbar = list(title = "cos2"),
showscale = TRUE
)
) %>%
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
),
title = "Individuals - PCA"
)Basic estimation of the simplest possible regression model for comparison with partical PCR.
ols <- lm(price ~ symboling + fueltype + aspiration +
doornumber + carbody + drivewheel + enginelocation + wheelbase +
carlength + carwidth + carheight + curbweight + enginetype +
cylindernumber + enginesize + fuelsystem + boreratio + stroke +
compressionratio + horsepower + peakrpm + citympg + highwaympg, data)
summary(ols)##
## Call:
## lm(formula = price ~ symboling + fueltype + aspiration + doornumber +
## carbody + drivewheel + enginelocation + wheelbase + carlength +
## carwidth + carheight + curbweight + enginetype + cylindernumber +
## enginesize + fuelsystem + boreratio + stroke + compressionratio +
## horsepower + peakrpm + citympg + highwaympg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5416.2 -1152.0 -35.8 830.8 9835.6
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.226e+04 1.652e+04 -1.347 0.179705
## symboling 7.388e+01 2.386e+02 0.310 0.757238
## fueltypegas -1.178e+04 7.017e+03 -1.678 0.095232 .
## aspirationturbo 1.626e+03 8.856e+02 1.836 0.068172 .
## doornumbertwo 1.876e+02 5.854e+02 0.320 0.749028
## carbodyhardtop -3.207e+03 1.376e+03 -2.331 0.020992 *
## carbodyhatchback -3.281e+03 1.223e+03 -2.683 0.008055 **
## carbodysedan -2.152e+03 1.332e+03 -1.615 0.108182
## carbodywagon -3.266e+03 1.455e+03 -2.244 0.026191 *
## drivewheelfwd 7.405e+01 1.040e+03 0.071 0.943351
## drivewheelrwd 1.033e+03 1.205e+03 0.857 0.392688
## enginelocationrear 7.695e+03 2.536e+03 3.035 0.002802 **
## wheelbase 4.882e+01 9.675e+01 0.505 0.614563
## carlength -6.130e+01 4.875e+01 -1.257 0.210410
## carwidth 6.936e+02 2.394e+02 2.897 0.004283 **
## carheight 8.943e+01 1.278e+02 0.700 0.485209
## curbweight 3.942e+00 1.715e+00 2.299 0.022781 *
## enginetypedohcv -7.189e+03 4.674e+03 -1.538 0.125912
## enginetypel -1.051e+03 1.608e+03 -0.654 0.514246
## enginetypeohc 3.126e+03 9.088e+02 3.439 0.000741 ***
## enginetypeohcf 1.234e+03 1.572e+03 0.785 0.433661
## enginetypeohcv -5.605e+03 1.247e+03 -4.495 1.31e-05 ***
## enginetyperotor -6.925e+01 4.505e+03 -0.015 0.987754
## cylindernumberfive -9.280e+03 2.716e+03 -3.417 0.000800 ***
## cylindernumberfour -9.879e+03 3.054e+03 -3.234 0.001476 **
## cylindernumbersix -6.570e+03 2.192e+03 -2.997 0.003154 **
## cylindernumberthree -4.629e+02 4.499e+03 -0.103 0.918173
## cylindernumbertwelve -1.024e+04 4.384e+03 -2.336 0.020707 *
## cylindernumbertwo NA NA NA NA
## enginesize 1.174e+02 2.600e+01 4.515 1.21e-05 ***
## fuelsystem2bbl -3.907e+01 8.920e+02 -0.044 0.965118
## fuelsystem4bbl -1.624e+03 2.775e+03 -0.585 0.559295
## fuelsystemidi NA NA NA NA
## fuelsystemmfi -3.480e+03 2.590e+03 -1.344 0.180967
## fuelsystemmpfi -2.444e+02 1.001e+03 -0.244 0.807415
## fuelsystemspdi -3.027e+03 1.382e+03 -2.191 0.029883 *
## fuelsystemspfi -6.187e+02 2.508e+03 -0.247 0.805484
## boreratio -1.882e+03 1.598e+03 -1.178 0.240443
## stroke -4.454e+03 9.009e+02 -4.944 1.89e-06 ***
## compressionratio -8.003e+02 5.259e+02 -1.522 0.129981
## horsepower 9.791e+00 2.227e+01 0.440 0.660789
## peakrpm 2.202e+00 6.194e-01 3.555 0.000495 ***
## citympg -1.477e+02 1.474e+02 -1.003 0.317569
## highwaympg 1.916e+02 1.347e+02 1.422 0.156916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2197 on 163 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9243
## F-statistic: 61.79 on 41 and 163 DF, p-value: < 2.2e-16
NAs appear! It was probable, as this dataset is quite small. This is also a great reson to use PCR, as this is a lot of regressors for ~200 observations.
## Model :
## price ~ symboling + fueltype + aspiration + doornumber + carbody +
## drivewheel + enginelocation + wheelbase + carlength + carwidth +
## carheight + curbweight + enginetype + cylindernumber + enginesize +
## fuelsystem + boreratio + stroke + compressionratio + horsepower +
## peakrpm + citympg + highwaympg
##
## Complete :
## (Intercept) symboling fueltypegas aspirationturbo
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 1 0 -1 0
## doornumbertwo carbodyhardtop carbodyhatchback carbodysedan
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 0 0 0 0
## carbodywagon drivewheelfwd drivewheelrwd enginelocationrear
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 0 0 0 0
## wheelbase carlength carwidth carheight curbweight
## cylindernumbertwo 0 0 0 0 0
## fuelsystemidi 0 0 0 0 0
## enginetypedohcv enginetypel enginetypeohc enginetypeohcf
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 0 0 0 0
## enginetypeohcv enginetyperotor cylindernumberfive
## cylindernumbertwo 0 1 0
## fuelsystemidi 0 0 0
## cylindernumberfour cylindernumbersix cylindernumberthree
## cylindernumbertwo 0 0 0
## fuelsystemidi 0 0 0
## cylindernumbertwelve enginesize fuelsystem2bbl fuelsystem4bbl
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 0 0 0 0
## fuelsystemmfi fuelsystemmpfi fuelsystemspdi fuelsystemspfi
## cylindernumbertwo 0 0 0 0
## fuelsystemidi 0 0 0 0
## boreratio stroke compressionratio horsepower peakrpm citympg
## cylindernumbertwo 0 0 0 0 0 0
## fuelsystemidi 0 0 0 0 0 0
## highwaympg
## cylindernumbertwo 0
## fuelsystemidi 0
# fueltype gas and fuelsystem idi + cylindernumber two and enginetype rotor are problematic:
# - every single car that has a 2-cylinder engine is also a rotor engine
# - fuelsystem idi is the opposite of fueltype gas - fuelsystem idi is the fuel system used only for Diesel enginesTypically one would have to take care of redundant variables and test this model to assess if it’s meeting specific assumptions. Leaving redundant variables inflates variance, but doesn’t affect the estimators. Removing them may cause omitted variable bias and in this case we care more about predictive power than correct assessment of significance, so we will leave them in the model. For comparability, both models are evaluated in their unmodified forms. No variable selection or assumption corrections are applied to either model, as the focus is on predictive performance rather than econometric validity.
# Data preparation:
# Extracting PCs, binding with leftover categorical variables:
pca_scores <- as.data.frame(pca$x)
PCR_data <- bind_cols(
price = data$price,
pca_scores,
data.categorical
)# Model estimation:
pcr_partial <- lm(price ~ PC1 + PC2 + PC3 + fueltype + aspiration + doornumber + carbody + drivewheel + enginelocation + enginetype + cylindernumber + fuelsystem,
PCR_data)
summary(pcr_partial)##
## Call:
## lm(formula = price ~ PC1 + PC2 + PC3 + fueltype + aspiration +
## doornumber + carbody + drivewheel + enginelocation + enginetype +
## cylindernumber + fuelsystem, data = PCR_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7203 -1267 0 1090 12530
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38672.62 3066.14 12.613 < 2e-16 ***
## PC1 1690.30 174.30 9.698 < 2e-16 ***
## PC2 631.44 352.85 1.790 0.075268 .
## PC3 551.80 280.00 1.971 0.050341 .
## fueltypegas -2911.87 1895.95 -1.536 0.126395
## aspirationturbo 1257.79 740.63 1.698 0.091246 .
## doornumbertwo 416.63 631.58 0.660 0.510342
## carbodyhardtop -3102.52 1525.84 -2.033 0.043539 *
## carbodyhatchback -2894.48 1317.16 -2.198 0.029304 *
## carbodysedan -2238.37 1370.59 -1.633 0.104247
## carbodywagon -3668.91 1520.65 -2.413 0.016873 *
## drivewheelfwd 450.75 1088.43 0.414 0.679291
## drivewheelrwd 1910.12 1226.20 1.558 0.121108
## enginelocationrear 11260.12 2370.50 4.750 4.23e-06 ***
## enginetypedohcv -9341.18 3438.78 -2.716 0.007266 **
## enginetypel -915.98 1408.91 -0.650 0.516462
## enginetypeohc 2559.76 985.06 2.599 0.010164 *
## enginetypeohcf 20.89 1411.76 0.015 0.988212
## enginetypeohcv -5295.82 1353.59 -3.912 0.000131 ***
## enginetyperotor -19237.66 3616.69 -5.319 3.17e-07 ***
## cylindernumberfive -20001.49 2280.05 -8.772 1.57e-15 ***
## cylindernumberfour -23815.22 2196.70 -10.841 < 2e-16 ***
## cylindernumbersix -15311.75 1842.36 -8.311 2.61e-14 ***
## cylindernumberthree -12714.06 4112.85 -3.091 0.002321 **
## cylindernumbertwelve -2215.77 3112.17 -0.712 0.477438
## cylindernumbertwo NA NA NA NA
## fuelsystem2bbl -1136.13 923.68 -1.230 0.220355
## fuelsystem4bbl -3946.73 3254.01 -1.213 0.226819
## fuelsystemidi NA NA NA NA
## fuelsystemmfi -3801.52 2986.71 -1.273 0.204783
## fuelsystemmpfi -1566.60 1042.99 -1.502 0.134903
## fuelsystemspdi -3722.17 1443.05 -2.579 0.010724 *
## fuelsystemspfi -3094.26 2874.26 -1.077 0.283175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2675 on 174 degrees of freedom
## Multiple R-squared: 0.9043, Adjusted R-squared: 0.8879
## F-statistic: 54.84 on 30 and 174 DF, p-value: < 2.2e-16
Same as for OLS applies here. Let’s compare the prediction quality.
In order to correctly assess prediction quality of these two models I applied cross-validation approach with 90% - 10% split for trainging and testing subsets.
It is imprortant to note that when performing predicitons on the testing dataset one must apply exactly the same PCA transformation learned from the training set to the test set. (This will have slighly different results than the PCA on whole dataset we analyzed before, as for every fold it will be performed again separately on the training set.)
# caret package can be used in case of the simple linear regression
# 10-fold cross-validation settings
set.seed(2026)
cv_settings <- trainControl(method = "cv", number = 10)
ols.cv <- train(price ~ symboling + fueltype + aspiration +
doornumber + carbody + drivewheel + enginelocation + wheelbase +
carlength + carwidth + carheight + curbweight + enginetype +
cylindernumber + enginesize + fuelsystem + boreratio + stroke +
compressionratio + horsepower + peakrpm + citympg + highwaympg,
data = data,
method = "lm",
trControl = cv_settings)
ols.results <- ols.cv$resample %>%
summarise(
MAE = mean(MAE),
RMSE = mean(RMSE),
Rsquared = mean(Rsquared)
)
# trainControl(method = "cv", number = 10) sets up 10-fold cross-validation, it splits the data into 10 chunks, trains on 9, tests on 1, and rotates through all combinations
# train(..., method = "lm") fits OLS model repeatedly across those 10 folds and averages the prediction errorAs per my earlier analysis for partial PCR I chose three first principal components as regressors. For comparison, attempts on 2 and 4 PCs were also included.
# caret package doesn't support partial PCR, so it has to be done manually in tidymodels
data.numeric <- data[, sapply(data, is.numeric) & names(data) != "symboling" & names(data) != "car_ID" & names(data) != "price"]
data.categorical <- data[, sapply(data, is.character) & names(data) != "CarName"]
set.seed(2026)
folds <- vfold_cv(data, v = 10)
# PCA applied only on numeric regressors
rec <- recipe(price ~ ., data = data) %>%
step_select(price, all_of(names(data.numeric)), all_of(names(data.categorical))) %>%
step_normalize(all_of(names(data.numeric))) %>% # skewness transformation omitted for code simplicity
step_pca(all_of(names(data.numeric)), num_comp = 3)# 3 PCs
# Model
lin_mod <- linear_reg() %>% set_engine("lm")
# Workflow
pcr_partial_wf <- workflow() %>%
add_recipe(rec) %>%
add_model(lin_mod)
# CV - PCA is refit on each training fold internally
pcr_partial_cv <- fit_resamples(
pcr_partial_wf,
resamples = folds,
metrics = metric_set(mae, rmse, rsq)
)## MAE RMSE Rsquared
## 1 1921.803 2651.101 0.8824575
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 1956. 10 85.3 pre0_mod0_post0
## 2 rmse standard 2666. 10 170. pre0_mod0_post0
## 3 rsq standard 0.884 10 0.0183 pre0_mod0_post0
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 1980. 10 82.3 pre0_mod0_post0
## 2 rmse standard 2694. 10 162. pre0_mod0_post0
## 3 rsq standard 0.882 10 0.0186 pre0_mod0_post0
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 1967. 10 92.7 pre0_mod0_post0
## 2 rmse standard 2667. 10 166. pre0_mod0_post0
## 3 rsq standard 0.884 10 0.0170 pre0_mod0_post0
Following cross-validation, the classical OLS regression model demonstrates marginally stronger predictive performance compared to the PCR approach, with slightly lower Mean Absolute Error and Root Mean Square Error, and a marginally higher R-squared. That said, the difference between the best-performing PCR model and OLS remains quite modest, suggesting that PCR remains a competitive alternative in contexts where dimensionality reduction or multicollinearity are primary concerns.
As anticipated, three principal components are the optimal configuration for the PCR models, but 2 perform similarily. Selecting appropriate number of components is an important step in this process. Retaining too few risks losing meaningful variance in the data, while including too many undermines the core benefit of the method. In cases where interpretability is less critical and dimensionality reduction is prioritized, PCR continues to offer a valuable and principled analytical framework.