This report presents a Principal Component Analysis (PCA) of the cars2004 dataset, which contains information on 425 cars from the 2004 model year with 19 features per car. The goal is to reduce dimensionality while retaining the most meaningful variance in the data, and to interpret the underlying structure of the dataset.
# Load the dataset
data <- read.csv("D:/Desktop 25-07-02/Master/semester 2/ADVANCED ENGINEERING DATA ANALYSIS/Excerises/cars2004.csv",
header = TRUE)
# Dimensions
cat("Dimensions:", dim(data), "\n")
## Dimensions: 428 19
# First rows
head(data)
## Name Sports SUV Wagon Minivan Pickup AWD RWD Retail
## 1 Chevrolet Aveo 4dr 0 0 0 0 0 0 0 11690
## 2 Chevrolet Aveo LS 4dr hatch 0 0 0 0 0 0 0 12585
## 3 Chevrolet Cavalier 2dr 0 0 0 0 0 0 0 14610
## 4 Chevrolet Cavalier 4dr 0 0 0 0 0 0 0 14810
## 5 Chevrolet Cavalier LS 2dr 0 0 0 0 0 0 0 16385
## 6 Dodge Neon SE 4dr 0 0 0 0 0 0 0 13670
## Dealer Engine Cylinders Horsepower CityMPG HighwayMPG Weight WheelBase Length
## 1 10965 1.6 4 103 28 34 2370 98 167
## 2 11802 1.6 4 103 28 34 2348 98 153
## 3 13697 2.2 4 140 26 37 2617 104 183
## 4 13884 2.2 4 140 26 37 2676 104 183
## 5 15357 2.2 4 140 26 37 2617 104 183
## 6 12849 2.0 4 132 29 36 2581 105 174
## Width
## 1 66
## 2 66
## 3 69
## 4 68
## 5 69
## 6 67
# Structure of the data
str(data)
## 'data.frame': 428 obs. of 19 variables:
## $ Name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ Sports : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SUV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Wagon : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minivan : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pickup : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AWD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RWD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Retail : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ Dealer : int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ Engine : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ Cylinders : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Horsepower: int 103 103 140 140 140 132 132 130 110 130 ...
## $ CityMPG : int 28 28 26 26 26 29 29 26 27 26 ...
## $ HighwayMPG: int 34 34 37 37 37 36 36 33 36 33 ...
## $ Weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ WheelBase : int 98 98 104 104 104 105 105 103 103 103 ...
## $ Length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ Width : int 66 66 69 68 69 67 67 67 67 67 ...
# Statistical summary
summary(data)
## Name Sports SUV Wagon
## Length :428 Min. :0.00000 Min. :0.0000 Min. :0.00000
## N.unique :425 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## N.blank : 0 Median :0.00000 Median :0.0000 Median :0.00000
## Min.nchar: 8 Mean :0.08081 Mean :0.1402 Mean :0.07009
## Max.nchar: 32 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## NAs :32
## Minivan Pickup AWD RWD
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.000 Median :0.000
## Mean :0.04673 Mean :0.05607 Mean :0.215 Mean :0.257
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.00000 Max. :1.000 Max. :1.000
##
## Retail Dealer Engine Cylinders
## Min. : 10280 Min. : 9875 Min. :1.300 Min. :-1.000
## 1st Qu.: 20334 1st Qu.: 18866 1st Qu.:2.375 1st Qu.: 4.000
## Median : 27635 Median : 25295 Median :3.000 Median : 6.000
## Mean : 32775 Mean : 30015 Mean :3.197 Mean : 5.776
## 3rd Qu.: 39205 3rd Qu.: 35710 3rd Qu.:3.900 3rd Qu.: 6.000
## Max. :192465 Max. :173560 Max. :8.300 Max. :12.000
##
## Horsepower CityMPG HighwayMPG Weight WheelBase
## Min. : 73.0 Min. :10.00 Min. :12.00 Min. :1850 Min. : 89.0
## 1st Qu.:165.0 1st Qu.:17.00 1st Qu.:24.00 1st Qu.:3102 1st Qu.:103.0
## Median :210.0 Median :19.00 Median :26.00 Median :3474 Median :107.0
## Mean :215.9 Mean :20.09 Mean :26.91 Mean :3577 Mean :108.2
## 3rd Qu.:255.0 3rd Qu.:21.00 3rd Qu.:29.00 3rd Qu.:3974 3rd Qu.:112.0
## Max. :500.0 Max. :60.00 Max. :66.00 Max. :7190 Max. :144.0
## NAs :14 NAs :14 NAs :2 NAs :2
## Length Width
## Min. :143.0 Min. :64.00
## 1st Qu.:177.0 1st Qu.:69.00
## Median :186.0 Median :71.00
## Mean :185.1 Mean :71.29
## 3rd Qu.:193.0 3rd Qu.:73.00
## Max. :227.0 Max. :81.00
## NAs :26 NAs :28
The dataset contains 428 observations and 19 variables, including one character variable (car name), seven binary indicators, and eleven continuous numerical variables.
# Count missing values per column
missing_vals <- colSums(is.na(data))
missing_vals[missing_vals > 0]
## Sports CityMPG HighwayMPG Weight WheelBase Length Width
## 32 14 14 2 2 26 28
# Remove rows with missing values
data_clean <- na.omit(data)
cat("Rows after removing missing values:", nrow(data_clean), "\n")
## Rows after removing missing values: 358
PCA requires numerical input. We exclude the car name (character) and all binary indicator variables (Sports, SUV, Wagon, Minivan, Pickup, AWD, RWD), keeping only the 11 continuous numerical features.
# Select numeric columns automatically
data_num <- data_clean[, sapply(data_clean, is.numeric)]
# Remove binary columns (only 2 unique values)
binary_cols <- sapply(data_num, function(x) length(unique(x)) == 2)
data_num <- data_num[, !binary_cols]
cat("Numerical variables selected:\n")
## Numerical variables selected:
colnames(data_num)
## [1] "Pickup" "Retail" "Dealer" "Engine" "Cylinders"
## [6] "Horsepower" "CityMPG" "HighwayMPG" "Weight" "WheelBase"
## [11] "Length" "Width"
# Remove columns with zero variance
zero_var_cols <- apply(data_num, 2, var) == 0
if (any(zero_var_cols)) {
cat("Removed zero-variance columns:", names(which(zero_var_cols)), "\n")
data_num <- data_num[, !zero_var_cols]
} else {
cat("No zero-variance columns found.\n")
}
## Removed zero-variance columns: Pickup
# Boxplots to visualize outliers
par(mfrow = c(3, 4), mar = c(3, 3, 2, 1))
for (col in colnames(data_num)) {
boxplot(data_num[[col]], main = col, col = "steelblue", outline = TRUE)
}
par(mfrow = c(1, 1))
Several variables (e.g., Retail, Dealer, Horsepower) show outliers, which are consistent with the natural spread of the car market — high-end vehicles with extreme values. These are kept as they represent real observations, not data errors.
PCA is sensitive to scale. Since the variables have different units (US dollars, liters, pounds, inches, etc.), we standardize the data to zero mean and unit variance before applying PCA.
# Standardize the data
data_scaled <- scale(data_num)
# Confirm scaling
round(apply(data_scaled, 2, mean), 5) # Should be ~0
## Retail Dealer Engine Cylinders Horsepower CityMPG HighwayMPG
## 0 0 0 0 0 0 0
## Weight WheelBase Length Width
## 0 0 0 0
round(apply(data_scaled, 2, var), 5) # Should be ~1
## Retail Dealer Engine Cylinders Horsepower CityMPG HighwayMPG
## 1 1 1 1 1 1 1
## Weight WheelBase Length Width
## 1 1 1 1
# Run PCA
pca_result <- prcomp(data_scaled, center = FALSE, scale. = FALSE)
# Summary of variance explained
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.7011 1.3208 0.91103 0.58903 0.51474 0.42904 0.37854
## Proportion of Variance 0.6633 0.1586 0.07545 0.03154 0.02409 0.01673 0.01303
## Cumulative Proportion 0.6633 0.8219 0.89731 0.92886 0.95294 0.96968 0.98270
## PC8 PC9 PC10 PC11
## Standard deviation 0.29898 0.24921 0.19512 0.02642
## Proportion of Variance 0.00813 0.00565 0.00346 0.00006
## Cumulative Proportion 0.99083 0.99648 0.99994 1.00000
library(factoextra)
fviz_eig(pca_result,
addlabels = TRUE,
ylim = c(0, 65),
barfill = "steelblue",
barcolor = "steelblue",
linecolor = "darkred",
main = "Scree Plot — Variance Explained by Each PC")
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumvar <- cumsum(var_explained)
# Table of variance explained
variance_table <- data.frame(
PC = paste0("PC", 1:length(var_explained)),
Variance_Explained = round(var_explained * 100, 2),
Cumulative = round(cumvar * 100, 2)
)
print(variance_table)
## PC Variance_Explained Cumulative
## 1 PC1 66.33 66.33
## 2 PC2 15.86 82.19
## 3 PC3 7.55 89.73
## 4 PC4 3.15 92.89
## 5 PC5 2.41 95.29
## 6 PC6 1.67 96.97
## 7 PC7 1.30 98.27
## 8 PC8 0.81 99.08
## 9 PC9 0.56 99.65
## 10 PC10 0.35 99.99
## 11 PC11 0.01 100.00
Based on the scree plot and cumulative variance table, we retain the first 2 principal components, which together explain approximately 82.2% of the total variance. This satisfies the common threshold of 80%.
# Show loadings for first 3 PCs
loadings_table <- round(pca_result$rotation[, 1:4], 3)
print(loadings_table)
## PC1 PC2 PC3 PC4
## Retail 0.264 0.487 0.236 -0.258
## Dealer 0.262 0.490 0.238 -0.269
## Engine 0.344 -0.030 0.051 0.511
## Cylinders 0.330 0.073 0.062 0.654
## Horsepower 0.315 0.302 0.055 0.059
## CityMPG -0.308 0.014 0.529 0.214
## HighwayMPG -0.304 0.008 0.600 0.134
## Weight 0.335 -0.161 -0.122 -0.072
## WheelBase 0.280 -0.382 0.281 -0.218
## Length 0.264 -0.388 0.365 -0.212
## Width 0.298 -0.321 0.105 -0.081
sort(abs(pca_result$rotation[, 1]), decreasing = TRUE)
## Engine Weight Cylinders Horsepower CityMPG HighwayMPG Width
## 0.3439500 0.3346780 0.3301586 0.3146367 0.3078987 0.3038309 0.2977170
## WheelBase Length Retail Dealer
## 0.2795422 0.2641983 0.2636420 0.2616986
PC1 is dominated by Weight, Length, Wheelbase, Width, Engine size, Cylinders, and Horsepower. All these variables relate to the physical size and power of a car. Cars with high PC1 scores are large, powerful vehicles (e.g., trucks and SUVs), while low PC1 scores correspond to compact, lighter cars.
sort(abs(pca_result$rotation[, 2]), decreasing = TRUE)
## Dealer Retail Length WheelBase Width Horsepower Weight
## 0.48981064 0.48677615 0.38825833 0.38159834 0.32138359 0.30204950 0.16063928
## Cylinders Engine CityMPG HighwayMPG
## 0.07290257 0.02960226 0.01353944 0.00848625
PC2 captures a contrast between fuel efficiency (CityMPG, HighwayMPG) and price (Retail, Dealer). High PC2 scores indicate fuel-efficient, lower-cost cars; low scores indicate expensive, less fuel-efficient vehicles.
# Biplot colored by SUV indicator
fviz_pca_biplot(pca_result,
label = "var",
habillage = as.factor(data_clean$SUV),
addEllipses = TRUE,
ellipse.level = 0.95,
repel = TRUE,
col.var = "darkred",
legend.title = "SUV",
title = "PCA Biplot — Cars 2004 (colored by SUV)")
Biplot Interpretation:
Two principal components explain the majority of variance in the dataset (~70–80%), meaning the 11 numerical variables can be effectively summarized in 2 dimensions.
PC1 represents car size and power: Weight, Engine, Cylinders, Horsepower, Length, Wheelbase, and Width all load heavily on this component.
PC2 represents the trade-off between fuel efficiency and cost: MPG variables oppose Price variables, reflecting the common pattern where expensive, powerful cars are less fuel-efficient.
The biplot reveals clear groupings: SUVs and trucks form a distinct cluster, separated from compact and fuel-efficient cars.
Practical insight: Manufacturers and buyers can use these two dimensions to position or compare vehicles — a car’s position in PCA space summarizes its size, power, efficiency, and price simultaneously.
Report prepared for Exercise 1 — Advanced Engineering Data Analysis, Prof. Daniel Fernández, UPC Barcelona Tech.