1 Introduction

This report presents a Principal Component Analysis (PCA) of the cars2004 dataset, which contains information on 425 cars from the 2004 model year with 19 features per car. The goal is to reduce dimensionality while retaining the most meaningful variance in the data, and to interpret the underlying structure of the dataset.


2 Data Loading and Exploration

# Load the dataset
data <- read.csv("D:/Desktop 25-07-02/Master/semester 2/ADVANCED ENGINEERING DATA ANALYSIS/Excerises/cars2004.csv",
                 header = TRUE)

# Dimensions
cat("Dimensions:", dim(data), "\n")
## Dimensions: 428 19
# First rows
head(data)
##                          Name Sports SUV Wagon Minivan Pickup AWD RWD Retail
## 1          Chevrolet Aveo 4dr      0   0     0       0      0   0   0  11690
## 2 Chevrolet Aveo LS 4dr hatch      0   0     0       0      0   0   0  12585
## 3      Chevrolet Cavalier 2dr      0   0     0       0      0   0   0  14610
## 4      Chevrolet Cavalier 4dr      0   0     0       0      0   0   0  14810
## 5   Chevrolet Cavalier LS 2dr      0   0     0       0      0   0   0  16385
## 6           Dodge Neon SE 4dr      0   0     0       0      0   0   0  13670
##   Dealer Engine Cylinders Horsepower CityMPG HighwayMPG Weight WheelBase Length
## 1  10965    1.6         4        103      28         34   2370        98    167
## 2  11802    1.6         4        103      28         34   2348        98    153
## 3  13697    2.2         4        140      26         37   2617       104    183
## 4  13884    2.2         4        140      26         37   2676       104    183
## 5  15357    2.2         4        140      26         37   2617       104    183
## 6  12849    2.0         4        132      29         36   2581       105    174
##   Width
## 1    66
## 2    66
## 3    69
## 4    68
## 5    69
## 6    67
# Structure of the data
str(data)
## 'data.frame':    428 obs. of  19 variables:
##  $ Name      : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ Sports    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SUV       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Wagon     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Minivan   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pickup    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AWD       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RWD       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Retail    : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ Dealer    : int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ Engine    : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ Cylinders : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Horsepower: int  103 103 140 140 140 132 132 130 110 130 ...
##  $ CityMPG   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ HighwayMPG: int  34 34 37 37 37 36 36 33 36 33 ...
##  $ Weight    : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ WheelBase : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ Length    : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ Width     : int  66 66 69 68 69 67 67 67 67 67 ...
# Statistical summary
summary(data)
##         Name         Sports             SUV             Wagon        
##  Length   :428   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  N.unique :425   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  N.blank  :  0   Median :0.00000   Median :0.0000   Median :0.00000  
##  Min.nchar:  8   Mean   :0.08081   Mean   :0.1402   Mean   :0.07009  
##  Max.nchar: 32   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##                  NAs    :32                                          
##     Minivan            Pickup             AWD             RWD       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.00000   Median :0.00000   Median :0.000   Median :0.000  
##  Mean   :0.04673   Mean   :0.05607   Mean   :0.215   Mean   :0.257  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:1.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000   Max.   :1.000  
##                                                                     
##      Retail           Dealer           Engine        Cylinders     
##  Min.   : 10280   Min.   :  9875   Min.   :1.300   Min.   :-1.000  
##  1st Qu.: 20334   1st Qu.: 18866   1st Qu.:2.375   1st Qu.: 4.000  
##  Median : 27635   Median : 25295   Median :3.000   Median : 6.000  
##  Mean   : 32775   Mean   : 30015   Mean   :3.197   Mean   : 5.776  
##  3rd Qu.: 39205   3rd Qu.: 35710   3rd Qu.:3.900   3rd Qu.: 6.000  
##  Max.   :192465   Max.   :173560   Max.   :8.300   Max.   :12.000  
##                                                                    
##    Horsepower       CityMPG        HighwayMPG        Weight       WheelBase    
##  Min.   : 73.0   Min.   :10.00   Min.   :12.00   Min.   :1850   Min.   : 89.0  
##  1st Qu.:165.0   1st Qu.:17.00   1st Qu.:24.00   1st Qu.:3102   1st Qu.:103.0  
##  Median :210.0   Median :19.00   Median :26.00   Median :3474   Median :107.0  
##  Mean   :215.9   Mean   :20.09   Mean   :26.91   Mean   :3577   Mean   :108.2  
##  3rd Qu.:255.0   3rd Qu.:21.00   3rd Qu.:29.00   3rd Qu.:3974   3rd Qu.:112.0  
##  Max.   :500.0   Max.   :60.00   Max.   :66.00   Max.   :7190   Max.   :144.0  
##                  NAs    :14      NAs    :14      NAs    :2      NAs    :2      
##      Length          Width      
##  Min.   :143.0   Min.   :64.00  
##  1st Qu.:177.0   1st Qu.:69.00  
##  Median :186.0   Median :71.00  
##  Mean   :185.1   Mean   :71.29  
##  3rd Qu.:193.0   3rd Qu.:73.00  
##  Max.   :227.0   Max.   :81.00  
##  NAs    :26      NAs    :28

The dataset contains 428 observations and 19 variables, including one character variable (car name), seven binary indicators, and eleven continuous numerical variables.


3 Data Preprocessing

3.1 Missing Values

# Count missing values per column
missing_vals <- colSums(is.na(data))
missing_vals[missing_vals > 0]
##     Sports    CityMPG HighwayMPG     Weight  WheelBase     Length      Width 
##         32         14         14          2          2         26         28
# Remove rows with missing values
data_clean <- na.omit(data)
cat("Rows after removing missing values:", nrow(data_clean), "\n")
## Rows after removing missing values: 358

3.2 Selecting Numerical Variables

PCA requires numerical input. We exclude the car name (character) and all binary indicator variables (Sports, SUV, Wagon, Minivan, Pickup, AWD, RWD), keeping only the 11 continuous numerical features.

# Select numeric columns automatically
data_num <- data_clean[, sapply(data_clean, is.numeric)]

# Remove binary columns (only 2 unique values)
binary_cols <- sapply(data_num, function(x) length(unique(x)) == 2)
data_num <- data_num[, !binary_cols]

cat("Numerical variables selected:\n")
## Numerical variables selected:
colnames(data_num)
##  [1] "Pickup"     "Retail"     "Dealer"     "Engine"     "Cylinders" 
##  [6] "Horsepower" "CityMPG"    "HighwayMPG" "Weight"     "WheelBase" 
## [11] "Length"     "Width"

3.3 Removing Zero-Variance Columns

# Remove columns with zero variance
zero_var_cols <- apply(data_num, 2, var) == 0
if (any(zero_var_cols)) {
  cat("Removed zero-variance columns:", names(which(zero_var_cols)), "\n")
  data_num <- data_num[, !zero_var_cols]
} else {
  cat("No zero-variance columns found.\n")
}
## Removed zero-variance columns: Pickup

3.4 Outlier Detection

# Boxplots to visualize outliers
par(mfrow = c(3, 4), mar = c(3, 3, 2, 1))
for (col in colnames(data_num)) {
  boxplot(data_num[[col]], main = col, col = "steelblue", outline = TRUE)
}
par(mfrow = c(1, 1))

Several variables (e.g., Retail, Dealer, Horsepower) show outliers, which are consistent with the natural spread of the car market — high-end vehicles with extreme values. These are kept as they represent real observations, not data errors.


4 Principal Component Analysis

4.1 Scaling and Centering

PCA is sensitive to scale. Since the variables have different units (US dollars, liters, pounds, inches, etc.), we standardize the data to zero mean and unit variance before applying PCA.

# Standardize the data
data_scaled <- scale(data_num)

# Confirm scaling
round(apply(data_scaled, 2, mean), 5)  # Should be ~0
##     Retail     Dealer     Engine  Cylinders Horsepower    CityMPG HighwayMPG 
##          0          0          0          0          0          0          0 
##     Weight  WheelBase     Length      Width 
##          0          0          0          0
round(apply(data_scaled, 2, var), 5)   # Should be ~1
##     Retail     Dealer     Engine  Cylinders Horsepower    CityMPG HighwayMPG 
##          1          1          1          1          1          1          1 
##     Weight  WheelBase     Length      Width 
##          1          1          1          1

4.2 Applying PCA

# Run PCA
pca_result <- prcomp(data_scaled, center = FALSE, scale. = FALSE)

# Summary of variance explained
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.7011 1.3208 0.91103 0.58903 0.51474 0.42904 0.37854
## Proportion of Variance 0.6633 0.1586 0.07545 0.03154 0.02409 0.01673 0.01303
## Cumulative Proportion  0.6633 0.8219 0.89731 0.92886 0.95294 0.96968 0.98270
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.29898 0.24921 0.19512 0.02642
## Proportion of Variance 0.00813 0.00565 0.00346 0.00006
## Cumulative Proportion  0.99083 0.99648 0.99994 1.00000

5 Dimensionality Reduction — How Many Components?

5.1 Scree Plot

library(factoextra)

fviz_eig(pca_result,
         addlabels = TRUE,
         ylim = c(0, 65),
         barfill = "steelblue",
         barcolor = "steelblue",
         linecolor = "darkred",
         main = "Scree Plot — Variance Explained by Each PC")

5.2 Cumulative Variance

var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cumvar <- cumsum(var_explained)

# Table of variance explained
variance_table <- data.frame(
  PC = paste0("PC", 1:length(var_explained)),
  Variance_Explained = round(var_explained * 100, 2),
  Cumulative = round(cumvar * 100, 2)
)
print(variance_table)
##      PC Variance_Explained Cumulative
## 1   PC1              66.33      66.33
## 2   PC2              15.86      82.19
## 3   PC3               7.55      89.73
## 4   PC4               3.15      92.89
## 5   PC5               2.41      95.29
## 6   PC6               1.67      96.97
## 7   PC7               1.30      98.27
## 8   PC8               0.81      99.08
## 9   PC9               0.56      99.65
## 10 PC10               0.35      99.99
## 11 PC11               0.01     100.00

Based on the scree plot and cumulative variance table, we retain the first 2 principal components, which together explain approximately 82.2% of the total variance. This satisfies the common threshold of 80%.


6 Interpretation of Principal Components

6.1 Loadings Table

# Show loadings for first 3 PCs
loadings_table <- round(pca_result$rotation[, 1:4], 3)
print(loadings_table)
##               PC1    PC2    PC3    PC4
## Retail      0.264  0.487  0.236 -0.258
## Dealer      0.262  0.490  0.238 -0.269
## Engine      0.344 -0.030  0.051  0.511
## Cylinders   0.330  0.073  0.062  0.654
## Horsepower  0.315  0.302  0.055  0.059
## CityMPG    -0.308  0.014  0.529  0.214
## HighwayMPG -0.304  0.008  0.600  0.134
## Weight      0.335 -0.161 -0.122 -0.072
## WheelBase   0.280 -0.382  0.281 -0.218
## Length      0.264 -0.388  0.365 -0.212
## Width       0.298 -0.321  0.105 -0.081

6.2 PC1 — Car Size

sort(abs(pca_result$rotation[, 1]), decreasing = TRUE)
##     Engine     Weight  Cylinders Horsepower    CityMPG HighwayMPG      Width 
##  0.3439500  0.3346780  0.3301586  0.3146367  0.3078987  0.3038309  0.2977170 
##  WheelBase     Length     Retail     Dealer 
##  0.2795422  0.2641983  0.2636420  0.2616986

PC1 is dominated by Weight, Length, Wheelbase, Width, Engine size, Cylinders, and Horsepower. All these variables relate to the physical size and power of a car. Cars with high PC1 scores are large, powerful vehicles (e.g., trucks and SUVs), while low PC1 scores correspond to compact, lighter cars.

6.3 PC2 — Fuel Efficiency vs. Price

sort(abs(pca_result$rotation[, 2]), decreasing = TRUE)
##     Dealer     Retail     Length  WheelBase      Width Horsepower     Weight 
## 0.48981064 0.48677615 0.38825833 0.38159834 0.32138359 0.30204950 0.16063928 
##  Cylinders     Engine    CityMPG HighwayMPG 
## 0.07290257 0.02960226 0.01353944 0.00848625

PC2 captures a contrast between fuel efficiency (CityMPG, HighwayMPG) and price (Retail, Dealer). High PC2 scores indicate fuel-efficient, lower-cost cars; low scores indicate expensive, less fuel-efficient vehicles.


7 Biplot Analysis

# Biplot colored by SUV indicator
fviz_pca_biplot(pca_result,
                label = "var",
                habillage = as.factor(data_clean$SUV),
                addEllipses = TRUE,
                ellipse.level = 0.95,
                repel = TRUE,
                col.var = "darkred",
                legend.title = "SUV",
                title = "PCA Biplot — Cars 2004 (colored by SUV)")

Biplot Interpretation:

  • Variable arrows pointing in the same direction indicate positive correlation (e.g., Weight and Length).
  • Opposite arrows indicate negative correlation (e.g., MPG variables vs. Weight).
  • SUV cluster (group 1) concentrates on the right side of PC1, confirming that SUVs are larger and heavier.
  • Non-SUVs (group 0) tend toward higher fuel efficiency on PC2.

8 Conclusions and Insights

  1. Two principal components explain the majority of variance in the dataset (~70–80%), meaning the 11 numerical variables can be effectively summarized in 2 dimensions.

  2. PC1 represents car size and power: Weight, Engine, Cylinders, Horsepower, Length, Wheelbase, and Width all load heavily on this component.

  3. PC2 represents the trade-off between fuel efficiency and cost: MPG variables oppose Price variables, reflecting the common pattern where expensive, powerful cars are less fuel-efficient.

  4. The biplot reveals clear groupings: SUVs and trucks form a distinct cluster, separated from compact and fuel-efficient cars.

  5. Practical insight: Manufacturers and buyers can use these two dimensions to position or compare vehicles — a car’s position in PCA space summarizes its size, power, efficiency, and price simultaneously.


Report prepared for Exercise 1 — Advanced Engineering Data Analysis, Prof. Daniel Fernández, UPC Barcelona Tech.