Introduction

In this paper, I use an electric vehicle specification dataset that includes battery capacity, range, efficiency, torque, and acceleration values for different models. Since many of these variables move together, it is not very clear what really drives overall performance when looking at them separately. For this reason, I use Principal Component Analysis (PCA) to reduce the number of variables and see the main patterns behind EV performance. This method helps to summarize the data into a smaller number of components that still capture most of the information.

Packages

library(tidyverse)
library(janitor)
library(FactoMineR)
library(factoextra)

Data Import

ev <- read_csv("electric_vehicles_spec_2025.csv.csv") %>%
  clean_names()
## Rows: 478 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): brand, model, battery_type, fast_charge_port, cargo_volume_l, driv...
## dbl (13): top_speed_kmh, battery_capacity_kWh, number_of_cells, torque_nm, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ev)
## Rows: 478
## Columns: 22
## $ brand                     <chr> "Abarth", "Abarth", "Abarth", "Abarth", "Aiw…
## $ model                     <chr> "500e Convertible", "500e Hatchback", "600e …
## $ top_speed_kmh             <dbl> 155, 155, 200, 200, 150, 160, 150, 200, 160,…
## $ battery_capacity_k_wh     <dbl> 37.8, 37.8, 50.8, 50.8, 60.0, 60.0, 50.8, 50…
## $ battery_type              <chr> "Lithium-ion", "Lithium-ion", "Lithium-ion",…
## $ number_of_cells           <dbl> 192, 192, 102, 102, NA, NA, 102, 102, 184, 1…
## $ torque_nm                 <dbl> 235, 235, 345, 345, 310, 315, 260, 345, 285,…
## $ efficiency_wh_per_km      <dbl> 156, 149, 158, 158, 156, 150, 128, 164, 138,…
## $ range_km                  <dbl> 225, 225, 280, 280, 315, 350, 320, 310, 310,…
## $ acceleration_0_100_s      <dbl> 7.0, 7.0, 5.9, 6.2, 7.5, 7.0, 9.0, 6.0, 7.4,…
## $ fast_charging_power_kw_dc <dbl> 67, 67, 79, 79, 78, 78, 85, 85, 70, 70, 150,…
## $ fast_charge_port          <chr> "CCS", "CCS", "CCS", "CCS", "CCS", "CCS", "C…
## $ towing_capacity_kg        <dbl> 0, 0, 0, 0, NA, NA, 0, 0, 500, 500, 2100, 21…
## $ cargo_volume_l            <chr> "185", "185", "360", "360", "496", "472", "4…
## $ seats                     <dbl> 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ drivetrain                <chr> "FWD", "FWD", "FWD", "FWD", "FWD", "FWD", "F…
## $ segment                   <chr> "B - Compact", "B - Compact", "JB - Compact"…
## $ length_mm                 <dbl> 3673, 3673, 4187, 4187, 4680, 4805, 4173, 41…
## $ width_mm                  <dbl> 1683, 1683, 1779, 1779, 1865, 1880, 1781, 17…
## $ height_mm                 <dbl> 1518, 1518, 1557, 1557, 1700, 1641, 1532, 15…
## $ car_body_type             <chr> "Hatchback", "Hatchback", "SUV", "SUV", "SUV…
## $ source_url                <chr> "https://ev-database.org/car/1904/Abarth-500…

Variable Selection for PCA

ev_pca_data <- ev %>%
  select(
    top_speed_kmh,
    battery_capacity_k_wh,
    torque_nm,
    efficiency_wh_per_km,
    range_km,
    acceleration_0_100_s,
    length_mm,
    width_mm,
    height_mm
  )

summary(ev_pca_data)
##  top_speed_kmh   battery_capacity_k_wh   torque_nm    efficiency_wh_per_km
##  Min.   :125.0   Min.   : 21.30        Min.   : 113   Min.   :109.0       
##  1st Qu.:160.0   1st Qu.: 60.00        1st Qu.: 305   1st Qu.:143.0       
##  Median :180.0   Median : 76.15        Median : 430   Median :155.0       
##  Mean   :185.5   Mean   : 74.04        Mean   : 498   Mean   :162.9       
##  3rd Qu.:201.0   3rd Qu.: 90.60        3rd Qu.: 679   3rd Qu.:177.8       
##  Max.   :325.0   Max.   :118.00        Max.   :1350   Max.   :370.0       
##                                        NA's   :7                          
##     range_km     acceleration_0_100_s   length_mm       width_mm   
##  Min.   :135.0   Min.   : 2.200       Min.   :3620   Min.   :1610  
##  1st Qu.:320.0   1st Qu.: 4.800       1st Qu.:4440   1st Qu.:1849  
##  Median :397.5   Median : 6.600       Median :4720   Median :1890  
##  Mean   :393.2   Mean   : 6.883       Mean   :4679   Mean   :1887  
##  3rd Qu.:470.0   3rd Qu.: 8.200       3rd Qu.:4961   3rd Qu.:1939  
##  Max.   :685.0   Max.   :19.100       Max.   :5908   Max.   :2080  
##                                                                    
##    height_mm   
##  Min.   :1329  
##  1st Qu.:1514  
##  Median :1596  
##  Mean   :1601  
##  3rd Qu.:1665  
##  Max.   :1986  
## 

NA Cleaning

ev_pca_data <- na.omit(ev_pca_data)

dim(ev_pca_data)
## [1] 471   9

Scaling Variables

ev_scaled <- scale(ev_pca_data)

summary(ev_scaled)
##  top_speed_kmh     battery_capacity_k_wh   torque_nm       efficiency_wh_per_km
##  Min.   :-1.7773   Min.   :-2.58728      Min.   :-1.5945   Min.   :-1.5668     
##  1st Qu.:-0.7359   1st Qu.:-0.69918      1st Qu.:-0.7994   1st Qu.:-0.5812     
##  Median :-0.1408   Median : 0.09594      Median :-0.2817   Median :-0.2334     
##  Mean   : 0.0000   Mean   : 0.00000      Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.: 0.4840   3rd Qu.: 0.82459      3rd Qu.: 0.7496   3rd Qu.: 0.4332     
##  Max.   : 4.1737   Max.   : 2.17358      Max.   : 3.5285   Max.   : 5.9984     
##     range_km        acceleration_0_100_s   length_mm          width_mm       
##  Min.   :-2.50531   Min.   :-1.7378      Min.   :-2.8552   Min.   :-3.77328  
##  1st Qu.:-0.69916   1st Qu.:-0.7816      1st Qu.:-0.6360   1st Qu.:-0.51166  
##  Median : 0.03306   Median :-0.1197      Median : 0.1082   Median :-0.03402  
##  Mean   : 0.00000   Mean   : 0.0000      Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.76529   3rd Qu.: 0.5055      3rd Qu.: 0.7726   3rd Qu.: 0.70292  
##  Max.   : 2.86433   Max.   : 4.4771      Max.   : 3.3369   Max.   : 2.64079  
##    height_mm       
##  Min.   :-2.09418  
##  1st Qu.:-0.66942  
##  Median :-0.04129  
##  Mean   : 0.00000  
##  3rd Qu.: 0.47959  
##  Max.   : 2.93846

Principal Component Analysis (PCA)

pca_result <- prcomp(ev_scaled, center = FALSE, scale. = FALSE)

summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2501 1.5161 0.77611 0.60172 0.50364 0.40591 0.36962
## Proportion of Variance 0.5625 0.2554 0.06693 0.04023 0.02818 0.01831 0.01518
## Cumulative Proportion  0.5625 0.8179 0.88488 0.92511 0.95330 0.97160 0.98678
##                           PC8     PC9
## Standard deviation     0.3074 0.15638
## Proportion of Variance 0.0105 0.00272
## Cumulative Proportion  0.9973 1.00000

Scree Plot

fviz_eig(pca_result)

In this part, the scree plot shows a clear drop in explained variance after the second component. The PC1 explains more than half of the total variance, while the second component also contributes substantially. After PC2, the additional components add only small amounts of explained variance. Therefore, using the first two principal components seems sufficient to capture most of the structure in the data. # Interpretation of PC

round(pca_result$rotation, 3)
##                          PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## top_speed_kmh          0.384  0.217  0.221 -0.251  0.163  0.170 -0.734 -0.308
## battery_capacity_k_wh  0.412 -0.046 -0.286  0.376 -0.265 -0.024 -0.201  0.157
## torque_nm              0.391  0.064  0.413  0.255  0.100  0.627  0.308  0.308
## efficiency_wh_per_km   0.192 -0.495  0.485 -0.160 -0.570 -0.289 -0.086  0.142
## range_km               0.374  0.203 -0.492  0.207 -0.318 -0.085  0.004 -0.007
## acceleration_0_100_s  -0.349 -0.302 -0.335 -0.201 -0.195  0.560 -0.367  0.374
## length_mm              0.323 -0.360 -0.271 -0.440 -0.045  0.280  0.386 -0.502
## width_mm               0.348 -0.307 -0.191 -0.238  0.590 -0.303 -0.025  0.494
## height_mm             -0.082 -0.588  0.008  0.609  0.281  0.045 -0.183 -0.358
##                          PC9
## top_speed_kmh         -0.061
## battery_capacity_k_wh  0.685
## torque_nm             -0.118
## efficiency_wh_per_km  -0.147
## range_km              -0.653
## acceleration_0_100_s  -0.089
## length_mm              0.131
## width_mm              -0.081
## height_mm             -0.183

The first principal component is mainly associated with battery capacity, torque, top speed, and driving range, while acceleration has a negative loading. This suggests that PC1 captures an overall performance and power dimension, where vehicles with higher battery capacity and stronger motors tend to score higher.

The second principal component appears to be related to efficiency and vehicle dimensions. Efficiency and vehicle height show relatively strong loadings, indicating that this component may reflect structural and efficiency-related differences between electric vehicles. # PCA Visualization (PC1 vs PC2)

fviz_pca_ind(
  pca_result,
  geom = "point",
  col.ind = "steelblue",
  repel = TRUE
)

In this part i showed, The PCA scatter plot shows that most of the variation in electric vehicle performance can be summarized along two main dimensions. PC1, which explains 56% of the variance, appears to represent overall performance and power characteristics. Vehicles located on the right side tend to have higher battery capacity, greater torque, longer range, and better acceleration performance.

In PC2, explaining around 25% of the variance, seems to capture structural differences related to vehicle size and efficiency. This suggests that EV performance differences are largely driven by a combination of power-related features and physical vehicle characteristics.

PCA Biplot

fviz_pca_biplot(
  pca_result,
  label = "var",        
  col.var = "red",
  col.ind = "grey70",
  alpha.ind = 0.5,
  pointsize = 1.5
)

In this part, The PCA biplot clearly shows us that battery capacity, range, torque, and top speed load strongly in the same direction on the first principal component. This confirms that PC1 mainly represents overall performance and power characteristics of electric vehicles.

In the PC2, component appears to capture structural differences related to vehicle size and efficiency. Variables such as height and efficiency contribute more strongly to this dimension, suggesting that PC2 differentiates vehicle type and body structure rather than raw performance.

K-means Clustering in PCA Space

ev_scores <- as.data.frame(pca_result$x[,1:2])

set.seed(123)
km_result <- kmeans(ev_scores, centers = 4, nstart = 25)

ev_scores$cluster <- as.factor(km_result$cluster)

ggplot(ev_scores, aes(PC1, PC2, color = cluster)) +
  geom_point(alpha = 0.6) +
  theme_minimal() +
  labs(title = "EV Segments based on PCA + K-means")

Cluster Summary in PCA Space

ev_scores %>% group_by(cluster) %>% summarise( n = n(), mean_PC1 = mean(PC1), mean_PC2 = mean(PC2) ) %>% arrange(desc(mean_PC1))

Adding cluster back to original data

ev_clustered <- ev_pca_data
ev_clustered$cluster <- ev_scores$cluster

Cluster means in original variables

ev_clustered %>% group_by(cluster) %>% summarise(across(everything(), mean)) %>% arrange(desc(battery_capacity_k_wh))

Cluster Profiling

In this part, I examine the average values of the original performance variables within each cluster to better understand the segments identified by K-means clustering.

Cluster 3 clearly represents us high-performance electric vehicles, with the largest battery capacity, highest torque levels, and top speeds above 200 km/h. These models appear to correspond to premium or performance oriented EVs.

Cluster 1 includes more balanced vehicles with medium to high battery capacity and solid performance levels, which suggests mainstream long-range electric models.

Cluster 4 consists of lower performance vehicles with smaller batteries and reduced power output, while Cluster 2 shows the lowest battery and torque values overall, indicating entry level or city focused electric cars.

Conclusion

In this paper, I used PCA to reduce the dimensionality of electric vehicle performance data and identify the main sources of variation across models. The first principal component mainly reflects overall performance and power, while the second captures structural and efficiency-related differences.

Using K-means clustering on the PCA results, four distinct EV segments were identified, ranging from high-performance premium vehicles to smaller city-oriented models. Overall, the analysis shows that combining PCA with clustering provides a clear way to understand patterns and segmentation within the electric vehicle market.