In this paper, I use an electric vehicle specification dataset that includes battery capacity, range, efficiency, torque, and acceleration values for different models. Since many of these variables move together, it is not very clear what really drives overall performance when looking at them separately. For this reason, I use Principal Component Analysis (PCA) to reduce the number of variables and see the main patterns behind EV performance. This method helps to summarize the data into a smaller number of components that still capture most of the information.
library(tidyverse)
library(janitor)
library(FactoMineR)
library(factoextra)
ev <- read_csv("electric_vehicles_spec_2025.csv.csv") %>%
clean_names()
## Rows: 478 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): brand, model, battery_type, fast_charge_port, cargo_volume_l, driv...
## dbl (13): top_speed_kmh, battery_capacity_kWh, number_of_cells, torque_nm, e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ev)
## Rows: 478
## Columns: 22
## $ brand <chr> "Abarth", "Abarth", "Abarth", "Abarth", "Aiw…
## $ model <chr> "500e Convertible", "500e Hatchback", "600e …
## $ top_speed_kmh <dbl> 155, 155, 200, 200, 150, 160, 150, 200, 160,…
## $ battery_capacity_k_wh <dbl> 37.8, 37.8, 50.8, 50.8, 60.0, 60.0, 50.8, 50…
## $ battery_type <chr> "Lithium-ion", "Lithium-ion", "Lithium-ion",…
## $ number_of_cells <dbl> 192, 192, 102, 102, NA, NA, 102, 102, 184, 1…
## $ torque_nm <dbl> 235, 235, 345, 345, 310, 315, 260, 345, 285,…
## $ efficiency_wh_per_km <dbl> 156, 149, 158, 158, 156, 150, 128, 164, 138,…
## $ range_km <dbl> 225, 225, 280, 280, 315, 350, 320, 310, 310,…
## $ acceleration_0_100_s <dbl> 7.0, 7.0, 5.9, 6.2, 7.5, 7.0, 9.0, 6.0, 7.4,…
## $ fast_charging_power_kw_dc <dbl> 67, 67, 79, 79, 78, 78, 85, 85, 70, 70, 150,…
## $ fast_charge_port <chr> "CCS", "CCS", "CCS", "CCS", "CCS", "CCS", "C…
## $ towing_capacity_kg <dbl> 0, 0, 0, 0, NA, NA, 0, 0, 500, 500, 2100, 21…
## $ cargo_volume_l <chr> "185", "185", "360", "360", "496", "472", "4…
## $ seats <dbl> 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ drivetrain <chr> "FWD", "FWD", "FWD", "FWD", "FWD", "FWD", "F…
## $ segment <chr> "B - Compact", "B - Compact", "JB - Compact"…
## $ length_mm <dbl> 3673, 3673, 4187, 4187, 4680, 4805, 4173, 41…
## $ width_mm <dbl> 1683, 1683, 1779, 1779, 1865, 1880, 1781, 17…
## $ height_mm <dbl> 1518, 1518, 1557, 1557, 1700, 1641, 1532, 15…
## $ car_body_type <chr> "Hatchback", "Hatchback", "SUV", "SUV", "SUV…
## $ source_url <chr> "https://ev-database.org/car/1904/Abarth-500…
ev_pca_data <- ev %>%
select(
top_speed_kmh,
battery_capacity_k_wh,
torque_nm,
efficiency_wh_per_km,
range_km,
acceleration_0_100_s,
length_mm,
width_mm,
height_mm
)
summary(ev_pca_data)
## top_speed_kmh battery_capacity_k_wh torque_nm efficiency_wh_per_km
## Min. :125.0 Min. : 21.30 Min. : 113 Min. :109.0
## 1st Qu.:160.0 1st Qu.: 60.00 1st Qu.: 305 1st Qu.:143.0
## Median :180.0 Median : 76.15 Median : 430 Median :155.0
## Mean :185.5 Mean : 74.04 Mean : 498 Mean :162.9
## 3rd Qu.:201.0 3rd Qu.: 90.60 3rd Qu.: 679 3rd Qu.:177.8
## Max. :325.0 Max. :118.00 Max. :1350 Max. :370.0
## NA's :7
## range_km acceleration_0_100_s length_mm width_mm
## Min. :135.0 Min. : 2.200 Min. :3620 Min. :1610
## 1st Qu.:320.0 1st Qu.: 4.800 1st Qu.:4440 1st Qu.:1849
## Median :397.5 Median : 6.600 Median :4720 Median :1890
## Mean :393.2 Mean : 6.883 Mean :4679 Mean :1887
## 3rd Qu.:470.0 3rd Qu.: 8.200 3rd Qu.:4961 3rd Qu.:1939
## Max. :685.0 Max. :19.100 Max. :5908 Max. :2080
##
## height_mm
## Min. :1329
## 1st Qu.:1514
## Median :1596
## Mean :1601
## 3rd Qu.:1665
## Max. :1986
##
ev_pca_data <- na.omit(ev_pca_data)
dim(ev_pca_data)
## [1] 471 9
ev_scaled <- scale(ev_pca_data)
summary(ev_scaled)
## top_speed_kmh battery_capacity_k_wh torque_nm efficiency_wh_per_km
## Min. :-1.7773 Min. :-2.58728 Min. :-1.5945 Min. :-1.5668
## 1st Qu.:-0.7359 1st Qu.:-0.69918 1st Qu.:-0.7994 1st Qu.:-0.5812
## Median :-0.1408 Median : 0.09594 Median :-0.2817 Median :-0.2334
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4840 3rd Qu.: 0.82459 3rd Qu.: 0.7496 3rd Qu.: 0.4332
## Max. : 4.1737 Max. : 2.17358 Max. : 3.5285 Max. : 5.9984
## range_km acceleration_0_100_s length_mm width_mm
## Min. :-2.50531 Min. :-1.7378 Min. :-2.8552 Min. :-3.77328
## 1st Qu.:-0.69916 1st Qu.:-0.7816 1st Qu.:-0.6360 1st Qu.:-0.51166
## Median : 0.03306 Median :-0.1197 Median : 0.1082 Median :-0.03402
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.76529 3rd Qu.: 0.5055 3rd Qu.: 0.7726 3rd Qu.: 0.70292
## Max. : 2.86433 Max. : 4.4771 Max. : 3.3369 Max. : 2.64079
## height_mm
## Min. :-2.09418
## 1st Qu.:-0.66942
## Median :-0.04129
## Mean : 0.00000
## 3rd Qu.: 0.47959
## Max. : 2.93846
pca_result <- prcomp(ev_scaled, center = FALSE, scale. = FALSE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2501 1.5161 0.77611 0.60172 0.50364 0.40591 0.36962
## Proportion of Variance 0.5625 0.2554 0.06693 0.04023 0.02818 0.01831 0.01518
## Cumulative Proportion 0.5625 0.8179 0.88488 0.92511 0.95330 0.97160 0.98678
## PC8 PC9
## Standard deviation 0.3074 0.15638
## Proportion of Variance 0.0105 0.00272
## Cumulative Proportion 0.9973 1.00000
fviz_eig(pca_result)
In this part, the scree plot shows a clear drop in explained variance after the second component. The PC1 explains more than half of the total variance, while the second component also contributes substantially. After PC2, the additional components add only small amounts of explained variance. Therefore, using the first two principal components seems sufficient to capture most of the structure in the data. # Interpretation of PC
round(pca_result$rotation, 3)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## top_speed_kmh 0.384 0.217 0.221 -0.251 0.163 0.170 -0.734 -0.308
## battery_capacity_k_wh 0.412 -0.046 -0.286 0.376 -0.265 -0.024 -0.201 0.157
## torque_nm 0.391 0.064 0.413 0.255 0.100 0.627 0.308 0.308
## efficiency_wh_per_km 0.192 -0.495 0.485 -0.160 -0.570 -0.289 -0.086 0.142
## range_km 0.374 0.203 -0.492 0.207 -0.318 -0.085 0.004 -0.007
## acceleration_0_100_s -0.349 -0.302 -0.335 -0.201 -0.195 0.560 -0.367 0.374
## length_mm 0.323 -0.360 -0.271 -0.440 -0.045 0.280 0.386 -0.502
## width_mm 0.348 -0.307 -0.191 -0.238 0.590 -0.303 -0.025 0.494
## height_mm -0.082 -0.588 0.008 0.609 0.281 0.045 -0.183 -0.358
## PC9
## top_speed_kmh -0.061
## battery_capacity_k_wh 0.685
## torque_nm -0.118
## efficiency_wh_per_km -0.147
## range_km -0.653
## acceleration_0_100_s -0.089
## length_mm 0.131
## width_mm -0.081
## height_mm -0.183
The first principal component is mainly associated with battery capacity, torque, top speed, and driving range, while acceleration has a negative loading. This suggests that PC1 captures an overall performance and power dimension, where vehicles with higher battery capacity and stronger motors tend to score higher.
The second principal component appears to be related to efficiency and vehicle dimensions. Efficiency and vehicle height show relatively strong loadings, indicating that this component may reflect structural and efficiency-related differences between electric vehicles. # PCA Visualization (PC1 vs PC2)
fviz_pca_ind(
pca_result,
geom = "point",
col.ind = "steelblue",
repel = TRUE
)
In this part i showed, The PCA scatter plot shows that most of the variation in electric vehicle performance can be summarized along two main dimensions. PC1, which explains 56% of the variance, appears to represent overall performance and power characteristics. Vehicles located on the right side tend to have higher battery capacity, greater torque, longer range, and better acceleration performance.
In PC2, explaining around 25% of the variance, seems to capture structural differences related to vehicle size and efficiency. This suggests that EV performance differences are largely driven by a combination of power-related features and physical vehicle characteristics.
fviz_pca_biplot(
pca_result,
label = "var",
col.var = "red",
col.ind = "grey70",
alpha.ind = 0.5,
pointsize = 1.5
)
In this part, The PCA biplot clearly shows us that battery capacity, range, torque, and top speed load strongly in the same direction on the first principal component. This confirms that PC1 mainly represents overall performance and power characteristics of electric vehicles.
In the PC2, component appears to capture structural differences related to vehicle size and efficiency. Variables such as height and efficiency contribute more strongly to this dimension, suggesting that PC2 differentiates vehicle type and body structure rather than raw performance.
ev_scores <- as.data.frame(pca_result$x[,1:2])
set.seed(123)
km_result <- kmeans(ev_scores, centers = 4, nstart = 25)
ev_scores$cluster <- as.factor(km_result$cluster)
ggplot(ev_scores, aes(PC1, PC2, color = cluster)) +
geom_point(alpha = 0.6) +
theme_minimal() +
labs(title = "EV Segments based on PCA + K-means")
ev_scores %>% group_by(cluster) %>% summarise( n = n(), mean_PC1 = mean(PC1), mean_PC2 = mean(PC2) ) %>% arrange(desc(mean_PC1))
ev_clustered <- ev_pca_data
ev_clustered$cluster <- ev_scores$cluster
ev_clustered %>% group_by(cluster) %>% summarise(across(everything(), mean)) %>% arrange(desc(battery_capacity_k_wh))
In this part, I examine the average values of the original performance variables within each cluster to better understand the segments identified by K-means clustering.
Cluster 3 clearly represents us high-performance electric vehicles, with the largest battery capacity, highest torque levels, and top speeds above 200 km/h. These models appear to correspond to premium or performance oriented EVs.
Cluster 1 includes more balanced vehicles with medium to high battery capacity and solid performance levels, which suggests mainstream long-range electric models.
Cluster 4 consists of lower performance vehicles with smaller batteries and reduced power output, while Cluster 2 shows the lowest battery and torque values overall, indicating entry level or city focused electric cars.
In this paper, I used PCA to reduce the dimensionality of electric vehicle performance data and identify the main sources of variation across models. The first principal component mainly reflects overall performance and power, while the second captures structural and efficiency-related differences.
Using K-means clustering on the PCA results, four distinct EV segments were identified, ranging from high-performance premium vehicles to smaller city-oriented models. Overall, the analysis shows that combining PCA with clustering provides a clear way to understand patterns and segmentation within the electric vehicle market.