EV Metrics

DATA 110 Project 1: Exploring Electric Vehicle Data

Matvei Shaposhnikov

As consumers switch from traditional gas cars, they are faced with a new set of variables, metrics, and trade-offs. The purpose of this project is to explore a dataset of modern electric vehicles to discover relationships between their price, performance, and brand reputation. By visualizing this data, I can provide insightful information to a potential EV buyer.

The dataset for this analysis contains information on over 100 EV models. The data was compiled from manufacturer specifications and inserted into a full dataset from the EV Database (ev-database.org). For this project, we will focus on the following key variables:

Brand: The manufacturer of the vehicle (e.g., Tesla, Ford, Kia). This is a categorical variable.
PriceEuro: The retail price of the vehicle in Euros. This is a quantitative variable that is a important consideration for any buyer.
Range_Km: The maximum distance the vehicle can travel on a single charge, measured in kilometers. This quantitative variable is often attributed to “range anxiety” (the fear of running out of charge while not being near an EV charger).
PowerTrain: The type of drive system the car uses (All-Wheel Drive, Rear-Wheel Drive, or Front-Wheel Drive). This categorical variable can affect both performance and efficiency.

My plan is to tell a two-part story. First, I will create a scatter plot to visualize the relationship between an EV’s price and its range. I will use the drivetrain type to see if it reveals any patterns in this relationship. This visualization aims to answer the question: “Does spending more money get you more range?” Second, building on this, I will create a bar chart that compares the average range offered by different brands. This will help identify which manufacturers are leading the market in terms of battery performance and provide a clear overview for consumers.

Code

Load tidyverse

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# This library contains everything needed to read, manipulate, and visualise data

Set Working Directory and Load the Dataset

setwd('C:/Users/gitar/Documents/EV dataset')
# set the working directory via setwd so that the dataset can be read and used by R
ev_data <- readr::read_csv("ElectricCarData_Clean.csv")

Rows: 103 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Brand, Model, FastCharge_KmH, RapidCharge, PowerTrain, PlugType, Bo...
dbl (6): AccelSec, TopSpeed_KmH, Range_Km, Efficiency_WhKm, Seats, PriceEuro

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Load the EV dataset into the document using readr::read_csv

Prepare Data using dplyr commands

# Here, I will create a new data frame that will be used for the second visualization to see the average range and the number of models per brand.

brand_summary <- ev_data %>%
  # Group the data by the car brand.
  group_by(Brand) %>%
  
  # Calculate new summary columns:
  # 1. Average_Range: The mean of the 'Range_Km' for each brand.
  # 2. Model_Count: The total number of models (rows in dataset) for each brand.
  summarise(
    Average_Range = mean(Range_Km),
    Model_Count = n()
  ) %>%
  
  # make the chart cleaner, only look at brands that have more than one car model
  filter(Model_Count > 1) %>%
  
  # Arrange the brands from highest average range to lowest.
  arrange(desc(Average_Range))

# Check new summary table.
print(brand_summary)

# A tibble: 18 × 3
   Brand      Average_Range Model_Count
   <chr>              <dbl>       <int>
 1 Tesla              501.           13
 2 Ford               395             4
 3 Porsche            388             5
 4 Byton              372.            3
 5 Audi               357.            9
 6 Mercedes           350             3
 7 Skoda              338.            6
 8 Nissan             328.            8
 9 BMW                319.            4
10 Volkswagen         318.            8
11 Kia                313             5
12 Hyundai            302.            3
13 Opel               288.            3
14 Peugeot            262.            2
15 Fiat               250             2
16 Renault            234             5
17 Honda              170             2
18 Smart               96.7           3

Scatter Plot of Price v. Range

# This visualization shows the fundamental trade-off between an EV's cost and its driving range. Coloring by drivetrain helps see if AWD, RWD, or FWD cars have different price/range characteristics.

ggplot(data = ev_data, aes(x = PriceEuro, y = Range_Km)) +
  
  # Create points for the scatter plot.
  # map the 'PowerTrain' column to the color aesthetic.
  geom_point(aes(color = PowerTrain), size = 3, alpha = 0.7) +
  
  # Change plot colors from default
  scale_color_manual(values = c("AWD" = "#0072B2", "RWD" = "#009E73", "FWD" = "#D55E00")) +
  
  # Change the theme to be better than the default grey background.
  theme_minimal() +
  
  # Add labels, title, and a caption for the data source.
  labs(
    title = "Electric Vehicle Price vs. Range",
    subtitle = "Higher-priced EVs generally offer a longer range",
    x = "Price (in Euros)",
    y = "Range (in Kilometers)",
    color = "Drivetrain", # This renames the legend title
    caption = "Data Source: Electric Vehicle Database 'ev-database.org'"
  )

Bar Chart of Average Range by Brand

# This visualization builds on the first. After seeing the trend, I now want to know which brands perform best based on it. This chart shows average range for each major brand, making it easy to compare.

ggplot(data = brand_summary, 
       # reorder(Brand, Average_Range) makes it so lowest average range is at the bottom and the highest at the top.
       aes(x = Average_Range, y = reorder(Brand, Average_Range))) +
  
  # Create the horizontal bars.
  geom_col(fill = "#4c72b0") +
  
  # Add the text labels for the model count.
  # including hjust = -0.2 for positioning the text to the right of each bar.
  geom_text(aes(label = paste(Model_Count, "models")), hjust = -0.1, color = "black", size = 3) +
  
  # expand the x-axis to make sure the text labels don't get cut off.
  scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
  
  # Use a minimal theme for a better look.
  theme_minimal() +
  
  # Add labels and title.
  labs(
    title = "Which Brands Offer the Longest Range?",
    subtitle = "Average range of brands with more than one model in the dataset",
    x = "Average Range (in Kilometers)",
    y = "Car Brand", # The y-axis is now the brand
    caption = "Data Source: Electric Vehicle Database 'ev-database.org'"
  )

Conclusion

a. Preparation

The dataset used, ElectricCarData_Clean.csv, was already in a relatively tidy format, so extensive cleaning like removing NA values was not required. However, significant data preparation was performed using the dplyr package to create a summary table for the second visualization. The process was as follows:

The group_by(Brand) function was used to sort the data by car manufacturer.

The summarise() function was then used to calculate two new metrics for each brand: Average_Range, which is the mean of the Range_Km column, and Model_Count, which is a count of the number of unique car models for each brand in the dataset, calculated using n().

Next, filter(Model_Count > 1) was used to remove brands that only had a single model listed. This was done to reduce noise in the final chart and focus the analysis on manufacturers with a more established lineup of vehicles.

Finally, arrange(desc(Average_Range)) was used to sort the summary table, ensuring that the final bar chart would be presented in a descending order of average range.

b. Interpretation

The two visualizations showed several interesting patterns. The first plot, a scatter plot of Price vs. Range, confirmed the expected positive correlation: in general, more expensive EVs offer a longer driving range. The drive train data, represented by color, showed that All-Wheel Drive (AWD) vehicles populate a significant amount of the price and range spectrum, staying above 250 Km range and including the highest-performing models. Front-Wheel Drive (FWD) and Rear-wheel Drive (RWD) vehicles were less common at higher performance points, FWD vehicles presented around 250 Km of range while RWD on mostly landed around 300 Km (there were notable outliers for RWD that preformed very poorly). There is a possibility the price is skewed by the fact that most car manufacturers charge more for AWD as opposed to FWD or RWD, however the difference in MSRP between the 3 isn’t as great as the differences presented on the first visualization.

The second plot, a bar chart of Average Range by Brand, provided a clear comparison of manufacturer performance. A surprising insight was the dominance of the well-established leader in EV technology, Tesla. However many companies like Ford and Porsche are beginning to catch up to Tesla, even while having only 4 or 5 models on the market, and may soon even surpass the industry leader. The chart also highlighted that while a trusted legacy brand like Audi has several models on the market, its average range is considerably lower than top performers. This distinction between the number of offerings and their average performance is a big takeaway for consumers looking beyond just brand name.

c. Limitation

One major factor I wished I could have included was charging speed. The dataset contained a FastCharge_KmH column, but the data was inconsistent, with many missing or non-numeric values (e.g., “-”). Cleaning this column to make it usable for visualization would have been challenging. A chart comparing a car’s range to its fast-charging speed would be valuable, as it addresses both “range anxiety” and the time spent waiting at a charger.The dataset also lacks regional information; knowing which cars are available in specific markets would make the analysis more practical for a potential buyer. A future project could attempt to clean the charging data or merge this dataset with one containing sales and availability by country.

DATA 110 Project 1: Exploring Electric Vehicle Data

Matvei Shaposhnikov

The dataset for this analysis contains information on over 100 EV models. The data was compiled from manufacturer specifications and inserted into a full dataset from the EV Database (ev-database.org). For this project, we will focus on the following key variables:

Brand: The manufacturer of the vehicle (e.g., Tesla, Ford, Kia). This is a categorical variable.

PriceEuro: The retail price of the vehicle in Euros. This is a quantitative variable that is a important consideration for any buyer.

Range_Km: The maximum distance the vehicle can travel on a single charge, measured in kilometers. This quantitative variable is often attributed to “range anxiety” (the fear of running out of charge while not being near an EV charger).

PowerTrain: The type of drive system the car uses (All-Wheel Drive, Rear-Wheel Drive, or Front-Wheel Drive). This categorical variable can affect both performance and efficiency.

Code

Load tidyverse

Set Working Directory and Load the Dataset

Prepare Data using dplyr commands

Scatter Plot of Price v. Range

Bar Chart of Average Range by Brand

Conclusion

a. Preparation

The group_by(Brand) function was used to sort the data by car manufacturer.

The summarise() function was then used to calculate two new metrics for each brand: Average_Range, which is the mean of the Range_Km column, and Model_Count, which is a count of the number of unique car models for each brand in the dataset, calculated using n().

Next, filter(Model_Count > 1) was used to remove brands that only had a single model listed. This was done to reduce noise in the final chart and focus the analysis on manufacturers with a more established lineup of vehicles.

Finally, arrange(desc(Average_Range)) was used to sort the summary table, ensuring that the final bar chart would be presented in a descending order of average range.

b. Interpretation

c. Limitation