2025-10-23

The Dataset

  • This project uses global car sales data from Kaggle.
  • It contains 50,000 entries of various transactions of cars and characteristics of the cars sold.
  • There are 6 characteristics recorded for each car to analyze relationships between various car types, conditions, and prices.

Brief Overview

To draw conclusions from this data, we are going to visualize the data through a few different lenses.

  • Pie charts will help us understand the makeup of the cars sold this year.
  • Side-by-side box plots will illustrate the differences in price distribution among models and manufacturers.
  • A scatter plot will show the number of cars sold from each manufacturer, per year.
  • A 3D scatter plot further demonstrates the effects of mileage, manufactured year and fuel type on car price.
  • Lastly, we will examine the 5 number summary for the prices of each manufacturer to compare the overall distributions.

Pie Charts

To show the proportion of different manufacturers and models sold, pie graphs are used. The left chart illustrates that VW, Ford, and Toyota makeup over 75% of the cars sold. On the right, each of their model types are shown to also makeup larger percentages of the cars sold than models made by BMW or Porsche. This analysis is helpful to demonstrate that mainstream car brands sell more frequently than luxurious brands.

Ggplot Boxplot

The following plot compares the distribution in price of cars sold based on their make and models. BMW have the highest variances, whereas Ford, Toyota, and VW have tighter distributions and more narrow bounds.

Ggplot Scatter Plot

This plot shows the amount of cars sold by each manufacturer by year. The lines are fitted and colored to predict the number sold each year by manufacturer.

Ggplot Scatter Plot Code

This is the code used to visualize the scatter plot.

year_data <- car_data %>%
  group_by(Year.of.manufacture, Manufacturer) %>%
summarise(n = n(), .groups = 'drop')

ggplot(data = year_data, aes(x = Year.of.manufacture, y = n)) +
  geom_smooth(aes(color = Manufacturer), method = "lm", se = FALSE) +
  geom_point(aes(color = Manufacturer), size = .8) +
  labs(title = "Number of cars sold by Manufacturer per Year",
       x = "Year Manufactured", 
       y =  "Count Sold")

Geom_smooth created linear regression lines for the number of cars sold each year by manufacturer. The graph shows clear indication that BMW and Porsche cars sold much fewer cars per year than more mainstream brands. Also, those two manufacturers experienced less growth since 1990 that then other companies.

Plotly 3D plot

The following is a 3D Scatter plot of effects of mileage, manufactured year and fuel type on car price.

3D plot Analysis

To see what impacts the price of a car, the 3D plot shows 4 variables. The color is based on the fuel type. We also consider the mileage and year of manufacture.

  • The graph shows that the more recently the car was manufactured, the higher the price.
  • Mileage appears to have an exponential relationship to price. After a vehicle has a mileage of at least 100k, the price usually drops to less than $50,000.
  • Petrol cars are typically more expensive than hybrid or diesel, however the distribution of price of hybrid cars is the narrowest.

Statistical Analysis

The 5 number summary for each manufacturer displays the minimum and maximum values, as well as the 25th, 50th, and 75th percentiles. This data is shown visually in the box plot graph, but separated further into models. In this table, it is clear that the distributions of prices of Porche and BMW cars are much higher than those of Toyota, Ford, or VW. The minimum values for each of the manufacturers is similar, however the values of Q1, median, and Q3 clearly demonstrate differences.

## # A tibble: 5 × 6
##   Manufacturer   Min    Q1 Median     Q3    Max
##   <chr>        <int> <dbl>  <dbl>  <dbl>  <int>
## 1 BMW            167 5318   14384 33592  168081
## 2 Ford            85 2626    6646 15706.  62748
## 3 Porsche        266 7039   18441 43712  167774
## 4 Toyota         176 3373.   8802 20918.  86353
## 5 VW              76 2584    6428 15214   58588

Conclusion

With our various visualizations of this data, we can better understand the various factors of global car sales. The pie charts and scatterplot show the distribution of cars sold, by manufacturer and model. The boxplot is a useful visual tool to examine the spread of price among each make and model. The 3D scatterplot is used to examine the relationships of 4 variables, fuel type, year manufactured, price, and mileage. The fuel type seemed to have a somewhat small impact on the price, whereas the older the vehicle and higher mileage seemed to have an exponential effect on the price of cars in this population. The statistical analysis was helpful for identifying large differences in distribution between the various manufacturers, and we found that less luxury brands had smaller spreads and were lower than luxury car brands. These analyses could be useful for businesses that sell cars to either set prices that align with the current market based on the mileage, year manufactured, and model, or decide which cars to stock and sell.