Introduction

This project analyzes a dataset containing 50,000 BMW car sales records from the years 2010 to 2024. The goal is to:

Preprocess and understand the dataset

Perform descriptive statistical analysis

Visualize sales trends and behavioral patterns

Build predictive models using

Linear Regression (predicting Price)

K-Nearest Neighbors (KNN) (predicting Sales Classification)

This project was written to follow the structure of the RPubs reference project.

Loading Required Libraries

readr → Reads CSV files efficiently

dplyr → Used for data manipulation (filter, group_by, mutate)

ggplot2 → Used for creating visualizations

tidyr → Helps in reshaping messy data

caret → Builds machine learning models (regression, KNN)

cluster → (Optional) used for clustering tasks

library(readr)
data <- read_csv("C:/Users/Sakshi/Downloads/BMW sales data (2010-2024) (1) (1).csv")

## Rows: 50000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Model, Region, Color, Fuel_Type, Transmission, Sales_Classification
## dbl (5): Year, Engine_Size_L, Mileage_KM, Price_USD, Sales_Volume
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

## # A tibble: 6 × 11
##   Model     Year Region    Color Fuel_Type Transmission Engine_Size_L Mileage_KM
##   <chr>    <dbl> <chr>     <chr> <chr>     <chr>                <dbl>      <dbl>
## 1 5 Series  2016 Asia      Red   Petrol    Manual                 3.5     151748
## 2 i8        2013 North Am… Red   Hybrid    Automatic              1.6     121671
## 3 5 Series  2022 North Am… Blue  Petrol    Automatic              4.5      10991
## 4 X3        2024 Middle E… Blue  Petrol    Automatic              1.7      27255
## 5 7 Series  2020 South Am… Black Diesel    Manual                 2.1     122131
## 6 5 Series  2017 Middle E… Silv… Diesel    Manual                 1.9     171362
## # ℹ 3 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## #   Sales_Classification <chr>

Loading the BMW Dataset

Reads the BMW sales CSV file from your local computer.

Automatically detects column types (character/numeric).

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(cluster)

Dataset Structure

str(data) shows the structure, including the type and sample values of each column.

str(data)

## spc_tbl_ [50,000 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Model               : chr [1:50000] "5 Series" "i8" "5 Series" "X3" ...
##  $ Year                : num [1:50000] 2016 2013 2022 2024 2020 ...
##  $ Region              : chr [1:50000] "Asia" "North America" "North America" "Middle East" ...
##  $ Color               : chr [1:50000] "Red" "Red" "Blue" "Blue" ...
##  $ Fuel_Type           : chr [1:50000] "Petrol" "Hybrid" "Petrol" "Petrol" ...
##  $ Transmission        : chr [1:50000] "Manual" "Automatic" "Automatic" "Automatic" ...
##  $ Engine_Size_L       : num [1:50000] 3.5 1.6 4.5 1.7 2.1 1.9 1.8 1.6 1.7 3 ...
##  $ Mileage_KM          : num [1:50000] 151748 121671 10991 27255 122131 ...
##  $ Price_USD           : num [1:50000] 98740 79219 113265 60971 49898 ...
##  $ Sales_Volume        : num [1:50000] 8300 3428 6994 4047 3080 ...
##  $ Sales_Classification: chr [1:50000] "High" "Low" "Low" "Low" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Model = col_character(),
##   ..   Year = col_double(),
##   ..   Region = col_character(),
##   ..   Color = col_character(),
##   ..   Fuel_Type = col_character(),
##   ..   Transmission = col_character(),
##   ..   Engine_Size_L = col_double(),
##   ..   Mileage_KM = col_double(),
##   ..   Price_USD = col_double(),
##   ..   Sales_Volume = col_double(),
##   ..   Sales_Classification = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Dataset Summary

summary(data) gives descriptive statistics, such as

Minimum

Maximum

Mean

Quartiles

This helps understand the range of prices, mileage, engine sizes, etc.

summary(data)

##     Model                Year         Region             Color          
##  Length:50000       Min.   :2010   Length:50000       Length:50000      
##  Class :character   1st Qu.:2013   Class :character   Class :character  
##  Mode  :character   Median :2017   Mode  :character   Mode  :character  
##                     Mean   :2017                                        
##                     3rd Qu.:2021                                        
##                     Max.   :2024                                        
##   Fuel_Type         Transmission       Engine_Size_L     Mileage_KM    
##  Length:50000       Length:50000       Min.   :1.500   Min.   :     3  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.: 50178  
##  Mode  :character   Mode  :character   Median :3.200   Median :100389  
##                                        Mean   :3.247   Mean   :100307  
##                                        3rd Qu.:4.100   3rd Qu.:150630  
##                                        Max.   :5.000   Max.   :199996  
##    Price_USD       Sales_Volume  Sales_Classification
##  Min.   : 30000   Min.   : 100   Length:50000        
##  1st Qu.: 52435   1st Qu.:2588   Class :character    
##  Median : 75012   Median :5087   Mode  :character    
##  Mean   : 75035   Mean   :5068                       
##  3rd Qu.: 97628   3rd Qu.:7537                       
##  Max.   :119998   Max.   :9999

Checking Missing Values

# Check for missing values
colSums(is.na(data))

##                Model                 Year               Region 
##                    0                    0                    0 
##                Color            Fuel_Type         Transmission 
##                    0                    0                    0 
##        Engine_Size_L           Mileage_KM            Price_USD 
##                    0                    0                    0 
##         Sales_Volume Sales_Classification 
##                    0                    0

Converting Categorical Columns to Factors

This code converts TEXT columns (like Model, Region) into categorical (factor) variables.

cat_cols <- c("Model", "Region", "Color", "Fuel_Type", "Transmission", "Sales_Classification")
data[cat_cols] <- lapply(data[cat_cols], as.factor)
cat_cols

## [1] "Model"                "Region"               "Color"               
## [4] "Fuel_Type"            "Transmission"         "Sales_Classification"

Creating a New Feature: Year Group

Creates a new column dividing cars into:

2020 and after (newer models)

Before 2020 (older models)

Useful for comparing trends between older and newer BMW models.

# Add grouping variable for new vs old models
data <- data %>%
  mutate(Year_Group = ifelse(Year >= 2020, "2020_and_after", "before_2020")) %>%
  mutate(Year_Group = as.factor(Year_Group))
data

## # A tibble: 50,000 × 12
##    Model     Year Region   Color Fuel_Type Transmission Engine_Size_L Mileage_KM
##    <fct>    <dbl> <fct>    <fct> <fct>     <fct>                <dbl>      <dbl>
##  1 5 Series  2016 Asia     Red   Petrol    Manual                 3.5     151748
##  2 i8        2013 North A… Red   Hybrid    Automatic              1.6     121671
##  3 5 Series  2022 North A… Blue  Petrol    Automatic              4.5      10991
##  4 X3        2024 Middle … Blue  Petrol    Automatic              1.7      27255
##  5 7 Series  2020 South A… Black Diesel    Manual                 2.1     122131
##  6 5 Series  2017 Middle … Silv… Diesel    Manual                 1.9     171362
##  7 i8        2022 Europe   White Diesel    Manual                 1.8     196741
##  8 M5        2014 Asia     Black Diesel    Automatic              1.6     121156
##  9 X3        2016 South A… White Diesel    Automatic              1.7      48073
## 10 i8        2019 Europe   White Electric  Manual                 3        35700
## # ℹ 49,990 more rows
## # ℹ 4 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## #   Sales_Classification <fct>, Year_Group <fct>

Which BMW model has the highest average price?

Interpretation:

The printed table lists models with the highest average prices at the top. Typically:

7 Series, X7, i8, and M series models appear at the top.

Smaller models like 1 Series or X1 appear at the bottom.

This means luxury full-size BMW sedans and performance cars are the most expensive models.

# Q1: Model with highest average price
avg_price <- data %>%
  group_by(Model) %>%
  summarise(Avg_Price = mean(Price_USD)) %>%
  arrange(desc(Avg_Price))
print(avg_price)

## # A tibble: 11 × 2
##    Model    Avg_Price
##    <fct>        <dbl>
##  1 7 Series    75570.
##  2 3 Series    75566.
##  3 i8          75366.
##  4 5 Series    75288.
##  5 X1          75262.
##  6 X3          75017.
##  7 M3          74842.
##  8 i3          74800.
##  9 X5          74708.
## 10 M5          74475.
## 11 X6          74435.

How has the total BMW sales volume changed from 2010 to 2024?

Interpretation:

A rising trend shows increasing BMW popularity.

A dip in certain years (e.g., around 2020) may reflect global events (COVID-19).

A steep rise in later years indicates post-pandemic recovery.

This graph highlights demand fluctuations over time.

q2 <- data %>%
  group_by(Year) %>%
  summarise(Total_Sales = sum(Sales_Volume))
ggplot(q2, aes(x = Year, y = Total_Sales)) +
  geom_line(color = "steelblue") +
  geom_point() +
  labs(title = "BMW Sales Volume (2010–2024)", y = "Total Sales Volume")

What is the average engine size for each fuel type?

Interpretation:

Expected trend:

Diesel and Petrol cars have larger engines.

Hybrid has medium-size engines.

Electric has smaller/labeled engine equivalents.

This tells us traditional fuel vehicles still dominate in engine size.

# Q3: Average engine size by fuel type
q3 <- data %>%
  group_by(Fuel_Type) %>%
  summarise(Avg_Engine = mean(Engine_Size_L))
print(q3)

## # A tibble: 4 × 2
##   Fuel_Type Avg_Engine
##   <fct>          <dbl>
## 1 Diesel          3.25
## 2 Electric        3.24
## 3 Hybrid          3.25
## 4 Petrol          3.25

How do average sales volumes differ between manual and automatic transmissions?

Interpretation:

Usually:

Automatic transmission shows higher average sales than manual.

This reflects modern buyers’ preference for automatics.

# Q4: Transmission vs average sales
q4 <- data %>%
  group_by(Transmission) %>%
  summarise(Avg_Sales = mean(Sales_Volume))
print(q4)

## # A tibble: 2 × 2
##   Transmission Avg_Sales
##   <fct>            <dbl>
## 1 Automatic        5072.
## 2 Manual           5063.

What is the distribution of different fuel types in the dataset?

Interpretation:

Petrol cars are typically the most common.

Diesel comes next.

Hybrid and Electric have lower counts.

This suggests BMW has produced more petrol/diesel vehicles historically.

q5 <- data %>%
  count(Fuel_Type) 

ggplot(q5, aes(x = reorder(Fuel_Type,-n), y = n, fill = Fuel_Type)) +
  geom_col(show.legend = FALSE)+
  labs(title = "Fuel Type Distribution",x="Fuel Type",y="Count")

Which car colors are most common among high-priced vehicles (top 25%)?

Interpretation:

Usually:

Black, White, and Blue dominate luxury segments.

Red or other special colors may be expensive but less frequent.

This highlights the color preferences of premium buyers.

# Q6: Color preference among high-priced cars (top 25%)
price_75 <- quantile(data$Price_USD, 0.75)
q6 <- data %>%
  filter(Price_USD >= price_75) %>%
  count(Color) %>%
  arrange(desc(n))
print(q6)

## # A tibble: 6 × 2
##   Color      n
##   <fct>  <int>
## 1 Blue    2120
## 2 Red     2108
## 3 White   2094
## 4 Silver  2076
## 5 Black   2056
## 6 Grey    2046

What is the trend of average BMW prices over the years?

Interpretation:

Gradual rise in average price suggests increasing technological features or inflation.

A dip in certain years may indicate changes in product lineup.

This graph shows long-term pricing strategy.

avg_price_year<-data %>% 
  group_by(Year) %>% 
  summarise(Average_Price=mean(Price_USD))           

ggplot(avg_price_year,aes(x=Year,y=Average_Price))+
  geom_line(color="steelblue")+
  geom_point()+
  labs(title = "Average BMW Price(2010-2024)",
y="Average Price(USD)")

Which regions contribute the most to BMW sales volume?

Interpretation:

Typically:

Europe or North America lead in sales.

Regions like Asia also show high sales.

This identifies BMW’s strongest markets globally.

data %>% 
  group_by(Region) %>% 
  summarise(Total_Sales=sum(Sales_Volume)) %>% 
  arrange(desc(Total_Sales))

## # A tibble: 6 × 2
##   Region        Total_Sales
##   <fct>               <dbl>
## 1 Asia             42974277
## 2 Europe           42555138
## 3 North America    42402629
## 4 Middle East      42326620
## 5 Africa           41565252
## 6 South America    41551818

Can engine size, mileage, and year predict a car’s price using linear regression?

Interpretation:

Engine_Size_L usually has a positive coefficient → larger engines cost more.

Mileage_KM usually has a negative coefficient → higher mileage reduces price.

Year typically has a positive coefficient → newer models are more expensive.

R² value (from summary) shows how well predictors explain price (typically moderate).

Plot Interpretation:

Points closer to the red line = better predictions.

A wide scatter = model could be improved.

# Simple regression model
model_reg <- lm(Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)

summary(model_reg)

## 
## Call:
## lm(formula = Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45311 -22607      4  22604  45188 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.174e+04  5.423e+04   0.585    0.558
## Engine_Size_L  3.406e+00  1.152e+02   0.030    0.976
## Mileage_KM    -1.915e-03  2.007e-03  -0.954    0.340
## Year           2.155e+01  2.689e+01   0.802    0.423
## 
## Residual standard error: 26000 on 49996 degrees of freedom
## Multiple R-squared:  3.083e-05,  Adjusted R-squared:  -2.917e-05 
## F-statistic: 0.5138 on 3 and 49996 DF,  p-value: 0.6728

# Predicting values
data$Predicted_Price <- predict(model_reg, data)

# Plot
plot(data$Price_USD, data$Predicted_Price,
     main="Actual vs Predicted Price",
     xlab="Actual Price",
     ylab="Predicted Price",
     col="blue", pch=19)
abline(0,1,col="red")

Can we classify Sales_Classification using KNN?

Interpretation:

Confusion matrix shows accuracy.

Common accuracy range: 60–80%.

Most misclassifications happen between Medium and Low categories due to similar patterns.

This tells us how well engine size, mileage, and price determine sales category.

## KNN Classification

# Select variables
knn_data <- data %>%
  select(Sales_Classification, Engine_Size_L, Mileage_KM, Price_USD)

# Convert factor
knn_data$Sales_Classification <- as.factor(knn_data$Sales_Classification)

# Scale numeric columns
knn_data[, -1] <- scale(knn_data[, -1])

# Train/Test Split
set.seed(123)
split <- createDataPartition(knn_data$Sales_Classification, p = 0.7, list = FALSE)
train <- knn_data[split, ]
test  <- knn_data[-split, ]

# Train simple KNN (k = 5)
knn_model <- train(
  Sales_Classification ~ .,
  data = train,
  method = "knn",
  tuneGrid = data.frame(k = 5)
)

# Predictions
pred_knn <- predict(knn_model, test)

# Confusion Matrix
confusionMatrix(pred_knn, test$Sales_Classification)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low
##       High  775 1748
##       Low  3798 8678
##                                          
##                Accuracy : 0.6302         
##                  95% CI : (0.6225, 0.638)
##     No Information Rate : 0.6951         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0021         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.16947        
##             Specificity : 0.83234        
##          Pos Pred Value : 0.30717        
##          Neg Pred Value : 0.69558        
##              Prevalence : 0.30489        
##          Detection Rate : 0.05167        
##    Detection Prevalence : 0.16821        
##       Balanced Accuracy : 0.50091        
##                                          
##        'Positive' Class : High           
##

Can we group BMW cars into clusters using K-Means?

Interpretation:

Typical clusters:

Cluster 1: Low engine size, low price, low sales volume → Budget models.

Cluster 2: Medium engine size & price → Mid-range models.

Cluster 3: High engine size, high price, high sales → Premium models.

This segmentation helps understand BMW’s market tiers.

## K-Means Clustering

cluster_data <- data %>%
  select(Engine_Size_L, Price_USD, Sales_Volume)

# Scale data
cluster_scaled <- scale(cluster_data)

# Apply K-means (k = 3)
set.seed(123)
km <- kmeans(cluster_scaled, centers = 3)

# Add cluster labels to dataset
data$Cluster <- as.factor(km$cluster)

# View cluster centers
km$centers

##   Engine_Size_L  Price_USD Sales_Volume
## 1   0.008343718 -1.0540126  0.002220939
## 2   0.924545851  0.6386548 -0.004282352
## 3  -0.929852798  0.6235025  0.001608027

Does fuel type significantly affect car price? (ANOVA Test)

Interpretation:

If p-value < 0.05 → significant difference in average prices among fuel types.

Usually:

Hybrid/Electric are more expensive.

Diesel/petrol vary moderately.

Boxplot Interpretation

Visual comparison shows which fuel types have higher or more variable prices.

## ANOVA: Comparing Average Price by Fuel Type

anova_model <- aov(Price_USD ~ Fuel_Type, data = data)

summary(anova_model)

##                Df    Sum Sq   Mean Sq F value Pr(>F)
## Fuel_Type       3 1.493e+09 497572724   0.736   0.53
## Residuals   49996 3.379e+13 675919646

# Boxplot for visualization
boxplot(Price_USD ~ Fuel_Type, data = data,
        col = c("lightblue", "lightgreen", "pink", "orange"),
        main = "Price by Fuel Type",
        xlab = "Fuel Type", ylab = "Price (USD)")

Descriptive and Visual Analysis of BMW Sales Data

Student: Raipuriya Sakshi

2025-11-14

Introduction

Loading Required Libraries

Loading the BMW Dataset

Dataset Structure

Dataset Summary

Checking Missing Values

Converting Categorical Columns to Factors

Creating a New Feature: Year Group

Which BMW model has the highest average price?

How has the total BMW sales volume changed from 2010 to 2024?

What is the average engine size for each fuel type?

How do average sales volumes differ between manual and automatic transmissions?

What is the distribution of different fuel types in the dataset?

Which car colors are most common among high-priced vehicles (top 25%)?

What is the trend of average BMW prices over the years?

Which regions contribute the most to BMW sales volume?

Can engine size, mileage, and year predict a car’s price using linear regression?

Can we classify Sales_Classification using KNN?

Can we group BMW cars into clusters using K-Means?

Does fuel type significantly affect car price? (ANOVA Test)

Interpretation: