Introduction

This project analyzes a dataset containing 50,000 BMW car sales records from the years 2010 to 2024. The goal is to:

Preprocess and understand the dataset

Perform descriptive statistical analysis

Visualize sales trends and behavioral patterns

Build predictive models using

Linear Regression (predicting Price)

K-Nearest Neighbors (KNN) (predicting Sales Classification)

This project was written to follow the structure of the RPubs reference project.

Loading Required Libraries

readr → Reads CSV files efficiently

dplyr → Used for data manipulation (filter, group_by, mutate)

ggplot2 → Used for creating visualizations

tidyr → Helps in reshaping messy data

caret → Builds machine learning models (regression, KNN)

cluster → (Optional) used for clustering tasks

library(readr)
data <- read_csv("C:/Users/Sakshi/Downloads/BMW sales data (2010-2024) (1) (1).csv")  
## Rows: 50000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Model, Region, Color, Fuel_Type, Transmission, Sales_Classification
## dbl (5): Year, Engine_Size_L, Mileage_KM, Price_USD, Sales_Volume
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
##   Model     Year Region    Color Fuel_Type Transmission Engine_Size_L Mileage_KM
##   <chr>    <dbl> <chr>     <chr> <chr>     <chr>                <dbl>      <dbl>
## 1 5 Series  2016 Asia      Red   Petrol    Manual                 3.5     151748
## 2 i8        2013 North Am… Red   Hybrid    Automatic              1.6     121671
## 3 5 Series  2022 North Am… Blue  Petrol    Automatic              4.5      10991
## 4 X3        2024 Middle E… Blue  Petrol    Automatic              1.7      27255
## 5 7 Series  2020 South Am… Black Diesel    Manual                 2.1     122131
## 6 5 Series  2017 Middle E… Silv… Diesel    Manual                 1.9     171362
## # ℹ 3 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## #   Sales_Classification <chr>

Loading the BMW Dataset

Reads the BMW sales CSV file from your local computer.

Automatically detects column types (character/numeric).

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(cluster)

Dataset Structure

str(data) shows the structure, including the type and sample values of each column.

str(data)
## spc_tbl_ [50,000 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Model               : chr [1:50000] "5 Series" "i8" "5 Series" "X3" ...
##  $ Year                : num [1:50000] 2016 2013 2022 2024 2020 ...
##  $ Region              : chr [1:50000] "Asia" "North America" "North America" "Middle East" ...
##  $ Color               : chr [1:50000] "Red" "Red" "Blue" "Blue" ...
##  $ Fuel_Type           : chr [1:50000] "Petrol" "Hybrid" "Petrol" "Petrol" ...
##  $ Transmission        : chr [1:50000] "Manual" "Automatic" "Automatic" "Automatic" ...
##  $ Engine_Size_L       : num [1:50000] 3.5 1.6 4.5 1.7 2.1 1.9 1.8 1.6 1.7 3 ...
##  $ Mileage_KM          : num [1:50000] 151748 121671 10991 27255 122131 ...
##  $ Price_USD           : num [1:50000] 98740 79219 113265 60971 49898 ...
##  $ Sales_Volume        : num [1:50000] 8300 3428 6994 4047 3080 ...
##  $ Sales_Classification: chr [1:50000] "High" "Low" "Low" "Low" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Model = col_character(),
##   ..   Year = col_double(),
##   ..   Region = col_character(),
##   ..   Color = col_character(),
##   ..   Fuel_Type = col_character(),
##   ..   Transmission = col_character(),
##   ..   Engine_Size_L = col_double(),
##   ..   Mileage_KM = col_double(),
##   ..   Price_USD = col_double(),
##   ..   Sales_Volume = col_double(),
##   ..   Sales_Classification = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Dataset Summary

summary(data) gives descriptive statistics, such as

Minimum

Maximum

Mean

Quartiles

This helps understand the range of prices, mileage, engine sizes, etc.

summary(data)
##     Model                Year         Region             Color          
##  Length:50000       Min.   :2010   Length:50000       Length:50000      
##  Class :character   1st Qu.:2013   Class :character   Class :character  
##  Mode  :character   Median :2017   Mode  :character   Mode  :character  
##                     Mean   :2017                                        
##                     3rd Qu.:2021                                        
##                     Max.   :2024                                        
##   Fuel_Type         Transmission       Engine_Size_L     Mileage_KM    
##  Length:50000       Length:50000       Min.   :1.500   Min.   :     3  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.: 50178  
##  Mode  :character   Mode  :character   Median :3.200   Median :100389  
##                                        Mean   :3.247   Mean   :100307  
##                                        3rd Qu.:4.100   3rd Qu.:150630  
##                                        Max.   :5.000   Max.   :199996  
##    Price_USD       Sales_Volume  Sales_Classification
##  Min.   : 30000   Min.   : 100   Length:50000        
##  1st Qu.: 52435   1st Qu.:2588   Class :character    
##  Median : 75012   Median :5087   Mode  :character    
##  Mean   : 75035   Mean   :5068                       
##  3rd Qu.: 97628   3rd Qu.:7537                       
##  Max.   :119998   Max.   :9999

Checking Missing Values

# Check for missing values
colSums(is.na(data))
##                Model                 Year               Region 
##                    0                    0                    0 
##                Color            Fuel_Type         Transmission 
##                    0                    0                    0 
##        Engine_Size_L           Mileage_KM            Price_USD 
##                    0                    0                    0 
##         Sales_Volume Sales_Classification 
##                    0                    0

Converting Categorical Columns to Factors

This code converts TEXT columns (like Model, Region) into categorical (factor) variables.

cat_cols <- c("Model", "Region", "Color", "Fuel_Type", "Transmission", "Sales_Classification")
data[cat_cols] <- lapply(data[cat_cols], as.factor)
cat_cols
## [1] "Model"                "Region"               "Color"               
## [4] "Fuel_Type"            "Transmission"         "Sales_Classification"

Creating a New Feature: Year Group

Creates a new column dividing cars into:

2020 and after (newer models)

Before 2020 (older models)

Useful for comparing trends between older and newer BMW models.

# Add grouping variable for new vs old models
data <- data %>%
  mutate(Year_Group = ifelse(Year >= 2020, "2020_and_after", "before_2020")) %>%
  mutate(Year_Group = as.factor(Year_Group))
data
## # A tibble: 50,000 × 12
##    Model     Year Region   Color Fuel_Type Transmission Engine_Size_L Mileage_KM
##    <fct>    <dbl> <fct>    <fct> <fct>     <fct>                <dbl>      <dbl>
##  1 5 Series  2016 Asia     Red   Petrol    Manual                 3.5     151748
##  2 i8        2013 North A… Red   Hybrid    Automatic              1.6     121671
##  3 5 Series  2022 North A… Blue  Petrol    Automatic              4.5      10991
##  4 X3        2024 Middle … Blue  Petrol    Automatic              1.7      27255
##  5 7 Series  2020 South A… Black Diesel    Manual                 2.1     122131
##  6 5 Series  2017 Middle … Silv… Diesel    Manual                 1.9     171362
##  7 i8        2022 Europe   White Diesel    Manual                 1.8     196741
##  8 M5        2014 Asia     Black Diesel    Automatic              1.6     121156
##  9 X3        2016 South A… White Diesel    Automatic              1.7      48073
## 10 i8        2019 Europe   White Electric  Manual                 3        35700
## # ℹ 49,990 more rows
## # ℹ 4 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## #   Sales_Classification <fct>, Year_Group <fct>

Which BMW model has the highest average price?

Interpretation:

The printed table lists models with the highest average prices at the top. Typically:

7 Series, X7, i8, and M series models appear at the top.

Smaller models like 1 Series or X1 appear at the bottom.

This means luxury full-size BMW sedans and performance cars are the most expensive models.

# Q1: Model with highest average price
avg_price <- data %>%
  group_by(Model) %>%
  summarise(Avg_Price = mean(Price_USD)) %>%
  arrange(desc(Avg_Price))
print(avg_price)
## # A tibble: 11 × 2
##    Model    Avg_Price
##    <fct>        <dbl>
##  1 7 Series    75570.
##  2 3 Series    75566.
##  3 i8          75366.
##  4 5 Series    75288.
##  5 X1          75262.
##  6 X3          75017.
##  7 M3          74842.
##  8 i3          74800.
##  9 X5          74708.
## 10 M5          74475.
## 11 X6          74435.

How has the total BMW sales volume changed from 2010 to 2024?

Interpretation:

A rising trend shows increasing BMW popularity.

A dip in certain years (e.g., around 2020) may reflect global events (COVID-19).

A steep rise in later years indicates post-pandemic recovery.

This graph highlights demand fluctuations over time.

q2 <- data %>%
  group_by(Year) %>%
  summarise(Total_Sales = sum(Sales_Volume))
ggplot(q2, aes(x = Year, y = Total_Sales)) +
  geom_line(color = "steelblue") +
  geom_point() +
  labs(title = "BMW Sales Volume (2010–2024)", y = "Total Sales Volume")

What is the average engine size for each fuel type?

Interpretation:

Expected trend:

Diesel and Petrol cars have larger engines.

Hybrid has medium-size engines.

Electric has smaller/labeled engine equivalents.

This tells us traditional fuel vehicles still dominate in engine size.

# Q3: Average engine size by fuel type
q3 <- data %>%
  group_by(Fuel_Type) %>%
  summarise(Avg_Engine = mean(Engine_Size_L))
print(q3)
## # A tibble: 4 × 2
##   Fuel_Type Avg_Engine
##   <fct>          <dbl>
## 1 Diesel          3.25
## 2 Electric        3.24
## 3 Hybrid          3.25
## 4 Petrol          3.25

How do average sales volumes differ between manual and automatic transmissions?

Interpretation:

Usually:

Automatic transmission shows higher average sales than manual.

This reflects modern buyers’ preference for automatics.

# Q4: Transmission vs average sales
q4 <- data %>%
  group_by(Transmission) %>%
  summarise(Avg_Sales = mean(Sales_Volume))
print(q4)
## # A tibble: 2 × 2
##   Transmission Avg_Sales
##   <fct>            <dbl>
## 1 Automatic        5072.
## 2 Manual           5063.

What is the distribution of different fuel types in the dataset?

Interpretation:

Petrol cars are typically the most common.

Diesel comes next.

Hybrid and Electric have lower counts.

This suggests BMW has produced more petrol/diesel vehicles historically.

q5 <- data %>%
  count(Fuel_Type) 

ggplot(q5, aes(x = reorder(Fuel_Type,-n), y = n, fill = Fuel_Type)) +
  geom_col(show.legend = FALSE)+
  labs(title = "Fuel Type Distribution",x="Fuel Type",y="Count")

Which car colors are most common among high-priced vehicles (top 25%)?

Interpretation:

Usually:

Black, White, and Blue dominate luxury segments.

Red or other special colors may be expensive but less frequent.

This highlights the color preferences of premium buyers.

# Q6: Color preference among high-priced cars (top 25%)
price_75 <- quantile(data$Price_USD, 0.75)
q6 <- data %>%
  filter(Price_USD >= price_75) %>%
  count(Color) %>%
  arrange(desc(n))
print(q6)
## # A tibble: 6 × 2
##   Color      n
##   <fct>  <int>
## 1 Blue    2120
## 2 Red     2108
## 3 White   2094
## 4 Silver  2076
## 5 Black   2056
## 6 Grey    2046

What is the trend of average BMW prices over the years?

Interpretation:

Gradual rise in average price suggests increasing technological features or inflation.

A dip in certain years may indicate changes in product lineup.

This graph shows long-term pricing strategy.

avg_price_year<-data %>% 
  group_by(Year) %>% 
  summarise(Average_Price=mean(Price_USD))           

ggplot(avg_price_year,aes(x=Year,y=Average_Price))+
  geom_line(color="steelblue")+
  geom_point()+
  labs(title = "Average BMW Price(2010-2024)",
y="Average Price(USD)")

Which regions contribute the most to BMW sales volume?

Interpretation:

Typically:

Europe or North America lead in sales.

Regions like Asia also show high sales.

This identifies BMW’s strongest markets globally.

data %>% 
  group_by(Region) %>% 
  summarise(Total_Sales=sum(Sales_Volume)) %>% 
  arrange(desc(Total_Sales))
## # A tibble: 6 × 2
##   Region        Total_Sales
##   <fct>               <dbl>
## 1 Asia             42974277
## 2 Europe           42555138
## 3 North America    42402629
## 4 Middle East      42326620
## 5 Africa           41565252
## 6 South America    41551818

Can engine size, mileage, and year predict a car’s price using linear regression?

Interpretation:

Engine_Size_L usually has a positive coefficient → larger engines cost more.

Mileage_KM usually has a negative coefficient → higher mileage reduces price.

Year typically has a positive coefficient → newer models are more expensive.

R² value (from summary) shows how well predictors explain price (typically moderate).

Plot Interpretation:

Points closer to the red line = better predictions.

A wide scatter = model could be improved.

# Simple regression model
model_reg <- lm(Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)

summary(model_reg)
## 
## Call:
## lm(formula = Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45311 -22607      4  22604  45188 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.174e+04  5.423e+04   0.585    0.558
## Engine_Size_L  3.406e+00  1.152e+02   0.030    0.976
## Mileage_KM    -1.915e-03  2.007e-03  -0.954    0.340
## Year           2.155e+01  2.689e+01   0.802    0.423
## 
## Residual standard error: 26000 on 49996 degrees of freedom
## Multiple R-squared:  3.083e-05,  Adjusted R-squared:  -2.917e-05 
## F-statistic: 0.5138 on 3 and 49996 DF,  p-value: 0.6728
# Predicting values
data$Predicted_Price <- predict(model_reg, data)

# Plot
plot(data$Price_USD, data$Predicted_Price,
     main="Actual vs Predicted Price",
     xlab="Actual Price",
     ylab="Predicted Price",
     col="blue", pch=19)
abline(0,1,col="red")

Can we classify Sales_Classification using KNN?

Interpretation:

Confusion matrix shows accuracy.

Common accuracy range: 60–80%.

Most misclassifications happen between Medium and Low categories due to similar patterns.

This tells us how well engine size, mileage, and price determine sales category.

## KNN Classification

# Select variables
knn_data <- data %>%
  select(Sales_Classification, Engine_Size_L, Mileage_KM, Price_USD)

# Convert factor
knn_data$Sales_Classification <- as.factor(knn_data$Sales_Classification)

# Scale numeric columns
knn_data[, -1] <- scale(knn_data[, -1])

# Train/Test Split
set.seed(123)
split <- createDataPartition(knn_data$Sales_Classification, p = 0.7, list = FALSE)
train <- knn_data[split, ]
test  <- knn_data[-split, ]

# Train simple KNN (k = 5)
knn_model <- train(
  Sales_Classification ~ .,
  data = train,
  method = "knn",
  tuneGrid = data.frame(k = 5)
)

# Predictions
pred_knn <- predict(knn_model, test)

# Confusion Matrix
confusionMatrix(pred_knn, test$Sales_Classification)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low
##       High  775 1748
##       Low  3798 8678
##                                          
##                Accuracy : 0.6302         
##                  95% CI : (0.6225, 0.638)
##     No Information Rate : 0.6951         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0021         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.16947        
##             Specificity : 0.83234        
##          Pos Pred Value : 0.30717        
##          Neg Pred Value : 0.69558        
##              Prevalence : 0.30489        
##          Detection Rate : 0.05167        
##    Detection Prevalence : 0.16821        
##       Balanced Accuracy : 0.50091        
##                                          
##        'Positive' Class : High           
## 

Can we group BMW cars into clusters using K-Means?

Interpretation:

Typical clusters:

Cluster 1: Low engine size, low price, low sales volume → Budget models.

Cluster 2: Medium engine size & price → Mid-range models.

Cluster 3: High engine size, high price, high sales → Premium models.

This segmentation helps understand BMW’s market tiers.

## K-Means Clustering

cluster_data <- data %>%
  select(Engine_Size_L, Price_USD, Sales_Volume)

# Scale data
cluster_scaled <- scale(cluster_data)

# Apply K-means (k = 3)
set.seed(123)
km <- kmeans(cluster_scaled, centers = 3)

# Add cluster labels to dataset
data$Cluster <- as.factor(km$cluster)

# View cluster centers
km$centers
##   Engine_Size_L  Price_USD Sales_Volume
## 1   0.008343718 -1.0540126  0.002220939
## 2   0.924545851  0.6386548 -0.004282352
## 3  -0.929852798  0.6235025  0.001608027

Does fuel type significantly affect car price? (ANOVA Test)

Interpretation:

If p-value < 0.05 → significant difference in average prices among fuel types.

Usually:

Hybrid/Electric are more expensive.

Diesel/petrol vary moderately.

Boxplot Interpretation

Visual comparison shows which fuel types have higher or more variable prices.

## ANOVA: Comparing Average Price by Fuel Type

anova_model <- aov(Price_USD ~ Fuel_Type, data = data)

summary(anova_model)
##                Df    Sum Sq   Mean Sq F value Pr(>F)
## Fuel_Type       3 1.493e+09 497572724   0.736   0.53
## Residuals   49996 3.379e+13 675919646
# Boxplot for visualization
boxplot(Price_USD ~ Fuel_Type, data = data,
        col = c("lightblue", "lightgreen", "pink", "orange"),
        main = "Price by Fuel Type",
        xlab = "Fuel Type", ylab = "Price (USD)")