This project analyzes a dataset containing 50,000 BMW car sales records from the years 2010 to 2024. The goal is to:
Preprocess and understand the dataset
Perform descriptive statistical analysis
Visualize sales trends and behavioral patterns
Build predictive models using
Linear Regression (predicting Price)
K-Nearest Neighbors (KNN) (predicting Sales Classification)
This project was written to follow the structure of the RPubs reference project.
readr → Reads CSV files efficiently
dplyr → Used for data manipulation (filter, group_by, mutate)
ggplot2 → Used for creating visualizations
tidyr → Helps in reshaping messy data
caret → Builds machine learning models (regression, KNN)
cluster → (Optional) used for clustering tasks
library(readr)
data <- read_csv("C:/Users/Sakshi/Downloads/BMW sales data (2010-2024) (1) (1).csv")
## Rows: 50000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Model, Region, Color, Fuel_Type, Transmission, Sales_Classification
## dbl (5): Year, Engine_Size_L, Mileage_KM, Price_USD, Sales_Volume
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
## Model Year Region Color Fuel_Type Transmission Engine_Size_L Mileage_KM
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 5 Series 2016 Asia Red Petrol Manual 3.5 151748
## 2 i8 2013 North Am… Red Hybrid Automatic 1.6 121671
## 3 5 Series 2022 North Am… Blue Petrol Automatic 4.5 10991
## 4 X3 2024 Middle E… Blue Petrol Automatic 1.7 27255
## 5 7 Series 2020 South Am… Black Diesel Manual 2.1 122131
## 6 5 Series 2017 Middle E… Silv… Diesel Manual 1.9 171362
## # ℹ 3 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## # Sales_Classification <chr>
Reads the BMW sales CSV file from your local computer.
Automatically detects column types (character/numeric).
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(cluster)
str(data) shows the structure, including the type and sample values of each column.
str(data)
## spc_tbl_ [50,000 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Model : chr [1:50000] "5 Series" "i8" "5 Series" "X3" ...
## $ Year : num [1:50000] 2016 2013 2022 2024 2020 ...
## $ Region : chr [1:50000] "Asia" "North America" "North America" "Middle East" ...
## $ Color : chr [1:50000] "Red" "Red" "Blue" "Blue" ...
## $ Fuel_Type : chr [1:50000] "Petrol" "Hybrid" "Petrol" "Petrol" ...
## $ Transmission : chr [1:50000] "Manual" "Automatic" "Automatic" "Automatic" ...
## $ Engine_Size_L : num [1:50000] 3.5 1.6 4.5 1.7 2.1 1.9 1.8 1.6 1.7 3 ...
## $ Mileage_KM : num [1:50000] 151748 121671 10991 27255 122131 ...
## $ Price_USD : num [1:50000] 98740 79219 113265 60971 49898 ...
## $ Sales_Volume : num [1:50000] 8300 3428 6994 4047 3080 ...
## $ Sales_Classification: chr [1:50000] "High" "Low" "Low" "Low" ...
## - attr(*, "spec")=
## .. cols(
## .. Model = col_character(),
## .. Year = col_double(),
## .. Region = col_character(),
## .. Color = col_character(),
## .. Fuel_Type = col_character(),
## .. Transmission = col_character(),
## .. Engine_Size_L = col_double(),
## .. Mileage_KM = col_double(),
## .. Price_USD = col_double(),
## .. Sales_Volume = col_double(),
## .. Sales_Classification = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data) gives descriptive statistics, such as
Minimum
Maximum
Mean
Quartiles
This helps understand the range of prices, mileage, engine sizes, etc.
summary(data)
## Model Year Region Color
## Length:50000 Min. :2010 Length:50000 Length:50000
## Class :character 1st Qu.:2013 Class :character Class :character
## Mode :character Median :2017 Mode :character Mode :character
## Mean :2017
## 3rd Qu.:2021
## Max. :2024
## Fuel_Type Transmission Engine_Size_L Mileage_KM
## Length:50000 Length:50000 Min. :1.500 Min. : 3
## Class :character Class :character 1st Qu.:2.400 1st Qu.: 50178
## Mode :character Mode :character Median :3.200 Median :100389
## Mean :3.247 Mean :100307
## 3rd Qu.:4.100 3rd Qu.:150630
## Max. :5.000 Max. :199996
## Price_USD Sales_Volume Sales_Classification
## Min. : 30000 Min. : 100 Length:50000
## 1st Qu.: 52435 1st Qu.:2588 Class :character
## Median : 75012 Median :5087 Mode :character
## Mean : 75035 Mean :5068
## 3rd Qu.: 97628 3rd Qu.:7537
## Max. :119998 Max. :9999
# Check for missing values
colSums(is.na(data))
## Model Year Region
## 0 0 0
## Color Fuel_Type Transmission
## 0 0 0
## Engine_Size_L Mileage_KM Price_USD
## 0 0 0
## Sales_Volume Sales_Classification
## 0 0
This code converts TEXT columns (like Model, Region) into categorical (factor) variables.
cat_cols <- c("Model", "Region", "Color", "Fuel_Type", "Transmission", "Sales_Classification")
data[cat_cols] <- lapply(data[cat_cols], as.factor)
cat_cols
## [1] "Model" "Region" "Color"
## [4] "Fuel_Type" "Transmission" "Sales_Classification"
Creates a new column dividing cars into:
2020 and after (newer models)
Before 2020 (older models)
Useful for comparing trends between older and newer BMW models.
# Add grouping variable for new vs old models
data <- data %>%
mutate(Year_Group = ifelse(Year >= 2020, "2020_and_after", "before_2020")) %>%
mutate(Year_Group = as.factor(Year_Group))
data
## # A tibble: 50,000 × 12
## Model Year Region Color Fuel_Type Transmission Engine_Size_L Mileage_KM
## <fct> <dbl> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 5 Series 2016 Asia Red Petrol Manual 3.5 151748
## 2 i8 2013 North A… Red Hybrid Automatic 1.6 121671
## 3 5 Series 2022 North A… Blue Petrol Automatic 4.5 10991
## 4 X3 2024 Middle … Blue Petrol Automatic 1.7 27255
## 5 7 Series 2020 South A… Black Diesel Manual 2.1 122131
## 6 5 Series 2017 Middle … Silv… Diesel Manual 1.9 171362
## 7 i8 2022 Europe White Diesel Manual 1.8 196741
## 8 M5 2014 Asia Black Diesel Automatic 1.6 121156
## 9 X3 2016 South A… White Diesel Automatic 1.7 48073
## 10 i8 2019 Europe White Electric Manual 3 35700
## # ℹ 49,990 more rows
## # ℹ 4 more variables: Price_USD <dbl>, Sales_Volume <dbl>,
## # Sales_Classification <fct>, Year_Group <fct>
Interpretation:
The printed table lists models with the highest average prices at the top. Typically:
7 Series, X7, i8, and M series models appear at the top.
Smaller models like 1 Series or X1 appear at the bottom.
This means luxury full-size BMW sedans and performance cars are the most expensive models.
# Q1: Model with highest average price
avg_price <- data %>%
group_by(Model) %>%
summarise(Avg_Price = mean(Price_USD)) %>%
arrange(desc(Avg_Price))
print(avg_price)
## # A tibble: 11 × 2
## Model Avg_Price
## <fct> <dbl>
## 1 7 Series 75570.
## 2 3 Series 75566.
## 3 i8 75366.
## 4 5 Series 75288.
## 5 X1 75262.
## 6 X3 75017.
## 7 M3 74842.
## 8 i3 74800.
## 9 X5 74708.
## 10 M5 74475.
## 11 X6 74435.
Interpretation:
A rising trend shows increasing BMW popularity.
A dip in certain years (e.g., around 2020) may reflect global events (COVID-19).
A steep rise in later years indicates post-pandemic recovery.
This graph highlights demand fluctuations over time.
q2 <- data %>%
group_by(Year) %>%
summarise(Total_Sales = sum(Sales_Volume))
ggplot(q2, aes(x = Year, y = Total_Sales)) +
geom_line(color = "steelblue") +
geom_point() +
labs(title = "BMW Sales Volume (2010–2024)", y = "Total Sales Volume")
Interpretation:
Expected trend:
Diesel and Petrol cars have larger engines.
Hybrid has medium-size engines.
Electric has smaller/labeled engine equivalents.
This tells us traditional fuel vehicles still dominate in engine size.
# Q3: Average engine size by fuel type
q3 <- data %>%
group_by(Fuel_Type) %>%
summarise(Avg_Engine = mean(Engine_Size_L))
print(q3)
## # A tibble: 4 × 2
## Fuel_Type Avg_Engine
## <fct> <dbl>
## 1 Diesel 3.25
## 2 Electric 3.24
## 3 Hybrid 3.25
## 4 Petrol 3.25
Interpretation:
Usually:
Automatic transmission shows higher average sales than manual.
This reflects modern buyers’ preference for automatics.
# Q4: Transmission vs average sales
q4 <- data %>%
group_by(Transmission) %>%
summarise(Avg_Sales = mean(Sales_Volume))
print(q4)
## # A tibble: 2 × 2
## Transmission Avg_Sales
## <fct> <dbl>
## 1 Automatic 5072.
## 2 Manual 5063.
Interpretation:
Petrol cars are typically the most common.
Diesel comes next.
Hybrid and Electric have lower counts.
This suggests BMW has produced more petrol/diesel vehicles historically.
q5 <- data %>%
count(Fuel_Type)
ggplot(q5, aes(x = reorder(Fuel_Type,-n), y = n, fill = Fuel_Type)) +
geom_col(show.legend = FALSE)+
labs(title = "Fuel Type Distribution",x="Fuel Type",y="Count")
Interpretation:
Usually:
Black, White, and Blue dominate luxury segments.
Red or other special colors may be expensive but less frequent.
This highlights the color preferences of premium buyers.
# Q6: Color preference among high-priced cars (top 25%)
price_75 <- quantile(data$Price_USD, 0.75)
q6 <- data %>%
filter(Price_USD >= price_75) %>%
count(Color) %>%
arrange(desc(n))
print(q6)
## # A tibble: 6 × 2
## Color n
## <fct> <int>
## 1 Blue 2120
## 2 Red 2108
## 3 White 2094
## 4 Silver 2076
## 5 Black 2056
## 6 Grey 2046
Interpretation:
Gradual rise in average price suggests increasing technological features or inflation.
A dip in certain years may indicate changes in product lineup.
This graph shows long-term pricing strategy.
avg_price_year<-data %>%
group_by(Year) %>%
summarise(Average_Price=mean(Price_USD))
ggplot(avg_price_year,aes(x=Year,y=Average_Price))+
geom_line(color="steelblue")+
geom_point()+
labs(title = "Average BMW Price(2010-2024)",
y="Average Price(USD)")
Interpretation:
Typically:
Europe or North America lead in sales.
Regions like Asia also show high sales.
This identifies BMW’s strongest markets globally.
data %>%
group_by(Region) %>%
summarise(Total_Sales=sum(Sales_Volume)) %>%
arrange(desc(Total_Sales))
## # A tibble: 6 × 2
## Region Total_Sales
## <fct> <dbl>
## 1 Asia 42974277
## 2 Europe 42555138
## 3 North America 42402629
## 4 Middle East 42326620
## 5 Africa 41565252
## 6 South America 41551818
Interpretation:
Engine_Size_L usually has a positive coefficient → larger engines cost more.
Mileage_KM usually has a negative coefficient → higher mileage reduces price.
Year typically has a positive coefficient → newer models are more expensive.
R² value (from summary) shows how well predictors explain price (typically moderate).
Plot Interpretation:
Points closer to the red line = better predictions.
A wide scatter = model could be improved.
# Simple regression model
model_reg <- lm(Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)
summary(model_reg)
##
## Call:
## lm(formula = Price_USD ~ Engine_Size_L + Mileage_KM + Year, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45311 -22607 4 22604 45188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.174e+04 5.423e+04 0.585 0.558
## Engine_Size_L 3.406e+00 1.152e+02 0.030 0.976
## Mileage_KM -1.915e-03 2.007e-03 -0.954 0.340
## Year 2.155e+01 2.689e+01 0.802 0.423
##
## Residual standard error: 26000 on 49996 degrees of freedom
## Multiple R-squared: 3.083e-05, Adjusted R-squared: -2.917e-05
## F-statistic: 0.5138 on 3 and 49996 DF, p-value: 0.6728
# Predicting values
data$Predicted_Price <- predict(model_reg, data)
# Plot
plot(data$Price_USD, data$Predicted_Price,
main="Actual vs Predicted Price",
xlab="Actual Price",
ylab="Predicted Price",
col="blue", pch=19)
abline(0,1,col="red")
Interpretation:
Confusion matrix shows accuracy.
Common accuracy range: 60–80%.
Most misclassifications happen between Medium and Low categories due to similar patterns.
This tells us how well engine size, mileage, and price determine sales category.
## KNN Classification
# Select variables
knn_data <- data %>%
select(Sales_Classification, Engine_Size_L, Mileage_KM, Price_USD)
# Convert factor
knn_data$Sales_Classification <- as.factor(knn_data$Sales_Classification)
# Scale numeric columns
knn_data[, -1] <- scale(knn_data[, -1])
# Train/Test Split
set.seed(123)
split <- createDataPartition(knn_data$Sales_Classification, p = 0.7, list = FALSE)
train <- knn_data[split, ]
test <- knn_data[-split, ]
# Train simple KNN (k = 5)
knn_model <- train(
Sales_Classification ~ .,
data = train,
method = "knn",
tuneGrid = data.frame(k = 5)
)
# Predictions
pred_knn <- predict(knn_model, test)
# Confusion Matrix
confusionMatrix(pred_knn, test$Sales_Classification)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 775 1748
## Low 3798 8678
##
## Accuracy : 0.6302
## 95% CI : (0.6225, 0.638)
## No Information Rate : 0.6951
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0021
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.16947
## Specificity : 0.83234
## Pos Pred Value : 0.30717
## Neg Pred Value : 0.69558
## Prevalence : 0.30489
## Detection Rate : 0.05167
## Detection Prevalence : 0.16821
## Balanced Accuracy : 0.50091
##
## 'Positive' Class : High
##
Interpretation:
Typical clusters:
Cluster 1: Low engine size, low price, low sales volume → Budget models.
Cluster 2: Medium engine size & price → Mid-range models.
Cluster 3: High engine size, high price, high sales → Premium models.
This segmentation helps understand BMW’s market tiers.
## K-Means Clustering
cluster_data <- data %>%
select(Engine_Size_L, Price_USD, Sales_Volume)
# Scale data
cluster_scaled <- scale(cluster_data)
# Apply K-means (k = 3)
set.seed(123)
km <- kmeans(cluster_scaled, centers = 3)
# Add cluster labels to dataset
data$Cluster <- as.factor(km$cluster)
# View cluster centers
km$centers
## Engine_Size_L Price_USD Sales_Volume
## 1 0.008343718 -1.0540126 0.002220939
## 2 0.924545851 0.6386548 -0.004282352
## 3 -0.929852798 0.6235025 0.001608027
If p-value < 0.05 → significant difference in average prices among fuel types.
Usually:
Hybrid/Electric are more expensive.
Diesel/petrol vary moderately.
Boxplot Interpretation
Visual comparison shows which fuel types have higher or more variable prices.
## ANOVA: Comparing Average Price by Fuel Type
anova_model <- aov(Price_USD ~ Fuel_Type, data = data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Fuel_Type 3 1.493e+09 497572724 0.736 0.53
## Residuals 49996 3.379e+13 675919646
# Boxplot for visualization
boxplot(Price_USD ~ Fuel_Type, data = data,
col = c("lightblue", "lightgreen", "pink", "orange"),
main = "Price by Fuel Type",
xlab = "Fuel Type", ylab = "Price (USD)")