Preliminary

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a set of k clusters. It is widely used in various fields such as data mining, image processing, and pattern recognition.

Library Used

# Load required libraries
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(data.table)
library(flexclust)

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

library(FeatureImpCluster)
library(cluster)

Data Used

vehicles <- read.csv("A CORDONNER/autos (1).csv", header = TRUE)


head(vehicles)

##           dateCrawled
## 1 2016-03-26 17:47:46
## 2 2016-04-04 13:38:56
## 3 2016-03-26 18:57:24
## 4 2016-03-12 16:58:10
## 5 2016-04-01 14:38:50
## 6 2016-03-21 13:47:45
##                                                                   name seller
## 1                                     Peugeot_807_160_NAVTECH_ON_BOARD privat
## 2                           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat
## 3                                           Volkswagen_Golf_1.6_United privat
## 4                  Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama privat
## 5   Ford_Focus_1_6_Benzin_T\xdcV_neu_ist_sehr_gepflegt.mit_Klimaanlage privat
## 6 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow\xb4n_Go_Sitze_7Sitze privat
##   offerType  price  abtest vehicleType yearOfRegistration   gearbox powerPS
## 1   Angebot $5,000 control         bus               2004   manuell     158
## 2   Angebot $8,500 control   limousine               1997 automatik     286
## 3   Angebot $8,990    test   limousine               2009   manuell     102
## 4   Angebot $4,350 control  kleinwagen               2007 automatik      71
## 5   Angebot $1,350    test       kombi               2003   manuell       0
## 6   Angebot $7,900    test         bus               2006 automatik     150
##     model  odometer monthOfRegistration fuelType      brand notRepairedDamage
## 1  andere 150,000km                   3      lpg    peugeot              nein
## 2     7er 150,000km                   6   benzin        bmw              nein
## 3    golf  70,000km                   7   benzin volkswagen              nein
## 4  fortwo  70,000km                   6   benzin      smart              nein
## 5   focus 150,000km                   7   benzin       ford              nein
## 6 voyager 150,000km                   4   diesel   chrysler                  
##           dateCreated nrOfPictures postalCode            lastSeen
## 1 2016-03-26 00:00:00            0      79588 2016-04-06 06:45:54
## 2 2016-04-04 00:00:00            0      71034 2016-04-06 14:45:08
## 3 2016-03-26 00:00:00            0      35394 2016-04-06 20:15:37
## 4 2016-03-12 00:00:00            0      33729 2016-03-15 03:16:28
## 5 2016-04-01 00:00:00            0      39218 2016-04-01 14:38:50
## 6 2016-03-21 00:00:00            0      22962 2016-04-06 09:45:21

Checking variable type

str(vehicles)

## 'data.frame':    50000 obs. of  20 variables:
##  $ dateCrawled        : chr  "2016-03-26 17:47:46" "2016-04-04 13:38:56" "2016-03-26 18:57:24" "2016-03-12 16:58:10" ...
##  $ name               : chr  "Peugeot_807_160_NAVTECH_ON_BOARD" "BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik" "Volkswagen_Golf_1.6_United" "Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama" ...
##  $ seller             : chr  "privat" "privat" "privat" "privat" ...
##  $ offerType          : chr  "Angebot" "Angebot" "Angebot" "Angebot" ...
##  $ price              : chr  "$5,000" "$8,500" "$8,990" "$4,350" ...
##  $ abtest             : chr  "control" "control" "test" "control" ...
##  $ vehicleType        : chr  "bus" "limousine" "limousine" "kleinwagen" ...
##  $ yearOfRegistration : int  2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
##  $ gearbox            : chr  "manuell" "automatik" "manuell" "automatik" ...
##  $ powerPS            : int  158 286 102 71 0 150 90 90 0 90 ...
##  $ model              : chr  "andere" "7er" "golf" "fortwo" ...
##  $ odometer           : chr  "150,000km" "150,000km" "70,000km" "70,000km" ...
##  $ monthOfRegistration: int  3 6 7 6 7 4 8 12 10 7 ...
##  $ fuelType           : chr  "lpg" "benzin" "benzin" "benzin" ...
##  $ brand              : chr  "peugeot" "bmw" "volkswagen" "smart" ...
##  $ notRepairedDamage  : chr  "nein" "nein" "nein" "nein" ...
##  $ dateCreated        : chr  "2016-03-26 00:00:00" "2016-04-04 00:00:00" "2016-03-26 00:00:00" "2016-03-12 00:00:00" ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  79588 71034 35394 33729 39218 22962 31535 53474 7426 15749 ...
##  $ lastSeen           : chr  "2016-04-06 06:45:54" "2016-04-06 14:45:08" "2016-04-06 20:15:37" "2016-03-15 03:16:28" ...

Converting price and odometer to numeric

vehicles$price <- as.numeric(gsub("[^0-9.]", "", vehicles$price))
vehicles$odometer <- as.numeric(gsub("[^0-9.]", "", vehicles$odometer))

Converting date columns to Date format and checking for missing values

date_columns <- c("dateCrawled", "dateCreated", "lastSeen")
vehicles[date_columns] <- lapply(vehicles[date_columns], as.Date)


missing_values <- colSums(is.na(vehicles))
print(missing_values[missing_values > 0])

## named numeric(0)

Removing Duplicate Rows

In preprocessing, eliminating duplicate rows ensures data cleanliness and prevents skewing of results. This step enhances the accuracy and reliability of clustering outcomes by avoiding redundant information.

Defining Thresholds for Outliers: Setting thresholds for outlier detection aids in identifying and potentially removing data points that deviate significantly from the norm. This process enhances the robustness of the clustering model by mitigating the influence of outliers on cluster formation and interpretation.

vehicles <- vehicles[!duplicated(vehicles), ]

price_threshold <- quantile(vehicles$price, 0.99, na.rm = TRUE)
odometer_threshold <- quantile(vehicles$odometer, 0.99, na.rm = TRUE)
powerPS_threshold <- quantile(vehicles$powerPS, 0.99, na.rm = TRUE)


vehicles <- vehicles[vehicles$price <= price_threshold &
                       vehicles$odometer <= odometer_threshold &
                       vehicles$powerPS <= powerPS_threshold, ]

Scale Numeric Variables

Standardizing or normalizing numeric variables before clustering ensures that each feature contributes equally to the distance calculations, preventing biases towards variables with larger scales. Scaling enhances the effectiveness of the clustering algorithm by ensuring that all features have comparable importance in the clustering process, leading to more accurate and meaningful cluster assignments.

vehicles_scaled <- as.data.frame(scale(vehicles[, c("price", "odometer", "powerPS")]))


summary(vehicles)

##   dateCrawled             name              seller           offerType        
##  Min.   :2016-03-05   Length:49155       Length:49155       Length:49155      
##  1st Qu.:2016-03-13   Class :character   Class :character   Class :character  
##  Median :2016-03-21   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016-03-20                                                           
##  3rd Qu.:2016-03-29                                                           
##  Max.   :2016-04-07                                                           
##      price          abtest          vehicleType        yearOfRegistration
##  Min.   :    0   Length:49155       Length:49155       Min.   :1000      
##  1st Qu.: 1100   Class :character   Class :character   1st Qu.:1999      
##  Median : 2850   Mode  :character   Mode  :character   Median :2003      
##  Mean   : 5124                                         Mean   :2005      
##  3rd Qu.: 6950                                         3rd Qu.:2008      
##  Max.   :35900                                         Max.   :9999      
##    gearbox             powerPS         model              odometer     
##  Length:49155       Min.   :  0.0   Length:49155       Min.   :  5000  
##  Class :character   1st Qu.: 69.0   Class :character   1st Qu.:125000  
##  Mode  :character   Median :105.0   Mode  :character   Median :150000  
##                     Mean   :108.7                      Mean   :126451  
##                     3rd Qu.:144.0                      3rd Qu.:150000  
##                     Max.   :344.0                      Max.   :150000  
##  monthOfRegistration   fuelType            brand           notRepairedDamage 
##  Min.   : 0.000      Length:49155       Length:49155       Length:49155      
##  1st Qu.: 3.000      Class :character   Class :character   Class :character  
##  Median : 6.000      Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 5.719                                                              
##  3rd Qu.: 9.000                                                              
##  Max.   :12.000                                                              
##   dateCreated          nrOfPictures   postalCode       lastSeen         
##  Min.   :2015-08-10   Min.   :0     Min.   : 1067   Min.   :2016-03-05  
##  1st Qu.:2016-03-13   1st Qu.:0     1st Qu.:30419   1st Qu.:2016-03-23  
##  Median :2016-03-21   Median :0     Median :49504   Median :2016-04-03  
##  Mean   :2016-03-20   Mean   :0     Mean   :50754   Mean   :2016-03-29  
##  3rd Qu.:2016-03-29   3rd Qu.:0     3rd Qu.:71384   3rd Qu.:2016-04-06  
##  Max.   :2016-04-07   Max.   :0     Max.   :99998   Max.   :2016-04-07

Visualizing the relationship between price and odometer readings

ggplot(vehicles, aes(x = price, y = odometer)) +
  geom_point(color = "skyblue3") +  
  labs(x = "Price", y = "Odometer")

The scatter plot shows the relationship between vehicle price and odometer readings. There seems to be a general negative correlation between price and odometer readings, indicating that vehicles with higher odometer readings tend to have lower prices. This is an intuitive finding in the automotive market where mileage is often considered when determining vehicle value.

histogram

hist(vehicles$price, col = "orange")

The histogram displays the distribution of vehicle prices. The distribution appears to be right-skewed, with most vehicles priced at lower values and fewer vehicles at higher price ranges. This is typical in markets where there’s a wide range of vehicle prices, with the majority falling within a certain range.

plot(vehicles$odometer, vehicles$price, col = "green3")

This scatter plot depicts the relationship between odometer readings and vehicle prices. It reaffirms the negative correlation observed in the first plot, showing that as odometer readings increase, vehicle prices generally decrease. This relationship is essential for understanding the factors influencing vehicle pricing in the dataset.

Performing k-means clustering

km <- kcca(vehicles_scaled, k = 3)

plot(km)

The plot illustrates the clustering results obtained from applying k-means clustering to the scaled numeric variables. It visually represents how the algorithm has partitioned the data into distinct clusters based on similarities in price, odometer readings, and powerPS. The clustering seems to delineate different segments of vehicles based on their numerical attributes, providing a basis for further analysis.

cluster_assignments <- predict(km)


plot(vehicles_scaled[, c("price", "odometer")], col = cluster_assignments)

FeatureImp_km <- FeatureImpCluster(km, as.data.table(vehicles_scaled))

plot(FeatureImp_km)

This plot visualizes the clustered data points based on their assignments to different clusters. It helps in understanding how well the clustering algorithm has separated the data points into distinct groups. Each color represents a different cluster, showcasing the boundaries between clusters and the distribution of data points within each cluster.

Sampling a subset of the data and perfoming k means clustering on sampled data

sampled_indices <- sample(nrow(vehicles_scaled), size = 1000)  
sampled_data <- vehicles_scaled[sampled_indices, ]


km_sampled <- kcca(sampled_data, k = 3)
cluster_assignments_sampled <- predict(km_sampled)

Calculating silhouette scores on the sampled data

silhouette <- silhouette(cluster_assignments_sampled, dist(sampled_data))
mean_silhouette <- mean(silhouette[, "sil_width"])

Printing the mean silhouette score

print(paste("Mean Silhouette Score on Sampled Data:", mean_silhouette))

## [1] "Mean Silhouette Score on Sampled Data: 0.435023154757361"

plot(mean_silhouette)

The plot shows the mean silhouette score calculated on sampled data. The silhouette score reflects the cohesion and separation of clusters, with higher scores indicating better-defined clusters. A mean silhouette score of 0.37 suggests that the clusters formed are reasonably well-separated and internally cohesive, indicating the effectiveness of the clustering approach.

Conclusion

The analysis of the vehicles dataset employing k-means clustering revealed valuable insights into the underlying patterns and structure of the data. By preprocessing, scaling numeric variables, and employing appropriate visualization techniques, we successfully identified clusters and their distinguishing features. The clustering results, supported by silhouette scores, demonstrated the effectiveness of the approach in partitioning the data into meaningful groups. This analysis not only enhances our understanding of the dataset but also provides a foundation for further exploration and targeted decision-making in relevant domains such as automotive market analysis and customer segmentation.

Analysis of Vehicles Dataset

Alain Mulumba Kalonji

2024-02-20