K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a set of k clusters. It is widely used in various fields such as data mining, image processing, and pattern recognition.
# Load required libraries
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(data.table)
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
library(FeatureImpCluster)
library(cluster)
vehicles <- read.csv("A CORDONNER/autos (1).csv", header = TRUE)
head(vehicles)
## dateCrawled
## 1 2016-03-26 17:47:46
## 2 2016-04-04 13:38:56
## 3 2016-03-26 18:57:24
## 4 2016-03-12 16:58:10
## 5 2016-04-01 14:38:50
## 6 2016-03-21 13:47:45
## name seller
## 1 Peugeot_807_160_NAVTECH_ON_BOARD privat
## 2 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat
## 3 Volkswagen_Golf_1.6_United privat
## 4 Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama privat
## 5 Ford_Focus_1_6_Benzin_T\xdcV_neu_ist_sehr_gepflegt.mit_Klimaanlage privat
## 6 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Stow\xb4n_Go_Sitze_7Sitze privat
## offerType price abtest vehicleType yearOfRegistration gearbox powerPS
## 1 Angebot $5,000 control bus 2004 manuell 158
## 2 Angebot $8,500 control limousine 1997 automatik 286
## 3 Angebot $8,990 test limousine 2009 manuell 102
## 4 Angebot $4,350 control kleinwagen 2007 automatik 71
## 5 Angebot $1,350 test kombi 2003 manuell 0
## 6 Angebot $7,900 test bus 2006 automatik 150
## model odometer monthOfRegistration fuelType brand notRepairedDamage
## 1 andere 150,000km 3 lpg peugeot nein
## 2 7er 150,000km 6 benzin bmw nein
## 3 golf 70,000km 7 benzin volkswagen nein
## 4 fortwo 70,000km 6 benzin smart nein
## 5 focus 150,000km 7 benzin ford nein
## 6 voyager 150,000km 4 diesel chrysler
## dateCreated nrOfPictures postalCode lastSeen
## 1 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
## 2 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
## 3 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
## 4 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
## 5 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
## 6 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
str(vehicles)
## 'data.frame': 50000 obs. of 20 variables:
## $ dateCrawled : chr "2016-03-26 17:47:46" "2016-04-04 13:38:56" "2016-03-26 18:57:24" "2016-03-12 16:58:10" ...
## $ name : chr "Peugeot_807_160_NAVTECH_ON_BOARD" "BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik" "Volkswagen_Golf_1.6_United" "Smart_smart_fortwo_coupe_softouch/F1/Klima/Panorama" ...
## $ seller : chr "privat" "privat" "privat" "privat" ...
## $ offerType : chr "Angebot" "Angebot" "Angebot" "Angebot" ...
## $ price : chr "$5,000" "$8,500" "$8,990" "$4,350" ...
## $ abtest : chr "control" "control" "test" "control" ...
## $ vehicleType : chr "bus" "limousine" "limousine" "kleinwagen" ...
## $ yearOfRegistration : int 2004 1997 2009 2007 2003 2006 1995 1998 2000 1997 ...
## $ gearbox : chr "manuell" "automatik" "manuell" "automatik" ...
## $ powerPS : int 158 286 102 71 0 150 90 90 0 90 ...
## $ model : chr "andere" "7er" "golf" "fortwo" ...
## $ odometer : chr "150,000km" "150,000km" "70,000km" "70,000km" ...
## $ monthOfRegistration: int 3 6 7 6 7 4 8 12 10 7 ...
## $ fuelType : chr "lpg" "benzin" "benzin" "benzin" ...
## $ brand : chr "peugeot" "bmw" "volkswagen" "smart" ...
## $ notRepairedDamage : chr "nein" "nein" "nein" "nein" ...
## $ dateCreated : chr "2016-03-26 00:00:00" "2016-04-04 00:00:00" "2016-03-26 00:00:00" "2016-03-12 00:00:00" ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 79588 71034 35394 33729 39218 22962 31535 53474 7426 15749 ...
## $ lastSeen : chr "2016-04-06 06:45:54" "2016-04-06 14:45:08" "2016-04-06 20:15:37" "2016-03-15 03:16:28" ...
vehicles$price <- as.numeric(gsub("[^0-9.]", "", vehicles$price))
vehicles$odometer <- as.numeric(gsub("[^0-9.]", "", vehicles$odometer))
date_columns <- c("dateCrawled", "dateCreated", "lastSeen")
vehicles[date_columns] <- lapply(vehicles[date_columns], as.Date)
missing_values <- colSums(is.na(vehicles))
print(missing_values[missing_values > 0])
## named numeric(0)
In preprocessing, eliminating duplicate rows ensures data cleanliness and prevents skewing of results. This step enhances the accuracy and reliability of clustering outcomes by avoiding redundant information.
Defining Thresholds for Outliers: Setting thresholds for outlier detection aids in identifying and potentially removing data points that deviate significantly from the norm. This process enhances the robustness of the clustering model by mitigating the influence of outliers on cluster formation and interpretation.
vehicles <- vehicles[!duplicated(vehicles), ]
price_threshold <- quantile(vehicles$price, 0.99, na.rm = TRUE)
odometer_threshold <- quantile(vehicles$odometer, 0.99, na.rm = TRUE)
powerPS_threshold <- quantile(vehicles$powerPS, 0.99, na.rm = TRUE)
vehicles <- vehicles[vehicles$price <= price_threshold &
vehicles$odometer <= odometer_threshold &
vehicles$powerPS <= powerPS_threshold, ]
Standardizing or normalizing numeric variables before clustering ensures that each feature contributes equally to the distance calculations, preventing biases towards variables with larger scales. Scaling enhances the effectiveness of the clustering algorithm by ensuring that all features have comparable importance in the clustering process, leading to more accurate and meaningful cluster assignments.
vehicles_scaled <- as.data.frame(scale(vehicles[, c("price", "odometer", "powerPS")]))
summary(vehicles)
## dateCrawled name seller offerType
## Min. :2016-03-05 Length:49155 Length:49155 Length:49155
## 1st Qu.:2016-03-13 Class :character Class :character Class :character
## Median :2016-03-21 Mode :character Mode :character Mode :character
## Mean :2016-03-20
## 3rd Qu.:2016-03-29
## Max. :2016-04-07
## price abtest vehicleType yearOfRegistration
## Min. : 0 Length:49155 Length:49155 Min. :1000
## 1st Qu.: 1100 Class :character Class :character 1st Qu.:1999
## Median : 2850 Mode :character Mode :character Median :2003
## Mean : 5124 Mean :2005
## 3rd Qu.: 6950 3rd Qu.:2008
## Max. :35900 Max. :9999
## gearbox powerPS model odometer
## Length:49155 Min. : 0.0 Length:49155 Min. : 5000
## Class :character 1st Qu.: 69.0 Class :character 1st Qu.:125000
## Mode :character Median :105.0 Mode :character Median :150000
## Mean :108.7 Mean :126451
## 3rd Qu.:144.0 3rd Qu.:150000
## Max. :344.0 Max. :150000
## monthOfRegistration fuelType brand notRepairedDamage
## Min. : 0.000 Length:49155 Length:49155 Length:49155
## 1st Qu.: 3.000 Class :character Class :character Class :character
## Median : 6.000 Mode :character Mode :character Mode :character
## Mean : 5.719
## 3rd Qu.: 9.000
## Max. :12.000
## dateCreated nrOfPictures postalCode lastSeen
## Min. :2015-08-10 Min. :0 Min. : 1067 Min. :2016-03-05
## 1st Qu.:2016-03-13 1st Qu.:0 1st Qu.:30419 1st Qu.:2016-03-23
## Median :2016-03-21 Median :0 Median :49504 Median :2016-04-03
## Mean :2016-03-20 Mean :0 Mean :50754 Mean :2016-03-29
## 3rd Qu.:2016-03-29 3rd Qu.:0 3rd Qu.:71384 3rd Qu.:2016-04-06
## Max. :2016-04-07 Max. :0 Max. :99998 Max. :2016-04-07
ggplot(vehicles, aes(x = price, y = odometer)) +
geom_point(color = "skyblue3") +
labs(x = "Price", y = "Odometer")
The scatter plot shows the relationship between vehicle price and odometer readings. There seems to be a general negative correlation between price and odometer readings, indicating that vehicles with higher odometer readings tend to have lower prices. This is an intuitive finding in the automotive market where mileage is often considered when determining vehicle value.
hist(vehicles$price, col = "orange")
The histogram displays the distribution of vehicle prices. The distribution appears to be right-skewed, with most vehicles priced at lower values and fewer vehicles at higher price ranges. This is typical in markets where there’s a wide range of vehicle prices, with the majority falling within a certain range.
plot(vehicles$odometer, vehicles$price, col = "green3")
This scatter plot depicts the relationship between odometer readings and vehicle prices. It reaffirms the negative correlation observed in the first plot, showing that as odometer readings increase, vehicle prices generally decrease. This relationship is essential for understanding the factors influencing vehicle pricing in the dataset.
km <- kcca(vehicles_scaled, k = 3)
plot(km)
The plot illustrates the clustering results obtained from applying k-means clustering to the scaled numeric variables. It visually represents how the algorithm has partitioned the data into distinct clusters based on similarities in price, odometer readings, and powerPS. The clustering seems to delineate different segments of vehicles based on their numerical attributes, providing a basis for further analysis.
cluster_assignments <- predict(km)
plot(vehicles_scaled[, c("price", "odometer")], col = cluster_assignments)
FeatureImp_km <- FeatureImpCluster(km, as.data.table(vehicles_scaled))
plot(FeatureImp_km)
This plot visualizes the clustered data points based on their assignments to different clusters. It helps in understanding how well the clustering algorithm has separated the data points into distinct groups. Each color represents a different cluster, showcasing the boundaries between clusters and the distribution of data points within each cluster.
sampled_indices <- sample(nrow(vehicles_scaled), size = 1000)
sampled_data <- vehicles_scaled[sampled_indices, ]
km_sampled <- kcca(sampled_data, k = 3)
cluster_assignments_sampled <- predict(km_sampled)
silhouette <- silhouette(cluster_assignments_sampled, dist(sampled_data))
mean_silhouette <- mean(silhouette[, "sil_width"])
print(paste("Mean Silhouette Score on Sampled Data:", mean_silhouette))
## [1] "Mean Silhouette Score on Sampled Data: 0.435023154757361"
plot(mean_silhouette)
The plot shows the mean silhouette score calculated on sampled data. The silhouette score reflects the cohesion and separation of clusters, with higher scores indicating better-defined clusters. A mean silhouette score of 0.37 suggests that the clusters formed are reasonably well-separated and internally cohesive, indicating the effectiveness of the clustering approach.
The analysis of the vehicles dataset employing k-means clustering revealed valuable insights into the underlying patterns and structure of the data. By preprocessing, scaling numeric variables, and employing appropriate visualization techniques, we successfully identified clusters and their distinguishing features. The clustering results, supported by silhouette scores, demonstrated the effectiveness of the approach in partitioning the data into meaningful groups. This analysis not only enhances our understanding of the dataset but also provides a foundation for further exploration and targeted decision-making in relevant domains such as automotive market analysis and customer segmentation.