class: title-slide .row[ .col-7[ .title[ # K Means Clustering ] .subtitle[ ## K Means Clustering ] .author[ ### Laxmikant Soni <br> [blog](https://laxmikants.github.io) <br> [<i class="fab fa-github"></i>](https://github.com/laxmiaknts) [<i class="fab fa-twitter"></i>](https://twitter.com/laxmikantsoni09) ] .affiliation[ ] ] .col-5[ .logo[ <img src="figures/rmarkdown.png" width="480" /> ] ] ] --- # K-Means Clustering .pull-top[ ## Introduction to K-Means Clustering * **Definition**: K-means clustering is an unsupervised machine learning algorithm that divides a dataset into clusters based on feature similarity. * **Objective**: To group similar data points into k clusters, where k is predefined. ] # K-Means Clustering .pull-top[ ## Key Concepts * **Centroid**: The center of a cluster, calculated as the mean position of all data points within the cluster. * **Clusters**: Groups where points in the same cluster are more similar to each other than to points in other clusters. ] --- # K-Means Clustering .pull-top[ ## Additional Concepts * **Inertia**: A measure of cluster compactness; lower inertia indicates tighter, more cohesive clusters. * **K Value**: The number of clusters, chosen based on factors like the elbow method to balance cohesion and separation. ] # K-Means Clustering .pull-top[ ## Example: Mall Customer Segmentation * **Dataset**: Each data point represents a customer, with features such as age, income, and spending score. * **Goal**: Segment customers based on shopping behaviors for targeted marketing. - **Cluster 1**: High-income, high-spending frequent shoppers. - **Cluster 2**: Younger customers with low to moderate spending scores, potentially price-sensitive. - **Cluster 3**: Mid-age customers with steady income and spending patterns. ] --- # k-Means Clustering .pull-top[ ## Mall Customer Segmentation Dataset | CustomerID | Age | Annual_Income (k$) | Spending_Score (1-100) | |------------|-----|---------------------|-------------------------| | 1 | 19 | 15 | 39 | | 2 | 21 | 15 | 81 | | 3 | 20 | 16 | 6 | | 4 | 23 | 16 | 77 | | 5 | 31 | 17 | 40 | | 6 | 22 | 17 | 76 | | 7 | 35 | 18 | 6 | | 8 | 23 | 18 | 94 | | 9 | 64 | 19 | 3 | | 10 | 30 | 19 | 72 | ] --- # Features and Target Explanation .pull-top[ ## Features 1. **CustomerID**: A unique identifier for each customer. Not used for clustering but helps to identify individual records. 2. **Age**: The age of the customer. Useful for understanding the age demographics of each customer segment. 3. **Annual_Income (k$)**: The customer’s annual income in thousands of dollars. This feature helps to differentiate high-income vs. low-income customer groups. 4. **Spending_Score (1-100)**: A score assigned based on customer behavior and purchasing data, ranging from 1 (low) to 100 (high). Higher scores indicate a tendency to spend more, which can help distinguish high and low spenders. ] --- # Features and Target Explanation .pull-top[ ## Target (Cluster Label) * **Cluster Label**: The cluster number assigned to each customer after running the K-means algorithm. - This label represents the group or segment to which the customer belongs, based on similarities in age, income, and spending score. - For instance, clusters may represent high-income frequent spenders, budget-conscious shoppers, or younger, moderate spenders. ] --- # Python Implementation .pull-top[ ## Importing Libraries ```python # Import necessary libraries import pandas as pd from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler ``` ] --- # Python Implementation .pull-top[ ## Setting up Dataset ```python # Sample dataset: Replace this with your actual dataset data = { 'CustomerID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Age': [19, 21, 20, 23, 31, 22, 35, 23, 64, 30], 'Annual_Income': [15, 15, 16, 16, 17, 17, 18, 18, 19, 19], 'Spending_Score': [39, 81, 6, 77, 40, 76, 6, 94, 3, 72] } df = pd.DataFrame(data) ``` ] --- # Python Implementation .pull-top[ ## Features for clustering ```python X = df[['Age', 'Annual_Income', 'Spending_Score']] ``` ] --- # Python Implementation .pull-top[ ## Data Standardization ```python scaler = StandardScaler() X_scaled = scaler.fit_transform(X) ``` ] --- # Python Implementation .pull-top[ ## Apply K-means clustering ```python # Apply K-means clustering kmeans = KMeans(n_clusters=3, random_state=42) df['Cluster'] = kmeans.fit_predict(X_scaled) # Display the resulting clusters print("Clustered Data:") ``` ``` ## Clustered Data: ``` ```python print(df) ``` ``` ## CustomerID Age Annual_Income Spending_Score Cluster ## 0 1 19 15 39 2 ## 1 2 21 15 81 0 ## 2 3 20 16 6 2 ## 3 4 23 16 77 0 ## 4 5 31 17 40 2 ## 5 6 22 17 76 0 ## 6 7 35 18 6 2 ## 7 8 23 18 94 0 ## 8 9 64 19 3 1 ## 9 10 30 19 72 0 ``` ] --- # Python Implementation .pull-top[ ## Classify New Customer ```python # Classify a new customer new_customer = pd.DataFrame({ 'Age': [25], 'Annual_Income': [20], 'Spending_Score': [60] }) # Standardize the new customer's data using the previous scaler new_customer_scaled = scaler.transform(new_customer) # Predict the cluster for the new customer new_customer_cluster = kmeans.predict(new_customer_scaled) print(f"New customer belongs to Cluster: {new_customer_cluster[0]}") ``` ``` ## New customer belongs to Cluster: 0 ``` ] --- class: inverse, center, middle # Thanks