Assignment Summary

The assignment involves applying one of three analytical methods—association rule mining, clustering, or collaborative filtering—on a chosen dataset to derive actionable insights. The report must follow the CRISP-DM framework and include a thorough analysis, reproducible results, and clearly defined steps.

Options for Analysis

  1. Option A: Association Rule Mining
    • Generate frequent itemsets and rules to understand relationships.
    • Evaluation using measures like support, confidence, and lift.
    • Tools: R is recommended.
  2. Option B: Clustering
    • Apply clustering to identify patterns in an unlabeled dataset.
    • Use internal/external validation for cluster evaluation.
    • Tools: Python with scikit-learn or similar libraries.
  3. Option C: Collaborative Filtering
    • Build a recommendation system using user-item or item-item matrices.
    • Tools: Dato’s GraphLab-Create.

Suggested Plan with Paris Airbnb Dataset (Option B: Clustering)

The Paris Airbnb dataset is well-suited for clustering as it can help identify patterns in property types, pricing, and amenities. Here’s a step-by-step plan:


1. Business Understanding

  • Goal: Cluster Airbnb listings to identify distinct property groups based on attributes like price, amenities, and location.
  • Effectiveness Measure: Evaluate clusters using Silhouette Score and Davies-Bouldin Index. Use visualization to check cluster separability.
  • Validation Method: Internal validation measures to ensure clustering quality.

2. Data Understanding

  • Tasks:
    1. Explore the dataset structure and types of attributes.
    2. Identify key features: property type, price, bedrooms, amenities, etc.
    3. Check for missing values, duplicates, and outliers.
    4. Visualize key attributes, such as price distribution, property types, and geographical spread.

3. Data Preparation

  • Steps:
    1. Handle missing data (impute or drop based on relevance).
    2. Remove duplicates.
    3. Normalize numerical features (e.g., price) to ensure balanced clustering.
    4. Encode categorical variables (e.g., property type, neighborhood) using one-hot encoding.

4. Modeling

  • Clustering Algorithms:
    1. K-Means: Determine the optimal number of clusters using the Elbow Method and Silhouette Score.
    2. DBSCAN: Cluster based on density; useful for irregular data distributions.
    3. Hierarchical Clustering: Visualize dendrograms to analyze data hierarchy.

5. Evaluation

  • Metrics:
    1. Silhouette Score for cluster compactness.
    2. Davies-Bouldin Index for cluster separation.
    3. Visualize clusters using PCA or t-SNE for dimensionality reduction.

6. Deployment

  • Considerations:
    1. Suggest how businesses (e.g., Airbnb hosts) can use clusters for pricing strategies or feature optimization.
    2. Discuss updating the clustering model with new data (e.g., seasonal trends).

7. Exceptional Work

  • Perform additional analyses:
    1. Time-series clustering to study seasonal price trends.
    2. Analyze clusters’ correlation with external factors (e.g., proximity to landmarks). Let’s dive into Step 1: Business Understanding for Lab 3 with the Paris Airbnb dataset.

Step 1: Business Understanding

Objective:

We aim to cluster Airbnb listings in Paris based on their attributes (e.g., price, property type, amenities) to identify patterns and insights that can help stakeholders like: - Hosts: Optimize pricing or improve features to make their properties more competitive. - Travelers: Understand which clusters align with their preferences, such as luxury properties or budget accommodations.

Questions to Answer:

  1. Why was this data collected?
    • This dataset is likely intended to provide insights into the Paris Airbnb market, focusing on listing characteristics, pricing, and geographical trends.
  2. What is the goal of clustering?
    • To group similar listings based on their features and identify distinct patterns in the dataset.
  3. How will you measure the effectiveness of clustering?
    • Internal Validation:
      • Silhouette Score: Measures how well clusters are separated.
      • Davies-Bouldin Index: Assesses the compactness and separation of clusters.
    • Visualizations:
      • Use t-SNE or PCA to visualize cluster separability.
    • Descriptive Analysis:
      • Profile each cluster based on key attributes (e.g., average price, property type distribution).
  4. Why does your chosen validation method make sense?
    • These metrics ensure the clusters are meaningful and well-separated, which is crucial for actionable insights.

Next Step: Data Understanding

We’ll now load the dataset, inspect its structure, and understand its features.

Action Plan for Data Understanding:

  1. Load the dataset into a DataFrame.
  2. Inspect the dataset (columns, data types, missing values).
  3. Generate basic statistics and visualizations to understand key attributes.
LS0tDQp0aXRsZTogIkxhYiAzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KIyMjIEFzc2lnbm1lbnQgU3VtbWFyeQ0KVGhlIGFzc2lnbm1lbnQgaW52b2x2ZXMgYXBwbHlpbmcgb25lIG9mIHRocmVlIGFuYWx5dGljYWwgbWV0aG9kc+KAlGFzc29jaWF0aW9uIHJ1bGUgbWluaW5nLCBjbHVzdGVyaW5nLCBvciBjb2xsYWJvcmF0aXZlIGZpbHRlcmluZ+KAlG9uIGEgY2hvc2VuIGRhdGFzZXQgdG8gZGVyaXZlIGFjdGlvbmFibGUgaW5zaWdodHMuIFRoZSByZXBvcnQgbXVzdCBmb2xsb3cgdGhlIENSSVNQLURNIGZyYW1ld29yayBhbmQgaW5jbHVkZSBhIHRob3JvdWdoIGFuYWx5c2lzLCByZXByb2R1Y2libGUgcmVzdWx0cywgYW5kIGNsZWFybHkgZGVmaW5lZCBzdGVwcy4NCg0KIyMjIE9wdGlvbnMgZm9yIEFuYWx5c2lzDQoxLiAqKk9wdGlvbiBBOiBBc3NvY2lhdGlvbiBSdWxlIE1pbmluZyoqDQogICAtIEdlbmVyYXRlIGZyZXF1ZW50IGl0ZW1zZXRzIGFuZCBydWxlcyB0byB1bmRlcnN0YW5kIHJlbGF0aW9uc2hpcHMuDQogICAtIEV2YWx1YXRpb24gdXNpbmcgbWVhc3VyZXMgbGlrZSBzdXBwb3J0LCBjb25maWRlbmNlLCBhbmQgbGlmdC4NCiAgIC0gVG9vbHM6IFIgaXMgcmVjb21tZW5kZWQuDQoNCjIuICoqT3B0aW9uIEI6IENsdXN0ZXJpbmcqKg0KICAgLSBBcHBseSBjbHVzdGVyaW5nIHRvIGlkZW50aWZ5IHBhdHRlcm5zIGluIGFuIHVubGFiZWxlZCBkYXRhc2V0Lg0KICAgLSBVc2UgaW50ZXJuYWwvZXh0ZXJuYWwgdmFsaWRhdGlvbiBmb3IgY2x1c3RlciBldmFsdWF0aW9uLg0KICAgLSBUb29sczogUHl0aG9uIHdpdGggc2Npa2l0LWxlYXJuIG9yIHNpbWlsYXIgbGlicmFyaWVzLg0KDQozLiAqKk9wdGlvbiBDOiBDb2xsYWJvcmF0aXZlIEZpbHRlcmluZyoqDQogICAtIEJ1aWxkIGEgcmVjb21tZW5kYXRpb24gc3lzdGVtIHVzaW5nIHVzZXItaXRlbSBvciBpdGVtLWl0ZW0gbWF0cmljZXMuDQogICAtIFRvb2xzOiBEYXRvJ3MgR3JhcGhMYWItQ3JlYXRlLg0KDQojIyMgU3VnZ2VzdGVkIFBsYW4gd2l0aCBQYXJpcyBBaXJibmIgRGF0YXNldCAoT3B0aW9uIEI6IENsdXN0ZXJpbmcpDQpUaGUgUGFyaXMgQWlyYm5iIGRhdGFzZXQgaXMgd2VsbC1zdWl0ZWQgZm9yIGNsdXN0ZXJpbmcgYXMgaXQgY2FuIGhlbHAgaWRlbnRpZnkgcGF0dGVybnMgaW4gcHJvcGVydHkgdHlwZXMsIHByaWNpbmcsIGFuZCBhbWVuaXRpZXMuIEhlcmUncyBhIHN0ZXAtYnktc3RlcCBwbGFuOg0KDQotLS0NCg0KIyMjIyAqKjEuIEJ1c2luZXNzIFVuZGVyc3RhbmRpbmcqKg0KICAgLSAqKkdvYWw6KiogQ2x1c3RlciBBaXJibmIgbGlzdGluZ3MgdG8gaWRlbnRpZnkgZGlzdGluY3QgcHJvcGVydHkgZ3JvdXBzIGJhc2VkIG9uIGF0dHJpYnV0ZXMgbGlrZSBwcmljZSwgYW1lbml0aWVzLCBhbmQgbG9jYXRpb24uDQogICAtICoqRWZmZWN0aXZlbmVzcyBNZWFzdXJlOioqIEV2YWx1YXRlIGNsdXN0ZXJzIHVzaW5nIFNpbGhvdWV0dGUgU2NvcmUgYW5kIERhdmllcy1Cb3VsZGluIEluZGV4LiBVc2UgdmlzdWFsaXphdGlvbiB0byBjaGVjayBjbHVzdGVyIHNlcGFyYWJpbGl0eS4NCiAgIC0gKipWYWxpZGF0aW9uIE1ldGhvZDoqKiBJbnRlcm5hbCB2YWxpZGF0aW9uIG1lYXN1cmVzIHRvIGVuc3VyZSBjbHVzdGVyaW5nIHF1YWxpdHkuDQoNCi0tLQ0KDQojIyMjICoqMi4gRGF0YSBVbmRlcnN0YW5kaW5nKioNCiAgIC0gKipUYXNrczoqKg0KICAgICAxLiBFeHBsb3JlIHRoZSBkYXRhc2V0IHN0cnVjdHVyZSBhbmQgdHlwZXMgb2YgYXR0cmlidXRlcy4NCiAgICAgMi4gSWRlbnRpZnkga2V5IGZlYXR1cmVzOiBwcm9wZXJ0eSB0eXBlLCBwcmljZSwgYmVkcm9vbXMsIGFtZW5pdGllcywgZXRjLg0KICAgICAzLiBDaGVjayBmb3IgbWlzc2luZyB2YWx1ZXMsIGR1cGxpY2F0ZXMsIGFuZCBvdXRsaWVycy4NCiAgICAgNC4gVmlzdWFsaXplIGtleSBhdHRyaWJ1dGVzLCBzdWNoIGFzIHByaWNlIGRpc3RyaWJ1dGlvbiwgcHJvcGVydHkgdHlwZXMsIGFuZCBnZW9ncmFwaGljYWwgc3ByZWFkLg0KDQotLS0NCg0KIyMjIyAqKjMuIERhdGEgUHJlcGFyYXRpb24qKg0KICAgLSAqKlN0ZXBzOioqDQogICAgIDEuIEhhbmRsZSBtaXNzaW5nIGRhdGEgKGltcHV0ZSBvciBkcm9wIGJhc2VkIG9uIHJlbGV2YW5jZSkuDQogICAgIDIuIFJlbW92ZSBkdXBsaWNhdGVzLg0KICAgICAzLiBOb3JtYWxpemUgbnVtZXJpY2FsIGZlYXR1cmVzIChlLmcuLCBwcmljZSkgdG8gZW5zdXJlIGJhbGFuY2VkIGNsdXN0ZXJpbmcuDQogICAgIDQuIEVuY29kZSBjYXRlZ29yaWNhbCB2YXJpYWJsZXMgKGUuZy4sIHByb3BlcnR5IHR5cGUsIG5laWdoYm9yaG9vZCkgdXNpbmcgb25lLWhvdCBlbmNvZGluZy4NCg0KLS0tDQoNCiMjIyMgKio0LiBNb2RlbGluZyoqDQogICAtICoqQ2x1c3RlcmluZyBBbGdvcml0aG1zOioqDQogICAgIDEuIEstTWVhbnM6IERldGVybWluZSB0aGUgb3B0aW1hbCBudW1iZXIgb2YgY2x1c3RlcnMgdXNpbmcgdGhlIEVsYm93IE1ldGhvZCBhbmQgU2lsaG91ZXR0ZSBTY29yZS4NCiAgICAgMi4gREJTQ0FOOiBDbHVzdGVyIGJhc2VkIG9uIGRlbnNpdHk7IHVzZWZ1bCBmb3IgaXJyZWd1bGFyIGRhdGEgZGlzdHJpYnV0aW9ucy4NCiAgICAgMy4gSGllcmFyY2hpY2FsIENsdXN0ZXJpbmc6IFZpc3VhbGl6ZSBkZW5kcm9ncmFtcyB0byBhbmFseXplIGRhdGEgaGllcmFyY2h5Lg0KDQotLS0NCg0KIyMjIyAqKjUuIEV2YWx1YXRpb24qKg0KICAgLSAqKk1ldHJpY3M6KioNCiAgICAgMS4gU2lsaG91ZXR0ZSBTY29yZSBmb3IgY2x1c3RlciBjb21wYWN0bmVzcy4NCiAgICAgMi4gRGF2aWVzLUJvdWxkaW4gSW5kZXggZm9yIGNsdXN0ZXIgc2VwYXJhdGlvbi4NCiAgICAgMy4gVmlzdWFsaXplIGNsdXN0ZXJzIHVzaW5nIFBDQSBvciB0LVNORSBmb3IgZGltZW5zaW9uYWxpdHkgcmVkdWN0aW9uLg0KDQotLS0NCg0KIyMjIyAqKjYuIERlcGxveW1lbnQqKg0KICAgLSAqKkNvbnNpZGVyYXRpb25zOioqDQogICAgIDEuIFN1Z2dlc3QgaG93IGJ1c2luZXNzZXMgKGUuZy4sIEFpcmJuYiBob3N0cykgY2FuIHVzZSBjbHVzdGVycyBmb3IgcHJpY2luZyBzdHJhdGVnaWVzIG9yIGZlYXR1cmUgb3B0aW1pemF0aW9uLg0KICAgICAyLiBEaXNjdXNzIHVwZGF0aW5nIHRoZSBjbHVzdGVyaW5nIG1vZGVsIHdpdGggbmV3IGRhdGEgKGUuZy4sIHNlYXNvbmFsIHRyZW5kcykuDQoNCi0tLQ0KDQojIyMjICoqNy4gRXhjZXB0aW9uYWwgV29yayoqDQogICAtIFBlcmZvcm0gYWRkaXRpb25hbCBhbmFseXNlczoNCiAgICAgMS4gVGltZS1zZXJpZXMgY2x1c3RlcmluZyB0byBzdHVkeSBzZWFzb25hbCBwcmljZSB0cmVuZHMuDQogICAgIDIuIEFuYWx5emUgY2x1c3RlcnPigJkgY29ycmVsYXRpb24gd2l0aCBleHRlcm5hbCBmYWN0b3JzIChlLmcuLCBwcm94aW1pdHkgdG8gbGFuZG1hcmtzKS4NCkxldCdzIGRpdmUgaW50byAqKlN0ZXAgMTogQnVzaW5lc3MgVW5kZXJzdGFuZGluZyoqIGZvciBMYWIgMyB3aXRoIHRoZSBQYXJpcyBBaXJibmIgZGF0YXNldC4NCg0KLS0tDQoNCiMjIyAqKlN0ZXAgMTogQnVzaW5lc3MgVW5kZXJzdGFuZGluZyoqDQoNCiMjIyMgKipPYmplY3RpdmU6KioNCldlIGFpbSB0byBjbHVzdGVyIEFpcmJuYiBsaXN0aW5ncyBpbiBQYXJpcyBiYXNlZCBvbiB0aGVpciBhdHRyaWJ1dGVzIChlLmcuLCBwcmljZSwgcHJvcGVydHkgdHlwZSwgYW1lbml0aWVzKSB0byBpZGVudGlmeSBwYXR0ZXJucyBhbmQgaW5zaWdodHMgdGhhdCBjYW4gaGVscCBzdGFrZWhvbGRlcnMgbGlrZToNCi0gKipIb3N0cyoqOiBPcHRpbWl6ZSBwcmljaW5nIG9yIGltcHJvdmUgZmVhdHVyZXMgdG8gbWFrZSB0aGVpciBwcm9wZXJ0aWVzIG1vcmUgY29tcGV0aXRpdmUuDQotICoqVHJhdmVsZXJzKio6IFVuZGVyc3RhbmQgd2hpY2ggY2x1c3RlcnMgYWxpZ24gd2l0aCB0aGVpciBwcmVmZXJlbmNlcywgc3VjaCBhcyBsdXh1cnkgcHJvcGVydGllcyBvciBidWRnZXQgYWNjb21tb2RhdGlvbnMuDQoNCiMjIyMgKipRdWVzdGlvbnMgdG8gQW5zd2VyOioqDQoxLiAqKldoeSB3YXMgdGhpcyBkYXRhIGNvbGxlY3RlZD8qKg0KICAgLSBUaGlzIGRhdGFzZXQgaXMgbGlrZWx5IGludGVuZGVkIHRvIHByb3ZpZGUgaW5zaWdodHMgaW50byB0aGUgUGFyaXMgQWlyYm5iIG1hcmtldCwgZm9jdXNpbmcgb24gbGlzdGluZyBjaGFyYWN0ZXJpc3RpY3MsIHByaWNpbmcsIGFuZCBnZW9ncmFwaGljYWwgdHJlbmRzLg0KDQoyLiAqKldoYXQgaXMgdGhlIGdvYWwgb2YgY2x1c3RlcmluZz8qKg0KICAgLSBUbyBncm91cCBzaW1pbGFyIGxpc3RpbmdzIGJhc2VkIG9uIHRoZWlyIGZlYXR1cmVzIGFuZCBpZGVudGlmeSBkaXN0aW5jdCBwYXR0ZXJucyBpbiB0aGUgZGF0YXNldC4NCg0KMy4gKipIb3cgd2lsbCB5b3UgbWVhc3VyZSB0aGUgZWZmZWN0aXZlbmVzcyBvZiBjbHVzdGVyaW5nPyoqDQogICAtICoqSW50ZXJuYWwgVmFsaWRhdGlvbioqOg0KICAgICAtIFNpbGhvdWV0dGUgU2NvcmU6IE1lYXN1cmVzIGhvdyB3ZWxsIGNsdXN0ZXJzIGFyZSBzZXBhcmF0ZWQuDQogICAgIC0gRGF2aWVzLUJvdWxkaW4gSW5kZXg6IEFzc2Vzc2VzIHRoZSBjb21wYWN0bmVzcyBhbmQgc2VwYXJhdGlvbiBvZiBjbHVzdGVycy4NCiAgIC0gKipWaXN1YWxpemF0aW9ucyoqOg0KICAgICAtIFVzZSB0LVNORSBvciBQQ0EgdG8gdmlzdWFsaXplIGNsdXN0ZXIgc2VwYXJhYmlsaXR5Lg0KICAgLSAqKkRlc2NyaXB0aXZlIEFuYWx5c2lzKio6DQogICAgIC0gUHJvZmlsZSBlYWNoIGNsdXN0ZXIgYmFzZWQgb24ga2V5IGF0dHJpYnV0ZXMgKGUuZy4sIGF2ZXJhZ2UgcHJpY2UsIHByb3BlcnR5IHR5cGUgZGlzdHJpYnV0aW9uKS4NCg0KNC4gKipXaHkgZG9lcyB5b3VyIGNob3NlbiB2YWxpZGF0aW9uIG1ldGhvZCBtYWtlIHNlbnNlPyoqDQogICAtIFRoZXNlIG1ldHJpY3MgZW5zdXJlIHRoZSBjbHVzdGVycyBhcmUgbWVhbmluZ2Z1bCBhbmQgd2VsbC1zZXBhcmF0ZWQsIHdoaWNoIGlzIGNydWNpYWwgZm9yIGFjdGlvbmFibGUgaW5zaWdodHMuDQoNCi0tLQ0KDQojIyMgKipOZXh0IFN0ZXA6IERhdGEgVW5kZXJzdGFuZGluZyoqDQpXZSdsbCBub3cgbG9hZCB0aGUgZGF0YXNldCwgaW5zcGVjdCBpdHMgc3RydWN0dXJlLCBhbmQgdW5kZXJzdGFuZCBpdHMgZmVhdHVyZXMuDQoNCiMjIyMgQWN0aW9uIFBsYW4gZm9yIERhdGEgVW5kZXJzdGFuZGluZzoNCjEuIExvYWQgdGhlIGRhdGFzZXQgaW50byBhIERhdGFGcmFtZS4NCjIuIEluc3BlY3QgdGhlIGRhdGFzZXQgKGNvbHVtbnMsIGRhdGEgdHlwZXMsIG1pc3NpbmcgdmFsdWVzKS4NCjMuIEdlbmVyYXRlIGJhc2ljIHN0YXRpc3RpY3MgYW5kIHZpc3VhbGl6YXRpb25zIHRvIHVuZGVyc3RhbmQga2V5IGF0dHJpYnV0ZXMuDQoNCg==