Assignment Summary
The assignment involves applying one of three analytical
methods—association rule mining, clustering, or collaborative
filtering—on a chosen dataset to derive actionable insights. The report
must follow the CRISP-DM framework and include a thorough analysis,
reproducible results, and clearly defined steps.
Options for Analysis
- Option A: Association Rule Mining
- Generate frequent itemsets and rules to understand
relationships.
- Evaluation using measures like support, confidence, and lift.
- Tools: R is recommended.
- Option B: Clustering
- Apply clustering to identify patterns in an unlabeled dataset.
- Use internal/external validation for cluster evaluation.
- Tools: Python with scikit-learn or similar libraries.
- Option C: Collaborative Filtering
- Build a recommendation system using user-item or item-item
matrices.
- Tools: Dato’s GraphLab-Create.
Suggested Plan with Paris Airbnb Dataset (Option B: Clustering)
The Paris Airbnb dataset is well-suited for clustering as it can help
identify patterns in property types, pricing, and amenities. Here’s a
step-by-step plan:
1. Business Understanding
- Goal: Cluster Airbnb listings to identify distinct
property groups based on attributes like price, amenities, and
location.
- Effectiveness Measure: Evaluate clusters using
Silhouette Score and Davies-Bouldin Index. Use visualization to check
cluster separability.
- Validation Method: Internal validation measures to
ensure clustering quality.
2. Data Understanding
- Tasks:
- Explore the dataset structure and types of attributes.
- Identify key features: property type, price, bedrooms, amenities,
etc.
- Check for missing values, duplicates, and outliers.
- Visualize key attributes, such as price distribution, property
types, and geographical spread.
3. Data Preparation
- Steps:
- Handle missing data (impute or drop based on relevance).
- Remove duplicates.
- Normalize numerical features (e.g., price) to ensure balanced
clustering.
- Encode categorical variables (e.g., property type, neighborhood)
using one-hot encoding.
4. Modeling
- Clustering Algorithms:
- K-Means: Determine the optimal number of clusters using the Elbow
Method and Silhouette Score.
- DBSCAN: Cluster based on density; useful for irregular data
distributions.
- Hierarchical Clustering: Visualize dendrograms to analyze data
hierarchy.
5. Evaluation
- Metrics:
- Silhouette Score for cluster compactness.
- Davies-Bouldin Index for cluster separation.
- Visualize clusters using PCA or t-SNE for dimensionality
reduction.
6. Deployment
- Considerations:
- Suggest how businesses (e.g., Airbnb hosts) can use clusters for
pricing strategies or feature optimization.
- Discuss updating the clustering model with new data (e.g., seasonal
trends).
7. Exceptional Work
- Perform additional analyses:
- Time-series clustering to study seasonal price trends.
- Analyze clusters’ correlation with external factors (e.g., proximity
to landmarks). Let’s dive into Step 1: Business
Understanding for Lab 3 with the Paris Airbnb dataset.
Step 1: Business Understanding
Objective:
We aim to cluster Airbnb listings in Paris based on their attributes
(e.g., price, property type, amenities) to identify patterns and
insights that can help stakeholders like: - Hosts:
Optimize pricing or improve features to make their properties more
competitive. - Travelers: Understand which clusters
align with their preferences, such as luxury properties or budget
accommodations.
Questions to Answer:
- Why was this data collected?
- This dataset is likely intended to provide insights into the Paris
Airbnb market, focusing on listing characteristics, pricing, and
geographical trends.
- What is the goal of clustering?
- To group similar listings based on their features and identify
distinct patterns in the dataset.
- How will you measure the effectiveness of
clustering?
- Internal Validation:
- Silhouette Score: Measures how well clusters are separated.
- Davies-Bouldin Index: Assesses the compactness and separation of
clusters.
- Visualizations:
- Use t-SNE or PCA to visualize cluster separability.
- Descriptive Analysis:
- Profile each cluster based on key attributes (e.g., average price,
property type distribution).
- Why does your chosen validation method make sense?
- These metrics ensure the clusters are meaningful and well-separated,
which is crucial for actionable insights.
Next Step: Data Understanding
We’ll now load the dataset, inspect its structure, and understand its
features.
Action Plan for Data Understanding:
- Load the dataset into a DataFrame.
- Inspect the dataset (columns, data types, missing values).
- Generate basic statistics and visualizations to understand key
attributes.
LS0tDQp0aXRsZTogIkxhYiAzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KIyMjIEFzc2lnbm1lbnQgU3VtbWFyeQ0KVGhlIGFzc2lnbm1lbnQgaW52b2x2ZXMgYXBwbHlpbmcgb25lIG9mIHRocmVlIGFuYWx5dGljYWwgbWV0aG9kc+KAlGFzc29jaWF0aW9uIHJ1bGUgbWluaW5nLCBjbHVzdGVyaW5nLCBvciBjb2xsYWJvcmF0aXZlIGZpbHRlcmluZ+KAlG9uIGEgY2hvc2VuIGRhdGFzZXQgdG8gZGVyaXZlIGFjdGlvbmFibGUgaW5zaWdodHMuIFRoZSByZXBvcnQgbXVzdCBmb2xsb3cgdGhlIENSSVNQLURNIGZyYW1ld29yayBhbmQgaW5jbHVkZSBhIHRob3JvdWdoIGFuYWx5c2lzLCByZXByb2R1Y2libGUgcmVzdWx0cywgYW5kIGNsZWFybHkgZGVmaW5lZCBzdGVwcy4NCg0KIyMjIE9wdGlvbnMgZm9yIEFuYWx5c2lzDQoxLiAqKk9wdGlvbiBBOiBBc3NvY2lhdGlvbiBSdWxlIE1pbmluZyoqDQogICAtIEdlbmVyYXRlIGZyZXF1ZW50IGl0ZW1zZXRzIGFuZCBydWxlcyB0byB1bmRlcnN0YW5kIHJlbGF0aW9uc2hpcHMuDQogICAtIEV2YWx1YXRpb24gdXNpbmcgbWVhc3VyZXMgbGlrZSBzdXBwb3J0LCBjb25maWRlbmNlLCBhbmQgbGlmdC4NCiAgIC0gVG9vbHM6IFIgaXMgcmVjb21tZW5kZWQuDQoNCjIuICoqT3B0aW9uIEI6IENsdXN0ZXJpbmcqKg0KICAgLSBBcHBseSBjbHVzdGVyaW5nIHRvIGlkZW50aWZ5IHBhdHRlcm5zIGluIGFuIHVubGFiZWxlZCBkYXRhc2V0Lg0KICAgLSBVc2UgaW50ZXJuYWwvZXh0ZXJuYWwgdmFsaWRhdGlvbiBmb3IgY2x1c3RlciBldmFsdWF0aW9uLg0KICAgLSBUb29sczogUHl0aG9uIHdpdGggc2Npa2l0LWxlYXJuIG9yIHNpbWlsYXIgbGlicmFyaWVzLg0KDQozLiAqKk9wdGlvbiBDOiBDb2xsYWJvcmF0aXZlIEZpbHRlcmluZyoqDQogICAtIEJ1aWxkIGEgcmVjb21tZW5kYXRpb24gc3lzdGVtIHVzaW5nIHVzZXItaXRlbSBvciBpdGVtLWl0ZW0gbWF0cmljZXMuDQogICAtIFRvb2xzOiBEYXRvJ3MgR3JhcGhMYWItQ3JlYXRlLg0KDQojIyMgU3VnZ2VzdGVkIFBsYW4gd2l0aCBQYXJpcyBBaXJibmIgRGF0YXNldCAoT3B0aW9uIEI6IENsdXN0ZXJpbmcpDQpUaGUgUGFyaXMgQWlyYm5iIGRhdGFzZXQgaXMgd2VsbC1zdWl0ZWQgZm9yIGNsdXN0ZXJpbmcgYXMgaXQgY2FuIGhlbHAgaWRlbnRpZnkgcGF0dGVybnMgaW4gcHJvcGVydHkgdHlwZXMsIHByaWNpbmcsIGFuZCBhbWVuaXRpZXMuIEhlcmUncyBhIHN0ZXAtYnktc3RlcCBwbGFuOg0KDQotLS0NCg0KIyMjIyAqKjEuIEJ1c2luZXNzIFVuZGVyc3RhbmRpbmcqKg0KICAgLSAqKkdvYWw6KiogQ2x1c3RlciBBaXJibmIgbGlzdGluZ3MgdG8gaWRlbnRpZnkgZGlzdGluY3QgcHJvcGVydHkgZ3JvdXBzIGJhc2VkIG9uIGF0dHJpYnV0ZXMgbGlrZSBwcmljZSwgYW1lbml0aWVzLCBhbmQgbG9jYXRpb24uDQogICAtICoqRWZmZWN0aXZlbmVzcyBNZWFzdXJlOioqIEV2YWx1YXRlIGNsdXN0ZXJzIHVzaW5nIFNpbGhvdWV0dGUgU2NvcmUgYW5kIERhdmllcy1Cb3VsZGluIEluZGV4LiBVc2UgdmlzdWFsaXphdGlvbiB0byBjaGVjayBjbHVzdGVyIHNlcGFyYWJpbGl0eS4NCiAgIC0gKipWYWxpZGF0aW9uIE1ldGhvZDoqKiBJbnRlcm5hbCB2YWxpZGF0aW9uIG1lYXN1cmVzIHRvIGVuc3VyZSBjbHVzdGVyaW5nIHF1YWxpdHkuDQoNCi0tLQ0KDQojIyMjICoqMi4gRGF0YSBVbmRlcnN0YW5kaW5nKioNCiAgIC0gKipUYXNrczoqKg0KICAgICAxLiBFeHBsb3JlIHRoZSBkYXRhc2V0IHN0cnVjdHVyZSBhbmQgdHlwZXMgb2YgYXR0cmlidXRlcy4NCiAgICAgMi4gSWRlbnRpZnkga2V5IGZlYXR1cmVzOiBwcm9wZXJ0eSB0eXBlLCBwcmljZSwgYmVkcm9vbXMsIGFtZW5pdGllcywgZXRjLg0KICAgICAzLiBDaGVjayBmb3IgbWlzc2luZyB2YWx1ZXMsIGR1cGxpY2F0ZXMsIGFuZCBvdXRsaWVycy4NCiAgICAgNC4gVmlzdWFsaXplIGtleSBhdHRyaWJ1dGVzLCBzdWNoIGFzIHByaWNlIGRpc3RyaWJ1dGlvbiwgcHJvcGVydHkgdHlwZXMsIGFuZCBnZW9ncmFwaGljYWwgc3ByZWFkLg0KDQotLS0NCg0KIyMjIyAqKjMuIERhdGEgUHJlcGFyYXRpb24qKg0KICAgLSAqKlN0ZXBzOioqDQogICAgIDEuIEhhbmRsZSBtaXNzaW5nIGRhdGEgKGltcHV0ZSBvciBkcm9wIGJhc2VkIG9uIHJlbGV2YW5jZSkuDQogICAgIDIuIFJlbW92ZSBkdXBsaWNhdGVzLg0KICAgICAzLiBOb3JtYWxpemUgbnVtZXJpY2FsIGZlYXR1cmVzIChlLmcuLCBwcmljZSkgdG8gZW5zdXJlIGJhbGFuY2VkIGNsdXN0ZXJpbmcuDQogICAgIDQuIEVuY29kZSBjYXRlZ29yaWNhbCB2YXJpYWJsZXMgKGUuZy4sIHByb3BlcnR5IHR5cGUsIG5laWdoYm9yaG9vZCkgdXNpbmcgb25lLWhvdCBlbmNvZGluZy4NCg0KLS0tDQoNCiMjIyMgKio0LiBNb2RlbGluZyoqDQogICAtICoqQ2x1c3RlcmluZyBBbGdvcml0aG1zOioqDQogICAgIDEuIEstTWVhbnM6IERldGVybWluZSB0aGUgb3B0aW1hbCBudW1iZXIgb2YgY2x1c3RlcnMgdXNpbmcgdGhlIEVsYm93IE1ldGhvZCBhbmQgU2lsaG91ZXR0ZSBTY29yZS4NCiAgICAgMi4gREJTQ0FOOiBDbHVzdGVyIGJhc2VkIG9uIGRlbnNpdHk7IHVzZWZ1bCBmb3IgaXJyZWd1bGFyIGRhdGEgZGlzdHJpYnV0aW9ucy4NCiAgICAgMy4gSGllcmFyY2hpY2FsIENsdXN0ZXJpbmc6IFZpc3VhbGl6ZSBkZW5kcm9ncmFtcyB0byBhbmFseXplIGRhdGEgaGllcmFyY2h5Lg0KDQotLS0NCg0KIyMjIyAqKjUuIEV2YWx1YXRpb24qKg0KICAgLSAqKk1ldHJpY3M6KioNCiAgICAgMS4gU2lsaG91ZXR0ZSBTY29yZSBmb3IgY2x1c3RlciBjb21wYWN0bmVzcy4NCiAgICAgMi4gRGF2aWVzLUJvdWxkaW4gSW5kZXggZm9yIGNsdXN0ZXIgc2VwYXJhdGlvbi4NCiAgICAgMy4gVmlzdWFsaXplIGNsdXN0ZXJzIHVzaW5nIFBDQSBvciB0LVNORSBmb3IgZGltZW5zaW9uYWxpdHkgcmVkdWN0aW9uLg0KDQotLS0NCg0KIyMjIyAqKjYuIERlcGxveW1lbnQqKg0KICAgLSAqKkNvbnNpZGVyYXRpb25zOioqDQogICAgIDEuIFN1Z2dlc3QgaG93IGJ1c2luZXNzZXMgKGUuZy4sIEFpcmJuYiBob3N0cykgY2FuIHVzZSBjbHVzdGVycyBmb3IgcHJpY2luZyBzdHJhdGVnaWVzIG9yIGZlYXR1cmUgb3B0aW1pemF0aW9uLg0KICAgICAyLiBEaXNjdXNzIHVwZGF0aW5nIHRoZSBjbHVzdGVyaW5nIG1vZGVsIHdpdGggbmV3IGRhdGEgKGUuZy4sIHNlYXNvbmFsIHRyZW5kcykuDQoNCi0tLQ0KDQojIyMjICoqNy4gRXhjZXB0aW9uYWwgV29yayoqDQogICAtIFBlcmZvcm0gYWRkaXRpb25hbCBhbmFseXNlczoNCiAgICAgMS4gVGltZS1zZXJpZXMgY2x1c3RlcmluZyB0byBzdHVkeSBzZWFzb25hbCBwcmljZSB0cmVuZHMuDQogICAgIDIuIEFuYWx5emUgY2x1c3RlcnPigJkgY29ycmVsYXRpb24gd2l0aCBleHRlcm5hbCBmYWN0b3JzIChlLmcuLCBwcm94aW1pdHkgdG8gbGFuZG1hcmtzKS4NCkxldCdzIGRpdmUgaW50byAqKlN0ZXAgMTogQnVzaW5lc3MgVW5kZXJzdGFuZGluZyoqIGZvciBMYWIgMyB3aXRoIHRoZSBQYXJpcyBBaXJibmIgZGF0YXNldC4NCg0KLS0tDQoNCiMjIyAqKlN0ZXAgMTogQnVzaW5lc3MgVW5kZXJzdGFuZGluZyoqDQoNCiMjIyMgKipPYmplY3RpdmU6KioNCldlIGFpbSB0byBjbHVzdGVyIEFpcmJuYiBsaXN0aW5ncyBpbiBQYXJpcyBiYXNlZCBvbiB0aGVpciBhdHRyaWJ1dGVzIChlLmcuLCBwcmljZSwgcHJvcGVydHkgdHlwZSwgYW1lbml0aWVzKSB0byBpZGVudGlmeSBwYXR0ZXJucyBhbmQgaW5zaWdodHMgdGhhdCBjYW4gaGVscCBzdGFrZWhvbGRlcnMgbGlrZToNCi0gKipIb3N0cyoqOiBPcHRpbWl6ZSBwcmljaW5nIG9yIGltcHJvdmUgZmVhdHVyZXMgdG8gbWFrZSB0aGVpciBwcm9wZXJ0aWVzIG1vcmUgY29tcGV0aXRpdmUuDQotICoqVHJhdmVsZXJzKio6IFVuZGVyc3RhbmQgd2hpY2ggY2x1c3RlcnMgYWxpZ24gd2l0aCB0aGVpciBwcmVmZXJlbmNlcywgc3VjaCBhcyBsdXh1cnkgcHJvcGVydGllcyBvciBidWRnZXQgYWNjb21tb2RhdGlvbnMuDQoNCiMjIyMgKipRdWVzdGlvbnMgdG8gQW5zd2VyOioqDQoxLiAqKldoeSB3YXMgdGhpcyBkYXRhIGNvbGxlY3RlZD8qKg0KICAgLSBUaGlzIGRhdGFzZXQgaXMgbGlrZWx5IGludGVuZGVkIHRvIHByb3ZpZGUgaW5zaWdodHMgaW50byB0aGUgUGFyaXMgQWlyYm5iIG1hcmtldCwgZm9jdXNpbmcgb24gbGlzdGluZyBjaGFyYWN0ZXJpc3RpY3MsIHByaWNpbmcsIGFuZCBnZW9ncmFwaGljYWwgdHJlbmRzLg0KDQoyLiAqKldoYXQgaXMgdGhlIGdvYWwgb2YgY2x1c3RlcmluZz8qKg0KICAgLSBUbyBncm91cCBzaW1pbGFyIGxpc3RpbmdzIGJhc2VkIG9uIHRoZWlyIGZlYXR1cmVzIGFuZCBpZGVudGlmeSBkaXN0aW5jdCBwYXR0ZXJucyBpbiB0aGUgZGF0YXNldC4NCg0KMy4gKipIb3cgd2lsbCB5b3UgbWVhc3VyZSB0aGUgZWZmZWN0aXZlbmVzcyBvZiBjbHVzdGVyaW5nPyoqDQogICAtICoqSW50ZXJuYWwgVmFsaWRhdGlvbioqOg0KICAgICAtIFNpbGhvdWV0dGUgU2NvcmU6IE1lYXN1cmVzIGhvdyB3ZWxsIGNsdXN0ZXJzIGFyZSBzZXBhcmF0ZWQuDQogICAgIC0gRGF2aWVzLUJvdWxkaW4gSW5kZXg6IEFzc2Vzc2VzIHRoZSBjb21wYWN0bmVzcyBhbmQgc2VwYXJhdGlvbiBvZiBjbHVzdGVycy4NCiAgIC0gKipWaXN1YWxpemF0aW9ucyoqOg0KICAgICAtIFVzZSB0LVNORSBvciBQQ0EgdG8gdmlzdWFsaXplIGNsdXN0ZXIgc2VwYXJhYmlsaXR5Lg0KICAgLSAqKkRlc2NyaXB0aXZlIEFuYWx5c2lzKio6DQogICAgIC0gUHJvZmlsZSBlYWNoIGNsdXN0ZXIgYmFzZWQgb24ga2V5IGF0dHJpYnV0ZXMgKGUuZy4sIGF2ZXJhZ2UgcHJpY2UsIHByb3BlcnR5IHR5cGUgZGlzdHJpYnV0aW9uKS4NCg0KNC4gKipXaHkgZG9lcyB5b3VyIGNob3NlbiB2YWxpZGF0aW9uIG1ldGhvZCBtYWtlIHNlbnNlPyoqDQogICAtIFRoZXNlIG1ldHJpY3MgZW5zdXJlIHRoZSBjbHVzdGVycyBhcmUgbWVhbmluZ2Z1bCBhbmQgd2VsbC1zZXBhcmF0ZWQsIHdoaWNoIGlzIGNydWNpYWwgZm9yIGFjdGlvbmFibGUgaW5zaWdodHMuDQoNCi0tLQ0KDQojIyMgKipOZXh0IFN0ZXA6IERhdGEgVW5kZXJzdGFuZGluZyoqDQpXZSdsbCBub3cgbG9hZCB0aGUgZGF0YXNldCwgaW5zcGVjdCBpdHMgc3RydWN0dXJlLCBhbmQgdW5kZXJzdGFuZCBpdHMgZmVhdHVyZXMuDQoNCiMjIyMgQWN0aW9uIFBsYW4gZm9yIERhdGEgVW5kZXJzdGFuZGluZzoNCjEuIExvYWQgdGhlIGRhdGFzZXQgaW50byBhIERhdGFGcmFtZS4NCjIuIEluc3BlY3QgdGhlIGRhdGFzZXQgKGNvbHVtbnMsIGRhdGEgdHlwZXMsIG1pc3NpbmcgdmFsdWVzKS4NCjMuIEdlbmVyYXRlIGJhc2ljIHN0YXRpc3RpY3MgYW5kIHZpc3VhbGl6YXRpb25zIHRvIHVuZGVyc3RhbmQga2V5IGF0dHJpYnV0ZXMuDQoNCg==