Step 1: Business Understanding

Purpose of the Dataset

We are using the Paris Airbnb dataset, which was likely collected to provide insights into short-term rental properties in Paris, including availability, pricing, and other property-related features. This dataset helps us understand the rental market by analyzing factors such as pricing and location. This data can be useful for Airbnb hosts to optimize their listings, renters to find suitable properties, and Airbnb itself to gain valuable market insights.

Defining Outcomes

We define our outcomes by predicting which factors influence rental prices the most or by identifying what features lead to higher booking rates. Success in mining this dataset will be determined by our ability to accurately predict these outcomes using a machine learning model. Specifically, we will focus on developing a prediction algorithm to forecast the price of rentals based on various attributes like location, room type, and availability.

Measuring Effectiveness

The effectiveness of our prediction algorithms will be measured using metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) for regression models (price prediction). For classification tasks, such as predicting availability or popularity, we will measure effectiveness using Accuracy or AUC-ROC.


Step 2: Data Understanding

  1. Describe Data Attributes
    Our dataset contains several attributes, which we describe as follows:
    • listing_id: Unique identifier for each listing.
    • name: Name of the Airbnb listing.
    • host_id: Identifier for the host.
    • price: The cost per night of the listing (numerical).
    • room_type: The type of room (categorical: e.g., Entire home/apt, Private room, etc.).
    • neighbourhood: The area in Paris (categorical).
    • reviews: The number of reviews (numerical).
    • availability: Number of days the listing is available in a year (numerical).
  2. Verify Data Quality
    We need to check the dataset for missing values, duplicates, and outliers, and decide how to address these issues.

Data Quality

  • Missing Values
    We will identify and handle any missing values using the following approach:

    # Check for missing values
    missing_data = df.isnull().sum()
    
    # Visualize missing data (optional)
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    sns.heatmap(df.isnull(), cbar=False)
    plt.show()

    Depending on the results, we will decide whether to drop rows with missing values, impute missing values using the mean or median, or ignore them if insignificant.

  • Duplicates
    We will remove any duplicate entries using this code:

    # Check for duplicates
    duplicates = df[df.duplicated()]
    df_cleaned = df.drop_duplicates()
  • Outliers
    We will identify outliers using box plots for numerical features such as price. To detect and handle outliers, we might use the interquartile range (IQR) method:

    sns.boxplot(x=df['price'])
    plt.show()

Simple Statistics

We will calculate basic statistics for the most important attributes:

# Basic statistics
df.describe()

Step 3: Data Preprocessing and Visualization

Importing Libraries and Loading Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('paris_airbnb.csv')

# Initial overview
df.head()

Handling Missing Data

# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)

# Fill missing values in 'price' column, if applicable
df['price'].fillna(df['price'].mean(), inplace=True)

Removing Duplicates

# Remove duplicate rows
df.drop_duplicates(inplace=True)

Descriptive Statistics

# Descriptive statistics for numerical columns
print(df.describe())

# Range, mode, mean, median, variance for price
price_range = df['price'].max() - df['price'].min()
price_mode = df['price'].mode()
price_mean = df['price'].mean()
price_median = df['price'].median()
price_variance = df['price'].var()

print(f"Price Range: {price_range}, Mode: {price_mode}, Mean: {price_mean}, Median: {price_median}, Variance: {price_variance}")

Step 4: Visualization of Important Attributes

We visualize the most important attributes like price, room_type, and availability_365.

  1. Price Distribution
plt.figure(figsize=(10,6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Price Distribution of Airbnb Listings in Paris')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
  1. Room Type Breakdown
plt.figure(figsize=(8,5))
sns.countplot(x='room_type', data=df)
plt.title('Distribution of Room Types')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()
  1. Availability of Listings
plt.figure(figsize=(10,6))
sns.histplot(df['availability'], bins=30, kde=True)
plt.title('Availability of Airbnb Listings in Paris')
plt.xlabel('Days Available')
plt.ylabel('Frequency')
plt.show()

Exploring Relationships Between Attributes

  1. Price vs. Room Type
plt.figure(figsize=(8,5))
sns.boxplot(x='room_type', y='price', data=df)
plt.title('Room Type vs. Price')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.show()
  1. Correlation Matrix
plt.figure(figsize=(10,6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Step 5: Dimensionality Reduction

We use PCA (Principal Component Analysis) to reduce the dimensionality of the dataset and visualize the results:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the features
features = ['price', 'number_of_reviews', 'availability_365']
x = df[features].values
x = StandardScaler().fit_transform(x)

# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(x)

# Visualize PCA result
plt.scatter(pca_result[:, 0], pca_result[:, 1])
plt.title('PCA of Airbnb Features')
plt.show()

Step 6: Exceptional Work - Creating Additional Features

We suggest adding new features to enhance our analysis:

# Example: Adding a feature for luxury listings
df['is_luxury'] = np.where(df['price'] > df['price'].quantile(0.9), 1, 0)

Final Step: Report

We will compile all of our findings and code into a well-documented Jupyter notebook or a PDF/HTML report. The final report will include explanations of each step, the Python code used, and conclusions drawn from our analysis.

Would you like any further assistance in implementing specific sections or preparing your final report?

---
title: "Group Paris - Lab 1 - ML7331"  
authors: "Troy McSimov, Jessica McPhaul, Trevor Kunz, Christian Castro"  
output: html_notebook  
---

### Step 1: Business Understanding

#### Purpose of the Dataset
We are using the Paris Airbnb dataset, which was likely collected to provide insights into short-term rental properties in Paris, including availability, pricing, and other property-related features. This dataset helps us understand the rental market by analyzing factors such as pricing and location. This data can be useful for Airbnb hosts to optimize their listings, renters to find suitable properties, and Airbnb itself to gain valuable market insights.

#### Defining Outcomes
We define our outcomes by predicting which factors influence rental prices the most or by identifying what features lead to higher booking rates. Success in mining this dataset will be determined by our ability to accurately predict these outcomes using a machine learning model. Specifically, we will focus on developing a prediction algorithm to forecast the price of rentals based on various attributes like location, room type, and availability.

#### Measuring Effectiveness
The effectiveness of our prediction algorithms will be measured using metrics like **Mean Absolute Error (MAE)** or **Root Mean Squared Error (RMSE)** for regression models (price prediction). For classification tasks, such as predicting availability or popularity, we will measure effectiveness using **Accuracy** or **AUC-ROC**.

---

### Step 2: Data Understanding

1. **Describe Data Attributes**  
   Our dataset contains several attributes, which we describe as follows:
   - `listing_id`: Unique identifier for each listing.
   - `name`: Name of the Airbnb listing.
   - `host_id`: Identifier for the host.
   - `price`: The cost per night of the listing (numerical).
   - `room_type`: The type of room (categorical: e.g., Entire home/apt, Private room, etc.).
   - `neighbourhood`: The area in Paris (categorical).
   - `reviews`: The number of reviews (numerical).
   - `availability`: Number of days the listing is available in a year (numerical).

2. **Verify Data Quality**  
   We need to check the dataset for missing values, duplicates, and outliers, and decide how to address these issues.

#### Data Quality
- **Missing Values**  
  We will identify and handle any missing values using the following approach:
  
  ```python
  # Check for missing values
  missing_data = df.isnull().sum()
  
  # Visualize missing data (optional)
  import seaborn as sns
  import matplotlib.pyplot as plt
  
  sns.heatmap(df.isnull(), cbar=False)
  plt.show()
  ```

  Depending on the results, we will decide whether to drop rows with missing values, impute missing values using the mean or median, or ignore them if insignificant.

- **Duplicates**  
  We will remove any duplicate entries using this code:
  
  ```python
  # Check for duplicates
  duplicates = df[df.duplicated()]
  df_cleaned = df.drop_duplicates()
  ```

- **Outliers**  
  We will identify outliers using box plots for numerical features such as `price`. To detect and handle outliers, we might use the interquartile range (IQR) method:
  
  ```python
  sns.boxplot(x=df['price'])
  plt.show()
  ```

#### Simple Statistics
We will calculate basic statistics for the most important attributes:
  
```python
# Basic statistics
df.describe()
```

---

### Step 3: Data Preprocessing and Visualization

#### Importing Libraries and Loading Data

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('paris_airbnb.csv')

# Initial overview
df.head()
```

#### Handling Missing Data

```python
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)

# Fill missing values in 'price' column, if applicable
df['price'].fillna(df['price'].mean(), inplace=True)
```

#### Removing Duplicates

```python
# Remove duplicate rows
df.drop_duplicates(inplace=True)
```

#### Descriptive Statistics

```python
# Descriptive statistics for numerical columns
print(df.describe())

# Range, mode, mean, median, variance for price
price_range = df['price'].max() - df['price'].min()
price_mode = df['price'].mode()
price_mean = df['price'].mean()
price_median = df['price'].median()
price_variance = df['price'].var()

print(f"Price Range: {price_range}, Mode: {price_mode}, Mean: {price_mean}, Median: {price_median}, Variance: {price_variance}")
```

---

### Step 4: Visualization of Important Attributes

We visualize the most important attributes like `price`, `room_type`, and `availability_365`.

1. **Price Distribution**

```python
plt.figure(figsize=(10,6))
sns.histplot(df['price'], bins=50, kde=True)
plt.title('Price Distribution of Airbnb Listings in Paris')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
```

2. **Room Type Breakdown**

```python
plt.figure(figsize=(8,5))
sns.countplot(x='room_type', data=df)
plt.title('Distribution of Room Types')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()
```

3. **Availability of Listings**

```python
plt.figure(figsize=(10,6))
sns.histplot(df['availability'], bins=30, kde=True)
plt.title('Availability of Airbnb Listings in Paris')
plt.xlabel('Days Available')
plt.ylabel('Frequency')
plt.show()
```

#### Exploring Relationships Between Attributes

1. **Price vs. Room Type**

```python
plt.figure(figsize=(8,5))
sns.boxplot(x='room_type', y='price', data=df)
plt.title('Room Type vs. Price')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.show()
```

2. **Correlation Matrix**

```python
plt.figure(figsize=(10,6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```

---

### Step 5: Dimensionality Reduction

We use PCA (Principal Component Analysis) to reduce the dimensionality of the dataset and visualize the results:

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the features
features = ['price', 'number_of_reviews', 'availability_365']
x = df[features].values
x = StandardScaler().fit_transform(x)

# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(x)

# Visualize PCA result
plt.scatter(pca_result[:, 0], pca_result[:, 1])
plt.title('PCA of Airbnb Features')
plt.show()
```

---

### Step 6: Exceptional Work - Creating Additional Features

We suggest adding new features to enhance our analysis:
  
- **Price per person:** Derived by dividing the price by the number of guests a property accommodates.
- **Luxury flag:** Classify listings as luxury if their price is in the top 10%.

```python
# Example: Adding a feature for luxury listings
df['is_luxury'] = np.where(df['price'] > df['price'].quantile(0.9), 1, 0)
```

---

### Final Step: Report
We will compile all of our findings and code into a well-documented Jupyter notebook or a PDF/HTML report. The final report will include explanations of each step, the Python code used, and conclusions drawn from our analysis.

Would you like any further assistance in implementing specific sections or preparing your final report?