ECON 465 Stage 1

PACKAGES

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.6.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

Stage 1: Data Acquisition & Probability Foundations

This project analyzes two economic datasets using predictive modeling techniques.

  • Dataset 1 uses regression analysis to predict Airbnb listing prices.
  • Dataset 2 uses classification analysis to predict whether a customer will subscribe to a term deposit.

1.1 Data Import

Airbnb Dataset

The Airbnb dataset was obtained from Kaggle and contains information about Airbnb listings in New York City.

Economic Question:

What factors determine Airbnb listing prices in NYC?

Bank Dataset

The Bank dataset was obtained from the UCI Machine Learning Repository and contains customer information related to bank marketing campaigns.

Economic Question:

Can we predict whether a customer will subscribe to a term deposit?

# IMPORT DATASETS

airbnb <- read_csv("Airbnb_data.csv")
Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date  (1): last_review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bank <- read_csv2("Bank_data.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 41188 Columns: 21── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (12): job, marital, education, default, housing, loan, contact, month, d...
dbl  (5): age, duration, campaign, pdays, previous
num  (4): emp.var.rate, cons.price.idx, cons.conf.idx, nr.employed
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.2 Data Cleaning

The datasets were cleaned by: - standardizing variable names - removing missing values - ensuring correct variable formats

# CLEAN AIRBNB DATA

airbnb <- airbnb %>%
  clean_names() %>%
  drop_na()

# CLEAN BANK DATA

bank <- bank %>%
  clean_names() %>%
  drop_na()

# CONVERT TARGET VARIABLE TO FACTOR

bank$y <- as.factor(bank$y)

1.3 Data Overview

Airbnb Dataset Structure

glimpse(airbnb)
Rows: 38,821
Columns: 16
$ id                             <dbl> 2539, 2595, 3831, 5022, 5099, 5121, 517…
$ name                           <chr> "Clean & quiet apt home by the park", "…
$ host_id                        <dbl> 2787, 2845, 4869, 7192, 7322, 7356, 896…
$ host_name                      <chr> "John", "Jennifer", "LisaRoxanne", "Lau…
$ neighbourhood_group            <chr> "Brooklyn", "Manhattan", "Brooklyn", "M…
$ neighbourhood                  <chr> "Kensington", "Midtown", "Clinton Hill"…
$ latitude                       <dbl> 40.64749, 40.75362, 40.68514, 40.79851,…
$ longitude                      <dbl> -73.97237, -73.98377, -73.95976, -73.94…
$ room_type                      <chr> "Private room", "Entire home/apt", "Ent…
$ price                          <dbl> 149, 225, 89, 80, 200, 60, 79, 79, 150,…
$ minimum_nights                 <dbl> 1, 1, 1, 10, 3, 45, 2, 2, 1, 5, 2, 4, 2…
$ number_of_reviews              <dbl> 9, 45, 270, 9, 74, 49, 430, 118, 160, 5…
$ last_review                    <date> 2018-10-19, 2019-05-21, 2019-07-05, 20…
$ reviews_per_month              <dbl> 0.21, 0.38, 4.64, 0.10, 0.59, 0.40, 3.4…
$ calculated_host_listings_count <dbl> 6, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, 1, …
$ availability_365               <dbl> 365, 355, 194, 0, 129, 0, 220, 0, 188, …

Bank Dataset Structure

glimpse(bank)
Rows: 41,188
Columns: 21
$ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
$ job            <chr> "housemaid", "services", "services", "admin.", "service…
$ marital        <chr> "married", "married", "married", "married", "married", …
$ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
$ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
$ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
$ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
$ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
$ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
$ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
$ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
$ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
$ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
$ emp_var_rate   <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,…
$ cons_price_idx <dbl> 93994, 93994, 93994, 93994, 93994, 93994, 93994, 93994,…
$ cons_conf_idx  <dbl> -364, -364, -364, -364, -364, -364, -364, -364, -364, -…
$ euribor3m      <chr> "4.857", "4.857", "4.857", "4.857", "4.857", "4.857", "…
$ nr_employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
$ y              <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

1.4 Summary Statistics

Airbnb Target Variable: Price

summary(airbnb$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    69.0   101.0   142.3   170.0 10000.0 
sd(airbnb$price)
[1] 196.9948

Bank Target Variable: y

table(bank$y)

   no   yes 
36548  4640 
prop.table(table(bank$y))

       no       yes 
0.8873458 0.1126542 

1.5 Probability Distribution Analysis

Airbnb Price Distribution

ggplot(airbnb, aes(x = price)) +
  geom_histogram(bins = 50) +
  theme_minimal() +
  labs(
    title = "Airbnb Price Distribution",
    x = "Price",
    y = "Frequency"
  )

The Airbnb price distribution is right-skewed because a small number of listings have very high prices.

Log Transformation

airbnb <- airbnb %>%
  filter(price > 0) %>%
  mutate(log_price = log(price))

Log-Transformed Price Distribution

ggplot(airbnb, aes(x = log_price)) +
  geom_histogram(bins = 50) +
  theme_minimal() +
  labs(
    title = "Log Airbnb Price Distribution",
    x = "Log Price",
    y = "Frequency"
  )

After log transformation, the distribution becomes more symmetric and closer to normal.

This suggests that Airbnb prices approximately follow a log-normal distribution.

Bank Target Variable Distribution

ggplot(bank, aes(x = y)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Bank Term Deposit Subscription",
    x = "Subscription",
    y = "Count"
  )

The target variable follows a Bernoulli distribution because it has two outcomes: yes or no.

Stage 1 Conclusion

The Airbnb dataset is suitable for regression analysis because the target variable is continuous.

The Bank dataset is suitable for classification analysis because the target variable is binary.

The probability distribution analysis showed that: - Airbnb prices are skewed and improve after log transformation - Bank subscription outcomes follow a Bernoulli distribution