ECON 465 Stage 1

PACKAGES

Airbnb: Kaggle NYC Airbnb dataset (URL) Bank: UCI Bank Marketing dataset (URL)

install.packages(“tidyverse”) install.packages(“janitor”) library(tidyverse) library(janitor)

1.1 DATA IMPORT

airbnb <- read_csv(“data/Airbnb_data.csv”) bank <- read_csv2(“data/Bank_data.csv”)

1.1 DATA CLEANING

airbnb <- airbnb %>% clean_names() %>% drop_na()

bank <- bank %>% clean_names() %>% drop_na()

1.2 ECONOMIC QUESTIONS

Airbnb: What factors determine Airbnb listing prices in NYC? Bank: Can we predict whether a customer will subscribe to a term deposit?

1.3 SUMMARY STATISTICS

glimpse(airbnb) glimpse(bank)

AIRBNB SUMMARY (TARGET = price)

summary(airbnb\(price) sd(airbnb\)price)

BANK SUMMARY (TARGET = y)

table(bank\(y) prop.table(table(bank\)y))

1.3 HISTOGRAMS

Airbnb price distribution

ggplot(airbnb, aes(x = price)) + geom_histogram(bins = 50) + theme_minimal() + labs(title = “Airbnb Price Distribution”)

Bank target distribution (y)

ggplot(bank, aes(x = y)) + geom_bar() + theme_minimal() + labs(title = “Bank Term Deposit Subscription (y)”)

1.3 LOG TRANSFORMATION

airbnb <- airbnb %>% filter(price > 0) %>% mutate(log_price = log(price))

1.3 LOG HISTOGRAMS

ggplot(airbnb, aes(x = log_price)) + geom_histogram(bins = 50) + labs(title = “Log Airbnb Price Distribution”)

#1.4 PROBABILITY INTERPRETATION

AIRBNB INTERPRETATION

Price is right-skewed due to a small number of expensive listings. After log transformation, distribution becomes more symmetric. This suggests a log-normal distribution.

BANK INTERPRETATION

Target variable (y) is binary (yes/no). This follows a Bernoulli distribution. Proportion of success (yes) can be interpreted as probability p.