ECON 465 Stage 1

PACKAGES

Airbnb: Kaggle NYC Airbnb dataset (URL) Bank: UCI Bank Marketing dataset (URL)

install.packages("tidyverse")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("janitor")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

1.1 DATA IMPORT

airbnb <- read_csv(“data/Airbnb_data.csv”) bank <- read_csv2(“data/Bank_data.csv”)

1.1 DATA CLEANING

airbnb <- airbnb %>% clean_names() %>% drop_na()

bank <- bank %>% clean_names() %>% drop_na()

1.2 ECONOMIC QUESTIONS

Airbnb: What factors determine Airbnb listing prices in NYC? Bank: Can we predict whether a customer will subscribe to a term deposit?

1.3 SUMMARY STATISTICS

glimpse(airbnb) glimpse(bank)

AIRBNB SUMMARY (TARGET = price)

summary(airbnb\(price) sd(airbnb\)price)

BANK SUMMARY (TARGET = y)

table(bank\(y) prop.table(table(bank\)y))

1.3 HISTOGRAMS

Airbnb price distribution

{r} ggplot(airbnb, aes(x = price)) + geom_histogram(bins = 50) + theme_minimal() + labs(title = “Airbnb Price Distribution”)

Bank target distribution (y)

ggplot(bank, aes(x = y)) + geom_bar() + theme_minimal() + labs(title = “Bank Term Deposit Subscription (y)”)

1.3 LOG TRANSFORMATION

airbnb <- airbnb %>% filter(price > 0) %>% mutate(log_price = log(price))

1.3 LOG HISTOGRAMS

ggplot(airbnb, aes(x = log_price)) + geom_histogram(bins = 50) + labs(title = “Log Airbnb Price Distribution”)

#1.4 PROBABILITY INTERPRETATION

AIRBNB INTERPRETATION

Price is right-skewed due to a small number of expensive listings. After log transformation, distribution becomes more symmetric. This suggests a log-normal distribution.

BANK INTERPRETATION

Target variable (y) is binary (yes/no). This follows a Bernoulli distribution. Proportion of success (yes) can be interpreted as probability p.