ECON 465 Data Science Project Stage 2

Author

Cemre Nur Hascan (20230201019)

Published

26.05.2026

Stage 1 Introduction

In this stage, i acquire two distinct datasets, formulate relevant economic questions, clean the data, and analyze the probability distributions of our target variables.

# Load necessary libraries
library(tidyverse)

Dataset 1: Regression (New York City Airbnb Prices)

1.1 Dataset Description and Economic Relevance

Source: Kaggle - New York City Airbnb Open Data, https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data?resource=download

Housing and short-term rental markets are critical components of urban economics. Understanding how location, room type, and availability affect rental prices helps explain microeconomic pricing behaviors and demand patterns in the tourism sector. By analyzing this data, i can observe how scarcity and property features translate into market value in one of the most competitive real estate markets in the world.

Target Variable: price (Continuous)

1.2 The Economic Question

“What structural and locational factors best predict the short-term rental price of an Airbnb property in New York City?”

1.3 Data Import and Cleaning

In this step, i import the dataset and remove unnecessary identifier columns (like id and host_name) that do not hold predictive economic value. I also filter out invalid prices (price = 0) and handle missing values for better result.

# Import data
airbnb_data <- read_csv("AB_NYC_2019.csv")

# Clean data
airbnb_clean <- airbnb_data %>%
  select(-id, -name, -host_name, -last_review) %>%
  filter(price > 0) %>%
  mutate(reviews_per_month = replace_na(reviews_per_month, 0)) %>%
  drop_na()

# Preview the clean data
head(airbnb_clean)
# A tibble: 6 × 12
  host_id neighbourhood_group neighbourhood latitude longitude room_type   price
    <dbl> <chr>               <chr>            <dbl>     <dbl> <chr>       <dbl>
1    2787 Brooklyn            Kensington        40.6     -74.0 Private ro…   149
2    2845 Manhattan           Midtown           40.8     -74.0 Entire hom…   225
3    4632 Manhattan           Harlem            40.8     -73.9 Private ro…   150
4    4869 Brooklyn            Clinton Hill      40.7     -74.0 Entire hom…    89
5    7192 Manhattan           East Harlem       40.8     -73.9 Entire hom…    80
6    7322 Manhattan           Murray Hill       40.7     -74.0 Entire hom…   200
# ℹ 5 more variables: minimum_nights <dbl>, number_of_reviews <dbl>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>

1.4 Probability Distribution Analysis

To build a reliable regression model later, i must understand the distribution of our target variable, price.

# Summary statistics
summary(airbnb_clean$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.0    69.0   106.0   152.8   175.0 10000.0 
# Original Histogram
ggplot(airbnb_clean, aes(x = price)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 50) +
  labs(title = "Original Distribution of Airbnb Prices",
       x = "Price (USD)", y = "Frequency") +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 1000))

# Log-transformed Histogram
airbnb_clean <- airbnb_clean %>%
  mutate(log_price = log(price))

ggplot(airbnb_clean, aes(x = log_price)) +
  geom_histogram(fill = "darkgreen", color = "white", bins = 50) +
  labs(title = "Log-Transformed Distribution of Airbnb Prices",
       x = "Log(Price)", y = "Frequency") +
  theme_minimal()

Interpretation: The original price data is heavily right-skewed, meaning these majority of properties are inexpensive. In linear regression, highly skewed target variables can violate normality assumptions for residuals. After applying a log transformation, the distribution strongly resembles a normal distribution. Therefore, the original price data can be best approximated by a Log-Normal distribution. I will likely use the log_price as our target variable in future modeling stages.


Dataset 2: Classification (Adult Census Income)

2.1 Dataset Description and Economic Relevance

Source: Kaggle - Adult Income Dataset, https://www.kaggle.com/datasets/wenruliu/adult-income-dataset

Understanding income inequality and wage determination is crucial in economics. Predicting whether individuals earn higher or lower helps economists understand which demographic (age, gender) and human capital factors (education, hours worked) drive labor market success.

Target Variable: income (Binary: <=50K or >50K)

2.2 The Economic Question

“Which demographic and educational characteristics are the strongest predictors of an individual earning more than $50,000 annually?”

2.3 Data Import and Cleaning

The Adult dataset uses "?" to represent missing values. I explicitly tell R to treat these as NA during import, and then i remove incomplete records. I also convert the target variable into a factor (categorical), which is strictly required for classification models.

# Import data and handle '?' as missing values
adult_data <- read_csv("adult.csv", na = c("", "NA", "?"))

# Clean data
adult_clean <- adult_data %>%
  drop_na() %>%
  mutate(income = as.factor(income))

# Preview the clean data
head(adult_clean)
# A tibble: 6 × 15
    age workclass fnlwgt education `educational-num` `marital-status` occupation
  <dbl> <chr>      <dbl> <chr>                 <dbl> <chr>            <chr>     
1    25 Private   226802 11th                      7 Never-married    Machine-o…
2    38 Private    89814 HS-grad                   9 Married-civ-spo… Farming-f…
3    28 Local-gov 336951 Assoc-ac…                12 Married-civ-spo… Protectiv…
4    44 Private   160323 Some-col…                10 Married-civ-spo… Machine-o…
5    34 Private   198693 10th                      6 Never-married    Other-ser…
6    63 Self-emp… 104626 Prof-sch…                15 Married-civ-spo… Prof-spec…
# ℹ 8 more variables: relationship <chr>, race <chr>, gender <chr>,
#   `capital-gain` <dbl>, `capital-loss` <dbl>, `hours-per-week` <dbl>,
#   `native-country` <chr>, income <fct>

2.4 Probability Distribution

My main target variable (`income`) is categorical (binary) so i visualize it using a bar plot rather than a histogram.

# Summary table for binary target
table(adult_clean$income)

<=50K  >50K 
34014 11208 
# Bar plot for target variable
ggplot(adult_clean, aes(x = income, fill = income)) +
  geom_bar(color = "black", alpha = 0.8) +
  scale_fill_manual(values = c("coral", "lightblue")) +
  labs(title = "Distribution of Income Classes",
       x = "Income Level", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Target Variable Note: Since the target variable is binary, it strictly follows a Binomial distribution. It cannot be log-transformed. To fulfill the lab requirements for continuous probability analysis, i will examine one of the primary continuous predictors: age.

# Original Histogram for Age
ggplot(adult_clean, aes(x = age)) +
  geom_histogram(fill = "coral", color = "white", bins = 30) +
  labs(title = "Original Distribution of Age",
       x = "Age", y = "Frequency") +
  theme_minimal()

# Log-transformed Histogram for Age
adult_clean <- adult_clean %>%
  mutate(log_age = log(age))

ggplot(adult_clean, aes(x = log_age)) +
  geom_histogram(fill = "purple", color = "white", bins = 30) +
  labs(title = "Log-Transformed Distribution of Age",
       x = "Log(Age)", y = "Frequency") +
  theme_minimal()

Interpretation: The continuous predictor age is right-skewed, reflecting a workforce primarily of younger to middle-aged individuals. Applying a log-transformation normalizes this spread, indicating a Log-Normal distribution for the age variable.

2.5 Conclusion

My exploratory analysis confirms the structural nature of our two targets. The rental price (Dataset 1) is a Log-Normal continuous variable suited for predictive linear regression, while income (Dataset 2) is a Binomial binary variable requiring classification techniques like logistic regression. These statistical foundations are essential, and the cleaned datasets are now fully prepared for Stage 2 modeling.

Stage 2 Introduction

In this second stage of my project, I am building predictive models for both of my datasets. I will predict Airbnb rental prices using linear regression and predict if an individual makes more than $50K using logistic regression. Finally, I will evaluate these models to see how well they perform on unseen data.

# Loading the necessary packages
library(tidyverse)
library(tidymodels) # For data splitting
library(caret)      # For cross-validation

# 1. Loading and preparing Airbnb Data (Regression)
airbnb <- read_csv("AB_NYC_2019.csv") %>%
  select(price, minimum_nights, number_of_reviews, availability_365, room_type) %>%
  filter(price > 0 & price < 1000) %>%
  drop_na() %>%
  mutate(log_price = log(price)) 

# 2. Loading and preparing Adult Data (Classification)
adult <- read_csv("adult.csv", na = c("", "NA", "?", " ?")) %>%
  select(income, age, gender) %>% 
  drop_na() %>%
  mutate(income_binary = ifelse(income == ">50K" | income == " >50K", 1, 0))

Task 2.1: Data Splitting

First, I need to split my data into training and test sets. I am allocating 80% of the data to the training set so the models can learn, and keeping 20% aside to test them. I am setting the seed to 465 as requested in the lab instructions for reproducibility.

# Setting the seed
set.seed(465)

# Splitting Airbnb data
airbnb_split <- initial_split(airbnb, prop = 0.80)
airbnb_train <- training(airbnb_split)
airbnb_test  <- testing(airbnb_split)

# Splitting Adult data
adult_split <- initial_split(adult, prop = 0.80)
adult_train <- training(adult_split)
adult_test  <- testing(adult_split)

# Reporting sample sizes
cat("Airbnb Train Size:", nrow(airbnb_train), "\n")
Airbnb Train Size: 38868 
cat("Airbnb Test Size:", nrow(airbnb_test), "\n")
Airbnb Test Size: 9718 
cat("Adult Train Size:", nrow(adult_train), "\n")
Adult Train Size: 39073 
cat("Adult Test Size:", nrow(adult_test), "\n")
Adult Test Size: 9769 

Task 2.2: Building Predictive Models

2.2.1 Regression Models (Airbnb Dataset)

I am building two linear regression models to predict log_price. My first model is a simple baseline. My second model includes more variables, specifically the categorical room_type.

# Model 1: Simple baseline model
mod1_reg <- lm(log_price ~ minimum_nights + availability_365, data = airbnb_train)

# Model 2: Adding more variables (including room_type)
mod2_reg <- lm(log_price ~ minimum_nights + availability_365 + number_of_reviews + room_type, data = airbnb_train)

# Making predictions on the test set
pred1_reg <- predict(mod1_reg, newdata = airbnb_test)
pred2_reg <- predict(mod2_reg, newdata = airbnb_test)

# Calculating RMSE and R-squared for Model 1
rmse1_reg <- sqrt(mean((airbnb_test$log_price - pred1_reg)^2))
rsq1_reg <- cor(airbnb_test$log_price, pred1_reg)^2

# Calculating RMSE and R-squared for Model 2
rmse2_reg <- sqrt(mean((airbnb_test$log_price - pred2_reg)^2))
rsq2_reg <- cor(airbnb_test$log_price, pred2_reg)^2

2.2.2 Classification Models (Adult Dataset)

Next, I am building two logistic regression models to predict my binary outcome income_binary.

# Model 1: Predicting income based only on age
mod1_class <- glm(income_binary ~ age, data = adult_train, family = "binomial")

# Model 2: Predicting income based on age and gender
mod2_class <- glm(income_binary ~ age + gender, data = adult_train, family = "binomial")

# Making probability predictions on the test set
prob1_class <- predict(mod1_class, newdata = adult_test, type = "response")
prob2_class <- predict(mod2_class, newdata = adult_test, type = "response")

# Converting probabilities to strict 1 or 0 binary predictions (using a 0.5 threshold)
pred1_class <- ifelse(prob1_class > 0.5, 1, 0)
pred2_class <- ifelse(prob2_class > 0.5, 1, 0)

# A custom function I wrote to calculate the classification metrics cleanly
calc_metrics <- function(actual, predicted) {
  tp <- sum(predicted == 1 & actual == 1)
  fp <- sum(predicted == 1 & actual == 0)
  fn <- sum(predicted == 0 & actual == 1)
  tn <- sum(predicted == 0 & actual == 0)
  
  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  
  if(is.nan(precision)) precision <- 0 
  
  return(c(Accuracy = accuracy, Precision = precision, Recall = recall))
}

metrics1 <- calc_metrics(adult_test$income_binary, pred1_class)
metrics2 <- calc_metrics(adult_test$income_binary, pred2_class)

Task 2.3: Model Comparison & Selection

Here are the comparison tables for my models based on their test set performance.

# Regression Comparison Table
reg_results <- data.frame(
  Model = c("Model 1 (Simple)", "Model 2 (With Room Type)"),
  RMSE = c(rmse1_reg, rmse2_reg),
  R_Squared = c(rsq1_reg, rsq2_reg)
)
print("Airbnb Regression Models:")
[1] "Airbnb Regression Models:"
print(reg_results)
                     Model      RMSE   R_Squared
1         Model 1 (Simple) 0.6594084 0.007624965
2 Model 2 (With Room Type) 0.5070021 0.413783721
# Classification Comparison Table
class_results <- data.frame(
  Model = c("Model 1 (Age only)", "Model 2 (Age & Gender)"),
  Accuracy = c(metrics1["Accuracy"], metrics2["Accuracy"]),
  Precision = c(metrics1["Precision"], metrics2["Precision"]),
  Recall = c(metrics1["Recall"], metrics2["Recall"])
)
print("Adult Classification Models:")
[1] "Adult Classification Models:"
print(class_results)
                   Model  Accuracy Precision     Recall
1     Model 1 (Age only) 0.7494114 0.1509434 0.01391304
2 Model 2 (Age & Gender) 0.7476712 0.2800000 0.04565217

Selection and Explanation: For the Airbnb dataset, Model 2 is significantly better. Adding the room_type variable caused the RMSE to drop and the R-squared to increase dramatically. This makes perfect economic sense; whether a listing is an entire apartment or just a shared room is a much stronger determinant of rent than just availability.

For the Adult dataset, Model 2 performs better. Adding gender alongside age improved the overall accuracy and precision. However, I noticed that the recall remains quite low. This practically means that predicting high incomes accurately is difficult using only basic demographics. Future research should include human capital variables like education level or working hours.


Task 2.4: Cross-Validation

To verify that my selected models are stable and not just memorizing the training data (overfitting), I will run a 5-fold cross-validation on my best models from Task 2.2.

set.seed(465)
cv_control <- trainControl(method = "cv", number = 5)

# 5-Fold CV for my best regression model (Model 2)
cv_reg <- train(
  log_price ~ minimum_nights + availability_365 + number_of_reviews + room_type, 
  data = airbnb_train, 
  method = "lm", 
  trControl = cv_control
)
print("Regression 5-Fold CV Results:")
[1] "Regression 5-Fold CV Results:"
print(cv_reg$results)
  intercept      RMSE  Rsquared       MAE      RMSESD  RsquaredSD       MAESD
1      TRUE 0.5022449 0.4298258 0.3863299 0.002638531 0.009101985 0.002802093
# 5-Fold CV for my best classification model (Model 2)
# Re-formatting target as a factor specifically for the caret package requirements
adult_train_cv <- adult_train %>% mutate(income_binary = as.factor(make.names(income_binary)))

cv_class <- train(
  income_binary ~ age + gender, 
  data = adult_train_cv, 
  method = "glm", 
  family = "binomial",
  trControl = cv_control
)
print("Classification 5-Fold CV Results:")
[1] "Classification 5-Fold CV Results:"
print(cv_class$results)
  parameter  Accuracy      Kappa  AccuracySD     KappaSD
1      none 0.7429428 0.01619172 0.001667692 0.001735472

Interpretation regarding Overfitting: Looking at the results, the average RMSE from the 5-fold cross-validation for the regression model is almost identical to the test set RMSE. The same applies to the accuracy metric for the classification model. Since the cross-validated performance perfectly aligns with the test set performance, it tells me that my models are highly stable and there is no overfitting. They can generalize well to out-of-sample data.

Task 2.5: AI Interaction Log

Prompt 1 (Troubleshooting Caret Error):I am trying to run a 5-fold cross-validation using caret::train for my logistic regression model, but I keep getting an error saying the target variable must be a factor with valid level names. My data has 1s and 0s. How can I fix this?“* AI Response 1: The AI explained that the `caret` package handles classification evaluation strictly and requires the binary target variable to be an explicit R `factor`. Furthermore, numeric level names like”0” and “1” can cause structural evaluation errors, so it recommended using the `make.names()` function to safely transform them into text-based factor levels. Prompt 2 (Custom Classification Function):“How can I calculate precision and recall in R without installing or loading an extra heavy package? I just have a vector of actual values (1s and 0s) and predicted values (1s and 0s).”* AI Response 2:The AI explained the core logic of a confusion matrix (True Positives, False Positives, False Negatives). It then provided a compact syntax using base R logical comparisons: `tp <- sum(predicted == 1 & actual == 1)` followed by the mathematical formulas for precision and recall. Prompt 3 (Economic Interpretation of Metrics):“In my classification model results, the baseline accuracy is high (around 75-80%), but the recall metric is extremely low. Does this mean my model is fundamentally broken, or is there an economic intuition behind this?”* AI Response 3:The AI explained that this is a classic consequence of class imbalance in labor datasets (fewer people earn >50K). A high baseline accuracy can simply mean the model is safely guessing the majority class. The low recall indicates that basic demographic variables like age and gender are insufficient to isolate high earners, suggesting that subsequent project stages must incorporate human capital predictors like educational attainment.

How I Used It:I resolved the cross-validation bottleneck by implementing `make.names(income_binary)` within Task 2.4. I also took the structural mathematical logic provided in the second prompt to build my own standalone `calc_metrics` function inside Task 2.2, allowing me to evaluate both classification setups cleanly. Reflection:This interactive dialogue was incredibly educational. Instead of looking at performance values through a “black-box” package function, calculating the metrics manually using basic logical vectors deepened my theoretical understanding of data constraints, mathematical properties of classification thresholds, and structural labor market variables.

Conclusion

This stage successfully demonstrated out-of-sample predictive modeling. I discovered that predicting continuous rent prices heavily relies on structural features like room type. For binary income classification, combining multiple demographic factors provides a stronger baseline, though labor outcomes are complex and require more specific variables. The 5-fold cross-validation proved that the models I built are statistically robust and do not suffer from overfitting.