Happiness is an increasingly recognized measure of a city’s overall livability and the well-being of its residents. As urban centers continue to grow, understanding the key factors that contribute to happiness is critical for effective city planning, policy design, and business strategy. This project investigates the complex relationships between socioeconomic status, health outcomes, environmental quality, and crime in cities across the United States to identify the variables that most significantly influence urban happiness.
Using publicly available data from sources such as the U.S. Census Bureau, the Environmental Protection Agency, the CDC, the FBI, and the Trust for Public Land, we integrate a wide range of indicators to construct a comprehensive view of urban life. We then develop a proxy happiness score by combining standardized metrics that reflect both quality of life and social well-being.
Through exploratory data analysis, statistical modeling, and machine learning techniques—including linear regression and clustering—we aim to reveal patterns that explain why some cities thrive while others struggle. The ultimate goal is to generate actionable, data-driven insights that policymakers, urban planners, and community stakeholders can use to foster healthier, safer, and more fulfilling urban environments.
To build a comprehensive view of urban life, this project utilizes publicly available data from reputable sources, including:
These datasets provide a wide range of indicators on demographics, environmental quality, health statistics, crime rates, and access to green spaces.
This project makes use of several R libraries for data acquisition, cleaning, exploration, modeling, and visualization:
tidycensus: Simplifies access to U.S.
Census Bureau data using the Census API.tidyverse: A collection of packages
for data science (manipulation, visualization, etc.).dplyr,
purrr,
janitor,
readr,
readxl: Data wrangling and file
handling.leaps,
car,
metrics: Regression modeling and
diagnostics.ggplot2,
GGally,
ggfortify,
scales: Visualization and statistical
plotting.randomForest: Machine learning using
the random forest algorithm.These libraries collectively support the end-to-end data science workflow, from ingestion and preprocessing to analysis and visualization.
# Load Libraries
library(tidycensus)
library(tidyverse)
library(dplyr)
library(janitor)
library(readr)
library(readxl)
library(purrr)
library(leaps)
library(car)
library(ggplot2)
library(GGally)
library(randomForest)
library(ggfortify)
library(scales)
library(Metrics)
We construct a proxy happiness score by combining standardized metrics that reflect both quality of life and social well-being. The analysis employs:
Exploratory Data Analysis (EDA) to uncover patterns and trends.
Statistical modeling, including linear regression, to quantify relationships.
Machine learning techniques, such as clustering, to group cities with similar traits.
The main goals of this project are to:
Identify key factors that influence urban happiness.
Reveal patterns that differentiate thriving cities from struggling ones.
Provide actionable insights for policymakers and planners.
Support the creation of healthier, safer, and more fulfilling urban environments.
This section details the process of acquiring, cleaning, and merging multiple datasets from various sources. The goal is to prepare a comprehensive dataset capturing socioeconomic, environmental, health, and crime-related metrics for cities across the United States. This cleaned dataset will serve as the foundation for further exploratory data analysis and modeling.
Loading and cleaning American Community Survey (ACS) data to extract demographic and education variables. Download ACS data for median income, education attainment, and population. Clean and transform to calculate an education index.
# Set Census API key for downloading ACS data
census_api_key("650d5cae759c6787192486fa80c955851f8e4639", install = TRUE, overwrite = TRUE)
## [1] "650d5cae759c6787192486fa80c955851f8e4639"
# Load variable metadata for 2023 ACS 5-year survey
acs_vars <- load_variables(2023, "acs5", cache = TRUE)
acs_data_clean <- get_acs(
geography = "place",
variables = c(
median_income = "B19013_001",
total_25_over = "B15003_001",
bachelors_degree = "B15003_022",
masters_degree = "B15003_023",
professional_degree = "B15003_024",
doctorate_degree = "B15003_025",
population = "B01003_001"
),
year = 2023,
survey = "acs5",
output = "wide"
)
# Clean and transform ACS data
acs <- acs_data_clean %>%
separate(NAME, into = c("City", "State"), sep = ",\\s*") %>%
mutate(
City = word(tolower(City), 1),
State = str_to_title(State),
State = state.abb[match(State, state.name)],
Education_Index = (bachelors_degreeE + masters_degreeE + professional_degreeE + doctorate_degreeE) / total_25_overE * 100
) %>%
select(-GEOID, -starts_with("bachelors_degree"), -starts_with("masters_degree"),
-starts_with("professional_degree"), -starts_with("doctorate_degree"), -starts_with("total_25_over")) %>%
filter(!is.na(State)) %>%
distinct(City, State, .keep_all = TRUE)
Importing air quality, parks, health, and crime datasets, followed by standardization of city and state names for consistency.
air_quality <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/Data/AirQuality2023.xlsx", skip = 2)
air <- air_quality %>%
select(
CBSA = `Core Based Statistical Area (CBSA)`,
O3_8hr = `O3 8-hr (ppm)`,
PM2.5_Annual = `PM2.5 Wtd AM (µg/m3)`,
PM2.5_24hr = `PM2.5 24-hr (µg/m3)`
) %>%
separate(CBSA, into = c("City", "State"), sep = ",\\s*") %>%
mutate(
O3_8hr = as.numeric(O3_8hr),
PM2.5_Annual = as.numeric(PM2.5_Annual),
PM2.5_24hr = as.numeric(PM2.5_24hr),
City = tolower(trimws(City)),
City = word(City, 1, sep = fixed("-")),
State = trimws(State)
) %>%
mutate(State = strsplit(State, "-")) %>%
unnest(State) %>%
mutate(State = toupper(trimws(State))) %>%
filter(!is.na(O3_8hr), !is.na(PM2.5_Annual), !is.na(PM2.5_24hr)) %>%
distinct(City, State, .keep_all = TRUE)
parks <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/Acreage_Park_System_Highlights_WEB_DATA_TABLES_City_Park_Facts_2025.xlsx",
sheet = "Parkland Stats by City", skip = 5)
parks <- parks[-1, ]
parks <- parks[-c(100:124), ]
parks <- parks %>%
select(-3, -8, -FIPS) %>%
separate(`City Name`, into = c("City", "State"), sep = ",\\s*") %>%
mutate(
City = tolower(trimws(City)),
City = word(City, 1, sep = fixed("-")),
State = toupper(trimws(State))
) %>%
distinct(City, State, .keep_all = TRUE)
health <- read_csv("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/PLACES__Local_Data_for_Better_Health__Place_Data_2023_release.csv")
health <- health %>%
rename(City = LocationName, State = StateAbbr) %>%
filter(Year == 2021) %>%
mutate(City = tolower(trimws(City)), State = toupper(trimws(State)))
# Standardize city names function
standardize_city <- function(df) {
df$City <- tolower(trimws(df$City))
df$City <- gsub("-.*", "", df$City)
df$City <- gsub("[^a-z ]", "", df$City)
df$City <- str_squish(df$City)
return(df)
}
health <- standardize_city(health) %>%
distinct(City, State, Measure, .keep_all = TRUE) %>%
select(City, State, Measure, Data_Value) %>%
pivot_wider(names_from = Measure, values_from = Data_Value) %>%
rename_with(~ gsub(" among adults.*", "", .x), everything()) %>%
rename_with(~ gsub("Visits to doctor.*", "Doctor_Checkup", .x)) %>%
rename(
COPD = `Chronic obstructive pulmonary disease`,
Diabetes = `Diagnosed diabetes`,
Depression = `Depression`,
Smoking = `Current smoking`,
Obesity = `Obesity`
) %>%
select(City, State, COPD, Diabetes, Depression, Smoking, Obesity, Doctor_Checkup)
offense <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/United_States_Offense_Type_by_Agency_2019.xls", skip = 3)
# Fix column names using first data row as suffix
first_row <- offense[1, ] %>% unlist(use.names = FALSE)
colnames(offense) <- ifelse(is.na(first_row) | first_row == "",
colnames(offense),
paste0(colnames(offense), "_", first_row))
offense <- offense[-1, ]
# Clean names and prepare city/state columns
colnames(offense) <- gsub("^\\.\\.\\.[0-9]+_", "", colnames(offense))
offense <- offense %>%
rename(City = `Agency Name`) %>%
select(-`Agency Type`) %>%
mutate(City = tolower(trimws(City)),
State = tools::toTitleCase(tolower(State)),
State = state.abb[match(State, state.name)]) %>%
filter(!is.na(State)) %>%
distinct(City, State, .keep_all = TRUE)
# Clean special characters in column names
colnames(offense) <- colnames(offense) %>%
str_replace_all("[\n\\-]", "_") %>%
str_replace_all("\\s+", "_") %>%
str_replace_all("_+", "_") %>%
str_replace("^_+", "") %>%
str_replace("_+$", "")
offense_num <- offense %>%
mutate(across(-c(City, State), as.numeric))
get_cols <- function(df, pattern) {
grep(pattern, colnames(df), value = TRUE, ignore.case = TRUE)
}
offense <- offense_num %>%
rowwise() %>%
mutate(
Violent_Crimes = sum(c_across(all_of(get_cols(offense_num, "Assault|Homicide|Murder|Kidnapping|Rape|Robbery"))), na.rm = TRUE),
Property_Crimes = sum(c_across(all_of(get_cols(offense_num, "Burglary|Theft|Larceny|Motor_Vehicle_Theft|Arson"))), na.rm = TRUE),
Drug_Offenses = sum(c_across(all_of(get_cols(offense_num, "Drug|Narcotic|Drug_Equipment"))), na.rm = TRUE),
Gambling = sum(c_across(all_of(get_cols(offense_num, "Gambling|Betting|Wagering"))), na.rm = TRUE),
Prostitution = sum(c_across(all_of(get_cols(offense_num, "Prostitution|Assisting|Purchasing"))), na.rm = TRUE),
Society_Crimes = sum(c_across(all_of(get_cols(offense_num, "Animal_Cruelty|Pornography|Obscene_Material"))), na.rm = TRUE),
Weapon_Law_Violations = sum(c_across(all_of(get_cols(offense_num, "Weapon_Law_Violations"))), na.rm = TRUE)
) %>%
ungroup() %>%
select(City, State, Violent_Crimes, Property_Crimes, Drug_Offenses, Gambling,
Prostitution, Society_Crimes, Weapon_Law_Violations) %>%
distinct(City, State, .keep_all = TRUE)
Standardize City Names Across All Data. Ensure consistent city name formatting for merging.
acs_data_clean <- standardize_city(acs)
air_quality_clean <- standardize_city(air)
health <- standardize_city(health)
offense <- standardize_city(offense)
parks <- standardize_city(parks)
common_cities <- reduce(
list(acs_data_clean, air_quality_clean, health, offense, parks),
function(x, y) inner_join(x, y, by = c("City", "State"))
) %>% select(City, State)
cat("✅ Matching city/state pairs found:", nrow(common_cities), "\n")
## ✅ Matching city/state pairs found: 29
Merging all datasets to create a unified city-level data frame. Join datasets and engineer variables for analysis.
finaldf <- list(acs_data_clean, air_quality_clean, health, offense, parks) %>%
purrr::map(~ distinct(.x, City, State, .keep_all = TRUE)) %>%
purrr::reduce(inner_join, by = c("City", "State")) %>%
distinct(City, State, .keep_all = TRUE)
# Select and rename columns, convert types, engineer new features
finaldf <- finaldf %>%
select(
City, State,
median_incomeM,
Education_Index,
Population,
Density,
PM2.5_Annual,
COPD,
Diabetes,
Depression,
Smoking,
Obesity,
Doctor_Checkup,
Violent_Crimes,
Property_Crimes,
`% Natural`,
`Parks as % City Area`,
`Total Acres`
) %>%
rename(
Median_Income = median_incomeM,
Total_Population = Population,
Population_Density = Density,
PM25_Annual_Avg = PM2.5_Annual,
COPD_Prevalence = COPD,
Diabetes_Prevalence = Diabetes,
Depression_Prevalence = Depression,
Smoking_Rate = Smoking,
Obesity_Prevalence = Obesity,
Routine_Checkup_Rate = Doctor_Checkup,
Violent_Crime_Count = Violent_Crimes,
Property_Crime_Count = Property_Crimes,
Percent_Natural_Land = `% Natural`,
Percent_Parks_Area = `Parks as % City Area`,
Total_Acres = `Total Acres`
) %>%
mutate(
Total_Population = as.numeric(Total_Population),
Total_Acres = as.numeric(Total_Acres),
Percent_Natural_Land = as.numeric(Percent_Natural_Land),
Percent_Parks_Area = as.numeric(Percent_Parks_Area),
Population_Density_Num = case_when(
Population_Density == "Low" ~ 1,
Population_Density == "Medium-Low" ~ 2,
Population_Density == "Medium" ~ 3,
Population_Density == "Medium-High" ~ 4,
Population_Density == "High" ~ 5,
TRUE ~ NA_real_
),
Natural_Land_Acres = Percent_Natural_Land * Total_Acres,
Parks_Area_Acres = Percent_Parks_Area * Total_Acres
) %>%
select(-Population_Density, -Percent_Natural_Land, -Percent_Parks_Area)
Constructing a proxy happiness score based on standardized key indicators. Create a composite happiness proxy by standardizing and combining key indicators.
df_happiness <- finaldf %>%
select(City, State,
Median_Income,
Education_Index,
COPD_Prevalence,
Depression_Prevalence,
Violent_Crime_Count,
Parks_Area_Acres) %>%
mutate(across(c(Median_Income, Education_Index, Parks_Area_Acres), scale, .names = "std_{col}")) %>%
mutate(across(c(COPD_Prevalence, Depression_Prevalence, Violent_Crime_Count), scale, .names = "std_{col}")) %>%
mutate(
proxy_happiness_score = std_Median_Income + std_Education_Index + std_Parks_Area_Acres -
std_COPD_Prevalence - std_Depression_Prevalence - std_Violent_Crime_Count,
proxy_happiness_score_rescaled = rescale(proxy_happiness_score, to = c(0, 100))
)
# View top cities by happiness score
df_happiness %>%
select(City, State, proxy_happiness_score, proxy_happiness_score_rescaled) %>%
arrange(desc(proxy_happiness_score_rescaled))
## # A tibble: 25 × 4
## City State proxy_happiness_score[,1] proxy_happiness_score_rescaled[…¹
## <chr> <chr> <dbl> <dbl>
## 1 durham NC 5.06 100
## 2 raleigh NC 4.85 98.1
## 3 madison WI 3.85 89.3
## 4 seattle WA 3.43 85.6
## 5 portland OR 2.25 75.2
## 6 albuquerque NM 2.03 73.3
## 7 denver CO 2.03 73.2
## 8 austin TX 2.01 73.0
## 9 charlotte NC 1.51 68.6
## 10 laredo TX 1.32 67.0
## # ℹ 15 more rows
## # ℹ abbreviated name: ¹proxy_happiness_score_rescaled[,1]
This section explores how statistical and machine learning models can quantify and predict urban happiness using socioeconomic, health, environmental, and crime-related indicators.
The goal is twofold:
Understand the influence of each variable on happiness (interpretability).
Predict relative happiness scores across cities (accuracy).
We selected predictor variables that cover core dimensions of city life—health, crime, education, income, and environment—and standardized them for modeling.
model_df <- df_happiness %>%
select(proxy_happiness_score_rescaled,
std_Median_Income,
std_Education_Index,
std_Parks_Area_Acres,
std_COPD_Prevalence,
std_Depression_Prevalence,
std_Violent_Crime_Count) %>%
drop_na()
We start with multiple linear regression to interpret the linear impact of each feature on the happiness score.
lm_model <- lm(proxy_happiness_score_rescaled ~ ., data = model_df)
summary(lm_model)
##
## Call:
## lm(formula = proxy_happiness_score_rescaled ~ ., data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.959e-14 -4.045e-15 1.289e-15 4.586e-15 2.683e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.527e+01 2.249e-15 2.457e+16 <2e-16 ***
## std_Median_Income 8.839e+00 3.109e-15 2.843e+15 <2e-16 ***
## std_Education_Index 8.839e+00 4.308e-15 2.052e+15 <2e-16 ***
## std_Parks_Area_Acres 8.839e+00 3.817e-15 2.316e+15 <2e-16 ***
## std_COPD_Prevalence -8.839e+00 4.942e-15 -1.788e+15 <2e-16 ***
## std_Depression_Prevalence -8.839e+00 2.542e-15 -3.477e+15 <2e-16 ***
## std_Violent_Crime_Count -8.839e+00 4.331e-15 -2.041e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.125e-14 on 18 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.616e+31 on 6 and 18 DF, p-value: < 2.2e-16
residuals_lm <- resid(lm_model)
ggplot(data.frame(Fitted = fitted(lm_model), Residuals = residuals_lm), aes(x = Fitted, y = Residuals)) +
geom_point(color = "darkorange", alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted (Linear Model)",
x = "Fitted Values", y = "Residuals") +
theme_minimal()
Random Forest provides a more flexible, non-linear model and allows us to evaluate variable importance.
set.seed(12393841)
rf_model <- randomForest(proxy_happiness_score_rescaled ~ ., data = model_df, importance = TRUE, ntree = 500)
print(rf_model)
##
## Call:
## randomForest(formula = proxy_happiness_score_rescaled ~ ., data = model_df, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 194.4188
## % Var explained: 75.52
importance(rf_model)
## %IncMSE IncNodePurity
## std_Median_Income 11.253158 4647.655
## std_Education_Index 14.177921 4950.084
## std_Parks_Area_Acres 4.266601 1345.357
## std_COPD_Prevalence 12.811910 4384.660
## std_Depression_Prevalence 0.535039 1135.835
## std_Violent_Crime_Count 5.288714 1883.499
varImpPlot(rf_model)
We compare predicted vs actual scores and compute evaluation metrics.
rf_preds <- predict(rf_model, model_df)
r2 <- 1 - sum((model_df$proxy_happiness_score_rescaled - rf_preds)^2) / sum((model_df$proxy_happiness_score_rescaled - mean(model_df$proxy_happiness_score_rescaled))^2)
rmse_val <- sqrt(mean((model_df$proxy_happiness_score_rescaled - rf_preds)^2))
cat("R²:", r2, "\n")
## R²: 0.9517091
cat("RMSE:", rmse_val, "\n")
## RMSE: 6.192694
ggplot(data.frame(Actual = model_df$proxy_happiness_score_rescaled, Predicted = rf_preds), aes(x = Actual, y = Predicted)) +
geom_point(color = "steelblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Random Forest: Predicted vs Actual Happiness Score") +
theme_minimal()
clust_df <- model_df %>% select(-proxy_happiness_score_rescaled)
wss <- sapply(1:6, function(k){
kmeans(clust_df, centers = k, nstart = 10)$tot.withinss
})
plot(1:6, wss, type = "b", pch = 19,
xlab = "Number of Clusters", ylab = "Total Within-Cluster SS",
main = "Elbow Method for Optimal k")
To group similar cities, we use k-means clustering on standardized features.
set.seed(12393841)
clust_df <- model_df %>% select(-proxy_happiness_score_rescaled)
kmeans_result <- kmeans(clust_df, centers = 3, nstart = 10)
model_df$Cluster <- as.factor(kmeans_result$cluster)
ggplot(model_df, aes(x = std_Education_Index, y = std_Median_Income, color = Cluster)) +
geom_point() +
labs(title = "City Clusters Based on Socioeconomic Features") +
theme_minimal()
This analysis, conducted as part of the Capstone Project: Analyzing Cities’ Happiness Through Metrics and Influencing Factors, aimed to model and predict a city’s relative happiness using key indicators across socioeconomic, health, crime, and environmental domains.
We implemented two main models:
Multiple Linear Regression (MLR) for interpretability
Random Forest (RF) for predictive performance
Both the Multiple Linear Regression and Random Forest models were evaluated using in-sample metrics. The Linear Model achieved a perfect R² of 1.000 and an RMSE of 0.00, indicating potential overfitting due to lack of validation. In contrast, the Random Forest model achieved a strong R² of 0.952 and an RMSE of 6.19, suggesting it effectively captures complex, non-linear relationships among the predictors while maintaining high predictive accuracy. These results support Random Forest as the more robust option for modeling urban happiness.
model_results <- data.frame(
Method = c("Linear Model", "Random Forest"),
R2 = round(c(summary(lm_model)$r.squared,
1 - sum((model_df$proxy_happiness_score_rescaled - rf_preds)^2) /
sum((model_df$proxy_happiness_score_rescaled - mean(model_df$proxy_happiness_score_rescaled))^2)), 3),
RMSE = round(c(sqrt(mean((model_df$proxy_happiness_score_rescaled - predict(lm_model))^2)),
sqrt(mean((model_df$proxy_happiness_score_rescaled - rf_preds)^2))), 2)
)
knitr::kable(model_results, caption = "Table 1: Model Performance Summary", digits = 3)
| Method | R2 | RMSE |
|---|---|---|
| Linear Model | 1.000 | 0.00 |
| Random Forest | 0.952 | 6.19 |
Table 1: Model Performance Summary shows that the Random Forest model achieved strong in-sample results with an R² of 0.952 and an RMSE of 6.19, outperforming the Linear Model in flexibility and generalization. The Linear Model showed a perfect R² of 1.000 and RMSE of 0.00, which likely indicates overfitting due to the absence of a test set or cross-validation.
varImpPlot(rf_model, main = "Random Forest Variable Importance", col = "steelblue")
Figure 1: Feature Importance Plot (see “Variable Importance Plot” section) shows that education index, median income, and depression prevalence were the most influential predictors in the Random Forest model.
ggplot(data.frame(Actual = model_df$proxy_happiness_score_rescaled, Predicted = rf_preds),
aes(x = Actual, y = Predicted)) +
geom_point(color = "steelblue", alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Random Forest: Actual vs Predicted Happiness Score",
x = "Actual Happiness Score",
y = "Predicted Happiness Score") +
theme_minimal()
Figure 2: Actual vs. Predicted Plot (see “Model Evaluation” section) displays a strong correlation between actual and predicted happiness scores, with most points falling close to the 45-degree line — indicating strong model accuracy.
residuals_lm <- resid(lm_model)
ggplot(data.frame(Fitted = fitted(lm_model), Residuals = residuals_lm),
aes(x = Fitted, y = Residuals)) +
geom_point(color = "darkorange", alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
labs(title = "Residuals vs Fitted Values (Linear Model)",
x = "Fitted Values", y = "Residuals") +
theme_minimal()
Figure 3: Residuals Plot (see “Multiple Linear Regression” section) confirms that residuals are randomly scattered with no clear pattern, supporting the model’s assumptions and suggesting no major bias.
df_happiness %>%
arrange(desc(proxy_happiness_score_rescaled)) %>%
slice(1:20) %>%
mutate(City = fct_reorder(City, proxy_happiness_score_rescaled)) %>%
ggplot(aes(x = City, y = proxy_happiness_score_rescaled, fill = State)) +
geom_col() +
coord_flip() +
labs(title = "Top 20 Happiest Cities (Proxy Score)",
x = "City", y = "Rescaled Happiness Score") +
theme_minimal()
Figure 4: Top 20 Happiest Cities Plot (see “Clustering Analysis (Optional)” or create a separate “Results Visualization” section) highlights the cities with the highest predicted happiness scores. These cities typically combine high education levels, greater income, more green space, and lower health or crime burdens.
ggplot(model_df, aes(x = std_Education_Index, y = std_Median_Income, color = Cluster)) +
geom_point(alpha = 0.7) +
labs(title = "City Clusters Based on Socioeconomic Features",
x = "Standardized Education Index",
y = "Standardized Median Income") +
theme_minimal()
Figure 5: Clustering Plot (see “Clustering Analysis” section) groups cities into meaningful clusters based on their socioeconomic and health profiles. This unsupervised method reveals regional or demographic patterns that can inform targeted urban policy.
This capstone project set out to understand what makes cities “happy” by building a proxy happiness score from measurable indicators such as income, education, health burden, crime levels, and green space availability. Through both Multiple Linear Regression and Random Forest modeling, we sought to balance interpretability and predictive power in quantifying and predicting relative happiness across U.S. cities.
The Random Forest model provided strong predictive performance (R² = 0.952, RMSE = 6.19), identifying education level, median income, and depression prevalence as the most important drivers of urban happiness (see Figure 1). The Linear Model, while achieving perfect in-sample performance, raised concerns of overfitting and would require further validation for generalization.
Clustering analysis revealed that cities tend to group into distinct profiles based on their socioeconomic and health features (see Figure 5), offering a potential tool for regional planning and resource targeting.
Education and income consistently emerged as the strongest positive contributors to happiness, suggesting that investments in schooling and economic opportunity have wide-reaching quality-of-life impacts.
Mental health metrics, particularly depression prevalence, were strongly and negatively associated with happiness, underscoring the importance of accessible mental health resources in urban planning.
Green space, as measured by parks area, showed a modest positive impact, reinforcing urban planning principles that emphasize environmental well-being.
For policy applications, the Linear Model may be more appropriate when communicating findings to non-technical stakeholders, as it offers interpretable coefficients and clear marginal effects.
When accuracy and pattern discovery are the priority (e.g., for simulation, forecasting, or policy optimization), the Random Forest model is the preferred tool.
This analysis provides a blueprint for data-driven city planning and policy design focused on maximizing well-being, and serves as a proof-of-concept for building quantifiable, interpretable happiness metrics from publicly available data.
While this project offers useful insights, several limitations should be acknowledged:
Proxy-Based Measurement The happiness score was constructed from proxy variables rather than directly surveyed well-being or subjective happiness scores. While informative, it may not capture the full spectrum of emotional and psychological factors involved in well-being.
No Train/Test Split or Cross-Validation All modeling was done on a single dataset without splitting for validation. As a result, performance metrics reflect in-sample results and may not generalize to new data. A train-test split or k-fold cross-validation is recommended for future iterations.
Static and Aggregated Data The analysis used one-time aggregated data per city, lacking temporal granularity. Cities change over time, and longitudinal or time-series data would offer more dynamic insights.
Geographic and Cultural Factors Not Included While socioeconomic and health data were included, regional or cultural characteristics (e.g., political climate, weather, cost of living) were not considered but may play a significant role in happiness.
Scaling Choices All features were standardized (z-scores), which is appropriate for modeling but may make interpretation in real-world units less intuitive.