Analyzing Citys’ Happiness Through Metrics and Influencing Factors

Introduction

Happiness is an increasingly recognized measure of a city’s overall livability and the well-being of its residents. As urban centers continue to grow, understanding the key factors that contribute to happiness is critical for effective city planning, policy design, and business strategy. This project investigates the complex relationships between socioeconomic status, health outcomes, environmental quality, and crime in cities across the United States to identify the variables that most significantly influence urban happiness.

Using publicly available data from sources such as the U.S. Census Bureau, the Environmental Protection Agency, the CDC, the FBI, and the Trust for Public Land, we integrate a wide range of indicators to construct a comprehensive view of urban life. We then develop a proxy happiness score by combining standardized metrics that reflect both quality of life and social well-being.

Through exploratory data analysis, statistical modeling, and machine learning techniques—including linear regression and clustering—we aim to reveal patterns that explain why some cities thrive while others struggle. The ultimate goal is to generate actionable, data-driven insights that policymakers, urban planners, and community stakeholders can use to foster healthier, safer, and more fulfilling urban environments.

Included as well:

Data Sources

To build a comprehensive view of urban life, this project utilizes publicly available data from reputable sources, including:

U.S. Census Bureau
Environmental Protection Agency (EPA)
Centers for Disease Control and Prevention (CDC)
Federal Bureau of Investigation (FBI)
Trust for Public Land

These datasets provide a wide range of indicators on demographics, environmental quality, health statistics, crime rates, and access to green spaces.

Libraries

This project makes use of several R libraries for data acquisition, cleaning, exploration, modeling, and visualization:

tidycensus: Simplifies access to U.S. Census Bureau data using the Census API.
tidyverse: A collection of packages for data science (manipulation, visualization, etc.).
dplyr, purrr, janitor, readr, readxl: Data wrangling and file handling.
leaps, car, metrics: Regression modeling and diagnostics.
ggplot2, GGally, ggfortify, scales: Visualization and statistical plotting.
randomForest: Machine learning using the random forest algorithm.

These libraries collectively support the end-to-end data science workflow, from ingestion and preprocessing to analysis and visualization.

# Load Libraries
library(tidycensus) 
library(tidyverse)
library(dplyr)
library(janitor)
library(readr)
library(readxl)
library(purrr)
library(leaps)
library(car)
library(ggplot2)
library(GGally)
library(randomForest)
library(ggfortify)
library(scales)
library(Metrics)

Methodology

We construct a proxy happiness score by combining standardized metrics that reflect both quality of life and social well-being. The analysis employs:

Exploratory Data Analysis (EDA) to uncover patterns and trends.

Statistical modeling, including linear regression, to quantify relationships.

Machine learning techniques, such as clustering, to group cities with similar traits.

Objectives

The main goals of this project are to:

Identify key factors that influence urban happiness.

Reveal patterns that differentiate thriving cities from struggling ones.

Provide actionable insights for policymakers and planners.

Support the creation of healthier, safer, and more fulfilling urban environments.

Data Cleaning & Preparation

This section details the process of acquiring, cleaning, and merging multiple datasets from various sources. The goal is to prepare a comprehensive dataset capturing socioeconomic, environmental, health, and crime-related metrics for cities across the United States. This cleaned dataset will serve as the foundation for further exploratory data analysis and modeling.

The workflow includes:

ACS Data

Loading and cleaning American Community Survey (ACS) data to extract demographic and education variables. Download ACS data for median income, education attainment, and population. Clean and transform to calculate an education index.

# Set Census API key for downloading ACS data
census_api_key("650d5cae759c6787192486fa80c955851f8e4639", install = TRUE, overwrite = TRUE)

## [1] "650d5cae759c6787192486fa80c955851f8e4639"

# Load variable metadata for 2023 ACS 5-year survey
acs_vars <- load_variables(2023, "acs5", cache = TRUE)

acs_data_clean <- get_acs(
  geography = "place",
  variables = c(
    median_income = "B19013_001",
    total_25_over = "B15003_001",
    bachelors_degree = "B15003_022",
    masters_degree = "B15003_023",
    professional_degree = "B15003_024",
    doctorate_degree = "B15003_025",
    population = "B01003_001"
  ),
  year = 2023,
  survey = "acs5",
  output = "wide"
)

# Clean and transform ACS data
acs <- acs_data_clean %>%
  separate(NAME, into = c("City", "State"), sep = ",\\s*") %>%
  mutate(
    City = word(tolower(City), 1),
    State = str_to_title(State),
    State = state.abb[match(State, state.name)],
    Education_Index = (bachelors_degreeE + masters_degreeE + professional_degreeE + doctorate_degreeE) / total_25_overE * 100
  ) %>%
  select(-GEOID, -starts_with("bachelors_degree"), -starts_with("masters_degree"),
         -starts_with("professional_degree"), -starts_with("doctorate_degree"), -starts_with("total_25_over")) %>%
  filter(!is.na(State)) %>%
  distinct(City, State, .keep_all = TRUE)

Load and Clean

Importing air quality, parks, health, and crime datasets, followed by standardization of city and state names for consistency.

air_quality <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/Data/AirQuality2023.xlsx", skip = 2)

air <- air_quality %>%
  select(
    CBSA = `Core Based Statistical Area (CBSA)`,
    O3_8hr = `O3            8-hr (ppm)`,
    PM2.5_Annual = `PM2.5         Wtd AM (µg/m3)`,
    PM2.5_24hr = `PM2.5         24-hr (µg/m3)`
  ) %>%
  separate(CBSA, into = c("City", "State"), sep = ",\\s*") %>%
  mutate(
    O3_8hr = as.numeric(O3_8hr),
    PM2.5_Annual = as.numeric(PM2.5_Annual),
    PM2.5_24hr = as.numeric(PM2.5_24hr),
    City = tolower(trimws(City)),
    City = word(City, 1, sep = fixed("-")),
    State = trimws(State)
  ) %>%
  mutate(State = strsplit(State, "-")) %>%
  unnest(State) %>%
  mutate(State = toupper(trimws(State))) %>%
  filter(!is.na(O3_8hr), !is.na(PM2.5_Annual), !is.na(PM2.5_24hr)) %>%
  distinct(City, State, .keep_all = TRUE)

parks <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/Acreage_Park_System_Highlights_WEB_DATA_TABLES_City_Park_Facts_2025.xlsx",
                    sheet = "Parkland Stats by City", skip = 5)

parks <- parks[-1, ]
parks <- parks[-c(100:124), ]

parks <- parks %>%
  select(-3, -8, -FIPS) %>%
  separate(`City Name`, into = c("City", "State"), sep = ",\\s*") %>%
  mutate(
    City = tolower(trimws(City)),
    City = word(City, 1, sep = fixed("-")),
    State = toupper(trimws(State))
  ) %>%
  distinct(City, State, .keep_all = TRUE)

health <- read_csv("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/PLACES__Local_Data_for_Better_Health__Place_Data_2023_release.csv")

health <- health %>%
  rename(City = LocationName, State = StateAbbr) %>%
  filter(Year == 2021) %>%
  mutate(City = tolower(trimws(City)), State = toupper(trimws(State)))

# Standardize city names function
standardize_city <- function(df) {
  df$City <- tolower(trimws(df$City))
  df$City <- gsub("-.*", "", df$City)
  df$City <- gsub("[^a-z ]", "", df$City)
  df$City <- str_squish(df$City)
  return(df)
}

health <- standardize_city(health) %>%
  distinct(City, State, Measure, .keep_all = TRUE) %>%
  select(City, State, Measure, Data_Value) %>%
  pivot_wider(names_from = Measure, values_from = Data_Value) %>%
  rename_with(~ gsub(" among adults.*", "", .x), everything()) %>%
  rename_with(~ gsub("Visits to doctor.*", "Doctor_Checkup", .x)) %>%
  rename(
    COPD = `Chronic obstructive pulmonary disease`,
    Diabetes = `Diagnosed diabetes`,
    Depression = `Depression`,
    Smoking = `Current smoking`,
    Obesity = `Obesity`
  ) %>%
  select(City, State, COPD, Diabetes, Depression, Smoking, Obesity, Doctor_Checkup)

offense <- read_excel("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/Capstone/data/United_States_Offense_Type_by_Agency_2019.xls", skip = 3)

# Fix column names using first data row as suffix
first_row <- offense[1, ] %>% unlist(use.names = FALSE)
colnames(offense) <- ifelse(is.na(first_row) | first_row == "",
                            colnames(offense),
                            paste0(colnames(offense), "_", first_row))
offense <- offense[-1, ]

# Clean names and prepare city/state columns
colnames(offense) <- gsub("^\\.\\.\\.[0-9]+_", "", colnames(offense))
offense <- offense %>%
  rename(City = `Agency Name`) %>%
  select(-`Agency Type`) %>%
  mutate(City = tolower(trimws(City)),
         State = tools::toTitleCase(tolower(State)),
         State = state.abb[match(State, state.name)]) %>%
  filter(!is.na(State)) %>%
  distinct(City, State, .keep_all = TRUE)

# Clean special characters in column names
colnames(offense) <- colnames(offense) %>%
  str_replace_all("[\n\\-]", "_") %>%
  str_replace_all("\\s+", "_") %>%
  str_replace_all("_+", "_") %>%
  str_replace("^_+", "") %>%
  str_replace("_+$", "")

offense_num <- offense %>%
  mutate(across(-c(City, State), as.numeric))

get_cols <- function(df, pattern) {
  grep(pattern, colnames(df), value = TRUE, ignore.case = TRUE)
}

offense <- offense_num %>%
  rowwise() %>%
  mutate(
    Violent_Crimes = sum(c_across(all_of(get_cols(offense_num, "Assault|Homicide|Murder|Kidnapping|Rape|Robbery"))), na.rm = TRUE),
    Property_Crimes = sum(c_across(all_of(get_cols(offense_num, "Burglary|Theft|Larceny|Motor_Vehicle_Theft|Arson"))), na.rm = TRUE),
    Drug_Offenses = sum(c_across(all_of(get_cols(offense_num, "Drug|Narcotic|Drug_Equipment"))), na.rm = TRUE),
    Gambling = sum(c_across(all_of(get_cols(offense_num, "Gambling|Betting|Wagering"))), na.rm = TRUE),
    Prostitution = sum(c_across(all_of(get_cols(offense_num, "Prostitution|Assisting|Purchasing"))), na.rm = TRUE),
    Society_Crimes = sum(c_across(all_of(get_cols(offense_num, "Animal_Cruelty|Pornography|Obscene_Material"))), na.rm = TRUE),
    Weapon_Law_Violations = sum(c_across(all_of(get_cols(offense_num, "Weapon_Law_Violations"))), na.rm = TRUE)
  ) %>%
  ungroup() %>%
  select(City, State, Violent_Crimes, Property_Crimes, Drug_Offenses, Gambling,
         Prostitution, Society_Crimes, Weapon_Law_Violations) %>%
  distinct(City, State, .keep_all = TRUE)

City Standardization

Standardize City Names Across All Data. Ensure consistent city name formatting for merging.

acs_data_clean <- standardize_city(acs)
air_quality_clean <- standardize_city(air)
health <- standardize_city(health)
offense <- standardize_city(offense)
parks <- standardize_city(parks)

common_cities <- reduce(
  list(acs_data_clean, air_quality_clean, health, offense, parks),
  function(x, y) inner_join(x, y, by = c("City", "State"))
) %>% select(City, State)

cat("✅ Matching city/state pairs found:", nrow(common_cities), "\n")

## ✅ Matching city/state pairs found: 29

Final Dataset

Merging all datasets to create a unified city-level data frame. Join datasets and engineer variables for analysis.

finaldf <- list(acs_data_clean, air_quality_clean, health, offense, parks) %>%
  purrr::map(~ distinct(.x, City, State, .keep_all = TRUE)) %>%
  purrr::reduce(inner_join, by = c("City", "State")) %>%
  distinct(City, State, .keep_all = TRUE)

# Select and rename columns, convert types, engineer new features
finaldf <- finaldf %>%
  select(
    City, State,
    median_incomeM,
    Education_Index,
    Population,
    Density,
    PM2.5_Annual,
    COPD,
    Diabetes,
    Depression,
    Smoking,
    Obesity,
    Doctor_Checkup,
    Violent_Crimes,
    Property_Crimes,
    `% Natural`,
    `Parks as % City Area`,
    `Total Acres`
  ) %>%
  rename(
    Median_Income = median_incomeM,
    Total_Population = Population,
    Population_Density = Density,
    PM25_Annual_Avg = PM2.5_Annual,
    COPD_Prevalence = COPD,
    Diabetes_Prevalence = Diabetes,
    Depression_Prevalence = Depression,
    Smoking_Rate = Smoking,
    Obesity_Prevalence = Obesity,
    Routine_Checkup_Rate = Doctor_Checkup,
    Violent_Crime_Count = Violent_Crimes,
    Property_Crime_Count = Property_Crimes,
    Percent_Natural_Land = `% Natural`,
    Percent_Parks_Area = `Parks as % City Area`,
    Total_Acres = `Total Acres`
  ) %>%
  mutate(
    Total_Population = as.numeric(Total_Population),
    Total_Acres = as.numeric(Total_Acres),
    Percent_Natural_Land = as.numeric(Percent_Natural_Land),
    Percent_Parks_Area = as.numeric(Percent_Parks_Area),
    Population_Density_Num = case_when(
      Population_Density == "Low" ~ 1,
      Population_Density == "Medium-Low" ~ 2,
      Population_Density == "Medium" ~ 3,
      Population_Density == "Medium-High" ~ 4,
      Population_Density == "High" ~ 5,
      TRUE ~ NA_real_
    ),
    Natural_Land_Acres = Percent_Natural_Land * Total_Acres,
    Parks_Area_Acres = Percent_Parks_Area * Total_Acres
  ) %>%
  select(-Population_Density, -Percent_Natural_Land, -Percent_Parks_Area)

Proxy Happiness Score

Constructing a proxy happiness score based on standardized key indicators. Create a composite happiness proxy by standardizing and combining key indicators.

df_happiness <- finaldf %>%
  select(City, State,
         Median_Income,
         Education_Index,
         COPD_Prevalence,
         Depression_Prevalence,
         Violent_Crime_Count,
         Parks_Area_Acres) %>%
  mutate(across(c(Median_Income, Education_Index, Parks_Area_Acres), scale, .names = "std_{col}")) %>%
  mutate(across(c(COPD_Prevalence, Depression_Prevalence, Violent_Crime_Count), scale, .names = "std_{col}")) %>%
  mutate(
    proxy_happiness_score = std_Median_Income + std_Education_Index + std_Parks_Area_Acres -
      std_COPD_Prevalence - std_Depression_Prevalence - std_Violent_Crime_Count,
    proxy_happiness_score_rescaled = rescale(proxy_happiness_score, to = c(0, 100))
  )

# View top cities by happiness score
df_happiness %>%
  select(City, State, proxy_happiness_score, proxy_happiness_score_rescaled) %>%
  arrange(desc(proxy_happiness_score_rescaled))

## # A tibble: 25 × 4
##    City        State proxy_happiness_score[,1] proxy_happiness_score_rescaled[…¹
##    <chr>       <chr>                     <dbl>                             <dbl>
##  1 durham      NC                         5.06                             100  
##  2 raleigh     NC                         4.85                              98.1
##  3 madison     WI                         3.85                              89.3
##  4 seattle     WA                         3.43                              85.6
##  5 portland    OR                         2.25                              75.2
##  6 albuquerque NM                         2.03                              73.3
##  7 denver      CO                         2.03                              73.2
##  8 austin      TX                         2.01                              73.0
##  9 charlotte   NC                         1.51                              68.6
## 10 laredo      TX                         1.32                              67.0
## # ℹ 15 more rows
## # ℹ abbreviated name: ¹proxy_happiness_score_rescaled[,1]

Modeling

This section explores how statistical and machine learning models can quantify and predict urban happiness using socioeconomic, health, environmental, and crime-related indicators.

The goal is twofold:

Understand the influence of each variable on happiness (interpretability).

Predict relative happiness scores across cities (accuracy).

We apply multiple modeling approaches:

Feature Engineering

We selected predictor variables that cover core dimensions of city life—health, crime, education, income, and environment—and standardized them for modeling.

model_df <- df_happiness %>%
  select(proxy_happiness_score_rescaled,
         std_Median_Income,
         std_Education_Index,
         std_Parks_Area_Acres,
         std_COPD_Prevalence,
         std_Depression_Prevalence,
         std_Violent_Crime_Count) %>%
  drop_na()

Multiple Linear Regression

We start with multiple linear regression to interpret the linear impact of each feature on the happiness score.

lm_model <- lm(proxy_happiness_score_rescaled ~ ., data = model_df)
summary(lm_model)

## 
## Call:
## lm(formula = proxy_happiness_score_rescaled ~ ., data = model_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.959e-14 -4.045e-15  1.289e-15  4.586e-15  2.683e-14 
## 
## Coefficients:
##                             Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)                5.527e+01  2.249e-15  2.457e+16   <2e-16 ***
## std_Median_Income          8.839e+00  3.109e-15  2.843e+15   <2e-16 ***
## std_Education_Index        8.839e+00  4.308e-15  2.052e+15   <2e-16 ***
## std_Parks_Area_Acres       8.839e+00  3.817e-15  2.316e+15   <2e-16 ***
## std_COPD_Prevalence       -8.839e+00  4.942e-15 -1.788e+15   <2e-16 ***
## std_Depression_Prevalence -8.839e+00  2.542e-15 -3.477e+15   <2e-16 ***
## std_Violent_Crime_Count   -8.839e+00  4.331e-15 -2.041e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125e-14 on 18 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.616e+31 on 6 and 18 DF,  p-value: < 2.2e-16

residuals_lm <- resid(lm_model)
ggplot(data.frame(Fitted = fitted(lm_model), Residuals = residuals_lm), aes(x = Fitted, y = Residuals)) +
  geom_point(color = "darkorange", alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted (Linear Model)",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Random Forest

Random Forest provides a more flexible, non-linear model and allows us to evaluate variable importance.

set.seed(12393841)
rf_model <- randomForest(proxy_happiness_score_rescaled ~ ., data = model_df, importance = TRUE, ntree = 500)
print(rf_model)

## 
## Call:
##  randomForest(formula = proxy_happiness_score_rescaled ~ ., data = model_df,      importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 194.4188
##                     % Var explained: 75.52

importance(rf_model)

##                             %IncMSE IncNodePurity
## std_Median_Income         11.253158      4647.655
## std_Education_Index       14.177921      4950.084
## std_Parks_Area_Acres       4.266601      1345.357
## std_COPD_Prevalence       12.811910      4384.660
## std_Depression_Prevalence  0.535039      1135.835
## std_Violent_Crime_Count    5.288714      1883.499

Variable Importance Plot

varImpPlot(rf_model)

Model Evaluation

We compare predicted vs actual scores and compute evaluation metrics.

rf_preds <- predict(rf_model, model_df)

r2 <- 1 - sum((model_df$proxy_happiness_score_rescaled - rf_preds)^2) / sum((model_df$proxy_happiness_score_rescaled - mean(model_df$proxy_happiness_score_rescaled))^2)

rmse_val <- sqrt(mean((model_df$proxy_happiness_score_rescaled - rf_preds)^2))


cat("R²:", r2, "\n")

## R²: 0.9517091

cat("RMSE:", rmse_val, "\n")

## RMSE: 6.192694

ggplot(data.frame(Actual = model_df$proxy_happiness_score_rescaled, Predicted = rf_preds), aes(x = Actual, y = Predicted)) +
  geom_point(color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Random Forest: Predicted vs Actual Happiness Score") +
  theme_minimal()

clust_df <- model_df %>% select(-proxy_happiness_score_rescaled)

wss <- sapply(1:6, function(k){
  kmeans(clust_df, centers = k, nstart = 10)$tot.withinss
})

plot(1:6, wss, type = "b", pch = 19,
     xlab = "Number of Clusters", ylab = "Total Within-Cluster SS",
     main = "Elbow Method for Optimal k")

Clustering Analysis

To group similar cities, we use k-means clustering on standardized features.

set.seed(12393841)
clust_df <- model_df %>% select(-proxy_happiness_score_rescaled)
kmeans_result <- kmeans(clust_df, centers = 3, nstart = 10)

model_df$Cluster <- as.factor(kmeans_result$cluster)

ggplot(model_df, aes(x = std_Education_Index, y = std_Median_Income, color = Cluster)) +
  geom_point() +
  labs(title = "City Clusters Based on Socioeconomic Features") +
  theme_minimal()

Summary

This analysis, conducted as part of the Capstone Project: Analyzing Cities’ Happiness Through Metrics and Influencing Factors, aimed to model and predict a city’s relative happiness using key indicators across socioeconomic, health, crime, and environmental domains.

We implemented two main models:

Multiple Linear Regression (MLR) for interpretability
Random Forest (RF) for predictive performance

Model Performance

Both the Multiple Linear Regression and Random Forest models were evaluated using in-sample metrics. The Linear Model achieved a perfect R² of 1.000 and an RMSE of 0.00, indicating potential overfitting due to lack of validation. In contrast, the Random Forest model achieved a strong R² of 0.952 and an RMSE of 6.19, suggesting it effectively captures complex, non-linear relationships among the predictors while maintaining high predictive accuracy. These results support Random Forest as the more robust option for modeling urban happiness.

model_results <- data.frame(
  Method = c("Linear Model", "Random Forest"),
  R2 = round(c(summary(lm_model)$r.squared, 
               1 - sum((model_df$proxy_happiness_score_rescaled - rf_preds)^2) /
                   sum((model_df$proxy_happiness_score_rescaled - mean(model_df$proxy_happiness_score_rescaled))^2)), 3),
  RMSE = round(c(sqrt(mean((model_df$proxy_happiness_score_rescaled - predict(lm_model))^2)), 
                 sqrt(mean((model_df$proxy_happiness_score_rescaled - rf_preds)^2))), 2)
)

knitr::kable(model_results, caption = "Table 1: Model Performance Summary", digits = 3)

Table 1: Model Performance Summary
Method	R2	RMSE
Linear Model	1.000	0.00
Random Forest	0.952	6.19

Table 1: Model Performance Summary shows that the Random Forest model achieved strong in-sample results with an R² of 0.952 and an RMSE of 6.19, outperforming the Linear Model in flexibility and generalization. The Linear Model showed a perfect R² of 1.000 and RMSE of 0.00, which likely indicates overfitting due to the absence of a test set or cross-validation.

Key Findings from Visualizations

Variable Importance plot

varImpPlot(rf_model, main = "Random Forest Variable Importance", col = "steelblue")

Figure 1: Feature Importance Plot (see “Variable Importance Plot” section) shows that education index, median income, and depression prevalence were the most influential predictors in the Random Forest model.

Actual vs. Predicted Plot

ggplot(data.frame(Actual = model_df$proxy_happiness_score_rescaled, Predicted = rf_preds), 
       aes(x = Actual, y = Predicted)) +
  geom_point(color = "steelblue", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Random Forest: Actual vs Predicted Happiness Score",
       x = "Actual Happiness Score",
       y = "Predicted Happiness Score") +
  theme_minimal()

Figure 2: Actual vs. Predicted Plot (see “Model Evaluation” section) displays a strong correlation between actual and predicted happiness scores, with most points falling close to the 45-degree line — indicating strong model accuracy.

Residuals vs. Fitted Values

residuals_lm <- resid(lm_model)
ggplot(data.frame(Fitted = fitted(lm_model), Residuals = residuals_lm), 
       aes(x = Fitted, y = Residuals)) +
  geom_point(color = "darkorange", alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Residuals vs Fitted Values (Linear Model)",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Figure 3: Residuals Plot (see “Multiple Linear Regression” section) confirms that residuals are randomly scattered with no clear pattern, supporting the model’s assumptions and suggesting no major bias.

Top 20 Happies Cities

df_happiness %>%
  arrange(desc(proxy_happiness_score_rescaled)) %>%
  slice(1:20) %>%
  mutate(City = fct_reorder(City, proxy_happiness_score_rescaled)) %>%
  ggplot(aes(x = City, y = proxy_happiness_score_rescaled, fill = State)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 20 Happiest Cities (Proxy Score)",
       x = "City", y = "Rescaled Happiness Score") +
  theme_minimal()

Figure 4: Top 20 Happiest Cities Plot (see “Clustering Analysis (Optional)” or create a separate “Results Visualization” section) highlights the cities with the highest predicted happiness scores. These cities typically combine high education levels, greater income, more green space, and lower health or crime burdens.

City Clusters

ggplot(model_df, aes(x = std_Education_Index, y = std_Median_Income, color = Cluster)) +
  geom_point(alpha = 0.7) +
  labs(title = "City Clusters Based on Socioeconomic Features",
       x = "Standardized Education Index",
       y = "Standardized Median Income") +
  theme_minimal()

Figure 5: Clustering Plot (see “Clustering Analysis” section) groups cities into meaningful clusters based on their socioeconomic and health profiles. This unsupervised method reveals regional or demographic patterns that can inform targeted urban policy.

Conclusion

This capstone project set out to understand what makes cities “happy” by building a proxy happiness score from measurable indicators such as income, education, health burden, crime levels, and green space availability. Through both Multiple Linear Regression and Random Forest modeling, we sought to balance interpretability and predictive power in quantifying and predicting relative happiness across U.S. cities.

The Random Forest model provided strong predictive performance (R² = 0.952, RMSE = 6.19), identifying education level, median income, and depression prevalence as the most important drivers of urban happiness (see Figure 1). The Linear Model, while achieving perfect in-sample performance, raised concerns of overfitting and would require further validation for generalization.

Clustering analysis revealed that cities tend to group into distinct profiles based on their socioeconomic and health features (see Figure 5), offering a potential tool for regional planning and resource targeting.

Interpretation & Recommendations

Education and income consistently emerged as the strongest positive contributors to happiness, suggesting that investments in schooling and economic opportunity have wide-reaching quality-of-life impacts.
Mental health metrics, particularly depression prevalence, were strongly and negatively associated with happiness, underscoring the importance of accessible mental health resources in urban planning.
Green space, as measured by parks area, showed a modest positive impact, reinforcing urban planning principles that emphasize environmental well-being.
For policy applications, the Linear Model may be more appropriate when communicating findings to non-technical stakeholders, as it offers interpretable coefficients and clear marginal effects.
When accuracy and pattern discovery are the priority (e.g., for simulation, forecasting, or policy optimization), the Random Forest model is the preferred tool.

This analysis provides a blueprint for data-driven city planning and policy design focused on maximizing well-being, and serves as a proof-of-concept for building quantifiable, interpretable happiness metrics from publicly available data.

Limitations

While this project offers useful insights, several limitations should be acknowledged:

Proxy-Based Measurement The happiness score was constructed from proxy variables rather than directly surveyed well-being or subjective happiness scores. While informative, it may not capture the full spectrum of emotional and psychological factors involved in well-being.
No Train/Test Split or Cross-Validation All modeling was done on a single dataset without splitting for validation. As a result, performance metrics reflect in-sample results and may not generalize to new data. A train-test split or k-fold cross-validation is recommended for future iterations.
Static and Aggregated Data The analysis used one-time aggregated data per city, lacking temporal granularity. Cities change over time, and longitudinal or time-series data would offer more dynamic insights.
Geographic and Cultural Factors Not Included While socioeconomic and health data were included, regional or cultural characteristics (e.g., political climate, weather, cost of living) were not considered but may play a significant role in happiness.
Scaling Choices All features were standardized (z-scores), which is appropriate for modeling but may make interpretation in real-world units less intuitive.

Analyzing Citys’ Happiness Through Metrics and Influencing Factors

Anil Palazzo

2025-08-20

Introduction

Introduction

Included as well:

Data Cleaning & Preparation

Data Cleaning & Preparation

The workflow includes:

Modeling

Modeling

We apply multiple modeling approaches:

Summary

Summary

Model Performance

Key Findings from Visualizations

Conclusion

Conclusion

Interpretation & Recommendations

Limitations