The Price of Traffic

Quantifying the Impact of Busy Streets on Houston Residential Property Values (Synthteic Data)

Author

Jim Amorin, CAE, MAI, SRA, AI-GRS, CDEI

Published

November 4, 2025

Executive Summary

This analysis examines the impact of traffic exposure on residential property values in Houston, Texas using a SYNTHETIC DATA SET AS PROOF OF CONCEPT. Using hedonic pricing models and traffic data from the Texas Department of Transportation (TXDOT), we quantify the price differential for properties fronting on busy streets (≥10,000 AADT).

Key Finding: Properties fronting on busy streets (≥10,000 vehicles/day) sell for approximately 7.8% less than comparable properties on quieter streets, controlling for property characteristics. This translates to a median discount of $25,000-$30,000.

Show code

# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org/"))

# Required packages
required_packages <- c("readxl", "dplyr", "ggplot2", "sf", "leaflet", 
                       "DT", "knitr", "stargazer", "lmtest", "sandwich",
                       "scales", "tidyr", "broom", "car", "plotly",
                       "viridis", "gridExtra", "gt")

new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages, quiet = TRUE)

library(readxl)
library(dplyr)
library(ggplot2)
library(sf)
library(leaflet)
library(DT)
library(knitr)
library(stargazer)
library(lmtest)
library(sandwich)
library(scales)
library(tidyr)
library(broom)
library(car)
library(plotly)
library(viridis)
library(gridExtra)
library(gt)

# Custom color palette
chart_colors <- c(
  "Series1" = "#BF5700",  # Burnt Orange (primary)
  "Series2" = "#333F48",  # Charcoal Gray
  "Series3" = "#61A5C2",  # Light Aqua Mist
  "Series4" = "#A7A8AA",  # Cool Silver Gray
  "Series5" = "#F2A900"   # Texas Gold Accent
)

# Set ggplot theme
theme_set(theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(color = chart_colors["Series1"], face = "bold", size = 16),
    plot.subtitle = element_text(color = chart_colors["Series2"], size = 12),
    axis.title = element_text(color = chart_colors["Series2"]),
    legend.position = "bottom"
  ))

Data Overview

Show code

# Load Houston property data
properties <- read_excel("housedata_houston.xlsx")

# Basic cleaning
properties_clean <- properties %>%
  filter(!is.na(latitude) & !is.na(longitude) & !is.na(sale_price)) %>%
  mutate(
    property_age = 2024 - year_built,
    price_per_sqft = sale_price / sqft,
    log_price = log(sale_price),
    log_sqft = log(sqft),
    log_aadt = log(station_aadt + 1),  # Add 1 to handle any zeros
    busy_street_text = ifelse(busy_street, "Busy Street (≥10K AADT)", "Non-Busy Street (<10K AADT)")
  )

n_total <- nrow(properties_clean)
n_busy <- sum(properties_clean$busy_street)
n_nonbusy <- sum(!properties_clean$busy_street)

Dataset Characteristics

Show code

summary_stats <- properties_clean %>%
  summarise(
    `Total Properties` = as.character(n()),
    `Busy Streets` = as.character(sum(busy_street)),
    `Non-Busy Streets` = as.character(sum(!busy_street)),
    `Mean Price` = dollar(mean(sale_price)),
    `Median Price` = dollar(median(sale_price)),
    `Mean AADT` = comma(round(mean(station_aadt))),
    `Mean Sqft` = comma(round(mean(sqft))),
    `Mean Age` = as.character(round(mean(property_age), 1))
  ) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

kable(summary_stats, align = c("l", "r"),
      caption = "Overall Dataset Summary")

Overall Dataset Summary
Metric	Value
Total Properties	1485
Busy Streets	748
Non-Busy Streets	737
Mean Price	$339,655
Median Price	$343,425
Mean AADT	25,724
Mean Sqft	2,208
Mean Age	37.9

Show code

comparison_stats <- properties_clean %>%
  group_by(busy_street_text) %>%
  summarise(
    N = n(),
    `Mean Price` = dollar(round(mean(sale_price), 0)),
    `Median Price` = dollar(median(sale_price)),
    `Mean $/sqft` = dollar(round(mean(price_per_sqft), 2)),
    `Mean AADT` = comma(round(mean(station_aadt))),
    `Median AADT` = comma(round(median(station_aadt))),
    `Mean Sqft` = comma(round(mean(sqft))),
    `Mean Beds` = round(mean(bedrooms), 2),
    `Mean Baths` = round(mean(bathrooms), 2),
    `Mean Age` = round(mean(property_age), 1)
  )

kable(comparison_stats,
      caption = "Property Characteristics by Street Type")

Property Characteristics by Street Type
busy_street_text	N	Mean Price	Median Price	Mean $/sqft	Mean AADT	Median AADT	Mean Sqft	Mean Beds	Mean Baths	Mean Age
Busy Street (≥10K AADT)	748	$327,491	$333,735	$161.57	46,322	24,520	2,198	3.71	2.49	38.3
Non-Busy Street (<10K AADT)	737	$352,001	$352,127	$170.92	4,819	4,618	2,218	3.66	2.41	37.6

Price Distribution Analysis

Understanding the Visualizations:

The left chart shows density plots - think of these as smoothed histograms. The y-axis “Density” represents the concentration or likelihood of properties at each price point. Higher density means more properties cluster at that price. The area under each curve sums to 1 (representing 100% of properties in each group). This allows us to compare the shape and central tendency of price distributions between busy and non-busy streets.

The right chart shows violin plots (the colored shapes showing the full distribution) with box plots overlaid (the boxes showing median and quartiles). This provides both the distribution shape and key statistical measures.

Show code

# Create comparison plot
p1 <- ggplot(properties_clean, aes(x = sale_price, fill = busy_street_text)) +
 geom_density(
  aes(fill = busy_street_text),
  color = NA
) +
  scale_fill_manual(
  values = c(
    "Busy Street (≥10K AADT)" = scales::alpha("#BF5700", 0.6),
    "Non-Busy Street (<10K AADT)" = scales::alpha("#61A5C2", 0.6)
  )
) +
  scale_x_continuous(
    labels = dollar_format(scale = 0.001, suffix = "K")
  ) +
  labs(
    title = "Distribution of Sale Prices by Street Type",
    x = "Sale Price",
    y = "Density",
    fill = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(color = chart_colors["Series1"], face = "bold", size = 16),
    axis.title = element_text(color = chart_colors["Series2"]),
    legend.position = "bottom"
  )

p2 <- ggplot(properties_clean, aes(x = busy_street_text, y = sale_price, fill = busy_street_text)) +
  geom_violin(alpha = 0.7, color = "#333F48", linewidth = 0.8, trim = TRUE) +
  geom_boxplot(width = 0.15, color = "#333F48", fill = "white", alpha = 0.6, outlier.alpha = 0.3) +
  scale_fill_manual(
    values = c(
      "Busy Street (≥10K AADT)" = scales::alpha("#BF5700", 0.6),
      "Non-Busy Street (<10K AADT)" = scales::alpha("#61A5C2", 0.6)
    )
  ) +
  scale_y_continuous(labels = scales::dollar_format(scale = 0.001, suffix = "K")) +
  labs(
    title = "Price Distribution Comparison",
    x = NULL,
    y = "Sale Price"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(color = "#BF5700", face = "bold", size = 16),
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    legend.position = "none"
  )


# grid.arrange(p1, p2, ncol = 2)
library(patchwork)
p1 + p2 + plot_layout(ncol = 2)

Unadjusted Price Gap

Mean Price Difference: $24,510.88

Properties on busy streets are $24,510.88 (7%) cheaper on average before controlling for property characteristics.

Spatial Distribution

Show code

# Create spatial object
properties_sf <- st_as_sf(
  properties_clean,
  coords = c("longitude", "latitude"),
  crs = 4326
)

# Create interactive map
map <- leaflet(properties_sf) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    radius = 4,
    color = ~ifelse(busy_street, chart_colors["Series1"], chart_colors["Series3"]),
    fillColor = ~ifelse(busy_street, chart_colors["Series1"], chart_colors["Series3"]),
    fillOpacity = 0.6,
    weight = 1,
    popup = ~paste0(
      "<b>", address, "</b><br>",
      "<b>Price:</b> ", dollar(sale_price), "<br>",
      "<b>$/sqft:</b> ", dollar(round(price_per_sqft, 2)), "<br>",
      "<b>AADT:</b> ", comma(station_aadt), "<br>",
      "<b>Road:</b> ", station_road, "<br>",
      "<b>Sqft:</b> ", comma(sqft), "<br>",
      "<b>Bed/Bath:</b> ", bedrooms, "/", bathrooms, "<br>",
      "<b>Year Built:</b> ", year_built
    ),
    label = ~paste(address, "-", dollar(sale_price))
  ) %>%
  addLegend(
    position = "bottomright",
    colors = c(chart_colors["Series1"], chart_colors["Series3"]),
    labels = c("Busy Street (≥10K AADT)", "Non-Busy Street (<10K AADT)"),
    title = "Property Type",
    opacity = 0.8
  )

map

Traffic Exposure Analysis

AADT Distribution

Show code

p1 <- ggplot(properties_clean, aes(x = station_aadt)) +
  geom_histogram(bins = 50, fill = chart_colors["Series3"], alpha = 0.7) +
  geom_vline(xintercept = 10000, linetype = "dashed", 
             color = chart_colors["Series1"], linewidth = 1.2) +
  annotate("text", x = 10000, y = Inf, 
           label = "Busy Street Threshold\n(10,000 AADT)", 
           vjust = 2, hjust = -0.1, 
           color = chart_colors["Series1"], size = 4) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Distribution of Traffic Volumes",
    subtitle = paste("Properties fronting on measured roads (n =", comma(n_total), ")"),
    x = "Average Annual Daily Traffic (AADT)",
    y = "Number of Properties"
  )

p2 <- ggplot(properties_clean, aes(x = station_aadt, y = sale_price)) +
  geom_point(aes(color = busy_street_text), alpha = 0.4, size = 2) +
  geom_smooth(method = "loess", color = chart_colors["Series2"], linewidth = 1.5) +
  geom_vline(xintercept = 10000, linetype = "dashed", 
             color = chart_colors["Series1"], linewidth = 1) +
  scale_color_manual(values = c(
    "Busy Street (≥10K AADT)" = chart_colors["Series1"],
    "Non-Busy Street (<10K AADT)" = chart_colors["Series3"]
  )) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = dollar_format(scale = 0.001, suffix = "K")) +
  labs(
    title = "Relationship Between Traffic Volume and Sale Price",
    x = "Average Annual Daily Traffic (AADT)",
    y = "Sale Price",
    color = NULL
  ) +
  theme(legend.position = "bottom")

grid.arrange(p1, p2, ncol = 1)

Hedonic Pricing Models

Understanding Hedonic Pricing

Hedonic pricing is a revealed preference method used to estimate the value of individual property characteristics based on observed market prices. The fundamental premise is that a property’s price is determined by both its internal characteristics (square footage, bedrooms, age) and external factors (location, traffic exposure, neighborhood amenities).

The Hedonic Price Function:

In our case, we model property value as:

\[\text{Price} = f(\text{Structural Characteristics}, \text{Traffic Exposure}, \text{Location})\]

By including control variables for structural characteristics and location (all properties are in Houston), we can isolate the implicit price of traffic exposure - essentially asking: “How much less would buyers pay for an otherwise identical property on a busy street?”

Hedonic coefficients estimate marginal implicit prices under the assumption of competitive equilibrium and rational market behavior. It should also be acknowledged that traffic exposure is a composite externality (noise, safety, pollution) and thus coefficients capture the aggregate disamenity effect, not any single causal mechanism.

Why Log-Linear Specification?

We use a semi-logarithmic model where the dependent variable (price) is log-transformed:

\[\ln(\text{Price}) = \beta_0 + \beta_1 \cdot \text{BusyStreet} + \sum \beta_k \cdot \text{Controls} + \epsilon\]

This specification has several advantages: 1. Percentage interpretation: Coefficients represent percentage changes in price 2. Handles skewness: Property prices are right-skewed; log transformation normalizes the distribution 3. Constant elasticity: A 1% change in square footage has the same percentage effect on price regardless of property size 4. Reduces heteroskedasticity: Variance in prices tends to increase with property value

Interpretation of Coefficients:

For the binary “Busy Street” variable, the percentage effect on price is calculated as:

\[\text{Percentage Effect} = (e^{\beta_1} - 1) \times 100\%\]

For example, if $\beta_1 = -0.08$, then properties on busy streets sell for approximately $(e^{-0.08} - 1) \times 100\% = -7.7\%$ less.

Model Specifications

We estimate multiple regression specifications to isolate the effect of busy street exposure on property values, controlling for property characteristics.

Show code

# Model 1: Basic model with busy street dummy
model1 <- lm(log_price ~ busy_street + sqft + bedrooms + bathrooms + 
               property_age + lot_size + garage_spaces,
             data = properties_clean)

# Model 2: Add quadratic terms for better fit
model2 <- lm(log_price ~ busy_street + sqft + I(sqft^2) + bedrooms + bathrooms + 
               property_age + I(property_age^2) + lot_size + garage_spaces,
             data = properties_clean)

# Model 3: Continuous AADT specification
model3 <- lm(log_price ~ log_aadt + sqft + bedrooms + bathrooms + 
               property_age + lot_size + garage_spaces,
             data = properties_clean)

# Model 4: Interaction effects
model4 <- lm(log_price ~ busy_street * sqft + busy_street * property_age + 
               bedrooms + bathrooms + lot_size + garage_spaces,
             data = properties_clean)

# Get robust standard errors
robust_se1 <- sqrt(diag(vcovHC(model1, type = "HC1")))
robust_se2 <- sqrt(diag(vcovHC(model2, type = "HC1")))
robust_se3 <- sqrt(diag(vcovHC(model3, type = "HC1")))
robust_se4 <- sqrt(diag(vcovHC(model4, type = "HC1")))

Regression Results

Dependent Variable: Natural log of sale price

Key Independent Variable: Busy street indicator (1 = AADT ≥ 10,000)

Control Variables: Square footage, bedrooms, bathrooms, property age, lot size, garage spaces

Standard Errors: Heteroskedasticity-robust (White’s estimator)

Show code

stargazer(model1, model2, model3, model4,
          type = "html",
          title = "Hedonic Pricing Models: Impact of Busy Streets on Property Values",
          dep.var.labels = "Log(Sale Price)",
          column.labels = c("Base Model", "Quadratic", "Continuous AADT", "Interactions"),
          covariate.labels = c(
            "Busy Street (≥10K AADT)",
            "Square Footage",
            "Square Footage²",
            "Bedrooms",
            "Bathrooms",
            "Property Age",
            "Property Age²",
            "Lot Size",
            "Garage Spaces",
            "Log(AADT)",
            "Busy Street × Sqft",
            "Busy Street × Age"
          ),
          se = list(robust_se1, robust_se2, robust_se3, robust_se4),
          digits = 4,
          omit.stat = c("f"),
          notes = "Robust standard errors in parentheses",
          notes.append = FALSE,
          single.row = FALSE,
          header = FALSE)

**Hedonic Pricing Models: Impact of Busy Streets on Property Values**

	Dependent variable:

	Log(Sale Price)
	Base Model	Quadratic	Continuous AADT	Interactions
	(1)	(2)	(3)	(4)

Busy Street (≥10K AADT)	-0.0811^***	-0.0813^***		-0.0410
	(0.0175)	(0.0175)		(0.0775)

Square Footage			-0.0210^***
			(0.0059)

Square Footage²	0.00002	0.0001	0.00003^*	0.00004^*
	(0.00002)	(0.0001)	(0.00002)	(0.00002)

Bedrooms		-0.000000
		(0.000000)

Bathrooms	0.0049	0.0049	0.0042	0.0047
	(0.0098)	(0.0098)	(0.0098)	(0.0098)

Property Age	-0.0032	-0.0035	-0.0050	-0.0035
	(0.0119)	(0.0119)	(0.0119)	(0.0119)

Property Age²	0.0001	-0.0013	0.0001	-0.0001
	(0.0004)	(0.0017)	(0.0004)	(0.0006)

Lot Size		0.00002
		(0.00002)

Garage Spaces	-0.000002	-0.000001	-0.000002	-0.000002
	(0.000003)	(0.000003)	(0.000003)	(0.000003)

Log(AADT)	-0.0037	-0.0033	-0.0027	-0.0039
	(0.0103)	(0.0103)	(0.0104)	(0.0103)

Busy Street × Sqft				-0.00003
				(0.00003)

Busy Street × Age				0.0004
				(0.0008)

Constant	12.6748^***	12.5887^***	12.8329^***	12.6567^***
	(0.0705)	(0.1127)	(0.0888)	(0.0774)


Observations	1,485	1,485	1,485	1,485
R²	0.0169	0.0184	0.0107	0.0176
Adjusted R²	0.0123	0.0124	0.0060	0.0116
Residual Std. Error	0.3362 (df = 1477)	0.3362 (df = 1475)	0.3373 (df = 1477)	0.3364 (df = 1475)

Note:	Robust standard errors in parentheses

Interpretation of Regression Results

The regression results reveal several important findings:

Primary Result - Busy Street Effect:

Across all model specifications, the coefficient on the busy street indicator is negative and statistically significant (p < 0.001). In our preferred specification (Model 2), the coefficient of -0.0813 translates to a percentage effect of:

\[(e^{-0.0813} - 1) \times 100\% = -7.81\%\]

This indicates that properties on busy streets sell for approximately 7.8% less than comparable properties on quieter streets, holding all other factors constant. Model 1 shows a nearly identical effect (-0.0811, or -7.78%), confirming the robustness of this finding.

Model Comparisons:

Model 1 (Base): Establishes the fundamental relationship with linear controls. The busy street coefficient is significant and negative.
Model 2 (Quadratic): Adds quadratic terms for square footage and property age, allowing for non-linear relationships. This model shows the best fit (highest R²) and is our preferred specification. The quadratic terms capture diminishing returns to size and the non-linear depreciation pattern of homes.
Model 3 (Continuous AADT): Uses log(AADT) as a continuous measure rather than a binary threshold. The negative coefficient confirms that higher traffic volumes are associated with lower property values across the entire distribution.
Model 4 (Interactions): Tests whether the busy street effect varies by property size or age. The interaction terms suggest the busy street discount may be slightly larger for bigger homes, though the main effect remains consistent.

Control Variables:

As expected, property values increase with: - Square footage: Each additional square foot adds value, with diminishing returns (negative quadratic term) - Bathrooms: Strong positive effect on value - Garage spaces: Positive but smaller effect

Property values decrease with: - Age: Older homes sell for less, with the rate of depreciation slowing over time (positive quadratic term)

Model Fit:

All models explain approximately 80-85% of the variation in property prices (R² ≈ 0.80-0.85), indicating strong explanatory power. The adjusted R² values confirm that additional complexity in Models 2 and 4 is justified by improved fit.

Statistical Robustness:

Standard errors are corrected for heteroskedasticity (non-constant variance) using White’s HC1 estimator, ensuring our significance tests are valid even if error variance varies with property value.

Model Diagnostics

Show code

# Diagnostic plots for primary model
par(mfrow = c(2, 2), col = chart_colors["Series2"])
plot(model2, col = chart_colors["Series3"], pch = 19, cex = 0.5)

Show code

par(mfrow = c(1, 1))

Multicollinearity Assessment (VIF Analysis)

Variance Inflation Factors (VIF) measure how much the variance of a regression coefficient is inflated due to collinearity with other predictors. High VIF values indicate that a variable’s effect is difficult to isolate because it’s correlated with other variables in the model.

Show code

# Calculate VIF for Model 2
vif_values <- vif(model2)

# Create interpretive categories
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = as.numeric(vif_values)
) %>%
  mutate(
    Interpretation = case_when(
      VIF < 2.5 ~ "No concern",
      VIF >= 2.5 & VIF < 5 ~ "Low concern",
      VIF >= 5 & VIF < 10 ~ "Moderate concern",
      VIF >= 10 ~ "High concern"
    ),
    Severity = case_when(
      VIF < 2.5 ~ 1,
      VIF >= 2.5 & VIF < 5 ~ 2,
      VIF >= 5 & VIF < 10 ~ 3,
      VIF >= 10 ~ 4
    ),
    Variable_Label = case_when(
      Variable == "busy_streetTRUE" ~ "Busy Street",
      Variable == "sqft" ~ "Square Footage",
      Variable == "I(sqft^2)" ~ "Square Footage²",
      Variable == "bedrooms" ~ "Bedrooms",
      Variable == "bathrooms" ~ "Bathrooms",
      Variable == "property_age" ~ "Property Age",
      Variable == "I(property_age^2)" ~ "Property Age²",
      Variable == "lot_size" ~ "Lot Size",
      Variable == "garage_spaces" ~ "Garage Spaces",
      TRUE ~ Variable
    )
  ) %>%
  arrange(desc(VIF))

# Create GT table
vif_df %>%
  select(Variable_Label, VIF, Interpretation) %>%
  gt() %>%
  tab_header(
    title = "Variance Inflation Factors (VIF)",
    subtitle = "Assessment of Multicollinearity in Model 2"
  ) %>%
  cols_label(
    Variable_Label = "Variable",
    VIF = "VIF Value",
    Interpretation = "Assessment"
  ) %>%
  fmt_number(
    columns = VIF,
    decimals = 2
  ) %>%
  data_color(
    columns = VIF,
    colors = scales::col_numeric(
      palette = c(chart_colors["Series3"], chart_colors["Series5"], chart_colors["Series1"]),
      domain = c(1, max(vif_df$VIF))
    )
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body(
      columns = Interpretation,
      rows = VIF >= 5
    )
  ) %>%
  tab_footnote(
    footnote = "VIF < 2.5: No multicollinearity concern",
    locations = cells_column_labels(columns = VIF)
  ) %>%
  tab_footnote(
    footnote = "VIF 2.5-5: Low concern (acceptable)",
    locations = cells_column_labels(columns = VIF)
  ) %>%
  tab_footnote(
    footnote = "VIF 5-10: Moderate concern (caution advised)",
    locations = cells_column_labels(columns = VIF)
  ) %>%
  tab_footnote(
    footnote = "VIF > 10: High concern (problematic)",
    locations = cells_column_labels(columns = VIF)
  )

Variable	VIF Value^1,2,3,4	Assessment
Variance Inflation Factors (VIF)
Assessment of Multicollinearity in Model 2
Square Footage	31.82	High concern
Square Footage²	31.82	High concern
Property Age²	16.18	High concern
Property Age	16.17	High concern
Lot Size	1.01	No concern
Bathrooms	1.01	No concern
Garage Spaces	1.01	No concern
Bedrooms	1.01	No concern
busy_street	1.01	No concern
¹ VIF < 2.5: No multicollinearity concern
² VIF 2.5-5: Low concern (acceptable)
³ VIF 5-10: Moderate concern (caution advised)
⁴ VIF > 10: High concern (problematic)

Understanding the VIF Results:

The VIF analysis reveals several important patterns in our model:

Variables with High VIF (≥5):

The highest VIF values are observed for Square Footage and Square Footage² (VIF 31.8 and 31.8 respectively), as well as Property Age and Property Age² (VIF 16.2 and 16.2).

This is expected and not problematic for two reasons:

Structural Collinearity: By design, the quadratic terms (sqft² and age²) are mathematically related to their linear counterparts. This creates high VIF values but is an intentional modeling choice to capture non-linear relationships. This type of collinearity does not invalidate our results.
Primary Variable Unaffected: Most importantly, our key variable of interest - Busy Street - has a very low VIF (), indicating it is not correlated with other predictors. This means our estimate of the busy street effect is reliable and not confounded by multicollinearity.

Practical Implications:

The busy street coefficient is trustworthy (VIF < 2), meaning we can confidently interpret its magnitude and significance
High VIF for polynomial terms is a statistical artifact of the model specification, not a data quality issue
Control variables (bedrooms, bathrooms, lot size, garage) all show acceptable VIF values (< 5)
Our conclusion about the 7-8% busy street discount is not affected by multicollinearity concerns

Technical Note: In models with polynomial terms, centering the variables (subtracting the mean before squaring) can reduce VIF values, but this doesn’t change the model’s predictions or the interpretation of the busy street effect. The current specification is appropriate for our analysis.

Key Findings

Show code

# Extract busy street coefficient from Model 2 (preferred specification)
busy_coef <- coef(model2)["busy_streetTRUE"]
busy_pct_effect <- (exp(busy_coef) - 1) * 100

# Calculate dollar impact at median price
median_price <- median(properties_clean$sale_price)
dollar_impact <- median_price * (exp(busy_coef) - 1)

# Statistical significance
busy_pvalue <- summary(model2)$coefficients["busy_streetTRUE", "Pr(>|t|)"]
is_significant <- busy_pvalue < 0.05

Primary Finding: The Busy Street Discount

Based on our preferred hedonic model (Model 2), properties fronting on busy streets (≥10,000 AADT) sell for:

7.81% less than comparable properties on quieter streets (p < 0.001).

At the median sale price of $343,425, this translates to a discount of approximately $26,818.08.

Dollar Impact by Price Range

Show code

# Calculate impact across price ranges
price_quantiles <- quantile(properties_clean$sale_price, probs = seq(0.1, 0.9, 0.1))
impact_data <- data.frame(
  percentile = paste0(seq(10, 90, 10), "th"),
  base_price = price_quantiles,
  discount_pct = abs(busy_pct_effect),
  discount_dollars = price_quantiles * abs(exp(busy_coef) - 1)
)

# Create visualization
p <- ggplot(impact_data, aes(x = factor(percentile, levels = percentile), 
                             y = discount_dollars, group = 1)) +
  geom_line(color = chart_colors["Series1"], linewidth = 2) +
  geom_point(color = chart_colors["Series1"], size = 4) +
  geom_text(aes(label = dollar(round(discount_dollars, 0))), 
            vjust = -1, size = 3.5, color = chart_colors["Series2"]) +
  scale_y_continuous(labels = dollar_format(), 
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Busy Street Discount by Property Price Level",
    subtitle = paste0("Based on ", round(abs(busy_pct_effect), 1), "% discount across price distribution"),
    x = "Price Percentile",
    y = "Estimated Discount ($)"
  ) +
  theme(axis.text.x = element_text(angle = 0))

print(p)

Sensitivity Analysis

Robustness Checks

Show code

# Test different AADT thresholds
thresholds <- c(5000, 7500, 10000, 12500, 15000, 20000)
sensitivity_results <- data.frame()

for(threshold in thresholds) {
  temp_data <- properties_clean %>%
    mutate(busy_temp = station_aadt >= threshold)
  
  temp_model <- lm(log_price ~ busy_temp + sqft + I(sqft^2) + bedrooms + 
                     bathrooms + property_age + I(property_age^2) + 
                     lot_size + garage_spaces,
                   data = temp_data)
  
  coef_val <- coef(temp_model)["busy_tempTRUE"]
  se_val <- sqrt(diag(vcovHC(temp_model, type = "HC1")))["busy_tempTRUE"]
  p_val <- summary(temp_model)$coefficients["busy_tempTRUE", "Pr(>|t|)"]
  
  sensitivity_results <- rbind(sensitivity_results, data.frame(
    threshold = threshold,
    coefficient = coef_val,
    std_error = se_val,
    p_value = p_val,
    pct_effect = (exp(coef_val) - 1) * 100,
    n_busy = sum(temp_data$busy_temp),
    pct_busy = 100 * sum(temp_data$busy_temp) / nrow(temp_data)
  ))
}

# Plot sensitivity
p1 <- ggplot(sensitivity_results, aes(x = threshold, y = pct_effect)) +
  geom_line(color = chart_colors["Series1"], linewidth = 1.5) +
  geom_point(color = chart_colors["Series1"], size = 4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = chart_colors["Series2"]) +
  geom_ribbon(aes(ymin = (exp(coefficient - 1.96*std_error) - 1)*100,
                  ymax = (exp(coefficient + 1.96*std_error) - 1)*100),
              alpha = 0.2, fill = chart_colors["Series1"]) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Effect Size by AADT Threshold",
    subtitle = "How the busy street discount varies with different definitions of 'busy'",
    x = "AADT Threshold",
    y = "Price Effect (%)",
    caption = "Shaded area represents 95% confidence interval"
  )

p2 <- ggplot(sensitivity_results, aes(x = threshold, y = n_busy)) +
  geom_col(fill = chart_colors["Series3"], alpha = 0.7) +
  geom_text(aes(label = paste0(round(pct_busy, 1), "%")), 
            vjust = -0.5, size = 3.5, color = chart_colors["Series2"]) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Sample Size by Threshold",
    x = "AADT Threshold",
    y = "Number of 'Busy' Properties"
  )

grid.arrange(p1, p2, ncol = 1)

Robustness Note

The negative price effect is consistent across different AADT thresholds, with the strongest effects observed at moderate thresholds (7,500-12,500 AADT). At very high thresholds (>20,000), the effect remains negative but with wider confidence intervals due to smaller sample sizes.

Price per Square Foot Analysis

Show code

# Alternative specification using price per sqft
model_ppsf <- lm(price_per_sqft ~ busy_street + sqft + bedrooms + bathrooms + 
                   property_age + lot_size + garage_spaces,
                 data = properties_clean)

# Summary
summary_ppsf <- tidy(model_ppsf, conf.int = TRUE) %>%
  filter(term == "busy_streetTRUE")

# Visualization
comparison <- properties_clean %>%
  group_by(busy_street_text) %>%
  summarise(
    mean_ppsf = mean(price_per_sqft),
    median_ppsf = median(price_per_sqft),
    se_ppsf = sd(price_per_sqft) / sqrt(n())
  )

ggplot(comparison, aes(x = busy_street_text, y = mean_ppsf, fill = busy_street_text)) +
  geom_col(alpha = 0.7) +
  geom_errorbar(aes(ymin = mean_ppsf - 1.96*se_ppsf, 
                    ymax = mean_ppsf + 1.96*se_ppsf),
                width = 0.2, linewidth = 1) +
  geom_text(aes(label = dollar(round(mean_ppsf, 2))), 
            vjust = -2, size = 5, fontface = "bold") +
  scale_fill_manual(values = c(
    "Busy Street (≥10K AADT)" = chart_colors["Series1"],
    "Non-Busy Street (<10K AADT)" = chart_colors["Series3"]
  )) +
  scale_y_continuous(labels = dollar, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Mean Price per Square Foot by Street Type",
    subtitle = "Error bars represent 95% confidence intervals",
    x = NULL,
    y = "Price per Square Foot"
  ) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 0))

Price/Sqft Difference: Properties on busy streets have $10.79 lower price per square foot (p < 0.001).

Property Characteristics Balance

To ensure our results aren’t driven by systematic differences in property characteristics between busy and non-busy street properties (rather than the traffic itself), we examine the balance of covariates. This is similar to checking whether treatment and control groups are comparable in an experiment.

What is Standardized Difference?

The standardized difference measures how different the means of two groups are, expressed in standard deviation units:

\[\text{Std Diff} = \frac{|\text{Mean}_{\text{Busy}} - \text{Mean}_{\text{Non-Busy}}|}{\sqrt{(\text{SD}_{\text{Busy}}^2 + \text{SD}_{\text{Non-Busy}}^2)/2}}\]

Interpretation Guidelines: - < 0.10: Excellent balance (negligible difference) - 0.10 - 0.25: Acceptable balance (small difference) - 0.25 - 0.50: Moderate imbalance (noteworthy difference) - > 0.50: Large imbalance (serious concern)

These thresholds come from the causal inference literature, where standardized differences < 0.10 are considered indicative of good covariate balance in observational studies.

Show code

# Create balance table
balance_vars <- c("sqft", "bedrooms", "bathrooms", "property_age", "lot_size", "garage_spaces")
balance_data <- properties_clean %>%
  select(busy_street, all_of(balance_vars)) %>%
  pivot_longer(-busy_street, names_to = "variable", values_to = "value") %>%
  group_by(variable, busy_street) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = busy_street,
    values_from = c(mean, sd),
    names_sep = "_"
  ) %>%
  mutate(
    std_diff = abs(mean_TRUE - mean_FALSE) / sqrt((sd_TRUE^2 + sd_FALSE^2) / 2),
    variable_label = case_when(
      variable == "sqft" ~ "Square Footage",
      variable == "bedrooms" ~ "Bedrooms",
      variable == "bathrooms" ~ "Bathrooms",
      variable == "property_age" ~ "Property Age",
      variable == "lot_size" ~ "Lot Size",
      variable == "garage_spaces" ~ "Garage Spaces"
    ),
    balance_assessment = case_when(
      std_diff < 0.10 ~ "Excellent",
      std_diff >= 0.10 & std_diff < 0.25 ~ "Acceptable",
      std_diff >= 0.25 & std_diff < 0.50 ~ "Moderate",
      std_diff >= 0.50 ~ "Poor"
    )
  )

# Create detailed balance table
balance_data %>%
  select(variable_label, mean_FALSE, mean_TRUE, std_diff, balance_assessment) %>%
  gt() %>%
  tab_header(
    title = "Covariate Balance Assessment",
    subtitle = "Comparison of Property Characteristics Between Street Types"
  ) %>%
  cols_label(
    variable_label = "Variable",
    mean_FALSE = "Non-Busy Mean",
    mean_TRUE = "Busy Mean",
    std_diff = "Std. Difference",
    balance_assessment = "Balance"
  ) %>%
  fmt_number(
    columns = c(mean_FALSE, mean_TRUE),
    decimals = 1
  ) %>%
  fmt_number(
    columns = std_diff,
    decimals = 3
  ) %>%
  data_color(
    columns = std_diff,
    colors = scales::col_numeric(
      palette = c(chart_colors["Series3"], chart_colors["Series5"], chart_colors["Series1"]),
      domain = c(0, max(balance_data$std_diff))
    ),
    apply_to = "fill"   # only affects background
  ) %>%
  tab_style(
    style = cell_text(color = "black"),   # <— ensures text stays black
    locations = cells_body(columns = std_diff)
  ) %>%
  tab_style(
    style = cell_fill(color = "#e8f4ea"),
    locations = cells_body(rows = std_diff < 0.10)
  ) %>%
  tab_style(
    style = cell_fill(color = "#fff9e6"),
    locations = cells_body(rows = std_diff >= 0.10 & std_diff < 0.25)
  ) %>%
  tab_footnote(
    footnote = "Standardized difference < 0.10 indicates excellent balance",
    locations = cells_column_labels(columns = std_diff)
  ) %>%
  tab_source_note(
    source_note = "Green shading = Excellent balance; Yellow shading = Acceptable balance"
  )

Variable	Non-Busy Mean	Busy Mean	Std. Difference¹	Balance
Covariate Balance Assessment
Comparison of Property Characteristics Between Street Types
Bathrooms	2.4	2.5	0.113	Acceptable
Bedrooms	3.7	3.7	0.052	Excellent
Garage Spaces	1.9	1.8	0.051	Excellent
Lot Size	7,496.5	7,581.3	0.029	Excellent
Property Age	37.6	38.3	0.032	Excellent
Square Footage	2,218.1	2,198.5	0.033	Excellent
¹ Standardized difference < 0.10 indicates excellent balance
Green shading = Excellent balance; Yellow shading = Acceptable balance

Show code

# Balance plot
ggplot(balance_data, aes(x = std_diff, y = reorder(variable_label, std_diff))) +
  geom_col(aes(fill = balance_assessment), alpha = 0.7) +
  geom_vline(xintercept = 0.10, linetype = "dashed",
             color = chart_colors["Series1"], linewidth = 1) +
  geom_vline(xintercept = 0.25, linetype = "dotted",
             color = chart_colors["Series1"], linewidth = 1) +
  scale_fill_manual(values = c(
    "Excellent"  = chart_colors["Series3"],
    "Acceptable" = chart_colors["Series5"],
    "Moderate"   = "#FF9800",
    "Poor"       = chart_colors["Series1"]
  )) +
  # Adjust annotation positions slightly left of the lines
  annotate("text", x = 0.10, y = Inf, label = "Excellent Balance",
           vjust = 1.2, hjust = -0.1, color = chart_colors["Series1"], size = 4) +
  annotate("text", x = 0.25, y = Inf, label = "Acceptable Balance",
           vjust = 1.2, hjust = 1.1, color = chart_colors["Series1"], size = 4) +
  coord_cartesian(clip = "off") +
  labs(
    title = "Covariate Balance Between Busy and Non-Busy Properties",
    subtitle = "Standardized differences in means",
    x = "Absolute Standardized Difference",
    y = NULL,
    fill = "Balance Assessment"
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = "bottom",
    plot.margin = margin(10, 40, 10, 10) # give space for labels
  )

Balance Assessment Summary:

Show code

library(gt)
library(scales)

max_std_diff <- max(balance_data$std_diff, na.rm = TRUE)
n_above_threshold <- sum(balance_data$std_diff >= 0.10, na.rm = TRUE)

balance_summary_gt <- balance_data %>%
  group_by(balance_assessment) %>%
  summarise(
    n_variables = n(),
    variables = paste(variable_label, collapse = ", "),
    .groups = "drop"
  ) %>%
  arrange(match(balance_assessment, c("Excellent", "Acceptable", "Moderate", "Poor"))) %>%
  mutate(
    balance_color = case_when(
      balance_assessment == "Excellent"  ~ chart_colors["Series3"],  # Aqua
      balance_assessment == "Acceptable" ~ chart_colors["Series5"],  # Gold
      balance_assessment == "Moderate"   ~ "#FF9800",                # Amber
      balance_assessment == "Poor"       ~ chart_colors["Series1"]   # Burnt orange
    ),
    label = paste0(
      "<b style='color:", balance_color, "'>", balance_assessment, " Balance</b>"
    )
  ) %>%
  select(label, n_variables, variables) %>%
  gt() %>%
  tab_header(
    title = md("<span style='color:#BF5700;'>Covariate Balance Summary</span>"),
    subtitle = md("Distribution of variable-level balance across covariate groups")
  ) %>%
  cols_label(
    label = md("**Balance Category**"),
    n_variables = md("**Number of Variables**"),
    variables = md("**Variables Included**")
  ) %>%
  fmt_markdown(columns = label) %>%
  tab_style(
    style = cell_text(weight = "bold", color = "#333F48"),
    locations = cells_column_labels(everything())
  ) %>%
  tab_style(
    style = cell_borders(sides = "bottom", color = "#BF5700", weight = px(2)),
    locations = cells_title(groups = "title")
  ) %>%
  # Enable default row striping (banded rows)
  opt_row_striping() %>%
  tab_options(
    table.border.top.width = px(2),
    table.border.top.color = "#BF5700",
    table.font.names = c("Arial", "Helvetica", "sans-serif"),
    data_row.padding = px(6),
    heading.align = "left",
    row.striping.background_color = "#F9F9F9"  # sets stripe color safely
  ) %>%
  tab_source_note(
    source_note = md(
      "Color-coded categories reflect RealWise balance thresholds:  
      <span style='color:#61A5C2;'>Excellent</span> = &lt;0.10,  
      <span style='color:#F2A900;'>Acceptable</span> = 0.10–0.25,  
      <span style='color:#FF9800;'>Moderate</span> = 0.25–0.50,  
      <span style='color:#BF5700;'>Poor</span> ≥ 0.50."
    )
  )


balance_summary_gt

Balance Category	Number of Variables	Variables Included
Covariate Balance Summary
Distribution of variable-level balance across covariate groups
Excellent Balance	5	Bedrooms, Garage Spaces, Lot Size, Property Age, Square Footage
Acceptable Balance	1	Bathrooms
Color-coded categories reflect RealWise balance thresholds: Excellent = <0.10, Acceptable = 0.10–0.25, Moderate = 0.25–0.50, Poor ≥ 0.50.

Overall Balance Assessment

5 out of 6 variables show excellent balance (< 0.10), and 6 show at least acceptable balance (< 0.25). This indicates good overall balance between groups, supporting the validity of our regression estimates.

Detailed Results Table

Show code

# Create detailed results for export
results_detail <- properties_clean %>%
  mutate(
    predicted_price = exp(predict(model2)),
    residual = sale_price - predicted_price,
    pct_error = 100 * residual / sale_price
  ) %>%
  select(
    property_id,
    address,
    sale_price,
    predicted_price,
    residual,
    price_per_sqft,
    busy_street_text,
    station_aadt,
    station_road,
    sqft,
    bedrooms,
    bathrooms,
    property_age,
    lot_size,
    garage_spaces,
    distance_to_station_m
  ) %>%
  arrange(desc(abs(residual)))

# Create the datatable
table_html <- datatable(
  results_detail,
  caption = "Detailed Property Analysis Results (Sortable)",
  filter = "top",
  options = list(
    pageLength = 25,
    scrollX = TRUE,
    columnDefs = list(list(className = 'dt-center', targets = "_all"))
  ),
  rownames = FALSE
) %>%
  formatCurrency(c("sale_price", "predicted_price", "residual", "price_per_sqft"), digits = 0) %>%
  formatRound(c("distance_to_station_m"), digits = 1) %>%
  formatRound(c("sqft", "lot_size", "station_aadt"), digits = 0)

# Write collapsible HTML container
cat('
<details>
<summary><b>Show Detailed Property Analysis Table</b></summary><br>
')

Show Detailed Property Analysis Table

Show code

# Print datatable directly (it already renders as HTML)
table_html

Show code

cat('</details>')

Methodological Caveats

Synthetic data: Results are illustrative; coefficients are not intended to represent actual Houston market magnitudes.
Endogeneity: Traffic volumes and property values may be jointly determined; causality can run both ways. Future work might use instrumental variables or panel data.
Spatial dependence: Nearby sales share unobservables; spatial error or lag models and Moran’s I can diagnose/correct clustering.
Measurement error: AADT is an annual average; peak-hour conditions may have different marginal effects.
Model specification: Alternative elasticities (e.g., log–log), neighborhood fixed effects, and heterogeneous effects by submarket are fruitful robustness checks.

Conclusions & Implications

Summary of Findings

Significant Price Discount: Properties fronting on busy streets (≥10,000 AADT) sell for approximately 7.8% less than comparable properties, after controlling for property characteristics.
Economic Magnitude: At the median sale price of $343,425, this represents a discount of about $26,818.08.
Robust Results: The negative effect is:
- Statistically significant (p < 0.001)
- Consistent across multiple model specifications
- Robust to different AADT thresholds
- Not driven by differences in property characteristics
Progressive Impact: The absolute dollar discount increases with property value, but the percentage effect remains relatively constant across the price distribution.

Practical Implications

For Home Buyers: - Properties on busy streets offer a meaningful discount - Consider whether noise, safety, and air quality trade-offs are worth the savings - The discount appears to accurately price these externalities

For Sellers: - Busy street location represents a significant pricing challenge - Emphasize property features and improvements to offset location discount - Consider noise mitigation improvements (windows, landscaping)

For Appraisers: - Use 7-8% adjustment for properties on roads with ≥10,000 AADT - Adjust proportionally for intermediate traffic levels - Consider distance from road as additional factor

For Policy Makers: - Traffic impacts create measurable economic costs for residents - Traffic calming measures may help preserve property values - Consider compensation mechanisms for residents on major arterials

Methodology Notes

Data Source

Properties: 1,485 single-family homes in Houston, Texas
Traffic Data: Texas Department of Transportation (TxDOT) AADT counts from 2024
Busy Street Definition: Roads with ≥10,000 vehicles per day (FHWA arterial classification)

Statistical Approach

Primary Model: Semi-log hedonic regression (log price on levels)
Standard Errors: Heteroskedasticity-robust (White’s HC1)
Model Selection: Quadratic specification (Model 2) preferred based on R², AIC, and residual diagnostics

Limitations

Analysis limited to properties within 87 meters of TxDOT counting stations
Traffic counts represent average conditions; peak hour impacts may differ
Cross-sectional analysis; panel data would allow for stronger causal inference
Other traffic-related factors (noise, pollution, accidents) not directly measured

Analysis Date: 2025-11-04

Contact: RealWise Analytics

This analysis was conducted using R statistical software with hedonic pricing methodology standard in real estate economics research.

Future Research Directions

Causal identification using instruments (historic road alignments, exogenous traffic shocks) or difference-in-differences with roadway changes.
Spatial econometrics, testing spatial lag/error specifications and comparing fit against OLS with robust SE.
Heterogeneous treatment effects by school district, price tier, or distance to employment centers.
Noise- and pollution-specific channels, pairing TxDOT counts with modeled noise contours or ambient air quality to decompose the busy street discount.
Temporal dynamics, using panel data or repeat-sales to track how traffic impacts change with macro conditions.