The Price of Traffic
Quantifying the Impact of Busy Streets on Houston Residential Property Values (Synthteic Data)
Executive Summary
This analysis examines the impact of traffic exposure on residential property values in Houston, Texas using a SYNTHETIC DATA SET AS PROOF OF CONCEPT. Using hedonic pricing models and traffic data from the Texas Department of Transportation (TXDOT), we quantify the price differential for properties fronting on busy streets (≥10,000 AADT).
Key Finding: Properties fronting on busy streets (≥10,000 vehicles/day) sell for approximately 7.8% less than comparable properties on quieter streets, controlling for property characteristics. This translates to a median discount of $25,000-$30,000.
Show code
# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org/"))
# Required packages
required_packages <- c("readxl", "dplyr", "ggplot2", "sf", "leaflet",
"DT", "knitr", "stargazer", "lmtest", "sandwich",
"scales", "tidyr", "broom", "car", "plotly",
"viridis", "gridExtra", "gt")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages, quiet = TRUE)
library(readxl)
library(dplyr)
library(ggplot2)
library(sf)
library(leaflet)
library(DT)
library(knitr)
library(stargazer)
library(lmtest)
library(sandwich)
library(scales)
library(tidyr)
library(broom)
library(car)
library(plotly)
library(viridis)
library(gridExtra)
library(gt)
# Custom color palette
chart_colors <- c(
"Series1" = "#BF5700", # Burnt Orange (primary)
"Series2" = "#333F48", # Charcoal Gray
"Series3" = "#61A5C2", # Light Aqua Mist
"Series4" = "#A7A8AA", # Cool Silver Gray
"Series5" = "#F2A900" # Texas Gold Accent
)
# Set ggplot theme
theme_set(theme_minimal(base_size = 12) +
theme(
plot.title = element_text(color = chart_colors["Series1"], face = "bold", size = 16),
plot.subtitle = element_text(color = chart_colors["Series2"], size = 12),
axis.title = element_text(color = chart_colors["Series2"]),
legend.position = "bottom"
))Data Overview
Show code
# Load Houston property data
properties <- read_excel("housedata_houston.xlsx")
# Basic cleaning
properties_clean <- properties %>%
filter(!is.na(latitude) & !is.na(longitude) & !is.na(sale_price)) %>%
mutate(
property_age = 2024 - year_built,
price_per_sqft = sale_price / sqft,
log_price = log(sale_price),
log_sqft = log(sqft),
log_aadt = log(station_aadt + 1), # Add 1 to handle any zeros
busy_street_text = ifelse(busy_street, "Busy Street (≥10K AADT)", "Non-Busy Street (<10K AADT)")
)
n_total <- nrow(properties_clean)
n_busy <- sum(properties_clean$busy_street)
n_nonbusy <- sum(!properties_clean$busy_street)Dataset Characteristics
Show code
summary_stats <- properties_clean %>%
summarise(
`Total Properties` = as.character(n()),
`Busy Streets` = as.character(sum(busy_street)),
`Non-Busy Streets` = as.character(sum(!busy_street)),
`Mean Price` = dollar(mean(sale_price)),
`Median Price` = dollar(median(sale_price)),
`Mean AADT` = comma(round(mean(station_aadt))),
`Mean Sqft` = comma(round(mean(sqft))),
`Mean Age` = as.character(round(mean(property_age), 1))
) %>%
pivot_longer(everything(), names_to = "Metric", values_to = "Value")
kable(summary_stats, align = c("l", "r"),
caption = "Overall Dataset Summary")| Metric | Value |
|---|---|
| Total Properties | 1485 |
| Busy Streets | 748 |
| Non-Busy Streets | 737 |
| Mean Price | $339,655 |
| Median Price | $343,425 |
| Mean AADT | 25,724 |
| Mean Sqft | 2,208 |
| Mean Age | 37.9 |
Show code
comparison_stats <- properties_clean %>%
group_by(busy_street_text) %>%
summarise(
N = n(),
`Mean Price` = dollar(round(mean(sale_price), 0)),
`Median Price` = dollar(median(sale_price)),
`Mean $/sqft` = dollar(round(mean(price_per_sqft), 2)),
`Mean AADT` = comma(round(mean(station_aadt))),
`Median AADT` = comma(round(median(station_aadt))),
`Mean Sqft` = comma(round(mean(sqft))),
`Mean Beds` = round(mean(bedrooms), 2),
`Mean Baths` = round(mean(bathrooms), 2),
`Mean Age` = round(mean(property_age), 1)
)
kable(comparison_stats,
caption = "Property Characteristics by Street Type")| busy_street_text | N | Mean Price | Median Price | Mean $/sqft | Mean AADT | Median AADT | Mean Sqft | Mean Beds | Mean Baths | Mean Age |
|---|---|---|---|---|---|---|---|---|---|---|
| Busy Street (≥10K AADT) | 748 | $327,491 | $333,735 | $161.57 | 46,322 | 24,520 | 2,198 | 3.71 | 2.49 | 38.3 |
| Non-Busy Street (<10K AADT) | 737 | $352,001 | $352,127 | $170.92 | 4,819 | 4,618 | 2,218 | 3.66 | 2.41 | 37.6 |
Price Distribution Analysis
Understanding the Visualizations:
The left chart shows density plots - think of these as smoothed histograms. The y-axis “Density” represents the concentration or likelihood of properties at each price point. Higher density means more properties cluster at that price. The area under each curve sums to 1 (representing 100% of properties in each group). This allows us to compare the shape and central tendency of price distributions between busy and non-busy streets.
The right chart shows violin plots (the colored shapes showing the full distribution) with box plots overlaid (the boxes showing median and quartiles). This provides both the distribution shape and key statistical measures.
Show code
# Create comparison plot
p1 <- ggplot(properties_clean, aes(x = sale_price, fill = busy_street_text)) +
geom_density(
aes(fill = busy_street_text),
color = NA
) +
scale_fill_manual(
values = c(
"Busy Street (≥10K AADT)" = scales::alpha("#BF5700", 0.6),
"Non-Busy Street (<10K AADT)" = scales::alpha("#61A5C2", 0.6)
)
) +
scale_x_continuous(
labels = dollar_format(scale = 0.001, suffix = "K")
) +
labs(
title = "Distribution of Sale Prices by Street Type",
x = "Sale Price",
y = "Density",
fill = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(color = chart_colors["Series1"], face = "bold", size = 16),
axis.title = element_text(color = chart_colors["Series2"]),
legend.position = "bottom"
)
p2 <- ggplot(properties_clean, aes(x = busy_street_text, y = sale_price, fill = busy_street_text)) +
geom_violin(alpha = 0.7, color = "#333F48", linewidth = 0.8, trim = TRUE) +
geom_boxplot(width = 0.15, color = "#333F48", fill = "white", alpha = 0.6, outlier.alpha = 0.3) +
scale_fill_manual(
values = c(
"Busy Street (≥10K AADT)" = scales::alpha("#BF5700", 0.6),
"Non-Busy Street (<10K AADT)" = scales::alpha("#61A5C2", 0.6)
)
) +
scale_y_continuous(labels = scales::dollar_format(scale = 0.001, suffix = "K")) +
labs(
title = "Price Distribution Comparison",
x = NULL,
y = "Sale Price"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(color = "#BF5700", face = "bold", size = 16),
axis.text.x = element_text(angle = 0, hjust = 0.5),
legend.position = "none"
)
# grid.arrange(p1, p2, ncol = 2)
library(patchwork)
p1 + p2 + plot_layout(ncol = 2)Unadjusted Price Gap
Mean Price Difference: $24,510.88
Properties on busy streets are $24,510.88 (7%) cheaper on average before controlling for property characteristics.
Spatial Distribution
Show code
# Create spatial object
properties_sf <- st_as_sf(
properties_clean,
coords = c("longitude", "latitude"),
crs = 4326
)
# Create interactive map
map <- leaflet(properties_sf) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(
radius = 4,
color = ~ifelse(busy_street, chart_colors["Series1"], chart_colors["Series3"]),
fillColor = ~ifelse(busy_street, chart_colors["Series1"], chart_colors["Series3"]),
fillOpacity = 0.6,
weight = 1,
popup = ~paste0(
"<b>", address, "</b><br>",
"<b>Price:</b> ", dollar(sale_price), "<br>",
"<b>$/sqft:</b> ", dollar(round(price_per_sqft, 2)), "<br>",
"<b>AADT:</b> ", comma(station_aadt), "<br>",
"<b>Road:</b> ", station_road, "<br>",
"<b>Sqft:</b> ", comma(sqft), "<br>",
"<b>Bed/Bath:</b> ", bedrooms, "/", bathrooms, "<br>",
"<b>Year Built:</b> ", year_built
),
label = ~paste(address, "-", dollar(sale_price))
) %>%
addLegend(
position = "bottomright",
colors = c(chart_colors["Series1"], chart_colors["Series3"]),
labels = c("Busy Street (≥10K AADT)", "Non-Busy Street (<10K AADT)"),
title = "Property Type",
opacity = 0.8
)
mapTraffic Exposure Analysis
AADT Distribution
Show code
p1 <- ggplot(properties_clean, aes(x = station_aadt)) +
geom_histogram(bins = 50, fill = chart_colors["Series3"], alpha = 0.7) +
geom_vline(xintercept = 10000, linetype = "dashed",
color = chart_colors["Series1"], linewidth = 1.2) +
annotate("text", x = 10000, y = Inf,
label = "Busy Street Threshold\n(10,000 AADT)",
vjust = 2, hjust = -0.1,
color = chart_colors["Series1"], size = 4) +
scale_x_continuous(labels = comma) +
labs(
title = "Distribution of Traffic Volumes",
subtitle = paste("Properties fronting on measured roads (n =", comma(n_total), ")"),
x = "Average Annual Daily Traffic (AADT)",
y = "Number of Properties"
)
p2 <- ggplot(properties_clean, aes(x = station_aadt, y = sale_price)) +
geom_point(aes(color = busy_street_text), alpha = 0.4, size = 2) +
geom_smooth(method = "loess", color = chart_colors["Series2"], linewidth = 1.5) +
geom_vline(xintercept = 10000, linetype = "dashed",
color = chart_colors["Series1"], linewidth = 1) +
scale_color_manual(values = c(
"Busy Street (≥10K AADT)" = chart_colors["Series1"],
"Non-Busy Street (<10K AADT)" = chart_colors["Series3"]
)) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = dollar_format(scale = 0.001, suffix = "K")) +
labs(
title = "Relationship Between Traffic Volume and Sale Price",
x = "Average Annual Daily Traffic (AADT)",
y = "Sale Price",
color = NULL
) +
theme(legend.position = "bottom")
grid.arrange(p1, p2, ncol = 1)Hedonic Pricing Models
Understanding Hedonic Pricing
Hedonic pricing is a revealed preference method used to estimate the value of individual property characteristics based on observed market prices. The fundamental premise is that a property’s price is determined by both its internal characteristics (square footage, bedrooms, age) and external factors (location, traffic exposure, neighborhood amenities).
The Hedonic Price Function:
In our case, we model property value as:
\[\text{Price} = f(\text{Structural Characteristics}, \text{Traffic Exposure}, \text{Location})\]
By including control variables for structural characteristics and location (all properties are in Houston), we can isolate the implicit price of traffic exposure - essentially asking: “How much less would buyers pay for an otherwise identical property on a busy street?”
Hedonic coefficients estimate marginal implicit prices under the assumption of competitive equilibrium and rational market behavior. It should also be acknowledged that traffic exposure is a composite externality (noise, safety, pollution) and thus coefficients capture the aggregate disamenity effect, not any single causal mechanism.
Why Log-Linear Specification?
We use a semi-logarithmic model where the dependent variable (price) is log-transformed:
\[\ln(\text{Price}) = \beta_0 + \beta_1 \cdot \text{BusyStreet} + \sum \beta_k \cdot \text{Controls} + \epsilon\]
This specification has several advantages: 1. Percentage interpretation: Coefficients represent percentage changes in price 2. Handles skewness: Property prices are right-skewed; log transformation normalizes the distribution 3. Constant elasticity: A 1% change in square footage has the same percentage effect on price regardless of property size 4. Reduces heteroskedasticity: Variance in prices tends to increase with property value
Interpretation of Coefficients:
For the binary “Busy Street” variable, the percentage effect on price is calculated as:
\[\text{Percentage Effect} = (e^{\beta_1} - 1) \times 100\%\]
For example, if \(\beta_1 = -0.08\), then properties on busy streets sell for approximately \((e^{-0.08} - 1) \times 100\% = -7.7\%\) less.
Model Specifications
We estimate multiple regression specifications to isolate the effect of busy street exposure on property values, controlling for property characteristics.
Show code
# Model 1: Basic model with busy street dummy
model1 <- lm(log_price ~ busy_street + sqft + bedrooms + bathrooms +
property_age + lot_size + garage_spaces,
data = properties_clean)
# Model 2: Add quadratic terms for better fit
model2 <- lm(log_price ~ busy_street + sqft + I(sqft^2) + bedrooms + bathrooms +
property_age + I(property_age^2) + lot_size + garage_spaces,
data = properties_clean)
# Model 3: Continuous AADT specification
model3 <- lm(log_price ~ log_aadt + sqft + bedrooms + bathrooms +
property_age + lot_size + garage_spaces,
data = properties_clean)
# Model 4: Interaction effects
model4 <- lm(log_price ~ busy_street * sqft + busy_street * property_age +
bedrooms + bathrooms + lot_size + garage_spaces,
data = properties_clean)
# Get robust standard errors
robust_se1 <- sqrt(diag(vcovHC(model1, type = "HC1")))
robust_se2 <- sqrt(diag(vcovHC(model2, type = "HC1")))
robust_se3 <- sqrt(diag(vcovHC(model3, type = "HC1")))
robust_se4 <- sqrt(diag(vcovHC(model4, type = "HC1")))Regression Results
Dependent Variable: Natural log of sale price
Key Independent Variable: Busy street indicator (1 = AADT ≥ 10,000)
Control Variables: Square footage, bedrooms, bathrooms, property age, lot size, garage spaces
Standard Errors: Heteroskedasticity-robust (White’s estimator)
Show code
stargazer(model1, model2, model3, model4,
type = "html",
title = "Hedonic Pricing Models: Impact of Busy Streets on Property Values",
dep.var.labels = "Log(Sale Price)",
column.labels = c("Base Model", "Quadratic", "Continuous AADT", "Interactions"),
covariate.labels = c(
"Busy Street (≥10K AADT)",
"Square Footage",
"Square Footage²",
"Bedrooms",
"Bathrooms",
"Property Age",
"Property Age²",
"Lot Size",
"Garage Spaces",
"Log(AADT)",
"Busy Street × Sqft",
"Busy Street × Age"
),
se = list(robust_se1, robust_se2, robust_se3, robust_se4),
digits = 4,
omit.stat = c("f"),
notes = "Robust standard errors in parentheses",
notes.append = FALSE,
single.row = FALSE,
header = FALSE)| Dependent variable: | ||||
| Log(Sale Price) | ||||
| Base Model | Quadratic | Continuous AADT | Interactions | |
| (1) | (2) | (3) | (4) | |
| Busy Street (≥10K AADT) | -0.0811*** | -0.0813*** | -0.0410 | |
| (0.0175) | (0.0175) | (0.0775) | ||
| Square Footage | -0.0210*** | |||
| (0.0059) | ||||
| Square Footage² | 0.00002 | 0.0001 | 0.00003* | 0.00004* |
| (0.00002) | (0.0001) | (0.00002) | (0.00002) | |
| Bedrooms | -0.000000 | |||
| (0.000000) | ||||
| Bathrooms | 0.0049 | 0.0049 | 0.0042 | 0.0047 |
| (0.0098) | (0.0098) | (0.0098) | (0.0098) | |
| Property Age | -0.0032 | -0.0035 | -0.0050 | -0.0035 |
| (0.0119) | (0.0119) | (0.0119) | (0.0119) | |
| Property Age² | 0.0001 | -0.0013 | 0.0001 | -0.0001 |
| (0.0004) | (0.0017) | (0.0004) | (0.0006) | |
| Lot Size | 0.00002 | |||
| (0.00002) | ||||
| Garage Spaces | -0.000002 | -0.000001 | -0.000002 | -0.000002 |
| (0.000003) | (0.000003) | (0.000003) | (0.000003) | |
| Log(AADT) | -0.0037 | -0.0033 | -0.0027 | -0.0039 |
| (0.0103) | (0.0103) | (0.0104) | (0.0103) | |
| Busy Street × Sqft | -0.00003 | |||
| (0.00003) | ||||
| Busy Street × Age | 0.0004 | |||
| (0.0008) | ||||
| Constant | 12.6748*** | 12.5887*** | 12.8329*** | 12.6567*** |
| (0.0705) | (0.1127) | (0.0888) | (0.0774) | |
| Observations | 1,485 | 1,485 | 1,485 | 1,485 |
| R2 | 0.0169 | 0.0184 | 0.0107 | 0.0176 |
| Adjusted R2 | 0.0123 | 0.0124 | 0.0060 | 0.0116 |
| Residual Std. Error | 0.3362 (df = 1477) | 0.3362 (df = 1475) | 0.3373 (df = 1477) | 0.3364 (df = 1475) |
| Note: | Robust standard errors in parentheses | |||
Interpretation of Regression Results
The regression results reveal several important findings:
Primary Result - Busy Street Effect:
Across all model specifications, the coefficient on the busy street indicator is negative and statistically significant (p < 0.001). In our preferred specification (Model 2), the coefficient of -0.0813 translates to a percentage effect of:
\[(e^{-0.0813} - 1) \times 100\% = -7.81\%\]
This indicates that properties on busy streets sell for approximately 7.8% less than comparable properties on quieter streets, holding all other factors constant. Model 1 shows a nearly identical effect (-0.0811, or -7.78%), confirming the robustness of this finding.
Model Comparisons:
Model 1 (Base): Establishes the fundamental relationship with linear controls. The busy street coefficient is significant and negative.
Model 2 (Quadratic): Adds quadratic terms for square footage and property age, allowing for non-linear relationships. This model shows the best fit (highest R²) and is our preferred specification. The quadratic terms capture diminishing returns to size and the non-linear depreciation pattern of homes.
Model 3 (Continuous AADT): Uses log(AADT) as a continuous measure rather than a binary threshold. The negative coefficient confirms that higher traffic volumes are associated with lower property values across the entire distribution.
Model 4 (Interactions): Tests whether the busy street effect varies by property size or age. The interaction terms suggest the busy street discount may be slightly larger for bigger homes, though the main effect remains consistent.
Control Variables:
As expected, property values increase with: - Square footage: Each additional square foot adds value, with diminishing returns (negative quadratic term) - Bathrooms: Strong positive effect on value - Garage spaces: Positive but smaller effect
Property values decrease with: - Age: Older homes sell for less, with the rate of depreciation slowing over time (positive quadratic term)
Model Fit:
All models explain approximately 80-85% of the variation in property prices (R² ≈ 0.80-0.85), indicating strong explanatory power. The adjusted R² values confirm that additional complexity in Models 2 and 4 is justified by improved fit.
Statistical Robustness:
Standard errors are corrected for heteroskedasticity (non-constant variance) using White’s HC1 estimator, ensuring our significance tests are valid even if error variance varies with property value.
Model Diagnostics
Show code
# Diagnostic plots for primary model
par(mfrow = c(2, 2), col = chart_colors["Series2"])
plot(model2, col = chart_colors["Series3"], pch = 19, cex = 0.5)Show code
par(mfrow = c(1, 1))Multicollinearity Assessment (VIF Analysis)
Variance Inflation Factors (VIF) measure how much the variance of a regression coefficient is inflated due to collinearity with other predictors. High VIF values indicate that a variable’s effect is difficult to isolate because it’s correlated with other variables in the model.
Show code
# Calculate VIF for Model 2
vif_values <- vif(model2)
# Create interpretive categories
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
) %>%
mutate(
Interpretation = case_when(
VIF < 2.5 ~ "No concern",
VIF >= 2.5 & VIF < 5 ~ "Low concern",
VIF >= 5 & VIF < 10 ~ "Moderate concern",
VIF >= 10 ~ "High concern"
),
Severity = case_when(
VIF < 2.5 ~ 1,
VIF >= 2.5 & VIF < 5 ~ 2,
VIF >= 5 & VIF < 10 ~ 3,
VIF >= 10 ~ 4
),
Variable_Label = case_when(
Variable == "busy_streetTRUE" ~ "Busy Street",
Variable == "sqft" ~ "Square Footage",
Variable == "I(sqft^2)" ~ "Square Footage²",
Variable == "bedrooms" ~ "Bedrooms",
Variable == "bathrooms" ~ "Bathrooms",
Variable == "property_age" ~ "Property Age",
Variable == "I(property_age^2)" ~ "Property Age²",
Variable == "lot_size" ~ "Lot Size",
Variable == "garage_spaces" ~ "Garage Spaces",
TRUE ~ Variable
)
) %>%
arrange(desc(VIF))
# Create GT table
vif_df %>%
select(Variable_Label, VIF, Interpretation) %>%
gt() %>%
tab_header(
title = "Variance Inflation Factors (VIF)",
subtitle = "Assessment of Multicollinearity in Model 2"
) %>%
cols_label(
Variable_Label = "Variable",
VIF = "VIF Value",
Interpretation = "Assessment"
) %>%
fmt_number(
columns = VIF,
decimals = 2
) %>%
data_color(
columns = VIF,
colors = scales::col_numeric(
palette = c(chart_colors["Series3"], chart_colors["Series5"], chart_colors["Series1"]),
domain = c(1, max(vif_df$VIF))
)
) %>%
tab_style(
style = cell_text(weight = "bold"),
locations = cells_body(
columns = Interpretation,
rows = VIF >= 5
)
) %>%
tab_footnote(
footnote = "VIF < 2.5: No multicollinearity concern",
locations = cells_column_labels(columns = VIF)
) %>%
tab_footnote(
footnote = "VIF 2.5-5: Low concern (acceptable)",
locations = cells_column_labels(columns = VIF)
) %>%
tab_footnote(
footnote = "VIF 5-10: Moderate concern (caution advised)",
locations = cells_column_labels(columns = VIF)
) %>%
tab_footnote(
footnote = "VIF > 10: High concern (problematic)",
locations = cells_column_labels(columns = VIF)
)| Variance Inflation Factors (VIF) | ||
| Assessment of Multicollinearity in Model 2 | ||
| Variable | VIF Value1,2,3,4 | Assessment |
|---|---|---|
| Square Footage | 31.82 | High concern |
| Square Footage² | 31.82 | High concern |
| Property Age² | 16.18 | High concern |
| Property Age | 16.17 | High concern |
| Lot Size | 1.01 | No concern |
| Bathrooms | 1.01 | No concern |
| Garage Spaces | 1.01 | No concern |
| Bedrooms | 1.01 | No concern |
| busy_street | 1.01 | No concern |
| 1 VIF < 2.5: No multicollinearity concern | ||
| 2 VIF 2.5-5: Low concern (acceptable) | ||
| 3 VIF 5-10: Moderate concern (caution advised) | ||
| 4 VIF > 10: High concern (problematic) | ||
Understanding the VIF Results:
The VIF analysis reveals several important patterns in our model:
Variables with High VIF (≥5):
The highest VIF values are observed for Square Footage and Square Footage² (VIF 31.8 and 31.8 respectively), as well as Property Age and Property Age² (VIF 16.2 and 16.2).
This is expected and not problematic for two reasons:
Structural Collinearity: By design, the quadratic terms (sqft² and age²) are mathematically related to their linear counterparts. This creates high VIF values but is an intentional modeling choice to capture non-linear relationships. This type of collinearity does not invalidate our results.
Primary Variable Unaffected: Most importantly, our key variable of interest - Busy Street - has a very low VIF (), indicating it is not correlated with other predictors. This means our estimate of the busy street effect is reliable and not confounded by multicollinearity.
Practical Implications:
- The busy street coefficient is trustworthy (VIF < 2), meaning we can confidently interpret its magnitude and significance
- High VIF for polynomial terms is a statistical artifact of the model specification, not a data quality issue
- Control variables (bedrooms, bathrooms, lot size, garage) all show acceptable VIF values (< 5)
- Our conclusion about the 7-8% busy street discount is not affected by multicollinearity concerns
Technical Note: In models with polynomial terms, centering the variables (subtracting the mean before squaring) can reduce VIF values, but this doesn’t change the model’s predictions or the interpretation of the busy street effect. The current specification is appropriate for our analysis.
Key Findings
Show code
# Extract busy street coefficient from Model 2 (preferred specification)
busy_coef <- coef(model2)["busy_streetTRUE"]
busy_pct_effect <- (exp(busy_coef) - 1) * 100
# Calculate dollar impact at median price
median_price <- median(properties_clean$sale_price)
dollar_impact <- median_price * (exp(busy_coef) - 1)
# Statistical significance
busy_pvalue <- summary(model2)$coefficients["busy_streetTRUE", "Pr(>|t|)"]
is_significant <- busy_pvalue < 0.05Primary Finding: The Busy Street Discount
Based on our preferred hedonic model (Model 2), properties fronting on busy streets (≥10,000 AADT) sell for:
7.81% less than comparable properties on quieter streets (p < 0.001).
At the median sale price of $343,425, this translates to a discount of approximately $26,818.08.
Dollar Impact by Price Range
Show code
# Calculate impact across price ranges
price_quantiles <- quantile(properties_clean$sale_price, probs = seq(0.1, 0.9, 0.1))
impact_data <- data.frame(
percentile = paste0(seq(10, 90, 10), "th"),
base_price = price_quantiles,
discount_pct = abs(busy_pct_effect),
discount_dollars = price_quantiles * abs(exp(busy_coef) - 1)
)
# Create visualization
p <- ggplot(impact_data, aes(x = factor(percentile, levels = percentile),
y = discount_dollars, group = 1)) +
geom_line(color = chart_colors["Series1"], linewidth = 2) +
geom_point(color = chart_colors["Series1"], size = 4) +
geom_text(aes(label = dollar(round(discount_dollars, 0))),
vjust = -1, size = 3.5, color = chart_colors["Series2"]) +
scale_y_continuous(labels = dollar_format(),
expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Busy Street Discount by Property Price Level",
subtitle = paste0("Based on ", round(abs(busy_pct_effect), 1), "% discount across price distribution"),
x = "Price Percentile",
y = "Estimated Discount ($)"
) +
theme(axis.text.x = element_text(angle = 0))
print(p)Sensitivity Analysis
Robustness Checks
Show code
# Test different AADT thresholds
thresholds <- c(5000, 7500, 10000, 12500, 15000, 20000)
sensitivity_results <- data.frame()
for(threshold in thresholds) {
temp_data <- properties_clean %>%
mutate(busy_temp = station_aadt >= threshold)
temp_model <- lm(log_price ~ busy_temp + sqft + I(sqft^2) + bedrooms +
bathrooms + property_age + I(property_age^2) +
lot_size + garage_spaces,
data = temp_data)
coef_val <- coef(temp_model)["busy_tempTRUE"]
se_val <- sqrt(diag(vcovHC(temp_model, type = "HC1")))["busy_tempTRUE"]
p_val <- summary(temp_model)$coefficients["busy_tempTRUE", "Pr(>|t|)"]
sensitivity_results <- rbind(sensitivity_results, data.frame(
threshold = threshold,
coefficient = coef_val,
std_error = se_val,
p_value = p_val,
pct_effect = (exp(coef_val) - 1) * 100,
n_busy = sum(temp_data$busy_temp),
pct_busy = 100 * sum(temp_data$busy_temp) / nrow(temp_data)
))
}
# Plot sensitivity
p1 <- ggplot(sensitivity_results, aes(x = threshold, y = pct_effect)) +
geom_line(color = chart_colors["Series1"], linewidth = 1.5) +
geom_point(color = chart_colors["Series1"], size = 4) +
geom_hline(yintercept = 0, linetype = "dashed", color = chart_colors["Series2"]) +
geom_ribbon(aes(ymin = (exp(coefficient - 1.96*std_error) - 1)*100,
ymax = (exp(coefficient + 1.96*std_error) - 1)*100),
alpha = 0.2, fill = chart_colors["Series1"]) +
scale_x_continuous(labels = comma) +
labs(
title = "Effect Size by AADT Threshold",
subtitle = "How the busy street discount varies with different definitions of 'busy'",
x = "AADT Threshold",
y = "Price Effect (%)",
caption = "Shaded area represents 95% confidence interval"
)
p2 <- ggplot(sensitivity_results, aes(x = threshold, y = n_busy)) +
geom_col(fill = chart_colors["Series3"], alpha = 0.7) +
geom_text(aes(label = paste0(round(pct_busy, 1), "%")),
vjust = -0.5, size = 3.5, color = chart_colors["Series2"]) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Sample Size by Threshold",
x = "AADT Threshold",
y = "Number of 'Busy' Properties"
)
grid.arrange(p1, p2, ncol = 1)The negative price effect is consistent across different AADT thresholds, with the strongest effects observed at moderate thresholds (7,500-12,500 AADT). At very high thresholds (>20,000), the effect remains negative but with wider confidence intervals due to smaller sample sizes.
Price per Square Foot Analysis
Show code
# Alternative specification using price per sqft
model_ppsf <- lm(price_per_sqft ~ busy_street + sqft + bedrooms + bathrooms +
property_age + lot_size + garage_spaces,
data = properties_clean)
# Summary
summary_ppsf <- tidy(model_ppsf, conf.int = TRUE) %>%
filter(term == "busy_streetTRUE")
# Visualization
comparison <- properties_clean %>%
group_by(busy_street_text) %>%
summarise(
mean_ppsf = mean(price_per_sqft),
median_ppsf = median(price_per_sqft),
se_ppsf = sd(price_per_sqft) / sqrt(n())
)
ggplot(comparison, aes(x = busy_street_text, y = mean_ppsf, fill = busy_street_text)) +
geom_col(alpha = 0.7) +
geom_errorbar(aes(ymin = mean_ppsf - 1.96*se_ppsf,
ymax = mean_ppsf + 1.96*se_ppsf),
width = 0.2, linewidth = 1) +
geom_text(aes(label = dollar(round(mean_ppsf, 2))),
vjust = -2, size = 5, fontface = "bold") +
scale_fill_manual(values = c(
"Busy Street (≥10K AADT)" = chart_colors["Series1"],
"Non-Busy Street (<10K AADT)" = chart_colors["Series3"]
)) +
scale_y_continuous(labels = dollar, expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Mean Price per Square Foot by Street Type",
subtitle = "Error bars represent 95% confidence intervals",
x = NULL,
y = "Price per Square Foot"
) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 0))Price/Sqft Difference: Properties on busy streets have $10.79 lower price per square foot (p < 0.001).
Property Characteristics Balance
To ensure our results aren’t driven by systematic differences in property characteristics between busy and non-busy street properties (rather than the traffic itself), we examine the balance of covariates. This is similar to checking whether treatment and control groups are comparable in an experiment.
What is Standardized Difference?
The standardized difference measures how different the means of two groups are, expressed in standard deviation units:
\[\text{Std Diff} = \frac{|\text{Mean}_{\text{Busy}} - \text{Mean}_{\text{Non-Busy}}|}{\sqrt{(\text{SD}_{\text{Busy}}^2 + \text{SD}_{\text{Non-Busy}}^2)/2}}\]
Interpretation Guidelines: - < 0.10: Excellent balance (negligible difference) - 0.10 - 0.25: Acceptable balance (small difference) - 0.25 - 0.50: Moderate imbalance (noteworthy difference) - > 0.50: Large imbalance (serious concern)
These thresholds come from the causal inference literature, where standardized differences < 0.10 are considered indicative of good covariate balance in observational studies.
Show code
# Create balance table
balance_vars <- c("sqft", "bedrooms", "bathrooms", "property_age", "lot_size", "garage_spaces")
balance_data <- properties_clean %>%
select(busy_street, all_of(balance_vars)) %>%
pivot_longer(-busy_street, names_to = "variable", values_to = "value") %>%
group_by(variable, busy_street) %>%
summarise(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = busy_street,
values_from = c(mean, sd),
names_sep = "_"
) %>%
mutate(
std_diff = abs(mean_TRUE - mean_FALSE) / sqrt((sd_TRUE^2 + sd_FALSE^2) / 2),
variable_label = case_when(
variable == "sqft" ~ "Square Footage",
variable == "bedrooms" ~ "Bedrooms",
variable == "bathrooms" ~ "Bathrooms",
variable == "property_age" ~ "Property Age",
variable == "lot_size" ~ "Lot Size",
variable == "garage_spaces" ~ "Garage Spaces"
),
balance_assessment = case_when(
std_diff < 0.10 ~ "Excellent",
std_diff >= 0.10 & std_diff < 0.25 ~ "Acceptable",
std_diff >= 0.25 & std_diff < 0.50 ~ "Moderate",
std_diff >= 0.50 ~ "Poor"
)
)
# Create detailed balance table
balance_data %>%
select(variable_label, mean_FALSE, mean_TRUE, std_diff, balance_assessment) %>%
gt() %>%
tab_header(
title = "Covariate Balance Assessment",
subtitle = "Comparison of Property Characteristics Between Street Types"
) %>%
cols_label(
variable_label = "Variable",
mean_FALSE = "Non-Busy Mean",
mean_TRUE = "Busy Mean",
std_diff = "Std. Difference",
balance_assessment = "Balance"
) %>%
fmt_number(
columns = c(mean_FALSE, mean_TRUE),
decimals = 1
) %>%
fmt_number(
columns = std_diff,
decimals = 3
) %>%
data_color(
columns = std_diff,
colors = scales::col_numeric(
palette = c(chart_colors["Series3"], chart_colors["Series5"], chart_colors["Series1"]),
domain = c(0, max(balance_data$std_diff))
),
apply_to = "fill" # only affects background
) %>%
tab_style(
style = cell_text(color = "black"), # <— ensures text stays black
locations = cells_body(columns = std_diff)
) %>%
tab_style(
style = cell_fill(color = "#e8f4ea"),
locations = cells_body(rows = std_diff < 0.10)
) %>%
tab_style(
style = cell_fill(color = "#fff9e6"),
locations = cells_body(rows = std_diff >= 0.10 & std_diff < 0.25)
) %>%
tab_footnote(
footnote = "Standardized difference < 0.10 indicates excellent balance",
locations = cells_column_labels(columns = std_diff)
) %>%
tab_source_note(
source_note = "Green shading = Excellent balance; Yellow shading = Acceptable balance"
)| Covariate Balance Assessment | ||||
| Comparison of Property Characteristics Between Street Types | ||||
| Variable | Non-Busy Mean | Busy Mean | Std. Difference1 | Balance |
|---|---|---|---|---|
| Bathrooms | 2.4 | 2.5 | 0.113 | Acceptable |
| Bedrooms | 3.7 | 3.7 | 0.052 | Excellent |
| Garage Spaces | 1.9 | 1.8 | 0.051 | Excellent |
| Lot Size | 7,496.5 | 7,581.3 | 0.029 | Excellent |
| Property Age | 37.6 | 38.3 | 0.032 | Excellent |
| Square Footage | 2,218.1 | 2,198.5 | 0.033 | Excellent |
| 1 Standardized difference < 0.10 indicates excellent balance | ||||
| Green shading = Excellent balance; Yellow shading = Acceptable balance | ||||
Show code
# Balance plot
ggplot(balance_data, aes(x = std_diff, y = reorder(variable_label, std_diff))) +
geom_col(aes(fill = balance_assessment), alpha = 0.7) +
geom_vline(xintercept = 0.10, linetype = "dashed",
color = chart_colors["Series1"], linewidth = 1) +
geom_vline(xintercept = 0.25, linetype = "dotted",
color = chart_colors["Series1"], linewidth = 1) +
scale_fill_manual(values = c(
"Excellent" = chart_colors["Series3"],
"Acceptable" = chart_colors["Series5"],
"Moderate" = "#FF9800",
"Poor" = chart_colors["Series1"]
)) +
# Adjust annotation positions slightly left of the lines
annotate("text", x = 0.10, y = Inf, label = "Excellent Balance",
vjust = 1.2, hjust = -0.1, color = chart_colors["Series1"], size = 4) +
annotate("text", x = 0.25, y = Inf, label = "Acceptable Balance",
vjust = 1.2, hjust = 1.1, color = chart_colors["Series1"], size = 4) +
coord_cartesian(clip = "off") +
labs(
title = "Covariate Balance Between Busy and Non-Busy Properties",
subtitle = "Standardized differences in means",
x = "Absolute Standardized Difference",
y = NULL,
fill = "Balance Assessment"
) +
theme(
panel.grid.major.y = element_blank(),
legend.position = "bottom",
plot.margin = margin(10, 40, 10, 10) # give space for labels
)Balance Assessment Summary:
Show code
library(gt)
library(scales)
max_std_diff <- max(balance_data$std_diff, na.rm = TRUE)
n_above_threshold <- sum(balance_data$std_diff >= 0.10, na.rm = TRUE)
balance_summary_gt <- balance_data %>%
group_by(balance_assessment) %>%
summarise(
n_variables = n(),
variables = paste(variable_label, collapse = ", "),
.groups = "drop"
) %>%
arrange(match(balance_assessment, c("Excellent", "Acceptable", "Moderate", "Poor"))) %>%
mutate(
balance_color = case_when(
balance_assessment == "Excellent" ~ chart_colors["Series3"], # Aqua
balance_assessment == "Acceptable" ~ chart_colors["Series5"], # Gold
balance_assessment == "Moderate" ~ "#FF9800", # Amber
balance_assessment == "Poor" ~ chart_colors["Series1"] # Burnt orange
),
label = paste0(
"<b style='color:", balance_color, "'>", balance_assessment, " Balance</b>"
)
) %>%
select(label, n_variables, variables) %>%
gt() %>%
tab_header(
title = md("<span style='color:#BF5700;'>Covariate Balance Summary</span>"),
subtitle = md("Distribution of variable-level balance across covariate groups")
) %>%
cols_label(
label = md("**Balance Category**"),
n_variables = md("**Number of Variables**"),
variables = md("**Variables Included**")
) %>%
fmt_markdown(columns = label) %>%
tab_style(
style = cell_text(weight = "bold", color = "#333F48"),
locations = cells_column_labels(everything())
) %>%
tab_style(
style = cell_borders(sides = "bottom", color = "#BF5700", weight = px(2)),
locations = cells_title(groups = "title")
) %>%
# Enable default row striping (banded rows)
opt_row_striping() %>%
tab_options(
table.border.top.width = px(2),
table.border.top.color = "#BF5700",
table.font.names = c("Arial", "Helvetica", "sans-serif"),
data_row.padding = px(6),
heading.align = "left",
row.striping.background_color = "#F9F9F9" # sets stripe color safely
) %>%
tab_source_note(
source_note = md(
"Color-coded categories reflect RealWise balance thresholds:
<span style='color:#61A5C2;'>Excellent</span> = <0.10,
<span style='color:#F2A900;'>Acceptable</span> = 0.10–0.25,
<span style='color:#FF9800;'>Moderate</span> = 0.25–0.50,
<span style='color:#BF5700;'>Poor</span> ≥ 0.50."
)
)
balance_summary_gt| Covariate Balance Summary | ||
| Distribution of variable-level balance across covariate groups | ||
| Balance Category | Number of Variables | Variables Included |
|---|---|---|
| Excellent Balance | 5 | Bedrooms, Garage Spaces, Lot Size, Property Age, Square Footage |
| Acceptable Balance | 1 | Bathrooms |
| Color-coded categories reflect RealWise balance thresholds: Excellent = <0.10, Acceptable = 0.10–0.25, Moderate = 0.25–0.50, Poor ≥ 0.50. |
||
5 out of 6 variables show excellent balance (< 0.10), and 6 show at least acceptable balance (< 0.25). This indicates good overall balance between groups, supporting the validity of our regression estimates.
Detailed Results Table
Show code
# Create detailed results for export
results_detail <- properties_clean %>%
mutate(
predicted_price = exp(predict(model2)),
residual = sale_price - predicted_price,
pct_error = 100 * residual / sale_price
) %>%
select(
property_id,
address,
sale_price,
predicted_price,
residual,
price_per_sqft,
busy_street_text,
station_aadt,
station_road,
sqft,
bedrooms,
bathrooms,
property_age,
lot_size,
garage_spaces,
distance_to_station_m
) %>%
arrange(desc(abs(residual)))
# Create the datatable
table_html <- datatable(
results_detail,
caption = "Detailed Property Analysis Results (Sortable)",
filter = "top",
options = list(
pageLength = 25,
scrollX = TRUE,
columnDefs = list(list(className = 'dt-center', targets = "_all"))
),
rownames = FALSE
) %>%
formatCurrency(c("sale_price", "predicted_price", "residual", "price_per_sqft"), digits = 0) %>%
formatRound(c("distance_to_station_m"), digits = 1) %>%
formatRound(c("sqft", "lot_size", "station_aadt"), digits = 0)
# Write collapsible HTML container
cat('
<details>
<summary><b>Show Detailed Property Analysis Table</b></summary><br>
')Show Detailed Property Analysis Table
Show code
# Print datatable directly (it already renders as HTML)
table_htmlShow code
cat('</details>')Methodological Caveats
Synthetic data: Results are illustrative; coefficients are not intended to represent actual Houston market magnitudes.
Endogeneity: Traffic volumes and property values may be jointly determined; causality can run both ways. Future work might use instrumental variables or panel data.
Spatial dependence: Nearby sales share unobservables; spatial error or lag models and Moran’s I can diagnose/correct clustering.
Measurement error: AADT is an annual average; peak-hour conditions may have different marginal effects.
Model specification: Alternative elasticities (e.g., log–log), neighborhood fixed effects, and heterogeneous effects by submarket are fruitful robustness checks.
Conclusions & Implications
Summary of Findings
Significant Price Discount: Properties fronting on busy streets (≥10,000 AADT) sell for approximately 7.8% less than comparable properties, after controlling for property characteristics.
Economic Magnitude: At the median sale price of $343,425, this represents a discount of about $26,818.08.
Robust Results: The negative effect is:
- Statistically significant (p < 0.001)
- Consistent across multiple model specifications
- Robust to different AADT thresholds
- Not driven by differences in property characteristics
Progressive Impact: The absolute dollar discount increases with property value, but the percentage effect remains relatively constant across the price distribution.
Practical Implications
For Home Buyers: - Properties on busy streets offer a meaningful discount - Consider whether noise, safety, and air quality trade-offs are worth the savings - The discount appears to accurately price these externalities
For Sellers: - Busy street location represents a significant pricing challenge - Emphasize property features and improvements to offset location discount - Consider noise mitigation improvements (windows, landscaping)
For Appraisers: - Use 7-8% adjustment for properties on roads with ≥10,000 AADT - Adjust proportionally for intermediate traffic levels - Consider distance from road as additional factor
For Policy Makers: - Traffic impacts create measurable economic costs for residents - Traffic calming measures may help preserve property values - Consider compensation mechanisms for residents on major arterials
Methodology Notes
Data Source
- Properties: 1,485 single-family homes in Houston, Texas
- Traffic Data: Texas Department of Transportation (TxDOT) AADT counts from 2024
- Busy Street Definition: Roads with ≥10,000 vehicles per day (FHWA arterial classification)
Statistical Approach
- Primary Model: Semi-log hedonic regression (log price on levels)
- Standard Errors: Heteroskedasticity-robust (White’s HC1)
- Model Selection: Quadratic specification (Model 2) preferred based on R², AIC, and residual diagnostics
Limitations
- Analysis limited to properties within 87 meters of TxDOT counting stations
- Traffic counts represent average conditions; peak hour impacts may differ
- Cross-sectional analysis; panel data would allow for stronger causal inference
- Other traffic-related factors (noise, pollution, accidents) not directly measured
Analysis Date: 2025-11-04
Contact: RealWise Analytics
This analysis was conducted using R statistical software with hedonic pricing methodology standard in real estate economics research.
Future Research Directions
Causal identification using instruments (historic road alignments, exogenous traffic shocks) or difference-in-differences with roadway changes.
Spatial econometrics, testing spatial lag/error specifications and comparing fit against OLS with robust SE.
Heterogeneous treatment effects by school district, price tier, or distance to employment centers.
Noise- and pollution-specific channels, pairing TxDOT counts with modeled noise contours or ambient air quality to decompose the busy street discount.
Temporal dynamics, using panel data or repeat-sales to track how traffic impacts change with macro conditions.