Econometrics
Prof. Youssef Ait Benaseer
Introduction
This project examines the determinants of housing prices in Boston,
focusing on how housing characteristics and neighborhood conditions
influence the median value of homes. I chose this topic because housing
markets are a central part of economic life and reflect broader patterns
of inequality, living standards, and access to resources. The Boston
dataset is widely used in econometrics, which made it a practical and
manageable choice while still allowing for meaningful analysis. I was
particularly interested in understanding how factors such as the number
of rooms, crime rates, pollution levels, and socioeconomic conditions
shape housing values across different areas. This topic is not only
relevant to me as a student of economics, but it should also be
interesting more broadly because housing prices affect individuals,
communities, and policy decisions. By analyzing these relationships, the
project provides insight into how different aspects of urban life
contribute to economic outcomes, making it both analytically useful and
socially meaningful.
Research Question
What factors significantly affect housing prices in Boston?
Data Description & Preparation
The dataset used in this project is the Boston Housing dataset,
obtained from the Kaggle website. It is a cross-sectional dataset where
each observation represents a different neighborhood in Boston, with a
total of 506 observations. The main variables used are medv (median
housing value), rm (number of rooms), crim (crime rate), nox
(pollution), and lstat (lower socioeconomic status). The dataset was
already relatively tidy, but basic data cleaning was performed in R to
ensure accuracy. This included checking for missing values, removing
incomplete observations, and verifying the structure of the dataset:
# Check missing values
colSums(is.na(boston))
...1 crim zn indus chas nox rm age dis rad tax ptratio
0 0 0 0 0 0 0 0 0 0 0 0
lstat medv
0 0
# Remove missing values
boston <- na.omit(boston)
# Remove duplicates
boston <- distinct(boston)
# Check structure
str(boston)
tibble [506 × 14] (S3: tbl_df/tbl/data.frame)
$ ...1 : num [1:506] 1 2 3 4 5 6 7 8 9 10 ...
$ crim : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : num [1:506] 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num [1:506] 6.58 6.42 7.18 7 7.15 ...
$ age : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
$ rad : num [1:506] 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ lstat : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Main Variables
The main outcome variable is medv, which measures
the median value of owner-occupied homes. The key explanatory variable
in the simple regression is rm, which measures the
average number of rooms per dwelling. The multiple regression adds
crim, nox, and lstat
as control variables.
| medv |
Median value of owner-occupied homes |
Outcome variable |
| rm |
Average number of rooms per dwelling |
Positive |
| crim |
Crime rate |
Negative |
| nox |
Nitric oxide concentration / pollution |
Negative |
| lstat |
Percentage of lower-status population |
Negative |
main_vars <- boston %>%
select(medv, rm, crim, nox, lstat)
summary(main_vars)
medv rm crim nox lstat
Min. : 5.00 Min. :3.561 Min. : 0.00632 Min. :0.3850 Min. : 1.73
1st Qu.:17.02 1st Qu.:5.886 1st Qu.: 0.08205 1st Qu.:0.4490 1st Qu.: 6.95
Median :21.20 Median :6.208 Median : 0.25651 Median :0.5380 Median :11.36
Mean :22.53 Mean :6.285 Mean : 3.61352 Mean :0.5547 Mean :12.65
3rd Qu.:25.00 3rd Qu.:6.623 3rd Qu.: 3.67708 3rd Qu.:0.6240 3rd Qu.:16.95
Max. :50.00 Max. :8.780 Max. :88.97620 Max. :0.8710 Max. :37.97
Distribution of the Outcome Variable
The outcome variable, medv, is important because the
project tries to explain why median housing values differ across Boston
neighborhoods. The histogram below shows the distribution of housing
values.
ggplot(boston, aes(x = medv)) +
geom_histogram(bins = 30, color = "black", fill = "gray80") +
labs(
title = "Distribution of Median Housing Values",
x = "Median Housing Value (medv)",
y = "Number of Observations"
) +
theme_minimal()

The distribution shows variation in housing values across the
dataset. This variation makes the dataset useful for studying why some
neighborhoods have higher housing values than others.
Scatter Plots & Regression Lines
boston_long <- boston %>%
select(medv, rm, crim, nox, lstat) %>%
pivot_longer(cols = -medv, names_to = "variable", values_to = "value")
ggplot(boston_long, aes(x = value, y = medv)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~variable, scales = "free_x") +
labs(title = "Scatter Plots of Housing Price vs Key Variables",
x = "Independent Variables",
y = "Median Housing Value")

The scatter plots show the relationship between housing prices and
the main explanatory variables. The number of rooms appears to have a
positive relationship with housing prices, meaning that homes with more
rooms tend to have higher values. In contrast, crime rate, pollution,
and lower status population appear to have negative relationships with
housing prices. These patterns provide initial visual evidence before
running the regression models.
Regression Model 1: Simple Regression Model
summary(model1)
Call:
lm(formula = medv ~ rm, data = boston)
Residuals:
Min 1Q Median 3Q Max
-23.346 -2.547 0.090 2.986 39.433
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.671 2.650 -13.08 <2e-16 ***
rm 9.102 0.419 21.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825
F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
Model 1 estimates the relationship between median housing value and
the average number of rooms. In this model, medv is the outcome variable
and rm is the explanatory variable. The coefficient on rm measures how
much the predicted median housing value changes when the average number
of rooms increases by one unit. However, model 1 may be incomplete
because housing prices are affected by many other neighborhood
characteristics. If factors such as crime, pollution, or socioeconomic
conditions are related to both the number of rooms and housing prices,
then the simple regression coefficient may suffer from omitted variable
bias.
\[
medv_i = \beta_0 + \beta_1 rm_i + u_i
\]
Regression Model 2: Multiple Regression Model
summary(model2)
Call:
lm(formula = medv ~ rm + crim + nox + lstat, data = boston)
Residuals:
Min 1Q Median 3Q Max
-17.901 -3.570 -1.132 1.919 29.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.51944 3.30533 -0.762 0.44628
rm 5.21855 0.44386 11.757 < 2e-16 ***
crim -0.10264 0.03275 -3.134 0.00183 **
nox -0.12239 2.68444 -0.046 0.96365
lstat -0.57738 0.05349 -10.794 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.495 on 501 degrees of freedom
Multiple R-squared: 0.6459, Adjusted R-squared: 0.643
F-statistic: 228.4 on 4 and 501 DF, p-value: < 2.2e-16
Model 2 adds crime rate, pollution,
and lower status population as control variables. This
model is stronger because housing prices are affected by more than just
the size of a home. By including these variables improves the simple
regression model because it reduces the risk of omitted variable bias.
In the simple model, the coefficient on rm may partly capture the
effects of other neighborhood characteristics. For example,
neighborhoods with larger homes may also have lower crime rates, cleaner
environments, or stronger socioeconomic conditions. If those factors are
omitted, the simple regression may overstate or understate the true
relationship between rooms and housing prices. Model 2 addresses the
identification threat of omitted variable bias by holding crime,
pollution, and lower-status population constant. This does not fully
prove causality, but it gives a stronger estimate of the relationship
between rooms and housing prices after accounting for major neighborhood
characteristics.
\[
medv_i = \beta_0 + \beta_1 rm_i + \beta_2 crim_i + \beta_3 nox_i +
\beta_4 lstat_i + u_i
\]
Regression Models Comparison
stargazer(model1, model2,
type = "text",
title = "Regression Results: Housing Prices in Boston",
dep.var.labels = "Median Housing Value (medv)",
column.labels = c("Simple Model", "Multiple Model"),
covariate.labels = c("Number of Rooms",
"Crime Rate",
"Pollution (NOX)",
"Lower Status Population (%)"),
omit.stat = c("ser"),
digits = 3,
align = TRUE,
no.space = TRUE)
Regression Results: Housing Prices in Boston
=============================================================================
Dependent variable:
-------------------------------------------------
Median Housing Value (medv)
Simple Model Multiple Model
(1) (2)
-----------------------------------------------------------------------------
Number of Rooms 9.102*** 5.219***
(0.419) (0.444)
Crime Rate -0.103***
(0.033)
Pollution (NOX) -0.122
(2.684)
Lower Status Population (%) -0.577***
(0.053)
Constant -34.671*** -2.519
(2.650) (3.305)
-----------------------------------------------------------------------------
Observations 506 506
R2 0.484 0.646
Adjusted R2 0.483 0.643
F Statistic 471.847*** (df = 1; 504) 228.417*** (df = 4; 501)
=============================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
The models comparison shows whether the multiple regression explains
more variation in housing prices than the simple regression. If the
R-squared and adjusted R-squared are higher in the multiple regression,
this means the added variables improve the model’s explanatory
power.
Interpretation of Regression Results
Rooms: The coefficient on rm
shows how median housing value changes when the average number of rooms
increases by one unit, holding crime, pollution, and lower-status
population constant. A positive and statistically significant
coefficient means that larger homes are associated with higher housing
values.
Crime rate: The coefficient on
crim shows how housing value changes when crime
increases, holding the other variables constant. A negative coefficient
suggests that higher crime is associated with lower housing
values.
Pollution: The coefficient on
nox shows the relationship between pollution and
housing value. A negative coefficient suggests that more polluted
neighborhoods tend to have lower housing values.
Lower-status population: The coefficient on
lstat shows how socioeconomic conditions relate to
housing value. A negative coefficient suggests that neighborhoods with a
larger lower-status population tend to have lower housing
values.
Model comparison: Compared with the simple
regression, the multiple regression is stronger because it controls for
crime, pollution, and socioeconomic conditions while estimating the
effect of rooms on housing prices.
Hypotheses for Regression Results
The multiple regression allows this project to test several
hypotheses. The table below presents the expected relationship between
each explanatory variable and median housing value.
| Rooms (rm) |
\(H_0: \beta_1 =
0\) |
\(H_A: \beta_1 >
0\) |
| Crime Rate (crim) |
\(H_0: \beta_2 =
0\) |
\(H_A: \beta_2 <
0\) |
| Pollution (nox) |
\(H_0: \beta_3 =
0\) |
\(H_A: \beta_3 <
0\) |
| Lower-Status Population (lstat) |
\(H_0: \beta_4 =
0\) |
\(H_A: \beta_4 <
0\) |
In this table, the null hypothesis states that each variable has
no relationship with median housing value after
controlling for the other variables. The alternative hypothesis shows
the expected direction of the relationship. For rooms,
the expected coefficient is positive, meaning more rooms should be
associated with higher housing values. For crime, pollution, and
lower-status population, the expected coefficients are
negative, meaning higher levels of these factors should be associated
with lower housing values.
Inference from the Multiple Regression
confint(model2, level = 0.95) %>%
kable(digits = 4, caption = "95% Confidence Intervals for Multiple Regression")
95% Confidence Intervals for Multiple Regression
| (Intercept) |
-9.0135 |
3.9746 |
| rm |
4.3465 |
6.0906 |
| crim |
-0.1670 |
-0.0383 |
| nox |
-5.3965 |
5.1518 |
| lstat |
-0.6825 |
-0.4723 |
The p-values in regression tables are used to
determine whether each explanatory variable has a statistically
significant relationship with median housing value after controlling for
the other variables. If a p-value is less than 0.05, the variable is
statistically significant at the 5% level. The confidence
intervals above provide another way to evaluate statistical
significance. If a 95% confidence interval does not include zero, then
the coefficient is statistically significant at the 5% level.
Conclusion
This project examined the factors associated with housing prices in
Boston using the Boston Housing dataset. The simple regression showed
that the average number of rooms is positively related to median housing
value. This means that neighborhoods with larger homes tend to have
higher housing prices.
The multiple regression added crime rate, pollution, and lower
socioeconomic status as additional explanatory variables. This improved
the analysis because housing prices are influenced by more than home
size alone. By controlling for these additional variables, the model
provides a clearer estimate of the relationship between rooms and
housing prices.
The results should still be interpreted carefully. The data are
cross-sectional, so the regression results show association rather than
definite causation. There may still be omitted variables, such as school
quality, distance to employment centers, transportation access, or
neighborhood amenities. Measurement error may also exist in variables
such as crime and pollution. Still, the analysis provides useful
evidence that both housing characteristics and neighborhood conditions
are important in explaining differences in Boston housing values.