Project Phase IV

Econometrics

Prof. Youssef Ait Benaseer

Introduction

This project examines the determinants of housing prices in Boston, focusing on how housing characteristics and neighborhood conditions influence the median value of homes. I chose this topic because housing markets are a central part of economic life and reflect broader patterns of inequality, living standards, and access to resources. The Boston dataset is widely used in econometrics, which made it a practical and manageable choice while still allowing for meaningful analysis. I was particularly interested in understanding how factors such as the number of rooms, crime rates, pollution levels, and socioeconomic conditions shape housing values across different areas. This topic is not only relevant to me as a student of economics, but it should also be interesting more broadly because housing prices affect individuals, communities, and policy decisions. By analyzing these relationships, the project provides insight into how different aspects of urban life contribute to economic outcomes, making it both analytically useful and socially meaningful.

Research Question

What factors significantly affect housing prices in Boston?

Data Description & Preparation

The dataset used in this project is the Boston Housing dataset, obtained from the Kaggle website. It is a cross-sectional dataset where each observation represents a different neighborhood in Boston, with a total of 506 observations. The main variables used are medv (median housing value), rm (number of rooms), crim (crime rate), nox (pollution), and lstat (lower socioeconomic status). The dataset was already relatively tidy, but basic data cleaning was performed in R to ensure accuracy. This included checking for missing values, removing incomplete observations, and verifying the structure of the dataset:

# Check missing values
colSums(is.na(boston))

   ...1    crim      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
      0       0       0       0       0       0       0       0       0       0       0       0 
  lstat    medv 
      0       0

# Remove missing values
boston <- na.omit(boston)

# Remove duplicates
boston <- distinct(boston)

# Check structure
str(boston)

tibble [506 × 14] (S3: tbl_df/tbl/data.frame)
 $ ...1   : num [1:506] 1 2 3 4 5 6 7 8 9 10 ...
 $ crim   : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : num [1:506] 0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num [1:506] 6.58 6.42 7.18 7 7.15 ...
 $ age    : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : num [1:506] 1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ lstat  : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Main Variables

The main outcome variable is medv, which measures the median value of owner-occupied homes. The key explanatory variable in the simple regression is rm, which measures the average number of rooms per dwelling. The multiple regression adds crim, nox, and lstat as control variables.

Variable	Meaning	Expected Relationship with medv
medv	Median value of owner-occupied homes	Outcome variable
rm	Average number of rooms per dwelling	Positive
crim	Crime rate	Negative
nox	Nitric oxide concentration / pollution	Negative
lstat	Percentage of lower-status population	Negative

main_vars <- boston %>%
  select(medv, rm, crim, nox, lstat)

summary(main_vars)

      medv             rm             crim               nox             lstat      
 Min.   : 5.00   Min.   :3.561   Min.   : 0.00632   Min.   :0.3850   Min.   : 1.73  
 1st Qu.:17.02   1st Qu.:5.886   1st Qu.: 0.08205   1st Qu.:0.4490   1st Qu.: 6.95  
 Median :21.20   Median :6.208   Median : 0.25651   Median :0.5380   Median :11.36  
 Mean   :22.53   Mean   :6.285   Mean   : 3.61352   Mean   :0.5547   Mean   :12.65  
 3rd Qu.:25.00   3rd Qu.:6.623   3rd Qu.: 3.67708   3rd Qu.:0.6240   3rd Qu.:16.95  
 Max.   :50.00   Max.   :8.780   Max.   :88.97620   Max.   :0.8710   Max.   :37.97

Distribution of the Outcome Variable

The outcome variable, medv, is important because the project tries to explain why median housing values differ across Boston neighborhoods. The histogram below shows the distribution of housing values.

ggplot(boston, aes(x = medv)) +
  geom_histogram(bins = 30, color = "black", fill = "gray80") +
  labs(
    title = "Distribution of Median Housing Values",
    x = "Median Housing Value (medv)",
    y = "Number of Observations"
  ) +
  theme_minimal()

The distribution shows variation in housing values across the dataset. This variation makes the dataset useful for studying why some neighborhoods have higher housing values than others.

Scatter Plots & Regression Lines

boston_long <- boston %>%
  select(medv, rm, crim, nox, lstat) %>%
  pivot_longer(cols = -medv, names_to = "variable", values_to = "value")

ggplot(boston_long, aes(x = value, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "Scatter Plots of Housing Price vs Key Variables",
       x = "Independent Variables",
       y = "Median Housing Value")

The scatter plots show the relationship between housing prices and the main explanatory variables. The number of rooms appears to have a positive relationship with housing prices, meaning that homes with more rooms tend to have higher values. In contrast, crime rate, pollution, and lower status population appear to have negative relationships with housing prices. These patterns provide initial visual evidence before running the regression models.

Regression Model 1: Simple Regression Model

summary(model1)


Call:
lm(formula = medv ~ rm, data = boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.346  -2.547   0.090   2.986  39.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.671      2.650  -13.08   <2e-16 ***
rm             9.102      0.419   21.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared:  0.4835,    Adjusted R-squared:  0.4825 
F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

Model 1 estimates the relationship between median housing value and the average number of rooms. In this model, medv is the outcome variable and rm is the explanatory variable. The coefficient on rm measures how much the predicted median housing value changes when the average number of rooms increases by one unit. However, model 1 may be incomplete because housing prices are affected by many other neighborhood characteristics. If factors such as crime, pollution, or socioeconomic conditions are related to both the number of rooms and housing prices, then the simple regression coefficient may suffer from omitted variable bias.

\[ medv_i = \beta_0 + \beta_1 rm_i + u_i \]

Regression Model 2: Multiple Regression Model

summary(model2)


Call:
lm(formula = medv ~ rm + crim + nox + lstat, data = boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.901  -3.570  -1.132   1.919  29.046 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.51944    3.30533  -0.762  0.44628    
rm           5.21855    0.44386  11.757  < 2e-16 ***
crim        -0.10264    0.03275  -3.134  0.00183 ** 
nox         -0.12239    2.68444  -0.046  0.96365    
lstat       -0.57738    0.05349 -10.794  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.495 on 501 degrees of freedom
Multiple R-squared:  0.6459,    Adjusted R-squared:  0.643 
F-statistic: 228.4 on 4 and 501 DF,  p-value: < 2.2e-16

Model 2 adds crime rate, pollution, and lower status population as control variables. This model is stronger because housing prices are affected by more than just the size of a home. By including these variables improves the simple regression model because it reduces the risk of omitted variable bias. In the simple model, the coefficient on rm may partly capture the effects of other neighborhood characteristics. For example, neighborhoods with larger homes may also have lower crime rates, cleaner environments, or stronger socioeconomic conditions. If those factors are omitted, the simple regression may overstate or understate the true relationship between rooms and housing prices. Model 2 addresses the identification threat of omitted variable bias by holding crime, pollution, and lower-status population constant. This does not fully prove causality, but it gives a stronger estimate of the relationship between rooms and housing prices after accounting for major neighborhood characteristics.

\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 crim_i + \beta_3 nox_i + \beta_4 lstat_i + u_i \]

Regression Models Comparison

stargazer(model1, model2,
          type = "text",
          title = "Regression Results: Housing Prices in Boston",
          dep.var.labels = "Median Housing Value (medv)",
          column.labels = c("Simple Model", "Multiple Model"),
          covariate.labels = c("Number of Rooms",
                               "Crime Rate",
                               "Pollution (NOX)",
                               "Lower Status Population (%)"),
          omit.stat = c("ser"),
          digits = 3,
          align = TRUE,
          no.space = TRUE)


Regression Results: Housing Prices in Boston
=============================================================================
                                           Dependent variable:               
                            -------------------------------------------------
                                       Median Housing Value (medv)           
                                  Simple Model            Multiple Model     
                                      (1)                      (2)           
-----------------------------------------------------------------------------
Number of Rooms                     9.102***                 5.219***        
                                    (0.419)                  (0.444)         
Crime Rate                                                  -0.103***        
                                                             (0.033)         
Pollution (NOX)                                               -0.122         
                                                             (2.684)         
Lower Status Population (%)                                 -0.577***        
                                                             (0.053)         
Constant                           -34.671***                 -2.519         
                                    (2.650)                  (3.305)         
-----------------------------------------------------------------------------
Observations                          506                      506           
R2                                   0.484                    0.646          
Adjusted R2                          0.483                    0.643          
F Statistic                 471.847*** (df = 1; 504) 228.417*** (df = 4; 501)
=============================================================================
Note:                                             *p<0.1; **p<0.05; ***p<0.01

The models comparison shows whether the multiple regression explains more variation in housing prices than the simple regression. If the R-squared and adjusted R-squared are higher in the multiple regression, this means the added variables improve the model’s explanatory power.

Interpretation of Regression Results

Rooms: The coefficient on rm shows how median housing value changes when the average number of rooms increases by one unit, holding crime, pollution, and lower-status population constant. A positive and statistically significant coefficient means that larger homes are associated with higher housing values.
Crime rate: The coefficient on crim shows how housing value changes when crime increases, holding the other variables constant. A negative coefficient suggests that higher crime is associated with lower housing values.
Pollution: The coefficient on nox shows the relationship between pollution and housing value. A negative coefficient suggests that more polluted neighborhoods tend to have lower housing values.
Lower-status population: The coefficient on lstat shows how socioeconomic conditions relate to housing value. A negative coefficient suggests that neighborhoods with a larger lower-status population tend to have lower housing values.
Model comparison: Compared with the simple regression, the multiple regression is stronger because it controls for crime, pollution, and socioeconomic conditions while estimating the effect of rooms on housing prices.

Hypotheses for Regression Results

The multiple regression allows this project to test several hypotheses. The table below presents the expected relationship between each explanatory variable and median housing value.

Variable	Null Hypothesis	Alternative Hypothesis
Rooms (rm)	\(H_0: \beta_1 = 0\)	\(H_A: \beta_1 > 0\)
Crime Rate (crim)	\(H_0: \beta_2 = 0\)	\(H_A: \beta_2 < 0\)
Pollution (nox)	\(H_0: \beta_3 = 0\)	\(H_A: \beta_3 < 0\)
Lower-Status Population (lstat)	\(H_0: \beta_4 = 0\)	\(H_A: \beta_4 < 0\)

In this table, the null hypothesis states that each variable has no relationship with median housing value after controlling for the other variables. The alternative hypothesis shows the expected direction of the relationship. For rooms, the expected coefficient is positive, meaning more rooms should be associated with higher housing values. For crime, pollution, and lower-status population, the expected coefficients are negative, meaning higher levels of these factors should be associated with lower housing values.

Inference from the Multiple Regression

confint(model2, level = 0.95) %>%
  kable(digits = 4, caption = "95% Confidence Intervals for Multiple Regression")

95% Confidence Intervals for Multiple Regression
	2.5 %	97.5 %
(Intercept)	-9.0135	3.9746
rm	4.3465	6.0906
crim	-0.1670	-0.0383
nox	-5.3965	5.1518
lstat	-0.6825	-0.4723

The p-values in regression tables are used to determine whether each explanatory variable has a statistically significant relationship with median housing value after controlling for the other variables. If a p-value is less than 0.05, the variable is statistically significant at the 5% level. The confidence intervals above provide another way to evaluate statistical significance. If a 95% confidence interval does not include zero, then the coefficient is statistically significant at the 5% level.

Conclusion

This project examined the factors associated with housing prices in Boston using the Boston Housing dataset. The simple regression showed that the average number of rooms is positively related to median housing value. This means that neighborhoods with larger homes tend to have higher housing prices.

The multiple regression added crime rate, pollution, and lower socioeconomic status as additional explanatory variables. This improved the analysis because housing prices are influenced by more than home size alone. By controlling for these additional variables, the model provides a clearer estimate of the relationship between rooms and housing prices.

The results should still be interpreted carefully. The data are cross-sectional, so the regression results show association rather than definite causation. There may still be omitted variables, such as school quality, distance to employment centers, transportation access, or neighborhood amenities. Measurement error may also exist in variables such as crime and pollution. Still, the analysis provides useful evidence that both housing characteristics and neighborhood conditions are important in explaining differences in Boston housing values.