ECON 465 – Week 7 Lab: Predictive Linear Regression

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

  • Distinguish prediction from causal inference in economics
  • Implement linear regression for out‑of‑sample prediction
  • Split data into training and test sets
  • Evaluate predictive performance using MSE and RMSE
  • Interpret linear regression coefficients in a predictive context
  • Apply these concepts to a real economic problem: predicting house prices in İzmir

The Economic Question

Can we predict the selling price of a house based on its characteristics (size, number of rooms, age, location)? How accurate can our predictions be? In this lab, we use a realistic dataset of house sales in İzmir to build and evaluate a predictive linear regression model.


1 Dataset: İzmir House Sales

We will use a simulated but realistic dataset of 2,000 house sales in İzmir. The dataset (izmir_houses.csv) contains the following variables:

Variable Description Unit / Values
price Selling price Turkish Lira (TL), rounded to thousand
sqm Size of the house Square meters
bedrooms Number of bedrooms 1 to 6
age Age of the house Years (0 to 50)
district District in İzmir Konak, Karşıyaka, Bornova, Buca, Çiğli, Balçova, Narlıdere, Güzelbahçe, Bayraklı, Menemen
floor Floor level 0 = ground floor, up to 10

The data reflects realistic price relationships: larger sqm, more bedrooms, newer houses, and certain districts (e.g., Narlıdere, Güzelbahçe) increase price.

# Load required packages
library(tidyverse)
library(tidymodels)  # for splitting and metrics
library(ggplot2)

# Read the data
houses <- read_csv("data/izmir_houses.csv")
glimpse(houses)
Rows: 2,000
Columns: 6
$ price    <dbl> 2372000, 3057000, 1660000, 1818000, 4108000, 2461000, 1316000…
$ sqm      <dbl> 63, 92, 60, 81, 102, 86, 88, 62, 47, 79, 84, 84, 86, 83, 47, …
$ bedrooms <dbl> 1, 3, 1, 2, 3, 3, 2, 2, 2, 1, 2, 2, 3, 2, 1, 3, 4, 2, 3, 3, 2…
$ age      <dbl> 3, 32, 36, 48, 5, 19, 36, 34, 39, 16, 9, 22, 5, 7, 39, 34, 38…
$ district <chr> "Buca", "Bayraklı", "Narlıdere", "Karşıyaka", "Güzelbahçe", "…
$ floor    <dbl> 10, 0, 0, 2, 9, 5, 9, 9, 0, 1, 7, 4, 4, 0, 2, 1, 0, 0, 0, 3, …

2 Part 1: Prediction vs. Causality – A Crucial Distinction

2.1 What Is Causal Inference?

In traditional econometrics, we ask: What is the causal effect of \(X\) on \(Y\)? For example, “Does an additional bedroom cause a higher house price?” To answer this, we need to control for confounders (size, location, etc.) and use methods like instrumental variables.

2.2 What Is Prediction?

In predictive modeling, we ask: Given the characteristics of a house, can we accurately predict its price? We don’t care if the relationship is causal – we only care about out‑of‑sample accuracy. A variable might be a good predictor even if it has no causal effect (e.g., the floor level might correlate with price in a neighborhood, but it doesn’t cause the price).

2.3 Why This Distinction Matters

  • Policy evaluation needs causality.
  • Business decisions (e.g., pricing, risk assessment) often only need prediction.

In this lab, we focus on prediction.

3 Part 2: The Predictive Modeling Workflow

3.1 Train/Test Split

We cannot evaluate a model on the same data used to train it – that would overestimate performance. Instead, we split the data into:

  • Training set (80%) : used to estimate the model coefficients.

  • Test set (20%) : used to evaluate how well the model predicts new, unseen data.

# Set seed for reproducibility
set.seed(465)

# Create a split object (80% training, 20% testing)
house_split <- initial_split(houses, prop = 0.8)

# Extract the two sets
house_train <- training(house_split)
house_test <- testing(house_split)

cat("Training set rows:", nrow(house_train), "\n")
Training set rows: 1600 
cat("Test set rows:", nrow(house_test), "\n") 
Test set rows: 400 

3.2 Why Random Split?

Random assignment ensures that both sets are representative of the whole dataset. The test set acts as a proxy for future data that the model has never seen.

4 Part 3: Building a Linear Regression Model

4.1 Simple Linear Regression (One Predictor)

We start with the simplest model: predict price using only sqm (size).

# Train the model on the training data
model_simple <- lm(price ~ sqm, data = house_train)

# View coefficients
summary(model_simple)

Call:
lm(formula = price ~ sqm, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-3207543  -502449   -73634   425568  7434075 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -21031.7    67712.8  -0.311    0.756    
sqm          26433.8      695.5  38.008   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 798300 on 1598 degrees of freedom
Multiple R-squared:  0.4748,    Adjusted R-squared:  0.4745 
F-statistic:  1445 on 1 and 1598 DF,  p-value: < 2.2e-16

Interpretation:

  • Intercept ((Intercept)) is the predicted price for a 0 sqm house (not economically meaningful, but needed for the line).

  • Slope (sqm) ≈ 35,000 TL per additional square meter.

4.2 Multiple Linear Regression (More Predictors)

Add bedrooms and age to improve prediction.

# Multiple linear regression
model_multiple <- lm(price ~ sqm + bedrooms + age, data = house_train)
summary(model_multiple)

Call:
lm(formula = price ~ sqm + bedrooms + age, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2436112  -407856   -45319   363120  6929022 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 798122.0    62604.9  12.749   <2e-16 ***
sqm          26796.7      923.5  29.016   <2e-16 ***
bedrooms    -17997.4    28954.3  -0.622    0.534    
age         -32238.6     1144.1 -28.177   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 652800 on 1596 degrees of freedom
Multiple R-squared:  0.6493,    Adjusted R-squared:  0.6486 
F-statistic: 984.9 on 3 and 1596 DF,  p-value: < 2.2e-16

Interpretation:

  • Each additional square meter adds ~34,000 TL, holding bedrooms and age constant.

  • Each additional bedroom adds ~25,000 TL.

  • Each year of age reduces price by ~2,300 TL.

4.3 Including Categorical Variables (Districts)

Districts like Narlıdere or Konak affect prices. We add district as a factor.

# Add district (categorical)
model_categorical <- lm(price ~ sqm + bedrooms + age + district, data = house_train)
summary(model_categorical)

Call:
lm(formula = price ~ sqm + bedrooms + age + district, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1868665  -306170   -14062   238629  6447531 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1310729.8    72325.9  18.123  < 2e-16 ***
sqm                   27214.5      716.3  37.994  < 2e-16 ***
bedrooms             -22497.9    22437.2  -1.003 0.316156    
age                  -32737.1      887.4 -36.891  < 2e-16 ***
districtBayraklı    -246258.4    70482.4  -3.494 0.000489 ***
districtBornova     -352350.2    65150.5  -5.408 7.34e-08 ***
districtBuca        -731744.0    69186.2 -10.576  < 2e-16 ***
districtÇiğli       -863047.7    68367.2 -12.624  < 2e-16 ***
districtGüzelbahçe   168172.6    96737.5   1.738 0.082327 .  
districtKarşıyaka    -80180.0    66747.9  -1.201 0.229839    
districtKonak       -530942.6    66250.1  -8.014 2.13e-15 ***
districtMenemen    -1124222.1    62326.6 -18.038  < 2e-16 ***
districtNarlıdere    155192.2    82288.5   1.886 0.059484 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 505400 on 1587 degrees of freedom
Multiple R-squared:  0.7909,    Adjusted R-squared:  0.7893 
F-statistic: 500.2 on 12 and 1587 DF,  p-value: < 2.2e-16

R automatically creates dummy variables for each district (leaving one out as reference). The coefficients show the price difference relative to the reference district (e.g., Konak). A positive coefficient means that district is more expensive than the reference.

5 Part 4: Making Predictions and Evaluating Performance

5.1 Predict on the Test Set

We use the trained model to predict prices for the test set.

house_test_with_pred <- house_test |>
  mutate(predicted_price = predict(model_categorical, newdata = house_test))

5.2 Evaluation Metrics

Mean Squared Error (MSE) : Average squared difference between actual and predicted prices. Penalizes large errors more.

\[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] Root Mean Squared Error (RMSE) : Square root of MSE. Interpretable in the same units as the outcome (TL).

\[ \text{RMSE} = \sqrt{\text{MSE}} \] R-squared (\(R^2\): Proportion of variance in the outcome explained by the model (on the test set, this is called “test R²”).

# Calculate metrics using yardstick


# Create a data frame with actual and predicted
metrics_summary <- house_test_with_pred |>
  metrics(truth = price, estimate = predicted_price)

# MSE
mse_val <- house_test_with_pred |>
  summarise(mse = mean((price - predicted_price)^2))
metrics_summary
# A tibble: 3 × 3
  .metric .estimator  .estimate
  <chr>   <chr>           <dbl>
1 rmse    standard   482836.   
2 rsq     standard        0.796
3 mae     standard   359324.   
mse_val
# A tibble: 1 × 1
            mse
          <dbl>
1 233130374871.
# RMSE
rmse_val <- house_test_with_pred |>
  summarise(mse = sqrt ((mean((price - predicted_price)^2))))
rmse_val
# A tibble: 1 × 1
      mse
    <dbl>
1 482836.
# R-squared (test)
rsq_val <- house_test_with_pred  |>
  summarise(rsq = 1 - sum((price - predicted_price)^2) /
                       sum((price - mean(price))^2))
 rsq_val 
# A tibble: 1 × 1
    rsq
  <dbl>
1 0.796

Interpretation:

  • \(RMSE ≈\) 450,000 TL means that, on average, our predictions are off by about 450,000 TL.

  • Test \(R² ≈\) 0.82 means the model explains 82% of the variance in test set prices – a good predictive model.

6 Part 7: Your Turn – Predicting House Prices with a New Feature

Task

The dataset also contains a floor variable. Add floor to the multiple regression model and evaluate whether it improves prediction.

  1. Create a new model: price ~ sqm + bedrooms + age + district + floor.

  2. Predict on the test set.

  3. Compute RMSE and R².

  4. Compare the RMSE with the previous complex model (without floor).

  5. Write a short interpretation: does floor help predict price?

Glossary of Functions Used

Function What it does
initial_split() Randomly splits data into training/testing
training() / testing() Extracts the subsets
lm() Fits linear regression model
predict() Generates predictions from a fitted model
rmse() Calculates root mean squared error
rsq() Calculates R-squared
yardstick package Provides metrics for evaluating predictions

Summary: What We Learned Today

  • Prediction vs. causality are different goals; this lab focused on prediction.
  • Train/test split is essential for honest evaluation.
  • Linear regression can be used for prediction, not just causal inference.
  • RMSE measures average prediction error (in TL).
  • Overfitting occurs when a model fits training noise; comparing train/test RMSE helps detect it.

These skills are directly applicable to real‑world economic problems like pricing, demand forecasting, and risk assessment.