ECON 465 – Week 7 Lab: Predictive Linear Regression

Author

Gül Ertan Özgüzer

Lab Objectives

By the end of this lab, you will be able to:

Distinguish prediction from causal inference in economics
Implement linear regression for out‑of‑sample prediction
Split data into training and test sets
Evaluate predictive performance using MSE and RMSE
Interpret linear regression coefficients in a predictive context
Apply these concepts to a real economic problem: predicting house prices in İzmir

The Economic Question

Can we predict the selling price of a house based on its characteristics (size, number of rooms, age, location)? How accurate can our predictions be? In this lab, we use a realistic dataset of house sales in İzmir to build and evaluate a predictive linear regression model.

1 Dataset: İzmir House Sales

We will use a simulated but realistic dataset of 2,000 house sales in İzmir. The dataset (izmir_houses.csv) contains the following variables:

Variable	Description	Unit / Values
`price`	Selling price	Turkish Lira (TL), rounded to thousand
`sqm`	Size of the house	Square meters
`bedrooms`	Number of bedrooms	1 to 6
`age`	Age of the house	Years (0 to 50)
`district`	District in İzmir	Konak, Karşıyaka, Bornova, Buca, Çiğli, Balçova, Narlıdere, Güzelbahçe, Bayraklı, Menemen
`floor`	Floor level	0 = ground floor, up to 10

The data reflects realistic price relationships: larger sqm, more bedrooms, newer houses, and certain districts (e.g., Narlıdere, Güzelbahçe) increase price.

# Load required packages
library(tidyverse)
library(tidymodels)  # for splitting and metrics
library(ggplot2)

# Read the data
houses <- read_csv("data/izmir_houses.csv")
glimpse(houses)

Rows: 2,000
Columns: 6
$ price    <dbl> 2372000, 3057000, 1660000, 1818000, 4108000, 2461000, 1316000…
$ sqm      <dbl> 63, 92, 60, 81, 102, 86, 88, 62, 47, 79, 84, 84, 86, 83, 47, …
$ bedrooms <dbl> 1, 3, 1, 2, 3, 3, 2, 2, 2, 1, 2, 2, 3, 2, 1, 3, 4, 2, 3, 3, 2…
$ age      <dbl> 3, 32, 36, 48, 5, 19, 36, 34, 39, 16, 9, 22, 5, 7, 39, 34, 38…
$ district <chr> "Buca", "Bayraklı", "Narlıdere", "Karşıyaka", "Güzelbahçe", "…
$ floor    <dbl> 10, 0, 0, 2, 9, 5, 9, 9, 0, 1, 7, 4, 4, 0, 2, 1, 0, 0, 0, 3, …

2 Part 1: Prediction vs. Causality – A Crucial Distinction

2.1 What Is Causal Inference?

In traditional econometrics, we ask: What is the causal effect of \(X\) on \(Y\)? For example, “Does an additional bedroom cause a higher house price?” To answer this, we need to control for confounders (size, location, etc.) and use methods like instrumental variables.

2.2 What Is Prediction?

In predictive modeling, we ask: Given the characteristics of a house, can we accurately predict its price? We don’t care if the relationship is causal – we only care about out‑of‑sample accuracy. A variable might be a good predictor even if it has no causal effect (e.g., the floor level might correlate with price in a neighborhood, but it doesn’t cause the price).

2.3 Why This Distinction Matters

Policy evaluation needs causality.
Business decisions (e.g., pricing, risk assessment) often only need prediction.

In this lab, we focus on prediction.

3 Part 2: The Predictive Modeling Workflow

3.1 Train/Test Split

We cannot evaluate a model on the same data used to train it – that would overestimate performance. Instead, we split the data into:

Training set (80%) : used to estimate the model coefficients.
Test set (20%) : used to evaluate how well the model predicts new, unseen data.

# Set seed for reproducibility
set.seed(465)

# Create a split object (80% training, 20% testing)
house_split <- initial_split(houses, prop = 0.8)

# Extract the two sets
house_train <- training(house_split)
house_test <- testing(house_split)

cat("Training set rows:", nrow(house_train), "\n")

Training set rows: 1600

cat("Test set rows:", nrow(house_test), "\n")

Test set rows: 400

3.2 Why Random Split?

Random assignment ensures that both sets are representative of the whole dataset. The test set acts as a proxy for future data that the model has never seen.

4 Part 3: Building a Linear Regression Model

4.1 Simple Linear Regression (One Predictor)

We start with the simplest model: predict price using only sqm (size).

# Train the model on the training data
model_simple <- lm(price ~ sqm, data = house_train)

# View coefficients
summary(model_simple)


Call:
lm(formula = price ~ sqm, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-3207543  -502449   -73634   425568  7434075 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -21031.7    67712.8  -0.311    0.756    
sqm          26433.8      695.5  38.008   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 798300 on 1598 degrees of freedom
Multiple R-squared:  0.4748,    Adjusted R-squared:  0.4745 
F-statistic:  1445 on 1 and 1598 DF,  p-value: < 2.2e-16

Interpretation:

Intercept ((Intercept)) is the predicted price for a 0 sqm house (not economically meaningful, but needed for the line).
Slope (sqm) ≈ 35,000 TL per additional square meter.

4.2 Multiple Linear Regression (More Predictors)

Add bedrooms and age to improve prediction.

# Multiple linear regression
model_multiple <- lm(price ~ sqm + bedrooms + age, data = house_train)
summary(model_multiple)


Call:
lm(formula = price ~ sqm + bedrooms + age, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2436112  -407856   -45319   363120  6929022 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 798122.0    62604.9  12.749   <2e-16 ***
sqm          26796.7      923.5  29.016   <2e-16 ***
bedrooms    -17997.4    28954.3  -0.622    0.534    
age         -32238.6     1144.1 -28.177   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 652800 on 1596 degrees of freedom
Multiple R-squared:  0.6493,    Adjusted R-squared:  0.6486 
F-statistic: 984.9 on 3 and 1596 DF,  p-value: < 2.2e-16

Interpretation:

Each additional square meter adds ~34,000 TL, holding bedrooms and age constant.
Each additional bedroom adds ~25,000 TL.
Each year of age reduces price by ~2,300 TL.

4.3 Including Categorical Variables (Districts)

Districts like Narlıdere or Konak affect prices. We add district as a factor.

# Add district (categorical)
model_categorical <- lm(price ~ sqm + bedrooms + age + district, data = house_train)
summary(model_categorical)


Call:
lm(formula = price ~ sqm + bedrooms + age + district, data = house_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1868665  -306170   -14062   238629  6447531 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1310729.8    72325.9  18.123  < 2e-16 ***
sqm                   27214.5      716.3  37.994  < 2e-16 ***
bedrooms             -22497.9    22437.2  -1.003 0.316156    
age                  -32737.1      887.4 -36.891  < 2e-16 ***
districtBayraklı    -246258.4    70482.4  -3.494 0.000489 ***
districtBornova     -352350.2    65150.5  -5.408 7.34e-08 ***
districtBuca        -731744.0    69186.2 -10.576  < 2e-16 ***
districtÇiğli       -863047.7    68367.2 -12.624  < 2e-16 ***
districtGüzelbahçe   168172.6    96737.5   1.738 0.082327 .  
districtKarşıyaka    -80180.0    66747.9  -1.201 0.229839    
districtKonak       -530942.6    66250.1  -8.014 2.13e-15 ***
districtMenemen    -1124222.1    62326.6 -18.038  < 2e-16 ***
districtNarlıdere    155192.2    82288.5   1.886 0.059484 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 505400 on 1587 degrees of freedom
Multiple R-squared:  0.7909,    Adjusted R-squared:  0.7893 
F-statistic: 500.2 on 12 and 1587 DF,  p-value: < 2.2e-16

R automatically creates dummy variables for each district (leaving one out as reference). The coefficients show the price difference relative to the reference district (e.g., Konak). A positive coefficient means that district is more expensive than the reference.

5 Part 4: Making Predictions and Evaluating Performance

5.1 Predict on the Test Set

We use the trained model to predict prices for the test set.

house_test_with_pred <- house_test |>
  mutate(predicted_price = predict(model_categorical, newdata = house_test))

5.2 Evaluation Metrics

Mean Squared Error (MSE) : Average squared difference between actual and predicted prices. Penalizes large errors more.

\[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \] Root Mean Squared Error (RMSE) : Square root of MSE. Interpretable in the same units as the outcome (TL).

\[ \text{RMSE} = \sqrt{\text{MSE}} \] R-squared (\(R^2\): Proportion of variance in the outcome explained by the model (on the test set, this is called “test R²”).

# Calculate metrics using yardstick


# Create a data frame with actual and predicted
metrics_summary <- house_test_with_pred |>
  metrics(truth = price, estimate = predicted_price)

# MSE
mse_val <- house_test_with_pred |>
  summarise(mse = mean((price - predicted_price)^2))
metrics_summary

# A tibble: 3 × 3
  .metric .estimator  .estimate
  <chr>   <chr>           <dbl>
1 rmse    standard   482836.   
2 rsq     standard        0.796
3 mae     standard   359324.

mse_val

# A tibble: 1 × 1
            mse
          <dbl>
1 233130374871.

# RMSE
rmse_val <- house_test_with_pred |>
  summarise(mse = sqrt ((mean((price - predicted_price)^2))))
rmse_val

# A tibble: 1 × 1
      mse
    <dbl>
1 482836.

# R-squared (test)
rsq_val <- house_test_with_pred  |>
  summarise(rsq = 1 - sum((price - predicted_price)^2) /
                       sum((price - mean(price))^2))
 rsq_val

# A tibble: 1 × 1
    rsq
  <dbl>
1 0.796

Interpretation:

\(RMSE ≈\) 450,000 TL means that, on average, our predictions are off by about 450,000 TL.
Test \(R² ≈\) 0.82 means the model explains 82% of the variance in test set prices – a good predictive model.

6 Part 7: Your Turn – Predicting House Prices with a New Feature

Task

The dataset also contains a floor variable. Add floor to the multiple regression model and evaluate whether it improves prediction.

Create a new model: price ~ sqm + bedrooms + age + district + floor.
Predict on the test set.
Compute RMSE and R².
Compare the RMSE with the previous complex model (without floor).
Write a short interpretation: does floor help predict price?

Glossary of Functions Used

Function	What it does
`initial_split()`	Randomly splits data into training/testing
`training()` / `testing()`	Extracts the subsets
`lm()`	Fits linear regression model
`predict()`	Generates predictions from a fitted model
`rmse()`	Calculates root mean squared error
`rsq()`	Calculates R-squared
`yardstick` package	Provides metrics for evaluating predictions

Summary: What We Learned Today

Prediction vs. causality are different goals; this lab focused on prediction.
Train/test split is essential for honest evaluation.
Linear regression can be used for prediction, not just causal inference.
RMSE measures average prediction error (in TL).
Overfitting occurs when a model fits training noise; comparing train/test RMSE helps detect it.

These skills are directly applicable to real‑world economic problems like pricing, demand forecasting, and risk assessment.