HW3

2025-11-11

Simple Linear Regression

What is Linear Regression? Linear Regression is the process of using data to draw the best fitted “straight line” that shows how one variable changes when another changes.

As an example: We will use housing dating from the txHousing Dataset built into R’s database.

Think of predicting housing Prices using two ‘clues’:

Variable 1: # of Listings
Variable 2: # Median House price

Using these two variables, we will use linear regression so see if we can find any relationships between the two.

DataSet `txhousing`

Focus Point: Dallas from 2000 - 2015

Key Variables:

• listings: Number of housings listed
• median_price: Median home price (in $1000s)

head(txHousing_DF)

## # A tibble: 6 × 6
##   city   month    listings median inventory median_price
##   <chr>  <fct>       <dbl>  <dbl>     <dbl>        <dbl>
## 1 Dallas January     13316 124400       3.7         124.
## 2 Dallas February    13495 127700       3.7         128.
## 3 Dallas March       13752 128500       3.7         128.
## 4 Dallas April       13752 132000       3.7         132 
## 5 Dallas May         14018 137100       3.7         137.
## 6 Dallas June        14392 138800       3.8         139.

Linear Regression in Action

Linear Regression helps us model the relationship between two variables by fitting a “best-fit” line.

This line shows how a variable changes when another variable changes.

Another Example

Here is an example of a regression line using another varable.

The Logic

\[ y = \beta_0 + \beta_1 \cdot x + \varepsilon, \text{ where } \varepsilon \sim \mathcal{N}(\mu = 0; \sigma^2) \]

Explanation:

$y$ → Dependent variable (median home price in $1000s).
$x$ → Independent variable (number of home listings).
$\beta_0$ → Intercept — baseline value of $y$ when $x = 0$.
$\beta_1$ → Slope — how much $y$ changes for each 1-unit increase in $x$.
$\varepsilon$ → Error term — random variation not captured by the model.
The notation $\varepsilon \sim \mathcal{N}(\mu = 0, \sigma^2)$ means errors follow a normal distribution with mean 0 and variance $\sigma^2$.

Now lets see how we can build a model that can eventually predict an output variable, based on a given input.

The Model

mod = lm(median_price ~ listings, data = txHousing_DF)
x = txHousing_DF$listings; y = txHousing_DF$median_price;

The Model Continued

summary(mod)

After feeding our linear model some data, we can now review its summary results to evaluate how well it explains the relationship between the two variables.

The output provides key information such as the intercept, slope coefficient, and several statistical measures (including p-values, R-squared, and the residual standard error) all of which help us assess the model’s accuracy and strength.

Ideally, a strong model will have:

A slope coefficient that aligns with our expectations (positive or negative direction),
A low p-value, indicating a statistically significant relationship, and
A high R-squared value, showing that a large portion of the variation in home prices can be explained by the number of listings.

In the following slides, we’ll interpret these values in detail to understand how listings affect home prices in Dallas.

summary(mod)

## 
## Call:
## lm(formula = median_price ~ listings, data = txHousing_DF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.063 -11.006   1.199  10.713  64.124 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.959e+02  4.986e+00  39.295  < 2e-16 ***
## listings    -1.613e-03  2.187e-04  -7.376  5.4e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.07 on 184 degrees of freedom
## Multiple R-squared:  0.2282, Adjusted R-squared:  0.224 
## F-statistic:  54.4 on 1 and 184 DF,  p-value: 5.403e-12

Interpreting the Model

Now that the model has been built, the summary output provides several key insights about how the number of listings relates to median home price in Dallas (2000–2015):

Intercept (β₀ ≈ 195.9)
The baseline value of the regression line, representing the predicted median home price (in $1000s) when the number of listings is zero.
While not meaningful in a real-world sense (since listings can’t truly be zero), it serves as the starting point of the fitted line.
Slope (β₁ ≈ -0.0016)
For each additional home listing, the model predicts a decrease of about $1.60 in median price.
The negative coefficient indicates a weak inverse relationship when more homes are on the market (higher supply), prices tend to drop slightly.
p-values (< 0.001)
Both the intercept and slope are statistically significant (indicated by the ’**’).
This means we have strong evidence that the slope is not zero, there is* a relationship, even if it’s weak.
R-squared ≈ 0.23
About 22.8% of the variation in median price can be explained by the number of listings.
This is a relatively low value, suggesting that listings alone do not strongly predict price. Other factors like demand, interest rates, and time of year likely play a larger role.
Residual Standard Error (≈ 20.07)
On average, predictions differ from actual prices by about $20,000. This reflects moderate variability in the data.

Final Thoughts

The bigger picture:
Our regression line revealed a clear trend — as the number of listings increases, the median home price tends to dip slightly.
This aligns with the basic principles of supply and demand: when more homes are available, prices often soften.

However, because our R² value is relatively low, we can conclude that listings alone don’t fully explain housing prices.
Other factors: such as interest rates, time of year, or overall market conditions likely play a significant role.

Further analysis using multiple regression or additional variables could refine the model and provide a clearer, more accurate understanding of what drives housing prices in Dallas.