Linear Regression (Part 1)

In this lab, we cover the following models:

ANOVA
Linear Regression

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.3

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(AmesHousing)

## Warning: package 'AmesHousing' was built under R version 4.3.3

library(boot)
library(broom)
library(lindia)

## Warning: package 'lindia' was built under R version 4.3.3

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

ANOVA

It isn’t always that hypothesis testing is between two groups. Sometimes, we’re actually interested in comparing multiple groups. In this case, we need to construct a hypothesis test which maintains neutrality, but whose alternative implies a different group among many. In these cases, given some parameter $\theta$ across $k$ different groups, our null hypothesis is:

\[ H_0 : \theta_1 = \theta_2 = \cdots = \theta_k \]

In this way, the alternative would be that there exists at least one group $i$ where $\theta_i$ is different from there rest. In other words, we are testing whether a group factor is independent of the response variable.

For example, referring to the Ames Housing Dataset (and its documentation), suppose we’re interested in the house price across different Building Types.

# make a simpler name for (and a copy of) the Ames data
ames <- make_ames()
ames <- ames |> rename_with(tolower)

ames |>
  ggplot() +
  geom_boxplot(mapping = aes(y = sale_price, x = bldg_type)) +
  scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
  annotation_logticks(sides = 'l') +
  labs(x = "Building Type",
       y = "Log Sales Price (in $1000s)")

We can see that some of these typically have a slightly higher sales price than others, but we want to know if the differences are large enough to significantly challenge our hypothesis that they’re actually all basically the same. So, in this case, our hypothesis test is

\[ H_0 : \text{average house price is equal across all Building Types} \]

Model Critiquing

I wanted to build upon the analysis of the features being used already, building type and sales, and see if I could find another feature that could further strengthen the predictability of house prices. In order to do this I first wanted to see how long it took different house prices to start dropping off in value.

To do this I made a new variable, years_since_built, which subtracts the year the home was built from the year it sold so we know how old it was the year it sold.

First, I looked at the relationship between this new feature and every house type in the dataset.

# Load required library
library(ggplot2)

# Calculate the years since the house was built
ames$years_since_built <- ames$year_sold - ames$year_built

# Plot: Years Since Built vs. Sale Price
ggplot(ames, aes(x = years_since_built, y = sale_price)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Relationship Between Years Since Built and Sale Price",
    x = "Years Since Built",
    y = "Sale Price ($)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

As I’d expect houses get cheaper as they age. But with all the house types grouped together, it’s difficult to draw insights that will aid the current linear regression being built. For this scatter plot, I wanted to compare one fam and townhse, building types. I wanted to compare these two, because of the plot below which shows that the 95% CI for these two classes was the highest.

# Filter the dataset for OneFam and TwnhsE building types
filtered_data <- ames %>%
  filter(bldg_type %in% c("OneFam", "TwnhsE"))

# Recalculate years since built if not already done
filtered_data$years_since_built <- filtered_data$year_sold - filtered_data$year_built

# Plot: Years Since Built vs. Sale Price for OneFam and TwnhsE
ggplot(filtered_data, aes(x = years_since_built, y = sale_price, color = bldg_type)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Years Since Built vs. Sale Price (Filtered by Building Type)",
    x = "Years Since Built",
    y = "Sale Price ($)",
    color = "Building Type"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

What we see is that twnhsE building types drop off in value much quicker compared to one fam. One fam buildings also last much longer in general, with zero twnhsE building types making it to even 50 years. This information could be useful for someone into high end real estate sales, since these are both top of the line in terms of price, but one fams hold their value significantly better based on the slopes of these two lines.

Next, I look at a similar plot that includes all of the building types on one plot, but it’s kind of hard to interpret.

# Recalculate years since built if not already done
ames$years_since_built <- ames$year_sold - ames$year_built

# Plot: Years Since Built vs. Sale Price with trendlines for all building types
ggplot(ames, aes(x = years_since_built, y = sale_price, color = bldg_type)) +
  geom_point(alpha = 0.6) +  # Scatter plot of all data points
  geom_smooth(method = "lm", se = FALSE) +  # Add trendlines for each building type
  labs(
    title = "Years Since Built vs. Sale Price by Building Type",
    x = "Years Since Built",
    y = "Sale Price ($)",
    color = "Building Type"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This one below allows us to compare more easily.

# Recalculate years since built if not already done
ames$years_since_built <- ames$year_sold - ames$year_built

# Create the plot
plot <- ggplot(ames, aes(x = years_since_built, y = sale_price)) +
  geom_point(color = "gray", alpha = 0.6) +  # All points in gray
  geom_smooth(aes(color = bldg_type), method = "loess", se = FALSE, size = 1) +  # Trendlines with different colors and thickness
  labs(
    title = "Years Since Built vs. Sale Price by Building Type",
    x = "Years Since Built",
    y = "Sale Price ($)",
    color = "Building Type"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(plot)

## `geom_smooth()` using formula = 'y ~ x'

# Save the plot as a larger image
ggsave("larger_plot.png", plot = plot, width = 15, height = 12, dpi = 300)

## `geom_smooth()` using formula = 'y ~ x'

This show that both twnhs building types have significant drop offs while all the rest do alright at holding their value.

Next we’ll run an annova test to get a read on the variance in our two features.

n <- aov(sale_price ~ bldg_type + years_since_built, data = ames)
summary(n)

##                     Df         Sum Sq       Mean Sq F value Pr(>F)    
## bldg_type            4   645411087404  161352771851    39.2 <2e-16 ***
## years_since_built    1  6010638675504 6010638675504  1460.2 <2e-16 ***
## Residuals         2924 12036487347443    4116445741                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results, these are what sticks out the most to me:

Significance of Predictors: Both bldg_type and years_since_built have extremely low p-values (<2e−16< 2e-16<2e−16), meaning they significantly impact sales_price.
Relative Importance: Years_since_built has a much higher F-value (1460.2) compared to bldg_type(39.2), indicating it explains far more variance in sales_price.
Residuals: The residuals account for most of the variance, showing that while both predictors are significant, much of the variation in sales_price remains unexplained.

Based on this I’d suggest adding my feature, but to also explore more additions.

next, i check to see how lot area affects things, but didn’t find any clear conclusions.

# Filter the dataset for OneFam building type with lot_area between 9000 and 11500 sqft
filtered_onefam_data <- ames[ames$bldg_type == "OneFam" & ames$lot_area >= 7000 & ames$lot_area <= 8000, ]

# Plot: Years Since Built vs. Sale Price for OneFam with specific lot_area range
ggplot(filtered_onefam_data, aes(x = years_since_built, y = sale_price)) +
  geom_point(color = "gray", alpha = 0.6) +  # Scatter points in gray
  geom_smooth(method = "loess", se = TRUE, color = "blue", size = 1) +  # Loess trendline with CI
  labs(
    title = "Years Since Built vs. Sale Price (OneFam, Lot Area: 9000-11500 sqft)",
    x = "Years Since Built",
    y = "Sale Price ($)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

End of Critiquing

Analysis of Variance

To help us understand how each of these groups differ from one another, the Analysis of Variance method measures three different ways that the data points vary:

Group Variance: Group averages’ deviation from the global average
- E.g., in our example, we have five different “group averages”
Error Variance: Within-group deviations from their group average
- I.e., the average amount that data points vary from their group average
Total Variance: The amount to which all data points vary from the global average
- This is just the variance of the numerical column of data.

\[ \begin{align*} \text{MSG} &= \frac{1}{\text{df}_g}\text{SSG} = \frac{1}{k - 1}\sum_{i = 1}^k n_i(\bar{x}_i - \bar{x})^2 \\ \text{MSE} &= \frac{1}{\text{df}_e}\text{SSE} = \frac{1}{n - k}\sum_{i = 1}^k (n_i - 1)s_i^2 \\ \text{MST} &= \frac{1}{\text{df}_t}\text{SST} = \frac{1}{n - 1}\sum_{i = 1}^n (x_i - \bar{x})^2 \end{align*} \]

You might notice that each of these are just different measures of variance:

$MSG$ is the Mean Squared Group variance. It is the variance between the $k$ groups (one for each group), weighted by the size of each group.
$MSE$ is the Mean Squared Error variance, and it is the (typical) variance within each group, but with an adjustment calculation to the degrees of freedom (replacing the $n_i - 1$ for each group with $n-k$.
$MST$ is the Mean Squared Total variance, and it is just the typical variance of all the values in the column of data.
Similarly, the $SS\_$ is the (weighted) sum of squared differences for each measure of variance, and it can be shown that $\text{SST = SSE + SSG}$.
The $\text{df}_j$ represents the Degrees of Freedom for item $j$. This is the number of data points being used in the calculation (e.g., $n$ if we’re averaging $n$ things), minus the number of parameter values that are using in the calculation which depend on each of those data points (e.g., 1 for $\text{MSG}$ because we’re using the global mean $\bar{x}$ in the calculation).

Assumptions

At this point, you might notice a few assumptions that must hold before we can continue:

Independence. Each data point must be independent of every other data point.
Normality. The distribution of within each group must be roughly normal. (E.g., imagine making a histogram for each group.)
Homoscedasticity (Constant Variance). The variance of data within groups remains consistent for every group. (I.e., no group has a significantly higher standard deviation than any other group.)

Of course, these assumptions are very strong and constricting! But, in most cases, if they’re “close enough”, the test will typically hold just fine — you’ll just need to balance the risks as needed.

Independent and Identically Distributed

When we talk about “Independence” in the first assumption, above, we are really referencing observations (i.e., rows of data) that are both (a) independent, and (b) come from the same distribution. In other words, they are independent an didentically distributed or i.i.d. Consider the following scenarios:

population_1 contains individual samples of graduate student ages, each from the same Poisson distribution, each one independent of the last. These are i.i.d.

population_2 contains samples of student heights across many disparate age groups. Each age group has a different height distribution. These are independent, but not identically distributed.

population_3 contains measurements of the temperature in the same location from hour to hour. The weather (i.e., temperature) at any moment affects the weather in the following hour. These are identically distributed, but not independent.

EXERCISE

For the hypothesis test we’ve formed above, determine if our data set meets all of these assumptions. If not, explain why, or how far off they are.

print("your code here")

## [1] "your code here"

F-Test

Remember that every hypothesis test requires a sampling distribution, considering all possible ways a dataset could have been sampled from the population. This way, we can compare our data from other theoretical possibilities. In our case, we are looking at evaluating our variances, above. In particular, since $\text{SST = SSE + SSG}$, we can just concern ourselves with $\text{SSE}$ and $\text{SSG}$.

The F Distribution measures models the ratio of two random variables that follow a chi-squared distribution, proportional to their degrees of freedom. In our case, both $SSG$ and $SSE$ are modeled by a $\chi^2$ distribution (recall the squared differences between “observed” and “expected”), and we’ve already divided out their degrees of freedom when calculating the $\text{MSG}$ and $\text{MSE}$. So, the test statistic for an ANOVA is:

\[ F^* = \frac{\text{SSG}/df_g}{\text{SSE}/df_e} = \frac{MSG}{MSE} = \frac{\text{variance between groups}}{\text{variance within groups}} \]

n <- nrow(ames)
k <- n_distinct(ames$bldg_type)

ggplot() +
  geom_function(xlim = c(0, 10), 
                fun = \(x) df(x, k - 1, n - k)) +
  geom_vline(xintercept = 1, color = 'orange') +
  labs(title = 'F Distribution for Ames Building Types',
       x = "F Values",
       y = "Probability Density") +
  theme_hc()

When $F^* \approx 1$, the variance between groups is roughly the same as the variance within groups, and there is likely no relationship between the groups and the numerical value. If $F^* < 1$, the same is true but the variance within groups may just be large. And, if $F^* > 1$, especially if it’s large, then there is evidence that the means between the groups overshadow the means within the groups, indicating that the groupings are introducing variance that isn’t inherent in the data themselves.

ANOVA Summary

An ANOVA test models a numerical value (i.e., the Sales Price) using the effects of some categorical factor (i.e., Building Type). We symbolically represent an ANOVA model in R using the formula response ~ factor. In this case, we only have one factor, so our formula is sale_price ~ bldg_type, and we have a one-way ANOVA. If we wanted to do the same thing but for two factors, we’d have a two-way ANOVA, and our formula would be response ~ factor1 * factor2. In our case, we have the following test:

m <- aov(sale_price ~ bldg_type, data = ames)
summary(m)

##               Df         Sum Sq      Mean Sq F value Pr(>F)    
## bldg_type      4   645411087404 161352771851   26.15 <2e-16 ***
## Residuals   2925 18047126022947   6169957615                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given such a small p-value, and such a large dataset, we can conclude that it is very unlikely the means are all equal, and at least one is different from the rest.

To determine which one is most unlikely to be the same as the rest, we can use multiple pairwise t-tests to compare each group (rows) to each other group (columns).

pairwise.t.test(ames$sale_price, ames$bldg_type, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  ames$sale_price and ames$bldg_type 
## 
##          OneFam      TwoFmCon    Duplex      Twnhs      
## TwoFmCon 0.000000051 -           -           -          
## Duplex   0.000000054 1           -           -          
## Twnhs    0.000000010 1           1           -          
## TwnhsE   1           0.000000031 0.000000093 0.000000019
## 
## P value adjustment method: bonferroni

It looks like it is very unlikely that OneFam is the same as the others, but it might have the same price as the TwnhsE group. Similarly, the TwnhsE group is probably not the same price as the rest of the groups.

Notice that we are running multiple t-tests here. With each test, individual p-values must decrease proportionally to the number of tests being run (this is similar to p-hacking/peaking), since we’re still using the same sample of data. To correct for this deflation, the Bonferroni Correction multiplies each p-value by the number of tests being run.

We can see this more clearly in ggplot using the same boot_ci function (using boot.ci) from a previous week.

# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  # the function we want to run on each iteration i
  boot_func <- \(x, i) func(x[i])
  
  # the boot instance, running the function on each iteration
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  # return just the lower and upper values
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ci <- ames |>
  group_by(bldg_type) |>
  summarise(ci_lower = boot_ci(sale_price, mean)['lower'],
            mean_price = mean(sale_price),
            ci_upper = boot_ci(sale_price, mean)['upper'])

df_ci

## # A tibble: 5 × 4
##   bldg_type ci_lower mean_price ci_upper
##   <fct>        <dbl>      <dbl>    <dbl>
## 1 OneFam     181121.    184812.  188853.
## 2 TwoFmCon   118126.    125582.  132801.
## 3 Duplex     130506.    139809.  146999.
## 4 Twnhs      126489.    135934.  144312.
## 5 TwnhsE     183907.    192312.  201400.

df_ci |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = bldg_type, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_price, y = bldg_type,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'orange')) +
  labs(title = "Sales Price By Building Type",
       x = "Sales Price (in $)",
       y = "Building Type",
       color = '')

A few caveats:

It is tempting to just look at confidence intervals, and skip the ANOVA test altogether. However, it’s a much better practice to test whether there is a difference at all before looking at supposed differences themselves.
You’ll notice that some of these intervals are asymmetric. This is a consequence of the skew of the individual distributions. Take a look at individual histograms or quantiles to prove this to yourself.

Linear Regression

Introduction

To help us understand Linear Regression, we’ll use one of the data sets from Anscombe’s Quartet.

anscombe |>
  ggplot(mapping = aes(x = x1, y = y1)) +
  geom_point(size = 2, color = 'darkblue')

Let’s assume that these data represent 10 empirical observations. These observations each have associated with them some attribute $x$ (e.g., neighborhood average house size), and an outcome of particular interest $y$ (e.g., neighborhood median household income). There seems to be a linear relationship between these $y$ vs. $x$, i.e., as x increases, y tends to increase by a fixed amount. So, we ask ourselves:

Can we sufficiently model $y$ using only a constant linear transformation of $x$ (a line)?

Suppose we stick with this idea that our data can be modeled linearly, and let’s propose a line lm (or “linear model”) that “fits” this data.

ASSUMPTION 1: VARIABLE $x$ IS LINEARLY CORRELATED WITH RESPONSE $y$.

anscombe |>
  ggplot(mapping = aes(x = x1, y = y1)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

## `geom_smooth()` using formula = 'y ~ x'

We see the line crosses $(6,6)$ and $(10,8)$. So, we can use these two points $(x_1, y_1)$, $(x_2, y_2)$ to get the slope $m$, and derive the equation for the line using

\[ \begin{align} y - y_1 &= m\cdot(x - x_1) \newline y - y_1 &= \left(\frac{y_2 - y_1}{x_2 - x_1}\right) \cdot (x - x_1) \newline y - 6 &= \left(\frac{8 - 6}{10 - 6}\right) \cdot (x - 6) \newline y - 6 &= \frac{2}{4} x - \frac{2}{4} \cdot 6 \newline y &= \frac{1}{2} x + 3 \newline \end{align} \]

Interpretation: This slope $m = 0.5$ means there is a change in $y$ of .5 units for every unit (1) change in $x$. And, the y-intercept $b$ indicates when $x$ is zero, $y$ is 3.

Now, before we move on to the next section, think about these questions:

Is this line a good fit for our data?
Is this line a good model for $y$?

Single Variable

Let’s investigate those questions as to whether this is a “good fit” by quantifying the differences between our model and the actual values in the data which inform the model.

# the line estimated above
lm_ <- \(x) (1 / 2) * x + 3

anscombe |>
  ggplot(mapping = aes(x = x1, y = y1)) +
  geom_point(size = 2) +
  geom_segment(mapping = aes(xend = x1, y = y1, yend = lm_(x1), 
                             color = 'error'), linewidth = 1) +
  geom_smooth(method = "lm", se = FALSE, color = 'red', linewidth = 0.5) + 
  scale_color_manual(values = c('darkblue')) +
  labs(color = '')

## `geom_smooth()` using formula = 'y ~ x'

On the figure, the dots are the points in our dataset and the red line is what we want to evaluate. Let’s define a few terms based on this image:

$x_i$ : the $i$th value in the data for the explanatory variable $x$
$y_i$ : the $i$th value in the data for the response variable $y$
$\hat{y}_i$ : the predicted response value from the model for $x_i$
$e_i$ : the error for the $i$th prediction, or the “residual” distance from each point $(x_i, y_i)$ to the line
$\hat{\beta}$ : a vector of estimated parameters that define the model
- $\hat{\beta}_0 = b = \text{y-intercept}$
- $\hat{\beta}_1 = m = \text{slope, for }x$
- We assume there exists a “true” linear relationship with parameters $\beta$ (no hat)
So, $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i = y_i + e_i$.

“Ordinary Least Squares” (OLS) is an approach (used by ggplot, for instance) for fitting a model/line by finding values for $\hat{\beta}$ that minimizes the sum of squared errors. There are a few reasons it seeks to minimize this metric (e.g., it works nicely with derivatives and linear algebra, it creates a convex optimization problem, etc.), but it suffices to say that this is typically the metric of choice. That means OLS is taking each error, squaring it, and then trying to minimize the sum of those areas by changing $\hat{\beta}$ values. What does that look like?

# try adjusting these
beta_0 <- 3
beta_1 <- 1/2

lm_ <- \(x) beta_1 * x + beta_0

anscombe |>
  ggplot() +
  geom_point(mapping = aes(x = x1, y = y1), size = 2) +
  geom_smooth(mapping = aes(x = x1, y = y1), method = "lm", se = FALSE, color = 'red', linewidth = 0.5) + 
  geom_rect(mapping = aes(xmin = x1, 
                          xmax = x1 + abs(y1 - lm_(x1)),
                          ymin = lm_(x1), 
                          ymax = y1), 
            fill = NA, color = 'darkblue') +
  labs(color = '')

## `geom_smooth()` using formula = 'y ~ x'

Think about how sensitive this scheme is to outliers, especially when they are far off to the left or right while also being far above or below the mean.

So, the goal of OLS is to minimize the Sum of Squared Errors (SSE) Cost Function:

\[ \text{SSE} = \sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2 = \sum _{i=1}^n \left(y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i)\right)^2 \]

(… where the $x_i$ and $y_i$ values are known, and $\hat{\beta}$ values are unknown)

Notice, this “SSE” is exactly the same as the one we encountered in the ANOVA test, so it will make sense when we see the F-Test used in evaluating this model …

It is subtle, but notice here that we are treating all errors as equally detrimental to the model. That is, no predicted value should have a higher expected error than any other predicted value.

ASSUMPTION 2: ERRORS HAVE CONSTANT VARIANCE ACROSS ALL PREDICTIONS

Similarly, if the error of one observation were conditional or dependent on the error of another observation, then our cost function is biased, since it doesn’t take into account this dependency. Note: we will see later that in the case of time series, these dependencies are captured in a slightly different cost function.

ASSUMPTION 3: OBSERVATIONS ARE INDEPENDENT AND UNCORRELATED

We can run linear regression models using the formula syntax (~), and lm:

model <- lm(y1 ~ x1, anscombe)
model$coefficients

## (Intercept)          x1 
##   3.0000909   0.5000909

In other words:

If $x_1 = 0$, then the expected value for $y_1$ is $\hat{\beta}_0 = 3$.
For every unit increase of $x_1$, on average, $y_1$ will increase by about $0.5$.

Multiple Variables

It’s very rare that you’ll want to model a response variable using only a single explanatory variable. In the case where multiple variables are used to model the response, we want the relationship between all variables and the response to remain linear.

As an example, we’ll return to a subset of the Ames housing dataset (single family, one story homes built after 2000), and we’ll create a new column that reduces the overall_qual column into “great” or “not great”.

ames_basic <- ames |>
  filter(bldg_type == "OneFam",
         house_style == "One_Story",
         year_built >= 2000) |>
  mutate(great_qual = ifelse(overall_qual %in%
           c("Very_Excellent", "Excellent", "Very_Good"),
           1, 0))

ames_basic |>
  group_by(great_qual) |>
  summarize(num = n())

## # A tibble: 2 × 2
##   great_qual   num
##        <dbl> <int>
## 1          0   145
## 2          1   182

Suppose we’re interested in modeling the housing price using the great_qual column as well as the first_flr_sf (first floor sq. ft.). We are going to want to draw the same kinds of conclusions above about how the price changes with either of these variables. So, in the same way that we need both to maintain a linear relationship with the sale_price, if one variable is held constant, we need the other to maintain a linear relationship, and vice versa.

# to plot facet averages
ames_grouped <-
  ames_basic |>
  group_by(great_qual) |>
  summarise(mean_price = mean(sale_price))

ames_basic |>
  ggplot() +
  facet_wrap(vars(great_qual), labeller = label_both) +
  geom_point(mapping = aes(x = first_flr_sf, y = sale_price)) +
  geom_hline(data = ames_grouped,
             mapping = aes(yintercept = mean_price),
             color = 'darkorange', linetype = 'dashed') +
  scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
  labs(title = "Sales Price by 1st Floor Sq Ft",
       subtitle = "Faceted by the overall quality of the house",
       x = "1st Floor Sq Ft", y = "Sales Price")

So here, we adjust our model by introducing a new variable which (in this case) has a binary coefficient:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 \]

model <- lm(sale_price ~ first_flr_sf + great_qual, ames_basic)
model$coefficients

##  (Intercept) first_flr_sf   great_qual 
##    6940.7427     139.9341   59244.7551

We can interpret our coefficients thus:

(non-realistic) If there is no first floor sq. feet, and the quality is not good, the expected sales price is roughly $7K.
If the (binary) home quality remains as it is, and we add a single unit of first floor square footage, then we can expect the sale_price to increase by about $140.
If the first floor square footage remains as it is, and we go from not great to great quality, then we can expect the sale_price to increase by about $59K.
- Note: The fact that this is not indicated by the average lines in the plot above tells us that there is a significant factor in the data that is not represented in the model …

In general, we assume that there is a latent (thereby unknown) linear relationship between random variables $X_1, X_2, \cdots, X_k$ and a response random variable $Y$ such that:\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k + \varepsilon \] where $\varepsilon$ is the error that comes with nature itself, normally distributed about 0. In other words, $Y$ changes linearly with random variables $X$, and the error between the line formed and the actual values are consistently centered at zero for all $Y$ values. In other words, $Y$ regresses toward a central mean value defined by a line.

Interaction Terms

Sometimes, we have a relationship between two variables such that the line formed between one and the response changes based on the other.

ames_basic |>
  ggplot(mapping = aes(x = year_remod_add, y = sale_price,
                       color = factor(great_qual))) +
  geom_jitter(height = 0, width = 0.1, shape = 'o', size = 3) +
  geom_smooth(method = 'lm', se = FALSE, linewidth = 0.5) +
  scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
  scale_color_brewer(palette = 'Dark2') +
  labs(title = "Sales Price by Year Remodeled",
       subtitle = "Colored by the overall quality of the house",
       x = "Year Remodeled (x-jittered)", y = "Sales Price",
       color = 'Great Quality?') +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

Here, we can see that when the quality of the house is not great, the Year Remodeled does not really affect the Sales price much. But, when it is great, an increase in Year Remodeled has a slightly positive effect on Sales Price. We call this as an interaction, and it’s represented in the model by multiplying the two features together, and use : to indicate this in the R model definition.

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \hat{\beta}_3x_1x_2 \]

# include all variables and their interaction
model <- lm(sale_price ~ year_remod_add + great_qual 
            + year_remod_add:great_qual, ames_basic)

# to view more coefficients a bit easier
tidy(model) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))

## # A tibble: 4 × 2
##   term                        estimate
##   <chr>                          <dbl>
## 1 (Intercept)                 -829942.
## 2 year_remod_add                  513.
## 3 great_qual                -10088892.
## 4 year_remod_add:great_qual      5088.

For example, if the quality is not great $x_2 = 0$, then we have

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2 \cdot 0 + \hat{\beta}_3x_1\cdot 0 \]

And we only need to worry about the fact that a house remodeled a year more recent will tend to have a sales price of about $500 more. On the other hand, if the quality is great $x_2 = 1$, then we have

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2 + \hat{\beta}_3x_1 = \hat{\beta}_0 + \hat{\beta}_2 + (\hat{\beta}_1 + \hat{\beta}_3)x_1 \]

And a house remodeled a year more recent (in this case) will tend to be $(513.3 + $5087.5 = $5600.8) more.

Notice the coefficient for great_qual in this model. This goes against our intuition for what should be happening, so it is a good sign that we should try different models.

EXERCISE

Develop another model which uses a different combination of variables and include an interaction term (i.e., the variables used above can still be a part of this combination). Run the model, look at the coefficients, and interpret your results.

print("your code here")

## [1] "your code here"

(Multi)collinearity

Let’s consider the features first_flr_sf ($x_1$) and lot_area ($x_2$). Again, if we’re interested in modeling the sales price for a house, then these are both explanatory variables.

ames_basic |>
  filter(lot_area < 30000) |>
  ggplot(mapping = aes(x = first_flr_sf, y = lot_area)) +
  geom_point() +
  theme_classic()

There’s a linear relationship between these two, so we could write a sort of “sub-model” as:

\[ x_2 = m x_1 + b \]

Now, suppose we were to build a model that contained both of them:

\[ \begin{align*} \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2(mx_1 + b) \\ &= \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2mx_1 + \hat{\beta}_2b\\ \to &= (\hat{\beta}_0 + \hat{\beta}_2b) + (\hat{\beta}_1 + \hat{\beta}_2m)x_1 \end{align*} \]

So if we were to interpret $\hat{\beta}_1$ (in the “original” model output) as the amount $y$ increases with each unit change in $x_1$, then we’d be mistaken. The truth of the matter (in this case) would be that on average, $y$ increases by about $\hat{\beta}_1 + \hat{\beta}_2m$ with each unit change in $x_1$. This brings us to our next assumption for linear modeling:

ASSUMPTION 4: INDEPENDENT VARIABLES CANNOT BE LINEARLY CORRELATED

Error

Let’s look at the relationship between Year Remodeled and First Floor Square Footage.

year_of_interest <- 2009  # change this value ...

model <- lm(sale_price ~ year_remod_add, ames_basic)

pred <- predict(model, 
        data.frame('year_remod_add' = year_of_interest))
  
# PLOT 1
ames_basic |>
  ggplot(mapping = aes(x = year_remod_add, y = sale_price)) +
  geom_jitter(height = 0, width = 0.1) +
  geom_smooth(method = 'lm', se = FALSE, color = 'lightblue') +
  geom_vline(mapping = aes(xintercept = year_of_interest,
                           color = paste('year = ', year_of_interest)),
             linewidth = 1) +
  scale_y_continuous(labels = \(x) paste("$", x / 1000, "K")) +
  scale_x_continuous(labels = \(x) round(x),
                     breaks = sort(unique(ames_basic$year_remod_add))) +
  scale_color_manual(values = 'darkorange') +
  labs(title = "Sales Price by Year Remodeled",
       x = "Year Remodeled", y = "Sales Price",
       color = '') +
  theme(legend.position="bottom")

## `geom_smooth()` using formula = 'y ~ x'

# PLOT 2
ames_basic |>
  filter(year_remod_add == year_of_interest) |>
  ggplot() +
  geom_histogram(mapping = aes(x = sale_price), color = 'white') +
  geom_vline(xintercept = pred, color = 'lightblue', linewidth = 1) +
  scale_x_continuous(labels = \(x) paste("$", x / 1000, "K")) + 
  labs(title = paste("Histogram of Actual Sales Prices at Year = ",
                     year_of_interest),
       subtitle = paste("Predicted Sales Price this year =", round(pred))) +
  theme_hc()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For each value of the year_remod_add, we see that the actual values are normally distributed roughly about a mean at the predicted value. Recall that an error is the difference between the predicted value and the actual value. Of course, we want this to be consistent across all predicted values. This leads us to the another assumption of linear regression:

ASSUMPTION 5: ERRORS ARE NORMALLY DISTRIBUTED OVER THE PREDICTION LINE