Data Dive 11

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Creating a Linear Model

We regress the sales column using median and inventory columns as regressors. There could be an interaction term between median and inventory so we onclude that as well.

##      (Intercept)           median        inventory median:inventory 
##    -6.545380e+02     1.144866e-02    -8.408883e-01    -2.560463e-04

Evaluating the Coefficients:

Our coefficient for the ‘median’ column is 0.0114 which indicates a positive but fairly weak relationship between the sales. This seems rather odd as one would expect a negative relationship with sales price. Perhaps it is because the column indicates only the ‘median’ price and not the mean price.

A few reasons behind this unexpected relationship could do with thee ‘illusion of luxury’. The higher median price might be reflective of better home-quality and therefore customers ar more likely to purchase such houses.

Plotting Sales and Median Price:

library(ggplot2)

ggplot(txhousing, aes(x = median, y = sales)) + 
  geom_point() + 
  geom_abline(slope = model$coefficients['median'], 
              intercept = model$coefficients['Intercept'])
## Warning: Removed 617 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_abline()`).

Looking at the above graph we notice a ‘fanning’ effect between our 2 variables indicating that the relationship displays high variance. We also see a strong flat portion of the scatter plot indicating a very weak (almost 0) relationship between median values and sales. Furthermore we can also explain the weakness of the relationship from the fact that ‘median’ price is the parameter collected in this data set. Median price would refer the the ‘middle’ value in the range of home prices in a given city. This choice of data collection might also affect our relationship as the Median is rarely representative of the true effect of cost on home sales.

The coefficient for inventory is -0.81 which indicates a negative but weak relationship between the inventory and sales. It is important to note that the coefficient for this column is the strongest of the 3 regressors in our model. Generally the negative coefficient is logically consistent as inventory represents the time taken to sell all remaining listings at the current pace of sales. Therefore higher sells would logically result in lower inventory and vice-versa.

The coefficient of our interaction term median:inventory is -0.0000256 which is both negative and extremely weak. This indicates that the interaction of median sales price and inventory as rather small and we don’t see the columns impact sales in a correlated manner.

Evaluating our Model:

Using the Variance Inflation Factor:

vif(model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##           median        inventory median:inventory 
##         2.180234         8.443936         8.748548

The VIF values for median , inventory and our interaction are all below 3 -implying that the variance of each coefficient is not affected by changes in the other predictors by an extremely large amount. Therefore, our model does not suffer from any strong cases of collinearity

Residual Analysis:

gg_resfitted(model) + geom_smooth(se=FALSE) + ggtitle(" Sales vs Median Price: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the above graph we observe that the variance in our residuals goes up very sharply after our fitted values cross 500. Furthermore we see that the this variance reduces after the value 1000. However if we observe the blue best-fit line, apart from a small bump in the 700’s point, it is relatively linear. In an ideal situation the correlation between fitted-values and residuals would be 0, we see a very slight negative relationship. This indicates that one of the assumptions of out lineear model is broken very slightly.