library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
We regress the sales column using median and inventory columns as regressors. There could be an interaction term between median and inventory so we onclude that as well.
## (Intercept) median inventory median:inventory
## -6.545380e+02 1.144866e-02 -8.408883e-01 -2.560463e-04
Our coefficient for the ‘median’ column is 0.0114 which indicates a positive but fairly weak relationship between the sales. This seems rather odd as one would expect a negative relationship with sales price. Perhaps it is because the column indicates only the ‘median’ price and not the mean price.
A few reasons behind this unexpected relationship could do with thee ‘illusion of luxury’. The higher median price might be reflective of better home-quality and therefore customers ar more likely to purchase such houses.
library(ggplot2)
ggplot(txhousing, aes(x = median, y = sales)) +
geom_point() +
geom_abline(slope = model$coefficients['median'],
intercept = model$coefficients['Intercept'])
## Warning: Removed 617 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_abline()`).
Looking at the above graph we notice a ‘fanning’ effect between our 2 variables indicating that the relationship displays high variance. We also see a strong flat portion of the scatter plot indicating a very weak (almost 0) relationship between median values and sales. Furthermore we can also explain the weakness of the relationship from the fact that ‘median’ price is the parameter collected in this data set. Median price would refer the the ‘middle’ value in the range of home prices in a given city. This choice of data collection might also affect our relationship as the Median is rarely representative of the true effect of cost on home sales.
The coefficient for inventory is -0.81 which indicates a negative but weak relationship between the inventory and sales. It is important to note that the coefficient for this column is the strongest of the 3 regressors in our model. Generally the negative coefficient is logically consistent as inventory represents the time taken to sell all remaining listings at the current pace of sales. Therefore higher sells would logically result in lower inventory and vice-versa.
The coefficient of our interaction term median:inventory is -0.0000256 which is both negative and extremely weak. This indicates that the interaction of median sales price and inventory as rather small and we don’t see the columns impact sales in a correlated manner.
vif(model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## median inventory median:inventory
## 2.180234 8.443936 8.748548
The VIF values for median , inventory and our interaction are all below 3 -implying that the variance of each coefficient is not affected by changes in the other predictors by an extremely large amount. Therefore, our model does not suffer from any strong cases of collinearity
gg_resfitted(model) + geom_smooth(se=FALSE) + ggtitle(" Sales vs Median Price: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
From the above graph we observe that the variance in our residuals goes up very sharply after our fitted values cross 500. Furthermore we see that the this variance reduces after the value 1000. However if we observe the blue best-fit line, apart from a small bump in the 700’s point, it is relatively linear. In an ideal situation the correlation between fitted-values and residuals would be 0, we see a very slight negative relationship. This indicates that one of the assumptions of out lineear model is broken very slightly.