We begin this exporation by presenting our original model - Regressing Sales on Number of Listings:
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
single_coeff = lm(sales~ listings, txhousing)
single_coeff$coefficients
## (Intercept) listings
## 22.6613744 0.1792617
We add the month column as a Secondary variable in our regression model to explore how the month of a year affects the sales of Houses in Texas. The adddition of a categorical column is intentional as it ensures that there is no risk of multi-collinearity between the other regressors.
txhousing$month = factor(txhousing$month)
two_coeff = lm(sales~ listings + month, txhousing)
two_coeff$coefficients
## (Intercept) listings month2 month3 month4 month5
## -145.9785140 0.1789908 68.2797210 194.6444567 199.1662621 266.5441988
## month6 month7 month8 month9 month10 month11
## 298.0404793 263.2273963 246.1471041 127.3293735 121.1373399 82.4532771
## month12
## 164.3196192
Since month is a categorical variable- we have different coefficients for different months of a year and can interpret how the months affect the total number sales. It appears that the month of February has the lowest marginal effect on sales amount with a coefficient of 68.66 while the month of June has the highest marginal effect - with a coefficient of 299.73.
Adding the “month” variable to our regression seems useful from a business perspective as it can help us predict the time of year in which we can expect more house sales, however since there are so many categories, it can make our analysis cluttered.
three_coeff = lm(sales~ listings + month + inventory, txhousing)
three_coeff$coefficients
## (Intercept) listings month2 month3 month4 month5
## 40.8588219 0.1768961 68.3729467 201.6322500 208.6062510 283.9758915
## month6 month7 month8 month9 month10 month11
## 313.7502007 284.1101260 265.9234046 142.8997625 135.8248400 92.3836560
## month12 inventory
## 161.7865464 -26.5408197
We observe that the coefficient for ‘inventory’ is - 26.53, which indicates a strongly negative relationship between the inventory and sales. This makes logical sense as inventory refers to the time taken to sell all listings at the current rate of sales. If the time to sell all listing is lower, then that would logically indicate an increase in the number of sales. Therefore this value is consistent with our understanding of the data set.
The first assumption of Linear Regression is that there is No correlation between Residuals and \(\hat{y}\) values
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
From the above graph we observe that the Variance in Residuals varies as our Fitted Values increase. We observe this phenomenon more strongly between the 1000 - 4000 fitted-value range, where the the value of our residuals increases. However after the fitted values exceed 4000, the residuals begin to increase, but in a negative direction. This is very clearly indicated by the blue best-fit line.
From this graph, we can observe that our 3-coefficient model violates the assumption that the variance of the residuals is not uncorrelated with the independant variable x.
Let us see if this is the case with our 2-Coefficient model:
gg_resfitted(two_coeff) + geom_smooth(se=FALSE) + ggtitle("2-Coefficient Model: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
How about our One Coefficient model?
gg_resfitted(single_coeff) + geom_smooth(se=FALSE) + ggtitle("1-Coefficient Model: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Generally speaking, it appears as our increase in regressors/variables to our initial model does not affect the variance in residuals. This likely implies that our model. **violates our 1st assumption*.
Given this issue of variance in our residual terms, it might be prudent to take the logarithm value of our numeric columns to see if that affects the residuals.
Our 2nd Linear Regression assumption is of Homoskedasticity- i.e The variance of residuals is uncorrelated with our x values.
We test this by plotting the residuals against the different x-variables in our 3-coefficient model
plots <- gg_resX(three_coeff, plot.all = FALSE)
# for each variable of interest ...
plots$listings
From the above chart we can observe that our assumption of Homosekdasticity is violated between the 1000 - 2000 listings range, where we observe a general positive. correlation between our residual values and houses listed. Outside of this range, we observe less correlation between the 2 terms and less violation of the assumption.
This generally means that our model is generally in alignment wiht our assumption.
plots$month
From the above chart, we observe that the there is little correlation between our residual values and the months of the year. No month category exhibits stronger varaince in residuals than the other. Therefore our Homoskedacticity assumption is NOT violated in this case.
plots$inventory
In the case of our inventory variable, we do observe non-constant variance in residuals, especially in the 0 -10 range of inventory. However this variance reduces as our inventory value increases. As a result we do observe a violation of our Homoskedasticity assumption in this case, at least in the range of 0-10.
This graph is usd to evaluate the assumption that our residuals are Normally Distributed and so we plot a density historgram of our residual values to observe their distribution.
gg_reshist(three_coeff)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Evaluation
When we look at this histogram of residuals, we observe a distribution that is roughly normal. However the tails are not equally distributed and the highest density of residuals falls in the immediate range of 0. There is a slightly shorter tail on the left, and a longer one on the right, while we have a higher density of negative residuals. This is not overly severe, but it indicates we may need more/different explanatory variables.
QQ Plots are another way of evaluating whether our Residual Distribution varies form the theoretical Normal Distribution. Ideally the Red and Black lines should overlap with each other indicating that our residual roughly matches a thoeretical normal distribution.
gg_qqplot(three_coeff)
From the above graph, we observe that our Normal Distribution Assumption is partially Violated as the 2 lines do not entirely overlap outside of the [-2, 2] quantiles. This is symptomatic of a Truncated Normal Distribution or a Thin tailed distribution. This is not a serious issue as Truncation is hardly noticeable. We would likely not notice if our normal distribution did not have any observations outside of [-2, 2], and this is unlikely to affect our model.
Some remedial methods could be to - 1) Implement a Heckman Model, or 2) Fit a Generalized Linear Model like logit/probit
The Cook’s Distance is a method of testing out No strong outliers assumption. Cook’s distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.
gg_cooksd(three_coeff, threshold = 'baseR')
## Cut-off for outliers too big to be shown in Cook's Distance plot.
From the aboce graph, we don’t view any Observation Numbers who’s
removal strongly affects our model, implying that under the
baseR threshold, there are No Strong Outliers in our
model. Implying that we meet the required assumption for our model.
In summary, Our 3-Coefficient model that includes (lisitngs, month and inventory) as our independant regressors has the following coefficients:
three_coeff$coefficients
## (Intercept) listings month2 month3 month4 month5
## 40.8588219 0.1768961 68.3729467 201.6322500 208.6062510 283.9758915
## month6 month7 month8 month9 month10 month11
## 313.7502007 284.1101260 265.9234046 142.8997625 135.8248400 92.3836560
## month12 inventory
## 161.7865464 -26.5408197
In terms of it alignment with the Linear Regression Assumptions: - 1) The assumption of no correlation between Residuals and \(\hat{y}\) values is violated. We can try to fix this by taking the Logarithm of our predicted values
Wee can improve our model by possibly, including interaction terms between listings and inventories or by modifying the listings column by taking its log.