Data Dive 9

Modifying our existing Linear Regression model

Original Model

We begin this exporation by presenting our original model - Regressing Sales on Number of Listings:

txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows
single_coeff = lm(sales~ listings, txhousing)
single_coeff$coefficients
## (Intercept)    listings 
##  22.6613744   0.1792617

Adding Month as an additional categorical variable in our Regression model:

We add the month column as a Secondary variable in our regression model to explore how the month of a year affects the sales of Houses in Texas. The adddition of a categorical column is intentional as it ensures that there is no risk of multi-collinearity between the other regressors.

txhousing$month = factor(txhousing$month)

two_coeff = lm(sales~ listings + month, txhousing)

two_coeff$coefficients
##  (Intercept)     listings       month2       month3       month4       month5 
## -145.9785140    0.1789908   68.2797210  194.6444567  199.1662621  266.5441988 
##       month6       month7       month8       month9      month10      month11 
##  298.0404793  263.2273963  246.1471041  127.3293735  121.1373399   82.4532771 
##      month12 
##  164.3196192

Evaluation of adding the ‘month’ variable:

Since month is a categorical variable- we have different coefficients for different months of a year and can interpret how the months affect the total number sales. It appears that the month of February has the lowest marginal effect on sales amount with a coefficient of 68.66 while the month of June has the highest marginal effect - with a coefficient of 299.73.

Adding the “month” variable to our regression seems useful from a business perspective as it can help us predict the time of year in which we can expect more house sales, however since there are so many categories, it can make our analysis cluttered.

Adding Inventory as another coefficient

three_coeff = lm(sales~ listings + month + inventory, txhousing)

three_coeff$coefficients
## (Intercept)    listings      month2      month3      month4      month5 
##  40.8588219   0.1768961  68.3729467 201.6322500 208.6062510 283.9758915 
##      month6      month7      month8      month9     month10     month11 
## 313.7502007 284.1101260 265.9234046 142.8997625 135.8248400  92.3836560 
##     month12   inventory 
## 161.7865464 -26.5408197

We observe that the coefficient for ‘inventory’ is - 26.53, which indicates a strongly negative relationship between the inventory and sales. This makes logical sense as inventory refers to the time taken to sell all listings at the current rate of sales. If the time to sell all listing is lower, then that would logically indicate an increase in the number of sales. Therefore this value is consistent with our understanding of the data set.

Evaluating our Models:

Test 1: Residuals vs. Fitted Values

The first assumption of Linear Regression is that there is No correlation between Residuals and \(\hat{y}\) values

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the above graph we observe that the Variance in Residuals varies as our Fitted Values increase. We observe this phenomenon more strongly between the 1000 - 4000 fitted-value range, where the the value of our residuals increases. However after the fitted values exceed 4000, the residuals begin to increase, but in a negative direction. This is very clearly indicated by the blue best-fit line.

From this graph, we can observe that our 3-coefficient model violates the assumption that the variance of the residuals is not uncorrelated with the independant variable x.

Let us see if this is the case with our 2-Coefficient model:

gg_resfitted(two_coeff) + geom_smooth(se=FALSE) + ggtitle("2-Coefficient Model: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

How about our One Coefficient model?

gg_resfitted(single_coeff) + geom_smooth(se=FALSE) + ggtitle("1-Coefficient Model: Residuals vs Fitted Values")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Evaluation:

Generally speaking, it appears as our increase in regressors/variables to our initial model does not affect the variance in residuals. This likely implies that our model. **violates our 1st assumption*.

Possible Solution:

Given this issue of variance in our residual terms, it might be prudent to take the logarithm value of our numeric columns to see if that affects the residuals.

Test 2: Residuals vs. X Values

Our 2nd Linear Regression assumption is of Homoskedasticity- i.e The variance of residuals is uncorrelated with our x values.

We test this by plotting the residuals against the different x-variables in our 3-coefficient model

plots <- gg_resX(three_coeff, plot.all = FALSE)
# for each variable of interest ...
plots$listings

From the above chart we can observe that our assumption of Homosekdasticity is violated between the 1000 - 2000 listings range, where we observe a general positive. correlation between our residual values and houses listed. Outside of this range, we observe less correlation between the 2 terms and less violation of the assumption.

This generally means that our model is generally in alignment wiht our assumption.

plots$month

From the above chart, we observe that the there is little correlation between our residual values and the months of the year. No month category exhibits stronger varaince in residuals than the other. Therefore our Homoskedacticity assumption is NOT violated in this case.

plots$inventory

In the case of our inventory variable, we do observe non-constant variance in residuals, especially in the 0 -10 range of inventory. However this variance reduces as our inventory value increases. As a result we do observe a violation of our Homoskedasticity assumption in this case, at least in the range of 0-10.

Test 3: Residual Histograms

This graph is usd to evaluate the assumption that our residuals are Normally Distributed and so we plot a density historgram of our residual values to observe their distribution.

gg_reshist(three_coeff)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Evaluation

When we look at this histogram of residuals, we observe a distribution that is roughly normal. However the tails are not equally distributed and the highest density of residuals falls in the immediate range of 0. There is a slightly shorter tail on the left, and a longer one on the right, while we have a higher density of negative residuals. This is not overly severe, but it indicates we may need more/different explanatory variables.

Test 4: QQ Plots

QQ Plots are another way of evaluating whether our Residual Distribution varies form the theoretical Normal Distribution. Ideally the Red and Black lines should overlap with each other indicating that our residual roughly matches a thoeretical normal distribution.

gg_qqplot(three_coeff)

From the above graph, we observe that our Normal Distribution Assumption is partially Violated as the 2 lines do not entirely overlap outside of the [-2, 2] quantiles. This is symptomatic of a Truncated Normal Distribution or a Thin tailed distribution. This is not a serious issue as Truncation is hardly noticeable. We would likely not notice if our normal distribution did not have any observations outside of [-2, 2], and this is unlikely to affect our model.

Some remedial methods could be to - 1) Implement a Heckman Model, or 2) Fit a Generalized Linear Model like logit/probit

Test 5: Cook’s D by Observation

The Cook’s Distance is a method of testing out No strong outliers assumption. Cook’s distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.

gg_cooksd(three_coeff, threshold = 'baseR')
## Cut-off for outliers too big to be shown in Cook's Distance plot.

From the aboce graph, we don’t view any Observation Numbers who’s removal strongly affects our model, implying that under the baseR threshold, there are No Strong Outliers in our model. Implying that we meet the required assumption for our model.

Conclusion:

In summary, Our 3-Coefficient model that includes (lisitngs, month and inventory) as our independant regressors has the following coefficients:

three_coeff$coefficients
## (Intercept)    listings      month2      month3      month4      month5 
##  40.8588219   0.1768961  68.3729467 201.6322500 208.6062510 283.9758915 
##      month6      month7      month8      month9     month10     month11 
## 313.7502007 284.1101260 265.9234046 142.8997625 135.8248400  92.3836560 
##     month12   inventory 
## 161.7865464 -26.5408197

In terms of it alignment with the Linear Regression Assumptions: - 1) The assumption of no correlation between Residuals and \(\hat{y}\) values is violated. We can try to fix this by taking the Logarithm of our predicted values

Wee can improve our model by possibly, including interaction terms between listings and inventories or by modifying the listings column by taking its log.