According to Caffo (2015), ‘Adjustment’ is about understanding how additional variables that are added to a model can distort the other variables. A real world example of this is through an analysis of US Avocado Sales Volume and a linear model predicting total sales volume based on sales volume of Small Hass Avocados (4046), Large Hass Avocados (4225) and Extra Large Hass Avocados (4770).
Firstly, let’s take a brief look at the data. The data contains monthly US sales volumes of Hass Avocados from 2015 to 2018. It has the average price per Avocado, Total Volume, Volume by type (Small-4046, Large-4225, XLarge-4770), Total Bags sold and total bags sold by type.
Year | Month | region | AvgPrice | TotalVolume | X4046 | X4225 | X4770 | TotalBags | SmallBags | LargeBags | XLBags |
---|---|---|---|---|---|---|---|---|---|---|---|
2015 | 1 | TotalUS | 1.0075 | 117901590 | 47927556 | 48195457 | 3309990 | 18468587 | 15220261 | 3197297 | 51028.62 |
2015 | 2 | TotalUS | 0.9725 | 134742045 | 57496409 | 53475599 | 3902248 | 19867790 | 16671400 | 3110311 | 86077.91 |
2015 | 3 | TotalUS | 1.0160 | 155157138 | 65668005 | 59541994 | 4653819 | 25293320 | 21763445 | 3386149 | 143725.23 |
2015 | 4 | TotalUS | 1.0450 | 127532485 | 56817403 | 45882786 | 3586287 | 21246008 | 18158197 | 3000196 | 87614.77 |
A good initial way to quickly understand the relationships of the variables is to do a ‘pairs()’ on the data. This image visually shows the relationship between each of the variables with each other. This also helps with understanding the relationships between the response variable (Total Volume), other exploratory variables and any collinear relationships between the exploratory variables.
Note that we can only run pairs() on numeric variables, therefore we will only take a subset of the data.
pairs(avocados)
You can see that there are some linear relationships between Total Volume and the different types of Avocados.
Let’s take a look at the specific relationship between Total Volume and Small Avocados (4046).
Correlation
## [1] 0.829823
You can see there is a strong linear relationship between total sales volume and total sales volume for small avocados.
Now let’s take a look at the specific relationship between Total Volume and Large Avocados (4225).
Correlation
## [1] 0.7534093
There is also strong linear relationship between total sales volume and total sales volume for large avocados.
Finally, let’s look at the specific relationship between Total Volume and Extra Large Avocados (4770).
Correlation
## [1] 0.4166469
There is moderate linear relationship between total sales volume and total sales volume for extra large avocados.
Now that we can see there is a linear relationship between total volume and the small, large and extra large avocados (and the distribution is relatively normal), we can run a linear regression model on the relationship.
##
## Call:
## lm(formula = mavo.form, data = mavo_us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28861783 -9013993 2544944 9525619 30942367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34784893.9318 12541116.6127 2.774 0.00863 **
## X4046 2.1404 0.2366 9.045 0.0000000000657 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14790000 on 37 degrees of freedom
## Multiple R-squared: 0.6886, Adjusted R-squared: 0.6802
## F-statistic: 81.82 on 1 and 37 DF, p-value: 0.00000000006565
The coefficient explains that for each small avocado sold, change in total volume will increase by approximately 2.
##
## Call:
## lm(formula = mavo.form, data = mavo_us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37562734 -10191010 1128088 12826702 31178305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47210122.6632 14472634.2193 3.262 0.00238 **
## X4225 1.9665 0.2822 6.970 0.0000000311 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17420000 on 37 degrees of freedom
## Multiple R-squared: 0.5676, Adjusted R-squared: 0.5559
## F-statistic: 48.57 on 1 and 37 DF, p-value: 0.00000003113
The coefficient explains that for each large avocado sold, change in total volume will also increase by approximately 2.
##
## Call:
## lm(formula = mavo.form, data = mavo_us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40709077 -16096359 -4234371 17084411 54110932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121996707.849 9495085.419 12.848 0.00000000000000328 ***
## X4770 6.076 2.179 2.788 0.00833 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24090000 on 37 degrees of freedom
## Multiple R-squared: 0.1736, Adjusted R-squared: 0.1513
## F-statistic: 7.772 on 1 and 37 DF, p-value: 0.008328
The coefficient explains that for each extra large avocado sold, change in total volume will also increase by approximately 6.
However what happens if you consider both small and large avocados sold together?
##
## Call:
## lm(formula = mavo.form, data = mavo_us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24439894 -9297991 4253887 9761072 18164458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18499435.9207 11930695.5099 1.551 0.12975
## X4046 1.5191 0.2730 5.565 0.00000265 ***
## X4225 0.9660 0.2762 3.497 0.00127 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12950000 on 36 degrees of freedom
## Multiple R-squared: 0.7676, Adjusted R-squared: 0.7547
## F-statistic: 59.44 on 2 and 36 DF, p-value: 0.000000000003919
Notice how the co-efficients have changed (been adjusted)?
The coefficients were about 2 individually. Now:
For each small avocado sold, the change in total volume will only increase by 1.5, assuming that sales in large avocados do not change.
And for each large avocado sold, the change in total volume will only increase by 1, assuming that sales in small avocados do not change.
What has happened?
Adding large and small avocados as features to the model now explains 77% (r-squared) of variance in total volume. This makes sense because these values would make up the total volume, but now you can also see there is a relationship between small and large avocados. Small avocados appears to increase the change of total volume greater than large avocados. Perhaps they sell more smaller ones compared with large ones.
Is there a way to visualise this relationship?
Fortunately, with two features we can visualise this in a 3D graphic using plotly.
From the image, you can see the linear model has changed from a line to a plane in a 3D space. The plane can take on two slopes (co-efficients) exactly as the model describes. If you rotate the image, you can visualise the different slopes.
Now what happens if you consider all 3 types of avocados sold?
##
## Call:
## lm(formula = mavo.form, data = mavo_us)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24435268 -9336150 4269007 9701571 18156012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18400385.77803 12354091.69514 1.489 0.1453
## X4046 1.51389 0.30602 4.947 0.0000188 ***
## X4225 0.97939 0.43760 2.238 0.0317 *
## X4770 -0.07619 1.91860 -0.040 0.9686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13130000 on 35 degrees of freedom
## Multiple R-squared: 0.7676, Adjusted R-squared: 0.7477
## F-statistic: 38.53 on 3 and 35 DF, p-value: 0.00000000003459
Notice that the coefficient of extra large avocados has changed from 6 to -0.08. Unfortunately, we cannot visualise this easily but it is stating that: - For every extra large (4770) sold, total volume will decrease by 0.08 (or close to 0), assuming small and large avocados don’t change.
However, the model also states that this only significant to a 9% level, so this evidence is not very strong.
As you can see adding additional variables to a model can potentially change the coefficients (slopes) in the model. Furthermore, omitting variables that are correlated with the response variable, i.e. excluding either small, large or extra large avocados can lead to bias. Therefore, one must be careful when adding variables to a model and interpreting the model results without first considering other confounding or collinear variables.
Caffo, Brian 2015, Regression Models for Data Science in R, Leanpub, Victoria
Kiggins, Justin 2018, Avocado Prices, Kaggle, viewed 29 August 2018, <[https://www.kaggle.com/neuromusic/avocado-prices/home)>