Introduction

According to Caffo (2015), ‘Adjustment’ is about understanding how additional variables that are added to a model can distort the other variables. A real world example of this is through an analysis of US Avocado Sales Volume and a linear model predicting total sales volume based on sales volume of Small Hass Avocados (4046), Large Hass Avocados (4225) and Extra Large Hass Avocados (4770).

The data

Firstly, let’s take a brief look at the data. The data contains monthly US sales volumes of Hass Avocados from 2015 to 2018. It has the average price per Avocado, Total Volume, Volume by type (Small-4046, Large-4225, XLarge-4770), Total Bags sold and total bags sold by type.

first 4 lines
Year Month region AvgPrice TotalVolume X4046 X4225 X4770 TotalBags SmallBags LargeBags XLBags
2015 1 TotalUS 1.0075 117901590 47927556 48195457 3309990 18468587 15220261 3197297 51028.62
2015 2 TotalUS 0.9725 134742045 57496409 53475599 3902248 19867790 16671400 3110311 86077.91
2015 3 TotalUS 1.0160 155157138 65668005 59541994 4653819 25293320 21763445 3386149 143725.23
2015 4 TotalUS 1.0450 127532485 56817403 45882786 3586287 21246008 18158197 3000196 87614.77

Analysis of the relationships

A good initial way to quickly understand the relationships of the variables is to do a ‘pairs()’ on the data. This image visually shows the relationship between each of the variables with each other. This also helps with understanding the relationships between the response variable (Total Volume), other exploratory variables and any collinear relationships between the exploratory variables.

Note that we can only run pairs() on numeric variables, therefore we will only take a subset of the data.

pairs(avocados)

You can see that there are some linear relationships between Total Volume and the different types of Avocados.

Total Sales Volume and Small Avocados

Let’s take a look at the specific relationship between Total Volume and Small Avocados (4046).

Correlation

## [1] 0.829823

You can see there is a strong linear relationship between total sales volume and total sales volume for small avocados.

Total Sales Volume and Large Avocados

Now let’s take a look at the specific relationship between Total Volume and Large Avocados (4225).

Correlation

## [1] 0.7534093

There is also strong linear relationship between total sales volume and total sales volume for large avocados.

Total Sales Volume and Extra Large Avocados

Finally, let’s look at the specific relationship between Total Volume and Extra Large Avocados (4770).

Correlation

## [1] 0.4166469

There is moderate linear relationship between total sales volume and total sales volume for extra large avocados.

Modelling

Now that we can see there is a linear relationship between total volume and the small, large and extra large avocados (and the distribution is relatively normal), we can run a linear regression model on the relationship.

Total Volume and Small Avocadoes

## 
## Call:
## lm(formula = mavo.form, data = mavo_us)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -28861783  -9013993   2544944   9525619  30942367 
## 
## Coefficients:
##                  Estimate    Std. Error t value        Pr(>|t|)    
## (Intercept) 34784893.9318 12541116.6127   2.774         0.00863 ** 
## X4046              2.1404        0.2366   9.045 0.0000000000657 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14790000 on 37 degrees of freedom
## Multiple R-squared:  0.6886, Adjusted R-squared:  0.6802 
## F-statistic: 81.82 on 1 and 37 DF,  p-value: 0.00000000006565

The coefficient explains that for each small avocado sold, change in total volume will increase by approximately 2.

Total Volume and Large Avocadoes

## 
## Call:
## lm(formula = mavo.form, data = mavo_us)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -37562734 -10191010   1128088  12826702  31178305 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 47210122.6632 14472634.2193   3.262      0.00238 ** 
## X4225              1.9665        0.2822   6.970 0.0000000311 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17420000 on 37 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5559 
## F-statistic: 48.57 on 1 and 37 DF,  p-value: 0.00000003113

The coefficient explains that for each large avocado sold, change in total volume will also increase by approximately 2.

Total Volume and Extra Large Avocadoes

## 
## Call:
## lm(formula = mavo.form, data = mavo_us)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -40709077 -16096359  -4234371  17084411  54110932 
## 
## Coefficients:
##                  Estimate    Std. Error t value            Pr(>|t|)    
## (Intercept) 121996707.849   9495085.419  12.848 0.00000000000000328 ***
## X4770               6.076         2.179   2.788             0.00833 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24090000 on 37 degrees of freedom
## Multiple R-squared:  0.1736, Adjusted R-squared:  0.1513 
## F-statistic: 7.772 on 1 and 37 DF,  p-value: 0.008328

The coefficient explains that for each extra large avocado sold, change in total volume will also increase by approximately 6.

Total Volume with Small & Large Avocadoes

However what happens if you consider both small and large avocados sold together?

## 
## Call:
## lm(formula = mavo.form, data = mavo_us)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -24439894  -9297991   4253887   9761072  18164458 
## 
## Coefficients:
##                  Estimate    Std. Error t value   Pr(>|t|)    
## (Intercept) 18499435.9207 11930695.5099   1.551    0.12975    
## X4046              1.5191        0.2730   5.565 0.00000265 ***
## X4225              0.9660        0.2762   3.497    0.00127 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12950000 on 36 degrees of freedom
## Multiple R-squared:  0.7676, Adjusted R-squared:  0.7547 
## F-statistic: 59.44 on 2 and 36 DF,  p-value: 0.000000000003919

Notice how the co-efficients have changed (been adjusted)?

The coefficients were about 2 individually. Now:

For each small avocado sold, the change in total volume will only increase by 1.5, assuming that sales in large avocados do not change.

And for each large avocado sold, the change in total volume will only increase by 1, assuming that sales in small avocados do not change.

What has happened?

Adding large and small avocados as features to the model now explains 77% (r-squared) of variance in total volume. This makes sense because these values would make up the total volume, but now you can also see there is a relationship between small and large avocados. Small avocados appears to increase the change of total volume greater than large avocados. Perhaps they sell more smaller ones compared with large ones.

Is there a way to visualise this relationship?

Fortunately, with two features we can visualise this in a 3D graphic using plotly.

From the image, you can see the linear model has changed from a line to a plane in a 3D space. The plane can take on two slopes (co-efficients) exactly as the model describes. If you rotate the image, you can visualise the different slopes.

Total Volume with Small, Large and Extra Large Avocadoes

Now what happens if you consider all 3 types of avocados sold?

## 
## Call:
## lm(formula = mavo.form, data = mavo_us)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -24435268  -9336150   4269007   9701571  18156012 
## 
## Coefficients:
##                   Estimate     Std. Error t value  Pr(>|t|)    
## (Intercept) 18400385.77803 12354091.69514   1.489    0.1453    
## X4046              1.51389        0.30602   4.947 0.0000188 ***
## X4225              0.97939        0.43760   2.238    0.0317 *  
## X4770             -0.07619        1.91860  -0.040    0.9686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13130000 on 35 degrees of freedom
## Multiple R-squared:  0.7676, Adjusted R-squared:  0.7477 
## F-statistic: 38.53 on 3 and 35 DF,  p-value: 0.00000000003459

Notice that the coefficient of extra large avocados has changed from 6 to -0.08. Unfortunately, we cannot visualise this easily but it is stating that: - For every extra large (4770) sold, total volume will decrease by 0.08 (or close to 0), assuming small and large avocados don’t change.

However, the model also states that this only significant to a 9% level, so this evidence is not very strong.

Conclusion

As you can see adding additional variables to a model can potentially change the coefficients (slopes) in the model. Furthermore, omitting variables that are correlated with the response variable, i.e. excluding either small, large or extra large avocados can lead to bias. Therefore, one must be careful when adding variables to a model and interpreting the model results without first considering other confounding or collinear variables.

References

Caffo, Brian 2015, Regression Models for Data Science in R, Leanpub, Victoria

Kiggins, Justin 2018, Avocado Prices, Kaggle, viewed 29 August 2018, <[https://www.kaggle.com/neuromusic/avocado-prices/home)>