According to Brian Caffo, adjustment consists of adding a new regressor into our model to investigate the change in the outcome. Including a new feature in our model could have a significant impact depending on the relationship between the variables. The adjustment process helps us to understand the best way to include the variable in the model.
To show how the adjustment works, I am going to use the Ames Housing Dataset compiled by Dean De Cock. The dataset consists of 1460 observations about sold house prices and other characteristics. In this example, I am only going to use only some of the features.
In this example, we are going to see how the number of bathrooms affects the sale price of a house, also considering the lot size in square feet. Figure 1 shows these relationships for houses under 20,000 square feet with 1 and 2 bathrooms.
Figure 1: Variable relationship using feature Bathrooms
In this case, it seems that there is no relationship between the number of bathrooms and the lot area. On the other hand, we can see that both predictors are related to Sale Price. However, the relationship between price and lot area differs a bit depending on the number of bathrooms (see the slope of the regression line).
A good approximation of the relationship between these three variables will be:
\[\begin{equation*} SalePrice={\beta}_{0}+{\beta}_{1}LotArea+{\beta}_{2}Bathroom \end{equation*}\]
Where Bathroom is a dummy variable, where 1 means two bathrooms and 0 means one bathroom.
The following example shows the same relationship but now grouping the observations into two groups, those with central air and those without it. This example uses only Single-family Detached houses under 20,000 square feet. Figure 2 shows the relationship.
Figure 2: Variable relationship using feature Central Air
The Figure shows some overlapping between groups under 12,000 square feet. Moreover, we can also see a difference between the mean of the two groups. This means that both variables affects the Sale Prices; however, we would also know that after 12,000 square feet, we only have houses with central air.
Due to the slope in the regression lines are not so different, we can still use the same relation as in Example 1.
\[\begin{equation*} SalePrice={\beta}_{0}+{\beta}_{1}LotArea+{\beta}_{2}CentralAir \end{equation*}\]
In this example, we are going to change the grouping variable to General Shape of Property, which can be irregular or regular. Figure 2 shows the relationship for Single-family Detached houses under 20,000 square feet.
Figure 3: Variable relationship using feature Property Shape
The Figure shows a similar relation as in Example 1; however, now the difference between the mean sale price between groups could be non-significant because it is too low. This means that there is no relationship between Sale Price and General Shape of Property.
Because the group variable could be non-significant, the relation can be written as the following:
\[\begin{equation*} SalePrice={\beta}_{0}+{\beta}_{1}LotArea \end{equation*}\]
In this example, we put another continuous variable to show another type of relationship. Here, I include the variable Floor Area. The relation can be seen in the following Figure.
You must enable Javascript to view this page properly.
According to the plot, Floor Area is better explaining the variation of Sale Price, but there is still variation that is only explained by Lot Area.