OLS regression makes several assumptions about the data used in the model including linearity between the Xs and Y along with normally distributed continuous variables. When either of these are violated, it can introduce bias and/or inefficiency into your results. In this tutorial, we will review how data transformations can help solve issues with both the linearity and normally distribution assumptions. Specifically, we will review: - Using polynomials to more appropriately estimate a curvilinear relationship - Using natural log to transform right-skewed continuous variables
Oftentimes, the relationship between some X and some Y contains a curivlinear relationship whereby the relationship between the two variables differs based on the specific value of x. We will simulate a polynomial relationship and then review how adding a polynomial to your regression model can help mitigate the violations of OLS assumptions.
Here, we initiate our data simulation with a vector of values for x, then create a polynomial relationship between x and y before adding random noise to the variable so that the model is probabilistic instead of deterministic.
###Simulating Polynomial Data
set.seed(1942)
# Generate x values
<- seq(-10, 10, by = 0.1)
x <-.758*x^2 + .321*x + 5
y # Add random noise to y
<- y + rnorm(length(x), mean = 0, sd = 50)
y_noisy
# Create a data frame
<- data.frame(x = x, y = y_noisy)
data head(data)
## x y
## 1 -10.0 46.85826
## 2 -9.9 160.52367
## 3 -9.8 28.98630
## 4 -9.7 -27.80271
## 5 -9.6 67.01748
## 6 -9.5 53.92413
$x2<-data$x^2 #Create a new squared version of x for modeling purposes
datahead(data)
## x y x2
## 1 -10.0 46.85826 100.00
## 2 -9.9 160.52367 98.01
## 3 -9.8 28.98630 96.04
## 4 -9.7 -27.80271 94.09
## 5 -9.6 67.01748 92.16
## 6 -9.5 53.92413 90.25
Now that we have created our polynomial relationship between x and y, let’s evaluate the issues in OLS without incorporating the curvilinear relationship. We start by creating a scatterplot between X and Y including the linear reltionship as well as a LOESS line. The LOESS lines will better reveal the true underlying relationship between the X and Y.
<- ggplot(data, aes(x = x, y =y ))
b
+ geom_point(size=2) +
b theme_bw(base_size = 20) + theme_minimal()+
theme(legend.position="bottom") +
guides(fill=guide_legend(title=NULL))+
geom_smooth(method = "loess", se=FALSE, span=.6, color="blue", linetype = "dashed") +
geom_smooth(method = "lm", se=FALSE, color="red", linetype = "dashed")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The blue and red lines on the scatterplot represent the estimated linear relationship (red line) between x and y as well as the LOESS curve (blue line) that better reflects the true relationship between the two variables. The linear line is reasonably flat whereas the blue line shows that as x increases at low values, y tends to go down before the relationship flips and as x increases so does y. The linear line in this model reflects bias in the estimated relationship between x and y because it does not account for the actual curvilinear relationship between the two variables.
Now, let’s estimate a linear regression without accounting for the curivilinear relationship and evaluate plot 1 from the assumption diagnostic plots. Remember, this plot can provide visual evidence of both heteroskedasticity and a unmodeled curvilinear relationship.
<-lm(y ~ x, data=data)
lm1summary(lm1)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.083 -37.858 -0.101 40.150 142.892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.1752 3.9991 6.545 4.93e-10 ***
## x 0.8630 0.6892 1.252 0.212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.7 on 199 degrees of freedom
## Multiple R-squared: 0.007816, Adjusted R-squared: 0.00283
## F-statistic: 1.568 on 1 and 199 DF, p-value: 0.212
plot(lm1, which=1)
Examining the linear regression model treating the relationship between x and y as purely linear, we see a non-significant beta weight on x. Then when we examine Plot 1 from the assumption graphs, we see a parabolic red-line. Anytime you see a pattern like this, it means there is some level of curvilinear relationship between the x and y. These residuals provide additional evidence to what we saw in the LOWESS plots above; there seems to be a non-linear relationship in the model.
Now, let’s add a polynomial transformation to our model and reexamine Plot 1 to see if that produces better behaving residuals. There are three different approaches you can use to include a polynomial in your model. First approach is the actually create the squared version of x in your dataset and include that in your model. The other two approaches do not require you to create a new variable in your data instead relying on specific code to do so.
<-lm(y~x+x2, data=data)
lm_quad<-lm(y~x+I(x^2), data=data)
lm_quad2<-lm(y~poly(x,2, raw=TRUE), data=data)
lm_quad3stargazer(lm1, lm_quad, lm_quad2, lm_quad3, digits=3, type="text")
##
## ===================================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------------
## y
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------------------------------------
## x 0.863 0.863 0.863
## (0.689) (0.637) (0.637)
##
## x2 0.727***
## (0.123)
##
## I(x2) 0.727***
## (0.123)
##
## poly(x, 2, raw = TRUE)1 0.863
## (0.637)
##
## poly(x, 2, raw = TRUE)2 0.727***
## (0.123)
##
## Constant 26.175*** 1.712 1.712 1.712
## (3.999) (5.543) (5.543) (5.543)
##
## -------------------------------------------------------------------------------------------------------------------
## Observations 201 201 201 201
## R2 0.008 0.157 0.157 0.157
## Adjusted R2 0.003 0.149 0.149 0.149
## Residual Std. Error 56.697 (df = 199) 52.391 (df = 198) 52.391 (df = 198) 52.391 (df = 198)
## F Statistic 1.568 (df = 1; 199) 18.446*** (df = 2; 198) 18.446*** (df = 2; 198) 18.446*** (df = 2; 198)
## ===================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Using stargazer
, we compare our 4 models side-by-side.
The three models which include the polynomial are all exactly the same,
as they should be. This simply illustrates that each of the three
approaches produce the exact same outcome so it is up to you to decide
which one to use. Examining the actual results, we see that X has a
non-significant beta weight for both the purely linear and the
polynomial regression. However, when we add the squared x to the model,
we see a significant predictor on that variable. Also, when we examine
the model fit statistics, we clearly see the polynomial model
outperforms the purely linear model on each model fit statistic. This
indicates that adding the polynomial term to the regression creates a
better performing model overall.
Now, lets’s reexamine Plot 1 from the assumption plots to see if we removed the curvilinear pattern in the residuals.
plot(lm1, which=1)
plot(lm_quad, which=1)
Here, we see that in the residual plot from the polynomial model a much flatter red line that does not exhibit a clearly curvilinear pattern like we see in the residual plot from the purely linear model. This indicates that by including the polynomial model we have better performing residuals. Coupling this with the better performing model fit statistics, we can say that including the polynomial term in the regression better reflects the true relationship between x and y in this data.
When you have a continuous variable with an extreme right-hand skew, it can be appropriate to use a log transformation to make the variable more normally distributed. Log transformations should only be done on continuous variables with a right-hand skew.
For this section, we will be working out of our California real-estate data. We start by looking at the histograms of our two continuous variables in this data: price and sqft.
# Specify the URL of the Stata file
<- "https://github.com/drCES/winter_603/raw/main/California%20Real%20Estate%20Data%20-%202015%20Ahmed.dta"
url
# Import the data into R
<- read_dta(url)
data
ggplot(data, aes(x = price)) +
geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
labs(title = "Histogram Plot of Price",
x = "Price",
y = "Density") +
theme_minimal()
ggplot(data, aes(x = sqft)) +
geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
labs(title = "Histogram Plot of Square Footage",
x = "Square Footage",
y = "Density") +
theme_minimal()
For both continuous variables in our real-estate data, we see evidence of an extreme right-hand skew with the untransformed variables. Because of this, it can be appropriate to use the natural logs of both variables in the analysis. We will next create log-transformed version of each variable and rerun our histograms to see how the transformation has influenced our distributions.
<- data %>%
data mutate(
price_ln = log(price),
sqft_ln = log(sqft)
)
ggplot(data, aes(x = price_ln)) +
geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
labs(title = "Histogram Plot of Natural Log of Price",
x = "Natural Log of Price",
y = "Density") +
theme_minimal()
ggplot(data, aes(x = sqft_ln)) +
geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
labs(title = "Histogram Plot of ofNatural Log of Square Footage",
x = "Natural Log of Square Footage",
y = "Density") +
theme_minimal()
Once we take the log of both variables, we see amuch more normally distributed variables compared to the untransformed version of the variables. Now, we proceed with our OLS regressions and examine a variety of models including:
We start with a basic bivariate linear model without transformations to review the assumption plots for normally distributed errors (plot 2, Normal Q-Q plot).
<-lm(price ~ sqft + bedrooms + bathrooms, data=data)
lm1
plot(lm1, which=2)
<- residuals(lm1)
residuals shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.72053, p-value < 2.2e-16
With the Normal Q-Q plot from the basic OLS model, we see some deviations at the extreme quantile levels. This can cause our standard errors to be inappropriately sized.
<-lm(price_ln ~ sqft + bedrooms + bathrooms, data=data)
log_lin
plot(log_lin, which=2)
<- residuals(log_lin)
residuals shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99335, p-value = 0.01838
Examining the log-linear Normal Q-Q plot, we see better performing residuals compared to the linear model; although there is still some deviations from the dotted gray line. Slight deviations like this are not a problem though. Now, let’s examine the linear-log model and review the Normal Q-Q plot.
<-lm(price ~ sqft_ln + bedrooms + bathrooms, data=data)
lin_log
plot(lin_log, which=2)
<- residuals(lin_log)
residuals shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.71013, p-value < 2.2e-16
In the linear-log model, where we transformed the IV but not the DV, we see similar extreme deviations in the Normal Q-Q plot as we did in the linear model. This indicates that taking the log of only the IV did not “fix” the issues that we had in the linear model.
Now, let’s look at the Normal Q-Q plot for the log-log model, where we are using the log transformations of both the DV and the IV.
<-lm(price_ln ~ sqft_ln + bedrooms + bathrooms, data=data)
log_log
plot(log_log, which=2)
<- residuals(log_log)
residuals shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99653, p-value = 0.301
Once we take the natural log of both the IV and DV, we see only slight deviations in the Normal Q-Q plot. There are still some minor deviations but there always will be. This plot shows that the log-log model has the most normally distributed residuals from any of the four models.
Next, we compare each model using the stargazer
package
to examine model performance.
stargazer(lm1, log_lin, lin_log, log_log, type="text", digits=3)
##
## =====================================================================================
## Dependent variable:
## ------------------------------------------------------
## price price_ln price price_ln
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------
## sqft 114.927*** 0.0002***
## (23.310) (0.00004)
##
## sqft_ln 357,801.900*** 0.820***
## (66,414.580) (0.104)
##
## bedrooms -13,343.860 0.163*** -29,706.340 0.100**
## (24,914.510) (0.040) (25,710.530) (0.040)
##
## bathrooms 167,924.700*** 0.163*** 150,866.700*** 0.084*
## (30,392.880) (0.049) (31,239.440) (0.049)
##
## Constant -84,869.340 11.519*** -2,449,858.000*** 6.127***
## (59,998.040) (0.096) (434,401.800) (0.679)
##
## -------------------------------------------------------------------------------------
## Observations 535 535 535 535
## R2 0.290 0.372 0.296 0.409
## Adjusted R2 0.286 0.368 0.292 0.406
## Residual Std. Error (df = 531) 430,243.900 0.691 428,428.400 0.670
## F Statistic (df = 3; 531) 72.156*** 104.712*** 74.272*** 122.652***
## =====================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
From the stargazer
table, we see variations in the model
performance but also in the relationship between the x’s and y’s. The
two models with the logged transformed DV have higher R2 and Adjusted-R2
compared to the model with the untransformed variables as well as a
higher F-statistic. In this situation since we have the same predictors,
we can use the F-statistic to compare model performance indicating that
the transformed models perform better. Comparing models 2 and 4,
log-linear and log-log models, we see the log-log model has a higher R2,
Adjusted R2, F-statistic and Residual Standard Error. All of these
statistics indicate that the log-log model is the better performing
Look at closely at the relationship between number of bedrooms and home-selling price. In the two models with an untransformed DV, we see a negative, non-significant, beta weight. This is one of those outcomes that feels wrong based on what we know about the relationship between bedrooms and home-selling price.
Once we transform the DV using the natural log, we see a positive and significant relationship between bedrooms and price. You should not simply transform a variable simply because you have a strange finding. However, coupling the distributions with the better performing log-log model gives us clear evidence that the distribution of the DV might be at fault for the strange finding of the bedrooms. This should lead us to conclude that the more appropriate model to use is the one where we transform both the DV and the continuous IV in our data.
One issue with using a model with a natural log is in interpretation of the relationship. It is not as simple as in the linear model where a 1 unit increase in x = a beta weight change in y.
Here, we review interpretation of the different models with log transformations starting with the log-linear model.
stargazer(log_lin, digits=3, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price_ln
## -----------------------------------------------
## sqft 0.0002***
## (0.00004)
##
## bedrooms 0.163***
## (0.040)
##
## bathrooms 0.163***
## (0.049)
##
## Constant 11.519***
## (0.096)
##
## -----------------------------------------------
## Observations 535
## R2 0.372
## Adjusted R2 0.368
## Residual Std. Error 0.691 (df = 531)
## F Statistic 104.712*** (df = 3; 531)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The regression table above shows the results for the log-linear model, which uses a log transformed DV but no transformation of the IVs. With this model, a 1-unit increase in x represents a relative change in y. To understand the relative change, you need to exponentiate the beta weight and subtract 1 from that value.
First, we create a tibble with the beta weights from the log-linear model then exponentiate each beta weight before subtracting 1 from the value. We also calculate the mean home price to then understand the average impact in dollars of a 1-unit increase in each IV, holding everything else in the model constant.
<- tibble(
coefficients variable = c("sqft", "bedrooms", "bathrooms"),
beta = c(0.0002, 0.163, 0.163) )
<-mean(data$price)
mean_price
# Calculate the percentage changes
<- coefficients %>%
coefficients mutate(
percentage_change = (exp(beta) - 1),
mean_price = mean_price,
impact_1unit = percentage_change * mean_price)
print(coefficients)
## # A tibble: 3 × 5
## variable beta percentage_change mean_price impact_1unit
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sqft 0.0002 0.000200 589363. 118.
## 2 bedrooms 0.163 0.177 589363. 104339.
## 3 bathrooms 0.163 0.177 589363. 104339.
It was just a statistical quirk that the beta weights are the same size on bedrooms and bathrooms but the interpretation is the same. For square footage, a 1-unit increase - i.e. increasing the size of a home by 1 square foot - is related to an average increase in home selling price of ~$118. For bedrooms and bathrooms, a 1-unit increase - i.e. increasing the number of bedrooms and bathrooms by 1 - is related to an average increase in home selling price of ~$104,339.
stargazer(lin_log, digits=3, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price
## -----------------------------------------------
## sqft_ln 357,801.900***
## (66,414.580)
##
## bedrooms -29,706.340
## (25,710.530)
##
## bathrooms 150,866.700***
## (31,239.440)
##
## Constant -2,449,858.000***
## (434,401.800)
##
## -----------------------------------------------
## Observations 535
## R2 0.296
## Adjusted R2 0.292
## Residual Std. Error 428,428.400 (df = 531)
## F Statistic 74.272*** (df = 3; 531)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here, we review interpreting a model with a logged IV but an untransformed DV; the linear-log model.
For the untransformed IVs, the interpretation is the same as in a regular linear model. A 1-unit increase in x is related to a beta weight change in y. With bedrooms and bathrooms untransformed in this model, we interpret their beta weights in the same way as in a linear model.
However, since the square footage variable is transformed in this model, we need to do additional work to understand how a 1-unit increase influences price, on average. We start by converting the beta weight to how a 1% increase in square footage influences price by dividing the beta weight by 100. With this value calculated, we can then do simple math to understand the impact of 1-unit increase in square footage on price.
<-357801.900/100
sqft_1p_betaprint(sqft_1p_beta)
## [1] 3578.019
<-mean(data$sqft)
sqft_meanprint(sqft_mean)
## [1] 2364.905
<-sqft_mean*.01
sqft_1pprint(sqft_1p)
## [1] 23.64905
<-sqft_1p_beta/sqft_1p
impact_1unitprint(impact_1unit)
## [1] 151.2965
We first see that a 1% increase in square footage, on average and holding everything else in the model constant, is related to a $3578 increase in price. However, a 1% increase is not the same as a 1-unit increase so we need to find the average square footage for the data. The average square footage of homes in the data is roughly 2,365. We calculate what 1% of the average square footage is, which equals 23.65. The final step is to simply divide the value we got from dividing the beta by 100 from the value for 1% of the average home square footage.
In this model, we get a value of roughly $151 indicating that a 1-unit increase in square footage, for the average home square footage size, is related to that value increase in home selling price.
The third type of logged model includes a log transformation for the DV and at least 1 of the IVs. For the IV(s) that have a log transformation, the coefficient is interpreted as a 1% increase in the logged IV is related to a beta weight percentage change in the DV. With the untransformed variables - bedrooms and bathrooms in this model - the interpretation is the same as in the log-linear model reviewed previously.
stargazer(log_log, digits=3, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price_ln
## -----------------------------------------------
## sqft_ln 0.820***
## (0.104)
##
## bedrooms 0.100**
## (0.040)
##
## bathrooms 0.084*
## (0.049)
##
## Constant 6.127***
## (0.679)
##
## -----------------------------------------------
## Observations 535
## R2 0.409
## Adjusted R2 0.406
## Residual Std. Error 0.670 (df = 531)
## F Statistic 122.652*** (df = 3; 531)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here we see a coefficient on the logged transformed square footage variable of .820. This indicates that 1% increase in square footage is related to a .82% increase in home selling price. To convert this into actual dollars, we once again use basic math. We have already calculated the average square footage and home prices before so we just need to print them here. Remember, that the coefficient is in percentages so to correctly calculate this value you need to divide the beta weight by 100 to make it statistically equivalent, which means we will multiply the mean home price by .0082 and not .82.
print(sqft_1p)
## [1] 23.64905
print(mean_price)
## [1] 589362.8
<-.0082*mean_price
price_changeprint(price_change)
## [1] 4832.775
<-price_change/sqft_1p
impact_1unitprint(impact_1unit)
## [1] 204.3539
Here, we calculate the calculate what .82% of the mean selling price is (~4833) then divide that value by the 1% change in square footage. This gives us an estimate of ~$204 indicating that a 1-unit increase in square footage is, on average, related to a $204 increase in home selling price.