Data Transformations and OLS

Introduction
Polynomial Relationships
Log Transformations
Interpretation

Introduction

OLS regression makes several assumptions about the data used in the model including linearity between the Xs and Y along with normally distributed continuous variables. When either of these are violated, it can introduce bias and/or inefficiency into your results. In this tutorial, we will review how data transformations can help solve issues with both the linearity and normally distribution assumptions. Specifically, we will review: - Using polynomials to more appropriately estimate a curvilinear relationship - Using natural log to transform right-skewed continuous variables

Polynomial Relationships

Oftentimes, the relationship between some X and some Y contains a curivlinear relationship whereby the relationship between the two variables differs based on the specific value of x. We will simulate a polynomial relationship and then review how adding a polynomial to your regression model can help mitigate the violations of OLS assumptions.

Here, we initiate our data simulation with a vector of values for x, then create a polynomial relationship between x and y before adding random noise to the variable so that the model is probabilistic instead of deterministic.

###Simulating Polynomial Data
set.seed(1942)

# Generate x values
x <- seq(-10, 10, by = 0.1)
y <-.758*x^2 + .321*x + 5
# Add random noise to y
y_noisy <- y + rnorm(length(x), mean = 0, sd = 50)

# Create a data frame
data <- data.frame(x = x, y = y_noisy)
head(data)

##       x         y
## 1 -10.0  46.85826
## 2  -9.9 160.52367
## 3  -9.8  28.98630
## 4  -9.7 -27.80271
## 5  -9.6  67.01748
## 6  -9.5  53.92413

data$x2<-data$x^2 #Create a new squared version of x for modeling purposes 
head(data)

##       x         y     x2
## 1 -10.0  46.85826 100.00
## 2  -9.9 160.52367  98.01
## 3  -9.8  28.98630  96.04
## 4  -9.7 -27.80271  94.09
## 5  -9.6  67.01748  92.16
## 6  -9.5  53.92413  90.25

Now that we have created our polynomial relationship between x and y, let’s evaluate the issues in OLS without incorporating the curvilinear relationship. We start by creating a scatterplot between X and Y including the linear reltionship as well as a LOESS line. The LOESS lines will better reveal the true underlying relationship between the X and Y.

b <- ggplot(data, aes(x = x, y =y ))

b + geom_point(size=2) +
  theme_bw(base_size = 20) + theme_minimal()+  
  theme(legend.position="bottom") +
  guides(fill=guide_legend(title=NULL))+
  geom_smooth(method = "loess", se=FALSE, span=.6, color="blue", linetype = "dashed") +
  geom_smooth(method = "lm", se=FALSE, color="red", linetype = "dashed")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The blue and red lines on the scatterplot represent the estimated linear relationship (red line) between x and y as well as the LOESS curve (blue line) that better reflects the true relationship between the two variables. The linear line is reasonably flat whereas the blue line shows that as x increases at low values, y tends to go down before the relationship flips and as x increases so does y. The linear line in this model reflects bias in the estimated relationship between x and y because it does not account for the actual curvilinear relationship between the two variables.

Now, let’s estimate a linear regression without accounting for the curivilinear relationship and evaluate plot 1 from the assumption diagnostic plots. Remember, this plot can provide visual evidence of both heteroskedasticity and a unmodeled curvilinear relationship.

lm1<-lm(y ~ x, data=data)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.083  -37.858   -0.101   40.150  142.892 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.1752     3.9991   6.545 4.93e-10 ***
## x             0.8630     0.6892   1.252    0.212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.7 on 199 degrees of freedom
## Multiple R-squared:  0.007816,   Adjusted R-squared:  0.00283 
## F-statistic: 1.568 on 1 and 199 DF,  p-value: 0.212

plot(lm1, which=1)

Examining the linear regression model treating the relationship between x and y as purely linear, we see a non-significant beta weight on x. Then when we examine Plot 1 from the assumption graphs, we see a parabolic red-line. Anytime you see a pattern like this, it means there is some level of curvilinear relationship between the x and y. These residuals provide additional evidence to what we saw in the LOWESS plots above; there seems to be a non-linear relationship in the model.

Now, let’s add a polynomial transformation to our model and reexamine Plot 1 to see if that produces better behaving residuals. There are three different approaches you can use to include a polynomial in your model. First approach is the actually create the squared version of x in your dataset and include that in your model. The other two approaches do not require you to create a new variable in your data instead relying on specific code to do so.

lm_quad<-lm(y~x+x2, data=data)
lm_quad2<-lm(y~x+I(x^2), data=data)
lm_quad3<-lm(y~poly(x,2, raw=TRUE), data=data)
stargazer(lm1, lm_quad, lm_quad2, lm_quad3, digits=3, type="text")

## 
## ===================================================================================================================
##                                                             Dependent variable:                                    
##                         -------------------------------------------------------------------------------------------
##                                                                      y                                             
##                                 (1)                   (2)                     (3)                     (4)          
## -------------------------------------------------------------------------------------------------------------------
## x                              0.863                 0.863                   0.863                                 
##                               (0.689)               (0.637)                 (0.637)                                
##                                                                                                                    
## x2                                                 0.727***                                                        
##                                                     (0.123)                                                        
##                                                                                                                    
## I(x2)                                                                      0.727***                                
##                                                                             (0.123)                                
##                                                                                                                    
## poly(x, 2, raw = TRUE)1                                                                              0.863         
##                                                                                                     (0.637)        
##                                                                                                                    
## poly(x, 2, raw = TRUE)2                                                                            0.727***        
##                                                                                                     (0.123)        
##                                                                                                                    
## Constant                     26.175***               1.712                   1.712                   1.712         
##                               (3.999)               (5.543)                 (5.543)                 (5.543)        
##                                                                                                                    
## -------------------------------------------------------------------------------------------------------------------
## Observations                    201                   201                     201                     201          
## R2                             0.008                 0.157                   0.157                   0.157         
## Adjusted R2                    0.003                 0.149                   0.149                   0.149         
## Residual Std. Error      56.697 (df = 199)     52.391 (df = 198)       52.391 (df = 198)       52.391 (df = 198)   
## F Statistic             1.568 (df = 1; 199) 18.446*** (df = 2; 198) 18.446*** (df = 2; 198) 18.446*** (df = 2; 198)
## ===================================================================================================================
## Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01

Using stargazer, we compare our 4 models side-by-side. The three models which include the polynomial are all exactly the same, as they should be. This simply illustrates that each of the three approaches produce the exact same outcome so it is up to you to decide which one to use. Examining the actual results, we see that X has a non-significant beta weight for both the purely linear and the polynomial regression. However, when we add the squared x to the model, we see a significant predictor on that variable. Also, when we examine the model fit statistics, we clearly see the polynomial model outperforms the purely linear model on each model fit statistic. This indicates that adding the polynomial term to the regression creates a better performing model overall.

Now, lets’s reexamine Plot 1 from the assumption plots to see if we removed the curvilinear pattern in the residuals.

plot(lm1, which=1)

plot(lm_quad, which=1)

Here, we see that in the residual plot from the polynomial model a much flatter red line that does not exhibit a clearly curvilinear pattern like we see in the residual plot from the purely linear model. This indicates that by including the polynomial model we have better performing residuals. Coupling this with the better performing model fit statistics, we can say that including the polynomial term in the regression better reflects the true relationship between x and y in this data.

Log Transformations

When you have a continuous variable with an extreme right-hand skew, it can be appropriate to use a log transformation to make the variable more normally distributed. Log transformations should only be done on continuous variables with a right-hand skew.

For this section, we will be working out of our California real-estate data. We start by looking at the histograms of our two continuous variables in this data: price and sqft.

# Specify the URL of the Stata file
url <- "https://github.com/drCES/winter_603/raw/main/California%20Real%20Estate%20Data%20-%202015%20Ahmed.dta"

# Import the data into R
data <- read_dta(url)

ggplot(data, aes(x = price)) +
  geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
  labs(title = "Histogram  Plot of Price",
       x = "Price",
       y = "Density") +
  theme_minimal()

ggplot(data, aes(x = sqft)) +
  geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
  labs(title = "Histogram Plot of Square Footage",
       x = "Square Footage",
       y = "Density") +
  theme_minimal()

For both continuous variables in our real-estate data, we see evidence of an extreme right-hand skew with the untransformed variables. Because of this, it can be appropriate to use the natural logs of both variables in the analysis. We will next create log-transformed version of each variable and rerun our histograms to see how the transformation has influenced our distributions.

data <- data %>%
  mutate(
    price_ln = log(price),
    sqft_ln = log(sqft)
  )

ggplot(data, aes(x = price_ln)) +
  geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
  labs(title = "Histogram  Plot of Natural Log of Price",
       x = "Natural Log of Price",
       y = "Density") +
  theme_minimal()

ggplot(data, aes(x = sqft_ln)) +
  geom_histogram(aes(y = after_stat(density)), position = "identity", bins = 30, alpha = 0.5, color="blue") +
  labs(title = "Histogram Plot of ofNatural Log of Square Footage",
       x = "Natural Log of Square Footage",
       y = "Density") +
  theme_minimal()

Once we take the log of both variables, we see amuch more normally distributed variables compared to the untransformed version of the variables. Now, we proceed with our OLS regressions and examine a variety of models including:

linear model
log-linear model
linear-log model
log-log model

We start with a basic bivariate linear model without transformations to review the assumption plots for normally distributed errors (plot 2, Normal Q-Q plot).

lm1<-lm(price ~ sqft + bedrooms + bathrooms, data=data)

plot(lm1, which=2)

residuals <- residuals(lm1)
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.72053, p-value < 2.2e-16

With the Normal Q-Q plot from the basic OLS model, we see some deviations at the extreme quantile levels. This can cause our standard errors to be inappropriately sized.

log_lin<-lm(price_ln ~ sqft + bedrooms + bathrooms, data=data)

plot(log_lin, which=2)

residuals <- residuals(log_lin)
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.99335, p-value = 0.01838

Examining the log-linear Normal Q-Q plot, we see better performing residuals compared to the linear model; although there is still some deviations from the dotted gray line. Slight deviations like this are not a problem though. Now, let’s examine the linear-log model and review the Normal Q-Q plot.

lin_log<-lm(price ~ sqft_ln + bedrooms + bathrooms, data=data)

plot(lin_log, which=2)

residuals <- residuals(lin_log)
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.71013, p-value < 2.2e-16

In the linear-log model, where we transformed the IV but not the DV, we see similar extreme deviations in the Normal Q-Q plot as we did in the linear model. This indicates that taking the log of only the IV did not “fix” the issues that we had in the linear model.

Now, let’s look at the Normal Q-Q plot for the log-log model, where we are using the log transformations of both the DV and the IV.

log_log<-lm(price_ln ~ sqft_ln + bedrooms + bathrooms, data=data)

plot(log_log, which=2)

residuals <- residuals(log_log)
shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.99653, p-value = 0.301

Once we take the natural log of both the IV and DV, we see only slight deviations in the Normal Q-Q plot. There are still some minor deviations but there always will be. This plot shows that the log-log model has the most normally distributed residuals from any of the four models.

Next, we compare each model using the stargazer package to examine model performance.

stargazer(lm1, log_lin, lin_log, log_log, type="text", digits=3)

## 
## =====================================================================================
##                                                 Dependent variable:                  
##                                ------------------------------------------------------
##                                    price       price_ln        price        price_ln 
##                                     (1)          (2)            (3)           (4)    
## -------------------------------------------------------------------------------------
## sqft                             114.927***   0.0002***                              
##                                   (23.310)    (0.00004)                              
##                                                                                      
## sqft_ln                                                   357,801.900***    0.820*** 
##                                                            (66,414.580)     (0.104)  
##                                                                                      
## bedrooms                        -13,343.860    0.163***     -29,706.340     0.100**  
##                                 (24,914.510)   (0.040)     (25,710.530)     (0.040)  
##                                                                                      
## bathrooms                      167,924.700***  0.163***   150,866.700***     0.084*  
##                                 (30,392.880)   (0.049)     (31,239.440)     (0.049)  
##                                                                                      
## Constant                        -84,869.340   11.519***  -2,449,858.000***  6.127*** 
##                                 (59,998.040)   (0.096)     (434,401.800)    (0.679)  
##                                                                                      
## -------------------------------------------------------------------------------------
## Observations                        535          535            535           535    
## R2                                 0.290        0.372          0.296         0.409   
## Adjusted R2                        0.286        0.368          0.292         0.406   
## Residual Std. Error (df = 531)  430,243.900     0.691       428,428.400      0.670   
## F Statistic (df = 3; 531)        72.156***    104.712***     74.272***     122.652***
## =====================================================================================
## Note:                                                     *p<0.1; **p<0.05; ***p<0.01

From the stargazer table, we see variations in the model performance but also in the relationship between the x’s and y’s. The two models with the logged transformed DV have higher R2 and Adjusted-R2 compared to the model with the untransformed variables as well as a higher F-statistic. In this situation since we have the same predictors, we can use the F-statistic to compare model performance indicating that the transformed models perform better. Comparing models 2 and 4, log-linear and log-log models, we see the log-log model has a higher R2, Adjusted R2, F-statistic and Residual Standard Error. All of these statistics indicate that the log-log model is the better performing

Look at closely at the relationship between number of bedrooms and home-selling price. In the two models with an untransformed DV, we see a negative, non-significant, beta weight. This is one of those outcomes that feels wrong based on what we know about the relationship between bedrooms and home-selling price.

Once we transform the DV using the natural log, we see a positive and significant relationship between bedrooms and price. You should not simply transform a variable simply because you have a strange finding. However, coupling the distributions with the better performing log-log model gives us clear evidence that the distribution of the DV might be at fault for the strange finding of the bedrooms. This should lead us to conclude that the more appropriate model to use is the one where we transform both the DV and the continuous IV in our data.

Interpretation

One issue with using a model with a natural log is in interpretation of the relationship. It is not as simple as in the linear model where a 1 unit increase in x = a beta weight change in y.

Log-Linear Model

Here, we review interpretation of the different models with log transformations starting with the log-linear model.

stargazer(log_lin, digits=3, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              price_ln          
## -----------------------------------------------
## sqft                         0.0002***         
##                              (0.00004)         
##                                                
## bedrooms                     0.163***          
##                               (0.040)          
##                                                
## bathrooms                    0.163***          
##                               (0.049)          
##                                                
## Constant                     11.519***         
##                               (0.096)          
##                                                
## -----------------------------------------------
## Observations                    535            
## R2                             0.372           
## Adjusted R2                    0.368           
## Residual Std. Error      0.691 (df = 531)      
## F Statistic          104.712*** (df = 3; 531)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The regression table above shows the results for the log-linear model, which uses a log transformed DV but no transformation of the IVs. With this model, a 1-unit increase in x represents a relative change in y. To understand the relative change, you need to exponentiate the beta weight and subtract 1 from that value.

First, we create a tibble with the beta weights from the log-linear model then exponentiate each beta weight before subtracting 1 from the value. We also calculate the mean home price to then understand the average impact in dollars of a 1-unit increase in each IV, holding everything else in the model constant.

coefficients <- tibble(
  variable = c("sqft", "bedrooms", "bathrooms"),
  beta = c(0.0002, 0.163, 0.163)    )

mean_price<-mean(data$price)

# Calculate the percentage changes
coefficients <- coefficients %>%
  mutate(
    percentage_change = (exp(beta) - 1),
    mean_price = mean_price,
    impact_1unit = percentage_change * mean_price)

print(coefficients)

## # A tibble: 3 × 5
##   variable    beta percentage_change mean_price impact_1unit
##   <chr>      <dbl>             <dbl>      <dbl>        <dbl>
## 1 sqft      0.0002          0.000200    589363.         118.
## 2 bedrooms  0.163           0.177       589363.      104339.
## 3 bathrooms 0.163           0.177       589363.      104339.

It was just a statistical quirk that the beta weights are the same size on bedrooms and bathrooms but the interpretation is the same. For square footage, a 1-unit increase - i.e. increasing the size of a home by 1 square foot - is related to an average increase in home selling price of ~$118. For bedrooms and bathrooms, a 1-unit increase - i.e. increasing the number of bedrooms and bathrooms by 1 - is related to an average increase in home selling price of ~$104,339.

Linear-Log Model

stargazer(lin_log, digits=3, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                price           
## -----------------------------------------------
## sqft_ln                   357,801.900***       
##                            (66,414.580)        
##                                                
## bedrooms                    -29,706.340        
##                            (25,710.530)        
##                                                
## bathrooms                 150,866.700***       
##                            (31,239.440)        
##                                                
## Constant                 -2,449,858.000***     
##                            (434,401.800)       
##                                                
## -----------------------------------------------
## Observations                    535            
## R2                             0.296           
## Adjusted R2                    0.292           
## Residual Std. Error   428,428.400 (df = 531)   
## F Statistic           74.272*** (df = 3; 531)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Here, we review interpreting a model with a logged IV but an untransformed DV; the linear-log model.

For the untransformed IVs, the interpretation is the same as in a regular linear model. A 1-unit increase in x is related to a beta weight change in y. With bedrooms and bathrooms untransformed in this model, we interpret their beta weights in the same way as in a linear model.

However, since the square footage variable is transformed in this model, we need to do additional work to understand how a 1-unit increase influences price, on average. We start by converting the beta weight to how a 1% increase in square footage influences price by dividing the beta weight by 100. With this value calculated, we can then do simple math to understand the impact of 1-unit increase in square footage on price.

sqft_1p_beta<-357801.900/100
print(sqft_1p_beta)

## [1] 3578.019

sqft_mean<-mean(data$sqft)
print(sqft_mean)

## [1] 2364.905

sqft_1p<-sqft_mean*.01
print(sqft_1p)

## [1] 23.64905

impact_1unit<-sqft_1p_beta/sqft_1p
print(impact_1unit)

## [1] 151.2965

We first see that a 1% increase in square footage, on average and holding everything else in the model constant, is related to a $3578 increase in price. However, a 1% increase is not the same as a 1-unit increase so we need to find the average square footage for the data. The average square footage of homes in the data is roughly 2,365. We calculate what 1% of the average square footage is, which equals 23.65. The final step is to simply divide the value we got from dividing the beta by 100 from the value for 1% of the average home square footage.

In this model, we get a value of roughly $151 indicating that a 1-unit increase in square footage, for the average home square footage size, is related to that value increase in home selling price.

Log-Log Model

The third type of logged model includes a log transformation for the DV and at least 1 of the IVs. For the IV(s) that have a log transformation, the coefficient is interpreted as a 1% increase in the logged IV is related to a beta weight percentage change in the DV. With the untransformed variables - bedrooms and bathrooms in this model - the interpretation is the same as in the log-linear model reviewed previously.

stargazer(log_log, digits=3, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              price_ln          
## -----------------------------------------------
## sqft_ln                      0.820***          
##                               (0.104)          
##                                                
## bedrooms                      0.100**          
##                               (0.040)          
##                                                
## bathrooms                     0.084*           
##                               (0.049)          
##                                                
## Constant                     6.127***          
##                               (0.679)          
##                                                
## -----------------------------------------------
## Observations                    535            
## R2                             0.409           
## Adjusted R2                    0.406           
## Residual Std. Error      0.670 (df = 531)      
## F Statistic          122.652*** (df = 3; 531)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Here we see a coefficient on the logged transformed square footage variable of .820. This indicates that 1% increase in square footage is related to a .82% increase in home selling price. To convert this into actual dollars, we once again use basic math. We have already calculated the average square footage and home prices before so we just need to print them here. Remember, that the coefficient is in percentages so to correctly calculate this value you need to divide the beta weight by 100 to make it statistically equivalent, which means we will multiply the mean home price by .0082 and not .82.

print(sqft_1p)

## [1] 23.64905

print(mean_price)

## [1] 589362.8

price_change<-.0082*mean_price
print(price_change)

## [1] 4832.775

impact_1unit<-price_change/sqft_1p
print(impact_1unit)

## [1] 204.3539

Here, we calculate the calculate what .82% of the mean selling price is (~4833) then divide that value by the 1% change in square footage. This gives us an estimate of ~$204 indicating that a 1-unit increase in square footage is, on average, related to a $204 increase in home selling price.