1 Data Simulation

This section generates synthetic data representing firm size and R&D expenditure. The data is designed to reflect a non-linear relationship and heteroscedasticity, where larger firms tend to have more variability in R&D spending.

set.seed(42)

n <- 200
firm_size <- runif(n, 10, 500)

error <- rnorm(n, mean = 0, sd = 0.5)
rd_expenditure <- exp(1.5 + 0.6 * log(firm_size) + error)

df_firms <- data.frame(
  Firm_ID = 1:n,
  Total_Assets = firm_size,
  RD_Expenditure = rd_expenditure
)

head(df_firms)

2 Visualization

The scatter plot below illustrates the relationship between Total Assets and R&D Expenditure.

plot(df_firms$Total_Assets, df_firms$RD_Expenditure,
     xlab="Total Assets",
     ylab="R&D Expenditure",
     main="Total Assets vs R&D Expenditure",
     pch=19)

The plot shows a clear non-linear (curved) relationship between firm size and R&D expenditure. As Total Assets increase, R&D expenditure grows at an increasing rate. Additionally, the spread of the data becomes wider for larger firms, indicating heteroscedasticity, where the variance of errors is not constant. This suggests that a simple linear model may not be appropriate and that a transformation could improve the model.

3 OLS Regression

We begin by estimating a simple linear regression model.

model_ols <- lm(RD_Expenditure ~ Total_Assets, data=df_firms)
summary(model_ols)
## 
## Call:
## lm(formula = RD_Expenditure ~ Total_Assets, data = df_firms)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135.79  -42.06  -12.37   25.08  404.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.50788   11.25914   3.598 0.000405 ***
## Total_Assets  0.35091    0.03731   9.405  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.31 on 198 degrees of freedom
## Multiple R-squared:  0.3088, Adjusted R-squared:  0.3053 
## F-statistic: 88.46 on 1 and 198 DF,  p-value: < 2.2e-16

The OLS results provide a baseline model. However, due to the apparent non-linearity and heteroscedasticity observed earlier, we need to evaluate whether the assumptions of the linear regression model are satisfied.

4 Diagnostic Analysis

4.1 Normality Check

hist(residuals(model_ols),
     main="Histogram of Residuals",
     xlab="Residuals")

qqnorm(residuals(model_ols))
qqline(residuals(model_ols))

4.2 Homoscedasticity Check

plot(fitted(model_ols), residuals(model_ols),
     xlab="Fitted Values",
     ylab="Residuals",
     main="Residuals vs Fitted",
     pch=19)
abline(h=0, col="red")

The diagnostic plots reveal several issues:

  • The histogram and QQ plot suggest that residuals are not normally distributed.
  • The residuals vs fitted plot shows a fan-shaped pattern, indicating heteroscedasticity.

These violations suggest that the OLS model may not provide reliable inference, and transformation is needed.

5 Box-Cox Transformation

To identify the appropriate transformation, we use the Box-Cox method.

library(MASS)
boxcox(model_ols)

The Box-Cox plot helps determine the optimal value of λ (lambda).

If λ is approximately 0, it indicates that a log transformation is appropriate. This is consistent with the data generation process, which follows a log-linear relationship.

6 Transformed Model

Based on the Box-Cox results, we apply a log transformation to both variables.

model_log <- lm(log(RD_Expenditure) ~ log(Total_Assets), data=df_firms)
summary(model_log)
## 
## Call:
## lm(formula = log(RD_Expenditure) ~ log(Total_Assets), data = df_firms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31707 -0.31284 -0.00069  0.30462  1.38376 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.4356     0.2100   6.837 9.82e-11 ***
## log(Total_Assets)   0.6075     0.0389  15.616  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4815 on 198 degrees of freedom
## Multiple R-squared:  0.5519, Adjusted R-squared:  0.5496 
## F-statistic: 243.9 on 1 and 198 DF,  p-value: < 2.2e-16

6.1 Diagnostics After Transformation

hist(residuals(model_log),
     main="Histogram of Residuals (Log Model)",
     xlab="Residuals")

qqnorm(residuals(model_log))
qqline(residuals(model_log))

plot(fitted(model_log), residuals(model_log),
     xlab="Fitted Values",
     ylab="Residuals",
     main="Residuals vs Fitted (Log Model)",
     pch=19)
abline(h=0, col="red")

After applying the log transformation:

  • The residuals appear more normally distributed.
  • The heteroscedasticity problem is significantly reduced.
  • The residual plot shows a more random pattern.

This indicates that the transformed model better satisfies the assumptions of linear regression.

7 Model Comparison

summary(model_ols)
## 
## Call:
## lm(formula = RD_Expenditure ~ Total_Assets, data = df_firms)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135.79  -42.06  -12.37   25.08  404.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.50788   11.25914   3.598 0.000405 ***
## Total_Assets  0.35091    0.03731   9.405  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75.31 on 198 degrees of freedom
## Multiple R-squared:  0.3088, Adjusted R-squared:  0.3053 
## F-statistic: 88.46 on 1 and 198 DF,  p-value: < 2.2e-16
summary(model_log)
## 
## Call:
## lm(formula = log(RD_Expenditure) ~ log(Total_Assets), data = df_firms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31707 -0.31284 -0.00069  0.30462  1.38376 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.4356     0.2100   6.837 9.82e-11 ***
## log(Total_Assets)   0.6075     0.0389  15.616  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4815 on 198 degrees of freedom
## Multiple R-squared:  0.5519, Adjusted R-squared:  0.5496 
## F-statistic: 243.9 on 1 and 198 DF,  p-value: < 2.2e-16

The comparison shows that the transformed model provides:

  • Better residual behavior
  • More reliable statistical inference
  • A model that better reflects the underlying data structure

In particular, the log-log model captures the multiplicative relationship between firm size and R&D expenditure.

8 Conclusion

This analysis demonstrates the importance of checking model assumptions in regression analysis. The initial OLS model violated key assumptions, including normality and homoscedasticity. By applying a Box-Cox transformation, we identified that a log-log model is more appropriate.

The transformed model significantly improves the reliability of the results and better represents the economic relationship between firm size and R&D expenditure. This highlights the importance of transformation techniques in empirical modeling.