library(ggplot2)
library(MASS) # boxcox()
library(lmtest) # bptest()
library(nortest) # ad.test()
library(knitr) # kable()We simulate 200 firms with a log-linear true relationship between firm size (Total Assets) and R&D expenditure, plus multiplicative (heteroscedastic) noise.
set.seed(42)
n <- 200
firm_size <- runif(n, 10, 500) # Total Assets in $millions
error <- rnorm(n, mean = 0, sd = 0.5)
rd_expenditure <- exp(1.5 + 0.6 * log(firm_size) + error)
df_firms <- data.frame(
Firm_ID = 1:n,
Total_Assets = firm_size,
RD_Expenditure = rd_expenditure
)
kable(head(df_firms, 6), digits = 3,
caption = "Table 1 – First 6 rows of simulated firm data")| Firm_ID | Total_Assets | RD_Expenditure |
|---|---|---|
| 1 | 458.255 | 322.769 |
| 2 | 469.167 | 302.763 |
| 3 | 150.208 | 54.905 |
| 4 | 416.919 | 421.566 |
| 5 | 324.455 | 103.121 |
| 6 | 264.357 | 134.174 |
Data summary:
#> Total_Assets RD_Expenditure
#> Min. : 10.12 Min. : 10.52
#> 1st Qu.:136.57 1st Qu.: 76.05
#> Median :282.15 Median :116.25
#> Mean :265.89 Mean :133.81
#> 3rd Qu.:382.78 3rd Qu.:164.97
#> Max. :494.56 Max. :606.82
Plot the raw relationship between Total Assets and R&D Expenditure.
ggplot(df_firms, aes(x = Total_Assets, y = RD_Expenditure)) +
geom_point(color = "#2E86AB", alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "#E84855", se = TRUE, linewidth = 1) +
labs(
title = "Total Assets vs R&D Expenditure (Raw Scale)",
subtitle = "Non-linear relationship with increasing variance (heteroscedasticity)",
x = "Total Assets ($ millions)",
y = "R&D Expenditure ($ millions)"
) +
theme_minimal(base_size = 13)Figure 1 – Raw scatter with OLS fit. Note the fan-shaped spread indicating heteroscedasticity.
Observation: The relationship is clearly non-linear and the spread of points widens with firm size — both signs that OLS assumptions will be violated.
#>
#> Call:
#> lm(formula = RD_Expenditure ~ Total_Assets, data = df_firms)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -135.79 -42.06 -12.37 25.08 404.97
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 40.50788 11.25914 3.598 0.000405 ***
#> Total_Assets 0.35091 0.03731 9.405 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 75.31 on 198 degrees of freedom
#> Multiple R-squared: 0.3088, Adjusted R-squared: 0.3053
#> F-statistic: 88.46 on 1 and 198 DF, p-value: < 2.2e-16
Figure 2 – Standard diagnostic plots for the raw OLS model.
sw_raw <- shapiro.test(resid(model_raw)) # Normality
bp_raw <- bptest(model_raw) # Homoscedasticity
cat("Shapiro-Wilk test — W =", round(sw_raw$statistic, 4),
" p-value =", format.pval(sw_raw$p.value, digits = 4), "\n")#> Shapiro-Wilk test — W = 0.878 p-value = 1.231e-11
cat("Breusch-Pagan test — BP =", round(bp_raw$statistic, 4),
" p-value =", format.pval(bp_raw$p.value, digits = 4), "\n")#> Breusch-Pagan test — BP = 13.2977 p-value = 0.0002657
Findings:
- Residuals vs Fitted — fan-shaped: variance increases with fitted values → heteroscedasticity confirmed.
- Q-Q Plot — strong right-tail deviation → residuals are not normal (Shapiro-Wilk p < 0.001).
- Breusch-Pagan — significant p-value confirms non-constant variance.
Use boxcox() to find the optimal power transformation λ
for the response.
Figure 3 – Box-Cox log-likelihood profile. The dashed lines mark the 95% CI for λ.
#> Optimal lambda: 0.17
Result: λ ≈ 0.17, which is very close to 0. By convention λ = 0 corresponds to a log transformation, confirming the true DGP structure built into the simulation.
Apply log transformations to both variables and refit.
#>
#> Call:
#> lm(formula = log_RD ~ log_Assets, data = df_firms)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.31707 -0.31284 -0.00069 0.30462 1.38376
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.4356 0.2100 6.837 9.82e-11 ***
#> log_Assets 0.6075 0.0389 15.616 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4815 on 198 degrees of freedom
#> Multiple R-squared: 0.5519, Adjusted R-squared: 0.5496
#> F-statistic: 243.9 on 1 and 198 DF, p-value: < 2.2e-16
ggplot(df_firms, aes(x = log_Assets, y = log_RD)) +
geom_point(color = "#2E86AB", alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "#E84855", se = TRUE, linewidth = 1) +
labs(
title = "log(Total Assets) vs log(R&D Expenditure)",
subtitle = "Linear relationship after log-log transformation",
x = "log(Total Assets)",
y = "log(R&D Expenditure)"
) +
theme_minimal(base_size = 13)Figure 4 – Log-log scatter. The relationship is now linear with uniform variance.
Figure 5 – Diagnostic plots after log-log transformation. Residuals are now well-behaved.
sw_log <- shapiro.test(resid(model_log))
bp_log <- bptest(model_log)
cat("Shapiro-Wilk test — W =", round(sw_log$statistic, 4),
" p-value =", format.pval(sw_log$p.value, digits = 4), "\n")#> Shapiro-Wilk test — W = 0.9948 p-value = 0.7145
cat("Breusch-Pagan test — BP =", round(bp_log$statistic, 4),
" p-value =", format.pval(bp_log$p.value, digits = 4), "\n")#> Breusch-Pagan test — BP = 0.0096 p-value = 0.922
comparison <- data.frame(
Metric = c("R²", "Residual SE", "SW p-value",
"BP p-value", "Slope β₁", "Optimal λ"),
Raw_OLS = c(
round(summary(model_raw)$r.squared, 4),
round(summary(model_raw)$sigma, 4),
round(sw_raw$p.value, 4),
round(as.numeric(bp_raw$p.value), 4),
round(coef(model_raw)["Total_Assets"], 4),
NA
),
LogLog_OLS = c(
round(summary(model_log)$r.squared, 4),
round(summary(model_log)$sigma, 4),
round(sw_log$p.value, 4),
round(as.numeric(bp_log$p.value), 4),
round(coef(model_log)["log_Assets"], 4),
round(opt_lambda, 4)
)
)
kable(comparison,
col.names = c("Metric", "Raw OLS", "Log-Log OLS"),
caption = "Table 2 – Raw vs refined model comparison",
align = c("l", "r", "r"))| Metric | Raw OLS | Log-Log OLS |
|---|---|---|
| R² | 0.3088 | 0.5519 |
| Residual SE | 75.3118 | 0.4815 |
| SW p-value | 0.0000 | 0.7145 |
| BP p-value | 0.0003 | 0.9220 |
| Slope β₁ | 0.3509 | 0.6075 |
| Optimal λ | NA | 0.1700 |
| Issue | Raw OLS | Log-Log OLS |
|---|---|---|
| Normality of residuals | ❌ Violated (SW p ≈ 0) | ✅ Satisfied (SW p > 0.05) |
| Homoscedasticity | ❌ Violated (BP p ≈ 0) | ✅ Satisfied (BP p > 0.05) |
| Model fit (R²) | ~0.37 | ~0.54 |
| Recovered slope β₁ | Biased | ≈ 0.6 (true value) |
Generated with R Markdown · R version 4.5.1 (2025-06-13 ucrt)