This report investigates the relationship between firm size (Total Assets) and R&D expenditure using simulated data for 200 firms. The analysis follows four stages:
| Step | Task |
|---|---|
| ① Visualize | Scatter plot of Total Assets vs R&D Expenditure |
| ② Diagnose | Raw OLS regression + residual diagnostics |
| ③ Transform | Box-Cox optimal λ search |
| ④ Refine | Re-run model on transformed variable & compare |
set.seed(42)
n <- 200
firm_size <- runif(n, 10, 500) # Total Assets ($ millions)
error <- rnorm(n, mean = 0, sd = 0.5)
rd_expenditure <- exp(1.5 + 0.6 * log(firm_size) + error)
df_firms <- data.frame(
Firm_ID = 1:n,
Total_Assets = firm_size,
RD_Expenditure = rd_expenditure
)
head(df_firms)| Firm_ID | Total_Assets | RD_Expenditure |
|---|---|---|
| 1 | 458.2550 | 322.76939 |
| 2 | 469.1670 | 302.76313 |
| 3 | 150.2084 | 54.90529 |
| 4 | 416.9193 | 421.56611 |
| 5 | 324.4553 | 103.12089 |
| 6 | 264.3570 | 134.17397 |
#> Total_Assets RD_Expenditure
#> Min. : 10.12 Min. : 10.52
#> 1st Qu.:136.57 1st Qu.: 76.05
#> Median :282.15 Median :116.25
#> Mean :265.89 Mean :133.81
#> 3rd Qu.:382.78 3rd Qu.:164.97
#> Max. :494.56 Max. :606.82
Data-generating process: The true model is \(\ln(\text{RD}) = 1.5 + 0.6\,\ln(\text{Assets}) + \varepsilon\), so the correct transformation is a log (Box-Cox λ = 0).
par(mar = c(4.5, 4.5, 3, 1))
plot(
df_firms$Total_Assets, df_firms$RD_Expenditure,
pch = 16,
col = adjustcolor("steelblue", alpha.f = 0.55),
xlab = "Total Assets ($ millions)",
ylab = "R&D Expenditure ($ millions)",
main = "Total Assets vs R&D Expenditure"
)
# Overlay power-law trend
x_seq <- seq(min(firm_size), max(firm_size), length.out = 300)
b_init <- coef(lm(log(RD_Expenditure) ~ log(Total_Assets), data = df_firms))
lines(x_seq, exp(b_init[1] + b_init[2] * log(x_seq)),
col = "firebrick", lwd = 2.2)
legend("topleft", legend = "Power-law trend", col = "firebrick",
lwd = 2, bty = "n")Figure 1 — Raw relationship between Total Assets and R&D Expenditure. The fan-shaped spread signals heteroscedasticity; the curvature signals non-linearity.
Key observations:
#>
#> Call:
#> lm(formula = RD_Expenditure ~ Total_Assets, data = df_firms)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -135.79 -42.06 -12.37 25.08 404.97
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 40.50788 11.25914 3.598 0.000405 ***
#> Total_Assets 0.35091 0.03731 9.405 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 75.31 on 198 degrees of freedom
#> Multiple R-squared: 0.3088, Adjusted R-squared: 0.3053
#> F-statistic: 88.46 on 1 and 198 DF, p-value: < 2.2e-16
Figure 2 — Diagnostic plots for raw OLS. The funnel in the Residuals vs Fitted panel and the curved Q-Q plot indicate violations of OLS assumptions.
#>
#> Shapiro-Wilk normality test
#>
#> data: resid(model_raw)
#> W = 0.87801, p-value = 1.231e-11
#> Non-constant Variance Score Test
#> Variance formula: ~ fitted.values
#> Chisquare = 48.37989, Df = 1, p = 3.5115e-12
Verdict:
| Test | Statistic | p-value | Result |
|---|---|---|---|
| Shapiro-Wilk | 0.878 | 1.23e-11 | ❌ Non-normal residuals |
| Breusch-Pagan | — | < 0.05 | ❌ Heteroscedasticity present |
The raw OLS model fails both checks — transformation is necessary.
par(mar = c(4.5, 4.5, 3.5, 1))
bc <- boxcox(model_raw, lambda = seq(-2, 2, by = 0.05), plotit = TRUE)
title(main = "Box-Cox Log-Likelihood Profile", line = 0.5)
abline(v = 0, col = "forestgreen", lwd = 1.5, lty = 2)
legend("topright",
legend = c("95% CI", "λ = 0 (log)"),
col = c("black", "forestgreen"),
lty = c(1, 2), lwd = 1.5, bty = "n")Figure 3 — Box-Cox log-likelihood profile. The peak near λ = 0 confirms that a log transformation of the response is optimal.
#> Optimal Box-Cox lambda: 0.1818
The optimal λ ≈ 0.182, which is close to 0. By convention, λ = 0 corresponds to the natural log transformation, so we apply \(y^* = \ln(\text{RD\_Expenditure})\) and \(x^* = \ln(\text{Total\_Assets})\).
#>
#> Call:
#> lm(formula = log_RD ~ log_Size, data = df_firms)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.31707 -0.31284 -0.00069 0.30462 1.38376
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.4356 0.2100 6.837 9.82e-11 ***
#> log_Size 0.6075 0.0389 15.616 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.4815 on 198 degrees of freedom
#> Multiple R-squared: 0.5519, Adjusted R-squared: 0.5496
#> F-statistic: 243.9 on 1 and 198 DF, p-value: < 2.2e-16
Figure 4 — Diagnostic plots for log-transformed OLS. Residuals are now randomly scattered and closely follow the normal line.
#>
#> Shapiro-Wilk normality test
#>
#> data: resid(model_log)
#> W = 0.99476, p-value = 0.7145
| Metric | Raw OLS | Log-Log OLS |
|---|---|---|
| Intercept (β₀) | 40.5079 | 1.4356 |
| Slope (β₁) | 0.3509 | 0.6075 |
| R² | 0.3088 | 0.5519 |
| Adj. R² | 0.3053 | 0.5496 |
| Shapiro-Wilk p-value | 1.23e-11 | 0.7145 |
| Optimal Box-Cox λ | — | 0.1818 |
The log-log model can be written as:
\[\ln(\hat{\text{RD}}) = 1.4356 + 0.6075 \cdot \ln(\text{Total\_Assets})\]
or equivalently on the original scale:
\[\hat{\text{RD}} = e^{1.4356} \cdot \text{Total\_Assets}^{0.6075}\]
Key takeaways:
Elasticity: A 1% increase in Total Assets is associated with a 0.6075% increase in R&D Expenditure (since β₁ < 1, R&D scales sub-proportionally with firm size).
Fit improves substantially: R² rises from 0.309 to 0.552.
Normality restored: Shapiro-Wilk p-value goes from 1.2e-11 (raw) to 0.715 (log) — residuals are now consistent with normality.
Homoscedasticity restored: The Breusch-Pagan test no longer rejects constant variance after the log transformation.
Box-Cox confirms λ ≈ 0: The optimal λ = 0.182 is very close to 0, validating the log transform choice.
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Asia/Taipei
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] car_3.1-5 carData_3.0-6 MASS_7.3-65
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 Formula_1.2-5 R6_2.6.1 fastmap_1.2.0
#> [5] xfun_0.56 cachem_1.1.0 abind_1.4-8 knitr_1.51
#> [9] htmltools_0.5.9 rmarkdown_2.30 lifecycle_1.0.5 cli_3.6.5
#> [13] sass_0.4.10 jquerylib_0.1.4 compiler_4.5.2 rstudioapi_0.18.0
#> [17] tools_4.5.2 evaluate_1.0.5 bslib_0.10.0 yaml_2.3.12
#> [21] otel_0.2.0 jsonlite_2.0.0 rlang_1.1.7