1 Overview

This report investigates the relationship between firm size (Total Assets) and R&D expenditure using simulated data for 200 firms. The analysis follows four stages:

Step Task
① Visualize Scatter plot of Total Assets vs R&D Expenditure
② Diagnose Raw OLS regression + residual diagnostics
③ Transform Box-Cox optimal λ search
④ Refine Re-run model on transformed variable & compare

2 Setup

2.1 Libraries

library(MASS)   # boxcox()
library(car)    # ncvTest() — Breusch-Pagan test

2.2 Data Generation

set.seed(42)
n <- 200

firm_size      <- runif(n, 10, 500)              # Total Assets ($ millions)
error          <- rnorm(n, mean = 0, sd = 0.5)
rd_expenditure <- exp(1.5 + 0.6 * log(firm_size) + error)

df_firms <- data.frame(
  Firm_ID        = 1:n,
  Total_Assets   = firm_size,
  RD_Expenditure = rd_expenditure
)

head(df_firms)
Firm_ID Total_Assets RD_Expenditure
1 458.2550 322.76939
2 469.1670 302.76313
3 150.2084 54.90529
4 416.9193 421.56611
5 324.4553 103.12089
6 264.3570 134.17397
summary(df_firms[, c("Total_Assets", "RD_Expenditure")])
#>   Total_Assets    RD_Expenditure  
#>  Min.   : 10.12   Min.   : 10.52  
#>  1st Qu.:136.57   1st Qu.: 76.05  
#>  Median :282.15   Median :116.25  
#>  Mean   :265.89   Mean   :133.81  
#>  3rd Qu.:382.78   3rd Qu.:164.97  
#>  Max.   :494.56   Max.   :606.82

Data-generating process: The true model is \(\ln(\text{RD}) = 1.5 + 0.6\,\ln(\text{Assets}) + \varepsilon\), so the correct transformation is a log (Box-Cox λ = 0).


3 ① Visualize

par(mar = c(4.5, 4.5, 3, 1))

plot(
  df_firms$Total_Assets, df_firms$RD_Expenditure,
  pch  = 16,
  col  = adjustcolor("steelblue", alpha.f = 0.55),
  xlab = "Total Assets ($ millions)",
  ylab = "R&D Expenditure ($ millions)",
  main = "Total Assets vs R&D Expenditure"
)

# Overlay power-law trend
x_seq  <- seq(min(firm_size), max(firm_size), length.out = 300)
b_init <- coef(lm(log(RD_Expenditure) ~ log(Total_Assets), data = df_firms))
lines(x_seq, exp(b_init[1] + b_init[2] * log(x_seq)),
      col = "firebrick", lwd = 2.2)
legend("topleft", legend = "Power-law trend", col = "firebrick",
       lwd = 2, bty = "n")
Figure 1 — Raw relationship between Total Assets and R&D Expenditure. The fan-shaped spread signals heteroscedasticity; the curvature signals non-linearity.

Figure 1 — Raw relationship between Total Assets and R&D Expenditure. The fan-shaped spread signals heteroscedasticity; the curvature signals non-linearity.

Key observations:

  • The relationship is clearly non-linear (concave power-law curve).
  • The variance increases with firm size — a signature of heteroscedasticity.
  • A linear OLS model on the raw scale will violate both normality and constant-variance assumptions.

4 ② Diagnose — Raw OLS

4.1 Fit the model

model_raw <- lm(RD_Expenditure ~ Total_Assets, data = df_firms)
summary(model_raw)
#> 
#> Call:
#> lm(formula = RD_Expenditure ~ Total_Assets, data = df_firms)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -135.79  -42.06  -12.37   25.08  404.97 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  40.50788   11.25914   3.598 0.000405 ***
#> Total_Assets  0.35091    0.03731   9.405  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 75.31 on 198 degrees of freedom
#> Multiple R-squared:  0.3088, Adjusted R-squared:  0.3053 
#> F-statistic: 88.46 on 1 and 198 DF,  p-value: < 2.2e-16

4.2 Residual diagnostic plots

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model_raw, which = 1:4)
Figure 2 — Diagnostic plots for raw OLS. The funnel in the Residuals vs Fitted panel and the curved Q-Q plot indicate violations of OLS assumptions.

Figure 2 — Diagnostic plots for raw OLS. The funnel in the Residuals vs Fitted panel and the curved Q-Q plot indicate violations of OLS assumptions.

par(mfrow = c(1, 1))

4.3 Formal tests

4.3.1 Normality — Shapiro-Wilk

sw_raw <- shapiro.test(resid(model_raw))
sw_raw
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_raw)
#> W = 0.87801, p-value = 1.231e-11

4.3.2 Homoscedasticity — Breusch-Pagan

ncvTest(model_raw)
#> Non-constant Variance Score Test 
#> Variance formula: ~ fitted.values 
#> Chisquare = 48.37989, Df = 1, p = 3.5115e-12

Verdict:

Test Statistic p-value Result
Shapiro-Wilk 0.878 1.23e-11 ❌ Non-normal residuals
Breusch-Pagan < 0.05 ❌ Heteroscedasticity present

The raw OLS model fails both checks — transformation is necessary.


5 ③ Transform — Box-Cox

par(mar = c(4.5, 4.5, 3.5, 1))
bc <- boxcox(model_raw, lambda = seq(-2, 2, by = 0.05), plotit = TRUE)
title(main = "Box-Cox Log-Likelihood Profile", line = 0.5)
abline(v = 0, col = "forestgreen", lwd = 1.5, lty = 2)
legend("topright",
       legend = c("95% CI", "λ = 0 (log)"),
       col    = c("black", "forestgreen"),
       lty    = c(1, 2), lwd = 1.5, bty = "n")
Figure 3 — Box-Cox log-likelihood profile. The peak near λ = 0 confirms that a log transformation of the response is optimal.

Figure 3 — Box-Cox log-likelihood profile. The peak near λ = 0 confirms that a log transformation of the response is optimal.

lambda_opt <- bc$x[which.max(bc$y)]
cat(sprintf("Optimal Box-Cox lambda: %.4f\n", lambda_opt))
#> Optimal Box-Cox lambda: 0.1818

The optimal λ ≈ 0.182, which is close to 0. By convention, λ = 0 corresponds to the natural log transformation, so we apply \(y^* = \ln(\text{RD\_Expenditure})\) and \(x^* = \ln(\text{Total\_Assets})\).

df_firms$log_RD   <- log(df_firms$RD_Expenditure)
df_firms$log_Size <- log(df_firms$Total_Assets)

6 ④ Refine — Log-Log OLS

6.1 Fit the transformed model

model_log <- lm(log_RD ~ log_Size, data = df_firms)
summary(model_log)
#> 
#> Call:
#> lm(formula = log_RD ~ log_Size, data = df_firms)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.31707 -0.31284 -0.00069  0.30462  1.38376 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.4356     0.2100   6.837 9.82e-11 ***
#> log_Size      0.6075     0.0389  15.616  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4815 on 198 degrees of freedom
#> Multiple R-squared:  0.5519, Adjusted R-squared:  0.5496 
#> F-statistic: 243.9 on 1 and 198 DF,  p-value: < 2.2e-16

6.2 Residual diagnostic plots

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model_log, which = 1:4)
Figure 4 — Diagnostic plots for log-transformed OLS. Residuals are now randomly scattered and closely follow the normal line.

Figure 4 — Diagnostic plots for log-transformed OLS. Residuals are now randomly scattered and closely follow the normal line.

par(mfrow = c(1, 1))

6.3 Formal tests

6.3.1 Normality — Shapiro-Wilk

sw_log <- shapiro.test(resid(model_log))
sw_log
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  resid(model_log)
#> W = 0.99476, p-value = 0.7145

6.3.2 Homoscedasticity — Breusch-Pagan

ncvTest(model_log)
#> Non-constant Variance Score Test 
#> Variance formula: ~ fitted.values 
#> Chisquare = 0.009246288, Df = 1, p = 0.9234

7 Model Comparison

Table 1 — Raw OLS vs Log-Log OLS model comparison
Metric Raw OLS Log-Log OLS
Intercept (β₀) 40.5079 1.4356
Slope (β₁) 0.3509 0.6075
0.3088 0.5519
Adj. R² 0.3053 0.5496
Shapiro-Wilk p-value 1.23e-11 0.7145
Optimal Box-Cox λ 0.1818

8 Interpretation

The log-log model can be written as:

\[\ln(\hat{\text{RD}}) = 1.4356 + 0.6075 \cdot \ln(\text{Total\_Assets})\]

or equivalently on the original scale:

\[\hat{\text{RD}} = e^{1.4356} \cdot \text{Total\_Assets}^{0.6075}\]

Key takeaways:

  1. Elasticity: A 1% increase in Total Assets is associated with a 0.6075% increase in R&D Expenditure (since β₁ < 1, R&D scales sub-proportionally with firm size).

  2. Fit improves substantially: R² rises from 0.309 to 0.552.

  3. Normality restored: Shapiro-Wilk p-value goes from 1.2e-11 (raw) to 0.715 (log) — residuals are now consistent with normality.

  4. Homoscedasticity restored: The Breusch-Pagan test no longer rejects constant variance after the log transformation.

  5. Box-Cox confirms λ ≈ 0: The optimal λ = 0.182 is very close to 0, validating the log transform choice.


9 Session Info

sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Asia/Taipei
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] car_3.1-5     carData_3.0-6 MASS_7.3-65  
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39     Formula_1.2-5     R6_2.6.1          fastmap_1.2.0    
#>  [5] xfun_0.56         cachem_1.1.0      abind_1.4-8       knitr_1.51       
#>  [9] htmltools_0.5.9   rmarkdown_2.30    lifecycle_1.0.5   cli_3.6.5        
#> [13] sass_0.4.10       jquerylib_0.1.4   compiler_4.5.2    rstudioapi_0.18.0
#> [17] tools_4.5.2       evaluate_1.0.5    bslib_0.10.0      yaml_2.3.12      
#> [21] otel_0.2.0        jsonlite_2.0.0    rlang_1.1.7