# packages
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# input data
data <- mtcars
str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Q1.1.

Estimating equation

\[ \text{mpg}_i = \beta_0 + \beta_1 \times \text{hp}_i + \beta_2 \times \text{wt}_i + \epsilon_i \]

where:

  • \(\text{mpg}_i\) is the miles per gallon for car \(i\),
  • \(\text{hp}_i\) is the horsepower of car \(i\),
  • \(\text{wt}_i\) is the weight of car \(i\),
  • \(\beta_0, \beta_1, \beta_2\) are parameters to be estimated,
  • \(\epsilon_i\) is the error term.

Summary statistics

summary(mtcars[, c("mpg", "hp", "wt")])
##       mpg              hp              wt       
##  Min.   :10.40   Min.   : 52.0   Min.   :1.513  
##  1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
##  Median :19.20   Median :123.0   Median :3.325  
##  Mean   :20.09   Mean   :146.7   Mean   :3.217  
##  3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
##  Max.   :33.90   Max.   :335.0   Max.   :5.424

Multivariate regression

# Fit the model
model <- lm(mpg ~ hp + wt, data = mtcars)
stargazer(model, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 mpg            
## -----------------------------------------------
## hp                           -0.032***         
##                               (0.009)          
##                                                
## wt                           -3.878***         
##                               (0.633)          
##                                                
## Constant                     37.227***         
##                               (1.599)          
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                             0.827           
## Adjusted R2                    0.815           
## Residual Std. Error       2.593 (df = 29)      
## F Statistic           69.211*** (df = 2; 29)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Q1.2. Interpretation of regression

Q1.3.

Residuals vs Fitted

plot(model$fitted.values, resid(model), xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

The residuals appear to be randomly distributed, indicating that the model’s assumptions of linearity and equal variance are reasonably satisfied.

Normal Q-Q Plot

qqnorm(resid(model))
qqline(resid(model), col = "red")

The points in the Q-Q plot largely follow the 45-degree line, indicating that the residuals are approximately normally distributed. There are some slight deviations at the tails, but these are not extreme.

Scale-Location

plot(model$fitted.values, sqrt(abs(resid(model))), xlab = "Fitted Values", ylab = "Sqrt|Residuals|",
     main = "Scale-Location")
abline(h = 0, col = "red")

The spread of residuals is fairly uniform across the range of fitted values, without any obvious pattern like a fan shape. This suggests that the variance of the residuals is constant across different values of the predictors, supporting the assumption of homoscedasticity.

Residuals vs Leverage

plot(model, which = 5)

Most data points are within the Cook’s distance lines, indicating that they are not unduly influential. However, there are a few points outside of these bounds, notably the Maserati Bora, which appears to have high leverage and a significant Cook’s distance.

Q1.4. Gauss-Markov Assumptions

Assumptions

  • Linearity of the relationship: Assumption: Each predictor variable should have a linear relationship with the outcome variable.

  • Independence of residuals: Residuals should not be correlated with each other; independence of observations.

  • Homoscedasticity: The variability of residuals is nearly constant across different levels of fitted values.

  • Normality of residuals: For small datasets, residuals should be approximately normally distributed.

From my analysis:

  • The linearity assumption needs specific residual plots against each predictor to be fully evaluated.
  • The independence assumption might require further checks such as a Durbin-Watson test unless explicitly known that the data collection method does not induce order (e.g., time-related biases).
  • The homoscedasticity assumption seems to be met based on the scale-location plot.
  • The normality assumption appears reasonably satisfied, particularly given the size of the dataset where strict normality is less critical.

Q1.5. BLUE

For OLS to be considered BLUE, it must satisfy the Gauss-Markov assumptions. BLUE stands for Best Linear Unbiased Estimator, and each term implies:

Q2. Log

Reasons for Log Transformation

  • Non-linear Relationships: Many relationships between variables are not linear but can often be approximated by linear models after transformations. Taking the logarithm of one or more variables can help linearize relationships, making it possible to fit a linear model more effectively. For example, exponential growth processes (common in economic and biological contexts) can be modeled linearly if a logarithmic transformation is applied.

  • Variance Stabilization: Heteroscedasticity (non-constant variance of the error terms) is a common violation of OLS assumptions that can affect the efficiency and accuracy of the estimates. Log transformation often stabilizes variance, especially when data include large range values or when larger values tend to have larger variances. This stabilization enhances the model’s validity by aligning with the assumption of homoscedasticity.

  • Improved Interpretation: Transformations can also simplify the interpretation of the regression coefficients. In a model where a log transformation has been applied to the dependent variable, a percentage change in one of the predictors leads to a proportional percentage change in the outcome variable. This is particularly useful in financial and economic modeling where elasticity (the percentage change response in the dependent variable to a one percent change in an independent variable) is of interest.