Discussion 7C

# packages
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

# input data
data <- mtcars
str(data)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Q1.1.

Estimating equation

\[ \text{mpg}_i = \beta_0 + \beta_1 \times \text{hp}_i + \beta_2 \times \text{wt}_i + \epsilon_i \]

where:

\(\text{mpg}_i\) is the miles per gallon for car \(i\),
\(\text{hp}_i\) is the horsepower of car \(i\),
\(\text{wt}_i\) is the weight of car \(i\),
\(\beta_0, \beta_1, \beta_2\) are parameters to be estimated,
\(\epsilon_i\) is the error term.

Summary statistics

summary(mtcars[, c("mpg", "hp", "wt")])

##       mpg              hp              wt       
##  Min.   :10.40   Min.   : 52.0   Min.   :1.513  
##  1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:2.581  
##  Median :19.20   Median :123.0   Median :3.325  
##  Mean   :20.09   Mean   :146.7   Mean   :3.217  
##  3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:3.610  
##  Max.   :33.90   Max.   :335.0   Max.   :5.424

Multivariate regression

# Fit the model
model <- lm(mpg ~ hp + wt, data = mtcars)
stargazer(model, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 mpg            
## -----------------------------------------------
## hp                           -0.032***         
##                               (0.009)          
##                                                
## wt                           -3.878***         
##                               (0.633)          
##                                                
## Constant                     37.227***         
##                               (1.599)          
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                             0.827           
## Adjusted R2                    0.815           
## Residual Std. Error       2.593 (df = 29)      
## F Statistic           69.211*** (df = 2; 29)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Q1.2. Interpretation of regression

hp: The coefficient for hp is -0.032, indicating that for each additional unit of horsepower, the miles per gallon (mpg) decreases by 0.032, holding the weight constant. This negative relationship suggests that more powerful engines (higher horsepower) tend to consume more fuel.
wt: The coefficient for wt is -3.878, meaning that for each additional unit of weight, mpg decreases by 3.878, holding horsepower constant. This is also a negative relationship, showing that heavier cars are less fuel-efficient.
Both coefficients have negative signs, which is consistent with expectations: more horsepower generally leads to higher fuel consumption, and heavier cars are typically less efficient. The magnitude of the weight’s impact (-3.878) is considerably larger than that of horsepower, suggesting that weight is a more significant factor in determining fuel efficiency than horsepower in this model.
Both hp and wt have three asterisks (***), indicating that they are statistically significant at the 0.01 level. This high level of significance suggests strong evidence against the null hypothesis that these coefficients are zero, affirming the impact of both horsepower and weight on fuel efficiency.

Q1.3.

Residuals vs Fitted

plot(model$fitted.values, resid(model), xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

The residuals appear to be randomly distributed, indicating that the model’s assumptions of linearity and equal variance are reasonably satisfied.

Normal Q-Q Plot

qqnorm(resid(model))
qqline(resid(model), col = "red")

The points in the Q-Q plot largely follow the 45-degree line, indicating that the residuals are approximately normally distributed. There are some slight deviations at the tails, but these are not extreme.

Scale-Location

plot(model$fitted.values, sqrt(abs(resid(model))), xlab = "Fitted Values", ylab = "Sqrt|Residuals|",
     main = "Scale-Location")
abline(h = 0, col = "red")

The spread of residuals is fairly uniform across the range of fitted values, without any obvious pattern like a fan shape. This suggests that the variance of the residuals is constant across different values of the predictors, supporting the assumption of homoscedasticity.

Residuals vs Leverage

plot(model, which = 5)

Most data points are within the Cook’s distance lines, indicating that they are not unduly influential. However, there are a few points outside of these bounds, notably the Maserati Bora, which appears to have high leverage and a significant Cook’s distance.

Q1.4. Gauss-Markov Assumptions

Assumptions

Linearity of the relationship: Assumption: Each predictor variable should have a linear relationship with the outcome variable.
Independence of residuals: Residuals should not be correlated with each other; independence of observations.
Homoscedasticity: The variability of residuals is nearly constant across different levels of fitted values.
Normality of residuals: For small datasets, residuals should be approximately normally distributed.

From my analysis:

The linearity assumption needs specific residual plots against each predictor to be fully evaluated.
The independence assumption might require further checks such as a Durbin-Watson test unless explicitly known that the data collection method does not induce order (e.g., time-related biases).
The homoscedasticity assumption seems to be met based on the scale-location plot.
The normality assumption appears reasonably satisfied, particularly given the size of the dataset where strict normality is less critical.

Q1.5. BLUE

For OLS to be considered BLUE, it must satisfy the Gauss-Markov assumptions. BLUE stands for Best Linear Unbiased Estimator, and each term implies:

Best: This means the estimator has the smallest variance among all unbiased linear estimators, making it the most efficient.
Linear: Refers to the estimator being a linear combination of the observed data.
Unbiased: Indicates that the expected value of the estimator equals the true parameter value.
Estimator: Refers to the method or rule used to estimate the parameters.

Q2. Log

Reasons for Log Transformation

Non-linear Relationships: Many relationships between variables are not linear but can often be approximated by linear models after transformations. Taking the logarithm of one or more variables can help linearize relationships, making it possible to fit a linear model more effectively. For example, exponential growth processes (common in economic and biological contexts) can be modeled linearly if a logarithmic transformation is applied.
Variance Stabilization: Heteroscedasticity (non-constant variance of the error terms) is a common violation of OLS assumptions that can affect the efficiency and accuracy of the estimates. Log transformation often stabilizes variance, especially when data include large range values or when larger values tend to have larger variances. This stabilization enhances the model’s validity by aligning with the assumption of homoscedasticity.
Improved Interpretation: Transformations can also simplify the interpretation of the regression coefficients. In a model where a log transformation has been applied to the dependent variable, a percentage change in one of the predictors leads to a proportional percentage change in the outcome variable. This is particularly useful in financial and economic modeling where elasticity (the percentage change response in the dependent variable to a one percent change in an independent variable) is of interest.