ECNM Discussion 3

Author

Bryan Calderon

0.1 Clear Data

rm(list = ls())      # Clear all files from your environment
         gc()            # Clear unused memory

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  581808 31.1    1327592   71         NA   669431 35.8
Vcells 1070079  8.2    8388608   64      16384  1852064 14.2

         cat("\f")       # Clear the console

 graphics.off()      # Clear all graphs

1 Part A

1.0.1 Explaining Covariance and Variance

Covariance

Measures how two variables vary together, indicating the direction of the linear relationship between the two variables.
They can move opposite to each other (negative infiniti) or with each other (positive infiniti).
Can sometimes be difficult to compare across different data sets as the value of the covariance depends on the variables.

Variance

Measures how much a single variable deviates from its mean, providing us a sense of how dispersed the data is from a single variable.

1.0.2 Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?

Regressions help us find the line which minimizes the SSE (sum of squared errors). This formula will then give us the best estimate for the slope which describes how much Y changes for a unit change in X.
Similarly, Covariance measures the degree which X and Y vary together while variance shows us how X varies around its mean. We would then get the same answer by dividing covariance by variance as the slope tells us how much Y changes for each unit increase in X, and covariance captures the joint variation between the two variables.

1.0.3 Implementing this with data

library("AER")

Warning: package 'AER' was built under R version 4.3.3

Loading required package: car

Loading required package: carData

Loading required package: lmtest

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Loading required package: survival

data("CigarettesB")

cig <- CigarettesB

head(cig)

     packs   price  income
AL 4.96213 0.20487 4.64039
AZ 4.66312 0.16640 4.68389
AR 5.10709 0.23406 4.59435
CA 4.50449 0.36399 4.88147
CT 4.66983 0.32149 5.09472
DE 5.04705 0.21929 4.87087

# X and Y terms
x <- cig$price    # Price
y <- cig$packs   # Packs

# Calc of Slope by using formula
cov_xy <- cov(x, y)
var_x <- var(x)
slope_formula <- cov_xy / var_x

# Calc slope through linear model
model <- lm(y ~ x)
slope_lm <- coef(model)[2]

# Print results, we get the same results
cat("Manual calculation:", slope_formula)

Manual calculation: -1.198316

cat("Linear Model:", slope_lm)

Linear Model: -1.198316

As you can see above we get the same results for the manual calculation and the Linear model

library(ggplot2)

ggplot(cig, aes(x = packs, 
                 y = price)) +
  geom_point(size = 2, 
             shape = 18, 
             col = "red") +
  stat_smooth(method = lm, 
              linetype = "dashed") + 
  xlab("# of Packs") + 
  ylab("Price")

`geom_smooth()` using formula = 'y ~ x'

2 Part B

1I picked the “USAirlines” data from AER library which shows cost data for six US airlines from 1970 - 1984 as a data frame.

firm -> factor indicating airline firm.

year -> factor indicating year

output -> Total output in terms of RPM (Revenue Passenger Miles = # of paying passangers * miles flown). Used to track the growth or contraction of airlines over time.

cost -> total cost (in USD 1000).

price -> fuel price

load -> average capacity utilization of the fleet.

data("USAirlines")

air <- USAirlines

#show sample data
head(air)

  firm year   output    cost  price     load
1    1 1970 0.952757 1140640 106650 0.534487
2    1 1971 0.986757 1215690 110307 0.532328
3    1 1972 1.091980 1309570 110574 0.547736
4    1 1973 1.175780 1511530 121974 0.540846
5    1 1974 1.160170 1676730 196606 0.591167
6    1 1975 1.173760 1823740 265609 0.575417

2.0.1 Getting a sense of the variables

# Plot to show quick graph
pairs(~output + price + cost + load, data = air,
      main = "Scatterplot Matrix of Air")

cor(air$output, air$price)

[1] 0.2276125

cor(air$output, air$cost)

[1] 0.9263269

cor(air$output, air$load)

[1] 0.42481

2.0.2 Constucting the Linear Model

\(Y_i = \beta_0 + \beta_1 * X_i + \beta_2 * X_2 + \beta_3 * X_3 + \epsilon_i\)

\(Y_i = \beta_0 + \beta_1 * Avg \ capacity \ utilization + \beta_2 * Total \ cost + \beta_3 * Fuel \ price + \epsilon_i\)

_______

\(Y_i\) = Total output in terms of RPM

\(\beta_0\) = Intercept (constant) term

\(\beta_1\) = Slope coefficient representing the change in RPM per unit change in avg capacity utilization

\(X_1\) = Average capacity utilization of the fleet.

\(\beta_2\) = Slope coefficient representing the change in RPM per unit change in total cost

\(X_2\) = Total cost (in USD 1000).

\(\beta_3\) = Slope coefficient representing the change in RPM per unit change in fuel price

\(X_3\) = Fuel price

\(\epsilon_i\) = Error term representing enexplainted variation

lm_model <- lm(output ~ load + cost + price, 
               data = air)

# Summary
summary(lm_model)


Call:
lm(formula = output ~ load + cost + price, data = air)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31709 -0.10654 -0.02613  0.10674  0.29186 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.633e-01  1.665e-01  -3.983 0.000142 ***
load         1.709e+00  3.163e-01   5.404 5.74e-07 ***
cost         4.570e-07  1.394e-08  32.781  < 2e-16 ***
price       -5.567e-07  5.254e-08 -10.595  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1337 on 86 degrees of freedom
Multiple R-squared:  0.9393,    Adjusted R-squared:  0.9372 
F-statistic: 443.7 on 3 and 86 DF,  p-value: < 2.2e-16

2.0.3 Graphing the linear model (actual vs predicted values)

# dataframe ->  actual vs predicted values
air$predicted_output <- predict(lm_model)

ggplot(air, aes(x = predicted_output, 
                y = output)) +
  geom_point(size = 2, 
             shape = 18, 
             col = "red") +
  geom_abline(slope = 1, intercept = 0, 
              color = "blue", 
              linetype = "dashed") +
  labs(title = "Actual vs Predicted Output",
       x = "Predicted Output",
       y = "Actual Output") +
  theme_minimal()

2.0.4 Linear model vs Manual calculations

# Taking the betas from lm_model
lm_betas <- round(lm_model$coefficients, 2)

# X matrix ---> independent variables with intercept only
# Y vector ---> dependent variable only
X_manual <- model.matrix(lm_model)
Y_manual <- air$output

# Manually calculating the betas
manual_betas <- solve(crossprod(X_manual), crossprod(X_manual, Y_manual))

# Rounding
manual_betas <- round(manual_betas, 2)

# Combining both sets of betas into a data frame
beta_results <- data.frame(manual_betas = manual_betas,lm_betas = lm_betas)

beta_results

            manual_betas lm_betas
(Intercept)        -0.66    -0.66
load                1.71     1.71
cost                0.00     0.00
price               0.00     0.00

# X and Y terms
x2 <- air$price
y2 <- air$output

# Calc of Slope by using formula
cov_xy2 <- cov(x2, y2)
var_x2 <- var(x2)
slope_formula2 <- cov_xy2 / var_x2

# Calc slope through linear model
model2 <- lm(y2 ~ x2)
slope_lm2 <- coef(model2)[2]

# Print results, we get the same results
cat("Manual calculation:", slope_formula2)

Manual calculation: 3.685884e-07

cat("Linear Model:", slope_lm2)

Linear Model: 3.685884e-07

As seen above, the manual calculation and the linear model give the same result

2.0.5 Standard Error calculation

Standard error: Standard deviation of the sample mean

Sample standard deviation / \(sqrt(N)\)
N -> sample size

# X matrix ---> independent variables with intercept only
# Y vector ---> dependent variable only
X3 <- model.matrix(lm_model)
Y3 <- air$output

# N x k matrix
N   <- nrow(X3)
k   <- ncol(X3) - 1 # number of predictor variables (excluding Intercept column)

# Estimated Regression Coefficients
beta_hat    <- solve(t(X3) %*% X3) %*% t(X3) %*% Y3

# Residuals
residuals <- Y3 - X3 %*% beta_hat

# Residual Sum of Squares
RSS <- sum((residuals)^2)

# Residual Variance (sigma^2)
residual_variance <- RSS / (N - k - 1)

# Standard Error of residuals
residual_std_error <- sqrt(residual_variance)

# Variance-Covariance matrix of beta_hat
var_betaHat <- residual_variance * solve(t(X3) %*% X3)

var_betaHat

              (Intercept)          load          cost         price
(Intercept)  2.773108e-02 -5.207217e-02  4.863108e-10  2.344950e-09
load        -5.207217e-02  1.000423e-01 -1.041849e-09 -5.995663e-09
cost         4.863108e-10 -1.041849e-09  1.943338e-16 -2.555534e-16
price        2.344950e-09 -5.995663e-09 -2.555534e-16  2.760848e-15

# Standard Errors of beta_hat (coefficients)
standard_errors <- sqrt(diag(var_betaHat))

standard_errors

 (Intercept)         load         cost        price 
1.665265e-01 3.162946e-01 1.394036e-08 5.254378e-08

Beta Hat: objective is to find the relationship between our dependent variable and independent variables while minimize the sum of the squared errors between our actual vs predicted values of Y.

RSS (Residual sum of squares): Residuals in our model are the difference between the actual values Y and the predicted values X beta hat. In this case it captures the total unexplained variation of Y after fitting the model.

Residual Variance: An estimate of the variance of the error term in the regression model. Helps us understand how much variation is left unexplained by the model in which we can divide the RSS by the degrees of freedom. It allows us to get the avg squared error for each degree of freedom.

Variance - Covariance Matrix: Combined the variance and variability in the independent variable in order to give us information about the uncertainty associated to each estimated coefficient.

Larger the variance —> Less precise the estimate of Beta

Standard Errors: We are taking the sqaured root of the diagnol elements of the variance covariance matrix in order to measure the precision of each coefficient.

Larger error –> Coefficient could vary widely if the sample were different