Weekly Discussion: Types of Data & Slope Parameter Interpretation

Weekly Discussion 1:

Part I:

Loading Data

#install.packages("AER")
library("AER")

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(package = "AER")

Data set 1: Travel Mode

Description: The TravelMode dataset contains 840 observations in long format, representing 210 individuals each faced with four travel alternatives (car, air, train, and bus) for trips between Sydney and Melbourne. For each individual–mode pair, it records whether the mode was chosen, along with characteristics such as terminal waiting time, in-vehicle travel time, vehicle cost, generalized cost, income, and party size. This structure makes the data well-suited for discrete choice modeling (e.g., conditional or multinomial logit) to study how time, cost, and income affect travel mode decisions.

data("TravelMode")

str(TravelMode)

## 'data.frame':    840 obs. of  9 variables:
##  $ individual: Factor w/ 210 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ mode      : Factor w/ 4 levels "air","train",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ choice    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ wait      : int  69 34 35 0 64 44 53 0 69 34 ...
##  $ vcost     : int  59 31 25 10 58 31 25 11 115 98 ...
##  $ travel    : int  100 372 417 180 68 354 399 255 125 892 ...
##  $ gcost     : int  70 71 70 30 68 84 85 50 129 195 ...
##  $ income    : int  35 35 35 35 30 30 30 30 40 40 ...
##  $ size      : int  1 1 1 1 2 2 2 2 1 1 ...

head(TravelMode)

##   individual  mode choice wait vcost travel gcost income size
## 1          1   air     no   69    59    100    70     35    1
## 2          1 train     no   34    31    372    71     35    1
## 3          1   bus     no   35    25    417    70     35    1
## 4          1   car    yes    0    10    180    30     35    1
## 5          2   air     no   64    58     68    68     30    2
## 6          2 train     no   44    31    354    84     30    2

tail(TravelMode)

##     individual  mode choice wait vcost travel gcost income size
## 835        209   bus     no   35    31    599    96     20    1
## 836        209   car    yes    0    27    510    82     20    1
## 837        210   air     no   64    66    140    87     70    4
## 838        210 train     no   44    54    670   156     70    4
## 839        210   bus     no   53    33    664   134     70    4
## 840        210   car    yes    0    12    540    94     70    4

summary(TravelMode)

##    individual     mode     choice         wait           vcost       
##  1      :  4   air  :210   no :630   Min.   : 0.00   Min.   :  2.00  
##  2      :  4   train:210   yes:210   1st Qu.: 0.75   1st Qu.: 23.00  
##  3      :  4   bus  :210             Median :35.00   Median : 39.00  
##  4      :  4   car  :210             Mean   :34.59   Mean   : 47.76  
##  5      :  4                         3rd Qu.:53.00   3rd Qu.: 66.25  
##  6      :  4                         Max.   :99.00   Max.   :180.00  
##  (Other):816                                                         
##      travel           gcost           income           size      
##  Min.   :  63.0   Min.   : 30.0   Min.   : 2.00   Min.   :1.000  
##  1st Qu.: 234.0   1st Qu.: 71.0   1st Qu.:20.00   1st Qu.:1.000  
##  Median : 397.0   Median :101.5   Median :34.50   Median :1.000  
##  Mean   : 486.2   Mean   :110.9   Mean   :34.55   Mean   :1.743  
##  3rd Qu.: 795.5   3rd Qu.:144.0   3rd Qu.:50.00   3rd Qu.:2.000  
##  Max.   :1440.0   Max.   :269.0   Max.   :72.00   Max.   :6.000  
##

Scatterplot of TravelMode Data

This is cross-sectional data because it captures mode choices and related attributes at a single point in time for different individuals. Each person appears four times (once per mode), making it “long” choice data.

Below is a scatter plot displaying a visual representation of the relationship between Generalized Cost and Income. Within each travel mode, generalized cost shows only a weak relationship with income, though higher-income travelers appear more in air and train than in car or bus. When party size is added, larger travel groups (bigger bubbles) are more common at mid-range incomes and costs, and mode choice clusters become clearer (e.g., higher-income groups lean more toward air).

ggplot(TravelMode, aes(x = gcost, y = income)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  facet_wrap(~ mode) +
  labs(
    title = "Generalized Cost vs. Income by Mode",
    x = "Generalized Cost",
    y = "Household Income"
  ) +
  theme_light(base_size = 14)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(TravelMode, aes(x = gcost, y = income, size = size, color = mode)) +
  geom_jitter(alpha = 0.5) +
  labs(
    title = "Generalized Cost vs. Income (Point Size = Party Size)",
    x = "Generalized Cost",
    y = "Household Income"
  ) +
  theme_bw()

Data set 2: USInvest

Description: The USInvest dataset from AER: Applied Econometrics with R is an annual time series covering U.S. macroeconomic data from 1968 to 1982. It contains four key variables: gross national product (gnp), investment (invest), the consumer price index (price), and the interest rate (interest), which together capture how output, spending, inflation, and borrowing costs evolved over time. This makes it useful for studying relationships such as how investment responds to changes in output, inflation, or interest rates.

data("USInvest")
usinvest <- as.data.frame(USInvest)
usinvest$year <- 1968:1982
head(usinvest)

##      gnp invest  price interest year
## 1  873.4  133.3  82.54     5.16 1968
## 2  944.0  149.3  86.79     5.87 1969
## 3  992.7  144.2  91.45     5.95 1970
## 4 1077.6  166.4  96.01     4.88 1971
## 5 1185.9  195.0 100.00     4.50 1972
## 6 1326.4  229.8 105.75     6.44 1973

Timeseries Plot of USInvest Data

This is time series data because it tracks the same economic variables (GNP, investment, prices, and interest rates) for the United States annually from 1968 to 1982. Each year appears once, making it a single-entity dataset observed repeatedly over time.

Below is a line plot showing U.S. investment over the sample period. Investment trends upward overall, with a dip in the mid-1970s followed by sharp growth into the early 1980s, reflecting sensitivity to business cycles. The scatter plot displays the relationship between GNP and investment. The two variables move closely together, with higher GNP strongly associated with higher investment, and the linear fit confirming a clear positive relationship.

ggplot(usinvest, aes(x = year, y = invest)) +
  geom_line(color = "navy", linewidth = 1) +
  scale_x_continuous(breaks = usinvest$year) +
  labs(title = "U.S. Investment (1968–1982)",
       x = "Year", y = "Investment")

ggplot(usinvest, aes(x = gnp, y = invest)) +
  geom_point(color = "purple", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
  labs(title = "Relationship between GNP and Investment",
       x = "Gross National Product", y = "Investment")

## `geom_smooth()` using formula = 'y ~ x'

Part II.

What is covariance? What is variance?
- Variance measures how much a single variable fluctuates around its mean, represented as the average squared deviation from the mean. A large variance means observations are widely dispersed and a small variance means they cluster tightly around the mean.
- Covariance measures how two variables move together relative to their means, shown as the average product of deviations. If x is above its mean when y is above its mean (and vice versa), the covariance is tends to be positive. The covariance will be negative if they move in opposite directions, near zero if there’s no linear co-movement. Units are the product of the units of x and y.
Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?
- In a simple regression with one explanatory variable, the slope is supposed to capture how much y changes, on average, when x changes by one unit. The covariance between x and y captures their joint movement via how strongly and in what direction they vary together. The variance of x captures how much x itself varies. By dividing covariance by variance, we’re standardizing the joint movement of x and y by the amount of variation in x alone. In other words, the slope is “the part of y’s variation that lines up with x, per unit of x’s own variation.” That’s why the formula works: it extracts the average rate of change in y with respect to x.

##In a simple linear regression of y on x, the slope coefficient $beta_1 = $ can be expressed as:

\[ \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \]

Running a simple regression for generalized cost on travel time for all observations in the TravelMode dataset.

model <- lm(gcost ~ travel, data = TravelMode)
summary(model)

## 
## Call:
## lm(formula = gcost ~ travel, data = TravelMode)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.621 -26.748  -7.663  24.469 123.359 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.374131   2.191052   25.27   <2e-16 ***
## travel       0.114170   0.003831   29.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.45 on 838 degrees of freedom
## Multiple R-squared:  0.5145, Adjusted R-squared:  0.514 
## F-statistic: 888.2 on 1 and 838 DF,  p-value: < 2.2e-16

Here, $\hat{\beta}_1$ is 0.11417. Now, let’s calculate the covariance of x and y and variance of x and divide, to prove the same answer.

cov_xy <- cov(TravelMode$travel, TravelMode$gcost)
var_x <- var(TravelMode$travel)
beta1_formula <- cov_xy / var_x
beta1_formula

## [1] 0.1141702

Comparing to the regression output:

coef(model)[2]   # regression slope

##    travel 
## 0.1141702

beta1_formula    # slope via formula

## [1] 0.1141702

The two results match, showing that the slope is simply the covariance of X and Y divided by the variance of X.

As can be seen, the slope from the regression $\hat{\beta}_1$ matches the value obtained from the covariance–variance formula. This confirms that $\hat{\beta}_1$ captures the average change in generalized cost for a one-unit change in travel time, consistent with the theoretical expression \[ \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \].