#install.packages("AER")
library("AER")
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(package = "AER")
Description: The TravelMode dataset contains 840 observations in long format, representing 210 individuals each faced with four travel alternatives (car, air, train, and bus) for trips between Sydney and Melbourne. For each individual–mode pair, it records whether the mode was chosen, along with characteristics such as terminal waiting time, in-vehicle travel time, vehicle cost, generalized cost, income, and party size. This structure makes the data well-suited for discrete choice modeling (e.g., conditional or multinomial logit) to study how time, cost, and income affect travel mode decisions.
data("TravelMode")
str(TravelMode)
## 'data.frame': 840 obs. of 9 variables:
## $ individual: Factor w/ 210 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ mode : Factor w/ 4 levels "air","train",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ choice : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ wait : int 69 34 35 0 64 44 53 0 69 34 ...
## $ vcost : int 59 31 25 10 58 31 25 11 115 98 ...
## $ travel : int 100 372 417 180 68 354 399 255 125 892 ...
## $ gcost : int 70 71 70 30 68 84 85 50 129 195 ...
## $ income : int 35 35 35 35 30 30 30 30 40 40 ...
## $ size : int 1 1 1 1 2 2 2 2 1 1 ...
head(TravelMode)
## individual mode choice wait vcost travel gcost income size
## 1 1 air no 69 59 100 70 35 1
## 2 1 train no 34 31 372 71 35 1
## 3 1 bus no 35 25 417 70 35 1
## 4 1 car yes 0 10 180 30 35 1
## 5 2 air no 64 58 68 68 30 2
## 6 2 train no 44 31 354 84 30 2
tail(TravelMode)
## individual mode choice wait vcost travel gcost income size
## 835 209 bus no 35 31 599 96 20 1
## 836 209 car yes 0 27 510 82 20 1
## 837 210 air no 64 66 140 87 70 4
## 838 210 train no 44 54 670 156 70 4
## 839 210 bus no 53 33 664 134 70 4
## 840 210 car yes 0 12 540 94 70 4
summary(TravelMode)
## individual mode choice wait vcost
## 1 : 4 air :210 no :630 Min. : 0.00 Min. : 2.00
## 2 : 4 train:210 yes:210 1st Qu.: 0.75 1st Qu.: 23.00
## 3 : 4 bus :210 Median :35.00 Median : 39.00
## 4 : 4 car :210 Mean :34.59 Mean : 47.76
## 5 : 4 3rd Qu.:53.00 3rd Qu.: 66.25
## 6 : 4 Max. :99.00 Max. :180.00
## (Other):816
## travel gcost income size
## Min. : 63.0 Min. : 30.0 Min. : 2.00 Min. :1.000
## 1st Qu.: 234.0 1st Qu.: 71.0 1st Qu.:20.00 1st Qu.:1.000
## Median : 397.0 Median :101.5 Median :34.50 Median :1.000
## Mean : 486.2 Mean :110.9 Mean :34.55 Mean :1.743
## 3rd Qu.: 795.5 3rd Qu.:144.0 3rd Qu.:50.00 3rd Qu.:2.000
## Max. :1440.0 Max. :269.0 Max. :72.00 Max. :6.000
##
This is cross-sectional data because it captures mode choices and related attributes at a single point in time for different individuals. Each person appears four times (once per mode), making it “long” choice data.
Below is a scatter plot displaying a visual representation of the relationship between Generalized Cost and Income. Within each travel mode, generalized cost shows only a weak relationship with income, though higher-income travelers appear more in air and train than in car or bus. When party size is added, larger travel groups (bigger bubbles) are more common at mid-range incomes and costs, and mode choice clusters become clearer (e.g., higher-income groups lean more toward air).
ggplot(TravelMode, aes(x = gcost, y = income)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
facet_wrap(~ mode) +
labs(
title = "Generalized Cost vs. Income by Mode",
x = "Generalized Cost",
y = "Household Income"
) +
theme_light(base_size = 14)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(TravelMode, aes(x = gcost, y = income, size = size, color = mode)) +
geom_jitter(alpha = 0.5) +
labs(
title = "Generalized Cost vs. Income (Point Size = Party Size)",
x = "Generalized Cost",
y = "Household Income"
) +
theme_bw()
Description: The USInvest dataset from AER: Applied Econometrics with R is an annual time series covering U.S. macroeconomic data from 1968 to 1982. It contains four key variables: gross national product (gnp), investment (invest), the consumer price index (price), and the interest rate (interest), which together capture how output, spending, inflation, and borrowing costs evolved over time. This makes it useful for studying relationships such as how investment responds to changes in output, inflation, or interest rates.
data("USInvest")
usinvest <- as.data.frame(USInvest)
usinvest$year <- 1968:1982
head(usinvest)
## gnp invest price interest year
## 1 873.4 133.3 82.54 5.16 1968
## 2 944.0 149.3 86.79 5.87 1969
## 3 992.7 144.2 91.45 5.95 1970
## 4 1077.6 166.4 96.01 4.88 1971
## 5 1185.9 195.0 100.00 4.50 1972
## 6 1326.4 229.8 105.75 6.44 1973
This is time series data because it tracks the same economic variables (GNP, investment, prices, and interest rates) for the United States annually from 1968 to 1982. Each year appears once, making it a single-entity dataset observed repeatedly over time.
Below is a line plot showing U.S. investment over the sample period. Investment trends upward overall, with a dip in the mid-1970s followed by sharp growth into the early 1980s, reflecting sensitivity to business cycles. The scatter plot displays the relationship between GNP and investment. The two variables move closely together, with higher GNP strongly associated with higher investment, and the linear fit confirming a clear positive relationship.
ggplot(usinvest, aes(x = year, y = invest)) +
geom_line(color = "navy", linewidth = 1) +
scale_x_continuous(breaks = usinvest$year) +
labs(title = "U.S. Investment (1968–1982)",
x = "Year", y = "Investment")
ggplot(usinvest, aes(x = gnp, y = invest)) +
geom_point(color = "purple", size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 1) +
labs(title = "Relationship between GNP and Investment",
x = "Gross National Product", y = "Investment")
## `geom_smooth()` using formula = 'y ~ x'
What is covariance? What is variance?
Variance measures how much a single variable fluctuates around its mean, represented as the average squared deviation from the mean. A large variance means observations are widely dispersed and a small variance means they cluster tightly around the mean.
Covariance measures how two variables move together relative to their means, shown as the average product of deviations. If x is above its mean when y is above its mean (and vice versa), the covariance is tends to be positive. The covariance will be negative if they move in opposite directions, near zero if there’s no linear co-movement. Units are the product of the units of x and y.
Why would dividing the covariance of y and x by the variance of x give you the slope coefficient from a simple linear regression (one x variable only)?
##In a simple linear regression of y on x,
the slope coefficient $beta_1 = $ can be expressed as:
\[ \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \]
Running a simple regression for generalized cost on travel time for all observations in the TravelMode dataset.
model <- lm(gcost ~ travel, data = TravelMode)
summary(model)
##
## Call:
## lm(formula = gcost ~ travel, data = TravelMode)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.621 -26.748 -7.663 24.469 123.359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.374131 2.191052 25.27 <2e-16 ***
## travel 0.114170 0.003831 29.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.45 on 838 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.514
## F-statistic: 888.2 on 1 and 838 DF, p-value: < 2.2e-16
Here, \(\hat{\beta}_1\) is 0.11417. Now, let’s calculate the covariance of x and y and variance of x and divide, to prove the same answer.
cov_xy <- cov(TravelMode$travel, TravelMode$gcost)
var_x <- var(TravelMode$travel)
beta1_formula <- cov_xy / var_x
beta1_formula
## [1] 0.1141702
Comparing to the regression output:
coef(model)[2] # regression slope
## travel
## 0.1141702
beta1_formula # slope via formula
## [1] 0.1141702
The two results match, showing that the slope is simply the covariance of X and Y divided by the variance of X.
As can be seen, the slope from the regression \(\hat{\beta}_1\) matches the value obtained from the covariance–variance formula. This confirms that \(\hat{\beta}_1\) captures the average change in generalized cost for a one-unit change in travel time, consistent with the theoretical expression \[ \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \].