In this homework exercise, the goal is mostly to do some regression analysis in stata.

Firstly, it is essential to go over the AIRFARE dataset:

library(foreign)

data <- read.dta("airfare.dta")
head(data) %>%
    knitr::kable()
year origin destin id dist passen fare bmktshr ldist y98 y99 y00 lfare ldistsq concen lpassen
1997 AKRON, OH ATLANTA, GA 1 528 152 106 0.8386 6.269096 0 0 0 4.663439 39.30157 0.8386 5.023880
1998 AKRON, OH ATLANTA, GA 1 528 265 106 0.8133 6.269096 1 0 0 4.663439 39.30157 0.8133 5.579730
1999 AKRON, OH ATLANTA, GA 1 528 336 113 0.8262 6.269096 0 1 0 4.727388 39.30157 0.8262 5.817111
2000 AKRON, OH ATLANTA, GA 1 528 298 123 0.8612 6.269096 0 0 1 4.812184 39.30157 0.8612 5.697094
1997 AKRON, OH ORLANDO, FL 2 861 282 104 0.5798 6.758094 0 0 0 4.644391 45.67184 0.5798 5.641907
1998 AKRON, OH ORLANDO, FL 2 861 178 105 0.5817 6.758094 1 0 0 4.653960 45.67184 0.5817 5.181784

Calculating parameters

group_by(data, year) %>%
    summarise(mean=mean(fare), median=median(fare),
              q1=quantile(fare,.25), q3=quantile(fare,.75)) %>%
    knitr::kable(caption="Stat. values of fair for various years")
Stat. values of fair for various years
year mean median q1 q3
1997 173.7520 160 116 225
1998 175.4413 167 121 219
1999 177.9704 168 124 224
2000 188.0235 178 132 233

Flight’s distance for flight destination

data_table <- group_by(data, destin) %>%
    summarise(mean=mean(dist), sd=sd(dist))
knitr::kable(data_table, caption="mean and standard deviation for flight distances based on cities" )
mean and standard deviation for flight distances based on cities
destin mean sd
ATLANTA, GA 883.3333 317.23502
AUSTIN, TX 612.6667 170.56342
BALTIMORE, MD 1197.0000 479.86437
BATON ROUGE, LA 449.0000 0.00000
BIRMINGHAM, AL 682.0000 0.00000
BOSTON, MA 1052.2000 703.20601
BUFFALO, NY 554.0000 168.90910
CHARLESTON, SC 259.0000 0.00000
CHARLOTTE, NC 438.0000 220.84631
CHICAGO, IL 723.1818 186.16912
CINCINNATI, OH 454.7500 187.64417
CLEVELAND, OH 429.5000 103.41600
CO SPRINGS, CO 1293.6667 379.81271
COLUMBIA, SC 429.0000 254.43270
COLUMBUS, OH 426.7500 140.73971
DALLAS, TX 734.6000 358.30236
DAYTON, OH 511.0000 271.14706
DENVER, CO 963.0714 430.12395
DES MOINES, IA 504.3333 152.34013
DETROIT, MI 570.5000 297.47527
EL PASO, TX 636.5000 382.40119
FLINT, MI 644.0000 0.00000
FT. LAUDERDALE, FL 1047.1538 271.57575
FT. MYERS, FL 965.8750 208.09826
GREENSBORO, NC 514.0000 155.46294
HARLINGEN, TX 369.0000 95.14500
HARRISBURG, PA 595.0000 0.00000
HARTFORD, CT 909.0000 434.82198
HOUSTON, TX 789.7826 390.18248
INDIANAPOLIS, IN 644.3846 306.37784
ISLIP/LONG ISLAND, NY 1093.0000 0.00000
JACKSON, MS 443.5000 135.06838
JACKSONVILLE, FL 667.7778 274.76901
KANSAS CITY, MO 740.0625 269.92520
KNOXVILLE, TN 475.0000 0.00000
LAS VEGAS, NV 1275.1739 573.62036
LEXINGTON/FRANKFORT, KY 313.0000 10.69045
LINCOLN, NE 467.0000 0.00000
LITTLE ROCK, AR 421.2500 93.26486
LONG BEACH, CA 1220.0000 0.00000
LOS ANGELES, CA 1593.2000 637.19816
LOUISVILLE, KY 777.8000 528.97672
LUBBOCK, TX 469.0000 195.40727
MANCHESTER, NH 843.0000 0.00000
MEMPHIS, TN 700.8750 423.08649
MIAMI, FL 1190.3158 487.64425
MIDLAND/ODESSA, TX 461.5000 208.32091
MILWAUKEE, WI 937.8333 459.14534
MINNEAPOLIS, MN 878.3333 368.21695
MISSION/MCALLEN, TX 392.0000 81.24742
MOBILE, AL 302.0000 0.00000
MYRTLE BEACH, SC 648.7500 118.63979
NASHVILLE, TN 742.5263 395.51949
NEW ORLEANS, LA 786.1000 382.82678
NEW YORK, NY 880.8837 513.87218
NEWBURGH, NY 1122.0000 0.00000
NEWPORT NEWS/WMSBURG, VA 508.0000 0.00000
NORFOLK, VA 875.8571 681.45149
OAKLAND, CA 1191.1000 710.40176
OKLAHOMA CITY, OK 743.5000 366.93845
OMAHA, NE 796.0000 336.38927
ONTARIO, CA 1202.6154 690.79704
ORLANDO, FL 941.1458 386.73362
PHILADELPHIA, PA 854.8929 525.56457
PHOENIX, AZ 1221.1316 620.30645
PITTSBURGH, PA 888.4737 563.86167
PORTLAND, ME 516.6667 285.94257
PORTLAND, OR 1515.2174 688.61568
PROVIDENCE, RI 1222.0000 665.04443
RALEIGH/DURHAM, NC 838.3684 528.42288
RENO, NV 910.0000 699.63939
RICHMOND, VA 521.1250 294.64530
ROCHESTER, NY 527.6667 291.34212
SACRAMENTO, CA 1276.4286 854.67286
SALT LAKE CITY, UT 1056.2143 579.06724
SAN ANTONIO, TX 825.5652 366.16478
SAN DIEGO, CA 1363.2286 700.47109
SAN FRANCISCO, CA 1596.3182 805.08455
SAN JOSE, CA 1267.1250 813.50979
SANTA ANA, CA 1219.7143 795.59335
SANTA BARBARA, CA 262.0000 0.00000
SARASOTA, FL 1025.1667 106.65249
SAVANNAH, GA 567.0000 261.22509
SEATTLE, WA 1512.4889 662.66938
SPOKANE, WA 671.0000 257.09623
ST. LOUIS, MO 752.1667 396.62996
SYRACUSE, NY 663.2500 321.87399
TALLAHASSEE, FL 406.0000 0.00000
TAMPA, FL 1009.5476 564.92421
TOLEDO, OH 921.0000 0.00000
TUCSON, AZ 887.5385 490.18710
TULSA, OK 690.4615 396.18150
VALPARAISO, FL 264.0000 0.00000
WASHINGTON, DC 1012.5273 724.95973
WEST PALM BEACH, FL 1095.8667 368.87366
WHITE PLAINS, NY 621.5000 124.54374
WICHITA, KS 581.0000 261.26155

Therefore, the average distance of flights from Detroit is:

result = grep("detroit", ignore.case = TRUE, data_table[["destin"]]) %>% 
    data_table[["mean"]][.]

570.5

Tabulation of filtered results

filter(data, year == "1997", fare < 55) %>% with(table(origin, destin)) %>%
    knitr::kable()
MYRTLE BEACH, SC RENO, NV SAN JOSE, CA ST. LOUIS, MO TUCSON, AZ
ATLANTIC CITY, NJ 1 0 0 0 0
LOS ANGELES, CA 0 0 0 0 1
LOUISVILLE, KY 0 0 0 1 0
OAKLAND, CA 0 1 0 0 0
PORTLAND, OR 0 1 0 0 0
RENO, NV 0 0 1 0 0

Correlation Table

filter(data, year == "2000") %>%
    select(dist, fare, passen) %>%
    cor() %>%
    knitr::kable()
dist fare passen
dist 1.0000000 0.5993319 -0.0680743
fare 0.5993319 1.0000000 -0.1230791
passen -0.0680743 -0.1230791 1.0000000

The results match expectations. Firstly, there’s unit covariance between same elements. Then, dist has positive cov with fare as greater distance is more expensive. Also, dist has negative cor with passen since fewer passengers go on distant trips. Similarly for fare and pass.

Regressing the linear model

model <- lm( fare ~ dist + bmktshr + passen, data=data)
summary(model)

Call:
lm(formula = fare ~ dist + bmktshr + passen, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-146.50  -43.81  -11.87   39.15  216.28 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.995247   4.488055  11.808  < 2e-16 ***
dist         0.087485   0.001645  53.177  < 2e-16 ***
bmktshr     69.430756   5.122536  13.554  < 2e-16 ***
passen      -0.004942   0.001057  -4.677 2.99e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.13 on 4592 degrees of freedom
Multiple R-squared:  0.4183,    Adjusted R-squared:  0.4179 
F-statistic:  1101 on 3 and 4592 DF,  p-value: < 2.2e-16

To see significance for different values, we perform a t-test which also gives variances and answers other questions.

t.test(data$fare)

    One Sample t-test

data:  data$fare
t = 161.87, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 176.6313 180.9622
sample estimates:
mean of x 
 178.7968 

fare

t.test(data$bmktshr)

    One Sample t-test

data:  data$bmktshr
t = 210.56, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.6044344 0.6157955
sample estimates:
mean of x 
0.6101149 

bmktshr

t.test(data$passen)

    One Sample t-test

data:  data$passen
t = 53.168, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 613.3426 660.3058
sample estimates:
mean of x 
 636.8242 

passen

t.test(data$dist)

    One Sample t-test

data:  data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  972.0519 1007.4381
sample estimates:
mean of x 
  989.745 

dist

For the case of testing dist for .01, one trick is to offset the null hypothesis by .01. This is done by setting the significance level to .99 as below:

t.test(data$dist, conf.level = 0.99)

    One Sample t-test

data:  data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
  966.4888 1013.0012
sample estimates:
mean of x 
  989.745 

dist with .99

These are the coefficients of the model:

summary(model)$coef
                Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 52.995247275 4.488054584 11.808067 1.015786e-31
dist         0.087485328 0.001645179 53.176794 0.000000e+00
bmktshr     69.430755685 5.122535725 13.553982 4.582865e-41
passen      -0.004942295 0.001056728 -4.676979 2.994895e-06

As we expected from the cor matrix, dist and passen have the respective correlation with fair. Same is for bmktshr.

To see how much of the variation is unxplained, we refer to \(1- R^2\) which is 0.5816954.

Table of values

This table was given in previous sections. For reference, it is again given here:

table1 <- summary(model)$coeff %>% as.data.frame %>%
    select("Estimate", "t value", "Pr(>|t|)")
table2 <- c("totalobserv"= summary(model)$res %>% length(), 
            "R-squared"= summary(model)$r.squared, 
            "F-statistic"= summary(model)$fstatistic[1])
knitr::kable(list(table1, table2))
Estimate t value Pr(>|t|)
(Intercept) 52.9952473 11.808067 0e+00
dist 0.0874853 53.176794 0e+00
bmktshr 69.4307557 13.553982 0e+00
passen -0.0049423 -4.676979 3e-06
x
totalobserv 4596.0000000
R-squared 0.4183046
F-statistic.value 1100.7221469

estout package

There is an equivalent package for estout. The latex code however, does not display in html format.

library(estout)
eststo(model)
esttab(t.value=TRUE,round.dec=2,caption="Reg Table",label="Reg")

Using the outreg2

It seems that the r equivalent of esttab is actually an equivalent of outreg. What this function does is to wrap the data in a compherensive format that can be used for comparison between different models. This is in fact the very description given by the estout R package which is based on the stata estout.