In this homework exercise, the goal is mostly to do some regression analysis in stata.
Firstly, it is essential to go over the AIRFARE dataset:
library(foreign)
data <- read.dta("airfare.dta")
head(data) %>%
knitr::kable()
| year | origin | destin | id | dist | passen | fare | bmktshr | ldist | y98 | y99 | y00 | lfare | ldistsq | concen | lpassen |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1997 | AKRON, OH | ATLANTA, GA | 1 | 528 | 152 | 106 | 0.8386 | 6.269096 | 0 | 0 | 0 | 4.663439 | 39.30157 | 0.8386 | 5.023880 |
| 1998 | AKRON, OH | ATLANTA, GA | 1 | 528 | 265 | 106 | 0.8133 | 6.269096 | 1 | 0 | 0 | 4.663439 | 39.30157 | 0.8133 | 5.579730 |
| 1999 | AKRON, OH | ATLANTA, GA | 1 | 528 | 336 | 113 | 0.8262 | 6.269096 | 0 | 1 | 0 | 4.727388 | 39.30157 | 0.8262 | 5.817111 |
| 2000 | AKRON, OH | ATLANTA, GA | 1 | 528 | 298 | 123 | 0.8612 | 6.269096 | 0 | 0 | 1 | 4.812184 | 39.30157 | 0.8612 | 5.697094 |
| 1997 | AKRON, OH | ORLANDO, FL | 2 | 861 | 282 | 104 | 0.5798 | 6.758094 | 0 | 0 | 0 | 4.644391 | 45.67184 | 0.5798 | 5.641907 |
| 1998 | AKRON, OH | ORLANDO, FL | 2 | 861 | 178 | 105 | 0.5817 | 6.758094 | 1 | 0 | 0 | 4.653960 | 45.67184 | 0.5817 | 5.181784 |
group_by(data, year) %>%
summarise(mean=mean(fare), median=median(fare),
q1=quantile(fare,.25), q3=quantile(fare,.75)) %>%
knitr::kable(caption="Stat. values of fair for various years")
| year | mean | median | q1 | q3 |
|---|---|---|---|---|
| 1997 | 173.7520 | 160 | 116 | 225 |
| 1998 | 175.4413 | 167 | 121 | 219 |
| 1999 | 177.9704 | 168 | 124 | 224 |
| 2000 | 188.0235 | 178 | 132 | 233 |
data_table <- group_by(data, destin) %>%
summarise(mean=mean(dist), sd=sd(dist))
knitr::kable(data_table, caption="mean and standard deviation for flight distances based on cities" )
| destin | mean | sd |
|---|---|---|
| ATLANTA, GA | 883.3333 | 317.23502 |
| AUSTIN, TX | 612.6667 | 170.56342 |
| BALTIMORE, MD | 1197.0000 | 479.86437 |
| BATON ROUGE, LA | 449.0000 | 0.00000 |
| BIRMINGHAM, AL | 682.0000 | 0.00000 |
| BOSTON, MA | 1052.2000 | 703.20601 |
| BUFFALO, NY | 554.0000 | 168.90910 |
| CHARLESTON, SC | 259.0000 | 0.00000 |
| CHARLOTTE, NC | 438.0000 | 220.84631 |
| CHICAGO, IL | 723.1818 | 186.16912 |
| CINCINNATI, OH | 454.7500 | 187.64417 |
| CLEVELAND, OH | 429.5000 | 103.41600 |
| CO SPRINGS, CO | 1293.6667 | 379.81271 |
| COLUMBIA, SC | 429.0000 | 254.43270 |
| COLUMBUS, OH | 426.7500 | 140.73971 |
| DALLAS, TX | 734.6000 | 358.30236 |
| DAYTON, OH | 511.0000 | 271.14706 |
| DENVER, CO | 963.0714 | 430.12395 |
| DES MOINES, IA | 504.3333 | 152.34013 |
| DETROIT, MI | 570.5000 | 297.47527 |
| EL PASO, TX | 636.5000 | 382.40119 |
| FLINT, MI | 644.0000 | 0.00000 |
| FT. LAUDERDALE, FL | 1047.1538 | 271.57575 |
| FT. MYERS, FL | 965.8750 | 208.09826 |
| GREENSBORO, NC | 514.0000 | 155.46294 |
| HARLINGEN, TX | 369.0000 | 95.14500 |
| HARRISBURG, PA | 595.0000 | 0.00000 |
| HARTFORD, CT | 909.0000 | 434.82198 |
| HOUSTON, TX | 789.7826 | 390.18248 |
| INDIANAPOLIS, IN | 644.3846 | 306.37784 |
| ISLIP/LONG ISLAND, NY | 1093.0000 | 0.00000 |
| JACKSON, MS | 443.5000 | 135.06838 |
| JACKSONVILLE, FL | 667.7778 | 274.76901 |
| KANSAS CITY, MO | 740.0625 | 269.92520 |
| KNOXVILLE, TN | 475.0000 | 0.00000 |
| LAS VEGAS, NV | 1275.1739 | 573.62036 |
| LEXINGTON/FRANKFORT, KY | 313.0000 | 10.69045 |
| LINCOLN, NE | 467.0000 | 0.00000 |
| LITTLE ROCK, AR | 421.2500 | 93.26486 |
| LONG BEACH, CA | 1220.0000 | 0.00000 |
| LOS ANGELES, CA | 1593.2000 | 637.19816 |
| LOUISVILLE, KY | 777.8000 | 528.97672 |
| LUBBOCK, TX | 469.0000 | 195.40727 |
| MANCHESTER, NH | 843.0000 | 0.00000 |
| MEMPHIS, TN | 700.8750 | 423.08649 |
| MIAMI, FL | 1190.3158 | 487.64425 |
| MIDLAND/ODESSA, TX | 461.5000 | 208.32091 |
| MILWAUKEE, WI | 937.8333 | 459.14534 |
| MINNEAPOLIS, MN | 878.3333 | 368.21695 |
| MISSION/MCALLEN, TX | 392.0000 | 81.24742 |
| MOBILE, AL | 302.0000 | 0.00000 |
| MYRTLE BEACH, SC | 648.7500 | 118.63979 |
| NASHVILLE, TN | 742.5263 | 395.51949 |
| NEW ORLEANS, LA | 786.1000 | 382.82678 |
| NEW YORK, NY | 880.8837 | 513.87218 |
| NEWBURGH, NY | 1122.0000 | 0.00000 |
| NEWPORT NEWS/WMSBURG, VA | 508.0000 | 0.00000 |
| NORFOLK, VA | 875.8571 | 681.45149 |
| OAKLAND, CA | 1191.1000 | 710.40176 |
| OKLAHOMA CITY, OK | 743.5000 | 366.93845 |
| OMAHA, NE | 796.0000 | 336.38927 |
| ONTARIO, CA | 1202.6154 | 690.79704 |
| ORLANDO, FL | 941.1458 | 386.73362 |
| PHILADELPHIA, PA | 854.8929 | 525.56457 |
| PHOENIX, AZ | 1221.1316 | 620.30645 |
| PITTSBURGH, PA | 888.4737 | 563.86167 |
| PORTLAND, ME | 516.6667 | 285.94257 |
| PORTLAND, OR | 1515.2174 | 688.61568 |
| PROVIDENCE, RI | 1222.0000 | 665.04443 |
| RALEIGH/DURHAM, NC | 838.3684 | 528.42288 |
| RENO, NV | 910.0000 | 699.63939 |
| RICHMOND, VA | 521.1250 | 294.64530 |
| ROCHESTER, NY | 527.6667 | 291.34212 |
| SACRAMENTO, CA | 1276.4286 | 854.67286 |
| SALT LAKE CITY, UT | 1056.2143 | 579.06724 |
| SAN ANTONIO, TX | 825.5652 | 366.16478 |
| SAN DIEGO, CA | 1363.2286 | 700.47109 |
| SAN FRANCISCO, CA | 1596.3182 | 805.08455 |
| SAN JOSE, CA | 1267.1250 | 813.50979 |
| SANTA ANA, CA | 1219.7143 | 795.59335 |
| SANTA BARBARA, CA | 262.0000 | 0.00000 |
| SARASOTA, FL | 1025.1667 | 106.65249 |
| SAVANNAH, GA | 567.0000 | 261.22509 |
| SEATTLE, WA | 1512.4889 | 662.66938 |
| SPOKANE, WA | 671.0000 | 257.09623 |
| ST. LOUIS, MO | 752.1667 | 396.62996 |
| SYRACUSE, NY | 663.2500 | 321.87399 |
| TALLAHASSEE, FL | 406.0000 | 0.00000 |
| TAMPA, FL | 1009.5476 | 564.92421 |
| TOLEDO, OH | 921.0000 | 0.00000 |
| TUCSON, AZ | 887.5385 | 490.18710 |
| TULSA, OK | 690.4615 | 396.18150 |
| VALPARAISO, FL | 264.0000 | 0.00000 |
| WASHINGTON, DC | 1012.5273 | 724.95973 |
| WEST PALM BEACH, FL | 1095.8667 | 368.87366 |
| WHITE PLAINS, NY | 621.5000 | 124.54374 |
| WICHITA, KS | 581.0000 | 261.26155 |
Therefore, the average distance of flights from Detroit is:
result = grep("detroit", ignore.case = TRUE, data_table[["destin"]]) %>%
data_table[["mean"]][.]
570.5
filter(data, year == "1997", fare < 55) %>% with(table(origin, destin)) %>%
knitr::kable()
| MYRTLE BEACH, SC | RENO, NV | SAN JOSE, CA | ST. LOUIS, MO | TUCSON, AZ | |
|---|---|---|---|---|---|
| ATLANTIC CITY, NJ | 1 | 0 | 0 | 0 | 0 |
| LOS ANGELES, CA | 0 | 0 | 0 | 0 | 1 |
| LOUISVILLE, KY | 0 | 0 | 0 | 1 | 0 |
| OAKLAND, CA | 0 | 1 | 0 | 0 | 0 |
| PORTLAND, OR | 0 | 1 | 0 | 0 | 0 |
| RENO, NV | 0 | 0 | 1 | 0 | 0 |
filter(data, year == "2000") %>%
select(dist, fare, passen) %>%
cor() %>%
knitr::kable()
| dist | fare | passen | |
|---|---|---|---|
| dist | 1.0000000 | 0.5993319 | -0.0680743 |
| fare | 0.5993319 | 1.0000000 | -0.1230791 |
| passen | -0.0680743 | -0.1230791 | 1.0000000 |
The results match expectations. Firstly, there’s unit covariance between same elements. Then, dist has positive cov with fare as greater distance is more expensive. Also, dist has negative cor with passen since fewer passengers go on distant trips. Similarly for fare and pass.
model <- lm( fare ~ dist + bmktshr + passen, data=data)
summary(model)
Call:
lm(formula = fare ~ dist + bmktshr + passen, data = data)
Residuals:
Min 1Q Median 3Q Max
-146.50 -43.81 -11.87 39.15 216.28
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.995247 4.488055 11.808 < 2e-16 ***
dist 0.087485 0.001645 53.177 < 2e-16 ***
bmktshr 69.430756 5.122536 13.554 < 2e-16 ***
passen -0.004942 0.001057 -4.677 2.99e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 57.13 on 4592 degrees of freedom
Multiple R-squared: 0.4183, Adjusted R-squared: 0.4179
F-statistic: 1101 on 3 and 4592 DF, p-value: < 2.2e-16
To see significance for different values, we perform a t-test which also gives variances and answers other questions.
t.test(data$fare)
One Sample t-test
data: data$fare
t = 161.87, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
176.6313 180.9622
sample estimates:
mean of x
178.7968
fare
t.test(data$bmktshr)
One Sample t-test
data: data$bmktshr
t = 210.56, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6044344 0.6157955
sample estimates:
mean of x
0.6101149
bmktshr
t.test(data$passen)
One Sample t-test
data: data$passen
t = 53.168, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
613.3426 660.3058
sample estimates:
mean of x
636.8242
passen
t.test(data$dist)
One Sample t-test
data: data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
972.0519 1007.4381
sample estimates:
mean of x
989.745
dist
For the case of testing dist for .01, one trick is to offset the null hypothesis by .01. This is done by setting the significance level to .99 as below:
t.test(data$dist, conf.level = 0.99)
One Sample t-test
data: data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
966.4888 1013.0012
sample estimates:
mean of x
989.745
dist with .99
These are the coefficients of the model:
summary(model)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.995247275 4.488054584 11.808067 1.015786e-31
dist 0.087485328 0.001645179 53.176794 0.000000e+00
bmktshr 69.430755685 5.122535725 13.553982 4.582865e-41
passen -0.004942295 0.001056728 -4.676979 2.994895e-06
As we expected from the cor matrix, dist and passen have the respective correlation with fair. Same is for bmktshr.
To see how much of the variation is unxplained, we refer to \(1- R^2\) which is 0.5816954.
This table was given in previous sections. For reference, it is again given here:
table1 <- summary(model)$coeff %>% as.data.frame %>%
select("Estimate", "t value", "Pr(>|t|)")
table2 <- c("totalobserv"= summary(model)$res %>% length(),
"R-squared"= summary(model)$r.squared,
"F-statistic"= summary(model)$fstatistic[1])
knitr::kable(list(table1, table2))
|
|
There is an equivalent package for estout. The latex code however, does not display in html format.
library(estout)
eststo(model)
esttab(t.value=TRUE,round.dec=2,caption="Reg Table",label="Reg")
It seems that the r equivalent of esttab is actually an equivalent of outreg. What this function does is to wrap the data in a compherensive format that can be used for comparison between different models. This is in fact the very description given by the estout R package which is based on the stata estout.