In this homework exercise, the goal is mostly to do some regression analysis in stata.

Firstly, it is essential to go over the AIRFARE dataset:

library(foreign)

data <- read.dta("airfare.dta")
head(data) %>%
    knitr::kable()

year	origin	destin	id	dist	passen	fare	bmktshr	ldist	y98	y99	y00	lfare	ldistsq	concen	lpassen
1997	AKRON, OH	ATLANTA, GA	1	528	152	106	0.8386	6.269096	0	0	0	4.663439	39.30157	0.8386	5.023880
1998	AKRON, OH	ATLANTA, GA	1	528	265	106	0.8133	6.269096	1	0	0	4.663439	39.30157	0.8133	5.579730
1999	AKRON, OH	ATLANTA, GA	1	528	336	113	0.8262	6.269096	0	1	0	4.727388	39.30157	0.8262	5.817111
2000	AKRON, OH	ATLANTA, GA	1	528	298	123	0.8612	6.269096	0	0	1	4.812184	39.30157	0.8612	5.697094
1997	AKRON, OH	ORLANDO, FL	2	861	282	104	0.5798	6.758094	0	0	0	4.644391	45.67184	0.5798	5.641907
1998	AKRON, OH	ORLANDO, FL	2	861	178	105	0.5817	6.758094	1	0	0	4.653960	45.67184	0.5817	5.181784

Calculating parameters

group_by(data, year) %>%
    summarise(mean=mean(fare), median=median(fare),
              q1=quantile(fare,.25), q3=quantile(fare,.75)) %>%
    knitr::kable(caption="Stat. values of fair for various years")

Stat. values of fair for various years
year	mean	median	q1	q3
1997	173.7520	160	116	225
1998	175.4413	167	121	219
1999	177.9704	168	124	224
2000	188.0235	178	132	233

Flight’s distance for flight destination

data_table <- group_by(data, destin) %>%
    summarise(mean=mean(dist), sd=sd(dist))
knitr::kable(data_table, caption="mean and standard deviation for flight distances based on cities" )

mean and standard deviation for flight distances based on cities
destin	mean	sd
ATLANTA, GA	883.3333	317.23502
AUSTIN, TX	612.6667	170.56342
BALTIMORE, MD	1197.0000	479.86437
BATON ROUGE, LA	449.0000	0.00000
BIRMINGHAM, AL	682.0000	0.00000
BOSTON, MA	1052.2000	703.20601
BUFFALO, NY	554.0000	168.90910
CHARLESTON, SC	259.0000	0.00000
CHARLOTTE, NC	438.0000	220.84631
CHICAGO, IL	723.1818	186.16912
CINCINNATI, OH	454.7500	187.64417
CLEVELAND, OH	429.5000	103.41600
CO SPRINGS, CO	1293.6667	379.81271
COLUMBIA, SC	429.0000	254.43270
COLUMBUS, OH	426.7500	140.73971
DALLAS, TX	734.6000	358.30236
DAYTON, OH	511.0000	271.14706
DENVER, CO	963.0714	430.12395
DES MOINES, IA	504.3333	152.34013
DETROIT, MI	570.5000	297.47527
EL PASO, TX	636.5000	382.40119
FLINT, MI	644.0000	0.00000
FT. LAUDERDALE, FL	1047.1538	271.57575
FT. MYERS, FL	965.8750	208.09826
GREENSBORO, NC	514.0000	155.46294
HARLINGEN, TX	369.0000	95.14500
HARRISBURG, PA	595.0000	0.00000
HARTFORD, CT	909.0000	434.82198
HOUSTON, TX	789.7826	390.18248
INDIANAPOLIS, IN	644.3846	306.37784
ISLIP/LONG ISLAND, NY	1093.0000	0.00000
JACKSON, MS	443.5000	135.06838
JACKSONVILLE, FL	667.7778	274.76901
KANSAS CITY, MO	740.0625	269.92520
KNOXVILLE, TN	475.0000	0.00000
LAS VEGAS, NV	1275.1739	573.62036
LEXINGTON/FRANKFORT, KY	313.0000	10.69045
LINCOLN, NE	467.0000	0.00000
LITTLE ROCK, AR	421.2500	93.26486
LONG BEACH, CA	1220.0000	0.00000
LOS ANGELES, CA	1593.2000	637.19816
LOUISVILLE, KY	777.8000	528.97672
LUBBOCK, TX	469.0000	195.40727
MANCHESTER, NH	843.0000	0.00000
MEMPHIS, TN	700.8750	423.08649
MIAMI, FL	1190.3158	487.64425
MIDLAND/ODESSA, TX	461.5000	208.32091
MILWAUKEE, WI	937.8333	459.14534
MINNEAPOLIS, MN	878.3333	368.21695
MISSION/MCALLEN, TX	392.0000	81.24742
MOBILE, AL	302.0000	0.00000
MYRTLE BEACH, SC	648.7500	118.63979
NASHVILLE, TN	742.5263	395.51949
NEW ORLEANS, LA	786.1000	382.82678
NEW YORK, NY	880.8837	513.87218
NEWBURGH, NY	1122.0000	0.00000
NEWPORT NEWS/WMSBURG, VA	508.0000	0.00000
NORFOLK, VA	875.8571	681.45149
OAKLAND, CA	1191.1000	710.40176
OKLAHOMA CITY, OK	743.5000	366.93845
OMAHA, NE	796.0000	336.38927
ONTARIO, CA	1202.6154	690.79704
ORLANDO, FL	941.1458	386.73362
PHILADELPHIA, PA	854.8929	525.56457
PHOENIX, AZ	1221.1316	620.30645
PITTSBURGH, PA	888.4737	563.86167
PORTLAND, ME	516.6667	285.94257
PORTLAND, OR	1515.2174	688.61568
PROVIDENCE, RI	1222.0000	665.04443
RALEIGH/DURHAM, NC	838.3684	528.42288
RENO, NV	910.0000	699.63939
RICHMOND, VA	521.1250	294.64530
ROCHESTER, NY	527.6667	291.34212
SACRAMENTO, CA	1276.4286	854.67286
SALT LAKE CITY, UT	1056.2143	579.06724
SAN ANTONIO, TX	825.5652	366.16478
SAN DIEGO, CA	1363.2286	700.47109
SAN FRANCISCO, CA	1596.3182	805.08455
SAN JOSE, CA	1267.1250	813.50979
SANTA ANA, CA	1219.7143	795.59335
SANTA BARBARA, CA	262.0000	0.00000
SARASOTA, FL	1025.1667	106.65249
SAVANNAH, GA	567.0000	261.22509
SEATTLE, WA	1512.4889	662.66938
SPOKANE, WA	671.0000	257.09623
ST. LOUIS, MO	752.1667	396.62996
SYRACUSE, NY	663.2500	321.87399
TALLAHASSEE, FL	406.0000	0.00000
TAMPA, FL	1009.5476	564.92421
TOLEDO, OH	921.0000	0.00000
TUCSON, AZ	887.5385	490.18710
TULSA, OK	690.4615	396.18150
VALPARAISO, FL	264.0000	0.00000
WASHINGTON, DC	1012.5273	724.95973
WEST PALM BEACH, FL	1095.8667	368.87366
WHITE PLAINS, NY	621.5000	124.54374
WICHITA, KS	581.0000	261.26155

Therefore, the average distance of flights from Detroit is:

result = grep("detroit", ignore.case = TRUE, data_table[["destin"]]) %>% 
    data_table[["mean"]][.]

570.5

Tabulation of filtered results

filter(data, year == "1997", fare < 55) %>% with(table(origin, destin)) %>%
    knitr::kable()

	MYRTLE BEACH, SC	RENO, NV	SAN JOSE, CA	ST. LOUIS, MO	TUCSON, AZ
ATLANTIC CITY, NJ	1	0	0	0	0
LOS ANGELES, CA	0	0	0	0	1
LOUISVILLE, KY	0	0	0	1	0
OAKLAND, CA	0	1	0	0	0
PORTLAND, OR	0	1	0	0	0
RENO, NV	0	0	1	0	0

Correlation Table

filter(data, year == "2000") %>%
    select(dist, fare, passen) %>%
    cor() %>%
    knitr::kable()

	dist	fare	passen
dist	1.0000000	0.5993319	-0.0680743
fare	0.5993319	1.0000000	-0.1230791
passen	-0.0680743	-0.1230791	1.0000000

The results match expectations. Firstly, there’s unit covariance between same elements. Then, dist has positive cov with fare as greater distance is more expensive. Also, dist has negative cor with passen since fewer passengers go on distant trips. Similarly for fare and pass.

Regressing the linear model

model <- lm( fare ~ dist + bmktshr + passen, data=data)
summary(model)


Call:
lm(formula = fare ~ dist + bmktshr + passen, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-146.50  -43.81  -11.87   39.15  216.28 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.995247   4.488055  11.808  < 2e-16 ***
dist         0.087485   0.001645  53.177  < 2e-16 ***
bmktshr     69.430756   5.122536  13.554  < 2e-16 ***
passen      -0.004942   0.001057  -4.677 2.99e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57.13 on 4592 degrees of freedom
Multiple R-squared:  0.4183,    Adjusted R-squared:  0.4179 
F-statistic:  1101 on 3 and 4592 DF,  p-value: < 2.2e-16

To see significance for different values, we perform a t-test which also gives variances and answers other questions.

t.test(data$fare)


    One Sample t-test

data:  data$fare
t = 161.87, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 176.6313 180.9622
sample estimates:
mean of x 
 178.7968

fare

t.test(data$bmktshr)


    One Sample t-test

data:  data$bmktshr
t = 210.56, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.6044344 0.6157955
sample estimates:
mean of x 
0.6101149

bmktshr

t.test(data$passen)


    One Sample t-test

data:  data$passen
t = 53.168, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 613.3426 660.3058
sample estimates:
mean of x 
 636.8242

passen

t.test(data$dist)


    One Sample t-test

data:  data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  972.0519 1007.4381
sample estimates:
mean of x 
  989.745

dist

For the case of testing dist for .01, one trick is to offset the null hypothesis by .01. This is done by setting the significance level to .99 as below:

t.test(data$dist, conf.level = 0.99)


    One Sample t-test

data:  data$dist
t = 109.67, df = 4595, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
  966.4888 1013.0012
sample estimates:
mean of x 
  989.745

dist with .99

These are the coefficients of the model:

summary(model)$coef

                Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 52.995247275 4.488054584 11.808067 1.015786e-31
dist         0.087485328 0.001645179 53.176794 0.000000e+00
bmktshr     69.430755685 5.122535725 13.553982 4.582865e-41
passen      -0.004942295 0.001056728 -4.676979 2.994895e-06

As we expected from the cor matrix, dist and passen have the respective correlation with fair. Same is for bmktshr.

To see how much of the variation is unxplained, we refer to \(1- R^2\) which is 0.5816954.

Table of values

This table was given in previous sections. For reference, it is again given here:

table1 <- summary(model)$coeff %>% as.data.frame %>%
    select("Estimate", "t value", "Pr(>|t|)")
table2 <- c("totalobserv"= summary(model)$res %>% length(), 
            "R-squared"= summary(model)$r.squared, 
            "F-statistic"= summary(model)$fstatistic[1])
knitr::kable(list(table1, table2))

	Estimate	t value	Pr(>\|t\|)
(Intercept)	52.9952473	11.808067	0e+00
dist	0.0874853	53.176794	0e+00
bmktshr	69.4307557	13.553982	0e+00
passen	-0.0049423	-4.676979	3e-06

	x
totalobserv	4596.0000000
R-squared	0.4183046
F-statistic.value	1100.7221469

estout package

There is an equivalent package for estout. The latex code however, does not display in html format.

library(estout)
eststo(model)
esttab(t.value=TRUE,round.dec=2,caption="Reg Table",label="Reg")

Using the outreg2

It seems that the r equivalent of esttab is actually an equivalent of outreg. What this function does is to wrap the data in a compherensive format that can be used for comparison between different models. This is in fact the very description given by the estout R package which is based on the stata estout.

STATA HW 2

Mohammad Ali Mohammadi

October 26, 2018