##The policy problem

Several cities aim to support public transportation and increase ridership. There are several ways to encourage residents to utilize public transportation more often; one of those is to modify the transit schedule so as to increase the frequency of buses in peak times and reduce waiting time.

City A has opted for this solution and starting from May 1st has implemented a new bus schedule which increases bus frequency. At the end of the year, the mayor wants to estimate whether the decision was effective and ask you to analyze one year of ridership data to test the following hypothesis:

An increase in bus frequency has a positive effect on bus ridership in City A (i.e., the new schedule is effective).

#Load Data

## [1] 1328 1407 1425 1252 1287

#Create time series model variables

time <- c(1:365)
treatment <- c(rep(0, 120), rep(1, 245))
timesince <- c(rep(0, 120), 1:245)
#checking vector length of variables
#length(time)
#length(treatment)
#length(timesince)
#adding all vectors to a data frame (time series data fram)
tsdf <- data.frame(passengers, time, treatment, timesince)
prev <- rbind(head(tsdf), 
                 c("...", "...", "...", "..."),  
                 tsdf[ 118:120, ], 
                 c("**START**","**TREATMENT**","---","---"), 
                 tsdf[ 121:123, ],  
                 c("...","...","...","..."),
                 tail(tsdf))
prev

#Summary Statitics of Passenger traffic

#summary statistics
summary(tsdf$passengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100    1334    1401    1404    1469    1700
#library(pastecs)
#stat.desc(tsdf$passengers)

#Visualize passenger traffic

plot(tsdf$time, tsdf$passengers, 
     bty="n", pch=19, col="gray",
     ylim = c(1000, 1800), xlim = c(0, 400),
     xlab = "Time (days)",
     ylab = "Number of Passengers")

abline(v=120, col="tomato", lty=2)

text(120, 1780, "Start of New Schedule", col="tomato", cex=1.3, pos=4)

ts <- lm(passengers ~ time + treatment + timesince, data=tsdf)
lines(tsdf$time, ts$fitted.values, col="steelblue", lwd=2)

#Run the time series model

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
regTS <- lm(passengers ~ time + treatment + timesince, data=tsdf)

stargazer(regTS, 
          type = "text",
          dep.var.labels = ("Passengers"),
          column.labels = ("Model Results"),
          covariate.labels = c("time", "treatment", "timesince"),
          omit.stat = "all",
          digist = 2)
## 
## =====================================
##               Dependent variable:    
##           ---------------------------
##                   Passengers         
##                  Model Results       
## -------------------------------------
## time                 0.107           
##                     (0.178)          
##                                      
## treatment           21.682           
##                    (15.018)          
##                                      
## timesince          0.503***          
##                     (0.188)          
##                                      
## Constant         1,327.890***        
##                    (12.421)          
##                                      
## =====================================
## =====================================
## Note:     *p<0.1; **p<0.05; ***p<0.01
## 
## =
## 2
## -

#Questions:

2b. What does the intercept represent?

ANSWER: The intercept is the expected mean value. 1328 passengers is what we would expect if all other variables were zero.

2c. What does the coefficient of Time represent?

ANSWER: The time coefficient represents the trend of passengers before treatent. Each day, the number of passengers increases by 0.107 on average according to the model.

2d. Which coefficient represents the immediate effect of the policy?

ANSWER: The Treatment coefficient - 21.7.

2e. Which coefficient represents the sustained effect of the policy?

ANSWER: The Time Since coefficent - 0.50.

#Looking at the results more closely

3a. Has the new schedule increased or decreased the use of public transportation in the short term?

ANSWER: The new schedule increased the use of public transportation short term. This increase is represented by the numerical value of 21.7 and is NOT significant.

3b. Has the new schedule increased or decreased the use of public transportation in the long term?

ANSWER: The use of public trasoprtation has increased long term since the implementation of the new schedule. This increase is represented by the numerical value of 0.50 and is significant.

3c. Provide a brief (1-2 statements) possible explanation for these results.

ANSWER: The jump in short term increase could be a result of persons utilizing the new schedule for the first time. The smaller long term increase could represent the idea that these new persons, and possibly others, continued using the public trasportation system after the policy implimentation.

#An important aspect of a time series model is the counterfactual.

4a. What is the number of passengers 100 days after the intervention?

#index 120+100 = 220 or time 100
tsdf$passengers[timesince==100]
## [1] 1314
tsdf$passengers[time==220]
## [1] 1314
1327.890+0.107*220+21.682+0.503*100
## [1] 1423.412

ANSWER: 1314 (from data) From the model, 1327.890 + 0.107(220) + 21.682(1) + 0.503(100) = 1423.412.

4b. What is the counterfactual? Provide both its formula and estimation.

ANSWER: Formula is Y = b0 + b1T + e = 1327.890 + 0.107T. The counterfactual at time (120+100) is: 1327.890 + 0.107(220) = 1351.43.

#looking at day one of treatment
data1<-as.data.frame(cbind(time=220, treatment=1, timesince=100)) 
data2<-as.data.frame(cbind(time=220, treatment=0, timesince=0))
#predicting passengers
y1 <- predict(regTS, data1)
y2 <- predict(regTS, data2)
#plotting visualization
plot(tsdf$passengers,
     bty="n", col="gray", pch=19,
     xlim=c(1,400), ylim=c(1000, 1800),
     xlab="Time (days)", ylab="Passengers")
points(220, y1, col="dodgerblue4", pch=19, bg="dodgerblue4", cex=2)
text(220, y1, labels="Y at t=220", pos=4, cex=1)
points(220, y2, col="darkorange", pch=19, cex=2)
text(220, y2, labels="C at t=220", pos=4, cex=1)
abline(v=120, col="tomato", lty=2)

4c. What would the counterfactual be after 150 days?

t150 <- 1327.890+0.107*150
t150
## [1] 1343.94

ANSWER: Time = 150 -> 1343.94 passengers

4d. Are the two counterfactuals the same? Why?

Answer: The numercial representaion of the counterfacutals are not the same. This is because the regression model of the counterfactual still shows and increase in passengers over time.

#A final visualization of the time series and counterfacutal

#estimate all predicted valsues of Y
pred1 <- predict(regTS, tsdf)
#create new data frame with no treatment i.e. the counterfactual 
datanew <- as.data.frame(cbind(time=rep(1:365), treatment=rep(0), timesince=(0)))
#predict the counterfactual 
pred2 <- predict(regTS, datanew)
#finally, the plot
plot(tsdf$passengers,
     bty="n", col=gray(0.5,0.5), pch=19,
     xlim=c(1,400), ylim=c(1000, 1800),
     xlab="Time (days)", ylab = "Passengers")
lines(rep(1:120), pred1[1:120], col="dodgerblue4", lwd=3)
lines(rep(121:365), pred1[121:365], col="dodgerblue4", lwd=3)
lines(rep(121:365), pred2[121:365], col="darkorange", lwd=3, lty=6)

text(0, 1260, labels="Predicted Values", pos=4, cex=1, col="dodgerblue4")
text(260, 1300, labels="Counterfactual", pos=4, cex=1, col="darkorange")