Load the dataset in R and create a new column called ‘rDate’ convert the ‘date’ column into the ‘date’ datatype. 5%
eb_df <- read.csv("eBayData.csv")
eb_df$date = paste(eb_df$date)
eb_df = mutate(eb_df, rDate = as.Date(date, "%m/%d/%Y"))
eb_df$isTreatmentGroup = factor(eb_df$isTreatmentGroup)
eb_df$isTreatmentPeriod = factor(eb_df$isTreatmentPeriod)
eb_df$dma = factor(eb_df$dma)
Determine the date that started the treatment period. That is, write code to determine the earliest date in the treatment period. How were the treatment and control groups treated differently during this period? 5%
treatment = eb_df %>% filter(isTreatmentGroup == 1)
control = eb_df %>% filter(isTreatmentGroup == 0)
#The treatment starts at
tr_date = min(filter(treatment, isTreatmentPeriod == 1)$rDate)
tr_date
## [1] "2012-05-22"
The treatment starts at May 22th, 2012. During this period, the DMAs in the treatment group were no longer shown search ads from eBay, while the DMAs in the control group would be continuously shown such ads. By constructing these two groups, researchers would be able to figure out the effect of search ads on revenues.
# treatment_treatpeoriod = treatment %>% filter(isTreatmentPeriod == 1)
# treatment_Nontreatpeoriod = treatment %>% filter(isTreatmentPeriod == 0)
# daily_rev_line = eb_df %>%
# group_by(rDate, isTreatmentGroup) %>%
# summarize(total_daily_revenue = sum(revenue)) %>%
# ggplot(aes(x = rDate, y = total_daily_revenue, col = isTreatmentGroup)) +
# geom_line()+
# geom_vline(aes(xintercept = tr_date), linetype = 2) +
# # geom_hline(aes(yintercept = mean(treatment_treatpeoriod$revenue)), linetype = 2) +
# geom_hline(aes(yintercept = mean(treatment_Nontreatpeoriod$revenue)), linetype = 2) +
# labs(
# title = 'Revenue Between Two Groups', tag = '1)', x = 'Date', y = 'Price ($)',
# caption = "*Vertical line: Treatment start date;
# *Source from 'eBay.csv'"
# ) +
# theme(
# plot.title = element_text(size=18, face="bold"),
# axis.title.x = element_text(size=12, face="bold"),
# axis.title.y = element_text(size=12, face="bold"))
# daily_rev_line
#mean line before v.s.mean line after
mean(treatment_treatpeoriod\(revenue) [1] 128070.9 mean(treatment_Nontreatpeoriod\)revenue) [1] 131556.9
The data contains a control group, which are shown search ads throughout the data, and a treatment group, which are only shown search ads before the treatment period.
Run a regression that compares log(revenue) of the treatment group in the pre-treatment period and in the treatment period. 5%
reg_1 = lm(log(treatment$revenue) ~ treatment$isTreatmentPeriod)
summary(reg_1)
##
## Call:
## lm(formula = log(treatment$revenue) ~ treatment$isTreatmentPeriod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0038 -0.7490 -0.0274 0.6929 3.8268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.94865 0.01472 743.988 <2e-16 ***
## treatment$isTreatmentPeriod1 -0.03940 0.01987 -1.983 0.0474 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.252 on 16044 degrees of freedom
## Multiple R-squared: 0.0002451, Adjusted R-squared: 0.0001828
## F-statistic: 3.933 on 1 and 16044 DF, p-value: 0.04737
What do the resulting coefficient estimates say about the effectiveness of advertising? Be as specific as you can. 10%
significance R2
Now we will use the control group for a true experimental approach. First, we will check the randomization was done properly.
Run a regression that compares log(revenue) of the treatment group and the control group in the pre-treatment period. 10%
pre_treatment = eb_df %>% filter(isTreatmentPeriod == 0)
reg_2 = lm(log(pre_treatment$revenue) ~ pre_treatment$isTreatmentGroup)
summary(reg_2)
##
## Call:
## lm(formula = log(pre_treatment$revenue) ~ pre_treatment$isTreatmentGroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9962 -0.7502 -0.0285 0.7331 3.8229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.96273 0.02037 538.128 <2e-16 ***
## pre_treatment$isTreatmentGroup1 -0.01408 0.02477 -0.568 0.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 10708 degrees of freedom
## Multiple R-squared: 3.017e-05, Adjusted R-squared: -6.322e-05
## F-statistic: 0.323 on 1 and 10708 DF, p-value: 0.5698
What is the purpose of this randomization check? What do the results of this regression show? 5%
Now, using the post treatment data, determine the effectiveness of eBay ads.
Run a regression with log(revenue) as the dependent variable, and whether the DMA is in the treatment group as the independent variable. 10%
af_treatment = eb_df %>% filter(isTreatmentPeriod == 1)
reg_3 = lm(log(af_treatment$revenue) ~ af_treatment$isTreatmentGroup)
summary(reg_3)
##
## Call:
## lm(formula = log(af_treatment$revenue) ~ af_treatment$isTreatmentGroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0038 -0.7546 -0.0288 0.7419 3.8268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.916740 0.018610 586.595 <2e-16 ***
## af_treatment$isTreatmentGroup1 -0.007494 0.022632 -0.331 0.741
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.208 on 13018 degrees of freedom
## Multiple R-squared: 8.422e-06, Adjusted R-squared: -6.839e-05
## F-statistic: 0.1096 on 1 and 13018 DF, p-value: 0.7406
coeftest(reg_3, vcov = vcovHC(reg_3, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9167401 0.0168810 646.6888 <2e-16 ***
## af_treatment$isTreatmentGroup1 -0.0074937 0.0215615 -0.3476 0.7282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(reg_3)
##
## Call:
## lm(formula = log(af_treatment$revenue) ~ af_treatment$isTreatmentGroup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0038 -0.7546 -0.0288 0.7419 3.8268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.916740 0.018610 586.595 <2e-16 ***
## af_treatment$isTreatmentGroup1 -0.007494 0.022632 -0.331 0.741
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.208 on 13018 degrees of freedom
## Multiple R-squared: 8.422e-06, Adjusted R-squared: -6.839e-05
## F-statistic: 0.1096 on 1 and 13018 DF, p-value: 0.7406
What do the resulting coefficient estimates say about the effectiveness of advertising? Be as specific as you can. 10%
Does this R-squared of this regression affect the interpretation or confidence in the estimate of the effectiveness of advertising? 10%
Throughout the analysis regression were run on log(revenue) rather than revenue. Was this the right choice?
Or would simply using revenue be more appropriate? Justify your answer. 5%
OVB, SIG SMALL, R SMALL, INTERACTION TERM OR OTHER.