Let \(Y_i\) denote the sales for individual \(i\) and \(D_i\) denote a treatment indicator for whether individual \(i\) was exposed to an online advertisement. Refer to chapter 1 (pages 4-12) of Mastering Metrics to answer the following questions.
Answer: The sales for an individual and the individual exposed to an online advertisement.
Answer: Someone without the sale versus someone with the sale.
Answer: The number of sold verusu not sold.
Answer: The average not sold and exposed to an online advertisement versus, the average not sold and not exposed to an online advertsiemnet.
The following code generates data where selection bias is present.
seed.nb = 3887
set.seed(seed.nb, kind = "Mersenne-Twister")
n=1000 #
error.term = rnorm(n) #
x1 = rnorm(n) # A random sample of size 1000 from the standard normal distribution
D = ifelse(error.term+x1<0,1,0) # The way people get assigned to treatment group (health insurance vs. no health insurance), depending some observed variable and some unobserved variable. There is some bias.
y = 2*D + x1 + error.term # if people have helath insurnace their health index will increase by 2
#beta0 = 0, alpha =2, beta1 = 1
Here \(n\) is the sample size, \(x_1\) is a predictor variable, \(D\) is a binary variable that indicates the presence of a treatment, and \(y\) is a continuous dependent variable. By answering the following questions, we will understand how the model behaves when selection bias is present in the data.
Answer: 2
lm.fit1 = lm(y ~ D + x1)
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ D + x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.51342 -0.49338 0.01122 0.49195 3.06488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86745 0.03630 23.894 < 2e-16 ***
## D 0.31199 0.05710 5.464 5.87e-08 ***
## x1 0.55148 0.02796 19.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7409 on 997 degrees of freedom
## Multiple R-squared: 0.3056, Adjusted R-squared: 0.3043
## F-statistic: 219.4 on 2 and 997 DF, p-value: < 2.2e-16
t.test(y[D==0], y[D==1])
##
## Welch Two Sample t-test
##
## data: y[D == 0] and y[D == 1]
## t = 5.9955, df = 997.39, p-value = 2.832e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2226595 0.4393311
## sample estimates:
## mean of x mean of y
## 1.1766737 0.8456784
x1 and and the error term
error.term. (Hint: plot \(x_1\) against the error term.) Why do we
need to examine the relationship between and the error term to answer
the question?Error term is normally distributed, mean = 0 and constant variation.
0 value is higher error term and lower value is lower error term, there
is bias.
x1 = independent variable = family income
jittered.D = jitter(D, amount = .1)
plot( error.term ~ jittered.D, pch =16)
cor(D, error.term)
## [1] -0.5744655
t.test( x1[D ==1], x1[D==0])
##
## Welch Two Sample t-test
##
## data: x1[D == 1] and x1[D == 0]
## t = -21.979, df = 996.88, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.270011 -1.061821
## sample estimates:
## mean of x mean of y
## -0.6052043 0.5607117
There is bias because nothing was controlled.
t.test(y[D==1] , y[D==0])
##
## Welch Two Sample t-test
##
## data: y[D == 1] and y[D == 0]
## t = -5.9955, df = 997.39, p-value = 2.832e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4393311 -0.2226595
## sample estimates:
## mean of x mean of y
## 0.8456784 1.1766737
x1
between the treatment and control groups? Conduct a formal test and
comment on the results. Why do we see the difference in the values of
x1?Yes, there is a difference in the average vale of x1 between the treatment and control groups. There cneters are one two opposite ends of the graph becasue there is bias in one test, where nothing is controlled.
Modify D.2 in the following code so that
D.2 is independently generated from a Bernoulli(\(p\)) distribution with \(p = 0.5\). That will generate data where
selection bias is NOT present. By answering the following questions, we
will understand how the model behaves when selection bias is NOT present
in the data.
seed.nb = 3887
set.seed(seed.nb, kind = "Mersenne-Twister")
n=1000 #
error.term.2 = rnorm(n) #
x1.2 = rnorm(n) #
D = rbinom(n, size =1, prob = 0.5)
y = 2*D + x1.2 + error.term.2 #treatment affect parameter = 2
There was a modification made in the binomial distribution. The way we modified this treatment variable in this study, was designed to simulate fair coin flip; random with no bias.
P-value is low, which means the results are statistically significant and tehre is a difference between the treatmnet effect of both groups.
lm.fit2 = lm( y ~ D + x1.2)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ D + x1.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4116 -0.6291 -0.0073 0.7077 3.1989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09988 0.04542 2.199 0.0281 *
## D 1.89584 0.06411 29.570 <2e-16 ***
## x1.2 1.02296 0.03141 32.571 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 997 degrees of freedom
## Multiple R-squared: 0.6577, Adjusted R-squared: 0.657
## F-statistic: 957.7 on 2 and 997 DF, p-value: < 2.2e-16
1.89584 + 2* 0.06411
## [1] 2.02406
t.test(y[D==0], y[D==1])
##
## Welch Two Sample t-test
##
## data: y[D == 0] and y[D == 1]
## t = -20.36, df = 996.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.054653 -1.693410
## sample estimates:
## mean of x mean of y
## 0.1023842 1.9764157
x1.2’s between the treatment and control groups? Why do we
check the difference in the values of x1.2? Conduct a
formal test and comment on the results.No, there is no evidence that there is a difference in control groups.
t.test( x1.2[D==0], x1.2[D==1])
##
## Welch Two Sample t-test
##
## data: x1.2[D == 0] and x1.2[D == 1]
## t = 0.32983, df = 995.57, p-value = 0.7416
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1055029 0.1481345
## sample estimates:
## mean of x mean of y
## 0.002445246 -0.018870545
Main question article: Airbnb redesigned website and want to see of new design has significance effect on booking rate
(Online Platform Design at Airbnb) Online businesses like Airbnb invest time and money into platform design. How quickly and easily someone is able to browse Airbnb’s inventory to find an appropriate place to stay can greatly affect booking decisions in the short-term, as well as future bookings with the company.
Airbnb recently used an online experiment to test the effectiveness of a redesigned search page on its website. The proposed design (shown below) specifically emphasized pictures of the listings and the map that displays where listings are located. Before rolling out the new design permanently, Airbnb wanted to make sure that it would have a positive impact on its users. They randomly selected a subset of users and assigned them into two groups: the treatment group would have access to the redesigned search page while the control group would continue to see the old design. The booking rate was then used to test for differences.
The airbnb data set contains n = 10,000
observations from the online experiment. Each observation is a website
session and includes the following variables: book
indicates whether a purchase was made, and platform
indicates the treatment condition where 0 is the old search page and 1
is the redesigned page.
After downloading the data from Carmen, set the pat and load the data into R using
load("airbnb.RData")
airbnb$book[1:10]
## [1] 0 0 1 0 0 0 1 1 1 0
airbnb$platform[1:10]
## [1] 0 0 1 1 0 1 1 1 1 0
Booking rates for the two conditions
mean (airbnb$book[airbnb$platform == 0])
## [1] 0.3211809
mean (airbnb$book[airbnb$platform ==1])
## [1] 0.343832
t.test( airbnb$book[airbnb$platform==0], airbnb$book[airbnb$platform==1])
##
## Welch Two Sample t-test
##
## data: airbnb$book[airbnb$platform == 0] and airbnb$book[airbnb$platform == 1]
## t = -2.4042, df = 9985.1, p-value = 0.01623
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.041119396 -0.004182847
## sample estimates:
## mean of x mean of y
## 0.3211809 0.3438320
New booking has significant results, because the p-value is 0.01623, which is smaller than 0.05. So we reject the p-value, and can conclude that the new booking page is more effective than the old one.
Another variable I would include is the phone/tablet version of the old and new browsers. I am sure a lot of people research Air Bnbs from other devices besides their tbalets, so you are missing a large group of people by only looking at the new web page from a computer.