Load Packages

pacman::p_load(sjPlot, predictshine, ggeffects, rio, dplyr, ggplot2, DT, sjlabelled , sjmisc , sjstats , ggeffects, tidyverse, httpuv, reshape2, corrplot,FactoMineR,  highcharter, lubridate, plotly,
              broom, stargazer, plm, GGally, CausalImpact )

## Installing package into 'C:/Users/engha/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)

## Bioconductor version 3.4 (BiocInstaller 1.24.0), ?biocLite for help

## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install

Case #1: Milk Field Experiment

Retail milk prices in the US follow a peculiar pattern. Prices across milk types (Skim, 1%, 2% or Whole milk) vary depending on where you live and which store you happen to patronize. At some stores, prices are equal, or flat, across all fat content. At other stores, prices are not flat - they decrease with fat content, so that whole milk is most expensive and skim is the cheapest option.

Let’s import the data and create some new features.

data <- import("Identification.xlsx" , sheet = "Milk")

data$Income <- data$`Per Capita Income in the zipcode`
data$`Flat Pricing?` <- as.factor(data$`Flat Pricing?`)
data$Condition <- data$`Flat Pricing?`
data$Whole.Milk.Share <- data$`Share of whole milk`

datatable(data, class = "cell-border stripe", options = list(pageLength = 5))

Impact of Pricing Structure of Whole Milk Share

In markets where milk prices are equal (flat) across fat alternatives, people are much more likely to choose whole milk over lower-calorie alternatives.

fit <- lm(Whole.Milk.Share ~ Condition, data = data)

mydf <- ggpredict(fit, terms = c("Condition"))
plot(mydf)

Experiment Validity

Share of whole milk (vs.low fat options) could depend on local demographics (income in this case) and not just the pricing structure.

a <- ggplot(data, aes(x = Income, y = Whole.Milk.Share)) + geom_point() + geom_smooth()

b <- ggplot(data, aes(x = log(Income), y = Whole.Milk.Share)) + geom_point() + 
    geom_smooth()

gridExtra::grid.arrange(a, b, ncol = 2)

## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

There are no significant differences in income across Flat vs. Non-flat stores.

fit <- lm(Income ~ Condition, data = data)

mydf <- ggpredict(fit, terms = c("Condition"))
plot(mydf)

Differential Impact of Pricing by Income

Shift to the lower-calorie options is particularly pronounced in lower-income neighborhoods, which is important because obesity has a disproportionate impact on lower-income groups. Here we use Quintiles of Income to capture non-linearities.

# Cut into quintiles

data$Income.Quintile <- with(data, cut(data$Income, quantile(data$Income, (0:5)/5), 
    include.lowest = TRUE))

data$Income.Quintile <- factor(data$Income.Quintile, labels = c("Low Income", 
    "Quint2", "Quint3", "Quint4", "High Income"))

fit <- lm(Whole.Milk.Share ~ Income.Quintile + Condition + Income.Quintile * 
    Condition, data = data)

stargazer(fit, type = "text")

## 
## ================================================================================
##                                                          Dependent variable:    
##                                                      ---------------------------
##                                                           Whole.Milk.Share      
## --------------------------------------------------------------------------------
## Income.QuintileQuint2                                         -0.112***         
##                                                                (0.017)          
##                                                                                 
## Income.QuintileQuint3                                         -0.168***         
##                                                                (0.017)          
##                                                                                 
## Income.QuintileQuint4                                         -0.216***         
##                                                                (0.017)          
##                                                                                 
## Income.QuintileHigh Income                                    -0.270***         
##                                                                (0.018)          
##                                                                                 
## ConditionNon Flat Pricing                                     -0.128***         
##                                                                (0.016)          
##                                                                                 
## Income.QuintileQuint2:ConditionNon Flat Pricing                 0.022           
##                                                                (0.022)          
##                                                                                 
## Income.QuintileQuint3:ConditionNon Flat Pricing                 0.025           
##                                                                (0.022)          
##                                                                                 
## Income.QuintileQuint4:ConditionNon Flat Pricing               0.079***          
##                                                                (0.022)          
##                                                                                 
## Income.QuintileHigh Income:ConditionNon Flat Pricing          0.085***          
##                                                                (0.022)          
##                                                                                 
## Constant                                                      0.527***          
##                                                                (0.013)          
##                                                                                 
## --------------------------------------------------------------------------------
## Observations                                                    1,708           
## R2                                                              0.283           
## Adjusted R2                                                     0.279           
## Residual Std. Error                                       0.135 (df = 1698)     
## F Statistic                                           74.554*** (df = 9; 1698)  
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

mydf <- ggpredict(fit, terms = c("Income.Quintile", "Condition"))
plot(mydf)

data1 <- data %>% 
  group_by(Income.Quintile, Condition) %>% 
  summarise(share = round(mean(Whole.Milk.Share), digits = 2))
datatable(data1)

Unobserved Taste Differences

But do the people in the markets where whole milk is more expensive just happen to be prone to making healthier or lower-calorie choices anyway? Another check is to look at the market shares of other products (diet soda in this case), because regular and diet soda are priced the same. If it is the case that people in markets with flat milk prices tend to prefer higher-calorie options, we should also observe lower shares of diet soda in these markets.

Exercise: Compare the market share of Diet Soda share in flat and non-flat markets.

Questions:
1) What is the market share of Diet Soda in Flat vs. Non Flat Markets?

# Create dataset of Flat markets
data_flat <- data %>%
   filter(`Flat Pricing?` == "Flat Pricing")
# Create dataset of Non Flat markets
data_nonflat <- data %>%
   filter(`Flat Pricing?` == "Non Flat Pricing")
(mean(data_flat$`Share of Diet Soda`))

## [1] 0.3534162

(mean(data_nonflat$`Share of Diet Soda`))

## [1] 0.3481864

The market share of Diet Soda in Flat vs. Non Flat Markets is 35.3% and 34.8% respectively.

data2 <- data %>% 
  group_by(Condition) %>% 
  summarise(share = round(mean(`Share of Diet Soda`), digits = 3))
datatable(data2)

Show correlation of share of diet soda and Log Income

cor(data$`Share of Diet Soda`, log(data$Income))

## [1] 0.6847082

Run the following regression and show predicted values of Share of Diet Soda by Income Quintiles and Flat/Non Flat.

fit1 <- lm(data$`Share of Diet Soda` ~ Income.Quintile + Condition, data = data)

stargazer(fit1, type = "text")

## 
## ======================================================
##                                Dependent variable:    
##                            ---------------------------
##                               `Share of Diet Soda`    
## ------------------------------------------------------
## Income.QuintileQuint2               0.062***          
##                                      (0.006)          
##                                                       
## Income.QuintileQuint3               0.099***          
##                                      (0.006)          
##                                                       
## Income.QuintileQuint4               0.128***          
##                                      (0.006)          
##                                                       
## Income.QuintileHigh Income          0.194***          
##                                      (0.006)          
##                                                       
## ConditionNon Flat Pricing            -0.003           
##                                      (0.004)          
##                                                       
## Constant                            0.255***          
##                                      (0.005)          
##                                                       
## ------------------------------------------------------
## Observations                          1,708           
## R2                                    0.419           
## Adjusted R2                           0.418           
## Residual Std. Error             0.077 (df = 1702)     
## F Statistic                 245.904*** (df = 5; 1702) 
## ======================================================
## Note:                      *p<0.1; **p<0.05; ***p<0.01

mydf1 <- ggpredict(fit1, terms = c("Income.Quintile", "Condition"))
plot(mydf1)

Case #2: Bing it ON

Microsoft’s “Bing It On” campaign purports to show that users prefer the company’s search engine to Google’s in a majority of blind tests. Recently, Ian Ayres (faculty at Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through Amazon’s Mechanical Turk. The paper concludes that Bing’s claims are misleading and are based on search words provided by the company. This in turn warrants legal scrutiny under the Lanham Act on false advertising.

In the file “BingitOn”" sheet in the file “Identification.xlsx” you are provided the data used in this study. There are approximately 900 participants in the experiment that were randomly assigned to one of the 3 groups based on what search words to use (variable: “Search Type”):

1: Popular searches (based on 2012 most popular google search words)
2: Bing suggested search words
3: User-generated search words

The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google wins. Data also contains an additional variable “Gender” (1=Male, 2=Female).

Objective: Analyze the relationship between “Search Type” and “Preference”.

#pacman::p_install_gh(c("strengejacke/strengejacke"))

bing<-import("Identification.xlsx", sheet="BingitOn")

str(bing)

## 'data.frame':    985 obs. of  3 variables:
##  $ Search_Type: num  2 1 2 1 3 2 1 2 3 3 ...
##  $ Preference : num  1 3 3 1 2 3 1 3 3 3 ...
##  $ Gender     : num  1 1 1 2 1 1 1 2 1 1 ...

## Recoding bing$Search_Type
bing$Search_Type <- as.character(bing$Search_Type)
bing$Search_Type[bing$Search_Type == "2"] <- "Bing Suggested"
bing$Search_Type[bing$Search_Type == "1"] <- "Popular Searches"
bing$Search_Type[bing$Search_Type == "3"] <- "User Generated"

## Recoding bing$Preference
bing$Preference <- as.character(bing$Preference)
bing$Preference[bing$Preference == "1"] <- "Bing WIns"
bing$Preference[bing$Preference == "3"] <- "Google WIns"
bing$Preference[bing$Preference == "2"] <- "Tie"
bing$Preference <- factor(bing$Preference)

## Recoding bing$Gender
bing$Gender <- as.character(bing$Gender)
bing$Gender[bing$Gender == "1"] <- "Male"
bing$Gender[bing$Gender == "2"] <- "Female"
bing$Gender <- factor(bing$Gender)

bing <- to_label(bing)
str(bing)

## Classes 'tbl_df', 'tbl' and 'data.frame':    985 obs. of  3 variables:
##  $ Search_Type: chr  "Bing Suggested" "Popular Searches" "Bing Suggested" "Popular Searches" ...
##  $ Preference : Factor w/ 3 levels "Bing WIns","Google WIns",..: 1 2 2 1 3 2 1 2 2 2 ...
##  $ Gender     : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 2 2 ...

# Frequency
set_label(bing$Search_Type) <- "Type of Search"
set_label(bing$Preference) <- "Preference"
set_label(bing$Gender) <- "Gender"

sjplot(bing, Search_Type, Preference, Gender, fun = "frq")

Relationship between Search Type & Preference

# Grouped frequencies
bing %>% sjplot(Search_Type, Preference, fun = "grpfrq", title = "Is there a relationship b/w Search Type & Preference?", show.summary = TRUE, summary.pos = "l")

## Warning in (function (var.cnt, var.grp, type = c("bar", "dot", "line",
## "boxplot", : NAs introduced by coercion

Relationship between Preference & Gender

# Grouped frequencies
bing %>% sjplot(Preference, Gender, fun = "grpfrq", title = "Is there a relationship b/w Preference & Gender?", 
    show.summary = TRUE, summary.pos = "l")

Is this a Valid Experiment?

# Grouped frequencies
bing %>% sjplot(Search_Type, Gender, fun = "grpfrq", title = "Is there a relationship b/w Search Type & Gender?", 
    show.summary = TRUE, summary.pos = "l")

## Warning in (function (var.cnt, var.grp, type = c("bar", "dot", "line",
## "boxplot", : NAs introduced by coercion

Exercise #2

Exercise: Test the relationship between search type and preference by gender.

Case 3: Impact of Marijuana Legalization and Home Prices

# pacman::p_load_gh('tomliptrot/predictshine')

sqft <- import("Identification.xlsx", sheet = "Zillow")

sq1 = gather(sqft, time, value, -c(RegionID, RegionName, State, SizeRank))
sq1$time = as.Date(paste(sq1$time, "-01", sep = ""))

datatable(head(sq1))

The data we are using is from Zillow. The main variable of interest is column labeled “value” which is the selling pice per square foot.

Data is for all major metro areas in the US. Our focus is on state of Colorado which legalized marijuana for recreational use in Jan 2014. Lets start with a quick plot of prices in Denver and compare it to overall US prices.

Denver vs.Overall US Price

US = subset(sq1, RegionName == "United States")
Denver = subset(sq1, RegionName == "Denver")

df = cbind.data.frame(Denver$time, Denver$value, US$value)
names(df) = c("date", "Denver_Median_House", "US_Median_House")
data_long = melt(df, id = "date")

plot1 = ggplot(data = data_long, aes(x = date, y = value, colour = variable)) + 
    geom_line() + geom_vline(xintercept = as.numeric(as.Date("2014-01-01")), 
    linetype = 4) + geom_text(aes(x = as.Date("2014-01-01"), label = "Legalization", 
    y = 200), colour = "darkgreen", angle = 90, vjust = -1, size = 4)

plot1

Causal Impact

How do we estimate causal impact of a phenomenon? Note that we are doing a controlled experiment here. We will use a recent R package by Google called “Causal Impact”. There is a short Youtube video that is worth watching. The general idea in this package is to develop a Bayesian time series model based on (multiple) comparable control groups to ask “what-if” questions.

# Dates
time.points = as.Date(df$date)
data = zoo(cbind(df[2:3]), time.points)

# Specify pre- and post-period
pre.period <- as.Date(c("1996-04-01", "2014-01-01"))
post.period <- as.Date(c("2014-01-02", "2017-02-01"))

# Causal impact analysis
impact = CausalImpact(data, pre.period, post.period)

plot(impact, c("original", "pointwise"))

## Warning: Removed 251 rows containing missing values (geom_path).

## Warning: Removed 251 rows containing missing values (geom_path).

print(impact)

## Posterior inference {CausalImpact}
## 
##                          Average      Cumulative  
## Actual                   198          7311        
## Prediction (s.d.)        167 (4.5)    6174 (167.0)
## 95% CI                   [158, 176]   [5851, 6513]
##                                                   
## Absolute effect (s.d.)   31 (4.5)     1137 (167.0)
## 95% CI                   [22, 39]     [798, 1460] 
##                                                   
## Relative effect (s.d.)   18% (2.7%)   18% (2.7%)  
## 95% CI                   [13%, 24%]   [13%, 24%]  
## 
## Posterior tail-area probability p:   0.00102
## Posterior prob. of a causal effect:  99.89775%
## 
## For more details, type: summary(impact, "report")

Print the Report

summary(impact, "report")

## Analysis report {CausalImpact}
## 
## 
## During the post-intervention period, the response variable had an average value of approx. 197.59. By contrast, in the absence of an intervention, we would have expected an average response of 166.87. The 95% interval of this counterfactual prediction is [158.13, 176.01]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 30.72 with a 95% interval of [21.58, 39.46]. For a discussion of the significance of this effect, see below.
## 
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 7.31K. By contrast, had the intervention not taken place, we would have expected a sum of 6.17K. The 95% interval of this prediction is [5.85K, 6.51K].
## 
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +18%. The 95% interval of this percentage is [+13%, +24%].
## 
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (30.72) to the original goal of the underlying intervention.
## 
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.

Baltimore as a Control

Why Baltimore? Its the closest city in size according to Zillow. Of course what we want is a price that is closely related to Denver real estate but did not go through the intervention.

Baltimore = subset(sq1, RegionName == "Baltimore")
df2 = cbind.data.frame(Denver$time, Denver$value, Baltimore$value)
names(df2) = c("date", "Denver_Median_House", "Baltimore_Median_House")

# Dates
data2 = zoo(cbind(df2[2:3]), time.points)

# Analysis
impact2 = CausalImpact(data2, pre.period, post.period)

plot(impact2, c("original", "pointwise"))

## Warning: Removed 251 rows containing missing values (geom_path).

## Warning: Removed 251 rows containing missing values (geom_path).

print(impact2)

## Posterior inference {CausalImpact}
## 
##                          Average      Cumulative  
## Actual                   198          7311        
## Prediction (s.d.)        162 (4.8)    5993 (178.5)
## 95% CI                   [153, 172]   [5651, 6357]
##                                                   
## Absolute effect (s.d.)   36 (4.8)     1318 (178.5)
## 95% CI                   [26, 45]     [954, 1660] 
##                                                   
## Relative effect (s.d.)   22% (3%)     22% (3%)    
## 95% CI                   [16%, 28%]   [16%, 28%]  
## 
## Posterior tail-area probability p:   0.00102
## Posterior prob. of a causal effect:  99.89775%
## 
## For more details, type: summary(impact, "report")

summary(impact2, "report")

## Analysis report {CausalImpact}
## 
## 
## During the post-intervention period, the response variable had an average value of approx. 197.59. By contrast, in the absence of an intervention, we would have expected an average response of 161.97. The 95% interval of this counterfactual prediction is [152.74, 171.80]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 35.62 with a 95% interval of [25.79, 44.85]. For a discussion of the significance of this effect, see below.
## 
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 7.31K. By contrast, had the intervention not taken place, we would have expected a sum of 5.99K. The 95% interval of this prediction is [5.65K, 6.36K].
## 
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +22%. The 95% interval of this percentage is [+16%, +28%].
## 
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (35.62) to the original goal of the underlying intervention.
## 
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.

Exercise #3

Here is a subset of university Towns: Use Ann Arbor as a control

University Towns

Boulder = subset(sq1, RegionName == "Boulder")
Indianapolis = subset(sq1, RegionName == "Indianapolis")
Ann_Arbor = subset(sq1, RegionName == "Ann Arbor")

University = cbind.data.frame(Boulder$time, Boulder$value, Indianapolis$value, 
    Ann_Arbor$value)
names(University) = c("date", "Boulder_Median_House", "Indianapolis_Median_House", 
    "Ann_Arbor_Median_House")
University_data_long = melt(University, id = "date")

ggplot(data = University_data_long, aes(x = date, y = value, colour = variable)) + 
    geom_line() + geom_vline(xintercept = as.numeric(as.Date("2014-01-01")), 
    linetype = 4) + geom_text(aes(x = as.Date("2014-01-01"), label = "Legalization", 
    y = 250), colour = "darkgreen", angle = 90, vjust = -1, size = 4)

D3M Exercise 1

Alvin Eng

August 15, 2017