To run this notebook, please first download the data A1.RData from Canvas. Also, make sure you have loaded the MSR package into your work space.

# load the package 
library(MSR)

# load the data for the assignment
load("A1.RData")

Question 1

Question 1a

For Question 1a, you can either report the histograms or association between Sales and the IVs. Here, I will show both.

Price vs. Sales

# the pearson correlation 
cor(cadbury$Sales,cadbury$Price)

## [1] -0.8336852

# the histogram
plot(Sales ~ Price, data = cadbury, type = "p")

Feature vs. Sales

As Feature is a discrete variable, we cannot use simple correlation. Instead, you can run an ANOVA to test whether Sales differ significantly between the featured vs. non-featured days.

# running the anova 
summary(aov(Sales ~ Feature, cadbury))

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Feature      1 256515  256515   54.47 3.4e-10 ***
## Residuals   66 310795    4709                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# the histogram
plot(Sales ~ Feature, data = cadbury)

We can do the same analysis for Display.

# running the anova 
summary(aov(Sales ~ Display, cadbury))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Display      1 174016  174016    29.2 9.62e-07 ***
## Residuals   66 393293    5959                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# the histogram
plot(Sales ~ Display, data = cadbury)

Question 1c

To obtain the VIF values of the variables in the data, we use the vif function. The function takes in the names of the variables and the data frame.

vif(c("Price","Feature","Display"), cadbury)

## FeatureYes DisplayYes      Price 
##   1.733758   1.138592   1.679551

Question 1d-1h

In Question 1, we run two regressions. The first regression is for Question 1b to Question 1i. The regression equation is like this:

\[Sales = \beta_0 + \beta_1*Price + \beta_2*Feature + \beta_3*Display + e\]

The first regression:

mdl_1 <- lm(Sales ~ Price + Feature + Display, data = cadbury)
summary(mdl_1)

## 
## Call:
## lm(formula = Sales ~ Price + Feature + Display, data = cadbury)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -110.681  -18.362    1.126   18.994  114.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  714.113     68.773  10.384 2.34e-15 ***
## Price        -41.542      4.492  -9.248 2.09e-13 ***
## FeatureYes    33.897     14.550   2.330    0.023 *  
## DisplayYes    75.399     13.645   5.526 6.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.2 on 64 degrees of freedom
## Multiple R-squared:  0.8177, Adjusted R-squared:  0.8091 
## F-statistic: 95.68 on 3 and 64 DF,  p-value: < 2.2e-16

Question 1i and 1j

The second regression is for Question 1j and 1k. The regression equation is like this:

\[Sales = \beta_0 + \beta_1*Price + \beta_2*Feature + \beta_3*Display + \beta_4*Sunny + \beta_5*Cloudy + e\]

To run the second regression, we first need to set the baseline of Weather to Rainy. For this, we use the relevel function.

cadbury$Weather <- relevel(cadbury$Weather, ref = "Rainy")

After setting the baseline, we run the second regression.

mdl_2 <- lm(Sales ~ Price + Feature + Display + Weather, data = cadbury)
summary(mdl_2)

## 
## Call:
## lm(formula = Sales ~ Price + Feature + Display + Weather, data = cadbury)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.209 -23.309  -2.101  20.927  98.209 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    698.273     67.641  10.323 4.30e-15 ***
## Price          -41.257      4.409  -9.357 1.82e-13 ***
## FeatureYes      33.858     14.437   2.345   0.0222 *  
## DisplayYes      75.844     13.302   5.702 3.53e-07 ***
## WeatherCloudy   27.539     12.069   2.282   0.0259 *  
## WeatherSunny    16.552     11.229   1.474   0.1455    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.08 on 62 degrees of freedom
## Multiple R-squared:  0.8331, Adjusted R-squared:  0.8196 
## F-statistic: 61.89 on 5 and 62 DF,  p-value: < 2.2e-16

Question 2

First, following the instructions in Question 1a, we will run a regression of ratings on the following variables: form, noapply, disinfect, bio, and price.

summary(lm(ratings ~ form + noapply + disinfect + bio + price, 
           data = cleanser))

## 
## Call:
## lm(formula = ratings ~ form + noapply + disinfect + bio + price, 
##     data = cleanser)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.193 -1.354  0.178  1.098  4.328 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.0204     0.3187  15.751  < 2e-16 ***
## formConcentrate    0.3665     0.2398   1.528   0.1274    
## formPremix        -0.2978     0.2569  -1.159   0.2472    
## noapply100 times  -0.1180     0.2471  -0.478   0.6333    
## noapply50 times   -0.4726     0.2565  -1.843   0.0663 .  
## disinfectYes       0.9433     0.2154   4.379 1.62e-05 ***
## bioYes             0.1497     0.2103   0.712   0.4771    
## price49 cents     -1.3947     0.2512  -5.553 5.90e-08 ***
## price79 cents     -2.8187     0.2502 -11.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.817 on 321 degrees of freedom
## Multiple R-squared:  0.3139, Adjusted R-squared:  0.2968 
## F-statistic: 18.36 on 8 and 321 DF,  p-value: < 2.2e-16

From the results, you can apply the 3 rules to transform the coefficients into partworths. Please see the suggested solutions for more details.