A Fruit Drink store chain is selling a new type of grape juice in some of its stores for pilot selling. In this paper, we will analyze the following questions put up by the store :-

  1. Which type of in-store advertisement is more effective? They have placed two types of ads in stores for testing, one theme is natural production of the juice, the other theme is family health caring.
  2. The Price Elasticity - the reactions of sales quantity of the grape juice to its price change;
  3. The Cross-price Elasticity - the reactions of sales quantity of the grape juice to the price changes of other products such as apple juice and cookies in the same store;
  4. How to find the best unit price of the grape juice which can maximize the profit and the forecast of sales with that price.

Let’s look at some of the observations in our dataset and try to generate a summary statistic. We start our EXPLORATORY ANALYSIS here.

##   sales price ad_type price_apple price_cookies
## 1   222  9.83       0        7.36          8.80
## 2   201  9.72       1        7.43          9.62
## 3   247 10.15       1        7.66          8.90
## 4   169 10.04       0        7.57         10.26
## 5   317  8.38       1        7.33          9.54
## 6   227  9.74       0        7.51          9.49
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:s20x':
## 
##     levene.test
##      sales           price           ad_type     price_apple   
##  Min.   :131.0   Min.   : 8.200   Min.   :0.0   Min.   :7.300  
##  1st Qu.:182.5   1st Qu.: 9.585   1st Qu.:0.0   1st Qu.:7.438  
##  Median :204.5   Median : 9.855   Median :0.5   Median :7.580  
##  Mean   :216.7   Mean   : 9.738   Mean   :0.5   Mean   :7.659  
##  3rd Qu.:244.2   3rd Qu.:10.268   3rd Qu.:1.0   3rd Qu.:7.805  
##  Max.   :335.0   Max.   :10.490   Max.   :1.0   Max.   :8.290  
##  price_cookies   
##  Min.   : 8.790  
##  1st Qu.: 9.190  
##  Median : 9.515  
##  Mean   : 9.622  
##  3rd Qu.:10.140  
##  Max.   :10.580

From the above summary table, we can roughly know the basic statistics of each numeric variable. For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is 335. Please ignore the statistics of the “ad_type” there since it is a categorical variable.

We can further explore the distribution of the data of sales by visualizing the data in graphical form as follows.

With the help of Boxplots, we can find out if there are any OUTLIERS in our data. Using Histograms, we can explore the data distribution shape(id it’s normal or not)

We don’t find outliers in the above box plot graph and the sales data distribution is roughly normal. It is not necessary to apply further data cleaning and treatment to the data set.

To find out the answer to the firt question,the ad with better effectiveness for sales between the two types of ads, one is with natural production theme; the other is with family health caring theme. We will be making use of subsetting to divide our dataset into two sub-sets as sales with ad nature and sales with ad family and comparing their means.

## [1] 186.6667
## [1] 246.6667

The mean of sales with nature product theme is about 187; the mean of sales with family health caring theme is about 247. It looks like that the latter one is better. However, this is only the conclusion based on the sample with only 30 observations randomly selected. To find out how likely the conclusion is correct for the whole population, it is necessary to do statistical testing - two-sample t-test.(for comparison between the two)

However, it is important to check the assumptions of t-tests, which assume the observations are normally distributed and independent, before conducting the t-tests. Otherwise the results of t-tests are not valid. The observations are independent since they were randomly sampled. Let’s check the normality by plotting the distribution shapes of the two groups of sales data.

We can see that the shapes are roughly normally distributed. We can also check the normality by Shapiro-Wilk test as follows.

## 
##  Shapiro-Wilk normality test
## 
## data:  sales_ad_nature$sales
## W = 0.94255, p-value = 0.4155
## 
##  Shapiro-Wilk normality test
## 
## data:  sales_ad_family$sales
## W = 0.89743, p-value = 0.08695

The p-values of the Shapiro-Wilk tests are larger than 0.05, so there is no strong evidence to reject the null hypothesis that the two groups of sales data are normally distributed.

Now we can conduct the t-test since the t-test assumptions are met.

## 
##  Welch Two Sample t-test
## 
## data:  sales_ad_nature$sales and sales_ad_family$sales
## t = -3.7515, df = 25.257, p-value = 0.0009233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -92.92234 -27.07766
## sample estimates:
## mean of x mean of y 
##  186.6667  246.6667

From the above t-test output, we can say that:-

We have strong evidence to say that the population means of the sales with the two different ad types are different because the p-value of the t-test is very small;

With 95% confidence, we can estimate that the mean of the sales with natural production theme ad is somewhere in 27 to 93 units less than that of the sales with family health caring theme ad.

So the conclusion is that the ad with the theme of family health caring is BETTER.

With the information given in the data set, we can explore how grape juice price, ad type, apple juice price, cookies price influence the sales of grape juice in a store by multiple linear regression analysis. Here, “sales” is the dependent variable and the others are independent variables.

Let’s investigate the correlation between the sales and other variables by displaying the correlation coefficients in pairs.

The correlation coefficients between sales and price, ad_type, price_apple, and price_cookies are 0.85, 0.58, 0.37, and 0.37 respectively, that means they all might have some influences to the sales, so we can try to add all of the independent variables into the regression model as follows.

## 
## Call:
## lm(formula = sales ~ price + ad_type + price_apple + price_cookies, 
##     data = grapeJuice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.290 -10.488   0.884  10.483  29.471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    774.813    145.349   5.331 1.59e-05 ***
## price          -51.239      5.321  -9.630 6.83e-10 ***
## ad_type         29.742      7.249   4.103 0.000380 ***
## price_apple     22.089     12.512   1.765 0.089710 .  
## price_cookies  -25.277      6.296  -4.015 0.000477 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.2 on 25 degrees of freedom
## Multiple R-squared:  0.8974, Adjusted R-squared:  0.881 
## F-statistic: 54.67 on 4 and 25 DF,  p-value: 5.318e-12

The p-value for price, ad_type, and price_cookies in the last column of the above output is much less than 0.05. They are significant in explaining the sales. We are confident to include these three variables into the model.

The p-value of price_apple is a bit larger than 0.05, seems there are no strong evidence for apple juice price to explain the sales. However, according to our real-life experience, we know when apple juice price is lower, consumers likely to buy more apple juice, and then the sales of other fruit juice will decrease. So we can also add it into the model to explain the grape juice sales.

The Adjusted R-squared is 0.881, which indicates a reasonable goodness of fit and 88% of the variation in sales can be explained by the four variables. The remaining 12% can be attributed to other factors or inherent variability.

The assumptions for the regression to be true are that data are random and independent, residuals are normally distributed and have constant variance. Let’s check the residuals assumptions visually.

The Residuals vs Fitted graph above shows that the residuals scatter around the fitted line with no obvious pattern, and the Normal Q-Q graph shows that basically the residuals are normally distributed. The assumptions are met.

For multiple regression, it is also important to check the multicollinearity among the variables because high multicollinearity will make the coefficients for independent variables less precise and introduce large errors in the predictions for dependant variable. We can investigate the multicollinearity by displaying the correlation coefficients of the independent variables in pairs.

Let’s investigate the multicollinearity.

##         price       ad_type   price_apple price_cookies 
##      1.246084      1.189685      1.149248      1.099255

The VIF test value for each variable is close to 1, which means the multicollinearity is very low among these variables.

Based on the above analysis, we can accept the regression result and construct the multi-linear model of sales as follows.

Sales = 774.81 - 51.24 * price + 29.74 * ad_type + 22.1 * price_apple - 25.28 * price_cookies

*Note : ?? = Delta.

With model established, we can analyse the Price Elasticity(PE) and Cross-price Elasticity(CPE) to predict the reactions of sales quantity to price. “Price elasticity is defined as %??Q/%??P, which indicates the percent change in quantity divided by the percent change in price; Cross-price Elasticity is the percent change in quantity divided by the change in the price of some other product.”1

PE = (??Q/Q) / (??P/P) = (??Q/??P) * (P/Q) = -51.24 * 0.045 = -2.3

P is price, Q is sales quantity

??Q/??P = -51.24 , the parameter before the variable “price” in the above model

P/Q = 9.738 / 216.7 = 0.045, P is the mean of prices in the dataset, so does Q

The PE indicates that 10% decrease in price will increase the sales by 23%, and vice verse.

Let’s further calculate the CPE on apple juice and cookies to analyze the how the change of apple juice price and cookies price influence the sales of grape juice.

CPEapple = (??Q/??Papple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78

CPEcookies = (??Q/??Pcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = - 1.12

The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.

The CPEcookies indicates that 10% decrease in cookies price will INCREASE the sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place the two products together will likely increase the sales for both.

We can also know that the sales increase 29.74 units when using the ad with the family health caring theme (ad_type = 1).

Now, coming to the final question of our paper, what should be the Optimal Pricing and Sales for the store to make maximum profit?

Usually companies want to get higher profit rather than just higher sales quantity. So, how to set the optimal price for the new grape juice to get the maximum profit based on the dataset collected in the pilot period and the regression model above?

To simplify the question, we can let the ad_type = 1, the price_apple = 7.659 (mean value), and the price_cookies = 9.738 (mean value).

The model is simplified as follows:-

Sales = 774.81 - 51.24 * price + 29.74 * 1 + 22.1 * 7.659 - 25.28 * 9.738

Sales = 772.64 - 51.24*price

Assume the marginal cost(C) per unit of grape juice is 5. We can calculate the profit (Y) by the following formula.

Y = (price - C) * Sales Quantity = (price - 5) * (772.64 - 51.24*price)

Y = - 51.24 * price2 + 1028.84 * price - 3863.2

Let’s get the optimal price to maximize Y.

## $maximum
## [1] 10.03942
## 
## $objective
## [1] 1301.28

The optimal price is 10.04; the maximum profit will be 1301 according to the above output. We can round the price to be 10 or 9.99.

We can further use the model to predict the sales while the price is 10.

##        fit      lwr      upr
## 1 215.1978 176.0138 254.3817

The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence in a store in one work on average. Based on the forecast and other factors, The Fruit Drink company can prepare the inventory for all of its stores after the pilot period.