empirical rule Your favorite orange juice brand is Citrus Hill. You went to a local market and found that it is sold at $2.20. You are surprised at the price tag, which seems too pricey. You wonder how rare it would be to encounter a Citrus Hill orange juice at this price. Fortunately, you have a dataset of the Citrus Hill juice prices. How would you use the dataset to answer your question? Describe.PriceCH Is the data normally distributed? Plot Citrus Hill orange juice price in a histogram.PriceCH Calculate the mean of price.PriceCH Calculate the standard deviation of price.empirical rule Base on your analysis in Q2, would it be appropriate to use the empirical rule for Q1? Why? Why not?empirical rule Let’s just assume that the prices are normally distributed for the sake of discussion. Would you pay $2.20 for the Citrus Hill orange juice? Or walk away?For this quiz, you are going to use orange juice data. This data set is originally used in a machine learning (ML) class, with the goal to predict which of the two brands of orange juices the customers bought. Of course, you are not building a ML algorithm in this quiz. I just wanted to provide you with the context of the data.
The response variable (that ML algorithm is built to predict) is Purchase, which takes either CH (Citrus Hill) or MM (Minute Maid). The predictor variables (that ML algorithm uses to make predictions) are characteristics of the customer and the product itself. Together, the data set has 18 variables.WeekofPurchase is the week of purchase. LoyalCH is customer brand loyalty for CH (how loyal the customer is for CH on a scale of 0-1), and is the only variable that characterizes customers. All other variables are characteristics of the product or stores the sale occurred at. For more information on the data set, click the link below and scroll down to page 11. https://cran.r-project.org/web/packages/ISLR/ISLR.pdf
# Load the package
library(tidyverse)
# Import data
Orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv', stringsAsFactors = TRUE) %>%
mutate(STORE = as.factor(STORE),
StoreID = as.factor(StoreID))
# Print the first 6 rows
head(Orange)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
# Get a sense of the dataset
glimpse(Orange)
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <int> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID <fct> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...
summary(Orange)
## Purchase WeekofPurchase StoreID PriceCH PriceMM
## CH:653 Min. :227.0 1 :157 Min. :1.690 Min. :1.690
## MM:417 1st Qu.:240.0 2 :222 1st Qu.:1.790 1st Qu.:1.990
## Median :257.0 3 :196 Median :1.860 Median :2.090
## Mean :254.4 4 :139 Mean :1.867 Mean :2.085
## 3rd Qu.:268.0 7 :355 3rd Qu.:1.990 3rd Qu.:2.180
## Max. :278.0 NA's: 1 Max. :2.090 Max. :2.290
## NA's :1 NA's :4
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.05196 Mean :0.1234 Mean :0.147 Mean :0.1624
## 3rd Qu.:0.00000 3rd Qu.:0.2300 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :0.50000 Max. :0.8000 Max. :1.000 Max. :1.0000
## NA's :2 NA's :4 NA's :2 NA's :5
## LoyalCH SalePriceMM SalePriceCH PriceDiff Store7
## Min. :0.000011 Min. :1.190 Min. :1.390 Min. :-0.6700 No :714
## 1st Qu.:0.320000 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000 Yes:356
## Median :0.600000 Median :2.090 Median :1.860 Median : 0.2300
## Mean :0.565203 Mean :1.962 Mean :1.816 Mean : 0.1463
## 3rd Qu.:0.850578 3rd Qu.:2.130 3rd Qu.:1.890 3rd Qu.: 0.3200
## Max. :0.999947 Max. :2.290 Max. :2.090 Max. : 0.6400
## NA's :5 NA's :5 NA's :1 NA's :1
## PctDiscMM PctDiscCH ListPriceDiff STORE
## Min. :0.00000 Min. :0.00000 Min. :0.000 0 :356
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.140 1 :157
## Median :0.00000 Median :0.00000 Median :0.240 2 :222
## Mean :0.05939 Mean :0.02732 Mean :0.218 3 :194
## 3rd Qu.:0.11268 3rd Qu.:0.00000 3rd Qu.:0.300 4 :139
## Max. :0.40201 Max. :0.25269 Max. :0.440 NA's: 2
## NA's :5 NA's :2
empirical rule Your favorite orange juice brand is Citrus Hill. You went to a local market and found that it is sold at $2.20. You are surprised at the price tag, which seems too pricey. You wonder how rare it would be to encounter a Citrus Hill orange juice at this price. Fortunately, you have a dataset of the Citrus Hill juice prices. How would you use the dataset to answer your question? Describe.Hint: Discuss all of the following topics in your answer: the normal distribution, the mean, and the standard deviation.
The normal distribution, almost all the data falls within plus minus 3 standard deviations from the mean
Mean is the average in the data set, in the middle of data which the normal distribution is based off of.
Using the empirical rule and the data set, we would be able to find out how rare the orange juice is sold at 2.2
PriceCH Is the data normally distributed? Plot Citrus Hill orange juice price in a histogram. ggplot(Orange, aes(x = PriceCH)) +
geom_histogram()
## Warning: Removed 1 rows containing non-finite values (stat_bin).
The data is not normally distributed because the data is not bell shaped, its not symetrical around the center
PriceCH Calculate the mean of price.Hint: You may add the na.rm = TRUE argument in the mean() function, if the function returns NA. It means that the variable has at least one row with NA.
mean(Orange$PriceCH, na.rm = TRUE)
## [1] 1.867428
PriceCH Calculate the standard deviation of price.Hint: You may add the na.rm = TRUE argument in the mean() function, if the function returns NA. It means that the variable has at least one row with NA.
sd(Orange$PriceCH, na.rm = TRUE)
## [1] 0.1020172
empirical rule Base on your analysis in Q2, would it be appropriate to use the empirical rule for Q1? Why? Why not?Hint: Discuss characteristics of the normal distribution. No we cannot apply the empirical rule because the data is not normal distributed.
empirical rule Let’s just assume that the prices are normally distributed for the sake of discussion. Would you pay $2.20 for the Citrus Hill orange juice? Or walk away?Hint: Discuss in terms of the mean, the standard deviation, and the probability.
I would walk away because the mean is $1.87. This means that $2.2 is two standard deviations higher than the mean. The Probability of this is 2.5%. I would walk because you can easily find a store that has it for cheaper because your chances of a store being $2.2 is 2.5%, the probability of the next store you going to being like this are very slim.
Hint: Discuss in terms of the unit of the variance and the standard deviation. The empirical rule uses standard deviation over variance because variance is harder to interpret due to being the unit squared.THis is very hard to interpret when comparing, due to the different unit. Standard deviation is just units which is easier to understand.
Hint: Use message, echo and results in the chunk options. Refer to the RMarkdown Reference Guide.