Quiz4

Q1 empirical rule Your favorite orange juice brand is Citrus Hill. You went to a local market and found that it is sold at $2.20. You are surprised at the price tag, which seems too pricey. You wonder how rare it would be to encounter a Citrus Hill orange juice at this price. Fortunately, you have a dataset of the Citrus Hill juice prices. How would you use the dataset to answer your question? Describe.
Q2 PriceCH Is the data normally distributed? Plot Citrus Hill orange juice price in a histogram.
Q3 PriceCH Calculate the mean of price.
Q4 PriceCH Calculate the standard deviation of price.
Q5 empirical rule Base on your analysis in Q2, would it be appropriate to use the empirical rule for Q1? Why? Why not?
Q6 empirical rule Let’s just assume that the prices are normally distributed for the sake of discussion. Would you pay $2.20 for the Citrus Hill orange juice? Or walk away?
Q7 Why do you think the empirical rule uses the standard deviation as a measure of spread instead of the variance?
Q8 Hide the messages, but display the code and its results on the webpage.
Q9 Display the title and your name correctly at the top of the webpage.
Q10 Use the correct slug.

For this quiz, you are going to use orange juice data. This data set is originally used in a machine learning (ML) class, with the goal to predict which of the two brands of orange juices the customers bought. Of course, you are not building a ML algorithm in this quiz. I just wanted to provide you with the context of the data.

The response variable (that ML algorithm is built to predict) is Purchase, which takes either CH (Citrus Hill) or MM (Minute Maid). The predictor variables (that ML algorithm uses to make predictions) are characteristics of the customer and the product itself. Together, the data set has 18 variables.WeekofPurchase is the week of purchase. LoyalCH is customer brand loyalty for CH (how loyal the customer is for CH on a scale of 0-1), and is the only variable that characterizes customers. All other variables are characteristics of the product or stores the sale occurred at. For more information on the data set, click the link below and scroll down to page 11. https://cran.r-project.org/web/packages/ISLR/ISLR.pdf

# Load the package
library(tidyverse)

# Import data
Orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv', stringsAsFactors = TRUE) %>%
  mutate(STORE = as.factor(STORE),
         StoreID = as.factor(StoreID))

# Print the first 6 rows
head(Orange)

##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0

# Get a sense of the dataset
glimpse(Orange)

## Rows: 1,070
## Columns: 18
## $ Purchase       <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <int> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID        <fct> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH        <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM        <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH         <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM         <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH      <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM      <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH        <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM    <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH    <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff      <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7         <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM      <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH      <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff  <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE          <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...

summary(Orange)

##  Purchase WeekofPurchase  StoreID       PriceCH         PriceMM     
##  CH:653   Min.   :227.0   1   :157   Min.   :1.690   Min.   :1.690  
##  MM:417   1st Qu.:240.0   2   :222   1st Qu.:1.790   1st Qu.:1.990  
##           Median :257.0   3   :196   Median :1.860   Median :2.090  
##           Mean   :254.4   4   :139   Mean   :1.867   Mean   :2.085  
##           3rd Qu.:268.0   7   :355   3rd Qu.:1.990   3rd Qu.:2.180  
##           Max.   :278.0   NA's:  1   Max.   :2.090   Max.   :2.290  
##                                      NA's   :1       NA's   :4      
##      DiscCH            DiscMM         SpecialCH       SpecialMM     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.05196   Mean   :0.1234   Mean   :0.147   Mean   :0.1624  
##  3rd Qu.:0.00000   3rd Qu.:0.2300   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :0.50000   Max.   :0.8000   Max.   :1.000   Max.   :1.0000  
##  NA's   :2         NA's   :4        NA's   :2       NA's   :5       
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff       Store7   
##  Min.   :0.000011   Min.   :1.190   Min.   :1.390   Min.   :-0.6700   No :714  
##  1st Qu.:0.320000   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000   Yes:356  
##  Median :0.600000   Median :2.090   Median :1.860   Median : 0.2300            
##  Mean   :0.565203   Mean   :1.962   Mean   :1.816   Mean   : 0.1463            
##  3rd Qu.:0.850578   3rd Qu.:2.130   3rd Qu.:1.890   3rd Qu.: 0.3200            
##  Max.   :0.999947   Max.   :2.290   Max.   :2.090   Max.   : 0.6400            
##  NA's   :5          NA's   :5       NA's   :1       NA's   :1                  
##    PctDiscMM         PctDiscCH       ListPriceDiff    STORE    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   0   :356  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.140   1   :157  
##  Median :0.00000   Median :0.00000   Median :0.240   2   :222  
##  Mean   :0.05939   Mean   :0.02732   Mean   :0.218   3   :194  
##  3rd Qu.:0.11268   3rd Qu.:0.00000   3rd Qu.:0.300   4   :139  
##  Max.   :0.40201   Max.   :0.25269   Max.   :0.440   NA's:  2  
##  NA's   :5         NA's   :2

Q1 `empirical rule` Your favorite orange juice brand is Citrus Hill. You went to a local market and found that it is sold at $2.20. You are surprised at the price tag, which seems too pricey. You wonder how rare it would be to encounter a Citrus Hill orange juice at this price. Fortunately, you have a dataset of the Citrus Hill juice prices. How would you use the dataset to answer your question? Describe.

Hint: Discuss all of the following topics in your answer: the normal distribution, the mean, and the standard deviation.

You would use the dataset to first of all determine if the dataset is normal distribution, meaning it is bell shaped. if it is, you can determine the mean by taking the average of all of datapoints in the datatset. Once you have this determined, you would use the empirical rule if the data is normally distributed which states that all of the data will fall +/- 3 standard deviatons from the mean.

Q2 `PriceCH` Is the data normally distributed? Plot Citrus Hill orange juice price in a histogram.

ggplot(Orange, aes(x =SalePriceMM)) + geom_histogram() + labs(title = “Purchases of Citrus Hill or Minute Maid Orange Juices”, x = “Price”)

Q3 `PriceCH` Calculate the mean of price.

Hint: You may add the na.rm = TRUE argument in the mean() function, if the function returns NA. It means that the variable has at least one row with NA.

mean(Orange$PriceCH, na.rm = TRUE) ## [1] 1.867428

Q4 `PriceCH` Calculate the standard deviation of price.

Hint: You may add the na.rm = TRUE argument in the mean() function, if the function returns NA. It means that the variable has at least one row with NA.

sd(Orange$PriceCH, na.rm = TRUE) ## [1] 0.1020172

Q5 `empirical rule` Base on your analysis in Q2, would it be appropriate to use the empirical rule for Q1? Why? Why not?

Hint: Discuss characterstics of the normal distribution.

No, It would not be appropriate to use the empirical rule for q1 because the data is not normally distributed so the mean of the data wouldn’t fall +/- 3 standard deviations from the mean.

Q6 `empirical rule` Let’s just assume that the prices are normally distributed for the sake of discussion. Would you pay $2.20 for the Citrus Hill orange juice? Or walk away?

Hint: Discuss in terms of the mean, the standard deviation, and the probability. The mean in this dataset is 1.867428. Since we have determined this, we can use the empirical rule to determine if this 2.20 pricetag is worth paying for the Orange Juice. we know that one standard deviation in this dataset is 0.1020172. If we multiplied this number by 3 we get 0.3060516.I then add this number to the mean and determine that three standard deviations from the mean would be 2.17 approximately. Since the 2.20 pricetag is more than three standard deviations from the mean if you were talking just in empirical rule technically you should walk away. (If this were my opinion though, I would still buy the Orange Juice).

Q7 Why do you think the empirical rule uses the standard deviation as a measure of spread instead of the variance?

Hint: Discuss in terms of the unit of the variance and the standard deviation.

I think the empirical rule uses the standard deviation as opposed to variance because standard deviation is easier to interpret and determine than variance. Also variance uses squares to and because it weighs outliers more heavily than data closer to the mean which makes sense that the empirical rule uses standard deviations because it involves how far data is from the mean as opposed to variance which depends more on outliers.

Q8 Hide the messages, but display the code and its results on the webpage.

Hint: Use message, echo and results in the chunk options. Refer to the RMarkdown Reference Guide.

Quiz4

JakJallah

Q2 PriceCH Is the data normally distributed? Plot Citrus Hill orange juice price in a histogram.

Q3 PriceCH Calculate the mean of price.

Q4 PriceCH Calculate the standard deviation of price.

Q5 empirical rule Base on your analysis in Q2, would it be appropriate to use the empirical rule for Q1? Why? Why not?

Q6 empirical rule Let’s just assume that the prices are normally distributed for the sake of discussion. Would you pay $2.20 for the Citrus Hill orange juice? Or walk away?