title: “Quiz 2”
author: “Logan Pruneau”
date: “5/13/2021”
output:
html_document:
toc: TRUE
For this quiz, you are going to use orange juice data. This data set is originally used in a machine learning (ML) class, with the goal to predict which of the two brands of orange juices the customers bought. Of course, you are not building a ML algorithm in this quiz. I just wanted to provide you with the context of the data.
The response variable (that ML algorithm is built to predict) is Purchase
, which takes either CH (Citrus Hill) or MM (Minute Maid). The predictor variables (that ML algorithm uses to make predictions) are characteristics of the customer and the product itself. Together, the data set has 18 variables.WeekofPurchase
is the week of purchase. LoyalCH
is customer brand loyalty for CH (how loyal the customer is for CH on a scale of 0-1), and is the only variable that characterizes customers. All other variables are characteristics of the product or stores the sale occurred at. For more information on the data set, click the link below and scroll down to page 11. https://cran.r-project.org/web/packages/ISLR/ISLR.pdf
# Load the package
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
# Import data
Orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv', stringsAsFactors = TRUE) %>%
mutate(STORE = as.factor(STORE),
StoreID = as.factor(StoreID))
# Print the first 6 rows
head(Orange)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
# Get a sense of the dataset
glimpse(Orange)
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <int> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID <fct> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...
summary(Orange)
## Purchase WeekofPurchase StoreID PriceCH PriceMM
## CH:653 Min. :227.0 1 :157 Min. :1.690 Min. :1.690
## MM:417 1st Qu.:240.0 2 :222 1st Qu.:1.790 1st Qu.:1.990
## Median :257.0 3 :196 Median :1.860 Median :2.090
## Mean :254.4 4 :139 Mean :1.867 Mean :2.085
## 3rd Qu.:268.0 7 :355 3rd Qu.:1.990 3rd Qu.:2.180
## Max. :278.0 NA's: 1 Max. :2.090 Max. :2.290
## NA's :1 NA's :4
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.05196 Mean :0.1234 Mean :0.147 Mean :0.1624
## 3rd Qu.:0.00000 3rd Qu.:0.2300 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :0.50000 Max. :0.8000 Max. :1.000 Max. :1.0000
## NA's :2 NA's :4 NA's :2 NA's :5
## LoyalCH SalePriceMM SalePriceCH PriceDiff Store7
## Min. :0.000011 Min. :1.190 Min. :1.390 Min. :-0.6700 No :714
## 1st Qu.:0.320000 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000 Yes:356
## Median :0.600000 Median :2.090 Median :1.860 Median : 0.2300
## Mean :0.565203 Mean :1.962 Mean :1.816 Mean : 0.1463
## 3rd Qu.:0.850578 3rd Qu.:2.130 3rd Qu.:1.890 3rd Qu.: 0.3200
## Max. :0.999947 Max. :2.290 Max. :2.090 Max. : 0.6400
## NA's :5 NA's :5 NA's :1 NA's :1
## PctDiscMM PctDiscCH ListPriceDiff STORE
## Min. :0.00000 Min. :0.00000 Min. :0.000 0 :356
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.140 1 :157
## Median :0.00000 Median :0.00000 Median :0.240 2 :222
## Mean :0.05939 Mean :0.02732 Mean :0.218 3 :194
## 3rd Qu.:0.11268 3rd Qu.:0.00000 3rd Qu.:0.300 4 :139
## Max. :0.40201 Max. :0.25269 Max. :0.440 NA's: 2
## NA's :5 NA's :2
Hint: See the result of `glimpse(Orange) .
glimpse(Orange)
## Rows: 1,070
## Columns: 18
## $ Purchase <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <int> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID <fct> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7 <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...
There are 18 columns in the Orange dataset.
Hint: Your interpretation must discuss the following variables: Purchase, WeekofPurchase, StoreID, LoyalCH, SalePriceCH, and PriceDiff.
Row 2 of the Orange dataset accounted for a transaction made towards Citrus Hill, Store #1, on week 239 at $1.75 for 1 bottle of OJ. CH’s OJ prices were 6 cents less than MM on week 239, which is part of the reason that CH has a 60% customer loyalty at this time.
SalePriceMM
What is the median price of Minute Maid orange juice?Hint: See the result of summary(Orange)
.
median_pr <- median((Orange$SalePriceMM) , na.rm = TRUE)
median_pr
## [1] 2.09
Purchase
Which of the two brands was sold more? Minute Maid or Citrus Hill?Hint: See the result of summary(Orange)
.
Citrus Hill sold 236 more units than Minute Maid, with Citrus Hill at 653 total sales, and Minute Maid at 417 total sales.
SalePriceMM
Graph the distribution of Minute Maid orange juice prices.Hint: Insert a code chunk below and the code to create a histogram.
p <- ggplot(data = Orange, aes(x = SalePriceMM)) +
geom_histogram()
p
## Warning: Removed 5 rows containing non-finite values (stat_bin).
Hint: Discuss in terms of the characteristics of the normal distribution you learned in Quiz2-a.
No, the prices aren’t normally distributed and we know this for a couple of reasons. One the histogram does not have a parabolic (bell shape) curve. Secondly, the standard deviation is very scattered, where in a normal distribution the SD is very small and the data is grouped together.
StoreID
In what store, the typical Minute Maid orange juice price appears to be lowest? Create a boxplot.Hint: Insert a code chunk below and the code to create a boxplot. See Data Visualization with R: Ch4.3.3 Box plots. Map SalePriceMM
to the y-axis and StoreID
to the x-axis. The typical value is represented by the median, the thick horizontal line inside the box.
p <- ggplot(data = Orange, aes(y = SalePriceMM, x = StoreID)) +
geom_boxplot()
p
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
Minute Maid prices appear to be the lowest at Store #1, with a minimum price of <$1.20.
Hint: Use message
, echo
and results
in the chunk options. Refer to the RMarkdown Reference Guide.