Quiz3

Q1 Describe a situation when the median is more appropriate than the mean as a measure of centrality.
Q2 SalePriceMM Calculate the mean price of Minute Maid orange joice.
Q3 SalePriceMM Calculate the median price of Minute Maid orange joice.
Q4 SalePriceMM Plot Minute Maid orange joice prices in a histogram.
Q5 Add the vertical lines of mean_pr and median_pr in the histogram.
Q6 Which of the two measures would be more apprrpriate to represent the typical price? Why?
Q7 Law of Large Numbers We learned that the sample mean is not likley to be representative of the population mean when a sample is too small. Explain why?
Q8 Hide the messages and warnings, but display the code and its results on the webpage.
Q9 Display the title and your name correctly at the top of the webpage.
Q10 Use the correct slug.

For this quiz, you are going to use orange juice data. This data set is originally used in a machine learning (ML) class, with the goal to predict which of the two brands of orange juices the customers bought. Of course, you are not building a ML algorithm in this quiz. I just wanted to provide you with the context of the data.

The response variable (that ML algorithm is built to predict) is Purchase, which takes either CH (Citrus Hill) or MM (Minute Maid). The predictor variables (that ML algorithm uses to make predictions) are characteristics of the customer and the product itself. Together, the data set has 18 variables.WeekofPurchase is the week of purchase. LoyalCH is customer brand loyalty for CH (how loyal the customer is for CH on a scale of 0-1), and is the only variable that characterizes customers. All other variables are characteristics of the product or stores the sale occurred at. For more information on the data set, click the link below and scroll down to page 11. https://cran.r-project.org/web/packages/ISLR/ISLR.pdf

# Load the package
library(tidyverse)

## -- Attaching packages -------------------

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Import data
Orange <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/orange_juice_withmissing.csv', stringsAsFactors = TRUE) %>%
  mutate(STORE = as.factor(STORE),
         StoreID = as.factor(StoreID))

# Print the first 6 rows
head(Orange)

##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0

# Get a sense of the dataset
glimpse(Orange)

## Rows: 1,070
## Columns: 18
## $ Purchase       <fct> CH, CH, CH, MM, CH, CH, CH, CH, CH, CH, CH, CH, CH, ...
## $ WeekofPurchase <int> 237, 239, 245, 227, 228, 230, 232, 234, 235, 238, 24...
## $ StoreID        <fct> 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 2...
## $ PriceCH        <dbl> 1.75, 1.75, 1.86, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceMM        <dbl> 1.99, 1.99, 2.09, 1.69, 1.69, 1.99, 1.99, 1.99, 1.99...
## $ DiscCH         <dbl> 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ DiscMM         <dbl> 0.00, 0.30, 0.00, 0.00, 0.00, 0.00, 0.40, 0.40, 0.40...
## $ SpecialCH      <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SpecialMM      <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ LoyalCH        <dbl> 0.500000, 0.600000, 0.680000, 0.400000, 0.956535, 0....
## $ SalePriceMM    <dbl> 1.99, 1.69, 2.09, 1.69, 1.69, 1.99, 1.59, 1.59, 1.59...
## $ SalePriceCH    <dbl> 1.75, 1.75, 1.69, 1.69, 1.69, 1.69, 1.69, 1.75, 1.75...
## $ PriceDiff      <dbl> 0.24, -0.06, 0.40, 0.00, 0.00, 0.30, -0.10, -0.16, -...
## $ Store7         <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y...
## $ PctDiscMM      <dbl> 0.000000, 0.150754, 0.000000, 0.000000, 0.000000, 0....
## $ PctDiscCH      <dbl> 0.000000, 0.000000, 0.091398, 0.000000, 0.000000, 0....
## $ ListPriceDiff  <dbl> 0.24, 0.24, 0.23, 0.00, 0.00, 0.30, 0.30, 0.24, 0.24...
## $ STORE          <fct> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2...

summary(Orange)

##  Purchase WeekofPurchase  StoreID       PriceCH         PriceMM     
##  CH:653   Min.   :227.0   1   :157   Min.   :1.690   Min.   :1.690  
##  MM:417   1st Qu.:240.0   2   :222   1st Qu.:1.790   1st Qu.:1.990  
##           Median :257.0   3   :196   Median :1.860   Median :2.090  
##           Mean   :254.4   4   :139   Mean   :1.867   Mean   :2.085  
##           3rd Qu.:268.0   7   :355   3rd Qu.:1.990   3rd Qu.:2.180  
##           Max.   :278.0   NA's:  1   Max.   :2.090   Max.   :2.290  
##                                      NA's   :1       NA's   :4      
##      DiscCH            DiscMM         SpecialCH       SpecialMM     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.05196   Mean   :0.1234   Mean   :0.147   Mean   :0.1624  
##  3rd Qu.:0.00000   3rd Qu.:0.2300   3rd Qu.:0.000   3rd Qu.:0.0000  
##  Max.   :0.50000   Max.   :0.8000   Max.   :1.000   Max.   :1.0000  
##  NA's   :2         NA's   :4        NA's   :2       NA's   :5       
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff       Store7   
##  Min.   :0.000011   Min.   :1.190   Min.   :1.390   Min.   :-0.6700   No :714  
##  1st Qu.:0.320000   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000   Yes:356  
##  Median :0.600000   Median :2.090   Median :1.860   Median : 0.2300            
##  Mean   :0.565203   Mean   :1.962   Mean   :1.816   Mean   : 0.1463            
##  3rd Qu.:0.850578   3rd Qu.:2.130   3rd Qu.:1.890   3rd Qu.: 0.3200            
##  Max.   :0.999947   Max.   :2.290   Max.   :2.090   Max.   : 0.6400            
##  NA's   :5          NA's   :5       NA's   :1       NA's   :1                  
##    PctDiscMM         PctDiscCH       ListPriceDiff    STORE    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   0   :356  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.140   1   :157  
##  Median :0.00000   Median :0.00000   Median :0.240   2   :222  
##  Mean   :0.05939   Mean   :0.02732   Mean   :0.218   3   :194  
##  3rd Qu.:0.11268   3rd Qu.:0.00000   3rd Qu.:0.300   4   :139  
##  Max.   :0.40201   Max.   :0.25269   Max.   :0.440   NA's:  2  
##  NA's   :5         NA's   :2

Q1 Describe a situation when the median is more appropriate than the mean as a measure of centrality.

I want to know how far the typical PSU student travels to come to campus. But there are to many students to collect data from. So I collected a sample from 10 students to get the mean. Would the mean of the 10 student sample be a good representation of all PSU students? What if there is a student from Alaska in the 10 student sample? Would that skew the mean? Would the Alaskan home bound person have same degree of influence had I collected a sample of 100 students?

The mean of the data can be skewed if the data set is very small and if on of the data point is way different from the rest of the data. This one data point will cause the mean to be way different than it really is due to the one out lair in the data, you would need to collect more data in order for this one out liar to not have an impact on the mean of the data.

This is when using the median of the data would be better, since the data set is smaller, you can use the median to get rid of the out lair to get a better answer to the measure of centrality.

Q2 `SalePriceMM` Calculate the mean price of Minute Maid orange joice.

Hint: Code it so that the outcome is a scalar, not a data frame. It’s the same code you learned in Quiz3-b. Save the result under mean_pr.

mean_pr <- mean(Orange$SalePriceMM, na.rm = TRUE)
mean_pr

## [1] 1.961934

Q3 `SalePriceMM` Calculate the median price of Minute Maid orange joice.

Hint: Replace mean with median in the code in Q2. Save the result under median_pr.

median_pr <- median(Orange$SalePriceMM, na.rm = TRUE)
median_pr

## [1] 2.09

Q4 `SalePriceMM` Plot Minute Maid orange joice prices in a histogram.

Hint: Refer to the code in Data Visualization with R: Ch3.2.1 Histogram.

ggplot(Orange, aes(x = SalePriceMM)) +
  geom_histogram() + 
  labs(title = "Minute Maid Prices",
       x = "price")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Q5 Add the vertical lines of mean_pr and median_pr in the histogram.

Hint: Copy the code from Q4 and add two lines of the geom_vline() function in the code for vertical lines of the mean and the median home prices. Google geom_vline() for its documentation.

ggplot(Orange, aes(x = SalePriceMM)) +
  geom_histogram() + 
  labs(title = "Minute Maid Prices",
       x = "price") + geom_vline(xintercept = mean_pr, color = "red") + geom_vline(xintercept = median_pr, color = "blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Q6 Which of the two measures would be more apprrpriate to represent the typical price? Why?

Median because the data is skewed, meaning the data has out liars which throws off the mean

Q7 `Law of Large Numbers` We learned that the sample mean is not likley to be representative of the population mean when a sample is too small. Explain why?

This is because one out lairer in the data can through off the mean, this casues the data to be skewed. You need more data to get a good mean, escipally whens theres out liars in the data

Q8 Hide the messages and warnings, but display the code and its results on the webpage.

Hint: Use message, echo and results in the chunk options. Refer to the RMarkdown Reference Guide.

Quiz3

Colton Petrosino

Q1 Describe a situation when the median is more appropriate than the mean as a measure of centrality.

Q2 SalePriceMM Calculate the mean price of Minute Maid orange joice.

Q3 SalePriceMM Calculate the median price of Minute Maid orange joice.

Q4 SalePriceMM Plot Minute Maid orange joice prices in a histogram.