For this quiz, you are going to use orange juice data. This data set is originally used in a machine learning (ML) class, with the goal to predict which of the two brands of orange juices the customers bought. Of course, you are not building a ML algorithm in this quiz. I just wanted to provide you with the context of the data.
The response variable (that ML algorithm is built to predict) is Purchase, which takes either CH (Citrus Hill) or MM (Minute Maid). The predictor variables (that ML algorithm uses to make predictions) are characteristics of the customer and the product itself. Together, the data set has 18 variables.WeekofPurchase is the week of purchase. LoyalCH is customer brand loyalty for CH (how loyal the customer is for CH on a scale of 0-1), and is the only variable that characterizes customers. All other variables are characteristics of the product or stores the sale occurred at. For more information on the data set, click the link below and scroll down to page 11. https://cran.r-project.org/web/packages/ISLR/ISLR.pdf
when the mean is skewed by outliers or data that effects the centrality. For example, two different wages or salaries in this case you would use the median.
SalePriceMM Calculate the mean price of Minute Maid orange joice.The mean is 1.962
SalePriceMM Calculate the median price of Minute Maid orange joice.2.090 is the median.
SalePriceMM Plot Minute Maid orange joice prices in a histogram.ggplot(Marriage, aes(x = age)) + geom_histogram(1.99,1.69,2.09,1.69,1.69,1.69)
geom_vline(1.96)
=1.96
Median because the mean is easily skewed.
Law of Large Numbers We learned that the sample mean is not likley to be representative of the population mean when a sample is too small. Explain why?If you have one outlier it throws off the mean