Week 6 Data Dive

Importing libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(boot)

Importing data

df <- read.csv('Auto Sales data.csv')

Finding a Correlative Relationship

Last week, I looked at the correlation using the QUANTITYORDERED, MSRP, and PRICE_EACH columns, alongside a column I created: PRICEDIFF - the difference between the values in PRICE_EACH and MSRP… so I want to look at a different combination:

MSRP and SALES (QUANTITYORDERED * PRICEEACH).

df |>
  ggplot() +
  geom_point(
    aes(x=MSRP, y=SALES)
  ) + 
  labs(title = paste("Correlation:", 
                        round(cor(df$MSRP, 
                                  df$SALES), 2)))

It’s a fairly strong positive relationship for how scattered the plot is; the higher the MSRP, the more SALES are made from the product. The point with the highest sales value looks to be an outlier.

Confidence Interval

av_SALES <- mean(df$SALES)
sd_SALES <- sd(df$SALES)
## Making a sample
x <- sample(df$SALES, 200)
f_sales <- \(x) dnorm(x, mean = 3553, sd = 101)

## Creating the confidence interval for SALES
P <- 0.95

z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
  
ggplot() +
  geom_function(xlim = c(3355, 3751), 
                fun = f_sales) +
  geom_segment(mapping = aes(x = av_SALES - 101,
                             y = f_sales(av_SALES),
                             xend = av_SALES + 101,
                             yend = f_sales(av_SALES),
                             linetype = "proposed interval"),
               color = "gray") +
  geom_point(mapping = aes(x = av_SALES,
             y = f_sales(av_SALES),
             color = "our sample"), size = 2) +
  labs(title = "*Possible* Sampling Distribution for df_SALES",
       x = "Sample Mean",
       y = "Probability Density",
       color = "",
       linetype = "") +
  theme_minimal()

With 100 random samples from of the automobile data set, about 95 of their average SALES values fall between 3355 and 3751 monetary units…