Week 7 Data Dive

Importing libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(boot)
## Others from the lab
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test
library(pwr)
## Warning: package 'pwr' was built under R version 4.4.3

Importing data

df <- read.csv('Auto Sales data.csv')

Creating a Null Hypothesis

The SALES column appears to be the most important piece of information. Among the products, there were two different types of Cars - Classic and Vintage… so which one brings more sales.

Ho: Classic and Vintage Cars bring in the same amount of sales on average

Ha: Classic and Vintage Cars bring in a different amount of sales on average

df_classic <- subset(df, PRODUCTLINE == "Classic Cars", select = SALES)
df_vintage <- subset(df, PRODUCTLINE == "Vintage Cars", select = SALES)

box_df <- data.frame(Product = df$PRODUCTLINE, SALES = df$SALES) |>
  filter(Product == "Classic Cars" | Product == "Vintage Cars")
box_df |>
  ggplot() +
  geom_boxplot(
    aes(x=Product, y=SALES)
  )

I did .take a peek at the boxplot before starting the test, and it showed that the average of Classic Cars is a bit larger than Vintage Cars… so I’ll change Ha to reflect that!

Ha: On average, Classic Cars bring in a larger amount of sales than Vintage Cars.

Testing the Null Hypothesis

A type I Error will cause the company to lose profit when reducing the amount of Vintage Cars they attempt to sell in favor of Classic Cars. That sounds pretty important, so the \(\alpha\) value should be .05 to reflect that.

A Type II Error may have caused a lost opportunity to aquire more profit by not increasing the number of Classic Cars sold while compromising Vintage Cars. It sounds about as bad as a Type I error, so the \(\beta\) value should be the same as the \(\alpha\) value: .05.

# Example: Two-sample t-test
pwr.t.test(
  d = 0.5,           # Cohen's d (standardized effect size)
  sig.level = 0.05,  # Significance level
  power = 0.95,      # Desired power
  type = "two.sample",
  alternative = "two.sided"
)
## 
##      Two-sample t test power calculation 
## 
##               n = 104.9279
##               d = 0.5
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

104 samples isn’t much compared to the number that we have… less than 5% of our sample! This means we can run the test!

t.test(
  x = df_classic$SALES,
  y = df_vintage$SALES,
  alternative = "two.sided",
  paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  df_classic$SALES and df_vintage$SALES
## t = 9.3593, df = 1357.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   734.3199 1123.7788
## sample estimates:
## mean of x mean of y 
##  4049.387  3120.338

Even with a two-sided t-test, that p-value is very small! Far lower than \(\alpha\), I have evidence suitable enough to reject the null hypothesis.