Week 8 Data Dive

Importing libraries

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) 
library(ggplot2) 
library(boot)

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

Importing data

df <- read.csv('Auto Sales data.csv')

ANOVA Test

SALES is the most valuable amount of information, even though I don’t know the monetary units.

I wondered if the SALES were significantly different between each type of product.

  • Ho: Average sales are not different among products.

  • Ha: Average sales are different among product lines.

m <- aov(df$SALES ~ df$PRODUCTLINE, data = df)
summary(m)
##                  Df    Sum Sq  Mean Sq F value Pr(>F)    
## df$PRODUCTLINE    6 4.852e+08 80873300   25.18 <2e-16 ***
## Residuals      2740 8.801e+09  3212062                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Wow)
With such a small p-value, it’s safe to assume that the null hypothesis is false. It’s very unlikely that that the SALES are equal among different lines of products! No wonder they sell so many.

Linear Regression Model

I created a regression model between SALES and one of its factors: PRICE_EACH.

## model <- lm(df$SALES ~ df$PRICEEACH, df)
## model$coefficients
## SALES = 35.35 * PRICE_EACH - 21.27

## Scatterplot + Regression Line
df |>
  ggplot(aes(x=PRICEEACH, y=SALES)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'

The best-fit line is [{SALES} = 35.35 * {PRICE_EACH} - 21.27], allowing the total expected sales to increase by 35.35 monetary units for each monetary unit of the individual price.

Few outliers exist when the individual price of a product is very high (more exist in the 75-125 area), so that would explain why PRICE_EACH was so much higher than MSRP.