title: “Data_9_1” author: “Eric” date: “2025-04-20” output: html_document —

Week 9 Data Dive

Importing libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr) 
library(ggplot2)
## From notes
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

Importing data

df <- read.csv('Auto Sales data.csv')

Adding to the Regression Model

This was the linear regression model that I created for Week 8.

df |>
  ggplot(aes(x=PRICEEACH, y=SALES)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')

## `geom_smooth()` using formula = 'y ~ x'

SALES has another factor that I want to look at - QUANTITYORDERED!

model <- lm(df$SALES ~ df$PRICEEACH + df$QUANTITYORDERED, df)

model$coefficients

##        (Intercept)       df$PRICEEACH df$QUANTITYORDERED 
##        -3601.96481           35.11239          102.70302

summary(model)

## 
## Call:
## lm(formula = df$SALES ~ df$PRICEEACH + df$QUANTITYORDERED, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2157.7  -187.3     0.0   186.8  3373.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3601.9648    34.5098  -104.4   <2e-16 ***
## df$PRICEEACH          35.1124     0.1857   189.1   <2e-16 ***
## df$QUANTITYORDERED   102.7030     0.7998   128.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 409.1 on 2744 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9505 
## F-statistic: 2.637e+04 on 2 and 2744 DF,  p-value: < 2.2e-16

R² suggests a very strong correlation. SALES is dependent on both PRICEEACH and QUANTITYORDERED after all, so this was expected.

I’m surprised that it isn’t a perfect 1.

So SALES = 35.11 (PRICE_EACH) + 102.70 (QUANTITYORDERED) - 3601.965… on average, a recorded SALES goes up by

35.11 monetary units for each monetary unit of PRICE
102.70 monetary units for each individual unit sold in the offer.

Diagnostic Plots

Residual vs Fitted Graph

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The variance increases the more fitted values are present (and the least, if anything under 0 should be considered.). This might go against the 2nd assumption of linear regression - ERRORS HAVE CONSTANT VARIANCE ACROSS ALL PREDICTIONS.

At least the dots are closely packed together.

Residual Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution is about as normal as I could’ve expected. Perhaps the 2nd assumption was not violated.