I am importing the libraries needed to run these notes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

# Print the modified data frame
head(my_data_3)

## # A tibble: 6 × 14
##   Inferential statistics…¹ ...2  ...3   ...4 ...5  ...6   ...7 ...8   ...9 ...10
##   <chr>                    <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 2 <NA>                     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 3 InvoiceNo                Date  Coun…    NA Shop  Gend…  NA   Size…  NA   "Uni…
## 4 52389                    1/1/… Unit…  2152 UK2   Male   11   44     10.5 "$15…
## 5 52390                    1/1/… Unit…  2230 US15  Male   11.5 44-45  11   "$19…
## 6 52391                    1/1/… Cana…  2160 CAN7  Male    9.5 42-43   9   "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>

Cleaning my dataset

I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.

new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)',   'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')

# Assign the new column names to the data frame
colnames(my_data_3) <- new_names

# Verify that the column names have been changed
colnames(my_data_3)

##  [1] "InvoiceNo"     "Date"          "Country"       "ProductID"    
##  [5] "Shop"          "Gender"        "Size(US)"      "Size (Europe)"
##  [9] "Size (UK)"     "UnitPrice"     "Discount"      "Year"         
## [13] "Month"         "SalePrice"

I am removing the first 3 rows to remove null values and un-necessary titles for my data set

my_data_3 <- my_data_3[-c(1:3), ]


# Print the modified data frame
print(my_data_3)

## # A tibble: 14,967 × 14
##    InvoiceNo Date     Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##    <chr>     <chr>    <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
##  1 52389     1/1/2014 United …      2152 UK2   Male         11   44             
##  2 52390     1/1/2014 United …      2230 US15  Male         11.5 44-45          
##  3 52391     1/1/2014 Canada        2160 CAN7  Male          9.5 42-43          
##  4 52392     1/1/2014 United …      2234 US6   Female        9.5 40             
##  5 52393     1/1/2014 United …      2222 UK4   Female        9   39-40          
##  6 52394     1/1/2014 United …      2173 US15  Male         10.5 43-44          
##  7 52395     1/2/2014 Germany       2200 GER2  Female        9   39-40          
##  8 52396     1/2/2014 Canada        2238 CAN5  Male         10   43             
##  9 52397     1/2/2014 United …      2191 US13  Male         10.5 43-44          
## 10 52398     1/2/2014 United …      2237 UK1   Female        9   39-40          
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <chr>

my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)

class(my_data_3$SalePrice)

## [1] "numeric"

# removing $ for Unit Price

my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)

my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)

# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)

head(my_data_3$Discount)

## [1]  0 20 20  0  0  0

class(my_data_3$UnitPrice)

## [1] "numeric"

class(my_data_3$Discount)

## [1] "numeric"

1) Creating linear regression model.

I have considered SalePrice is response variable. Country, Gender, Size in US, Unit Price, and Discount as explanatory variables. As Saleprice is most important data in the Shoe sale’s industry

model <- lm(SalePrice ~ Country + Gender + `Size(US)` + UnitPrice + Discount, data = my_data_3)

# Summarize the model
summary(model)

## 
## Call:
## lm(formula = SalePrice ~ Country + Gender + `Size(US)` + UnitPrice + 
##     Discount, data = my_data_3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6639  -2.0270   0.1277   1.8821  12.6896 
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           20.513334   0.319420   64.221   <2e-16 ***
## CountryGermany         0.013253   0.092783    0.143    0.886    
## CountryUnited Kingdom  0.005568   0.117888    0.047    0.962    
## CountryUnited States  -0.001138   0.087966   -0.013    0.990    
## GenderMale             0.058607   0.073754    0.795    0.427    
## `Size(US)`             0.014393   0.023946    0.601    0.548    
## UnitPrice              0.874289   0.001391  628.440   <2e-16 ***
## Discount              -1.631286   0.001875 -869.936   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.898 on 14959 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9877 
## F-statistic: 1.72e+05 on 7 and 14959 DF,  p-value: < 2.2e-16

# Make predictions
predictions <- predict(model, newdata = my_data_3)

# Evaluate the model (e.g., calculate R-squared)
rsquared <- cor(predictions, my_data_3$SalePrice)^2
cat("R-squared (R^2): ", rsquared, "\n")

## R-squared (R^2):  0.9877313

2) Tools to diagnose the linear model:

Residual analysis of the regression model:

plot(model$residuals, model$fitted.values)

The above plot suggests that the model is doing a good job of estimating sale prices for shoes. Having the dots near the center without a strong pattern suggests a positive sign in regression analysis.

Diagnosing the regression model by cook’s method::

plot(cooks.distance(model))

It suggests that our model is relatively robust to individual data points, and there are no extreme outliers or influential observations that disproportionately affect the model’s results.

3) Interpretation of Discount column:

The coefficient for “Discount” is around -1.6313. This means that as you increase the discount on shoes by one unit, the average SalePrice of the shoes tends to decrease by approximately 1.6313 units. In simple terms, higher discounts lead to lower prices for the shoes, which makes intuitive sense.

R_stat_GLM_Part_2

Surya

2023-11-05