I am importing the libraries needed to run these notes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
# Print the modified data frame
head(my_data_3)
## # A tibble: 6 × 14
## Inferential statistics…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 2 <NA> <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 3 InvoiceNo Date Coun… NA Shop Gend… NA Size… NA "Uni…
## 4 52389 1/1/… Unit… 2152 UK2 Male 11 44 10.5 "$15…
## 5 52390 1/1/… Unit… 2230 US15 Male 11.5 44-45 11 "$19…
## 6 52391 1/1/… Cana… 2160 CAN7 Male 9.5 42-43 9 "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>
I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.
new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)', 'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')
# Assign the new column names to the data frame
colnames(my_data_3) <- new_names
# Verify that the column names have been changed
colnames(my_data_3)
## [1] "InvoiceNo" "Date" "Country" "ProductID"
## [5] "Shop" "Gender" "Size(US)" "Size (Europe)"
## [9] "Size (UK)" "UnitPrice" "Discount" "Year"
## [13] "Month" "SalePrice"
I am removing the first 3 rows to remove null values and un-necessary titles for my data set
my_data_3 <- my_data_3[-c(1:3), ]
# Print the modified data frame
print(my_data_3)
## # A tibble: 14,967 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 52389 1/1/2014 United … 2152 UK2 Male 11 44
## 2 52390 1/1/2014 United … 2230 US15 Male 11.5 44-45
## 3 52391 1/1/2014 Canada 2160 CAN7 Male 9.5 42-43
## 4 52392 1/1/2014 United … 2234 US6 Female 9.5 40
## 5 52393 1/1/2014 United … 2222 UK4 Female 9 39-40
## 6 52394 1/1/2014 United … 2173 US15 Male 10.5 43-44
## 7 52395 1/2/2014 Germany 2200 GER2 Female 9 39-40
## 8 52396 1/2/2014 Canada 2238 CAN5 Male 10 43
## 9 52397 1/2/2014 United … 2191 US13 Male 10.5 43-44
## 10 52398 1/2/2014 United … 2237 UK1 Female 9 39-40
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <chr>
my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)
class(my_data_3$SalePrice)
## [1] "numeric"
# removing $ for Unit Price
my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)
my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)
# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)
head(my_data_3$Discount)
## [1] 0 20 20 0 0 0
class(my_data_3$UnitPrice)
## [1] "numeric"
class(my_data_3$Discount)
## [1] "numeric"
I have considered SalePrice is response variable. Country, Gender, Size in US, Unit Price, and Discount as explanatory variables. As Saleprice is most important data in the Shoe sale’s industry
model <- lm(SalePrice ~ Country + Gender + `Size(US)` + UnitPrice + Discount, data = my_data_3)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = SalePrice ~ Country + Gender + `Size(US)` + UnitPrice +
## Discount, data = my_data_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6639 -2.0270 0.1277 1.8821 12.6896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.513334 0.319420 64.221 <2e-16 ***
## CountryGermany 0.013253 0.092783 0.143 0.886
## CountryUnited Kingdom 0.005568 0.117888 0.047 0.962
## CountryUnited States -0.001138 0.087966 -0.013 0.990
## GenderMale 0.058607 0.073754 0.795 0.427
## `Size(US)` 0.014393 0.023946 0.601 0.548
## UnitPrice 0.874289 0.001391 628.440 <2e-16 ***
## Discount -1.631286 0.001875 -869.936 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.898 on 14959 degrees of freedom
## Multiple R-squared: 0.9877, Adjusted R-squared: 0.9877
## F-statistic: 1.72e+05 on 7 and 14959 DF, p-value: < 2.2e-16
# Make predictions
predictions <- predict(model, newdata = my_data_3)
# Evaluate the model (e.g., calculate R-squared)
rsquared <- cor(predictions, my_data_3$SalePrice)^2
cat("R-squared (R^2): ", rsquared, "\n")
## R-squared (R^2): 0.9877313
plot(model$residuals, model$fitted.values)
The above plot suggests that the model is doing a good job of estimating sale prices for shoes. Having the dots near the center without a strong pattern suggests a positive sign in regression analysis.
plot(cooks.distance(model))
It suggests that our model is relatively robust to individual data
points, and there are no extreme outliers or influential observations
that disproportionately affect the model’s results.
The coefficient for “Discount” is around -1.6313. This means that as you increase the discount on shoes by one unit, the average SalePrice of the shoes tends to decrease by approximately 1.6313 units. In simple terms, higher discounts lead to lower prices for the shoes, which makes intuitive sense.