I am importing the libraries needed to run these notes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
# Print the modified data frame
head(my_data_3)
## # A tibble: 6 × 14
## Inferential statistics…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 2 <NA> <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 3 InvoiceNo Date Coun… NA Shop Gend… NA Size… NA "Uni…
## 4 52389 1/1/… Unit… 2152 UK2 Male 11 44 10.5 "$15…
## 5 52390 1/1/… Unit… 2230 US15 Male 11.5 44-45 11 "$19…
## 6 52391 1/1/… Cana… 2160 CAN7 Male 9.5 42-43 9 "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>
I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.
new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)', 'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')
# Assign the new column names to the data frame
colnames(my_data_3) <- new_names
# Verify that the column names have been changed
colnames(my_data_3)
## [1] "InvoiceNo" "Date" "Country" "ProductID"
## [5] "Shop" "Gender" "Size(US)" "Size (Europe)"
## [9] "Size (UK)" "UnitPrice" "Discount" "Year"
## [13] "Month" "SalePrice"
I am removing the first 3 rows to remove null values and un-necessary titles for my data set
my_data_3 <- my_data_3[-c(1:3), ]
# Print the modified data frame
print(my_data_3)
## # A tibble: 14,967 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 52389 1/1/2014 United … 2152 UK2 Male 11 44
## 2 52390 1/1/2014 United … 2230 US15 Male 11.5 44-45
## 3 52391 1/1/2014 Canada 2160 CAN7 Male 9.5 42-43
## 4 52392 1/1/2014 United … 2234 US6 Female 9.5 40
## 5 52393 1/1/2014 United … 2222 UK4 Female 9 39-40
## 6 52394 1/1/2014 United … 2173 US15 Male 10.5 43-44
## 7 52395 1/2/2014 Germany 2200 GER2 Female 9 39-40
## 8 52396 1/2/2014 Canada 2238 CAN5 Male 10 43
## 9 52397 1/2/2014 United … 2191 US13 Male 10.5 43-44
## 10 52398 1/2/2014 United … 2237 UK1 Female 9 39-40
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <chr>
my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)
class(my_data_3$SalePrice)
## [1] "numeric"
# removing $ for Unit Price
my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)
my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)
# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)
head(my_data_3$Discount)
## [1] 0 20 20 0 0 0
class(my_data_3$UnitPrice)
## [1] "numeric"
class(my_data_3$Discount)
## [1] "numeric"
For my Bundy’s Shoe sales dataset, I am considering Saleprice as my response variable, as it holds crucial part in the revenue of the shop.
I have selected Country as my categorical column, As it influence prices and sales of a shoe product.
Null Hypothesis (H0): There is no significant difference in sale prices among the different countries.
Alternative Hypothesis (H1): There is a significant difference in sale prices among the different countries.
model <- aov(SalePrice ~ Country, data = my_data_3)
anova_result <- anova(model)
print(anova_result)
## Analysis of Variance Table
##
## Response: SalePrice
## Df Sum Sq Mean Sq F value Pr(>F)
## Country 3 5323 1774.5 1.4338 0.2307
## Residuals 14963 18517924 1237.6
Based on the ANOVA test, My Pr(>F) value is 0.23 which is much greater than sigifical p value (0.05).We do not have evidence to reject the null hypothesis. Therefore, it would be safe to assume that the mean SalePrice is not significantly different across the countries you’ve examined in your dataset.
I have selected size of shoe in US as my continuous Variable and built the regression model with the response variable .
model <- lm(SalePrice ~ `Size(US)`, data = my_data_3)
H0 (Null Hypothesis): There is no significant relationship between Size(US) and SalePrice. H1 (Alternative Hypothesis): There is a significant relationship between Size(US) and SalePrice
summary(lm(SalePrice ~ `Size(US)`, data = my_data_3))
##
## Call:
## lm(formula = SalePrice ~ `Size(US)`, data = my_data_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.625 -18.895 4.949 25.079 55.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.33007 1.77283 81.412 <2e-16 ***
## `Size(US)` -0.03721 0.19024 -0.196 0.845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.18 on 14965 degrees of freedom
## Multiple R-squared: 2.556e-06, Adjusted R-squared: -6.427e-05
## F-statistic: 0.03826 on 1 and 14965 DF, p-value: 0.8449
From the summary of the linear regression, we already know that the p-value for the Size(US) coefficient is 0.845, which is much greater than 0.05. This means that I have to accept the null hypothesis, and there is no significant relationship between Size(US) and SalePrice based on this model.
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm(SalePrice ~ `Size(US)`, data = my_data_3))
In this model, the coefficient for Size(US) is -0.03721. This means that for every one-unit increase in the size of the property (measured in US units), the model predicts a decrease of 0.03721 units in the sale price. However, since this coefficient is not statistically significant, this interpretation is not useful in real-life. In future, I may consider other continuous variable that can have a influence on SalePrice.
To improve the regression model. I’ll include “Country” as a categorical variable in the regression model. I’ll also include an interaction term between “Size(US)” and “Country” to account for potential variations in the relationship between size and sale price in different countries. Let’s build this regression model and evaluate its performance.
# Linear regression model with Size(US), Country, and their interaction
model <- lm(SalePrice ~ `Size(US)` * Country, data = my_data_3)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ `Size(US)` * Country, data = my_data_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.457 -19.386 3.205 25.718 60.574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 153.4811 4.0524 37.874 <2e-16 ***
## `Size(US)` -1.0037 0.4339 -2.313 0.0207 *
## CountryGermany -8.3495 5.1929 -1.608 0.1079
## CountryUnited Kingdom -15.2049 6.6244 -2.295 0.0217 *
## CountryUnited States -12.6258 4.9366 -2.558 0.0105 *
## `Size(US)`:CountryGermany 0.8355 0.5550 1.506 0.1322
## `Size(US)`:CountryUnited Kingdom 1.7951 0.7134 2.516 0.0119 *
## `Size(US)`:CountryUnited States 1.3174 0.5298 2.487 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.17 on 14959 degrees of freedom
## Multiple R-squared: 0.0008622, Adjusted R-squared: 0.0003946
## F-statistic: 1.844 on 7 and 14959 DF, p-value: 0.07446
The model with “Size(US),” “Country,” and their interaction does not significantly improve the prediction of sale prices. The interaction terms don’t seem to contribute much to the model, and the overall model has limited explanatory power.