Regression

I am importing the libraries needed to run these notes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

# Print the modified data frame
head(my_data_3)

## # A tibble: 6 × 14
##   Inferential statistics…¹ ...2  ...3   ...4 ...5  ...6   ...7 ...8   ...9 ...10
##   <chr>                    <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 2 <NA>                     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 3 InvoiceNo                Date  Coun…    NA Shop  Gend…  NA   Size…  NA   "Uni…
## 4 52389                    1/1/… Unit…  2152 UK2   Male   11   44     10.5 "$15…
## 5 52390                    1/1/… Unit…  2230 US15  Male   11.5 44-45  11   "$19…
## 6 52391                    1/1/… Cana…  2160 CAN7  Male    9.5 42-43   9   "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>

Cleaning my dataset

I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.

new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)',   'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')

# Assign the new column names to the data frame
colnames(my_data_3) <- new_names

# Verify that the column names have been changed
colnames(my_data_3)

##  [1] "InvoiceNo"     "Date"          "Country"       "ProductID"    
##  [5] "Shop"          "Gender"        "Size(US)"      "Size (Europe)"
##  [9] "Size (UK)"     "UnitPrice"     "Discount"      "Year"         
## [13] "Month"         "SalePrice"

I am removing the first 3 rows to remove null values and un-necessary titles for my data set

my_data_3 <- my_data_3[-c(1:3), ]


# Print the modified data frame
print(my_data_3)

## # A tibble: 14,967 × 14
##    InvoiceNo Date     Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##    <chr>     <chr>    <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
##  1 52389     1/1/2014 United …      2152 UK2   Male         11   44             
##  2 52390     1/1/2014 United …      2230 US15  Male         11.5 44-45          
##  3 52391     1/1/2014 Canada        2160 CAN7  Male          9.5 42-43          
##  4 52392     1/1/2014 United …      2234 US6   Female        9.5 40             
##  5 52393     1/1/2014 United …      2222 UK4   Female        9   39-40          
##  6 52394     1/1/2014 United …      2173 US15  Male         10.5 43-44          
##  7 52395     1/2/2014 Germany       2200 GER2  Female        9   39-40          
##  8 52396     1/2/2014 Canada        2238 CAN5  Male         10   43             
##  9 52397     1/2/2014 United …      2191 US13  Male         10.5 43-44          
## 10 52398     1/2/2014 United …      2237 UK1   Female        9   39-40          
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <chr>

my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)

class(my_data_3$SalePrice)

## [1] "numeric"

# removing $ for Unit Price

my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)

my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)

# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)

head(my_data_3$Discount)

## [1]  0 20 20  0  0  0

class(my_data_3$UnitPrice)

## [1] "numeric"

class(my_data_3$Discount)

## [1] "numeric"

1)Selection of Response Variable

For my Bundy’s Shoe sales dataset, I am considering Saleprice as my response variable, as it holds crucial part in the revenue of the shop.

2)Selection of Categorical column.

I have selected Country as my categorical column, As it influence prices and sales of a shoe product.

ANOVA Test and Results:

Null Hypothesis (H0): There is no significant difference in sale prices among the different countries.

Alternative Hypothesis (H1): There is a significant difference in sale prices among the different countries.

model <- aov(SalePrice ~ Country, data = my_data_3)
anova_result <- anova(model)
print(anova_result)

## Analysis of Variance Table
## 
## Response: SalePrice
##              Df   Sum Sq Mean Sq F value Pr(>F)
## Country       3     5323  1774.5  1.4338 0.2307
## Residuals 14963 18517924  1237.6

Based on the ANOVA test, My Pr(>F) value is 0.23 which is much greater than sigifical p value (0.05).We do not have evidence to reject the null hypothesis. Therefore, it would be safe to assume that the mean SalePrice is not significantly different across the countries you’ve examined in your dataset.

3.a)Selection of Continuous Variable.

I have selected size of shoe in US as my continuous Variable and built the regression model with the response variable .

model <- lm(SalePrice ~ `Size(US)`, data = my_data_3)

3.b) Hypothesis Testing.

H0 (Null Hypothesis): There is no significant relationship between Size(US) and SalePrice. H1 (Alternative Hypothesis): There is a significant relationship between Size(US) and SalePrice

summary(lm(SalePrice ~ `Size(US)`, data = my_data_3))

## 
## Call:
## lm(formula = SalePrice ~ `Size(US)`, data = my_data_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.625 -18.895   4.949  25.079  55.228 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 144.33007    1.77283  81.412   <2e-16 ***
## `Size(US)`   -0.03721    0.19024  -0.196    0.845    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.18 on 14965 degrees of freedom
## Multiple R-squared:  2.556e-06,  Adjusted R-squared:  -6.427e-05 
## F-statistic: 0.03826 on 1 and 14965 DF,  p-value: 0.8449

From the summary of the linear regression, we already know that the p-value for the Size(US) coefficient is 0.845, which is much greater than 0.05. This means that I have to accept the null hypothesis, and there is no significant relationship between Size(US) and SalePrice based on this model.

# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm(SalePrice ~ `Size(US)`, data = my_data_3))

3.c) Interpretting coefficients of my model.

In this model, the coefficient for Size(US) is -0.03721. This means that for every one-unit increase in the size of the property (measured in US units), the model predicts a decrease of 0.03721 units in the sale price. However, since this coefficient is not statistically significant, this interpretation is not useful in real-life. In future, I may consider other continuous variable that can have a influence on SalePrice.

4) Considering other variables and their impact on SalePrice.

To improve the regression model. I’ll include “Country” as a categorical variable in the regression model. I’ll also include an interaction term between “Size(US)” and “Country” to account for potential variations in the relationship between size and sale price in different countries. Let’s build this regression model and evaluate its performance.

# Linear regression model with Size(US), Country, and their interaction
model <- lm(SalePrice ~ `Size(US)` * Country, data = my_data_3)

summary(model)

## 
## Call:
## lm(formula = SalePrice ~ `Size(US)` * Country, data = my_data_3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.457 -19.386   3.205  25.718  60.574 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      153.4811     4.0524  37.874   <2e-16 ***
## `Size(US)`                        -1.0037     0.4339  -2.313   0.0207 *  
## CountryGermany                    -8.3495     5.1929  -1.608   0.1079    
## CountryUnited Kingdom            -15.2049     6.6244  -2.295   0.0217 *  
## CountryUnited States             -12.6258     4.9366  -2.558   0.0105 *  
## `Size(US)`:CountryGermany          0.8355     0.5550   1.506   0.1322    
## `Size(US)`:CountryUnited Kingdom   1.7951     0.7134   2.516   0.0119 *  
## `Size(US)`:CountryUnited States    1.3174     0.5298   2.487   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.17 on 14959 degrees of freedom
## Multiple R-squared:  0.0008622,  Adjusted R-squared:  0.0003946 
## F-statistic: 1.844 on 7 and 14959 DF,  p-value: 0.07446

The model with “Size(US),” “Country,” and their interaction does not significantly improve the prediction of sale prices. The interaction terms don’t seem to contribute much to the model, and the overall model has limited explanatory power.