I am importing the libraries needed to run these notes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

# Print the modified data frame
head(my_data_3)

## # A tibble: 6 × 14
##   Inferential statistics…¹ ...2  ...3   ...4 ...5  ...6   ...7 ...8   ...9 ...10
##   <chr>                    <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 2 <NA>                     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 3 InvoiceNo                Date  Coun…    NA Shop  Gend…  NA   Size…  NA   "Uni…
## 4 52389                    1/1/… Unit…  2152 UK2   Male   11   44     10.5 "$15…
## 5 52390                    1/1/… Unit…  2230 US15  Male   11.5 44-45  11   "$19…
## 6 52391                    1/1/… Cana…  2160 CAN7  Male    9.5 42-43   9   "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>

Cleaning my dataset

I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.

new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)',   'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')

# Assign the new column names to the data frame
colnames(my_data_3) <- new_names

# Verify that the column names have been changed
colnames(my_data_3)

##  [1] "InvoiceNo"     "Date"          "Country"       "ProductID"    
##  [5] "Shop"          "Gender"        "Size(US)"      "Size (Europe)"
##  [9] "Size (UK)"     "UnitPrice"     "Discount"      "Year"         
## [13] "Month"         "SalePrice"

I am removing the first 3 rows to remove null values and un-necessary titles for my data set

my_data_3 <- my_data_3[-c(1:3), ]


# Print the modified data frame
print(my_data_3)

## # A tibble: 14,967 × 14
##    InvoiceNo Date     Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##    <chr>     <chr>    <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
##  1 52389     1/1/2014 United …      2152 UK2   Male         11   44             
##  2 52390     1/1/2014 United …      2230 US15  Male         11.5 44-45          
##  3 52391     1/1/2014 Canada        2160 CAN7  Male          9.5 42-43          
##  4 52392     1/1/2014 United …      2234 US6   Female        9.5 40             
##  5 52393     1/1/2014 United …      2222 UK4   Female        9   39-40          
##  6 52394     1/1/2014 United …      2173 US15  Male         10.5 43-44          
##  7 52395     1/2/2014 Germany       2200 GER2  Female        9   39-40          
##  8 52396     1/2/2014 Canada        2238 CAN5  Male         10   43             
##  9 52397     1/2/2014 United …      2191 US13  Male         10.5 43-44          
## 10 52398     1/2/2014 United …      2237 UK1   Female        9   39-40          
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <chr>

my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)

class(my_data_3$SalePrice)

## [1] "numeric"

# removing $ for Unit Price

my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)

my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)

# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)

head(my_data_3$Discount)

## [1]  0 20 20  0  0  0

class(my_data_3$UnitPrice)

## [1] "numeric"

class(my_data_3$Discount)

## [1] "numeric"

1) Selection of Binary Column

I have selected Gender as my binary column, as it plays an important role in sales of the shoes and used for modelling of the dataset. I have connverted male and female gender into 1 and 0 respectively.

my_data_3$Gender <- ifelse(my_data_3$Gender == "Male", 1, 0)

2) Building Logistic Regression model

I selected Country and Size(US) as explanatory variables for the binary column (Gender).

model <- glm(Gender ~ Country + `Size(US)`, data = my_data_3, family = binomial)

summary(model)

## 
## Call:
## glm(formula = Gender ~ Country + `Size(US)`, family = binomial, 
##     data = my_data_3)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -8.13061    0.17049 -47.691  < 2e-16 ***
## CountryGermany        -0.10419    0.05620  -1.854  0.06374 .  
## CountryUnited Kingdom  0.01528    0.07109   0.215  0.82979    
## CountryUnited States  -0.17002    0.05305  -3.205  0.00135 ** 
## `Size(US)`             0.95400    0.01850  51.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20195  on 14966  degrees of freedom
## Residual deviance: 16052  on 14962  degrees of freedom
## AIC: 16062
## 
## Number of Fisher Scoring iterations: 5

2.a)Intrepretation:

Intercept: The intercept, which is -8.13061, represents the log-odds of the “Gender” variable when all other variables (Country and Size(US)) are set to zero. It’s highly statistically significant (p-value < 0.001).
Country Effects: People from Germany have a slightly lower chance of being in the ‘Gender’ category. -People from the United Kingdom don’t seem to have a significant effect on ‘Gender.’ People from the United States are less likely to be in the ‘Gender’ category, and this effect is statistically significant.
Size(US): For each unit increase in ‘Size(US),’ the chance of being in the ‘Gender’ category goes up significantly.
The model is better at predicting ‘Gender’ than a simple baseline model.

2.b) Confifence Interval

Sure, let’s calculate and interpret a confidence interval (C.I.) for the coefficient of the ‘Size(US)’ variable, which is 0.95400 with a standard error of 0.01850. We’ll use a 95% confidence level.

A 95% confidence interval for the ‘Size(US)’ coefficient is given by:

\[0.95400 \pm 1.96 \times 0.01850\]

Let’s calculate this interval:

\[0.95400 - 1.96 \times 0.01850 \approx 0.9172\]

\[0.95400 + 1.96 \times 0.01850 \approx 0.9908\]

So, the 95% confidence interval for the ‘Size(US)’ coefficient approximately are [0.9172,0.9908]

Interpretation:

We are 95% confident that the true effect of ‘Size(US)’ on ‘Gender’ falls within this interval.
The coefficient of ‘Size(US)’ represents the change in the log-odds of being in the ‘Gender’ category for each one-unit increase in ‘Size(US)’.
In simpler terms, as ‘Size(US)’ increases, the odds of being in a particular ‘Gender’ category increase significantly, as the entire interval is above 0.

3) Transformation:

I am creating two scatterplots against the binary column(Gender). 1) Country vs Gender 2) Size(US) vs Gender

ggplot(data = my_data_3, aes(x = Country, y = Gender)) +
  geom_point() +
  labs(x = "Country", y = "Gender")

ggplot(data = my_data_3, aes(x = `Size(US)`, y = Gender)) +
  geom_point() +
  labs(x = "Size(US)", y = "Gender")

Why the Scatter Plot for Country vs. Gender:

The scatter plot for “Country” vs. “Gender” will show how different countries relate to the “Gender” variable. If there are distinct patterns or variations between countries and “Gender,” you can assess how different countries affect gender distribution.

Scatter Plot for Size(US) vs. Gender:

The scatter plot for “Size(US)” vs. “Gender” will show how different shoe sizes (in the US sizing system) relate to the “Gender” variable. It can help assess whether certain shoe sizes are more common among a specific gender. For example, you can see if there’s a pattern where one gender prefers larger shoe sizes while the other prefers smaller ones.

R_stat_week_10

Surya

2023-10-28