I am importing the libraries needed to run these notes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
my_data_3 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",",show_col_types = FALSE)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
# Print the modified data frame
head(my_data_3)
## # A tibble: 6 × 14
## Inferential statistics…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 2 <NA> <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 3 InvoiceNo Date Coun… NA Shop Gend… NA Size… NA "Uni…
## 4 52389 1/1/… Unit… 2152 UK2 Male 11 44 10.5 "$15…
## 5 52390 1/1/… Unit… 2230 US15 Male 11.5 44-45 11 "$19…
## 6 52391 1/1/… Cana… 2160 CAN7 Male 9.5 42-43 9 "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>
I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.
new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)', 'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')
# Assign the new column names to the data frame
colnames(my_data_3) <- new_names
# Verify that the column names have been changed
colnames(my_data_3)
## [1] "InvoiceNo" "Date" "Country" "ProductID"
## [5] "Shop" "Gender" "Size(US)" "Size (Europe)"
## [9] "Size (UK)" "UnitPrice" "Discount" "Year"
## [13] "Month" "SalePrice"
I am removing the first 3 rows to remove null values and un-necessary titles for my data set
my_data_3 <- my_data_3[-c(1:3), ]
# Print the modified data frame
print(my_data_3)
## # A tibble: 14,967 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 52389 1/1/2014 United … 2152 UK2 Male 11 44
## 2 52390 1/1/2014 United … 2230 US15 Male 11.5 44-45
## 3 52391 1/1/2014 Canada 2160 CAN7 Male 9.5 42-43
## 4 52392 1/1/2014 United … 2234 US6 Female 9.5 40
## 5 52393 1/1/2014 United … 2222 UK4 Female 9 39-40
## 6 52394 1/1/2014 United … 2173 US15 Male 10.5 43-44
## 7 52395 1/2/2014 Germany 2200 GER2 Female 9 39-40
## 8 52396 1/2/2014 Canada 2238 CAN5 Male 10 43
## 9 52397 1/2/2014 United … 2191 US13 Male 10.5 43-44
## 10 52398 1/2/2014 United … 2237 UK1 Female 9 39-40
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <chr>
my_data_3$SalePrice <- gsub("\\$", "", my_data_3$SalePrice)
my_data_3$SalePrice <- as.numeric(my_data_3$SalePrice)
class(my_data_3$SalePrice)
## [1] "numeric"
# removing $ for Unit Price
my_data_3$UnitPrice <- gsub("\\$", "", my_data_3$UnitPrice)
my_data_3$UnitPrice <- as.numeric(my_data_3$UnitPrice)
# Remove '%' from the Discount column
my_data_3$Discount <- gsub("%", "", my_data_3$Discount)
my_data_3$Discount <- as.numeric(my_data_3$Discount)
head(my_data_3$Discount)
## [1] 0 20 20 0 0 0
class(my_data_3$UnitPrice)
## [1] "numeric"
class(my_data_3$Discount)
## [1] "numeric"
I have selected Gender as my binary column, as it plays an important role in sales of the shoes and used for modelling of the dataset. I have connverted male and female gender into 1 and 0 respectively.
my_data_3$Gender <- ifelse(my_data_3$Gender == "Male", 1, 0)
I selected Country and Size(US) as explanatory variables for the binary column (Gender).
model <- glm(Gender ~ Country + `Size(US)`, data = my_data_3, family = binomial)
summary(model)
##
## Call:
## glm(formula = Gender ~ Country + `Size(US)`, family = binomial,
## data = my_data_3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.13061 0.17049 -47.691 < 2e-16 ***
## CountryGermany -0.10419 0.05620 -1.854 0.06374 .
## CountryUnited Kingdom 0.01528 0.07109 0.215 0.82979
## CountryUnited States -0.17002 0.05305 -3.205 0.00135 **
## `Size(US)` 0.95400 0.01850 51.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20195 on 14966 degrees of freedom
## Residual deviance: 16052 on 14962 degrees of freedom
## AIC: 16062
##
## Number of Fisher Scoring iterations: 5
Intercept: The intercept, which is -8.13061, represents the log-odds of the “Gender” variable when all other variables (Country and Size(US)) are set to zero. It’s highly statistically significant (p-value < 0.001).
Country Effects: People from Germany have a slightly lower chance of being in the ‘Gender’ category. -People from the United Kingdom don’t seem to have a significant effect on ‘Gender.’ People from the United States are less likely to be in the ‘Gender’ category, and this effect is statistically significant.
Size(US): For each unit increase in ‘Size(US),’ the chance of being in the ‘Gender’ category goes up significantly.
The model is better at predicting ‘Gender’ than a simple baseline model.
Sure, let’s calculate and interpret a confidence interval (C.I.) for the coefficient of the ‘Size(US)’ variable, which is 0.95400 with a standard error of 0.01850. We’ll use a 95% confidence level.
A 95% confidence interval for the ‘Size(US)’ coefficient is given by:
\[0.95400 \pm 1.96 \times 0.01850\]
Let’s calculate this interval:
\[0.95400 - 1.96 \times 0.01850 \approx 0.9172\]
\[0.95400 + 1.96 \times 0.01850 \approx 0.9908\]
So, the 95% confidence interval for the ‘Size(US)’ coefficient approximately are [0.9172,0.9908]
Interpretation:
I am creating two scatterplots against the binary column(Gender). 1) Country vs Gender 2) Size(US) vs Gender
ggplot(data = my_data_3, aes(x = Country, y = Gender)) +
geom_point() +
labs(x = "Country", y = "Gender")
ggplot(data = my_data_3, aes(x = `Size(US)`, y = Gender)) +
geom_point() +
labs(x = "Size(US)", y = "Gender")
The scatter plot for “Country” vs. “Gender” will show how different countries relate to the “Gender” variable. If there are distinct patterns or variations between countries and “Gender,” you can assess how different countries affect gender distribution.
The scatter plot for “Size(US)” vs. “Gender” will show how different shoe sizes (in the US sizing system) relate to the “Gender” variable. It can help assess whether certain shoe sizes are more common among a specific gender. For example, you can see if there’s a pattern where one gender prefers larger shoe sizes while the other prefers smaller ones.