R_stat_week

title: “R_Stat_week_4” author: “Surya” date: “2023-09-17” output: html_document —` ### Loading Data

I am importing the libraries needed to run these notes.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I will use my data-set about sales of shoes,for which you have approved.

library(readr)
#loading the data
my_data_2 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 14970 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Inferential statistics. Confidence intervals, ...2, ...3, ...5, ......
## dbl (5): ...4, ...7, ...9, ...12, ...13
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Print the modified data frame
head(my_data_2)

## # A tibble: 6 × 14
##   Inferential statistics…¹ ...2  ...3   ...4 ...5  ...6   ...7 ...8   ...9 ...10
##   <chr>                    <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 2 <NA>                     <NA>  <NA>     NA <NA>  <NA>   NA   <NA>   NA    <NA>
## 3 InvoiceNo                Date  Coun…    NA Shop  Gend…  NA   Size…  NA   "Uni…
## 4 52389                    1/1/… Unit…  2152 UK2   Male   11   44     10.5 "$15…
## 5 52390                    1/1/… Unit…  2230 US15  Male   11.5 44-45  11   "$19…
## 6 52391                    1/1/… Cana…  2160 CAN7  Male    9.5 42-43   9   "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>

Cleaning my dataset

I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.

new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)',   'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')

# Assign the new column names to the data frame
colnames(my_data_2) <- new_names

# Verify that the column names have been changed
colnames(my_data_2)

##  [1] "InvoiceNo"     "Date"          "Country"       "ProductID"    
##  [5] "Shop"          "Gender"        "Size(US)"      "Size (Europe)"
##  [9] "Size (UK)"     "UnitPrice"     "Discount"      "Year"         
## [13] "Month"         "SalePrice"

I am removing the first 3 rows to remove null values and un-necessary titles for my data set

my_data_2 <- my_data_2[-c(1:3), ]


# Print the modified data frame
print(my_data_2)

## # A tibble: 14,967 × 14
##    InvoiceNo Date     Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##    <chr>     <chr>    <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
##  1 52389     1/1/2014 United …      2152 UK2   Male         11   44             
##  2 52390     1/1/2014 United …      2230 US15  Male         11.5 44-45          
##  3 52391     1/1/2014 Canada        2160 CAN7  Male          9.5 42-43          
##  4 52392     1/1/2014 United …      2234 US6   Female        9.5 40             
##  5 52393     1/1/2014 United …      2222 UK4   Female        9   39-40          
##  6 52394     1/1/2014 United …      2173 US15  Male         10.5 43-44          
##  7 52395     1/2/2014 Germany       2200 GER2  Female        9   39-40          
##  8 52396     1/2/2014 Canada        2238 CAN5  Male         10   43             
##  9 52397     1/2/2014 United …      2191 US13  Male         10.5 43-44          
## 10 52398     1/2/2014 United …      2237 UK1   Female        9   39-40          
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <chr>

my_data_2$SalePrice <- gsub("\\$", "", my_data_2$SalePrice)
my_data_2$SalePrice <- as.numeric(my_data_2$SalePrice)

class(my_data_2$SalePrice)

## [1] "numeric"

# removing $ for Unit Price

my_data_2$UnitPrice <- gsub("\\$", "", my_data_2$UnitPrice)

my_data_2$UnitPrice <- as.numeric(my_data_2$UnitPrice)

class(my_data_2$UnitPrice)

## [1] "numeric"

picking 5 sub samples from my data of each around 50% of actual data

library(dplyr)

num_subsamples <- 5
subsample_size <- floor(0.5 * nrow(my_data_2))

set.seed(123)
for (i in 1:num_subsamples) {
  # Sample rows with replacement
  subsample_indices <- sample(1:nrow(my_data_2), size = subsample_size, replace = TRUE)
  
  # Create a subsample data frame with the desired name format
  subsample_name <- paste0("subsample_", i)
  assign(subsample_name, my_data_2[subsample_indices, ])
}

print(head(subsample_1))

## # A tibble: 6 × 14
##   InvoiceNo Date       Country ProductID Shop  Gender `Size(US)` `Size (Europe)`
##   <chr>     <chr>      <chr>       <dbl> <chr> <chr>       <dbl> <chr>          
## 1 54467     11/18/2014 United…      2230 US12  Male         10.5 43-44          
## 2 54511     11/26/2014 Canada       2172 CAN6  Female        8.5 39             
## 3 61708     5/29/2016  United…      2147 US12  Female        8.5 39             
## 4 60198     3/3/2016   Germany      2180 GER2  Female        7.5 38             
## 5 63535     8/26/2016  Germany      2155 GER2  Male         11.5 44-45          
## 6 54946     1/24/2015  United…      2175 US15  Male         10   43             
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <dbl>

print(head(subsample_2))

## # A tibble: 6 × 14
##   InvoiceNo Date      Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##   <chr>     <chr>     <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
## 1 63801     9/9/2016  United …      2209 US6   Male         11.5 44-45          
## 2 54538     12/3/2014 United …      2181 US8   Male          6.5 39             
## 3 62400     7/2/2016  United …      2210 US12  Female        5.5 36             
## 4 59279     1/7/2016  Germany       2159 GER1  Female        8   38-39          
## 5 57863     9/27/2015 Germany       2168 GER2  Male          8   41             
## 6 62646     7/13/2016 United …      2182 US10  Female        9   39-40          
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <dbl>

print(head(subsample_3))

## # A tibble: 6 × 14
##   InvoiceNo Date      Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##   <chr>     <chr>     <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
## 1 52789     3/7/2014  United …      2230 UK5   Male         10   43             
## 2 61012     4/20/2016 United …      2157 US15  Female        7.5 38             
## 3 57064     7/29/2015 Canada        2175 CAN3  Male         10.5 43-44          
## 4 56448     6/14/2015 United …      2160 UK1   Male          9.5 42-43          
## 5 58371     11/2/2015 United …      2227 UK5   Female        8   38-39          
## 6 57394     8/23/2015 Germany       2241 GER2  Female        8   38-39          
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <dbl>

print(head(subsample_4))

## # A tibble: 6 × 14
##   InvoiceNo Date      Country  ProductID Shop  Gender `Size(US)` `Size (Europe)`
##   <chr>     <chr>     <chr>        <dbl> <chr> <chr>       <dbl> <chr>          
## 1 63509     8/25/2016 United …      2238 US12  Male          9   42             
## 2 62869     7/25/2016 United …      2219 US15  Female        8   38-39          
## 3 56703     7/2/2015  Germany       2227 GER2  Male         10.5 43-44          
## 4 63902     9/14/2016 Canada        2188 CAN2  Male         15   48             
## 5 57324     8/17/2015 United …      2155 US15  Female       10.5 41             
## 6 56613     6/25/2015 United …      2218 US12  Male         10.5 43-44          
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <dbl>

print(head(subsample_5))

## # A tibble: 6 × 14
##   InvoiceNo Date       Country ProductID Shop  Gender `Size(US)` `Size (Europe)`
##   <chr>     <chr>      <chr>       <dbl> <chr> <chr>       <dbl> <chr>          
## 1 64017     9/19/2016  United…      2205 UK5   Female        5   35-36          
## 2 63301     8/15/2016  United…      2193 UK1   Female        9.5 40             
## 3 64872     11/1/2016  Germany      2174 GER1  Female       12   42-43          
## 4 65649     12/22/2016 United…      2208 US13  Male          9.5 42-43          
## 5 53098     4/24/2014  United…      2170 UK2   Female        9.5 40             
## 6 57419     8/25/2015  United…      2180 US12  Male          8.5 41-42          
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## #   Year <dbl>, Month <dbl>, SalePrice <dbl>

Average Shoe Size of sub-samples

calculate_average_shoe_size <- function(subsample_df) {
  mean(subsample_df$`Size(US)`, na.rm = TRUE) 
}

average_shoe_sizes <- numeric(num_subsamples)

# Calculate the average shoe size for each subsample and print the results
for (i in 1:num_subsamples) {
  subsample_name <- paste0("subsample_", i)
  average_shoe_size <- calculate_average_shoe_size(get(subsample_name))
  
  cat("Average shoe size for", subsample_name, ":", average_shoe_size, "\n")
}

## Average shoe size for subsample_1 : 9.213618 
## Average shoe size for subsample_2 : 9.198918 
## Average shoe size for subsample_3 : 9.198784 
## Average shoe size for subsample_4 : 9.21903 
## Average shoe size for subsample_5 : 9.198383

Observation1:

there is not much difference between shoe sizes in different samples. It happened may be because it’s actual data of the company and anamoly may be much between each sample.

Average Unit price of subsamples

calculate_average_unit_price <- function(subsample_df) {
  mean(subsample_df$UnitPrice, na.rm = TRUE)  
}

average_unit_prices <- numeric(num_subsamples)

for (i in 1:num_subsamples) {
  subsample_name <- paste0("subsample_", i)
  average_unit_price <- calculate_average_unit_price(get(subsample_name))
  
  cat("Average unit price for", subsample_name, ":", average_unit_price, "\n")
}

## Average unit price for subsample_1 : 164.4537 
## Average unit price for subsample_2 : 164.2693 
## Average unit price for subsample_3 : 163.9913 
## Average unit price for subsample_4 : 164.2425 
## Average unit price for subsample_5 : 164.3

Observation 2:

there is not much difference between average unit price for different samples. It happened may be because it’s actual data of the company.

Average Sale Price of each subsample

calculate_average_sale_price <- function(subsample_df) {
  mean(subsample_df$SalePrice, na.rm = TRUE)
}

average_sale_prices <- numeric(num_subsamples)

for (i in 1:num_subsamples) {
  subsample_name <- paste0("subsample_", i)
  average_sale_price <- calculate_average_sale_price(get(subsample_name))
  
  # Print the average sale price for the current subsample
  cat("Average sale price for", subsample_name, ":", average_sale_price, "\n")
}

## Average sale price for subsample_1 : 144.28 
## Average sale price for subsample_2 : 144.0443 
## Average sale price for subsample_3 : 143.7014 
## Average sale price for subsample_4 : 143.8762 
## Average sale price for subsample_5 : 143.868

Observation 2:

there is not much difference between average Sale price for different samples. It happened may be because it’s actual data of the company.

gender_counts <- data.frame(subsample = character(num_subsamples), male_count = integer(num_subsamples), female_count = integer(num_subsamples))

# Calculate the counts for each subsample
for (i in 1:num_subsamples) {
  subsample_name <- paste0("subsample_", i)
  subsample_df <- get(subsample_name)
  
  male_count <- sum(subsample_df$Gender == "Male")
  female_count <- sum(subsample_df$Gender == "Female")
  
  gender_counts[i, ] <- c(subsample_name, male_count, female_count)
}

print(gender_counts)

##     subsample male_count female_count
## 1 subsample_1       4410         3073
## 2 subsample_2       4390         3093
## 3 subsample_3       4521         2962
## 4 subsample_4       4452         3031
## 5 subsample_5       4475         3008

Observation 4:

The maximum males in the sub_sample are 4521 which is in sample 3 and The maximum females in the sub_sample are 3093 which is in sample 2

Calculating the number of shoes purchased in each country for every sub-sample

I created a dataframes of categorizing country for each sub-sample.

print_country_shoe_counts <- function(subsample_df, subsample_name) {
  country_shoe_counts <- subsample_df %>%
    group_by(Country) %>%
    summarize(ShoeCount = n())
  
  cat("Subsample:", subsample_name, "\n")
  print(country_shoe_counts)
}

for (i in 1:num_subsamples) {
  subsample_name <- paste0("subsample_", i)
  subsample_df <- get(subsample_name)
  
  print_country_shoe_counts(subsample_df, subsample_name)
}

## Subsample: subsample_1 
## # A tibble: 4 × 2
##   Country        ShoeCount
##   <chr>              <int>
## 1 Canada              1484
## 2 Germany             2192
## 3 United Kingdom       903
## 4 United States       2904
## Subsample: subsample_2 
## # A tibble: 4 × 2
##   Country        ShoeCount
##   <chr>              <int>
## 1 Canada              1454
## 2 Germany             2191
## 3 United Kingdom       898
## 4 United States       2940
## Subsample: subsample_3 
## # A tibble: 4 × 2
##   Country        ShoeCount
##   <chr>              <int>
## 1 Canada              1438
## 2 Germany             2211
## 3 United Kingdom       888
## 4 United States       2946
## Subsample: subsample_4 
## # A tibble: 4 × 2
##   Country        ShoeCount
##   <chr>              <int>
## 1 Canada              1457
## 2 Germany             2219
## 3 United Kingdom       858
## 4 United States       2949
## Subsample: subsample_5 
## # A tibble: 4 × 2
##   Country        ShoeCount
##   <chr>              <int>
## 1 Canada              1420
## 2 Germany             2215
## 3 United Kingdom       892
## 4 United States       2956

Observation 5:

Overall, no of shoes purchased in each country as follows: United states> Germany> Canada > United Kingdom This trend is observed in all the sub samples.

Conclusion:

Overall, all the categorical variables in sub-samples are close to each other. There is no such Pecular difference among them is observed. I will further investigate about samples and draw my final conclusion in the future.