title: “R_Stat_week_4” author: “Surya” date: “2023-09-17” output: html_document —` ### Loading Data
I am importing the libraries needed to run these notes.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I will use my data-set about sales of shoes,for which you have approved.
library(readr)
#loading the data
my_data_2 <- read_delim("C:/Users/Surya CST/Documents/CSV_files/Bundy_Shoe_Shop.csv",delim=",")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 14970 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Inferential statistics. Confidence intervals, ...2, ...3, ...5, ......
## dbl (5): ...4, ...7, ...9, ...12, ...13
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the modified data frame
head(my_data_2)
## # A tibble: 6 × 14
## Inferential statistics…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Al Bundy's shoe shop <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 2 <NA> <NA> <NA> NA <NA> <NA> NA <NA> NA <NA>
## 3 InvoiceNo Date Coun… NA Shop Gend… NA Size… NA "Uni…
## 4 52389 1/1/… Unit… 2152 UK2 Male 11 44 10.5 "$15…
## 5 52390 1/1/… Unit… 2230 US15 Male 11.5 44-45 11 "$19…
## 6 52391 1/1/… Cana… 2160 CAN7 Male 9.5 42-43 9 "$14…
## # ℹ abbreviated name: ¹`Inferential statistics. Confidence intervals`
## # ℹ 4 more variables: ...11 <chr>, ...12 <dbl>, ...13 <dbl>, ...14 <chr>
I am renaming my titles from{X1,X2} to {Invoice,Date},,etc to make the date more simple and clear.
new_names <- c("InvoiceNo",'Date', "Country", "ProductID",'Shop','Gender','Size(US)','Size (Europe)', 'Size (UK)','UnitPrice','Discount', 'Year','Month','SalePrice')
# Assign the new column names to the data frame
colnames(my_data_2) <- new_names
# Verify that the column names have been changed
colnames(my_data_2)
## [1] "InvoiceNo" "Date" "Country" "ProductID"
## [5] "Shop" "Gender" "Size(US)" "Size (Europe)"
## [9] "Size (UK)" "UnitPrice" "Discount" "Year"
## [13] "Month" "SalePrice"
I am removing the first 3 rows to remove null values and un-necessary titles for my data set
my_data_2 <- my_data_2[-c(1:3), ]
# Print the modified data frame
print(my_data_2)
## # A tibble: 14,967 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 52389 1/1/2014 United … 2152 UK2 Male 11 44
## 2 52390 1/1/2014 United … 2230 US15 Male 11.5 44-45
## 3 52391 1/1/2014 Canada 2160 CAN7 Male 9.5 42-43
## 4 52392 1/1/2014 United … 2234 US6 Female 9.5 40
## 5 52393 1/1/2014 United … 2222 UK4 Female 9 39-40
## 6 52394 1/1/2014 United … 2173 US15 Male 10.5 43-44
## 7 52395 1/2/2014 Germany 2200 GER2 Female 9 39-40
## 8 52396 1/2/2014 Canada 2238 CAN5 Male 10 43
## 9 52397 1/2/2014 United … 2191 US13 Male 10.5 43-44
## 10 52398 1/2/2014 United … 2237 UK1 Female 9 39-40
## # ℹ 14,957 more rows
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <chr>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <chr>
my_data_2$SalePrice <- gsub("\\$", "", my_data_2$SalePrice)
my_data_2$SalePrice <- as.numeric(my_data_2$SalePrice)
class(my_data_2$SalePrice)
## [1] "numeric"
# removing $ for Unit Price
my_data_2$UnitPrice <- gsub("\\$", "", my_data_2$UnitPrice)
my_data_2$UnitPrice <- as.numeric(my_data_2$UnitPrice)
class(my_data_2$UnitPrice)
## [1] "numeric"
library(dplyr)
num_subsamples <- 5
subsample_size <- floor(0.5 * nrow(my_data_2))
set.seed(123)
for (i in 1:num_subsamples) {
# Sample rows with replacement
subsample_indices <- sample(1:nrow(my_data_2), size = subsample_size, replace = TRUE)
# Create a subsample data frame with the desired name format
subsample_name <- paste0("subsample_", i)
assign(subsample_name, my_data_2[subsample_indices, ])
}
print(head(subsample_1))
## # A tibble: 6 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 54467 11/18/2014 United… 2230 US12 Male 10.5 43-44
## 2 54511 11/26/2014 Canada 2172 CAN6 Female 8.5 39
## 3 61708 5/29/2016 United… 2147 US12 Female 8.5 39
## 4 60198 3/3/2016 Germany 2180 GER2 Female 7.5 38
## 5 63535 8/26/2016 Germany 2155 GER2 Male 11.5 44-45
## 6 54946 1/24/2015 United… 2175 US15 Male 10 43
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <dbl>
print(head(subsample_2))
## # A tibble: 6 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 63801 9/9/2016 United … 2209 US6 Male 11.5 44-45
## 2 54538 12/3/2014 United … 2181 US8 Male 6.5 39
## 3 62400 7/2/2016 United … 2210 US12 Female 5.5 36
## 4 59279 1/7/2016 Germany 2159 GER1 Female 8 38-39
## 5 57863 9/27/2015 Germany 2168 GER2 Male 8 41
## 6 62646 7/13/2016 United … 2182 US10 Female 9 39-40
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <dbl>
print(head(subsample_3))
## # A tibble: 6 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 52789 3/7/2014 United … 2230 UK5 Male 10 43
## 2 61012 4/20/2016 United … 2157 US15 Female 7.5 38
## 3 57064 7/29/2015 Canada 2175 CAN3 Male 10.5 43-44
## 4 56448 6/14/2015 United … 2160 UK1 Male 9.5 42-43
## 5 58371 11/2/2015 United … 2227 UK5 Female 8 38-39
## 6 57394 8/23/2015 Germany 2241 GER2 Female 8 38-39
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <dbl>
print(head(subsample_4))
## # A tibble: 6 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 63509 8/25/2016 United … 2238 US12 Male 9 42
## 2 62869 7/25/2016 United … 2219 US15 Female 8 38-39
## 3 56703 7/2/2015 Germany 2227 GER2 Male 10.5 43-44
## 4 63902 9/14/2016 Canada 2188 CAN2 Male 15 48
## 5 57324 8/17/2015 United … 2155 US15 Female 10.5 41
## 6 56613 6/25/2015 United … 2218 US12 Male 10.5 43-44
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <dbl>
print(head(subsample_5))
## # A tibble: 6 × 14
## InvoiceNo Date Country ProductID Shop Gender `Size(US)` `Size (Europe)`
## <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 64017 9/19/2016 United… 2205 UK5 Female 5 35-36
## 2 63301 8/15/2016 United… 2193 UK1 Female 9.5 40
## 3 64872 11/1/2016 Germany 2174 GER1 Female 12 42-43
## 4 65649 12/22/2016 United… 2208 US13 Male 9.5 42-43
## 5 53098 4/24/2014 United… 2170 UK2 Female 9.5 40
## 6 57419 8/25/2015 United… 2180 US12 Male 8.5 41-42
## # ℹ 6 more variables: `Size (UK)` <dbl>, UnitPrice <dbl>, Discount <chr>,
## # Year <dbl>, Month <dbl>, SalePrice <dbl>
calculate_average_shoe_size <- function(subsample_df) {
mean(subsample_df$`Size(US)`, na.rm = TRUE)
}
average_shoe_sizes <- numeric(num_subsamples)
# Calculate the average shoe size for each subsample and print the results
for (i in 1:num_subsamples) {
subsample_name <- paste0("subsample_", i)
average_shoe_size <- calculate_average_shoe_size(get(subsample_name))
cat("Average shoe size for", subsample_name, ":", average_shoe_size, "\n")
}
## Average shoe size for subsample_1 : 9.213618
## Average shoe size for subsample_2 : 9.198918
## Average shoe size for subsample_3 : 9.198784
## Average shoe size for subsample_4 : 9.21903
## Average shoe size for subsample_5 : 9.198383
there is not much difference between shoe sizes in different samples. It happened may be because it’s actual data of the company and anamoly may be much between each sample.
calculate_average_unit_price <- function(subsample_df) {
mean(subsample_df$UnitPrice, na.rm = TRUE)
}
average_unit_prices <- numeric(num_subsamples)
for (i in 1:num_subsamples) {
subsample_name <- paste0("subsample_", i)
average_unit_price <- calculate_average_unit_price(get(subsample_name))
cat("Average unit price for", subsample_name, ":", average_unit_price, "\n")
}
## Average unit price for subsample_1 : 164.4537
## Average unit price for subsample_2 : 164.2693
## Average unit price for subsample_3 : 163.9913
## Average unit price for subsample_4 : 164.2425
## Average unit price for subsample_5 : 164.3
there is not much difference between average unit price for different samples. It happened may be because it’s actual data of the company.
calculate_average_sale_price <- function(subsample_df) {
mean(subsample_df$SalePrice, na.rm = TRUE)
}
average_sale_prices <- numeric(num_subsamples)
for (i in 1:num_subsamples) {
subsample_name <- paste0("subsample_", i)
average_sale_price <- calculate_average_sale_price(get(subsample_name))
# Print the average sale price for the current subsample
cat("Average sale price for", subsample_name, ":", average_sale_price, "\n")
}
## Average sale price for subsample_1 : 144.28
## Average sale price for subsample_2 : 144.0443
## Average sale price for subsample_3 : 143.7014
## Average sale price for subsample_4 : 143.8762
## Average sale price for subsample_5 : 143.868
there is not much difference between average Sale price for different samples. It happened may be because it’s actual data of the company.
gender_counts <- data.frame(subsample = character(num_subsamples), male_count = integer(num_subsamples), female_count = integer(num_subsamples))
# Calculate the counts for each subsample
for (i in 1:num_subsamples) {
subsample_name <- paste0("subsample_", i)
subsample_df <- get(subsample_name)
male_count <- sum(subsample_df$Gender == "Male")
female_count <- sum(subsample_df$Gender == "Female")
gender_counts[i, ] <- c(subsample_name, male_count, female_count)
}
print(gender_counts)
## subsample male_count female_count
## 1 subsample_1 4410 3073
## 2 subsample_2 4390 3093
## 3 subsample_3 4521 2962
## 4 subsample_4 4452 3031
## 5 subsample_5 4475 3008
The maximum males in the sub_sample are 4521 which is in sample 3 and The maximum females in the sub_sample are 3093 which is in sample 2
I created a dataframes of categorizing country for each sub-sample.
print_country_shoe_counts <- function(subsample_df, subsample_name) {
country_shoe_counts <- subsample_df %>%
group_by(Country) %>%
summarize(ShoeCount = n())
cat("Subsample:", subsample_name, "\n")
print(country_shoe_counts)
}
for (i in 1:num_subsamples) {
subsample_name <- paste0("subsample_", i)
subsample_df <- get(subsample_name)
print_country_shoe_counts(subsample_df, subsample_name)
}
## Subsample: subsample_1
## # A tibble: 4 × 2
## Country ShoeCount
## <chr> <int>
## 1 Canada 1484
## 2 Germany 2192
## 3 United Kingdom 903
## 4 United States 2904
## Subsample: subsample_2
## # A tibble: 4 × 2
## Country ShoeCount
## <chr> <int>
## 1 Canada 1454
## 2 Germany 2191
## 3 United Kingdom 898
## 4 United States 2940
## Subsample: subsample_3
## # A tibble: 4 × 2
## Country ShoeCount
## <chr> <int>
## 1 Canada 1438
## 2 Germany 2211
## 3 United Kingdom 888
## 4 United States 2946
## Subsample: subsample_4
## # A tibble: 4 × 2
## Country ShoeCount
## <chr> <int>
## 1 Canada 1457
## 2 Germany 2219
## 3 United Kingdom 858
## 4 United States 2949
## Subsample: subsample_5
## # A tibble: 4 × 2
## Country ShoeCount
## <chr> <int>
## 1 Canada 1420
## 2 Germany 2215
## 3 United Kingdom 892
## 4 United States 2956
Overall, no of shoes purchased in each country as follows: United states> Germany> Canada > United Kingdom This trend is observed in all the sub samples.
Overall, all the categorical variables in sub-samples are close to each other. There is no such Pecular difference among them is observed. I will further investigate about samples and draw my final conclusion in the future.