# Load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
Superstore_data=read.csv("SampleSuperstore_final.csv")
head(Superstore_data)
## Ship.Mode Segment Country City State Postal.Code
## 1 Second Class Consumer United States Henderson Kentucky 42420
## 2 Second Class Consumer United States Henderson Kentucky 42420
## 3 Second Class Corporate United States Los Angeles California 90036
## 4 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 5 Standard Class Consumer United States Fort Lauderdale Florida 33311
## 6 Standard Class Consumer United States Los Angeles California 90032
## Region Category Sub.Category Sales Quantity Discount Profit
## 1 South Furniture Bookcases 261.9600 2 0.00 41.9136
## 2 South Furniture Chairs 731.9400 3 0.00 219.5820
## 3 West Office Supplies Labels 14.6200 2 0.00 6.8714
## 4 South Furniture Tables 957.5775 5 0.45 -383.0310
## 5 South Office Supplies Storage 22.3680 2 0.20 2.5164
## 6 West Furniture Furnishings 48.8600 7 0.00 14.1694
QUESTION
Use the group_by function to group your data into (at least) 3
different sets of groups, each summarizing different variables.
For example, this could be as simple as three data frames which
group your data based on three different categorical columns, but
summarize the same continuous column. Or, it could be as complex as
three different combinations of categorical columns, each illustrating
summarizations of different continuous (or categorical columns).
Within each group_by data frame, calculate the expected
probability for each group. Maybe assign the lowest probability group an
“anomaly” tag, and then translate that back into your original data
frame. Draw some conclusions about the numbers you’ve calculated.
Try to draw a testable hypothesis for why some groups are rarer
than others (How might you test this hypothesis?)
Think of different ways to visualize these groups
total_records <- nrow(Superstore_data)
#total_records
# 1 group_by Segment and expected probability for Segment
Segment_group <- Superstore_data |>
group_by(Segment) |>
summarise(total_count_segment=n(),
.groups = 'drop') |>
mutate(Expected_probability_Segment= (total_count_segment/total_records)) #|>
#mutate(Segment_Spl_tag = case_When(Expected_probability_Segment>0.519 ~ 'Usual'),)
Segment_group
## # A tibble: 3 × 3
## Segment total_count_segment Expected_probability_Segment
## <chr> <int> <dbl>
## 1 Consumer 5191 0.519
## 2 Corporate 3020 0.302
## 3 Home Office 1783 0.178
merged_data <- merge(Superstore_data, Segment_group, by = "Segment", all.x = TRUE)
# Define a function to calculate 'Segment_Special_tag'
Segment_Special_tag <- function(total_count_segment) {
if (total_count_segment >= 3020) {
return("normal")
}else {
return("low")
}
}
# Apply the function to create 'Segment_anomaly' column
merged_data$Segment_Special_tag <- sapply(merged_data$total_count_segment, Segment_Special_tag)
# Print the resulting table
print(merged_data|> select("Segment","Expected_probability_Segment","Segment_Special_tag")|> sample_n(10) )
## Segment Expected_probability_Segment Segment_Special_tag
## 1 Corporate 0.3021813 normal
## 2 Corporate 0.3021813 normal
## 3 Home Office 0.1784070 low
## 4 Consumer 0.5194116 normal
## 5 Consumer 0.5194116 normal
## 6 Consumer 0.5194116 normal
## 7 Corporate 0.3021813 normal
## 8 Consumer 0.5194116 normal
## 9 Home Office 0.1784070 low
## 10 Consumer 0.5194116 normal
Observations -
0.519 is the probability of records in the dataset that have the
Segment- Consumer, while the probability of records for corporate and
Home office are 0.302 and 0.1783 respectively.
Consumer Segment is seen to be a significant one when it comes to
the products bought i.e. Most records are of consumer segment.
#1.
sample1 <- sample_n(merged_data,1)
sample1 |> select("Segment","Expected_probability_Segment","Segment_Special_tag")
## Segment Expected_probability_Segment Segment_Special_tag
## 1 Corporate 0.3021813 normal
From above, can see that the data pulled randomly
#2.
sample2 <- sample_n(merged_data,1)
sample2 |> select("Segment","Expected_probability_Segment","Segment_Special_tag")
## Segment Expected_probability_Segment Segment_Special_tag
## 1 Consumer 0.5194116 normal
From above,again can see that the data pulled randomly.
#3.
sample3 <- sample_n(merged_data,1)
sample3 |> select("Segment","Expected_probability_Segment","Segment_Special_tag")
## Segment Expected_probability_Segment Segment_Special_tag
## 1 Consumer 0.5194116 normal
#4.
sample3 <- sample_n(merged_data,1)
sample3 |> select("Segment","Expected_probability_Segment","Segment_Special_tag")
## Segment Expected_probability_Segment Segment_Special_tag
## 1 Consumer 0.5194116 normal
In all those random pulls the Home office must have appeared or not in segments , like very rare. Consumer segments have higher probability in being taken out.
Segment_group |>
ggplot(mapping = aes(x=Segment, y = Expected_probability_Segment, color=Expected_probability_Segment, fill=Expected_probability_Segment)) +
geom_bar(stat = "identity") +
labs(
title = "Expected Probability of Segments",
x = "Segment",
y = "Expected Probability"
) +
theme_minimal()
last_plot() +
coord_polar()
Superstore_data |>
ggplot(mapping = aes(x=Segment, color=Segment, fill=Segment)) +
geom_bar() +
labs(
title = "Total Count by Segment",
x = "Segment",
y = "Total Count"
) +
theme_minimal()
From the above plots, we can see that the visualizations also state that most records are of Consumer segment with highest product purchase. Least bought items are from the Home Office segment. The reason for Home Office to show least could be that the products purchased for that are least in the entire Country.
# 2 group_by Region
Region_group <- Superstore_data |>
group_by(Region) |>
summarise(total_count_region=n(),
.groups = 'drop') |>
mutate(Expected_probability_Region= (total_count_region/total_records))
Region_group
## # A tibble: 4 × 3
## Region total_count_region Expected_probability_Region
## <chr> <int> <dbl>
## 1 Central 2323 0.232
## 2 East 2848 0.285
## 3 South 1620 0.162
## 4 West 3203 0.320
merged_data <- merge(Superstore_data, Region_group, by = "Region", all.x = TRUE)
# Define a function to calculate 'Segment_Special_tag'
Region_Special_tag <- function(total_count_segment) {
if (total_count_segment >= 2323) {
return("normal")
}else {
return("low")
}
}
# Apply the function to create 'Segment_anomaly' column
merged_data$Region_Special_tag <- sapply(merged_data$total_count_region, Region_Special_tag)
# Print the resulting table
print(merged_data|> select("Region","Expected_probability_Region","Region_Special_tag")|> sample_n(10))
## Region Expected_probability_Region Region_Special_tag
## 1 East 0.2849710 normal
## 2 West 0.3204923 normal
## 3 West 0.3204923 normal
## 4 East 0.2849710 normal
## 5 Central 0.2324395 normal
## 6 Central 0.2324395 normal
## 7 Central 0.2324395 normal
## 8 Central 0.2324395 normal
## 9 East 0.2849710 normal
## 10 East 0.2849710 normal
Observations -
0.3203 is the probability of records in the dataset that have
purchased products from Western part of US. Following which we can see
East and Central having probability of 0.28 and 0.23 respectively.While
Southern part of USA is having comparitively lesser records.
Western and eastern parts having bought the products in more
quantity than southern and central parts of US
#1.
sample1 <- sample_n(merged_data,1)
sample1 |> select("Region","Expected_probability_Region","Region_Special_tag")
## Region Expected_probability_Region Region_Special_tag
## 1 East 0.284971 normal
#2.
sample2 <- sample_n(merged_data,1)
sample2 |> select("Region","Expected_probability_Region","Region_Special_tag")
## Region Expected_probability_Region Region_Special_tag
## 1 Central 0.2324395 normal
#3.
sample3 <- sample_n(merged_data,1)
sample3 |> select("Region","Expected_probability_Region","Region_Special_tag")
## Region Expected_probability_Region Region_Special_tag
## 1 Central 0.2324395 normal
Here from above pulls, can see that South region has the lowest probability and obtaining that record is 0.1620 times out of the entire dataset
Visualize Region Group -
Region_group |>
ggplot(mapping = aes(x=Region, y = Expected_probability_Region, color=Expected_probability_Region, fill=Expected_probability_Region)) +
geom_bar(stat = "identity") +
labs(
title = "Expected Probability of Region",
x = "Region",
y = "Expected Probability"
) +
theme_minimal()
last_plot() +
coord_polar()
ggplot(Region_group, aes(x = Region, y = total_count_region, fill = Region)) +
geom_bar(stat = "identity") +
labs(
title = "Total Count by Region",
x = "Region",
y = "Total Count"
) +
theme_minimal()
The observations are seen in the plots above. Most products are
bought in the west and East region of US. Followed by Central. South has
the least customers with products bought.
###3.Category -
# 3 group_by Category
Category_group <- Superstore_data |>
group_by(Category) |>
summarise(total_count_category=n(),
.groups = 'drop') |>
mutate(Expected_probability_Category= (total_count_category/total_records))
Category_group
## # A tibble: 3 × 3
## Category total_count_category Expected_probability_Category
## <chr> <int> <dbl>
## 1 Furniture 2121 0.212
## 2 Office Supplies 6026 0.603
## 3 Technology 1847 0.185
merged_data <- merge(Superstore_data, Category_group, by = "Category", all.x = TRUE)
# Define a function to calculate 'Segment_Special_tag'
Category_Special_tag <- function(total_count_category) {
if (total_count_category >= 1848) {
return("normal")
}else {
return("low")
}
}
# Apply the function to create 'Segment_anomaly' column
merged_data$Category_Special_tag <- sapply(merged_data$total_count_category, Category_Special_tag)
# Print the resulting table
print(merged_data|> select("Category","Expected_probability_Category","Category_Special_tag")|> sample_n(10))
## Category Expected_probability_Category Category_Special_tag
## 1 Furniture 0.2122273 normal
## 2 Technology 0.1848109 low
## 3 Furniture 0.2122273 normal
## 4 Furniture 0.2122273 normal
## 5 Furniture 0.2122273 normal
## 6 Office Supplies 0.6029618 normal
## 7 Furniture 0.2122273 normal
## 8 Office Supplies 0.6029618 normal
## 9 Technology 0.1848109 low
## 10 Office Supplies 0.6029618 normal
Observations -
- 0.6029 is the probability of records in the dataset that have
purchased Office Supply products. Following which we can see Furniture
at o.2121 and Technology having probability of 0.1847.
#1.
sample1 <- sample_n(merged_data,1)
sample1 |> select("Category","Expected_probability_Category","Category_Special_tag")
## Category Expected_probability_Category Category_Special_tag
## 1 Office Supplies 0.6029618 normal
#2.
sample2 <- sample_n(merged_data,1)
sample2 |> select("Category","Expected_probability_Category","Category_Special_tag")
## Category Expected_probability_Category Category_Special_tag
## 1 Office Supplies 0.6029618 normal
#3.
sample3 <- sample_n(merged_data,1)
sample3 |> select("Category","Expected_probability_Category","Category_Special_tag")
## Category Expected_probability_Category Category_Special_tag
## 1 Office Supplies 0.6029618 normal
#4.
sample4 <- sample_n(merged_data,1)
sample4 |> select("Category","Expected_probability_Category","Category_Special_tag")
## Category Expected_probability_Category Category_Special_tag
## 1 Office Supplies 0.6029618 normal
Can see that it is pretty difficult to obtain Technology category when picking randomly as its probability is 0.18
Visualize Category Group -
Category_group |>
ggplot(mapping = aes(x=Category, y = Expected_probability_Category, color=Expected_probability_Category, fill=Expected_probability_Category)) +
geom_bar(stat = "identity") +
labs(
title = "Expected Probability of Category",
x = "Category",
y = "Expected Probability"
) +
theme_minimal()
last_plot() +
coord_polar()
ggplot(Category_group, aes(x = Category, y = total_count_category, fill = Category)) +
geom_bar(stat = "identity") +
labs(
title = "Total Count by Category",
x = "CAtegory",
y = "Total Count"
) +
theme_minimal()
From the above plot, can state a few things - i. Office Supplies category has the most purchases ii. Furniture and Technology sees the least in Category
4.Sub-Category :
# 4 group_by Category and Region
Sub_category_group <- Superstore_data |>
group_by(Sub.Category) |>
summarise(total_count_sub_category=n(),
.groups = 'drop') |>
mutate(Expected_probability_sub_Category=(total_count_sub_category/total_records))
Sub_category_group
## # A tibble: 17 × 3
## Sub.Category total_count_sub_category Expected_probability_sub_Category
## <chr> <int> <dbl>
## 1 Accessories 775 0.0775
## 2 Appliances 466 0.0466
## 3 Art 796 0.0796
## 4 Binders 1523 0.152
## 5 Bookcases 228 0.0228
## 6 Chairs 617 0.0617
## 7 Copiers 68 0.00680
## 8 Envelopes 254 0.0254
## 9 Fasteners 217 0.0217
## 10 Furnishings 957 0.0958
## 11 Labels 364 0.0364
## 12 Machines 115 0.0115
## 13 Paper 1370 0.137
## 14 Phones 889 0.0890
## 15 Storage 846 0.0847
## 16 Supplies 190 0.0190
## 17 Tables 319 0.0319
Observations -
* There are about 17 products. Each of those having some varied
probabilities on how many were bought by each. With Paper and binders
having the most probabilities
Visualize Sub-category Group -
Sub_category_group|>
ggplot(mapping = aes(x=Sub.Category, y = Expected_probability_sub_Category, color=Expected_probability_sub_Category, fill=Expected_probability_sub_Category)) +
geom_bar(stat = "identity") +
labs(
title = "Expected Probability of Sub-Category",
x = "Sub-Category",
y = "Expected Probability"
) +
theme_minimal()+
theme (axis.text.x = element_text(angle=90))
last_plot() +
coord_polar()
QUESTION
## EXTRA assumption and thoughts-
Pick 2-3 categorical variables for which you know all
possible combinations.
* Which combinations never show up? Why might that be?
* Which combinations are the most/least common, and why might that
be?
* Try (i.e., no need if you can’t figure this one out) to find a way to
visualize these combinations.
1.State -
-> In US there are 50 states in USA , so one of the state data is not
present at all. The reason for the same not present could be
* Not all records were documented in this dataset, hence the weird
observations.
-> There are about 49 states
n_distinct(Superstore_data$State)
## [1] 49
state_count<- aggregate(Superstore_data$State,by=list(Superstore_data$State), FUN=length)
top_10_states_count <-state_count %>%
arrange(desc(x)) %>%
slice(1:10) %>%
rename (
state=Group.1,
count=x
)
top_10_states_count
## state count
## 1 California 2001
## 2 New York 1128
## 3 Texas 985
## 4 Pennsylvania 587
## 5 Washington 506
## 6 Illinois 492
## 7 Ohio 469
## 8 Florida 383
## 9 Michigan 255
## 10 North Carolina 249
state_count<- aggregate(Superstore_data$State,by=list(Superstore_data$State), FUN=length)
last_10_states_count <-state_count %>%
arrange(desc(x)) %>%
tail(10) %>%
rename (
state=Group.1,
count=x
)
last_10_states_count
## state count
## 40 Kansas 24
## 41 Idaho 21
## 42 Montana 15
## 43 South Dakota 12
## 44 Vermont 11
## 45 District of Columbia 10
## 46 Maine 8
## 47 North Dakota 7
## 48 West Virginia 4
## 49 Wyoming 1
Reason for the Wyoming & West Virginia having only 1 and 4
products being purchased, could be that
–> Maybe there are not much people residing in those regions
–>or not all records are documented in this dataset, hence the weird
observations.
2.Region
-> There are 4 regions.
n_distinct(Superstore_data$Region)
## [1] 4
-> This covers most of USA.i.e. East, West, Central and
South
-> There is no minimum and max, but with respect to Sales we can
figure out which Region has min Sales and max Sales.
aggregate(Superstore_data$Sales, list(Superstore_data$Region), FUN=mean) |>
rename (
Region=Group.1,
Mean_Sales=x
) |>
arrange(desc(Mean_Sales))
## Region Mean_Sales
## 1 South 241.8036
## 2 East 238.3361
## 3 West 226.4932
## 4 Central 215.7727
-> Overall sales is spread around the region. No major max or min
in the same.
-> Region vs Sales:
Sales can be seen most in the southern part of US. While minimum sales
is found to be in East.
Superstore_data %>%
ggplot() +
geom_bar(mapping = aes(x = Region, y=Sales,fill = Region),stat = "identity", position="dodge")
-> Region vs Category vs Sales:
Superstore_data %>%
ggplot() +
geom_bar(mapping = aes(x = Region, y=Sales,fill = Category),stat = "identity", position="dodge")
we can see that Technologies are the costliest items be it any
region. South has bought tech products which have a cost greater than
$2000. Followed by Central, West and East region having cost
approximately around $1750, $1450 and $1200 respectively.
With Office supplies being the 2nd most costly item bought in all
regions. But East regions sees similar cost on Office and
furniture.