The “Superstore Sales” dataset is a comprehensive and versatile collection of data that provides valuable insights into sales, customer behavior, and product performance. This dataset offers a rich resource for in-depth analysis.
Containing information from diverse regions and segments, the dataset enables exploration of trends, patterns, and correlations in sales and customer preferences. The dataset encompasses sales transactions, enabling researchers and analysts to understand buying patterns, identify high-demand products, and assess the effectiveness of different shipping modes.
Whether used for educational purposes, business strategy formulation, or data analysis practice, the “Superstore Sales” dataset offers a comprehensive platform to delve into the dynamics of sales operations, customer interactions, and the factors that drive business success.
Data source: https://www.kaggle.com/datasets/ishanshrivastava28/superstore-sales
The dataset provides an opportunity to examine the impact of various factors such as discounts, geographical locations, and product categories on profitability. By analyzing this dataset, businesses and data enthusiasts can uncover actionable insights for optimizing pricing strategies, supply chain management, and customer engagement.
Supervised learning tasks like regression can be done with the dataset; Predicting sales of a superstore by taking all the variables can be done. Furthermore, even clustering can be done identify segment a sale would belong to.
df <-read.csv('/Users/fahadmehfooz/Desktop/IUPUI/First Semester/Intro to Statistics/Intro to Stats Dataset/Dataset 1/Superstore.csv')
summary(df)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Length:9994 Length:9994
## 1st Qu.:2499 Class :character Class :character Class :character
## Median :4998 Mode :character Mode :character Mode :character
## Mean :4998
## 3rd Qu.:7496
## Max. :9994
## Ship.Mode Customer.ID Customer.Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Country City State Postal.Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56430
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
numeric_data <- unlist(lapply(df, is.numeric))
data_num <- df[ , numeric_data] # Subset numeric columns of data
summary(data_num)
## Row.ID Postal.Code Sales Quantity
## Min. : 1 Min. : 1040 Min. : 0.444 Min. : 1.00
## 1st Qu.:2499 1st Qu.:23223 1st Qu.: 17.280 1st Qu.: 2.00
## Median :4998 Median :56430 Median : 54.490 Median : 3.00
## Mean :4998 Mean :55190 Mean : 229.858 Mean : 3.79
## 3rd Qu.:7496 3rd Qu.:90008 3rd Qu.: 209.940 3rd Qu.: 5.00
## Max. :9994 Max. :99301 Max. :22638.480 Max. :14.00
## Discount Profit
## Min. :0.0000 Min. :-6599.978
## 1st Qu.:0.0000 1st Qu.: 1.729
## Median :0.2000 Median : 8.666
## Mean :0.1562 Mean : 28.657
## 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :0.8000 Max. : 8399.976
library(plyr)
count(df, 'Category')
## Category freq
## 1 Furniture 2121
## 2 Office Supplies 6026
## 3 Technology 1847
count(df, 'Region')
## Region freq
## 1 Central 2323
## 2 East 2848
## 3 South 1620
## 4 West 3203
count(df, 'Country')
## Country freq
## 1 United States 9994
count(df, 'Segment')
## Segment freq
## 1 Consumer 5191
## 2 Corporate 3020
## 3 Home Office 1783
count(df, 'State')
## State freq
## 1 Alabama 61
## 2 Arizona 224
## 3 Arkansas 60
## 4 California 2001
## 5 Colorado 182
## 6 Connecticut 82
## 7 Delaware 96
## 8 District of Columbia 10
## 9 Florida 383
## 10 Georgia 184
## 11 Idaho 21
## 12 Illinois 492
## 13 Indiana 149
## 14 Iowa 30
## 15 Kansas 24
## 16 Kentucky 139
## 17 Louisiana 42
## 18 Maine 8
## 19 Maryland 105
## 20 Massachusetts 135
## 21 Michigan 255
## 22 Minnesota 89
## 23 Mississippi 53
## 24 Missouri 66
## 25 Montana 15
## 26 Nebraska 38
## 27 Nevada 39
## 28 New Hampshire 27
## 29 New Jersey 130
## 30 New Mexico 37
## 31 New York 1128
## 32 North Carolina 249
## 33 North Dakota 7
## 34 Ohio 469
## 35 Oklahoma 66
## 36 Oregon 124
## 37 Pennsylvania 587
## 38 Rhode Island 56
## 39 South Carolina 42
## 40 South Dakota 12
## 41 Tennessee 183
## 42 Texas 985
## 43 Utah 53
## 44 Vermont 11
## 45 Virginia 224
## 46 Washington 506
## 47 West Virginia 4
## 48 Wisconsin 110
## 49 Wyoming 1
Sales_city_wise <- aggregate(df$Sales, list(df$State), FUN=mean)
Sales_city_wise <- Sales_city_wise[order(Sales_city_wise$x, decreasing = TRUE), ]
print(Sales_city_wise)
## Group.1 x
## 49 Wyoming 1603.1360
## 44 Vermont 811.7609
## 27 Nevada 428.9513
## 38 Rhode Island 404.0706
## 25 Montana 372.6235
## 13 Indiana 359.4319
## 24 Missouri 336.4417
## 22 Minnesota 335.5410
## 1 Alabama 319.8466
## 45 Virginia 315.3425
## 47 West Virginia 302.4560
## 21 Michigan 299.0965
## 35 Oklahoma 298.2332
## 48 Wisconsin 291.9510
## 8 District of Columbia 286.5020
## 7 Delaware 285.9486
## 31 New York 275.5995
## 29 New Jersey 275.1101
## 46 Washington 273.9946
## 28 New Hampshire 270.0935
## 10 Georgia 266.8252
## 16 Kentucky 263.2500
## 9 Florida 233.6128
## 4 California 228.7295
## 19 Maryland 225.7669
## 32 North Carolina 223.3059
## 17 Louisiana 219.4531
## 20 Massachusetts 212.1069
## 43 Utah 211.6992
## 11 Idaho 208.6898
## 23 Mississippi 203.2328
## 39 South Carolina 201.9455
## 37 Pennsylvania 198.4871
## 26 Nebraska 196.4455
## 3 Arkansas 194.6355
## 5 Colorado 176.4182
## 42 Texas 172.7797
## 41 Tennessee 167.5512
## 34 Ohio 166.8617
## 6 Connecticut 163.2239
## 12 Illinois 162.9392
## 18 Maine 158.8162
## 2 Arizona 157.5089
## 14 Iowa 152.6587
## 36 Oregon 140.5738
## 33 North Dakota 131.4157
## 30 New Mexico 129.2844
## 15 Kansas 121.4296
## 40 South Dakota 109.6300
aggregate(df$Segment, by = list(df$Segment, df$Country), FUN = length)
## Group.1 Group.2 x
## 1 Consumer United States 5191
## 2 Corporate United States 3020
## 3 Home Office United States 1783
# Creating a histogram
hist(df$Sales,
main = "Distribution of Sales",
xlab = "Sales", # X-axis label
ylab = "Frequency", # Y-axis label
col = "red", # Bar color
border = "black", # Border color
breaks = 20,
freq = FALSE) # Number of bins or breaks
# Displaying density also
lines(density(df$Sales), col = "black", lwd = 2)
> This is a right skewed distribution.
# Creating a histogram
hist(df$Sales,
main = "Distribution of Profit",
xlab = "Profit", # X-axis label
ylab = "Frequency", # Y-axis label
col = "green", # Bar color
border = "black", # Border color
breaks = 20,
freq = FALSE) # Number of bins or breaks
# Displaying density also
lines(density(df$Profit), col = "black", lwd = 2)
> Right skewed distrbution, most of the data points are concentrated
on the left side, closer to the median.
# Creating a histogram
hist(df$Sales,
main = "Distribution of Quantity",
xlab = "Quantity", # X-axis label
ylab = "Frequency", # Y-axis label
col = "green", # Bar color
border = "black", # Border color
breaks = 20,
freq = FALSE) # Number of bins or breaks
# Displaying density also
lines(density(df$Quantity), col = "black", lwd = 2)
pie(table(df$Category),
main = "Categories- Variable Distribution",
xlab = "Categories",
ylab = "Frequency")
barplot(table(df$Region),
main = "Region Variable Distribution",
xlab = "Region",
ylab = "Frequency")
library(ggplot2)
library(reshape2)
df_corr <- cor(data_num)
# Create a correlation heatmap using ggplot2 to find feature correlation between all numeric columns
heatmap_plot <- ggplot(data = melt(df_corr), aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), vjust = 1) +
labs(title = "Correlation Heatmap for continuous variables", x = "Features", y = "Features", fill = "Correlation")
# Print the heatmap
print(heatmap_plot)
IMP: Sales and Profit have the highest correlation as the heatmap suggests
# Create a scatter plot
plot(df$Sales, df$Profit,
main = "Scatter Plot: Sales vs. Profit",
xlab = "Sales",
ylab = "Profit",
col = "green"
)
The linear relationship is not super strong between the two features.
df$`Profit` <- as.numeric(as.character(df$`Profit`))
plot <- ggplot(df, aes(x=Region, y=Profit, main = "Boxplot : Region vs. Profit",
col = "red")) + geom_boxplot()+
labs(title = "Box Plot of Region Vs Profit",
x = "Region",
y = "Profit")
plot
Some outliers can be seen very clearly according to their respective regions.
plot <- ggplot(df, aes(x = Segment, y = Profit)) +
geom_boxplot() +
labs(title = "Box Plot of Profit Vs Segment",
x = "Segment",
y = "Profit")
plot