Data Dive Week 2 by Fahad Mehfooz

Data Documentation

The “Superstore Sales” dataset is a comprehensive and versatile collection of data that provides valuable insights into sales, customer behavior, and product performance. This dataset offers a rich resource for in-depth analysis.

Containing information from diverse regions and segments, the dataset enables exploration of trends, patterns, and correlations in sales and customer preferences. The dataset encompasses sales transactions, enabling researchers and analysts to understand buying patterns, identify high-demand products, and assess the effectiveness of different shipping modes.

Whether used for educational purposes, business strategy formulation, or data analysis practice, the “Superstore Sales” dataset offers a comprehensive platform to delve into the dynamics of sales operations, customer interactions, and the factors that drive business success.

Data source: https://www.kaggle.com/datasets/ishanshrivastava28/superstore-sales

Project’s goals/purpose ?

The dataset provides an opportunity to examine the impact of various factors such as discounts, geographical locations, and product categories on profitability. By analyzing this dataset, businesses and data enthusiasts can uncover actionable insights for optimizing pricing strategies, supply chain management, and customer engagement.

Supervised learning tasks like regression can be done with the dataset; Predicting sales of a superstore by taking all the variables can be done. Furthermore, even clustering can be done identify segment a sale would belong to.

Loading data

df <-read.csv('/Users/fahadmehfooz/Desktop/IUPUI/First Semester/Intro to Statistics/Intro to Stats Dataset/Dataset 1/Superstore.csv')

summary(df)

##      Row.ID       Order.ID          Order.Date         Ship.Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56430  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

Summary of Numeric Columns

numeric_data <- unlist(lapply(df, is.numeric))  
data_num <- df[ , numeric_data]                        # Subset numeric columns of data
   
summary(data_num)

##      Row.ID      Postal.Code        Sales              Quantity    
##  Min.   :   1   Min.   : 1040   Min.   :    0.444   Min.   : 1.00  
##  1st Qu.:2499   1st Qu.:23223   1st Qu.:   17.280   1st Qu.: 2.00  
##  Median :4998   Median :56430   Median :   54.490   Median : 3.00  
##  Mean   :4998   Mean   :55190   Mean   :  229.858   Mean   : 3.79  
##  3rd Qu.:7496   3rd Qu.:90008   3rd Qu.:  209.940   3rd Qu.: 5.00  
##  Max.   :9994   Max.   :99301   Max.   :22638.480   Max.   :14.00  
##     Discount          Profit         
##  Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :0.2000   Median :    8.666  
##  Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :0.8000   Max.   : 8399.976

Summary of Categorical Columns with unique values and their counts:

library(plyr)
count(df, 'Category')

##          Category freq
## 1       Furniture 2121
## 2 Office Supplies 6026
## 3      Technology 1847

count(df, 'Region')

##    Region freq
## 1 Central 2323
## 2    East 2848
## 3   South 1620
## 4    West 3203

count(df, 'Country')

##         Country freq
## 1 United States 9994

count(df, 'Segment')

##       Segment freq
## 1    Consumer 5191
## 2   Corporate 3020
## 3 Home Office 1783

count(df, 'State')

##                   State freq
## 1               Alabama   61
## 2               Arizona  224
## 3              Arkansas   60
## 4            California 2001
## 5              Colorado  182
## 6           Connecticut   82
## 7              Delaware   96
## 8  District of Columbia   10
## 9               Florida  383
## 10              Georgia  184
## 11                Idaho   21
## 12             Illinois  492
## 13              Indiana  149
## 14                 Iowa   30
## 15               Kansas   24
## 16             Kentucky  139
## 17            Louisiana   42
## 18                Maine    8
## 19             Maryland  105
## 20        Massachusetts  135
## 21             Michigan  255
## 22            Minnesota   89
## 23          Mississippi   53
## 24             Missouri   66
## 25              Montana   15
## 26             Nebraska   38
## 27               Nevada   39
## 28        New Hampshire   27
## 29           New Jersey  130
## 30           New Mexico   37
## 31             New York 1128
## 32       North Carolina  249
## 33         North Dakota    7
## 34                 Ohio  469
## 35             Oklahoma   66
## 36               Oregon  124
## 37         Pennsylvania  587
## 38         Rhode Island   56
## 39       South Carolina   42
## 40         South Dakota   12
## 41            Tennessee  183
## 42                Texas  985
## 43                 Utah   53
## 44              Vermont   11
## 45             Virginia  224
## 46           Washington  506
## 47        West Virginia    4
## 48            Wisconsin  110
## 49              Wyoming    1

Aggregations:

Average sales city-wise?

Sales_city_wise <- aggregate(df$Sales, list(df$State), FUN=mean)
Sales_city_wise <- Sales_city_wise[order(Sales_city_wise$x, decreasing = TRUE), ]
print(Sales_city_wise)

##                 Group.1         x
## 49              Wyoming 1603.1360
## 44              Vermont  811.7609
## 27               Nevada  428.9513
## 38         Rhode Island  404.0706
## 25              Montana  372.6235
## 13              Indiana  359.4319
## 24             Missouri  336.4417
## 22            Minnesota  335.5410
## 1               Alabama  319.8466
## 45             Virginia  315.3425
## 47        West Virginia  302.4560
## 21             Michigan  299.0965
## 35             Oklahoma  298.2332
## 48            Wisconsin  291.9510
## 8  District of Columbia  286.5020
## 7              Delaware  285.9486
## 31             New York  275.5995
## 29           New Jersey  275.1101
## 46           Washington  273.9946
## 28        New Hampshire  270.0935
## 10              Georgia  266.8252
## 16             Kentucky  263.2500
## 9               Florida  233.6128
## 4            California  228.7295
## 19             Maryland  225.7669
## 32       North Carolina  223.3059
## 17            Louisiana  219.4531
## 20        Massachusetts  212.1069
## 43                 Utah  211.6992
## 11                Idaho  208.6898
## 23          Mississippi  203.2328
## 39       South Carolina  201.9455
## 37         Pennsylvania  198.4871
## 26             Nebraska  196.4455
## 3              Arkansas  194.6355
## 5              Colorado  176.4182
## 42                Texas  172.7797
## 41            Tennessee  167.5512
## 34                 Ohio  166.8617
## 6           Connecticut  163.2239
## 12             Illinois  162.9392
## 18                Maine  158.8162
## 2               Arizona  157.5089
## 14                 Iowa  152.6587
## 36               Oregon  140.5738
## 33         North Dakota  131.4157
## 30           New Mexico  129.2844
## 15               Kansas  121.4296
## 40         South Dakota  109.6300

Group wise distribution of Country and segment?

aggregate(df$Segment, by = list(df$Segment, df$Country), FUN = length)

##       Group.1       Group.2    x
## 1    Consumer United States 5191
## 2   Corporate United States 3020
## 3 Home Office United States 1783

Visual Summary:

Distribution for Sales

# Creating a histogram
hist(df$Sales, 
     main = "Distribution of Sales",  
     xlab = "Sales",              # X-axis label
     ylab = "Frequency",          # Y-axis label
     col = "red",                # Bar color
     border = "black",            # Border color
     breaks = 20,
     freq = FALSE)                 # Number of bins or breaks

# Displaying density also
lines(density(df$Sales), col = "black", lwd = 2)

> This is a right skewed distribution.

Distribution for Profit

# Creating a histogram
hist(df$Sales, 
     main = "Distribution of Profit",  
     xlab = "Profit",              # X-axis label
     ylab = "Frequency",          # Y-axis label
     col = "green",                # Bar color
     border = "black",            # Border color
     breaks = 20,
     freq = FALSE)                 # Number of bins or breaks

# Displaying density also
lines(density(df$Profit), col = "black", lwd = 2)

> Right skewed distrbution, most of the data points are concentrated on the left side, closer to the median.

Distribution for Quantity

# Creating a histogram
hist(df$Sales, 
     main = "Distribution of Quantity",  
     xlab = "Quantity",              # X-axis label
     ylab = "Frequency",          # Y-axis label
     col = "green",                # Bar color
     border = "black",            # Border color
     breaks = 20,
     freq = FALSE)                 # Number of bins or breaks

# Displaying density also
lines(density(df$Quantity), col = "black", lwd = 2)

Creating a pie chart for checking the distribution of Categories variable

pie(table(df$Category),
    main = "Categories- Variable Distribution",
        xlab = "Categories",
        ylab = "Frequency")

Creating a bar plot for checking the distribution of Regions variable

barplot(table(df$Region),
        main = "Region Variable Distribution",
        xlab = "Region",
        ylab = "Frequency")

Create a correlation heatmap using ggplot2 to find feature correlation between all numeric columns

library(ggplot2)
library(reshape2)


df_corr <- cor(data_num)


# Create a correlation heatmap using ggplot2 to find feature correlation between all numeric columns
heatmap_plot <- ggplot(data = melt(df_corr), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), vjust = 1) +  
  labs(title = "Correlation Heatmap for continuous variables", x = "Features", y = "Features", fill = "Correlation")

# Print the heatmap
print(heatmap_plot)

IMP: Sales and Profit have the highest correlation as the heatmap suggests

Checking relationship between Sales and Profit?

# Create a scatter plot
plot(df$Sales, df$Profit,
     main = "Scatter Plot: Sales vs. Profit",
     xlab = "Sales",
     ylab = "Profit",
     col = "green"
)

The linear relationship is not super strong between the two features.

Region wise-Profit

df$`Profit` <- as.numeric(as.character(df$`Profit`))
plot <- ggplot(df, aes(x=Region, y=Profit, main = "Boxplot : Region vs. Profit",

                       col = "red")) + geom_boxplot()+
  labs(title = "Box Plot of Region Vs Profit",
       x = "Region",
       y = "Profit") 
plot

Some outliers can be seen very clearly according to their respective regions.

Segment wise-Sales

plot <- ggplot(df, aes(x = Segment, y = Profit)) +
  geom_boxplot() +
  labs(title = "Box Plot of Profit Vs Segment",
       x = "Segment",
       y = "Profit") 
plot