1 Introduction

Welcome to exploratory data analysis. Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

1.1 Introduction of data

This exploratory data analysis was carried out using retail data in csv format. This retail data contains data of the sales activity of retail companies, let’s say PT Angsa Biru. PT Angsa Biru is a company engaged in supplying office equipment, furniture and other technologies.

1.2 EDA Steps

in exploratory data analysis (EDA), there are some steps, namely:

1.2.1 Data collection

Data collection is an essential part of exploratory data analysis. It refers to the process of finding and loading data into our system.

1.2.2 Data cleaning

Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and hence adversely affect the results.

1.2.3 Univariate analysis

In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers to a single feature/ column. You can do this either with graphical or non-graphical means by finding specific mathematical values in the data. Some visual methods include:

Histograms: Bar plots in which the frequency of data is represented with rectangle bars.

Box-plots: Here the information is represented in the form of boxes.

1.2.4 Bivariate analysis

Here, you use two variables and compare them. This way, you can find how one feature affects the other. It is done with scatter plots, which plot individual data points or correlation matrices that plot the correlation in hues. You can also use boxplots.

2 Input Data

Calling the data (for Csv format). Call the data using read.csv and put the document linked.

retail <- read.csv("D:/Data Farabi/Algoritma/1_programming_for_data_science-main/data_input/retail.csv")
retail

2.1 Data inspection

head(retail)

tail(retail)

dim(retail)

#> [1] 9994   15

names(retail)

#>  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
#>  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
#> [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

From data inspection we got that, data sales record for PT Angsa Biru is consist of: a. Containt 9994 data observations with 15 coloumns. b. There are 15 types of data (based on coloumn) stand of ““Row.ID”, “Order.ID”, “Order.Date”, “Ship.Date”, “Ship.Mode”, “Customer.ID”, “Segment”,“Product.ID”, “Category”, “Sub.Category”, “Product.Name”, “Sales”, “Quantity”, “Discount”, “Profit”

2.2 Data cleansing & coertions

This section, we will check the data and data type.

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#>  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

From the data above, we can see that some of data type are incorrect type. So, we shall to convert it into correct data.

retail$Row.ID <- as.character(retail$Row.ID)
retail$Order.Date <- as.Date(retail$Order.Date, format = "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, format = "%m/%d/%y")
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
retail

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : chr  "1" "2" "3" "4" ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

the data type has been changed into the correct type of data

Then, checkin missing value of data

colSums(is.na(retail))

#>       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
#>            0            0            0            0            0            0 
#>      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
#>            0            0            0            0            0            0 
#>     Quantity     Discount       Profit 
#>            0            0            0

anyNA(retail)

#> [1] FALSE

There is no missing data value.

To delete coulmns which we do not need for the analysis because we do not need it, we can do subset (information from coloumn number 1 and 2), then we save it into the retail variable.

retail <- retail[,3:15]
retail

Now data from retail of PT Angsa Biru is ready to be processed.

3 Data explanation

Brief explanation

summary(retail)

#>    Order.Date           Ship.Date                   Ship.Mode   
#>  Min.   :2014-01-03   Min.   :2014-01-07   First Class   :1538  
#>  1st Qu.:2015-05-23   1st Qu.:2015-05-27   Same Day      : 543  
#>  Median :2016-06-26   Median :2016-06-29   Second Class  :1945  
#>  Mean   :2016-04-30   Mean   :2016-05-03   Standard Class:5968  
#>  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                        
#>  Max.   :2017-12-30   Max.   :2018-01-05                        
#>                                                                 
#>  Customer.ID               Segment      Product.ID       
#>  Length:9994        Consumer   :5191   Length:9994       
#>  Class :character   Corporate  :3020   Class :character  
#>  Mode  :character   Home Office:1783   Mode  :character  
#>                                                          
#>                                                          
#>                                                          
#>                                                          
#>             Category         Sub.Category  Product.Name      
#>  Furniture      :2121   Binders    :1523   Length:9994       
#>  Office Supplies:6026   Paper      :1370   Class :character  
#>  Technology     :1847   Furnishings: 957   Mode  :character  
#>                         Phones     : 889                     
#>                         Storage    : 846                     
#>                         Art        : 796                     
#>                         (Other)    :3613                     
#>      Sales              Quantity        Discount          Profit         
#>  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
#>

from the information above, we got the information that: 1. First order data occured in january 2014 2. Standard class is the popular ship mode with 5968 data. 3. The buyer of PT Angsa Biru mostly comes from consumer segment with 5191 data. 4. Category of product which most popular comes from office supplies with 6026 data. 5. PT Angsa Biru gained the averaged sales at 229.858 6. Maksimum order quantity at 14 unit, minimum at 1 unit and the average at 3.79 unit 7. The Average proft of PT Angsa Biru is 28.657

Checking the outlier within profit

aggregate(Profit~Segment,retail,mean)

aggregate(Profit~Segment,retail,var)

aggregate(Profit~Segment, retail, sd)

boxplot(retail$Profit)

from result above, we find possibilities outliers

4 Data Manipulatin and Transformation

4.1 Which category gives the lowest sales number? how much profit?

retail[retail$Sales == 0.444,]

Answer : Sales 0.444 comes from segment ‘consumer’, office supplies, and got loss 1.11

4.2 Which segment buying the most office supplies?

office_supplies <- retail [retail$Category=="Office Supplies",]
head(office_supplies)

round(prop.table( table(office_supplies$Segment))*100,2)

#> 
#>    Consumer   Corporate Home Office 
#>       51.89       30.20       17.91

Answer : consumer segment with 51.89%

4.3 How many times buyers get the highest discount 80% and when it happened?

Discount <-retail[retail$Discount==0.8,]
(sort(table(Discount$Order.Date),decreasing = T)[1])

#> 2014-07-26 
#>          4

Answer : 80% discount was on July 2014, happened 4 times

4.4 Related to number 3, who was the buyers?

Discount.2<-Discount[Discount$Order.Date=="2014-07-26",]
(table(Discount.2$Segment))

#> 
#>    Consumer   Corporate Home Office 
#>           4           0           0

Answer : all buyer was coming from consumer segment

4.5 How much total of quantity order from each segment and category, and which is the highest?

xtabs(Quantity~Segment+Category, retail)

#>              Category
#> Segment       Furniture Office Supplies Technology
#>   Consumer         4166           11758       3597
#>   Corporate        2495            7018       2095
#>   Home Office      1367            4130       1247

plot(xtabs(Quantity~Segment+Category,retail))

heatmap(xtabs(Quantity~Segment+Category, retail), cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)

Answer : Based on result above: segmen consumer with category office supply make the highest quantity order with sum 11758

4.6 what is median of quantity for every segment and category?

aggregate(Quantity~Segment+Category, retail, median)

Answer : Median of quantity = 3

4.7 How much Total quantity order since PT. Elips start selling the product?

sum(retail$Quantity)

#> [1] 37873

Answer : 37873 items

4.8 How much profit in 4 years?

sum(retail$Profit)

#> [1] 286397

Answer : 286397

4.9 Which category generate the highest profit?

xtabs(Profit~ Category, retail)

#> Category
#>       Furniture Office Supplies      Technology 
#>        18451.27       122490.80       145454.95

Answer : Technology

4.10 Which the Highest Loss in transaction?

retail[retail$Profit==-6599.978,]

Answer : happened in 2016, was selling 5 items of machines, discount 70%

4.11 Which Highest Profit in Segment ?

xtabs(Profit~ Segment, retail)

#> Segment
#>    Consumer   Corporate Home Office 
#>   134119.21    91979.13    60298.68

Answer : Consumer was given the higest profit compare to other segmen

4.12 Which the highest Sales number based on category ?

xtabs(Sales~ Category, retail)

#> Category
#>       Furniture Office Supplies      Technology 
#>        741999.8        719047.0        836154.0

Answer : Technology have the highest selling price compare to other

4.13 what is Profit based on Ship Mode + segment?

xtabs(Profit~Segment+Ship.Mode,retail)

#>              Ship.Mode
#> Segment       First Class  Same Day Second Class Standard Class
#>   Consumer      21374.044  9874.205    24946.911      77924.049
#>   Corporate     14464.472  1818.142    18225.713      57470.807
#>   Home Office   13131.324  4199.412    14274.011      28693.932

plot(xtabs(Profit~Segment+Ship.Mode,retail))

Answer : Most of buyer was choosing Standar Class for shipping options. Among them consumer was the higest one

4.14 what is The highest quantity seller based on Category ?

xtabs(Quantity~Category,retail)

#> Category
#>       Furniture Office Supplies      Technology 
#>            8028           22906            6939

Answer : Office Supply

5 Explanatory Text & Business Recomendation

5.1 Case explanatory

Since 2014, PT Angsa Biru has sold 37873 items with profit 286397. PT Angsa Biru has main product category, there are furniture category, office supplies category and Technology category From this main product, category which generate the highest profit is Technology with $14545.95. The lowest sales comes from office supply category with sales 0.444 and loss 1.11. Even though the lowest sales come from office supply category, the highest quantity also come from office supply with consumer segment.

PT Angsa Biru has some kind of buyer segment, such as consumer, corporate and home office. Consumer segment is giving the highest amount of profit with $134119.21 with ship mode standard class.

5.2 Recommendation:

Technology category has given the highest profit but the highest loss has happened in this category too. So, shall be considered about discount rate to make it happen in another time.
Need some strategies to increase sales target for corporate and home office segment.
Office supply is the highest quantity, so make it sure the availability of product items to meet consumer satisfaction.

Exploratory Data Analysis

Moh. Suhud Farabi

2022-07-17