Welcome to exploratory data analysis. Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.
This exploratory data analysis was carried out using retail data in csv format. This retail data contains data of the sales activity of retail companies, let’s say PT Angsa Biru. PT Angsa Biru is a company engaged in supplying office equipment, furniture and other technologies.
in exploratory data analysis (EDA), there are some steps, namely:
Data collection is an essential part of exploratory data analysis. It refers to the process of finding and loading data into our system.
Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and hence adversely affect the results.
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers to a single feature/ column. You can do this either with graphical or non-graphical means by finding specific mathematical values in the data. Some visual methods include:
Histograms: Bar plots in which the frequency of data is represented with rectangle bars.
Box-plots: Here the information is represented in the form of boxes.
Here, you use two variables and compare them. This way, you can find how one feature affects the other. It is done with scatter plots, which plot individual data points or correlation matrices that plot the correlation in hues. You can also use boxplots.
Calling the data (for Csv format). Call the data using read.csv and put the document linked.
retail <- read.csv("D:/Data Farabi/Algoritma/1_programming_for_data_science-main/data_input/retail.csv")
retailhead(retail)tail(retail)dim(retail)#> [1] 9994 15
names(retail)#> [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
#> [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
#> [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
From data inspection we got that, data sales record for PT Angsa Biru is consist of: a. Containt 9994 data observations with 15 coloumns. b. There are 15 types of data (based on coloumn) stand of ““Row.ID”, “Order.ID”, “Order.Date”, “Ship.Date”, “Ship.Mode”, “Customer.ID”, “Segment”,“Product.ID”, “Category”, “Sub.Category”, “Product.Name”, “Sales”, “Quantity”, “Discount”, “Profit”
This section, we will check the data and data type.
str(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#> $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
From the data above, we can see that some of data type are incorrect type. So, we shall to convert it into correct data.
retail$Row.ID <- as.character(retail$Row.ID)
retail$Order.Date <- as.Date(retail$Order.Date, format = "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, format = "%m/%d/%y")
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
retailstr(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : chr "1" "2" "3" "4" ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
the data type has been changed into the correct type of data
Then, checkin missing value of data
colSums(is.na(retail))#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 0 0 0 0 0 0
#> Segment Product.ID Category Sub.Category Product.Name Sales
#> 0 0 0 0 0 0
#> Quantity Discount Profit
#> 0 0 0
anyNA(retail)#> [1] FALSE
There is no missing data value.
To delete coulmns which we do not need for the analysis because we do not need it, we can do subset (information from coloumn number 1 and 2), then we save it into the retail variable.
retail <- retail[,3:15]
retailNow data from retail of PT Angsa Biru is ready to be processed.
Brief explanation
summary(retail)#> Order.Date Ship.Date Ship.Mode
#> Min. :2014-01-03 Min. :2014-01-07 First Class :1538
#> 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Same Day : 543
#> Median :2016-06-26 Median :2016-06-29 Second Class :1945
#> Mean :2016-04-30 Mean :2016-05-03 Standard Class:5968
#> 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :2017-12-30 Max. :2018-01-05
#>
#> Customer.ID Segment Product.ID
#> Length:9994 Consumer :5191 Length:9994
#> Class :character Corporate :3020 Class :character
#> Mode :character Home Office:1783 Mode :character
#>
#>
#>
#>
#> Category Sub.Category Product.Name
#> Furniture :2121 Binders :1523 Length:9994
#> Office Supplies:6026 Paper :1370 Class :character
#> Technology :1847 Furnishings: 957 Mode :character
#> Phones : 889
#> Storage : 846
#> Art : 796
#> (Other) :3613
#> Sales Quantity Discount Profit
#> Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
#>
from the information above, we got the information that: 1. First order data occured in january 2014 2. Standard class is the popular ship mode with 5968 data. 3. The buyer of PT Angsa Biru mostly comes from consumer segment with 5191 data. 4. Category of product which most popular comes from office supplies with 6026 data. 5. PT Angsa Biru gained the averaged sales at 229.858 6. Maksimum order quantity at 14 unit, minimum at 1 unit and the average at 3.79 unit 7. The Average proft of PT Angsa Biru is 28.657
Checking the outlier within profit
`
aggregate(Profit~Segment,retail,mean)aggregate(Profit~Segment,retail,var)aggregate(Profit~Segment, retail, sd)boxplot(retail$Profit)from result above, we find possibilities outliers
retail[retail$Sales == 0.444,]Answer : Sales 0.444 comes from segment ‘consumer’, office supplies, and got loss 1.11
office_supplies <- retail [retail$Category=="Office Supplies",]
head(office_supplies)round(prop.table( table(office_supplies$Segment))*100,2)#>
#> Consumer Corporate Home Office
#> 51.89 30.20 17.91
Answer : consumer segment with 51.89%
Discount <-retail[retail$Discount==0.8,]
(sort(table(Discount$Order.Date),decreasing = T)[1])#> 2014-07-26
#> 4
Answer : 80% discount was on July 2014, happened 4 times
xtabs(Quantity~Segment+Category, retail)#> Category
#> Segment Furniture Office Supplies Technology
#> Consumer 4166 11758 3597
#> Corporate 2495 7018 2095
#> Home Office 1367 4130 1247
plot(xtabs(Quantity~Segment+Category,retail))heatmap(xtabs(Quantity~Segment+Category, retail), cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)Answer : Based on result above: segmen consumer with category office supply make the highest quantity order with sum 11758
aggregate(Quantity~Segment+Category, retail, median)Answer : Median of quantity = 3
sum(retail$Quantity)#> [1] 37873
Answer : 37873 items
sum(retail$Profit)#> [1] 286397
Answer : 286397
xtabs(Profit~ Category, retail)#> Category
#> Furniture Office Supplies Technology
#> 18451.27 122490.80 145454.95
Answer : Technology
retail[retail$Profit==-6599.978,]Answer : happened in 2016, was selling 5 items of machines, discount 70%
xtabs(Profit~ Segment, retail)#> Segment
#> Consumer Corporate Home Office
#> 134119.21 91979.13 60298.68
Answer : Consumer was given the higest profit compare to other segmen
xtabs(Sales~ Category, retail)#> Category
#> Furniture Office Supplies Technology
#> 741999.8 719047.0 836154.0
Answer : Technology have the highest selling price compare to other
xtabs(Profit~Segment+Ship.Mode,retail)#> Ship.Mode
#> Segment First Class Same Day Second Class Standard Class
#> Consumer 21374.044 9874.205 24946.911 77924.049
#> Corporate 14464.472 1818.142 18225.713 57470.807
#> Home Office 13131.324 4199.412 14274.011 28693.932
plot(xtabs(Profit~Segment+Ship.Mode,retail))Answer : Most of buyer was choosing Standar Class for shipping options. Among them consumer was the higest one
xtabs(Quantity~Category,retail)#> Category
#> Furniture Office Supplies Technology
#> 8028 22906 6939
Answer : Office Supply
Since 2014, PT Angsa Biru has sold 37873 items with profit 286397. PT Angsa Biru has main product category, there are furniture category, office supplies category and Technology category From this main product, category which generate the highest profit is Technology with $14545.95. The lowest sales comes from office supply category with sales 0.444 and loss 1.11. Even though the lowest sales come from office supply category, the highest quantity also come from office supply with consumer segment.
PT Angsa Biru has some kind of buyer segment, such as consumer, corporate and home office. Consumer segment is giving the highest amount of profit with $134119.21 with ship mode standard class.
Technology category has given the highest profit but the highest loss has happened in this category too. So, shall be considered about discount rate to make it happen in another time.
Need some strategies to increase sales target for corporate and home office segment.
Office supply is the highest quantity, so make it sure the availability of product items to meet consumer satisfaction.