BigMart has collected 2013 sales data for 1559 products across 10 stores in different cities. In our training dataset, we have 8523 observations and 12 variables. Also, there is one additional variable which has been transformed from one of the original variables.
The structure of the data is as below:
## Classes 'tbl_df', 'tbl' and 'data.frame': 8523 obs. of 13 variables:
## $ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : Factor w/ 2 levels "Low Fat","Regular": 1 2 1 2 1 2 2 1 2 2 ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 3 levels "High","Medium",..: 2 2 2 NA 1 2 1 2 NA NA ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
## $ Log2_MRP : num 7.96 5.59 7.15 7.51 5.75 ...
## $ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
The numeric summary of the data is as below:
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## FDG33 : 10 Min. : 4.555 Low Fat:5517 Min. :0.00000
## FDW13 : 10 1st Qu.: 9.310 Regular:3006 1st Qu.:0.02699
## DRE49 : 9 Median :12.793 Median :0.05393
## DRN47 : 9 Mean :12.847 Mean :0.06613
## FDD38 : 9 3rd Qu.:16.000 3rd Qu.:0.09459
## FDF52 : 9 Max. :21.350 Max. :0.32839
## (Other):8467
## Item_Type Item_MRP Outlet_Identifier
## Fruits and Vegetables:1232 Min. : 31.29 OUT027 : 935
## Snack Foods :1200 1st Qu.: 93.83 OUT013 : 932
## Household : 910 Median :143.01 OUT035 : 930
## Frozen Foods : 856 Mean :140.99 OUT046 : 930
## Dairy : 682 3rd Qu.:185.64 OUT049 : 930
## Canned : 649 Max. :266.89 OUT045 : 929
## (Other) :2994 (Other):2937
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## Min. :1985 High : 932 Tier 1:2388
## 1st Qu.:1987 Medium:2793 Tier 2:2785
## Median :1999 Small :2388 Tier 3:3350
## Mean :1998 NA's :2410
## 3rd Qu.:2004
## Max. :2009
##
## Outlet_Type Log2_MRP Item_Outlet_Sales
## Grocery Store :1083 Min. :4.968 Min. : 33.29
## Supermarket Type1:5577 1st Qu.:6.552 1st Qu.: 834.25
## Supermarket Type2: 928 Median :7.160 Median : 1794.33
## Supermarket Type3: 935 Mean :6.966 Mean : 2181.29
## 3rd Qu.:7.536 3rd Qu.: 3101.30
## Max. :8.060 Max. :13086.97
##
Let’s begin our data analysis by asking a question regarding to the item sales.
The above points will be addressed one by one. So…
Firstly, I will create histograms for each continuous variables.
How about the relationship between item sales and retail price? Will that be the higher in retail price, the higer in item sales?
Up to here, we have not yet found out the reason of the right-skewed distribution of item sales. Let’s check the relationship between item weight and item sales.
Back to our question, the right-skewed distribution of item sales is not explained by item weight, item visibilty nor item retail price. How about we break the distribution of item sales by facet of categorical variables?
## # A tibble: 2 x 3
## # Groups: Outlet_Type [?]
## Outlet_Type Outlet_Identifier count
## <fct> <fct> <int>
## 1 Grocery Store OUT010 555
## 2 Grocery Store OUT019 528
## # A tibble: 10 x 5
## Outlet_Identifier count sum avg IQR
## <fct> <int> <dbl> <dbl> <dbl>
## 1 OUT027 935 3453926 3694 2931
## 2 OUT035 930 2268123 2439 2071
## 3 OUT049 930 2183970 2348 2099
## 4 OUT017 926 2167465 2341 1901
## 5 OUT013 932 2142664 2299 2094
## 6 OUT046 930 2118395 2278 1958
## 7 OUT045 929 2036725 2192 1861
## 8 OUT018 928 1851823 1995 1721
## 9 OUT010 555 188340 339 301
## 10 OUT019 528 179694 340 307
Any common characteristics of the outliers of each outlets? We have filtered the outliers (i.e. item sales greater than 1.5 * inter-quartile range of item sales) from each outlets. Then compare its distribution with the overall one.
## # A tibble: 16 x 5
## Item_Type n sales avg IQR
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Fruits and Vegetables 59 363753 6165 2203
## 2 Snack Foods 45 277147 6159 2218
## 3 Household 35 218462 6242 2135
## 4 Frozen Foods 30 181329 6044 1412
## 5 Canned 26 152902 5881 1469
## 6 Dairy 22 150422 6837 2463
## 7 Meat 22 135645 6166 1700
## 8 Baking Goods 17 97164 5716 1302
## 9 Health and Hygiene 15 89796 5986 918
## 10 Soft Drinks 10 70464 7046 2346
## 11 Breads 7 44202 6315 1445
## 12 Hard Drinks 7 40375 5768 1518
## 13 Breakfast 4 30427 7607 745
## 14 Starchy Foods 4 23740 5935 1634
## 15 Others 3 15189 5063 483
## 16 Seafood 1 5705 5705 0
## # A tibble: 2 x 2
## Outlet_Identifier Visibility_Greater_Than_0.2
## <fct> <int>
## 1 OUT010 62
## 2 OUT019 72