1. Introduction

유럽에 3,000개의 매장을 갖고 있는 드럭 스토어(한국의 올리브영같은 매장) 데이터를 분석하는 미션으로 데이터 분석 툴(SQL, R, Python)을 활용해서 데이터가 필요한-마케팅, 영업, 투자 전략, 현장-부서에서 사용하기에 알맞은 형태의 데이터로 가공해서 제공한다. — # Rossman Store Sales

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams!

Data Description

You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the “Sales” column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

Files

  • train.csv - historical data including Sales
  • test.csv - historical data excluding Sales
  • sample_submission.csv - a sample submission file in the correct format
  • store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren’t.

  • Id - an Id that represents a (Store, Date) duple within the test set
  • Store - a unique Id for each store
  • Sales - the turnover for any given day (this is what you are predicting)
  • Customers - the number of customers on a given day
  • Open - an indicator for whether the store was open: 0 = closed, 1 = open
  • StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
  • SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
  • StoreType - differentiates between 4 different store models: a, b, c, d
  • Assortment - describes an assortment level: a = basic, b = extra, c = extended
  • CompetitionDistance - distance in meters to the nearest competitor store
  • CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
  • Promo - indicates whether a store is running a promo on that day
  • Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
  • Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
  • PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

2. Setting

2.1. Import library

library(data.table)
library(forecast)
library(lubridate) # 날짜 라이브러리
library(zoo) # 날짜 라이브러리
library(ggplot2) # 시각화 라이브러리
library(DescTools) # 통계 요약 라이브러리
library(sqldf) # sql 라이브러리
library(dplyr) # 전처리 라이브러리
library(knitr) # 출판 라이브러리
library(VIM) # 결측치 라이브러리

2.2. Load data

test <- fread("../../data/sql/rossman/test.csv")
train <- fread("../../data/sql/rossman/train.csv")
store <- fread("../../data/sql/rossman/store.csv")

2.3. Data overview

str(train)
## Classes 'data.table' and 'data.frame':   1017209 obs. of  9 variables:
##  $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Date         : chr  "2015-07-31" "2015-07-31" "2015-07-31" "2015-07-31" ...
##  $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
##  $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
##  $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StateHoliday : chr  "0" "0" "0" "0" ...
##  $ SchoolHoliday: int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
glimpse(train)
## Observations: 1,017,209
## Variables: 9
## $ Store         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ DayOfWeek     <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ Date          <chr> "2015-07-31", "2015-07-31", "2015-07-31", "2015-...
## $ Sales         <int> 5263, 6064, 8314, 13995, 4822, 5651, 15344, 8492...
## $ Customers     <int> 555, 625, 821, 1498, 559, 589, 1414, 833, 687, 6...
## $ Open          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Promo         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ StateHoliday  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
## $ SchoolHoliday <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
summary(train)
##      Store          DayOfWeek         Date               Sales      
##  Min.   :   1.0   Min.   :1.000   Length:1017209     Min.   :    0  
##  1st Qu.: 280.0   1st Qu.:2.000   Class :character   1st Qu.: 3727  
##  Median : 558.0   Median :4.000   Mode  :character   Median : 5744  
##  Mean   : 558.4   Mean   :3.998                      Mean   : 5774  
##  3rd Qu.: 838.0   3rd Qu.:6.000                      3rd Qu.: 7856  
##  Max.   :1115.0   Max.   :7.000                      Max.   :41551  
##    Customers           Open            Promo        StateHoliday      
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Length:1017209    
##  1st Qu.: 405.0   1st Qu.:1.0000   1st Qu.:0.0000   Class :character  
##  Median : 609.0   Median :1.0000   Median :0.0000   Mode  :character  
##  Mean   : 633.1   Mean   :0.8301   Mean   :0.3815                     
##  3rd Qu.: 837.0   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :7388.0   Max.   :1.0000   Max.   :1.0000                     
##  SchoolHoliday   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1786  
##  3rd Qu.:0.0000  
##  Max.   :1.0000
head(train)
##    Store DayOfWeek       Date Sales Customers Open Promo StateHoliday
## 1:     1         5 2015-07-31  5263       555    1     1            0
## 2:     2         5 2015-07-31  6064       625    1     1            0
## 3:     3         5 2015-07-31  8314       821    1     1            0
## 4:     4         5 2015-07-31 13995      1498    1     1            0
## 5:     5         5 2015-07-31  4822       559    1     1            0
## 6:     6         5 2015-07-31  5651       589    1     1            0
##    SchoolHoliday
## 1:             1
## 2:             1
## 3:             1
## 4:             1
## 5:             1
## 6:             1

2.4. Reshape data type

train$Date <- as.Date(train$Date)
#train[, Date := as.Date(Date)]
#train$Date <- as.Date(train$Date)
train <- train[order(Date)]

3. Report in SQL

3.1. Sales by total

total_sales <- sqldf("
select
    count(distinct store) as tot_store
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , min(date) as first_purchase_date
    , max(date) as last_purchase_date
from train;")

summary(total_sales)
##    tot_store       tot_cust           avg_cust       tot_sales         
##  Min.   :1115   Min.   :6.44e+08   Min.   :633.1   Min.   :5873180623  
##  1st Qu.:1115   1st Qu.:6.44e+08   1st Qu.:633.1   1st Qu.:5873180623  
##  Median :1115   Median :6.44e+08   Median :633.1   Median :5873180623  
##  Mean   :1115   Mean   :6.44e+08   Mean   :633.1   Mean   :5873180623  
##  3rd Qu.:1115   3rd Qu.:6.44e+08   3rd Qu.:633.1   3rd Qu.:5873180623  
##  Max.   :1115   Max.   :6.44e+08   Max.   :633.1   Max.   :5873180623  
##    avg_sales    first_purchase_date last_purchase_date
##  Min.   :5774   Min.   :15706       Min.   :16647     
##  1st Qu.:5774   1st Qu.:15706       1st Qu.:16647     
##  Median :5774   Median :15706       Median :16647     
##  Mean   :5774   Mean   :15706       Mean   :16647     
##  3rd Qu.:5774   3rd Qu.:15706       3rd Qu.:16647     
##  Max.   :5774   Max.   :15706       Max.   :16647
Total Sales.
tot_store tot_cust avg_cust tot_sales avg_sales first_purchase_date last_purchase_date
1115 644041755 633.15 5873180623 5773.82 15706 16647

3.2. Sales by annual

annual_sales <- sqldf("
select
    substr(date, 1, 4) as year
    , count(distinct store) as tot_store
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust

    , count(*) as cnt_trade
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , min(date) as first_purchase_date
    , max(date) as last_purchase_date
from train
group by 1;
")

summary(annual_sales)
##      year             tot_store       tot_cust          avg_cust    
##  Length:95          Min.   : 935   Min.   :2396743   Min.   :432.9  
##  Class :character   1st Qu.:1115   1st Qu.:6420442   1st Qu.:599.6  
##  Mode  :character   Median :1115   Median :6832654   Median :630.6  
##                     Mean   :1083   Mean   :6779387   Mean   :633.6  
##                     3rd Qu.:1115   3rd Qu.:7277780   3rd Qu.:669.2  
##                     Max.   :1115   Max.   :9525313   Max.   :854.3  
##    cnt_trade       tot_sales          avg_sales    first_purchase_date
##  Min.   : 4459   Min.   :20033838   Min.   :3875   Min.   :15706      
##  1st Qu.:11150   1st Qu.:57395018   1st Qu.:5300   1st Qu.:15935      
##  Median :11150   Median :61340641   Median :5730   Median :16170      
##  Mean   :10707   Mean   :61822954   Mean   :5776   Mean   :16170      
##  3rd Qu.:11150   3rd Qu.:66413716   3rd Qu.:6177   3rd Qu.:16405      
##  Max.   :11150   Max.   :95991386   Max.   :8609   Max.   :16640      
##  last_purchase_date
##  Min.   :15709     
##  1st Qu.:15944     
##  Median :16179     
##  Mean   :16179     
##  3rd Qu.:16414     
##  Max.   :16647
annual_sales$tot_sales <- as.numeric(annual_sales$tot_sales)
Sales by annual sales.
year tot_store tot_cust avg_cust cnt_trade tot_sales avg_sales first_purchase_date last_purchase_date
1570 1115 2396743 537.51 4459 20033838 4492.90 15706 15709
1571 1115 6674810 598.64 11150 61257507 5493.95 15710 15719
1572 1115 6857427 615.02 11150 60758739 5449.21 15720 15729
1573 1115 6050198 542.62 11150 52238436 4685.06 15730 15739
1574 1115 7350066 659.20 11150 66257993 5942.42 15740 15749
1575 1115 7302965 654.97 11150 66556834 5969.22 15750 15759
1576 1115 6672469 598.43 11150 58798163 5273.38 15760 15769
1577 1115 7028338 630.34 11150 60634339 5438.06 15770 15779
1578 1115 7031567 630.63 11150 66142071 5932.02 15780 15789
1579 1115 6733738 603.92 11150 60674610 5441.67 15790 15799
1580 1115 6773018 607.45 11150 61058795 5476.13 15800 15809
1581 1115 7097850 636.58 11150 59974267 5378.86 15810 15819
1582 1115 7316861 656.22 11150 67433091 6047.81 15820 15829
1583 1115 6349850 569.49 11150 55553797 4982.40 15830 15839
1584 1115 6586412 590.71 11150 56752031 5089.87 15840 15849
1585 1115 6815031 611.21 11150 64855718 5816.66 15850 15859
1586 1115 7298046 654.53 11150 62918504 5642.92 15860 15869
1587 1115 6722795 602.94 11150 60966372 5467.84 15870 15879
1588 1115 7502901 672.91 11150 67631129 6065.57 15880 15889
1589 1115 7123260 638.86 11150 61232883 5491.74 15890 15899
1590 1115 6933002 621.79 11150 63864886 5727.79 15900 15909
1591 1115 7734513 693.68 11150 72510687 6503.20 15910 15919
1592 1115 6519142 584.68 11150 56806267 5094.73 15920 15929
1593 1115 7227530 648.21 11150 62520179 5607.19 15930 15939
1594 1115 6803763 610.20 11150 62098160 5569.34 15940 15949
1595 1115 7400148 663.69 11150 65419656 5867.23 15950 15959
1596 1115 6878898 616.94 11150 57525187 5159.21 15960 15969
1597 1115 6832654 612.79 11150 61535330 5518.86 15970 15979
1598 1115 7179652 643.91 11150 66062947 5924.93 15980 15989
1599 1115 6210688 557.01 11150 52892260 4743.70 15990 15999
1600 1115 7176881 643.67 11150 62029923 5563.22 16000 16009
1601 1115 6696881 600.62 11150 62776823 5630.21 16010 16019
1602 1115 7299302 654.65 11150 65035485 5832.78 16020 16029
1603 1115 7547685 676.92 11150 68358616 6130.82 16030 16039
1604 1115 7724983 692.82 11150 75527096 6773.73 16040 16049
1605 1115 9525313 854.29 11150 95991386 8609.09 16050 16059
1606 1115 6097653 546.87 11150 55657215 4991.68 16060 16069
1607 1115 6566606 588.93 11150 60978485 5468.92 16070 16079
1608 1115 6052144 542.79 11150 51052492 4578.70 16080 16089
1609 1115 7185804 644.47 11150 66007753 5919.98 16090 16099
1610 1115 7813908 700.80 11150 73953517 6632.60 16100 16109
1611 1115 6646894 596.13 11150 57879440 5190.98 16110 16119
1612 1115 7131513 639.60 11150 61340641 5501.40 16120 16129
1613 1115 7016945 629.32 11150 64982872 5828.06 16130 16139
1614 1115 7282227 653.11 11150 64338184 5770.24 16140 16149
1615 1115 6062408 543.71 11150 51635865 4631.02 16150 16159
1616 1115 7847055 703.77 11150 72348899 6488.69 16160 16169
1617 1115 7615121 682.97 11150 71448480 6407.94 16170 16179
1618 1115 5910800 530.12 11150 51969491 4660.94 16180 16189
1619 1115 7785269 698.23 11150 74869562 6714.76 16190 16199
1620 1115 6387943 572.91 11150 54883915 4922.32 16200 16209
1621 1115 6624713 594.14 11150 59053322 5296.26 16210 16219
1622 1115 7514523 673.95 11150 71737401 6433.85 16220 16229
1623 1115 7021595 629.74 11150 64259697 5763.20 16230 16239
1624 1115 6617774 593.52 11150 57211336 5131.06 16240 16249
1625 1115 6429133 662.11 9710 60035717 6182.88 16250 16259
1626 935 6503701 695.58 9350 59033219 6313.71 16260 16269
1627 935 5253821 561.91 9350 45312901 4846.30 16270 16279
1628 935 6944748 742.75 9350 66270597 7087.76 16280 16289
1629 935 5275826 564.26 9350 44310537 4739.09 16290 16299
1630 935 6558951 701.49 9350 58408023 6246.85 16300 16309
1631 935 6724028 719.15 9350 61317731 6558.05 16310 16319
1632 935 5656856 605.01 9350 48425837 5179.23 16320 16329
1633 935 6005169 642.26 9350 49970801 5344.47 16330 16339
1634 935 5787380 618.97 9350 54484729 5827.24 16340 16349
1635 935 6279592 671.61 9350 54446950 5823.20 16350 16359
1636 935 5628443 601.97 9350 49587405 5303.47 16360 16369
1637 935 6411750 685.75 9350 57699633 6171.08 16370 16379
1638 935 6685251 715.00 9350 61545837 6582.44 16380 16389
1639 935 5703682 610.02 9350 50311781 5380.94 16390 16399
1640 935 7742634 828.09 9350 77840828 8325.22 16400 16409
1641 935 6143050 657.01 9350 57264848 6124.58 16410 16419
1642 935 7408640 792.37 9350 74134673 7928.84 16420 16429
1643 1115 4358953 432.87 10070 39025458 3875.42 16430 16439
1644 1115 7638629 685.08 11150 75031207 6729.26 16440 16449
1645 1115 6750899 605.46 11150 59036599 5294.76 16450 16459
1646 1115 7110375 637.70 11150 70158235 6292.22 16460 16469
1647 1115 7193755 645.18 11150 63931869 5733.80 16470 16479
1648 1115 6510723 583.92 11150 61199423 5488.74 16480 16489
1649 1115 7559842 678.01 11150 70804801 6350.21 16490 16499
1650 1115 6221630 557.99 11150 54707795 4906.53 16500 16509
1651 1115 7524718 674.86 11150 69751401 6255.73 16510 16519
1652 1115 7503111 672.92 11150 76201952 6834.26 16520 16529
1653 1115 6058278 543.34 11150 55338162 4963.06 16530 16539
1654 1115 7123996 638.92 11150 63175642 5665.98 16540 16549
1655 1115 6761216 606.39 11150 67567902 6059.90 16550 16559
1656 1115 6942273 622.63 11150 63893494 5730.36 16560 16569
1657 1115 6995066 627.36 11150 66850674 5995.58 16570 16579
1658 1115 7193896 645.19 11150 69526000 6235.52 16580 16589
1659 1115 6745150 604.95 11150 60427424 5419.50 16590 16599
1660 1115 6851474 614.48 11150 66229888 5939.90 16600 16609
1661 1115 7678390 688.64 11150 75271815 6750.84 16610 16619
1662 1115 6400197 574.01 11150 58556073 5251.67 16620 16629
1663 1115 7273333 652.32 11150 67660411 6068.20 16630 16639
1664 1115 5946921 666.70 8920 60139779 6742.13 16640 16647
<img sr c="rossman-e da-in-sql-r _files/figu re-html/pres sure-1.png" width=“672” />

3.3. Sales by quarter

train$Quarter <- substr(quarters(as.Date(train$Date)), 2, 2)
table(train$Quarter)
## 
##      1      2      3      4 
## 301049 304395 223165 188600
quartely_sales <- sqldf("
select
    substr(date, 1, 4) as year
    , quarter
    , count(distinct store) as tot_store
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
    
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
from train
group by 1, 2;
 ")
summary(quartely_sales)
##      year             Quarter            tot_store       tot_cust      
##  Length:105         Length:105         Min.   : 935   Min.   : 531392  
##  Class :character   Class :character   1st Qu.:1115   1st Qu.:6028617  
##  Mode  :character   Mode  :character   Median :1115   Median :6696881  
##                                        Mean   :1081   Mean   :6133731  
##                                        3rd Qu.:1115   3rd Qu.:7176881  
##                                        Max.   :1115   Max.   :9525313  
##     avg_cust        tot_sales          avg_sales    
##  Min.   : 350.8   Min.   : 4534864   Min.   : 3148  
##  1st Qu.: 598.4   1st Qu.:51969491   1st Qu.: 5295  
##  Median : 631.9   Median :60285966   Median : 5734  
##  Mean   : 636.1   Mean   :55935054   Mean   : 5827  
##  3rd Qu.: 669.9   3rd Qu.:65419656   3rd Qu.: 6213  
##  Max.   :1009.1   Max.   :95991386   Max.   :10819
Sales by Quarter.
year Quarter tot_store tot_cust avg_cust tot_sales avg_sales
1570 1 1115 2396743 537.51 20033838 4492.90
1571 1 1115 6674810 598.64 61257507 5493.95
1572 1 1115 6857427 615.02 60758739 5449.21
1573 1 1115 6050198 542.62 52238436 4685.06
1574 1 1115 7350066 659.20 66257993 5942.42
1575 1 1115 7302965 654.97 66556834 5969.22
1576 1 1115 6672469 598.43 58798163 5273.38
1577 1 1115 7028338 630.34 60634339 5438.06
1578 1 1115 7031567 630.63 66142071 5932.02
1579 1 1115 4229770 632.25 40168931 6004.32
1579 2 1115 2503968 561.43 20505679 4597.69
1580 2 1115 6773018 607.45 61058795 5476.13
1581 2 1115 7097850 636.58 59974267 5378.86
1582 2 1115 7316861 656.22 67433091 6047.81
1583 2 1115 6349850 569.49 55553797 4982.40
1584 2 1115 6586412 590.71 56752031 5089.87
1585 2 1115 6815031 611.21 64855718 5816.66
1586 2 1115 7298046 654.53 62918504 5642.92
1587 2 1115 6722795 602.94 60966372 5467.84
1588 2 1115 4615466 591.35 39526592 5064.27
1588 3 1115 2887435 863.21 28104537 8401.95
1589 3 1115 7123260 638.86 61232883 5491.74
1590 3 1115 6933002 621.79 63864886 5727.79
1591 3 1115 7734513 693.68 72510687 6503.20
1592 3 1115 6519142 584.68 56806267 5094.73
1593 3 1115 7227530 648.21 62520179 5607.19
1594 3 1115 6803763 610.20 62098160 5569.34
1595 3 1115 7400148 663.69 65419656 5867.23
1596 3 1115 6878898 616.94 57525187 5159.21
1597 3 1115 6028617 600.76 54858130 5466.68
1597 4 1115 804037 721.11 6677200 5988.52
1598 4 1115 7179652 643.91 66062947 5924.93
1599 4 1115 6210688 557.01 52892260 4743.70
1600 4 1115 7176881 643.67 62029923 5563.22
1601 4 1115 6696881 600.62 62776823 5630.21
1602 4 1115 7299302 654.65 65035485 5832.78
1603 4 1115 7547685 676.92 68358616 6130.82
1604 4 1115 7724983 692.82 75527096 6773.73
1605 4 1115 9525313 854.29 95991386 8609.09
1606 4 1115 6097653 546.87 55657215 4991.68
1607 1 1115 6035214 601.42 56443621 5624.68
1607 4 1115 531392 476.58 4534864 4067.14
1608 1 1115 6052144 542.79 51052492 4578.70
1609 1 1115 7185804 644.47 66007753 5919.98
1610 1 1115 7813908 700.80 73953517 6632.60
1611 1 1115 6646894 596.13 57879440 5190.98
1612 1 1115 7131513 639.60 61340641 5501.40
1613 1 1115 7016945 629.32 64982872 5828.06
1614 1 1115 7282227 653.11 64338184 5770.24
1615 1 1115 6062408 543.71 51635865 4631.02
1616 1 1115 1125153 1009.11 12062933 10818.77
1616 2 1115 6721902 669.85 60285966 6007.57
1617 2 1115 7615121 682.97 71448480 6407.94
1618 2 1115 5910800 530.12 51969491 4660.94
1619 2 1115 7785269 698.23 74869562 6714.76
1620 2 1115 6387943 572.91 54883915 4922.32
1621 2 1115 6624713 594.14 59053322 5296.26
1622 2 1115 7514523 673.95 71737401 6433.85
1623 2 1115 7021595 629.74 64259697 5763.20
1624 2 1115 6617774 593.52 57211336 5131.06
1625 2 1115 1231675 552.32 13566115 6083.46
1625 3 935 5197458 694.85 46469602 6212.51
1626 3 935 6503701 695.58 59033219 6313.71
1627 3 935 5253821 561.91 45312901 4846.30
1628 3 935 6944748 742.75 66270597 7087.76
1629 3 935 5275826 564.26 44310537 4739.09
1630 3 935 6558951 701.49 58408023 6246.85
1631 3 935 6724028 719.15 61317731 6558.05
1632 3 935 5656856 605.01 48425837 5179.23
1633 3 935 6005169 642.26 49970801 5344.47
1634 3 935 2363140 631.86 22638102 6052.97
1634 4 935 3424240 610.38 31846627 5676.76
1635 4 935 6279592 671.61 54446950 5823.20
1636 4 935 5628443 601.97 49587405 5303.47
1637 4 935 6411750 685.75 57699633 6171.08
1638 4 935 6685251 715.00 61545837 6582.44
1639 4 935 5703682 610.02 50311781 5380.94
1640 4 935 7742634 828.09 77840828 8325.22
1641 4 935 6143050 657.01 57264848 6124.58
1642 4 935 7408640 792.37 74134673 7928.84
1643 1 1115 1564487 350.78 14039097 3147.78
1643 4 935 2794466 498.12 24986361 4453.90
1644 1 1115 7638629 685.08 75031207 6729.26
1645 1 1115 6750899 605.46 59036599 5294.76
1646 1 1115 7110375 637.70 70158235 6292.22
1647 1 1115 7193755 645.18 63931869 5733.80
1648 1 1115 6510723 583.92 61199423 5488.74
1649 1 1115 7559842 678.01 70804801 6350.21
1650 1 1115 6221630 557.99 54707795 4906.53
1651 1 1115 7524718 674.86 69751401 6255.73
1652 1 1115 4326445 646.70 44090444 6590.50
1652 2 1115 3176666 712.26 32111508 7199.89
1653 2 1115 6058278 543.34 55338162 4963.06
1654 2 1115 7123996 638.92 63175642 5665.98
1655 2 1115 6761216 606.39 67567902 6059.90
1656 2 1115 6942273 622.63 63893494 5730.36
1657 2 1115 6995066 627.36 66850674 5995.58
1658 2 1115 7193896 645.19 69526000 6235.52
1659 2 1115 6745150 604.95 60427424 5419.50
1660 2 1115 6851474 614.48 66229888 5939.90
1661 2 1115 5044953 646.37 49305462 6317.16
1661 3 1115 2633437 787.28 25966353 7762.74
1662 3 1115 6400197 574.01 58556073 5251.67
1663 3 1115 7273333 652.32 67660411 6068.20
1664 3 1115 5946921 666.70 60139779 6742.13

3.4. Sales by store type

sales_by_store_type <- sqldf("
select
    storetype
    , count(distinct a.store) as tot_store
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , substr(min(date), 1, 10) as first_purchase_date
    , substr(max(date), 1, 10) as last_purchase_date
from train a
left join store b
on a.store = b.store
group by 1;
")

summary(sales_by_store_type)
##   StoreType           tot_store        tot_cust            avg_cust     
##  Length:4           Min.   : 17.0   Min.   : 31465621   Min.   : 501.4  
##  Class :character   1st Qu.:115.2   1st Qu.: 76963684   1st Qu.: 619.6  
##  Mode  :character   Median :248.0   Median :124517350   Median : 666.1  
##                     Mean   :278.8   Mean   :161010439   Mean   : 955.4  
##                     3rd Qu.:411.5   3rd Qu.:208564105   3rd Qu.:1001.9  
##                     Max.   :602.0   Max.   :363541434   Max.   :1987.7  
##    tot_sales            avg_sales     first_purchase_date
##  Min.   : 159231395   Min.   : 5642   Length:4           
##  1st Qu.: 783221426   1st Qu.: 5703   Class :character   
##  Median : 783221426   Median : 5731   Mode  :character   
##  Mean   :1468295155   Mean   : 6791                      
##  3rd Qu.:1765392943   3rd Qu.: 6818                      
##  Max.   :3165334859   Max.   :10059                      
##  last_purchase_date
##  Length:4          
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
Sales by storetype.
StoreType tot_store tot_cust avg_cust tot_sales avg_sales first_purchase_date last_purchase_date
a 602 363541434 659.03 3165334859 5738.18 15706.0 16647.0
b 17 31465621 1987.72 159231395 10058.84 15706.0 16647.0
c 148 92129705 673.27 783221426 5723.63 15706.0 16647.0
d 348 156904995 501.43 1765392943 5641.82 15706.0 16647.0

3.5. Create base table

base <- sqldf("select * from train a left join store b on a.store = b.store")
head(base)
##   Store DayOfWeek       Date Sales Customers Open Promo StateHoliday
## 1     1         2 2013-01-01     0         0    0     0            a
## 2     2         2 2013-01-01     0         0    0     0            a
## 3     3         2 2013-01-01     0         0    0     0            a
## 4     4         2 2013-01-01     0         0    0     0            a
## 5     5         2 2013-01-01     0         0    0     0            a
## 6     6         2 2013-01-01     0         0    0     0            a
##   SchoolHoliday Quarter Store..11 StoreType Assortment CompetitionDistance
## 1             1       1         1         c          a                1270
## 2             1       1         2         a          a                 570
## 3             1       1         3         a          a               14130
## 4             1       1         4         c          c                 620
## 5             1       1         5         a          a               29910
## 6             1       1         6         a          a                 310
##   CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2
## 1                         9                     2008      0
## 2                        11                     2007      1
## 3                        12                     2006      1
## 4                         9                     2009      0
## 5                         4                     2015      0
## 6                        12                     2013      0
##   Promo2SinceWeek Promo2SinceYear   PromoInterval
## 1              NA              NA                
## 2              13            2010 Jan,Apr,Jul,Oct
## 3              14            2011 Jan,Apr,Jul,Oct
## 4              NA              NA                
## 5              NA              NA                
## 6              NA              NA

3.6. Sales by store type & year

sales_by_store_type_year <- sqldf("
select
    storetype
    , substr(date, 1, 4) as year
    , count(distinct a.store) as tot_store
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , substr(min(date), 1, 10) as first_purchase_date
    , substr(max(date), 1, 10) as last_purchase_date
from train a
left join store b
on a.store = b.store
group by 1, 2;
")

summary(sales_by_store_type_year)
##   StoreType             year             tot_store        tot_cust      
##  Length:380         Length:380         Min.   : 16.0   Min.   :  95689  
##  Class :character   Class :character   1st Qu.:104.8   1st Qu.: 382608  
##  Mode  :character   Mode  :character   Median :207.5   Median :1188746  
##                                        Mean   :270.7   Mean   :1694847  
##                                        3rd Qu.:390.5   3rd Qu.:2057489  
##                                        Max.   :602.0   Max.   :5421709  
##     avg_cust        tot_sales          avg_sales     first_purchase_date
##  Min.   : 346.2   Min.   :  452444   Min.   : 3666   Length:380         
##  1st Qu.: 564.0   1st Qu.: 2531426   1st Qu.: 5362   Class :character   
##  Median : 666.5   Median :11202269   Median : 5935   Mode  :character   
##  Mean   : 954.9   Mean   :15455738   Mean   : 6789                      
##  3rd Qu.:1061.7   3rd Qu.:23607748   3rd Qu.: 8265                      
##  Max.   :2333.6   Max.   :51645134   Max.   :13309                      
##  last_purchase_date
##  Length:380        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
Sales by storetype.
StoreType year tot_store tot_cust avg_cust tot_sales avg_sales first_purchase_date last_purchase_date
a 1570 602 1381681 574.03 11061535 4595.57 15706.0 15709.0
a 1571 602 3801168 631.42 33372558 5543.61 15710.0 15719.0
a 1572 602 3960231 657.85 33559157 5574.61 15720.0 15729.0
a 1573 602 3409873 566.42 28179401 4680.96 15730.0 15739.0
a 1574 602 4230332 702.71 36618070 6082.74 15740.0 15749.0
a 1575 602 4129331 685.94 35926320 5967.83 15750.0 15759.0
a 1576 602 3858562 640.96 32599743 5415.24 15760.0 15769.0
a 1577 602 4041632 671.37 33163239 5508.84 15770.0 15779.0
a 1578 602 3948851 655.96 35611874 5915.59 15780.0 15789.0
a 1579 602 3781699 628.19 32688011 5429.90 15790.0 15799.0
a 1580 602 3806589 632.32 32827650 5453.10 15800.0 15809.0
a 1581 602 4139782 687.67 33341812 5538.51 15810.0 15819.0
a 1582 602 4121731 684.67 36307578 6031.16 15820.0 15829.0
a 1583 602 3592786 596.81 30104569 5000.76 15830.0 15839.0
a 1584 602 3722802 618.41 30644657 5090.47 15840.0 15849.0
a 1585 602 3798311 630.95 34569096 5742.37 15850.0 15859.0
a 1586 602 4241696 704.60 34851646 5789.31 15860.0 15869.0
a 1587 602 3768813 626.05 32598276 5415.00 15870.0 15879.0
a 1588 602 4286730 712.08 36758114 6106.00 15880.0 15889.0
a 1589 602 4044443 671.83 33036478 5487.79 15890.0 15899.0
a 1590 602 3960302 657.86 34869393 5792.26 15900.0 15909.0
a 1591 602 4399036 730.74 39227659 6516.22 15910.0 15919.0
a 1592 602 3675399 610.53 30565692 5077.36 15920.0 15929.0
a 1593 602 4135894 687.03 33985317 5645.40 15930.0 15939.0
a 1594 602 3811301 633.11 33057368 5491.26 15940.0 15949.0
a 1595 602 4238893 704.14 35570832 5908.78 15950.0 15959.0
a 1596 602 3866472 642.27 30733566 5105.24 15960.0 15969.0
a 1597 602 3897768 647.47 33491365 5563.35 15970.0 15979.0
a 1598 602 4047020 672.26 35499462 5896.92 15980.0 15989.0
a 1599 602 3484668 578.85 28350973 4709.46 15990.0 15999.0
a 1600 602 4114440 683.46 33848825 5622.73 16000.0 16009.0
a 1601 602 3700907 614.77 33204850 5515.76 16010.0 16019.0
a 1602 602 4178903 694.17 35585370 5911.19 16020.0 16029.0
a 1603 602 4247882 705.63 36543615 6070.37 16030.0 16039.0
a 1604 602 4387935 728.89 41005733 6811.58 16040.0 16049.0
a 1605 602 5421709 900.62 51645134 8578.93 16050.0 16059.0
a 1606 602 3276255 544.23 28609040 4752.33 16060.0 16069.0
a 1607 602 3705394 615.51 32885012 5462.63 16070.0 16079.0
a 1608 602 3362299 558.52 27000278 4485.10 16080.0 16089.0
a 1609 602 4078818 677.54 35603342 5914.18 16090.0 16099.0
a 1610 602 4377430 727.15 39453138 6553.68 16100.0 16109.0
a 1611 602 3766229 625.62 31314036 5201.67 16110.0 16119.0
a 1612 602 4043967 671.76 33091993 5497.01 16120.0 16129.0
a 1613 602 3931903 653.14 34758584 5773.85 16130.0 16139.0
a 1614 602 4165343 691.92 34922933 5801.15 16140.0 16149.0
a 1615 602 3369547 559.73 27294077 4533.90 16150.0 16159.0
a 1616 602 4493914 746.50 39449044 6553.00 16160.0 16169.0
a 1617 602 4218665 700.77 37698932 6262.28 16170.0 16179.0
a 1618 602 3323533 552.08 27976421 4647.25 16180.0 16189.0
a 1619 602 4356117 723.61 39925213 6632.10 16190.0 16199.0
a 1620 602 3556459 590.77 29198742 4850.29 16200.0 16209.0
a 1621 602 3717388 617.51 31559985 5242.52 16210.0 16219.0
a 1622 602 4205684 698.62 38266113 6356.50 16220.0 16229.0
a 1623 602 3952518 656.56 34425556 5718.53 16230.0 16239.0
a 1624 602 3703239 615.16 30445315 5057.36 16240.0 16249.0
a 1625 602 3684965 689.04 33130813 6194.99 16250.0 16259.0
a 1626 518 3738232 721.67 32660630 6305.14 16260.0 16269.0
a 1627 518 2955941 570.64 24575234 4744.25 16270.0 16279.0
a 1628 518 4011050 774.33 36682098 7081.49 16280.0 16289.0
a 1629 518 2977777 574.86 24049856 4642.83 16290.0 16299.0
a 1630 518 3803164 734.20 32636743 6300.53 16300.0 16309.0
a 1631 518 3858260 744.84 33803223 6525.72 16310.0 16319.0
a 1632 518 3237196 624.94 26712729 5156.90 16320.0 16329.0
a 1633 518 3451616 666.34 27622993 5332.62 16330.0 16339.0
a 1634 518 3263718 630.06 29707671 5735.07 16340.0 16349.0
a 1635 518 3622097 699.25 30162885 5822.95 16350.0 16359.0
a 1636 518 3156001 609.27 26825544 5178.68 16360.0 16369.0
a 1637 518 3685417 711.47 31905223 6159.31 16370.0 16379.0
a 1638 518 3817077 736.89 33672524 6500.49 16380.0 16389.0
a 1639 518 3264692 630.25 27850019 5376.45 16390.0 16399.0
a 1640 518 4430513 855.31 42777239 8258.15 16400.0 16409.0
a 1641 518 3465356 668.99 31084612 6000.89 16410.0 16419.0
a 1642 518 4144230 800.04 39778185 7679.19 16420.0 16429.0
a 1643 602 2326288 421.73 20218892 3665.50 16430.0 16439.0
a 1644 602 4281848 711.27 40011342 6646.40 16440.0 16449.0
a 1645 602 3741684 621.54 31155384 5175.31 16450.0 16459.0
a 1646 602 3967209 659.00 37403276 6213.17 16460.0 16469.0
a 1647 602 4033112 669.95 34160318 5674.47 16470.0 16479.0
a 1648 602 3564756 592.15 32029931 5320.59 16480.0 16489.0
a 1649 602 4269942 709.29 38099127 6328.76 16490.0 16499.0
a 1650 602 3439258 571.31 28736542 4773.51 16500.0 16509.0
a 1651 602 4273290 709.85 37646065 6253.50 16510.0 16519.0
a 1652 602 4062614 674.85 39418472 6547.92 16520.0 16529.0
a 1653 602 3398092 564.47 29785338 4947.73 16530.0 16539.0
a 1654 602 4045966 672.09 34048808 5655.95 16540.0 16549.0
a 1655 602 3706823 615.75 35380379 5877.14 16550.0 16559.0
a 1656 602 3907752 649.13 34202095 5681.41 16560.0 16569.0
a 1657 602 3876297 643.90 35263471 5857.72 16570.0 16579.0
a 1658 602 4023189 668.30 37031211 6151.36 16580.0 16589.0
a 1659 602 3766710 625.70 32102854 5332.70 16590.0 16599.0
a 1660 602 3813574 633.48 35116092 5833.24 16600.0 16609.0
a 1661 602 4336170 720.29 40511265 6729.45 16610.0 16619.0
a 1662 602 3527516 585.97 30790107 5114.64 16620.0 16629.0
a 1663 602 4089356 679.30 36196416 6012.69 16630.0 16639.0
a 1664 602 3282412 681.56 31506561 6542.06 16640.0 16647.0
b 1570 17 95689 1407.19 452444 6653.59 15706.0 15709.0
b 1571 17 263628 1550.75 1341506 7891.21 15710.0 15719.0
b 1572 17 288121 1694.83 1412593 8309.37 15720.0 15729.0
b 1573 17 279135 1641.97 1347179 7924.58 15730.0 15739.0
b 1574 17 294783 1734.02 1472923 8664.25 15740.0 15749.0
b 1575 17 296388 1743.46 1468459 8637.99 15750.0 15759.0
b 1576 17 299284 1760.49 1478822 8698.95 15760.0 15769.0
b 1577 17 286244 1683.79 1395383 8208.14 15770.0 15779.0
b 1578 17 301171 1771.59 1540938 9064.34 15780.0 15789.0
b 1579 17 319739 1880.82 1687423 9926.02 15790.0 15799.0
b 1580 17 309201 1818.83 1513123 8900.72 15800.0 15809.0
b 1581 17 301033 1770.78 1402424 8249.55 15810.0 15819.0
b 1582 17 334942 1970.25 1684582 9909.31 15820.0 15829.0
b 1583 17 331940 1952.59 1630043 9588.49 15830.0 15839.0
b 1584 17 332452 1955.60 1607734 9457.26 15840.0 15849.0
b 1585 17 326845 1922.62 1692096 9953.51 15850.0 15859.0
b 1586 17 330021 1941.30 1550025 9117.79 15860.0 15869.0
b 1587 17 341042 2006.13 1615111 9500.65 15870.0 15879.0
b 1588 17 327926 1928.98 1606825 9451.91 15880.0 15889.0
b 1589 17 329279 1936.94 1512013 8894.19 15890.0 15899.0
b 1590 17 339893 1999.37 1612376 9484.56 15900.0 15909.0
b 1591 17 336043 1976.72 1603569 9432.76 15910.0 15919.0
b 1592 17 316927 1864.28 1514549 8909.11 15920.0 15929.0
b 1593 17 315409 1855.35 1506347 8860.86 15930.0 15939.0
b 1594 17 322913 1899.49 1609226 9466.04 15940.0 15949.0
b 1595 17 323744 1904.38 1592932 9370.19 15950.0 15959.0
b 1596 17 312378 1837.52 1482225 8718.97 15960.0 15969.0
b 1597 17 324773 1910.43 1631922 9599.54 15970.0 15979.0
b 1598 17 342066 2012.15 1773038 10429.64 15980.0 15989.0
b 1599 17 322238 1895.52 1564748 9204.40 15990.0 15999.0
b 1600 17 333769 1963.35 1614167 9495.10 16000.0 16009.0
b 1601 17 342141 2012.59 1776775 10451.62 16010.0 16019.0
b 1602 17 334274 1966.32 1665331 9796.06 16020.0 16029.0
b 1603 17 339241 1995.54 1711531 10067.83 16030.0 16039.0
b 1604 17 370624 2180.14 2025107 11912.39 16040.0 16049.0
b 1605 17 373889 2199.35 2048429 12049.58 16050.0 16059.0
b 1606 17 299368 1760.99 1693822 9963.66 16060.0 16069.0
b 1607 17 289590 1703.47 1556287 9154.63 16070.0 16079.0
b 1608 17 310913 1828.90 1528287 8989.92 16080.0 16089.0
b 1609 17 327293 1925.25 1660635 9768.44 16090.0 16099.0
b 1610 17 344002 2023.54 1790440 10532.00 16100.0 16109.0
b 1611 17 338860 1993.29 1683921 9905.42 16110.0 16119.0
b 1612 17 332683 1956.96 1607754 9457.38 16120.0 16129.0
b 1613 17 346052 2035.60 1784170 10495.12 16130.0 16139.0
b 1614 17 331995 1952.91 1619996 9529.39 16140.0 16149.0
b 1615 17 326279 1919.29 1558135 9165.50 16150.0 16159.0
b 1616 17 356075 2094.56 1787256 10513.27 16160.0 16169.0
b 1617 17 360436 2120.21 1862310 10954.76 16170.0 16179.0
b 1618 17 338301 1990.01 1655261 9736.83 16180.0 16189.0
b 1619 17 373081 2194.59 1955990 11505.82 16190.0 16199.0
b 1620 17 349863 2058.02 1698165 9989.21 16200.0 16209.0
b 1621 17 366944 2158.49 1797745 10574.97 16210.0 16219.0
b 1622 17 378428 2226.05 1929616 11350.68 16220.0 16229.0
b 1623 17 355288 2089.93 1762566 10368.04 16230.0 16239.0
b 1624 17 342976 2017.51 1642237 9660.22 16240.0 16249.0
b 1625 17 361400 2230.86 1836094 11333.91 16250.0 16259.0
b 1626 16 343642 2147.76 1674910 10468.19 16260.0 16269.0
b 1627 16 331194 2069.96 1535405 9596.28 16270.0 16279.0
b 1628 16 337117 2106.98 1728042 10800.26 16280.0 16289.0
b 1629 16 314637 1966.48 1519499 9496.87 16290.0 16299.0
b 1630 16 312581 1953.63 1577857 9861.61 16300.0 16309.0
b 1631 16 326615 2041.34 1684765 10529.78 16310.0 16319.0
b 1632 16 324682 2029.26 1603491 10021.82 16320.0 16329.0
b 1633 16 324840 2030.25 1561672 9760.45 16330.0 16339.0
b 1634 16 346241 2164.01 1835617 11472.61 16340.0 16349.0
b 1635 16 323765 2023.53 1623288 10145.55 16350.0 16359.0
b 1636 16 321085 2006.78 1614259 10089.12 16360.0 16369.0
b 1637 16 345495 2159.34 1785990 11162.44 16370.0 16379.0
b 1638 16 312800 1955.00 1645113 10281.96 16380.0 16389.0
b 1639 16 333970 2087.31 1720390 10752.44 16390.0 16399.0
b 1640 16 373380 2333.62 2129488 13309.30 16400.0 16409.0
b 1641 16 347661 2172.88 1835406 11471.29 16410.0 16419.0
b 1642 16 335248 2095.30 1941576 12134.85 16420.0 16429.0
b 1643 17 281328 1715.41 1599963 9755.87 16430.0 16439.0
b 1644 17 332201 1954.12 1775637 10444.92 16440.0 16449.0
b 1645 17 323231 1901.36 1598432 9402.54 16450.0 16459.0
b 1646 17 353161 2077.42 1920799 11298.82 16460.0 16469.0
b 1647 17 346216 2036.56 1739266 10230.98 16470.0 16479.0
b 1648 17 339689 1998.17 1782685 10486.38 16480.0 16489.0
b 1649 17 348344 2049.08 1828037 10753.16 16490.0 16499.0
b 1650 17 338798 1992.93 1702805 10016.50 16500.0 16509.0
b 1651 17 347514 2044.20 1753033 10311.96 16510.0 16519.0
b 1652 17 357067 2100.39 1997706 11751.21 16520.0 16529.0
b 1653 17 350996 2064.68 1808989 10641.11 16530.0 16539.0
b 1654 17 345186 2030.51 1692611 9956.54 16540.0 16549.0
b 1655 17 373500 2197.06 2047805 12045.91 16550.0 16559.0
b 1656 17 356708 2098.28 1852855 10899.15 16560.0 16569.0
b 1657 17 349804 2057.67 1832335 10778.44 16570.0 16579.0
b 1658 17 374942 2205.54 2006269 11801.58 16580.0 16589.0
b 1659 17 365930 2152.53 1779947 10470.28 16590.0 16599.0
b 1660 17 363034 2135.49 1903003 11194.14 16600.0 16609.0
b 1661 17 384001 2258.83 1939562 11409.19 16610.0 16619.0
b 1662 17 364083 2141.66 1792156 10542.09 16620.0 16629.0
b 1663 17 365557 2150.34 1810468 10649.81 16630.0 16639.0
b 1664 17 280366 2061.51 1495659 10997.49 16640.0 16647.0
c 1570 148 352083 594.73 2665406 4502.37 15706.0 15709.0
c 1571 148 960910 649.26 8074615 5455.82 15710.0 15719.0
c 1572 148 973353 657.67 7928935 5357.39 15720.0 15729.0
c 1573 148 882210 596.09 7007874 4735.05 15730.0 15739.0
c 1574 148 1050163 709.57 8654653 5847.74 15740.0 15749.0
c 1575 148 1069063 722.34 8975827 6064.75 15750.0 15759.0
c 1576 148 939634 634.89 7662938 5177.66 15760.0 15769.0
c 1577 148 999485 675.33 7941404 5365.81 15770.0 15779.0
c 1578 148 1041551 703.75 8972195 6062.29 15780.0 15789.0
c 1579 148 995793 672.83 8253974 5577.01 15790.0 15799.0
c 1580 148 978896 661.42 8132295 5494.79 15800.0 15809.0
c 1581 148 978278 661.00 7758436 5242.19 15810.0 15819.0
c 1582 148 1049454 709.09 9042509 6109.80 15820.0 15829.0
c 1583 148 884863 597.88 7227502 4883.45 15830.0 15839.0
c 1584 148 925372 625.25 7404165 5002.81 15840.0 15849.0
c 1585 148 1006729 680.22 8911765 6021.46 15850.0 15859.0
c 1586 148 992906 670.88 8027994 5424.32 15860.0 15869.0
c 1587 148 943956 637.81 8071549 5453.75 15870.0 15879.0
c 1588 148 1061105 716.96 8974249 6063.68 15880.0 15889.0
c 1589 148 982686 663.98 7937653 5363.28 15890.0 15899.0
c 1590 148 941245 635.98 8111579 5480.80 15900.0 15909.0
c 1591 148 1087793 735.00 9507036 6423.67 15910.0 15919.0
c 1592 148 918304 620.48 7439361 5026.60 15920.0 15929.0
c 1593 148 1012297 683.98 8167706 5518.72 15930.0 15939.0
c 1594 148 958412 647.58 8205785 5544.45 15940.0 15949.0
c 1595 148 1023015 691.23 8459669 5715.99 15950.0 15959.0
c 1596 148 984022 664.88 7716372 5213.76 15960.0 15969.0
c 1597 148 948754 641.05 7999015 5404.74 15970.0 15979.0
c 1598 148 1019709 688.99 8702411 5880.01 15980.0 15989.0
c 1599 148 884635 597.73 6996844 4727.60 15990.0 15999.0
c 1600 148 988406 667.84 7952830 5373.53 16000.0 16009.0
c 1601 148 986589 666.61 8552979 5779.04 16010.0 16019.0
c 1602 148 1020387 689.45 8408217 5681.23 16020.0 16029.0
c 1603 148 1108520 749.00 9298760 6282.95 16030.0 16039.0
c 1604 148 1117649 755.17 10124286 6840.73 16040.0 16049.0
c 1605 148 1400775 946.47 13116710 8862.64 16050.0 16059.0
c 1606 148 928356 627.27 7795266 5267.07 16060.0 16069.0
c 1607 148 943734 637.66 8031585 5426.75 16070.0 16079.0
c 1608 148 858487 580.06 6689421 4519.88 16080.0 16089.0
c 1609 148 995011 672.30 8468256 5721.79 16090.0 16099.0
c 1610 148 1124587 759.86 9829570 6641.60 16100.0 16109.0
c 1611 148 923464 623.96 7475494 5051.01 16110.0 16119.0
c 1612 148 995397 672.57 7960347 5378.61 16120.0 16129.0
c 1613 148 985517 665.89 8470969 5723.63 16130.0 16139.0
c 1614 148 996419 673.26 8194544 5536.85 16140.0 16149.0
c 1615 148 854168 577.14 6746678 4558.57 16150.0 16159.0
c 1616 148 1086382 734.04 9324503 6300.34 16160.0 16169.0
c 1617 148 1092762 738.35 9525055 6435.85 16170.0 16179.0
c 1618 148 805063 543.96 6610063 4466.26 16180.0 16189.0
c 1619 148 1093921 739.14 9823296 6637.36 16190.0 16199.0
c 1620 148 896749 605.91 7185648 4855.17 16200.0 16209.0
c 1621 148 903539 610.50 7566160 5112.27 16210.0 16219.0
c 1622 148 1045726 706.57 9361989 6325.67 16220.0 16229.0
c 1623 148 958788 647.83 8192847 5535.71 16230.0 16239.0
c 1624 148 932524 630.08 7588703 5127.50 16240.0 16249.0
c 1625 148 920119 672.60 8110632 5928.82 16250.0 16259.0
c 1626 134 958154 715.04 8251185 6157.60 16260.0 16269.0
c 1627 134 776857 579.74 6363424 4748.82 16270.0 16279.0
c 1628 134 1040447 776.45 9389870 7007.37 16280.0 16289.0
c 1629 134 795750 593.84 6307018 4706.73 16290.0 16299.0
c 1630 134 979426 730.91 8207681 6125.14 16300.0 16309.0
c 1631 134 1008328 752.48 8661892 6464.10 16310.0 16319.0
c 1632 134 822743 613.99 6670206 4977.77 16320.0 16329.0
c 1633 134 880235 656.89 6949432 5186.14 16330.0 16339.0
c 1634 134 854457 637.65 7632229 5695.69 16340.0 16349.0
c 1635 134 938879 700.66 7737758 5774.45 16350.0 16359.0
c 1636 134 862795 643.88 7151563 5336.99 16360.0 16369.0
c 1637 134 957228 714.35 8118020 6058.22 16370.0 16379.0
c 1638 134 1020127 761.29 8805137 6571.00 16380.0 16389.0
c 1639 134 851687 635.59 7089290 5290.51 16390.0 16399.0
c 1640 134 1203715 898.29 11360533 8478.01 16400.0 16409.0
c 1641 134 972237 725.55 8526046 6362.72 16410.0 16419.0
c 1642 134 1191320 889.04 11103543 8286.23 16420.0 16429.0
c 1643 148 714729 511.98 5905608 4230.38 16430.0 16439.0
c 1644 148 1089669 736.26 9783850 6610.71 16440.0 16449.0
c 1645 148 970395 655.67 7829933 5290.50 16450.0 16459.0
c 1646 148 1014520 685.49 9248319 6248.86 16460.0 16469.0
c 1647 148 1020365 689.44 8396847 5673.55 16470.0 16479.0
c 1648 148 937961 633.76 8130553 5493.62 16480.0 16489.0
c 1649 148 1059089 715.60 9194165 6212.27 16490.0 16499.0
c 1650 148 874512 590.89 7129597 4817.30 16500.0 16509.0
c 1651 148 1038358 701.59 8983591 6069.99 16510.0 16519.0
c 1652 148 1120748 757.26 10466536 7071.98 16520.0 16529.0
c 1653 148 833261 563.01 7080963 4784.43 16530.0 16539.0
c 1654 148 973858 658.01 8110929 5480.36 16540.0 16549.0
c 1655 148 945995 639.19 8847624 5978.12 16550.0 16559.0
c 1656 148 946362 639.43 8153203 5508.92 16560.0 16569.0
c 1657 148 980123 662.25 8736657 5903.15 16570.0 16579.0
c 1658 148 999354 675.24 9013448 6090.17 16580.0 16589.0
c 1659 148 942154 636.59 7910369 5344.84 16590.0 16599.0
c 1660 148 944520 638.19 8532388 5765.13 16600.0 16609.0
c 1661 148 1048094 708.17 9630741 6507.26 16610.0 16619.0
c 1662 148 896768 605.92 7677641 5187.60 16620.0 16629.0
c 1663 148 1007333 680.63 8794387 5942.15 16630.0 16639.0
c 1664 148 847437 715.74 8002726 6759.06 16640.0 16647.0
d 1570 348 567290 407.54 5854453 4205.79 15706.0 15709.0
d 1571 348 1649104 473.88 18468828 5307.13 15710.0 15719.0
d 1572 348 1635722 470.04 17858054 5131.62 15720.0 15729.0
d 1573 348 1478980 424.99 15703982 4512.64 15730.0 15739.0
d 1574 348 1774788 510.00 19512347 5607.00 15740.0 15749.0
d 1575 348 1808183 519.59 20186228 5800.64 15750.0 15759.0
d 1576 348 1574989 452.58 17056660 4901.34 15760.0 15769.0
d 1577 348 1700977 488.79 18134313 5211.01 15770.0 15779.0
d 1578 348 1739994 500.00 20017064 5752.03 15780.0 15789.0
d 1579 348 1636507 470.26 18045202 5185.40 15790.0 15799.0
d 1580 348 1678332 482.28 18585727 5340.73 15800.0 15809.0
d 1581 348 1678757 482.40 17471595 5020.57 15810.0 15819.0
d 1582 348 1810734 520.33 20398422 5861.62 15820.0 15829.0
d 1583 348 1540261 442.60 16591683 4767.73 15830.0 15839.0
d 1584 348 1605786 461.43 17095475 4912.49 15840.0 15849.0
d 1585 348 1683146 483.66 19682761 5655.97 15850.0 15859.0
d 1586 348 1733423 498.11 18488839 5312.88 15860.0 15869.0
d 1587 348 1668984 479.59 18681436 5368.23 15870.0 15879.0
d 1588 348 1827140 525.04 20291941 5831.02 15880.0 15889.0
d 1589 348 1766852 507.72 18746739 5386.99 15890.0 15899.0
d 1590 348 1691562 486.08 19271538 5537.80 15900.0 15909.0
d 1591 348 1911641 549.32 22172423 6371.39 15910.0 15919.0
d 1592 348 1608512 462.22 17286665 4967.43 15920.0 15929.0
d 1593 348 1763930 506.88 18860809 5419.77 15930.0 15939.0
d 1594 348 1711137 491.71 19225781 5524.65 15940.0 15949.0
d 1595 348 1814496 521.41 19796223 5688.57 15950.0 15959.0
d 1596 348 1716026 493.11 17593024 5055.47 15960.0 15969.0
d 1597 348 1661359 477.40 18413028 5291.10 15970.0 15979.0
d 1598 348 1770857 508.87 20088036 5772.42 15980.0 15989.0
d 1599 348 1519147 436.54 15979695 4591.87 15990.0 15999.0
d 1600 348 1740266 500.08 18614101 5348.88 16000.0 16009.0
d 1601 348 1667244 479.09 19242219 5529.37 16010.0 16019.0
d 1602 348 1765738 507.40 19376567 5567.98 16020.0 16029.0
d 1603 348 1852042 532.20 20804710 5978.36 16030.0 16039.0
d 1604 348 1848775 531.26 22371970 6428.73 16040.0 16049.0
d 1605 348 2328940 669.24 29181113 8385.38 16050.0 16059.0
d 1606 348 1593674 457.95 17559087 5045.71 16060.0 16069.0
d 1607 348 1627888 467.78 18505601 5317.70 16070.0 16079.0
d 1608 348 1520445 436.91 15834506 4550.15 16080.0 16089.0
d 1609 348 1784682 512.84 20275520 5826.30 16090.0 16099.0
d 1610 348 1967889 565.49 22880369 6574.82 16100.0 16109.0
d 1611 348 1618341 465.04 17405989 5001.72 16110.0 16119.0
d 1612 348 1759466 505.59 18680547 5367.97 16120.0 16129.0
d 1613 348 1753473 503.87 19969149 5738.26 16130.0 16139.0
d 1614 348 1788470 513.93 19600711 5632.39 16140.0 16149.0
d 1615 348 1512414 434.60 16036975 4608.33 16150.0 16159.0
d 1616 348 1910684 549.05 21788096 6260.95 16160.0 16169.0
d 1617 348 1943258 558.41 22362183 6425.91 16170.0 16179.0
d 1618 348 1443903 414.91 15727746 4519.47 16180.0 16189.0
d 1619 348 1962150 563.84 23165063 6656.63 16190.0 16199.0
d 1620 348 1584872 455.42 16801360 4827.98 16200.0 16209.0
d 1621 348 1636842 470.36 18129432 5209.61 16210.0 16219.0
d 1622 348 1884685 541.58 22179683 6373.47 16220.0 16229.0
d 1623 348 1755001 504.31 19878728 5712.28 16230.0 16239.0
d 1624 348 1639035 470.99 17535081 5038.82 16240.0 16249.0
d 1625 348 1462649 516.47 16958178 5988.06 16250.0 16259.0
d 1626 267 1463673 548.19 16446494 6159.74 16260.0 16269.0
d 1627 267 1189829 445.63 12838838 4808.55 16270.0 16279.0
d 1628 267 1556134 582.82 18470587 6917.82 16280.0 16289.0
d 1629 267 1187662 444.82 12434164 4656.99 16290.0 16299.0
d 1630 267 1463780 548.23 15985742 5987.17 16300.0 16309.0
d 1631 267 1530825 573.34 17167851 6429.91 16310.0 16319.0
d 1632 267 1272235 476.49 13439411 5033.49 16320.0 16329.0
d 1633 267 1348478 505.05 13836704 5182.29 16330.0 16339.0
d 1634 267 1322964 495.49 15309212 5733.79 16340.0 16349.0
d 1635 267 1394851 522.42 14923019 5589.15 16350.0 16359.0
d 1636 267 1288562 482.61 13996039 5241.96 16360.0 16369.0
d 1637 267 1423610 533.19 15890400 5951.46 16370.0 16379.0
d 1638 267 1535247 575.00 17423063 6525.49 16380.0 16389.0
d 1639 267 1253333 469.41 13652082 5113.14 16390.0 16399.0
d 1640 267 1735026 649.82 21573568 8079.99 16400.0 16409.0
d 1641 267 1357796 508.54 15818784 5924.64 16410.0 16419.0
d 1642 267 1737842 650.88 21311369 7981.79 16420.0 16429.0
d 1643 348 1036608 346.23 11300995 3774.55 16430.0 16439.0
d 1644 348 1934911 556.01 23460378 6741.49 16440.0 16449.0
d 1645 348 1715589 492.99 18452850 5302.54 16450.0 16459.0
d 1646 348 1775485 510.20 21585841 6202.83 16460.0 16469.0
d 1647 348 1794062 515.54 19635438 5642.37 16470.0 16479.0
d 1648 348 1668317 479.40 19256254 5533.41 16480.0 16489.0
d 1649 348 1882467 540.94 21683472 6230.88 16490.0 16499.0
d 1650 348 1569062 450.88 17138851 4924.96 16500.0 16509.0
d 1651 348 1865556 536.08 21368712 6140.43 16510.0 16519.0
d 1652 348 1962682 563.99 24319238 6988.29 16520.0 16529.0
d 1653 348 1475929 424.12 16662872 4788.18 16530.0 16539.0
d 1654 348 1758986 505.46 19323294 5552.67 16540.0 16549.0
d 1655 348 1734898 498.53 21292094 6118.42 16550.0 16559.0
d 1656 348 1731451 497.54 19685341 5656.71 16560.0 16569.0
d 1657 348 1788842 514.04 21018211 6039.72 16570.0 16579.0
d 1658 348 1796411 516.21 21475072 6171.00 16580.0 16589.0
d 1659 348 1670356 479.99 18634254 5354.67 16590.0 16599.0
d 1660 348 1730346 497.23 20678405 5942.07 16600.0 16609.0
d 1661 348 1910125 548.89 23190247 6663.86 16610.0 16619.0
d 1662 348 1611830 463.17 18296169 5257.52 16620.0 16629.0
d 1663 348 1811087 520.43 20859140 5994.01 16630.0 16639.0
d 1664 348 1536706 551.98 19134833 6873.14 16640.0 16647.0

3.7. Sales by distance (Unit : 5,000)

경쟁사와 가까이 있으면 매출이 잘 안 나오는지?

base['CompetitionDistanceGroup'] <- trunc(base$CompetitionDistance / 5000) * 5000
# base['CompetitionDistanceGroup'] <- round(base$CompetitionDistance / 5000, 1) * 5000
# base['CompetitionDistanceGroup'] <- ceiling(base$CompetitionDistance / 5000) * 5000
#base %>% select(CompetitionDistance, CompetitionDistanceGroup) %>% head(10)

sales_by_distance <- sqldf("
select
    CompetitionDistanceGroup as competition_distance_group
    , count(distinct store) as tot_store
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , substr(min(date), 1, 10) as first_purchase_date
    , substr(max(date), 1, 10) as last_purchase_date
from base
group by 1;
")

summary(sales_by_distance)
##  competition_distance_group   tot_store        tot_sales         
##  Min.   :    0              Min.   :  1.00   Min.   :   5718883  
##  1st Qu.:13750              1st Qu.:  3.00   1st Qu.:  13614827  
##  Median :27500              Median :  7.00   Median :  36389933  
##  Mean   :29583              Mean   : 85.77   Mean   : 451783124  
##  3rd Qu.:41250              3rd Qu.: 65.00   3rd Qu.: 324222806  
##  Max.   :75000              Max.   :762.00   Max.   :4059521480  
##  NA's   :1                                                       
##    avg_sales       tot_cust            avg_cust     first_purchase_date
##  Min.   :4455   Min.   :   702496   Min.   :457.5   Length:13          
##  1st Qu.:5158   1st Qu.:  1379518   1st Qu.:504.9   Class :character   
##  Median :5575   Median :  3826739   Median :542.4   Mode  :character   
##  Mean   :5686   Mean   : 49541673   Mean   :597.8                      
##  3rd Qu.:6071   3rd Qu.: 29891814   3rd Qu.:678.2                      
##  Max.   :7321   Max.   :471232154   Max.   :861.2                      
##                                                                        
##  last_purchase_date
##  Length:13         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
Sales by distance(group).
competition_distance_group tot_store tot_sales avg_sales tot_cust avg_cust first_purchase_date last_purchase_date
NA 3 11983829 4535.89 1208630 457.47 15706.0 16647.0
0 762 4059521480 5842.69 471232154 678.22 15706.0 16647.0
5000 163 828389324 5575.45 77488707 521.54 15706.0 16647.0
10000 65 324222806 5476.18 29891814 504.88 15706.0 16647.0
15000 65 335727649 5670.50 32114869 542.43 15706.0 16647.0
20000 25 152528996 6685.76 15082188 661.09 15706.0 16647.0
25000 13 62219790 5158.33 6481443 537.34 15706.0 16647.0
30000 7 36389933 5518.64 3826739 580.34 15706.0 16647.0
35000 4 16785669 4454.80 1797685 477.09 15706.0 16647.0
40000 3 19180616 6787.20 2024284 716.31 15706.0 16647.0
45000 3 13614827 4817.70 1379518 488.15 15706.0 16647.0
55000 1 5718883 6071.00 702496 745.75 15706.0 16647.0
75000 1 6896821 7321.47 811228 861.18 15706.0 16647.0

3.8. Does distance affect sales?

  • 거리가 매출에 영향을 미치는지
  • EDA를 하다가 궁금해서 linear regression
x <- sqldf("
select 
    CompetitionDistance
    , Sales
from base
")

lm <- lm(x$Sales~x$CompetitionDistance)
summary(lm)
## 
## Call:
## lm(formula = x$Sales ~ x$CompetitionDistance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5829  -2051    -27   2084  35738 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.829e+03  4.674e+00 1247.03   <2e-16 ***
## x$CompetitionDistance -9.598e-03  4.955e-04  -19.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3850 on 1014565 degrees of freedom
##   (2642 observations deleted due to missingness)
## Multiple R-squared:  0.0003698,  Adjusted R-squared:  0.0003688 
## F-statistic: 375.3 on 1 and 1014565 DF,  p-value: < 2.2e-16
  • 거리가 멀수록 매출이 덜 나온다.
  • 대략 -10 10의 -3승 0.01
  • sales = - 0.01 * distance * 5829
  • 번화가는 매장들이 붙어 있지만 매출이 많으니까…
ggplot(x, aes(x = x$CompetitionDistance, y = x$Sales)) +
    geom_point(alpha = 0.02) +
    geom_smooth(method = 'lm', col = "red")

  • 기울기가 0
    • 예측하는 문제라면 boost, 뉴럴넷
    • sales
  • 이상치가 조금 있긴 하지만
    • 거리랑 매출의 관계를 찾은 것
    • 거리가 커질수록 세일즈가 줄어든다.
    • 이런 식으로 인사이트를 찾는 것

3.9. Sales by distance & store type

sales_by_store_type_distance <- sqldf("
select
    storetype
    , CompetitionDistanceGroup as competition_distance_group
    , count(distinct store) as tot_store
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , substr(min(date), 1, 10) as first_purchase_date
    , substr(max(date), 1, 10) as last_purchase_date
from base
group by 1, 2;
")

summary(sales_by_store_type_distance)
##   StoreType         competition_distance_group   tot_store     
##  Length:32          Min.   :    0              Min.   :  1.00  
##  Class :character   1st Qu.:10000              1st Qu.:  1.00  
##  Mode  :character   Median :20000              Median :  3.50  
##                     Mean   :23000              Mean   : 34.84  
##                     3rd Qu.:33750              3rd Qu.: 26.50  
##                     Max.   :75000              Max.   :450.00  
##                     NA's   :2                                  
##    tot_sales            avg_sales        tot_cust            avg_cust     
##  Min.   :   3385282   Min.   : 3594   Min.   :   337120   Min.   : 357.9  
##  1st Qu.:   6599117   1st Qu.: 5272   1st Qu.:   632957   1st Qu.: 514.1  
##  Median :  19117000   Median : 5652   Median :  2117185   Median : 555.8  
##  Mean   : 183536894   Mean   : 5789   Mean   : 20126305   Mean   : 618.1  
##  3rd Qu.: 159231395   3rd Qu.: 6563   3rd Qu.: 14296809   3rd Qu.: 682.9  
##  Max.   :2383432599   Max.   :10059   Max.   :282162255   Max.   :1987.7  
##                                                                           
##  first_purchase_date last_purchase_date
##  Length:32           Length:32         
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
##                                        
## 
Sales by storetype and distance.
StoreType competition_distance_group tot_store tot_sales avg_sales tot_cust avg_cust first_purchase_date last_purchase_date
a NA 1 3385282 3593.72 337120 357.88 15706.0 16647.0
a 0 450 2383432599 5788.47 282162255 685.27 15706.0 16647.0
a 5000 56 282703073 5532.78 29054203 568.62 15706.0 16647.0
a 10000 25 123032160 5392.84 12439092 545.24 15706.0 16647.0
a 15000 33 171557353 5618.57 17564543 575.25 15706.0 16647.0
a 20000 13 86263837 7151.70 9140240 757.77 15706.0 16647.0
a 25000 11 49206113 4834.56 5439905 534.48 15706.0 16647.0
a 30000 4 19117000 5073.51 2210086 586.54 15706.0 16647.0
a 35000 3 11286129 3993.68 1195430 423.01 15706.0 16647.0
a 40000 3 19180616 6787.20 2024284 716.31 15706.0 16647.0
a 45000 1 3554993 3773.88 460552 488.91 15706.0 16647.0
a 55000 1 5718883 6071.00 702496 745.75 15706.0 16647.0
a 75000 1 6896821 7321.47 811228 861.18 15706.0 16647.0
b 0 17 159231395 10058.84 31465621 1987.72 15706.0 16647.0
c 0 123 653916153 5753.37 78575720 691.33 15706.0 16647.0
c 5000 14 69411008 5337.67 7748849 595.88 15706.0 16647.0
c 10000 5 25437380 5400.72 2459591 522.21 15706.0 16647.0
c 15000 1 4239004 4500.00 485624 515.52 15706.0 16647.0
c 20000 2 11128629 6546.25 1161711 683.36 15706.0 16647.0
c 25000 1 6228961 6612.49 507968 539.24 15706.0 16647.0
c 30000 1 6599117 7005.43 643191 682.79 15706.0 16647.0
c 45000 1 6261174 6646.68 547051 580.73 15706.0 16647.0
d NA 2 8598547 5057.97 871510 512.65 15706.0 16647.0
d 0 172 862941333 5619.57 79028558 514.64 15706.0 16647.0
d 5000 93 476275243 5637.86 40685655 481.61 15706.0 16647.0
d 10000 35 175753266 5547.42 14993131 473.24 15706.0 16647.0
d 15000 31 159931292 5767.45 14064702 507.20 15706.0 16647.0
d 20000 10 55136530 6091.09 4780237 528.09 15706.0 16647.0
d 25000 1 6784716 7202.46 533570 566.42 15706.0 16647.0
d 30000 2 10673816 5665.51 973462 516.70 15706.0 16647.0
d 35000 1 5499540 5838.15 602255 639.34 15706.0 16647.0
d 45000 1 3798660 4032.55 371915 394.81 15706.0 16647.0

3.10. Sales by competition open since year

competition_open_since_year_sales <- sqldf("
select
    CompetitionOpenSinceYear as competition_open_since_year
    , count(distinct store) as tot_store
   
    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales
   
    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust
   
    , min(date) as first_purchase_date
    , max(date) as last_purchase_date
from base
group by 1;
")

summary(competition_open_since_year_sales)
##  competition_open_since_year   tot_store        tot_sales        
##  Min.   :1900                Min.   :  1.00   Min.   :5.222e+06  
##  1st Qu.:1998                1st Qu.:  7.25   1st Qu.:3.838e+07  
##  Median :2004                Median : 32.50   Median :1.648e+08  
##  Mean   :1998                Mean   : 46.46   Mean   :2.447e+08  
##  3rd Qu.:2010                3rd Qu.: 54.25   3rd Qu.:2.898e+08  
##  Max.   :2015                Max.   :354.00   Max.   :1.862e+09  
##  NA's   :1                                                       
##    avg_sales       tot_cust            avg_cust     first_purchase_date
##  Min.   :5225   Min.   :   507202   Min.   :550.4   Min.   :15706      
##  1st Qu.:5623   1st Qu.:  5136704   1st Qu.:601.2   1st Qu.:15706      
##  Median :5773   Median : 18711883   Median :629.3   Median :15706      
##  Mean   :5863   Mean   : 26835073   Mean   :640.8   Mean   :15706      
##  3rd Qu.:6013   3rd Qu.: 31904140   3rd Qu.:669.1   3rd Qu.:15706      
##  Max.   :6889   Max.   :204627267   Max.   :818.7   Max.   :15706      
##                                                                        
##  last_purchase_date
##  Min.   :16647     
##  1st Qu.:16647     
##  Median :16647     
##  Mean   :16647     
##  3rd Qu.:16647     
##  Max.   :16647     
## 
Sales by competition open since year.
competition_open_since_year tot_store tot_sales avg_sales tot_cust avg_cust first_purchase_date last_purchase_date
NA 354 1862040952 5758.63 204627267 632.84 15706 16647
1900 1 5222211 6889.46 507202 669.13 15706 16647
1961 1 6086195 6460.93 547187 580.88 15706 16647
1990 5 27862226 5915.55 3209038 681.32 15706 16647
1994 2 11280525 5987.54 1282646 680.81 15706 16647
1995 2 10472100 6160.06 935723 550.43 15706 16647
1998 1 5737921 6091.21 770860 818.32 15706 16647
1999 8 41892450 5698.10 6019094 818.70 15706 16647
2000 10 60175681 6515.34 5779259 625.73 15706 16647
2001 16 78278080 5323.59 8358755 568.47 15706 16647
2002 27 136190565 5473.46 16122104 647.94 15706 16647
2003 19 97218299 5788.87 9835892 585.68 15706 16647
2004 22 103469923 5224.70 11071731 559.07 15706 16647
2005 62 320132499 5659.65 34012056 601.30 15706 16647
2006 47 247721540 5787.62 28638014 669.08 15706 16647
2007 48 238392636 5449.72 26285748 600.90 15706 16647
2008 54 271344522 5597.50 30100113 620.93 15706 16647
2009 54 293110384 5933.89 32247157 652.83 15706 16647
2010 55 288653207 5631.38 31120758 607.14 15706 16647
2011 54 279859156 5665.62 31789801 643.57 15706 16647
2012 82 424676043 5715.77 46030586 619.53 15706 16647
2013 83 489038371 6483.68 52757749 699.46 15706 16647
2014 70 380853414 5975.86 40691353 638.48 15706 16647
2015 38 193471723 5518.30 21301662 607.58 15706 16647

3.11. Sales by competition open since year & distance

competition_open_since_year_distance_sales <- sqldf("
select
    CompetitionOpenSinceYear as competition_open_since_year
    , CompetitionDistanceGroup as competition_distance_group

    , count(distinct store) as tot_store

    , sum(sales) as tot_sales
    , round(avg(sales), 2) as avg_sales

    , sum(customers) as tot_cust
    , round(avg(customers), 2) as avg_cust

    , substr(min(date), 1, 10) as first_purchase_date
    , substr(max(date), 1, 10) as last_purchase_date
from base
group by 1, 2;
")

summary(competition_open_since_year_distance_sales)
##  competition_open_since_year competition_distance_group   tot_store     
##  Min.   :1900                Min.   :    0              Min.   :  1.00  
##  1st Qu.:2003                1st Qu.: 5000              1st Qu.:  1.00  
##  Median :2008                Median :10000              Median :  3.00  
##  Mean   :2006                Mean   :14227              Mean   : 10.05  
##  3rd Qu.:2012                3rd Qu.:20000              3rd Qu.:  6.00  
##  Max.   :2015                Max.   :75000              Max.   :233.00  
##  NA's   :9                   NA's   :1                                  
##    tot_sales           avg_sales       tot_cust            avg_cust     
##  Min.   :2.319e+06   Min.   :2461   Min.   :   246036   Min.   : 274.8  
##  1st Qu.:5.979e+06   1st Qu.:5126   1st Qu.:   568147   1st Qu.: 494.4  
##  Median :1.323e+07   Median :5705   Median :  1299613   Median : 566.7  
##  Mean   :5.291e+07   Mean   :5648   Mean   :  5802178   Mean   : 571.4  
##  3rd Qu.:3.345e+07   3rd Qu.:6232   3rd Qu.:  3620793   3rd Qu.: 656.1  
##  Max.   :1.236e+09   Max.   :9303   Max.   :145176448   Max.   :1013.9  
##                                                                         
##  first_purchase_date last_purchase_date
##  Length:111          Length:111        
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
##                                        
## 
Sales by competition open since year & distance.
competition_open_since_year competition_distance_group tot_store tot_sales avg_sales tot_cust avg_cust first_purchase_date last_purchase_date
NA NA 3 11983829 4535.89 1208630 457.47 15706.0 16647.0
NA 0 233 1236273942 5817.92 145176448 683.20 15706.0 16647.0
NA 5000 56 283087701 5540.31 26156555 511.91 15706.0 16647.0
NA 10000 23 114937050 5396.61 10713115 503.01 15706.0 16647.0
NA 15000 24 129843392 5986.88 12774831 589.03 15706.0 16647.0
NA 20000 9 57741496 6810.74 5192859 612.51 15706.0 16647.0
NA 25000 4 16140805 4283.65 2009532 533.32 15706.0 16647.0
NA 35000 1 5135916 5452.14 584069 620.03 15706.0 16647.0
NA 75000 1 6896821 7321.47 811228 861.18 15706.0 16647.0
1900 0 1 5222211 6889.46 507202 669.13 15706.0 16647.0
1961 5000 1 6086195 6460.93 547187 580.88 15706.0 16647.0
1990 0 5 27862226 5915.55 3209038 681.32 15706.0 16647.0
1994 5000 2 11280525 5987.54 1282646 680.81 15706.0 16647.0
1995 0 2 10472100 6160.06 935723 550.43 15706.0 16647.0
1998 0 1 5737921 6091.21 770860 818.32 15706.0 16647.0
1999 0 6 30804788 5450.25 5031896 890.29 15706.0 16647.0
1999 20000 1 4477826 5907.42 434973 573.84 15706.0 16647.0
1999 25000 1 6609836 7016.81 552225 586.23 15706.0 16647.0
2000 0 6 33835243 6187.86 3596984 657.82 15706.0 16647.0
2000 5000 2 13517765 7175.03 1063559 564.52 15706.0 16647.0
2000 10000 1 7317626 7768.18 624924 663.40 15706.0 16647.0
2000 15000 1 5505047 5844.00 493792 524.20 15706.0 16647.0
2001 0 9 46456969 5479.71 5340571 629.93 15706.0 16647.0
2001 5000 5 24180618 5342.60 2275985 502.87 15706.0 16647.0
2001 10000 1 3713936 4899.65 313767 413.94 15706.0 16647.0
2001 15000 1 3926557 4168.32 428432 454.81 15706.0 16647.0
2002 0 16 84230349 5588.53 11316111 750.80 15706.0 16647.0
2002 5000 3 10123612 4451.90 881688 387.73 15706.0 16647.0
2002 10000 1 2873358 3050.27 301609 320.18 15706.0 16647.0
2002 15000 4 19241000 5106.42 1696492 450.24 15706.0 16647.0
2002 20000 1 4875320 5175.50 587900 624.10 15706.0 16647.0
2002 25000 1 6228961 6612.49 507968 539.24 15706.0 16647.0
2002 40000 1 8617965 9148.58 830336 881.46 15706.0 16647.0
2003 0 10 54414615 6266.08 5968487 687.30 15706.0 16647.0
2003 5000 5 18117349 4002.95 1819303 401.97 15706.0 16647.0
2003 10000 3 19814992 7500.00 1556415 589.10 15706.0 16647.0
2003 15000 1 4871343 5171.28 491687 521.96 15706.0 16647.0
2004 0 12 55223519 5225.54 6445182 609.88 15706.0 16647.0
2004 5000 5 24237320 5355.13 2180389 481.75 15706.0 16647.0
2004 10000 1 5018882 5327.90 474412 503.62 15706.0 16647.0
2004 15000 3 16671567 5899.35 1650253 583.95 15706.0 16647.0
2004 25000 1 2318635 2461.40 321495 341.29 15706.0 16647.0
2005 0 42 221739076 5793.16 24850902 649.26 15706.0 16647.0
2005 5000 9 46698555 5630.40 4560262 549.83 15706.0 16647.0
2005 10000 6 26822750 4905.40 2231186 408.04 15706.0 16647.0
2005 15000 3 11018057 4170.35 1057130 400.12 15706.0 16647.0
2005 20000 1 7069345 7504.61 779006 826.97 15706.0 16647.0
2005 25000 1 6784716 7202.46 533570 566.42 15706.0 16647.0
2006 0 36 188474732 5777.18 22687743 695.43 15706.0 16647.0
2006 5000 6 34863708 6168.38 3644602 644.83 15706.0 16647.0
2006 10000 2 10632856 5643.77 951371 504.97 15706.0 16647.0
2006 15000 3 13750244 5204.48 1354298 512.60 15706.0 16647.0
2007 0 36 182124772 5459.38 21314895 638.94 15706.0 16647.0
2007 5000 7 32038862 5145.98 2992574 480.66 15706.0 16647.0
2007 10000 1 3330211 4393.42 246036 324.59 15706.0 16647.0
2007 15000 3 16717700 6327.67 1303671 493.44 15706.0 16647.0
2007 20000 1 4181091 5515.95 428572 565.40 15706.0 16647.0
2008 0 41 200292655 5472.78 23226626 634.64 15706.0 16647.0
2008 5000 3 17823296 6746.14 1543775 584.32 15706.0 16647.0
2008 10000 1 5369397 7083.64 435346 574.34 15706.0 16647.0
2008 15000 3 14570826 5155.99 1435980 508.13 15706.0 16647.0
2008 20000 2 17527345 9303.26 1910272 1013.94 15706.0 16647.0
2008 30000 1 2914087 3093.51 269319 285.90 15706.0 16647.0
2008 35000 1 2816323 2989.73 258896 274.84 15706.0 16647.0
2008 40000 1 3769419 4001.51 472848 501.96 15706.0 16647.0
2008 45000 1 6261174 6646.68 547051 580.73 15706.0 16647.0
2009 0 44 242344696 5979.69 27573111 680.35 15706.0 16647.0
2009 5000 5 26555264 5867.27 2396637 529.53 15706.0 16647.0
2009 10000 3 13230162 5007.63 1308575 495.30 15706.0 16647.0
2009 15000 2 10980262 6458.98 968834 569.90 15706.0 16647.0
2010 0 33 177440595 5776.44 19949399 649.44 15706.0 16647.0
2010 5000 9 46398548 5472.82 4382401 516.91 15706.0 16647.0
2010 10000 3 11701416 4140.63 1146125 405.56 15706.0 16647.0
2010 15000 3 14859632 5258.19 1450715 513.35 15706.0 16647.0
2010 20000 2 11267014 5980.37 1285623 682.39 15706.0 16647.0
2010 25000 1 3509384 4629.79 307197 405.27 15706.0 16647.0
2010 30000 4 23476618 6230.52 2599298 689.83 15706.0 16647.0
2011 0 38 194779996 5644.49 23079739 668.82 15706.0 16647.0
2011 5000 10 54816894 5819.20 5549047 589.07 15706.0 16647.0
2011 10000 3 14409042 5098.74 1505351 532.68 15706.0 16647.0
2011 15000 2 12068834 6405.96 1299613 689.82 15706.0 16647.0
2011 20000 1 3784390 4992.60 356051 469.72 15706.0 16647.0
2012 0 62 325908310 5799.60 36775305 654.42 15706.0 16647.0
2012 5000 12 60460643 5623.20 5336631 496.34 15706.0 16647.0
2012 10000 2 10482577 5564.00 1017463 540.05 15706.0 16647.0
2012 15000 3 12053961 4562.44 1228064 464.82 15706.0 16647.0
2012 20000 2 10271012 5451.70 1070868 568.40 15706.0 16647.0
2012 35000 1 5499540 5838.15 602255 639.34 15706.0 16647.0
2013 0 50 308321589 6812.23 35034323 774.07 15706.0 16647.0
2013 5000 11 58191800 5822.67 5663153 566.66 15706.0 16647.0
2013 10000 6 33066227 6257.80 3010005 569.65 15706.0 16647.0
2013 15000 5 28810475 6365.55 2604942 575.55 15706.0 16647.0
2013 20000 4 26666825 7077.18 2584335 685.86 15706.0 16647.0
2013 25000 1 6654188 7063.89 808163 857.92 15706.0 16647.0
2013 30000 1 4127609 4381.75 444300 471.66 15706.0 16647.0
2013 35000 1 3333890 3539.16 352465 374.17 15706.0 16647.0
2013 40000 1 6793232 7211.50 721100 765.50 15706.0 16647.0
2013 45000 2 7353653 3903.21 832467 441.86 15706.0 16647.0
2013 55000 1 5718883 6071.00 702496 745.75 15706.0 16647.0
2014 0 52 283015311 5979.87 31906002 674.15 15706.0 16647.0
2014 5000 10 51637327 5704.52 4446098 491.17 15706.0 16647.0
2014 10000 4 25854624 7213.90 2381142 664.38 15706.0 16647.0
2014 15000 1 4144086 4399.24 421584 447.54 15706.0 16647.0
2014 25000 2 10330447 5483.25 1022705 542.84 15706.0 16647.0
2014 30000 1 5871619 6233.14 513822 545.46 15706.0 16647.0
2015 0 27 144545865 5809.25 16535607 664.56 15706.0 16647.0
2015 5000 2 8273342 4391.37 766215 406.70 15706.0 16647.0
2015 10000 4 15647700 4152.79 1674972 444.53 15706.0 16647.0
2015 15000 3 16694666 5907.53 1454551 514.70 15706.0 16647.0
2015 20000 1 4667332 6157.43 451729 595.95 15706.0 16647.0
2015 25000 1 3642818 3867.11 418588 444.36 15706.0 16647.0

3.12. Which stores have big differences in sales due to school holidays?

School Holiday 여부에 따른 매출액 증감이 큰 매장은? (Top 10)

table(base$SchoolHoliday)
## 
##      0      1 
## 835488 181721
Desc(base$SchoolHoliday)
## ------------------------------------------------------------------------- 
## base$SchoolHoliday (integer - dichotomous)
## 
##      length         n       NAs    unique
##   1'017'209 1'017'209         0         2
##                100.0%      0.0%          
## 
##       freq   perc  lci.95  uci.95'
## 0  835'488  82.1%   82.1%   82.2%
## 1  181'721  17.9%   17.8%   17.9%
## 
## ' 95%-CI Wilson

school_holiday <- sqldf("
select
    store
    , schoolholiday
    , sum(sales) as tot_sales
from base
group by 1, 2;
")

school_holiday_diff_desc <- sqldf("
select
    store
    , schoolholiday
    , sum(case when schoolholiday = 0 then tot_sales end) as not_holiday_sales
    , sum(case when schoolholiday = 1 then tot_sales end) as holiday_sales
    , sum(case when schoolholiday = 0 then tot_sales end) - sum(case when schoolholiday = 1 then tot_sales end) as diff
from school_holiday
group by 1
order by diff desc ;") %>% head(10)
summary(school_holiday_diff_desc)
##      Store        SchoolHoliday not_holiday_sales  holiday_sales    
##  Min.   : 251.0   Min.   :1     Min.   :10175787   Min.   :2365944  
##  1st Qu.: 347.0   1st Qu.:1     1st Qu.:11211306   1st Qu.:2772411  
##  Median : 537.5   Median :1     Median :11766496   Median :2958632  
##  Mean   : 575.8   Mean   :1     Mean   :12265578   Mean   :3037902  
##  3rd Qu.: 774.2   3rd Qu.:1     3rd Qu.:13172706   3rd Qu.:3372016  
##  Max.   :1114.0   Max.   :1     Max.   :16131631   Max.   :3653765  
##       diff         
##  Min.   : 7809843  
##  1st Qu.: 8297905  
##  Median : 9050979  
##  Mean   : 9227676  
##  3rd Qu.: 9599275  
##  Max.   :12746420
What is the bigger sales increase due to school holidays?
Store SchoolHoliday not_holiday_sales holiday_sales diff
262 1 16131631 3385211 12746420
817 1 13441062 3616805 9824257
562 1 13273557 3653765 9619792
1114 1 12870155 3332430 9537725
733 1 11586482 2480676 9105806
251 1 11946511 2950359 8996152
513 1 11285500 2966906 8318594
788 1 11186575 2895566 8291009
383 1 10758520 2731359 8027161
335 1 10175787 2365944 7809843

3.13. Which stores have small differences in sales due to school holidays?

School Holiday 여부에 따른 매출액 증감이 작은 매장은? (Top 10)

school_holiday_diff_asc <- sqldf("
select
    store
    , schoolholiday
    , sum(case when schoolholiday = 0 then tot_sales end) as not_holiday_sales
    , sum(case when schoolholiday = 1 then tot_sales end) as holiday_sales
    , sum(case when schoolholiday = 0 then tot_sales end) - sum(case when schoolholiday = 1 then tot_sales end) as diff
from school_holiday
group by 1
order by diff;") %>% head(10)

summary(school_holiday_diff_asc)
##      Store       SchoolHoliday not_holiday_sales holiday_sales   
##  Min.   :198.0   Min.   :1     Min.   :1651351   Min.   :462971  
##  1st Qu.:222.2   1st Qu.:1     1st Qu.:1781068   1st Qu.:481274  
##  Median :366.0   Median :1     Median :1825997   Median :494913  
##  Mean   :473.9   Mean   :1     Mean   :1841610   Mean   :511127  
##  3rd Qu.:731.2   3rd Qu.:1     3rd Qu.:1924701   3rd Qu.:526140  
##  Max.   :972.0   Max.   :1     Max.   :2002451   Max.   :588780  
##       diff        
##  Min.   :1188380  
##  1st Qu.:1236802  
##  Median :1341650  
##  Mean   :1330483  
##  3rd Qu.:1419803  
##  Max.   :1451714
What is the smaller sales increase due to school holidays?
Store SchoolHoliday not_holiday_sales holiday_sales diff
307 1 1651351 462971 1188380
543 1 1700778 478509 1222269
972 1 1813847 588780 1225067
198 1 1770141 498132 1272009
841 1 1826941 491694 1335247
208 1 1825053 476999 1348054
219 1 1927666 513496 1414170
232 1 2002451 580770 1421681
794 1 1915807 489567 1426240
425 1 1982069 530355 1451714

3.14. Which stores have big differences in sales due to state holidays?

State Holiday 여부에 따른 매출액 증감이 큰 매장은? (Top 10)

State Holiday ‘a’ = public / ‘b’ = easter / ‘c’ = christmas

Desc(base$StateHoliday)
## ------------------------------------------------------------------------- 
## base$StateHoliday (character)
## 
##      length         n       NAs    unique    levels     dupes
##   1'017'209 1'017'209         0         4         4         y
##                100.0%      0.0%                              
## 
##    level     freq   perc    cumfreq  cumperc
## 1      0  986'159  96.9%    986'159    96.9%
## 2      a   20'260   2.0%  1'006'419    98.9%
## 3      b    6'690   0.7%  1'013'109    99.6%
## 4      c    4'100   0.4%  1'017'209   100.0%

state_holiday <- sqldf("
select
    store
    , StateHoliday
    , sum(sales) as tot_sales
from base
group by 1, 2
")

state_holiday_desc <- sqldf("
select
    Store
    , not_holiday_sales
    , public_holiday_sales
    , easter_holiday_sales
    , christmas_holiday_sales

    , case when not_holiday_sales = public_diff then 0 else public_diff end +
      case when not_holiday_sales = easter_diff then 0 else easter_diff end +
      case when not_holiday_sales = christmas_diff then 0 else christmas_diff end as diff

    , case
        when public_diff != 0 then 'a'
        when easter_diff != 0 then 'b'
        when christmas_diff != 0 then 'c'
        else 'error' end as category

from (
        select
            Store
            , sum(case when StateHoliday = 0 then tot_sales end) as not_holiday_sales
            , sum(case when StateHoliday = 'a' then tot_sales end) as public_holiday_sales
            , sum(case when StateHoliday = 'b' then tot_sales end) as easter_holiday_sales
            , sum(case when StateHoliday = 'c' then tot_sales end) as christmas_holiday_sales

            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'a' then tot_sales end) as public_diff
            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'b' then tot_sales end) as easter_diff
            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'c' then tot_sales end) as christmas_diff
        from 
            state_holiday
        group by 1
) 
order by diff desc; ") %>% head(10)

summary(state_holiday_desc)
##      Store        not_holiday_sales  public_holiday_sales
##  Min.   : 259.0   Min.   : 7039181   Min.   :102286      
##  1st Qu.: 357.0   1st Qu.: 9851432   1st Qu.:178832      
##  Median : 528.0   Median :10400706   Median :239213      
##  Mean   : 561.6   Mean   :11748420   Mean   :235231      
##  3rd Qu.: 720.2   3rd Qu.:13271430   3rd Qu.:270714      
##  Max.   :1097.0   Max.   :18751269   Max.   :444916      
##  easter_holiday_sales christmas_holiday_sales      diff         
##  Min.   : 41700       Min.   : 19060          Min.   :20951087  
##  1st Qu.: 56390       1st Qu.: 27449          1st Qu.:29205960  
##  Median : 90579       Median : 44518          Median :30898199  
##  Mean   : 92815       Mean   : 48105          Mean   :34869110  
##  3rd Qu.:102157       3rd Qu.: 57583          3rd Qu.:39399918  
##  Max.   :208687       Max.   :111970          Max.   :55488234  
##    category        
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
state_holiday_diff_desc.
Store not_holiday_sales public_holiday_sales easter_holiday_sales christmas_holiday_sales diff category
262 18751269 444916 208687 111970 55488234 a
562 16486022 257742 122245 61313 49016766 a
733 13593424 315136 102448 56150 40306538 a
335 12305447 169698 47526 19060 36680057 a
682 10382719 102831 51227 20602 30973497 a
259 10418693 275038 100079 58061 30822901 a
769 9857593 222954 71878 45575 29232372 a
423 9849378 206235 101283 43460 29197156 a
1097 8800475 255472 81079 42387 26022487 a
494 7039181 102286 41700 22470 20951087 a

3.15. Which stores have small differences in sales due to state holidays?

State Holiday 여부에 따른 매출액 증감이 작은 매장은? (Top 10)

state_holiday_asc <- sqldf("
select
    Store
    , not_holiday_sales
    , public_holiday_sales
    , easter_holiday_sales
    , christmas_holiday_sales

    , case when not_holiday_sales = public_diff then 0 else public_diff end +
      case when not_holiday_sales = easter_diff then 0 else easter_diff end +
      case when not_holiday_sales = christmas_diff then 0 else christmas_diff end as diff

    , case
        when public_diff != 0 then 'a'
        when easter_diff != 0 then 'b'
        when christmas_diff != 0 then 'c'
        else 'error' end as category

from (
        select
            Store
            , sum(case when StateHoliday = 0 then tot_sales end) as not_holiday_sales
            , sum(case when StateHoliday = 'a' then tot_sales end) as public_holiday_sales
            , sum(case when StateHoliday = 'b' then tot_sales end) as easter_holiday_sales
            , sum(case when StateHoliday = 'c' then tot_sales end) as christmas_holiday_sales

            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'a' then tot_sales end) as public_diff
            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'b' then tot_sales end) as easter_diff
            , sum(case when StateHoliday = 0 then tot_sales end) - sum(case when StateHoliday = 'c' then tot_sales end) as christmas_diff
        from 
            state_holiday
        group by 1
) 
order by diff; ") %>% head(10)

summary(state_holiday_asc)
##      Store       not_holiday_sales public_holiday_sales
##  Min.   : 1.00   Min.   :3126578   Min.   :0           
##  1st Qu.: 5.00   1st Qu.:4304095   1st Qu.:0           
##  Median : 9.50   Median :4738852   Median :0           
##  Mean   : 8.50   Mean   :5016646   Mean   :0           
##  3rd Qu.:11.75   3rd Qu.:5814749   3rd Qu.:0           
##  Max.   :14.00   Max.   :7556507   Max.   :0           
##  easter_holiday_sales christmas_holiday_sales      diff  
##  Min.   :0            Min.   :0               Min.   :0  
##  1st Qu.:0            1st Qu.:0               1st Qu.:0  
##  Median :0            Median :0               Median :0  
##  Mean   :0            Mean   :0               Mean   :0  
##  3rd Qu.:0            3rd Qu.:0               3rd Qu.:0  
##  Max.   :0            Max.   :0               Max.   :0  
##    category        
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
state_holiday_diff_asc.
Store not_holiday_sales public_holiday_sales easter_holiday_sales christmas_holiday_sales diff category
1 3716854 0 0 0 0 a
3 5408261 0 0 0 0 a
4 7556507 0 0 0 0 a
8 4342857 0 0 0 0 a
9 5112061 0 0 0 0 a
10 4365642 0 0 0 0 a
11 6296286 0 0 0 0 a
12 5950245 0 0 0 0 a
13 3126578 0 0 0 0 a
14 4291174 0 0 0 0 a

4. EDA (Visualization) in R & SQL

4.1. EDA 1 : Sales

hist(train$Sales)

sales_summary <- sqldf("
select
    round(avg(Sales), 2) as open_sales
    , sum(case when StateHoliday = 0 and SchoolHoliday = 0 then Sales else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 0 then 1 else 0 end) as open_not_holiday_sales
    , sum(case when StateHoliday != 0 and SchoolHoliday != 0 then Sales else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday != 0 then 1 else 0 end) as open_holiday_sales
    , sum(case when StateHoliday = 0 and SchoolHoliday = 1 then sales else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 1 then 1 else 0 end) as open_school_holiday_sales
    , sum(case when StateHoliday != 0 and SchoolHoliday = 0 then sales else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday = 0 then 1 else 0 end) as open_state_holiday_sales
from train
where Open != 0
")

sales <- sqldf("
select
    Store
    , round(avg(Sales), 2) as open_sales
    , sum(case when StateHoliday = 0 and SchoolHoliday = 0 then Sales else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 0 then 1 else 0 end) as open_not_holiday_sales
    , sum(case when StateHoliday != 0 and SchoolHoliday != 0 then Sales else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday != 0 then 1 else 0 end) as open_holiday_sales
    , sum(case when StateHoliday = 0 and SchoolHoliday = 1 then sales else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 1 then 1 else 0 end) as open_school_holiday_sales
    , sum(case when StateHoliday != 0 and SchoolHoliday = 0 then sales else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday = 0 then 1 else 0 end) as open_state_holiday_sales
from train
where Open != 0
group by 1
")

summary(sales)
##      Store          open_sales    open_not_holiday_sales
##  Min.   :   1.0   Min.   : 2704   Min.   : 2625         
##  1st Qu.: 279.5   1st Qu.: 5322   1st Qu.: 5256         
##  Median : 558.0   Median : 6590   Median : 6531         
##  Mean   : 558.0   Mean   : 6934   Mean   : 6879         
##  3rd Qu.: 836.5   3rd Qu.: 7964   3rd Qu.: 7886         
##  Max.   :1115.0   Max.   :21757   Max.   :21505         
##                                                         
##  open_holiday_sales open_school_holiday_sales open_state_holiday_sales
##  Min.   : 1121      Min.   : 2990             Min.   : 1278           
##  1st Qu.: 4515      1st Qu.: 5517             1st Qu.: 4295           
##  Median : 6154      Median : 6837             Median : 7029           
##  Mean   : 7049      Mean   : 7162             Mean   : 7146           
##  3rd Qu.: 8934      3rd Qu.: 8211             3rd Qu.: 8592           
##  Max.   :28285      Max.   :22747             Max.   :34956           
##  NA's   :1054                                 NA's   :984
boxplot(sales)

hist(sales$open_sales, 100, 
     main = "Mean sales per store when store was not closed")

hist(sales$open_holiday_sales, 100, 
     main = "Mean sales per store when store was not closed and holiday")

4.2. EDA 2 : Customers

hist(train$Customers)

customers_summary <- sqldf("
select
    round(avg(customers), 2) as customers
    , sum(case when StateHoliday = 0 and SchoolHoliday = 0 then customers else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 0 then 1 else 0 end) as open_not_holiday_customers
    , sum(case when StateHoliday != 0 and SchoolHoliday != 0 then customers else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday != 0 then 1 else 0 end) as open_holiday_customers
    , sum(case when StateHoliday = 0 and SchoolHoliday = 1 then customers else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 1 then 1 else 0 end) as open_school_holiday_customers
    , sum(case when StateHoliday != 0 and SchoolHoliday = 0 then customers else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday = 0 then 1 else 0 end) as open_state_holiday_customers
from 
    train
where Open != 0
")

customers <- sqldf("
select
    Store
    , round(avg(customers), 2) as customers
    , sum(case when StateHoliday = 0 and SchoolHoliday = 0 then customers else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 0 then 1 else 0 end) as open_not_holiday_customers
    , sum(case when StateHoliday != 0 and SchoolHoliday != 0 then customers else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday != 0 then 1 else 0 end) as open_holiday_customers
    , sum(case when StateHoliday = 0 and SchoolHoliday = 1 then customers else 0 end) / sum(case when StateHoliday = 0 and SchoolHoliday = 1 then 1 else 0 end) as open_school_holiday_customers
    , sum(case when StateHoliday != 0 and SchoolHoliday = 0 then customers else 0 end) / sum(case when StateHoliday != 0 and SchoolHoliday = 0 then 1 else 0 end) as open_state_holiday_customers
from 
    train
where Open != 0
group by 1
")

summary(customers)
##      Store          customers      open_not_holiday_customers
##  Min.   :   1.0   Min.   : 240.2   Min.   : 235.0            
##  1st Qu.: 279.5   1st Qu.: 541.5   1st Qu.: 537.5            
##  Median : 558.0   Median : 678.7   Median : 673.0            
##  Mean   : 558.0   Mean   : 754.5   Mean   : 749.8            
##  3rd Qu.: 836.5   3rd Qu.: 866.2   3rd Qu.: 859.0            
##  Max.   :1115.0   Max.   :3403.5   Max.   :3408.0            
##                                                              
##  open_holiday_customers open_school_holiday_customers
##  Min.   : 135           Min.   : 258.0               
##  1st Qu.: 481           1st Qu.: 551.5               
##  Median : 689           Median : 692.0               
##  Mean   : 953           Mean   : 771.9               
##  3rd Qu.:1225           3rd Qu.: 886.5               
##  Max.   :4073           Max.   :3406.0               
##  NA's   :1054                                        
##  open_state_holiday_customers
##  Min.   : 163.0              
##  1st Qu.: 538.5              
##  Median : 734.0              
##  Mean   : 906.0              
##  3rd Qu.: 982.0              
##  Max.   :5154.0              
##  NA's   :984
boxplot(customers)

hist(customers$customers, 100,
    main = "Mean customers per store when store was not closed")

4.3. EDA 3 : Customers vs. Sales

ggplot(train[train$Sales != 0 & train$Customers != 0],
       aes(x = log(Customers), y = log(Sales))) + 
    geom_point(alpha = 0.2) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

4.4. EDA 4: Promotion vs. Sales

ggplot(train[train$Sales != 0 & train$Customers != 0],
       aes(x = factor(Promo), y = Sales)) + 
    geom_jitter(alpha = 0.1) +
    geom_boxplot(color = "yellow", outlier.colour = NA, fill = NA)

ggplot(train[train$Sales != 0 & train$Customers != 0],
       aes(x = factor(Promo), y = Customers)) + 
    geom_jitter(alpha = 0.1) +
    geom_boxplot(color = "yellow", outlier.colour = NA, fill = NA)

promo_type_sales <- sqldf("
with base as (
select
    a.*
    , abs(promo_per_sales - not_promo_per_sales) as diff
from (
    select
        sum(case when promo == 0 then sales else 0 end) as not_promo_sales
        , sum(case when promo == 0 then customers else 0 end) as not_promo_customers
        , round(100. * (sum(case when promo == 0 then sales else 0 end)) / sum(case when promo == 0 then customers else 0 end)) / 100 as not_promo_per_sales
        , sum(case when promo != 0 then sales else 0 end) as promo_sales
        , sum(case when promo != 0 then customers else 0 end) as promo_customers
        , round(100. * (sum(case when promo != 0 then sales else 0 end)) / sum(case when promo != 0 then customers else 0 end)) / 100 as promo_per_sales
    from train
    where Sales != 0
    ) as a
)
select
    *
from base
")

promo_type_sales
##   not_promo_sales not_promo_customers not_promo_per_sales promo_sales
## 1      2771974337           325777807                8.51  3101206286
##   promo_customers promo_per_sales diff
## 1       318263940            9.74 1.23

Conclusion

예상대로 판매량은 고객 수와 밀접한 관련이 있음. customer의 박스 플롯이 sales의 박스 플롯보다 조금 더 겹쳐보이는 것 같은데, 이것은 광고가 주로 더 많은 고객을 끌어 들이는 것이 아니라 고객이 더 많은 것을 소비하도록 한다는 것을 의미함. 고객 당 평균 지출 금액은 약 1.23유로 더 높음.

table(ifelse(train$Open == 1, "Opened", "Closed"),
      ifelse(train$Sales > 0, "Sales > 0", "Sales = 0"))
##         
##          Sales = 0 Sales > 0
##   Closed    172817         0
##   Opened        54    844338

매장이 열지 않았을 경우 매출이 없지만, 문을 열어도 매출이 없는 매장이 있다. 이러한 관측치는 데이터의 오류일 수 있다.

4.5. EDA 5 : Day of week vs. Sales

  • 일요일 매출의 변동성은 꽤 높지만 중간 값은 그렇지 않음
ggplot(train[Sales != 0],
       aes(x = factor(DayOfWeek), y = Sales)) + 
    geom_jitter(alpha = 0.1) + 
    geom_boxplot(color = "yellow", outlier.colour = NA, fill = NA)

4.6. EDA 6 : Assortment vs. Store type

# There is a connection between store type and type of assortment
table(data.frame(Assortment = store$Assortment, StoreType = store$StoreType))
##           StoreType
## Assortment   a   b   c   d
##          a 381   7  77 128
##          b   0   9   0   0
##          c 221   1  71 220
task_1 <- sqldf("
select
    assortment
    , sum(case when storetype = 'a' then 1 else 0 end) as a 
    , sum(case when storetype = 'b' then 1 else 0 end) as b 
    , sum(case when storetype = 'c' then 1 else 0 end) as c
    , sum(case when storetype = 'd' then 1 else 0 end) as d
from store
group by 1
")


task_2 <- sqldf("
with base as (
select
    a.*
from (
    select
        assortment
        , sum(case when storetype = 'a' then 1 else 0 end) as a 
        , sum(case when storetype = 'b' then 1 else 0 end) as b 
        , sum(case when storetype = 'c' then 1 else 0 end) as c
        , sum(case when storetype = 'd' then 1 else 0 end) as d
        , count(*) as total
    from store
    group by 1 ) as a
)

select *
from base

union all

select
    'total'
    , sum(a)
    , sum(b)
    , sum(c)
    , sum(d)
    , sum(total)
from base;
")

task_3 <- sqldf("
with base as (
select
    aa.*
    , a + b + c + d as total

from (
    select
        assortment
        , round( 100. * a / total, 2) as a
        , round( 100. * b / total, 2) as b
        , round( 100. * c / total, 2) as c
        , round( 100. * d / total, 2) as d
    from (
        select
            assortment
            , sum(case when storetype = 'a' then 1 else 0 end) as a 
            , sum(case when storetype = 'b' then 1 else 0 end) as b 
            , sum(case when storetype = 'c' then 1 else 0 end) as c
            , sum(case when storetype = 'd' then 1 else 0 end) as d
            , count(*) as total
        from store
        group by 1 )
        ) as aa
)
select *
from base

union all

select
    'total'
    , round(avg(a), 2)
    , round(avg(b), 2)
    , round(avg(c), 2)
    , round(avg(d), 2)
    , round(avg(total), 2)
from base;
")


task_4 <- sqldf("
select
    assortment
    , storetype
    , count(*) as as_total
from store
group by 1, 2
")

task_5 <- sqldf("
select
    a.assortment
    , a.storetype
    , round(100. * as_total / a_total, 2) as prob
from (
    select
        assortment
        , storetype
        , count(*) as as_total
    from store
    group by 1, 2
    ) as a, (

    select
        assortment
        , count(*) as a_total
    from store
    group by 1
    ) as b
where a.assortment = b.assortment
")
Between store type and type of assortment.
assortment a b c d total
a 381 7 77 128 593
b 0 9 0 0 9
c 221 1 71 220 513
total 602 17 148 348 1115
Between store type and type of assortment
assortment a b c d total
a 64.25 1.18 12.98 21.59 100.00
b 0.00 100.00 0.00 0.00 100.00
c 43.08 0.19 13.84 42.88 99.99
total 35.78 33.79 8.94 21.49 100.00
ggplot(data = task_4, aes(x = Assortment, y = as_total, fill =  StoreType)) +
  geom_bar(stat = "identity", alpha = 1)

ggplot(data = task_5, aes(x = assortment, y = prob, fill =  storetype)) +
  geom_bar(stat = "identity", alpha = 1)

4.7. EDA 7 :

# Merge store and sales 
#sales_store <- merge(sales, store, by = "Store")

#temp <- sales
#temp$year <- format(temp$Date, "%Y")
#temp$month <- format(temp$Date, "%m")
#temp[, StoreMean := mean(Sales), by = Store]
#temp <- temp[, .(MonthlySalesMean = mean(Sales / (StoreMean)) * 100), 
#             by = .(year, month)]
#temp <- as.data.frame(temp)

#SalesTS <- ts(temp$MonthlySalesMean, start=2013, frequency=12)
#col = rainbow(3)
#seasonplot(SalesTS, col=col, year.labels.left = TRUE, pch=19, las=1)