lbb inclass

library

read data copiers

## 'data.frame':    62 obs. of  15 variables:
##  $ Row.ID      : int  336 393 407 516 596 754 1151 1234 1550 1645 ...
##  $ Order.ID    : Factor w/ 62 levels "CA-2014-116666",..: 17 49 37 39 4 26 15 32 5 14 ...
##  $ Order.Date  : Factor w/ 61 levels "1/22/17","1/4/16",..: 53 58 21 1 52 34 33 37 47 25 ...
##  $ Ship.Date   : Factor w/ 59 levels "1/27/17","1/9/16",..: 57 56 15 1 48 32 31 37 45 25 ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 4 4 4 1 2 1 1 1 ...
##  $ Customer.ID : Factor w/ 59 levels "AB-10255","AD-10180",..: 14 36 41 2 43 59 46 6 25 27 ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 1 3 1 2 1 2 1 2 ...
##  $ Product.ID  : Factor w/ 12 levels "TEC-CO-10000971",..: 3 8 11 9 11 7 12 2 2 5 ...
##  $ Category    : Factor w/ 1 level "Technology": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sub.Category: Factor w/ 1 level "Copiers": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Product.Name: Factor w/ 12 levels "Brother DCP1000 Digital 3 in 1 Multifunction Machine",..: 10 6 12 2 12 9 1 3 3 7 ...
##  $ Sales       : num  960 1800 1200 3000 1200 ...
##  $ Quantity    : int  2 3 3 5 3 3 2 2 1 7 ...
##  $ Discount    : num  0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
##  $ Profit      : num  336 702 435 1380 435 ...
## # A tibble: 4 x 2
##   Ship.Mode      avg_Profit
##   <fct>               <dbl>
## 1 First Class          416.
## 2 Same Day             516.
## 3 Second Class         459.
## 4 Standard Class       444.

ambil beberapa variabel yang memungkinkan mempunyai pengaruh terhadap model

cek korelasi antar variabel

cek outlier setiap predictor. untuk sementara kita tidak menghilangkan outlier terlebih dahulu, karena ingin dianalisis lebih dalam

buat model full

## 
## Call:
## lm(formula = Profit ~ ., data = copiers_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.64  -56.21    6.49   66.51  246.10 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)
## (Intercept)               63.45200   55.06367   1.152                0.254
## Ship.ModeSame Day          9.13795   57.48532   0.159                0.874
## Ship.ModeSecond Class     42.97158   43.92567   0.978                0.332
## Ship.ModeStandard Class   13.92121   38.33509   0.363                0.718
## SegmentCorporate           9.30122   31.29427   0.297                0.767
## SegmentHome Office       -30.70992   39.86033  -0.770                0.444
## Sales                      0.42124    0.02522  16.701 < 0.0000000000000002
## Quantity                 -13.40975   13.54512  -0.990                0.327
## Discount                -874.83099  117.02350  -7.476       0.000000000774
##                            
## (Intercept)                
## Ship.ModeSame Day          
## Ship.ModeSecond Class      
## Ship.ModeStandard Class    
## SegmentCorporate           
## SegmentHome Office         
## Sales                   ***
## Quantity                   
## Discount                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.7 on 53 degrees of freedom
## Multiple R-squared:  0.9515, Adjusted R-squared:  0.9442 
## F-statistic: 130.1 on 8 and 53 DF,  p-value: < 0.00000000000000022

coba step- wise

## Start:  AIC=582.65
## Profit ~ Ship.Mode + Segment + Sales + Quantity + Discount
## 
##             Df Sum of Sq     RSS    AIC
## - Ship.Mode  3     12392  571679 578.01
## - Segment    2     10395  569682 579.79
## - Quantity   1     10343  569630 581.79
## <none>                    559287 582.65
## - Discount   1    589741 1149027 625.29
## - Sales      1   2943279 3502566 694.40
## 
## Step:  AIC=578.01
## Profit ~ Segment + Sales + Quantity + Discount
## 
##            Df Sum of Sq     RSS    AIC
## - Segment   2     15938  587617 575.72
## - Quantity  1     15195  586874 577.64
## <none>                   571679 578.01
## - Discount  1    640224 1211903 622.60
## - Sales     1   3074393 3646072 690.89
## 
## Step:  AIC=575.72
## Profit ~ Sales + Quantity + Discount
## 
##            Df Sum of Sq     RSS    AIC
## - Quantity  1     13134  600751 575.09
## <none>                   587617 575.72
## - Discount  1    635820 1223437 619.18
## - Sales     1   3256039 3843656 690.16
## 
## Step:  AIC=575.09
## Profit ~ Sales + Discount
## 
##            Df Sum of Sq     RSS    AIC
## <none>                   600751 575.09
## - Discount  1    751791 1352543 623.40
## - Sales     1   8783232 9383984 743.50
## 
## Call:
## lm(formula = Profit ~ Sales + Discount, data = copiers_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.69  -62.73    6.89   66.79  275.86 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   66.21587   30.34849   2.182               0.0331 *  
## Sales          0.40019    0.01363  29.370 < 0.0000000000000002 ***
## Discount    -894.82460  104.13831  -8.593     0.00000000000549 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.9 on 59 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9462 
## F-statistic: 537.2 on 2 and 59 DF,  p-value: < 0.00000000000000022
## [1] 98.43544
## [1] 75.08537
## [1] 0.3009575
## [1]   59.998 2302.967

pengaruh outlier

residual menyebar normal (normality)

## 
##  Shapiro-Wilk normality test
## 
## data:  copiers_backward$residuals
## W = 0.9834, p-value = 0.5656

homoscedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  copiers_backward
## BP = 16.869, df = 2, p-value = 0.0002172

multicollinearity

##    Sales Discount 
## 1.038959 1.038959
## 
##  Pearson's product-moment correlation
## 
## data:  copiers_df$Sales and copiers_df$Profit
## t = 21.26, df = 60, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9013320 0.9632858
## sample estimates:
##       cor 
## 0.9395785
## 
##  Pearson's product-moment correlation
## 
## data:  copiers_df$Discount and copiers_df$Profit
## t = -3.7139, df = 60, p-value = 0.0004496
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6156283 -0.2046714
## sample estimates:
##        cor 
## -0.4323383

LBB RM : 1. lakukan eksplorasi data - str() - cor() - group_by() + summarise() utk factor terhadap y

  1. feature selection berdasarkan bisnis

  2. modeling
  1. feature selection
  1. Eror
  1. asumsi
  1. Kesimpulan

david

8/8/2019