lbb inclass
library
read data copiers
## 'data.frame': 62 obs. of 15 variables:
## $ Row.ID : int 336 393 407 516 596 754 1151 1234 1550 1645 ...
## $ Order.ID : Factor w/ 62 levels "CA-2014-116666",..: 17 49 37 39 4 26 15 32 5 14 ...
## $ Order.Date : Factor w/ 61 levels "1/22/17","1/4/16",..: 53 58 21 1 52 34 33 37 47 25 ...
## $ Ship.Date : Factor w/ 59 levels "1/27/17","1/9/16",..: 57 56 15 1 48 32 31 37 45 25 ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 4 4 4 1 2 1 1 1 ...
## $ Customer.ID : Factor w/ 59 levels "AB-10255","AD-10180",..: 14 36 41 2 43 59 46 6 25 27 ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 1 3 1 2 1 2 1 2 ...
## $ Product.ID : Factor w/ 12 levels "TEC-CO-10000971",..: 3 8 11 9 11 7 12 2 2 5 ...
## $ Category : Factor w/ 1 level "Technology": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sub.Category: Factor w/ 1 level "Copiers": 1 1 1 1 1 1 1 1 1 1 ...
## $ Product.Name: Factor w/ 12 levels "Brother DCP1000 Digital 3 in 1 Multifunction Machine",..: 10 6 12 2 12 9 1 3 3 7 ...
## $ Sales : num 960 1800 1200 3000 1200 ...
## $ Quantity : int 2 3 3 5 3 3 2 2 1 7 ...
## $ Discount : num 0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
## $ Profit : num 336 702 435 1380 435 ...
## # A tibble: 4 x 2
## Ship.Mode avg_Profit
## <fct> <dbl>
## 1 First Class 416.
## 2 Same Day 516.
## 3 Second Class 459.
## 4 Standard Class 444.
ambil beberapa variabel yang memungkinkan mempunyai pengaruh terhadap model
cek korelasi antar variabel
cek outlier setiap predictor. untuk sementara kita tidak menghilangkan outlier terlebih dahulu, karena ingin dianalisis lebih dalam
buat model full
##
## Call:
## lm(formula = Profit ~ ., data = copiers_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.64 -56.21 6.49 66.51 246.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.45200 55.06367 1.152 0.254
## Ship.ModeSame Day 9.13795 57.48532 0.159 0.874
## Ship.ModeSecond Class 42.97158 43.92567 0.978 0.332
## Ship.ModeStandard Class 13.92121 38.33509 0.363 0.718
## SegmentCorporate 9.30122 31.29427 0.297 0.767
## SegmentHome Office -30.70992 39.86033 -0.770 0.444
## Sales 0.42124 0.02522 16.701 < 0.0000000000000002
## Quantity -13.40975 13.54512 -0.990 0.327
## Discount -874.83099 117.02350 -7.476 0.000000000774
##
## (Intercept)
## Ship.ModeSame Day
## Ship.ModeSecond Class
## Ship.ModeStandard Class
## SegmentCorporate
## SegmentHome Office
## Sales ***
## Quantity
## Discount ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.7 on 53 degrees of freedom
## Multiple R-squared: 0.9515, Adjusted R-squared: 0.9442
## F-statistic: 130.1 on 8 and 53 DF, p-value: < 0.00000000000000022
coba step- wise
## Start: AIC=582.65
## Profit ~ Ship.Mode + Segment + Sales + Quantity + Discount
##
## Df Sum of Sq RSS AIC
## - Ship.Mode 3 12392 571679 578.01
## - Segment 2 10395 569682 579.79
## - Quantity 1 10343 569630 581.79
## <none> 559287 582.65
## - Discount 1 589741 1149027 625.29
## - Sales 1 2943279 3502566 694.40
##
## Step: AIC=578.01
## Profit ~ Segment + Sales + Quantity + Discount
##
## Df Sum of Sq RSS AIC
## - Segment 2 15938 587617 575.72
## - Quantity 1 15195 586874 577.64
## <none> 571679 578.01
## - Discount 1 640224 1211903 622.60
## - Sales 1 3074393 3646072 690.89
##
## Step: AIC=575.72
## Profit ~ Sales + Quantity + Discount
##
## Df Sum of Sq RSS AIC
## - Quantity 1 13134 600751 575.09
## <none> 587617 575.72
## - Discount 1 635820 1223437 619.18
## - Sales 1 3256039 3843656 690.16
##
## Step: AIC=575.09
## Profit ~ Sales + Discount
##
## Df Sum of Sq RSS AIC
## <none> 600751 575.09
## - Discount 1 751791 1352543 623.40
## - Sales 1 8783232 9383984 743.50
##
## Call:
## lm(formula = Profit ~ Sales + Discount, data = copiers_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -277.69 -62.73 6.89 66.79 275.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.21587 30.34849 2.182 0.0331 *
## Sales 0.40019 0.01363 29.370 < 0.0000000000000002 ***
## Discount -894.82460 104.13831 -8.593 0.00000000000549 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.9 on 59 degrees of freedom
## Multiple R-squared: 0.9479, Adjusted R-squared: 0.9462
## F-statistic: 537.2 on 2 and 59 DF, p-value: < 0.00000000000000022
## [1] 98.43544
## [1] 75.08537
## [1] 0.3009575
## [1] 59.998 2302.967
pengaruh outlier
residual menyebar normal (normality)
##
## Shapiro-Wilk normality test
##
## data: copiers_backward$residuals
## W = 0.9834, p-value = 0.5656
homoscedasticity
##
## studentized Breusch-Pagan test
##
## data: copiers_backward
## BP = 16.869, df = 2, p-value = 0.0002172
multicollinearity
## Sales Discount
## 1.038959 1.038959
##
## Pearson's product-moment correlation
##
## data: copiers_df$Sales and copiers_df$Profit
## t = 21.26, df = 60, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9013320 0.9632858
## sample estimates:
## cor
## 0.9395785
##
## Pearson's product-moment correlation
##
## data: copiers_df$Discount and copiers_df$Profit
## t = -3.7139, df = 60, p-value = 0.0004496
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6156283 -0.2046714
## sample estimates:
## cor
## -0.4323383
LBB RM : 1. lakukan eksplorasi data - str() - cor() - group_by() + summarise() utk factor terhadap y
feature selection berdasarkan bisnis
- modeling
- interpretasi koefisien
- bahas adj R-Square
- bahas signifikansi predictor
- feature selection
- stepwise
- regsubset
- Eror
- RMSE / MSE
- MAE
- MAPE
- asumsi
- normality
- homoscedasticity
- Multicollinearity
- linearity
- Kesimpulan