Data Summary
Summary in general
The dimension of small data set is (1000, 14)
The dimension of large data set is (100000, 14)
Among these features, 7 of them are categorical and the rest is numerical(integer + double)
## Rows: 1,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
## Rows: 100,000
## Columns: 14
## $ Region <chr> "Middle East and North Africa", "Central America and th…
## $ Country <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…
Statistical Summary
- There is no obvious missing data in both data sets
## Region Country Item.Type Sales.Channel
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:1000 Length:1000 Min. :102928006 Length:1000
## Class :character Class :character 1st Qu.:328074026 Class :character
## Mode :character Mode :character Median :556609714 Mode :character
## Mean :549681325
## 3rd Qu.:769694483
## Max. :995529830
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 13 Min. : 9.33 Min. : 6.92 Min. : 2043
## 1st Qu.:2420 1st Qu.: 81.73 1st Qu.: 56.67 1st Qu.: 281192
## Median :5184 Median :154.06 Median : 97.44 Median : 754939
## Mean :5054 Mean :262.11 Mean :184.97 Mean :1327322
## 3rd Qu.:7537 3rd Qu.:421.89 3rd Qu.:263.33 3rd Qu.:1733503
## Max. :9998 Max. :668.27 Max. :524.96 Max. :6617210
## Total.Cost Total.Profit
## Min. : 1417 Min. : 532.6
## 1st Qu.: 164932 1st Qu.: 98376.1
## Median : 464726 Median : 277226.0
## Mean : 936119 Mean : 391202.6
## 3rd Qu.:1141750 3rd Qu.: 548456.8
## Max. :5204978 Max. :1726181.4
## Region Country Item.Type Sales.Channel
## Length:100000 Length:100000 Length:100000 Length:100000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:100000 Length:100000 Min. :100008904 Length:100000
## Class :character Class :character 1st Qu.:326046383 Class :character
## Mode :character Mode :character Median :547718512 Mode :character
## Mean :550395554
## 3rd Qu.:775078534
## Max. :999996459
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 1 Min. : 9.33 Min. : 6.92 Min. : 19
## 1st Qu.: 2505 1st Qu.:109.28 1st Qu.: 56.67 1st Qu.: 279753
## Median : 5007 Median :205.70 Median :117.11 Median : 789892
## Mean : 5001 Mean :266.70 Mean :188.02 Mean :1336067
## 3rd Qu.: 7495 3rd Qu.:437.20 3rd Qu.:364.69 3rd Qu.:1836490
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6682700
## Total.Cost Total.Profit
## Min. : 14 Min. : 4.8
## 1st Qu.: 162928 1st Qu.: 95900.0
## Median : 467937 Median : 283657.5
## Mean : 941975 Mean : 394091.2
## 3rd Qu.:1209475 3rd Qu.: 568384.1
## Max. :5249075 Max. :1738700.0
Visualization
Numeric: Distribution
Total.CostandTotal.ProfitandTotal.Revenuein both data set have similar trend, which indicates that these three variables are correlatedUnit.CostandUnit.Pricein large data set have more peaks than in the small data setUnit.Soldin large data set look much more stable than in small data set
Numeric: Outliers
Except
Unit.Sold, the rest of numerical variables in large data either shift to right or expend the 3rd QuantileUnit.Soldin small data set is slightly skewed, however in the large data, it is symmetric.Both data set does not show extreme outliers
Numeric: Correlation
Total.Revenue,Total.CostandTotal.Profitare highly correlated as aforementioned, large data in the second graph show slightly higher correlation betweenUnit.CostandTotal.Revenueand/orTotal.Costand/orTotal.Profit,Unit.PriceandTotal Revenueand/orTotal.Profitand/orTotal.CostUnit.PriceandUnit.Costare also highly correlated
Numeric: Correlation Cont.
After dropping those highly correlated pairs, the correlation for the rest of values is shown below.
For some of values, correlation in large data set is still slightly higher than ones in small data set, but it does not go beyond the limit which I set to 0.8
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(highCorrelated)` instead of `highCorrelated` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Categorical: Distribution in General
- based on the general distribution, I see that there are lots of
distinct value in variable
Country,Order.DateandShip.Date, I do not think these variable will be a good predictor for later modeling. Therefore, these variables are dropped as well.
Categorical: Refined Distribution
- name of the rest categorical variables and its corresponding distribution are shown below.
## [1] "Region" "Item.Type" "Sales.Channel" "Order.Priority"
Categorical: Relational Distribution
- use selected categorical variable to see the relationship between
them and target variable
Total.Profit
Time Series: Target Total.Profit
the first plot is the time series for small data set, there is seasonality shown, but the trend looks downward(not clear for a long run)
the second plot is the time series for large data set, seasonality is shown as well, but different from small data. The trend looks upward(not clear for a long run)
Pre-processing
1. Drop columns
Based on the Categorical: Relational Distribution, I
see that Sales.Channel, Order.Priority and
Region don’t make much different in its individual internal
value across target variable, which means that the line trend follows
similar pattern except Item.Type. Therefore, I think that
these variables will not provide significant information in future
modeling.
## Item.Type Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 Cosmetics 8446 263.33 2224085.2 1468506.02
## 2 Vegetables 3018 90.93 274426.7 190526.34
## 3 Baby Food 1517 159.42 241840.1 145419.62
## 4 Cereal 3322 117.11 389039.4 294295.98
## 5 Fruits 9845 6.92 68127.4 23726.45
## 6 Cereal 9528 117.11 1115824.1 844085.52
## Item.Type Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 Snacks 934 97.44 91008.96 51500.76
## 2 Cosmetics 4551 263.33 1198414.83 791282.37
## 3 Fruits 9986 6.92 69103.12 24066.26
## 4 Personal Care 9118 56.67 516717.06 228497.08
## 5 Household 5858 502.54 2943879.32 970846.34
## 6 Clothes 1149 35.84 41180.16 84382.56
2. Encoding
All categorical variable need to convert to dummy variables so that each value will have the same weight
## Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1 0 0 0 0
## 2 0 0 0 0
## 3 1 0 0 0
## 4 0 0 1 0
## 5 0 0 0 0
## 6 0 0 1 0
## Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1 1 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## 6 0 0 0 0
## Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 0 8446 263.33 2224085.2 1468506.02
## 2 1 3018 90.93 274426.7 190526.34
## 3 0 1517 159.42 241840.1 145419.62
## 4 0 3322 117.11 389039.4 294295.98
## 5 0 9845 6.92 68127.4 23726.45
## 6 0 9528 117.11 1115824.1 844085.52
## Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 1
## Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1 0 0 0 0
## 2 1 0 0 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 0 0 1 0
## 6 0 0 0 0
## Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
## Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 0 934 97.44 91008.96 51500.76
## 2 0 4551 263.33 1198414.83 791282.37
## 3 0 9986 6.92 69103.12 24066.26
## 4 0 9118 56.67 516717.06 228497.08
## 5 0 5858 502.54 2943879.32 970846.34
## 6 0 1149 35.84 41180.16 84382.56
3. Standardization
Values in the data set need to be centered and scales for better performance purpose
## Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1 -0.3085368 -0.3350145 -0.2927295 -0.2907131
## 2 -0.3085368 -0.3350145 -0.2927295 -0.2907131
## 3 3.2378633 -0.3350145 -0.2927295 -0.2907131
## 4 -0.3085368 -0.3350145 3.4127071 -0.2907131
## 5 -0.3085368 -0.3350145 -0.2927295 -0.2907131
## 6 -0.3085368 -0.3350145 3.4127071 -0.2907131
## Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1 3.510128 -0.2742144 -0.288687 -0.2907131
## 2 -0.284605 -0.2742144 -0.288687 -0.2907131
## 3 -0.284605 -0.2742144 -0.288687 -0.2907131
## 4 -0.284605 -0.2742144 -0.288687 -0.2907131
## 5 -0.284605 3.6431344 -0.288687 -0.2907131
## 6 -0.284605 -0.2742144 -0.288687 -0.2907131
## Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1 -0.3124054 -0.3085368 -0.2987228
## 2 -0.3124054 -0.3085368 -0.2987228
## 3 -0.3124054 -0.3085368 -0.2987228
## 4 -0.3124054 -0.3085368 -0.2987228
## 5 -0.3124054 -0.3085368 -0.2987228
## 6 -0.3124054 -0.3085368 -0.2987228
## Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 -0.3275855 1.1691049 0.4470603 1.1078603 2.8081089
## 2 3.0495851 -0.7017320 -0.5364566 -0.5691632 -0.5230846
## 3 -0.3275855 -1.2190729 -0.1457311 -0.5971930 -0.6406602
## 4 -0.3275855 -0.5969541 -0.3871035 -0.4705776 -0.2525977
## 5 -0.3275855 1.6512900 -1.0157214 -0.7466142 -0.9578667
## 6 -0.3275855 1.5420315 -0.3871035 0.1545754 1.1804887
## Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1 -0.3029613 -0.3000207 -0.3032367 -0.3009306
## 2 -0.3029613 -0.3000207 -0.3032367 -0.3009306
## 3 -0.3029613 -0.3000207 -0.3032367 -0.3009306
## 4 -0.3029613 -0.3000207 -0.3032367 -0.3009306
## 5 -0.3029613 -0.3000207 -0.3032367 -0.3009306
## 6 -0.3029613 -0.3000207 -0.3032367 3.3229924
## Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1 -0.3022329 -0.3000999 -0.3004165 -0.3012466
## 2 3.3086737 -0.3000999 -0.3004165 -0.3012466
## 3 -0.3022329 3.3321908 -0.3004165 -0.3012466
## 4 -0.3022329 -0.3000999 -0.3004165 -0.3012466
## 5 -0.3022329 -0.3000999 3.3286787 -0.3012466
## 6 -0.3022329 -0.3000999 -0.3004165 -0.3012466
## Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1 -0.303335 -0.3021146 3.3221199
## 2 -0.303335 -0.3021146 -0.3010096
## 3 -0.303335 -0.3021146 -0.3010096
## 4 -0.303335 3.3099686 -0.3010096
## 5 -0.303335 -0.3021146 -0.3010096
## 6 -0.303335 -0.3021146 -0.3010096
## Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1 -0.3004956 -1.4100675 -0.5155185 -0.7387963 -0.9025072
## 2 -0.3004956 -0.1561568 0.4286153 0.2226367 1.0463451
## 3 -0.3004956 1.7280026 -1.0306972 -0.7578146 -0.9747796
## 4 -0.3004956 1.4270918 -0.7475538 -0.3692029 -0.4362349
## 5 -0.3004956 0.2969428 1.7900370 1.7380226 1.5193815
## 6 -0.3004956 -1.3355332 -0.8661041 -0.7820569 -0.8158847
Data Splitting
I use the most common splits where training data contains 75% of original records and test set contains the remaining 25%. I am going to use training data to make a model, and then use test set to test model performance at the end.
## [1] "number of records in small training set: 750"
## [1] "number of records in small test set: 250"
## [1] "number of records in large training set: 75000"
## [1] "number of records in large test set: 25000"
Modeling
Linear
Linear model is the first model comes to me when the target variable is numeric, it is the easiest and simplest model to start with.
linear model in small data set shows NA in values like
Unit.CostandItem.Type.Vegetables, so I update the model by dropping these two features.All features after dropping looks pretty significant in modeling, the adjusted \(R^2\) looks pretty good too.
since small set and large set are in the same data structure, linear model is fitted in the large set without features aforementioned. The adjusted \(R^2\) is roughly the same, but the residual error increases a bit.
##
## Call:
## lm(formula = Total.Profit ~ ., data = sm.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12197 -0.14905 -0.00085 0.14715 0.86628
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001581 0.009464 0.167 0.86738
## `Item.Type.Baby Food` 0.066479 0.012843 5.176 2.92e-07 ***
## Item.Type.Beverages -0.135916 0.012743 -10.666 < 2e-16 ***
## Item.Type.Cereal 0.072586 0.012094 6.002 3.06e-09 ***
## Item.Type.Clothes 0.071912 0.012237 5.877 6.34e-09 ***
## Item.Type.Cosmetics 0.296022 0.012979 22.808 < 2e-16 ***
## Item.Type.Fruits -0.140838 0.012036 -11.701 < 2e-16 ***
## Item.Type.Household 0.035419 0.015255 2.322 0.02051 *
## Item.Type.Meat -0.240118 0.014137 -16.985 < 2e-16 ***
## `Item.Type.Office Supplies` -0.119047 0.016456 -7.234 1.18e-12 ***
## `Item.Type.Personal Care` -0.117251 0.012452 -9.416 < 2e-16 ***
## Item.Type.Snacks -0.031880 0.012181 -2.617 0.00905 **
## Item.Type.Vegetables NA NA NA NA
## Units.Sold 0.256651 0.014023 18.302 < 2e-16 ***
## Unit.Cost NA NA NA NA
## Total.Cost 0.676720 0.021248 31.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2588 on 736 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.932
## F-statistic: 790.9 on 13 and 736 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Total.Profit ~ `Item.Type.Baby Food` + Item.Type.Beverages +
## Item.Type.Cereal + Item.Type.Clothes + Item.Type.Cosmetics +
## Item.Type.Fruits + Item.Type.Household + Item.Type.Meat +
## `Item.Type.Office Supplies` + `Item.Type.Personal Care` +
## Item.Type.Snacks + Units.Sold + Total.Cost, data = sm.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12197 -0.14905 -0.00085 0.14715 0.86628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001581 0.009464 0.167 0.86738
## `Item.Type.Baby Food` 0.066479 0.012843 5.176 2.92e-07 ***
## Item.Type.Beverages -0.135916 0.012743 -10.666 < 2e-16 ***
## Item.Type.Cereal 0.072586 0.012094 6.002 3.06e-09 ***
## Item.Type.Clothes 0.071912 0.012237 5.877 6.34e-09 ***
## Item.Type.Cosmetics 0.296022 0.012979 22.808 < 2e-16 ***
## Item.Type.Fruits -0.140838 0.012036 -11.701 < 2e-16 ***
## Item.Type.Household 0.035419 0.015255 2.322 0.02051 *
## Item.Type.Meat -0.240118 0.014137 -16.985 < 2e-16 ***
## `Item.Type.Office Supplies` -0.119047 0.016456 -7.234 1.18e-12 ***
## `Item.Type.Personal Care` -0.117251 0.012452 -9.416 < 2e-16 ***
## Item.Type.Snacks -0.031880 0.012181 -2.617 0.00905 **
## Units.Sold 0.256651 0.014023 18.302 < 2e-16 ***
## Total.Cost 0.676720 0.021248 31.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2588 on 736 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.932
## F-statistic: 790.9 on 13 and 736 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Total.Profit ~ . - Item.Type.Vegetables - Unit.Cost,
## data = lg.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03826 -0.15550 0.00044 0.15482 1.03688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0002897 0.0010033 0.289 0.773
## `Item.Type.Baby Food` 0.0652432 0.0013765 47.397 <2e-16 ***
## Item.Type.Beverages -0.1256861 0.0013733 -91.518 <2e-16 ***
## Item.Type.Cereal 0.0730669 0.0013673 53.437 <2e-16 ***
## Item.Type.Clothes 0.0810663 0.0013705 59.153 <2e-16 ***
## Item.Type.Cosmetics 0.2678688 0.0014423 185.727 <2e-16 ***
## Item.Type.Fruits -0.1519964 0.0013778 -110.317 <2e-16 ***
## Item.Type.Household 0.0485853 0.0017670 27.496 <2e-16 ***
## Item.Type.Meat -0.2374692 0.0015506 -153.150 <2e-16 ***
## `Item.Type.Office Supplies` -0.1138407 0.0018176 -62.631 <2e-16 ***
## `Item.Type.Personal Care` -0.1111023 0.0013650 -81.394 <2e-16 ***
## Item.Type.Snacks -0.0341179 0.0013587 -25.111 <2e-16 ***
## Units.Sold 0.2886331 0.0014699 196.360 <2e-16 ***
## Total.Cost 0.6580395 0.0022857 287.888 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2748 on 74986 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9244
## F-statistic: 7.052e+04 on 13 and 74986 DF, p-value: < 2.2e-16
Linear Diagnostic Plot
Simple Decision Tree
the model is slightly different in small and large data set
top three variable importance remain with
Total.Cost,Unit.CostandUnits.Sold
## Overall
## Total.Cost 3.63887284
## Unit.Cost 2.14846488
## Units.Sold 1.83768943
## Item.Type.Cosmetics 1.04225956
## Item.Type.Meat 0.99629313
## Item.Type.Household 0.98349119
## Item.Type.Clothes 0.39831007
## Item.Type.Beverages 0.25233408
## Item.Type.Fruits 0.14390896
## Item.Type.Snacks 0.12582920
## Item.Type.Vegetables 0.10169810
## Item.Type.Office Supplies 0.06916356
## `Item.Type.Baby Food` 0.00000000
## Item.Type.Cereal 0.00000000
## `Item.Type.Office Supplies` 0.00000000
## `Item.Type.Personal Care` 0.00000000
## Overall
## Total.Cost 3.35390021
## Units.Sold 2.70577846
## Unit.Cost 2.60196314
## Item.Type.Cosmetics 1.47844212
## Item.Type.Office Supplies 1.18682510
## Item.Type.Clothes 0.50247599
## Item.Type.Snacks 0.28358551
## Item.Type.Meat 0.26590618
## Item.Type.Household 0.19211021
## Item.Type.Fruits 0.15584585
## Item.Type.Cereal 0.12103425
## Item.Type.Beverages 0.10591413
## Item.Type.Personal Care 0.10183406
## Item.Type.Vegetables 0.04940815
## `Item.Type.Baby Food` 0.00000000
## `Item.Type.Office Supplies` 0.00000000
## `Item.Type.Personal Care` 0.00000000
Bagging
##
## Bagging regression trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Total.Profit ~ ., data = sm.train,
## coob = TRUE)
##
## Out-of-bag estimate of root mean squared error: 0.2455
##
## Bagging regression trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Total.Profit ~ ., data = lg.train,
## coob = TRUE)
##
## Out-of-bag estimate of root mean squared error: 0.2655
Performance
Use RMSE to check the performance of three models
## linear decision_tree bagging
## small data set 0.289 0.363 0.256
## large data set 0.273 0.284 0.266