622: hw1

Jie Zou

2022-10-10

Data Summary

Summary in general

  • The dimension of small data set is (1000, 14)

  • The dimension of large data set is (100000, 14)

  • Among these features, 7 of them are categorical and the rest is numerical(integer + double)

## Rows: 1,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country        <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date     <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID       <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date      <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …
## Rows: 100,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "Central America and th…
## $ Country        <chr> "Azerbaijan", "Panama", "Sao Tome and Principe", "Sao T…
## $ Item.Type      <chr> "Snacks", "Cosmetics", "Fruits", "Personal Care", "Hous…
## $ Sales.Channel  <chr> "Online", "Offline", "Offline", "Online", "Offline", "O…
## $ Order.Priority <chr> "C", "L", "M", "M", "H", "C", "M", "C", "H", "H", "C", …
## $ Order.Date     <chr> "10/8/2014", "2/22/2015", "12/9/2015", "9/17/2014", "2/…
## $ Order.ID       <int> 535113847, 874708545, 854349935, 892836844, 129280602, …
## $ Ship.Date      <chr> "10/23/2014", "2/27/2015", "1/18/2016", "10/12/2014", "…
## $ Units.Sold     <int> 934, 4551, 9986, 9118, 5858, 1149, 7964, 6307, 8217, 27…
## $ Unit.Price     <dbl> 152.58, 437.20, 9.33, 81.73, 668.27, 109.28, 437.20, 9.…
## $ Unit.Cost      <dbl> 97.44, 263.33, 6.92, 56.67, 502.54, 35.84, 263.33, 6.92…
## $ Total.Revenue  <dbl> 142509.72, 1989697.20, 93169.38, 745214.14, 3914725.66,…
## $ Total.Cost     <dbl> 91008.96, 1198414.83, 69103.12, 516717.06, 2943879.32, …
## $ Total.Profit   <dbl> 51500.76, 791282.37, 24066.26, 228497.08, 970846.34, 84…

Statistical Summary

  • There is no obvious missing data in both data sets
##     Region            Country           Item.Type         Sales.Channel     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:1000        Length:1000        Min.   :102928006   Length:1000       
##  Class :character   Class :character   1st Qu.:328074026   Class :character  
##  Mode  :character   Mode  :character   Median :556609714   Mode  :character  
##                                        Mean   :549681325                     
##                                        3rd Qu.:769694483                     
##                                        Max.   :995529830                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :  13   Min.   :  9.33   Min.   :  6.92   Min.   :   2043  
##  1st Qu.:2420   1st Qu.: 81.73   1st Qu.: 56.67   1st Qu.: 281192  
##  Median :5184   Median :154.06   Median : 97.44   Median : 754939  
##  Mean   :5054   Mean   :262.11   Mean   :184.97   Mean   :1327322  
##  3rd Qu.:7537   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1733503  
##  Max.   :9998   Max.   :668.27   Max.   :524.96   Max.   :6617210  
##    Total.Cost       Total.Profit      
##  Min.   :   1417   Min.   :    532.6  
##  1st Qu.: 164932   1st Qu.:  98376.1  
##  Median : 464726   Median : 277226.0  
##  Mean   : 936119   Mean   : 391202.6  
##  3rd Qu.:1141750   3rd Qu.: 548456.8  
##  Max.   :5204978   Max.   :1726181.4
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100000      Length:100000      Length:100000      Length:100000     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100000      Length:100000      Min.   :100008904   Length:100000     
##  Class :character   Class :character   1st Qu.:326046383   Class :character  
##  Mode  :character   Mode  :character   Median :547718512   Mode  :character  
##                                        Mean   :550395554                     
##                                        3rd Qu.:775078534                     
##                                        Max.   :999996459                     
##    Units.Sold      Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :    1   Min.   :  9.33   Min.   :  6.92   Min.   :     19  
##  1st Qu.: 2505   1st Qu.:109.28   1st Qu.: 56.67   1st Qu.: 279753  
##  Median : 5007   Median :205.70   Median :117.11   Median : 789892  
##  Mean   : 5001   Mean   :266.70   Mean   :188.02   Mean   :1336067  
##  3rd Qu.: 7495   3rd Qu.:437.20   3rd Qu.:364.69   3rd Qu.:1836490  
##  Max.   :10000   Max.   :668.27   Max.   :524.96   Max.   :6682700  
##    Total.Cost       Total.Profit      
##  Min.   :     14   Min.   :      4.8  
##  1st Qu.: 162928   1st Qu.:  95900.0  
##  Median : 467937   Median : 283657.5  
##  Mean   : 941975   Mean   : 394091.2  
##  3rd Qu.:1209475   3rd Qu.: 568384.1  
##  Max.   :5249075   Max.   :1738700.0

Visualization

Numeric: Distribution

  • Total.Cost and Total.Profit and Total.Revenue in both data set have similar trend, which indicates that these three variables are correlated

  • Unit.Cost and Unit.Price in large data set have more peaks than in the small data set

  • Unit.Sold in large data set look much more stable than in small data set

Numeric: Outliers

  • Except Unit.Sold, the rest of numerical variables in large data either shift to right or expend the 3rd Quantile

  • Unit.Sold in small data set is slightly skewed, however in the large data, it is symmetric.

  • Both data set does not show extreme outliers

Numeric: Correlation

  • Total.Revenue, Total.Cost and Total.Profit are highly correlated as aforementioned, large data in the second graph show slightly higher correlation between Unit.Cost and Total.Revenueand/or Total.Cost and/or Total.Profit, Unit.Price and Total Revenue and/or Total.Profit and/or Total.Cost

  • Unit.Price and Unit.Cost are also highly correlated

Numeric: Correlation Cont.

  • After dropping those highly correlated pairs, the correlation for the rest of values is shown below.

  • For some of values, correlation in large data set is still slightly higher than ones in small data set, but it does not go beyond the limit which I set to 0.8

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(highCorrelated)` instead of `highCorrelated` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Categorical: Distribution in General

  • based on the general distribution, I see that there are lots of distinct value in variable Country, Order.Date and Ship.Date, I do not think these variable will be a good predictor for later modeling. Therefore, these variables are dropped as well.

Categorical: Refined Distribution

  • name of the rest categorical variables and its corresponding distribution are shown below.
## [1] "Region"         "Item.Type"      "Sales.Channel"  "Order.Priority"

Categorical: Relational Distribution

  • use selected categorical variable to see the relationship between them and target variable Total.Profit

Time Series: Target Total.Profit

  • the first plot is the time series for small data set, there is seasonality shown, but the trend looks downward(not clear for a long run)

  • the second plot is the time series for large data set, seasonality is shown as well, but different from small data. The trend looks upward(not clear for a long run)

Pre-processing

1. Drop columns

Based on the Categorical: Relational Distribution, I see that Sales.Channel, Order.Priority and Region don’t make much different in its individual internal value across target variable, which means that the line trend follows similar pattern except Item.Type. Therefore, I think that these variables will not provide significant information in future modeling.

##    Item.Type Units.Sold Unit.Cost Total.Cost Total.Profit
## 1  Cosmetics       8446    263.33  2224085.2   1468506.02
## 2 Vegetables       3018     90.93   274426.7    190526.34
## 3  Baby Food       1517    159.42   241840.1    145419.62
## 4     Cereal       3322    117.11   389039.4    294295.98
## 5     Fruits       9845      6.92    68127.4     23726.45
## 6     Cereal       9528    117.11  1115824.1    844085.52
##       Item.Type Units.Sold Unit.Cost Total.Cost Total.Profit
## 1        Snacks        934     97.44   91008.96     51500.76
## 2     Cosmetics       4551    263.33 1198414.83    791282.37
## 3        Fruits       9986      6.92   69103.12     24066.26
## 4 Personal Care       9118     56.67  516717.06    228497.08
## 5     Household       5858    502.54 2943879.32    970846.34
## 6       Clothes       1149     35.84   41180.16     84382.56

2. Encoding

All categorical variable need to convert to dummy variables so that each value will have the same weight

##   Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1                   0                   0                0                 0
## 2                   0                   0                0                 0
## 3                   1                   0                0                 0
## 4                   0                   0                1                 0
## 5                   0                   0                0                 0
## 6                   0                   0                1                 0
##   Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1                   1                0                   0              0
## 2                   0                0                   0              0
## 3                   0                0                   0              0
## 4                   0                0                   0              0
## 5                   0                1                   0              0
## 6                   0                0                   0              0
##   Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1                         0                       0                0
## 2                         0                       0                0
## 3                         0                       0                0
## 4                         0                       0                0
## 5                         0                       0                0
## 6                         0                       0                0
##   Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1                    0       8446    263.33  2224085.2   1468506.02
## 2                    1       3018     90.93   274426.7    190526.34
## 3                    0       1517    159.42   241840.1    145419.62
## 4                    0       3322    117.11   389039.4    294295.98
## 5                    0       9845      6.92    68127.4     23726.45
## 6                    0       9528    117.11  1115824.1    844085.52
##   Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1                   0                   0                0                 0
## 2                   0                   0                0                 0
## 3                   0                   0                0                 0
## 4                   0                   0                0                 0
## 5                   0                   0                0                 0
## 6                   0                   0                0                 1
##   Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1                   0                0                   0              0
## 2                   1                0                   0              0
## 3                   0                1                   0              0
## 4                   0                0                   0              0
## 5                   0                0                   1              0
## 6                   0                0                   0              0
##   Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1                         0                       0                1
## 2                         0                       0                0
## 3                         0                       0                0
## 4                         0                       1                0
## 5                         0                       0                0
## 6                         0                       0                0
##   Item.Type.Vegetables Units.Sold Unit.Cost Total.Cost Total.Profit
## 1                    0        934     97.44   91008.96     51500.76
## 2                    0       4551    263.33 1198414.83    791282.37
## 3                    0       9986      6.92   69103.12     24066.26
## 4                    0       9118     56.67  516717.06    228497.08
## 5                    0       5858    502.54 2943879.32    970846.34
## 6                    0       1149     35.84   41180.16     84382.56

3. Standardization

Values in the data set need to be centered and scales for better performance purpose

##   Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1          -0.3085368          -0.3350145       -0.2927295        -0.2907131
## 2          -0.3085368          -0.3350145       -0.2927295        -0.2907131
## 3           3.2378633          -0.3350145       -0.2927295        -0.2907131
## 4          -0.3085368          -0.3350145        3.4127071        -0.2907131
## 5          -0.3085368          -0.3350145       -0.2927295        -0.2907131
## 6          -0.3085368          -0.3350145        3.4127071        -0.2907131
##   Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1            3.510128       -0.2742144           -0.288687     -0.2907131
## 2           -0.284605       -0.2742144           -0.288687     -0.2907131
## 3           -0.284605       -0.2742144           -0.288687     -0.2907131
## 4           -0.284605       -0.2742144           -0.288687     -0.2907131
## 5           -0.284605        3.6431344           -0.288687     -0.2907131
## 6           -0.284605       -0.2742144           -0.288687     -0.2907131
##   Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1                -0.3124054              -0.3085368       -0.2987228
## 2                -0.3124054              -0.3085368       -0.2987228
## 3                -0.3124054              -0.3085368       -0.2987228
## 4                -0.3124054              -0.3085368       -0.2987228
## 5                -0.3124054              -0.3085368       -0.2987228
## 6                -0.3124054              -0.3085368       -0.2987228
##   Item.Type.Vegetables Units.Sold  Unit.Cost Total.Cost Total.Profit
## 1           -0.3275855  1.1691049  0.4470603  1.1078603    2.8081089
## 2            3.0495851 -0.7017320 -0.5364566 -0.5691632   -0.5230846
## 3           -0.3275855 -1.2190729 -0.1457311 -0.5971930   -0.6406602
## 4           -0.3275855 -0.5969541 -0.3871035 -0.4705776   -0.2525977
## 5           -0.3275855  1.6512900 -1.0157214 -0.7466142   -0.9578667
## 6           -0.3275855  1.5420315 -0.3871035  0.1545754    1.1804887
##   Item.Type.Baby Food Item.Type.Beverages Item.Type.Cereal Item.Type.Clothes
## 1          -0.3029613          -0.3000207       -0.3032367        -0.3009306
## 2          -0.3029613          -0.3000207       -0.3032367        -0.3009306
## 3          -0.3029613          -0.3000207       -0.3032367        -0.3009306
## 4          -0.3029613          -0.3000207       -0.3032367        -0.3009306
## 5          -0.3029613          -0.3000207       -0.3032367        -0.3009306
## 6          -0.3029613          -0.3000207       -0.3032367         3.3229924
##   Item.Type.Cosmetics Item.Type.Fruits Item.Type.Household Item.Type.Meat
## 1          -0.3022329       -0.3000999          -0.3004165     -0.3012466
## 2           3.3086737       -0.3000999          -0.3004165     -0.3012466
## 3          -0.3022329        3.3321908          -0.3004165     -0.3012466
## 4          -0.3022329       -0.3000999          -0.3004165     -0.3012466
## 5          -0.3022329       -0.3000999           3.3286787     -0.3012466
## 6          -0.3022329       -0.3000999          -0.3004165     -0.3012466
##   Item.Type.Office Supplies Item.Type.Personal Care Item.Type.Snacks
## 1                 -0.303335              -0.3021146        3.3221199
## 2                 -0.303335              -0.3021146       -0.3010096
## 3                 -0.303335              -0.3021146       -0.3010096
## 4                 -0.303335               3.3099686       -0.3010096
## 5                 -0.303335              -0.3021146       -0.3010096
## 6                 -0.303335              -0.3021146       -0.3010096
##   Item.Type.Vegetables Units.Sold  Unit.Cost Total.Cost Total.Profit
## 1           -0.3004956 -1.4100675 -0.5155185 -0.7387963   -0.9025072
## 2           -0.3004956 -0.1561568  0.4286153  0.2226367    1.0463451
## 3           -0.3004956  1.7280026 -1.0306972 -0.7578146   -0.9747796
## 4           -0.3004956  1.4270918 -0.7475538 -0.3692029   -0.4362349
## 5           -0.3004956  0.2969428  1.7900370  1.7380226    1.5193815
## 6           -0.3004956 -1.3355332 -0.8661041 -0.7820569   -0.8158847

Data Splitting

I use the most common splits where training data contains 75% of original records and test set contains the remaining 25%. I am going to use training data to make a model, and then use test set to test model performance at the end.

## [1] "number of records in small training set: 750"
## [1] "number of records in small test set: 250"
## [1] "number of records in large training set: 75000"
## [1] "number of records in large test set: 25000"

Modeling

Linear

  • Linear model is the first model comes to me when the target variable is numeric, it is the easiest and simplest model to start with.

  • linear model in small data set shows NA in values like Unit.Cost and Item.Type.Vegetables, so I update the model by dropping these two features.

  • All features after dropping looks pretty significant in modeling, the adjusted \(R^2\) looks pretty good too.

  • since small set and large set are in the same data structure, linear model is fitted in the large set without features aforementioned. The adjusted \(R^2\) is roughly the same, but the residual error increases a bit.

## 
## Call:
## lm(formula = Total.Profit ~ ., data = sm.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12197 -0.14905 -0.00085  0.14715  0.86628 
## 
## Coefficients: (2 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.001581   0.009464   0.167  0.86738    
## `Item.Type.Baby Food`        0.066479   0.012843   5.176 2.92e-07 ***
## Item.Type.Beverages         -0.135916   0.012743 -10.666  < 2e-16 ***
## Item.Type.Cereal             0.072586   0.012094   6.002 3.06e-09 ***
## Item.Type.Clothes            0.071912   0.012237   5.877 6.34e-09 ***
## Item.Type.Cosmetics          0.296022   0.012979  22.808  < 2e-16 ***
## Item.Type.Fruits            -0.140838   0.012036 -11.701  < 2e-16 ***
## Item.Type.Household          0.035419   0.015255   2.322  0.02051 *  
## Item.Type.Meat              -0.240118   0.014137 -16.985  < 2e-16 ***
## `Item.Type.Office Supplies` -0.119047   0.016456  -7.234 1.18e-12 ***
## `Item.Type.Personal Care`   -0.117251   0.012452  -9.416  < 2e-16 ***
## Item.Type.Snacks            -0.031880   0.012181  -2.617  0.00905 ** 
## Item.Type.Vegetables               NA         NA      NA       NA    
## Units.Sold                   0.256651   0.014023  18.302  < 2e-16 ***
## Unit.Cost                          NA         NA      NA       NA    
## Total.Cost                   0.676720   0.021248  31.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2588 on 736 degrees of freedom
## Multiple R-squared:  0.9332, Adjusted R-squared:  0.932 
## F-statistic: 790.9 on 13 and 736 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Total.Profit ~ `Item.Type.Baby Food` + Item.Type.Beverages + 
##     Item.Type.Cereal + Item.Type.Clothes + Item.Type.Cosmetics + 
##     Item.Type.Fruits + Item.Type.Household + Item.Type.Meat + 
##     `Item.Type.Office Supplies` + `Item.Type.Personal Care` + 
##     Item.Type.Snacks + Units.Sold + Total.Cost, data = sm.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12197 -0.14905 -0.00085  0.14715  0.86628 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.001581   0.009464   0.167  0.86738    
## `Item.Type.Baby Food`        0.066479   0.012843   5.176 2.92e-07 ***
## Item.Type.Beverages         -0.135916   0.012743 -10.666  < 2e-16 ***
## Item.Type.Cereal             0.072586   0.012094   6.002 3.06e-09 ***
## Item.Type.Clothes            0.071912   0.012237   5.877 6.34e-09 ***
## Item.Type.Cosmetics          0.296022   0.012979  22.808  < 2e-16 ***
## Item.Type.Fruits            -0.140838   0.012036 -11.701  < 2e-16 ***
## Item.Type.Household          0.035419   0.015255   2.322  0.02051 *  
## Item.Type.Meat              -0.240118   0.014137 -16.985  < 2e-16 ***
## `Item.Type.Office Supplies` -0.119047   0.016456  -7.234 1.18e-12 ***
## `Item.Type.Personal Care`   -0.117251   0.012452  -9.416  < 2e-16 ***
## Item.Type.Snacks            -0.031880   0.012181  -2.617  0.00905 ** 
## Units.Sold                   0.256651   0.014023  18.302  < 2e-16 ***
## Total.Cost                   0.676720   0.021248  31.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2588 on 736 degrees of freedom
## Multiple R-squared:  0.9332, Adjusted R-squared:  0.932 
## F-statistic: 790.9 on 13 and 736 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Total.Profit ~ . - Item.Type.Vegetables - Unit.Cost, 
##     data = lg.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03826 -0.15550  0.00044  0.15482  1.03688 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  0.0002897  0.0010033    0.289    0.773    
## `Item.Type.Baby Food`        0.0652432  0.0013765   47.397   <2e-16 ***
## Item.Type.Beverages         -0.1256861  0.0013733  -91.518   <2e-16 ***
## Item.Type.Cereal             0.0730669  0.0013673   53.437   <2e-16 ***
## Item.Type.Clothes            0.0810663  0.0013705   59.153   <2e-16 ***
## Item.Type.Cosmetics          0.2678688  0.0014423  185.727   <2e-16 ***
## Item.Type.Fruits            -0.1519964  0.0013778 -110.317   <2e-16 ***
## Item.Type.Household          0.0485853  0.0017670   27.496   <2e-16 ***
## Item.Type.Meat              -0.2374692  0.0015506 -153.150   <2e-16 ***
## `Item.Type.Office Supplies` -0.1138407  0.0018176  -62.631   <2e-16 ***
## `Item.Type.Personal Care`   -0.1111023  0.0013650  -81.394   <2e-16 ***
## Item.Type.Snacks            -0.0341179  0.0013587  -25.111   <2e-16 ***
## Units.Sold                   0.2886331  0.0014699  196.360   <2e-16 ***
## Total.Cost                   0.6580395  0.0022857  287.888   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2748 on 74986 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9244 
## F-statistic: 7.052e+04 on 13 and 74986 DF,  p-value: < 2.2e-16

Linear Diagnostic Plot

Simple Decision Tree

  • the model is slightly different in small and large data set

  • top three variable importance remain with Total.Cost, Unit.Cost and Units.Sold

##                                Overall
## Total.Cost                  3.63887284
## Unit.Cost                   2.14846488
## Units.Sold                  1.83768943
## Item.Type.Cosmetics         1.04225956
## Item.Type.Meat              0.99629313
## Item.Type.Household         0.98349119
## Item.Type.Clothes           0.39831007
## Item.Type.Beverages         0.25233408
## Item.Type.Fruits            0.14390896
## Item.Type.Snacks            0.12582920
## Item.Type.Vegetables        0.10169810
## Item.Type.Office Supplies   0.06916356
## `Item.Type.Baby Food`       0.00000000
## Item.Type.Cereal            0.00000000
## `Item.Type.Office Supplies` 0.00000000
## `Item.Type.Personal Care`   0.00000000

##                                Overall
## Total.Cost                  3.35390021
## Units.Sold                  2.70577846
## Unit.Cost                   2.60196314
## Item.Type.Cosmetics         1.47844212
## Item.Type.Office Supplies   1.18682510
## Item.Type.Clothes           0.50247599
## Item.Type.Snacks            0.28358551
## Item.Type.Meat              0.26590618
## Item.Type.Household         0.19211021
## Item.Type.Fruits            0.15584585
## Item.Type.Cereal            0.12103425
## Item.Type.Beverages         0.10591413
## Item.Type.Personal Care     0.10183406
## Item.Type.Vegetables        0.04940815
## `Item.Type.Baby Food`       0.00000000
## `Item.Type.Office Supplies` 0.00000000
## `Item.Type.Personal Care`   0.00000000

Bagging

## 
## Bagging regression trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Total.Profit ~ ., data = sm.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of root mean squared error:  0.2455
## 
## Bagging regression trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Total.Profit ~ ., data = lg.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of root mean squared error:  0.2655

Performance

Use RMSE to check the performance of three models

##                linear decision_tree bagging
## small data set  0.289         0.363   0.256
## large data set  0.273         0.284   0.266