This project’s main objective is to make a regression model on copier retail sale dataset. Then we are going to make predictions towards sales profit.

Data Preparation

Read Data

# read data copiers
copiers <- read.csv("copiers.csv")
head(copiers)

EDA

str(copiers)
#> 'data.frame':    62 obs. of  15 variables:
#>  $ Row.ID      : int  336 393 407 516 596 754 1151 1234 1550 1645 ...
#>  $ Order.ID    : chr  "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
#>  $ Order.Date  : chr  "9/1/15" "9/21/14" "12/8/17" "1/22/17" ...
#>  $ Ship.Date   : chr  "9/4/15" "9/23/14" "12/12/17" "1/27/17" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Standard Class" "Standard Class" ...
#>  $ Customer.ID : chr  "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Consumer" "Home Office" ...
#>  $ Product.ID  : chr  "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
#>  $ Category    : chr  "Technology" "Technology" "Technology" "Technology" ...
#>  $ Sub.Category: chr  "Copiers" "Copiers" "Copiers" "Copiers" ...
#>  $ Product.Name: chr  "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
#>  $ Sales       : num  960 1800 1200 3000 1200 ...
#>  $ Quantity    : int  2 3 3 5 3 3 2 2 1 7 ...
#>  $ Discount    : num  0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
#>  $ Profit      : num  336 702 435 1380 435 ...

Intuitively, sales is the most correlated column towards the profit.

Let’s check profit’s distribution

boxplot(copiers$Profit)

Insight: - There are outliers on the profit data - Q1 and Q3 located in 100 and 600

Let’s check sales distribution

boxplot(copiers$Sales)

  • There is outlier in the data
  • Q1 and Q3 located in around 700 and 1800
  • Median on the sales is located in around 1000

Let’s check the correlation between ‘Profit’ and ‘Sales’:

# correlation value
cor(x=copiers$Sales , y=copiers$Profit)
#> [1] 0.9395785

correlatin has the range of -1 to 1. -1 means has a strong negative correlation 1 means has a strong positive relation 0 means no relation

We can conclude that ‘Profit’ and ‘Sales’ has a strong positive correlation.

# visualisasi scatter plot
plot(copiers$Sales , copiers$Profit)

Simple Linear Regression

This is a model of regression with only one predictor.

model_simple <- lm(formula = Profit ~ Sales, data = copiers)
summary(model_simple)
#> 
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -407.07  -70.08   22.95   76.56  345.05 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -114.06251   32.62743  -3.496 0.000895 ***
#> Sales          0.42286    0.01989  21.260  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 150.1 on 60 degrees of freedom
#> Multiple R-squared:  0.8828, Adjusted R-squared:  0.8809 
#> F-statistic:   452 on 1 and 60 DF,  p-value: < 2.2e-16

lm() function will produce the “intercept” of each predictors, so that we can make it as a formula: \[\hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n\]

Thus, we can create the formula on “model_simple” above as:

\[Profit = -114.06251 + 0.42286*{Sales}\]

We can visualize the regression line below:

plot(copiers$Sales, copiers$Profit)
abline(model_simple, col = "red")

From the plot above and the graphs beforehand, we can see that there is an outlier around 5000 value sale. Outlier is a data point that has an extreme value. In some cases, outlier can be harmful for the model. We can handle this by removing them.

Let’s try to remove it and then create a new model from it. Then we can compare between before outlier removal and after outlier removal.

# filter row with the sales value under 4000
copiers_no_outlier <- copiers[copiers$Sales<4000 , ]
# check the range of the data to validate that we have removed the outlier.
range(copiers_no_outlier$Sales)
#> [1]  299.990 3359.952

Let’s create e new model from it

model_no_outlier <- lm(formula=Profit~Sales , data=copiers_no_outlier)

summary(model_no_outlier)
#> 
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers_no_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -386.43  -66.78   18.00   66.44  340.24 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -83.54810   32.82978  -2.545   0.0136 *  
#> Sales         0.39444    0.02146  18.384   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 142.4 on 59 degrees of freedom
#> Multiple R-squared:  0.8514, Adjusted R-squared:  0.8489 
#> F-statistic:   338 on 1 and 59 DF,  p-value: < 2.2e-16
# summary model dengan keseluruhan data (masih ada outlier)
summary(model_simple)
#> 
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -407.07  -70.08   22.95   76.56  345.05 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -114.06251   32.62743  -3.496 0.000895 ***
#> Sales          0.42286    0.01989  21.260  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 150.1 on 60 degrees of freedom
#> Multiple R-squared:  0.8828, Adjusted R-squared:  0.8809 
#> F-statistic:   452 on 1 and 60 DF,  p-value: < 2.2e-16
plot(copiers$Sales, copiers$Profit)
abline(model_simple, col = "red") # model dengan keseluruhan data
abline(model_no_outlier, col = "green") # model tanpa outlier

To make it more objective, let’s compare their goodness of fit (R-squared):

summary(model_simple)$r.squared # with outlier
#> [1] 0.8828077
summary(model_no_outlier)$r.squared # without outlier
#> [1] 0.8513778

Model with the leverage from outlier has a higher r-square, so we should keep the outlier.

Multiple Linear Regression

Linear regression with more than one predictor can improve the performance of the model because it provides more information to explain the target variable.

Ways to perform feature selection (choosing predictors): - Based on business considerations. - Based on statistics: - Correlation analysis. - Step-wise regression.

1. Dataset and Feature Selection

names(copiers)
#>  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
#>  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
#> [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

Terget: Profit Predictor: Sales, Discount, Quantity, Segment, Ship.Mode, Category, Sub.Category, Product.Name

Feature selection:

  • Target variable
  • Predictor variable (based on business considerations):
    • Remove columns with a large number of unique values (too specific and with many levels).
    • Remove columns with only one unique value (non-informative).

In conclusion, potential predictors are: Sales, Discount, Quantity, Segment, and Ship.Mode.

2. Data Cleansing

We need to select the column that we want to use

# install.packages("dplyr")
library(dplyr)

# select column and change the data type
copiers <- copiers %>% 
  select(Profit, Sales, Discount, Quantity, Segment, Ship.Mode) %>% #to take the wanted columns
  mutate(Ship.Mode = as.factor(Ship.Mode),
         Segment = as.factor(Segment))

head(copiers)
anyNA(copiers)
#> [1] FALSE

3. EDA

Let’s see the correlation between the predictor and target:

library(GGally)
ggcorr(data = copiers, label = T)

Insight:

  • Sales and Quantity have a correlation > 0.5.
  • Discount has a correlation > -0.5.
  • Conclusion: Sales and Quantity have a strong correlation with profit.

4. Modelling & Interpretation

# make a model with all predictor
model_ols_multi <- lm(Profit ~ ., copiers)
summary(model_ols_multi)
#> 
#> Call:
#> lm(formula = Profit ~ ., data = copiers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -285.64  -56.21    6.49   66.51  246.10 
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               63.45200   55.06367   1.152    0.254    
#> Sales                      0.42124    0.02522  16.701  < 2e-16 ***
#> Discount                -874.83099  117.02350  -7.476 7.74e-10 ***
#> Quantity                 -13.40975   13.54512  -0.990    0.327    
#> SegmentCorporate           9.30122   31.29427   0.297    0.767    
#> SegmentHome Office       -30.70992   39.86033  -0.770    0.444    
#> Ship.ModeSame Day          9.13795   57.48532   0.159    0.874    
#> Ship.ModeSecond Class     42.97158   43.92567   0.978    0.332    
#> Ship.ModeStandard Class   13.92121   38.33509   0.363    0.718    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 102.7 on 53 degrees of freedom
#> Multiple R-squared:  0.9515, Adjusted R-squared:  0.9442 
#> F-statistic: 130.1 on 8 and 53 DF,  p-value: < 2.2e-16

Conclusion: - This model has the better R-squared, which is 0.9515 - This model has Adjusted R-squared: 0.9442 (adjusted r-squared is more representative for multiple predictors) - Sales and Discount are significant predictors for predicting targets

5. Re-Modelling

model_ols_multi2 <- lm(Profit ~ Sales + Discount, data = copiers)
summary(model_ols_multi2)$adj.r.squared # model with only two significant predistors
#> [1] 0.9461829

6. Prediction

Because we don’t have any new data, we can only use the training data to test it to know how good each models is.

# assign the predictions into new columns
copiers$pred_simple <- predict(object = model_simple, newdata = copiers)
copiers$pred_ols_multi <- predict(object = model_ols_multi, newdata = copiers)
copiers$pred_ols_multi2 <- predict(object = model_ols_multi2, newdata = copiers)
head(copiers)

7. Model Evaluation

R-Squared

First, let’s see their goodness of fit

summary(model_simple)$r.squared # 1 predictor
#> [1] 0.8828077
summary(model_ols_multi)$adj.r.squared # all predictor
#> [1] 0.9442254
summary(model_ols_multi2)$adj.r.squared # 2 predictor
#> [1] 0.9461829

The best model based on the highest R-squared value is model_ols_multi2, which is built using 2 significant predictors (Sales and Discount) to predict the target variable. This model achieved the highest R-squared value compared to the others.

Error

Let’s remember that our main goal of making the model is to minimalize the error on predictions. Error is the difference between the actual value and predicted value.

# install.packages("MLmetrics")
library(MLmetrics)

RMSE(y_pred = copiers$pred_simple, y_true = copiers$Profit)
#> [1] 147.6997
RMSE(y_pred = copiers$pred_ols_multi, y_true = copiers$Profit)
#> [1] 94.97767
RMSE(y_pred = copiers$pred_ols_multi2, y_true = copiers$Profit)
#> [1] 98.43544

The best model based on the lowest RMSE (Root Mean Squared Error) is the prediction output from model_ols_multi, which utilizes all predictors.

Interpretation of RMSE: There will be an average prediction error of 94.98 (in the unit of profit) for each data point.

Additional information: To determine if the RMSE is sufficiently small, it is common to compare it with the range of the target variable.

range(copiers$Profit)
#> [1]   59.998 2302.967

Conclusion: The best model based on R-squared is model_ols_multi2, which uses Sales and Discount as predictors. The best model based on RMSE is model_ols_multi, which includes all available predictors. The chosen model to predict new data values is ols_multi because it has the least RMSE.