This project’s main objective is to make a regression model on copier retail sale dataset. Then we are going to make predictions towards sales profit.
# read data copiers
copiers <- read.csv("copiers.csv")
head(copiers)str(copiers)#> 'data.frame': 62 obs. of 15 variables:
#> $ Row.ID : int 336 393 407 516 596 754 1151 1234 1550 1645 ...
#> $ Order.ID : chr "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
#> $ Order.Date : chr "9/1/15" "9/21/14" "12/8/17" "1/22/17" ...
#> $ Ship.Date : chr "9/4/15" "9/23/14" "12/12/17" "1/27/17" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Standard Class" "Standard Class" ...
#> $ Customer.ID : chr "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
#> $ Segment : chr "Consumer" "Consumer" "Consumer" "Home Office" ...
#> $ Product.ID : chr "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
#> $ Category : chr "Technology" "Technology" "Technology" "Technology" ...
#> $ Sub.Category: chr "Copiers" "Copiers" "Copiers" "Copiers" ...
#> $ Product.Name: chr "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
#> $ Sales : num 960 1800 1200 3000 1200 ...
#> $ Quantity : int 2 3 3 5 3 3 2 2 1 7 ...
#> $ Discount : num 0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
#> $ Profit : num 336 702 435 1380 435 ...
Intuitively, sales is the most correlated column towards the profit.
Let’s check profit’s distribution
boxplot(copiers$Profit)Insight: - There are outliers on the profit data - Q1 and Q3 located in 100 and 600
Let’s check sales distribution
boxplot(copiers$Sales)Let’s check the correlation between ‘Profit’ and ‘Sales’:
# correlation value
cor(x=copiers$Sales , y=copiers$Profit)#> [1] 0.9395785
correlatin has the range of -1 to 1. -1 means has a strong negative correlation 1 means has a strong positive relation 0 means no relation
We can conclude that ‘Profit’ and ‘Sales’ has a strong positive correlation.
# visualisasi scatter plot
plot(copiers$Sales , copiers$Profit)This is a model of regression with only one predictor.
model_simple <- lm(formula = Profit ~ Sales, data = copiers)
summary(model_simple)#>
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -407.07 -70.08 22.95 76.56 345.05
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -114.06251 32.62743 -3.496 0.000895 ***
#> Sales 0.42286 0.01989 21.260 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 150.1 on 60 degrees of freedom
#> Multiple R-squared: 0.8828, Adjusted R-squared: 0.8809
#> F-statistic: 452 on 1 and 60 DF, p-value: < 2.2e-16
lm() function will produce the “intercept” of each predictors, so that we can make it as a formula: \[\hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n\]
Thus, we can create the formula on “model_simple” above as:
\[Profit = -114.06251 + 0.42286*{Sales}\]
We can visualize the regression line below:
plot(copiers$Sales, copiers$Profit)
abline(model_simple, col = "red")From the plot above and the graphs beforehand, we can see that there is an outlier around 5000 value sale. Outlier is a data point that has an extreme value. In some cases, outlier can be harmful for the model. We can handle this by removing them.
Let’s try to remove it and then create a new model from it. Then we can compare between before outlier removal and after outlier removal.
# filter row with the sales value under 4000
copiers_no_outlier <- copiers[copiers$Sales<4000 , ]
# check the range of the data to validate that we have removed the outlier.
range(copiers_no_outlier$Sales)#> [1] 299.990 3359.952
Let’s create e new model from it
model_no_outlier <- lm(formula=Profit~Sales , data=copiers_no_outlier)
summary(model_no_outlier)#>
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers_no_outlier)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -386.43 -66.78 18.00 66.44 340.24
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -83.54810 32.82978 -2.545 0.0136 *
#> Sales 0.39444 0.02146 18.384 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 142.4 on 59 degrees of freedom
#> Multiple R-squared: 0.8514, Adjusted R-squared: 0.8489
#> F-statistic: 338 on 1 and 59 DF, p-value: < 2.2e-16
# summary model dengan keseluruhan data (masih ada outlier)
summary(model_simple)#>
#> Call:
#> lm(formula = Profit ~ Sales, data = copiers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -407.07 -70.08 22.95 76.56 345.05
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -114.06251 32.62743 -3.496 0.000895 ***
#> Sales 0.42286 0.01989 21.260 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 150.1 on 60 degrees of freedom
#> Multiple R-squared: 0.8828, Adjusted R-squared: 0.8809
#> F-statistic: 452 on 1 and 60 DF, p-value: < 2.2e-16
plot(copiers$Sales, copiers$Profit)
abline(model_simple, col = "red") # model dengan keseluruhan data
abline(model_no_outlier, col = "green") # model tanpa outlierTo make it more objective, let’s compare their goodness of fit (R-squared):
summary(model_simple)$r.squared # with outlier#> [1] 0.8828077
summary(model_no_outlier)$r.squared # without outlier#> [1] 0.8513778
Model with the leverage from outlier has a higher r-square, so we should keep the outlier.
Linear regression with more than one predictor can improve the performance of the model because it provides more information to explain the target variable.
Ways to perform feature selection (choosing predictors): - Based on business considerations. - Based on statistics: - Correlation analysis. - Step-wise regression.
names(copiers)#> [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
#> [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
#> [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
Terget: Profit Predictor: Sales, Discount, Quantity, Segment, Ship.Mode, Category, Sub.Category, Product.Name
Feature selection:
In conclusion, potential predictors are: Sales, Discount, Quantity, Segment, and Ship.Mode.
We need to select the column that we want to use
# install.packages("dplyr")
library(dplyr)
# select column and change the data type
copiers <- copiers %>%
select(Profit, Sales, Discount, Quantity, Segment, Ship.Mode) %>% #to take the wanted columns
mutate(Ship.Mode = as.factor(Ship.Mode),
Segment = as.factor(Segment))
head(copiers)anyNA(copiers)#> [1] FALSE
Let’s see the correlation between the predictor and target:
library(GGally)
ggcorr(data = copiers, label = T)Insight:
# make a model with all predictor
model_ols_multi <- lm(Profit ~ ., copiers)
summary(model_ols_multi)#>
#> Call:
#> lm(formula = Profit ~ ., data = copiers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -285.64 -56.21 6.49 66.51 246.10
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 63.45200 55.06367 1.152 0.254
#> Sales 0.42124 0.02522 16.701 < 2e-16 ***
#> Discount -874.83099 117.02350 -7.476 7.74e-10 ***
#> Quantity -13.40975 13.54512 -0.990 0.327
#> SegmentCorporate 9.30122 31.29427 0.297 0.767
#> SegmentHome Office -30.70992 39.86033 -0.770 0.444
#> Ship.ModeSame Day 9.13795 57.48532 0.159 0.874
#> Ship.ModeSecond Class 42.97158 43.92567 0.978 0.332
#> Ship.ModeStandard Class 13.92121 38.33509 0.363 0.718
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 102.7 on 53 degrees of freedom
#> Multiple R-squared: 0.9515, Adjusted R-squared: 0.9442
#> F-statistic: 130.1 on 8 and 53 DF, p-value: < 2.2e-16
Conclusion: - This model has the better R-squared, which is 0.9515 - This model has Adjusted R-squared: 0.9442 (adjusted r-squared is more representative for multiple predictors) - Sales and Discount are significant predictors for predicting targets
model_ols_multi2 <- lm(Profit ~ Sales + Discount, data = copiers)
summary(model_ols_multi2)$adj.r.squared # model with only two significant predistors#> [1] 0.9461829
Because we don’t have any new data, we can only use the training data to test it to know how good each models is.
# assign the predictions into new columns
copiers$pred_simple <- predict(object = model_simple, newdata = copiers)
copiers$pred_ols_multi <- predict(object = model_ols_multi, newdata = copiers)
copiers$pred_ols_multi2 <- predict(object = model_ols_multi2, newdata = copiers)
head(copiers)First, let’s see their goodness of fit
summary(model_simple)$r.squared # 1 predictor#> [1] 0.8828077
summary(model_ols_multi)$adj.r.squared # all predictor#> [1] 0.9442254
summary(model_ols_multi2)$adj.r.squared # 2 predictor#> [1] 0.9461829
The best model based on the highest R-squared value is model_ols_multi2, which is built using 2 significant predictors (Sales and Discount) to predict the target variable. This model achieved the highest R-squared value compared to the others.
Let’s remember that our main goal of making the model is to minimalize the error on predictions. Error is the difference between the actual value and predicted value.
# install.packages("MLmetrics")
library(MLmetrics)
RMSE(y_pred = copiers$pred_simple, y_true = copiers$Profit)#> [1] 147.6997
RMSE(y_pred = copiers$pred_ols_multi, y_true = copiers$Profit)#> [1] 94.97767
RMSE(y_pred = copiers$pred_ols_multi2, y_true = copiers$Profit)#> [1] 98.43544
The best model based on the lowest RMSE (Root Mean Squared Error) is the prediction output from model_ols_multi, which utilizes all predictors.
Interpretation of RMSE: There will be an average prediction error of 94.98 (in the unit of profit) for each data point.
Additional information: To determine if the RMSE is sufficiently small, it is common to compare it with the range of the target variable.
range(copiers$Profit)#> [1] 59.998 2302.967
Conclusion: The best model based on R-squared is model_ols_multi2, which uses Sales and Discount as predictors. The best model based on RMSE is model_ols_multi, which includes all available predictors. The chosen model to predict new data values is ols_multi because it has the least RMSE.