My Project LBB Regresion Model this time using a dataset about the profits of 50 startups. This dataset was obtained from https://www.kaggle.com. The dataset that’s we see here contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”. The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.
Business Question: A startup company spends on all three departments (R&D, marketing and administration) to increase its profit. However, the startup leader wants to see which of the three departments is the most influential in increasing the profit of their startup. In order for the startup’s profit to increase significantly, it is necessary that the cost plan for the coming period can be targeted.
library(dplyr)
Here are the details of the 50 startups dataframe:
R.D.Spend : expenses for R&D
Administration : cost for administration
Marketing.Spend : cost for marketing development
State: indicates in which state the startup is located.
Profit: the profit earned by the startup
The first step is to import the dataset using the
read.csv() function.
startup <- read.csv("data_input/50_Startups.csv")
startup
The next step is to investigate the imported dataset, because we want
to observe the initial and final data of the startup
dataset. We use the head() and tail()
functions.
head(startup)
tail(startup)
To find out the suitable data type, it is checked first with the
glimpse() function.
startup %>%
glimpse()
## Rows: 50
## Columns: 5
## $ R.D.Spend <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…
Before doing the next step, there is State column types
that must be converted to factor type. But we will drop this column
because regression analysis requires numeric data only.
startup_clean <-
startup %>%
mutate(State = as.factor(State)) %>%
select(-c(State))
Then we check the other columns again whether the data type is correct or not.
startup_clean %>%
glimpse()
## Rows: 50
## Columns: 4
## $ R.D.Spend <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…
The data type of each column is correct, so the next step is to process this data.
We also need to check for missing values in this dataset.
startup_clean %>%
is.na() %>%
colSums()
## R.D.Spend Administration Marketing.Spend Profit
## 0 0 0 0
This dataset has no missing values, so it can continue to the next steps.
Before we create simple linear regression or multiple linear
regression modeling, we want to check the correlation between variables
using the ggcorr() function and the GGally
library.
The correlation value between variables (predictors) has a range of -1 to 1, which means : - the closer to 1 means the stronger positively - the closer to -1 means the stronger the negative
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.1
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Correlation check between target and predictor
ggcorr(startup_clean, hjust = 1, layout.exp = 3, label = TRUE)
Insight:
To create a Simple Linear Regression model, we will predict
Profit based on the 1 most potential predictor variable
which is R.D.Spend.
R.D.Spend against Profitplot(startup_clean$R.D.Spend, startup_clean$Profit)
The Insight: From the scatter plot, there is a linear trend in the scatter plot between “R.D.Spend” and “Profit”, the conclusion is that there is a positive relationship between spending on Research and Development (R&D) and the profit generated by the startup. In other words, the higher the R&D expenditure, the more likely the profit earned by the startup.
Simple linear regression model involves one independent variable (predictor) to predict the dependent variable (target = Profit). In this case, we will use 1 column as a predictor which is R.D.Spend.
Create this simple linear regression model using the
lm() function, with Profit as target,
R.D.Spend as predictor, startup_clean as
dataframe object.
We can view the result of the modeling with the
summary() function.
model_startup <- lm(Profit ~ R.D.Spend,
startup_clean)
summary(model_startup)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34351 -4626 -375 6249 17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
## R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
## F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16
The insights:
The regression coefficient for the variable
R.D.Spend is about 0.8543. This means that every 1 unit
increase in R.D.Spend spends will contribute about 0.8543
units to profits.
The intercept (β0 value) is about 49030. This is the estimated
profit when R.D.Spend spend is zero.
The p-value is very low (<2.2e-16). This means
thatR.D.Spend significantly affect profit.
4.The coefficient of Multiple R-squared (R2) value is about 0.9465
(94.65%). This indicates that about 94.65% of the variation in profit
can be explained by the R.D.Spend variable.
Conclusion:
R.D.Spend variable has a significant positive influence
on startup Profits.Profits based on
R.D.Spend.plot(startup_clean$R.D.Spend, startup_clean$Profit)
abline(model_startup, col = "red")
Insight :
Trend Line: If the abline has a positive slope, it indicates an
upward trend in Profit as R.D.Spend increases. That is, the
greater the R.D.Spend, the higher the expected
Profit.
Data Pattern: data points tend to cluster around a trend line or
have significant variation. If the data points are close to the trend
line, it indicates a positive correlation between R.D.Spend
and Profit.
Outliers: there are points that far from the trend line. These can be outliers that may affect the regression results.
Multiple linear regression models involve more than one independent variable (predictor) to predict the dependent variable (target). In this case, we will use all columns or some columns as predictors.
plot(startup_clean)
The insights:
R.D.Spend and Profit Plot : There appears to be a positive correlation between R.D.Spend (Research and Development spending) and Profit. As R.D. Spend increases, Profit tends to increase as well. The data points cluster along an upward trend, suggesting that higher R.D. spending is associated with higher profits.
Marketing Spend and Profit Plot: Marketing Spend also shows a positive correlation with Profit. As Marketing Spend increases, Profit tends to rise. Again, the spread of the data points forms denser clusters along the upward trend.
Administration and Profit Plot: It’s less obvious how administration costs and profit are related. There is no clear trend among the data points, which are more dispersed.
Spending on R.D.Spend and marketing may have a greater effect on profit than spending on administration.
We select several columns (“R.D.Spend”, “Administration”, and
“Marketing.Spend”) as predictor variables (X), the column “Profit” as
the dependent variable (Y) and startup_clean as dataframe
object. Create the multiple linear regression model using the
lm() function. We can view the result of the modeling with
the summary() function.
model_startup_all <- lm(Profit ~ ., startup_clean)
summary(model_startup_all)
##
## Call:
## lm(formula = Profit ~ ., data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
The insights:
The intercept value is 50,120. This indicates the expected value
of Profit when all independent variables
(R.D.Spend, Administration, and
Marketing.Spend) are zero. In this context, if all costs
(R.D.Spend, Administration, and
Marketing.Spend) are zero, we can expect
Profit to be around 50,120.
The coefficient value of R.D.Spend is 0.8057. This
means that every one unit increase in R.D.Spend will
increase Profit by 0.8057, by ignoring other variables.
The coefficient value of Administration is -0.02682.
Although this coefficient is negative, the p-value is high (0.602),
which indicates that Administration is not statistically
significant to Profit.
The coefficient value of Marketing.Spend is 0.02723.
Its p-value (0.105) is close to the significance threshold (0.05), so we
need to be more careful in interpreting the effect of
Marketing.Spend on Profit.
The Adjusted R-squared value is 0.9475, which takes into account the number of independent variables. The higher the value, the better.
P-value is very low (< 2.2e-16), this indicates that the overall model is significant.
Create the multiple linear regression model using the
lm() function, with Profit as target,
R.D.Spend & Marketing.Spend as predictor,
and startup_clean as dataframe object. We can view the
result of the modeling with the summary() function.
model_startup2 <- lm(Profit ~ R.D.Spend + Marketing.Spend , startup_clean)
summary(model_startup2)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
The insights:
The intercept value is 46.980. This indicates the expected value
of Profit when some independent variables (R.D.Spend and
Marketing.Spend) are zero. In this context, if all costs
(R.D.Spend and Marketing.Spend) are zero, we
can expect Profit to be around 46.980.
The coefficient value of R.D.Spend is 0.797. This
means that every one unit increase in R.D.Spend will
increase Profit by 0.797, by ignoring other variables.
The coefficient value of Marketing.Spend is 0.030.
Its p-value (0.06) is close to the significance threshold (0.05), so we
need to be more careful in interpreting the effect of
Marketing.Spend on Profit.
The Adjusted R-squared value is 0.9483, which takes into account the number of independent variables. The higher the value, the better.
P-value is very low (< 2.2e-16), this indicates that the overall model is significant.
Conclusion: R.D.Spend has a significant influence on
Profit, while Marketing.Spend may have a
weaker influence. We may consider focusing more on
R.D.Spend to increase Profit.
We will compare the performance of the 3 models that have been
created. 1. model_startup: 1 predictor
(R.D.Spend) 2. model_startup_all: all
predictors (R.D.Spend, Administration, and
Marketing.Spend) 3. model_startup2: 2
significant predictors (R.D.Spend and
Marketing.Spend)
# save prediction results to a new dataset
startup_pred <- startup_clean
# predictions from one predictor model
startup_pred$pred1 <- predict(model_startup, startup_clean)
# predictions from an all predictor model
startup_pred$pred_all <- predict(model_startup_all, startup_clean)
# predictions from model 2 predictors
startup_pred$pred2 <- predict(model_startup2, startup_clean)
startup_pred
Insight :
Profit value (ex: 192261.83) with the
prediction results using the 3 models:The prediction result between the 1 predictor model has a large
difference from the Profit value, compared to the all
predictor and 2 predictor models.
Stepwise regression is a method that iteratively checks the statistical significance of each independent variable in a linear regression model.
Using the Stepwise Regression method can select the most relevant variables to be included in the model, test one by one variables and can reduce variables that do not make a significant contribution. So it is more efficient in time and resources.
In Step-Wise Regression, there are 3 methods used to select relevant variables in the regression model, which are Backward Elimination, Forward Selection and Both.
Backward Elimination is a method in Step-Wise regression that starts with all predictors (independent variables) and gradually removes insignificant predictors one by one.
In the process of eliminating insignificant predictors one by one, the AIC value is also generated. This AIC value represents the amount of information lost in the model (information loss). Therefore, a good regression model is one with a small AIC value.
We use the model_startup_all model which includes all
variables as predictors. The stepwise regressing process uses the
step() function, by filling in some parameters:
model_startup_all as the object, and
“backward” as the direction.
model_backward <- step(object = model_startup_all,
direction = "backward")
## Start: AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## - Administration 1 2.3539e+07 3.9444e+09 915.18
## <none> 3.9209e+09 916.88
## - Marketing.Spend 1 2.3349e+08 4.1543e+09 917.77
## - R.D.Spend 1 2.7147e+10 3.1068e+10 1018.37
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
We can see the result of model_backward by using the
summary() function.
summary(model_backward)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Insight :
The intercept value (4.698e+04 or 46.980) is the estimated
Profit when R.D.Spend and
Marketing.Spend are zero.
R.D.Spend has a coefficient estimate of
approximately 0.797. This means that every 1 unit increase in
R.D.Spend will contribute about 0.797 units to
Profit (assuming other variables remain constant).
Marketing.Spend has a coefficient estimate of
approximately 0.030. However, the p-value (0.06) indicates that this
relationship is not statistically significant at the 0.05 level of
significance.
The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that there is a strong relationship between the predictor and the target.
Conclusion: R.D.Spend has a significant influence on
Profit, while Marketing.Spend has a weaker
influence..
The opposite of Backward Elimination, the Forward Selection method is a method in stepwise regression that starts with an empty model (no predictors) and gradually adds predictors one by one. In each step forward, we add one variable that gives the best improvement to the model.
Create a model with no predictors first,
model_startup_none as the object. By using the
lm() function with Profit as the target,
1 as the no predictor and startup_clean as the
dataframe.
model_startup_none <- lm(Profit ~ 1, startup_clean)
model_startup_none
##
## Call:
## lm(formula = Profit ~ 1, data = startup_clean)
##
## Coefficients:
## (Intercept)
## 112013
Insight: From the results of the linear regression analysis, we only
have one variable in the model, the Intercept with a value
of 112013. This value is the estimated Profit when there
are no predictors.
The stepwise regression process uses the step()
function, by filling in some parameters: model_startup_none
as the object, and “forward” as the direction. For the
Forward Selection method, we need to define the scope
parameter to indicate the maximum upper limit of predictor combinations
with model_startup_all.
In the process of adding the predictors one by one, the AIC value is calculated. In the Forward Selection method, a good AIC value is a small AIC value.
model_forward <- step(object = model_startup_none,
direction = "forward",
scope = list(upper= model_startup_all))
## Start: AIC=1061.42
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + R.D.Spend 1 7.5349e+10 4.2560e+09 916.98
## + Marketing.Spend 1 4.4511e+10 3.5094e+10 1022.46
## + Administration 1 3.2071e+09 7.6398e+10 1061.36
## <none> 7.9605e+10 1061.42
##
## Step: AIC=916.98
## Profit ~ R.D.Spend
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 311651716 3944394850 915.18
## <none> 4256046566 916.98
## + Administration 1 101704903 4154341663 917.77
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3944394850 915.18
## + Administration 1 23538549 3920856301 916.88
We can see the result of model_forward by using the
summary() function.
# summary model
summary(model_forward)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Insight :
The estimated coefficient for the variable R.D.Spend
is about 0.797. This means that every 1 unit increase in
R.D.Spend will contribute about 0.797 units to
Profit (assuming other variables remain constant).
The estimated coefficient for the variable
Marketing.Spend is about 0.030. However, the p-value (0.06)
indicates that this relationship is not statistically significant at the
0.05 level of significance.
The value of intercept (4.698e+04 or about 46,980)
is the estimated Profit when R.D.Spend and
Marketing.Spend are zero.
Model Quality: The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that here is a strong relationship between the predictor and the target.
In conclusion, R.D.Spend has a significant influence on
Profit, while Marketing.Spend may have a
weaker influence. Consider focusing more on R.D.Spend to
increase Profit.
Both method is a method in stepwise regression that is a combination of forward selection and backward elimination. In this method, we consider including or removing predictors at each step, depending on statistical significance.
In the first step, we model the stepwise regression of Both method using the model without predictors. Then at each step, we consider including or removing predictors. If a predictor is significant, it is added to the model, and if a predictor is not significant, it is removed from the model.
The stepwise regression process uses the step()
function, by filling in some parameters: model_startup_none
as the object, and “both” as the direction. For the Both
Selection method, we need to define the scope parameter to
indicate the maximum upper limit of predictor combinations with
model_startup_all.
model_both <- step(object = model_startup_none,
direction = "both",
scope = list(upper= model_startup_all))
## Start: AIC=1061.42
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + R.D.Spend 1 7.5349e+10 4.2560e+09 916.98
## + Marketing.Spend 1 4.4511e+10 3.5094e+10 1022.46
## + Administration 1 3.2071e+09 7.6398e+10 1061.36
## <none> 7.9605e+10 1061.42
##
## Step: AIC=916.98
## Profit ~ R.D.Spend
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 3.1165e+08 3.9444e+09 915.18
## <none> 4.2560e+09 916.98
## + Administration 1 1.0170e+08 4.1543e+09 917.77
## - R.D.Spend 1 7.5349e+10 7.9605e+10 1061.42
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## + Administration 1 2.3539e+07 3.9209e+09 916.88
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
We can see the result of model_both by using the
summary() function.
# summary model
summary(model_both)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Insight : 1. The estimated coefficient for the variable
R.D.Spend is about 0.797. This means that every 1 unit
increase in R.D.Spend will contribute about 0.797 units to
Profit (assuming other variables remain constant).
The estimated coefficient for the variable
Marketing.Spend is about 0.030. However, the p-value (0.06)
indicates that this relationship is not statistically significant at the
0.05 level of significance.
The value of intercept (4.698e+04 or about 46,980)
is the estimated Profit when R.D.Spend and
Marketing.Spend are zero.
Model Quality: The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that here is a strong relationship between the predictor and the target.
In conclusion, R.D.Spend has a significant influence on
Profit, while Marketing.Spend may have a
weaker influence. Consider focusing more on R.D.Spend to
increase Profit
The next step is to make interval predictions on the Step-Wise Regression Backward Elimination model.
pred_model_backward <- predict(model_backward,
startup_clean)
head(pred_model_backward)
## 1 2 3 4 5 6
## 192800.5 189774.7 181405.4 173441.3 171127.6 162879.3
The result of the Step-Wise Regression model prediction will be
compared with the Profit value. For that we must create a
prediction range first, to make it easier to compare.
Using the predict() function, filling the parameters
model_backward as object, startup_clean as
data, interval = “prediction” (to get the prediction
interval), level = 0.95 to set the interval width.
# untuk menambahkan batas atas-bawah
pred_model_backward_interval <- predict(object = model_backward,
newdata = startup_clean,
interval = "prediction",
level = 0.95)
head(pred_model_backward_interval)
## fit lwr upr
## 1 192800.5 173283.0 212317.9
## 2 189774.7 170381.4 209167.9
## 3 181405.4 162191.9 200618.8
## 4 173441.3 154359.6 192523.1
## 5 171127.6 152092.2 190163.1
## 6 162879.3 143929.5 181829.1
head(startup_clean$Profit)
## [1] 192261.8 191792.1 191050.4 182902.0 166187.9 156991.1
Insight: To compare the Profit value with the prediction
result as follows: + Profit from prediction = 192800.5 -
Prediction value lower limit prediction = 173283.0 - Predicted value of
prediction upper limit = 212317.9 + Profit value =
192261.8
The Profit value (192261.8) is still in the lower limit
range to the upper limit (173283.0 - 212317.9). So the Step-Wise
Regression model is said to predict well.
Model evaluation is the process of measuring and examining the performance of a statistical model using existing data. The goal is to identify the strengths and weaknesses of the model that has been formed. There are several commonly used evaluation metrics in regression models.
startup_pred %>%
select(Profit, pred1, pred_all, pred2) %>%
head()
We want to see the difference between the predicted and actual values. There are 4 commonly used evaluation model metrics: 1. Mean Absolute Error (MAE) 2. Mean Squared Error (MSE) 3. Root Mean Squared Error (RMSE) 3. 4. Mean Absolute Percentage Error (MAPE)
This time we will calculate the Mean Absolute Percentage Error (MAPE) value, because it is easier to interpret.
MAPE measures the prediction error as a percentage, and can be calculated as the average absolute percentage error for each time period minus the actual value, then divided by the actual value.
Syntacs:
MAPE(y_pred = predict value, y_true = actual value), and
using MLmetrics library.
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.4.1
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
MAPE(startup_pred$pred1, startup_pred$Profit)*100
## [1] 11.07014
MAPE(startup_pred$pred_all, startup_pred$Profit)*100
## [1] 10.60121
MAPE(startup_pred$pred2, startup_pred$Profit)*100
## [1] 10.60871
Insight:
MAPE is always positive and the smaller the value, the more accurate the model is in forecasting. The MAPE value of 10.60% means that the average difference between the prediction and the actual value is 10.60%. The following is the interpretation of the MAPE value:
The best model based on the MAPE metric is the all predictor Model with a value of 10.60121%, which means: the average error deviates by 10.60121% from the actual data.
Compare the Adjusted R-squared values for the Step-Wise Regression models:
# Adjusted R-squared value for Step-Wise Regression
summary(model_backward)$adj.r.squared
## [1] 0.9483418
summary(model_forward)$adj.r.squared
## [1] 0.9483418
summary(model_both)$adj.r.squared
## [1] 0.9483418
Insight :
The value of Adjusted R-squared from the 3 methods of the Step-Wise Regression models, the value is similar (0.9483418), a value close to 1 means that the model is said to be good at predicting.
# R-squared value for Linear Regression Model
# 1 predictor
summary(model_startup)$r.squared
## [1] 0.9465353
# all predictor
summary(model_startup_all)$adj.r.squared
## [1] 0.9475338
# 2 significant predictor
summary(model_startup2)$adj.r.squared
## [1] 0.9483418
Insight :
Among the three models, the model using 2 significant predictors is said to have the best Adjusted R-squared value of 0.9483418.
The purpose is to ascertain whether the model we created is considered the Best Linear Unbiased Estimator (BLUE) model, which is a model that can predict new data consistently. Assumptions of the linear regression model used:
Linearity
Normality of Residuals
Homoscedasticity of Residuals
No Multicollinearity
We will check the assumptions by using the Step-Wise Regression model of the Backward Elimination method.
Linearity means that the target variable and its predictor have a linear relationship or the relationship is a straight line.
If the points in the plot appear to be on a straight line, then there is a linear relationship between the two variables. However, if not, we can try adding another independent variable to the model.
To check if this assumption is fulfilled by making a scatter plot between X and y. If the points in the plot appear to be on a straight line, then there is a linear relationship between the two variables.
plot(model_backward, which = 1)
abline(h = 10000, col = "green")
abline(h = -10000, col = "green")
Insight:
It can be seen that the points tend to be randomly scattered around the horizontal line, although there are some points that are a bit far away. This indicates that in general the regression model performs quite well, but there are some outliers or data that may not fit the general pattern.
A linear regression model is expected to produce normally distributed errors. That way, the errors gather more around zero.
To check the normality of residuals by examining the residual distribution plot or using statistical tests such as the normality test (Shapiro-Wilk test and looking at the visualization of the residual histogram).
hist() function# histogram residual
hist(model_backward$residuals)
shapiro.test()Shapiro-Wilk hypothesis test:
Expected condition: H0 - p_value > alpha -> fail to reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)
Syntax: shapiro.test(nama_model$residuals)
# Shapiro test of residuals for the backward model
shapiro.test(model_backward$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward$residuals
## W = 0.93717, p-value = 0.01042
Insight:
The p-value is 0.01042. If the p-value is smaller than the chosen significance level (usually 0.05), then reject the null hypothesis. In this case, the p-value < alpha (0.05), so we have evidence that the data tested is not normally distributed. So it does not pass the assumption of Normality of Residuals
Since it does not fulfill the normality of residuals assumption, we can use another more complex model which is assumption-free.
It is expected that the error generated by the model spreads randomly or with constant variation. When visualized, the errors are not patterned. This condition is also referred to as homoscedasticity.
fitted.values vs
residuals# scatter plot
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")
Insight : Shows the error of the model is randomly scattered (does not form a pattern)
bptest() from package
lmtestBreusch-Pagan hypothesis test:
Expected condition: H0 - p_value > alpha -> fail to reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)
Syntax: bptest(model)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model_backward)
##
## studentized Breusch-Pagan test
##
## data: model_backward
## BP = 2.8431, df = 2, p-value = 0.2413
Insight:
The test results show that the p-value is 0.2413. Because the p-value > alpha (0.05) then reject h0 or the error spreads constant or homoscedasticity
Multicollinearity is a condition of strong correlation between predictors. This is undesirable because it indicates redundant predictors in the model, which should be selected just one of the variables with a very strong relationship. The hope is that multicollinearity does not occur.
Perform VIF (Variance Inflation Factor) Test with vif()
function from package car:
Expected condition: VIF < 10
Syntax: vif(nama_model)
# vif dari model backward
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(model_backward)
## R.D.Spend Marketing.Spend
## 2.103206 2.103206
Insigth : The vif value is <10 so it passes the no multicollinearity test.
Based on several models created, there are 2 very good predictors in
predicting Profit are R.D.Spend and
Marketing.Spend. It can be proven by looking at the three
Linear Regression models, the model that uses 2 significant predictors
is said to have the best Adjusted R-squared value of 0.9483418.
When viewed from the evaluation model using the MAPE metric, the metric with the model using all predictors is the best with a value of 10.60121%, which means: the average error deviates by 10.60121% from the actual data. And value of Adjusted R-squared from the 3 methods of the Step-Wise Regression models, the value is similar (0.9483418).
The conclusion that can be drawn is that linear regression models
with two predictors and Step-Wise Regression models have good
performance in predicting Profit.