1 Introduction

My Project LBB Regresion Model this time using a dataset about the profits of 50 startups. This dataset was obtained from https://www.kaggle.com. The dataset that’s we see here contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”. The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.

Business Question: A startup company spends on all three departments (R&D, marketing and administration) to increase its profit. However, the startup leader wants to see which of the three departments is the most influential in increasing the profit of their startup. In order for the startup’s profit to increase significantly, it is necessary that the cost plan for the coming period can be targeted.

library(dplyr)

2 Data Preparation

Here are the details of the 50 startups dataframe:

R.D.Spend : expenses for R&D
Administration : cost for administration
Marketing.Spend : cost for marketing development
State: indicates in which state the startup is located.
Profit: the profit earned by the startup

2.1 Import & Read Data

The first step is to import the dataset using the read.csv() function.

startup <- read.csv("data_input/50_Startups.csv")
startup

2.2 Inspect Data

The next step is to investigate the imported dataset, because we want to observe the initial and final data of the startup dataset. We use the head() and tail() functions.

head(startup)

tail(startup)

2.3 Structure Data

To find out the suitable data type, it is checked first with the glimpse() function.

startup %>% 
  glimpse()

## Rows: 50
## Columns: 5
## $ R.D.Spend       <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State           <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…

2.4 Data Cleansing

Before doing the next step, there is State column types that must be converted to factor type. But we will drop this column because regression analysis requires numeric data only.

startup_clean <- 
  startup %>% 
    mutate(State = as.factor(State)) %>% 
    select(-c(State))

Then we check the other columns again whether the data type is correct or not.

startup_clean %>% 
  glimpse()

## Rows: 50
## Columns: 4
## $ R.D.Spend       <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…

The data type of each column is correct, so the next step is to process this data.

2.5 Check Missing Value

We also need to check for missing values in this dataset.

startup_clean %>% 
  is.na() %>% 
  colSums()

##       R.D.Spend  Administration Marketing.Spend          Profit 
##               0               0               0               0

This dataset has no missing values, so it can continue to the next steps.

3 Exploratory Data Analysis (EDA)

Before we create simple linear regression or multiple linear regression modeling, we want to check the correlation between variables using the ggcorr() function and the GGally library.

The correlation value between variables (predictors) has a range of -1 to 1, which means : - the closer to 1 means the stronger positively - the closer to -1 means the stronger the negative

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.1

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Correlation check between target and predictor
ggcorr(startup_clean, hjust = 1, layout.exp = 3, label = TRUE)

Insight:

Based on correlation, R.D.Spend and Marketing.Spend have a strong correlation with Profit. R.D.Spend has the highest correlation value of 1, so R.D.Spend is the strongest predictor.
Administration has a moderate correlation (0.2) we can choose Administration to be a predictor or not. To compare the results of some modeling, in this case we will make Administration a predictor as well.

4 Simple Linear Regression

To create a Simple Linear Regression model, we will predict Profit based on the 1 most potential predictor variable which is R.D.Spend.

4.1 Visualize `R.D.Spend` against `Profit`

plot(startup_clean$R.D.Spend, startup_clean$Profit)

The Insight: From the scatter plot, there is a linear trend in the scatter plot between “R.D.Spend” and “Profit”, the conclusion is that there is a positive relationship between spending on Research and Development (R&D) and the profit generated by the startup. In other words, the higher the R&D expenditure, the more likely the profit earned by the startup.

4.2 Simple Linear Regression Model

Simple linear regression model involves one independent variable (predictor) to predict the dependent variable (target = Profit). In this case, we will use 1 column as a predictor which is R.D.Spend.

Create this simple linear regression model using the lm() function, with Profit as target, R.D.Spend as predictor, startup_clean as dataframe object.

We can view the result of the modeling with the summary() function.

model_startup <- lm(Profit ~ R.D.Spend, 
                    startup_clean)

summary(model_startup)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34351  -4626   -375   6249  17188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.903e+04  2.538e+03   19.32   <2e-16 ***
## R.D.Spend   8.543e-01  2.931e-02   29.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
## F-statistic: 849.8 on 1 and 48 DF,  p-value: < 2.2e-16

The insights:

The regression coefficient for the variable R.D.Spend is about 0.8543. This means that every 1 unit increase in R.D.Spend spends will contribute about 0.8543 units to profits.
The intercept (β0 value) is about 49030. This is the estimated profit when R.D.Spend spend is zero.
The p-value is very low (<2.2e-16). This means thatR.D.Spend significantly affect profit.

4.The coefficient of Multiple R-squared (R2) value is about 0.9465 (94.65%). This indicates that about 94.65% of the variation in profit can be explained by the R.D.Spend variable.

Conclusion:

R.D.Spend variable has a significant positive influence on startup Profits.
This model can be used to predict Profits based on R.D.Spend.

plot(startup_clean$R.D.Spend, startup_clean$Profit)
abline(model_startup, col = "red")

Insight :

Trend Line: If the abline has a positive slope, it indicates an upward trend in Profit as R.D.Spend increases. That is, the greater the R.D.Spend, the higher the expected Profit.
Data Pattern: data points tend to cluster around a trend line or have significant variation. If the data points are close to the trend line, it indicates a positive correlation between R.D.Spend and Profit.
Outliers: there are points that far from the trend line. These can be outliers that may affect the regression results.

5 Multiple Linear Regression

Multiple linear regression models involve more than one independent variable (predictor) to predict the dependent variable (target). In this case, we will use all columns or some columns as predictors.

5.1 Visualization of All Predictors against Profit

plot(startup_clean)

The insights:

R.D.Spend and Profit Plot : There appears to be a positive correlation between R.D.Spend (Research and Development spending) and Profit. As R.D. Spend increases, Profit tends to increase as well. The data points cluster along an upward trend, suggesting that higher R.D. spending is associated with higher profits.
Marketing Spend and Profit Plot: Marketing Spend also shows a positive correlation with Profit. As Marketing Spend increases, Profit tends to rise. Again, the spread of the data points forms denser clusters along the upward trend.
Administration and Profit Plot: It’s less obvious how administration costs and profit are related. There is no clear trend among the data points, which are more dispersed.
Spending on R.D.Spend and marketing may have a greater effect on profit than spending on administration.

5.2 Model ML with All Predictor

We select several columns (“R.D.Spend”, “Administration”, and “Marketing.Spend”) as predictor variables (X), the column “Profit” as the dependent variable (Y) and startup_clean as dataframe object. Create the multiple linear regression model using the lm() function. We can view the result of the modeling with the summary() function.

model_startup_all <- lm(Profit ~ ., startup_clean)

summary(model_startup_all)

## 
## Call:
## lm(formula = Profit ~ ., data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33534  -4795     63   6606  17275 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
## R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
## Administration  -2.682e-02  5.103e-02  -0.526    0.602    
## Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9475 
## F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16

The insights:

The intercept value is 50,120. This indicates the expected value of Profit when all independent variables (R.D.Spend, Administration, and Marketing.Spend) are zero. In this context, if all costs (R.D.Spend, Administration, and Marketing.Spend) are zero, we can expect Profit to be around 50,120.
The coefficient value of R.D.Spend is 0.8057. This means that every one unit increase in R.D.Spend will increase Profit by 0.8057, by ignoring other variables.
The coefficient value of Administration is -0.02682. Although this coefficient is negative, the p-value is high (0.602), which indicates that Administration is not statistically significant to Profit.
The coefficient value of Marketing.Spend is 0.02723. Its p-value (0.105) is close to the significance threshold (0.05), so we need to be more careful in interpreting the effect of Marketing.Spend on Profit.
The Adjusted R-squared value is 0.9475, which takes into account the number of independent variables. The higher the value, the better.
P-value is very low (< 2.2e-16), this indicates that the overall model is significant.

5.3 Model ML with Multiple Predictors

Create the multiple linear regression model using the lm() function, with Profit as target, R.D.Spend & Marketing.Spend as predictor, and startup_clean as dataframe object. We can view the result of the modeling with the summary() function.

model_startup2 <- lm(Profit ~ R.D.Spend + Marketing.Spend , startup_clean)

summary(model_startup2)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

The insights:

The intercept value is 46.980. This indicates the expected value of Profit when some independent variables (R.D.Spend and Marketing.Spend) are zero. In this context, if all costs (R.D.Spend and Marketing.Spend) are zero, we can expect Profit to be around 46.980.
The coefficient value of R.D.Spend is 0.797. This means that every one unit increase in R.D.Spend will increase Profit by 0.797, by ignoring other variables.
The coefficient value of Marketing.Spend is 0.030. Its p-value (0.06) is close to the significance threshold (0.05), so we need to be more careful in interpreting the effect of Marketing.Spend on Profit.
The Adjusted R-squared value is 0.9483, which takes into account the number of independent variables. The higher the value, the better.
P-value is very low (< 2.2e-16), this indicates that the overall model is significant.

Conclusion: R.D.Spend has a significant influence on Profit, while Marketing.Spend may have a weaker influence. We may consider focusing more on R.D.Spend to increase Profit.

5.4 Prediction

We will compare the performance of the 3 models that have been created. 1. model_startup: 1 predictor (R.D.Spend) 2. model_startup_all: all predictors (R.D.Spend, Administration, and Marketing.Spend) 3. model_startup2: 2 significant predictors (R.D.Spend and Marketing.Spend)

# save prediction results to a new dataset
startup_pred <- startup_clean

# predictions from one predictor model
startup_pred$pred1 <- predict(model_startup, startup_clean)

# predictions from an all predictor model
startup_pred$pred_all <- predict(model_startup_all, startup_clean)

# predictions from model 2 predictors
startup_pred$pred2 <- predict(model_startup2, startup_clean)

startup_pred

Insight :

Comparing the Profit value (ex: 192261.83) with the prediction results using the 3 models:

Predictions from one predictor model = 190289.29
Predictions from an all predictor model = 192521.25
Predictions from model 2 predictors = 192800.46

The prediction result between the 1 predictor model has a large difference from the Profit value, compared to the all predictor and 2 predictor models.

6 Step-Wise Regression

Stepwise regression is a method that iteratively checks the statistical significance of each independent variable in a linear regression model.

Using the Stepwise Regression method can select the most relevant variables to be included in the model, test one by one variables and can reduce variables that do not make a significant contribution. So it is more efficient in time and resources.

In Step-Wise Regression, there are 3 methods used to select relevant variables in the regression model, which are Backward Elimination, Forward Selection and Both.

6.1 Backward Elimination

Backward Elimination is a method in Step-Wise regression that starts with all predictors (independent variables) and gradually removes insignificant predictors one by one.

In the process of eliminating insignificant predictors one by one, the AIC value is also generated. This AIC value represents the amount of information lost in the model (information loss). Therefore, a good regression model is one with a small AIC value.

We use the model_startup_all model which includes all variables as predictors. The stepwise regressing process uses the step() function, by filling in some parameters: model_startup_all as the object, and “backward” as the direction.

model_backward <- step(object = model_startup_all,
                       direction = "backward")

## Start:  AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## - Administration   1 2.3539e+07 3.9444e+09  915.18
## <none>                          3.9209e+09  916.88
## - Marketing.Spend  1 2.3349e+08 4.1543e+09  917.77
## - R.D.Spend        1 2.7147e+10 3.1068e+10 1018.37
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46

We can see the result of model_backward by using the summary() function.

summary(model_backward)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

Insight :

The intercept value (4.698e+04 or 46.980) is the estimated Profit when R.D.Spend and Marketing.Spend are zero.
R.D.Spend has a coefficient estimate of approximately 0.797. This means that every 1 unit increase in R.D.Spend will contribute about 0.797 units to Profit (assuming other variables remain constant).
Marketing.Spend has a coefficient estimate of approximately 0.030. However, the p-value (0.06) indicates that this relationship is not statistically significant at the 0.05 level of significance.
The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that there is a strong relationship between the predictor and the target.

Conclusion: R.D.Spend has a significant influence on Profit, while Marketing.Spend has a weaker influence..

6.2 Forward Selection

The opposite of Backward Elimination, the Forward Selection method is a method in stepwise regression that starts with an empty model (no predictors) and gradually adds predictors one by one. In each step forward, we add one variable that gives the best improvement to the model.

Create a model with no predictors first, model_startup_none as the object. By using the lm() function with Profit as the target, 1 as the no predictor and startup_clean as the dataframe.

model_startup_none <- lm(Profit ~ 1, startup_clean)

model_startup_none

## 
## Call:
## lm(formula = Profit ~ 1, data = startup_clean)
## 
## Coefficients:
## (Intercept)  
##      112013

Insight: From the results of the linear regression analysis, we only have one variable in the model, the Intercept with a value of 112013. This value is the estimated Profit when there are no predictors.

The stepwise regression process uses the step() function, by filling in some parameters: model_startup_none as the object, and “forward” as the direction. For the Forward Selection method, we need to define the scope parameter to indicate the maximum upper limit of predictor combinations with model_startup_all.

In the process of adding the predictors one by one, the AIC value is calculated. In the Forward Selection method, a good AIC value is a small AIC value.

model_forward <- step(object = model_startup_none,
                      direction = "forward",
                      scope = list(upper= model_startup_all))

## Start:  AIC=1061.42
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS     AIC
## + R.D.Spend        1 7.5349e+10 4.2560e+09  916.98
## + Marketing.Spend  1 4.4511e+10 3.5094e+10 1022.46
## + Administration   1 3.2071e+09 7.6398e+10 1061.36
## <none>                          7.9605e+10 1061.42
## 
## Step:  AIC=916.98
## Profit ~ R.D.Spend
## 
##                   Df Sum of Sq        RSS    AIC
## + Marketing.Spend  1 311651716 3944394850 915.18
## <none>                         4256046566 916.98
## + Administration   1 101704903 4154341663 917.77
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                  Df Sum of Sq        RSS    AIC
## <none>                        3944394850 915.18
## + Administration  1  23538549 3920856301 916.88

We can see the result of model_forward by using the summary() function.

# summary model
summary(model_forward)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

Insight :

The estimated coefficient for the variable R.D.Spend is about 0.797. This means that every 1 unit increase in R.D.Spend will contribute about 0.797 units to Profit (assuming other variables remain constant).
The estimated coefficient for the variable Marketing.Spend is about 0.030. However, the p-value (0.06) indicates that this relationship is not statistically significant at the 0.05 level of significance.
The value of intercept (4.698e+04 or about 46,980) is the estimated Profit when R.D.Spend and Marketing.Spend are zero.
Model Quality: The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that here is a strong relationship between the predictor and the target.

In conclusion, R.D.Spend has a significant influence on Profit, while Marketing.Spend may have a weaker influence. Consider focusing more on R.D.Spend to increase Profit.

6.3 Both (Combination of forward and backward selection)

Both method is a method in stepwise regression that is a combination of forward selection and backward elimination. In this method, we consider including or removing predictors at each step, depending on statistical significance.

In the first step, we model the stepwise regression of Both method using the model without predictors. Then at each step, we consider including or removing predictors. If a predictor is significant, it is added to the model, and if a predictor is not significant, it is removed from the model.

The stepwise regression process uses the step() function, by filling in some parameters: model_startup_none as the object, and “both” as the direction. For the Both Selection method, we need to define the scope parameter to indicate the maximum upper limit of predictor combinations with model_startup_all.

model_both <- step(object = model_startup_none,
                      direction = "both",
                      scope = list(upper= model_startup_all))

## Start:  AIC=1061.42
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS     AIC
## + R.D.Spend        1 7.5349e+10 4.2560e+09  916.98
## + Marketing.Spend  1 4.4511e+10 3.5094e+10 1022.46
## + Administration   1 3.2071e+09 7.6398e+10 1061.36
## <none>                          7.9605e+10 1061.42
## 
## Step:  AIC=916.98
## Profit ~ R.D.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## + Marketing.Spend  1 3.1165e+08 3.9444e+09  915.18
## <none>                          4.2560e+09  916.98
## + Administration   1 1.0170e+08 4.1543e+09  917.77
## - R.D.Spend        1 7.5349e+10 7.9605e+10 1061.42
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## + Administration   1 2.3539e+07 3.9209e+09  916.88
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46

We can see the result of model_both by using the summary() function.

# summary model
summary(model_both)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startup_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

Insight : 1. The estimated coefficient for the variable R.D.Spend is about 0.797. This means that every 1 unit increase in R.D.Spend will contribute about 0.797 units to Profit (assuming other variables remain constant).

The estimated coefficient for the variable Marketing.Spend is about 0.030. However, the p-value (0.06) indicates that this relationship is not statistically significant at the 0.05 level of significance.
The value of intercept (4.698e+04 or about 46,980) is the estimated Profit when R.D.Spend and Marketing.Spend are zero.
Model Quality: The Adjusted R-squared value (0.9483) indicates that the goodness of the model is close to 1 meaning that here is a strong relationship between the predictor and the target.

In conclusion, R.D.Spend has a significant influence on Profit, while Marketing.Spend may have a weaker influence. Consider focusing more on R.D.Spend to increase Profit

6.4 Prediction Interval

The next step is to make interval predictions on the Step-Wise Regression Backward Elimination model.

pred_model_backward <- predict(model_backward,
                               startup_clean)
head(pred_model_backward)

##        1        2        3        4        5        6 
## 192800.5 189774.7 181405.4 173441.3 171127.6 162879.3

The result of the Step-Wise Regression model prediction will be compared with the Profit value. For that we must create a prediction range first, to make it easier to compare.

Using the predict() function, filling the parameters model_backward as object, startup_clean as data, interval = “prediction” (to get the prediction interval), level = 0.95 to set the interval width.

# untuk menambahkan batas atas-bawah
pred_model_backward_interval <- predict(object = model_backward,
                                    newdata = startup_clean,
                                    interval = "prediction",
                                    level = 0.95) 

head(pred_model_backward_interval)

##        fit      lwr      upr
## 1 192800.5 173283.0 212317.9
## 2 189774.7 170381.4 209167.9
## 3 181405.4 162191.9 200618.8
## 4 173441.3 154359.6 192523.1
## 5 171127.6 152092.2 190163.1
## 6 162879.3 143929.5 181829.1

head(startup_clean$Profit)

## [1] 192261.8 191792.1 191050.4 182902.0 166187.9 156991.1

Insight: To compare the Profit value with the prediction result as follows: + Profit from prediction = 192800.5 - Prediction value lower limit prediction = 173283.0 - Predicted value of prediction upper limit = 212317.9 + Profit value = 192261.8

The Profit value (192261.8) is still in the lower limit range to the upper limit (173283.0 - 212317.9). So the Step-Wise Regression model is said to predict well.

7 Evaluation Model

Model evaluation is the process of measuring and examining the performance of a statistical model using existing data. The goal is to identify the strengths and weaknesses of the model that has been formed. There are several commonly used evaluation metrics in regression models.

startup_pred %>% 
  select(Profit, pred1, pred_all, pred2) %>% 
  head()

We want to see the difference between the predicted and actual values. There are 4 commonly used evaluation model metrics: 1. Mean Absolute Error (MAE) 2. Mean Squared Error (MSE) 3. Root Mean Squared Error (RMSE) 3. 4. Mean Absolute Percentage Error (MAPE)

This time we will calculate the Mean Absolute Percentage Error (MAPE) value, because it is easier to interpret.

7.1 Mean Absolute Percentage Error (MAPE)

MAPE measures the prediction error as a percentage, and can be calculated as the average absolute percentage error for each time period minus the actual value, then divided by the actual value.

Syntacs: MAPE(y_pred = predict value, y_true = actual value), and using MLmetrics library.

library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 4.4.1

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

MAPE(startup_pred$pred1, startup_pred$Profit)*100

## [1] 11.07014

MAPE(startup_pred$pred_all, startup_pred$Profit)*100

## [1] 10.60121

MAPE(startup_pred$pred2, startup_pred$Profit)*100

## [1] 10.60871

Insight:

MAPE is always positive and the smaller the value, the more accurate the model is in forecasting. The MAPE value of 10.60% means that the average difference between the prediction and the actual value is 10.60%. The following is the interpretation of the MAPE value:

≤ 10: The forecasting results are very accurate.
10 - 20: Good forecasting results.
20 - 50: The forecasting results are decent (good enough).

The best model based on the MAPE metric is the all predictor Model with a value of 10.60121%, which means: the average error deviates by 10.60121% from the actual data.

7.2 Comparison

Compare the Adjusted R-squared values for the Step-Wise Regression models:

# Adjusted R-squared value for Step-Wise Regression

summary(model_backward)$adj.r.squared

## [1] 0.9483418

summary(model_forward)$adj.r.squared

## [1] 0.9483418

summary(model_both)$adj.r.squared

## [1] 0.9483418

Insight :

The value of Adjusted R-squared from the 3 methods of the Step-Wise Regression models, the value is similar (0.9483418), a value close to 1 means that the model is said to be good at predicting.

# R-squared value for Linear Regression Model
# 1 predictor
summary(model_startup)$r.squared

## [1] 0.9465353

# all predictor
summary(model_startup_all)$adj.r.squared

## [1] 0.9475338

# 2 significant predictor
summary(model_startup2)$adj.r.squared

## [1] 0.9483418

Insight :

Among the three models, the model using 2 significant predictors is said to have the best Adjusted R-squared value of 0.9483418.

8 Linear Regression Assumptions

The purpose is to ascertain whether the model we created is considered the Best Linear Unbiased Estimator (BLUE) model, which is a model that can predict new data consistently. Assumptions of the linear regression model used:

Linearity
Normality of Residuals
Homoscedasticity of Residuals
No Multicollinearity

We will check the assumptions by using the Step-Wise Regression model of the Backward Elimination method.

8.1 Linearity

Linearity means that the target variable and its predictor have a linear relationship or the relationship is a straight line.

If the points in the plot appear to be on a straight line, then there is a linear relationship between the two variables. However, if not, we can try adding another independent variable to the model.

To check if this assumption is fulfilled by making a scatter plot between X and y. If the points in the plot appear to be on a straight line, then there is a linear relationship between the two variables.

plot(model_backward, which = 1)
abline(h = 10000, col = "green")
abline(h = -10000, col = "green")

Insight:

It can be seen that the points tend to be randomly scattered around the horizontal line, although there are some points that are a bit far away. This indicates that in general the regression model performs quite well, but there are some outliers or data that may not fit the general pattern.

8.2 Normality of Residuals

A linear regression model is expected to produce normally distributed errors. That way, the errors gather more around zero.

To check the normality of residuals by examining the residual distribution plot or using statistical tests such as the normality test (Shapiro-Wilk test and looking at the visualization of the residual histogram).

Visualization of the residual histogram using the hist() function

# histogram residual
hist(model_backward$residuals)

Statistical test with shapiro.test()

Shapiro-Wilk hypothesis test:

H0: errors are normally distributed
H1: errors are NOT normally distributed

Expected condition: H0 - p_value > alpha -> fail to reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)

Syntax: shapiro.test(nama_model$residuals)

# Shapiro test of residuals for the backward model
shapiro.test(model_backward$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.93717, p-value = 0.01042

Insight:

The p-value is 0.01042. If the p-value is smaller than the chosen significance level (usually 0.05), then reject the null hypothesis. In this case, the p-value < alpha (0.05), so we have evidence that the data tested is not normally distributed. So it does not pass the assumption of Normality of Residuals
Since it does not fulfill the normality of residuals assumption, we can use another more complex model which is assumption-free.

8.3 Homoscedasticity of Residuals

It is expected that the error generated by the model spreads randomly or with constant variation. When visualized, the errors are not patterned. This condition is also referred to as homoscedasticity.

Scatter plot visualization: fitted.values vs residuals

# scatter plot
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")

Insight : Shows the error of the model is randomly scattered (does not form a pattern)

Statistical test with bptest() from package lmtest

Breusch-Pagan hypothesis test:

H0: error spread is constant or homoscedasticity
H1: error distribution is NOT constant or heteroscedasticity

Expected condition: H0 - p_value > alpha -> fail to reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)

Syntax: bptest(model)

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(model_backward)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 2.8431, df = 2, p-value = 0.2413

Insight:

The test results show that the p-value is 0.2413. Because the p-value > alpha (0.05) then reject h0 or the error spreads constant or homoscedasticity

8.4 No Multicollinearity

Multicollinearity is a condition of strong correlation between predictors. This is undesirable because it indicates redundant predictors in the model, which should be selected just one of the variables with a very strong relationship. The hope is that multicollinearity does not occur.

Perform VIF (Variance Inflation Factor) Test with vif() function from package car:

VIF value > 10: multicollinearity occurs in the model
VIF value < 10: no multicollinearity in the model

Expected condition: VIF < 10

Syntax: vif(nama_model)

# vif dari model backward
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(model_backward)

##       R.D.Spend Marketing.Spend 
##        2.103206        2.103206

Insigth : The vif value is <10 so it passes the no multicollinearity test.

9 Conclusion

Based on several models created, there are 2 very good predictors in predicting Profit are R.D.Spend and Marketing.Spend. It can be proven by looking at the three Linear Regression models, the model that uses 2 significant predictors is said to have the best Adjusted R-squared value of 0.9483418.

When viewed from the evaluation model using the MAPE metric, the metric with the model using all predictors is the best with a value of 10.60121%, which means: the average error deviates by 10.60121% from the actual data. And value of Adjusted R-squared from the 3 methods of the Step-Wise Regression models, the value is similar (0.9483418).

The conclusion that can be drawn is that linear regression models with two predictors and Step-Wise Regression models have good performance in predicting Profit.

Analyzing 50 startups using Linear Regression and Step-Wise Regression

Intan M Sari

2024-07-25

1 Introduction

2 Data Preparation

2.1 Import & Read Data

2.2 Inspect Data

2.3 Structure Data

2.4 Data Cleansing

2.5 Check Missing Value

3 Exploratory Data Analysis (EDA)

4 Simple Linear Regression

4.1 Visualize `R.D.Spend` against `Profit`

4.2 Simple Linear Regression Model

5 Multiple Linear Regression

5.1 Visualization of All Predictors against Profit

5.2 Model ML with All Predictor

5.3 Model ML with Multiple Predictors

5.4 Prediction

6 Step-Wise Regression

6.1 Backward Elimination

6.2 Forward Selection

6.3 Both (Combination of forward and backward selection)

6.4 Prediction Interval

7 Evaluation Model

7.1 Mean Absolute Percentage Error (MAPE)

7.2 Comparison

8 Linear Regression Assumptions

8.1 Linearity

8.2 Normality of Residuals

8.3 Homoscedasticity of Residuals

8.4 No Multicollinearity

9 Conclusion

Analyzing 50 startups using Linear Regression and Step-Wise Regression

Intan M Sari

2024-07-25

1 Introduction

2 Data Preparation

2.1 Import & Read Data

2.2 Inspect Data

2.3 Structure Data

2.4 Data Cleansing

2.5 Check Missing Value

3 Exploratory Data Analysis (EDA)

4 Simple Linear Regression

4.1 Visualize R.D.Spend against Profit

4.2 Simple Linear Regression Model

5 Multiple Linear Regression

5.1 Visualization of All Predictors against Profit

5.2 Model ML with All Predictor

5.3 Model ML with Multiple Predictors

5.4 Prediction

6 Step-Wise Regression

6.1 Backward Elimination

6.2 Forward Selection

6.3 Both (Combination of forward and backward selection)

6.4 Prediction Interval

7 Evaluation Model

7.1 Mean Absolute Percentage Error (MAPE)

7.2 Comparison

8 Linear Regression Assumptions

8.1 Linearity

8.2 Normality of Residuals

8.3 Homoscedasticity of Residuals

8.4 No Multicollinearity

9 Conclusion

4.1 Visualize `R.D.Spend` against `Profit`