Background

There is information about 50 Startups’ research and development expenditure, management expenditure, marketing expenditure, the amount of profit they earn and the location where they are established. The aim is to estimate the amount of profit. This is a numerical estimation problem and our dependent variable is “Profit”.

We want to know the relationship among variables, especially between the startup profit with other variables. You can download the data here

1. Data Preparation

Load the required package.

library(dplyr)
library(GGally) # for EDA check correlation
library(performance) # for compare performance modeling
library(MLmetrics)  # for metric evaluation

Load the dataset

startup <- read.csv("50_Startups.csv")
rmarkdown::paged_table(startup)

The data has 50 rows and 4 columns. With the following explanation:

2. Data Cleansing

2.1 Check data type

Next, we need to check the data structure and if there is a data type that is not correct, need change the appropriate data type.

str(startup)
## 'data.frame':    50 obs. of  5 variables:
##  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
##  $ Administration : num  136898 151378 101146 118672 91392 ...
##  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
##  $ State          : chr  "New York" "California" "Florida" "New York" ...
##  $ Profit         : num  192262 191792 191050 182902 166188 ...

Here there is a state column that does not have the correct data type. So it is necessary to change the data type according to your needs.

startup$State <- as.factor(startup$State)
str(startup)
## 'data.frame':    50 obs. of  5 variables:
##  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
##  $ Administration : num  136898 151378 101146 118672 91392 ...
##  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
##  $ State          : Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...
##  $ Profit         : num  192262 191792 191050 182902 166188 ...
summary(startup)
##    R.D.Spend      Administration   Marketing.Spend         State   
##  Min.   :     0   Min.   : 51283   Min.   :     0   California:17  
##  1st Qu.: 39936   1st Qu.:103731   1st Qu.:129300   Florida   :16  
##  Median : 73051   Median :122700   Median :212716   New York  :17  
##  Mean   : 73722   Mean   :121345   Mean   :211025                  
##  3rd Qu.:101603   3rd Qu.:144842   3rd Qu.:299469                  
##  Max.   :165349   Max.   :182646   Max.   :471784                  
##      Profit      
##  Min.   : 14681  
##  1st Qu.: 90139  
##  Median :107978  
##  Mean   :112013  
##  3rd Qu.:139766  
##  Max.   :192262

2.2 check missing value

startup %>% is.na() %>% colSums()
##       R.D.Spend  Administration Marketing.Spend           State          Profit 
##               0               0               0               0               0

There is no empty data in our data frame.

3. Exploratoty Data Analyst (EDA)

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

Find the Pearson correlation between features.

ggcorr(data = startup, label = TRUE)
## Warning in ggcorr(data = startup, label = TRUE): data in column(s) 'State' are
## not numeric and were ignored

. The graphic shows that a lot of variable has Positive correlation with the ‘Profit’ variables, such us:

Multicollinearity (very strong correlation) is usually caused because the formation of a variable is based on other variables.

4 Modeling

4.1 Simple Linear Regression

Simple linear regression is a model with one predictor variable. In this case we will use Marketing.Spend because it has the strongest correlation to ‘Profit’.

Visualization of Marketing.Spend because it has the strongest correlation to ‘Profit’

# scatter plot visualization for 'profit' and 'Marketing.Spend'
plot(startup$Marketing.Spend, startup$Profit)

model_startup_rnd <- lm(formula = Profit ~ R.D.Spend, data = startup)
# scatter plot visualization for 'profit' and 'R.D.Spend'
plot(startup$R.D.Spend, startup$Profit)
abline(model_startup_rnd, col= "red")

. Based on the plot, the greater the sales'R.D.Spend value, the higher the profit obtained by the company.

4.2 Modeling

With ‘Profit’ as the target, we will create a regression model by doing trial and error as follows:

  • Model with all predictor
  • Model without ‘R.D.Spend’, becuase multikolinearitas
# Model all predictor
model_startup_all <- lm(formula = Profit ~. , data = startup)
summary(model_startup_all)
## 
## Call:
## lm(formula = Profit ~ ., data = startup)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.013e+04  6.885e+03   7.281 4.44e-09 ***
## R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Administration  -2.700e-02  5.223e-02  -0.517    0.608    
## Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
## StateFlorida     1.988e+02  3.371e+03   0.059    0.953    
## StateNew York   -4.189e+01  3.256e+03  -0.013    0.990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16
# Model without 'R.D.Spend'
model_startup_no_rd <- lm(formula = Profit ~ . -R.D.Spend, data = startup)
summary(model_startup_no_rd)
## 
## Call:
## lm(formula = Profit ~ . - R.D.Spend, data = startup)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79838 -12006   1497  12027  57394 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.903e+04  1.843e+04   1.033   0.3072    
## Administration   3.239e-01  1.335e-01   2.426   0.0193 *  
## Marketing.Spend  2.507e-01  3.135e-02   7.997 3.48e-10 ***
## StateFlorida    -1.704e+03  9.338e+03  -0.182   0.8561    
## StateNew York    3.876e+03  9.003e+03   0.431   0.6689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26160 on 45 degrees of freedom
## Multiple R-squared:  0.6131, Adjusted R-squared:  0.5787 
## F-statistic: 17.83 on 4 and 45 DF,  p-value: 7.778e-09
compare_performance(model_startup_all, model_startup_no_rd)
## # Comparison of Model Performance Indices
## 
## Name                | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |      RMSE |     Sigma
## --------------------------------------------------------------------------------------------------------------------------
## model_startup_all   |    lm | 1064.8 (>.999) | 1067.4 (>.999) | 1078.1 (>.999) | 0.951 |     0.945 |  8854.761 |  9439.207
## model_startup_no_rd |    lm | 1165.8 (<.001) | 1167.8 (<.001) | 1177.3 (<.001) | 0.613 |     0.579 | 24818.746 | 26161.255

Based on significant predictors

  • model_startup_all has significant predictors only ‘R.D.Spend’
  • model_startup_best has significant predictors are ‘Marketing.Spend’ and ‘Administration’

based on the R-Squared value used, because both models use more than 1 predictor, so we look at the “Adjusted R-Squared” value, not “Multiple R-squared”

  • model_startup_all: 0.9452
  • model_startup_best: 0.5787

💡 Insight: Models with many predictors (model_startup_all) produce better R-Squared values. So, we keep model with all predictors.

4.3 Model Evaluation

Finding out whether the machine learning model that has been created is good enough by looking at the prediction results will produce the smallest error value, by:

MAPE (Mean Absolute Percentage Error) shows how big the deviation is in percentage form.

pred_startup_all <- predict(model_startup_all, newdata = startup)
pred_startup_no_rd <- predict(model_startup_no_rd, newdata = startup)

mape_all <- MAPE(y_pred = pred_startup_all, y_true = startup$Profit) *100
mape_no_rd <- MAPE(y_pred = pred_startup_no_rd, y_true = startup$Profit) *100

#Compare value
MAPE = c(mape_all, mape_no_rd)
MAPE
## [1] 10.60236 24.53762

MAPE (Mean Absolute Percentage Error) shows how big the deviation is in percentage form.

💡 Insight: It can be concluded that the smallest MAPE value is model_startup_all (error value 10.60%)

5. Step-wise Regression for Feature Selection

Step-wise regression helps us choose good predictors, by finding the combination of predictors that produces the best model based on the AIC (Akaike Information Criterion) value.

5.1 Backward Model

Creating a model without predictors for Backward Model

model_startup_none <- lm(formula = Profit ~1, data = startup)
# stepwise regression: backward elimination
model_backward_all <- step(object = model_startup_all,
                           direction = "backward")
## Start:  AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
## 
##                   Df  Sum of Sq        RSS     AIC
## - State            2 5.1666e+05 3.9209e+09  916.88
## - Administration   1 2.3816e+07 3.9442e+09  919.17
## <none>                          3.9203e+09  920.87
## - Marketing.Spend  1 2.2071e+08 4.1410e+09  921.61
## - R.D.Spend        1 2.6878e+10 3.0799e+10 1021.94
## 
## Step:  AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## - Administration   1 2.3539e+07 3.9444e+09  915.18
## <none>                          3.9209e+09  916.88
## - Marketing.Spend  1 2.3349e+08 4.1543e+09  917.77
## - R.D.Spend        1 2.7147e+10 3.1068e+10 1018.37
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46
summary(model_backward_all)$adj.r.squared
## [1] 0.9483418
model_backward_no_rd <- step(object = model_startup_no_rd,
                             direction = "backward")
## Start:  AIC=1021.94
## Profit ~ (R.D.Spend + Administration + Marketing.Spend + State) - 
##     R.D.Spend
## 
##                   Df  Sum of Sq        RSS    AIC
## - State            2 2.6942e+08 3.1068e+10 1018.4
## <none>                          3.0799e+10 1021.9
## - Administration   1 4.0290e+09 3.4827e+10 1026.1
## - Marketing.Spend  1 4.3775e+10 7.4573e+10 1064.2
## 
## Step:  AIC=1018.37
## Profit ~ Administration + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS    AIC
## <none>                          3.1068e+10 1018.4
## - Administration   1 4.0256e+09 3.5094e+10 1022.5
## - Marketing.Spend  1 4.5330e+10 7.6398e+10 1061.4
summary(model_backward_no_rd)$adj.r.squared
## [1] 0.5931154

5.2 Forward Model

Using the Forward model, by defining the ‘scope’ parameter as the maximum upper limit of the predictor combination.

# stepwise regression: forward selection
model_forward_all <- step(object = model_startup_none,
                          direction = "forward",
                          scope = list(upper = model_startup_all),
                          trace = FALSE
                          )

summary(model_forward_all)$adj.r.squared
## [1] 0.9483418
model_forward_no_rd <- step(object = model_startup_none,
                          direction = "forward",
                          scope = list(upper = model_startup_no_rd),
                          trace = FALSE
                          )
summary(model_forward_no_rd)$adj.r.squared
## [1] 0.5931154

We need to compare all the values from the models that have been created in backward and stepward

# Comparison
c(summary(model_backward_all)$adj.r.squared,
  summary(model_backward_no_rd)$adj.r.squared,
  summary(model_forward_all)$adj.r.squared,
  summary(model_forward_no_rd)$adj.r.squared)
## [1] 0.9483418 0.5931154 0.9483418 0.5931154

💡 Insight: Models with many predictors (model_startup_all) produce better R-Squared values. So, we keep model with all predictors. The use of previously used predictors is correct

6 Asumsi Linear Regression

6.1 Linearity

Linearity means that the target variable and its predictor have a linear relationship or the relationship is a straight line.

plot(model_backward_all, which = 1)
abline(h = 1000, col = "green")
abline(h = -1000, col = "green")

The residual value is around the value 0.

Conclusion: Because the residual value is around the number 0, the linearity assumption test is FULFILLED

plot(model_backward_no_rd, which = 1)
abline(h = 1000, col = "green")
abline(h = -1000, col = "green")

The residual value is around the value 0.

Conclusion: Because the residual value is around the number 0, the linearity assumption test is FULFILLED

6.2 Normality of Residuals

The linear regression model is expected to produce normally distributed errors. That way, the errors are more clustered around the number zero.

hist(model_backward_all$residuals)

. 2. Statistical test with shapiro.test()

Shapiro-Wilk hypothesis test:

  • H0: errors are normally distributed
  • H1: errors are NOT normally distributed
# shapiro test dari residual
shapiro.test(model_backward_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward_all$residuals
## W = 0.93717, p-value = 0.01042

p-value = 0.01042

Because the p-value (0.01042) < the alpha value (0.05), then the Normality of Residuals assumption test [NOT FULFILLED]., because H0 is rejected.

hist(model_backward_no_rd$residuals)

# shapiro test dari residual
shapiro.test(model_backward_no_rd$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward_no_rd$residuals
## W = 0.9695, p-value = 0.2209

p-value = 0.2209

Because the p-value (0.01042) < the alpha value (0.05), then the Normality of Residuals assumption test [FULFILLED]., because H0 is rejected.