There is information about 50 Startups’ research and development expenditure, management expenditure, marketing expenditure, the amount of profit they earn and the location where they are established. The aim is to estimate the amount of profit. This is a numerical estimation problem and our dependent variable is “Profit”.
We want to know the relationship among variables, especially between
the startup profit
with other variables.
You can download the data here
Load the required package.
library(dplyr)
library(GGally) # for EDA check correlation
library(performance) # for compare performance modeling
library(MLmetrics) # for metric evaluation
Load the dataset
startup <- read.csv("50_Startups.csv")
rmarkdown::paged_table(startup)
The data has 50 rows and 4 columns. With the following explanation:
R.D.Spend
: indicate how much each startup spends on
Research and DevelopmentAdministration
: how much they spend on MarketingMarketing.Spend
: how much they spend on administration
costState
: which state the startup is based inProfit
: the profit made by the startupNext, we need to check the data structure and if there is a data type that is not correct, need change the appropriate data type.
str(startup)
## 'data.frame': 50 obs. of 5 variables:
## $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
## $ Administration : num 136898 151378 101146 118672 91392 ...
## $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
## $ State : chr "New York" "California" "Florida" "New York" ...
## $ Profit : num 192262 191792 191050 182902 166188 ...
Here there is a state
column that does not have the
correct data type. So it is necessary to change the data type according
to your needs.
startup$State <- as.factor(startup$State)
str(startup)
## 'data.frame': 50 obs. of 5 variables:
## $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
## $ Administration : num 136898 151378 101146 118672 91392 ...
## $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
## $ State : Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...
## $ Profit : num 192262 191792 191050 182902 166188 ...
summary(startup)
## R.D.Spend Administration Marketing.Spend State
## Min. : 0 Min. : 51283 Min. : 0 California:17
## 1st Qu.: 39936 1st Qu.:103731 1st Qu.:129300 Florida :16
## Median : 73051 Median :122700 Median :212716 New York :17
## Mean : 73722 Mean :121345 Mean :211025
## 3rd Qu.:101603 3rd Qu.:144842 3rd Qu.:299469
## Max. :165349 Max. :182646 Max. :471784
## Profit
## Min. : 14681
## 1st Qu.: 90139
## Median :107978
## Mean :112013
## 3rd Qu.:139766
## Max. :192262
startup %>% is.na() %>% colSums()
## R.D.Spend Administration Marketing.Spend State Profit
## 0 0 0 0 0
There is no empty data in our data frame.
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
ggcorr(data = startup, label = TRUE)
## Warning in ggcorr(data = startup, label = TRUE): data in column(s) 'State' are
## not numeric and were ignored
. The graphic shows that a lot of variable has Positive correlation with
the ‘Profit’ variables, such us:
Multicollinearity (very strong correlation) is usually caused because the formation of a variable is based on other variables.
Simple linear regression is a model with one predictor
variable. In this case we will use Marketing.Spend
because it has the strongest correlation to ‘Profit’.
Visualization of Marketing.Spend
because it has the
strongest correlation to ‘Profit’
# scatter plot visualization for 'profit' and 'Marketing.Spend'
plot(startup$Marketing.Spend, startup$Profit)
model_startup_rnd <- lm(formula = Profit ~ R.D.Spend, data = startup)
# scatter plot visualization for 'profit' and 'R.D.Spend'
plot(startup$R.D.Spend, startup$Profit)
abline(model_startup_rnd, col= "red")
. Based on the plot, the greater the
sales'R.D.Spend
value,
the higher the profit
obtained by the company.
With ‘Profit’ as the target, we will create a regression model by doing trial and error as follows:
# Model all predictor
model_startup_all <- lm(formula = Profit ~. , data = startup)
summary(model_startup_all)
##
## Call:
## lm(formula = Profit ~ ., data = startup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## StateFlorida 1.988e+02 3.371e+03 0.059 0.953
## StateNew York -4.189e+01 3.256e+03 -0.013 0.990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
# Model without 'R.D.Spend'
model_startup_no_rd <- lm(formula = Profit ~ . -R.D.Spend, data = startup)
summary(model_startup_no_rd)
##
## Call:
## lm(formula = Profit ~ . - R.D.Spend, data = startup)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79838 -12006 1497 12027 57394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.903e+04 1.843e+04 1.033 0.3072
## Administration 3.239e-01 1.335e-01 2.426 0.0193 *
## Marketing.Spend 2.507e-01 3.135e-02 7.997 3.48e-10 ***
## StateFlorida -1.704e+03 9.338e+03 -0.182 0.8561
## StateNew York 3.876e+03 9.003e+03 0.431 0.6689
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26160 on 45 degrees of freedom
## Multiple R-squared: 0.6131, Adjusted R-squared: 0.5787
## F-statistic: 17.83 on 4 and 45 DF, p-value: 7.778e-09
compare_performance(model_startup_all, model_startup_no_rd)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
## --------------------------------------------------------------------------------------------------------------------------
## model_startup_all | lm | 1064.8 (>.999) | 1067.4 (>.999) | 1078.1 (>.999) | 0.951 | 0.945 | 8854.761 | 9439.207
## model_startup_no_rd | lm | 1165.8 (<.001) | 1167.8 (<.001) | 1177.3 (<.001) | 0.613 | 0.579 | 24818.746 | 26161.255
Based on significant predictors
based on the R-Squared value used, because both models use more than 1 predictor, so we look at the “Adjusted R-Squared” value, not “Multiple R-squared”
💡 Insight: Models with many predictors (model_startup_all) produce better R-Squared values. So, we keep model with all predictors.
Finding out whether the machine learning model that has been created is good enough by looking at the prediction results will produce the smallest error value, by:
MAPE (Mean Absolute Percentage Error) shows how big the deviation is in percentage form.
pred_startup_all <- predict(model_startup_all, newdata = startup)
pred_startup_no_rd <- predict(model_startup_no_rd, newdata = startup)
mape_all <- MAPE(y_pred = pred_startup_all, y_true = startup$Profit) *100
mape_no_rd <- MAPE(y_pred = pred_startup_no_rd, y_true = startup$Profit) *100
#Compare value
MAPE = c(mape_all, mape_no_rd)
MAPE
## [1] 10.60236 24.53762
MAPE (Mean Absolute Percentage Error) shows how big the deviation is in percentage form.
💡 Insight: It can be concluded that the smallest MAPE value is model_startup_all (error value 10.60%)
Step-wise regression helps us choose good predictors, by finding the combination of predictors that produces the best model based on the AIC (Akaike Information Criterion) value.
Creating a model without predictors for Backward Model
model_startup_none <- lm(formula = Profit ~1, data = startup)
# stepwise regression: backward elimination
model_backward_all <- step(object = model_startup_all,
direction = "backward")
## Start: AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
##
## Df Sum of Sq RSS AIC
## - State 2 5.1666e+05 3.9209e+09 916.88
## - Administration 1 2.3816e+07 3.9442e+09 919.17
## <none> 3.9203e+09 920.87
## - Marketing.Spend 1 2.2071e+08 4.1410e+09 921.61
## - R.D.Spend 1 2.6878e+10 3.0799e+10 1021.94
##
## Step: AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## - Administration 1 2.3539e+07 3.9444e+09 915.18
## <none> 3.9209e+09 916.88
## - Marketing.Spend 1 2.3349e+08 4.1543e+09 917.77
## - R.D.Spend 1 2.7147e+10 3.1068e+10 1018.37
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
summary(model_backward_all)$adj.r.squared
## [1] 0.9483418
model_backward_no_rd <- step(object = model_startup_no_rd,
direction = "backward")
## Start: AIC=1021.94
## Profit ~ (R.D.Spend + Administration + Marketing.Spend + State) -
## R.D.Spend
##
## Df Sum of Sq RSS AIC
## - State 2 2.6942e+08 3.1068e+10 1018.4
## <none> 3.0799e+10 1021.9
## - Administration 1 4.0290e+09 3.4827e+10 1026.1
## - Marketing.Spend 1 4.3775e+10 7.4573e+10 1064.2
##
## Step: AIC=1018.37
## Profit ~ Administration + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.1068e+10 1018.4
## - Administration 1 4.0256e+09 3.5094e+10 1022.5
## - Marketing.Spend 1 4.5330e+10 7.6398e+10 1061.4
summary(model_backward_no_rd)$adj.r.squared
## [1] 0.5931154
Using the Forward model, by defining the ‘scope’ parameter as the maximum upper limit of the predictor combination.
# stepwise regression: forward selection
model_forward_all <- step(object = model_startup_none,
direction = "forward",
scope = list(upper = model_startup_all),
trace = FALSE
)
summary(model_forward_all)$adj.r.squared
## [1] 0.9483418
model_forward_no_rd <- step(object = model_startup_none,
direction = "forward",
scope = list(upper = model_startup_no_rd),
trace = FALSE
)
summary(model_forward_no_rd)$adj.r.squared
## [1] 0.5931154
We need to compare all the values from the models that have been created in backward and stepward
# Comparison
c(summary(model_backward_all)$adj.r.squared,
summary(model_backward_no_rd)$adj.r.squared,
summary(model_forward_all)$adj.r.squared,
summary(model_forward_no_rd)$adj.r.squared)
## [1] 0.9483418 0.5931154 0.9483418 0.5931154
💡 Insight: Models with many predictors (model_startup_all) produce better R-Squared values. So, we keep model with all predictors. The use of previously used predictors is correct
Linearity means that the target variable and its predictor have a linear relationship or the relationship is a straight line.
plot(model_backward_all, which = 1)
abline(h = 1000, col = "green")
abline(h = -1000, col = "green")
The residual value is around the value 0.
Conclusion: Because the residual value is around the number 0, the linearity assumption test is FULFILLED
plot(model_backward_no_rd, which = 1)
abline(h = 1000, col = "green")
abline(h = -1000, col = "green")
The residual value is around the value 0.
Conclusion: Because the residual value is around the number 0, the linearity assumption test is FULFILLED
The linear regression model is expected to produce normally distributed errors. That way, the errors are more clustered around the number zero.
hist(model_backward_all$residuals)
. 2. Statistical test with
shapiro.test()
Shapiro-Wilk hypothesis test:
# shapiro test dari residual
shapiro.test(model_backward_all$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward_all$residuals
## W = 0.93717, p-value = 0.01042
p-value = 0.01042
Because the p-value (0.01042) < the alpha value (0.05), then the Normality of Residuals assumption test [
NOT FULFILLED
]., because H0 is rejected.
hist(model_backward_no_rd$residuals)
# shapiro test dari residual
shapiro.test(model_backward_no_rd$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_backward_no_rd$residuals
## W = 0.9695, p-value = 0.2209
p-value = 0.2209
Because the p-value (0.01042) < the alpha value (0.05), then the Normality of Residuals assumption test [
FULFILLED
]., because H0 is rejected.