Introduction

The dataset that’s we see here contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”. The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.

Source : https://www.kaggle.com/datasets/amineoumous/50-startups-data?select=50_Startups.csv

Business Question

We will learn to create a model to predict profit using linear regression where there are 3 variables that help in creating the model.

Variable Description

1. Data Preparation

1.1 Prerequisites

1.2 Importing Dataset

First thing first lets import our data

data <- read.csv("data_input/50_Startups.csv")
head(data)

##   R.D.Spend Administration Marketing.Spend      State   Profit
## 1  165349.2      136897.80        471784.1   New York 192261.8
## 2  162597.7      151377.59        443898.5 California 191792.1
## 3  153441.5      101145.55        407934.5    Florida 191050.4
## 4  144372.4      118671.85        383199.6   New York 182902.0
## 5  142107.3       91391.77        366168.4    Florida 166187.9
## 6  131876.9       99814.71        362861.4   New York 156991.1

Let’s check for missing values in our data.

1.3 Data Inspection

str(data)

## 'data.frame':    50 obs. of  5 variables:
##  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
##  $ Administration : num  136898 151378 101146 118672 91392 ...
##  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
##  $ State          : chr  "New York" "California" "Florida" "New York" ...
##  $ Profit         : num  192262 191792 191050 182902 166188 ...

colSums(is.na(data))

##       R.D.Spend  Administration Marketing.Spend           State          Profit 
##               0               0               0               0               0

To make processing easier, let’s change our data type.

1.4 Data Types

data1 <- data %>%
  mutate(State = as.factor(State))

Once we have finished preparing our data, let’s start exploring our data.

2. Data Processing

2.1 Data Ekploration Correlation

ggcorr(data1, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

## Warning in ggcorr(data1, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'State' are not numeric and were ignored

From the data above we can conclude that all variables have a positive influence on the profit variable, especially for marketing and RnD

Based on the table above, let’s select the variables we need.

data2 <- data1 %>% select(R.D.Spend,  Administration, Marketing.Spend, Profit)

The following is the distribution of values for each variable.

boxplot(data2)

2.2 Model

Next, a linear regression model can be created with the RnD predictor variable because this variable has the highest positive correlation with the Profit target variable.

model1 <- lm(Profit ~ R.D.Spend, data = data2)
summary(model1)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34351  -4626   -375   6249  17188 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 49032.89914  2537.89695   19.32 <0.0000000000000002 ***
## R.D.Spend       0.85429     0.02931   29.15 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
## F-statistic: 849.8 on 1 and 48 DF,  p-value: < 0.00000000000000022

Here’s the visualization

plot(data2$R.D.Spend, data2$Profit)
abline(model1, col = "red")

Let’s make several models to see the value of each model

model2 <- lm(formula = Profit ~ Administration + Marketing.Spend, data = data2)
summary(model2)

## 
## Call:
## lm(formula = Profit ~ Administration + Marketing.Spend, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82155 -12168   2836  13650  56472 
## 
## Coefficients:
##                    Estimate  Std. Error t value        Pr(>|t|)    
## (Intercept)     20224.42906 17698.17704   1.143          0.2589    
## Administration      0.32367     0.13116   2.468          0.0173 *  
## Marketing.Spend     0.24884     0.03005   8.281 0.0000000000973 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25710 on 47 degrees of freedom
## Multiple R-squared:  0.6097, Adjusted R-squared:  0.5931 
## F-statistic: 36.71 on 2 and 47 DF,  p-value: 0.0000000002496

model3  <- lm(formula = Profit ~ R.D.Spend  + Administration + Marketing.Spend, data = data2)
summary(model3)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
##     data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33534  -4795     63   6606  17275 
## 
## Coefficients:
##                    Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)     50122.19299  6572.35262   7.626        0.00000000106 ***
## R.D.Spend           0.80572     0.04515  17.846 < 0.0000000000000002 ***
## Administration     -0.02682     0.05103  -0.526                0.602    
## Marketing.Spend     0.02723     0.01645   1.655                0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9475 
## F-statistic:   296 on 3 and 46 DF,  p-value: < 0.00000000000000022

From the data above we conclude that the best model is model 3

\[Candidate Model3\]

\[Profit = 0.80572 (R.D.Spend) - 0.02682(Administration) + 0.02723(Marketing.Spend) + 50122.19299\]

\[Profit = 0.8543(R.D.Spend) + 49032.8991\]

2.3 Model Comparation

Comparing the 3 models model1 = R.D Spend > Multiple R-squared: 0.9465 | Adjusted R-squared: 0.9454 model2 = Administration and Marketing Spend > Multiple R-squared: 0.6097 | Adjusted R-squared: 0.5931 model3 = All > Multiple R-squared: 0.9507 | Adjusted R-squared: 0.9475

From the data above we can conclude that model 3 is the best model

3. Prediction & Error

3.1 Prediction

First of all, let’s make a prediction

data2$pred1 <- predict(model1, data2)

# prediksi dari model semua prediktor
data2$pred2 <- predict(model2, data2)

# prediksi dari model 2 prediktor
data2$pred3 <- predict(model3, data2)

head(data2)

##   R.D.Spend Administration Marketing.Spend   Profit    pred1    pred2    pred3
## 1  165349.2      136897.80        471784.1 192261.8 190289.3 181935.0 192521.3
## 2  162597.7      151377.59        443898.5 191792.1 187938.7 179682.6 189156.8
## 3  153441.5      101145.55        407934.5 191050.4 180116.7 154474.4 182147.3
## 4  144372.4      118671.85        383199.6 182902.0 172369.0 153992.1 173696.7
## 5  142107.3       91391.77        366168.4 166187.9 170434.0 140924.1 172139.5
## 6  131876.9       99814.71        362861.4 156991.1 161694.2 142827.5 163580.8

Let’s create an evaluation model

3.2 Model Evaluation

3.2.1 MAE

MAE(data2$pred3, data2$Profit)

## [1] 6471.45

range(data2$Profit)

## [1]  14681.4 192261.8

3.2.2 MSE

MSE(data2$pred3, data2$Profit)

## [1] 78417126

3.2.3 RMSE

RMSE(data2$pred3, data2$Profit)

## [1] 8855.344

3.2.4 MAPE

MAPE(data2$pred3, data2$Profit) *100

## [1] 10.60121

From the data above, we have calculated the predictions and errors, so we can conclude that model3 is the best model and pred3 is the prediction with the best error too.

After knowing the best model, let’s explore the data in more depth

3.3 Step-wise Regression for Feature Selection

First, we create a model without predictors

modelnone <- lm(Profit ~ 1, data2)

Then we use a combination of Forward and Backward, namely both

model_both <- step(object = modelnone,
                      direction = "both",
                      scope = list(upper= model3),
                      trace=F)
summary(model_both)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                    Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)     46975.86422  2689.93292  17.464 <0.0000000000000002 ***
## R.D.Spend           0.79658     0.04135  19.266 <0.0000000000000002 ***
## Marketing.Spend     0.02991     0.01552   1.927                0.06 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 0.00000000000000022

Next let’s calculate the interval

model_step <- predict(model_both, newdata = data2)

head(model_step)

##        1        2        3        4        5        6 
## 192800.5 189774.7 181405.4 173441.3 171127.6 162879.3

We add an upper limit and a lower limit

model_step_interval <- predict(object = model_both,
                                    newdata = data2,
                                    interval = "prediction",
                                    level = 0.95) 

head(model_step_interval)

##        fit      lwr      upr
## 1 192800.5 173283.0 212317.9
## 2 189774.7 170381.4 209167.9
## 3 181405.4 162191.9 200618.8
## 4 173441.3 154359.6 192523.1
## 5 171127.6 152092.2 190163.1
## 6 162879.3 143929.5 181829.1

Let’s try to visualize it

pred.int <- predict(model3, interval = "prediction", 
                    level = 0.95)

## Warning in predict.lm(model3, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses

mydata <- cbind(data2, pred.int)

ggplot(data = mydata, aes(x = Profit, y = R.D.Spend)) +
  geom_point() +
  labs(title = "Linear Regression of Profit by RnD Spend") +
  geom_line(aes(y = fit), color = "blue") +
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  theme_minimal()

predict(model1, data.frame(R.D.Spend = 10), interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 49041.44 43939.16 54143.72

4. Asumtion Test

4.1 Linearity

plot(model_both, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")

The conclusion of both models from the image above is that the model is considered linear (passes the linearity assumption test because the residual value is around 0)

hist(model_both$residuals)

4.2 Shapiro Test

shapiro.test(model_both$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_both$residuals
## W = 0.93717, p-value = 0.01042

The Shapiro-Wilk normality test results you provided indicate that the residuals from your model (model_both$residuals) may not be normally distributed. Here’s a breakdown of the results:

Test Statistic (W): 0.93717. This value is close to 1, which suggests normality. However, it’s not the only factor to consider. p-value: 0.01042. This value is less than the commonly used significance level of 0.05. A low p-value suggests that the observed data is unlikely to have come from a normal distribution by chance.

Expected condition: H0 - p_value > alpha -> fail reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)

Based on the data above, p-value = 0.01042, p-value > alpha -> accept h1 (Error is not normally distributed -> does not pass normality assumption)

4.3 Asumtion Homoscedasticity

fitted.values -> is the predicted value of the training data residuals -> is the error value

# scatter plot
plot(x = model_both$fitted.values, y = model_both$residuals)
abline(h = 0, col = "red")

## 4.4 Breusch-Pagan hypothesis test

bptest(model_both)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_both
## BP = 2.8431, df = 2, p-value = 0.2413

H0: constant spread error or homoscedasticity * H1: error spread is NOT constant or heteroscedasticity

Expected condition: H0 p_value > alpha -> fail reject h0 (accept h0) p_value < alpha -> reject h0 (accept h1)

Conclusion: 0.3625 -> p-value > alpha (0.05) -> reject h0 (residuals spread randomly or homoscedasticity, pass the assumption test)

vif(model_both)

##       R.D.Spend Marketing.Spend 
##        2.103206        2.103206

Conclusion: not a multicollinearity predictor because vif < 10 (passes the no multicollinearity test)

5. Conclusion & Business Recomendation

Model 3 obtained has an R-square of 0.9507 and has a MAPE of 10.60121. Apart from that, after the linearity test analysis was carried out, the model had good criteria because the residual value was around 0.

Based on this model, Profit accumulates positively with the value of the RnDSpend variable. This shows that every one unit increase in R.D.Spend is associated with an average increase of 0.80572 units in Profit, assuming other variables are constant. This is the most influential variable in the model so that companies can increase profits by providing linear efforts by continuously increasing the Research and Development budget

50 Startup Data

Dimas Aditya