The dataset that’s we see here contains data about 50 startups. It has 5 columns: “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Profit”. The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing, and how much they spend on administration cost, the state column indicates which state the startup is based in, and the last column states the profit made by the startup.
Source : https://www.kaggle.com/datasets/amineoumous/50-startups-data?select=50_Startups.csv
We will learn to create a model to predict profit using linear regression where there are 3 variables that help in creating the model.
First thing first lets import our data
data <- read.csv("data_input/50_Startups.csv")
head(data)
## R.D.Spend Administration Marketing.Spend State Profit
## 1 165349.2 136897.80 471784.1 New York 192261.8
## 2 162597.7 151377.59 443898.5 California 191792.1
## 3 153441.5 101145.55 407934.5 Florida 191050.4
## 4 144372.4 118671.85 383199.6 New York 182902.0
## 5 142107.3 91391.77 366168.4 Florida 166187.9
## 6 131876.9 99814.71 362861.4 New York 156991.1
Let’s check for missing values in our data.
str(data)
## 'data.frame': 50 obs. of 5 variables:
## $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
## $ Administration : num 136898 151378 101146 118672 91392 ...
## $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
## $ State : chr "New York" "California" "Florida" "New York" ...
## $ Profit : num 192262 191792 191050 182902 166188 ...
colSums(is.na(data))
## R.D.Spend Administration Marketing.Spend State Profit
## 0 0 0 0 0
To make processing easier, let’s change our data type.
data1 <- data %>%
mutate(State = as.factor(State))
Once we have finished preparing our data, let’s start exploring our data.
ggcorr(data1, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(data1, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'State' are not numeric and were ignored
From the data above we can conclude that all variables have a positive influence on the profit variable, especially for marketing and RnD
Based on the table above, let’s select the variables we need.
data2 <- data1 %>% select(R.D.Spend, Administration, Marketing.Spend, Profit)
The following is the distribution of values for each variable.
boxplot(data2)
Next, a linear regression model can be created with the RnD predictor variable because this variable has the highest positive correlation with the Profit target variable.
model1 <- lm(Profit ~ R.D.Spend, data = data2)
summary(model1)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34351 -4626 -375 6249 17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49032.89914 2537.89695 19.32 <0.0000000000000002 ***
## R.D.Spend 0.85429 0.02931 29.15 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
## F-statistic: 849.8 on 1 and 48 DF, p-value: < 0.00000000000000022
Here’s the visualization
plot(data2$R.D.Spend, data2$Profit)
abline(model1, col = "red")
Let’s make several models to see the value of each model
model2 <- lm(formula = Profit ~ Administration + Marketing.Spend, data = data2)
summary(model2)
##
## Call:
## lm(formula = Profit ~ Administration + Marketing.Spend, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82155 -12168 2836 13650 56472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20224.42906 17698.17704 1.143 0.2589
## Administration 0.32367 0.13116 2.468 0.0173 *
## Marketing.Spend 0.24884 0.03005 8.281 0.0000000000973 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25710 on 47 degrees of freedom
## Multiple R-squared: 0.6097, Adjusted R-squared: 0.5931
## F-statistic: 36.71 on 2 and 47 DF, p-value: 0.0000000002496
model3 <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data = data2)
summary(model3)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
## data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50122.19299 6572.35262 7.626 0.00000000106 ***
## R.D.Spend 0.80572 0.04515 17.846 < 0.0000000000000002 ***
## Administration -0.02682 0.05103 -0.526 0.602
## Marketing.Spend 0.02723 0.01645 1.655 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 0.00000000000000022
From the data above we conclude that the best model is model 3
\[Candidate Model3\]
\[Profit = 0.80572 (R.D.Spend) - 0.02682(Administration) + 0.02723(Marketing.Spend) + 50122.19299\]
\[Profit = 0.8543(R.D.Spend) + 49032.8991\]
Comparing the 3 models model1 = R.D Spend > Multiple R-squared: 0.9465 | Adjusted R-squared: 0.9454 model2 = Administration and Marketing Spend > Multiple R-squared: 0.6097 | Adjusted R-squared: 0.5931 model3 = All > Multiple R-squared: 0.9507 | Adjusted R-squared: 0.9475
From the data above we can conclude that model 3 is the best model
First of all, let’s make a prediction
data2$pred1 <- predict(model1, data2)
# prediksi dari model semua prediktor
data2$pred2 <- predict(model2, data2)
# prediksi dari model 2 prediktor
data2$pred3 <- predict(model3, data2)
head(data2)
## R.D.Spend Administration Marketing.Spend Profit pred1 pred2 pred3
## 1 165349.2 136897.80 471784.1 192261.8 190289.3 181935.0 192521.3
## 2 162597.7 151377.59 443898.5 191792.1 187938.7 179682.6 189156.8
## 3 153441.5 101145.55 407934.5 191050.4 180116.7 154474.4 182147.3
## 4 144372.4 118671.85 383199.6 182902.0 172369.0 153992.1 173696.7
## 5 142107.3 91391.77 366168.4 166187.9 170434.0 140924.1 172139.5
## 6 131876.9 99814.71 362861.4 156991.1 161694.2 142827.5 163580.8
Let’s create an evaluation model
MAE(data2$pred3, data2$Profit)
## [1] 6471.45
range(data2$Profit)
## [1] 14681.4 192261.8
MSE(data2$pred3, data2$Profit)
## [1] 78417126
RMSE(data2$pred3, data2$Profit)
## [1] 8855.344
MAPE(data2$pred3, data2$Profit) *100
## [1] 10.60121
From the data above, we have calculated the predictions and errors, so we can conclude that model3 is the best model and pred3 is the prediction with the best error too.
After knowing the best model, let’s explore the data in more depth
First, we create a model without predictors
modelnone <- lm(Profit ~ 1, data2)
Then we use a combination of Forward and Backward, namely both
model_both <- step(object = modelnone,
direction = "both",
scope = list(upper= model3),
trace=F)
summary(model_both)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46975.86422 2689.93292 17.464 <0.0000000000000002 ***
## R.D.Spend 0.79658 0.04135 19.266 <0.0000000000000002 ***
## Marketing.Spend 0.02991 0.01552 1.927 0.06 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 0.00000000000000022
Next let’s calculate the interval
model_step <- predict(model_both, newdata = data2)
head(model_step)
## 1 2 3 4 5 6
## 192800.5 189774.7 181405.4 173441.3 171127.6 162879.3
We add an upper limit and a lower limit
model_step_interval <- predict(object = model_both,
newdata = data2,
interval = "prediction",
level = 0.95)
head(model_step_interval)
## fit lwr upr
## 1 192800.5 173283.0 212317.9
## 2 189774.7 170381.4 209167.9
## 3 181405.4 162191.9 200618.8
## 4 173441.3 154359.6 192523.1
## 5 171127.6 152092.2 190163.1
## 6 162879.3 143929.5 181829.1
Let’s try to visualize it
pred.int <- predict(model3, interval = "prediction",
level = 0.95)
## Warning in predict.lm(model3, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
mydata <- cbind(data2, pred.int)
ggplot(data = mydata, aes(x = Profit, y = R.D.Spend)) +
geom_point() +
labs(title = "Linear Regression of Profit by RnD Spend") +
geom_line(aes(y = fit), color = "blue") +
geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
theme_minimal()
predict(model1, data.frame(R.D.Spend = 10), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 49041.44 43939.16 54143.72
plot(model_both, which = 1)
abline(h = 10, col = "green")
abline(h = -10, col = "green")
The conclusion of both models from the image above is that the model is considered linear (passes the linearity assumption test because the residual value is around 0)
hist(model_both$residuals)
shapiro.test(model_both$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_both$residuals
## W = 0.93717, p-value = 0.01042
The Shapiro-Wilk normality test results you provided indicate that the residuals from your model (model_both$residuals) may not be normally distributed. Here’s a breakdown of the results:
Test Statistic (W): 0.93717. This value is close to 1, which suggests normality. However, it’s not the only factor to consider. p-value: 0.01042. This value is less than the commonly used significance level of 0.05. A low p-value suggests that the observed data is unlikely to have come from a normal distribution by chance.
Expected condition: H0 - p_value > alpha -> fail reject h0 (accept h0) - p_value < alpha -> reject h0 (accept h1)
Based on the data above, p-value = 0.01042, p-value > alpha -> accept h1 (Error is not normally distributed -> does not pass normality assumption)
fitted.values -> is the predicted value of the training data residuals -> is the error value
# scatter plot
plot(x = model_both$fitted.values, y = model_both$residuals)
abline(h = 0, col = "red")
## 4.4 Breusch-Pagan hypothesis test
bptest(model_both)
##
## studentized Breusch-Pagan test
##
## data: model_both
## BP = 2.8431, df = 2, p-value = 0.2413
H0: constant spread error or homoscedasticity * H1: error spread is NOT constant or heteroscedasticity
Expected condition: H0 p_value > alpha -> fail reject h0 (accept h0) p_value < alpha -> reject h0 (accept h1)
Conclusion: 0.3625 -> p-value > alpha (0.05) -> reject h0 (residuals spread randomly or homoscedasticity, pass the assumption test)
vif(model_both)
## R.D.Spend Marketing.Spend
## 2.103206 2.103206
Conclusion: not a multicollinearity predictor because vif < 10 (passes the no multicollinearity test)
Model 3 obtained has an R-square of 0.9507 and has a MAPE of 10.60121. Apart from that, after the linearity test analysis was carried out, the model had good criteria because the residual value was around 0.
Based on this model, Profit accumulates positively with the value of the RnDSpend variable. This shows that every one unit increase in R.D.Spend is associated with an average increase of 0.80572 units in Profit, assuming other variables are constant. This is the most influential variable in the model so that companies can increase profits by providing linear efforts by continuously increasing the Research and Development budget