1. OVERVIEW

The dataset ‘50_Startups’ can be used to analyze how the different types of expenditures and the location of a startup (state) influence its profitability. By examining the relationships between these variables, We can gain insights into what factors most significantly affect the success of startups.

Purpose: Regression Analysis: Building models to predict profit based on R&D Spend, Administration, Marketing Spend, and State.

2. DATA PROCESSING

A. Import Libraries

Begin the steps by importing the necessary libraries required for data manipulation, visualization, and analysis. Common libraries include lubridate for date and time data manipulation, tidyr for data wrangling, ggplot2 for static data visualization, GGally for simplifying complex plots, tidyverse for comprehensive data science tools, scales for formatting, glue for string interpolation, ggrepel for avoiding overlapping text labels, lmtest for regression diagnostics, car for regression diagnostics and statistical analysis, caret for predictive modeling, Metrics for evaluating model performance, and MLmetrics for additional machine learning metrics.

# Load necessary libraries

library(lubridate)  # for Date and time data manipulation
library(tidyr) # for data wrangling
library(ggplot2) # Static data visualization
library(GGally)
library(tidyverse)
library(scales)  # for comma formatting
library(glue) # String interpolation
library(scales)  # for comma formatting
library(ggrepel)
library(lmtest)
library(car)
library(caret)
library(Metrics)
library(MLmetrics)

B. Import Data & Inspection

Load the necessary libraries and the dataset into R. Then check the structure of the dataset to understand its components and data types.

# Load the dataset
startups  <- read.csv("data_input/50_Startups.csv")
head(startups )

# Check the structure of the data
str(startups )

#> 'data.frame':    50 obs. of  5 variables:
#>  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
#>  $ Administration : num  136898 151378 101146 118672 91392 ...
#>  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
#>  $ State          : chr  "New York" "California" "Florida" "New York" ...
#>  $ Profit         : num  192262 191792 191050 182902 166188 ...

rmarkdown::paged_table(startups)

The dataset 50_Startups.csv contains information about 50 startups, with each observation representing a unique startup. The dataset includes 5 variables: 1. R.D.Spend: Represents the amount of money spent on Research and Development (R&D) by the startup. 2. Administration: Represents the amount of money spent on administrative expenses by the startup. 3. Marketing.Spend: Represents the amount of money spent on marketing activities by the startup. 4. State: Indicates the state where the startup is located. 5. Profit: Represents the profit made by the startup.

Data Cleaning

We convert the ‘State’ column to a factor because it is a categorical variable. Then, check for missing values and remove any rows with missing values to ensure the dataset is clean.

# Convert 'State' to a factor
startups$State <- as.factor(startups$State)

# Check for missing values
anyNA(startups)

#> [1] FALSE

table(is.na(startups))

#> 
#> FALSE 
#>   250

there is no missing value

# If there are any missing values, remove them
startups <- na.omit(startups)

3. EXPLANATORY DATA ANALYSIS

*EDA helps us understand the relationships between variables. Visualize the correlation matrix to see how variables are related and use boxplots to check for outliers.

# Check the correlation matrix


ggcorr(startups, hjust = 1, layout.exp = 3, label = TRUE)+
  ggtitle("Correlation Matrix of Startups Data") +
  theme(panel.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
        plot.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
        legend.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
        legend.key = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold", vjust = 1))

In the correlation plot, it is evident that all variables positively influence Life expectancy, with R.D Spend showing the highest positive correlation compared to other factors.

Here are the distributions of each variable’s values.

bg_color <- "#C4D5C5"
par(bg = bg_color)

boxplot(startups)

there’s no outlier from each variable column.

# Visualize the distribution of the data
bg_color <- "#C4D5C5"
par(bg = bg_color)

pairs(startups)

# Correlation matrix
cor_matrix <- cor(startups[, sapply(startups, is.numeric)])
print(cor_matrix)

#>                 R.D.Spend Administration Marketing.Spend    Profit
#> R.D.Spend       1.0000000     0.24195525      0.72424813 0.9729005
#> Administration  0.2419552     1.00000000     -0.03215388 0.2007166
#> Marketing.Spend 0.7242481    -0.03215388      1.00000000 0.7477657
#> Profit          0.9729005     0.20071657      0.74776572 1.0000000

4. MODELLING

A. Simple Linier Regression

Simple linear regression is a model with one predictor variable. In this case, use R&D Spend because it has the strongest correlation.

# Plotting with ggplot2
ggplot(data = startups, aes(x = R.D.Spend, y = Profit)) +
  geom_point(aes(size = R.D.Spend, color = R.D.Spend)) +  # Map color to Year for legend
  scale_size_continuous(range = c(1, 6), guide = "none") +
  scale_color_viridis_c(option = "A", direction = -1, labels = scales::number_format()) +
  labs(title = "Correlation R.D.Spend and Profit",
       x = "R.D Spent", 
       y = "Profit",
       color = "R.D.Spend") +  # Add color legend label
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 12, hjust = 0.5, color = "#333333"),
    panel.background = element_rect(fill = "lightgrey"),
    plot.background = element_rect(fill = "#C4D5C5"),
    panel.grid.major = element_line(colour = "grey"),
    axis.line = element_line(color = "grey"),
    axis.text = element_text(size = 10, colour = "black"),
    legend.position = "right"  # Add the legend back to the plot
  )

a linear regression model can be developed using R.D.Spend as the predictor variable since it shows the strongest positive correlation with the target variable Profit.

# membuat model
# lm(Target ~ Prediktor, data)
model_RDspend <- lm(Profit ~ R.D.Spend, startups)

summary(model_RDspend)

#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = startups)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -34351  -4626   -375   6249  17188 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) 49032.89914  2537.89695   19.32 <0.0000000000000002 ***
#> R.D.Spend       0.85429     0.02931   29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
#> F-statistic: 849.8 on 1 and 48 DF,  p-value: < 0.00000000000000022

summary(model_RDspend)

#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = startups)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -34351  -4626   -375   6249  17188 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) 49032.89914  2537.89695   19.32 <0.0000000000000002 ***
#> R.D.Spend       0.85429     0.02931   29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
#> F-statistic: 849.8 on 1 and 48 DF,  p-value: < 0.00000000000000022

Model Interpretation:

From the coefficients:

-. Each increase of 1 unit in R.D.Spend decreases Profit by 0.85429 -. From the significance: R.D.Spend is an influential predictor.

From the R-squared: 0.9465 -> quite good

B. Multiple Linier Regression

# lm(Target ~ Prediktor, data)
model_all <- lm(Profit ~ ., startups)

summary(model_all)

#> 
#> Call:
#> lm(formula = Profit ~ ., data = startups)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -33504  -4736     90   6672  17338 
#> 
#> Coefficients:
#>                    Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)     50125.34383  6884.81973   7.281        0.00000000444 ***
#> R.D.Spend           0.80602     0.04641  17.369 < 0.0000000000000002 ***
#> Administration     -0.02700     0.05223  -0.517                0.608    
#> Marketing.Spend     0.02698     0.01714   1.574                0.123    
#> StateFlorida      198.78879  3371.00712   0.059                0.953    
#> StateNew York     -41.88702  3256.03913  -0.013                0.990    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9439 on 44 degrees of freedom
#> Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
#> F-statistic: 169.9 on 5 and 44 DF,  p-value: < 0.00000000000000022

Interpretation: - signfikan: R.D.Spend prediktor yang signifikan - r-squared: 0.9452 , adjusted r-squared mendekati 1

5. DATA EVALUATION

Compare the R-squared of model_RDspend with model_all. To view detailed information about the model, use the function summary(model_name).

# membandingkan R-squared
# $r.squared -> multiple
summary(model_RDspend)$r.squared

#> [1] 0.9465353

# $adj.r.squared -> adjusted
summary(model_all)$adj.r.squared

#> [1] 0.9451562

Compare RMSE model_RDspend with model_all.:

# buat object hasil prediksi
#object hasil prediksi model 1 prediktor
predict_RDspend <- predict(model_RDspend , startups)
# object hasil prediksi model semua prediktor
predict_all <-  predict(model_all , startups)

library(caret)
#hitung RMSE (gunakan object model_RDspend dan model_all sebagai y_pred)
RMSE(predict_RDspend,startups$Profit)

#> [1] 9226.101

RMSE(predict_all,startups$Profit)

#> [1] 8854.761

#hitung MAPE (gunakan object model_RDspend dan model_all sebagai y_pred)
MAPE(predict_RDspend, startups$Profit)

#> [1] 0.1107014

MAPE(predict_all, startups$Profit)

#> [1] 0.1060236

The better model is model_all, which only 1 predictor R.D Spend. It is the best model based on R-squared, RMSE, and MAPE compared to model_RDspend.

# stepwise regression: backward elimination
model_backward <- step(object = model_all,
                       direction = "backward")

#> Start:  AIC=920.87
#> Profit ~ R.D.Spend + Administration + Marketing.Spend + State
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> - State            2      516657  3920856301  916.88
#> - Administration   1    23816156  3944155801  919.17
#> <none>                            3920339644  920.87
#> - Marketing.Spend  1   220708706  4141048350  921.61
#> - R.D.Spend        1 26878168212 30798507857 1021.94
#> 
#> Step:  AIC=916.88
#> Profit ~ R.D.Spend + Administration + Marketing.Spend
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> - Administration   1    23538549  3944394850  915.18
#> <none>                            3920856301  916.88
#> - Marketing.Spend  1   233485362  4154341663  917.77
#> - R.D.Spend        1 27147076244 31067932545 1018.37
#> 
#> Step:  AIC=915.18
#> Profit ~ R.D.Spend + Marketing.Spend
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> <none>                            3944394850  915.18
#> - Marketing.Spend  1   311651716  4256046566  916.98
#> - R.D.Spend        1 31149105710 35093500560 1022.46

summary(model_backward)

#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startups)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -33645  -4632   -414   6484  17097 
#> 
#> Coefficients:
#>                    Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept)     46975.86422  2689.93292  17.464 <0.0000000000000002 ***
#> R.D.Spend           0.79658     0.04135  19.266 <0.0000000000000002 ***
#> Marketing.Spend     0.02991     0.01552   1.927                0.06 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9161 on 47 degrees of freedom
#> Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
#> F-statistic: 450.8 on 2 and 47 DF,  p-value: < 0.00000000000000022

create a new model based on backward model

m_backwards <- lm(Profit ~ R.D.Spend + Marketing.Spend, startups)
m_backwards

#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startups)
#> 
#> Coefficients:
#>     (Intercept)        R.D.Spend  Marketing.Spend  
#>     46975.86422          0.79658          0.02991

Candidate model

Profit (model_backward) = 46975.86422 + 0.79658 (R.D.Spend) + 0.02991 (Marketing.Spend)
Profit (model_RDspend)= 49032.89914 + 0.85429 (R.D.Spend)

The step-wise regression method will produce an optimal formula based on the lowest AIC value, where a lower AIC indicates a better fit with fewer unexplained observations.

Compared to the initial model that only uses the R.D Spend variable, the regression model using the predictor variables R.D Spend and Marketing.Spend (model_backward) has an adjusted R-squared of 0.9483 , which is slightly higher than the previous model’s 0.9465353.

Model & Error Prediction

# uset lower upper
pred_model_step_interval <- predict(object = model_backward,
                                    newdata = startups,
                                    interval = "confidence",
                                    level = 0.95) 

head(pred_model_step_interval)

#>        fit      lwr      upr
#> 1 192800.5 186375.1 199225.8
#> 2 189774.7 183737.1 195812.2
#> 3 181405.4 175973.1 186837.7
#> 4 173441.3 168494.9 178387.7
#> 5 171127.6 166362.9 175892.3
#> 6 162879.3 158469.1 167289.5

pred_model_RDspend <- predict(object = model_RDspend,
                                    newdata = startups,
                                    interval = "confidence",
                                    level = 0.95) 

head(pred_model_RDspend)

#>        fit      lwr      upr
#> 1 190289.3 184262.9 196315.7
#> 2 187938.7 182057.1 193820.3
#> 3 180116.7 174709.8 185523.5
#> 4 172369.0 167419.3 177318.7
#> 5 170434.0 165596.0 175271.9
#> 6 161694.2 157345.5 166042.9

Check Normality of the model:

-. Model_RDSpend

bg_color <- "#C4D5C5"
par(bg = bg_color)

hist(model_RDspend$residuals, breaks = 20, col = "blue", main = "Histogram of Residuals", xlab = "Residuals")

shapiro.test(model_RDspend$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_RDspend$residuals
#> W = 0.93708, p-value = 0.01034

-. Model_Backward

bg_color <- "#C4D5C5"
par(bg = bg_color)


hist(model_backward$residuals, breaks = 20,col = "skyblue", main = "Histogram of Residuals", xlab = "Residuals")

shapiro.test(model_backward$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.93717, p-value = 0.01042

For both models, with a P-value > 0.05, we accept the null hypothesis (H0). This also indicates that the residuals are normally distributed around their mean, ensuring that our model has errors distributed evenly around the mean.

Heteroscedasticity -.model RD. Spend

library(lmtest)

bg_color <- "#C4D5C5"
par(bg = bg_color)

plot(startups$Profit, model_RDspend$residuals,
     main = "Residuals vs Profit",
     xlab = "Profit",
     ylab = "Residuals",
     pch = 19, col = "blue", 
     panel.first = grid())+
abline(h = 0, col = "red")

#> integer(0)

library(lmtest)
bptest(model_RDspend)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_RDspend
#> BP = 2.4925, df = 1, p-value = 0.1144

-.model Backward

bg_color <- "#C4D5C5"
par(bg = bg_color)

library(lmtest)
plot(startups$Profit, model_backward$residuals,
     main = "Residuals vs Profit",
     xlab = "Profit",
     ylab = "Residuals",
     pch = 19, col = "blue", 
     panel.first = grid())+
abline(h = 0, col = "red")

#> integer(0)

bptest()from packagelmtest

library(lmtest)
bptest(model_backward)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 2.8431, df = 2, p-value = 0.2413

In both models, the P-value > 0.05 so that H0 is accepted. This also means that the residuals do not have a pattern (Heteroscedasticity) where all existing patterns have been successfully captured by the model created.

Variance Inflation Factor (Multicollinearity)

# vif dari model backward
library(car)
vif(model_backward)

#>       R.D.Spend Marketing.Spend 
#>        2.103206        2.103206

There are no values equal to or greater than 10, so no multicollinearity was found among the variables (the predictor variables are independent of each other).

Based on the analysis results, both models meet the criteria for a good linear regression model.

6. DATA CONCLUSION

The project involves analyzing the 50_Startups dataset to understand the impact of various expenditures and the location of a startup on its profitability.

A correlation matrix was visualized to understand relationships between variables. R&D Spend showed the strongest positive correlation with Profit.
Simple Linear Regression: A model using R&D Spend as the predictor was developed, showing a significant positive influence on Profit with an R-squared value of 0.9465.
Multiple Linear Regression: A model using all predictors was developed, with an R-squared value of 0.9452. R&D Spend remained the most significant predictor.
Model Evaluation: R-squared, RMSE, and MAPE values were compared for both models. The multiple linear regression model had slightly better performance.
Stepwise Regression: Backward elimination was performed, resulting in a model with R&D Spend and Marketing Spend as predictors. This model had a higher adjusted R-squared of 0.9483.
Normality of residuals was checked using histograms and Shapiro-Wilk tests, confirming normal distribution.
Heteroscedasticity was tested using plots and the Breusch-Pagan test, showing no patterns in residuals.
Multicollinearity was checked using Variance Inflation Factor (VIF), with no multicollinearity found among predictors.

Both the simple and multiple linear regression models developed in this analysis are effective in predicting the profitability of startups based on their expenditures. The key findings include:

R&D Spend: This variable is the most significant predictor of profit, with higher R&D expenditures leading to higher profits.
Marketing Spend: When included alongside R&D Spend, this variable also contributes positively to the prediction of profit.
Model Performance: The stepwise regression model, which includes both R&D Spend and Marketing Spend, provides the best fit with an adjusted R-squared of 0.9483, indicating a high level of explained variance.

The analysis confirms that careful management of R&D and Marketing expenditures can significantly influence the profitability of startups. The diagnostic tests validate the robustness of the models, making them reliable tools for predicting startup success.

by Eva Marudur

eva.marudur@gmail.com

LBB-Regession-Model (Startups)

Eva Marudur

2024-07-16