The dataset ‘50_Startups’ can be used to analyze how the different types of expenditures and the location of a startup (state) influence its profitability. By examining the relationships between these variables, We can gain insights into what factors most significantly affect the success of startups.
Purpose: Regression Analysis: Building models to predict profit based on R&D Spend, Administration, Marketing Spend, and State.
Begin the steps by importing the necessary libraries required for data manipulation, visualization, and analysis. Common libraries include lubridate for date and time data manipulation, tidyr for data wrangling, ggplot2 for static data visualization, GGally for simplifying complex plots, tidyverse for comprehensive data science tools, scales for formatting, glue for string interpolation, ggrepel for avoiding overlapping text labels, lmtest for regression diagnostics, car for regression diagnostics and statistical analysis, caret for predictive modeling, Metrics for evaluating model performance, and MLmetrics for additional machine learning metrics.
# Load necessary libraries
library(lubridate) # for Date and time data manipulation
library(tidyr) # for data wrangling
library(ggplot2) # Static data visualization
library(GGally)
library(tidyverse)
library(scales) # for comma formatting
library(glue) # String interpolation
library(scales) # for comma formatting
library(ggrepel)
library(lmtest)
library(car)
library(caret)
library(Metrics)
library(MLmetrics)Load the necessary libraries and the dataset into R. Then check the structure of the dataset to understand its components and data types.
#> 'data.frame': 50 obs. of 5 variables:
#> $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
#> $ Administration : num 136898 151378 101146 118672 91392 ...
#> $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
#> $ State : chr "New York" "California" "Florida" "New York" ...
#> $ Profit : num 192262 191792 191050 182902 166188 ...
The dataset 50_Startups.csv contains information about 50 startups, with each observation representing a unique startup. The dataset includes 5 variables: 1. R.D.Spend: Represents the amount of money spent on Research and Development (R&D) by the startup. 2. Administration: Represents the amount of money spent on administrative expenses by the startup. 3. Marketing.Spend: Represents the amount of money spent on marketing activities by the startup. 4. State: Indicates the state where the startup is located. 5. Profit: Represents the profit made by the startup.
We convert the ‘State’ column to a factor because it is a categorical variable. Then, check for missing values and remove any rows with missing values to ensure the dataset is clean.
#> [1] FALSE
#>
#> FALSE
#> 250
there is no missing value
*EDA helps us understand the relationships between variables. Visualize the correlation matrix to see how variables are related and use boxplots to check for outliers.
# Check the correlation matrix
ggcorr(startups, hjust = 1, layout.exp = 3, label = TRUE)+
ggtitle("Correlation Matrix of Startups Data") +
theme(panel.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
plot.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
legend.background = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
legend.key = element_rect(fill = "#C4D5C5", color = "#C4D5C5"),
plot.title = element_text(hjust = 0.5, size = 14, face = "bold", vjust = 1))
In the correlation plot, it is evident that all variables positively
influence Life expectancy, with R.D Spend showing the highest positive
correlation compared to other factors.
Here are the distributions of each variable’s values.
there’s no outlier from each variable column.
#> R.D.Spend Administration Marketing.Spend Profit
#> R.D.Spend 1.0000000 0.24195525 0.72424813 0.9729005
#> Administration 0.2419552 1.00000000 -0.03215388 0.2007166
#> Marketing.Spend 0.7242481 -0.03215388 1.00000000 0.7477657
#> Profit 0.9729005 0.20071657 0.74776572 1.0000000
Simple linear regression is a model with one predictor variable. In this case, use R&D Spend because it has the strongest correlation.
# Plotting with ggplot2
ggplot(data = startups, aes(x = R.D.Spend, y = Profit)) +
geom_point(aes(size = R.D.Spend, color = R.D.Spend)) + # Map color to Year for legend
scale_size_continuous(range = c(1, 6), guide = "none") +
scale_color_viridis_c(option = "A", direction = -1, labels = scales::number_format()) +
labs(title = "Correlation R.D.Spend and Profit",
x = "R.D Spent",
y = "Profit",
color = "R.D.Spend") + # Add color legend label
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 12, hjust = 0.5, color = "#333333"),
panel.background = element_rect(fill = "lightgrey"),
plot.background = element_rect(fill = "#C4D5C5"),
panel.grid.major = element_line(colour = "grey"),
axis.line = element_line(color = "grey"),
axis.text = element_text(size = 10, colour = "black"),
legend.position = "right" # Add the legend back to the plot
)a linear regression model can be developed using R.D.Spend as the predictor variable since it shows the strongest positive correlation with the target variable Profit.
# membuat model
# lm(Target ~ Prediktor, data)
model_RDspend <- lm(Profit ~ R.D.Spend, startups)
summary(model_RDspend)#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = startups)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -34351 -4626 -375 6249 17188
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 49032.89914 2537.89695 19.32 <0.0000000000000002 ***
#> R.D.Spend 0.85429 0.02931 29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
#> F-statistic: 849.8 on 1 and 48 DF, p-value: < 0.00000000000000022
#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = startups)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -34351 -4626 -375 6249 17188
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 49032.89914 2537.89695 19.32 <0.0000000000000002 ***
#> R.D.Spend 0.85429 0.02931 29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
#> F-statistic: 849.8 on 1 and 48 DF, p-value: < 0.00000000000000022
Model Interpretation:
-. Each increase of 1 unit in R.D.Spend decreases Profit by 0.85429 -. From the significance: R.D.Spend is an influential predictor.
#>
#> Call:
#> lm(formula = Profit ~ ., data = startups)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -33504 -4736 90 6672 17338
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 50125.34383 6884.81973 7.281 0.00000000444 ***
#> R.D.Spend 0.80602 0.04641 17.369 < 0.0000000000000002 ***
#> Administration -0.02700 0.05223 -0.517 0.608
#> Marketing.Spend 0.02698 0.01714 1.574 0.123
#> StateFlorida 198.78879 3371.00712 0.059 0.953
#> StateNew York -41.88702 3256.03913 -0.013 0.990
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9439 on 44 degrees of freedom
#> Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
#> F-statistic: 169.9 on 5 and 44 DF, p-value: < 0.00000000000000022
Interpretation: - signfikan: R.D.Spend prediktor yang signifikan - r-squared: 0.9452 , adjusted r-squared mendekati 1
Compare the R-squared of model_RDspend with model_all. To view detailed information about the model, use the function summary(model_name).
#> [1] 0.9465353
#> [1] 0.9451562
Compare RMSE model_RDspend with
model_all.:
# buat object hasil prediksi
#object hasil prediksi model 1 prediktor
predict_RDspend <- predict(model_RDspend , startups)
# object hasil prediksi model semua prediktor
predict_all <- predict(model_all , startups)library(caret)
#hitung RMSE (gunakan object model_RDspend dan model_all sebagai y_pred)
RMSE(predict_RDspend,startups$Profit)#> [1] 9226.101
#> [1] 8854.761
#hitung MAPE (gunakan object model_RDspend dan model_all sebagai y_pred)
MAPE(predict_RDspend, startups$Profit)#> [1] 0.1107014
#> [1] 0.1060236
The better model is model_all, which only 1 predictor R.D Spend. It is the best model based on R-squared, RMSE, and MAPE compared to model_RDspend.
# stepwise regression: backward elimination
model_backward <- step(object = model_all,
direction = "backward")#> Start: AIC=920.87
#> Profit ~ R.D.Spend + Administration + Marketing.Spend + State
#>
#> Df Sum of Sq RSS AIC
#> - State 2 516657 3920856301 916.88
#> - Administration 1 23816156 3944155801 919.17
#> <none> 3920339644 920.87
#> - Marketing.Spend 1 220708706 4141048350 921.61
#> - R.D.Spend 1 26878168212 30798507857 1021.94
#>
#> Step: AIC=916.88
#> Profit ~ R.D.Spend + Administration + Marketing.Spend
#>
#> Df Sum of Sq RSS AIC
#> - Administration 1 23538549 3944394850 915.18
#> <none> 3920856301 916.88
#> - Marketing.Spend 1 233485362 4154341663 917.77
#> - R.D.Spend 1 27147076244 31067932545 1018.37
#>
#> Step: AIC=915.18
#> Profit ~ R.D.Spend + Marketing.Spend
#>
#> Df Sum of Sq RSS AIC
#> <none> 3944394850 915.18
#> - Marketing.Spend 1 311651716 4256046566 916.98
#> - R.D.Spend 1 31149105710 35093500560 1022.46
#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startups)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -33645 -4632 -414 6484 17097
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 46975.86422 2689.93292 17.464 <0.0000000000000002 ***
#> R.D.Spend 0.79658 0.04135 19.266 <0.0000000000000002 ***
#> Marketing.Spend 0.02991 0.01552 1.927 0.06 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9161 on 47 degrees of freedom
#> Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
#> F-statistic: 450.8 on 2 and 47 DF, p-value: < 0.00000000000000022
create a new model based on backward model
#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = startups)
#>
#> Coefficients:
#> (Intercept) R.D.Spend Marketing.Spend
#> 46975.86422 0.79658 0.02991
Candidate model
The step-wise regression method will produce an optimal formula based on the lowest AIC value, where a lower AIC indicates a better fit with fewer unexplained observations.
Compared to the initial model that only uses the R.D Spend variable, the regression model using the predictor variables R.D Spend and Marketing.Spend (model_backward) has an adjusted R-squared of 0.9483 , which is slightly higher than the previous model’s 0.9465353.
Model & Error Prediction
# uset lower upper
pred_model_step_interval <- predict(object = model_backward,
newdata = startups,
interval = "confidence",
level = 0.95)
head(pred_model_step_interval)#> fit lwr upr
#> 1 192800.5 186375.1 199225.8
#> 2 189774.7 183737.1 195812.2
#> 3 181405.4 175973.1 186837.7
#> 4 173441.3 168494.9 178387.7
#> 5 171127.6 166362.9 175892.3
#> 6 162879.3 158469.1 167289.5
pred_model_RDspend <- predict(object = model_RDspend,
newdata = startups,
interval = "confidence",
level = 0.95)
head(pred_model_RDspend)#> fit lwr upr
#> 1 190289.3 184262.9 196315.7
#> 2 187938.7 182057.1 193820.3
#> 3 180116.7 174709.8 185523.5
#> 4 172369.0 167419.3 177318.7
#> 5 170434.0 165596.0 175271.9
#> 6 161694.2 157345.5 166042.9
Check Normality of the model:
-. Model_RDSpend
bg_color <- "#C4D5C5"
par(bg = bg_color)
hist(model_RDspend$residuals, breaks = 20, col = "blue", main = "Histogram of Residuals", xlab = "Residuals")#>
#> Shapiro-Wilk normality test
#>
#> data: model_RDspend$residuals
#> W = 0.93708, p-value = 0.01034
-. Model_Backward
bg_color <- "#C4D5C5"
par(bg = bg_color)
hist(model_backward$residuals, breaks = 20,col = "skyblue", main = "Histogram of Residuals", xlab = "Residuals")#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.93717, p-value = 0.01042
For both models, with a P-value > 0.05, we accept the null hypothesis (H0). This also indicates that the residuals are normally distributed around their mean, ensuring that our model has errors distributed evenly around the mean.
Heteroscedasticity -.model RD. Spend
library(lmtest)
bg_color <- "#C4D5C5"
par(bg = bg_color)
plot(startups$Profit, model_RDspend$residuals,
main = "Residuals vs Profit",
xlab = "Profit",
ylab = "Residuals",
pch = 19, col = "blue",
panel.first = grid())+
abline(h = 0, col = "red")#> integer(0)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_RDspend
#> BP = 2.4925, df = 1, p-value = 0.1144
-.model Backward
bg_color <- "#C4D5C5"
par(bg = bg_color)
library(lmtest)
plot(startups$Profit, model_backward$residuals,
main = "Residuals vs Profit",
xlab = "Profit",
ylab = "Residuals",
pch = 19, col = "blue",
panel.first = grid())+
abline(h = 0, col = "red")#> integer(0)
bptest()from packagelmtest
#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 2.8431, df = 2, p-value = 0.2413
In both models, the P-value > 0.05 so that H0 is accepted. This also means that the residuals do not have a pattern (Heteroscedasticity) where all existing patterns have been successfully captured by the model created.
Variance Inflation Factor (Multicollinearity)
#> R.D.Spend Marketing.Spend
#> 2.103206 2.103206
There are no values equal to or greater than 10, so no multicollinearity was found among the variables (the predictor variables are independent of each other).
Based on the analysis results, both models meet the criteria for a good linear regression model.
The project involves analyzing the 50_Startups dataset to understand the impact of various expenditures and the location of a startup on its profitability.
Both the simple and multiple linear regression models developed in this analysis are effective in predicting the profitability of startups based on their expenditures. The key findings include:
The analysis confirms that careful management of R&D and Marketing expenditures can significantly influence the profitability of startups. The diagnostic tests validate the robustness of the models, making them reliable tools for predicting startup success.
by Eva Marudur