In this analysis, we will use a dataset consisting of information about 50 startup companies in the US. The dataset includes several variables that can affect company profit, such as R&D expenses, administration costs, marketing expenses, company location, and company profit
The purpose of this analysis is to build a prediction model which can predict the company profit based on the variables in the dataset. Using this model can help us to predict the profit of the new startup based on their characteristics, such as operational expenses and geographic locations
Variables in the Dataset: - R&D Spend: The expenditure allocated for research and development of products or services. - Administration: Administrative expenses associated with the general management of the company. - Marketing Spend: Expenditure designated for marketing activities and promotion of products or services. - State: Geographic location of the company, in this case, the states in America. - Profit: The amount of profit earned by the company.
Target Variable : Profit Predictor : R&D Spend, Administration, Marketing.Spend, state
library(dplyr) # data wrangling
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally) # melihat korelasi antar variabel
## Warning: package 'GGally' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MLmetrics) # model evaluation
## Warning: package 'MLmetrics' was built under R version 4.3.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
# uji asumsi
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# EDA
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.3.2
library(ggplot2)
# Machine learning
library(performance)
## Warning: package 'performance' was built under R version 4.3.2
data <- read.csv("startup.csv")
head(data)
## R.D.Spend Administration Marketing.Spend State Profit
## 1 165349.2 136897.80 471784.1 New York 192261.8
## 2 162597.7 151377.59 443898.5 California 191792.1
## 3 153441.5 101145.55 407934.5 Florida 191050.4
## 4 144372.4 118671.85 383199.6 New York 182902.0
## 5 142107.3 91391.77 366168.4 Florida 166187.9
## 6 131876.9 99814.71 362861.4 New York 156991.1
library(dplyr)
glimpse(data)
## Rows: 50
## Columns: 5
## $ R.D.Spend <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…
Check Missing Value
anyNA(data)
## [1] FALSE
It Shows there is no missing value
Check The correlation between Variables
library(GGally)
ggcorr(data, label = TRUE, label_size = 2.9, datajust = 1, layout.exp = 2)
## Warning in ggcorr(data, label = TRUE, label_size = 2.9, datajust = 1,
## layout.exp = 2): data in column(s) 'State' are not numeric and were ignored
## Warning in geom_text(data = textData, aes(label = !!as.name("diagLabel")), :
## Ignoring unknown parameters: `datajust`
From the Result we know that R&D Spend variable and Marketing.Spend
Variable have strong correlation with Target Variable
Outlier Checking
boxplot(data$Profit, data$R.D.Spend, data$Marketing.Spend, data$Administration, horizontal = T)
The Data is clean which means there is no outlier
data_nox <- lm(formula = Profit ~ 1, data = data)
summary(data_nox)
##
## Call:
## lm(formula = Profit ~ 1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97331 -21874 -4034 27753 80249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112013 5700 19.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40310 on 49 degrees of freedom
data1 <- lm(formula = Profit ~ R.D.Spend, data = data)
summary(data1)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34351 -4626 -375 6249 17188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
## R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
## F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16
Interpretation : R.D.Spend has significant effect towards Profit due to the p-value < 0.05
dataall <- lm(formula = Profit ~ ., data = data)
summary(dataall)
##
## Call:
## lm(formula = Profit ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## StateFlorida 1.988e+02 3.371e+03 0.059 0.953
## StateNew York -4.189e+01 3.256e+03 -0.013 0.990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
plot(data$R.D.Spend, data$Profit)
abline(data1$coefficients[1], data1$coefficients[2])
### Feature Selection ( Backward Elimination)
dataall <- lm (Profit ~ ., data)
step(dataall, direction = "backward")
## Start: AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
##
## Df Sum of Sq RSS AIC
## - State 2 5.1666e+05 3.9209e+09 916.88
## - Administration 1 2.3816e+07 3.9442e+09 919.17
## <none> 3.9203e+09 920.87
## - Marketing.Spend 1 2.2071e+08 4.1410e+09 921.61
## - R.D.Spend 1 2.6878e+10 3.0799e+10 1021.94
##
## Step: AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## - Administration 1 2.3539e+07 3.9444e+09 915.18
## <none> 3.9209e+09 916.88
## - Marketing.Spend 1 2.3349e+08 4.1543e+09 917.77
## - R.D.Spend 1 2.7147e+10 3.1068e+10 1018.37
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = data)
##
## Coefficients:
## (Intercept) R.D.Spend Marketing.Spend
## 4.698e+04 7.966e-01 2.991e-02
Insight
The final linear regression model built using the predictor variables R&D Spend and Marketing Spend has an AIC of 915.18, which is lower than the initial model (AIC=920.87). The final model indicates that:
Therefore, the conclusion of this analysis is that R&D expenditure and marketing expenditure significantly influence company profit, while the State and Administration variables do not provide significant contributions to profit and can be disregarded in the prediction model.
Predict <- predict(dataall, data)
Predict
## 1 2 3 4 5 6 7 8
## 192390.57 189071.32 182276.19 173584.98 172277.13 163473.81 158099.29 160155.64
## 9 10 11 12 13 14 15 16
## 151634.74 154829.66 135664.64 135528.60 129282.92 127431.25 149694.38 146143.64
## 17 18 19 20 21 22 23 24
## 116854.07 130085.41 129149.73 115594.19 116570.73 117201.51 114833.31 110123.80
## 25 26 27 28 29 30 31 32
## 113294.37 102200.27 110765.30 114279.80 101818.59 101721.04 99629.01 97617.30
## 33 34 35 36 37 38 39 40
## 98988.24 98061.36 88974.70 90420.01 75423.09 89577.70 69606.52 83684.98
## 41 42 43 44 45 46 47 48
## 74762.75 74956.31 70575.99 60100.27 64585.15 47588.36 56272.99 46468.23
## 49 50
## 49123.07 48185.04
predict(object = dataall, newdata = data,
interval = "prediction",
level = 0.95) %>% head()
## fit lwr upr
## 1 192390.6 171809.6 212971.6
## 2 189071.3 168333.2 209809.4
## 3 182276.2 161942.9 202609.5
## 4 173585.0 153506.8 193663.2
## 5 172277.1 151902.6 192651.6
## 6 163473.8 143374.4 183573.3
Insight : The predicted company profit is approximately 192390.6 with a certain level of confidence. The prediction interval (95% confidence in this case) is between 171809.6 and 212971.6. This means we are confident that the actual value of the company profit will fall between these two values around 95% of the time.
library(MLmetrics)
MAE(y_true = data$Profit,
y_pred = Predict)
## [1] 6475.501
range(data$Profit)
## [1] 14681.4 192261.8
14681.4-192261.8
## [1] -177580.4
#RMSE
RMSE(y_true = data$Profit, # data asli / actual
y_pred = Predict)
## [1] 8854.761
Conclusion: From this analysis, we can conclude that Research and Development (R&D) spending and Marketing costs significantly influence the profit of startups in the United States. The positive coefficients for both variables indicate that increases in R&D spending and Marketing costs tend to have a positive impact on company profit.
R&D Spend: Each unit increase in R&D spending is associated with an approximate increase in company profit of 0.7966 units, assuming other variables remain constant. This suggests that efforts in research and development of products or services can positively contribute to a company’s financial performance.
Marketing Cost: Each unit increase in Marketing costs is associated with an approximate increase in company profit of 0.02991 units, assuming other variables remain constant. This indicates that investments in marketing and promotion activities for products or services can also have a positive impact on company profit by attracting more customers or increasing sales.
Thus, efficiently managing R&D and Marketing costs can be crucial strategies for startups in the United States to enhance profitability and business success.