Background

In this analysis, we will use a dataset consisting of information about 50 startup companies in the US. The dataset includes several variables that can affect company profit, such as R&D expenses, administration costs, marketing expenses, company location, and company profit

Objective

The purpose of this analysis is to build a prediction model which can predict the company profit based on the variables in the dataset. Using this model can help us to predict the profit of the new startup based on their characteristics, such as operational expenses and geographic locations

Variable in the Dataset

Variables in the Dataset: - R&D Spend: The expenditure allocated for research and development of products or services. - Administration: Administrative expenses associated with the general management of the company. - Marketing Spend: Expenditure designated for marketing activities and promotion of products or services. - State: Geographic location of the company, in this case, the states in America. - Profit: The amount of profit earned by the company.

Target Variable : Profit Predictor : R&D Spend, Administration, Marketing.Spend, state

library(dplyr) # data wrangling
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally) # melihat korelasi antar variabel
## Warning: package 'GGally' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MLmetrics) # model evaluation
## Warning: package 'MLmetrics' was built under R version 4.3.2
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
# uji asumsi
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# EDA
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.3.2
library(ggplot2)

# Machine learning
library(performance)
## Warning: package 'performance' was built under R version 4.3.2
Load Dataset
data <- read.csv("startup.csv")
head(data)
##   R.D.Spend Administration Marketing.Spend      State   Profit
## 1  165349.2      136897.80        471784.1   New York 192261.8
## 2  162597.7      151377.59        443898.5 California 191792.1
## 3  153441.5      101145.55        407934.5    Florida 191050.4
## 4  144372.4      118671.85        383199.6   New York 182902.0
## 5  142107.3       91391.77        366168.4    Florida 166187.9
## 6  131876.9       99814.71        362861.4   New York 156991.1

Explanatory Data Analysis

library(dplyr)
glimpse(data)
## Rows: 50
## Columns: 5
## $ R.D.Spend       <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State           <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…

Check Missing Value

anyNA(data)
## [1] FALSE

It Shows there is no missing value

Check The correlation between Variables

library(GGally)
ggcorr(data, label = TRUE, label_size = 2.9, datajust = 1, layout.exp = 2)
## Warning in ggcorr(data, label = TRUE, label_size = 2.9, datajust = 1,
## layout.exp = 2): data in column(s) 'State' are not numeric and were ignored
## Warning in geom_text(data = textData, aes(label = !!as.name("diagLabel")), :
## Ignoring unknown parameters: `datajust`

From the Result we know that R&D Spend variable and Marketing.Spend Variable have strong correlation with Target Variable

Outlier Checking

boxplot(data$Profit, data$R.D.Spend, data$Marketing.Spend, data$Administration, horizontal = T)

The Data is clean which means there is no outlier

Modle without predictor variable

data_nox <- lm(formula = Profit ~ 1, data = data)
summary(data_nox)
## 
## Call:
## lm(formula = Profit ~ 1, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -97331 -21874  -4034  27753  80249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   112013       5700   19.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40310 on 49 degrees of freedom

Model with 1 predictor Variable

data1 <- lm(formula = Profit ~ R.D.Spend, data = data)
summary(data1)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34351  -4626   -375   6249  17188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.903e+04  2.538e+03   19.32   <2e-16 ***
## R.D.Spend   8.543e-01  2.931e-02   29.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
## F-statistic: 849.8 on 1 and 48 DF,  p-value: < 2.2e-16

Interpretation : R.D.Spend has significant effect towards Profit due to the p-value < 0.05

Model With all variable

dataall <- lm(formula = Profit ~ ., data = data)
summary(dataall)
## 
## Call:
## lm(formula = Profit ~ ., data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.013e+04  6.885e+03   7.281 4.44e-09 ***
## R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Administration  -2.700e-02  5.223e-02  -0.517    0.608    
## Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
## StateFlorida     1.988e+02  3.371e+03   0.059    0.953    
## StateNew York   -4.189e+01  3.256e+03  -0.013    0.990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16
plot(data$R.D.Spend, data$Profit)
abline(data1$coefficients[1], data1$coefficients[2])

### Feature Selection ( Backward Elimination)

dataall <- lm (Profit ~ ., data)
step(dataall, direction = "backward")
## Start:  AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
## 
##                   Df  Sum of Sq        RSS     AIC
## - State            2 5.1666e+05 3.9209e+09  916.88
## - Administration   1 2.3816e+07 3.9442e+09  919.17
## <none>                          3.9203e+09  920.87
## - Marketing.Spend  1 2.2071e+08 4.1410e+09  921.61
## - R.D.Spend        1 2.6878e+10 3.0799e+10 1021.94
## 
## Step:  AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## - Administration   1 2.3539e+07 3.9444e+09  915.18
## <none>                          3.9209e+09  916.88
## - Marketing.Spend  1 2.3349e+08 4.1543e+09  917.77
## - R.D.Spend        1 2.7147e+10 3.1068e+10 1018.37
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = data)
## 
## Coefficients:
##     (Intercept)        R.D.Spend  Marketing.Spend  
##       4.698e+04        7.966e-01        2.991e-02

Insight

The final linear regression model built using the predictor variables R&D Spend and Marketing Spend has an AIC of 915.18, which is lower than the initial model (AIC=920.87). The final model indicates that:

  • The Intercept (4.698e+04) represents the estimated profit obtained when all predictor variables have a value of zero.
  • The coefficient for R&D Spend (7.966e-01) indicates that for every one-unit increase in R&D expenditure, profit is expected to increase by 0.7966 units, assuming that all other predictor variables remain constant.
  • The coefficient for Marketing Spend (2.991e-02) indicates that for every one-unit increase in marketing expenditure, profit is expected to increase by 0.02991 units, assuming that all other predictor variables remain constant.

Therefore, the conclusion of this analysis is that R&D expenditure and marketing expenditure significantly influence company profit, while the State and Administration variables do not provide significant contributions to profit and can be disregarded in the prediction model.

Model Prediction and Error

  1. Prediction with Profit in model “data1”
Predict <- predict(dataall, data)
Predict
##         1         2         3         4         5         6         7         8 
## 192390.57 189071.32 182276.19 173584.98 172277.13 163473.81 158099.29 160155.64 
##         9        10        11        12        13        14        15        16 
## 151634.74 154829.66 135664.64 135528.60 129282.92 127431.25 149694.38 146143.64 
##        17        18        19        20        21        22        23        24 
## 116854.07 130085.41 129149.73 115594.19 116570.73 117201.51 114833.31 110123.80 
##        25        26        27        28        29        30        31        32 
## 113294.37 102200.27 110765.30 114279.80 101818.59 101721.04  99629.01  97617.30 
##        33        34        35        36        37        38        39        40 
##  98988.24  98061.36  88974.70  90420.01  75423.09  89577.70  69606.52  83684.98 
##        41        42        43        44        45        46        47        48 
##  74762.75  74956.31  70575.99  60100.27  64585.15  47588.36  56272.99  46468.23 
##        49        50 
##  49123.07  48185.04

Prediction Interval

predict(object = dataall, newdata = data,
        interval = "prediction", 
        level = 0.95) %>% head()
##        fit      lwr      upr
## 1 192390.6 171809.6 212971.6
## 2 189071.3 168333.2 209809.4
## 3 182276.2 161942.9 202609.5
## 4 173585.0 153506.8 193663.2
## 5 172277.1 151902.6 192651.6
## 6 163473.8 143374.4 183573.3

Insight : The predicted company profit is approximately 192390.6 with a certain level of confidence. The prediction interval (95% confidence in this case) is between 171809.6 and 212971.6. This means we are confident that the actual value of the company profit will fall between these two values around 95% of the time.

Model Evaluation

library(MLmetrics)
MAE(y_true = data$Profit, 
    y_pred = Predict)
## [1] 6475.501
range(data$Profit)
## [1]  14681.4 192261.8
14681.4-192261.8
## [1] -177580.4
#RMSE

RMSE(y_true = data$Profit, # data asli / actual
    y_pred = Predict)
## [1] 8854.761

Conclusion

Conclusion: From this analysis, we can conclude that Research and Development (R&D) spending and Marketing costs significantly influence the profit of startups in the United States. The positive coefficients for both variables indicate that increases in R&D spending and Marketing costs tend to have a positive impact on company profit.

  1. R&D Spend: Each unit increase in R&D spending is associated with an approximate increase in company profit of 0.7966 units, assuming other variables remain constant. This suggests that efforts in research and development of products or services can positively contribute to a company’s financial performance.

  2. Marketing Cost: Each unit increase in Marketing costs is associated with an approximate increase in company profit of 0.02991 units, assuming other variables remain constant. This indicates that investments in marketing and promotion activities for products or services can also have a positive impact on company profit by attracting more customers or increasing sales.

Thus, efficiently managing R&D and Marketing costs can be crucial strategies for startups in the United States to enhance profitability and business success.