Instruction :

Dalam pembuatan report, jangan lupa untuk meliputi hal-hal berikut:

Library

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

I. Read data

# read startups data
startups <- read.csv("data_input/50_Startups.csv")

startups
##    R.D.Spend Administration Marketing.Spend      State    Profit
## 1  165349.20      136897.80       471784.10   New York 192261.83
## 2  162597.70      151377.59       443898.53 California 191792.06
## 3  153441.51      101145.55       407934.54    Florida 191050.39
## 4  144372.41      118671.85       383199.62   New York 182901.99
## 5  142107.34       91391.77       366168.42    Florida 166187.94
## 6  131876.90       99814.71       362861.36   New York 156991.12
## 7  134615.46      147198.87       127716.82 California 156122.51
## 8  130298.13      145530.06       323876.68    Florida 155752.60
## 9  120542.52      148718.95       311613.29   New York 152211.77
## 10 123334.88      108679.17       304981.62 California 149759.96
## 11 101913.08      110594.11       229160.95    Florida 146121.95
## 12 100671.96       91790.61       249744.55 California 144259.40
## 13  93863.75      127320.38       249839.44    Florida 141585.52
## 14  91992.39      135495.07       252664.93 California 134307.35
## 15 119943.24      156547.42       256512.92    Florida 132602.65
## 16 114523.61      122616.84       261776.23   New York 129917.04
## 17  78013.11      121597.55       264346.06 California 126992.93
## 18  94657.16      145077.58       282574.31   New York 125370.37
## 19  91749.16      114175.79       294919.57    Florida 124266.90
## 20  86419.70      153514.11            0.00   New York 122776.86
## 21  76253.86      113867.30       298664.47 California 118474.03
## 22  78389.47      153773.43       299737.29   New York 111313.02
## 23  73994.56      122782.75       303319.26    Florida 110352.25
## 24  67532.53      105751.03       304768.73    Florida 108733.99
## 25  77044.01       99281.34       140574.81   New York 108552.04
## 26  64664.71      139553.16       137962.62 California 107404.34
## 27  75328.87      144135.98       134050.07    Florida 105733.54
## 28  72107.60      127864.55       353183.81   New York 105008.31
## 29  66051.52      182645.56       118148.20    Florida 103282.38
## 30  65605.48      153032.06       107138.38   New York 101004.64
## 31  61994.48      115641.28        91131.24    Florida  99937.59
## 32  61136.38      152701.92        88218.23   New York  97483.56
## 33  63408.86      129219.61        46085.25 California  97427.84
## 34  55493.95      103057.49       214634.81    Florida  96778.92
## 35  46426.07      157693.92       210797.67 California  96712.80
## 36  46014.02       85047.44       205517.64   New York  96479.51
## 37  28663.76      127056.21       201126.82    Florida  90708.19
## 38  44069.95       51283.14       197029.42 California  89949.14
## 39  20229.59       65947.93       185265.10   New York  81229.06
## 40  38558.51       82982.09       174999.30 California  81005.76
## 41  28754.33      118546.05       172795.67 California  78239.91
## 42  27892.92       84710.77       164470.71    Florida  77798.83
## 43  23640.93       96189.63       148001.11 California  71498.49
## 44  15505.73      127382.30        35534.17   New York  69758.98
## 45  22177.74      154806.14        28334.72 California  65200.33
## 46   1000.23      124153.04         1903.93   New York  64926.08
## 47   1315.46      115816.21       297114.46    Florida  49490.75
## 48      0.00      135426.92            0.00 California  42559.73
## 49    542.05       51743.15            0.00   New York  35673.41
## 50      0.00      116983.80        45173.06 California  14681.40

II. Information of Column

III. Data Cleansing & Feature Selection

III.1 Feature Selection

  • We will not remove any column based on column with only 1 value or column with no value (NA). We will consider to drop column after we test the relation between column.
# Check Data Structure
glimpse(startups)
## Rows: 50
## Columns: 5
## $ R.D.Spend       <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State           <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…
  • We will drop state column because it was non numerical
startups <- startups %>%
  select(-State)

# Show structure Data
str(startups)
## 'data.frame':    50 obs. of  4 variables:
##  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
##  $ Administration : num  136898 151378 101146 118672 91392 ...
##  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
##  $ Profit         : num  192262 191792 191050 182902 166188 ...

IV. Exploratory Data Analysis (EDA) - Correlation

# Check Correlations
ggcorr(startups, label = T)

Insight:

Based on correlation :

Based on the insight we see on the correlation. We will use R&D.Spend and Marketing.Spend as predictors.

V. Simple Linear Regression

V.1 Concept: Simple Linear Regression

# visualization scatter plot for R&D
plot(startups$R.D.Spend, startups$Profit)

# visualization scatter plot for marketing
plot(startups$Marketing.Spend, startups$Profit)

Insight:

*Based on the plots, the bigger the R.D.Spend and Marketing.Spend, the higher the profit gain

V.2 Modelling

model_ols <- lm(formula = Profit ~ R.D.Spend,
                data = startups)

# see model's result
model_ols
## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = startups)
## 
## Coefficients:
## (Intercept)    R.D.Spend  
##   4.903e+04    8.543e-01

Visualization of Liner Regression

# scatter plot
plot(startups$R.D.Spend, startups$Profit)

# Create Line
abline(model_ols, col = "red")

### V.3 Prediction

Now we will make new dummy data of the R.D.Spend to test our model

# dummy data
new_rd <- data.frame(R.D.Spend = c(70000, 80000, 90000, 100000))
new_rd
##   R.D.Spend
## 1     7e+04
## 2     8e+04
## 3     9e+04
## 4     1e+05

Use predict() to predict new data

# predict
predict(model_ols, new_rd)
##        1        2        3        4 
## 108833.3 117376.2 125919.1 134462.0

On the test using dummy data with value difference of 10.000 we get these value 117376.2 117376.2 125919.1 134462.0

(117376.2-108833.3)/10000
## [1] 0.85429
(125919.1-117376.2)/10000
## [1] 0.85429

[1] 0.85429 [1] 0.85429 From the dummy we can prove the slope value from lm()

Call: lm(formula = Profit ~ R.D.Spend, data = startups)

Coefficients: (Intercept) R.D.Spend
49032.8991 0.8543

Intercept (49032.8991): This is the baseline value of the Profit when R.D.Spend is zero. It means that if no money is spent on R&D (R.D.Spend = 0), the predicted profit would be 49,032.8991 units

R.D.Spend Coefficient (0.8543):

This is the slope of the relationship between R.D.Spend and Profit. For every additional unit increase in R.D.Spend, the Profit increases by 0.8543 units. For example, if you spend an additional $1,000 on R&D, the profit is expected to increase by approximately $854.3.

Putting it together, the linear equation for the model is: Profit=49,032.8991+0.8543×R.D.Spend

This equation can be used to predict profit based on different values of R&D spending.