Dalam pembuatan report, jangan lupa untuk meliputi hal-hal berikut:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# read startups data
startups <- read.csv("data_input/50_Startups.csv")
startups
## R.D.Spend Administration Marketing.Spend State Profit
## 1 165349.20 136897.80 471784.10 New York 192261.83
## 2 162597.70 151377.59 443898.53 California 191792.06
## 3 153441.51 101145.55 407934.54 Florida 191050.39
## 4 144372.41 118671.85 383199.62 New York 182901.99
## 5 142107.34 91391.77 366168.42 Florida 166187.94
## 6 131876.90 99814.71 362861.36 New York 156991.12
## 7 134615.46 147198.87 127716.82 California 156122.51
## 8 130298.13 145530.06 323876.68 Florida 155752.60
## 9 120542.52 148718.95 311613.29 New York 152211.77
## 10 123334.88 108679.17 304981.62 California 149759.96
## 11 101913.08 110594.11 229160.95 Florida 146121.95
## 12 100671.96 91790.61 249744.55 California 144259.40
## 13 93863.75 127320.38 249839.44 Florida 141585.52
## 14 91992.39 135495.07 252664.93 California 134307.35
## 15 119943.24 156547.42 256512.92 Florida 132602.65
## 16 114523.61 122616.84 261776.23 New York 129917.04
## 17 78013.11 121597.55 264346.06 California 126992.93
## 18 94657.16 145077.58 282574.31 New York 125370.37
## 19 91749.16 114175.79 294919.57 Florida 124266.90
## 20 86419.70 153514.11 0.00 New York 122776.86
## 21 76253.86 113867.30 298664.47 California 118474.03
## 22 78389.47 153773.43 299737.29 New York 111313.02
## 23 73994.56 122782.75 303319.26 Florida 110352.25
## 24 67532.53 105751.03 304768.73 Florida 108733.99
## 25 77044.01 99281.34 140574.81 New York 108552.04
## 26 64664.71 139553.16 137962.62 California 107404.34
## 27 75328.87 144135.98 134050.07 Florida 105733.54
## 28 72107.60 127864.55 353183.81 New York 105008.31
## 29 66051.52 182645.56 118148.20 Florida 103282.38
## 30 65605.48 153032.06 107138.38 New York 101004.64
## 31 61994.48 115641.28 91131.24 Florida 99937.59
## 32 61136.38 152701.92 88218.23 New York 97483.56
## 33 63408.86 129219.61 46085.25 California 97427.84
## 34 55493.95 103057.49 214634.81 Florida 96778.92
## 35 46426.07 157693.92 210797.67 California 96712.80
## 36 46014.02 85047.44 205517.64 New York 96479.51
## 37 28663.76 127056.21 201126.82 Florida 90708.19
## 38 44069.95 51283.14 197029.42 California 89949.14
## 39 20229.59 65947.93 185265.10 New York 81229.06
## 40 38558.51 82982.09 174999.30 California 81005.76
## 41 28754.33 118546.05 172795.67 California 78239.91
## 42 27892.92 84710.77 164470.71 Florida 77798.83
## 43 23640.93 96189.63 148001.11 California 71498.49
## 44 15505.73 127382.30 35534.17 New York 69758.98
## 45 22177.74 154806.14 28334.72 California 65200.33
## 46 1000.23 124153.04 1903.93 New York 64926.08
## 47 1315.46 115816.21 297114.46 Florida 49490.75
## 48 0.00 135426.92 0.00 California 42559.73
## 49 542.05 51743.15 0.00 New York 35673.41
## 50 0.00 116983.80 45173.06 California 14681.40
# Check Data Structure
glimpse(startups)
## Rows: 50
## Columns: 5
## $ R.D.Spend <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107.34,…
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.77, …
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3628…
## $ State <chr> "New York", "California", "Florida", "New York", "Flor…
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1569…
startups <- startups %>%
select(-State)
# Show structure Data
str(startups)
## 'data.frame': 50 obs. of 4 variables:
## $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
## $ Administration : num 136898 151378 101146 118672 91392 ...
## $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
## $ Profit : num 192262 191792 191050 182902 166188 ...
# Check Correlations
ggcorr(startups, label = T)
Insight:
Based on correlation :
Based on the insight we see on the correlation. We will use R&D.Spend and Marketing.Spend as predictors.
# visualization scatter plot for R&D
plot(startups$R.D.Spend, startups$Profit)
# visualization scatter plot for marketing
plot(startups$Marketing.Spend, startups$Profit)
Insight:
*Based on the plots, the bigger the R.D.Spend and Marketing.Spend, the higher the profit gain
model_ols <- lm(formula = Profit ~ R.D.Spend,
data = startups)
# see model's result
model_ols
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = startups)
##
## Coefficients:
## (Intercept) R.D.Spend
## 4.903e+04 8.543e-01
Visualization of Liner Regression
# scatter plot
plot(startups$R.D.Spend, startups$Profit)
# Create Line
abline(model_ols, col = "red")
### V.3 Prediction
Now we will make new dummy data of the R.D.Spend to test our model
# dummy data
new_rd <- data.frame(R.D.Spend = c(70000, 80000, 90000, 100000))
new_rd
## R.D.Spend
## 1 7e+04
## 2 8e+04
## 3 9e+04
## 4 1e+05
Use predict() to predict new data
# predict
predict(model_ols, new_rd)
## 1 2 3 4
## 108833.3 117376.2 125919.1 134462.0
On the test using dummy data with value difference of 10.000 we get these value 117376.2 117376.2 125919.1 134462.0
(117376.2-108833.3)/10000
## [1] 0.85429
(125919.1-117376.2)/10000
## [1] 0.85429
[1] 0.85429 [1] 0.85429 From the dummy we can prove the slope value from lm()
Call: lm(formula = Profit ~ R.D.Spend, data = startups)
Coefficients: (Intercept) R.D.Spend
49032.8991 0.8543
Intercept (49032.8991): This is the baseline value of the Profit when R.D.Spend is zero. It means that if no money is spent on R&D (R.D.Spend = 0), the predicted profit would be 49,032.8991 units
R.D.Spend Coefficient (0.8543):
This is the slope of the relationship between R.D.Spend and Profit. For every additional unit increase in R.D.Spend, the Profit increases by 0.8543 units. For example, if you spend an additional $1,000 on R&D, the profit is expected to increase by approximately $854.3.
Putting it together, the linear equation for the model is: Profit=49,032.8991+0.8543×R.D.Spend
This equation can be used to predict profit based on different values of R&D spending.