INTRO

HI…!WELCOME TO MY RMD

1. LOAD DATASET

cop <- read.csv("E:/Kerja/copiers.csv")

2. EXPLANATORY DATA ANALYSIS

str(cop)
## 'data.frame':    62 obs. of  15 variables:
##  $ Row.ID      : int  336 393 407 516 596 754 1151 1234 1550 1645 ...
##  $ Order.ID    : chr  "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
##  $ Order.Date  : chr  "9/1/15" "9/21/14" "12/8/17" "1/22/17" ...
##  $ Ship.Date   : chr  "9/4/15" "9/23/14" "12/12/17" "1/27/17" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Standard Class" "Standard Class" ...
##  $ Customer.ID : chr  "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Consumer" "Home Office" ...
##  $ Product.ID  : chr  "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
##  $ Category    : chr  "Technology" "Technology" "Technology" "Technology" ...
##  $ Sub.Category: chr  "Copiers" "Copiers" "Copiers" "Copiers" ...
##  $ Product.Name: chr  "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
##  $ Sales       : num  960 1800 1200 3000 1200 ...
##  $ Quantity    : int  2 3 3 5 3 3 2 2 1 7 ...
##  $ Discount    : num  0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
##  $ Profit      : num  336 702 435 1380 435 ...

Tipe data yang belum sesuai :

library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
cop$Order.Date <- mdy(cop$Order.Date)
cop$Ship.Date <- mdy(cop$Ship.Date)
cop$Ship.Mode <- as.factor(cop$Ship.Mode)
cop$Segment <- as.factor(cop$Segment)
cop$Category <- as.factor(cop$Category)
cop$Sub.Category <- as.factor(cop$Sub.Category)
str(cop)
## 'data.frame':    62 obs. of  15 variables:
##  $ Row.ID      : int  336 393 407 516 596 754 1151 1234 1550 1645 ...
##  $ Order.ID    : chr  "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
##  $ Order.Date  : Date, format: "2015-09-01" "2014-09-21" ...
##  $ Ship.Date   : Date, format: "2015-09-04" "2014-09-23" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 4 4 4 1 2 1 1 1 ...
##  $ Customer.ID : chr  "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 1 3 1 2 1 2 1 2 ...
##  $ Product.ID  : chr  "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
##  $ Category    : Factor w/ 1 level "Technology": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sub.Category: Factor w/ 1 level "Copiers": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Product.Name: chr  "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
##  $ Sales       : num  960 1800 1200 3000 1200 ...
##  $ Quantity    : int  2 3 3 5 3 3 2 2 1 7 ...
##  $ Discount    : num  0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
##  $ Profit      : num  336 702 435 1380 435 ...
anyNA(cop)
## [1] FALSE
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# melihat korelasi antar variabel
ggcorr(cop, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(cop, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'Order.ID', 'Order.Date', 'Ship.Date', 'Ship.Mode',
## 'Customer.ID', 'Segment', 'Product.ID', 'Category', 'Sub.Category',
## 'Product.Name' are not numeric and were ignored

jika dilihat faktor sales memiliki korelasi positif paling tinggi di banding faktor lain.

cek persebaran data

# cek persebaran variabel profit
boxplot(cop$Profit)

terdapat outlier

# cek persebaran variabel sales
hist(cop$Sales)

# cek korelasi antar variabel target dan prediktor
cor(x=cop$Sales, y=cop$Profit)
## [1] 0.9395785

hubungan antara keduanya sangan positif. ketika sales naik, profit naik, begitu pula sebaliknya.

**3. MEMBUAT MODEL REGRESI LINEAR

selanjutnya membuat model regresi linear dengam variabel sales karena memiliki korelasi yang sangat tinggi dengan profit.

m <- lm(formula = Sales~1, data = cop)
summary(m)
## 
## Call:
## lm(formula = Sales ~ 1, data = cop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1031.1  -731.1  -351.1   418.9  3568.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1331.1      122.7   10.84 7.32e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 966.5 on 61 degrees of freedom

melakukan prediksi nilai profit berdasarkan nilai sales

mp <- lm(formula = Profit~Sales, data = cop)
summary(mp)
## 
## Call:
## lm(formula = Profit ~ Sales, data = cop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -407.07  -70.08   22.95   76.56  345.05 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -114.06251   32.62743  -3.496 0.000895 ***
## Sales          0.42286    0.01989  21.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150.1 on 60 degrees of freedom
## Multiple R-squared:  0.8828, Adjusted R-squared:  0.8809 
## F-statistic:   452 on 1 and 60 DF,  p-value: < 2.2e-16

model yang diperoleh y = -114.06251 + 0.42286*sales

misal sales dengan 100, besar profit yang di dapat

y = -114.06251 + 0.42286*100
y
## [1] -71.77651

Formula model regresi:

Interpretasi model regresi

intercept = -114.063

slope/b1 = 0.42286

Profit = -114.06251 + 0.42286*Sales

ketika nilai slope kita 0.423, artinya setiap kenaikan 1 nilai pada Sales, akan menaikkan 0.423 kali target variabel kita (Profit)

y = 0.89 - 0.93*Sales

cor(cop$Sales, cop$Profit)
## [1] 0.9395785
summary(mp)
## 
## Call:
## lm(formula = Profit ~ Sales, data = cop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -407.07  -70.08   22.95   76.56  345.05 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -114.06251   32.62743  -3.496 0.000895 ***
## Sales          0.42286    0.01989  21.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150.1 on 60 degrees of freedom
## Multiple R-squared:  0.8828, Adjusted R-squared:  0.8809 
## F-statistic:   452 on 1 and 60 DF,  p-value: < 2.2e-16

Nilai p-value sebagai tolak ukur apakah variabel prediktor berpengaruh signifikan terhadap variabel target. Dikatakan signifikan ketika nilai p-value < 0.05.

Hipotesis : - H0 : Prediktor variabel tidak mempengaruhi target variabel - H1 : Prediktor variabel signifikan berpengaruh terhadap target variabel

plot(x = cop$Sales, y = cop$Profit)
abline(mp, col = "red")

R-square model sebesar 88,9% dan sisanya dijelaskan oleh faktor lain model dengan prediktor sales hanya mampu menjelaskan 88,9% dari variansi profit.

maka dari sini juga terlihat bahwa H0 di tolak dan H1 diterima