HI…!WELCOME TO MY RMD
1. LOAD DATASET
cop <- read.csv("E:/Kerja/copiers.csv")
2. EXPLANATORY DATA ANALYSIS
str(cop)
## 'data.frame': 62 obs. of 15 variables:
## $ Row.ID : int 336 393 407 516 596 754 1151 1234 1550 1645 ...
## $ Order.ID : chr "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
## $ Order.Date : chr "9/1/15" "9/21/14" "12/8/17" "1/22/17" ...
## $ Ship.Date : chr "9/4/15" "9/23/14" "12/12/17" "1/27/17" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Standard Class" "Standard Class" ...
## $ Customer.ID : chr "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
## $ Segment : chr "Consumer" "Consumer" "Consumer" "Home Office" ...
## $ Product.ID : chr "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
## $ Category : chr "Technology" "Technology" "Technology" "Technology" ...
## $ Sub.Category: chr "Copiers" "Copiers" "Copiers" "Copiers" ...
## $ Product.Name: chr "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
## $ Sales : num 960 1800 1200 3000 1200 ...
## $ Quantity : int 2 3 3 5 3 3 2 2 1 7 ...
## $ Discount : num 0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
## $ Profit : num 336 702 435 1380 435 ...
Tipe data yang belum sesuai :
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
cop$Order.Date <- mdy(cop$Order.Date)
cop$Ship.Date <- mdy(cop$Ship.Date)
cop$Ship.Mode <- as.factor(cop$Ship.Mode)
cop$Segment <- as.factor(cop$Segment)
cop$Category <- as.factor(cop$Category)
cop$Sub.Category <- as.factor(cop$Sub.Category)
str(cop)
## 'data.frame': 62 obs. of 15 variables:
## $ Row.ID : int 336 393 407 516 596 754 1151 1234 1550 1645 ...
## $ Order.ID : chr "CA-2015-137946" "US-2014-135972" "CA-2017-117457" "CA-2017-127432" ...
## $ Order.Date : Date, format: "2015-09-01" "2014-09-21" ...
## $ Ship.Date : Date, format: "2015-09-04" "2014-09-23" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 4 4 4 1 2 1 1 1 ...
## $ Customer.ID : chr "DB-13615" "JG-15115" "KH-16510" "AD-10180" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 1 3 1 2 1 2 1 2 ...
## $ Product.ID : chr "TEC-CO-10001449" "TEC-CO-10002313" "TEC-CO-10004115" "TEC-CO-10003236" ...
## $ Category : Factor w/ 1 level "Technology": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sub.Category: Factor w/ 1 level "Copiers": 1 1 1 1 1 1 1 1 1 1 ...
## $ Product.Name: chr "Hewlett Packard LaserJet 3310 Copier" "Canon PC1080F Personal Copier" "Sharp AL-1530CS Digital Copier" "Canon Image Class D660 Copier" ...
## $ Sales : num 960 1800 1200 3000 1200 ...
## $ Quantity : int 2 3 3 5 3 3 2 2 1 7 ...
## $ Discount : num 0.2 0 0.2 0 0.2 0.2 0 0.4 0.2 0 ...
## $ Profit : num 336 702 435 1380 435 ...
anyNA(cop)
## [1] FALSE
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# melihat korelasi antar variabel
ggcorr(cop, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(cop, label = TRUE, label_size = 2.9, hjust = 1, layout.exp
## = 2): data in column(s) 'Order.ID', 'Order.Date', 'Ship.Date', 'Ship.Mode',
## 'Customer.ID', 'Segment', 'Product.ID', 'Category', 'Sub.Category',
## 'Product.Name' are not numeric and were ignored
jika dilihat faktor sales memiliki korelasi positif paling tinggi di banding faktor lain.
# cek persebaran variabel profit
boxplot(cop$Profit)
terdapat outlier
# cek persebaran variabel sales
hist(cop$Sales)
# cek korelasi antar variabel target dan prediktor
cor(x=cop$Sales, y=cop$Profit)
## [1] 0.9395785
hubungan antara keduanya sangan positif. ketika sales naik, profit naik, begitu pula sebaliknya.
**3. MEMBUAT MODEL REGRESI LINEAR
selanjutnya membuat model regresi linear dengam variabel sales karena memiliki korelasi yang sangat tinggi dengan profit.
m <- lm(formula = Sales~1, data = cop)
summary(m)
##
## Call:
## lm(formula = Sales ~ 1, data = cop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1031.1 -731.1 -351.1 418.9 3568.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1331.1 122.7 10.84 7.32e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 966.5 on 61 degrees of freedom
melakukan prediksi nilai profit berdasarkan nilai sales
mp <- lm(formula = Profit~Sales, data = cop)
summary(mp)
##
## Call:
## lm(formula = Profit ~ Sales, data = cop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -407.07 -70.08 22.95 76.56 345.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -114.06251 32.62743 -3.496 0.000895 ***
## Sales 0.42286 0.01989 21.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 150.1 on 60 degrees of freedom
## Multiple R-squared: 0.8828, Adjusted R-squared: 0.8809
## F-statistic: 452 on 1 and 60 DF, p-value: < 2.2e-16
model yang diperoleh y = -114.06251 + 0.42286*sales
misal sales dengan 100, besar profit yang di dapat
y = -114.06251 + 0.42286*100
y
## [1] -71.77651
Formula model regresi:
Interpretasi model regresi
intercept = -114.063
slope/b1 = 0.42286
Profit = -114.06251 + 0.42286*Sales
ketika nilai slope kita 0.423, artinya setiap kenaikan 1 nilai pada Sales, akan menaikkan 0.423 kali target variabel kita (Profit)
y = 0.89 - 0.93*Sales
cor(cop$Sales, cop$Profit)
## [1] 0.9395785
summary(mp)
##
## Call:
## lm(formula = Profit ~ Sales, data = cop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -407.07 -70.08 22.95 76.56 345.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -114.06251 32.62743 -3.496 0.000895 ***
## Sales 0.42286 0.01989 21.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 150.1 on 60 degrees of freedom
## Multiple R-squared: 0.8828, Adjusted R-squared: 0.8809
## F-statistic: 452 on 1 and 60 DF, p-value: < 2.2e-16
Nilai p-value sebagai tolak ukur apakah variabel prediktor berpengaruh signifikan terhadap variabel target. Dikatakan signifikan ketika nilai p-value < 0.05.
Hipotesis : - H0 : Prediktor variabel tidak mempengaruhi target variabel - H1 : Prediktor variabel signifikan berpengaruh terhadap target variabel
plot(x = cop$Sales, y = cop$Profit)
abline(mp, col = "red")
R-square model sebesar 88,9% dan sisanya dijelaskan oleh faktor lain model dengan prediktor sales hanya mampu menjelaskan 88,9% dari variansi profit.
maka dari sini juga terlihat bahwa H0 di tolak dan H1 diterima