1 Statistika deskriptif
2 Visualisasi
3 Analisis korelasi
4 Model Regresi Linier
- 4.1 Sederhana
- 4.2 Berganda

Tugas Business Intelligent

Nama : Lahfanda Dista Permata Putri

NRP : 5006201005

1 Statistika deskriptif

1.1 Deskripsi data

Data yang digunakan pada tugas ini adalah data penjualan sepeda di berbagai negara di Eropa dan diunduh dari laman Kaggle. Data ini terdiri dari 18 kolom, dengan kolom Sales yang menjadi kolom utama.

#Loading R packages required
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

#Loading the data files
sales <- read.csv("C:/Users/user/Videos/Sales.csv")
head(sales)

## # A tibble: 6 x 18
##   Date      Day Month  Year Customer_Age Age_Group Customer_Gender Country State
##   <chr>   <int> <chr> <int>        <int> <chr>     <chr>           <chr>   <chr>
## 1 2013-1~    26 Nove~  2013           19 Youth (<~ M               Canada  Brit~
## 2 2015-1~    26 Nove~  2015           19 Youth (<~ M               Canada  Brit~
## 3 2014-0~    23 March  2014           49 Adults (~ M               Austra~ New ~
## 4 2016-0~    23 March  2016           49 Adults (~ M               Austra~ New ~
## 5 2014-0~    15 May    2014           47 Adults (~ F               Austra~ New ~
## 6 2016-0~    15 May    2016           47 Adults (~ F               Austra~ New ~
## # i 9 more variables: Product_Category <chr>, Sub_Category <chr>,
## #   Product <chr>, Order_Quantity <int>, Unit_Cost <int>, Unit_Price <int>,
## #   Profit <int>, Cost <int>, Revenue <int>

1.2 Data Preprocessing

1.2.1 Identifikasi Tipe Data

str(sales)

## 'data.frame':    113036 obs. of  18 variables:
##  $ Date            : chr  "2013-11-26" "2015-11-26" "2014-03-23" "2016-03-23" ...
##  $ Day             : int  26 26 23 23 15 15 22 22 22 22 ...
##  $ Month           : chr  "November" "November" "March" "March" ...
##  $ Year            : int  2013 2015 2014 2016 2014 2016 2014 2016 2014 2016 ...
##  $ Customer_Age    : int  19 19 49 49 47 47 47 47 35 35 ...
##  $ Age_Group       : chr  "Youth (<25)" "Youth (<25)" "Adults (35-64)" "Adults (35-64)" ...
##  $ Customer_Gender : chr  "M" "M" "M" "M" ...
##  $ Country         : chr  "Canada" "Canada" "Australia" "Australia" ...
##  $ State           : chr  "British Columbia" "British Columbia" "New South Wales" "New South Wales" ...
##  $ Product_Category: chr  "Accessories" "Accessories" "Accessories" "Accessories" ...
##  $ Sub_Category    : chr  "Bike Racks" "Bike Racks" "Bike Racks" "Bike Racks" ...
##  $ Product         : chr  "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" "Hitch Rack - 4-Bike" ...
##  $ Order_Quantity  : int  8 8 23 20 4 5 4 2 22 21 ...
##  $ Unit_Cost       : int  45 45 45 45 45 45 45 45 45 45 ...
##  $ Unit_Price      : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Profit          : int  590 590 1366 1188 238 297 199 100 1096 1046 ...
##  $ Cost            : int  360 360 1035 900 180 225 180 90 990 945 ...
##  $ Revenue         : int  950 950 2401 2088 418 522 379 190 2086 1991 ...

1.2.2 Merubah Format Tanggal (Date)

sales$Date = ymd(sales$Date)
head(sales)

## # A tibble: 6 x 18
##   Date         Day Month     Year Customer_Age Age_Group Customer_Gender Country
##   <date>     <int> <chr>    <int>        <int> <chr>     <chr>           <chr>  
## 1 2013-11-26    26 November  2013           19 Youth (<~ M               Canada 
## 2 2015-11-26    26 November  2015           19 Youth (<~ M               Canada 
## 3 2014-03-23    23 March     2014           49 Adults (~ M               Austra~
## 4 2016-03-23    23 March     2016           49 Adults (~ M               Austra~
## 5 2014-05-15    15 May       2014           47 Adults (~ F               Austra~
## 6 2016-05-15    15 May       2016           47 Adults (~ F               Austra~
## # i 10 more variables: State <chr>, Product_Category <chr>, Sub_Category <chr>,
## #   Product <chr>, Order_Quantity <int>, Unit_Cost <int>, Unit_Price <int>,
## #   Profit <int>, Cost <int>, Revenue <int>

1.3 Statistical Summary of Data

summary(sales)

##       Date                 Day           Month                Year     
##  Min.   :2011-01-01   Min.   : 1.00   Length:113036      Min.   :2011  
##  1st Qu.:2013-12-22   1st Qu.: 8.00   Class :character   1st Qu.:2013  
##  Median :2014-06-27   Median :16.00   Mode  :character   Median :2014  
##  Mean   :2014-11-23   Mean   :15.67                      Mean   :2014  
##  3rd Qu.:2016-01-09   3rd Qu.:23.00                      3rd Qu.:2016  
##  Max.   :2016-07-31   Max.   :31.00                      Max.   :2016  
##   Customer_Age    Age_Group         Customer_Gender      Country         
##  Min.   :17.00   Length:113036      Length:113036      Length:113036     
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :35.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :35.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :87.00                                                           
##     State           Product_Category   Sub_Category         Product         
##  Length:113036      Length:113036      Length:113036      Length:113036     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order_Quantity   Unit_Cost        Unit_Price         Profit       
##  Min.   : 1.0   Min.   :   1.0   Min.   :   2.0   Min.   :  -30.0  
##  1st Qu.: 2.0   1st Qu.:   2.0   1st Qu.:   5.0   1st Qu.:   29.0  
##  Median :10.0   Median :   9.0   Median :  24.0   Median :  101.0  
##  Mean   :11.9   Mean   : 267.3   Mean   : 452.9   Mean   :  285.1  
##  3rd Qu.:20.0   3rd Qu.:  42.0   3rd Qu.:  70.0   3rd Qu.:  358.0  
##  Max.   :32.0   Max.   :2171.0   Max.   :3578.0   Max.   :15096.0  
##       Cost            Revenue       
##  Min.   :    1.0   Min.   :    2.0  
##  1st Qu.:   28.0   1st Qu.:   63.0  
##  Median :  108.0   Median :  223.0  
##  Mean   :  469.3   Mean   :  754.4  
##  3rd Qu.:  432.0   3rd Qu.:  800.0  
##  Max.   :42978.0   Max.   :58074.0

table(sales$Age_Group)

## 
##       Adults (35-64)        Seniors (64+) Young Adults (25-34) 
##                55824                  730                38654 
##          Youth (<25) 
##                17828

table(sales$Customer_Gender)

## 
##     F     M 
## 54724 58312

table(sales$Country)

## 
##      Australia         Canada         France        Germany United Kingdom 
##          23936          14178          10998          11098          13620 
##  United States 
##          39206

table(sales$Product_Category)

## 
## Accessories       Bikes    Clothing 
##       70120       25982       16934

Berdasarkan output tersebut diketahui bahwa:

Data yang diperoleh berjumlah 113.036 data observasi (dari Januari 2011 hingga Juli 2016) di 6 negara (Australia, Canada, France, Germany, United Kingdom, dan United States).
Customer berada pada rentang usia 17-87 tahun, dengan transaksi terbanyak pada:
- Kelompok usia Adults (35-64 tahun) sebanyak 55824 kali.
- Jenis kelamin Male (Laki-laki) sebanyak 58312 kali.
- Negara Amerika Serikat sebanyak 39206 kali.
Kategori produk yang dijual yaitu accessories, bikes, dan clothing, dengan harga jual produk (unit price) paling murah yaitu $2 dan yang paling mahal $3578.
Biaya pembuatan produk (unit cost) paling murah yaitu $1, dengan rata-rata $267.3 dan yang paling mahal $2171.
Jumlah barang terjual minimum pada tanggal pencatatan yaitu 1 pcs, dengan penjualan tertinggi sejumlah 32 pcs, dan rata-rata penjualan 12 pcs.
Pendapatan (revenue) tertinggi berada pada angka $58074, terendah pada $2, dan rata-rata $754.4.

1.4 Informasi lain

Revenue by country and product category

rev_country_product <- sales %>%
  group_by(Country, Product_Category) %>%
  summarize(Total_Revenue = sum(Revenue))
rev_country_product

## # A tibble: 18 x 3
##    Country        Product_Category Total_Revenue
##    <chr>          <chr>                    <int>
##  1 Australia      Accessories            2746405
##  2 Australia      Bikes                 16952818
##  3 Australia      Clothing               1602836
##  4 Canada         Accessories            2282940
##  5 Canada         Bikes                  4275003
##  6 Canada         Clothing               1377795
##  7 France         Accessories            1388053
##  8 France         Bikes                  6324125
##  9 France         Clothing                720694
## 10 Germany        Accessories            1548818
## 11 Germany        Bikes                  6792782
## 12 Germany        Clothing                636996
## 13 United Kingdom Accessories            1873023
## 14 United Kingdom Bikes                  7856994
## 15 United Kingdom Clothing                916179
## 16 United States  Accessories            5278753
## 17 United States  Bikes                 19580412
## 18 United States  Clothing               3116382

Profit by country and product category

prof_country_product <- sales %>%
  group_by(Country, Product_Category) %>%
  summarize(Total_Profit = sum(Profit))
prof_country_product

## # A tibble: 18 x 3
##    Country        Product_Category Total_Profit
##    <chr>          <chr>                   <int>
##  1 Australia      Accessories           1518253
##  2 Australia      Bikes                 4837357
##  3 Australia      Clothing               420420
##  4 Canada         Accessories           1418383
##  5 Canada         Bikes                 1690962
##  6 Canada         Clothing               607951
##  7 France         Accessories            779216
##  8 France         Bikes                 1910747
##  9 France         Clothing               190319
## 10 Germany        Accessories            906723
## 11 Germany        Bikes                 2287302
## 12 Germany        Clothing               165970
## 13 United Kingdom Accessories           1144701
## 14 United Kingdom Bikes                 2974139
## 15 United Kingdom Clothing               295013
## 16 United States  Accessories           3095101
## 17 United States  Bikes                 6818769
## 18 United States  Clothing              1159774

Global revenue performance by year

rev_year <- sales %>%
  group_by(Year) %>%
  summarize(Total_Revenue = sum(Revenue))
rev_year

## # A tibble: 6 x 2
##    Year Total_Revenue
##   <int>         <int>
## 1  2011       8964888
## 2  2012       9175983
## 3  2013      15240037
## 4  2014      14152724
## 5  2015      20023991
## 6  2016      17713385

Global profit performance by year

prof_year <- sales %>%
  group_by(Year) %>%
  summarize(Total_Profit = sum(Profit))
prof_year

## # A tibble: 6 x 2
##    Year Total_Profit
##   <int>        <int>
## 1  2011      2881301
## 2  2012      2951993
## 3  2013      5959208
## 4  2014      5864087
## 5  2015      7528563
## 6  2016      7035948

2 Visualisasi

2.1 Bar Chart

Total Revenue by Country

# Data
rev_country <- aggregate(Revenue ~ Country, data = sales, sum)

# Membuat bar chart menggunakan ggplot
ggplot(rev_country, aes(x = Country, y = Revenue)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(title = 'Total Revenue by Country', 
       x = 'Country', 
       y = 'Total Revenue') +
  theme(plot.title = element_text(size = 16, hjust = 0.5))

Berdasarkan plot di atas diketahui bahwa United States memiliki total revenue tertinggi sedangkan Canada memiliki total revenue terendah.

Total Profit by Country

# Data
prof_country <- aggregate(Profit ~ Country, data = sales, sum)

# Membuat bar chart menggunakan ggplot
ggplot(prof_country, aes(x = Country, y = Profit)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(title = 'Total Profit by Country', 
       x = 'Country', 
       y = 'Total Profit') +
  theme(plot.title = element_text(size = 16, hjust = 0.5))

Sama seperti plot sebelumnya, diketahui bahwa United States menjadi negara dengan total profit tertinggi. Bedanya pada plot ini diketahui France yang menjadi negara dengan total profit terendah.

2.2 Histogram

2.2.1 Single Histogram

# Filter data untuk masing-masing gender
male_data <- sales$Customer_Age[sales$Customer_Gender == "M"]
female_data <- sales$Customer_Age[sales$Customer_Gender == "F"]

# Membuat histogram menggunakan ggplot
ggplot(data.frame(Customer_Age_M = male_data), aes(x = Customer_Age_M)) +
  geom_histogram(binwidth = 2, fill = "darkblue", alpha = 0.5) +
  labs(title = "Customer Age Histogram (Male)",
       x = "Customer Age",
       y = "Frequency") +
  scale_x_continuous(breaks = seq(0, 90, by = 10), limits = c(0, 90)) +
  scale_y_continuous(breaks = seq(0, 4500, by = 500)) +
  theme_minimal()

ggplot(data.frame(Customer_Age_F = female_data), aes(x = Customer_Age_F)) +
  geom_histogram(binwidth = 2, fill = "pink", alpha = 0.5) +
  labs(title = "Customer Age Histogram (Female)",
       x = "Customer Age",
       y = "Frequency") +
  scale_x_continuous(breaks = seq(0, 90, by = 10), limits = c(0, 90)) +
  scale_y_continuous(breaks = seq(0, 4500, by = 500)) +
  theme_minimal()

2.2.2 Double Histogram

# Membuat dataset untuk digunakan dalam ggplot
sales_data <- data.frame(
  Customer_Age = c(sales$Customer_Age[sales$Customer_Gender == "M"],
                   sales$Customer_Age[sales$Customer_Gender == "F"]),
  Customer_Gender = rep(c("Male", "Female"),
                        times = c(sum(sales$Customer_Gender == "M"),
                                  sum(sales$Customer_Gender == "F")))
)

# Membuat histogram menggunakan ggplot
ggplot(sales_data, aes(x = Customer_Age, fill = Customer_Gender)) +
  geom_histogram(binwidth = 2, position = "identity", alpha = 0.3) +
  labs(title = "Customer Age by Gender Histograms",
       x = "Customer Age",
       y = "Frequency") +
  scale_x_continuous(breaks = seq(0, 90, by = 10), limits = c(0, 90)) +
  scale_y_continuous(breaks = seq(0, 4500, by = 500)) +
  scale_fill_manual(values = c("Male" = "darkblue", "Female" = "pink")) +
  theme_minimal()

2.3 Boxplot

Order Quantity by Product Category

ggplot(data=sales, mapping = aes(x=Product_Category, y=Order_Quantity, fill=Product_Category)) +
  geom_boxplot()

Boxplot jumlah pesanan berdasarkan kategori produk dibuat untuk membantu dalam memahami karakteristik dari distribusi data. Accessories, Clothing dan Bikes secara berurutan menunjukkan urutan data dengan sebaran tertinggi ke terendah yang diukur berdasarkan bidang jangkauan kuartil dalam (inner quartile range). Selanjutnya diketahui pula bahwa kategori Accessories dan Clothing memiliki nilai tengah (median) yang sama dan lebih besar dari kategori Bikes. Pada kategori Accessories dan Clothing tidak ada nilai yang menyimpang jauh (outlier), dengan panjang whisker bagian atas sedikit lebih panjang daribada whisker bagian bawah. Sedangkan pada kategori Bikes terdapat satu outlier yang nilainya di atas whisker. Adanya outlier di bagian atas boxplot yang disertai dengan whisker bagian atas yang lebih panjang, menunjukkan bahwa ketiga kategori memiliki distribusi data cenderung menjulur ke arah kanan (positive skewness).

2.4 Pie Chart

Global Revenue Performance by Product Category

# Data
rev_pie <- sales %>% 
  group_by(Product_Category) %>% 
  summarise(Revenue = sum(Revenue))

# Membuat pie chart menggunakan ggplot
ggplot(rev_pie, aes(x = "", y = Revenue, fill = Product_Category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  theme_void() +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Pastel3") +
  geom_text(aes(label = sprintf("%1.1f%%", (Revenue / sum(Revenue)) * 100)),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Total Revenues of Each Product Category", 
       caption = "Note: Percentages are relative to the total revenues.")

Berdasarkan chart tersebut diketahui bahwa pendapatan (revenue) terbesar berasal dari penjualan sepeda (bikes) yaitu sebesar 72.5%. Jika dilihat pada boxplot sebelumnya, diketahui bahwa banyaknya penjualan sepeda sebenarnya lebih rendah dibandingkan dengan penjualan aksesoris (accessories) dan pakaian (clothing). Akan tetapi karena harga sepeda jauh lebih mahal daripada kedua kategori lainnya, sehingga penjualan sepeda bisa menjadi penyumbang terbesar pada pendapatan.

Global Profit Performance by Country

# Data
df_pie <- sales %>% 
  group_by(Country) %>% 
  summarise(Profit = sum(Profit))

# Membuat pie chart menggunakan ggplot
ggplot(df_pie, aes(x = "", y = Profit, fill = Country)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  theme_void() +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Pastel2") +
  geom_text(aes(label = sprintf("%1.1f%%", (Profit / sum(Profit)) * 100)),
            position = position_stack(vjust = 0.5)) +
  labs(title = "Total Profits of Each Global Store Location", 
       caption = "Note: Percentages are relative to the total profits.")

Berdasarkan chart tersebut diketahui bahwa penjualan dari outlet yang berada di United States menyumbang profit lebih dari 1/3 total profit seluruh negara.

3 Analisis korelasi

# Membuat data frame baru khusus untuk variabel numerik
new_sales <- data.frame(Qty=sales$Order_Quantity, Unit_Cost=sales$Unit_Cost, Unit_Price=sales$Unit_Price, Profit=sales$Profit, Cost=sales$Cost, Revenue=sales$Revenue)

# Melakukan analisis korelasi antar variabel
cor_matrix <- cor(new_sales)

# Menampilkan matriks korelasi
print(cor_matrix)

##                   Qty  Unit_Cost Unit_Price     Profit       Cost    Revenue
## Qty         1.0000000 -0.5158350 -0.5159246 -0.2388634 -0.3403816 -0.3128950
## Unit_Cost  -0.5158350  1.0000000  0.9978936  0.7410203  0.8298690  0.8178650
## Unit_Price -0.5159246  0.9978936  1.0000000  0.7498702  0.8263011  0.8185218
## Profit     -0.2388634  0.7410203  0.7498702  1.0000000  0.9022330  0.9565717
## Cost       -0.3403816  0.8298690  0.8263011  0.9022330  1.0000000  0.9887584
## Revenue    -0.3128950  0.8178650  0.8185218  0.9565717  0.9887584  1.0000000

4 Model Regresi Linier

4.1 Sederhana

# Membuat model regresi linier sederhana
model1 <- lm(Profit ~ Revenue, data = new_sales)

# Menampilkan ringkasan model
summary(model1)

## 
## Call:
## lm(formula = Profit ~ Revenue, data = new_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4199.7   -32.8   -15.4    41.6  1215.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.486e+01  4.542e-01   76.74   <2e-16 ***
## Revenue     3.317e-01  3.006e-04 1103.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132.3 on 113034 degrees of freedom
## Multiple R-squared:  0.915,  Adjusted R-squared:  0.915 
## F-statistic: 1.217e+06 on 1 and 113034 DF,  p-value: < 2.2e-16

Dari output regresi linear tersebut, dapat dibuat beberapa kesimpulan:

Koefisien Regresi:
- Intercept ($\beta_0$): 34.86. Ini adalah perkiraan nilai dari variabel dependen (Profit) ketika nilai variabel independen (Revenue) sama dengan 0.
- Revenue ($\beta_1$): 0.3317. Ini menunjukkan bahwa untuk setiap satu unit peningkatan dalam Revenue, diperkirakan bahwa Profit akan meningkat sekitar 0.3317 unit.
- Model regresi linier: \[Profit = 34.86 + 0.3317*Revenue\]
Signifikansi Statistik:
- Kedua koefisien (Intercept dan Revenue) memiliki nilai p yang sangat rendah (< 0.001), sehingga keduanya secara signifikan berbeda dari nol. Ini menunjukkan bahwa keduanya memiliki dampak yang signifikan terhadap variabel dependen.
R-squared:
- Multiple R-squared adalah 0.915, yang berarti sekitar 91.5% variabilitas dalam Profit dapat dijelaskan oleh variabilitas dalam Revenue. Ini menunjukkan bahwa model ini memiliki kemampuan yang baik untuk menjelaskan variasi dalam data.
F-statistic:
- F-statistic sangat besar (1.217e+06) dengan nilai p yang sangat rendah (< 2.2e-16), mengindikasikan bahwa model secara keseluruhan sangat signifikan.

Dengan demikian, dapat disimpulkan bahwa model regresi linear ini secara statistik signifikan dan mampu menjelaskan sebagian besar variasi dalam variabel dependen (Profit) berdasarkan variabel independen (Revenue). Namun, interpretasi praktis dari koefisien perlu dilakukan dengan pertimbangan konteks bisnis dan analisis lebih lanjut.

4.2 Berganda

# Membuat model regresi linier sederhana
model2 <- lm(Cost ~ Qty + Unit_Cost, data = new_sales)

# Menampilkan ringkasan model
summary(model2)

## 
## Call:
## lm(formula = Cost ~ Qty + Unit_Cost, data = new_sales)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -909   -148    -25     45  40577 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.781389   2.987759  -15.32   <2e-16 ***
## Qty          11.057563   0.176229   62.74   <2e-16 ***
## Unit_Cost     1.434725   0.003065  468.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 485.3 on 113033 degrees of freedom
## Multiple R-squared:  0.6992, Adjusted R-squared:  0.6992 
## F-statistic: 1.313e+05 on 2 and 113033 DF,  p-value: < 2.2e-16