1. Examining the basic data characteristics

a. dim() function

df <- read.csv("E:/Binus University/Semester 2/Data Mining and Visualization/bike_buyers.csv")
dim(df)
## [1] 1000   13

EXPLANATION Data Set Buyer Bike terdiri dari 1000 baris dan 13 kolom.

b. str() function

str(df)
## 'data.frame':    1000 obs. of  13 variables:
##  $ ï..ID           : int  12496 24107 14177 24381 25597 13507 27974 19364 22155 19280 ...
##  $ Marital.Status  : chr  "Married" "Married" "Married" "Single" ...
##  $ Gender          : chr  "Female" "Male" "Male" "" ...
##  $ Income          : int  40000 30000 80000 70000 30000 10000 160000 40000 20000 NA ...
##  $ Children        : int  1 3 5 0 0 2 2 1 2 2 ...
##  $ Education       : chr  "Bachelors" "Partial College" "Partial College" "Bachelors" ...
##  $ Occupation      : chr  "Skilled Manual" "Clerical" "Professional" "Professional" ...
##  $ Home.Owner      : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Cars            : int  0 1 2 1 0 0 4 0 2 1 ...
##  $ Commute.Distance: chr  "0-1 Miles" "0-1 Miles" "2-5 Miles" "5-10 Miles" ...
##  $ Region          : chr  "Europe" "Europe" "Europe" "Pacific" ...
##  $ Age             : int  42 43 60 41 36 50 33 43 58 NA ...
##  $ Purchased.Bike  : chr  "No" "No" "No" "Yes" ...

EXPLANATION ID : ID buyer bike, dengan tipe data integer

Marital Status: Status pernikahan buyer bike, dengan tipe data character

Gender : Jenis kelamin buyer bike, dengan tipe data character

Income : Besar pendapatan buyer bike, dengan tipe data integer

Children : Jumlah anak buyer bike, dengan tipe data integer

Education : Latar belakang pendidikan buyer bike, dengan tipe data character

Occupation : Pekerjaan buyer bike, dengan tipe data character

Home.Owner : Apakah buyer bike memiliki rumah atau tidak, dengan tipe data character

Cars : Jumlah mobil buyer bike, dengan tipe data integer

Commute.Distance : Jarak rumah buyer bike dengan perusahaan, dengan tipe data character

Region : Daerah tempat tinggal buyer bike, dengan tipe data character

Age : Usia buyer bike, dengan tipe data integer

Purchased.Bike : apakah buyer bike jadi membeli atau tidak, dengan tipe data character

c. BasicSummary() function

BasicSummary <- function(df, dgts = 3){
## #
## ################################################################
## #
## # Create a basic summary of variables in the data frame df,
## # a data frame with one row for each column of df giving the
## # variable name, type, number of unique levels, the most
## # frequent level, its frequency and corresponding fraction of
## # records, the number of missing values and its corresponding
## # fraction of records
## #
## ################################################################
## #
m <- ncol(df)
varNames <- colnames(df)
varType <- vector("character",m)
topLevel <- vector("character",m)
topCount <- vector("numeric",m)
missCount <- vector("numeric",m)
levels <- vector("numeric", m)

for (i in 1:m){
x <- df[,i]
varType[i] <- class(x)
xtab <- table(x, useNA = "ifany")
levels[i] <- length(xtab)
nums <- as.numeric(xtab)
maxnum <- max(nums)
topCount[i] <- maxnum
maxIndex <- which.max(nums)
lvls <- names(xtab)
topLevel[i] <- lvls[maxIndex]
missIndex <- which((is.na(x)) | (x == "") | (x == " "))
missCount[i] <- length(missIndex)
}
n <- nrow(df)
topFrac <- round(topCount/n, digits = dgts)
missFrac <- round(missCount/n, digits = dgts)
## #
summaryFrame <- data.frame(variable = varNames, type = varType,
 levels = levels, topLevel = topLevel,
 topCount = topCount, topFrac = topFrac,
 missFreq = missCount, missFrac = missFrac)
 return(summaryFrame)
 }

BasicSummary(df)
##            variable      type levels      topLevel topCount topFrac missFreq
## 1             ï..ID   integer   1000         11000        1   0.001        0
## 2    Marital.Status character      3       Married      535   0.535        7
## 3            Gender character      3          Male      500   0.500       11
## 4            Income   integer     17         60000      165   0.165        6
## 5          Children   integer      7             0      274   0.274        8
## 6         Education character      5     Bachelors      306   0.306        0
## 7        Occupation character      5  Professional      276   0.276        0
## 8        Home.Owner character      3           Yes      682   0.682        4
## 9              Cars   integer      6             2      342   0.342        9
## 10 Commute.Distance character      5     0-1 Miles      366   0.366        0
## 11           Region character      3 North America      508   0.508        0
## 12              Age   integer     54            40       40   0.040        8
## 13   Purchased.Bike character      2            No      519   0.519        0
##    missFrac
## 1     0.000
## 2     0.007
## 3     0.011
## 4     0.006
## 5     0.008
## 6     0.000
## 7     0.000
## 8     0.004
## 9     0.009
## 10    0.000
## 11    0.000
## 12    0.008
## 13    0.000

EXPLANATION ID : memiliki 1000 unique value, yang berarti seluruh id buyers bike berbeda. Angka 11000 muncul dengan frekuensi sebanyak 1 kali dan persentase sebesar 0.001. Tipe data integer dan tidak ada missing value.

Marital Status: memiliki 3 unique value. “Married” paling sering muncul, dengan frekuensi sebanyak 535 kali dan persentase sebesar 0.535. Tipe data character dan terdapat 7 missing value dengan persentase 0.007.

Gender : memiliki 3 unique value. “Male” paling sering muncul, dengan frekuensi sebanyak 500 kali dan persentase sebesar 0.500. Tipe data character dan terdapat 11 missing value dengan persentase 0.011.

Income : memiliki 17 unique value. Angka 60000 paling sering muncul, dengan frekuensi sebanyak 165 kali dan persentase sebesar 0.165. Tipe data integer dan terdapat 6 missing value dengan persentase 0.006.

Children : memiliki 7 unique value. Angka 0 paling sering muncul, dengan frekuensi sebanyak 274 kali dan persentase sebesar 0.274. Tipe data integer dan terdapat 8 missing value dengan persentase 0.008.

Education : memiliki 5 unique value. “Bachelors” paling sering muncul, dengan frekuensi sebanyak 306 kali dan persentase sebesar 0.306. Tipe data character dan tidak memiliki missing value.

Occupation : memiliki 5 unique value. “Professional” paling sering muncul, dengan frekuensi sebanyak 276 kali dan persentase sebesar 0.276. Tipe data character dan tidak memiliki missing value.

Home.Owner : memiliki 3 unique value. “Yes” paling sering muncul, dengan frekuensi sebanyak 682 kali dan persentase sebesar 0.682. Tipe data character dan terdapat 4 missing value dengan persentase 0.004.

Cars : memiliki 6 unique value. Angka 2 paling sering muncul, dengan frekuensi sebanyak 342 kali dan persentase sebesar 0.342. Tipe data integer dan terdapat 9 missing value dengan persentase 0.009.

Commute.Distance : memiliki 5 unique value. “0-1 miles” paling sering muncul, dengan frekuensi sebanyak 366 kali dan persentase sebesar 0.366. Tipe data character dan tidak memiliki missing value.

Region : memiliki 3 unique value. “North America” paling sering muncul, dengan frekuensi sebanyak 508 kali dan persentase sebesar 0.508. Tipe data integer dan tidak memiliki missing value.

Age : memiliki 54 unique value. Angka 40 paling sering muncul, dengan frekuensi sebanyak 40 kali dan persentase sebesar 0.040. Tipe data integer dan terdapat 8 missing value dengan persentase 0.008.

Purchased.Bike : memiliki 2 unique value, yaitu yes dan no. “No” paling sering muncul, dengan frekuensi sebanyak 519 kali dan persentase sebesar 0.519. Tipe data character dan tidak memiliki missing value.

2 Examining Summary Statistics

a. sapply() function

# Compute the mean of each column
sapply(df[, c(1,4,5,9,12)], mean, na.rm=TRUE)
##        ï..ID       Income     Children         Cars          Age 
## 19965.992000 56267.605634     1.910282     1.455096    44.181452
# Compute quartiles
sapply(df[, c(1,4,5,9,12)], quantile, na.rm=TRUE)
##         ï..ID Income Children Cars Age
## 0%   11000.00  10000        0    0  25
## 25%  15290.75  30000        0    1  35
## 50%  19744.00  60000        2    1  43
## 75%  24470.75  70000        3    2  52
## 100% 29447.00 170000        5    4  89

EXPLANATION: 1. 25% dari data buyer bike, tidak memiliki anak dan sudah memiliki mobil 2. 75% dari income berjumlah dibawah 100000 3. Quantile 1 dari income = 30000, children = 0, cars = 1, age = 35 4. Quantile 2 (Median) dari income = 60000, children = 2, cars = 1, age = 43 5. Quantile 3 dari income = 70000, children = 3, cars = 2, age = 52

b. describe() function

library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(df)
## df 
## 
##  13  Variables      1000  Observations
## --------------------------------------------------------------------------------
## ï..ID 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0     1000        1    19966     6176    11781    12627 
##      .25      .50      .75      .90      .95 
##    15291    19744    24471    27544    28413 
## 
## lowest : 11000 11047 11061 11090 11116, highest: 29337 29355 29380 29424 29447
## --------------------------------------------------------------------------------
## Marital.Status 
##        n  missing distinct 
##      993        7        2 
##                           
## Value      Married  Single
## Frequency      535     458
## Proportion   0.539   0.461
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##      989       11        2 
##                         
## Value      Female   Male
## Frequency     489    500
## Proportion  0.494  0.506
## --------------------------------------------------------------------------------
## Income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      994        6       16    0.986    56268    34273    10000    20000 
##      .25      .50      .75      .90      .95 
##    30000    60000    70000   100000   120000 
## 
## lowest :  10000  20000  30000  40000  50000, highest: 120000 130000 150000 160000 170000
##                                                                          
## Value       10000  20000  30000  40000  50000  60000  70000  80000  90000
## Frequency      73     74    134    153     40    165    123     90     38
## Proportion  0.073  0.074  0.135  0.154  0.040  0.166  0.124  0.091  0.038
##                                                            
## Value      100000 110000 120000 130000 150000 160000 170000
## Frequency      29     16     17     32      4      3      3
## Proportion  0.029  0.016  0.017  0.032  0.004  0.003  0.003
## --------------------------------------------------------------------------------
## Children 
##        n  missing distinct     Info     Mean      Gmd 
##      992        8        6     0.96     1.91    1.827 
## 
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##                                               
## Value          0     1     2     3     4     5
## Frequency    274   169   209   133   126    81
## Proportion 0.276 0.170 0.211 0.134 0.127 0.082
## --------------------------------------------------------------------------------
## Education 
##        n  missing distinct 
##     1000        0        5 
## 
## lowest : Bachelors           Graduate Degree     High School         Partial College     Partial High School
## highest: Bachelors           Graduate Degree     High School         Partial College     Partial High School
##                                                                       
## Value                Bachelors     Graduate Degree         High School
## Frequency                  306                 174                 179
## Proportion               0.306               0.174               0.179
##                                                   
## Value          Partial College Partial High School
## Frequency                  265                  76
## Proportion               0.265               0.076
## --------------------------------------------------------------------------------
## Occupation 
##        n  missing distinct 
##     1000        0        5 
## 
## lowest : Clerical       Management     Manual         Professional   Skilled Manual
## highest: Clerical       Management     Manual         Professional   Skilled Manual
##                                                                       
## Value            Clerical     Management         Manual   Professional
## Frequency             177            173            119            276
## Proportion          0.177          0.173          0.119          0.276
##                          
## Value      Skilled Manual
## Frequency             255
## Proportion          0.255
## --------------------------------------------------------------------------------
## Home.Owner 
##        n  missing distinct 
##      996        4        2 
##                       
## Value         No   Yes
## Frequency    314   682
## Proportion 0.315 0.685
## --------------------------------------------------------------------------------
## Cars 
##        n  missing distinct     Info     Mean      Gmd 
##      991        9        5    0.925    1.455    1.226 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency    238   267   342    85    59
## Proportion 0.240 0.269 0.345 0.086 0.060
## --------------------------------------------------------------------------------
## Commute.Distance 
##        n  missing distinct 
##     1000        0        5 
## 
## lowest : 0-1 Miles  1-2 Miles  10+ Miles  2-5 Miles  5-10 Miles
## highest: 0-1 Miles  1-2 Miles  10+ Miles  2-5 Miles  5-10 Miles
##                                                                  
## Value       0-1 Miles  1-2 Miles  10+ Miles  2-5 Miles 5-10 Miles
## Frequency         366        169        111        162        192
## Proportion      0.366      0.169      0.111      0.162      0.192
## --------------------------------------------------------------------------------
## Region 
##        n  missing distinct 
##     1000        0        3 
##                                                     
## Value             Europe North America       Pacific
## Frequency            300           508           192
## Proportion         0.300         0.508         0.192
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      992        8       53    0.999    44.18    12.85    28.00    30.00 
##      .25      .50      .75      .90      .95 
##    35.00    43.00    52.00    60.90    65.45 
## 
## lowest : 25 26 27 28 29, highest: 73 74 78 80 89
## --------------------------------------------------------------------------------
## Purchased.Bike 
##        n  missing distinct 
##     1000        0        2 
##                       
## Value         No   Yes
## Frequency    519   481
## Proportion 0.519 0.481
## --------------------------------------------------------------------------------

EXPLANATION: 13 Kolom varibel, 1000 baris observations ID : mean 19966, missing value 0, lowest : 11000, highest: 29447

Marital.Status : missing value 7,distinct 2 value, yaitu Married dan Single Frequency Married : 535, Frequency single : 458 Proportion (freq / n) Married : 0.539, Single : 0.461

Gender : 11 missing value, distinct 2 value, yaitu Female dan Male Frequency Female : 489, Male : 500 Proportion Female : 0.494, Male : 0.506

Income : 6 missing value, lowest : 10000, highest: 170000, mean : 56268

Children : 8 missing value, mean 1.91, lowest : 0 , highest: 5

Education : distinct 5 value, lowest : Bachelors, highest: Partial High School

Occupation: 0 missing value, distinct 5 value, lowest : Clerical, highest: Skilled Manual

Home.Owner : 4 missing value, distinct value 2, Value No Yes Frequency 314 682 Proportion 0.315 0.685

Region : distinct value 3 Value Europe North America Pacific Frequency 300 508 192 Proportion 0.300 0.508 0.192

Cars : mean : 1.455, lowest = 0, highest = 5, missing value 9, distinct value = 5

Commute.Distance : lowest = 0-1 miles, highes = 5 - 10 miles

Age : mean : 44.18

Purchased bike : Value No Yes Frequency 519 481 Proportion 0.519 0.481

3. Look for data anomalies

an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data

a. qqplot() function

library(car)
## Warning: package 'car' was built under R version 4.1.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.1.3
qqPlot(df$Income)

## [1] 13 44

EXPLANATION Dari qq-plot df$income terlihat bahwa data tidak distribusi normal.

library(car)
qqPlot(df$Children)

## [1]  3 13

EXPLANATION Dari qq-plot df$income terlihat bahwa data tidak distribusi normal.

library(car)
qqPlot(df$Age)

## [1] 376 402

EXPLANATION Dari qq-plot df$Age terlihat bahwa data mendekati distribusi normal.

b. Boxplot() function for Outlier Single Variable

out <- boxplot.stats(df$Income)$out

boxplot(df$Income,
  ylab = "",
  main = "Income"
)

mtext(paste("Outliers: ", paste(out, collapse = ", ")))

EXPLANATION Dari visualisasi boxplot di atas, terlihat bahwa terdapat beberapa outliers dalam variabel Income.

out <- boxplot.stats(df$Children)$out

boxplot(df$Children,
  ylab = "",
  main = "Children"
)

mtext(paste("Outliers: ", paste(out, collapse = ", ")))

EXPLANATION Dari visualisasi boxplot di atas, terlihat bahwa variabel Children tidak memiliki outliers.

out <- boxplot.stats(df$Cars)$out

boxplot(df$Cars,
  ylab = "",
  main = "Cars"
)

mtext(paste("Outliers: ", paste(out, collapse = ", ")))

EXPLANATION Dari visualisasi boxplot di atas, terlihat bahwa variabel Cars memiliki outliers.

out <- boxplot.stats(df$Age)$out

boxplot(df$Age,
  ylab = "",
  main = "Age"
)

mtext(paste("Outliers: ", paste(out, collapse = ", ")))

EXPLANATION Dari visualisasi boxplot di atas, terlihat bahwa variabel Age memiliki outliers.

c. FindOutlier() function

ThreeSigma <- function(x, t = 3){

 mu <- mean(x, na.rm = TRUE)
 sig <- sd(x, na.rm = TRUE)
 if (sig == 0){
 message("All non-missing x-values are identical")
}
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }

Hampel <- function(x, t = 3){

 mu <- median(x, na.rm = TRUE)
 sig <- mad(x, na.rm = TRUE)
 if (sig == 0){
 message("Hampel identifer implosion: MAD scale estimate is zero")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }
   
BoxplotRule<- function(x, t = 1.5){

 xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
 xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
 Q <- xU - xL
 if (Q == 0){
 message("Boxplot rule implosion: interquartile distance is zero")
 }
 up <- xU + t * Q
 down <- xU - t * Q
 out <- list(up = up, down = down)
 return(out)
}   

ExtractDetails <- function(x, down, up){

 outClass <- rep("N", length(x))
 indexLo <- which(x < down)
 indexHi <- which(x > up)
 outClass[indexLo] <- "L"
 outClass[indexHi] <- "U"
 index <- union(indexLo, indexHi)
 values <- x[index]
 outClass <- outClass[index]
 nOut <- length(index)
 maxNom <- max(x[which(x <= up)])
 minNom <- min(x[which(x >= down)])
 outList <- list(nOut = nOut, lowLim = down,
 upLim = up, minNom = minNom,
 maxNom = maxNom, index = index,
 values = values,
 outClass = outClass)
 return(outList)
 }
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
 threeLims <- ThreeSigma(x, t = t3)
 HampLims <- Hampel(x, t = tH)
 boxLims <- BoxplotRule(x, t = tb)

 n <- length(x)
 nMiss <- length(which(is.na(x)))

 threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
 HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
 boxList <- ExtractDetails(x, boxLims$down, boxLims$up)

 sumFrame <- data.frame(method = "ThreeSigma", n = n,
 nMiss = nMiss, nOut = threeList$nOut,
 lowLim = threeList$lowLim,
 upLim = threeList$upLim,
 minNom = threeList$minNom,
 maxNom = threeList$maxNom)
 upFrame <- data.frame(method = "Hampel", n = n,
 nMiss = nMiss, nOut = HampList$nOut,
 lowLim = HampList$lowLim,
 upLim = HampList$upLim,
 minNom = HampList$minNom,
 maxNom = HampList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)
 upFrame <- data.frame(method = "BoxplotRule", n = n,
 nMiss = nMiss, nOut = boxList$nOut,
 lowLim = boxList$lowLim,
 upLim = boxList$upLim,
 minNom = boxList$minNom,
 maxNom = boxList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)

 threeFrame <- data.frame(index = threeList$index,
 values = threeList$values,
 type = threeList$outClass)
 HampFrame <- data.frame(index = HampList$index,
 values = HampList$values,
 type = HampList$outClass)
 boxFrame <- data.frame(index = boxList$index,
 values = boxList$values,
 type = boxList$outClass)
 outList <- list(summary = sumFrame, threeSigma = threeFrame,
 Hampel = HampFrame, boxplotRule = boxFrame)
 return(outList)
}
fullSummary <- FindOutliers(df$Income)
fullSummary$summary
##        method    n nMiss nOut    lowLim    upLim minNom maxNom
## 1  ThreeSigma 1000     6   10 -36935.85 149471.1  10000 130000
## 2      Hampel 1000     6   10 -28956.00 148956.0  10000 130000
## 3 BoxplotRule 1000     6   10  10000.00 130000.0  10000 130000

EXPLANATION Ketiga metode yang digunakan untuk mendeteksi outliers memberikan hasil yang sama, yaitu 10. Dapat disimpulkan bahwa outliers dari df$Income adalah sebanyak 10 buah.

fullSummary <- FindOutliers(df$Children)
fullSummary$summary
##        method    n nMiss nOut    lowLim    upLim minNom maxNom
## 1  ThreeSigma 1000     8    0 -2.970448 6.791013      0      5
## 2      Hampel 1000     8    0 -2.447800 6.447800      0      5
## 3 BoxplotRule 1000     8    0 -1.500000 7.500000      0      5

EXPLANATION Ketiga metode yang digunakan untuk mendeteksi outliers memberikan hasil yang sama, yaitu 0. Dapat disimpulkan bahwa df$Children tidak memiliki outliers.

fullSummary <- FindOutliers(df$Cars)
fullSummary$summary
##        method    n nMiss nOut   lowLim    upLim minNom maxNom
## 1  ThreeSigma 1000     9    0 -1.91017 4.820362      0      4
## 2      Hampel 1000     9    0 -3.44780 5.447800      0      4
## 3 BoxplotRule 1000     9  297  0.50000 3.500000      1      3

EXPLANATION Metode ThreeSigma dan Hampel yang digunakan untuk mendeteksi outliers memberikan hasil yang sama, yaitu 0. Sedangkan, metode Boxplotrule mendeteksi 297 outliers. Oleh karena itu, dapat disimpulkan bahwa df$Cars tidak memiliki outliers, karena 2 dari 3 metode mendeteksi 0 outliers.

fullSummary <- FindOutliers(df$Age)
fullSummary$summary
##        method    n nMiss nOut   lowLim    upLim minNom maxNom
## 1  ThreeSigma 1000     8    2 10.09543 78.26747     25     78
## 2      Hampel 1000     8    2  7.41760 78.58240     25     78
## 3 BoxplotRule 1000     8   25 26.50000 77.50000     27     74

EXPLANATION Metode ThreeSigma dan Hampel yang digunakan untuk mendeteksi outliers memberikan hasil yang sama, yaitu 2. Sedangkan, metode Boxplotrule mendeteksi 25 outliers. Oleh karena itu, dapat disimpulkan bahwa df$Age memiliki 2 outliers, karena 2 dari 3 metode mendeteksi 2 outliers.

4. Look at the relations between key variables

count <- table(df$Cars, df$Purchased.Bike)
count
##    
##      No Yes
##   0  91 147
##   1 115 152
##   2 218 124
##   3  52  33
##   4  38  21

semakin sedikit jumlah mobil, semakin tinggi persentase yang beli sepeda

count <- table(df$Income, df$Purchased.Bike)
count
##         
##          No Yes
##   10000  45  28
##   20000  43  31
##   30000  81  53
##   40000  64  89
##   50000  20  20
##   60000  84  81
##   70000  58  65
##   80000  56  34
##   90000  14  24
##   100000 18  11
##   110000  8   8
##   120000  8   9
##   130000 17  15
##   150000  1   3
##   160000  0   3
##   170000  2   1

semakin tinggi income, semakin tinggi persentase yang beli sepeda

count <- table(df$Children, df$Purchased.Bike)
count
##    
##      No Yes
##   0 135 139
##   1  72  97
##   2 112  97
##   3  61  72
##   4  72  54
##   5  63  18

semakin sedikit anaknya, semakin tinggi persentase yang beli sepeda

# Create the layout
nf <- layout(matrix(c(1,1,2,3), nrow=2, byrow=TRUE))

# Fill with plots
mosaicplot(Income ~ Purchased.Bike, data = df, main = "", las = 1, shade = TRUE)

# Scatterplot
plot(df$Cars, df$Income)


#Boxplot
boxplot(Children ~ Purchased.Bike, data= df, xlab = "Children", ylab ="Purchased.Bike")  

matrix(c(1,1,2,3), nrow=2) creates a matrix of 2 rows and 2 columns. First 2 panels will be for the first chart, the third for chart2 and the last for chart 3.

Mosaic plots describe the relationship between two categorical variables. Essentially, these plots are graphical representations of contingency tables that tell us how many times the values of two categorical variables occur together in a dataset.

EXPLANATION: 1. semakin tinggi income, semakin tinggi persentase yang beli sepeda 2. semakin sedikit jumlah mobil, semakin tinggi persentase yang beli sepeda 3. semakin sedikit anaknya, semakin tinggi persentase yang beli sepeda