Video Link: ‘https://binusianorg-my.sharepoint.com/personal/jasmine_alifa_binus_ac_id/_layouts/15/guestaccess.aspx?docid=020876ffd4d8941a0ae89c6e87dd4061c&authkey=AeYxPLNI_auXy5xjybSzeN0&e=YwOf6e

1. Apply the following exploratory data analysis techniques using R on CarPrice dataset: (25 pts.)
  1. Using mfrow parameter, construct a two-by-two plot array showing the concentrations of the following four attributes versus the record number in the dataset:
  1. carlength, top left;
  2. carwidth, top right;
  3. doornumber, lower left; and
  4. enginetype, lower right.

In all cases, the x-axis label should read Record number in dataset and the y-axis should read the attribute. Each plot should have a title spelling out the name of the element on which the attribute is based (e.g., “carlength” for the top-left plot).

library(readr)
## Warning: package 'readr' was built under R version 4.1.3
readData <- "C:/Users/user/OneDrive/Documents/R/CarPrice-OddSID.csv"
data <- read.csv(readData)

sapply(data, class)
##      car_ID   symboling     CarName  doornumber  drivewheel   carlength 
##   "integer"   "integer" "character" "character" "character"   "numeric" 
##    carwidth  enginetype  enginesize  fuelsystem  horsepower     peakrpm 
##   "numeric" "character"   "integer" "character"   "integer"   "integer" 
##       price 
##   "numeric"
PriceCar <- transform(data, enginetype = as.factor(enginetype),
                      doornumber = as.factor(doornumber))

par(mfrow=c(2,2))
plot(PriceCar$car_ID, PriceCar$carlength, main="Car Length", xlab = "Record Number in Dataset", ylab = "Car Length", col="#5584AC")
plot(PriceCar$car_ID, PriceCar$carwidth, main="Car Width", xlab = "Record Number in Dataset", ylab = "Car Width", col="#FF8080")

plot(PriceCar$car_ID, PriceCar$doornumber, main="Door Number", xlab = "Record Number in Dataset", ylab = "Door Number", col="#A2B38B", axes = FALSE)
axis(side = 2, at = 1:2, labels = levels(PriceCar$doornumber))
box()

plot(PriceCar$car_ID, PriceCar$enginetype, main="Engine Type", xlab = "Record Number in Dataset", ylab = "Engine Type", col="#C69B7B", axes = FALSE)
axis(side = 2, at = 1:7, labels = levels(PriceCar$enginetype))
box()

summary(PriceCar)
##      car_ID      symboling         CarName          doornumber
##  Min.   :  1   Min.   :-2.0000   Length:205         four:115  
##  1st Qu.: 52   1st Qu.: 0.0000   Class :character   two : 90  
##  Median :103   Median : 1.0000   Mode  :character             
##  Mean   :103   Mean   : 0.8341                                
##  3rd Qu.:154   3rd Qu.: 2.0000                                
##  Max.   :205   Max.   : 3.0000                                
##                                                               
##   drivewheel          carlength        carwidth     enginetype    enginesize   
##  Length:205         Min.   :141.1   Min.   :60.30   dohc : 12   Min.   : 61.0  
##  Class :character   1st Qu.:166.3   1st Qu.:64.10   dohcv:  1   1st Qu.: 97.0  
##  Mode  :character   Median :173.2   Median :65.50   l    : 12   Median :120.0  
##                     Mean   :174.0   Mean   :65.91   ohc  :148   Mean   :126.9  
##                     3rd Qu.:183.1   3rd Qu.:66.90   ohcf : 15   3rd Qu.:141.0  
##                     Max.   :208.1   Max.   :72.30   ohcv : 13   Max.   :326.0  
##                                                     rotor:  4                  
##   fuelsystem          horsepower       peakrpm         price      
##  Length:205         Min.   : 48.0   Min.   :4150   Min.   : 5118  
##  Class :character   1st Qu.: 70.0   1st Qu.:4800   1st Qu.: 7788  
##  Mode  :character   Median : 95.0   Median :5200   Median :10295  
##                     Mean   :104.1   Mean   :5125   Mean   :13277  
##                     3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:16503  
##                     Max.   :288.0   Max.   :6600   Max.   :45400  
## 
__Write your explanation here:__
Berdasarkan visualisasi plot diatas, variabel Car Length memiliki tipe data numeric dengan range data di 140-210. Dari plot Car Length, dapat dilhat banyak data yang berada di range 150-190. Dapat disimpulkan, panjang mobil paling banyak pada dataset `CarPrice` ada di range tersebut.

Dari plot Car Width, variabel tersebut memiliki tipe data numeric dengan range data di 60-73. Penyebaran data Car Width paling banyak tersebar di range 62-70. Dapat disimpulkan, lebar mobil paling banyak pada dataset `CarPrice` ada di range tersebut.

Dari plot Door Number, variabel tersebut memiliki tipe data factor dengan dua level yaitu `four` dan `two`. Dari visualisasi plot tersebut, dapat disimpulkan pada dataset `CarPrice' mobil yang memiliki 4 pintu lebih banyak dari mobil yang memiliki 2 pintu. Hal ini karena visualisasi plot menunjukkan data terbanyak berada di level `four`.

Dari plot Engine Type, variabel tersebut memiliki tipe data factor dengan tujuh level yaitu dohc, dohcv, l, ohc, ohcf, ohcv, dan rotor. Dari visualisasi plot tersebut, dapat disimpulkan tipe mesin paling banyak digunakan pada dataset `CarPrice` yaitu `ohc`. Hal ini karena visualisasi plot menunjukkan data terbanyak berada di level `ohc`.
  1. Construct a mosaic plot showing the relationship between the variables FuelSystem and DriveWheel in the CarPrice data frame. Does this plot suggest a relationship between these variables? Explain your answer.
PriceCar <- transform(data, drivewheel = as.factor(drivewheel),
          fuelsystem = as.factor(fuelsystem))
relation <- table(PriceCar$fuelsystem, PriceCar$drivewheel)
mosaicplot(relation, main="Relation of Fuel System vs Drive Wheel", xlab="Fuel System", ylab="Drive Wheel", border = "brown3", color = blues9)

__Write your explanation here:__
Dari hasil visualisasi mosaic plot diatas, dapat dilihat terdapat hubungan antara Fuel System (Sistem Bahan Bakar) dengan Drive Wheel (Roda Penggerak). Warna pada hasil visualisasi diatas juga menunjukkan relasi antara 'fuel system` dan `drive wheel`. 

Warna biru tua (paling gelap) menunjukkan `fuel system` berelasi dengan sistem `drive wheel` yaitu `rwd`. Warna biru menunjukkan relasi `fuel system`  dengan sistem `drive wheel` yaitu `fwd`. Sedangkan warna biru muda menunjukkan relasi `fuel system` dengan sistem `drive wheel` yaitu `4wd`.

Fuel system `1bbl` sepenuhnya menggunakan sistem drive wheel `fwd`. Fuel system `2bbl` dan `mpfi` menggunakan ketiga sistem drive wheel yaitu, `4wd`, `fwd`, dan `rwd`. Fuel sistem `4bbl` dan `spfi` sepenuhnya menggunakan sistem drive wheel `rwd`. 

Fuel system `idi` dan `spdi` keduanya menggunakan sistem drive wheel `fwd` dan `rwd`. Fuel system `mfi` sepenuhnya menggunakan sistem drive wheel `rwd`.
  1. Compute the correlation for all attributes. Interpret the statistical findings!
library(DataExplorer) 
## Warning: package 'DataExplorer' was built under R version 4.1.3
PriceCar_numeric <- (PriceCar[sapply(PriceCar, is.numeric)])
plot_correlation(PriceCar_numeric)

__Write your explanation here:__
Dari hasil plot tabel kolerasi di atas, dapat dilihat bahwa variabel yang memiliki korelasi terkuat adalah `enginesize` dengan `price`. Korelasi tersebut bernilai 0,87 atau dibulatkan menjadi 0,9. Sedangkan, variabel dengan nilai korelasi terendah yaitu `carlength` dengan `symboling`. Korelasi tersebut bernilai -0.36. 
2. You need to compare three ways (three-sigma edit rule, Hampel identifier, boxplot rule) of detecting univariate outliers for the peakrpm attribute from the data frame: (20 pts.)
  1. Generate a plot for each technique and give the appropriate features (labels, line type, etc.). Based on these plots, which outlier detector seems to be giving the more reasonable results?

three-sigma edit rule: [mean – t * SD, mean + t * SD]

Hampel Identifier: [median – t * MAD, median + t * MAD]

Boxplot Rule: [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]

# Three-sigma edit rule
peakrpm <- PriceCar$peakrpm
mean_rpm <- mean(peakrpm)
sd_rpm <- sd(peakrpm)

# Calculate Three-sigma edit rule
outlier_three1 = mean_rpm - (3 * sd_rpm)
outlier_three2 = mean_rpm + (3 * sd_rpm)

three_low = which(peakrpm < outlier_three1)
three_up = which(peakrpm > outlier_three2)

outlierbythree = length(peakrpm[three_low]) + length(peakrpm[three_up])

# Hampel Identifier
median_rpm <- median(peakrpm)
MAD_rpm <- mad(peakrpm)

# Calculate Hampel Identifier
outlier_hampel1 = median_rpm - (3 * MAD_rpm)
outlier_hampel2 = median_rpm + (3 * MAD_rpm)

hampel_low = which(peakrpm < outlier_hampel1)
hampel_up = which(peakrpm > outlier_hampel2)

outlierbyhampel = length(peakrpm[hampel_up]) + length(peakrpm[hampel_low])

# Box plot
Q1_rpm <- quantile(peakrpm, 0.25)
Q3_rpm <- quantile(peakrpm, 0.75)
IQR_rpm <- IQR(peakrpm)

# Calculate Box plot
outlier_boxplot1 = Q1_rpm - (1.5 * IQR_rpm)
outlier_boxplot2 = Q3_rpm + (1.5 * IQR_rpm)

boxplot_low = which(peakrpm < outlier_boxplot1)
boxplot_up = which(peakrpm > outlier_boxplot2)

outlierbyboxplot = length(peakrpm[boxplot_low]) + length(peakrpm[boxplot_up])

sprintf("Outlier Range by Three Sigma Rule : [%f, %f]", outlier_three1, outlier_three2)
## [1] "Outlier Range by Three Sigma Rule : [3694.165022, 6556.078880]"
sprintf("Outlier Range by Hampel Identifier : [%f, %f]", outlier_hampel1, outlier_hampel2)
## [1] "Outlier Range by Hampel Identifier : [3865.660000, 6534.340000]"
sprintf("Outlier Range by Boxplot Rule : [%f, %f]", outlier_boxplot1, outlier_boxplot2)
## [1] "Outlier Range by Boxplot Rule : [3750.000000, 6550.000000]"
# Three sigma edit rule
plot(peakrpm, ylab="peakrpm", xlab="Record Numbers of Data", main="Outliers of Peakrpm - Three Sigma Rule", ylim = c(3500, 7000))
abline(h=median_rpm, lty=2, col="blue")
abline(h=outlier_three1, lty=3, lwd=2, col="red2") # bawah
abline(h=outlier_three2, lty=3, lwd=2, col="red2") # atas

# Hampel Identifier
plot(peakrpm, ylab="peakrpm", xlab="Record Numbers of Data", main="Outliers of Peakrpm - Hampel Identifier", ylim = c(3500, 7000))
abline(h=median_rpm, lty=2, col="blue")
abline(h=outlier_hampel1, lty=3, lwd=2, col="red2") # bawah
abline(h=outlier_hampel2, lty=3, lwd=2, col="red2") # atas

# Box plot
boxplot(peakrpm, ylab="peakrpm", xlab="Record Numbers of Data", main="Outliers of Peakrpm - Box Plot", ylim = c(3500, 7000))
abline(h=median_rpm, lty=2, col="blue")
abline(h=outlier_boxplot1, lty=3, lwd=2, col="red2") # bawah
abline(h=outlier_boxplot2, lty=3, lwd=2, col="red2") # atas

__Write your explanation here:__
Berdasarkan visualisasi plot diatas, metode three sigma edit rule mendeteksi outlier dengan cukup akurat menggunakan nilai mean dan standar deviasi untuk menentukan range atas dan range bawah. Pada metode hampel identifier, deteksi outlier mendapatkan hasil yang cukup akurat dengan menggunakan nilai MAD dan median. Pada metode box plot, deteksi outlier juga cukup akurat menggunakan nilai quartile dan interquartile. 

Dapat disimpulkan ketiga metode tersebut menghasilkan deteksi outlier yang tidak jauh berbeda yaitu tepat 2. Hal ini terjadi karena range outlier antar tiap metode memiliki nilai yang tidak jauh berbeda. Tetapi, menurut saya untuk deteksi outlier yang reasonable dapat menggunakan Hampel Identifier. Saya memilih Hample Identifier karena nilai median dan MAD lebih reasonable untuk dijadikan tolak ukur mendeteksi outlier. 
  1. How many data points are declared outliers by each of the technique? Based on this data points, which outlier detector seems to be giving the more reasonable results?
sprintf("Total outliers by Three Sigma Rule : %d", outlierbythree)
## [1] "Total outliers by Three Sigma Rule : 2"
sprintf("Total outliers by Hampel Identifier : %d", outlierbyhampel)
## [1] "Total outliers by Hampel Identifier : 2"
sprintf("Total outliers by Boxplot Rule : %d", outlierbyboxplot)
## [1] "Total outliers by Boxplot Rule : 2"
__Write your explanation here:__
Berdasarkan hasil di atas, outlier detector yang lebih reasonable adalah Hampel Identifier. Hampel Identifier menggunakan nilai median dan MAD sebagai tolak ukur mendeteksi posisi data outlier, sehingga lebih efektif untuk digunakan. Dibandingkan dengan metode three sigma edit rule yang menggunakan nilai mean dan standar deviasi. 
3. Do a comprehensive EDA on your dataset then find the best-fit linear regression model then answer the following questions: (40 pts.)
  1. Interpret the result of your model.
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
colSums(is.na(PriceCar))
##     car_ID  symboling    CarName doornumber drivewheel  carlength   carwidth 
##          0          0          0          0          0          0          0 
## enginetype enginesize fuelsystem horsepower    peakrpm      price 
##          0          0          0          0          0          0
rcorr(as.matrix(PriceCar_numeric), type = "spearman")
##            car_ID symboling carlength carwidth enginesize horsepower peakrpm
## car_ID       1.00     -0.16      0.16     0.15       0.09       0.01   -0.23
## symboling   -0.16      1.00     -0.40    -0.25      -0.18      -0.01    0.28
## carlength    0.16     -0.40      1.00     0.89       0.78       0.66   -0.27
## carwidth     0.15     -0.25      0.89     1.00       0.77       0.69   -0.20
## enginesize   0.09     -0.18      0.78     0.77       1.00       0.82   -0.27
## horsepower   0.01     -0.01      0.66     0.69       0.82       1.00    0.11
## peakrpm     -0.23      0.28     -0.27    -0.20      -0.27       0.11    1.00
## price        0.02     -0.14      0.80     0.81       0.83       0.85   -0.07
##            price
## car_ID      0.02
## symboling  -0.14
## carlength   0.80
## carwidth    0.81
## enginesize  0.83
## horsepower  0.85
## peakrpm    -0.07
## price       1.00
## 
## n= 205 
## 
## 
## P
##            car_ID symboling carlength carwidth enginesize horsepower peakrpm
## car_ID            0.0250    0.0261    0.0327   0.2038     0.9401     0.0009 
## symboling  0.0250           0.0000    0.0002   0.0113     0.8874     0.0000 
## carlength  0.0261 0.0000              0.0000   0.0000     0.0000     0.0000 
## carwidth   0.0327 0.0002    0.0000             0.0000     0.0000     0.0042 
## enginesize 0.2038 0.0113    0.0000    0.0000              0.0000     0.0000 
## horsepower 0.9401 0.8874    0.0000    0.0000   0.0000                0.1082 
## peakrpm    0.0009 0.0000    0.0000    0.0042   0.0000     0.1082            
## price      0.7706 0.0385    0.0000    0.0000   0.0000     0.0000     0.3450 
##            price 
## car_ID     0.7706
## symboling  0.0385
## carlength  0.0000
## carwidth   0.0000
## enginesize 0.0000
## horsepower 0.0000
## peakrpm    0.3450
## price
hist.data.frame(PriceCar_numeric)

pairs(~ car_ID + symboling + carlength + carwidth + enginesize + horsepower + peakrpm + price, data = PriceCar_numeric, main = "Car Price Data")

hist(PriceCar$enginesize, col="#E3CAA5")

hist(log(PriceCar$enginesize), col="#F1DDBF")

summary(log(PriceCar$enginesize))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.111   4.575   4.787   4.800   4.949   5.787
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
set.seed(1)
validation_Index = createDataPartition(PriceCar$enginesize, p=0.8, list = FALSE)
validation_set = PriceCar[-validation_Index,]
training_set = PriceCar[validation_Index,]

fit1 <- lm(log(enginesize)~ price, data= PriceCar)
summary(fit1)
## 
## Call:
## lm(formula = log(enginesize) ~ price, data = PriceCar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56272 -0.08047 -0.02420  0.07509  0.39165 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.403e+00  2.055e-02  214.28   <2e-16 ***
## price       2.995e-05  1.327e-06   22.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1514 on 203 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 509.6 on 1 and 203 DF,  p-value: < 2.2e-16
fit2 <- lm(enginesize~ price, data= PriceCar)
summary(fit2)
## 
## Call:
## lm(formula = enginesize ~ price, data = PriceCar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.585 -10.244  -2.612   8.102  95.552 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.641e+01  2.751e+00   24.14   <2e-16 ***
## price       4.557e-03  1.777e-04   25.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.27 on 203 degrees of freedom
## Multiple R-squared:  0.7641, Adjusted R-squared:  0.763 
## F-statistic: 657.6 on 1 and 203 DF,  p-value: < 2.2e-16
plot(PriceCar$price, log(PriceCar$enginesize))

__Write your explanation here:__
Berdasarkan fitting linear regression model yang saya lakukan, menurut saya variabel yang memiliki korelasi cukup baik untuk digunakan dalam proses linear regression adalah `enginesize` dengan `price`. Hubungan linear antara kedua variabel tersebut positif. Untuk melakukan proses linear regresi lebih lanjut, pertama saya cek distribusi normal target variabel melalui histogram. Saya mengubah data `enginesize` menjadi log `enginesize` agar datanya lebih terdistribusi secara normal. 

Selanjutnya, saya membuat linear modelling menggunakan fit1 dengan target variabel log `enginesize` dan independent variable `price`. Saya memilih fit1 karena P-value nya mendekati 0 yaitu <2e-16 *** dan F-statistic nya cukup baik. Walaupun fit2 memiliki F-statistic yang lebih tinggi, tetapi data `enginesize` lebih berdistribusi secara normal saat saya melakukan log di fit1, hal ini tentunya akan berpengaruh pada proses linear modelling selanjutnya, sehingga saya tidak menggunakan fit2.
  1. Write down the equation of the best fitting line.
fit1
## 
## Call:
## lm(formula = log(enginesize) ~ price, data = PriceCar)
## 
## Coefficients:
## (Intercept)        price  
##   4.403e+00    2.995e-05
#log(enginesize) = 4.403e+00  + price∗(2.995e-05)
__Write your explanation here:__
Dari proses linear modelling yang saya lakukan diatas, saya mendapatkan final regression equation dari model diatas, yaitu: 
  log(enginesize) = 4.403e+00  + price∗(2.995e-05)
  1. Is your model good? Why or why not?
validation_set$predicted <- predict(fit1, validation_set)
actual_prediction <- data.frame(validation_set$enginesize, validation_set$predicted, validation_set$enginesize - validation_set$predicted)
names(actual_prediction) <- c ("Engine Size", "Predicted", "Residuals")
correlationAccuracy <- cor(actual_prediction)
correlationAccuracy
##             Engine Size Predicted Residuals
## Engine Size   1.0000000  0.703112 0.9999942
## Predicted     0.7031120  1.000000 0.7006770
## Residuals     0.9999942  0.700677 1.0000000
actual_prediction$log_enginesize <- log(actual_prediction$`Engine Size`)
head(actual_prediction)
##   Engine Size Predicted Residuals log_enginesize
## 1         136  4.859291 131.14071       4.912655
## 2         131  4.937444 126.06256       4.875197
## 3         108  4.894636 103.10536       4.682131
## 4         164  5.138307 158.86169       5.099866
## 5          90  4.593515  85.40649       4.499810
## 6          98  4.640841  93.35916       4.584967
plot(fit1, which = 1)

prediction <- predict(fit1, validation_set)
plot(exp(prediction), validation_set$enginesize, xlab="Predicted Price", ylab="Actual Price", main="Actual Price vs Predicted Price", col="#1572A1")
abline(a=0, b=1, col="#2D31FA")

__Write your explanation here:__
Berdasarkan presentase predicted value dengan engine size yang bernilai 0.7 atau 70%, menurut saya model yang saya buat sudah cukup baik. Hal ini dapat dilihat dari nilai residuals yang tidak terlalu besar.

Dari visualisasi diatas, dapat dilihat dari hasil plot residual vs fitted kebanyakan data berada di sekitar titik nol, berarti model prediksi yang saya buat mendekati benar. Selain itu, hasil plot `actual price` terhadap `predicted price` menunjukkan banyak data mendekati garis linear regresi yang saya buat. 
  1. Based on your answer in c, will you deploy the model? Why or why not?
hist(rstudent(fit1), col="#FFD59E")

prediction <- predict(fit1, validation_set)
plot(exp(prediction), validation_set$enginesize, xlab="Predicted Price", ylab="Actual Price", main="Actual Price vs Predicted Price", col="#1572A1")
abline(a=0, b=1, col="#2D31FA")

__Write your explanation here:__
Tujuan dari linear regresi yang saya lakukan ini adalah untuk melihat hubungan antara `enginesize` dengan `price`. Saya melakukan explorasi data lebih lanjut untuk mendapatkan hasil linear regresi yang terbaik. Dalam menguji model akhir, cukup masuk akal bahwa semakin tinggi harga mobil semakin besar juga ukuran mesin. Hal ini menunjukkan kapasitas mesin berpengaruh terhadap harga mobil. 

Saya akan melakukan deployment lebih lanjut karena model yang saya buat cukup akurat. Dapat dilihat dari visualisasi residual standar menggunakan fungsi rstudent() bahwa data hampir berdistribusi normal, walaupun belum sepenuhnya terdistribusi normal tetapi hasilnya tidak terlalu kontras. Selain itu, hasil plot `actual price` terhadap `predicted price` menunjukkan banyak data mendekati garis linear regresi yang saya buat.