PROJECT: DATA SCIENTIST ASSESSMENT USING R

DESKRIPSI

Project ini adalah salah satu project yang terdapat di DQLab dan merupakan bagian dari proses pembelajaran saya di platform DQLab. DQLab adalah salah satu platform kursus data science di Indonesia. Mentor dalam project ini adalah Mrs. Trisna Yulia Junita, seorang Data Scientist di PT BUMA.

Website DQLab: https://academy.dqlab.id/

Fundamental R

Membuat Data frame

df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
df[[2]]

## [1] 4 5 6

df[[1]][[1]]

## [1] 1

df[[2]][[3]]

## [1] 6

Membuat Vector dan Index Position

x <- c("red","blue","yellow","orange","green","purple")
y <- x[c(2,3,4)]
y

## [1] "blue"   "yellow" "orange"

Mengisi Syntax Factor

x <- factor(c("grape", "apples", "pear", "cherry", "mango", "panda")) #membuat variabel berisi factor
x

## [1] grape  apples pear   cherry mango  panda 
## Levels: apples cherry grape mango panda pear

x[6] <- "apples" #mengganti isian factor
x

## [1] grape  apples pear   cherry mango  apples
## Levels: apples cherry grape mango panda pear

Menggunakan Function

add_numbers <- function(x, y) {
    x + y
}
add_numbers(3, 3)

## [1] 6

Mengganti Missing Value

Buatlah sebuah “R Function” di mana Function tersebut dapat digunakan untuk mengganti missing value dari suatu vector dengan nilai mean dari vector tersebut.

df <- c(1,2,3,4,5,6,NA,7,8,9,NA)
df

##  [1]  1  2  3  4  5  6 NA  7  8  9 NA

mean_replace <- function(i) {
  i[is.na(i)] <- mean(i, na.rm=TRUE);
  i 
}
df <- mean_replace(df)
df

##  [1] 1 2 3 4 5 6 5 7 8 9 5

Statistic : Visualization with R

Problem 1 : Langkah Awal

Sebagai seorang data scientist di suatu organisasi lingkungan hidup, Anda diminta untuk melakukan analisa terhadap data hasil observasi dari beberapa pohon cherry. Dataset hasil observasi disimpan di link file “https://storage.googleapis.com/dqlab-dataset/trees.csv”, dan proses analisa dilakukan dengan R programming.

library(readr)
trees_df <- read.csv("https://storage.googleapis.com/dqlab-dataset/trees.csv")

Mengecek Struktur Data

Menampilkan nama kolom dan tipe data setiap kolom.
Terdapat kesalahan penulisan untuk kolom Girth, seharusnya nama kolom tersebut adalah Diameter.
Satuan dari kolom Girth/ Diameter adalah inch, seharusnya adalah ft,sehingga perlu ditambahkan kolom (diameter_ft) yang berisi hasil konversi inch – ft ( 1 inch = 0.08333 ft).
Memunculkan beberapa baris dari dataset.
Menampilkan hasil statistik deskriptif (min, max, median, mean, dan quartil) untuk semua kolom.
Mengecek missing value.

names(trees_df)

## [1] "Girth"  "Height" "Volume"

str(trees_df)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: int  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

names(trees_df)[1] <- "Diameter" 
trees_df$diameter_ft <- trees_df$Diameter*0.08333
head(trees_df)

##   Diameter Height Volume diameter_ft
## 1      8.3     70   10.3    0.691639
## 2      8.6     65   10.3    0.716638
## 3      8.8     63   10.2    0.733304
## 4     10.5     72   16.4    0.874965
## 5     10.7     81   18.8    0.891631
## 6     10.8     83   19.7    0.899964

summary(trees_df)

##     Diameter         Height       Volume       diameter_ft    
##  Min.   : 8.30   Min.   :63   Min.   :10.20   Min.   :0.6916  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40   1st Qu.:0.9208  
##  Median :12.90   Median :76   Median :24.20   Median :1.0750  
##  Mean   :13.25   Mean   :76   Mean   :30.17   Mean   :1.1040  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30   3rd Qu.:1.2708  
##  Max.   :20.60   Max.   :87   Max.   :77.00   Max.   :1.7166

is.na(trees_df)

##       Diameter Height Volume diameter_ft
##  [1,]    FALSE  FALSE  FALSE       FALSE
##  [2,]    FALSE  FALSE  FALSE       FALSE
##  [3,]    FALSE  FALSE  FALSE       FALSE
##  [4,]    FALSE  FALSE  FALSE       FALSE
##  [5,]    FALSE  FALSE  FALSE       FALSE
##  [6,]    FALSE  FALSE  FALSE       FALSE
##  [7,]    FALSE  FALSE  FALSE       FALSE
##  [8,]    FALSE  FALSE  FALSE       FALSE
##  [9,]    FALSE  FALSE  FALSE       FALSE
## [10,]    FALSE  FALSE  FALSE       FALSE
## [11,]    FALSE  FALSE  FALSE       FALSE
## [12,]    FALSE  FALSE  FALSE       FALSE
## [13,]    FALSE  FALSE  FALSE       FALSE
## [14,]    FALSE  FALSE  FALSE       FALSE
## [15,]    FALSE  FALSE  FALSE       FALSE
## [16,]    FALSE  FALSE  FALSE       FALSE
## [17,]    FALSE  FALSE  FALSE       FALSE
## [18,]    FALSE  FALSE  FALSE       FALSE
## [19,]    FALSE  FALSE  FALSE       FALSE
## [20,]    FALSE  FALSE  FALSE       FALSE
## [21,]    FALSE  FALSE  FALSE       FALSE
## [22,]    FALSE  FALSE  FALSE       FALSE
## [23,]    FALSE  FALSE  FALSE       FALSE
## [24,]    FALSE  FALSE  FALSE       FALSE
## [25,]    FALSE  FALSE  FALSE       FALSE
## [26,]    FALSE  FALSE  FALSE       FALSE
## [27,]    FALSE  FALSE  FALSE       FALSE
## [28,]    FALSE  FALSE  FALSE       FALSE
## [29,]    FALSE  FALSE  FALSE       FALSE
## [30,]    FALSE  FALSE  FALSE       FALSE
## [31,]    FALSE  FALSE  FALSE       FALSE

Shapiro Test

Apakah data hasil observasi ketiga variabel (diameter_ft, Height, Volume) terdistribusi normal? Gunakan Shapiro-Wilk’s method untuk mengevaluasi normality dari setiap variabel.

shapiro.test(trees_df$diameter_ft)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$diameter_ft
## W = 0.94117, p-value = 0.08893

shapiro.test(trees_df$Height)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$Height
## W = 0.96545, p-value = 0.4034

shapiro.test(trees_df$Volume)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$Volume
## W = 0.88757, p-value = 0.003579

Visualisasi Sederhana

Buatlah visualisasi sederhana yang menjelaskan distribusi Volume dari dataset yang sudah dikerjakan sebelumnya.

Untuk menyelesaikan soal ini, diminta untuk menggunakan density plot dari standar R-base.

plot(density(trees_df$Volume))

Mencari Hubungan

Bagaimanakah hubungan antara Volume batang pohon Cherry dengan diameter dan ketinggian (height)?

Gunakan pula visualisasi sederhana untuk menjelaskan hubungan tersebut.

lm(formula = Volume ~ Height + diameter_ft, data = trees_df)

## 
## Call:
## lm(formula = Volume ~ Height + diameter_ft, data = trees_df)
## 
## Coefficients:
## (Intercept)       Height  diameter_ft  
##    -57.9877       0.3393      56.5002

plot(trees_df$diameter_ft, trees_df$Volume)

plot(trees_df$Height, trees_df$Volume)

Problem Statement #2

Analisa Efek Pemberian Obat Tidur

Selanjutnya, saya diminta untuk menganalisa efek pemberian obat tidur terhadap peningkatan lamanya waktu tidur dari sekelompok mahasiswa.

library(readr) #pre-defined
library(dplyr) #pre-defined

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sleep_df <- read_csv('https://storage.googleapis.com/dqlab-dataset/sleep.csv') #pre-defined

## Rows: 20 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): extra, group, ID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Save the data in two different dataframe/vector
group1 <- filter(sleep_df, sleep_df$group == 1)
group2 <- filter(sleep_df, sleep_df$group == 2)

# Compute t-test
t_test <- t.test(group1$extra, group2$extra)
t_test

## 
##  Welch Two Sample t-test
## 
## data:  group1$extra and group2$extra
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean of x mean of y 
##      0.75      2.33

Menghasilkan Grafik - Boxplot

library(ggplot2)
ggplot(sleep_df, aes(x = as.character(group), y = extra, fill = as.character(group))) + geom_boxplot()

Machine Learning with R

Membuat Model Sederhana

Dataset https://storage.googleapis.com/dqlab-dataset/electric_bill.csv berikut menyajikan informasi terkait biaya listrik rumah tangga di suatu negara.

Dari data tersebut anda ingin mengetahui faktor – faktor apa saja yang mempengaruhi total biaya listrik di setiap rumah tangga.

Buatlah model sederhana yang dapat menjelaskan bagaimana pengaruh “num_people” dan “housearea” terhadap “amount_paid”!

library(readr)
electric_bill <- read_csv("https://storage.googleapis.com/dqlab-dataset/electric_bill.csv")

## Rows: 1000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): num_rooms, num_people, housearea, is_ac, is_tv, is_flat, ave_month...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

model <- lm(amount_paid ~ num_people + housearea, data = electric_bill)

model

## 
## Call:
## lm(formula = amount_paid ~ num_people + housearea, data = electric_bill)
## 
## Coefficients:
## (Intercept)   num_people    housearea  
##     482.920        4.834        0.118

Training dan Testing

Dalam membuat model machine learning, dataset perlu dibagi ke dalam Training dan Testing set. Salah satu library yang digunakan adalah caret. Dengan menggunakan Iris dataset yang sudah di-load dan library caret, buatlah syntax untuk membagi dataset tersebut ke dalam training set (80%) dan testing dataset (20%)!

library(readr)
library(caret)

## Loading required package: lattice

set.seed(123)
iris <- read_csv("https://storage.googleapis.com/dqlab-dataset/iris.csv")

## Rows: 150 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
training_set <-  iris[trainIndex, ]
testing_set <- iris[-trainIndex, ]

dim(training_set)

## [1] 120   5

dim(testing_set)

## [1] 30  5

Model Decision Tree

Dataset https://storage.googleapis.com/dqlab-dataset/suv_data.csv berikut menyajikan informasi terkait profil customer yang pernah membeli mobil keluaran terbaru.

Dataset ini perlu dibagi ke dalam training set dan testing set, dan buatlah machine learning model dengan menggunakan Decision tree untuk memprediksi customer yang tertarik membeli mobil keluaran terbaru, serta evaluasi performance dari kedua model tersebut.

library(caret) #pre-defined 
library(rpart) #pre-defined
library(readr) #pre-defined
set.seed(123)  #pre-defined

suv_data <- read_csv("https://storage.googleapis.com/dqlab-dataset/suv_data.csv") #pre-defined

## Rows: 400 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Gender
## dbl (3): Age, EstimatedSalary, Purchased
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#split data to training & testing set
trainIndex <- createDataPartition(suv_data$Purchased, p = 0.8, list = FALSE)
training_set <- suv_data[trainIndex, ]
testing_set <- suv_data[-trainIndex,]

#build model with decision tree
model_dt <- rpart(Purchased ~ ., data = training_set, method="class")
predictions_dt <- predict(model_dt, newdata = testing_set, type = "class")

#evaluate performance with new data/ testing_set
testing_purchased <- factor(testing_set$Purchased) #pre-defined 

#show the evaluation result 
evaluation_result <- confusionMatrix(predictions_dt, testing_purchased)
evaluation_result

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 53  1
##          1  5 21
##                                          
##                Accuracy : 0.925          
##                  95% CI : (0.8439, 0.972)
##     No Information Rate : 0.725          
##     P-Value [Acc > NIR] : 7.53e-06       
##                                          
##                   Kappa : 0.822          
##                                          
##  Mcnemar's Test P-Value : 0.2207         
##                                          
##             Sensitivity : 0.9138         
##             Specificity : 0.9545         
##          Pos Pred Value : 0.9815         
##          Neg Pred Value : 0.8077         
##              Prevalence : 0.7250         
##          Detection Rate : 0.6625         
##    Detection Prevalence : 0.6750         
##       Balanced Accuracy : 0.9342         
##                                          
##        'Positive' Class : 0              
##

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tentang saya, silakan kunjungi: