Project: Data Scientist Assessment Using R

Membuat Data frame

Jika kamu memiliki data.frame berikut :

df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))

Bagaimana kamu memunculkan output berikut di R console?

df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
df [[2]]

## [1] 4 5 6

df [[1]][[1]]

## [1] 1

df [[2]][[3]]

## [1] 6

Membuat Vector dan Index Position

x <- c(“red”,“blue”,“yellow”,“orange”,“green”,“purple”) y <- x……………. y

Silahkan mengisi bagian blank (……) dengan syntax yang sesuai sehingga diperoleh output yang diharapkan.

## [1] "blue"   "yellow" "orange"

Mengisi Syntax Factor

Variabel (x) adalah suatu factor :

x <- …………………. x

Isilah blank (….) dengan syntax yang tepat
Bagaimana syntax yang tepat untuk memodifikasi (x)

x <- factor(c("grape","apples", "pear","cherry","mango","panda"))
x

## [1] grape  apples pear   cherry mango  panda 
## Levels: apples cherry grape mango panda pear

x[6] <- "apples"
x

## [1] grape  apples pear   cherry mango  apples
## Levels: apples cherry grape mango panda pear

Menggunakan Function

Jika terdapat “Function” :

add_numbers <- function(x, y) { x + y }

Di mana x dan y = 3, bagaimana anda menggunakan “Function” tersebut

Notes : Pada bagian function dituliskan dulu (x = ….. , y = …..)

#membuat fungsi penambahan angka
tambah_angka <- function(x,y){x+y}
#menggunakan fungsi dengan isian angka 3
tambah_angka(3,3)

## [1] 6

Mengganti Missing Value

Buatlah sebuah “R Function” di mana Function tersebut dapat digunakan untuk mengganti missing value dari suatu vector dengan nilai mean dari vector tersebut.

Vector dengan missing value (input) :

df <- c(1,2,3,4,5,6,NA,7,8,9,NA) df

mean_replace <- ……… df <- mean_replace(df) df

Vector sesudah missing value diganti dengan nilai mean (output) :

df <- c(1,2,3,4,5,6,NA,7,8,9,NA)
df

##  [1]  1  2  3  4  5  6 NA  7  8  9 NA

mean_replace <- function(df){df[is.na(df)] <- mean(df, na.rm = TRUE) 
return(df)}
df <- mean_replace(df)
df

##  [1] 1 2 3 4 5 6 5 7 8 9 5

Problem 1 : Langkah Awal

Sebagai seorang data scientist di suatu organisasi lingkungan hidup, Anda diminta untuk melakukan analisa terhadap data hasil observasi dari beberapa pohon cherry. Dataset hasil observasi disimpan di link file “https://storage.googleapis.com/dqlab-dataset/trees.csv”, dan proses analisa dilakukan dengan R programming. Adapun tahap – tahap yang perlu dilakukan adalah sebagai berikut :

Load library dan load dataset dengan nama “trees_df”, adapun library yang diperlukan yaitu readr

library(readr)
trees_df <- read.csv("https://storage.googleapis.com/dqlab-dataset/trees.csv")

Mengecek Struktur Data

Menampilkan nama kolom dan tipe data setiap kolom.

Terdapat kesalahan penulisan untuk kolom Girth, seharusnya nama kolom tersebut adalah Diameter.

Satuan dari kolom Girth/ Diameter adalah inch, seharusnya adalah ft,sehingga perlu ditambahkan kolom (diameter_ft) yang berisi hasil konversi inch – ft ( 1 inch = 0.08333 ft).

Memunculkan beberapa baris dari dataset.

Menampilkan hasil statistik deskriptif (min, max, median, mean, dan quartil) untuk semua kolom.

Mengecek missing value.

names(trees_df)

## [1] "Girth"  "Height" "Volume"

str(trees_df)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: int  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

names(trees_df)[1] <- "Diameter"
trees_df$diameter_ft <- trees_df$Diameter*0.08333
head(trees_df)

##   Diameter Height Volume diameter_ft
## 1      8.3     70   10.3    0.691639
## 2      8.6     65   10.3    0.716638
## 3      8.8     63   10.2    0.733304
## 4     10.5     72   16.4    0.874965
## 5     10.7     81   18.8    0.891631
## 6     10.8     83   19.7    0.899964

summary(trees_df)

##     Diameter         Height       Volume       diameter_ft    
##  Min.   : 8.30   Min.   :63   Min.   :10.20   Min.   :0.6916  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40   1st Qu.:0.9208  
##  Median :12.90   Median :76   Median :24.20   Median :1.0750  
##  Mean   :13.25   Mean   :76   Mean   :30.17   Mean   :1.1040  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30   3rd Qu.:1.2708  
##  Max.   :20.60   Max.   :87   Max.   :77.00   Max.   :1.7166

is.na(trees_df)

##       Diameter Height Volume diameter_ft
##  [1,]    FALSE  FALSE  FALSE       FALSE
##  [2,]    FALSE  FALSE  FALSE       FALSE
##  [3,]    FALSE  FALSE  FALSE       FALSE
##  [4,]    FALSE  FALSE  FALSE       FALSE
##  [5,]    FALSE  FALSE  FALSE       FALSE
##  [6,]    FALSE  FALSE  FALSE       FALSE
##  [7,]    FALSE  FALSE  FALSE       FALSE
##  [8,]    FALSE  FALSE  FALSE       FALSE
##  [9,]    FALSE  FALSE  FALSE       FALSE
## [10,]    FALSE  FALSE  FALSE       FALSE
## [11,]    FALSE  FALSE  FALSE       FALSE
## [12,]    FALSE  FALSE  FALSE       FALSE
## [13,]    FALSE  FALSE  FALSE       FALSE
## [14,]    FALSE  FALSE  FALSE       FALSE
## [15,]    FALSE  FALSE  FALSE       FALSE
## [16,]    FALSE  FALSE  FALSE       FALSE
## [17,]    FALSE  FALSE  FALSE       FALSE
## [18,]    FALSE  FALSE  FALSE       FALSE
## [19,]    FALSE  FALSE  FALSE       FALSE
## [20,]    FALSE  FALSE  FALSE       FALSE
## [21,]    FALSE  FALSE  FALSE       FALSE
## [22,]    FALSE  FALSE  FALSE       FALSE
## [23,]    FALSE  FALSE  FALSE       FALSE
## [24,]    FALSE  FALSE  FALSE       FALSE
## [25,]    FALSE  FALSE  FALSE       FALSE
## [26,]    FALSE  FALSE  FALSE       FALSE
## [27,]    FALSE  FALSE  FALSE       FALSE
## [28,]    FALSE  FALSE  FALSE       FALSE
## [29,]    FALSE  FALSE  FALSE       FALSE
## [30,]    FALSE  FALSE  FALSE       FALSE
## [31,]    FALSE  FALSE  FALSE       FALSE

Shapiro Test

Apakah data hasil observasi ketiga variabel (diameter_ft, Height, Volume) terdistribusi normal? Gunakan Shapiro-Wilk’s method untuk mengevaluasi normality dari setiap variabel.

shapiro.test(trees_df$diameter_ft)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$diameter_ft
## W = 0.94117, p-value = 0.08893

shapiro.test(trees_df$Height)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$Height
## W = 0.96545, p-value = 0.4034

shapiro.test(trees_df$Volume)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees_df$Volume
## W = 0.88757, p-value = 0.003579

Visualisasi Sederhana

Buatlah visualisasi sederhana yang menjelaskan distribusi Volume dari dataset yang sudah dikerjakan sebelumnya.

Untuk menyelesaikan soal ini, diminta untuk menggunakan density plot dari standar R-base.

plot(density(trees_df$Volume))

Mencari Hubungan

Bagaimanakah hubungan antara Volume batang pohon Cherry dengan diameter dan ketinggian (height)?

Gunakan pula visualisasi sederhana untuk menjelaskan hubungan tersebut.

lm(formula = Volume ~ Height + diameter_ft, data = trees_df)

## 
## Call:
## lm(formula = Volume ~ Height + diameter_ft, data = trees_df)
## 
## Coefficients:
## (Intercept)       Height  diameter_ft  
##    -57.9877       0.3393      56.5002

plot(trees_df$diameter_ft, trees_df$Volume)

plot(trees_df$Height, trees_df$Volume)

## Analisa Efek Pemberian Obat Tidur

Selanjutnya, kamu diminta untuk menganalisa efek pemberian obat tidur terhadap peningkatan lamanya waktu tidur dari sekelompok mahasiswa.

Extra = peningkatan waktu tidur dalam satuan jam (hour) – numeric

Group = jenis treatment/dosis obat yang diberikan – factor

ID = ID mahasiswa yang diobservasi – factor

Jika diasumsikan data hasil observasi tersebut terdistribusi normal, lakukanlah uji hipotesis untuk mengetahui apakah terdapat perbedaan efek antara group 1 dan group 2.

H0 = Tidak ada perbedaan efek antara group 1 dan group 2 H1 = Terdapat perbedaan efek group 1 dan group 2

Cara 1 :

Load library & datasetinstall.packages(“tidyverse”) #install in backendlibrary(readr) #pre-definedlibrary(dplyr) #pre-defined sleep_df <- read_csv(‘https://storage.googleapis.com/dqlab-dataset/sleep.csv’) #pre-defined # Save the data in two different dataframe/vectorgroup1 <- filter(sleep_df, sleep_df\(group == 1)group2 <- filter(sleep_df, sleep_df\)group == 2) # Compute t-testt_test <- t.test(group1\(extra, group2\)extra)t_test

Cara 2 :

Load library & dataset install.packages(“tidyverse”)library(readr) #pre-defined sleep_df <- read_csv(‘sleep.csv’) #pre-defined # Directly compute t-test without save it in 2 different dataframe/vectort_test <- t.test(extra ~ group, data = sleep_df)t_test

library(readr) #pre-defined
library(dplyr) #pre-defined

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sleep_df <- read_csv('https://storage.googleapis.com/dqlab-dataset/sleep.csv') #pre-defined

## Rows: 20 Columns: 3

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (3): extra, group, ID
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Save the data in two different dataframe/vector
group1 <- filter(sleep_df, sleep_df$group == 1)
group2 <- filter(sleep_df, sleep_df$group == 2)

# Compute t-test
t_test <- t.test(group1$extra, group2$extra)
t_test

## 
##  Welch Two Sample t-test
## 
## data:  group1$extra and group2$extra
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean of x mean of y 
##      0.75      2.33

Menghasilkan Grafik - Boxplot

Menggunakan data di problem 2 dan library ggplot

library(ggplot2)
ggplot(sleep_df, aes(x=as.character(group),y=extra,fill=as.character(group))) + geom_boxplot()

## Membuat Model Sederhana

Dataset https://storage.googleapis.com/dqlab-dataset/electric_bill.csv berikut menyajikan informasi terkait biaya listrik rumah tangga di suatu negara.

Dari data tersebut anda ingin mengetahui faktor – faktor apa saja yang mempengaruhi total biaya listrik di setiap rumah tangga.

Buatlah model sederhana yang dapat menjelaskan bagaimana pengaruh “num_people” dan “housearea” terhadap “amount_paid”!

library(readr)
electric_bill <- read_csv("https://storage.googleapis.com/dqlab-dataset/electric_bill.csv")

## Rows: 1000 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (10): num_rooms, num_people, housearea, is_ac, is_tv, is_flat, ave_month...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

model <- lm(amount_paid ~ num_people + housearea, data=electric_bill)
model

## 
## Call:
## lm(formula = amount_paid ~ num_people + housearea, data = electric_bill)
## 
## Coefficients:
## (Intercept)   num_people    housearea  
##     482.920        4.834        0.118

Training dan Testing

Dalam membuat model machine learning, dataset perlu dibagi ke dalam Training dan Testing set. Salah satu library yang digunakan adalah caret. Dengan menggunakan Iris dataset yang sudah di-load dan library caret, buatlah syntax untuk membagi dataset tersebut ke dalam training set (80%) dan testing dataset (20%)!

library(readr)
library(caret)

## Loading required package: lattice

set.seed(123)
iris <- read_csv("https://storage.googleapis.com/dqlab-dataset/iris.csv")

## Rows: 150 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trainIndex <- createDataPartition(iris$Species, p=0.8, list=FALSE)
training_set <- iris[trainIndex,]
testing_set <- iris[-trainIndex,]

dim(training_set)

## [1] 120   5

dim(testing_set)

## [1] 30  5

Model Decision Tree

Dataset https://storage.googleapis.com/dqlab-dataset/suv_data.csv berikut menyajikan informasi terkait profil customer yang pernah membeli mobil keluaran terbaru.

Dataset ini perlu dibagi ke dalam training set dan testing set, dan buatlah machine learning model dengan menggunakan Decision tree untuk memprediksi customer yang tertarik membeli mobil keluaran terbaru, serta evaluasi performance dari kedua model tersebut.

library(caret) #pre-defined 
library(rpart) #pre-defined
library(readr) #pre-defined
set.seed(123)  #pre-defined

suv_data <- read_csv("https://storage.googleapis.com/dqlab-dataset/suv_data.csv") #pre-defined

## Rows: 400 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Gender
## dbl (3): Age, EstimatedSalary, Purchased
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

#split data to training & testing set
trainIndex <- createDataPartition(suv_data$Purchased, p=0.8, list=FALSE)
training_set <- suv_data[trainIndex,]
testing_set <- suv_data[-trainIndex,]

#build model with decision tree
model_dt <- rpart(Purchased ~ ., data = training_set, method="class")
predictions_dt <- predict(model_dt,newdata = testing_set, type = "class")

#evaluate performance with new data/ testing_set
testing_purchased <- factor(testing_set$Purchased) #pre-defined 

#show the evaluation result 
evaluation_result <- confusionMatrix(predictions_dt,testing_purchased)
evaluation_result

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 53  1
##          1  5 21
##                                          
##                Accuracy : 0.925          
##                  95% CI : (0.8439, 0.972)
##     No Information Rate : 0.725          
##     P-Value [Acc > NIR] : 7.53e-06       
##                                          
##                   Kappa : 0.822          
##                                          
##  Mcnemar's Test P-Value : 0.2207         
##                                          
##             Sensitivity : 0.9138         
##             Specificity : 0.9545         
##          Pos Pred Value : 0.9815         
##          Neg Pred Value : 0.8077         
##              Prevalence : 0.7250         
##          Detection Rate : 0.6625         
##    Detection Prevalence : 0.6750         
##       Balanced Accuracy : 0.9342         
##                                          
##        'Positive' Class : 0              
##