Project ini adalah salah satu project yang terdapat di DQLab dan merupakan bagian dari proses pembelajaran saya di platform DQLab. DQLab adalah salah satu platform kursus data science di Indonesia. Mentor dalam project ini adalah Mrs. Trisna Yulia Junita, seorang Data Scientist di PT BUMA.
Website DQLab: https://academy.dqlab.id/
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
df[[2]]
## [1] 4 5 6
df[[1]][[1]]
## [1] 1
df[[2]][[3]]
## [1] 6
x <- c("red","blue","yellow","orange","green","purple")
y <- x[c(2,3,4)]
y
## [1] "blue" "yellow" "orange"
x <- factor(c("grape", "apples", "pear", "cherry", "mango", "panda")) #membuat variabel berisi factor
x
## [1] grape apples pear cherry mango panda
## Levels: apples cherry grape mango panda pear
x[6] <- "apples" #mengganti isian factor
x
## [1] grape apples pear cherry mango apples
## Levels: apples cherry grape mango panda pear
add_numbers <- function(x, y) {
x + y
}
add_numbers(3, 3)
## [1] 6
Buatlah sebuah “R Function” di mana Function tersebut dapat digunakan untuk mengganti missing value dari suatu vector dengan nilai mean dari vector tersebut.
df <- c(1,2,3,4,5,6,NA,7,8,9,NA)
df
## [1] 1 2 3 4 5 6 NA 7 8 9 NA
mean_replace <- function(i) {
i[is.na(i)] <- mean(i, na.rm=TRUE);
i
}
df <- mean_replace(df)
df
## [1] 1 2 3 4 5 6 5 7 8 9 5
Sebagai seorang data scientist di suatu organisasi lingkungan hidup, Anda diminta untuk melakukan analisa terhadap data hasil observasi dari beberapa pohon cherry. Dataset hasil observasi disimpan di link file “https://storage.googleapis.com/dqlab-dataset/trees.csv”, dan proses analisa dilakukan dengan R programming.
library(readr)
trees_df <- read.csv("https://storage.googleapis.com/dqlab-dataset/trees.csv")
names(trees_df)
## [1] "Girth" "Height" "Volume"
str(trees_df)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: int 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
names(trees_df)[1] <- "Diameter"
trees_df$diameter_ft <- trees_df$Diameter*0.08333
head(trees_df)
## Diameter Height Volume diameter_ft
## 1 8.3 70 10.3 0.691639
## 2 8.6 65 10.3 0.716638
## 3 8.8 63 10.2 0.733304
## 4 10.5 72 16.4 0.874965
## 5 10.7 81 18.8 0.891631
## 6 10.8 83 19.7 0.899964
summary(trees_df)
## Diameter Height Volume diameter_ft
## Min. : 8.30 Min. :63 Min. :10.20 Min. :0.6916
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40 1st Qu.:0.9208
## Median :12.90 Median :76 Median :24.20 Median :1.0750
## Mean :13.25 Mean :76 Mean :30.17 Mean :1.1040
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30 3rd Qu.:1.2708
## Max. :20.60 Max. :87 Max. :77.00 Max. :1.7166
is.na(trees_df)
## Diameter Height Volume diameter_ft
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE
Apakah data hasil observasi ketiga variabel (diameter_ft, Height, Volume) terdistribusi normal? Gunakan Shapiro-Wilk’s method untuk mengevaluasi normality dari setiap variabel.
shapiro.test(trees_df$diameter_ft)
##
## Shapiro-Wilk normality test
##
## data: trees_df$diameter_ft
## W = 0.94117, p-value = 0.08893
shapiro.test(trees_df$Height)
##
## Shapiro-Wilk normality test
##
## data: trees_df$Height
## W = 0.96545, p-value = 0.4034
shapiro.test(trees_df$Volume)
##
## Shapiro-Wilk normality test
##
## data: trees_df$Volume
## W = 0.88757, p-value = 0.003579
Buatlah visualisasi sederhana yang menjelaskan distribusi Volume dari dataset yang sudah dikerjakan sebelumnya.
Untuk menyelesaikan soal ini, diminta untuk menggunakan density plot dari standar R-base.
plot(density(trees_df$Volume))
Bagaimanakah hubungan antara Volume batang pohon Cherry dengan diameter dan ketinggian (height)?
Gunakan pula visualisasi sederhana untuk menjelaskan hubungan tersebut.
lm(formula = Volume ~ Height + diameter_ft, data = trees_df)
##
## Call:
## lm(formula = Volume ~ Height + diameter_ft, data = trees_df)
##
## Coefficients:
## (Intercept) Height diameter_ft
## -57.9877 0.3393 56.5002
plot(trees_df$diameter_ft, trees_df$Volume)
plot(trees_df$Height, trees_df$Volume)
Selanjutnya, saya diminta untuk menganalisa efek pemberian obat tidur terhadap peningkatan lamanya waktu tidur dari sekelompok mahasiswa.
library(readr) #pre-defined
library(dplyr) #pre-defined
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sleep_df <- read_csv('https://storage.googleapis.com/dqlab-dataset/sleep.csv') #pre-defined
## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): extra, group, ID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Save the data in two different dataframe/vector
group1 <- filter(sleep_df, sleep_df$group == 1)
group2 <- filter(sleep_df, sleep_df$group == 2)
# Compute t-test
t_test <- t.test(group1$extra, group2$extra)
t_test
##
## Welch Two Sample t-test
##
## data: group1$extra and group2$extra
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean of x mean of y
## 0.75 2.33
library(ggplot2)
ggplot(sleep_df, aes(x = as.character(group), y = extra, fill = as.character(group))) + geom_boxplot()
Dataset https://storage.googleapis.com/dqlab-dataset/electric_bill.csv berikut menyajikan informasi terkait biaya listrik rumah tangga di suatu negara.
Dari data tersebut anda ingin mengetahui faktor – faktor apa saja yang mempengaruhi total biaya listrik di setiap rumah tangga.
Buatlah model sederhana yang dapat menjelaskan bagaimana pengaruh “num_people” dan “housearea” terhadap “amount_paid”!
library(readr)
electric_bill <- read_csv("https://storage.googleapis.com/dqlab-dataset/electric_bill.csv")
## Rows: 1000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): num_rooms, num_people, housearea, is_ac, is_tv, is_flat, ave_month...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(amount_paid ~ num_people + housearea, data = electric_bill)
model
##
## Call:
## lm(formula = amount_paid ~ num_people + housearea, data = electric_bill)
##
## Coefficients:
## (Intercept) num_people housearea
## 482.920 4.834 0.118
Dalam membuat model machine learning, dataset perlu dibagi ke dalam Training dan Testing set. Salah satu library yang digunakan adalah caret. Dengan menggunakan Iris dataset yang sudah di-load dan library caret, buatlah syntax untuk membagi dataset tersebut ke dalam training set (80%) dan testing dataset (20%)!
library(readr)
library(caret)
## Loading required package: lattice
set.seed(123)
iris <- read_csv("https://storage.googleapis.com/dqlab-dataset/iris.csv")
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
training_set <- iris[trainIndex, ]
testing_set <- iris[-trainIndex, ]
dim(training_set)
## [1] 120 5
dim(testing_set)
## [1] 30 5
Dataset https://storage.googleapis.com/dqlab-dataset/suv_data.csv berikut menyajikan informasi terkait profil customer yang pernah membeli mobil keluaran terbaru.
Dataset ini perlu dibagi ke dalam training set dan testing set, dan buatlah machine learning model dengan menggunakan Decision tree untuk memprediksi customer yang tertarik membeli mobil keluaran terbaru, serta evaluasi performance dari kedua model tersebut.
library(caret) #pre-defined
library(rpart) #pre-defined
library(readr) #pre-defined
set.seed(123) #pre-defined
suv_data <- read_csv("https://storage.googleapis.com/dqlab-dataset/suv_data.csv") #pre-defined
## Rows: 400 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Gender
## dbl (3): Age, EstimatedSalary, Purchased
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#split data to training & testing set
trainIndex <- createDataPartition(suv_data$Purchased, p = 0.8, list = FALSE)
training_set <- suv_data[trainIndex, ]
testing_set <- suv_data[-trainIndex,]
#build model with decision tree
model_dt <- rpart(Purchased ~ ., data = training_set, method="class")
predictions_dt <- predict(model_dt, newdata = testing_set, type = "class")
#evaluate performance with new data/ testing_set
testing_purchased <- factor(testing_set$Purchased) #pre-defined
#show the evaluation result
evaluation_result <- confusionMatrix(predictions_dt, testing_purchased)
evaluation_result
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 53 1
## 1 5 21
##
## Accuracy : 0.925
## 95% CI : (0.8439, 0.972)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 7.53e-06
##
## Kappa : 0.822
##
## Mcnemar's Test P-Value : 0.2207
##
## Sensitivity : 0.9138
## Specificity : 0.9545
## Pos Pred Value : 0.9815
## Neg Pred Value : 0.8077
## Prevalence : 0.7250
## Detection Rate : 0.6625
## Detection Prevalence : 0.6750
## Balanced Accuracy : 0.9342
##
## 'Positive' Class : 0
##
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Tentang saya, silakan kunjungi: