Data yang digunakan pada exercise ini adalah data penjemputan (pickup) penumpang oleh taxi di suatu kota. Pendataan dilakukan dengan mencatat waktu pickup dalam satu hari selama bulan Januari 2015. Data terdiri dari tiga kolom yaitu: (1) waktu dalam satu hari (menit), (2) urutan hari (1 = Senin s.d 7 = Minggu), dan (3) jumlah pickup.
taxi<-read.csv("https://raw.githubusercontent.com/greenore/ac209b-coursework/master/hw1/data/dataset_1_train.txt")
head(taxi)
## TimeMin DayOfWeek PickupCount
## 1 57 5 111
## 2 68 5 95
## 3 182 5 95
## 4 298 5 75
## 5 363 5 35
## 6 395 5 30
## Transform day numbers to characters
weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday")
taxi$DayOfWeek <- factor(taxi$DayOfWeek, labels=weekdays,ordered=TRUE)
rm(weekdays)
## Transform to time in hours
taxi$TimeHours <- round((taxi$TimeMin / 60), 0)
head(taxi)
## TimeMin DayOfWeek PickupCount TimeHours
## 1 57 Friday 111 1
## 2 68 Friday 95 1
## 3 182 Friday 95 3
## 4 298 Friday 75 5
## 5 363 Friday 35 6
## 6 395 Friday 30 7
Silahkan lakukan visualisasi data. Anda dapat menggunakan code berikut, maupun menggunakan code Anda sendiri. Diskusikan dengan rekan Anda: seperti apa sebaran PickupCount dari hari ke hari selama 1 minggu?
Nilai median pada setiap harinya tidak terlalu jauh
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(taxi, aes(DayOfWeek, PickupCount)) +
labs(title="Plot I: Boxplot",
subtitle="Pickup count vs. day of the week") +
geom_boxplot(color="black") +
xlab("Weekday") +
ylab("Pickup count") +
theme_bw()
Selanjutnya, silahkan lakukan eksplorasi, dapat dengan memanfaatkan code berikut, serta diskusikan dengan rekan Anda: Seperti apa sebaran PickupCount dari waktu ke waktu (pagi hingga malam)?
ggplot(taxi, aes(TimeMin, PickupCount)) +
geom_point(stroke=0, alpha=0.8) +
theme_bw() +
labs(title="Plot II: Scatterplot",
subtitle="Pickup count vs. time of the day") +
scale_x_continuous(breaks=c(0, 360, 720, 1080, 1440),
labels=c("00:00", "06:00", "12:00", "18:00", "24:00")) +
ylab(label="Pickup Count") +
xlab("Time of the day")
Lakukan pemulusan spline bersama rekan Anda, dan interpretasikan pola yang Anda peroleh.
fit=smooth.spline (taxi$TimeHours , taxi$PickupCount ,lambda =0.5)
plot(taxi$TimeHours , taxi$PickupCount)
lines(fit, col="red")
fit=smooth.spline (taxi$TimeHours , taxi$PickupCount ,lambda =0.4)
plot(taxi$TimeHours , taxi$PickupCount)
lines(fit, col="red")
fit=smooth.spline (taxi$TimeHours , taxi$PickupCount ,lambda =0.3)
plot(taxi$TimeHours , taxi$PickupCount)
lines(fit, col="red")
fit=smooth.spline (taxi$TimeHours , taxi$PickupCount ,lambda =0.2)
plot(taxi$TimeHours , taxi$PickupCount)
lines(fit, col="red")
fit=smooth.spline (taxi$TimeHours , taxi$PickupCount ,lambda =0.1)
plot(taxi$TimeHours , taxi$PickupCount)
lines(fit, col="red")
Lakukan pendekatan LOESS bersama rekan Anda pada data ini, dan interpretasikan pola yang Anda peroleh.