Data Description and Preparation

Sebagai seorang YouTuber di Amerika yang ingin meningkatkan pamor channel YouTube, kita berencana untuk membuat konten video yang trending! Kita baru saja mendapatkan data YouTube’s US Trending Videos dan ingin mencari tahu karakteristik apa saja yang membuat suatu video menjadi trending?

Read Data

vids <- read.csv('data_input/USvideos.csv')

cek data:

head(vids)

#>   trending_date                                                          title
#> 1      17.14.11                             WE WANT TO TALK ABOUT OUR MARRIAGE
#> 2      17.14.11 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
#> 3      17.14.11          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
#> 4      17.14.11                               Nickelback Lyrics: Real or Fake?
#> 5      17.14.11                                       I Dare You: GOING BALD!?
#> 6      17.14.11                                          2 Weeks with iPhone X
#>           channel_title category_id             publish_time   views  likes
#> 1          CaseyNeistat          22 2017-11-13T17:13:01.000Z  748374  57527
#> 2       LastWeekTonight          24 2017-11-13T07:30:00.000Z 2418783  97185
#> 3          Rudy Mancuso          23 2017-11-12T19:05:24.000Z 3191434 146033
#> 4 Good Mythical Morning          24 2017-11-13T11:00:04.000Z  343168  10172
#> 5              nigahiga          24 2017-11-12T18:01:41.000Z 2095731 132235
#> 6              iJustine          28 2017-11-13T19:07:23.000Z  119180   9763
#>   dislikes comment_count comments_disabled ratings_disabled
#> 1     2966         15954             FALSE            FALSE
#> 2     6146         12703             FALSE            FALSE
#> 3     5339          8181             FALSE            FALSE
#> 4      666          2146             FALSE            FALSE
#> 5     1989         17518             FALSE            FALSE
#> 6      511          1434             FALSE            FALSE
#>   video_error_or_removed
#> 1                  FALSE
#> 2                  FALSE
#> 3                  FALSE
#> 4                  FALSE
#> 5                  FALSE
#> 6                  FALSE

str(vids)

#> 'data.frame':    13400 obs. of  12 variables:
#>  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id           : int  22 24 23 24 24 28 24 28 1 25 ...
#>  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#>  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#>  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#>  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#>  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#>  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

YouTube’s US Trending Videos adalah kumpulan 200 video trending di US per harinya sejak 2017-11-14 hingga 2018-01-21. Berikut adalah deskripsi kolomnya:

General information relating to video:
* trending_date: tanggal trending * title: judul video * channel_title: nama channel Youtube * category_id: kategori video * publish_time: tanggal upload video * comment_disabled: apakah kolom komentar tidak diaktifkan * rating_disabled: apakah rating video tidak diaktifkan * video_error_or_removed: apakah video dihapus

Statistics on a particular date: * views: jumlah views * likes: jumlah likes * dislikes: jumlah dislikes * comment_count: jumlah komentar

Eksplor data anda! Apakah tiap kolom sudah memiliki tipe data yang tepat?

Datetime <- trending_date dan publish_time
Factor <- category_id

Data Wrangling

Data wrangling adalah istilah lain dari data cleaning. Salah satu contohnya adalah mengubah tipe data dan subsetting baris/kolom tertentu.

`lubridate`

lubridate adalah package yang sangat powerful untuk mengolah data waktu dan tanggal.

Sebelumnya kita mengubah data ke tipe date dengan menggunakan as.Date():

YEAR
%Y = YYYY (2020) %y = YY (20)

MONTH
%B = month name e.g. March
%b = month name(abbreviation) e.g. Mar
%m = 2 digits mo e.g. 03
%M = 1 digit mo e.g. 3

DAY
%A = weekday e.g. Friday
%d = weekday digit.

ubah trending_date menjadi tipe data date:

head(vids$trending_date)

#> [1] "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11"

base_date <- as.Date(x = vids$trending_date, format = "%y.%d.%m" )
head(base_date)

#> [1] "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14"
#> [6] "2017-11-14"

menggunakan lubridate:

library(lubridate)

a <- "19/04/22"
b <- "Tuesday, 19-04-2022"
c <- "April 19, 2022"
d <- "2022/04/19, 1:42PM"

# metode base
as.Date(a, "%d/%m/%y")

#> [1] "2022-04-19"

as.Date(d, format="%Y/%m/%d %h:%m")

#> [1] NA

# metode lubridate: masukan urutan d/m/y nya saja
a <- dmy(a)
dmy(b)

#> [1] "2022-04-19"

d <- ymd_hm(d) 

class(a) #bertipe Date karena tidak memiliki informasi waktu

#> [1] "Date"

class(d) #bertipe POSIXct karena memiliki informasi waktu

#> [1] "POSIXct" "POSIXt"


### `sapply()` & `lapply()`

**sapply**: mengaplikasikan fungsi ke tiap baris secara bersamaan.

formula: `sapply(data, fungsi)`

untuk mengubah nilai menjadi nilai tertentu dapat digunakan fungsi `switch()`. Namun switch hanya dapat mengubah satu nilai (hanya 1 baris, tidak bisa seluruh baris):




```r
switch("1", # data
       "1" = "Education", # kamus
       "2" = "Travel", 
       "3" = "Music")

#> [1] "Education"

# # will return error
# switch(c("1","2"),
#        "1" = "Education",
#        "2" = "Travel",
#        "3" = "Music")

Hal ini diatasi dengan sapply():

data <- c("1","2")

sapply(X = data, # data/kolom yang ingin diubah 
       FUN = switch, # fungsi
       "1" = "Education", # kamus
       "2" = "Travel", 
       "4" = "Music")

#>           1           2 
#> "Education"    "Travel"

Note:

switch() mentranslasikan nilai berdasarkan kamus. Bila nilai tidak ada pada kamus, maka dihasilkan NULL.
data lebih baik diubah ke tipe karakter terlebih dahulu

Mengubah category_id untuk tiap row dengan switch() dengan bantuan sapply():

# ubah kolom `category_id` menjadi label aslinya
vids$category_id <- sapply(X = as.character(vids$category_id), 
                           FUN = switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

head(vids)

#>   trending_date                                                          title
#> 1      17.14.11                             WE WANT TO TALK ABOUT OUR MARRIAGE
#> 2      17.14.11 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
#> 3      17.14.11          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
#> 4      17.14.11                               Nickelback Lyrics: Real or Fake?
#> 5      17.14.11                                       I Dare You: GOING BALD!?
#> 6      17.14.11                                          2 Weeks with iPhone X
#>           channel_title            category_id             publish_time   views
#> 1          CaseyNeistat       People and Blogs 2017-11-13T17:13:01.000Z  748374
#> 2       LastWeekTonight          Entertainment 2017-11-13T07:30:00.000Z 2418783
#> 3          Rudy Mancuso                 Comedy 2017-11-12T19:05:24.000Z 3191434
#> 4 Good Mythical Morning          Entertainment 2017-11-13T11:00:04.000Z  343168
#> 5              nigahiga          Entertainment 2017-11-12T18:01:41.000Z 2095731
#> 6              iJustine Science and Technology 2017-11-13T19:07:23.000Z  119180
#>    likes dislikes comment_count comments_disabled ratings_disabled
#> 1  57527     2966         15954             FALSE            FALSE
#> 2  97185     6146         12703             FALSE            FALSE
#> 3 146033     5339          8181             FALSE            FALSE
#> 4  10172      666          2146             FALSE            FALSE
#> 5 132235     1989         17518             FALSE            FALSE
#> 6   9763      511          1434             FALSE            FALSE
#>   video_error_or_removed
#> 1                  FALSE
#> 2                  FALSE
#> 3                  FALSE
#> 4                  FALSE
#> 5                  FALSE
#> 6                  FALSE

# ubah kolom `category_id` menjadi tipe factor
vids$category_id <- as.factor(vids$category_id)
# cek data
str(vids)

#> 'data.frame':    13400 obs. of  12 variables:
#>  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
#>  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#>  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#>  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#>  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#>  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#>  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

lapply: untuk mengaplikasikan fungsi (misal mengubah tipe data) ke banyak kolom secara bersamaan.

formula: lapply(data, fungsi)

Note: Di bawah adalah contoh penggunaan lapply(), namun pada kasus ini tidak wajib dilakukan.

# cara base
vids$views <- as.numeric(vids$views)
vids$likes <- as.numeric(vids$likes)
vids$dislikes <- as.numeric(vids$dislikes)
vids$comment_count <- as.numeric(vids$comment_count)

vids[,c("views","likes","dislikes","comment_count")] <- lapply(vids[,c("views","likes","dislikes","comment_count")], as.numeric)

str(vids)

#> 'data.frame':    13400 obs. of  12 variables:
#>  $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#>  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#>  $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#>  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
#>  $ publish_time          : chr  "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#>  $ views                 : num  748374 2418783 3191434 343168 2095731 ...
#>  $ likes                 : num  57527 97185 146033 10172 132235 ...
#>  $ dislikes              : num  2966 6146 5339 666 1989 ...
#>  $ comment_count         : num  15954 12703 8181 2146 17518 ...
#>  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Feature Engineering

Feature engineering adalah membuat kolom/variabel baru berdasarkan data yang ada. Hal ini berguna untuk mengekstrak informasi tambahan yang bisa digunakan untuk eksplorasi data dan modeling.

Ambil informasi jam publish_time ke dalam kolom baru publish_hour

vids$publish_hour <- hour(vids$publish_time)
head(vids,3)

#>   trending_date                                                          title
#> 1      17.14.11                             WE WANT TO TALK ABOUT OUR MARRIAGE
#> 2      17.14.11 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
#> 3      17.14.11          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
#>     channel_title      category_id             publish_time   views  likes
#> 1    CaseyNeistat People and Blogs 2017-11-13T17:13:01.000Z  748374  57527
#> 2 LastWeekTonight    Entertainment 2017-11-13T07:30:00.000Z 2418783  97185
#> 3    Rudy Mancuso           Comedy 2017-11-12T19:05:24.000Z 3191434 146033
#>   dislikes comment_count comments_disabled ratings_disabled
#> 1     2966         15954             FALSE            FALSE
#> 2     6146         12703             FALSE            FALSE
#> 3     5339          8181             FALSE            FALSE
#>   video_error_or_removed publish_hour
#> 1                  FALSE            0
#> 2                  FALSE            0
#> 3                  FALSE            0

unique(vids$publish_hour)

#> [1] 0

Buat kolom publish_when dengan membagi publish_hour menjadi beberapa periode (Day-Night) menggunakan ifelse():

vids$publish_when <- ifelse(test = vids$publish_hour > 12, yes = "Night", no="Day")

# cek hasil
head(vids)

#>   trending_date                                                          title
#> 1      17.14.11                             WE WANT TO TALK ABOUT OUR MARRIAGE
#> 2      17.14.11 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
#> 3      17.14.11          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
#> 4      17.14.11                               Nickelback Lyrics: Real or Fake?
#> 5      17.14.11                                       I Dare You: GOING BALD!?
#> 6      17.14.11                                          2 Weeks with iPhone X
#>           channel_title            category_id             publish_time   views
#> 1          CaseyNeistat       People and Blogs 2017-11-13T17:13:01.000Z  748374
#> 2       LastWeekTonight          Entertainment 2017-11-13T07:30:00.000Z 2418783
#> 3          Rudy Mancuso                 Comedy 2017-11-12T19:05:24.000Z 3191434
#> 4 Good Mythical Morning          Entertainment 2017-11-13T11:00:04.000Z  343168
#> 5              nigahiga          Entertainment 2017-11-12T18:01:41.000Z 2095731
#> 6              iJustine Science and Technology 2017-11-13T19:07:23.000Z  119180
#>    likes dislikes comment_count comments_disabled ratings_disabled
#> 1  57527     2966         15954             FALSE            FALSE
#> 2  97185     6146         12703             FALSE            FALSE
#> 3 146033     5339          8181             FALSE            FALSE
#> 4  10172      666          2146             FALSE            FALSE
#> 5 132235     1989         17518             FALSE            FALSE
#> 6   9763      511          1434             FALSE            FALSE
#>   video_error_or_removed publish_hour publish_when
#> 1                  FALSE            0          Day
#> 2                  FALSE            0          Day
#> 3                  FALSE            0          Day
#> 4                  FALSE            0          Day
#> 5                  FALSE            0          Day
#> 6                  FALSE            0          Day

Bisa juga untuk > 2 kondisi [Optional]:

# x = data

pw <- function(x){ 
    
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 4pm"
    }else{
      x <- "4pm to 12am"
    }
}

# gunakan `sapply()` untuk aplikasikan ke seluruh baris
temp <- sapply(vids$publish_hour, pw)

# cek hasil
head(vids$publish_hour)

#> [1] 0 0 0 0 0 0

head(temp)

#> [1] "12am to 8am" "12am to 8am" "12am to 8am" "12am to 8am" "12am to 8am"
#> [6] "12am to 8am"

`match()`

Dalam data vids terdapat redudansi data yaitu terdapat video yang muncul beberapa kali karena trending lebih dari 1 hari.

length(vids$title)

#> [1] 13400

length(unique(vids$title))

#> [1] 2986

Untuk analisis lanjutan, kita hanya akan menggunakan data saat video tersebut pertama kali trending demi mengurangi redudansi data. Untuk itu kita dapat menggunakan unique() dan match().

Contoh:

# dummy data
df <- data.frame(nama = c("Lita", "Lita", "Nurul", "Dwi"), 
                 umur = c(22,23,22,22))
df

#>    nama umur
#> 1  Lita   22
#> 2  Lita   23
#> 3 Nurul   22
#> 4   Dwi   22

# mengambil nama unique
unique(df$nama)

#> [1] "Lita"  "Nurul" "Dwi"

# mencari index saat nama unique pertama kali muncul

index <- match(unique(df$nama), df$nama) 
         # pada index berapa `unique(df$nama)` cocok/match dengan `df$nama` 

index

#> [1] 1 3 4

# filter data yang termasuk index
df[index, ]

#>    nama umur
#> 1  Lita   22
#> 3 Nurul   22
#> 4   Dwi   22

Aplikasikan pada data vids:

index.vids <- match(unique(vids$title), vids$title)
vids.u <- vids[index.vids,] #mulai melakukan subsetting khusus untuk video yang unik saja

head(vids.u)

#>   trending_date                                                          title
#> 1      17.14.11                             WE WANT TO TALK ABOUT OUR MARRIAGE
#> 2      17.14.11 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
#> 3      17.14.11          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
#> 4      17.14.11                               Nickelback Lyrics: Real or Fake?
#> 5      17.14.11                                       I Dare You: GOING BALD!?
#> 6      17.14.11                                          2 Weeks with iPhone X
#>           channel_title            category_id             publish_time   views
#> 1          CaseyNeistat       People and Blogs 2017-11-13T17:13:01.000Z  748374
#> 2       LastWeekTonight          Entertainment 2017-11-13T07:30:00.000Z 2418783
#> 3          Rudy Mancuso                 Comedy 2017-11-12T19:05:24.000Z 3191434
#> 4 Good Mythical Morning          Entertainment 2017-11-13T11:00:04.000Z  343168
#> 5              nigahiga          Entertainment 2017-11-12T18:01:41.000Z 2095731
#> 6              iJustine Science and Technology 2017-11-13T19:07:23.000Z  119180
#>    likes dislikes comment_count comments_disabled ratings_disabled
#> 1  57527     2966         15954             FALSE            FALSE
#> 2  97185     6146         12703             FALSE            FALSE
#> 3 146033     5339          8181             FALSE            FALSE
#> 4  10172      666          2146             FALSE            FALSE
#> 5 132235     1989         17518             FALSE            FALSE
#> 6   9763      511          1434             FALSE            FALSE
#>   video_error_or_removed publish_hour publish_when
#> 1                  FALSE            0          Day
#> 2                  FALSE            0          Day
#> 3                  FALSE            0          Day
#> 4                  FALSE            0          Day
#> 5                  FALSE            0          Day
#> 6                  FALSE            0          Day

dim(vids.u)

#> [1] 2986   14

Missing Value

Missing value (NA) dapat menyulitkan pengolahan data. Oleh karena itu perlu dideteksi dan bila ada perlu diberi perlakuan.

# cek keseluruhan data
anyNA(vids.u)

#> [1] FALSE

# cek jumlah NA per kolom
colSums(is.na(vids.u))

#>          trending_date                  title          channel_title 
#>                      0                      0                      0 
#>            category_id           publish_time                  views 
#>                      0                      0                      0 
#>                  likes               dislikes          comment_count 
#>                      0                      0                      0 
#>      comments_disabled       ratings_disabled video_error_or_removed 
#>                      0                      0                      0 
#>           publish_hour           publish_when 
#>                      0                      0

Base Plot

Exploratory Data Analysis (EDA) Bertujuan untuk mendapat informasi dari data (explorasi). EDA dapat dilakukan menggunakan base plot.

Histogram

Tujuan: cek distribusi data.

Contoh, pada jam berapa saja video trending banyak dipublish? bagaimana distribusi publish_hour dari data vids.u?

hist(vids.u$publish_hour,
     breaks = 20,
     xlim = c(0,25),
     xaxt = "n")
axis(side=1, at=seq(0,25,5))

Insight:

Banyak video yang trending di publish pada pukul 10-11 siang EST.

Boxplot

Tujuan: cek distribusi data dan outlier dari data.

Contoh, untuk pertanyaan yang sama seperti di atas:

boxplot(vids.u$publish_hour)

Insight:

Tidak terdapat outlier pada data publish hour

`plot()`

Tujuan: menyajikan beragam tipe plot sesuai tipe data yang dimasukkan.

x kategori: bar chart -> frekuensi tiap kategori
x numerik: scatterplot -> sebaran data
x dan y adalah numerik: scatterplot -> hubungan antar data
x kategori, y numerik: boxplot -> perbandingan distribusi tiap kategori

# plot()
plot(vids.u$category_id, horiz=T, las=2)

Business Question:

Kita tertarik dengan category_id “Autos and Vehicles”, “Gaming”, dan “Travel and Events”. Dari ketiga kategori tersebut, adakah hubungan antara nilai likes/view dan dislikes/view?

Tahapan:

subset data vids.u untuk kategori di atas dan simpan ke objek vids.agt

# vids.agt <- vids.u[vids.u$category_id == "Autos and Vehicles" | vids.u$category_id == "Gaming" | vids.u$category_id == "Travel and Events",]

vids.agt <- vids.u[vids.u$category_id %in% c("Autos and Vehicles", "Gaming", "Travel and Events"),]

buat kolom likesp berisi likes/view dan dislikesp berisi dislikes/view:

vids.agt$likesp <- vids.agt$likes/vids.agt$views
vids.agt$dislikesp <- vids.agt$dislikes/vids.agt$views
head(vids.agt)

#>     trending_date                                                        title
#> 31       17.14.11 I TOOK THE $3,000,000 LAMBO TO CARMAX! They offered me......
#> 35       17.14.11       New Emirates First Class Suite | Boeing 777 | Emirates
#> 59       17.14.11                                  Train Swipes Parked Vehicle
#> 132      17.14.11                         L.A. Noire - Nintendo Switch Trailer
#> 164      17.14.11                 Caterham Chris Hoy 60 Second Donut Challenge
#> 198      17.14.11          Inside Keanu Reeves' Custom Motorcycle Shop | WIRED
#>     channel_title        category_id             publish_time  views likes
#> 31    hp_overload Autos and Vehicles 2017-11-13T01:43:12.000Z  98378  4035
#> 35       Emirates  Travel and Events 2017-11-12T05:55:42.000Z 141148  1661
#> 59       ViralHog Autos and Vehicles 2017-11-13T00:46:11.000Z   7265    89
#> 132      Nintendo             Gaming 2017-11-09T19:59:48.000Z 154872  7683
#> 164 Caterham Cars Autos and Vehicles 2017-11-09T09:59:31.000Z   4850    22
#> 198         WIRED Autos and Vehicles 2017-11-08T15:00:27.000Z 704363 16352
#>     dislikes comment_count comments_disabled ratings_disabled
#> 31       495           486             FALSE            FALSE
#> 35        70           236             FALSE            FALSE
#> 59         8            22             FALSE            FALSE
#> 132      164          1734             FALSE            FALSE
#> 164        1             1             FALSE            FALSE
#> 198      224           841             FALSE            FALSE
#>     video_error_or_removed publish_hour publish_when      likesp    dislikesp
#> 31                   FALSE            0          Day 0.041015268 0.0050316128
#> 35                   FALSE            0          Day 0.011767790 0.0004959333
#> 59                   FALSE            0          Day 0.012250516 0.0011011700
#> 132                  FALSE            0          Day 0.049608709 0.0010589390
#> 164                  FALSE            0          Day 0.004536082 0.0002061856
#> 198                  FALSE            0          Day 0.023215302 0.0003180178

visualisasikan:

Tipe plot apa yang kira-kira sesuai?

scatterplot -> cek hubungan antara variable numerik
histogram -> distribusi data per variable
boxplot -> sama seperti histogram

plot(vids.agt$likesp,vids.agt$dislikesp)

cor(vids.agt$likesp, vids.agt$dislikesp)

#> [1] 0.1712322

Insight: likes per view dan dilikes per view pada data vids.agt memiliki korelasi yang rendah

THANKYOUUUU :)

Workflow Visualization