1 INTRODUCTION

I am going to choose one of the 2 projects learning by Building explained in our .Rmd file using USvideo data (youtube data):

Module 1: Creating a publication-grade plot Applying what you’ve learned, create an economics- or social-related plot that is polished with the appropriate annotations, aesthetics and some simple commentary.

R as a statistical computing environment packs a generous amount of tools allowing us to reshape, clean and visualize our data through its built-in capabilities. In the first part of this coursebook, we’ll take a look at many of these capabilities and learn how to incorporate these into our day-to-day data science work. In the second part of this coursebook, we’ll shift our focus onto ggplot, a plotting system by Hadley Wickham. As you’ll see in this 3 days workshop, this plotting system is among the most popular visualization tools today because of its power, extensibility and simplicity (an unlikely combination).

we will need to use install.packages() to install any packages that are not already downloaded onto our machine. we then load the package into your workspace using the library() function:

library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
## Loading required package: magrittr
library(leaflet)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(ggpubr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)

2 Creating Plot using USvideo data (youtube data)

To get started with plotting in R, let’s start by reading our data into the environment:

vids <- read.csv("USvideos.csv")
names(vids)
##  [1] "trending_date"          "title"                 
##  [3] "channel_title"          "category_id"           
##  [5] "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"              
##  [9] "comment_count"          "comments_disabled"     
## [11] "ratings_disabled"       "video_error_or_removed"

2.1 Review Summary and setting data before Plotting

#install dan gunakan library lubridate kemudian Ubah format tanggal.
vids$trending_date <- ydm(vids$trending_date)
#The raw dataset does not have the proper names for each category, but identify them by an “id” instead. The following code chunk “switches” them by “id” and also convert that to a factor. We will also convert our video titles to a character vector:

vids$title <- as.character(vids$title)
vids$category_id <- sapply(as.character(vids$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")
vids$category_id <- as.factor(vids$category_id)
str(vids)
## 'data.frame':    13400 obs. of  12 variables:
##  $ trending_date         : Date, format: "2017-11-14" "2017-11-14" ...
##  $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
##  $ channel_title         : Factor w/ 1408 levels "_¢_Á_\235","“÷\201\220µ_‘⬓_\220 Korean Englishman",..: 195 686 1046 472 902 559 1063 283 6 1358 ...
##  $ category_id           : Factor w/ 16 levels "Autos and Vehicles",..: 11 4 2 4 4 13 4 13 5 9 ...
##  $ publish_time          : Factor w/ 2903 levels "2008-04-05T18:22:40.000Z",..: 302 271 255 275 253 307 240 258 281 279 ...
##  $ views                 : int  748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
##  $ likes                 : int  57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
##  $ dislikes              : int  2966 6146 5339 666 1989 511 2445 778 119 1363 ...
##  $ comment_count         : int  15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
##  $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#ubah format tanggal sesuai dengan Amerika
head(vids$publish_time)
## [1] 2017-11-13T17:13:01.000Z 2017-11-13T07:30:00.000Z
## [3] 2017-11-12T19:05:24.000Z 2017-11-13T11:00:04.000Z
## [5] 2017-11-12T18:01:41.000Z 2017-11-13T19:07:23.000Z
## 2903 Levels: 2008-04-05T18:22:40.000Z ... 2018-01-21T05:44:30.000Z
vids$publish_time <- ymd_hms(vids$publish_time,tz="America/New_York")
## Date in ISO8601 format; converting timezone from UTC to "America/New_York".
most <- vids[vids$views == max(vids$views),]
year(most$trending_date)
## [1] 2017
month(most$trending_date)
## [1] 12
day(most$trending_date)
## [1] 14
hour(most$trending_date)
## [1] 0
#We will also go ahead and create three new variables for our data frame, storing the hours, period of the day, and the day of the week of each video at the time of publish:
vids$publish_hour <- hour(vids$publish_time)

pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 3pm"
    }else{
      x <- "3pm to 12am"
    }  
}

vids$publish_when <- as.factor(sapply(vids$publish_hour, pw)) 
vids$publish_wday <- as.factor(weekdays(vids$publish_time))

#ubah urutan factor menggunakan level
vids$publish_wday <- ordered(vids$publish_wday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

vids[,c("views", "likes", "dislikes", "comment_count")] <- lapply(vids[,c("views", "likes", "dislikes", "comment_count")], as.numeric) #ubah jadi numeric
#beri nama data baru dan ambil hanya judul yang unik
vids.u <- vids[match(unique(vids$title), vids$title),]
vids.u$timetotrend <- vids.u$trending_date - as.Date(vids.u$publish_time)
vids.u$timetotrend <- as.factor(ifelse(vids.u$timetotrend <= 7, vids.u$timetotrend, "8+"))

2.2 Make Plotting and Analyze data

langkah pertama : 1. Analisa data (mengunakan names (), str(), head(), untuk mengenal data dan menganalisa data) 2. lalu lakukan plotting.

Tabel yang dibuat:

2.2.1 MEMBUAT PERBANDINGAN TIAP CATEGORY BERDASARKAN VIEWERS.

category_1<-ggplot(vids.u,aes(category_id,views/1000000))+ 
  geom_col(fill="blue",position="dodge")+
  facet_wrap(~publish_when)+
  labs(title="CATEGORY YOUTUBE VS LIKES", x="Category",y="Viewers",
       subtitle="Viewers in million",
       caption="Youtube top viewers")+
  coord_flip()+
  theme_economist()
category_1

Kesimpulan:

  1. Jumlah tiap Category memiliki Viewers terbanyak masing2 di jam2 tertentu.

  2. Music memiliki viewers terbanyak diseluruh Jam

  3. Entertainment memiliki Viewers terbanyak khususnya di pagi sampai sore hari

2.2.2 MEMBUAT PERBANDINGAN RATIO LIKES TERHADAP SETIAP CHANNEL

#buat data baru untuk ratiolikes, buat data frame grup dengan data mean diurut tertinggi
ratiolikes <- aggregate.data.frame(list(likesratio = vids.u$likes/vids.u$views, comment = vids.u$comment_count), by=list(vids.u$channel_title), mean)
ratiolikes <- ratiolikes[order(ratiolikes$likesratio, decreasing = T), ]
ratiolikes <- ratiolikes[ratiolikes$comment != 0, ]
head(ratiolikes)
##                 Group.1 likesratio comment
## 732       LuisFonsiVEVO  0.2706132   12094
## 727  LouisTomlinsonVEVO  0.2451110   26259
## 700       LiamPayneVEVO  0.2065753    3914
## 336           dodieVEVO  0.2022609    4780
## 1256       TheVampsVEVO  0.1932547    1176
## 717       littlemixVEVO  0.1926535    5629
Channel_likes<-ggplot(ratiolikes[1:20,], aes(x=Group.1, y=likesratio))+ #diambil 20 biar ga kebanyakan analisa
  geom_point(aes(size=comment), color="dodgerblue",show.legend = F)+ # Axis y menyatakan dislkijes, size menyatakan jumlah comment.
  coord_flip()+
   labs(title="Channel Tittle dengan Ratio Likes Terbanyak", x="Ratio Likes",y="Channel Tittle",
       subtitle="in Percentage",
       caption="Channel's Top Likes")+
  theme_tufte()+
  scale_size(range=c(1,11))
Channel_likes

Kesimpulan:

  1. Jumlah Likes banyak belum tentu memiliki Comment yang banyak pada setiap Channel.

  2. Comment terbanyak ada di Channel ibighit.

  3. jumlah Likes tebanyak ada pada Channel LuisFonsiVevo.