library(lubridate)
library(dplyr)
library(ggplot2)
library(scales)
library(plotly)
library(glue)
library(tidyr)we are a beginner youtube channel who wants to make our videos so that they can be trending in our country, for that we want to do some analysis on several youtube channels in America that make them trending using USvideos.csv data file and see what YouTube users in the US are doing so they can trend.
We will analyze the USvideos.csv data contained in the
data_input folder. Use the read.csv() function
to read the CSV file to R.
USvideo <- read.csv(file="data_input/USvideos.csv")
USvideoafter we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() and tail() data’s only.
head(USvideo)tail(USvideo)dim(USvideo)#> [1] 13400 12
The functions anteNA() and is.na() are used
to find out whether there is a missing value in the data, functions that
sound simple but are very crucial if they are not carried out because
they will affect our work process in analysis, for example during
calculations.
anyNA(USvideo)#> [1] FALSE
colSums(is.na(USvideo))#> trending_date title channel_title
#> 0 0 0
#> category_id publish_time views
#> 0 0 0
#> likes dislikes comment_count
#> 0 0 0
#> comments_disabled ratings_disabled video_error_or_removed
#> 0 0 0
str(USvideo)#> 'data.frame': 13400 obs. of 12 variables:
#> $ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
#> $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#> $ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#> $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
#> $ publish_time : chr "2017-11-13T17:13:01.000Z" "2017-11-13T07:30:00.000Z" "2017-11-12T19:05:24.000Z" "2017-11-13T11:00:04.000Z" ...
#> $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#> $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#> $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#> $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
#> $ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
From the results of our inspections, we get some information that can support our analysis, as follows:
YouTube’s US Trending Videos adalah kumpulan 200 video trending di US per harinya sejak 2017-11-14 hingga 2018-01-21. Berikut adalah deskripsi kolomnya:
Are there any unnecessary columns? Does each column already have the right data type?
Discarded column:
Columns that need to be fixed:
in order to shorten the time we will use the existing functions in the dplyr library to do all these things at once.
USvideo <- USvideo %>%
select(-c(comments_disabled,ratings_disabled,video_error_or_removed)) %>%
mutate(category_id = as.factor(category_id),
publish_time = ymd_hms(publish_time),
trending_date = ydm(trending_date))str(USvideo)#> 'data.frame': 13400 obs. of 9 variables:
#> $ trending_date: Date, format: "2017-11-14" "2017-11-14" ...
#> $ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
#> $ channel_title: chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
#> $ category_id : Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...
#> $ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" ...
#> $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
#> $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
#> $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
#> $ comment_count: int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
Now, Each of column already changed into desired data type dataset is ready to be processed and analyzed.
in this session we want to see by visualizing what american youtubers are doing
USvideo$publish_hour <- hour(USvideo$publish_time)
hist(USvideo$publish_hour, breaks = 10)
Answer: based on the histogram, the video is usually published
at 14-17
USvideo$publish_wday <- wday(USvideo$publish_time, label = T, abbr = T)
plot(USvideo$publish_wday)Answer: most published videos on tuesday.
plot(USvideo$views, USvideo$likes)cor(USvideo$views, USvideo$likes)#> [1] 0.8831559
Answer: based on the scatter plot that we made, the number of likes affects the number of viewers on the published video, to see the correlation with the numbers we use function cor and we get a value of about 0.8 which indicates likes have a strong effect on the number of video viewers.
# change values in column category_id
USvideo$category_id <- sapply(as.character(USvideo$category_id), switch,
"1" = "Film and Animation",
"2" = "Autos and Vehicles",
"10" = "Music",
"15" = "Pets and Animals",
"17" = "Sports",
"19" = "Travel and Events",
"20" = "Gaming",
"22" = "People and Blogs",
"23" = "Comedy",
"24" = "Entertainment",
"25" = "News and Politics",
"26" = "Howto and Style",
"27" = "Education",
"28" = "Science and Technology",
"29" = "Nonprofit and Activism",
"43" = "Shows")
# cahnge type to factor
USvideo$category_id <- as.factor(USvideo$category_id)
# aggregation base on category_id
vids_count <- USvideo %>%
group_by(category_id) %>%
summarise(count_cat = n()) %>%
arrange(desc(count_cat)) %>%
ungroup()
#scaling label on chart
vids_count2 <- vids_count %>%
mutate(label = glue("Category {category_id}
Video Count: {comma(count_cat)}"))
plot_count <- ggplot(vids_count2, aes(x =count_cat , y= reorder(category_id, count_cat), text = label)) +
geom_col(aes(fill = count_cat)) +
scale_fill_gradient(low = "blue", high ="navy") +
labs(title = "Number Of videos Base Category",
x = "Video count",
y = NULL) +
theme_minimal() +
theme(legend.position = "none")
ggplotly(plot_count, tooltip = "text")These bar plot shows that the lowest vidoe count is Shows and for top 4 video count are Entertaiment, Music, Howto and Style, and Comedy.
although Film and Animation is not included in the 4 most published categories, if we look at it from the perspective of the audience, the animation category is in the top 4 categories. and we have mentioned above that likes have a strong correlation with evidence in the top 3, the number of viewers has the same position as the top 3 likes, but not in the fourth position (Film and Animation) which shows fewer likes when compared to the number of viewers in the bottom position.
From all the graphs above, we can draw some assumptions, such as: