Economics fields are still famous for students and workers. Economics has a significant impact on people or countries. Udemy start makes courses in this field in 2011 until now (2020). They have many courses about economic’s themes include free and paid courses. Every knowledge continually updates, and that is the opportunity for Udemy to make a course. But how a new course can reach many people and give a good income for Udemy? Let’s check it out!!
This data set is from kaggle, https://www.kaggle.com/jilkothari/finance-accounting-courses-udemy-13k-course . This data contain course in Udemy from India. Some important column are
title,
num_subscribers (number of subscribers),
is_paid (for free and paid class category),
avg_rating (ratings), num_publish_lectures (total of materials),
published_time (launching time),
discount_price__amount and price_detail__amount (to get total income with number subscribers).
First call the library
library(dplyr)
library(lubridate)
library(ggplot2)
library(gsubfn)
library(ggthemes)
library(ggExtra)
library(scales)Read the data
check it
course <- read.csv("udemy.csv")
str(course)## 'data.frame': 13608 obs. of 20 variables:
## $ id : int 762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
## $ title : chr "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar - PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
## $ url : chr "/course/the-complete-sql-bootcamp/" "/course/tableau10/" "/course/pmp-pmbok6-35-pdus/" "/course/the-complete-financial-analyst-course/" ...
## $ is_paid : chr "True" "True" "True" "True" ...
## $ num_subscribers : int 295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
## $ avg_rating : num 4.66 4.59 4.59 4.54 4.47 ...
## $ avg_rating_recent : num 4.68 4.6 4.59 4.54 4.47 ...
## $ rating : num 4.68 4.6 4.59 4.54 4.47 ...
## $ num_reviews : int 78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
## $ is_wishlisted : chr "False" "False" "False" "False" ...
## $ num_published_lectures : int 84 78 292 338 83 275 23 275 144 413 ...
## $ num_published_practice_tests: int 0 0 2 0 0 0 0 0 0 0 ...
## $ created : chr "2016-02-14T22:57:48Z" "2016-08-22T12:10:18Z" "2017-09-26T16:32:48Z" "2015-10-23T13:34:35Z" ...
## $ published_time : chr "2016-04-06T05:16:11Z" "2016-08-23T16:59:49Z" "2017-11-14T23:58:14Z" "2016-01-21T01:38:48Z" ...
## $ discount_price__amount : num 455 455 455 455 455 455 455 455 455 455 ...
## $ discount_price__currency : chr "INR" "INR" "INR" "INR" ...
## $ discount_price__price_string: chr "₹455" "₹455" "₹455" "₹455" ...
## $ price_detail__amount : num 8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
## $ price_detail__currency : chr "INR" "INR" "INR" "INR" ...
## $ price_detail__price_string : chr "₹8,640" "₹8,640" "₹8,640" "₹8,640" ...
Delete unuse column and change type of data
course[,c("is_paid", "is_wishlisted")] <- lapply(course[,c("is_paid", "is_wishlisted")], as.factor)
course[,c("created", "published_time")] <- lapply(course[,c("created", "published_time")], ymd_hms)
course <- course[,-c(3,16,17,19,20)]
course$is_paid <- sapply(as.character(course$is_paid), switch,
"True" = "Paid",
"False" = "Free")
anyNA(course)## [1] TRUE
Oh we found NA values, where is it?
colSums(is.na(course))## id title
## 0 0
## is_paid num_subscribers
## 0 0
## avg_rating avg_rating_recent
## 0 0
## rating num_reviews
## 0 0
## is_wishlisted num_published_lectures
## 0 0
## num_published_practice_tests created
## 0 0
## published_time discount_price__amount
## 0 1403
## price_detail__amount
## 497
NA value is in discount price and price, that because we have free course and for some course don’t have a discount so the value is NA. so we will keep it.
Add a column for categorical time.
course$publish_hour <- hour(course$published_time)
course$publish_year <- year(course$published_time)
pw <- function(x){
if(x < 8){
x <- "12am to 8am"
}else if(x >= 8 & x < 16){
x <- "8am to 4pm"
}else{
x <- "4pm to 12am"
}
}
course$publish_time <- sapply(course$publish_hour, pw)
course$tot_income <- course$num_subscribers * (course$price_detail__amount-course$discount_price__amount)
str(course)## 'data.frame': 13608 obs. of 19 variables:
## $ id : int 762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
## $ title : chr "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar - PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
## $ is_paid : chr "Paid" "Paid" "Paid" "Paid" ...
## $ num_subscribers : int 295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
## $ avg_rating : num 4.66 4.59 4.59 4.54 4.47 ...
## $ avg_rating_recent : num 4.68 4.6 4.59 4.54 4.47 ...
## $ rating : num 4.68 4.6 4.59 4.54 4.47 ...
## $ num_reviews : int 78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
## $ is_wishlisted : Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_published_lectures : int 84 78 292 338 83 275 23 275 144 413 ...
## $ num_published_practice_tests: int 0 0 2 0 0 0 0 0 0 0 ...
## $ created : POSIXct, format: "2016-02-14 22:57:48" "2016-08-22 12:10:18" ...
## $ published_time : POSIXct, format: "2016-04-06 05:16:11" "2016-08-23 16:59:49" ...
## $ discount_price__amount : num 455 455 455 455 455 455 455 455 455 455 ...
## $ price_detail__amount : num 8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
## $ publish_hour : int 5 16 23 1 21 18 17 23 17 18 ...
## $ publish_year : num 2016 2016 2017 2016 2016 ...
## $ publish_time : chr "12am to 8am" "4pm to 12am" "4pm to 12am" "12am to 8am" ...
## $ tot_income : num 2.42e+09 1.71e+09 1.27e+09 2.01e+09 3.07e+09 ...
Make new data from agregratting average rating by free and paid course
rat_cou <- aggregate(avg_rating ~ is_paid, course,FUN = mean)Make new data from counting free and paid course
num_cou <- data.frame(table(course$is_paid))Make new data from aggregating number of subscriber each course price
tren_byp <- aggregate(num_subscribers ~ price_detail__amount, course, FUN= sum)year <- aggregate(num_subscribers ~ publish_year, course, sum)ggplot(year, aes(x= publish_year, y=num_subscribers)) +
geom_line(col = "#173F5f") +
geom_point(col = "#3caea3") +
theme_clean()+
scale_x_continuous(breaks = seq(2011, 2020, 1)) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Trend Course Release Year",
subtitle = "by Number of Subscrition",
x = "Release Year",
y = "Number of Subscribers")Udemy users in India mostly subscribe to the course that release in 2016-2018, with the sum each release year is more than 6 million subscribers. Subscribers from course release before 2014 are less than other years, and it happened because the materials were always updated and then course release after 2014, especially in 2016-2018 have update materials and easy to learn by user.
ggplot(num_cou, aes(x = Var1, y = Freq)) +
geom_col(aes(fill =Freq)) +
scale_y_continuous(labels = scales::comma) +
theme_clean() +
labs(title = "Free vs Paid",
subtitle = "Number of Course",
x = "Course Type",
y = "Total Number of Course") +
scale_fill_gradient(high = "#173F5f",
low = "#3caea3") +
theme(legend.position = "none")From this graphic we know that number of “paid course” higher than “free course”. But from that how about the average score rating?
ggplot(course, aes(avg_rating, num_subscribers)) +
geom_jitter(col="#3caea3", alpha = 0.5)+
scale_y_continuous(labels = scales::comma) +
facet_wrap(~is_paid, scales = "free")+
labs(title = "Number of Subscriber vs Rating", x="Rating", y= "Number of Subscriber" )+
theme_pander() +
theme(plot.title = element_text(hjust = 0.5))We can see, most of them have a rating between 4 to 5 points, but in a paid course more dominant than a free course. Udemy users want to spent their money to take a paid course even though they can find a free course. So it’s okay if we’re going to launch some paid course because the user will subscribe to the course that have suitable materials and is easy to learn. If we want to launch a course, what will influence our subscribers?
Is that price have influence for the number of subscriber?
g <- ggplot(course,aes(x=price_detail__amount, y = num_subscribers)) +
geom_jitter(aes(col = avg_rating)) +
labs(title = "Price vs Number Subscribers",
x = "Course Price (rupee)",
y = "Number of Subscribers",
col = "Rating") +
theme_pander() +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = seq(from = 0, to = 15000, by = 2000))
gto make sure this graph we will use correlation test.
cor.test(course$num_subscribers, course$price_detail__amount)##
## Pearson's product-moment correlation
##
## data: course$num_subscribers and course$price_detail__amount
## t = 17.077, df = 13109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1307354 0.1642253
## sample estimates:
## cor
## 0.1475226
The most popular price for the course in range ₹ 1000-2000 and ₹ 8000-9000. But between number of subscriber and price have correlation, so we want to see number of subscribers each price.
Betwen course price ₹ 1000-2000 and ₹ 8000-9000, who have higher number of subscribers?
ggplot(tren_byp, aes(x= price_detail__amount, y= num_subscribers)) +
geom_col() +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(breaks = seq(from = 0, to = 15000, by = 1000)) +
theme_pander() +
labs(title = "Number of Subscribers",
subtitle = "by Price",
x = "Price (rupee)",
y = "Number of Subscribers")The most popular course is in range price ₹ 8.500-9000. we will find out their characteristic.
Density plot
course_pop <- course[course$price_detail__amount >= 8500 & course$price_detail__amount <= 9000,]
ggplot(course_pop,aes(num_published_lectures)) +
geom_density(fill = "#3caea3", alpha = 0.6) +
labs(title = "Dencity of Material Number",
x = "Number of Courses Materials",
# y = "Number of Subscribers",
col = "Rating") +
theme_pander() +
scale_x_continuous(breaks = seq(from = 0, to = 700, by = 30))Doughnut plot
time <- data.frame(table(course_pop$publish_time))
time <- time[order(time$Freq, decreasing = T),]
ggplot(time, aes(x=reorder(Var1, Freq), y=Freq)) +
geom_segment(aes(x=reorder(Var1, Freq), xend = reorder(Var1, Freq), y = 0, yend=Freq), color="skyblue") +
geom_point(color= "#173F5f", size=4, alpha=0.8) +
theme_economist_white()+
labs(
title = "Publish Hour",
x = "",
y= "Number of Course"
)From the density plot and doughnut plot, we can determine two criteria from the ₹ 8.500-9.000 group :
This popular course mostly have 20-120 materials each course
Most of them release at 4 pm - 12 am (that time is crucial because at that time many users have done their activity like work or school), and at that time we can reach the student’s user or worker.
course_year <- course[course$publish_year >= 2016 & course$publish_year <= 2018,]
course_year <- course_year[order(course_year$num_subscribers, decreasing = T),]
ggplot(course_year[1:10,], aes(x=num_subscribers, y= reorder(title, num_subscribers))) +
geom_col(aes(fill = num_subscribers))+
scale_fill_gradient(high = "#173F5f",
low = "#3caea3")+
theme_pander()+
scale_x_continuous(labels = scales::comma) +
theme(legend.position = "none") +
labs(title = "Top 10 course",
subtitle = "Publish in 2016-2018",
x= "Number of Subscriber",
y = "course Name")course_inc = course
course_inc$tot_income <- course_inc$tot_income/1000000000course_inc <- course_inc[order(course_inc$tot_income, decreasing = T),]
ggplot(course_inc[1:10,], aes(x=tot_income, y= reorder(title, tot_income))) +
geom_col(aes(fill = tot_income))+
scale_fill_gradient(high = "#173F5f",
low = "#3caea3")+
theme_pander()+
scale_x_continuous(labels = scales::comma) +
theme(legend.position = "none") +
labs(title = "Top 10 course",
subtitle = "by Income",
x= "Income (in Billion rupee)",
y = "Course Name") +
geom_vline(data = course_inc, aes(xintercept = median(tot_income)))course_inc[1,]Compare with our insight :
An Entire MBA in 1 Course: Award Winning Business School Prof is popular course by year and total income. This course publish at 9 pm in 2016, have 83 materials, and the price is 8640 rupee.
For lecture who want to make course must considering :
The number of materials, our reference is between 20 - 120 materials.
Free courses not always have high subscribers and better ratings, so you can develop a paid course in the range of 8.500 - 9.000 rupee for your income.
We will recommend you to publish at 4 pm - 12 am because at that time, you can optimizing reach the Udemy user.