Introduction

Udemy Data

Brief

Economics fields are still famous for students and workers. Economics has a significant impact on people or countries. Udemy start makes courses in this field in 2011 until now (2020). They have many courses about economic’s themes include free and paid courses. Every knowledge continually updates, and that is the opportunity for Udemy to make a course. But how a new course can reach many people and give a good income for Udemy? Let’s check it out!!

About Data

This data set is from kaggle, https://www.kaggle.com/jilkothari/finance-accounting-courses-udemy-13k-course . This data contain course in Udemy from India. Some important column are

title,
num_subscribers (number of subscribers),
is_paid (for free and paid class category),
avg_rating (ratings), num_publish_lectures (total of materials),
published_time (launching time),
discount_price__amount and price_detail__amount (to get total income with number subscribers).

Prepare Tool and Data

First call the library

library(dplyr)
library(lubridate)
library(ggplot2)
library(gsubfn)
library(ggthemes)
library(ggExtra)
library(scales)

Read the data

check it

course <- read.csv("udemy.csv")
str(course)

## 'data.frame':    13608 obs. of  20 variables:
##  $ id                          : int  762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
##  $ title                       : chr  "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar -  PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
##  $ url                         : chr  "/course/the-complete-sql-bootcamp/" "/course/tableau10/" "/course/pmp-pmbok6-35-pdus/" "/course/the-complete-financial-analyst-course/" ...
##  $ is_paid                     : chr  "True" "True" "True" "True" ...
##  $ num_subscribers             : int  295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
##  $ avg_rating                  : num  4.66 4.59 4.59 4.54 4.47 ...
##  $ avg_rating_recent           : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ rating                      : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ num_reviews                 : int  78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
##  $ is_wishlisted               : chr  "False" "False" "False" "False" ...
##  $ num_published_lectures      : int  84 78 292 338 83 275 23 275 144 413 ...
##  $ num_published_practice_tests: int  0 0 2 0 0 0 0 0 0 0 ...
##  $ created                     : chr  "2016-02-14T22:57:48Z" "2016-08-22T12:10:18Z" "2017-09-26T16:32:48Z" "2015-10-23T13:34:35Z" ...
##  $ published_time              : chr  "2016-04-06T05:16:11Z" "2016-08-23T16:59:49Z" "2017-11-14T23:58:14Z" "2016-01-21T01:38:48Z" ...
##  $ discount_price__amount      : num  455 455 455 455 455 455 455 455 455 455 ...
##  $ discount_price__currency    : chr  "INR" "INR" "INR" "INR" ...
##  $ discount_price__price_string: chr  "â‚¹455" "â‚¹455" "â‚¹455" "â‚¹455" ...
##  $ price_detail__amount        : num  8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
##  $ price_detail__currency      : chr  "INR" "INR" "INR" "INR" ...
##  $ price_detail__price_string  : chr  "â‚¹8,640" "â‚¹8,640" "â‚¹8,640" "â‚¹8,640" ...

Preprosesing Data

Delete unuse column and change type of data

course[,c("is_paid", "is_wishlisted")] <- lapply(course[,c("is_paid", "is_wishlisted")], as.factor)
course[,c("created", "published_time")] <- lapply(course[,c("created", "published_time")], ymd_hms)
course <- course[,-c(3,16,17,19,20)]
course$is_paid <- sapply(as.character(course$is_paid), switch,
       "True" = "Paid",
       "False" = "Free")
anyNA(course)

## [1] TRUE

Oh we found NA values, where is it?

colSums(is.na(course))

##                           id                        title 
##                            0                            0 
##                      is_paid              num_subscribers 
##                            0                            0 
##                   avg_rating            avg_rating_recent 
##                            0                            0 
##                       rating                  num_reviews 
##                            0                            0 
##                is_wishlisted       num_published_lectures 
##                            0                            0 
## num_published_practice_tests                      created 
##                            0                            0 
##               published_time       discount_price__amount 
##                            0                         1403 
##         price_detail__amount 
##                          497

NA value is in discount price and price, that because we have free course and for some course don’t have a discount so the value is NA. so we will keep it.

Add a column for categorical time.

course$publish_hour <- hour(course$published_time)
course$publish_year <- year(course$published_time)
pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 4pm"
    }else{
      x <- "4pm to 12am"
    }
}
course$publish_time <- sapply(course$publish_hour, pw)
course$tot_income <- course$num_subscribers * (course$price_detail__amount-course$discount_price__amount)
str(course)

## 'data.frame':    13608 obs. of  19 variables:
##  $ id                          : int  762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
##  $ title                       : chr  "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar -  PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
##  $ is_paid                     : chr  "Paid" "Paid" "Paid" "Paid" ...
##  $ num_subscribers             : int  295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
##  $ avg_rating                  : num  4.66 4.59 4.59 4.54 4.47 ...
##  $ avg_rating_recent           : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ rating                      : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ num_reviews                 : int  78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
##  $ is_wishlisted               : Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_published_lectures      : int  84 78 292 338 83 275 23 275 144 413 ...
##  $ num_published_practice_tests: int  0 0 2 0 0 0 0 0 0 0 ...
##  $ created                     : POSIXct, format: "2016-02-14 22:57:48" "2016-08-22 12:10:18" ...
##  $ published_time              : POSIXct, format: "2016-04-06 05:16:11" "2016-08-23 16:59:49" ...
##  $ discount_price__amount      : num  455 455 455 455 455 455 455 455 455 455 ...
##  $ price_detail__amount        : num  8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
##  $ publish_hour                : int  5 16 23 1 21 18 17 23 17 18 ...
##  $ publish_year                : num  2016 2016 2017 2016 2016 ...
##  $ publish_time                : chr  "12am to 8am" "4pm to 12am" "4pm to 12am" "12am to 8am" ...
##  $ tot_income                  : num  2.42e+09 1.71e+09 1.27e+09 2.01e+09 3.07e+09 ...

Make new data from agregratting average rating by free and paid course

rat_cou <- aggregate(avg_rating  ~ is_paid, course,FUN = mean)

Make new data from counting free and paid course

num_cou <- data.frame(table(course$is_paid))

Make new data from aggregating number of subscriber each course price

tren_byp <- aggregate(num_subscribers ~ price_detail__amount, course, FUN= sum)

Plotting

Course Trend

year <- aggregate(num_subscribers ~ publish_year, course, sum)

ggplot(year, aes(x= publish_year, y=num_subscribers)) +
  geom_line(col = "#173F5f") +
  geom_point(col = "#3caea3") +
  theme_clean()+
  scale_x_continuous(breaks = seq(2011, 2020, 1)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Trend Course Release Year",
       subtitle = "by Number of Subscrition",
       x = "Release Year",
       y = "Number of Subscribers")

Udemy users in India mostly subscribe to the course that release in 2016-2018, with the sum each release year is more than 6 million subscribers. Subscribers from course release before 2014 are less than other years, and it happened because the materials were always updated and then course release after 2014, especially in 2016-2018 have update materials and easy to learn by user.

Number of Courses

ggplot(num_cou, aes(x = Var1, y = Freq)) +
  geom_col(aes(fill =Freq)) +
  scale_y_continuous(labels = scales::comma) +
  theme_clean() +
  labs(title = "Free vs Paid",
       subtitle = "Number of Course",
       x = "Course Type",
       y = "Total Number of Course") +
  scale_fill_gradient(high = "#173F5f",
                      low = "#3caea3") +
  theme(legend.position = "none")

From this graphic we know that number of “paid course” higher than “free course”. But from that how about the average score rating?

ggplot(course, aes(avg_rating, num_subscribers)) +
   geom_jitter(col="#3caea3", alpha = 0.5)+
  scale_y_continuous(labels = scales::comma) +
   facet_wrap(~is_paid, scales = "free")+
   labs(title = "Number of Subscriber vs Rating", x="Rating", y= "Number of Subscriber" )+
  theme_pander() +
   theme(plot.title = element_text(hjust = 0.5))

We can see, most of them have a rating between 4 to 5 points, but in a paid course more dominant than a free course. Udemy users want to spent their money to take a paid course even though they can find a free course. So it’s okay if we’re going to launch some paid course because the user will subscribe to the course that have suitable materials and is easy to learn. If we want to launch a course, what will influence our subscribers?

Is that price have influence for the number of subscriber?

g <- ggplot(course,aes(x=price_detail__amount, y = num_subscribers)) +
  geom_jitter(aes(col = avg_rating)) +
  labs(title = "Price vs Number Subscribers",
       x = "Course Price (rupee)",
       y = "Number of Subscribers",
       col = "Rating") +
  theme_pander() +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = seq(from = 0, to = 15000, by = 2000))
g

to make sure this graph we will use correlation test.

cor.test(course$num_subscribers, course$price_detail__amount)

## 
##  Pearson's product-moment correlation
## 
## data:  course$num_subscribers and course$price_detail__amount
## t = 17.077, df = 13109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1307354 0.1642253
## sample estimates:
##       cor 
## 0.1475226

The most popular price for the course in range ₹ 1000-2000 and ₹ 8000-9000. But between number of subscriber and price have correlation, so we want to see number of subscribers each price.

Criteria Based on Popular Price

Betwen course price ₹ 1000-2000 and ₹ 8000-9000, who have higher number of subscribers?

ggplot(tren_byp, aes(x= price_detail__amount, y= num_subscribers)) +
  geom_col() +
    scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = seq(from = 0, to = 15000, by = 1000)) +
  theme_pander() +
  labs(title = "Number of Subscribers",
       subtitle = "by Price",
       x = "Price (rupee)",
       y = "Number of Subscribers")

The most popular course is in range price ₹ 8.500-9000. we will find out their characteristic.

Density plot

course_pop <- course[course$price_detail__amount >= 8500 & course$price_detail__amount <= 9000,]
ggplot(course_pop,aes(num_published_lectures)) +
  geom_density(fill = "#3caea3", alpha = 0.6) +
  labs(title = "Dencity of Material Number",
       x = "Number of Courses Materials",
       # y = "Number of Subscribers",
       col = "Rating") +
  theme_pander() +
  scale_x_continuous(breaks = seq(from = 0, to = 700, by = 30))

Doughnut plot

time <- data.frame(table(course_pop$publish_time))

time <- time[order(time$Freq, decreasing = T),]
ggplot(time, aes(x=reorder(Var1, Freq), y=Freq)) +
  geom_segment(aes(x=reorder(Var1, Freq), xend = reorder(Var1, Freq), y = 0, yend=Freq), color="skyblue") +
  geom_point(color= "#173F5f", size=4, alpha=0.8) +
  theme_economist_white()+
  labs(
    title = "Publish Hour",
    x = "",
    y= "Number of Course"
  )

From the density plot and doughnut plot, we can determine two criteria from the ₹ 8.500-9.000 group :

This popular course mostly have 20-120 materials each course
Most of them release at 4 pm - 12 am (that time is crucial because at that time many users have done their activity like work or school), and at that time we can reach the student’s user or worker.

Popular Course by Publish Year

course_year <- course[course$publish_year >= 2016 & course$publish_year <= 2018,]
course_year <- course_year[order(course_year$num_subscribers, decreasing = T),]
ggplot(course_year[1:10,], aes(x=num_subscribers, y= reorder(title, num_subscribers))) +
  geom_col(aes(fill = num_subscribers))+
  scale_fill_gradient(high = "#173F5f",
                      low = "#3caea3")+
  theme_pander()+
  scale_x_continuous(labels = scales::comma) +
  theme(legend.position = "none") +
  labs(title = "Top 10 course",
       subtitle = "Publish in 2016-2018",
       x= "Number of Subscriber",
       y = "course Name")

Course With Higher Income

course_inc = course
course_inc$tot_income <-  course_inc$tot_income/1000000000

course_inc <- course_inc[order(course_inc$tot_income, decreasing = T),]
ggplot(course_inc[1:10,], aes(x=tot_income, y= reorder(title, tot_income))) +
  geom_col(aes(fill = tot_income))+
  scale_fill_gradient(high = "#173F5f",
                      low = "#3caea3")+
  theme_pander()+
  scale_x_continuous(labels = scales::comma) +
  theme(legend.position = "none") +
  labs(title = "Top 10 course",
       subtitle = "by Income",
       x= "Income (in Billion rupee)",
       y = "Course Name") +
  geom_vline(data = course_inc, aes(xintercept = median(tot_income)))

course_inc[1,]

Compare with our insight :

An Entire MBA in 1 Course: Award Winning Business School Prof is popular course by year and total income. This course publish at 9 pm in 2016, have 83 materials, and the price is 8640 rupee.

Conclusion

For lecture who want to make course must considering :

The number of materials, our reference is between 20 - 120 materials.
Free courses not always have high subscribers and better ratings, so you can develop a paid course in the range of 8.500 - 9.000 rupee for your income.
We will recommend you to publish at 4 pm - 12 am because at that time, you can optimizing reach the Udemy user.

Udemy Course

Dionisius Widjayanto

8/12/2021