Introduction

Udemy Data

Brief

Economics fields are still famous for students and workers. Economics has a significant impact on people or countries. Udemy start makes courses in this field in 2011 until now (2020). They have many courses about economic’s themes include free and paid courses. Every knowledge continually updates, and that is the opportunity for Udemy to make a course. But how a new course can reach many people and give a good income for Udemy? Let’s check it out!!

About Data

This data set is from kaggle, https://www.kaggle.com/jilkothari/finance-accounting-courses-udemy-13k-course . This data contain course in Udemy from India. Some important column are

  1. title,

  2. num_subscribers (number of subscribers),

  3. is_paid (for free and paid class category),

  4. avg_rating (ratings), num_publish_lectures (total of materials),

  5. published_time (launching time),

  6. discount_price__amount and price_detail__amount (to get total income with number subscribers).

Prepare Tool and Data

First call the library

library(dplyr)
library(lubridate)
library(ggplot2)
library(gsubfn)
library(ggthemes)
library(ggExtra)
library(scales)

Read the data

check it

course <- read.csv("udemy.csv")
str(course)
## 'data.frame':    13608 obs. of  20 variables:
##  $ id                          : int  762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
##  $ title                       : chr  "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar -  PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
##  $ url                         : chr  "/course/the-complete-sql-bootcamp/" "/course/tableau10/" "/course/pmp-pmbok6-35-pdus/" "/course/the-complete-financial-analyst-course/" ...
##  $ is_paid                     : chr  "True" "True" "True" "True" ...
##  $ num_subscribers             : int  295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
##  $ avg_rating                  : num  4.66 4.59 4.59 4.54 4.47 ...
##  $ avg_rating_recent           : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ rating                      : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ num_reviews                 : int  78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
##  $ is_wishlisted               : chr  "False" "False" "False" "False" ...
##  $ num_published_lectures      : int  84 78 292 338 83 275 23 275 144 413 ...
##  $ num_published_practice_tests: int  0 0 2 0 0 0 0 0 0 0 ...
##  $ created                     : chr  "2016-02-14T22:57:48Z" "2016-08-22T12:10:18Z" "2017-09-26T16:32:48Z" "2015-10-23T13:34:35Z" ...
##  $ published_time              : chr  "2016-04-06T05:16:11Z" "2016-08-23T16:59:49Z" "2017-11-14T23:58:14Z" "2016-01-21T01:38:48Z" ...
##  $ discount_price__amount      : num  455 455 455 455 455 455 455 455 455 455 ...
##  $ discount_price__currency    : chr  "INR" "INR" "INR" "INR" ...
##  $ discount_price__price_string: chr  "₹455" "₹455" "₹455" "₹455" ...
##  $ price_detail__amount        : num  8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
##  $ price_detail__currency      : chr  "INR" "INR" "INR" "INR" ...
##  $ price_detail__price_string  : chr  "₹8,640" "₹8,640" "₹8,640" "₹8,640" ...

Preprosesing Data

Delete unuse column and change type of data

course[,c("is_paid", "is_wishlisted")] <- lapply(course[,c("is_paid", "is_wishlisted")], as.factor)
course[,c("created", "published_time")] <- lapply(course[,c("created", "published_time")], ymd_hms)
course <- course[,-c(3,16,17,19,20)]
course$is_paid <- sapply(as.character(course$is_paid), switch,
       "True" = "Paid",
       "False" = "Free")
anyNA(course)
## [1] TRUE

Oh we found NA values, where is it?

colSums(is.na(course))
##                           id                        title 
##                            0                            0 
##                      is_paid              num_subscribers 
##                            0                            0 
##                   avg_rating            avg_rating_recent 
##                            0                            0 
##                       rating                  num_reviews 
##                            0                            0 
##                is_wishlisted       num_published_lectures 
##                            0                            0 
## num_published_practice_tests                      created 
##                            0                            0 
##               published_time       discount_price__amount 
##                            0                         1403 
##         price_detail__amount 
##                          497

NA value is in discount price and price, that because we have free course and for some course don’t have a discount so the value is NA. so we will keep it.

Add a column for categorical time.

course$publish_hour <- hour(course$published_time)
course$publish_year <- year(course$published_time)
pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 4pm"
    }else{
      x <- "4pm to 12am"
    }
}
course$publish_time <- sapply(course$publish_hour, pw)
course$tot_income <- course$num_subscribers * (course$price_detail__amount-course$discount_price__amount)
str(course)
## 'data.frame':    13608 obs. of  19 variables:
##  $ id                          : int  762616 937678 1361790 648826 637930 1208634 864146 321410 673654 1653432 ...
##  $ title                       : chr  "The Complete SQL Bootcamp 2020: Go from Zero to Hero" "Tableau 2020 A-Z: Hands-On Tableau Training for Data Science" "PMP Exam Prep Seminar -  PMBOK Guide 6" "The Complete Financial Analyst Course 2020" ...
##  $ is_paid                     : chr  "Paid" "Paid" "Paid" "Paid" ...
##  $ num_subscribers             : int  295509 209070 155282 245860 374836 124180 96207 127680 112572 115269 ...
##  $ avg_rating                  : num  4.66 4.59 4.59 4.54 4.47 ...
##  $ avg_rating_recent           : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ rating                      : num  4.68 4.6 4.59 4.54 4.47 ...
##  $ num_reviews                 : int  78006 54581 52653 46447 41630 38093 30470 28665 27408 23906 ...
##  $ is_wishlisted               : Factor w/ 1 level "False": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_published_lectures      : int  84 78 292 338 83 275 23 275 144 413 ...
##  $ num_published_practice_tests: int  0 0 2 0 0 0 0 0 0 0 ...
##  $ created                     : POSIXct, format: "2016-02-14 22:57:48" "2016-08-22 12:10:18" ...
##  $ published_time              : POSIXct, format: "2016-04-06 05:16:11" "2016-08-23 16:59:49" ...
##  $ discount_price__amount      : num  455 455 455 455 455 455 455 455 455 455 ...
##  $ price_detail__amount        : num  8640 8640 8640 8640 8640 8640 8640 8640 8640 8640 ...
##  $ publish_hour                : int  5 16 23 1 21 18 17 23 17 18 ...
##  $ publish_year                : num  2016 2016 2017 2016 2016 ...
##  $ publish_time                : chr  "12am to 8am" "4pm to 12am" "4pm to 12am" "12am to 8am" ...
##  $ tot_income                  : num  2.42e+09 1.71e+09 1.27e+09 2.01e+09 3.07e+09 ...

Make new data from agregratting average rating by free and paid course

rat_cou <- aggregate(avg_rating  ~ is_paid, course,FUN = mean)

Make new data from counting free and paid course

num_cou <- data.frame(table(course$is_paid))

Make new data from aggregating number of subscriber each course price

tren_byp <- aggregate(num_subscribers ~ price_detail__amount, course, FUN= sum)

Plotting

Course Trend

year <- aggregate(num_subscribers ~ publish_year, course, sum)
ggplot(year, aes(x= publish_year, y=num_subscribers)) +
  geom_line(col = "#173F5f") +
  geom_point(col = "#3caea3") +
  theme_clean()+
  scale_x_continuous(breaks = seq(2011, 2020, 1)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Trend Course Release Year",
       subtitle = "by Number of Subscrition",
       x = "Release Year",
       y = "Number of Subscribers")

Udemy users in India mostly subscribe to the course that release in 2016-2018, with the sum each release year is more than 6 million subscribers. Subscribers from course release before 2014 are less than other years, and it happened because the materials were always updated and then course release after 2014, especially in 2016-2018 have update materials and easy to learn by user.

Number of Courses

ggplot(num_cou, aes(x = Var1, y = Freq)) +
  geom_col(aes(fill =Freq)) +
  scale_y_continuous(labels = scales::comma) +
  theme_clean() +
  labs(title = "Free vs Paid",
       subtitle = "Number of Course",
       x = "Course Type",
       y = "Total Number of Course") +
  scale_fill_gradient(high = "#173F5f",
                      low = "#3caea3") +
  theme(legend.position = "none")

From this graphic we know that number of “paid course” higher than “free course”. But from that how about the average score rating?

ggplot(course, aes(avg_rating, num_subscribers)) +
   geom_jitter(col="#3caea3", alpha = 0.5)+
  scale_y_continuous(labels = scales::comma) +
   facet_wrap(~is_paid, scales = "free")+
   labs(title = "Number of Subscriber vs Rating", x="Rating", y= "Number of Subscriber" )+
  theme_pander() +
   theme(plot.title = element_text(hjust = 0.5))

We can see, most of them have a rating between 4 to 5 points, but in a paid course more dominant than a free course. Udemy users want to spent their money to take a paid course even though they can find a free course. So it’s okay if we’re going to launch some paid course because the user will subscribe to the course that have suitable materials and is easy to learn. If we want to launch a course, what will influence our subscribers?

Is that price have influence for the number of subscriber?

g <- ggplot(course,aes(x=price_detail__amount, y = num_subscribers)) +
  geom_jitter(aes(col = avg_rating)) +
  labs(title = "Price vs Number Subscribers",
       x = "Course Price (rupee)",
       y = "Number of Subscribers",
       col = "Rating") +
  theme_pander() +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = seq(from = 0, to = 15000, by = 2000))
g

to make sure this graph we will use correlation test.

cor.test(course$num_subscribers, course$price_detail__amount)
## 
##  Pearson's product-moment correlation
## 
## data:  course$num_subscribers and course$price_detail__amount
## t = 17.077, df = 13109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1307354 0.1642253
## sample estimates:
##       cor 
## 0.1475226

The most popular price for the course in range ₹ 1000-2000 and ₹ 8000-9000. But between number of subscriber and price have correlation, so we want to see number of subscribers each price.

Course With Higher Income

course_inc = course
course_inc$tot_income <-  course_inc$tot_income/1000000000
course_inc <- course_inc[order(course_inc$tot_income, decreasing = T),]
ggplot(course_inc[1:10,], aes(x=tot_income, y= reorder(title, tot_income))) +
  geom_col(aes(fill = tot_income))+
  scale_fill_gradient(high = "#173F5f",
                      low = "#3caea3")+
  theme_pander()+
  scale_x_continuous(labels = scales::comma) +
  theme(legend.position = "none") +
  labs(title = "Top 10 course",
       subtitle = "by Income",
       x= "Income (in Billion rupee)",
       y = "Course Name") +
  geom_vline(data = course_inc, aes(xintercept = median(tot_income)))

course_inc[1,]

Compare with our insight :

An Entire MBA in 1 Course: Award Winning Business School Prof is popular course by year and total income. This course publish at 9 pm in 2016, have 83 materials, and the price is 8640 rupee.

Conclusion

For lecture who want to make course must considering :

  1. The number of materials, our reference is between 20 - 120 materials.

  2. Free courses not always have high subscribers and better ratings, so you can develop a paid course in the range of 8.500 - 9.000 rupee for your income.

  3. We will recommend you to publish at 4 pm - 12 am because at that time, you can optimizing reach the Udemy user.