This is a Case Study carried on a dataset on Udemy Courses. I am
going to Analyse Data Collected on Courses uploaded on UDEMY. It will
use the six steps in Data Analysis. ASK,
PREPARE, PROCESS, ANALYZE,
SHARE, and ACT
The client is a fictional company called SubAdd Tech Solutions, which seeks to analyse existing trends and various variables affecting the performance of courses on UDEMY to help them strategize for the courses they are about to create. They have no Specific Questions, they want a thorough investigation of the complete dataset as much as is possible with a Junior Data Analyst.
Secondary Data will be used since SubAdd tech Solutions do not have any form of pre-existing data already.
Since the company does not have already existing data for the analysis, a search was carried out online and a dataset was obtained from Kaggle. Kaggle is a community of Data Scientist, which contains a lot of Datasets to work with, some of which are free.
Secondary data was collected from a publicly available dataset on
kaggle named Business
Courses Udemy. It Contains information on 10,000 Courses on Udemy.
All courses in the data set are paid. This Data Set was Downloaded
freely as a csv file with the name
udemy_output_All_Business_p1_p626 The Dataset contains 20
columns consisting of different Variables which describe the data. The
number of roles are 9447, excluding the first column which serves as the
header. It includes various metrics such as discounts, course published
dates, course prices, number of reviews, rating, and number of
subscribers among others.
R Programming Language will be used for this analysis,
this is because of R’s ability to
Python can also be used for this analysis, but
R is the official language of the company.
R Studio desktop version is used for this analysis.
#Installing Packages
# Remove the # and Run these lines if you have not used these packages before
#install.packages("tidyverse")
#install.packages("lubridate")
#install.packages("readr")
# Loading of Packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(readr)
library(dplyr)
# tidyverse contain most of the tools that will be used in this analysis
# The Dataset is in the same folder as this MarkDown Book
courses <- read_csv("udemy_output_All_Business_p1_p626.csv")
## Rows: 9447 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): title, url, discount_price__currency, discount_price__price_strin...
## dbl (10): id, num_subscribers, avg_rating, avg_rating_recent, rating, num_r...
## lgl (2): is_paid, is_wishlisted
## dttm (2): created, published_time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(courses)
glimpse(courses)
## Rows: 9,447
## Columns: 20
## $ id <dbl> 762616, 937678, 1361790, 648826, 637930, …
## $ title <chr> "The Complete SQL Bootcamp 2020: Go from …
## $ url <chr> "/course/the-complete-sql-bootcamp/", "/c…
## $ is_paid <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ num_subscribers <dbl> 295509, 209070, 155282, 245860, 374836, 1…
## $ avg_rating <dbl> 4.66019, 4.58956, 4.59491, 4.54407, 4.470…
## $ avg_rating_recent <dbl> 4.67874, 4.60015, 4.59326, 4.53772, 4.471…
## $ rating <dbl> 4.67874, 4.60015, 4.59326, 4.53772, 4.471…
## $ num_reviews <dbl> 78006, 54581, 52653, 46447, 41630, 38093,…
## $ is_wishlisted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ num_published_lectures <dbl> 84, 78, 292, 338, 83, 275, 23, 275, 144, …
## $ num_published_practice_tests <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ created <dttm> 2016-02-14 22:57:48, 2016-08-22 12:10:18…
## $ published_time <dttm> 2016-04-06 05:16:11, 2016-08-23 16:59:49…
## $ discount_price__amount <dbl> 455, 455, 455, 455, 455, 455, 455, 455, 4…
## $ discount_price__currency <chr> "INR", "INR", "INR", "INR", "INR", "INR",…
## $ discount_price__price_string <chr> "₹455", "₹455", "₹455", "₹455", "₹455", "…
## $ price_detail__amount <dbl> 8640, 8640, 8640, 8640, 8640, 8640, 8640,…
## $ price_detail__currency <chr> "INR", "INR", "INR", "INR", "INR", "INR",…
## $ price_detail__price_string <chr> "₹8,640", "₹8,640", "₹8,640", "₹8,640", "…
This process is used to ensure data integrity and to avoid errors and biasis in the analysis
NA#This checks for Null Values in the dataset
nulls = is.na(courses)
#This summarises the characteristics of the null values
summary(nulls)
## id title url is_paid
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:9447 FALSE:9447 FALSE:9447 FALSE:9447
##
## num_subscribers avg_rating avg_rating_recent rating
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:9447 FALSE:9447 FALSE:9447 FALSE:9447
##
## num_reviews is_wishlisted num_published_lectures
## Mode :logical Mode :logical Mode :logical
## FALSE:9447 FALSE:9447 FALSE:9447
##
## num_published_practice_tests created published_time
## Mode :logical Mode :logical Mode :logical
## FALSE:9447 FALSE:9447 FALSE:9447
##
## discount_price__amount discount_price__currency discount_price__price_string
## Mode :logical Mode :logical Mode :logical
## FALSE:8937 FALSE:8937 FALSE:8937
## TRUE :510 TRUE :510 TRUE :510
## price_detail__amount price_detail__currency price_detail__price_string
## Mode :logical Mode :logical Mode :logical
## FALSE:9447 FALSE:9447 FALSE:9447
##
This is to check is there are repeated rows in the dataset
#This checks for duplicated rows in the dataset
nulls = duplicated(courses)
#This summarises the characteristics of the duplicated rows
summary(nulls)
## Mode FALSE
## logical 9447
glimpse(courses)
## Rows: 9,447
## Columns: 20
## $ id <dbl> 762616, 937678, 1361790, 648826, 637930, …
## $ title <chr> "The Complete SQL Bootcamp 2020: Go from …
## $ url <chr> "/course/the-complete-sql-bootcamp/", "/c…
## $ is_paid <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ num_subscribers <dbl> 295509, 209070, 155282, 245860, 374836, 1…
## $ avg_rating <dbl> 4.66019, 4.58956, 4.59491, 4.54407, 4.470…
## $ avg_rating_recent <dbl> 4.67874, 4.60015, 4.59326, 4.53772, 4.471…
## $ rating <dbl> 4.67874, 4.60015, 4.59326, 4.53772, 4.471…
## $ num_reviews <dbl> 78006, 54581, 52653, 46447, 41630, 38093,…
## $ is_wishlisted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ num_published_lectures <dbl> 84, 78, 292, 338, 83, 275, 23, 275, 144, …
## $ num_published_practice_tests <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ created <dttm> 2016-02-14 22:57:48, 2016-08-22 12:10:18…
## $ published_time <dttm> 2016-04-06 05:16:11, 2016-08-23 16:59:49…
## $ discount_price__amount <dbl> 455, 455, 455, 455, 455, 455, 455, 455, 4…
## $ discount_price__currency <chr> "INR", "INR", "INR", "INR", "INR", "INR",…
## $ discount_price__price_string <chr> "₹455", "₹455", "₹455", "₹455", "₹455", "…
## $ price_detail__amount <dbl> 8640, 8640, 8640, 8640, 8640, 8640, 8640,…
## $ price_detail__currency <chr> "INR", "INR", "INR", "INR", "INR", "INR",…
## $ price_detail__price_string <chr> "₹8,640", "₹8,640", "₹8,640", "₹8,640", "…
All columns are in the right data class, including the
published time and date created, which are in
DateTime format
There is data integrity, hence I will proceed to the
ANALYSIS phase.
Below are the analysis carried out on the dataset to get insights.
Descriptive Statistics gives an overview of various columns in the dataset.
# The Summary Function gives an overview statistics of all the non-chracter coluumns
summary(courses)
## id title url is_paid
## Min. : 2762 Length:9447 Length:9447 Mode:logical
## 1st Qu.: 673031 Class :character Class :character TRUE:9447
## Median :1381806 Mode :character Mode :character
## Mean :1456603
## 3rd Qu.:2154793
## Max. :3469452
##
## num_subscribers avg_rating avg_rating_recent rating
## Min. : 3 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 141 1st Qu.:3.900 1st Qu.:3.879 1st Qu.:3.879
## Median : 916 Median :4.200 Median :4.206 Median :4.206
## Mean : 3399 Mean :4.139 Mean :4.125 Mean :4.125
## 3rd Qu.: 2890 3rd Qu.:4.450 3rd Qu.:4.447 3rd Qu.:4.447
## Max. :374836 Max. :5.000 Max. :5.000 Max. :5.000
##
## num_reviews is_wishlisted num_published_lectures
## Min. : 3.0 Mode :logical Min. : 0.0
## 1st Qu.: 12.0 FALSE:9447 1st Qu.: 13.0
## Median : 33.0 Median : 23.0
## Mean : 299.1 Mean : 34.4
## 3rd Qu.: 110.0 3rd Qu.: 39.0
## Max. :78006.0 Max. :699.0
##
## num_published_practice_tests created
## Min. :0.0000 Min. :2010-04-14 21:32:46.00
## 1st Qu.:0.0000 1st Qu.:2015-11-18 07:28:59.50
## Median :0.0000 Median :2017-10-05 17:06:12.00
## Mean :0.1018 Mean :2017-06-24 23:21:45.66
## 3rd Qu.:0.0000 3rd Qu.:2019-01-16 11:34:46.00
## Max. :6.0000 Max. :2020-09-02 13:25:24.00
##
## published_time discount_price__amount
## Min. :2010-04-14 16:32:46.00 Min. : 455
## 1st Qu.:2016-01-25 07:43:43.00 1st Qu.: 455
## Median :2017-11-30 18:57:20.00 Median : 455
## Mean :2017-08-17 22:57:47.16 Mean : 489
## 3rd Qu.:2019-03-08 17:51:10.50 3rd Qu.: 455
## Max. :2020-09-08 21:04:10.00 Max. :3200
## NA's :510
## discount_price__currency discount_price__price_string price_detail__amount
## Length:9447 Length:9447 Min. : 1280
## Class :character Class :character 1st Qu.: 1600
## Mode :character Mode :character Median : 3840
## Mean : 4829
## 3rd Qu.: 8640
## Max. :12800
##
## price_detail__currency price_detail__price_string
## Length:9447 Length:9447
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The Descriptive statistics indicates that
paid for and non
was in the wish listOverall, the course has a lot of variance, that is some courses were extremely more expensive than others and some performed extremely higher than most of the courses. Outlier courses were many
This will give an overview of how the data is distributed, how far or near they are from the center.
#ggplot will be used for visualization
ggplot(data= courses, mapping = aes(x= num_subscribers)) + geom_histogram(fill="blue") +
geom_freqpoly(color="red") +
labs(title= "Distribution of Number of Subscribers", x="Number of Subcribers", y= "Number of Courses")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggplot will be used for visualization
ggplot(data= courses, mapping = aes(x= avg_rating)) + geom_histogram( fill= "blue") +
geom_freqpoly(color="red") +
labs(title= "Distribution of Number of Average ratings", x="Average Rating", y= "Number of Courses")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggplot will be used for visualisation
ggplot(data= courses, mapping = aes(x= num_reviews)) + geom_histogram(fill="blue") +
geom_freqpoly(color="red") +
labs(title= "Distribution of Number of Reviews", x="Number of Reviews", y= "Number of Courses")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To get the number of discounted courses, we will use the
mutate function to create a column called
is_duplicated, this contain a boolean
indicating where or not a course has a discount
# All discounted courses have `NA` at the column `discount_price__amount`
courses <- courses %>%
mutate(is_discounted = ifelse(is.na(discount_price__amount)==FALSE, TRUE, FALSE ))
# Creating a Bar Chart to show distribution of discounts
ggplot(data=courses, mapping = aes(x= is_discounted, fill= is_discounted)) + geom_bar() +
labs(title= "Distribution of Discounts", x="Course Discount Status", y= "Number of Courses")
I will first filter out the discounted courses
discounted_courses <- courses %>% filter(is_discounted==TRUE)
# Creating a Histogram
ggplot(data= discounted_courses, mapping= aes(x= discount_price__amount)) + geom_histogram(fill="blue") +
labs(title= "Distribution of Discounted amounts", x="Amount of Discount", y= "Number of Courses")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
discounted_courses <- discounted_courses %>%
mutate(discount_percentage= (discount_price__amount/ price_detail__amount)*100)
# Creating a Histogram
ggplot(data= discounted_courses, mapping= aes(x= discount_percentage)) + geom_histogram(bins= 10, fill="blue") +
labs(title= "Distribution of Discount Percentages", x="Discount Percentage", y= "Number of Courses")
We will first create a column for the year of course upload separately
courses <- courses %>% mutate(year_published=year(published_time))
# Creating a Bar Chart for each year
ggplot(data= courses, mapping= aes(x= year_published, fill=is_discounted)) + geom_bar()
Correlation Analysis check how various Variables relate to each
other. This analysis will be done by calculating cor and
also drawing out a scatter plot to indicate data relations while
differentiating between discounted courses and non-discounted
courses
Does More Subscribers mean More reviews?
#calculating correlation
correlation <- courses %>% select(num_subscribers, num_reviews) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= num_subscribers, y= num_reviews, color= is_discounted)) + geom_point() +
labs(title= "Number of Reviews vs Number of Subcribers", x="Number of Subcribers", y= "Number of Reviews") +
annotate("text", x= 250000, y= 30000, label= paste("Correlation= ",correlation))
Does more lectures correspond to more Subscribers?
#calculating correlation
correlation <- courses %>% select(num_subscribers, num_published_lectures) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= num_published_lectures, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Number of Publised Lectures vs Number of Subcribers", x="Number of Published Lectures", y= "Number of Subscribers") +
annotate("text", x= 450, y= 200000, label= paste("Correlation= ",correlation))
Does Better Reviews Mean More Subscribers?
#calculating correlation
correlation <- courses %>% select(num_subscribers, avg_rating) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= avg_rating, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Average Rating vs Number of Subcribers", x="Average Rating", y= "Number of Subscribers") +
annotate("text", x= 3.5, y= 100000, label= paste("Correlation= ",correlation))
#calculating correlation
correlation <- courses %>% select(num_subscribers, price_detail__amount) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= price_detail__amount, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Course Price vs Number of Subcribers", x="Course Price", y= "Number of Subscribers") +
annotate("text", x= 5000, y= 200000, label= paste("Correlation= ",correlation))
# We will create a column to calculate the number of days since course was published with today as reference
courses <- courses %>% mutate(days_since_published = as.numeric(now() - published_time))
#calculating correlation
correlation <- courses %>% select(num_subscribers,days_since_published) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= days_since_published, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Number of Days Since Published vs Number of Subcribers", x="Number of days since Published", y= "Number of Subscribers") + annotate("text", x= 3500, y= 200000, label= paste("Correlation= ",correlation))
#calculating correlation
#correlation <- courses %>% select(num_subscribers, is_discounted) %>% cor()
#correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= is_discounted, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Discount Status vs Number of Subcribers", x="Discount", y= "Number of Subscribers")
I will first assign 0 to courses without discount
courses <- courses %>% mutate(course_discount = ifelse(is_discounted==FALSE, 0, discount_price__amount))
# Creating a Discount percentage column
courses <- courses %>% mutate(discount_percentage = (course_discount/ price_detail__amount)*100)
#calculating correlation
correlation <- courses %>% select(num_subscribers, discount_percentage) %>% cor()
correlation<- round(correlation[1,2],3)
ggplot(data= courses, mapping= aes(x= discount_percentage, y= num_subscribers, color= is_discounted)) + geom_point() +
labs(title= "Discount Percentage vs Number of Subcribers", x="Discount Percentage", y= "Number of Subscribers") +
annotate("text", x= 30, y= 200000, label= paste("Correlation= ",correlation))
After the insights comes the action. Knowledge without action is useless. From the Analysis, I suggest these actions to each stakeholder
Text Analysis should be carried out on the course
titles to know which keywords influence course performance.Thumbnail, Course Category, and other performance influencing variables
should be gathered for further analysis.Text Analysis