The dataset I used for this project is Udemy Courses and it includes detailed information about various online courses offered on the platform.
This dataset contains features such as course name, content duration, number of subscribers, number of reviews, course price, and more. This dataset is taken from Kaggle source [https://www.kaggle.com/datasets/andrewmvd/udemy-courses]
It can be used to analyze trends in online learning, course popularity, and content effectiveness.
The major goal of this project is to analyze the relationship between key variables in udemy dataset and understand which factors contribute to the popularity and success of a course.
Visualization for Course Price Distribution
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
df <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")
head(df)
## course_id course_title
## 1 1070968 Ultimate Investment Banking Course
## 2 1113822 Complete GST Course & Certification - Grow Your CA Practice
## 3 1006314 Financial Modeling for Business Analysts and Consultants
## 4 1210588 Beginner to Pro - Financial Analysis in Excel 2017
## 5 1011058 How To Maximize Your Profits Trading Options
## 6 192870 Trading Penny Stocks: A Guide for All Levels In 2017
## url
## 1 https://www.udemy.com/ultimate-investment-banking-course/
## 2 https://www.udemy.com/goods-and-services-tax/
## 3 https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/
## 4 https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/
## 5 https://www.udemy.com/how-to-maximize-your-profits-trading-options/
## 6 https://www.udemy.com/trading-penny-stocks-a-guide-for-all-levels/
## is_paid price num_subscribers num_reviews num_lectures level
## 1 True 200 2147 23 51 All Levels
## 2 True 75 2792 923 274 All Levels
## 3 True 45 2174 74 51 Intermediate Level
## 4 True 95 2451 11 36 All Levels
## 5 True 200 1276 45 26 Intermediate Level
## 6 True 150 9221 138 25 All Levels
## content_duration published_timestamp subject
## 1 1.5 2017-01-18T20:58:58Z Business Finance
## 2 39.0 2017-03-09T16:34:20Z Business Finance
## 3 2.5 2016-12-19T19:26:30Z Business Finance
## 4 3.0 2017-05-30T20:07:24Z Business Finance
## 5 2.0 2016-12-13T14:57:18Z Business Finance
## 6 3.0 2014-05-02T15:13:30Z Business Finance
ggplot(df, aes(x = price)) +
geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
labs(title = "Distribution of Course Prices",
x = "Price",
y = "Frequency") +
theme_minimal()
This is the visualization for distribution course prices. Lower price
courses have more frequency than high price courses.
Visualization for Number of Lectures vs. Number of Subscribers
ggplot(df, aes(x = num_lectures, y = num_subscribers)) +
geom_point(color = "purple", size = 3, alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Number of Lectures vs Number of Subscribers",
x = "Number of Lectures",
y = "Number of Subscribers") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
In this visualization number of lectures is increasing and number of
subscribers are also increasing. This may be influencing factor. More
investigation is required as it impacts the number of subscribers and
number of lectures.
Hypothesis 1: Number of Subscribers vs. Content Duration Visualization
ggplot(df, aes(x = content_duration, y = num_subscribers)) +
geom_point(color = "blue", size = 3, alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Hypothesis 1: Content Duration vs Number of Subscribers",
x = "Content Duration (in hours)",
y = "Number of Subscribers") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
cor_hypothesis1 <- cor(df$content_duration, df$num_subscribers, use = "complete.obs", method = "pearson")
cor_hypothesis1
## [1] 0.1618387
This visualization indicates that there is a positive relationship between Number of subscribers and Content Duration, but the correlation coefficient is indicating it has weak relationship. No.of subscribers may not be strong influencer for content duration.
Hypothesis 2: Paid vs. Free Courses by Number of Reviews Visualization
ggplot(df, aes(x = is_paid, y = num_reviews, fill = is_paid)) +
geom_boxplot() +
labs(title = "Hypothesis 2: Reviews for Paid vs Free Courses",
x = "Course Type (Paid or Free)",
y = "Number of Reviews") +
theme_minimal()
This visualization indicates that unpaid courses have more number of reviews.