Analyze the factors influencing the popularity and pricing of Udemy courses. Determine the key drivers behind course success in terms of ratings and enrollment.
Understand the relationship between course attributes (e.g., price, duration, level) and popularity metrics (e.g., ratings, enrollments).
Develop actionable insights to guide course creation and marketing strategies
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")
head(data)
## course_id course_title
## 1 1070968 Ultimate Investment Banking Course
## 2 1113822 Complete GST Course & Certification - Grow Your CA Practice
## 3 1006314 Financial Modeling for Business Analysts and Consultants
## 4 1210588 Beginner to Pro - Financial Analysis in Excel 2017
## 5 1011058 How To Maximize Your Profits Trading Options
## 6 192870 Trading Penny Stocks: A Guide for All Levels In 2017
## url
## 1 https://www.udemy.com/ultimate-investment-banking-course/
## 2 https://www.udemy.com/goods-and-services-tax/
## 3 https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/
## 4 https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/
## 5 https://www.udemy.com/how-to-maximize-your-profits-trading-options/
## 6 https://www.udemy.com/trading-penny-stocks-a-guide-for-all-levels/
## is_paid price num_subscribers num_reviews num_lectures level
## 1 True 200 2147 23 51 All Levels
## 2 True 75 2792 923 274 All Levels
## 3 True 45 2174 74 51 Intermediate Level
## 4 True 95 2451 11 36 All Levels
## 5 True 200 1276 45 26 Intermediate Level
## 6 True 150 9221 138 25 All Levels
## content_duration published_timestamp subject
## 1 1.5 2017-01-18T20:58:58Z Business Finance
## 2 39.0 2017-03-09T16:34:20Z Business Finance
## 3 2.5 2016-12-19T19:26:30Z Business Finance
## 4 3.0 2017-05-30T20:07:24Z Business Finance
## 5 2.0 2016-12-13T14:57:18Z Business Finance
## 6 3.0 2014-05-02T15:13:30Z Business Finance
str(data)
## 'data.frame': 3678 obs. of 12 variables:
## $ course_id : int 1070968 1113822 1006314 1210588 1011058 192870 739964 403100 476268 1167710 ...
## $ course_title : chr "Ultimate Investment Banking Course" "Complete GST Course & Certification - Grow Your CA Practice" "Financial Modeling for Business Analysts and Consultants" "Beginner to Pro - Financial Analysis in Excel 2017" ...
## $ url : chr "https://www.udemy.com/ultimate-investment-banking-course/" "https://www.udemy.com/goods-and-services-tax/" "https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/" "https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/" ...
## $ is_paid : chr "True" "True" "True" "True" ...
## $ price : int 200 75 45 95 200 150 65 95 195 200 ...
## $ num_subscribers : int 2147 2792 2174 2451 1276 9221 1540 2917 5172 827 ...
## $ num_reviews : int 23 923 74 11 45 138 178 148 34 14 ...
## $ num_lectures : int 51 274 51 36 26 25 26 23 38 15 ...
## $ level : chr "All Levels" "All Levels" "Intermediate Level" "All Levels" ...
## $ content_duration : num 1.5 39 2.5 3 2 3 1 2.5 2.5 1 ...
## $ published_timestamp: chr "2017-01-18T20:58:58Z" "2017-03-09T16:34:20Z" "2016-12-19T19:26:30Z" "2017-05-30T20:07:24Z" ...
## $ subject : chr "Business Finance" "Business Finance" "Business Finance" "Business Finance" ...
summary(data)
## course_id course_title url is_paid
## Min. : 8324 Length:3678 Length:3678 Length:3678
## 1st Qu.: 407692 Class :character Class :character Class :character
## Median : 687917 Mode :character Mode :character Mode :character
## Mean : 675972
## 3rd Qu.: 961356
## Max. :1282064
## price num_subscribers num_reviews num_lectures
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.: 20.00 1st Qu.: 111.0 1st Qu.: 4.0 1st Qu.: 15.00
## Median : 45.00 Median : 911.5 Median : 18.0 Median : 25.00
## Mean : 66.05 Mean : 3197.2 Mean : 156.3 Mean : 40.11
## 3rd Qu.: 95.00 3rd Qu.: 2546.0 3rd Qu.: 67.0 3rd Qu.: 45.75
## Max. :200.00 Max. :268923.0 Max. :27445.0 Max. :779.00
## level content_duration published_timestamp subject
## Length:3678 Min. : 0.000 Length:3678 Length:3678
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 2.000 Mode :character Mode :character
## Mean : 4.095
## 3rd Qu.: 4.500
## Max. :78.500
missing_summary <- sapply(data, function(x) sum(is.na(x)))
missing_summary
## course_id course_title url is_paid
## 0 0 0 0
## price num_subscribers num_reviews num_lectures
## 0 0 0 0
## level content_duration published_timestamp subject
## 0 0 0 0
Distribution of Course Levels
# Bar plot for levels
ggplot(data, aes(x = level)) +
geom_bar(fill = "green", alpha = 0.7) +
labs(title = "Distribution of Course Levels", x = "Level", y = "Count") +
theme_minimal()
- Courses labeled as All Levels dominate, indicating that many courses are designed to appeal to a broad audience.
- Beginner-level courses are the second most common, showing a focus on attracting entry-level learners. - Course creators should consider targeting broader or beginner-level audiences to maximize enrollment opportunities, as these levels dominate the marketplace.
Distribution of Content Duration Histogram for content duration
ggplot(data, aes(x = content_duration)) + geom_histogram(binwidth = 1, fill = "purple", color = "white", alpha = 0.7) + labs(title = "Distribution of Content Duration", x = "Content Duration (Hours)", y = "Count") + theme_minimal()
- Most courses have content durations of less than 5 hours, with a sharp peak at very short durations (1–2 hours).
- This suggests that short courses are more common, likely due to learner preferences for concise content. - Course creators may focus on shorter durations to align with learner preferences and market trends, while extended courses may target niche audiences.
subject vs. subscribers
# Perform ANOVA
anova_result <- aov(num_subscribers ~ subject, data = data)
# Summarize the ANOVA results
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## subject 3 2.133e+10 7.11e+09 84.05 <2e-16 ***
## Residuals 3674 3.108e+11 8.46e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a boxplot to visualize subscribers by subject
ggplot(data, aes(x = subject, y = num_subscribers, fill = subject)) +
geom_boxplot() +
labs(title = "Number of Subscribers by Subject",
x = "Subject",
y = "Number of Subscribers") +
theme_minimal() +
theme(legend.position = "none")
- Web Development has more subscribers compared to other courses.
- Business Finance, Graphic Design, and Musical Instruments have much lower median subscriber counts compared to Web Development. These subjects exhibit smaller spreads, indicating less variability in their subscriber counts.
The p-value (< 2e-16) is extremely small, much smaller than the conventional threshold of 0.05. This indicates that the differences in the mean number of subscribers across subjects are statistically significant.
# Correlation between reviews and subscribers
cor_test <- cor.test(data$num_reviews, data$num_subscribers)
cor_test
##
## Pearson's product-moment correlation
##
## data: data$num_reviews and data$num_subscribers
## t = 51.852, df = 3676, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6308781 0.6682284
## sample estimates:
## cor
## 0.6499455
ggplot(data, aes(x = num_reviews, y = num_subscribers)) +
geom_point(alpha = 0.5, color = "red") +
geom_smooth(method = "lm", color = "blue", se = TRUE) +
labs(title = "Number of Reviews vs. Number of Subscribers",
x = "Number of Reviews",
y = "Number of Subscribers") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
- The upward slope of the regression line confirms a positive correlation: as the number of reviews increases, the number of subscribers also tends to increase. At low numbers of reviews, there is a wide spread in subscriber counts.
- Some courses with very few reviews still manage to attract many subscribers, suggesting other factors (e.g., price, content quality, or subject) may influence popularity.
- There are outliers with extremely high subscriber counts and reviews. These are likely exceptional courses that are very popular due to factors such as exceptional content, high ratings, or effective marketing
- 0.649 correlation indicates a moderately strong positive correlation. P value <2.21-16 indicates statistically significant, confirming the positive relationship.
data <- data %>%
mutate(price_range = cut(price,
breaks = c(-Inf, 50, 100, Inf),
labels = c("Low", "Medium", "High")))
table(data$price_range)
##
## Low Medium High
## 2344 611 723
pairwise_results <- pairwise.t.test(
data$num_subscribers,
data$price_range,
p.adjust.method = "bonferroni",
alternative = c("two.sided")
)
# Print pairwise comparison matrix
print(pairwise_results)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$num_subscribers and data$price_range
##
## Low Medium
## Medium 1 -
## High 8.5e-08 2.9e-05
##
## P value adjustment method: bonferroni
# Create the boxplot
ggplot(data, aes(x = price_range, y = num_subscribers, fill = price_range)) +
geom_boxplot() +
labs(title = "Pairwise Comparisons of Subscribers by Price Range",
x = "Price Range",
y = "Number of Subscribers") +
theme_minimal() +
theme(legend.position = "none") +
annotate("text", x = 1.5, y = 200000, label = "", color = "red") +
annotate("text", x = 2.5, y = 150000, label = "", color = "blue") # Example
Low vs High adjusted p-value is 8.5e-08 which is very small and
significant.So a strong difference exists between the Low and High price
ranges. Medium vs High adjusted p-value is 2.9e-05 which is significant
difference. Medium and High ranges differ significantly, though less
strongly than Low vs High.And from the plot the variance is higher for
Low and High price ranges.So we can use this variable to perform ANOVA
test.
#Create the price_range column
data <- data %>%
mutate(price_range = cut(price,
breaks = c(-Inf, 50, 100, Inf),
labels = c("Low", "Medium", "High")))
# Check the distribution of price ranges
table(data$price_range)
##
## Low Medium High
## 2344 611 723
# Perform ANOVA test
anova_price_range <- aov(num_subscribers ~ price_range, data = data)
# Summarize the results
summary(anova_price_range)
## Df Sum Sq Mean Sq F value Pr(>F)
## price_range 2 2.951e+09 1.476e+09 16.48 7.53e-08 ***
## Residuals 3675 3.292e+11 8.957e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a boxplot to visualize subscribers by price range
ggplot(data, aes(x = price_range, y = num_subscribers, fill = price_range)) +
geom_boxplot() +
labs(title = "Number of Subscribers by Price Range",
x = "Price Range",
y = "Number of Subscribers") +
theme_minimal() +
theme(legend.position = "none")
- The High price range has the highest median number of subscribers, indicating that on average, higher-priced courses are associated with a higher central tendency in subscriber count.
- The Low price range has the most variability, with extreme outliers driving high subscriber counts for some courses, but the median remains the lowest.
- The p-value is 7.53e-08, which is far smaller than the significance threshold of 0.05.There are statistically significant differences in the mean number of subscribers across price ranges.
# Regression model
model <- lm(num_subscribers ~ price + num_reviews + content_duration, data = data)
summary(model)
##
## Call:
## lm(formula = num_subscribers ~ price + num_reviews + content_duration,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62072 -2175 -1570 -209 209400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2328.6773 181.1101 12.858 <2e-16 ***
## price -4.5828 2.0441 -2.242 0.025 *
## num_reviews 6.5860 0.1309 50.308 <2e-16 ***
## content_duration 34.6916 21.0236 1.650 0.099 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7220 on 3674 degrees of freedom
## Multiple R-squared: 0.4234, Adjusted R-squared: 0.4229
## F-statistic: 899.3 on 3 and 3674 DF, p-value: < 2.2e-16
# Visualizing regression: Reviews vs. Subscribers
ggplot(data, aes(x = num_reviews, y = num_subscribers)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Regression: Reviews vs. Subscribers", x = "Number of Reviews", y = "Number of Subscribers")
## `geom_smooth()` using formula = 'y ~ x'
There is a positive correlation between reviews and subscribers, as evidenced by the upward slope of the regression line
- Courses with more reviews tend to have higher subscriber counts, highlighting the importance of user engagement and feedback.
- This supports the hypothesis that higher engagement (more reviews) is associated with more subscribers. Marketing efforts could focus on encouraging reviews to increase course popularity. Price - For every unit increase in price, the number of subscribers decreases by approximately 4.58 (significant at the 5% level). num_reviews - For every additional review, the number of subscribers increases by approximately 6.59 Content duration - For every additional hour of content, the number of subscribers increases by 34.69, though the result is not statistically significant (p > 0.05). Price has a negative relationship with subscribers. Reviews are a strong positive predictor of subscribers. Content duration has a positive but not statistically significant relationship
Web Development courses are the most popular, while beginner-friendly and short-duration courses attract the most learners. Courses priced above $100 strike the right balance for maximum enrollments. To succeed, course creators should focus on creating accessible, high-quality content, encourage reviews to boost visibility, and use data insights to align with learner preferences and market demand.