Problem Statement:

Analyze the factors influencing the popularity and pricing of Udemy courses. Determine the key drivers behind course success in terms of ratings and enrollment.

Initial EDA

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")
head(data) 
##   course_id                                                course_title
## 1   1070968                          Ultimate Investment Banking Course
## 2   1113822 Complete GST Course & Certification - Grow Your CA Practice
## 3   1006314    Financial Modeling for Business Analysts and Consultants
## 4   1210588          Beginner to Pro - Financial Analysis in Excel 2017
## 5   1011058                How To Maximize Your Profits Trading Options
## 6    192870        Trading Penny Stocks: A Guide for All Levels In 2017
##                                                                               url
## 1                       https://www.udemy.com/ultimate-investment-banking-course/
## 2                                   https://www.udemy.com/goods-and-services-tax/
## 3 https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/
## 4       https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/
## 5             https://www.udemy.com/how-to-maximize-your-profits-trading-options/
## 6              https://www.udemy.com/trading-penny-stocks-a-guide-for-all-levels/
##   is_paid price num_subscribers num_reviews num_lectures              level
## 1    True   200            2147          23           51         All Levels
## 2    True    75            2792         923          274         All Levels
## 3    True    45            2174          74           51 Intermediate Level
## 4    True    95            2451          11           36         All Levels
## 5    True   200            1276          45           26 Intermediate Level
## 6    True   150            9221         138           25         All Levels
##   content_duration  published_timestamp          subject
## 1              1.5 2017-01-18T20:58:58Z Business Finance
## 2             39.0 2017-03-09T16:34:20Z Business Finance
## 3              2.5 2016-12-19T19:26:30Z Business Finance
## 4              3.0 2017-05-30T20:07:24Z Business Finance
## 5              2.0 2016-12-13T14:57:18Z Business Finance
## 6              3.0 2014-05-02T15:13:30Z Business Finance
str(data)
## 'data.frame':    3678 obs. of  12 variables:
##  $ course_id          : int  1070968 1113822 1006314 1210588 1011058 192870 739964 403100 476268 1167710 ...
##  $ course_title       : chr  "Ultimate Investment Banking Course" "Complete GST Course & Certification - Grow Your CA Practice" "Financial Modeling for Business Analysts and Consultants" "Beginner to Pro - Financial Analysis in Excel 2017" ...
##  $ url                : chr  "https://www.udemy.com/ultimate-investment-banking-course/" "https://www.udemy.com/goods-and-services-tax/" "https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/" "https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/" ...
##  $ is_paid            : chr  "True" "True" "True" "True" ...
##  $ price              : int  200 75 45 95 200 150 65 95 195 200 ...
##  $ num_subscribers    : int  2147 2792 2174 2451 1276 9221 1540 2917 5172 827 ...
##  $ num_reviews        : int  23 923 74 11 45 138 178 148 34 14 ...
##  $ num_lectures       : int  51 274 51 36 26 25 26 23 38 15 ...
##  $ level              : chr  "All Levels" "All Levels" "Intermediate Level" "All Levels" ...
##  $ content_duration   : num  1.5 39 2.5 3 2 3 1 2.5 2.5 1 ...
##  $ published_timestamp: chr  "2017-01-18T20:58:58Z" "2017-03-09T16:34:20Z" "2016-12-19T19:26:30Z" "2017-05-30T20:07:24Z" ...
##  $ subject            : chr  "Business Finance" "Business Finance" "Business Finance" "Business Finance" ...
summary(data)
##    course_id       course_title           url              is_paid         
##  Min.   :   8324   Length:3678        Length:3678        Length:3678       
##  1st Qu.: 407692   Class :character   Class :character   Class :character  
##  Median : 687917   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 675972                                                           
##  3rd Qu.: 961356                                                           
##  Max.   :1282064                                                           
##      price        num_subscribers     num_reviews       num_lectures   
##  Min.   :  0.00   Min.   :     0.0   Min.   :    0.0   Min.   :  0.00  
##  1st Qu.: 20.00   1st Qu.:   111.0   1st Qu.:    4.0   1st Qu.: 15.00  
##  Median : 45.00   Median :   911.5   Median :   18.0   Median : 25.00  
##  Mean   : 66.05   Mean   :  3197.2   Mean   :  156.3   Mean   : 40.11  
##  3rd Qu.: 95.00   3rd Qu.:  2546.0   3rd Qu.:   67.0   3rd Qu.: 45.75  
##  Max.   :200.00   Max.   :268923.0   Max.   :27445.0   Max.   :779.00  
##     level           content_duration published_timestamp   subject         
##  Length:3678        Min.   : 0.000   Length:3678         Length:3678       
##  Class :character   1st Qu.: 1.000   Class :character    Class :character  
##  Mode  :character   Median : 2.000   Mode  :character    Mode  :character  
##                     Mean   : 4.095                                         
##                     3rd Qu.: 4.500                                         
##                     Max.   :78.500
missing_summary <- sapply(data, function(x) sum(is.na(x)))
missing_summary
##           course_id        course_title                 url             is_paid 
##                   0                   0                   0                   0 
##               price     num_subscribers         num_reviews        num_lectures 
##                   0                   0                   0                   0 
##               level    content_duration published_timestamp             subject 
##                   0                   0                   0                   0

Univariate analysis

Distribution of Course Levels

# Bar plot for levels
ggplot(data, aes(x = level)) +
  geom_bar(fill = "green", alpha = 0.7) +
  labs(title = "Distribution of Course Levels", x = "Level", y = "Count") +
  theme_minimal()

- Courses labeled as All Levels dominate, indicating that many courses are designed to appeal to a broad audience.

- Beginner-level courses are the second most common, showing a focus on attracting entry-level learners. - Course creators should consider targeting broader or beginner-level audiences to maximize enrollment opportunities, as these levels dominate the marketplace.

Distribution of Content Duration Histogram for content duration

ggplot(data, aes(x = content_duration)) + geom_histogram(binwidth = 1, fill = "purple", color = "white", alpha = 0.7) + labs(title = "Distribution of Content Duration", x = "Content Duration (Hours)", y = "Count") + theme_minimal()

- Most courses have content durations of less than 5 hours, with a sharp peak at very short durations (1–2 hours).

- This suggests that short courses are more common, likely due to learner preferences for concise content. - Course creators may focus on shorter durations to align with learner preferences and market trends, while extended courses may target niche audiences.

Bivariate analysis

subject vs. subscribers

# Perform ANOVA
anova_result <- aov(num_subscribers ~ subject, data = data)

# Summarize the ANOVA results
summary(anova_result)
##               Df    Sum Sq  Mean Sq F value Pr(>F)    
## subject        3 2.133e+10 7.11e+09   84.05 <2e-16 ***
## Residuals   3674 3.108e+11 8.46e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a boxplot to visualize subscribers by subject
ggplot(data, aes(x = subject, y = num_subscribers, fill = subject)) +
  geom_boxplot() +
  labs(title = "Number of Subscribers by Subject", 
       x = "Subject", 
       y = "Number of Subscribers") +
  theme_minimal() +
  theme(legend.position = "none")

- Web Development has more subscribers compared to other courses.

- Business Finance, Graphic Design, and Musical Instruments have much lower median subscriber counts compared to Web Development. These subjects exhibit smaller spreads, indicating less variability in their subscriber counts.

The p-value (< 2e-16) is extremely small, much smaller than the conventional threshold of 0.05. This indicates that the differences in the mean number of subscribers across subjects are statistically significant.

Hypothesis 1

# Correlation between reviews and subscribers
cor_test <- cor.test(data$num_reviews, data$num_subscribers)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  data$num_reviews and data$num_subscribers
## t = 51.852, df = 3676, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6308781 0.6682284
## sample estimates:
##       cor 
## 0.6499455
ggplot(data, aes(x = num_reviews, y = num_subscribers)) +
  geom_point(alpha = 0.5, color = "red") +
  geom_smooth(method = "lm", color = "blue", se = TRUE) +
  labs(title = "Number of Reviews vs. Number of Subscribers",
       x = "Number of Reviews",
       y = "Number of Subscribers") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

- The upward slope of the regression line confirms a positive correlation: as the number of reviews increases, the number of subscribers also tends to increase. At low numbers of reviews, there is a wide spread in subscriber counts.

- Some courses with very few reviews still manage to attract many subscribers, suggesting other factors (e.g., price, content quality, or subject) may influence popularity.

- There are outliers with extremely high subscriber counts and reviews. These are likely exceptional courses that are very popular due to factors such as exceptional content, high ratings, or effective marketing

- 0.649 correlation indicates a moderately strong positive correlation. P value <2.21-16 indicates statistically significant, confirming the positive relationship.

Pairwise Comparisons of Price Ranges

data <- data %>%
  mutate(price_range = cut(price, 
                           breaks = c(-Inf, 50, 100, Inf), 
                           labels = c("Low", "Medium", "High")))
table(data$price_range)
## 
##    Low Medium   High 
##   2344    611    723
pairwise_results <- pairwise.t.test(
  data$num_subscribers,
  data$price_range,
  p.adjust.method = "bonferroni",
  alternative = c("two.sided")
)

# Print pairwise comparison matrix
print(pairwise_results)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$num_subscribers and data$price_range 
## 
##        Low     Medium 
## Medium 1       -      
## High   8.5e-08 2.9e-05
## 
## P value adjustment method: bonferroni
# Create the boxplot
ggplot(data, aes(x = price_range, y = num_subscribers, fill = price_range)) +
  geom_boxplot() +
  labs(title = "Pairwise Comparisons of Subscribers by Price Range", 
       x = "Price Range", 
       y = "Number of Subscribers") +
  theme_minimal() +
  theme(legend.position = "none") +
  annotate("text", x = 1.5, y = 200000, label = "", color = "red") +
  annotate("text", x = 2.5, y = 150000, label = "", color = "blue")  # Example

Low vs High adjusted p-value is 8.5e-08 which is very small and significant.So a strong difference exists between the Low and High price ranges. Medium vs High adjusted p-value is 2.9e-05 which is significant difference. Medium and High ranges differ significantly, though less strongly than Low vs High.And from the plot the variance is higher for Low and High price ranges.So we can use this variable to perform ANOVA test.

Hypothesis 2

#Create the price_range column
data <- data %>%
  mutate(price_range = cut(price, 
                           breaks = c(-Inf, 50, 100, Inf), 
                           labels = c("Low", "Medium", "High")))

# Check the distribution of price ranges
table(data$price_range)
## 
##    Low Medium   High 
##   2344    611    723
# Perform ANOVA test
anova_price_range <- aov(num_subscribers ~ price_range, data = data)

# Summarize the results
summary(anova_price_range)
##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## price_range    2 2.951e+09 1.476e+09   16.48 7.53e-08 ***
## Residuals   3675 3.292e+11 8.957e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a boxplot to visualize subscribers by price range
ggplot(data, aes(x = price_range, y = num_subscribers, fill = price_range)) +
  geom_boxplot() +
  labs(title = "Number of Subscribers by Price Range", 
       x = "Price Range", 
       y = "Number of Subscribers") +
  theme_minimal() +
  theme(legend.position = "none")

- The High price range has the highest median number of subscribers, indicating that on average, higher-priced courses are associated with a higher central tendency in subscriber count.

- The Low price range has the most variability, with extreme outliers driving high subscriber counts for some courses, but the median remains the lowest.

- The p-value is 7.53e-08, which is far smaller than the significance threshold of 0.05.There are statistically significant differences in the mean number of subscribers across price ranges.

Regression model

# Regression model
model <- lm(num_subscribers ~ price + num_reviews + content_duration, data = data)
summary(model)
## 
## Call:
## lm(formula = num_subscribers ~ price + num_reviews + content_duration, 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62072  -2175  -1570   -209 209400 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2328.6773   181.1101  12.858   <2e-16 ***
## price              -4.5828     2.0441  -2.242    0.025 *  
## num_reviews         6.5860     0.1309  50.308   <2e-16 ***
## content_duration   34.6916    21.0236   1.650    0.099 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7220 on 3674 degrees of freedom
## Multiple R-squared:  0.4234, Adjusted R-squared:  0.4229 
## F-statistic: 899.3 on 3 and 3674 DF,  p-value: < 2.2e-16
# Visualizing regression: Reviews vs. Subscribers
ggplot(data, aes(x = num_reviews, y = num_subscribers)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Regression: Reviews vs. Subscribers", x = "Number of Reviews", y = "Number of Subscribers")
## `geom_smooth()` using formula = 'y ~ x'

Conclusion:

Web Development courses are the most popular, while beginner-friendly and short-duration courses attract the most learners. Courses priced above $100 strike the right balance for maximum enrollments. To succeed, course creators should focus on creating accessible, high-quality content, encourage reviews to boost visibility, and use data insights to align with learner preferences and market demand.