Short description

The dataset I used for this project is Udemy Courses and it includes detailed information about various online courses offered on the platform.

This dataset contains features such as course name, content duration, number of subscribers, number of reviews, course price, and more. This dataset is taken from Kaggle source [https://www.kaggle.com/datasets/andrewmvd/udemy-courses]

It can be used to analyze trends in online learning, course popularity, and content effectiveness.

Goal

The major goal of this project is to analyze the relationship between key variables in udemy dataset and understand which factors contribute to the popularity and success of a course.

Visualization 1

Visualization for Course Price Distribution

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the dataset
df <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")
head(df)
##   course_id                                                course_title
## 1   1070968                          Ultimate Investment Banking Course
## 2   1113822 Complete GST Course & Certification - Grow Your CA Practice
## 3   1006314    Financial Modeling for Business Analysts and Consultants
## 4   1210588          Beginner to Pro - Financial Analysis in Excel 2017
## 5   1011058                How To Maximize Your Profits Trading Options
## 6    192870        Trading Penny Stocks: A Guide for All Levels In 2017
##                                                                               url
## 1                       https://www.udemy.com/ultimate-investment-banking-course/
## 2                                   https://www.udemy.com/goods-and-services-tax/
## 3 https://www.udemy.com/financial-modeling-for-business-analysts-and-consultants/
## 4       https://www.udemy.com/complete-excel-finance-course-from-beginner-to-pro/
## 5             https://www.udemy.com/how-to-maximize-your-profits-trading-options/
## 6              https://www.udemy.com/trading-penny-stocks-a-guide-for-all-levels/
##   is_paid price num_subscribers num_reviews num_lectures              level
## 1    True   200            2147          23           51         All Levels
## 2    True    75            2792         923          274         All Levels
## 3    True    45            2174          74           51 Intermediate Level
## 4    True    95            2451          11           36         All Levels
## 5    True   200            1276          45           26 Intermediate Level
## 6    True   150            9221         138           25         All Levels
##   content_duration  published_timestamp          subject
## 1              1.5 2017-01-18T20:58:58Z Business Finance
## 2             39.0 2017-03-09T16:34:20Z Business Finance
## 3              2.5 2016-12-19T19:26:30Z Business Finance
## 4              3.0 2017-05-30T20:07:24Z Business Finance
## 5              2.0 2016-12-13T14:57:18Z Business Finance
## 6              3.0 2014-05-02T15:13:30Z Business Finance
ggplot(df, aes(x = price)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Course Prices",
       x = "Price",
       y = "Frequency") +
  theme_minimal()

This is the visualization for distribution course prices. Lower price courses have more frequency than high price courses.

Visualization 2

Visualization for Number of Lectures vs. Number of Subscribers

ggplot(df, aes(x = num_lectures, y = num_subscribers)) +
  geom_point(color = "purple", size = 3, alpha = 0.6) +  
  geom_smooth(method = "lm", color = "red", se = FALSE) +  
  labs(title = "Number of Lectures vs Number of Subscribers",
       x = "Number of Lectures",
       y = "Number of Subscribers") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

In this visualization number of lectures is increasing and number of subscribers are also increasing. This may be influencing factor. More investigation is required as it impacts the number of subscribers and number of lectures.

Plan to move forward

  1. Perform summary statistics to understand the central tendencies and distributions of key variables (e.g., price, num_subscribers, num_reviews, content_duration).
  2. Create additional visualizations to explore relationships between other variables, such as price vs. num_subscribers or num_lectures vs. num_reviews.
  3. Explore the distribution of price in detail, including identifying outliers or trends and do testing hypothesis

Initial Findings

Hypothesis 1: Number of Subscribers vs. Content Duration Visualization

ggplot(df, aes(x = content_duration, y = num_subscribers)) +
  geom_point(color = "blue", size = 3, alpha = 0.6) +  
  geom_smooth(method = "lm", color = "red", se = FALSE) +  
  labs(title = "Hypothesis 1: Content Duration vs Number of Subscribers",
       x = "Content Duration (in hours)",
       y = "Number of Subscribers") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

cor_hypothesis1 <- cor(df$content_duration, df$num_subscribers, use = "complete.obs", method = "pearson")
cor_hypothesis1
## [1] 0.1618387

This visualization indicates that there is a positive relationship between Number of subscribers and Content Duration, but the correlation coefficient is indicating it has weak relationship. No.of subscribers may not be strong influencer for content duration.

Hypothesis 2: Paid vs. Free Courses by Number of Reviews Visualization

ggplot(df, aes(x = is_paid, y = num_reviews, fill = is_paid)) +
  geom_boxplot() +
  labs(title = "Hypothesis 2: Reviews for Paid vs Free Courses",
       x = "Course Type (Paid or Free)",
       y = "Number of Reviews") +
  theme_minimal()

This visualization indicates that unpaid courses have more number of reviews.