Introduction

The purpose of this analysis is to explore the Social Media and Entertainment Dataset to gain insights into user behavior. This notebook includes numeric and categorical summaries, aggregation-based analysis, and visualizations. It aims to address questions informed by the data’s documentation and goals.

# Define Kaggle dataset URL and output file path
dataset_url <- "https://www.kaggle.com/datasets/ashaychoudhary/social-media-and-entertainment-dataset"
output_file <- "social_media_entertainment.csv"

# Define Kaggle API credentials (Ensure your Kaggle API key is set up properly)
kaggle_username <- Sys.getenv("shyam136")
kaggle_key <- Sys.getenv("2d9d07df4bcf0a4e67a30d7a03ccaf4c")

# Construct API call
response <- GET(
  url = paste0("https://www.kaggle.com/api/v1/datasets/download/ashaychoudhary/social-media-and-entertainment-dataset"),
  authenticate(kaggle_username, kaggle_key),
  write_disk(output_file, overwrite = TRUE)
)

# Check if the file was downloaded successfully
if (file.exists(output_file)) {
  message("Dataset downloaded successfully!")
  # Load the dataset into R
  data <- read_csv(output_file)
  # Display the first few rows of the dataset
  head(data)
} else {
  stop("Failed to download the dataset. Check your Kaggle API setup.")
}
## # A tibble: 6 × 40
##   `User ID`   Age Gender Country Daily Social Media Tim…¹ Daily Entertainment …²
##       <dbl> <dbl> <chr>  <chr>                      <dbl>                  <dbl>
## 1         1    32 Other  Germany                     4.35                   4.08
## 2         2    62 Other  India                       4.96                   4.21
## 3         3    51 Female USA                         6.78                   1.77
## 4         4    44 Female India                       5.06                   9.21
## 5         5    21 Other  Germany                     2.57                   1.3 
## 6         6    21 Male   Canada                      4.69                   1.7 
## # ℹ abbreviated names: ¹​`Daily Social Media Time (hrs)`,
## #   ²​`Daily Entertainment Time (hrs)`
## # ℹ 34 more variables: `Social Media Platforms Used` <dbl>,
## #   `Primary Platform` <chr>, `Daily Messaging Time (hrs)` <dbl>,
## #   `Daily Video Content Time (hrs)` <dbl>, `Daily Gaming Time (hrs)` <dbl>,
## #   Occupation <chr>, `Marital Status` <chr>, `Monthly Income (USD)` <dbl>,
## #   `Device Type` <chr>, `Internet Speed (Mbps)` <dbl>, …

Numeric Summary of Selected Columns:

Explanation: This summary provides a statistical overview of age and daily social media usage. Insights such as mean, median, and range help understand user demographics and behavior patterns.

# Summarize key numeric columns: Age and Daily Social Media Time (hrs)
numeric_summary <- data %>%
  summarize(
    Min_Age = min(Age, na.rm = TRUE),
    Max_Age = max(Age, na.rm = TRUE),
    Mean_Age = mean(Age, na.rm = TRUE),
    Median_Age = median(Age, na.rm = TRUE),
    Min_TimeSpent = min(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    Max_TimeSpent = max(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    Mean_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE),
    Median_TimeSpent = median(`Daily Social Media Time (hrs)`, na.rm = TRUE)
  )

# Display the summary
numeric_summary
## # A tibble: 1 × 8
##   Min_Age Max_Age Mean_Age Median_Age Min_TimeSpent Max_TimeSpent Mean_TimeSpent
##     <dbl>   <dbl>    <dbl>      <dbl>         <dbl>         <dbl>          <dbl>
## 1      13      65     38.5         39           0.5             8           4.25
## # ℹ 1 more variable: Median_TimeSpent <dbl>

Categorical Summary: Primary Platform:

Explanation:** This summary highlights the most popular social media platforms used by the dataset’s participants. It provides a foundational understanding of platform preferences.

# Summarize the distribution of users across primary platforms
categorical_summary <- data %>%
  count(`Primary Platform`) %>%
  arrange(desc(n))

# Display the unique values and their counts
categorical_summary
## # A tibble: 5 × 2
##   `Primary Platform`     n
##   <chr>              <int>
## 1 TikTok             60301
## 2 Twitter            60285
## 3 Facebook           59936
## 4 YouTube            59757
## 5 Instagram          59721

Novel Questions to Explore:

  1. What is the distribution of daily social media time across age groups?
  2. How do preferred platforms vary by gender?
  3. Is there a correlation between monthly income and daily entertainment time?

Aggregation Analysis:

Explanation: This aggregation identifies the platforms where users spend the most time on average. It helps prioritize platforms for engagement strategies.

# Addressing Question: Which platforms have the highest average daily social media time?
platform_time_spent <- data %>%
  group_by(`Primary Platform`) %>%
  summarize(Average_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)) %>%
  arrange(desc(Average_TimeSpent))

# Display the result
platform_time_spent
## # A tibble: 5 × 2
##   `Primary Platform` Average_TimeSpent
##   <chr>                          <dbl>
## 1 YouTube                         4.27
## 2 Facebook                        4.26
## 3 Instagram                       4.25
## 4 TikTok                          4.25
## 5 Twitter                         4.25

Visual Summaries:

Distribution of Daily Social Media Time

Insight: This histogram shows the time users typically spend on social media daily. Peaks and trends in the plot reveal common usage patterns.

# Visualizing the distribution of daily social media time
ggplot(data, aes(x = `Daily Social Media Time (hrs)`)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Daily Social Media Time",
    x = "Daily Social Media Time (hrs)",
    y = "Frequency"
  )

Preferred Platforms by Gender

Insight: This bar plot illustrates platform preferences across genders, providing insights into gender-specific trends and preferences.

# Visualizing preferred platforms by gender
ggplot(data, aes(x = `Primary Platform`, fill = Gender)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Preferred Platforms by Gender",
    x = "Primary Platform",
    y = "Count",
    fill = "Gender"
  )

Insights and Conclusions

  1. Numeric Summary: Age and daily social media time reveal critical patterns in user demographics and behavior.
  2. Categorical Summary: Popular platforms like [insert platform name] dominate user preference, reflecting broader trends.
  3. Aggregation: Average daily usage per platform offers strategic insights for targeting user engagement.
  4. Visualizations: The distribution and platform preference plots reveal actionable patterns and raise further questions.

Next Steps

  1. Investigate how demographic factors influence preferred content types.
  2. Explore trends in entertainment time by occupation or marital status.
  3. Examine correlations between internet speed and daily online time.