The purpose of this analysis is to explore the Social Media and Entertainment Dataset to gain insights into user behavior. This notebook includes numeric and categorical summaries, aggregation-based analysis, and visualizations. It aims to address questions informed by the data’s documentation and goals.
# Define Kaggle dataset URL and output file path
dataset_url <- "https://www.kaggle.com/datasets/ashaychoudhary/social-media-and-entertainment-dataset"
output_file <- "social_media_entertainment.csv"
# Define Kaggle API credentials (Ensure your Kaggle API key is set up properly)
kaggle_username <- Sys.getenv("shyam136")
kaggle_key <- Sys.getenv("2d9d07df4bcf0a4e67a30d7a03ccaf4c")
# Construct API call
response <- GET(
url = paste0("https://www.kaggle.com/api/v1/datasets/download/ashaychoudhary/social-media-and-entertainment-dataset"),
authenticate(kaggle_username, kaggle_key),
write_disk(output_file, overwrite = TRUE)
)
# Check if the file was downloaded successfully
if (file.exists(output_file)) {
message("Dataset downloaded successfully!")
# Load the dataset into R
data <- read_csv(output_file)
# Display the first few rows of the dataset
head(data)
} else {
stop("Failed to download the dataset. Check your Kaggle API setup.")
}
## # A tibble: 6 × 40
## `User ID` Age Gender Country Daily Social Media Tim…¹ Daily Entertainment …²
## <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 32 Other Germany 4.35 4.08
## 2 2 62 Other India 4.96 4.21
## 3 3 51 Female USA 6.78 1.77
## 4 4 44 Female India 5.06 9.21
## 5 5 21 Other Germany 2.57 1.3
## 6 6 21 Male Canada 4.69 1.7
## # ℹ abbreviated names: ¹`Daily Social Media Time (hrs)`,
## # ²`Daily Entertainment Time (hrs)`
## # ℹ 34 more variables: `Social Media Platforms Used` <dbl>,
## # `Primary Platform` <chr>, `Daily Messaging Time (hrs)` <dbl>,
## # `Daily Video Content Time (hrs)` <dbl>, `Daily Gaming Time (hrs)` <dbl>,
## # Occupation <chr>, `Marital Status` <chr>, `Monthly Income (USD)` <dbl>,
## # `Device Type` <chr>, `Internet Speed (Mbps)` <dbl>, …
Explanation: This summary provides a statistical overview of age and daily social media usage. Insights such as mean, median, and range help understand user demographics and behavior patterns.
# Summarize key numeric columns: Age and Daily Social Media Time (hrs)
numeric_summary <- data %>%
summarize(
Min_Age = min(Age, na.rm = TRUE),
Max_Age = max(Age, na.rm = TRUE),
Mean_Age = mean(Age, na.rm = TRUE),
Median_Age = median(Age, na.rm = TRUE),
Min_TimeSpent = min(`Daily Social Media Time (hrs)`, na.rm = TRUE),
Max_TimeSpent = max(`Daily Social Media Time (hrs)`, na.rm = TRUE),
Mean_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE),
Median_TimeSpent = median(`Daily Social Media Time (hrs)`, na.rm = TRUE)
)
# Display the summary
numeric_summary
## # A tibble: 1 × 8
## Min_Age Max_Age Mean_Age Median_Age Min_TimeSpent Max_TimeSpent Mean_TimeSpent
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 13 65 38.5 39 0.5 8 4.25
## # ℹ 1 more variable: Median_TimeSpent <dbl>
Explanation:** This summary highlights the most popular social media platforms used by the dataset’s participants. It provides a foundational understanding of platform preferences.
# Summarize the distribution of users across primary platforms
categorical_summary <- data %>%
count(`Primary Platform`) %>%
arrange(desc(n))
# Display the unique values and their counts
categorical_summary
## # A tibble: 5 × 2
## `Primary Platform` n
## <chr> <int>
## 1 TikTok 60301
## 2 Twitter 60285
## 3 Facebook 59936
## 4 YouTube 59757
## 5 Instagram 59721
Explanation: This aggregation identifies the platforms where users spend the most time on average. It helps prioritize platforms for engagement strategies.
# Addressing Question: Which platforms have the highest average daily social media time?
platform_time_spent <- data %>%
group_by(`Primary Platform`) %>%
summarize(Average_TimeSpent = mean(`Daily Social Media Time (hrs)`, na.rm = TRUE)) %>%
arrange(desc(Average_TimeSpent))
# Display the result
platform_time_spent
## # A tibble: 5 × 2
## `Primary Platform` Average_TimeSpent
## <chr> <dbl>
## 1 YouTube 4.27
## 2 Facebook 4.26
## 3 Instagram 4.25
## 4 TikTok 4.25
## 5 Twitter 4.25
Insight: This bar plot illustrates platform preferences across genders, providing insights into gender-specific trends and preferences.
# Visualizing preferred platforms by gender
ggplot(data, aes(x = `Primary Platform`, fill = Gender)) +
geom_bar(position = "dodge") +
labs(
title = "Preferred Platforms by Gender",
x = "Primary Platform",
y = "Count",
fill = "Gender"
)