STAT210

Author

Kyra Vargas

STAT 210-01: Introduction to Data Science

Final Project: Dark Side of Social Media

Abstract:

This report analyzes the “Time-Wasters on Social Media” dataset,sourced from Kaggle to identify the true drivers of productivity loss in the digital age. By examining a synthetic population of 1,000 users, this study tests common assumptions regarding geography, gender, and platform choice. Our findings reveal that while social media usage is a consistent global phenomenon, the intent behind usage, specifically boredom, is the primary predictor of high productivity loss. These results suggest that digital distraction is less about the time spent and more about the user’s psychological state at the moment of engagement.

Introduction:

Social media has become a major part of our daily life, but it may negatively impact productivity and increase addictive behaviors. This analysis explores how user characteristics, platform choice, and usage patterns relate to productivity loss and addiction.

Methodology:

The analysis was conducted using R and the tidyverse suite of packages.

Data Source: The dataset was obtained from Kaggle, containing user metrics such as location, gender, platform type, and self reported productivity loss.

Cleaning: Data was cleaned for consistency, including recoding categorical typos (e.g., “Barzil” to “Brazil”).

Visual Analysis: A series of six targeted visualizations including: histograms, ranked bar charts, scatterplots, and boxplots, were used to test variables.

1. How bad is the productivity loss for the average person in this study?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

data <- read_csv("data/Social_Media.csv")
Rows: 1000 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (12): Gender, Location, Profession, Demographics, Platform, Video_Categ...
dbl  (16): UserID, Age, Income, Total_Time_Spent, Number_of_Sessions, Video_...
lgl   (2): Debt, Owns_Property
time  (1): Watch_Time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(data, aes(x = Productivity_Loss)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "white") +
  labs(title = "Distribution of Productivity Loss",
       subtitle = "How social media impact is spread across all users",
       x = "Productivity Loss Score (0-10)",
       y = "Number of Users") 

The distribution of productivity loss across the dataset reveals a bimodal pattern, indicating that social media does not affect all users in a uniform way. The data clusters into two distinct groups: a “low-impact” group peaking around a score of 2.5, and a larger “high-impact” group peaking between 6.0 and 7.0. It is also characterized by a significant dip at the 5.0 mark. This ‘gap’ splits the dataset into this two distinct populations. This tells us that an ‘average’ score doesn’t actually exist for most people; they are either winning the battle with their focus or losing it.

2. Where is the problem happening? Does geography influence social media impact?

country_ranking <- data %>%
  mutate(Location = recode(Location, "Barzil" = "Brazil"))%>%
  group_by(Location) %>%
  summarize(Avg_Loss = mean(Productivity_Loss, na.rm = TRUE)) %>%
  arrange(desc(Avg_Loss)) #highest to lowest


ggplot(country_ranking, aes(x = reorder(Location, Avg_Loss), y = Avg_Loss, fill = Avg_Loss)) +
  geom_col(width = 0.6) + 
  coord_flip() + #sideways so country names are easier to read
  scale_fill_viridis_c(option = "plasma") + #color variation
  labs(title = "Ranked Productivity Loss by Country",
       x = "Country",
       y = "Average Productivity Loss (0-10)") 

This ranked bar chart identifies the specific leaders in productivity loss, with Pakistan, Japan, and Vietnam reporting the highest average impact (approximately 5.4). Brazil shows the lowest impact in this dataset (near 4.5). While these rankings provide a clear hierarchy, the most significant insight is the narrow variance between the top and bottom of the list, with less than a 1.0-point difference separating all ten countries on a 10-point scale.

3. The Demographics: Is it a specific gender? Does total time spent on social media differ significantly between genders?

gender_stats <- data %>%
  group_by(Gender) %>%
  summarize(
  Avg_Time = mean(Total_Time_Spent),
  Avg_Addiction = mean(Addiction_Level)
  )

#Plot for time
ggplot(gender_stats, aes(x = Gender, y = Avg_Time, fill = Gender)) +
  geom_col(width = 0.5) + 
  labs(title = "Average Time Spent by Gender", 
       y = "Minutes") 

Social media engagement appears neutral in this dataset, with all groups spending nearly the same amount of time on platforms daily. Because these bars are nearly equal, gender serves as a weak predictor for productivity loss.

4. The App: Is one platform more “evil” than others? Do different social media platforms lead to different levels of productivity loss?

ggplot(data, aes(x = Platform, y = Productivity_Loss)) +
  geom_boxplot() +
  labs(title = "Productivity Loss by Platform")

The boxplot shows that productivity loss is relatively similar across all four platforms, with medians around 5 to 6. This suggests that no single platform dominates in causing productivity loss, but small differences exist.

TikTok and YouTube show slightly higher median productivity loss compared to Facebook and Instagram, indicating that users on video-heavy platforms may experience slightly greater productivity loss impact.

Instagram stands out not because it has the highest productivity loss, but because it has the widest spread of values. This means user experiences on Instagram vary more widely, some users report low productivity loss, while others report much higher values.

Checking this values using the summary statistics table:

data %>%
  group_by(Platform) %>%
  summarise(
    median_loss = median(Productivity_Loss, na.rm = TRUE),
    mean_loss = mean(Productivity_Loss, na.rm = TRUE),
    count = n()
  )
# A tibble: 4 × 4
  Platform  median_loss mean_loss count
  <chr>           <dbl>     <dbl> <int>
1 Facebook            5      5.07   221
2 Instagram           5      5.08   256
3 TikTok              6      5.14   273
4 YouTube             6      5.26   250

TikTok and YouTube have higher median and mean productivity loss compared to Facebook and Instagram. Although the differences are not large, the consistent pattern suggests that more video based platforms may contribute to higher productivity loss.

5. The time: Does spending more time make it worse? How does the total time spent on social media relate to productivity loss?

ggplot(data, aes(x = Total_Time_Spent, y = Productivity_Loss)) +
  geom_jitter(alpha = 0.3, color = "blue") + 
  geom_smooth(method = "loess", color = "red") +
  labs(title = "Relationship Between Time Spent and Productivity Loss",
      x = "Total Time Spent (Minutes)",
       y = "Productivity Loss")
`geom_smooth()` using formula = 'y ~ x'

The red trend line is almost a flat line. It shows that someone spending only 20 minutes on their phone can have their productivity totally damaged just as much as someone who’s been scrolling for five hours. It’s like the “damage” to your focus happens the second you open the app, and after that, the amount of time you stay on doesn’t actually make the productivity loss worse. Instead of a steady climb, it’s just a messy cloud of dots. Basically, time spent is a terrible predictor of how much work you’re actually getting done and it seems to be more about the distraction itself than the actual minutes on the clock.

6. The Truth: Does the ‘Watch Reason’ serve as a better predictor of productivity loss than ‘Total Time Spent’?

ggplot(data, aes(x = Watch_Reason, y = Productivity_Loss, fill = Watch_Reason)) +
  geom_boxplot() +
  labs(title = "Productivity Loss by Watch Reason",
       x = "Reason for Watching",
       y = "Productivity Loss Score")

It’s not the time that kills productivity, it’s the reason.

While total time spent on social media did not show a clear relationship with productivity loss, the reason for usage does appear to matter. Specifically, users who cite Boredom as their primary reason for watching experience a higher median productivity loss (around 6.0) compared to those using platforms for habit or procrastination (5.0). This suggests that the mental state of the user, rather than just the time spent, is a key factor in how social media usage impacts their daily output.

This concludes if someone is using social media because they are bored, they are statistically more likely to see a drop in their productivity compared to someone using it for a specific reason like a habit or intentional procrastination.

Conclusion:

Debunking the myth that total “Time Spent” is the primary driver of distraction, the analysis reveals that the psychological intent behind usage is what truly matters. Specifically, users who engage with content due to Boredom report the highest levels of productivity loss, whereas those seeking information or specific engagement remain more focused. This leads to the final conclusion that reclaiming productivity is not about setting strict time limits, but about managing the internal triggers and “Watch Reasons” that lead us to scroll in the first place.

References: