project 2 data 110

#Find an image that relates to your topic and place it at the top of your document.
library(png)

#install.packages("png")
img <- readPNG("dataset-cover.png") # works for Netpbm grayscale formats

# View image 
plot(as.raster(img))

#Cite the Source for the image: Kumaresan, A. (2025, September 14). Screen Time Vs Mental Wellness Survey - 2025. Kaggle. <https://www.kaggle.com/datasets/adharshinikumar/screentime-vs-mentalwellness-survey-2025>

#A total of 400 individuals aged 16 to 40 completed a survey regarding their daily technology use. The resulting dataset enables analysis of the relationship between screen usage and mental health. The dataset includes 15 variables. Seven variables are categorical: gender (Male, Female, Other), occupation (student, working professional, freelancer), work mode (remote, on-site, hybrid), sleep quality (self-reported rating from 1, very poor, to 5, excellent), stress level (self-reported from 0, no stress, to 10, extremely stressed), self-rated productivity score (0–100), and a composite index of overall wellness (0–100). Six variables are quantitative: age, total average daily screen usage (hours), daily screen time spent on work or study tasks (hours), average sleep duration per night (hours), total minutes spent exercising per week, and hours spent socializing offline per week. The primary research question is whether total average daily screen usage and average sleep duration per night predict the number of hours spent socializing offline per week.

#CSV: Please see attached. Hyperlink the source for where you get the dataset: <https://www.kaggle.com/datasets/adharshinikumar/screentime-vs-mentalwellness-survey-2025> The number of variables: 15

#How many are categorical: 7 categorical variables

#Identify the categorical variables: Male/Female/Other (Gender), Occupation(Participant’s role such as Student, Working Professional, Freelancer, etc., work mode (work/study mode such as remote, on-site, hybrid), sleep quality (self-reported sleep quality rating 1=very poor, 5= excellent), stress level (self-reported stress level 0=no stress, 10=extremely stressed), self-rated productivity score (0-100), composite index reflecting overall wellness (0-100)

#How many are quantitative: 4

#Specify which variables you plan to use: Age, total average daily screen usage(hours), daily screen time spent on work/study tasks (hours), and average sleep duration per night (hours)

#Research Question: Are a person’s total average daily screen usage (hours) and average sleep duration per night(hours) predictive of his or her social hours per week?

#Load the necessary libraries library(dplyr)#It provides a grammar of data manipulation

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

library(ggfortify)

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(GGally)

library(DataExplorer)#install.packages("DataExplorer")

library(ggrepel)

#Load your dataset using the readr::read_csv() command

screen_time<-readr::read_csv("ScreenTime vs MentalWellness (3).csv")

Rows: 400 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): user_id, gender, occupation, work_mode
dbl (11): age, screen_time_hours, work_screen_hours, leisure_screen_hours, s...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Count missing values

sum(is.na(screen_time))

[1] 0

#Show code in the output

{echo=TRUE}

#Check out the first few lines

head(screen_time)

# A tibble: 6 × 15
  user_id   age gender  occupation work_mode screen_time_hours work_screen_hours
  <chr>   <dbl> <chr>   <chr>      <chr>                 <dbl>             <dbl>
1 U0001      33 Female  Employed   Remote                10.8               5.44
2 U0002      28 Female  Employed   In-person              7.4               0.37
3 U0003      35 Female  Employed   Hybrid                 9.78              1.09
4 U0004      42 Male    Employed   Hybrid                11.1               0.56
5 U0005      28 Male    Student    Remote                13.2               4.09
6 U0006      28 Non-bi… Self-empl… Hybrid                 9.83              0.53
# ℹ 8 more variables: leisure_screen_hours <dbl>, sleep_hours <dbl>,
#   sleep_quality_1_5 <dbl>, stress_level_0_10 <dbl>, productivity_0_100 <dbl>,
#   exercise_minutes_per_week <dbl>, social_hours_per_week <dbl>,
#   mental_wellness_index_0_100 <dbl>

#Explore both quantitative and categorical variables with simple plots to determine what you want to focus on for your final visualization.

screen_time%>% arrange(desc(screen_time_hours))%>%View()#order by descending total average daily screen usage screen_time%>% arrange(desc(social_hours_per_week))%>%View()#order by descending hours spent socializing offline per week.

#Create a scatterplot with a linear regression.

plot1_z<-ggplot(screen_time,aes(screen_time_hours, social_hours_per_week))+ 
labs(
#Main title 
title = "Social Hours per Week versus Screen Time in Hours in 400 subjects",
subtitle = "What is the relationship between sleep and social activity?",
caption = "Source: ScreenTime vs MentalWllness", 
#Add a caption x = "screen time in hours", 
#Give x-axis a name #Give y-axis a name
y = "Social Hours per week") + 
#Fix the axes to start at 0. 
coord_cartesian(xlim=c(0,25), ylim=c(0,33))+ 
#add linear regression with confidence interval and remove the confidence interval band
  geom_smooth(method='lm',formula=y~x,se=FALSE, color="black")+ 
#Add the points and separate points that sit on top of each other and adjust color inside and outside, size of points 
geom_jitter(color="blue", size = 1.5, alpha = 0.4, width = 0.4, height = 0.3, stroke=0.5)+
#Change the default theme
theme_light(base_size = 12)
plot1_z

#any tips on utilizing geom_point ?geom_point

names(screen_time)

 [1] "user_id"                     "age"                        
 [3] "gender"                      "occupation"                 
 [5] "work_mode"                   "screen_time_hours"          
 [7] "work_screen_hours"           "leisure_screen_hours"       
 [9] "sleep_hours"                 "sleep_quality_1_5"          
[11] "stress_level_0_10"           "productivity_0_100"         
[13] "exercise_minutes_per_week"   "social_hours_per_week"      
[15] "mental_wellness_index_0_100"

library(ggplot2)
library(plotly)

# Create ggplot
z <- ggplot(screen_time, aes(
  x = sleep_hours,
  y = social_hours_per_week,
  size = mental_wellness_index_0_100,
  color = mental_wellness_index_0_100,
  text = paste("gender:", gender)
)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient(low = "blue", high = "red") +
  coord_cartesian(xlim = c(0, 30), ylim = c(0, 30)) +
  labs(
    title = "Social Hours per week versus Sleep Hours",
    caption = "Source: ScreenTime vs MentalWellness",
    subtitle = "Exploring relationship between sleep and social activity",
    x = "Sleep Hours",
    y = "Social Hours per week",
    size = "Mental Wellness Index",
    color = "Mental Wellness Index"
  ) +
  theme_light(base_size = 12)

# Convert to interactive plot
ggplotly(z, tooltip = "text")

# Correlation (separate line)
cor(screen_time$screen_time_hours,
    screen_time$social_hours_per_week,
    use = "complete.obs")

[1] -0.1981244

plot1_z <- lm(social_hours_per_week ~ screen_time_hours, data = screen_time) 
#lm(y ~ x) 
summary(plot1_z)


Call:
lm(formula = social_hours_per_week ~ screen_time_hours, data = screen_time)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0077  -3.3183  -0.0243   3.0159  16.5632 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       11.42908    0.90652  12.608  < 2e-16 ***
screen_time_hours -0.39048    0.09683  -4.033 6.62e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.818 on 398 degrees of freedom
Multiple R-squared:  0.03925,   Adjusted R-squared:  0.03684 
F-statistic: 16.26 on 1 and 398 DF,  p-value: 6.616e-05

#The model has the equation: Social hours per week = -0.39048 (screen_time_hours)+11.42908 The slope may be interpreted in the following: For each additional total average daily screen usage (hour), there is a predicted decrease of -0.39048 hours spent socializing offline per week. The column for Pr(\>\|t\|) p-value on the right of screen_time_hours (0.0000662) has 3 asterisks which suggests it is statistically significant variable to explain the weak, negative correlation between total average daily screen usage (hour) and hours spent socializing offline per week. The more asterisks, the more the variable contributes to the model. 3.684% of the variation in the observations may be explained by the model. In other words, 96.316% of the variance in the data is likely not explained by this model.

#name each column

names(screen_time)

 [1] "user_id"                     "age"                        
 [3] "gender"                      "occupation"                 
 [5] "work_mode"                   "screen_time_hours"          
 [7] "work_screen_hours"           "leisure_screen_hours"       
 [9] "sleep_hours"                 "sleep_quality_1_5"          
[11] "stress_level_0_10"           "productivity_0_100"         
[13] "exercise_minutes_per_week"   "social_hours_per_week"      
[15] "mental_wellness_index_0_100"

#Check out the pairwise comparisions with density curves and correlation output
## Is there an easier way to compare multiple variables using a scatterplot matrix?
screen_time_2 <- screen_time %>% rename( screen = screen_time_hours, work = work_screen_hours, leisure = leisure_screen_hours, exercise = exercise_minutes_per_week, social = social_hours_per_week, wellness = mental_wellness_index_0_100 ) 
ggpairs(screen_time_2, columns = 6:14, upper = list(continuous = wrap("cor", size = 4))
)

names(screen_time_2)

 [1] "user_id"            "age"                "gender"            
 [4] "occupation"         "work_mode"          "screen"            
 [7] "work"               "leisure"            "sleep_hours"       
[10] "sleep_quality_1_5"  "stress_level_0_10"  "productivity_0_100"
[13] "exercise"           "social"             "wellness"

#Backward elimination 
access_2<-lm(social_hours_per_week~age+gender+occupation+work_mode+screen_time_hours+work_screen_hours+leisure_screen_hours+sleep_hours+sleep_quality_1_5+stress_level_0_10+productivity_0_100+exercise_minutes_per_week,data=screen_time) 
summary(access_2)


Call:
lm(formula = social_hours_per_week ~ age + gender + occupation + 
    work_mode + screen_time_hours + work_screen_hours + leisure_screen_hours + 
    sleep_hours + sleep_quality_1_5 + stress_level_0_10 + productivity_0_100 + 
    exercise_minutes_per_week, data = screen_time)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0855  -3.4247  -0.1379   3.1562  15.4188 

Coefficients: (1 not defined because of singularities)
                           Estimate Std. Error t value Pr(>|t|)   
(Intercept)               12.805657   4.917319   2.604  0.00957 **
age                        0.030586   0.033823   0.904  0.36640   
genderMale                -0.198920   0.503580  -0.395  0.69305   
genderNon-binary/Other     0.644282   1.821087   0.354  0.72369   
occupationRetired         -1.684304   1.418851  -1.187  0.23593   
occupationSelf-employed    0.181829   0.814259   0.223  0.82342   
occupationStudent         -0.412903   0.619960  -0.666  0.50580   
occupationUnemployed       0.250534   1.047717   0.239  0.81114   
work_modeIn-person        -0.133975   0.637471  -0.210  0.83365   
work_modeRemote           -1.156856   0.881610  -1.312  0.19024   
screen_time_hours         -0.562841   0.173125  -3.251  0.00125 **
work_screen_hours          0.463483   0.236177   1.962  0.05044 . 
leisure_screen_hours             NA         NA      NA       NA   
sleep_hours               -0.212983   0.379307  -0.562  0.57478   
sleep_quality_1_5         -0.144302   0.577548  -0.250  0.80284   
stress_level_0_10          0.059929   0.255474   0.235  0.81466   
productivity_0_100         0.003491   0.040643   0.086  0.93159   
exercise_minutes_per_week -0.000492   0.003621  -0.136  0.89200   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.863 on 383 degrees of freedom
Multiple R-squared:  0.05837,   Adjusted R-squared:  0.01903 
F-statistic: 1.484 on 16 and 383 DF,  p-value: 0.1022

#plot access_2 
autoplot(access_2, 1:1, nrow=1, ncol=1)

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

#If we are trying to predict Hours spent socializing offline per week, then we can see if any of the predictor variables contribute to this model. Note the adjusted R-squared value is 0.01903. You can see all of the variables. Adjusted R-squared is 0.01903 1.903%. The variables that do not appear to be as significant as the others are productivity_0_100, exercise_minutes_per_week, stress_level_0_10, sleep_quality_1_5, sleep_hours, leisure_screen_hours, work_modeRemote, work_modeIn-person, occupationUnemployed, occupationStudent, occupationSelf-employed, occupationRetired, genderNon-binary/Other, genderMale, and age since they all have large p-values. So drop that and re-run the model.

access_3 <- lm(social_hours_per_week ~ screen_time_hours + work_screen_hours,
               data = screen_time); summary(access_3)


Call:
lm(formula = social_hours_per_week ~ screen_time_hours + work_screen_hours, 
    data = screen_time)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0572  -3.3176  -0.0854   3.1150  16.0500 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        11.6883     0.9306  12.560  < 2e-16 ***
screen_time_hours  -0.4623     0.1133  -4.081 5.43e-05 ***
work_screen_hours   0.1782     0.1461   1.219    0.223    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.815 on 397 degrees of freedom
Multiple R-squared:  0.04284,   Adjusted R-squared:  0.03802 
F-statistic: 8.884 on 2 and 397 DF,  p-value: 0.0001681

#plot access_3 
autoplot(access_3, 1:1, nrow=1, ncol=1)

#1.Look at the p-value for each variable #2. Check out the residual plots. #3. Look at the output for the Adjusted R-Squared value at the bottom of the output. #Try the last model, but drop the last two observations #The residuals plots show observations 58, 111 and 117 have an effect on the residuals plots as well having high scale-location values. 
options(scipen = 0); access_4 <- screen_time[-c(53, 177, 111), ]

summary(access_4)

   user_id               age           gender           occupation       
 Length:397         Min.   :16.00   Length:397         Length:397        
 Class :character   1st Qu.:24.00   Class :character   Class :character  
 Mode  :character   Median :30.00   Mode  :character   Mode  :character  
                    Mean   :29.83                                        
                    3rd Qu.:35.00                                        
                    Max.   :60.00                                        
  work_mode         screen_time_hours work_screen_hours leisure_screen_hours
 Length:397         Min.   : 1.000    Min.   : 0.11     Min.   : 0.890      
 Class :character   1st Qu.: 7.380    1st Qu.: 0.70     1st Qu.: 5.460      
 Mode  :character   Median : 9.110    Median : 1.45     Median : 6.700      
                    Mean   : 9.029    Mean   : 2.18     Mean   : 6.848      
                    3rd Qu.:10.540    3rd Qu.: 3.01     3rd Qu.: 8.440      
                    Max.   :19.170    Max.   :12.04     Max.   :13.350      
  sleep_hours    sleep_quality_1_5 stress_level_0_10 productivity_0_100
 Min.   :4.640   Min.   :1.000     Min.   : 0.000    Min.   : 20.60    
 1st Qu.:6.400   1st Qu.:1.000     1st Qu.: 6.900    1st Qu.: 43.60    
 Median :7.030   Median :1.000     Median : 8.800    Median : 51.80    
 Mean   :7.014   Mean   :1.398     Mean   : 8.152    Mean   : 54.29    
 3rd Qu.:7.640   3rd Qu.:2.000     3rd Qu.:10.000    3rd Qu.: 63.00    
 Max.   :9.740   Max.   :4.000     Max.   :10.000    Max.   :100.00    
 exercise_minutes_per_week social_hours_per_week mental_wellness_index_0_100
 Min.   :  0.0             Min.   : 0.0          Min.   : 0.00              
 1st Qu.: 58.0             1st Qu.: 4.5          1st Qu.: 3.80              
 Median :103.0             Median : 7.7          Median :14.80              
 Mean   :110.1             Mean   : 7.8          Mean   :20.34              
 3rd Qu.:157.0             3rd Qu.:10.9          3rd Qu.:30.60              
 Max.   :372.0             Max.   :19.4          Max.   :97.00

#plot access_4 
autoplot(access_3, 1:1, nrow=1, ncol=1)

#The model accounts for 3.0802% of the variation in the observations, indicating that 96.9198% of the variance remains unexplained. Further analysis should consider including another variable with a p-value less than the significance level of 0.05. Including both the other variable and specifically, total average daily screen usage (hours) may help explain the correlation between total average daily screen usage (hours) and hours spent socializing offline per week. I encountered challenges in labeling my data points due to the large sample size of 400 participants so I decided to decline the use of geom text and I used geom jitter to avoid overlap in the points.