Social Media Impact on Student
use of library
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
Introduction:- In this Data set we are Analysis the students stress level , daily social media use ,sleep hours, screen time before sleep ,depression , anxiety, addiction and also physical activity
Teen_health<-read.csv("C:/R studio/Project/ca2/Teen_Mental_Health_Dataset.csv")
View(Teen_health)
———————————————–———————————————–———————————————–———————————————–———————————————– Level 1: Understanding the Data (Basic Exploration) ————————————————————————————————————————————————————————————————————————–——————— Question 1.1: What is the structure of the dataset (number of rows, columns, and data types)? column name,head and tail top6,bottom 6 fetch data
#1structure of dataset
str(Teen_health)
## 'data.frame': 1200 obs. of 13 variables:
## $ age : int 14 19 17 15 15 19 18 16 19 15 ...
## $ gender : chr "male" "female" "female" "male" ...
## $ daily_social_media_hours: num 7.9 1.9 1.3 7.4 4.7 7.4 2.5 4 3.3 1.9 ...
## $ platform_usage : chr "Instagram" "TikTok" "Instagram" "TikTok" ...
## $ sleep_hours : num 7.4 8 7.6 6.9 4.9 4.4 6.4 4.2 5 4.9 ...
## $ screen_time_before_sleep: num 2.9 2.9 0.5 1.6 3 2.4 2.4 0.5 2.1 1.5 ...
## $ academic_performance : num 3.01 3.22 3.92 3.48 2.37 2.63 2.63 2.4 2.04 3.77 ...
## $ physical_activity : num 1.5 0.8 0 0.8 1.4 0.6 0.7 1.3 0.9 1.1 ...
## $ social_interaction_level: chr "low" "high" "high" "medium" ...
## $ stress_level : int 2 8 2 1 3 3 2 6 1 1 ...
## $ anxiety_level : int 2 1 4 7 5 5 2 10 10 1 ...
## $ addiction_level : int 1 10 2 9 2 7 5 5 9 4 ...
## $ depression_label : int 0 0 0 0 0 0 0 0 0 0 ...
#column names
names(Teen_health)
## [1] "age" "gender"
## [3] "daily_social_media_hours" "platform_usage"
## [5] "sleep_hours" "screen_time_before_sleep"
## [7] "academic_performance" "physical_activity"
## [9] "social_interaction_level" "stress_level"
## [11] "anxiety_level" "addiction_level"
## [13] "depression_label"
#first& lastrows
head(Teen_health)
## age gender daily_social_media_hours platform_usage sleep_hours
## 1 14 male 7.9 Instagram 7.4
## 2 19 female 1.9 TikTok 8.0
## 3 17 female 1.3 Instagram 7.6
## 4 15 male 7.4 TikTok 6.9
## 5 15 female 4.7 Both 4.9
## 6 19 female 7.4 Both 4.4
## screen_time_before_sleep academic_performance physical_activity
## 1 2.9 3.01 1.5
## 2 2.9 3.22 0.8
## 3 0.5 3.92 0.0
## 4 1.6 3.48 0.8
## 5 3.0 2.37 1.4
## 6 2.4 2.63 0.6
## social_interaction_level stress_level anxiety_level addiction_level
## 1 low 2 2 1
## 2 high 8 1 10
## 3 high 2 4 2
## 4 medium 1 7 9
## 5 medium 3 5 2
## 6 high 3 5 7
## depression_label
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
tail(Teen_health)
## age gender daily_social_media_hours platform_usage sleep_hours
## 1195 17 male 2.0 Both 4.5
## 1196 18 female 6.8 Instagram 6.6
## 1197 16 male 2.3 Both 8.0
## 1198 14 female 1.7 Both 8.7
## 1199 15 male 3.9 Both 8.5
## 1200 16 female 4.7 TikTok 6.5
## screen_time_before_sleep academic_performance physical_activity
## 1195 1.7 2.65 0.0
## 1196 2.0 2.76 1.0
## 1197 1.9 2.12 0.4
## 1198 0.7 3.98 0.8
## 1199 2.1 3.19 0.6
## 1200 1.0 2.91 0.9
## social_interaction_level stress_level anxiety_level addiction_level
## 1195 medium 9 4 2
## 1196 low 3 4 4
## 1197 high 7 4 4
## 1198 high 1 1 1
## 1199 high 7 9 9
## 1200 medium 5 7 3
## depression_label
## 1195 0
## 1196 0
## 1197 0
## 1198 0
## 1199 0
## 1200 0
#2Missing values
colSums(is.na(Teen_health))
## age gender daily_social_media_hours
## 0 0 0
## platform_usage sleep_hours screen_time_before_sleep
## 0 0 0
## academic_performance physical_activity social_interaction_level
## 0 0 0
## stress_level anxiety_level addiction_level
## 0 0 0
## depression_label
## 0
Interpretation: In this i have learn that 1200 records and 13 columns innthis show how many male and female because of use of phone and less sleep stress, anxiety,depression increasing day by day and also know name of colums The dataset has no missing values, ensuring complete data for analysis. This eliminates the need for imputation and allows seamless filtering, grouping, and feature engineering.
———————————————–———————————————–———————————————–———————————————–———————————————– Level 2:Understanding the Data and converting into category ———————————————–———————————————–———————————————–———————————————–———————————————– Question:-A data analyst wants to summarize the Teen_health dataset, check its size, and ensure that categorical variables like gender are properly formatted for analysis.
#Summary
summary(Teen_health)
## age gender daily_social_media_hours platform_usage
## Min. :13.00 Length:1200 Min. :1.000 Length:1200
## 1st Qu.:14.00 Class :character 1st Qu.:2.800 Class :character
## Median :16.00 Mode :character Median :4.500 Mode :character
## Mean :15.93 Mean :4.537
## 3rd Qu.:18.00 3rd Qu.:6.300
## Max. :19.00 Max. :8.000
## sleep_hours screen_time_before_sleep academic_performance
## Min. :4.000 Min. :0.50 Min. :2.00
## 1st Qu.:5.200 1st Qu.:1.10 1st Qu.:2.50
## Median :6.500 Median :1.80 Median :2.99
## Mean :6.449 Mean :1.74 Mean :2.99
## 3rd Qu.:7.600 3rd Qu.:2.40 3rd Qu.:3.48
## Max. :9.000 Max. :3.00 Max. :4.00
## physical_activity social_interaction_level stress_level anxiety_level
## Min. :0.000 Length:1200 Min. : 1.000 Min. : 1.000
## 1st Qu.:0.500 Class :character 1st Qu.: 3.000 1st Qu.: 3.000
## Median :1.000 Mode :character Median : 5.000 Median : 6.000
## Mean :1.014 Mean : 5.446 Mean : 5.637
## 3rd Qu.:1.500 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :2.000 Max. :10.000 Max. :10.000
## addiction_level depression_label
## Min. : 1.000 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.00000
## Median : 6.000 Median :0.00000
## Mean : 5.565 Mean :0.02583
## 3rd Qu.: 8.000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.00000
#dimensions
dim(Teen_health)
## [1] 1200 13
#Convert gender to factor
Teen_health$gender <- as.factor(Teen_health$gender)
Interpretation: I can see statistical details like minimum, maximum, mean, and also frequency for categorical data. find the number of rows and columns in the dataset. convert the gender column into categorical data so that it can be used properly in analysis and graphs. Converting to factor is important because it helps in grouping, comparison, and visualization (like bar charts).
———————————————–———————————————–———————————————–———————————————–———————————————– Level 3:filtering and small analysis ————————————————————————————————————————————————————————————————————————–———————
Question:-A student wants to identify with high stress levels and extract their basic details like age, gender, and stress level for further analysis.
#3High stress students
high_stress <- Teen_health %>%
filter(stress_level > 7) %>%
select(age, gender, stress_level)
head(high_stress)
## age gender stress_level
## 1 19 female 8
## 2 16 male 10
## 3 16 female 10
## 4 14 female 8
## 5 18 male 8
## 6 18 female 9
Interpretation:to find students with stress level greater than 7.use select only to columnlike age, gender, and stress level. ————————————————————————————————————————————————————————————————————————–———————
Question3.1:-identify the top 10 students with the highest stress levels and extract their basic details like age, gender, and stress level for deeper analysis.
#4top 10 highest stress students
top_stress <- Teen_health %>%
arrange(desc(stress_level)) %>%
select(age, gender, stress_level) %>%
head(10)
top_stress
## age gender stress_level
## 1 16 male 10
## 2 16 female 10
## 3 13 female 10
## 4 14 male 10
## 5 19 male 10
## 6 15 male 10
## 7 14 male 10
## 8 17 female 10
## 9 17 female 10
## 10 13 male 10
Interpretation:-identify students with the highest stress levels by sorting the dataset in descending order.only relevant columns such as age, gender, and stress level using select().retrieve the top 10 students with the highest stress ————————————————————————————————————————————————————————————————————————–——————— Question3.2 want to fetech data that stress levels to identify which students are most and least stressed.
#5ranking students by stress
rank_stress <- Teen_health %>%
arrange(desc(stress_level)) %>%
mutate(rank = row_number())
head(rank_stress)
## age gender daily_social_media_hours platform_usage sleep_hours
## 1 16 male 3.1 Both 6.1
## 2 16 female 6.7 Both 6.8
## 3 13 female 6.6 TikTok 7.3
## 4 14 male 6.4 Instagram 5.7
## 5 19 male 1.6 Instagram 8.6
## 6 15 male 4.0 Both 8.8
## screen_time_before_sleep academic_performance physical_activity
## 1 0.8 2.11 1.9
## 2 1.9 3.08 1.4
## 3 1.6 3.27 0.1
## 4 0.9 2.06 1.8
## 5 2.4 3.81 0.4
## 6 1.3 3.59 1.9
## social_interaction_level stress_level anxiety_level addiction_level
## 1 high 10 7 10
## 2 high 10 9 1
## 3 medium 10 2 1
## 4 high 10 4 2
## 5 low 10 4 1
## 6 medium 10 6 10
## depression_label rank
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Interpretation:-create new column use arrange descending use mutaate and fetch data of top most who have most stress
————————————————————————————————————————————————————————————————————————–——————— Level 4: Data Visualization and Trend Analysis ————————————————————————————————————————————————————————————————————————–——————— Question:- wants to analyze the distribution of stress levels among students to understand how stress is spread across the dataset.
#7distribution of stress levels
ggplot(Teen_health, aes(x = stress_level)) +
geom_histogram(binwidth = 2, fill = "red", color = "black") +
labs(title = "Distribution of Stress Levels")
Interpretation:-By use of ggplot and histogram learning how trends of
most students have low, medium, or high stress levels. Visualization
makes easy to understand patterns and trends in the data.
————————————————————————————————————————————————————————————————————————–———————
Level 5: Feature Engineering and creating new column
————————————————————————————————————————————————————————————————————————–———————
Question:-analyst wants to classify students into stress categories
(High, Medium, Low) based on their stress level no of student in each
category
#8stress categories
Teen_health$stress_group <- ifelse(Teen_health$stress_level >= 7, "High",
ifelse(Teen_health$stress_level >= 4,
"Medium", "Low"))
ggplot(Teen_health, aes(x = stress_group, fill = stress_group)) +
geom_bar() +
labs(title = "Stress Categories")
Interpretation:- learnt how to make new column and convert into categories for analysis use of barchar to comparise between data use if ifelse i easy to covert into (high, medium and low )category
Question:-checking relationship between daily social media usage and stress levels usage leads to higher stress.
#9social media vs stress
ggplot(Teen_health, aes(x = daily_social_media_hours, y = stress_level)) +
geom_point(color = "blue") +
labs(title = "Social Media vs Stress")
Interpretation:-The scatter plot shows no clear relationship between
daily social media hours and stress level. The data points are randomly
scattered, and there is no visible slope—neither positive, negative, nor
constant.
Question:- wants to compare the average stress levels between different genders to understand if stress varies by gender.
#10average stress by gender
gender_avg <- Teen_health %>%
group_by(gender) %>%
summarise(avg_stress = mean(stress_level))
ggplot(gender_avg, aes(x = gender, y = avg_stress, fill = gender)) +
geom_bar(stat = "identity") +
labs(title = "Average Stress by Gender")
Interpretation:- in this i have learnt how to group columns make bar
char to compare between gender and find avg and anlysis dataset
summaerise it
————————————————————————————————————————————————————————————————————————–———————
Data Visualization and Trend Analysis
————————————————————————————————————————————————————————————————————————–———————
Question:- wants to analyze the distribution of sleep hours among
students to understand their sleeping pattern
#11sleep distribution
ggplot(Teen_health, aes(x = sleep_hours)) +
geom_histogram(binwidth = 1, fill = "green", color = "black") +
labs(title = "Sleep Hours Distribution")
Interpretation:- in this with the use of hisogram i learnt how sleep
hours are spread across different ranges.it help whether most students
get adequate sleep or not. help to understand beahviour
Question:-i wants to check the relationship between sleep hours and stress levels among students to determine whether less sleep leads to higher stress.
#12sleep vs stress
ggplot(Teen_health, aes(x = sleep_hours, y = stress_level)) +
geom_jitter(color = "darkblue", width = 0.2, height = 0.2) +
geom_smooth(method = "lm", color = "red", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:-the scatter plot with a regression line shows a slight
negative relationship between sleep hours and stress level. As sleep
hours increase, stress level tends to decrease slightly, but the trend
is very weak.
Question:-analyze how screen time before sleep affects students’ total sleep hours to understand its impact on sleep patterns.
#13screen time before sleep vs after sleep
ggplot(Teen_health, aes(x = screen_time_before_sleep, y = sleep_hours)) +
geom_point(color = "purple") +
geom_smooth(method = "lm", color = "blue", se = FALSE)+
labs(title = "Screen Time Before Sleep vs Sleep",
x = "Screen Time Before Sleep",
y = "Sleep Hours")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:-The scatter plot with regression line shows a slight
positive relationship between screen time before sleep and sleep hours.
As screen time increases, sleep hours also increase slightly, but the
relationship is very weak
Question:-checking relationship between physical activity and stress levels to check whether increased physical activity reduces stress among students.
#14physical activity vs stress
ggplot(Teen_health, aes(x = physical_activity, y = stress_level)) +
geom_point(color = "orange") +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Physical Activity vs Stress")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:-The scatter plot with regression line shows a very
slight positive relationship between physical activity and stress level.
However, the slope is almost flat, indicating that the relationship is
extremely weak and practically negligible
Question:-wants to analyze which social media platforms are most used by students and compare the number of users for each platform.
#15graph for use of socal media
ggplot(Teen_health, aes(x = platform_usage, fill = platform_usage)) +
geom_bar() +
labs(title = "Number of Students Using Each Social Media App",
x = "Social Media Platform",
y = "Number of Students")
Interpretation:-visualize categorical data using a bar chart to compare
the number of students using different social media platforms.I was able
to see which platform is most popular among students
CA 3 Question start Question :- find Correlation in male how much % male use wihch platfrom
male_data <- Teen_health[Teen_health$gender == "male", ]
values <- table(male_data$platform_usage)
labels <- paste(names(values),
round(values/sum(values)*100), "%")
pie(values, labels = labels,
col = rainbow(length(values)),
main = "Male Platform Usage (%)")
Interpretation:-I used a bar chart to compare how many male students use
different social media platforms. From this chart i see which platform
is the most popular among the male students.
Question:- How do sleep hours affect the stress level of students?
model <- lm(stress_level ~ sleep_hours, data = Teen_health)
summary(model)
##
## Call:
## lm(formula = stress_level ~ sleep_hours, data = Teen_health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5000 -2.4469 -0.4094 2.5487 4.6105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.58833 0.38422 14.54 <2e-16 ***
## sleep_hours -0.02209 0.05814 -0.38 0.704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.904 on 1198 degrees of freedom
## Multiple R-squared: 0.0001205, Adjusted R-squared: -0.0007141
## F-statistic: 0.1444 on 1 and 1198 DF, p-value: 0.704
predict(model, newdata = data.frame(sleep_hours = 6))
## 1
## 5.455763
Interpretation:-Students who sleep more tend to have weak relation stress levels. This means proper sleep helps reduce stress and if less sleep high stress level for that you have to sleep well p < 0.05 → significant other Only 0.012%
Question:-What relationship exists between daily social media hours and screen time before sleep?
pairs(Teen_health[, c("daily_social_media_hours", "screen_time_before_sleep")], main = "Pair Plot: Social Media vs Screen Time", pch = 19, col = "blue")
Interpretation:- The pair plot shows the relationship between daily
social media usage and screen time before sleep. The data points are
scattered randomly, indicating no clear or strong relationship between
the two variables
Question:-What is the cumulative distribution of students’ physical activity levels?
x <- sort(Teen_health$physical_activity)
y <- cumsum(rep(1, length(x)))
plot(x, y,
type = "l",
col = "blue",
main = "CFD of Physical Activity",
xlab = "Physical Activity",
ylab = "Cumulative Frequency")
interpretation:-The cumulative frequency graph shows how the number of students increases as physical activity levels increase.I line graph It helps in understanding how many students fall below or within a certain level of physical activity.”
Question:-How is stress level distributed among students and are there any outliers?
boxplot(Teen_health$stress_level,
col = "lightblue",
main = "Box Plot of Stress Level",
ylab = "Stress Level")
Interpretation:-The box plot shows the distribution of stress levels
among students. It displays the median, spread of data, and possible
Outliers. It helps in understanding how stress levels vary among
students.
Question:-What is the distribution of social media addiction levels among students?
# Create categories (Low, Medium, High addiction)
Teen_health$addiction_level <- case_when(
Teen_health$daily_social_media_hours <= 2 ~ "Low",
Teen_health$daily_social_media_hours <= 5 ~ "Medium",
Teen_health$daily_social_media_hours <= 10 ~ "High"
)
values <- table(Teen_health$addiction_level)
labels <- paste(names(values),
round(values/sum(values)*100), "%")
pie(values,
labels = labels,
col = c("green","yellow","red"),
main = "Social Media Addiction Level (%)")
Interpretation:-The pie chart shows the percentage distribution of
students based on their level of social media usage. It categorizes
students into low, medium, and high addiction levels table count
student
Question:-Is there a relationship between sleep hours and stress level among students?
cor(Teen_health$sleep_hours,
Teen_health$stress_level)
## [1] -0.01097922
interpretation:-The correlation value is very close to 0, which indicates that there is no strong relationship between sleep hours ind and stress level dep .
Question:-Is there a relationship between social media usage and stress level among students?
cor(Teen_health[, c("sleep_hours",
"daily_social_media_hours",
"stress_level")])
## sleep_hours daily_social_media_hours stress_level
## sleep_hours 1.000000000 -0.009472174 -0.01097922
## daily_social_media_hours -0.009472174 1.000000000 0.03069774
## stress_level -0.010979224 0.030697742 1.00000000
Interpretation:-The correlation value is close to 0, which indicates that there is no strong relationship between social media usage and stress level sleep vs social,sleep vs stress, social vs stress
Question:-How do sleep hours affect stress level among students?”
model1 <- lm(stress_level ~ sleep_hours, data = Teen_health)
summary(model1)
##
## Call:
## lm(formula = stress_level ~ sleep_hours, data = Teen_health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5000 -2.4469 -0.4094 2.5487 4.6105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.58833 0.38422 14.54 <2e-16 ***
## sleep_hours -0.02209 0.05814 -0.38 0.704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.904 on 1198 degrees of freedom
## Multiple R-squared: 0.0001205, Adjusted R-squared: -0.0007141
## F-statistic: 0.1444 on 1 and 1198 DF, p-value: 0.704
Interpretation:-The regression analysis shows that sleep hours do not have a significant effect on stress level dep . Although there is a slight negative relationship, it is very weak and not meaningful.
Question:-Does social media usage affect stress level among students?
model2 <- lm(stress_level ~ daily_social_media_hours, data = Teen_health)
summary(model2)
##
## Call:
## lm(formula = stress_level ~ daily_social_media_hours, data = Teen_health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5979 -2.4541 -0.3476 2.5426 4.7051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.24662 0.20529 25.557 <2e-16 ***
## daily_social_media_hours 0.04391 0.04131 1.063 0.288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.903 on 1198 degrees of freedom
## Multiple R-squared: 0.0009424, Adjusted R-squared: 0.0001084
## F-statistic: 1.13 on 1 and 1198 DF, p-value: 0.288
Interpretation:-The regression analysis shows that social media usage does not have a significant effect on stress level. The relationship between the two variables is very weak.
Question:-How do sleep hours, social media usage, and physical activity together affect stress level among students?
model_multi <- lm(stress_level ~ sleep_hours +daily_social_media_hours + physical_activity,data = Teen_health)
summary(model_multi)
##
## Call:
## lm(formula = stress_level ~ sleep_hours + daily_social_media_hours +
## physical_activity, data = Teen_health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6588 -2.4574 -0.3404 2.5474 4.7961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.33157 0.45043 11.837 <2e-16 ***
## sleep_hours -0.02181 0.05816 -0.375 0.708
## daily_social_media_hours 0.04334 0.04135 1.048 0.295
## physical_activity 0.05746 0.14417 0.399 0.690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.905 on 1196 degrees of freedom
## Multiple R-squared: 0.001189, Adjusted R-squared: -0.001316
## F-statistic: 0.4747 on 3 and 1196 DF, p-value: 0.7
Interpretation:-The multiple regression analysis shows that none of the variables significantly affect stress level. The relationships are weak, and the model does not explain much variation in stress.
Conclusion:-