library(tidyr)
library(dplyr)
library(ggplot2)
My data set downloaded from the Kaggle. ** Data Set** (link): Contains teen user phone use habits and effects on health. Columns of the dataset contains name, age, Grade level, Daily phone usage, Sleep hours, time on social media, time on education, phone usage purpose, and addiction level.
Load your CSV dataset into R as instructed in previous assignment. Select columns to plot together to visualize your data. Plot a graph out of any of the ones we reviewed in class. Make sure axes, lines are annotated and it has a title. Briefly explain what your graph shows. Show the R code that resulted in the graph.
getwd()
## [1] "C:/Users/sona6/AdvancedAnalytics/itec4220project"
setwd('C:/Users/sona6/AdvancedAnalytics/itec4220project')
getwd()
## [1] "C:/Users/sona6/AdvancedAnalytics/itec4220project"
data <- read.csv("teen_phone_addiction_dataset.csv")
head(data)
## ID Name Age Gender Location School_Grade
## 1 1 Shannon Francis 13 Female Hansonfort 9th
## 2 2 Scott Rodriguez 17 Female Theodorefort 7th
## 3 3 Adrian Knox 13 Other Lindseystad 11th
## 4 4 Brittany Hamilton 18 Female West Anthony 12th
## 5 5 Steven Smith 14 Other Port Lindsaystad 9th
## 6 6 Mary Adams 13 Female East Angelachester 10th
## Daily_Usage_Hours Sleep_Hours Academic_Performance Social_Interactions
## 1 4.0 6.1 78 5
## 2 5.5 6.5 70 5
## 3 5.8 5.5 93 8
## 4 3.1 3.9 78 8
## 5 2.5 6.7 56 4
## 6 3.9 6.3 89 3
## Exercise_Hours Anxiety_Level Depression_Level Self_Esteem Parental_Control
## 1 0.1 10 3 8 0
## 2 0.0 3 7 3 0
## 3 0.8 2 3 10 0
## 4 1.6 9 10 3 0
## 5 1.1 1 5 1 0
## 6 0.7 7 1 3 0
## Screen_Time_Before_Bed Phone_Checks_Per_Day Apps_Used_Daily
## 1 1.4 86 19
## 2 0.9 96 9
## 3 0.5 137 8
## 4 1.4 128 7
## 5 1.0 96 20
## 6 1.1 135 8
## Time_on_Social_Media Time_on_Gaming Time_on_Education Phone_Usage_Purpose
## 1 3.6 1.7 1.2 Browsing
## 2 1.1 4.0 1.8 Browsing
## 3 0.3 1.5 0.4 Education
## 4 3.1 1.6 0.8 Social Media
## 5 2.6 0.9 1.1 Gaming
## 6 3.8 0.0 1.4 Social Media
## Family_Communication Weekend_Usage_Hours Addiction_Level
## 1 4 8.7 10.0
## 2 2 5.3 10.0
## 3 6 5.7 9.2
## 4 8 3.0 9.8
## 5 10 3.7 8.6
## 6 7 6.0 8.8
data %>% group_by(Age, Gender) %>%
summarise(mean_usage = mean(Daily_Usage_Hours, na.rm = TRUE))%>%
ggplot(aes(x = Age, y = mean_usage, color = Gender, group = Gender)) +
geom_line(size = 1) +
labs(title = "Average Daily Usage by Age and Gender",
x = "Age", y = "Average Daily Usage (Hours)")
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The line plot shows the average daily usage of an activity for ages 13
to 19, separated by gender. For males, usage starts low at age 13 and
gradually rises, peaking around age 18, then slightly drops at 19.
Females start with higher usage at younger ages, dip a little around 17,
and then increase sharply at 19. Overall, males steadily increase over
the teen years, while females’ usage goes up and down but ends highest
at age 19. This shows that younger females use the activity more than
males, but males catch up in the middle teen years, and females rise
again at the end.
Do a simple statistical calculation (e.g. mean, standard deviation, mode, median, etc.) with R that aligns with your hypothesis and plot/report results. Explain what the result means in terms of your question.
#just changed the Addiction_Level value from the float to integer
data <- data %>% mutate(Addiction_Level = as.integer(data$Addiction_Level))
# Calculated mean and sd of Academic_performance by addiction level
academic_stats <- data %>%
group_by(Addiction_Level) %>%
summarise( mean_score = mean(Academic_Performance, na.rm = TRUE),
sd_score = sd(Academic_Performance, na.rm = TRUE)
)
print(academic_stats)
## # A tibble: 10 Ă— 3
## Addiction_Level mean_score sd_score
## <int> <dbl> <dbl>
## 1 1 77.3 5.69
## 2 2 76 14.2
## 3 3 72.6 11.5
## 4 4 74.8 14.9
## 5 5 73.9 14.0
## 6 6 74.9 15.4
## 7 7 76.5 14.7
## 8 8 72.6 14.4
## 9 9 75.6 15.3
## 10 10 75.1 14.6
# Plot mean academic performance by addiction level with error bars
ggplot(academic_stats, aes(x = Addiction_Level, y = mean_score, fill = Addiction_Level)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean_score - sd_score, ymax = mean_score + sd_score), width = 0.2) +
labs(title = "Academic Performance by Phone Addiction Level",
x = "Phone Addiction Level",
y = "Mean Academic Performance") +
theme_minimal()
The graph shows that students’ average academic performance goes down a
little as phone addiction increases. The standard deviation shows that
scores vary within each group, but overall, students who use their
phones more tend to have lower grades. The error bars indicate some
variation within each group, but overall trends supports a negative
realtionship between phone addiction and academic performance.
We will apply statistical tests to your dataset to gain insight in answering your questions. Start by first applying a correlation or regression analysis to detect a relationship. Explain how the relationship aligns with your questions.
data <- data %>% mutate(Daily_Usage_Hours = as.integer(data$Daily_Usage_Hours))
cor_test <- cor.test(data$Daily_Usage_Hours, data$Addiction_Level)
cor_test
##
## Pearson's product-moment correlation
##
## data: data$Daily_Usage_Hours and data$Addiction_Level
## t = 41.484, df = 2998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5806507 0.6261436
## sample estimates:
## cor
## 0.6038887
Teens who spend more hours on their phones tend to have higher addiction levels. The correlation of 0.604 shows a moderate positive relationship, and the extremely small p-value confirms that this relationship is statistically significant, not due to random chance.
For the second statistical test, select a numerical column that you want to check. First, plot a histogram of it and discuss about its distribution. As you did with the above graph, make sure the histogram is properly annotated.
ggplot(data, aes(x = Time_on_Social_Media)) +
geom_histogram(binwidth = 0.5, fill = "lightblue", color = "black") +
labs(title = "Distribution of Social Media Usage",
x = "Time on Social Media (Hours)", y = "Count of Students")
The histogram shows how much time students spend on social media. Most
students spend a few hours per day (the tallest bars are on the left),
while fewer students spend many hours (the shorter bars on the right).
This indicates that social media usage is somewhat skewed, with a small
group of heavy users.
divide your dataset into two groups of rows based on another column that matters for your question and apply a test that we discussed in class (t-test, ANOVA, …) to test for significant differences between the two groups. Make sure the test you selected is consistent with the distribution that you observed earlier. Show the code and briefly explain the results.
data <- data %>%
mutate(Addiction_Group = ifelse(Addiction_Level <= 5, "Low", "High"))
# Check number of students in each group
table(data$Addiction_Group)
##
## High Low
## 2790 210
t_test_result <- t.test(Academic_Performance ~ Addiction_Group, data = data)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: Academic_Performance by Addiction_Group
## t = 0.90047, df = 246.65, p-value = 0.3687
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## -1.057469 2.838728
## sample estimates:
## mean in group High mean in group Low
## 75.00968 74.11905
The average academic performance of students with high phone addiction (75.01) is slightly higher than those with low addiction (74.12), but this difference is not statistically significant (p = 0.37). The confidence interval also includes 0, so we cannot conclude that phone addiction affects academic performance in this dataset.