library(tidyr)
library(dplyr)
library(ggplot2)

Data Set

My data set downloaded from the Kaggle. ** Data Set** (link): Contains teen user phone use habits and effects on health. Columns of the dataset contains name, age, Grade level, Daily phone usage, Sleep hours, time on social media, time on education, phone usage purpose, and addiction level.

Question 1:

Load your CSV dataset into R as instructed in previous assignment. Select columns to plot together to visualize your data. Plot a graph out of any of the ones we reviewed in class. Make sure axes, lines are annotated and it has a title. Briefly explain what your graph shows. Show the R code that resulted in the graph.

getwd()
## [1] "C:/Users/sona6/AdvancedAnalytics/itec4220project"
setwd('C:/Users/sona6/AdvancedAnalytics/itec4220project')
getwd()
## [1] "C:/Users/sona6/AdvancedAnalytics/itec4220project"
data <- read.csv("teen_phone_addiction_dataset.csv")
head(data)
##   ID              Name Age Gender           Location School_Grade
## 1  1   Shannon Francis  13 Female         Hansonfort          9th
## 2  2   Scott Rodriguez  17 Female       Theodorefort          7th
## 3  3       Adrian Knox  13  Other        Lindseystad         11th
## 4  4 Brittany Hamilton  18 Female       West Anthony         12th
## 5  5      Steven Smith  14  Other   Port Lindsaystad          9th
## 6  6        Mary Adams  13 Female East Angelachester         10th
##   Daily_Usage_Hours Sleep_Hours Academic_Performance Social_Interactions
## 1               4.0         6.1                   78                   5
## 2               5.5         6.5                   70                   5
## 3               5.8         5.5                   93                   8
## 4               3.1         3.9                   78                   8
## 5               2.5         6.7                   56                   4
## 6               3.9         6.3                   89                   3
##   Exercise_Hours Anxiety_Level Depression_Level Self_Esteem Parental_Control
## 1            0.1            10                3           8                0
## 2            0.0             3                7           3                0
## 3            0.8             2                3          10                0
## 4            1.6             9               10           3                0
## 5            1.1             1                5           1                0
## 6            0.7             7                1           3                0
##   Screen_Time_Before_Bed Phone_Checks_Per_Day Apps_Used_Daily
## 1                    1.4                   86              19
## 2                    0.9                   96               9
## 3                    0.5                  137               8
## 4                    1.4                  128               7
## 5                    1.0                   96              20
## 6                    1.1                  135               8
##   Time_on_Social_Media Time_on_Gaming Time_on_Education Phone_Usage_Purpose
## 1                  3.6            1.7               1.2            Browsing
## 2                  1.1            4.0               1.8            Browsing
## 3                  0.3            1.5               0.4           Education
## 4                  3.1            1.6               0.8        Social Media
## 5                  2.6            0.9               1.1              Gaming
## 6                  3.8            0.0               1.4        Social Media
##   Family_Communication Weekend_Usage_Hours Addiction_Level
## 1                    4                 8.7            10.0
## 2                    2                 5.3            10.0
## 3                    6                 5.7             9.2
## 4                    8                 3.0             9.8
## 5                   10                 3.7             8.6
## 6                    7                 6.0             8.8

Line plot with grouping

data %>% group_by(Age, Gender) %>%
  summarise(mean_usage = mean(Daily_Usage_Hours, na.rm = TRUE))%>%
  ggplot(aes(x = Age, y = mean_usage, color = Gender, group = Gender)) +
  geom_line(size = 1) +
  labs(title = "Average Daily Usage by Age and Gender",
       x = "Age", y = "Average Daily Usage (Hours)")
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The line plot shows the average daily usage of an activity for ages 13 to 19, separated by gender. For males, usage starts low at age 13 and gradually rises, peaking around age 18, then slightly drops at 19. Females start with higher usage at younger ages, dip a little around 17, and then increase sharply at 19. Overall, males steadily increase over the teen years, while females’ usage goes up and down but ends highest at age 19. This shows that younger females use the activity more than males, but males catch up in the middle teen years, and females rise again at the end.

Question 2

Do a simple statistical calculation (e.g. mean, standard deviation, mode, median, etc.) with R that aligns with your hypothesis and plot/report results. Explain what the result means in terms of your question.

Calculate mean and SD of academic performance by addiction level

#just changed the Addiction_Level value from the float to integer
data <- data %>% mutate(Addiction_Level = as.integer(data$Addiction_Level))

# Calculated mean and sd of Academic_performance by addiction level
academic_stats <- data %>%
  group_by(Addiction_Level) %>%
  summarise( mean_score = mean(Academic_Performance, na.rm = TRUE),
           sd_score = sd(Academic_Performance, na.rm = TRUE)
  )

print(academic_stats)
## # A tibble: 10 Ă— 3
##    Addiction_Level mean_score sd_score
##              <int>      <dbl>    <dbl>
##  1               1       77.3     5.69
##  2               2       76      14.2 
##  3               3       72.6    11.5 
##  4               4       74.8    14.9 
##  5               5       73.9    14.0 
##  6               6       74.9    15.4 
##  7               7       76.5    14.7 
##  8               8       72.6    14.4 
##  9               9       75.6    15.3 
## 10              10       75.1    14.6
# Plot mean academic performance by addiction level with error bars
ggplot(academic_stats, aes(x = Addiction_Level, y = mean_score, fill = Addiction_Level)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_score - sd_score, ymax = mean_score + sd_score), width = 0.2) +
  labs(title = "Academic Performance by Phone Addiction Level",
       x = "Phone Addiction Level",
       y = "Mean Academic Performance") +
  theme_minimal()

The graph shows that students’ average academic performance goes down a little as phone addiction increases. The standard deviation shows that scores vary within each group, but overall, students who use their phones more tend to have lower grades. The error bars indicate some variation within each group, but overall trends supports a negative realtionship between phone addiction and academic performance.

Question 3

We will apply statistical tests to your dataset to gain insight in answering your questions. Start by first applying a correlation or regression analysis to detect a relationship. Explain how the relationship aligns with your questions.

Correlation

data <- data %>% mutate(Daily_Usage_Hours = as.integer(data$Daily_Usage_Hours))


cor_test <- cor.test(data$Daily_Usage_Hours, data$Addiction_Level)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  data$Daily_Usage_Hours and data$Addiction_Level
## t = 41.484, df = 2998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5806507 0.6261436
## sample estimates:
##       cor 
## 0.6038887

Teens who spend more hours on their phones tend to have higher addiction levels. The correlation of 0.604 shows a moderate positive relationship, and the extremely small p-value confirms that this relationship is statistically significant, not due to random chance.

Question 4

For the second statistical test, select a numerical column that you want to check. First, plot a histogram of it and discuss about its distribution. As you did with the above graph, make sure the histogram is properly annotated.

Histogram

ggplot(data, aes(x = Time_on_Social_Media)) +
  geom_histogram(binwidth = 0.5, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Social Media Usage",
       x = "Time on Social Media (Hours)", y = "Count of Students")

The histogram shows how much time students spend on social media. Most students spend a few hours per day (the tallest bars are on the left), while fewer students spend many hours (the shorter bars on the right). This indicates that social media usage is somewhat skewed, with a small group of heavy users.

Question 5

divide your dataset into two groups of rows based on another column that matters for your question and apply a test that we discussed in class (t-test, ANOVA, …) to test for significant differences between the two groups. Make sure the test you selected is consistent with the distribution that you observed earlier. Show the code and briefly explain the results.

Divide dataset into two groups

data <- data %>%
  mutate(Addiction_Group = ifelse(Addiction_Level <= 5, "Low", "High"))

# Check number of students in each group

table(data$Addiction_Group)
## 
## High  Low 
## 2790  210

Apply t-test to compare academic performance between the two groups

t_test_result <- t.test(Academic_Performance ~ Addiction_Group, data = data)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  Academic_Performance by Addiction_Group
## t = 0.90047, df = 246.65, p-value = 0.3687
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
##  -1.057469  2.838728
## sample estimates:
## mean in group High  mean in group Low 
##           75.00968           74.11905

The average academic performance of students with high phone addiction (75.01) is slightly higher than those with low addiction (74.12), but this difference is not statistically significant (p = 0.37). The confidence interval also includes 0, so we cannot conclude that phone addiction affects academic performance in this dataset.