library(readr)
# Load the CSV 
data <- read.csv("phone_usage_india.csv")

———————————————————

1. Understanding data

———————————————————

1.1 What are the column names and data types in the dataset?

str(data)  # Displays structure of the dataset

## 'data.frame':    17686 obs. of  16 variables:
##  $ User_ID          : chr  "U00001" "U00002" "U00003" "U00004" ...
##  $ Age              : int  53 60 37 32 16 21 57 56 46 44 ...
##  $ Gender           : chr  "Male" "Other" "Female" "Male" ...
##  $ Location         : chr  "Mumbai" "Delhi" "Ahmedabad" "Pune" ...
##  $ Phone_Brand      : chr  "Vivo" "Realme" "Nokia" "Samsung" ...
##  $ OS               : chr  "Android" "iOS" "Android" "Android" ...
##  $ screen_time      : num  3.7 9.2 4.5 11 2.2 5.4 6 3.1 5.3 9.9 ...
##  $ data_usage       : num  0.918 19.137 8.65 25.836 3.133 ...
##  $ call_duration    : num  37.9 13.7 66.8 156.2 236.2 ...
##  $ apps_installed   : int  104 169 96 146 86 25 123 188 194 84 ...
##  $ social_media_time: num  1.66 27.64 3.88 28.57 5.67 ...
##  $ ecommerce_spend  : num  -0.1 10.91 -13.58 17.71 6.48 ...
##  $ streaming_time   : num  5.2 5.1 1.7 3.2 3.4 0.6 2.9 5.2 6.1 7.6 ...
##  $ gaming_time      : num  8.97 22.07 -3.02 32.15 3.66 ...
##  $ recharge_cost    : num  0.136 -2.941 5.844 6.301 18.46 ...
##  $ primary_use      : chr  "Education" "Gaming" "Entertainment" "Entertainment" ...

Interpretation: Shows a data frame with 17,686 observations (rows) and 16 variables (columns). It details the name and data type of each column, such as User_ID as character, Age as integer, and screen_time as numeric. This provides a foundational understanding of the data set’s structure and the nature of the information it contains.

1.2 Are there any missing values in the dataset? If so, identify which columns have missing data?

colSums(is.na(data))  # Count missing values per column

##           User_ID               Age            Gender          Location 
##                 0                 0                 0                 0 
##       Phone_Brand                OS       screen_time        data_usage 
##                 0                 0                 0                 0 
##     call_duration    apps_installed social_media_time   ecommerce_spend 
##                 0                 0                 0                 0 
##    streaming_time       gaming_time     recharge_cost       primary_use 
##                 0                 0                 0                 0

Interpretation: This indicates that for each of the listed columns, the sum of missing values is 0. Therefore, there are no missing values present in any of these columns in your data set.

1.3 What are the unique values in categorical columns like phone brand, OS, and primary usage?

unique(data$Phone_Brand)  # Unique phone brands

##  [1] "Vivo"         "Realme"       "Nokia"        "Samsung"      "Xiaomi"      
##  [6] "Oppo"         "Apple"        "Google Pixel" "Motorola"     "OnePlus"

unique(data$OS)           # Unique operating systems

## [1] "Android" "iOS"

unique(data$primary_use) # Unique primary usage categories

## [1] "Education"     "Gaming"        "Entertainment" "Social Media" 
## [5] "Work"

Interpretation: The unique values in key categorical columns shows the dataset includes the following: phone brands like Vivo, Realme, Nokia, Samsung, Xiaomi, Oppo, Apple, Google Pixel, Motorola, and OnePlus; operating systems Android and iOS; and primary use categories such as Education, Gaming, Entertainment, Social Media, and Work.

———————————————————

2. Data Extraction & Filtering

———————————————————

2.1 Identify the top 10 most used phone brands.

top_10_phoneBrands<- data %>%
  count(Phone_Brand, name = "Count") %>%  # Count occurrences and rename column
  arrange(desc(Count)) %>%  # Sort in descending order
  head(10)  # Select top 10 brands
print(top_10_phoneBrands)

##     Phone_Brand Count
## 1         Nokia  1816
## 2       OnePlus  1807
## 3        Xiaomi  1803
## 4          Vivo  1797
## 5         Apple  1775
## 6       Samsung  1764
## 7        Realme  1762
## 8  Google Pixel  1729
## 9      Motorola  1717
## 10         Oppo  1716

Interpretation: The provided table displays the top ten phone brands in the dataset by count. Nokia is the most frequent with 1816, closely followed by OnePlus (1807) and Xiaomi (1803). Apple holds the fifth position with 1775. The remaining top ten include Vivo, Samsung, Realme, Google Pixel, Motorola, and Oppo, indicating the distribution of popular phone brands among the users in the dataset.

2.2 Extract users who use mobile devices for more than or equal to 8 hours daily.

heavy_users <- data %>% filter(screen_time > 8)%>%arrange(desc(screen_time))%>%head(10) 
print(heavy_users)

##    User_ID Age Gender  Location  Phone_Brand      OS screen_time data_usage
## 1   U00172  39 Female   Chennai Google Pixel Android          12   25.15700
## 2   U00216  21   Male    Jaipur Google Pixel     iOS          12   21.13470
## 3   U00689  24   Male   Kolkata       Xiaomi     iOS          12   28.30320
## 4   U00735  35 Female    Jaipur      Samsung Android          12   23.05077
## 5   U00747  35  Other   Kolkata     Motorola Android          12   24.20725
## 6   U00767  46  Other    Jaipur      Samsung Android          12   29.12187
## 7   U00932  37 Female Bangalore        Nokia Android          12   24.77088
## 8   U00935  55  Other    Mumbai      Samsung     iOS          12   24.58675
## 9   U01239  40  Other    Jaipur Google Pixel Android          12   23.37282
## 10  U01375  42   Male    Jaipur        Nokia Android          12   20.47090
##    call_duration apps_installed social_media_time ecommerce_spend
## 1           35.9             16          18.73935       16.223577
## 2          211.5            113          27.73469        9.036460
## 3           55.5            183          19.15537        1.995935
## 4           48.3             14          25.74125       44.907198
## 5           33.0            140          28.65438       33.704400
## 6           44.2            135          31.74048       22.863113
## 7          260.5             97          17.93232       17.735024
## 8          209.7            190          30.15895       29.350123
## 9           59.9             38          28.45537       30.014362
## 10          99.6             80          32.92130       42.675533
##    streaming_time gaming_time recharge_cost   primary_use
## 1             0.8    19.76892    -0.5326361 Entertainment
## 2             1.3    18.79201    11.0809628          Work
## 3             1.7    27.98732     4.7164114 Entertainment
## 4             3.7    10.47650     2.7543381  Social Media
## 5             3.6    19.65111     6.6479657        Gaming
## 6             5.9    19.39338     3.4336422  Social Media
## 7             3.5    28.73611    28.3705046     Education
## 8             4.9    10.49492    29.5416612 Entertainment
## 9             3.5    18.06149    13.8064929        Gaming
## 10            4.3    23.22885    16.2915746  Social Media

Interpretation: This table shows only the users from data set who have a screen time greater than 8 hours. It focused on “heavy users” by screen time.

2.3 Identify users who primarily engage in gaming and analyze their data usage.

gaming_users <- data %>%
  filter(primary_use == "Gaming") %>%
  select(User_ID, Phone_Brand, OS, data_usage, screen_time)%>%head(10)

print(gaming_users)

##    User_ID Phone_Brand      OS data_usage screen_time
## 1   U00002      Realme     iOS  19.136572         9.2
## 2   U00022      Xiaomi Android  25.068779        11.7
## 3   U00024        Vivo Android   5.573529         1.4
## 4   U00027        Vivo Android  16.109427         9.2
## 5   U00034    Motorola Android  20.936997        11.4
## 6   U00060      Xiaomi     iOS   7.181936         3.0
## 7   U00070        Oppo     iOS  12.965341         5.3
## 8   U00073    Motorola     iOS  11.426739         5.2
## 9   U00074      Realme Android  14.704642         6.9
## 10  U00076        Oppo Android  17.769887         6.1

Interpretation: The table only displays the users who use there mobile phone for Playing games as the main purpose with the corresponding phone brand and Operating system they have. This also shows their data usage in GB and the screen time in hours per day.

———————————————————

3. Grouping and Summarization

———————————————————

3.1 Grouping users by city, summarize the average data usage and screen time in each location?

data %>%
  group_by(Location) %>%
  summarise(
    Avg_Data_Usage = mean(data_usage, na.rm = TRUE),
    Avg_Screen_Time = mean(screen_time, na.rm = TRUE)
  )

## # A tibble: 10 × 3
##    Location  Avg_Data_Usage Avg_Screen_Time
##    <chr>              <dbl>           <dbl>
##  1 Ahmedabad           13.0            6.56
##  2 Bangalore           13.2            6.57
##  3 Chennai             13.0            6.53
##  4 Delhi               13.0            6.47
##  5 Hyderabad           13.2            6.60
##  6 Jaipur              13.2            6.65
##  7 Kolkata             12.9            6.41
##  8 Lucknow             13.2            6.58
##  9 Mumbai              12.8            6.42
## 10 Pune                13.3            6.67

Interpretation: The dataframe shows the average data usage and Screen time for each location. Pune has the highest average Data usage and average screen time. Whereas, Mumbai is the lowest in Average data usage and Kolkata in Average Screen time.

3.2 Identify which city has the highest number of social media users.

data %>%
  group_by(Location) %>%
  summarise(Total_SocialMedia_Usage = sum(social_media_time, na.rm = TRUE)) %>%
  arrange(desc(Total_SocialMedia_Usage)) %>%
  head(1)

## # A tibble: 1 × 2
##   Location Total_SocialMedia_Usage
##   <chr>                      <dbl>
## 1 Jaipur                    26674.

Intrepretation: The output shows that Jaipur has the highest social media users.

3.3 Determine which age group has the highest mobile usage.

data %>%
  group_by(Age) %>%
  summarise(Avg_Usage = mean(screen_time, na.rm = TRUE)) %>%
  arrange(desc(Avg_Usage))%>%head(2)

## # A tibble: 2 × 2
##     Age Avg_Usage
##   <int>     <dbl>
## 1    46      6.93
## 2    20      6.86

Interpretation: This output shows the average usage (Avg_Usage) for different ages (Age). For instance, individuals aged 46 have an average usage of approximately 6.93, while those aged 20 have an average usage of about 6.86.

———————————————————

4. Sorting and Ranking Data

———————————————————

4.1 Rank mobile brands based on average screen time.

data %>%
  group_by(Phone_Brand) %>%
  summarise(Avg_Screen_Time = mean(screen_time, na.rm = TRUE)) %>%
  arrange(desc(Avg_Screen_Time))

## # A tibble: 10 × 2
##    Phone_Brand  Avg_Screen_Time
##    <chr>                  <dbl>
##  1 Xiaomi                  6.64
##  2 Realme                  6.64
##  3 Nokia                   6.61
##  4 Samsung                 6.60
##  5 Motorola                6.56
##  6 OnePlus                 6.55
##  7 Apple                   6.52
##  8 Vivo                    6.51
##  9 Google Pixel            6.43
## 10 Oppo                    6.39

Interpretation: Xiaomi users have an average screen time of approximately 6.64, while Realme users average around 6.64 as well. Whereas, Oppo is the least in average screen time.

4.2 Identify the gender with the highest mobile dependency.

data %>%
  group_by(Gender) %>%
  summarise(Avg_Screen_Time = mean(screen_time, na.rm = TRUE)) %>%
  arrange(desc(Avg_Screen_Time))

## # A tibble: 3 × 2
##   Gender Avg_Screen_Time
##   <chr>            <dbl>
## 1 Male              6.60
## 2 Other             6.53
## 3 Female            6.50

Interpretation: This represents that Males tend to have higher screen time of 6.6 hours per day than the females and others

4.3 Find the top 5 users with the highest mobile data usage.

data %>%
  arrange(desc(data_usage)) %>%
  head(5)

##   User_ID Age Gender  Location Phone_Brand      OS screen_time data_usage
## 1  U17656  21  Other    Mumbai        Oppo Android        11.5   34.74822
## 2  U16375  25  Other   Lucknow      Realme     iOS        11.9   32.30989
## 3  U05993  26   Male Ahmedabad      Realme Android        11.7   31.87627
## 4  U14936  47  Other Bangalore     Samsung     iOS        11.4   31.84143
## 5  U09055  45 Female   Chennai       Nokia Android        12.0   31.32909
##   call_duration apps_installed social_media_time ecommerce_spend streaming_time
## 1         128.9            181          12.43514       35.103541            5.3
## 2         140.9            152          36.53769        5.270189            2.3
## 3         192.7            173          32.78599        6.784371            4.4
## 4         296.3             99          24.80781       35.344356            0.9
## 5         132.7            142          21.44678       22.466015            5.6
##   gaming_time recharge_cost   primary_use
## 1    18.62551      16.98026 Entertainment
## 2     9.45858      12.47200          Work
## 3    31.52751      20.72110          Work
## 4    30.57840      21.38560 Entertainment
## 5    20.22828      11.42800 Entertainment

Interpretation: This data frame displays the users and their details who are the highest in data usage

———————————————————

5. Feature Engineering:

———————————————————

5.1 Create a new feature “Usage Category” based on screen time: Light User (≤ 2 hours), Moderate User (2–6 hours) and Heavy User (>6 hours)

data <- data %>%
  mutate(Usage_Category = case_when(
    screen_time <= 2 ~ "Light User",
    screen_time > 2 & screen_time <= 6 ~ "Moderate User",
    screen_time > 6 ~ "Heavy User"
  ))
unique(data$Usage_Category)

## [1] "Moderate User" "Heavy User"    "Light User"

Interpretation: The new Column is added to the data set which will help to categorise the users according to their screen time. The users who have screen time less than or equal to 2 hours are categorised as light users and who have screen time more than 6 hours are Heavy users. And the others are moderate ones.

5.2 Compute “Total Mobile Interaction” as the sum of screen time, call duration, and data usage.

data <- data %>%
  mutate(Total_Mobile_Interaction = screen_time + call_duration + data_usage)

summary(data$Total_Mobile_Interaction)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.579  97.653 170.450 171.027 243.048 339.541

Interpretation: A new column named total interaction is appended which will represent the whole interaction of the user with the mobile phone is terms of screen time, call duration and data usage.

5.3 Determine the percentage of users dedicated to each primary use (Education, Gaming, Entertainment, Social Media, Work).

data %>%
  group_by(primary_use) %>%
  summarise(User_Count = n()) %>%
  mutate(Percentage = (User_Count / nrow(data)) * 100)

## # A tibble: 5 × 3
##   primary_use   User_Count Percentage
##   <chr>              <int>      <dbl>
## 1 Education           3601       20.4
## 2 Entertainment       3451       19.5
## 3 Gaming              3576       20.2
## 4 Social Media        3501       19.8
## 5 Work                3557       20.1

Interpretation: This showcase the percentage of the basic purpose for which a user is using his/her mobile phone. The dataframe presents percentage of each primary use with the total number of users who are engaged. Therefore, Education has the highest percentage of users.

———————————————————

6. Data Visualisation

———————————————————

Bar Plot:

6.1 How screen time vary across different locations?

avg_screen_time <- data %>%
  group_by(Location) %>%
  summarise(avg_screen_time = mean(screen_time, na.rm = TRUE))

ggplot(avg_screen_time, aes(x = avg_screen_time, y = Location, fill = Location)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Screen Time by Location", x = "Average Screen Time (Hours)", y = "Location") +
  theme_minimal()

Interpretation:

-> Pune has the highest average screen time, followed closely by Mumbai and Lucknow.

-> Ahmedabad and Bangalore show slightly lower screen times compared to others, but the overall difference among all cities is very small.

-> All locations have average screen times between 6 and 7 hours, indicating similar behavior across cities.

…………………………………………………………………….

Pie Chart:

6.2 What is the percentage of users in each ‘Age Group’?

data$Age_Group <- cut(data$Age, breaks = c(15, 25, 35, 45, 55, Inf), labels = c("15-25", "26-35", "36-45", "46-55", "55+"), right = FALSE)
age_summary <- data %>%
  group_by(Age_Group) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100, label = paste0(Age_Group, "\n", round(percentage, 1), "%"))

ggplot(age_summary, aes(x = "", y = count, fill = Age_Group)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5), size = 4, color = "white") +
  labs(title = "User Distribution by Age Group", fill = "Age Group") +
  theme_void()

Interpretation:

-> The 46-55 age group forms the largest segment at 22.2%.

-> Age groups 15-25, 26-35, and 36-45 each contribute around 21%.

-> The 55+ age group has the smallest share at 13.2%.

…………………………………………………………………….

6.3 What is the distribution of Phone Brands among users?

phone_brand_data <- data %>%
  group_by(Phone_Brand) %>%
  summarise(count = n())

ggplot(phone_brand_data, aes(x = "", y = count, fill = Phone_Brand)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Distribution of Phone brands Among Users", fill = "Phone_Brand") +
  theme_void() +
  theme(legend.position = "top")

Interpretation:

-> The number of users for each phone brand is fairly balanced, with all brands ranging between approximately 1716 to 1816 users.

-> Nokia has the highest user count at 1816, closely followed by Vivo (1803) and Apple (1775).

-> Motorola and OnePlus have slightly lower user counts around 1716–1717.

…………………………………………………………………….

Histogram:

6.4 What is the distribution of ‘Talk-time (minutes)’ by gender?

ggplot(data, aes(x = call_duration, fill = Gender)) +
  geom_histogram(binwidth = 50, position = "dodge") +
  labs(title = "Distribution of Talk-time by Gender", x = "Talk-time (minutes)", y = "Number of Users", fill = "Gender") +
  scale_fill_manual(values = c("Male" = "blue", "Female" = "pink", "Other" = "green")) +
  theme_minimal()

Interpretation:

-> Around 50 to 250 minutes, the number of users is quite high for all genders close to 900 -1000 users per category.

-> Other gender users (green bars) slightly outnumber Male and Female users at many points, especially around 100 minutes and 200 minutes.

…………………………………………………………………….

Line Chart:

6.5 How does the average recharge amount vary with the age of the users?

average_recharge_cost_by_age <- data %>%
  group_by(Age) %>%
  summarise(AverageRechargeCost = mean(recharge_cost))

ggplot(average_recharge_cost_by_age, aes(x = Age, y = AverageRechargeCost)) +
  geom_line(color = 'darkblue') +
  geom_point(color = 'red') +
  labs(title = "Relationship Between Age and Average Recharge Cost", x = "Age", y = "Average Recharge Cost") +
  theme_minimal()

Interpretation:

-> Overall, the average recharge cost fluctuates.

-> There’s no strong increasing or decreasing trend, indicating high variability.

-> Younger users (15–20 years) and users around 40–45 years seem to have slightly higher average recharge costs compared to others.

-> Ages around 30–35 and 50–55 show lower recharge costs.

———————————————————

7. Advanced Calculations

———————————————————

7.1 Correlation Analysis

7.1.1 Correlation between screen time and other activities

time_cor <- data[,c('screen_time', 'social_media_time', 'streaming_time', 'gaming_time')] 
cor_matrix <- cor(time_cor)
print(cor_matrix)

##                   screen_time social_media_time streaming_time gaming_time
## screen_time        1.00000000        0.71050388    -0.02174779  0.58116792
## social_media_time  0.71050388        1.00000000    -0.02204071  0.41791978
## streaming_time    -0.02174779       -0.02204071     1.00000000 -0.01533965
## gaming_time        0.58116792        0.41791978    -0.01533965  1.00000000

corrplot(cor_matrix, method= "circle", type = "lower", tl.col = "black", tl.srt = 90, col = c("blue", "purple", "red"))

Interpretation:

-> The correlation matrix shows that Screen time is mainly driven by social media and gaming.

-> Streaming time does not significantly relate to other activities here.

-> Strongest connection is between screen time and social media time and Graph also represents the same.

…………………………………………………………………….

7.1.2 Correlation between Age and various spendings.

age_cor <- data[,c('Age','recharge_cost','ecommerce_spend')] 
cor_matrix1 <- cor(age_cor)
print(cor_matrix1)

##                           Age recharge_cost ecommerce_spend
## Age              1.0000000000 -0.0002567667    -0.003051382
## recharge_cost   -0.0002567667  1.0000000000    -0.008728541
## ecommerce_spend -0.0030513816 -0.0087285412     1.000000000

corrplot(cor_matrix1, method = "color", addCoef.col = "brown", number.cex = 0.9, col = colorRampPalette(c("black","orange","skyblue"))(100), tl.col = "black", tl.srt= 45)

Interpretation: All the correlations are extremely close to zero. This means that Age, Recharge Cost, and E-commerce Spend are independent of each other — no strong or meaningful relationship is observed among these variables.

…………………………………………………………………….

7.1.3 What is the correlation between Number of apps installed, Data usage and screen time?

time_cor <- data[,c('apps_installed','screen_time','data_usage')] 
cor_matrix <- cor(time_cor) #calculate the correlation matrix
print(cor_matrix)

##                apps_installed screen_time  data_usage
## apps_installed    1.000000000 0.004704233 0.005789781
## screen_time       0.004704233 1.000000000 0.904153536
## data_usage        0.005789781 0.904153536 1.000000000

corrplot(cor_matrix, method= "number", type = "lower", tl.col = "black", tl.srt = 45, col = c("blue", "purple", "turquoise"))

Interpretation: apps_installed is almost unrelated to both screen_time and data_usage (correlation ~ 0). screen_time and data_usage have a very strong positive correlation — when people spend more time on their devices, they tend to use more data.

…………………………………………………………………….

7.2 ANOVA

7.2.1 Is there a significant difference in recharge amounts across locations?

anova_model <- aov(recharge_cost ~ Location, data = data)
summary(anova_model)

##                Df  Sum Sq Mean Sq F value Pr(>F)
## Location        9    1108  123.16   1.273  0.246
## Residuals   17676 1709576   96.72

#visualize the anova
ggplot(data, aes(x = Location, y = recharge_cost, fill = Location)) +
  geom_boxplot() +
  labs(title = "Recharge Amount by Location", x = "Location", y = "Recharge Amount") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  stat_compare_means(method = "anova")

Interpretation:

-> ANOVA results (p = 0.543) indicate that there is no statistically significant difference in recharge costs across different locations.

-> Thus, we fail to reject the null hypothesis and conclude that location does not have a meaningful impact on recharge spending.

…………………………………………………………………….

7.2.2 Does operating system affect data usage?

anova_result <- aov(data_usage ~ OS, data = data)
summary(anova_result)

##                Df Sum Sq Mean Sq F value Pr(>F)
## OS              1     59   59.43   1.208  0.272
## Residuals   17684 869860   49.19

#visualize the anova
ggplot(data, aes(x = OS, y = data_usage, fill = OS)) +
  geom_boxplot() +
  labs(title = "Data Usage across Operating Systems", x = "Operating System", y = "Data Usage (in units)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set3") +
  stat_compare_means(method = "anova")

Interpretation: the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is no significant effect of OS on data usage in this analysis.

…………………………………………………………………….

7.3 Regression

7.3.1 How does the call duration affect the recharge cost for work users?

work_data<- subset(data, primary_use == 'Work')
simple_model <- lm(recharge_cost ~ call_duration, data = data)
summary(simple_model)

## 
## Call:
## lm(formula = recharge_cost ~ call_duration, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.2018  -3.4521  -0.0311   3.4136  21.7071 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0427369  0.0769644   0.555    0.579    
## call_duration 0.0996819  0.0004434 224.835   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.007 on 17684 degrees of freedom
## Multiple R-squared:  0.7408, Adjusted R-squared:  0.7408 
## F-statistic: 5.055e+04 on 1 and 17684 DF,  p-value: < 2.2e-16

# Plot
ggplot(work_data, aes(x = call_duration, y = recharge_cost)) +
  geom_point(color = 'orange') +
  geom_smooth(method = 'lm', se = TRUE, color = 'black') +   # simple linear fit
  ggtitle(' Recharge Cost vs Call Duration (Work Users)') +
  xlab('Call Duration (minutes)') +
  ylab('Recharge Cost') +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation:

-> This regression plot shows a positive linear relationship between Call Duration and New Recharge Cost for work users.

-> As call duration increases, the recharge cost also tends to rise.

-> The upward trend of the black regression line indicates that longer call usage leads to higher recharge expenses, and the spread of points suggests some variability but a clear overall pattern.

…………………………………………………………………….

7.3.2 Can we predict a user’s mobile data usage based on the amount of time they spend using their smartphone screen daily?”

# Build the simple linear regression model
simple_model <- lm(data_usage ~ screen_time, data = data)

# Summarize the model
summary(simple_model)

## 
## Call:
## lm(formula = data_usage ~ screen_time, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9763  -2.0266   0.0001   2.0212  12.7373 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.009307   0.051662   -0.18    0.857    
## screen_time  1.998726   0.007102  281.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.996 on 17684 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.8175 
## F-statistic: 7.921e+04 on 1 and 17684 DF,  p-value: < 2.2e-16

# Make predictions using the model
data$predicted_data_usage <- predict(simple_model, newdata = data)

# Plot actual vs predicted
ggplot(data, aes(x = screen_time)) +
  geom_point(aes(y = data_usage), color = "black", alpha = 0.5) +  # actual points
  geom_line(aes(y = predicted_data_usage), color = "blue", size = 1) + # predicted line
  labs(
    title = "Prediction of Data Usage Based on Screen Time",
    x = "Screen Time (hours per day)",
    y = "Data Usage (GB)"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpratation:

-> This regression plot shows a strong positive relationship between Screen Time and Data Usage.

-> As screen time (hours per day) increases, the data usage (in GB) also rises steadily, as indicated by the upward slope of the blue regression line.

-> The dense clustering around the line suggests a consistent pattern with slight variability.

…………………………………………………………………….

8. PairPlot

8.1 How are screen time, e-commerce spending, social media time and gaming time related among users under the age of 18?

age_subset<- subset(data, Age<18)
# Select the required columns

selected_data <- age_subset[, c("screen_time", "ecommerce_spend","gaming_time","social_media_time")]

# Create the pairplot
ggpairs(selected_data)

Interpretation:

->This pairplot shows the relationship between screen time, ecommerce spend, gaming time, and social media time for users under the age of 18.

-> We can see strong positive correlations — especially between screen time and social media time, and between screen time and gaming time.

-> This suggests that younger users who spend more time on their phones are highly active in gaming and social media activities.

COMPARISON OF MOBILE PHONE USAGE PATTERNS IN INDIA

Aastha Verma

2025-04-21

———————————————————

1. Understanding data

———————————————————

1.1 What are the column names and data types in the dataset?

1.2 Are there any missing values in the dataset? If so, identify which columns have missing data?

1.3 What are the unique values in categorical columns like phone brand, OS, and primary usage?

———————————————————

2. Data Extraction & Filtering

———————————————————

2.1 Identify the top 10 most used phone brands.

2.2 Extract users who use mobile devices for more than or equal to 8 hours daily.

2.3 Identify users who primarily engage in gaming and analyze their data usage.

———————————————————

3. Grouping and Summarization

———————————————————

3.1 Grouping users by city, summarize the average data usage and screen time in each location?

3.2 Identify which city has the highest number of social media users.

3.3 Determine which age group has the highest mobile usage.

———————————————————

4. Sorting and Ranking Data

———————————————————

4.1 Rank mobile brands based on average screen time.

4.2 Identify the gender with the highest mobile dependency.

4.3 Find the top 5 users with the highest mobile data usage.

———————————————————

5. Feature Engineering:

———————————————————

5.1 Create a new feature “Usage Category” based on screen time: Light User (≤ 2 hours), Moderate User (2–6 hours) and Heavy User (>6 hours)

5.2 Compute “Total Mobile Interaction” as the sum of screen time, call duration, and data usage.

5.3 Determine the percentage of users dedicated to each primary use (Education, Gaming, Entertainment, Social Media, Work).

———————————————————

6. Data Visualisation

———————————————————

Bar Plot:

6.1 How screen time vary across different locations?

…………………………………………………………………….

Pie Chart:

6.2 What is the percentage of users in each ‘Age Group’?

…………………………………………………………………….

6.3 What is the distribution of Phone Brands among users?

…………………………………………………………………….

Histogram:

6.4 What is the distribution of ‘Talk-time (minutes)’ by gender?

…………………………………………………………………….

Line Chart:

6.5 How does the average recharge amount vary with the age of the users?

———————————————————

7. Advanced Calculations

———————————————————

7.1 Correlation Analysis

7.1.1 Correlation between screen time and other activities

…………………………………………………………………….

7.1.2 Correlation between Age and various spendings.

…………………………………………………………………….

7.1.3 What is the correlation between Number of apps installed, Data usage and screen time?

…………………………………………………………………….

7.2 ANOVA

7.2.1 Is there a significant difference in recharge amounts across locations?

…………………………………………………………………….

7.2.2 Does operating system affect data usage?

…………………………………………………………………….

7.3 Regression

7.3.1 How does the call duration affect the recharge cost for work users?

…………………………………………………………………….

7.3.2 Can we predict a user’s mobile data usage based on the amount of time they spend using their smartphone screen daily?”

…………………………………………………………………….

8. PairPlot

8.1 How are screen time, e-commerce spending, social media time and gaming time related among users under the age of 18?