library(readr)
# Load the CSV
data <- read.csv("phone_usage_india.csv")
str(data) # Displays structure of the dataset
## 'data.frame': 17686 obs. of 16 variables:
## $ User_ID : chr "U00001" "U00002" "U00003" "U00004" ...
## $ Age : int 53 60 37 32 16 21 57 56 46 44 ...
## $ Gender : chr "Male" "Other" "Female" "Male" ...
## $ Location : chr "Mumbai" "Delhi" "Ahmedabad" "Pune" ...
## $ Phone_Brand : chr "Vivo" "Realme" "Nokia" "Samsung" ...
## $ OS : chr "Android" "iOS" "Android" "Android" ...
## $ screen_time : num 3.7 9.2 4.5 11 2.2 5.4 6 3.1 5.3 9.9 ...
## $ data_usage : num 0.918 19.137 8.65 25.836 3.133 ...
## $ call_duration : num 37.9 13.7 66.8 156.2 236.2 ...
## $ apps_installed : int 104 169 96 146 86 25 123 188 194 84 ...
## $ social_media_time: num 1.66 27.64 3.88 28.57 5.67 ...
## $ ecommerce_spend : num -0.1 10.91 -13.58 17.71 6.48 ...
## $ streaming_time : num 5.2 5.1 1.7 3.2 3.4 0.6 2.9 5.2 6.1 7.6 ...
## $ gaming_time : num 8.97 22.07 -3.02 32.15 3.66 ...
## $ recharge_cost : num 0.136 -2.941 5.844 6.301 18.46 ...
## $ primary_use : chr "Education" "Gaming" "Entertainment" "Entertainment" ...
Interpretation: Shows a data frame with 17,686 observations (rows) and 16 variables (columns). It details the name and data type of each column, such as User_ID as character, Age as integer, and screen_time as numeric. This provides a foundational understanding of the data set’s structure and the nature of the information it contains.
colSums(is.na(data)) # Count missing values per column
## User_ID Age Gender Location
## 0 0 0 0
## Phone_Brand OS screen_time data_usage
## 0 0 0 0
## call_duration apps_installed social_media_time ecommerce_spend
## 0 0 0 0
## streaming_time gaming_time recharge_cost primary_use
## 0 0 0 0
Interpretation: This indicates that for each of the listed columns, the sum of missing values is 0. Therefore, there are no missing values present in any of these columns in your data set.
unique(data$Phone_Brand) # Unique phone brands
## [1] "Vivo" "Realme" "Nokia" "Samsung" "Xiaomi"
## [6] "Oppo" "Apple" "Google Pixel" "Motorola" "OnePlus"
unique(data$OS) # Unique operating systems
## [1] "Android" "iOS"
unique(data$primary_use) # Unique primary usage categories
## [1] "Education" "Gaming" "Entertainment" "Social Media"
## [5] "Work"
Interpretation: The unique values in key categorical columns shows the dataset includes the following: phone brands like Vivo, Realme, Nokia, Samsung, Xiaomi, Oppo, Apple, Google Pixel, Motorola, and OnePlus; operating systems Android and iOS; and primary use categories such as Education, Gaming, Entertainment, Social Media, and Work.
top_10_phoneBrands<- data %>%
count(Phone_Brand, name = "Count") %>% # Count occurrences and rename column
arrange(desc(Count)) %>% # Sort in descending order
head(10) # Select top 10 brands
print(top_10_phoneBrands)
## Phone_Brand Count
## 1 Nokia 1816
## 2 OnePlus 1807
## 3 Xiaomi 1803
## 4 Vivo 1797
## 5 Apple 1775
## 6 Samsung 1764
## 7 Realme 1762
## 8 Google Pixel 1729
## 9 Motorola 1717
## 10 Oppo 1716
Interpretation: The provided table displays the top ten phone brands in the dataset by count. Nokia is the most frequent with 1816, closely followed by OnePlus (1807) and Xiaomi (1803). Apple holds the fifth position with 1775. The remaining top ten include Vivo, Samsung, Realme, Google Pixel, Motorola, and Oppo, indicating the distribution of popular phone brands among the users in the dataset.
heavy_users <- data %>% filter(screen_time > 8)%>%arrange(desc(screen_time))%>%head(10)
print(heavy_users)
## User_ID Age Gender Location Phone_Brand OS screen_time data_usage
## 1 U00172 39 Female Chennai Google Pixel Android 12 25.15700
## 2 U00216 21 Male Jaipur Google Pixel iOS 12 21.13470
## 3 U00689 24 Male Kolkata Xiaomi iOS 12 28.30320
## 4 U00735 35 Female Jaipur Samsung Android 12 23.05077
## 5 U00747 35 Other Kolkata Motorola Android 12 24.20725
## 6 U00767 46 Other Jaipur Samsung Android 12 29.12187
## 7 U00932 37 Female Bangalore Nokia Android 12 24.77088
## 8 U00935 55 Other Mumbai Samsung iOS 12 24.58675
## 9 U01239 40 Other Jaipur Google Pixel Android 12 23.37282
## 10 U01375 42 Male Jaipur Nokia Android 12 20.47090
## call_duration apps_installed social_media_time ecommerce_spend
## 1 35.9 16 18.73935 16.223577
## 2 211.5 113 27.73469 9.036460
## 3 55.5 183 19.15537 1.995935
## 4 48.3 14 25.74125 44.907198
## 5 33.0 140 28.65438 33.704400
## 6 44.2 135 31.74048 22.863113
## 7 260.5 97 17.93232 17.735024
## 8 209.7 190 30.15895 29.350123
## 9 59.9 38 28.45537 30.014362
## 10 99.6 80 32.92130 42.675533
## streaming_time gaming_time recharge_cost primary_use
## 1 0.8 19.76892 -0.5326361 Entertainment
## 2 1.3 18.79201 11.0809628 Work
## 3 1.7 27.98732 4.7164114 Entertainment
## 4 3.7 10.47650 2.7543381 Social Media
## 5 3.6 19.65111 6.6479657 Gaming
## 6 5.9 19.39338 3.4336422 Social Media
## 7 3.5 28.73611 28.3705046 Education
## 8 4.9 10.49492 29.5416612 Entertainment
## 9 3.5 18.06149 13.8064929 Gaming
## 10 4.3 23.22885 16.2915746 Social Media
Interpretation: This table shows only the users from data set who have a screen time greater than 8 hours. It focused on “heavy users” by screen time.
gaming_users <- data %>%
filter(primary_use == "Gaming") %>%
select(User_ID, Phone_Brand, OS, data_usage, screen_time)%>%head(10)
print(gaming_users)
## User_ID Phone_Brand OS data_usage screen_time
## 1 U00002 Realme iOS 19.136572 9.2
## 2 U00022 Xiaomi Android 25.068779 11.7
## 3 U00024 Vivo Android 5.573529 1.4
## 4 U00027 Vivo Android 16.109427 9.2
## 5 U00034 Motorola Android 20.936997 11.4
## 6 U00060 Xiaomi iOS 7.181936 3.0
## 7 U00070 Oppo iOS 12.965341 5.3
## 8 U00073 Motorola iOS 11.426739 5.2
## 9 U00074 Realme Android 14.704642 6.9
## 10 U00076 Oppo Android 17.769887 6.1
Interpretation: The table only displays the users who use there mobile phone for Playing games as the main purpose with the corresponding phone brand and Operating system they have. This also shows their data usage in GB and the screen time in hours per day.
data %>%
group_by(Location) %>%
summarise(
Avg_Data_Usage = mean(data_usage, na.rm = TRUE),
Avg_Screen_Time = mean(screen_time, na.rm = TRUE)
)
## # A tibble: 10 × 3
## Location Avg_Data_Usage Avg_Screen_Time
## <chr> <dbl> <dbl>
## 1 Ahmedabad 13.0 6.56
## 2 Bangalore 13.2 6.57
## 3 Chennai 13.0 6.53
## 4 Delhi 13.0 6.47
## 5 Hyderabad 13.2 6.60
## 6 Jaipur 13.2 6.65
## 7 Kolkata 12.9 6.41
## 8 Lucknow 13.2 6.58
## 9 Mumbai 12.8 6.42
## 10 Pune 13.3 6.67
Interpretation: The dataframe shows the average data usage and Screen time for each location. Pune has the highest average Data usage and average screen time. Whereas, Mumbai is the lowest in Average data usage and Kolkata in Average Screen time.
data %>%
group_by(Age) %>%
summarise(Avg_Usage = mean(screen_time, na.rm = TRUE)) %>%
arrange(desc(Avg_Usage))%>%head(2)
## # A tibble: 2 × 2
## Age Avg_Usage
## <int> <dbl>
## 1 46 6.93
## 2 20 6.86
Interpretation: This output shows the average usage (Avg_Usage) for different ages (Age). For instance, individuals aged 46 have an average usage of approximately 6.93, while those aged 20 have an average usage of about 6.86.
data %>%
group_by(Phone_Brand) %>%
summarise(Avg_Screen_Time = mean(screen_time, na.rm = TRUE)) %>%
arrange(desc(Avg_Screen_Time))
## # A tibble: 10 × 2
## Phone_Brand Avg_Screen_Time
## <chr> <dbl>
## 1 Xiaomi 6.64
## 2 Realme 6.64
## 3 Nokia 6.61
## 4 Samsung 6.60
## 5 Motorola 6.56
## 6 OnePlus 6.55
## 7 Apple 6.52
## 8 Vivo 6.51
## 9 Google Pixel 6.43
## 10 Oppo 6.39
Interpretation: Xiaomi users have an average screen time of approximately 6.64, while Realme users average around 6.64 as well. Whereas, Oppo is the least in average screen time.
data %>%
group_by(Gender) %>%
summarise(Avg_Screen_Time = mean(screen_time, na.rm = TRUE)) %>%
arrange(desc(Avg_Screen_Time))
## # A tibble: 3 × 2
## Gender Avg_Screen_Time
## <chr> <dbl>
## 1 Male 6.60
## 2 Other 6.53
## 3 Female 6.50
Interpretation: This represents that Males tend to have higher screen time of 6.6 hours per day than the females and others
data %>%
arrange(desc(data_usage)) %>%
head(5)
## User_ID Age Gender Location Phone_Brand OS screen_time data_usage
## 1 U17656 21 Other Mumbai Oppo Android 11.5 34.74822
## 2 U16375 25 Other Lucknow Realme iOS 11.9 32.30989
## 3 U05993 26 Male Ahmedabad Realme Android 11.7 31.87627
## 4 U14936 47 Other Bangalore Samsung iOS 11.4 31.84143
## 5 U09055 45 Female Chennai Nokia Android 12.0 31.32909
## call_duration apps_installed social_media_time ecommerce_spend streaming_time
## 1 128.9 181 12.43514 35.103541 5.3
## 2 140.9 152 36.53769 5.270189 2.3
## 3 192.7 173 32.78599 6.784371 4.4
## 4 296.3 99 24.80781 35.344356 0.9
## 5 132.7 142 21.44678 22.466015 5.6
## gaming_time recharge_cost primary_use
## 1 18.62551 16.98026 Entertainment
## 2 9.45858 12.47200 Work
## 3 31.52751 20.72110 Work
## 4 30.57840 21.38560 Entertainment
## 5 20.22828 11.42800 Entertainment
Interpretation: This data frame displays the users and their details who are the highest in data usage
data <- data %>%
mutate(Usage_Category = case_when(
screen_time <= 2 ~ "Light User",
screen_time > 2 & screen_time <= 6 ~ "Moderate User",
screen_time > 6 ~ "Heavy User"
))
unique(data$Usage_Category)
## [1] "Moderate User" "Heavy User" "Light User"
Interpretation: The new Column is added to the data set which will help to categorise the users according to their screen time. The users who have screen time less than or equal to 2 hours are categorised as light users and who have screen time more than 6 hours are Heavy users. And the others are moderate ones.
data <- data %>%
mutate(Total_Mobile_Interaction = screen_time + call_duration + data_usage)
summary(data$Total_Mobile_Interaction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.579 97.653 170.450 171.027 243.048 339.541
Interpretation: A new column named total interaction is appended which will represent the whole interaction of the user with the mobile phone is terms of screen time, call duration and data usage.
avg_screen_time <- data %>%
group_by(Location) %>%
summarise(avg_screen_time = mean(screen_time, na.rm = TRUE))
ggplot(avg_screen_time, aes(x = avg_screen_time, y = Location, fill = Location)) +
geom_bar(stat = "identity") +
labs(title = "Average Screen Time by Location", x = "Average Screen Time (Hours)", y = "Location") +
theme_minimal()
Interpretation:
-> Pune has the highest average screen time, followed closely by Mumbai and Lucknow.
-> Ahmedabad and Bangalore show slightly lower screen times compared to others, but the overall difference among all cities is very small.
-> All locations have average screen times between 6 and 7 hours, indicating similar behavior across cities.
data$Age_Group <- cut(data$Age, breaks = c(15, 25, 35, 45, 55, Inf), labels = c("15-25", "26-35", "36-45", "46-55", "55+"), right = FALSE)
age_summary <- data %>%
group_by(Age_Group) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100, label = paste0(Age_Group, "\n", round(percentage, 1), "%"))
ggplot(age_summary, aes(x = "", y = count, fill = Age_Group)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = label), position = position_stack(vjust = 0.5), size = 4, color = "white") +
labs(title = "User Distribution by Age Group", fill = "Age Group") +
theme_void()
Interpretation:
-> The 46-55 age group forms the largest segment at 22.2%.
-> Age groups 15-25, 26-35, and 36-45 each contribute around 21%.
-> The 55+ age group has the smallest share at 13.2%.
phone_brand_data <- data %>%
group_by(Phone_Brand) %>%
summarise(count = n())
ggplot(phone_brand_data, aes(x = "", y = count, fill = Phone_Brand)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Distribution of Phone brands Among Users", fill = "Phone_Brand") +
theme_void() +
theme(legend.position = "top")
Interpretation:
-> The number of users for each phone brand is fairly balanced, with all brands ranging between approximately 1716 to 1816 users.
-> Nokia has the highest user count at 1816, closely followed by Vivo (1803) and Apple (1775).
-> Motorola and OnePlus have slightly lower user counts around 1716–1717.
ggplot(data, aes(x = call_duration, fill = Gender)) +
geom_histogram(binwidth = 50, position = "dodge") +
labs(title = "Distribution of Talk-time by Gender", x = "Talk-time (minutes)", y = "Number of Users", fill = "Gender") +
scale_fill_manual(values = c("Male" = "blue", "Female" = "pink", "Other" = "green")) +
theme_minimal()
Interpretation:
-> Around 50 to 250 minutes, the number of users is quite high for all genders close to 900 -1000 users per category.
-> Other gender users (green bars) slightly outnumber Male and Female users at many points, especially around 100 minutes and 200 minutes.
average_recharge_cost_by_age <- data %>%
group_by(Age) %>%
summarise(AverageRechargeCost = mean(recharge_cost))
ggplot(average_recharge_cost_by_age, aes(x = Age, y = AverageRechargeCost)) +
geom_line(color = 'darkblue') +
geom_point(color = 'red') +
labs(title = "Relationship Between Age and Average Recharge Cost", x = "Age", y = "Average Recharge Cost") +
theme_minimal()
Interpretation:
-> Overall, the average recharge cost fluctuates.
-> There’s no strong increasing or decreasing trend, indicating high variability.
-> Younger users (15–20 years) and users around 40–45 years seem to have slightly higher average recharge costs compared to others.
-> Ages around 30–35 and 50–55 show lower recharge costs.
time_cor <- data[,c('screen_time', 'social_media_time', 'streaming_time', 'gaming_time')]
cor_matrix <- cor(time_cor)
print(cor_matrix)
## screen_time social_media_time streaming_time gaming_time
## screen_time 1.00000000 0.71050388 -0.02174779 0.58116792
## social_media_time 0.71050388 1.00000000 -0.02204071 0.41791978
## streaming_time -0.02174779 -0.02204071 1.00000000 -0.01533965
## gaming_time 0.58116792 0.41791978 -0.01533965 1.00000000
corrplot(cor_matrix, method= "circle", type = "lower", tl.col = "black", tl.srt = 90, col = c("blue", "purple", "red"))
Interpretation:
-> The correlation matrix shows that Screen time is mainly driven by social media and gaming.
-> Streaming time does not significantly relate to other activities here.
-> Strongest connection is between screen time and social media time and Graph also represents the same.
age_cor <- data[,c('Age','recharge_cost','ecommerce_spend')]
cor_matrix1 <- cor(age_cor)
print(cor_matrix1)
## Age recharge_cost ecommerce_spend
## Age 1.0000000000 -0.0002567667 -0.003051382
## recharge_cost -0.0002567667 1.0000000000 -0.008728541
## ecommerce_spend -0.0030513816 -0.0087285412 1.000000000
corrplot(cor_matrix1, method = "color", addCoef.col = "brown", number.cex = 0.9, col = colorRampPalette(c("black","orange","skyblue"))(100), tl.col = "black", tl.srt= 45)
Interpretation: All the correlations are extremely close to zero. This means that Age, Recharge Cost, and E-commerce Spend are independent of each other — no strong or meaningful relationship is observed among these variables.
time_cor <- data[,c('apps_installed','screen_time','data_usage')]
cor_matrix <- cor(time_cor) #calculate the correlation matrix
print(cor_matrix)
## apps_installed screen_time data_usage
## apps_installed 1.000000000 0.004704233 0.005789781
## screen_time 0.004704233 1.000000000 0.904153536
## data_usage 0.005789781 0.904153536 1.000000000
corrplot(cor_matrix, method= "number", type = "lower", tl.col = "black", tl.srt = 45, col = c("blue", "purple", "turquoise"))
Interpretation: apps_installed is almost unrelated to both screen_time and data_usage (correlation ~ 0). screen_time and data_usage have a very strong positive correlation — when people spend more time on their devices, they tend to use more data.
anova_model <- aov(recharge_cost ~ Location, data = data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Location 9 1108 123.16 1.273 0.246
## Residuals 17676 1709576 96.72
#visualize the anova
ggplot(data, aes(x = Location, y = recharge_cost, fill = Location)) +
geom_boxplot() +
labs(title = "Recharge Amount by Location", x = "Location", y = "Recharge Amount") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
stat_compare_means(method = "anova")
Interpretation:
-> ANOVA results (p = 0.543) indicate that there is no statistically significant difference in recharge costs across different locations.
-> Thus, we fail to reject the null hypothesis and conclude that location does not have a meaningful impact on recharge spending.
anova_result <- aov(data_usage ~ OS, data = data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## OS 1 59 59.43 1.208 0.272
## Residuals 17684 869860 49.19
#visualize the anova
ggplot(data, aes(x = OS, y = data_usage, fill = OS)) +
geom_boxplot() +
labs(title = "Data Usage across Operating Systems", x = "Operating System", y = "Data Usage (in units)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set3") +
stat_compare_means(method = "anova")
Interpretation: the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is no significant effect of OS on data usage in this analysis.
work_data<- subset(data, primary_use == 'Work')
simple_model <- lm(recharge_cost ~ call_duration, data = data)
summary(simple_model)
##
## Call:
## lm(formula = recharge_cost ~ call_duration, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2018 -3.4521 -0.0311 3.4136 21.7071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0427369 0.0769644 0.555 0.579
## call_duration 0.0996819 0.0004434 224.835 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.007 on 17684 degrees of freedom
## Multiple R-squared: 0.7408, Adjusted R-squared: 0.7408
## F-statistic: 5.055e+04 on 1 and 17684 DF, p-value: < 2.2e-16
# Plot
ggplot(work_data, aes(x = call_duration, y = recharge_cost)) +
geom_point(color = 'orange') +
geom_smooth(method = 'lm', se = TRUE, color = 'black') + # simple linear fit
ggtitle(' Recharge Cost vs Call Duration (Work Users)') +
xlab('Call Duration (minutes)') +
ylab('Recharge Cost') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
-> This regression plot shows a positive linear relationship between Call Duration and New Recharge Cost for work users.
-> As call duration increases, the recharge cost also tends to rise.
-> The upward trend of the black regression line indicates that longer call usage leads to higher recharge expenses, and the spread of points suggests some variability but a clear overall pattern.
# Build the simple linear regression model
simple_model <- lm(data_usage ~ screen_time, data = data)
# Summarize the model
summary(simple_model)
##
## Call:
## lm(formula = data_usage ~ screen_time, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9763 -2.0266 0.0001 2.0212 12.7373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.009307 0.051662 -0.18 0.857
## screen_time 1.998726 0.007102 281.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.996 on 17684 degrees of freedom
## Multiple R-squared: 0.8175, Adjusted R-squared: 0.8175
## F-statistic: 7.921e+04 on 1 and 17684 DF, p-value: < 2.2e-16
# Make predictions using the model
data$predicted_data_usage <- predict(simple_model, newdata = data)
# Plot actual vs predicted
ggplot(data, aes(x = screen_time)) +
geom_point(aes(y = data_usage), color = "black", alpha = 0.5) + # actual points
geom_line(aes(y = predicted_data_usage), color = "blue", size = 1) + # predicted line
labs(
title = "Prediction of Data Usage Based on Screen Time",
x = "Screen Time (hours per day)",
y = "Data Usage (GB)"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpratation:
-> This regression plot shows a strong positive relationship between Screen Time and Data Usage.
-> As screen time (hours per day) increases, the data usage (in GB) also rises steadily, as indicated by the upward slope of the blue regression line.
-> The dense clustering around the line suggests a consistent pattern with slight variability.