import kagglehub
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import squarify
# Download latest version
path = kagglehub.dataset_download("valakhorasani/mobile-device-usage-and-user-behavior-dataset")
path += '/*.csv'
print("Path to dataset files:", path)
## Path to dataset files: /Users/steven/.cache/kagglehub/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset/versions/1/*.csv
files_to_read = glob.glob(path)
df = pd.DataFrame()
for f in files_to_read:
df = pd.concat([df,pd.read_csv(f)])
df = df.reset_index(drop=True)
print(len(df.index))
## 700
Below, we will see a small snapshot of the data at hand. The data here pertains to different users’ phone habits as well as their choices in mobile device and a few identifying attributes such as age and gender. The data can be found at https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset
df.head()
## User ID Device Model Operating System ... Age Gender User Behavior Class
## 0 1 Google Pixel 5 Android ... 40 Male 4
## 1 2 OnePlus 9 Android ... 47 Female 3
## 2 3 Xiaomi Mi 11 Android ... 42 Male 2
## 3 4 Google Pixel 5 Android ... 20 Male 3
## 4 5 iPhone 12 iOS ... 31 Female 3
##
## [5 rows x 11 columns]
Below we will see two charts, the first of which asks if the screentime for Men differs from that of Women? To do this, we will use two histograms to compare how many people fit into each bucket of screentime.
male_users = df[df['Gender'] == 'Male']
female_users = df[df['Gender'] == 'Female']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
ax1.hist(male_users['Screen On Time (hours/day)'], bins=10, alpha=0.5, label="Male Users", color="blue")
ax2.hist(female_users['Screen On Time (hours/day)'], bins=10, alpha=0.5, label="Female Users", color="pink")
ax1.set_xlabel('Average Hours of Screentime Per Day (Male Users)')
ax1.set_ylabel('Frequency')
ax2.set_xlabel('Average Hours of Screentime Per Day (Female Users)')
ax2.set_ylabel('Frequency')
#plt.legend(loc='upper right')
plt.show()
We see here that, largely, men and women don’t have any meaningful
differences in their screentime usage. Both genders have most of their
users falling into the “low” group of approximately two hours of
screentime / day and a general decline.
So, if gender represents no meaningful difference, what about age? To answer this question, let’s use a scatter plot to represent all of our users.
plt.figure(figsize=(10, 6))
plt.scatter(male_users['Age'], male_users['Screen On Time (hours/day)'], label='Male Users', color='blue', alpha=0.5)
plt.scatter(female_users['Age'], female_users['Screen On Time (hours/day)'], label='Female Users', color='pink', alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Screen On Time (hours/day)')
plt.title('Scatter Plot of Screen On Time (hours/day) against Age for Male and Female Users')
plt.legend(loc='upper right')
plt.show()
Intuitively, we might assume that we would see a general negative
trajectory with younger users using their screens more and older people
being more screen/mobile phone adverse. However, what we see here is,
largely, an even distribution of screentime users across age. Like
before, there is no way to glean any information here based purely on
demographics. So, let’s try to learn some more about the different
groups of users we have.
Here, we’ll identify “Super Users” as the users that are falling into that top screentime bracket we observed in our histograms.
Before we do, we’ll want to get a general idea for what devices all users prefer. This will be relevant later.
device_counts = df['Device Model'].value_counts()
plt.figsize=(14,12)
device_counts.plot.pie(labels=[f'{label} ({count})' for label, count in zip(device_counts.index, device_counts)])
plt.title('Number of users by device Model')
plt.ylabel('')
plt.show()
This pie chart shows that, generally speaking, we have an evenly
distributed pool of devices with preferences slightly
skewing toward the Xiaomi Mi 11. I believe that this trend is more
indicative of the data scientists who gathered this data hoping for an
even spread than it is indicative of anything meaningful. Still, it will
be useful to keep in mind for our further evaluations.
As mentioned before, let’s gather some analytics on the various groups using a tree map, keeping a more careful eye on the top usage group.
def assign_usage_group(screen_on_time, max_time):
increment = 2.5
group_number = (screen_on_time // increment) + 1
return str(group_number)
max_screen_on_time = df['Screen On Time (hours/day)'].max()
max_app_time = df['App Usage Time (min/day)'].max()
df['Screentime Usage Group'] = df['Screen On Time (hours/day)'].apply(assign_usage_group, args=(max_screen_on_time,))
unique_groups = sorted(df['Screentime Usage Group'].unique(), reverse=True)
usage_labels = ['High Usage', "Medium High Usage", "Medium Usage", "Medium Usage", "Low Usage"]
mapping_dict = {str(group): usage_labels[i] for i, group in enumerate(unique_groups)}
df['Usage Group'] = df['Screentime Usage Group'].map(mapping_dict)
def most_popular_device(devices):
return devices.value_counts().idxmax()
grouped_df = df.groupby('Usage Group').agg({
'Screen On Time (hours/day)': 'mean',
'Device Model': most_popular_device,
'Age': 'mean'
}).reset_index()
labels = [f'{row['Usage Group']}\nScreen Time: {row['Screen On Time (hours/day)']} \nDevices: {row['Device Model']}\nAvg Age: {row['Age']:.1f}'
for _, row in grouped_df.iterrows()]
plt.figure(figsize=(12,8))
colors=plt.cm.tab20.colors
squarify.plot(sizes=grouped_df['Screen On Time (hours/day)'], color=colors[:len(grouped_df)],label=labels, alpha=0.8)
plt.axis('off')
## (np.float64(0.0), np.float64(100.0), np.float64(0.0), np.float64(100.0))
plt.title('Tree Map of Screen Time Usage Groups')
plt.show()
We’ll note here that despite the distribution of users-to-devices seemed
rather evenly split, the high usage users seem to prefer the Xiaomi Mi
11 phone. So now we can ask one key question to gain insight. Why?
To answer this question, let’s consider one factor that all mobile phone users keep in the backs of their minds: battery life. More specifically, we can form the hypothesis that the super users will gravitate toward the phones that have better overall battery life. So, if we map out battery life against screen time, we should see the Xiaomi Mi 11 perform the best.
We will use a trellis of line charts to map out the performance of the devices, using colors to highlight the performance. Green will be best, followed by blue then yellow then orange then finally red for the worst performing battery.
max_battery_life = df.groupby('Device Model')['Battery Drain (mAh/day)'].max().sort_values()
color_mapping = {
max_battery_life.index[0]: 'red',
max_battery_life.index[1]: 'orange',
max_battery_life.index[2]: 'yellow',
max_battery_life.index[3]: 'blue',
max_battery_life.index[4]: 'green'
}
g = sns.FacetGrid(df, col="Device Model", col_wrap=2, hue='Device Model',palette=color_mapping)
g.map(sns.lineplot, "Screen On Time (hours/day)", "Battery Drain (mAh/day)", errorbar=None)
# Set the axis labels and title
g.set_axis_labels("Screen On Time (hours/day)", "Battery Drain (mAh/day)")
g.figure.suptitle("Battery Consumption vs Screentime for Different Device Types", y=1.05)
# Show the plot
plt.show()
We successfully confirm our hypothesis that the Xiaomi Mi 11 has the
best overall battery. Interestingly, none of the devices have a
perfectly linear of battery drain to screen time with the iPhone 12
having the most apparent peaks and valleys in its graph.
While we cannot identify a clear demographic that would identify users with a clear device preference or screentime habit, we can successfully determine that there are super users who use their screens more than the others who seem to prefer the Xiaomi Mi 11 phone. A reasonable hypothesis is that these users prefer this phone since it performs the best of the presented options when scored on battery drain.