The primary objective of this analysis is to explore the relationship
between Bellabeat users and their usage of Bellabeat products. We aim to
define user groups based on their habits and create usage scenarios for
each group. By addressing the following questions, we will gain insights
to inform Bellabeat’s marketing strategy and product development:
1. Are there distinct user groups based on product usage?
2. How do these user groups differ in their usage of the product?
3. How can this analysis benefit these user groups by profiling their
needs?
4. Are there patterns in product usage, and what are the usage habits of
each group?
5. Can we identify patterns in the data that can be leveraged to
maximize benefits for all users?
The task is to analyze the usage data from Bellabeat’s smart devices and
app to identify distinct user groups and their respective usage
patterns. Understanding these groups and their behaviors will enable
Bellabeat to tailor its marketing strategies to better meet user needs,
enhance user satisfaction, and drive growth. This analysis involves
segmenting users based on their habits, identifying usage scenarios, and
leveraging data patterns to provide actionable recommendations for
maximizing user benefits.
The data for this analysis was sourced from Kaggle and is available
at Kaggle - Fitbit Dataset. It consists of a zip file containing two
datasets:
1. Fitabase Data 3.12.16-4.11.16
2. Fitabase Data 4.12.16-5.12.16
For this exercise, we will use the Fitabase Data 4.12.16-5.12.16
dataset.
The dataset contains the following 18 CSV files:
• dailyActivity_merged.csv
• dailyCalories_merged.csv
• dailyIntensities_merged.csv
• dailySteps_merged.csv
• heartrate_seconds_merged.csv
• hourlyCalories_merged.csv
• hourlyIntensities_merged.csv
• hourlySteps_merged.csv
• minuteCaloriesNarrow_merged.csv
• minuteCaloriesWide_merged.csv
• minuteIntensitiesNarrow_merged.csv
• minuteIntensitiesWide_merged.csv
• minuteMETsNarrow_merged.csv
• minuteSleep_merged.csv
• minuteStepsNarrow_merged.csv
• minuteStepsWide_merged.csv
• sleepDay_merged.csv
• weightLogInfo_merged.csv
The data has been unzipped and placed in the ./data folder. Each of these CSV files was loaded into the global environment as data frames using a custom load_csv_files function.
Using the skim function, a preliminary overview of the data was obtained. The dataset consists of 18 CSV files that can be categorized as follows:
• Daily Data:
o dailyActivity_merged
o dailyCalories_merged
o dailyIntensities_merged
o dailySteps_merged
• Hourly Data:
o hourlyCalories_merged
o hourlyIntensities_merged
o hourlySteps_merged
• Minutely Data:
o minuteCaloriesNarrow_merged
o minuteCaloriesWide_merged
o minuteIntensitiesNarrow_merged
o minuteIntensitiesWide_merged
o minuteMETsNarrow_merged
o minuteSleep_merged
o minuteStepsNarrow_merged
o minuteStepsWide_merged
• Other Data:
o heartrate_seconds_merged
o sleepDay_merged
o weightLogInfo_merged
• Daily Data:
o dailyActivity_merged is already a merge of dailyCalories_merged,
dailyIntensities_merged, and dailySteps_merged.
o sleepDay_merged can be joined with dailyActivity_merged as both
contain daily records.
• Hourly Data:
o Hourly data files can be merged to create hourlyActivity_merged.
• Minutely Data:
o Minutely data files can be merged to create
minuteActivity_merged.
The merged datasets will provide a comprehensive view of user activity
at different time intervals, enabling a detailed analysis of user habits
and usage patterns.
Before proceeding with the analysis, the following steps will be
taken to clean and manipulate the data:
1. Handling Missing Values: Missing values will be identified and
appropriately handled (e.g., imputation, removal).
2. Removing Duplicates: Duplicate records will be removed to ensure data
integrity.
3. Data Type Conversion: Ensure all columns have the correct data types
(e.g., datetime for date columns).
4. Merging Data: Merge the datasets as described to create comprehensive
daily, hourly, and minutely datasets.
Drop Unnecessary DataFrames: Since dailyActivity_merged is already a combination of dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged, these separate dataframes are not needed and were dropped from the environment.
Check for Missing Values:
All data frames were checked for missing values. No missing values were found in any of the data frames.
• minuteSleep_merged had 543 duplicates.
• sleepDay_merged had 3 duplicates.
Duplicates were removed and the data was rechecked to confirm there were no remaining duplicates.
Each data frame was inspected for the number of observations and
variables. The observations range from 940 to approximately 2.5 million.
It was noted that the variables related to dates and times are stored as
text data (
Convert Date and Time Columns to Proper Data Types:
Date and time columns were converted to appropriate datetime formats
(POSIXct). This ensures proper handling and manipulation of date and
time data.
Splitting Date and Time Columns:
The datetime values need to be split and properly formatted in the other
data frames as well. Given the number of data frames (14), this could be
a time-consuming process. To streamline this, a split_datetime_column
function was created to find and format these values efficiently. This
function will be used iteratively on the data frames that are not
dailyActivity_merged, splitting dates and times and then splitting the
time values into Hour, Minute, and Second variables.
heartrate_seconds_merged
The heartrate_seconds_merged dataframe has a seconds variable, which
isn’t directly comparable with other data frames. To make it useful, the
mean heart rates for each minute will be calculated.
weightLogInfo_merged
In the weightLogInfo_merged dataframe, only the LogId variable
differentiates the information, and there is no way to relate it with
time. Therefore, the data will be grouped by Id to obtain a mean weight
for each person.
Hourly Data
The hourly_merged dataframe will be created by joining all the hourly*
dataframes.
Minutely Data
Similarly, the minute_merged dataframe will be created by joining all
the minute* dataframes.
After splitting the datetime columns, it was found that the Hour, Minute, and Second variables in the sleepDay_merged dataframe all contained 0, 0, 0 data. Therefore, these columns were removed to clean up the dataframe.
I had 5 questions to the data, which were
The variables I can use to group users can be;
To address these questions, we will perform feature engineering and create new variables that can be used to group users. These variables include time-related variables, activity levels, distance, steps, calories, heart rate, intensity, sleep duration, and weight. To analyze user behavior based on the day of the week, I created a weekdays data frame that maps dates to weekday names. This data frame was then joined with all other data frames containing date information to add the weekday information to each record. I also created a function to group hours of the day into categorical times: Morning, Afternoon, Evening, and Night. This function was applied to all data frames that contain hour data.
From the dailyActivity_merged dataframe, I created a TotalMinutes variable as the sum of the minutes of different types of activities (e.g., SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes). Additionally, I created per-day variables for all metrics by grouping the data by user and date, storing this aggregated data in a new dataframe named per_day.
Here is how per_day dataframe look like:
## ── Data Summary ────────────────────────
## Values
## Name per_day
## Number of rows 33
## Number of columns 15
## _______________________
## Column type frequency:
## numeric 15
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd
## 1 Id 0 1 4.86e+9 2.43e+9
## 2 steps_per_day 0 1 7.52e+3 3.58e+3
## 3 distance_per_day 0 1 5.40e+0 2.77e+0
## 4 minutes_per_day 0 1 1.22e+3 1.98e+2
## 5 Calories_per_day 0 1 2.28e+3 5.63e+2
## 6 VeryActiveDistance_per_day 0 1 1.45e+0 1.87e+0
## 7 ModeratelyActiveDistance_per_day 0 1 5.57e-1 5.38e-1
## 8 LightActiveDistance_per_day 0 1 3.32e+0 1.38e+0
## 9 SedentaryActiveDistance_per_day 0 1 1.63e-3 3.02e-3
## 10 VeryActiveMinutes_per_day 0 1 2.03e+1 2.38e+1
## 11 FairlyActiveMinutes_per_day 0 1 1.33e+1 1.21e+1
## 12 LightlyActiveMinutes_per_day 0 1 1.92e+2 7.57e+1
## 13 SedentaryMinutes_per_day 0 1 9.99e+2 2.28e+2
## 14 mean_sleep_minutes 9 0.727 3.77e+2 1.37e+2
## 15 mean_time_in_bed_minutes 9 0.727 4.20e+2 1.74e+2
## p0 p25 p50 p75 p100 hist
## 1 1.50e+9 2.35e+9 4.45e+9 6.96e+9 8.88e+9 ▇▆▃▅▅
## 2 9.16e+2 5.57e+3 7.28e+3 9.52e+3 1.60e+4 ▃▅▇▃▁
## 3 6.35e-1 3.45e+0 5.30e+0 6.91e+0 1.32e+1 ▃▇▆▁▁
## 4 9.11e+2 1.04e+3 1.32e+3 1.42e+3 1.44e+3 ▃▃▁▂▇
## 5 1.48e+3 1.92e+3 2.13e+3 2.60e+3 3.44e+3 ▅▇▃▂▂
## 6 6.13e-3 1.42e-1 7.30e-1 2.21e+0 8.51e+0 ▇▃▁▁▁
## 7 1.13e-2 1.28e-1 5.02e-1 7.73e-1 2.75e+0 ▇▆▁▁▁
## 8 5.07e-1 2.61e+0 3.50e+0 4.14e+0 6.19e+0 ▃▆▇▆▂
## 9 0 0 0 7.69e-4 1.10e-2 ▇▁▁▁▁
## 10 9.68e-2 3.58e+0 1.04e+1 2.34e+1 8.73e+1 ▇▂▁▁▁
## 11 2.58e-1 4.03e+0 1.23e+1 1.94e+1 6.13e+1 ▇▆▂▁▁
## 12 3.86e+1 1.44e+2 2.06e+2 2.46e+2 3.28e+2 ▃▆▅▇▃
## 13 6.62e+2 7.66e+2 1.08e+3 1.21e+3 1.32e+3 ▇▃▁▆▇
## 14 6.1 e+1 3.36e+2 4.17e+2 4.49e+2 6.52e+2 ▂▂▃▇▁
## 15 6.9 e+1 3.77e+2 4.46e+2 4.87e+2 9.61e+2 ▂▅▇▁▁
Now that you have finished creating your visualizations, act on your findings. Prepare the deliverables you have been asked to create, including the high-level recommendations based on your analysis. Use the following Case Study Roadmap as a guide:
Case Study Roadmap - Act
Guiding questions
Key tasks - Create your portfolio. - Add your case study. - Practice presenting your case study to a friend or family member.
Deliverable
Your top high-level insights based on your analysis
Follow these steps: