Business Task

The primary objective of this analysis is to explore the relationship between Bellabeat users and their usage of Bellabeat products. We aim to define user groups based on their habits and create usage scenarios for each group. By addressing the following questions, we will gain insights to inform Bellabeat’s marketing strategy and product development:
1. Are there distinct user groups based on product usage?
2. How do these user groups differ in their usage of the product?
3. How can this analysis benefit these user groups by profiling their needs?
4. Are there patterns in product usage, and what are the usage habits of each group?
5. Can we identify patterns in the data that can be leveraged to maximize benefits for all users?
The task is to analyze the usage data from Bellabeat’s smart devices and app to identify distinct user groups and their respective usage patterns. Understanding these groups and their behaviors will enable Bellabeat to tailor its marketing strategies to better meet user needs, enhance user satisfaction, and drive growth. This analysis involves segmenting users based on their habits, identifying usage scenarios, and leveraging data patterns to provide actionable recommendations for maximizing user benefits.

Data Sources

The data for this analysis was sourced from Kaggle and is available at Kaggle - Fitbit Dataset. It consists of a zip file containing two datasets:
1. Fitabase Data 3.12.16-4.11.16
2. Fitabase Data 4.12.16-5.12.16
For this exercise, we will use the Fitabase Data 4.12.16-5.12.16 dataset.

Data Files

The dataset contains the following 18 CSV files:
• dailyActivity_merged.csv
• dailyCalories_merged.csv
• dailyIntensities_merged.csv
• dailySteps_merged.csv
• heartrate_seconds_merged.csv
• hourlyCalories_merged.csv
• hourlyIntensities_merged.csv
• hourlySteps_merged.csv
• minuteCaloriesNarrow_merged.csv
• minuteCaloriesWide_merged.csv
• minuteIntensitiesNarrow_merged.csv
• minuteIntensitiesWide_merged.csv
• minuteMETsNarrow_merged.csv
• minuteSleep_merged.csv
• minuteStepsNarrow_merged.csv
• minuteStepsWide_merged.csv
• sleepDay_merged.csv
• weightLogInfo_merged.csv

Data Loading

The data has been unzipped and placed in the ./data folder. Each of these CSV files was loaded into the global environment as data frames using a custom load_csv_files function.

Data Overview

Using the skim function, a preliminary overview of the data was obtained. The dataset consists of 18 CSV files that can be categorized as follows:

• Daily Data:
o dailyActivity_merged
o dailyCalories_merged
o dailyIntensities_merged
o dailySteps_merged
• Hourly Data:
o hourlyCalories_merged
o hourlyIntensities_merged
o hourlySteps_merged
• Minutely Data:
o minuteCaloriesNarrow_merged
o minuteCaloriesWide_merged
o minuteIntensitiesNarrow_merged
o minuteIntensitiesWide_merged
o minuteMETsNarrow_merged
o minuteSleep_merged
o minuteStepsNarrow_merged
o minuteStepsWide_merged
• Other Data:
o heartrate_seconds_merged
o sleepDay_merged
o weightLogInfo_merged

Data Merging

• Daily Data:
o dailyActivity_merged is already a merge of dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged.
o sleepDay_merged can be joined with dailyActivity_merged as both contain daily records.
• Hourly Data:
o Hourly data files can be merged to create hourlyActivity_merged.
• Minutely Data:
o Minutely data files can be merged to create minuteActivity_merged.
The merged datasets will provide a comprehensive view of user activity at different time intervals, enabling a detailed analysis of user habits and usage patterns.

Data Cleaning and Manipulation

Before proceeding with the analysis, the following steps will be taken to clean and manipulate the data:
1. Handling Missing Values: Missing values will be identified and appropriately handled (e.g., imputation, removal).
2. Removing Duplicates: Duplicate records will be removed to ensure data integrity.
3. Data Type Conversion: Ensure all columns have the correct data types (e.g., datetime for date columns).
4. Merging Data: Merge the datasets as described to create comprehensive daily, hourly, and minutely datasets.

Data Preparation Steps

  1. Drop Unnecessary DataFrames: Since dailyActivity_merged is already a combination of dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged, these separate dataframes are not needed and were dropped from the environment.

  2. Check for Missing Values:

All data frames were checked for missing values. No missing values were found in any of the data frames.

  1. Remove Duplicates:

• minuteSleep_merged had 543 duplicates.
• sleepDay_merged had 3 duplicates.

Duplicates were removed and the data was rechecked to confirm there were no remaining duplicates.

  1. Data Shape and Types:

Each data frame was inspected for the number of observations and variables. The observations range from 940 to approximately 2.5 million. It was noted that the variables related to dates and times are stored as text data ( data type), which can cause issues in comparisons and calculations.

  1. Convert Date and Time Columns to Proper Data Types:
    Date and time columns were converted to appropriate datetime formats (POSIXct). This ensures proper handling and manipulation of date and time data.

  2. Splitting Date and Time Columns:
    The datetime values need to be split and properly formatted in the other data frames as well. Given the number of data frames (14), this could be a time-consuming process. To streamline this, a split_datetime_column function was created to find and format these values efficiently. This function will be used iteratively on the data frames that are not dailyActivity_merged, splitting dates and times and then splitting the time values into Hour, Minute, and Second variables.

Handling Specific Data Frames

  1. heartrate_seconds_merged
    The heartrate_seconds_merged dataframe has a seconds variable, which isn’t directly comparable with other data frames. To make it useful, the mean heart rates for each minute will be calculated.

  2. weightLogInfo_merged
    In the weightLogInfo_merged dataframe, only the LogId variable differentiates the information, and there is no way to relate it with time. Therefore, the data will be grouped by Id to obtain a mean weight for each person.

Creating Merged DataFrames

  1. Hourly Data
    The hourly_merged dataframe will be created by joining all the hourly* dataframes.

  2. Minutely Data
    Similarly, the minute_merged dataframe will be created by joining all the minute* dataframes.

Removing Unnecessary Columns

After splitting the datetime columns, it was found that the Hour, Minute, and Second variables in the sleepDay_merged dataframe all contained 0, 0, 0 data. Therefore, these columns were removed to clean up the dataframe.

Analyze

I had 5 questions to the data, which were

  1. Are there user groups which can be said to differ in the usage of the product?
  2. In what respect those user groups differ?
  3. How can this user grouping benefit this analysis, therefore profiling the needs of various user groups?
  4. Are there any patterns in the usage of the product? What are the usage habits of user groups?
  5. Are there any patterns in the data which can be leveraged to maximize the benefit for all users?

The variables I can use to group users can be;

  • Time variables, like the time of the day, or the day of the week
  • Activity per day
  • Distance per day
  • Steps per day
  • Calories per day
  • Mean Heart Rate
  • Intensity per day
  • Sleep duration per day
  • Weight

To address these questions, we will perform feature engineering and create new variables that can be used to group users. These variables include time-related variables, activity levels, distance, steps, calories, heart rate, intensity, sleep duration, and weight. To analyze user behavior based on the day of the week, I created a weekdays data frame that maps dates to weekday names. This data frame was then joined with all other data frames containing date information to add the weekday information to each record. I also created a function to group hours of the day into categorical times: Morning, Afternoon, Evening, and Night. This function was applied to all data frames that contain hour data.

From the dailyActivity_merged dataframe, I created a TotalMinutes variable as the sum of the minutes of different types of activities (e.g., SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes). Additionally, I created per-day variables for all metrics by grouping the data by user and date, storing this aggregated data in a new dataframe named per_day.

Here is how per_day dataframe look like:

## ── Data Summary ────────────────────────
##                            Values 
## Name                       per_day
## Number of rows             33     
## Number of columns          15     
## _______________________           
## Column type frequency:            
##   numeric                  15     
## ________________________          
## Group variables            None   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable                    n_missing complete_rate    mean      sd
##  1 Id                                       0         1     4.86e+9 2.43e+9
##  2 steps_per_day                            0         1     7.52e+3 3.58e+3
##  3 distance_per_day                         0         1     5.40e+0 2.77e+0
##  4 minutes_per_day                          0         1     1.22e+3 1.98e+2
##  5 Calories_per_day                         0         1     2.28e+3 5.63e+2
##  6 VeryActiveDistance_per_day               0         1     1.45e+0 1.87e+0
##  7 ModeratelyActiveDistance_per_day         0         1     5.57e-1 5.38e-1
##  8 LightActiveDistance_per_day              0         1     3.32e+0 1.38e+0
##  9 SedentaryActiveDistance_per_day          0         1     1.63e-3 3.02e-3
## 10 VeryActiveMinutes_per_day                0         1     2.03e+1 2.38e+1
## 11 FairlyActiveMinutes_per_day              0         1     1.33e+1 1.21e+1
## 12 LightlyActiveMinutes_per_day             0         1     1.92e+2 7.57e+1
## 13 SedentaryMinutes_per_day                 0         1     9.99e+2 2.28e+2
## 14 mean_sleep_minutes                       9         0.727 3.77e+2 1.37e+2
## 15 mean_time_in_bed_minutes                 9         0.727 4.20e+2 1.74e+2
##         p0     p25     p50     p75    p100 hist 
##  1 1.50e+9 2.35e+9 4.45e+9 6.96e+9 8.88e+9 ▇▆▃▅▅
##  2 9.16e+2 5.57e+3 7.28e+3 9.52e+3 1.60e+4 ▃▅▇▃▁
##  3 6.35e-1 3.45e+0 5.30e+0 6.91e+0 1.32e+1 ▃▇▆▁▁
##  4 9.11e+2 1.04e+3 1.32e+3 1.42e+3 1.44e+3 ▃▃▁▂▇
##  5 1.48e+3 1.92e+3 2.13e+3 2.60e+3 3.44e+3 ▅▇▃▂▂
##  6 6.13e-3 1.42e-1 7.30e-1 2.21e+0 8.51e+0 ▇▃▁▁▁
##  7 1.13e-2 1.28e-1 5.02e-1 7.73e-1 2.75e+0 ▇▆▁▁▁
##  8 5.07e-1 2.61e+0 3.50e+0 4.14e+0 6.19e+0 ▃▆▇▆▂
##  9 0       0       0       7.69e-4 1.10e-2 ▇▁▁▁▁
## 10 9.68e-2 3.58e+0 1.04e+1 2.34e+1 8.73e+1 ▇▂▁▁▁
## 11 2.58e-1 4.03e+0 1.23e+1 1.94e+1 6.13e+1 ▇▆▂▁▁
## 12 3.86e+1 1.44e+2 2.06e+2 2.46e+2 3.28e+2 ▃▆▅▇▃
## 13 6.62e+2 7.66e+2 1.08e+3 1.21e+3 1.32e+3 ▇▃▁▆▇
## 14 6.1 e+1 3.36e+2 4.17e+2 4.49e+2 6.52e+2 ▂▂▃▇▁
## 15 6.9 e+1 3.77e+2 4.46e+2 4.87e+2 9.61e+2 ▂▅▇▁▁

Share

Once you have completed your analysis, create your data visualizations. The visualizations should clearly communicate your high-level insights and recommendations. Use the following Case Study Roadmap as a guide:

Case Study Roadmap - Share

Guiding questions

  • Were you able to answer the business questions?
  • What story does your data tell?
  • How do your findings relate to your original question?
  • Who is your audience? What is the best way to communicate with them?
  • Can data visualization help you share your findings?
  • Is your presentation accessible to your audience?

Key tasks

  1. Determine the best way to share your findings.
  2. Create effective data visualizations.
  3. Present your findings.
  4. Ensure your work is accessible.

Deliverable Supporting visualizations and key findings

Follow these steps:

  1. Take out a piece of paper and a pen and sketch some ideas for how you will visualize the data.
  2. Once you choose a visual form, open your tool of choice to create your visualization. Use a presentation software, such as PowerPoint or Google Slides; your spreadsheet program; Tableau; or R.
  3. Create your data visualization, remembering that contrast should be used to draw your audience’s attention to the most important insights. Use artistic principles including size, color, and shape.
  4. Ensure clear meaning through the proper use of common elements, such as headlines, subtitles, and labels.
  5. Refine your data visualization by applying deep attention to detail.

Act

Now that you have finished creating your visualizations, act on your findings. Prepare the deliverables you have been asked to create, including the high-level recommendations based on your analysis. Use the following Case Study Roadmap as a guide:

Case Study Roadmap - Act

Guiding questions

  • What is your final conclusion based on your analysis?
  • How could your team and business apply your insights?
  • What next steps would you or your stakeholders take based on your findings?
  • Is there additional data you could use to expand on your findings?

Key tasks - Create your portfolio. - Add your case study. - Practice presenting your case study to a friend or family member.

Deliverable

Your top high-level insights based on your analysis

Follow these steps:

  1. If you do not have one already, create an online portfolio. (Use Build a Portfolio with Google Sites.)
  2. Consider how you want to feature your case study in your porftolio.
  3. Upload or link your case study findings to your portfolio.
  4. Write a brief paragraph describing the case study, your process, and your discoveries.
  5. Add the paragraph to introduce your case study in your portfolio.