Bellabeat Usage Analysis

Data Files

The dataset contains the following 18 CSV files:
• dailyActivity_merged.csv
• dailyCalories_merged.csv
• dailyIntensities_merged.csv
• dailySteps_merged.csv
• heartrate_seconds_merged.csv
• hourlyCalories_merged.csv
• hourlyIntensities_merged.csv
• hourlySteps_merged.csv
• minuteCaloriesNarrow_merged.csv
• minuteCaloriesWide_merged.csv
• minuteIntensitiesNarrow_merged.csv
• minuteIntensitiesWide_merged.csv
• minuteMETsNarrow_merged.csv
• minuteSleep_merged.csv
• minuteStepsNarrow_merged.csv
• minuteStepsWide_merged.csv
• sleepDay_merged.csv
• weightLogInfo_merged.csv

Data Loading

The data has been unzipped and placed in the ./data folder. Each of these CSV files was loaded into the global environment as data frames using a custom load_csv_files function.

Data Overview

Using the skim function, a preliminary overview of the data was obtained. The dataset consists of 18 CSV files that can be categorized as follows:

• Daily Data:
o dailyActivity_merged
o dailyCalories_merged
o dailyIntensities_merged
o dailySteps_merged
• Hourly Data:
o hourlyCalories_merged
o hourlyIntensities_merged
o hourlySteps_merged
• Minutely Data:
o minuteCaloriesNarrow_merged
o minuteCaloriesWide_merged
o minuteIntensitiesNarrow_merged
o minuteIntensitiesWide_merged
o minuteMETsNarrow_merged
o minuteSleep_merged
o minuteStepsNarrow_merged
o minuteStepsWide_merged
• Other Data:
o heartrate_seconds_merged
o sleepDay_merged
o weightLogInfo_merged

Data Merging

• Daily Data:
o dailyActivity_merged is already a merge of dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged.
o sleepDay_merged can be joined with dailyActivity_merged as both contain daily records.
• Hourly Data:
o Hourly data files can be merged to create hourlyActivity_merged.
• Minutely Data:
o Minutely data files can be merged to create minuteActivity_merged.
The merged datasets will provide a comprehensive view of user activity at different time intervals, enabling a detailed analysis of user habits and usage patterns.

Data Cleaning and Manipulation

Before proceeding with the analysis, the following steps will be taken to clean and manipulate the data:
1. Handling Missing Values: Missing values will be identified and appropriately handled (e.g., imputation, removal).
2. Removing Duplicates: Duplicate records will be removed to ensure data integrity.
3. Data Type Conversion: Ensure all columns have the correct data types (e.g., datetime for date columns).
4. Merging Data: Merge the datasets as described to create comprehensive daily, hourly, and minutely datasets.

Data Preparation Steps

Drop Unnecessary DataFrames: Since dailyActivity_merged is already a combination of dailyCalories_merged, dailyIntensities_merged, and dailySteps_merged, these separate dataframes are not needed and were dropped from the environment.
Check for Missing Values:

All data frames were checked for missing values. No missing values were found in any of the data frames.

Remove Duplicates:

• minuteSleep_merged had 543 duplicates.
• sleepDay_merged had 3 duplicates.

Duplicates were removed and the data was rechecked to confirm there were no remaining duplicates.

Data Shape and Types:

Each data frame was inspected for the number of observations and variables. The observations range from 940 to approximately 2.5 million. It was noted that the variables related to dates and times are stored as text data ( data type), which can cause issues in comparisons and calculations.

Convert Date and Time Columns to Proper Data Types:
Date and time columns were converted to appropriate datetime formats (POSIXct). This ensures proper handling and manipulation of date and time data.
Splitting Date and Time Columns:
The datetime values need to be split and properly formatted in the other data frames as well. Given the number of data frames (14), this could be a time-consuming process. To streamline this, a split_datetime_column function was created to find and format these values efficiently. This function will be used iteratively on the data frames that are not dailyActivity_merged, splitting dates and times and then splitting the time values into Hour, Minute, and Second variables.

Handling Specific Data Frames

heartrate_seconds_merged
The heartrate_seconds_merged dataframe has a seconds variable, which isn’t directly comparable with other data frames. To make it useful, the mean heart rates for each minute will be calculated.
weightLogInfo_merged
In the weightLogInfo_merged dataframe, only the LogId variable differentiates the information, and there is no way to relate it with time. Therefore, the data will be grouped by Id to obtain a mean weight for each person.

Creating Merged DataFrames

Hourly Data
The hourly_merged dataframe will be created by joining all the hourly* dataframes.
Minutely Data
Similarly, the minute_merged dataframe will be created by joining all the minute* dataframes.

Removing Unnecessary Columns

After splitting the datetime columns, it was found that the Hour, Minute, and Second variables in the sleepDay_merged dataframe all contained 0, 0, 0 data. Therefore, these columns were removed to clean up the dataframe.

Analyze

I had 5 questions to the data, which were

Are there user groups which can be said to differ in the usage of the product?
In what respect those user groups differ?
How can this user grouping benefit this analysis, therefore profiling the needs of various user groups?
Are there any patterns in the usage of the product? What are the usage habits of user groups?
Are there any patterns in the data which can be leveraged to maximize the benefit for all users?

The variables I can use to group users can be;

Time variables, like the time of the day, or the day of the week
Activity per day
Distance per day
Steps per day
Calories per day
Mean Heart Rate
Intensity per day
Sleep duration per day
Weight

To address these questions, we will perform feature engineering and create new variables that can be used to group users. These variables include time-related variables, activity levels, distance, steps, calories, heart rate, intensity, sleep duration, and weight. To analyze user behavior based on the day of the week, I created a weekdays data frame that maps dates to weekday names. This data frame was then joined with all other data frames containing date information to add the weekday information to each record. I also created a function to group hours of the day into categorical times: Morning, Afternoon, Evening, and Night. This function was applied to all data frames that contain hour data.

From the dailyActivity_merged dataframe, I created a TotalMinutes variable as the sum of the minutes of different types of activities (e.g., SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes). Additionally, I created per-day variables for all metrics by grouping the data by user and date, storing this aggregated data in a new dataframe named per_day.

Here is how per_day dataframe look like:

## ── Data Summary ────────────────────────
##                            Values 
## Name                       per_day
## Number of rows             33     
## Number of columns          15     
## _______________________           
## Column type frequency:            
##   numeric                  15     
## ________________________          
## Group variables            None   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable                    n_missing complete_rate    mean      sd
##  1 Id                                       0         1     4.86e+9 2.43e+9
##  2 steps_per_day                            0         1     7.52e+3 3.58e+3
##  3 distance_per_day                         0         1     5.40e+0 2.77e+0
##  4 minutes_per_day                          0         1     1.22e+3 1.98e+2
##  5 Calories_per_day                         0         1     2.28e+3 5.63e+2
##  6 VeryActiveDistance_per_day               0         1     1.45e+0 1.87e+0
##  7 ModeratelyActiveDistance_per_day         0         1     5.57e-1 5.38e-1
##  8 LightActiveDistance_per_day              0         1     3.32e+0 1.38e+0
##  9 SedentaryActiveDistance_per_day          0         1     1.63e-3 3.02e-3
## 10 VeryActiveMinutes_per_day                0         1     2.03e+1 2.38e+1
## 11 FairlyActiveMinutes_per_day              0         1     1.33e+1 1.21e+1
## 12 LightlyActiveMinutes_per_day             0         1     1.92e+2 7.57e+1
## 13 SedentaryMinutes_per_day                 0         1     9.99e+2 2.28e+2
## 14 mean_sleep_minutes                       9         0.727 3.77e+2 1.37e+2
## 15 mean_time_in_bed_minutes                 9         0.727 4.20e+2 1.74e+2
##         p0     p25     p50     p75    p100 hist 
##  1 1.50e+9 2.35e+9 4.45e+9 6.96e+9 8.88e+9 ▇▆▃▅▅
##  2 9.16e+2 5.57e+3 7.28e+3 9.52e+3 1.60e+4 ▃▅▇▃▁
##  3 6.35e-1 3.45e+0 5.30e+0 6.91e+0 1.32e+1 ▃▇▆▁▁
##  4 9.11e+2 1.04e+3 1.32e+3 1.42e+3 1.44e+3 ▃▃▁▂▇
##  5 1.48e+3 1.92e+3 2.13e+3 2.60e+3 3.44e+3 ▅▇▃▂▂
##  6 6.13e-3 1.42e-1 7.30e-1 2.21e+0 8.51e+0 ▇▃▁▁▁
##  7 1.13e-2 1.28e-1 5.02e-1 7.73e-1 2.75e+0 ▇▆▁▁▁
##  8 5.07e-1 2.61e+0 3.50e+0 4.14e+0 6.19e+0 ▃▆▇▆▂
##  9 0       0       0       7.69e-4 1.10e-2 ▇▁▁▁▁
## 10 9.68e-2 3.58e+0 1.04e+1 2.34e+1 8.73e+1 ▇▂▁▁▁
## 11 2.58e-1 4.03e+0 1.23e+1 1.94e+1 6.13e+1 ▇▆▂▁▁
## 12 3.86e+1 1.44e+2 2.06e+2 2.46e+2 3.28e+2 ▃▆▅▇▃
## 13 6.62e+2 7.66e+2 1.08e+3 1.21e+3 1.32e+3 ▇▃▁▆▇
## 14 6.1 e+1 3.36e+2 4.17e+2 4.49e+2 6.52e+2 ▂▂▃▇▁
## 15 6.9 e+1 3.77e+2 4.46e+2 4.87e+2 9.61e+2 ▂▅▇▁▁

Share

Once you have completed your analysis, create your data visualizations. The visualizations should clearly communicate your high-level insights and recommendations. Use the following Case Study Roadmap as a guide:

Case Study Roadmap - Share

Guiding questions

Were you able to answer the business questions?
What story does your data tell?
How do your findings relate to your original question?
Who is your audience? What is the best way to communicate with them?
Can data visualization help you share your findings?
Is your presentation accessible to your audience?

Key tasks

Determine the best way to share your findings.
Create effective data visualizations.
Present your findings.
Ensure your work is accessible.

Deliverable Supporting visualizations and key findings

Follow these steps:

Take out a piece of paper and a pen and sketch some ideas for how you will visualize the data.
Once you choose a visual form, open your tool of choice to create your visualization. Use a presentation software, such as PowerPoint or Google Slides; your spreadsheet program; Tableau; or R.
Create your data visualization, remembering that contrast should be used to draw your audience’s attention to the most important insights. Use artistic principles including size, color, and shape.
Ensure clear meaning through the proper use of common elements, such as headlines, subtitles, and labels.
Refine your data visualization by applying deep attention to detail.

Act

Now that you have finished creating your visualizations, act on your findings. Prepare the deliverables you have been asked to create, including the high-level recommendations based on your analysis. Use the following Case Study Roadmap as a guide:

Case Study Roadmap - Act

Guiding questions

What is your final conclusion based on your analysis?
How could your team and business apply your insights?
What next steps would you or your stakeholders take based on your findings?
Is there additional data you could use to expand on your findings?

Key tasks - Create your portfolio. - Add your case study. - Practice presenting your case study to a friend or family member.

Deliverable

Your top high-level insights based on your analysis