Bellabeat case study / mini project

Project Overview

Objective(s)

Analyze Bellabeat’s smart device data.
Derive insights into consumer usage patterns.

Expected Outcome

Provide actionable insights to guide Bellabeat’s marketing strategy.

Target Audience

The Bellabeat executive team.

Deliverables

An analysis report (this one!) showcasing process and results.
High-level recommendations to enhance Bellabeat’s marketing approach (given at the end).

Loading Required Libraries

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)
library(stringr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(here)

## here() starts at D:/Bellabeat-data-to-clean

library(ggplot2)
library(knitr)
library(pander)

We start by loading the required libraries: - dplyr and tidyr for data manipulation - janitor for cleaning column names - skimr for summary statistics - stringr for string manipulation - lubridate for working with dates - ggplot2 for data visualization - knitr and pander for reporting and displaying results

Setting the Working Directory and File Handling

# Set the working directory to the folder containing CSV files
setwd("D:\\Bellabeat-data-to-clean")

# List all CSV files in the directory
file_list <- list.files(pattern = "\\.csv$")

In this section, we set the working directory to where the data files are located and list all CSV files in that directory.

Creating a Folder to Save Cleaned Data

# Create a folder to save cleaned files
cleaned_folder <- "cleaned_files"
if (!dir.exists(cleaned_folder)) dir.create(cleaned_folder)

Here, we create a new directory named cleaned_files to store the cleaned dataset(s) if it doesn’t already exist.

Data Cleaning Function

# Function to clean and analyze the data
clean_data <- function(data, file_name) {
    # Remove empty rows
    data <- data %>% 
        filter(complete.cases(.))
    
    head(data)
    
    # Remove duplicate rows
    data <- data %>% distinct()
    
    # Trim whitespaces around entries in character cells
    data <- data %>% 
        mutate(across(where(is.character), ~str_trim(.)))
    
    # Format numeric columns to have 1-2 decimal places
    data <- data %>% 
        mutate(across(where(is.numeric), ~round(., 2)))
    
    # Analyze missing data and suggest filling methods
    missing_data_summary <- colSums(is.na(data))
    if (any(missing_data_summary > 0)) {
        print("Columns with missing data:", names(missing_data_summary[missing_data_summary > 0]), "\n")
        print("Suggested methods to fill missing data depend on column types, e.g.:\n")
        print("  - Numeric columns: Mean/Median/Mode\n")
        print("  - Categorical columns: Mode or most frequent value\n")
        print("  - Datetime columns: Impute based on trends\n")
    }
    
    # Data summary
    skim_without_charts(data)
    
    # Clean column names
    data <- clean_names(data)
    
    return(data)
}

This function performs several steps to clean the data: 1. Remove empty rows using complete.cases(). 2. Remove duplicate rows with distinct(). 3. Trim whitespace in character columns. 4. Format numeric columns to 1-2 decimal places. 5. Handle missing data by suggesting methods for imputation. 6. Generate a summary of the data using skimr. 7. Clean column names using clean_names() from the janitor package.

Looping Through Files and Cleaning Data

# Loop through each file
for (file in file_list) {
    # Read the CSV file
    data <- read.csv(file)
    
    # Clean the data
    cleaned_data <- clean_data(data, file)
    
    # Save the cleaned data
    cleaned_file_path <- file.path(cleaned_folder, paste("cleaned", file))
    write.csv(cleaned_data, cleaned_file_path, row.names = FALSE)
    
    # Summary statistics
    summary_stats <- cleaned_data %>%
        summarize(
            avg_steps = mean(total_steps),
            avg_calories = mean(calories),
            avg_very_active_minutes = mean(very_active_minutes)
        )
    #pander(summary_stats, caption = "Summary Statistics")
    print(summary_stats)
}

##   avg_steps avg_calories avg_very_active_minutes
## 1  7637.911      2303.61                21.16489

Visualizing total steps over time

    # Visualizing smoothed daily total steps over time
    suppressWarnings({
    ggplot(cleaned_data, aes(x = as.Date(activity_date), y = total_steps)) +
        geom_point(alpha = 0.5, color = "blue") +
        geom_smooth(method = "loess", color = "red", se = FALSE) +
        labs(
            title = "Smoothed Daily Total Steps Over Time",
            x = "Date",
            y = "Total Steps"
        ) +
        theme_minimal()
})

## `geom_smooth()` using formula = 'y ~ x'

This loop iterates over each CSV file, reads the data, cleans it using the clean_data() function, and saves the cleaned data in a new folder. It also generates summary statistics and visualizations.

Correlation Analysis

# Calculate correlation between Total Steps and Calories
correlation <- cor(cleaned_data$total_steps, cleaned_data$calories, use = "complete.obs")

# Print correlation
print(paste("Correlation between Total Steps and Calories:", round(correlation, 2)))

## [1] "Correlation between Total Steps and Calories: 0.59"

In this section, we calculate the correlation between total steps and calories burned using the cor() function, and display the result.

Visualizing the Relationship (correlation) Between Total Steps and Calories

# Scatter plot with trend line showing the relationship between Total Steps and Calories
ggplot(cleaned_data, aes(x = total_steps, y = calories)) +
    geom_point(color = "blue", alpha = 0.5) +
    geom_smooth(method = "lm", color = "red", se = TRUE) +
    labs(
        title = "Relationship between Total Steps and Calories",
        x = "Total Steps",
        y = "Calories Burned"
    ) +
    theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This plot visualizes the relationship between total steps and calories burned with a scatter plot and a linear trend line.

Categorizing Users by Activity Level

# Create user groups based on activity levels
cleaned_data <- cleaned_data %>%
    mutate(activity_level = case_when(
        total_steps > 10000 ~ "Highly Active",
        total_steps >= 5000 ~ "Moderately Active",
        TRUE ~ "Sedentary"
    ))

Create a bar chart of activity levels

# Create a bar chart of activity levels
ggplot(cleaned_data, aes(x = activity_level, fill = activity_level)) +
    geom_bar() +
    labs(
        title = "Distribution of Users by Activity Level",
        x = "Activity Level",
        y = "Number of Users"
    ) +
    theme_minimal() +
    scale_fill_manual(values = c("Highly Active" = "green", 
                                 "Moderately Active" = "orange", 
                                 "Sedentary" = "red")) +
    theme(legend.position = "none")

In this section, we categorize users based on their total steps into three activity levels and visualize the distribution with a bar chart.

Analyzing Peak Activity Times

# Identify peak activity times and periods
cleaned_data %>%
    group_by(activity_date) %>%
    summarize(daily_steps = sum(total_steps)) %>%
    arrange(desc(daily_steps))

## # A tibble: 31 × 2
##    activity_date daily_steps
##    <chr>               <dbl>
##  1 4/16/2016          277733
##  2 4/12/2016          271816
##  3 4/23/2016          267124
##  4 4/21/2016          263795
##  5 4/20/2016          261215
##  6 4/30/2016          258726
##  7 4/27/2016          258516
##  8 4/19/2016          257557
##  9 4/14/2016          255538
## 10 4/25/2016          253849
## # ℹ 21 more rows

Here, we summarize the total steps for each day and identify the peak activity periods.

Visualizing Calories Burned by Activity Level

# Boxplot for calories burned by activity level
ggplot(cleaned_data, aes(x = activity_level, y = calories)) +
    geom_boxplot() +
    labs(title = "Calories Burned by Activity Level")

This box plot visualizes how calories burned vary by activity level.

Bellabeat’s marketing strategy

It is based on the analysis of user activity data above. The analysis includes trends, activity levels, and correlations that can guide product and marketing decisions to improve user engagement and satisfaction.

1. User Activity Levels

Insight:

The data categorizes users into three activity levels: Highly Active, Moderately Active, and Sedentary. The distribution of users across these levels provides an opportunity for targeted marketing and product development.

Marketing Strategy:

Targeting Sedentary Users: Since many users fall under the “Sedentary” category, Bellabeat can create campaigns aimed at encouraging them to become more active.
- Example: Personalized campaigns focusing on lifestyle change, like step-by-step activity guides or motivational reminders to boost daily steps.
Product Positioning: Develop features catering to users starting their fitness journey (e.g., reminders to take small steps to increase activity).

Improvement Areas for Devices & App:

Motivational Features: Include features like customized reminders or progress tracking, encouraging sedentary users to become “Highly Active.”
User Education: Create in-app content that explains the benefits of increasing activity, aimed at motivating users in the “Sedentary” group.

2. Correlation Between Total Steps and Calories

Insight:

A positive correlation between total steps and calories burned indicates that users who take more steps tend to burn more calories.

Marketing Strategy:

Emphasize Health Benefits: Highlight the connection between increased physical activity (steps) and improved health outcomes.
- Example: Messaging like “Take 10,000 steps a day for a healthier life” can resonate with users.
Targeted Campaigns: Promote the idea of walking more to burn calories, especially to users in the “Moderately Active” group.

Improvement Areas for Devices & App:

Step Count Goals: Implement features that show how specific step counts contribute to calories burned, making activity more rewarding.
Incorporate Health Metrics: Integrate more health data (e.g., heart rate, sleep quality) to provide a holistic fitness view, showing users how their activity impacts overall well-being.

3. User Activity Trends Over Time

Insight:

Visualizing total steps over time reveals peak activity periods, which can guide Bellabeat in sending targeted notifications and creating promotions.

Marketing Strategy:

Time-Based Engagement: Use peak activity times to send reminders or engagement prompts.
- Example: Notifications encouraging users to take a walk or move more at specific times of day, such as during lunch or evenings.
Seasonal Campaigns: Identify seasonal patterns to create marketing campaigns that align with these trends (e.g., promoting walking challenges in summer or fitness resolutions in January).

Improvement Areas for Devices & App:

Smart Reminders: The app should send time-sensitive reminders during peak activity periods to keep users engaged.
Seasonal Features: Implement in-app features that help users stay active during off-peak seasons (e.g., winter challenges).

4. Calories Burned by Activity Level

Insight:

The analysis suggests that the more active a user is, the more calories they tend to burn.

Marketing Strategy:

Feature Active Users: Bellabeat can use stories or testimonials from highly active users to inspire others.
- Example: Share success stories of users who achieved fitness milestones and how Bellabeat helped them.
Targeting Moderately Active Users: Encourage users in the “Moderately Active” group to push towards higher activity levels with new features in the app.

Improvement Areas for Devices & App:

Personalized Challenges: Introduce challenges designed to increase calorie burn, based on the user’s current activity level.
Incentivize Calorie Burn: Implement features like badges or rewards for reaching calorie-burning milestones.

5. Identifying Peak Activity Times

Insight:

By analyzing peak activity periods, Bellabeat can identify when users are most active and can tailor engagement efforts accordingly.

Marketing Strategy:

Engagement During Peak Hours: Send push notifications or app reminders during peak activity times to keep users motivated.
- Example: Send reminders to users during peak periods to encourage them to achieve specific step goals.
Social Media Strategy: Create social media challenges that align with peak activity times, motivating users to share their achievements.

Improvement Areas for Devices & App:

Adaptive Notifications: The app can tailor notifications to match peak activity periods for greater user engagement.
Group Challenges: Develop group challenges around peak activity times, creating a social, competitive element that encourages more steps.

Conclusion

The insights derived from the analysis suggest that Bellabeat can enhance its marketing strategy by targeting users based on their activity levels, motivating them with step and calorie-based goals, and engaging them during peak activity times. Additionally, improving the app’s features to provide personalized and actionable insights will encourage users to increase their activity levels and achieve their fitness goals.

Note(s)

This case study is part of the Google Data Analytics Certificate program on Coursera.
Only one file has been used for analysis. File name: ‘dailyActivity_merged.csv.’ Therefore, the analysis and insights are based on a very small subset of data.
Generative AI (ChatGPT) has been used to learn, code / re-code, correct mistakes/errors, update code, and also to generate and update this notebook / markdown.
This is just a quick version (I am still learning :). I am already working, so I don’t have much time to work on the detailed version. There may be some mistakes, so use it at your own risk.

Bellabeat case study / mini project

Muhammad Tahir

Project Overview

Objective(s)

Expected Outcome

Target Audience

Deliverables

Loading Required Libraries

Setting the Working Directory and File Handling

Creating a Folder to Save Cleaned Data

Data Cleaning Function

Looping Through Files and Cleaning Data

Visualizing total steps over time

Correlation Analysis

Visualizing the Relationship (correlation) Between Total Steps and Calories

Categorizing Users by Activity Level

Create a bar chart of activity levels

Analyzing Peak Activity Times

Visualizing Calories Burned by Activity Level

Bellabeat’s marketing strategy

1. User Activity Levels

Insight:

Marketing Strategy:

Improvement Areas for Devices & App:

2. Correlation Between Total Steps and Calories

Insight:

Marketing Strategy:

Improvement Areas for Devices & App:

3. User Activity Trends Over Time

Insight:

Marketing Strategy:

Improvement Areas for Devices & App:

4. Calories Burned by Activity Level

Insight:

Marketing Strategy:

Improvement Areas for Devices & App:

5. Identifying Peak Activity Times

Insight:

Marketing Strategy:

Improvement Areas for Devices & App:

Conclusion

Note(s)