R Case Study - Bellabeat

About a company

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company

Questions for the analysis

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy

Business task

Identify potential opportunities for growth and recommendations for the Bellabeat marketing strategy improvement based on trends in smart device usage.

Loading Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(readr)
library(skimr)
library(lubridate)

Importing datasets For this project, I will use FitBit Fitness Tracker data

activity <- read.csv("/Users/phing/Documents/Fita/dailyActivity_merged.csv")
heart_rate <- read.csv("/Users/phing/Documents/Fita/hear.csv")
hourly_calories <- read.csv("/Users/phing/Documents/Fita/hourlyCalories_merged.csv")
hourly_intensity <- read.csv("/Users/phing/Documents/Fita/hourlyIntensities_merged.csv")
hourly_step <- read.csv("/Users/phing/Documents/Fita/hourlySteps_merged.csv")
sleep_day <- read.csv("/Users/phing/Documents/Fita/sleepDay_merged.csv")
weight_log <- read.csv("/Users/phing/Documents/Fita/weightLogInfo_merged.csv")

I already checked the data in Google Sheets. I just need to make sure that everything were imported correctly by using View() and head() functions.

head(activity)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    3/25/2016      11004          7.11            7.11
## 2 1503960366    3/26/2016      17609         11.55           11.55
## 3 1503960366    3/27/2016      12736          8.53            8.53
## 4 1503960366    3/28/2016      13231          8.93            8.93
## 5 1503960366    3/29/2016      12041          7.85            7.85
## 6 1503960366    3/30/2016      10970          7.16            7.16
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.57                     0.46
## 2                        0               6.92                     0.73
## 3                        0               4.66                     0.16
## 4                        0               3.19                     0.79
## 5                        0               2.16                     1.09
## 6                        0               2.36                     0.51
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                4.07                       0                33
## 2                3.91                       0                89
## 3                3.71                       0                56
## 4                4.95                       0                39
## 5                4.61                       0                28
## 6                4.29                       0                30
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  12                  205              804     1819
## 2                  17                  274              588     2154
## 3                   5                  268              605     1944
## 4                  20                  224             1080     1932
## 5                  28                  243              763     1886
## 6                  13                  223             1174     1820

I spotted some problems with the timestamp data. So before analysis, I need to convert it to date time format and split to date and time.

Fixing formating

# intensities
hourly_intensity$ActivityHour = as.POSIXct(hourly_intensity$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_intensity$time <- format(hourly_intensity$ActivityHour, format = "%H:%M:%S")
hourly_intensity$date <- format(hourly_intensity$ActivityHour, format = "%m/%d/%Y")
# calories
hourly_calories$ActivityHour = as.POSIXct(hourly_calories$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
hourly_calories$time <- format(hourly_calories$ActivityHour, format = "%H:%M:%S")
hourly_calories$date <- format(hourly_calories$ActivityHour, format = "%m/%d/%Y")
# activity
activity$ActivityDate = as.POSIXct(activity$ActivityDate, format = "%m/%d/%Y", tz = Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%Y")
# sleep
sleep_day$SleepDay = as.POSIXct(sleep_day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())
sleep_day$date <- format(sleep_day$SleepDay, format = "%m/%d/%Y")

Now that everything is ready, I can start exploring data sets.

Exploring and summarizing data

n_distinct(activity$Id)

## [1] 35

n_distinct(hourly_calories$Id)

## [1] 34

n_distinct(hourly_intensity$Id)

## [1] 34

n_distinct(sleep_day$Id)

## [1] 24

n_distinct(weight_log$Id)

## [1] 11

This information tells us about number participants in each data sets.

There is 35 participants in the activity, 34 calories and intensities data sets, 24 in the sleep and only 11 in the weight data set. 11 participants is not significant to make any recommendations and conclusions based on this data.

Let’s have a look at summary statistics of the data sets:

#activity 
activity %>% 
  select(TotalSteps, TotalDistance, SedentaryMinutes, Calories) %>% 
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :  32.0   Min.   :   0  
##  1st Qu.: 1988   1st Qu.: 1.410   1st Qu.: 728.0   1st Qu.:1776  
##  Median : 5986   Median : 4.090   Median :1057.0   Median :2062  
##  Mean   : 6547   Mean   : 4.664   Mean   : 995.3   Mean   :2189  
##  3rd Qu.:10198   3rd Qu.: 7.160   3rd Qu.:1285.0   3rd Qu.:2667  
##  Max.   :28497   Max.   :27.530   Max.   :1440.0   Max.   :4562

# explore num of active minutes per category
activity %>% 
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>% 
  summary()

##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.: 64.0       
##  Median :  0.00    Median :  1.00      Median :181.0       
##  Mean   : 16.62    Mean   : 13.07      Mean   :170.1       
##  3rd Qu.: 25.00    3rd Qu.: 16.00      3rd Qu.:257.0       
##  Max.   :202.00    Max.   :660.00      Max.   :720.0

# calories
hourly_calories %>% 
  select(Calories) %>% 
  summary()

##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 61.00  
##  Median : 77.00  
##  Mean   : 94.27  
##  3rd Qu.:104.00  
##  Max.   :933.00

# sleep
sleep_day %>% 
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>% 
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

#weight
weight_log %>% 
  select(WeightKg, BMI) %>% 
  summary()

##     WeightKg           BMI       
##  Min.   : 53.30   Min.   :21.45  
##  1st Qu.: 61.70   1st Qu.:24.10  
##  Median : 62.50   Median :24.39  
##  Mean   : 73.44   Mean   :25.73  
##  3rd Qu.: 85.80   3rd Qu.:25.76  
##  Max.   :129.60   Max.   :46.17

Here are some interesting findings from the data: - Sedentary Time: On average, participants spend 995 minutes (about 16 hours) sitting each day. This is way too much and needs to be reduced! - Activity Level: Most participants are only lightly active. - Sleep: Participants sleep an average of 7 hours per night in one stretch. - Steps: Participants take an average of 6,547 steps daily, which is lower than the CDC-recommended 8,000 steps for health benefits. According to the CDC, taking 8,000 steps a day can lower the risk of death from all causes by 51%, while taking 12,000 steps can reduce it by 65% compared to taking just 4,000 steps.

Merging data

I’m going to merge activity and sleep on columns Id and date

merged_data <- merge(sleep_day, activity, by=c("Id", "date"))
head(merged_data)

##           Id       date   SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 04/12/2016 2016-04-12                 1                327
## 2 1927972279 04/12/2016 2016-04-12                 3                750
## 3 2026352035 04/12/2016 2016-04-12                 1                503
## 4 3977333714 04/12/2016 2016-04-12                 1                274
## 5 4020332650 04/12/2016 2016-04-12                 1                501
## 6 4445114986 04/12/2016 2016-04-12                 2                429
##   TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1            346   2016-04-12        224          0.14            0.14
## 2            775   2016-04-12         24          0.02            0.02
## 3            546   2016-04-12       1019          0.63            0.63
## 4            469   2016-04-12        759          0.57            0.57
## 5            541   2016-04-12          8          0.01            0.01
## 6            457   2016-04-12        278          0.19            0.19
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0                  0                        0
## 2                        0                  0                        0
## 3                        0                  0                        0
## 4                        0                  0                        0
## 5                        0                  0                        0
## 6                        0                  0                        0
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                0.13                       0                 0
## 2                0.02                       0                 0
## 3                0.63                       0                 0
## 4                0.57                       0                 0
## 5                0.01                       0                 0
## 6                0.19                       0                 0
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                   0                    9               32       50
## 2                   0                    3              161      942
## 3                   0                   64              223      600
## 4                   0                   17              187      182
## 5                   0                    1              321      446
## 6                   0                   20              253      745

Visualization

ggplot(data = activity, aes(x = TotalSteps, y = Calories)) +
  geom_point() + geom_smooth() + labs(title = "Total Steps Vs. Calories")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I see positive correlation here between Total Steps and Calories, which is obvious - the more active we are, the more calories we burn.

ggplot(data = sleep_day, aes(x= TotalMinutesAsleep, y = TotalTimeInBed)) + 
  geom_point() + labs(title = " Total Minutes Asleep Vs. Total Time in Bed")

The relationship between Total Minutes Asleep and Total Time in Bed looks linear. So if the Bellabeat users want to improve their sleep, we should consider using notification to go to sleep.

int_new <- hourly_intensity %>% 
  group_by(time) %>% 
  drop_na() %>% 
  summarize(mean_total_int = mean(TotalIntensity))

ggplot(data = int_new, aes(x = time, y= mean_total_int)) + geom_histogram(stat = "identity", fill = 'darkblue') + theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = "Average Total Intensity Vs. Time")

## Warning in geom_histogram(stat = "identity", fill = "darkblue"): Ignoring
## unknown parameters: `binwidth`, `bins`, and `pad`

After visualizing Total Intensity hourly, I found out that people are more active between 5 am and 10pm.

Most activity happens between 5 pm and 7 pm - I suppose, that people go to a gym or for a walk after finishing work. We can use this time in the Bellabeat app to remind and motivate users to go for a run or walk.

ggplot(data=merged_data, aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) + 
geom_point(color='darkblue') + geom_smooth() +
  labs(title="Minutes Asleep vs. Sedentary Minutes")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The relationship between Total Minutes Asleep and Sedentary Minutes is not linear. The blue line indicates a curved trend, suggesting that sedentary minutes don’t consistently increase or decrease with sleep duration.

Summarizing recommendations for the business:

Bellabeat collects data about activity, sleep, stress, and reproductive health to help women learn more about their health and habits. Since starting in 2013, Bellabeat has become a popular wellness company for women.

I analyzed FitBit Fitness Tracker data and found some ideas that could help Bellabeat with its marketing:

Who to Target: Women who work full-time and spend most of their day sitting at a computer or in meetings. These women do light activity to stay healthy but could benefit from being more active. They might need tips or motivation to build healthier habits.

Message for Bellabeat’s Campaign: The Bellabeat app isn’t just a fitness tracker. It’s a helpful friend that supports women in balancing work, life, and health. It offers daily tips to keep users educated and motivated. Ideas for the App:

Encourage More Steps: The average user takes about 6547 steps daily, which is less than the CDC-recommended 8,000 steps for better health. Bellabeat could encourage users to reach this goal and explain its benefits.
Help with Weight Loss: The app could share ideas for low-calorie meals.
Improve Sleep: Notifications could remind users when it’s time to sleep.
Motivate During Peak Times: Most activity happens from 5 PM to 7 PM. Bellabeat could send reminders to encourage users to go for a walk or run during this time.
Reduce Sitting Time: The app could suggest ways to reduce long periods of sitting to improve health and sleep.

This is my first project using R, and I’d love to hear your feedback or suggestions. Thanks for reading my Bellabeat case study!

R Case Study - Bellabeat

Phi

2024-10-07

About a company

Questions for the analysis

Business task

Summarizing recommendations for the business: