Scenario

Bellabeat is a high-tech manufacturer of health-focused products for women. While they are a successful small company, they have the potential to become a major player in the global smart device market. As a data analyst, I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insights into how consumers are using their smart devices. The insights I uncover will help guide the company’s marketing strategy.

Questions
What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat’s marketing strategy?
Business Tasks

Conduct an analysis of consumer behavior and usage patterns on FitBit smart devices in order to identify the key factors that shape trends and gain insights used to enhance Bellabeat’s marketing strategy.

Prepare

Data used

For this case study, the stakeholder provided us with the datasets. We used the FitBit Fitness Tracker Data as our source of information. The dataset is available on Kaggle and was made accessible through Mobius.

Our Dataset

This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 12/03/2016 and 12/05/2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level records of physical activity, heart rate, and sleep monitoring.

Data Credibility and Integrity

There are potential issues with bias and credibility in this dataset. Since it consists of only 30 users and lacks demographic information, there is a risk of sampling bias, making it uncertain whether the sample accurately represents the entire population. Additionally, the dataset is not up-to-date and was collected over a limited period of two months, which may impact its reliability. The small sample size and short data collection period could lead to skewed insights that may not be generalizable to a larger audience. Without a more diverse and comprehensive dataset, drawing meaningful and reliable conclusions becomes challenging.

Process

Tools I use and why? For this particular project, I prefer to use the R programming language because it allows me to practice what I learned through the program and to perform both data cleaning and visualization efficiently. Before I start cleaning the data. We have to install packages and import the data((files) that I download from Kaggle.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("janitor")

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library("lubridate")
library("ggplot2")
library("ggpubr")
library("tidyr")
library("dplyr")

importing the data(files from Kaggle)

daily_activity <- read.csv("fitabase4-5/dailyActivity_merged.csv")
daily_sleep <- read.csv("fitabase4-5/sleepDay_merged.csv")
hourly_intensities <- read.csv("fitabase4-5/hourlyIntensities_merged.csv")
hourly_calories <- read.csv("fitabase4-5/hourlyCalories_merged.csv")

# data_clean
data_clean_remove <- remove_empty(daily_activity, which = c("rows","cols"), quiet = FALSE)

## No empty rows to remove.

## No empty columns to remove.

# data_clean_remove

data_activityfiltered <- data_clean_remove %>% filter(LoggedActivitiesDistance!= 0)

data_activityfiltered <- data_activityfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data_duplicate <- sum(duplicated(data_activityfiltered))
# duplicates
data_duplicate

## [1] 0

head(data_activityfiltered)

##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 6775888955    4/26/2016       7091          5.27            5.27
## 2 6962181067    4/21/2016      11835          9.71            7.88
## 3 6962181067    4/25/2016      13239          9.27            9.08
## 4 6962181067     5/9/2016      12342          8.72            8.68
## 5 7007744171    4/12/2016      14172         10.29            9.48
## 6 7007744171    4/13/2016      12862          9.65            8.60
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                 1.959596               3.48                     0.87
## 2                 4.081692               3.99                     2.10
## 3                 2.785175               3.02                     1.68
## 4                 3.167822               3.90                     1.18
## 5                 4.869783               4.50                     0.38
## 6                 4.851307               4.61                     0.56
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                0.73                    0.00                42
## 2                3.51                    0.11                53
## 3                4.46                    0.10                35
## 4                3.65                    0.00                43
## 5                5.41                    0.00                53
## 6                4.48                    0.00                56
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  30                   47             1321     2584
## 2                  27                  214              708     2179
## 3                  31                  282              637     2194
## 4                  21                  231              607     2105
## 5                   8                  355             1024     2937
## 6                  22                  261             1101     2742

# data_clean
data1_clean_remove <- remove_empty(daily_sleep, which = c("rows","cols"), quiet = FALSE)

## No empty rows to remove.

## No empty columns to remove.

# data_clean_remove

data1_sleepfiltered <- data1_clean_remove %>% filter(TotalSleepRecords!= 0)

data1_sleepfiltered <- data1_sleepfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data1_duplicate <- sum(duplicated(data1_sleepfiltered))
# duplicates
data1_duplicate

## [1] 0

head(data1_sleepfiltered)

##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

data2_clean <- clean_names(hourly_intensities)
# data_clean
data2_clean_remove <- remove_empty(data2_clean, which = c("rows","cols"), quiet = FALSE)

## No empty rows to remove.

## No empty columns to remove.

# data_clean_remove
data2_intensitiesfiltered <- hourly_intensities %>% filter(TotalIntensity!= 0)

data2_intensitiesfiltered <- data2_intensitiesfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data2_duplicate <- sum(duplicated(data2_intensitiesfiltered))
# duplicates
data2_duplicate

## [1] 0

head(data2_intensitiesfiltered)

##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM             20         0.333333
## 2 1503960366  4/12/2016 1:00:00 AM              8         0.133333
## 3 1503960366  4/12/2016 2:00:00 AM              7         0.116667
## 4 1503960366  4/12/2016 8:00:00 AM             13         0.216667
## 5 1503960366  4/12/2016 9:00:00 AM             30         0.500000
## 6 1503960366 4/12/2016 10:00:00 AM             29         0.483333

# data_clean
data3_clean <- clean_names(hourly_calories)
data3_clean_remove <- remove_empty(hourly_calories, which = c("rows","cols"), quiet = FALSE)

## No empty rows to remove.

## No empty columns to remove.

# data_clean_remove
data3_caloriesfiltered <- data3_clean_remove %>% filter(Calories!= 0)
data3_caloriesfiltered <- data3_caloriesfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data3_duplicate <- sum(duplicated(data3_caloriesfiltered))
# duplicates
data3_duplicate

## [1] 0

head(data3_caloriesfiltered)

##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48

daily_activity <- data_activityfiltered %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date,format = "%m/%d/%Y"))
head(daily_activity)

##           Id       date TotalSteps TotalDistance TrackerDistance
## 1 6775888955 2016-04-26       7091          5.27            5.27
## 2 6962181067 2016-04-21      11835          9.71            7.88
## 3 6962181067 2016-04-25      13239          9.27            9.08
## 4 6962181067 2016-05-09      12342          8.72            8.68
## 5 7007744171 2016-04-12      14172         10.29            9.48
## 6 7007744171 2016-04-13      12862          9.65            8.60
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                 1.959596               3.48                     0.87
## 2                 4.081692               3.99                     2.10
## 3                 2.785175               3.02                     1.68
## 4                 3.167822               3.90                     1.18
## 5                 4.869783               4.50                     0.38
## 6                 4.851307               4.61                     0.56
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                0.73                    0.00                42
## 2                3.51                    0.11                53
## 3                4.46                    0.10                35
## 4                3.65                    0.00                43
## 5                5.41                    0.00                53
## 6                4.48                    0.00                56
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  30                   47             1321     2584
## 2                  27                  214              708     2179
## 3                  31                  282              637     2194
## 4                  21                  231              607     2105
## 5                   8                  355             1024     2937
## 6                  22                  261             1101     2742

daily_sleep <- data1_sleepfiltered %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
##   Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`

head(daily_sleep)

##           Id       date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320

hourly_intensities <- data2_intensitiesfiltered %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
hourly_intensities <- separate(hourly_intensities, date_time, into = c("date", "time"), sep = " ")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 318 rows [1, 20, 37, 53,
## 71, 105, 141, 160, 193, 212, 232, 254, 275, 309, 328, 347, 387, 409, 426, 448,
## ...].

head(hourly_intensities)

##           Id       date     time TotalIntensity AverageIntensity
## 1 1503960366 2016-04-12     <NA>             20         0.333333
## 2 1503960366 2016-04-12 01:00:00              8         0.133333
## 3 1503960366 2016-04-12 02:00:00              7         0.116667
## 4 1503960366 2016-04-12 08:00:00             13         0.216667
## 5 1503960366 2016-04-12 09:00:00             30         0.500000
## 6 1503960366 2016-04-12 10:00:00             29         0.483333

hourly_calories <- data3_caloriesfiltered %>%
  rename(date_time = ActivityHour) %>%
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
hourly_calories <- separate(hourly_calories, date_time, into = c("date", "time"), sep = " ")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].

head(hourly_calories)

##           Id       date     time Calories
## 1 1503960366 2016-04-12     <NA>       81
## 2 1503960366 2016-04-12 01:00:00       61
## 3 1503960366 2016-04-12 02:00:00       59
## 4 1503960366 2016-04-12 03:00:00       47
## 5 1503960366 2016-04-12 04:00:00       48
## 6 1503960366 2016-04-12 05:00:00       48

unique participants are there in each dataframe

n_distinct(daily_activity$id)

## [1] 0

n_distinct(daily_sleep$id)

## [1] 0

n_distinct(hourly_intensities$id)

## [1] 0

n_distinct(hourly_calories$id)

## [1] 0

nrow(daily_activity)

## [1] 32

nrow(daily_sleep)

## [1] 410

nrow(hourly_intensities)

## [1] 13002

nrow(hourly_calories)

## [1] 22099

daily_activity %>% 
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes)%>%
  summary()

##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   : 6064   Min.   : 4.810   Min.   : 607.0  
##  1st Qu.: 9035   1st Qu.: 7.165   1st Qu.: 722.2  
##  Median :12634   Median : 9.690   Median : 812.0  
##  Mean   :12042   Mean   : 9.147   Mean   : 870.6  
##  3rd Qu.:14178   3rd Qu.:10.623   3rd Qu.:1028.2  
##  Max.   :20067   Max.   :14.300   Max.   :1321.0

daily_sleep %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0

hourly_intensities %>%  
  select(TotalIntensity,
         AverageIntensity) %>%
  summary()

##  TotalIntensity   AverageIntensity 
##  Min.   :  1.00   Min.   :0.01667  
##  1st Qu.:  6.00   1st Qu.:0.10000  
##  Median : 13.00   Median :0.21667  
##  Mean   : 20.46   Mean   :0.34093  
##  3rd Qu.: 25.00   3rd Qu.:0.41667  
##  Max.   :180.00   Max.   :3.00000

Analyze

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point()

ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()

combined_data <- merge(daily_sleep, daily_activity, by="Id", all=TRUE)

n_distinct(combined_data$Id)

## [1] 24

head(combined_data)

##           Id     date.x TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
##   date.y TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
## 1   <NA>         NA            NA              NA                       NA
## 2   <NA>         NA            NA              NA                       NA
## 3   <NA>         NA            NA              NA                       NA
## 4   <NA>         NA            NA              NA                       NA
## 5   <NA>         NA            NA              NA                       NA
## 6   <NA>         NA            NA              NA                       NA
##   VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 1                 NA                       NA                  NA
## 2                 NA                       NA                  NA
## 3                 NA                       NA                  NA
## 4                 NA                       NA                  NA
## 5                 NA                       NA                  NA
## 6                 NA                       NA                  NA
##   SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 1                      NA                NA                  NA
## 2                      NA                NA                  NA
## 3                      NA                NA                  NA
## 4                      NA                NA                  NA
## 5                      NA                NA                  NA
## 6                      NA                NA                  NA
##   LightlyActiveMinutes SedentaryMinutes Calories
## 1                   NA               NA       NA
## 2                   NA               NA       NA
## 3                   NA               NA       NA
## 4                   NA               NA       NA
## 5                   NA               NA       NA
## 6                   NA               NA       NA

ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
    geom_ribbon(aes(ymin=0, ymax=TotalTimeInBed), alpha=0.2, fill="blue",  color="blue") +
    geom_point(color="red") + 
    labs(title="Total Time Asleep vs. Total Time in Bed")

6. Act- Recommendations

Bellabeat designs smart devices that help women track their daily habits and health. This case study analyzes user data to provide insights that can help improve sales and user experience.

However, the dataset is small and lacks demographic details, which may lead to biased results. To improve accuracy, I recommend collecting more diverse data to better understand the target audience and refine marketing strategies.

From my analysis, I identified key trends that can enhance the Bellabeat app’s features and improve marketing efforts to reach the right customers effectively.

Bellabeat Case study, Capston Project

Zulikha Shafi

2025-03-05

Scenario

unique participants are there in each dataframe

Thank you for being a part of my R programming case study!