Scenario

Bellabeat is a high-tech manufacturer of health-focused products for women. While they are a successful small company, they have the potential to become a major player in the global smart device market. As a data analyst, I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insights into how consumers are using their smart devices. The insights I uncover will help guide the company’s marketing strategy.

  1. Ask

Conduct an analysis of consumer behavior and usage patterns on FitBit smart devices in order to identify the key factors that shape trends and gain insights used to enhance Bellabeat’s marketing strategy.

  1. Prepare

For this case study, the stakeholder provided us with the datasets. We used the FitBit Fitness Tracker Data as our source of information. The dataset is available on Kaggle and was made accessible through Mobius.

This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 12/03/2016 and 12/05/2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level records of physical activity, heart rate, and sleep monitoring.

There are potential issues with bias and credibility in this dataset. Since it consists of only 30 users and lacks demographic information, there is a risk of sampling bias, making it uncertain whether the sample accurately represents the entire population. Additionally, the dataset is not up-to-date and was collected over a limited period of two months, which may impact its reliability. The small sample size and short data collection period could lead to skewed insights that may not be generalizable to a larger audience. Without a more diverse and comprehensive dataset, drawing meaningful and reliable conclusions becomes challenging.

  1. Process
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("janitor")
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("lubridate")
library("ggplot2")
library("ggpubr")
library("tidyr")
library("dplyr")

daily_activity <- read.csv("fitabase4-5/dailyActivity_merged.csv")
daily_sleep <- read.csv("fitabase4-5/sleepDay_merged.csv")
hourly_intensities <- read.csv("fitabase4-5/hourlyIntensities_merged.csv")
hourly_calories <- read.csv("fitabase4-5/hourlyCalories_merged.csv")
# data_clean
data_clean_remove <- remove_empty(daily_activity, which = c("rows","cols"), quiet = FALSE)
## No empty rows to remove.
## No empty columns to remove.
# data_clean_remove

data_activityfiltered <- data_clean_remove %>% filter(LoggedActivitiesDistance!= 0)

data_activityfiltered <- data_activityfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data_duplicate <- sum(duplicated(data_activityfiltered))
# duplicates
data_duplicate
## [1] 0
head(data_activityfiltered)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 6775888955    4/26/2016       7091          5.27            5.27
## 2 6962181067    4/21/2016      11835          9.71            7.88
## 3 6962181067    4/25/2016      13239          9.27            9.08
## 4 6962181067     5/9/2016      12342          8.72            8.68
## 5 7007744171    4/12/2016      14172         10.29            9.48
## 6 7007744171    4/13/2016      12862          9.65            8.60
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                 1.959596               3.48                     0.87
## 2                 4.081692               3.99                     2.10
## 3                 2.785175               3.02                     1.68
## 4                 3.167822               3.90                     1.18
## 5                 4.869783               4.50                     0.38
## 6                 4.851307               4.61                     0.56
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                0.73                    0.00                42
## 2                3.51                    0.11                53
## 3                4.46                    0.10                35
## 4                3.65                    0.00                43
## 5                5.41                    0.00                53
## 6                4.48                    0.00                56
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  30                   47             1321     2584
## 2                  27                  214              708     2179
## 3                  31                  282              637     2194
## 4                  21                  231              607     2105
## 5                   8                  355             1024     2937
## 6                  22                  261             1101     2742
# data_clean
data1_clean_remove <- remove_empty(daily_sleep, which = c("rows","cols"), quiet = FALSE)
## No empty rows to remove.
## No empty columns to remove.
# data_clean_remove

data1_sleepfiltered <- data1_clean_remove %>% filter(TotalSleepRecords!= 0)

data1_sleepfiltered <- data1_sleepfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data1_duplicate <- sum(duplicated(data1_sleepfiltered))
# duplicates
data1_duplicate
## [1] 0
head(data1_sleepfiltered)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
data2_clean <- clean_names(hourly_intensities)
# data_clean
data2_clean_remove <- remove_empty(data2_clean, which = c("rows","cols"), quiet = FALSE)
## No empty rows to remove.
## No empty columns to remove.
# data_clean_remove
data2_intensitiesfiltered <- hourly_intensities %>% filter(TotalIntensity!= 0)

data2_intensitiesfiltered <- data2_intensitiesfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data2_duplicate <- sum(duplicated(data2_intensitiesfiltered))
# duplicates
data2_duplicate
## [1] 0
head(data2_intensitiesfiltered)
##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM             20         0.333333
## 2 1503960366  4/12/2016 1:00:00 AM              8         0.133333
## 3 1503960366  4/12/2016 2:00:00 AM              7         0.116667
## 4 1503960366  4/12/2016 8:00:00 AM             13         0.216667
## 5 1503960366  4/12/2016 9:00:00 AM             30         0.500000
## 6 1503960366 4/12/2016 10:00:00 AM             29         0.483333
# data_clean
data3_clean <- clean_names(hourly_calories)
data3_clean_remove <- remove_empty(hourly_calories, which = c("rows","cols"), quiet = FALSE)
## No empty rows to remove.
## No empty columns to remove.
# data_clean_remove
data3_caloriesfiltered <- data3_clean_remove %>% filter(Calories!= 0)
data3_caloriesfiltered <- data3_caloriesfiltered %>%
  distinct() %>%
  drop_na()
# data_filtered
data3_duplicate <- sum(duplicated(data3_caloriesfiltered))
# duplicates
data3_duplicate
## [1] 0
head(data3_caloriesfiltered)
##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48
daily_activity <- data_activityfiltered %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date,format = "%m/%d/%Y"))
head(daily_activity)
##           Id       date TotalSteps TotalDistance TrackerDistance
## 1 6775888955 2016-04-26       7091          5.27            5.27
## 2 6962181067 2016-04-21      11835          9.71            7.88
## 3 6962181067 2016-04-25      13239          9.27            9.08
## 4 6962181067 2016-05-09      12342          8.72            8.68
## 5 7007744171 2016-04-12      14172         10.29            9.48
## 6 7007744171 2016-04-13      12862          9.65            8.60
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                 1.959596               3.48                     0.87
## 2                 4.081692               3.99                     2.10
## 3                 2.785175               3.02                     1.68
## 4                 3.167822               3.90                     1.18
## 5                 4.869783               4.50                     0.38
## 6                 4.851307               4.61                     0.56
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                0.73                    0.00                42
## 2                3.51                    0.11                53
## 3                4.46                    0.10                35
## 4                3.65                    0.00                43
## 5                5.41                    0.00                53
## 6                4.48                    0.00                56
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  30                   47             1321     2584
## 2                  27                  214              708     2179
## 3                  31                  282              637     2194
## 4                  21                  231              607     2105
## 5                   8                  355             1024     2937
## 6                  22                  261             1101     2742
daily_sleep <- data1_sleepfiltered %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
##   Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`
head(daily_sleep)
##           Id       date TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
hourly_intensities <- data2_intensitiesfiltered %>% 
  rename(date_time = ActivityHour) %>% 
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
hourly_intensities <- separate(hourly_intensities, date_time, into = c("date", "time"), sep = " ") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 318 rows [1, 20, 37, 53,
## 71, 105, 141, 160, 193, 212, 232, 254, 275, 309, 328, 347, 387, 409, 426, 448,
## ...].
head(hourly_intensities)
##           Id       date     time TotalIntensity AverageIntensity
## 1 1503960366 2016-04-12     <NA>             20         0.333333
## 2 1503960366 2016-04-12 01:00:00              8         0.133333
## 3 1503960366 2016-04-12 02:00:00              7         0.116667
## 4 1503960366 2016-04-12 08:00:00             13         0.216667
## 5 1503960366 2016-04-12 09:00:00             30         0.500000
## 6 1503960366 2016-04-12 10:00:00             29         0.483333
hourly_calories <- data3_caloriesfiltered %>%
  rename(date_time = ActivityHour) %>%
  mutate(date_time = as.POSIXct(date_time,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
hourly_calories <- separate(hourly_calories, date_time, into = c("date", "time"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 934 rows [1, 25, 49, 73,
## 97, 121, 145, 169, 193, 217, 241, 265, 289, 313, 337, 361, 385, 409, 433, 457,
## ...].
head(hourly_calories)
##           Id       date     time Calories
## 1 1503960366 2016-04-12     <NA>       81
## 2 1503960366 2016-04-12 01:00:00       61
## 3 1503960366 2016-04-12 02:00:00       59
## 4 1503960366 2016-04-12 03:00:00       47
## 5 1503960366 2016-04-12 04:00:00       48
## 6 1503960366 2016-04-12 05:00:00       48

unique participants are there in each dataframe

n_distinct(daily_activity$id)
## [1] 0
n_distinct(daily_sleep$id) 
## [1] 0
n_distinct(hourly_intensities$id)
## [1] 0
n_distinct(hourly_calories$id)
## [1] 0
nrow(daily_activity)
## [1] 32
nrow(daily_sleep)
## [1] 410
nrow(hourly_intensities)
## [1] 13002
nrow(hourly_calories)
## [1] 22099
daily_activity %>% 
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes)%>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes
##  Min.   : 6064   Min.   : 4.810   Min.   : 607.0  
##  1st Qu.: 9035   1st Qu.: 7.165   1st Qu.: 722.2  
##  Median :12634   Median : 9.690   Median : 812.0  
##  Mean   :12042   Mean   : 9.147   Mean   : 870.6  
##  3rd Qu.:14178   3rd Qu.:10.623   3rd Qu.:1028.2  
##  Max.   :20067   Max.   :14.300   Max.   :1321.0
daily_sleep %>%  
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.00      Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.00      1st Qu.:361.0      1st Qu.:403.8  
##  Median :1.00      Median :432.5      Median :463.0  
##  Mean   :1.12      Mean   :419.2      Mean   :458.5  
##  3rd Qu.:1.00      3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.00      Max.   :796.0      Max.   :961.0
hourly_intensities %>%  
  select(TotalIntensity,
         AverageIntensity) %>%
  summary()
##  TotalIntensity   AverageIntensity 
##  Min.   :  1.00   Min.   :0.01667  
##  1st Qu.:  6.00   1st Qu.:0.10000  
##  Median : 13.00   Median :0.21667  
##  Mean   : 20.46   Mean   :0.34093  
##  3rd Qu.: 25.00   3rd Qu.:0.41667  
##  Max.   :180.00   Max.   :3.00000
  1. Analyze
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes)) + geom_point()

ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point()

combined_data <- merge(daily_sleep, daily_activity, by="Id", all=TRUE)
n_distinct(combined_data$Id)
## [1] 24
head(combined_data)
##           Id     date.x TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
##   date.y TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance
## 1   <NA>         NA            NA              NA                       NA
## 2   <NA>         NA            NA              NA                       NA
## 3   <NA>         NA            NA              NA                       NA
## 4   <NA>         NA            NA              NA                       NA
## 5   <NA>         NA            NA              NA                       NA
## 6   <NA>         NA            NA              NA                       NA
##   VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 1                 NA                       NA                  NA
## 2                 NA                       NA                  NA
## 3                 NA                       NA                  NA
## 4                 NA                       NA                  NA
## 5                 NA                       NA                  NA
## 6                 NA                       NA                  NA
##   SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 1                      NA                NA                  NA
## 2                      NA                NA                  NA
## 3                      NA                NA                  NA
## 4                      NA                NA                  NA
## 5                      NA                NA                  NA
## 6                      NA                NA                  NA
##   LightlyActiveMinutes SedentaryMinutes Calories
## 1                   NA               NA       NA
## 2                   NA               NA       NA
## 3                   NA               NA       NA
## 4                   NA               NA       NA
## 5                   NA               NA       NA
## 6                   NA               NA       NA
ggplot(data=daily_sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + 
    geom_ribbon(aes(ymin=0, ymax=TotalTimeInBed), alpha=0.2, fill="blue",  color="blue") +
    geom_point(color="red") + 
    labs(title="Total Time Asleep vs. Total Time in Bed")

6. Act- Recommendations

Bellabeat designs smart devices that help women track their daily habits and health. This case study analyzes user data to provide insights that can help improve sales and user experience.

However, the dataset is small and lacks demographic details, which may lead to biased results. To improve accuracy, I recommend collecting more diverse data to better understand the target audience and refine marketing strategies.

From my analysis, I identified key trends that can enhance the Bellabeat app’s features and improve marketing efforts to reach the right customers effectively.

Thank you for being a part of my R programming case study!