The following case study is the final capstone project to finish out the google Data Analytics Certificate.

The scenario

I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.

Characters and products

Characters:

Products:

ASK Phase

Business Task: Identify trends on how consumers use non-Bellabeat smart devices to provide insights for Bellabeat’s marketing strategy, focusing on a single Bellabeat product.

Key Stakeholders: Urška Sršen - Bellabeat’s cofounder and Chief Creative Officer Sando Mur - Bellabeat’s cofounder; key member of the Bellabeat executive team Bellabeat marketing analytics team

PREPARE Phase

Dataset used: The dataset used is FitBit Fitness Tracker Data provided by Mobius on Kaggle. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits

The dataset has 18 CSV documents available for analysis. They are all long form and each has unique quantitative data provided by FitBit. Each user has a unique ID.

This is a limited dataset (30 users) with no demographic information so there is possibility to encounter sampling bias as we can not be sure if the sample is representative of a population. The data also only covers a two month period in 2016 so it is not current.

PROCESS Phase

My analysis will be done in R for ease of sharing and viewing the data and data visuals.

Setting up my environment: I chose to install and load multiple R packages that will help in my analysis and build my presentation.

#Install and load packages 
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("here")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
#Load Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(lubridate)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)
library(here)
## here() starts at /cloud/project
library(readr)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

I loaded each CSV and checked for errors and duplicates in the data. Using skimr I can see brief overviews of what each dataset contains.

#load each file
Activity <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Calories <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_steps <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Intensities <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
hourly_steps <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sleep <- read_csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight <- read.csv("mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")

#example of one skimr and view
skimr::skim_without_charts(weight)
Data summary
Name weight
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12

In viewing each CSV, I was able to conclude that the user ID was the unique value shared amongst each dataset. I then checked if each dataset had data for all the same users.

#load distinct number of users for each dataset
n_distinct(Activity$Id)
## [1] 33
n_distinct(Calories$Id)
## [1] 33
n_distinct(daily_steps$Id)
## [1] 33
n_distinct(Intensities$Id)
## [1] 33
n_distinct(hourly_steps$Id)
## [1] 33
n_distinct(Sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8

I found that weightLogInfo_merged and sleepDay_merged were missing inputs from some users, 8 and 24 respectively. I chose to not work with the weight dataset as the sample size was too small.

#remove weight from my loaded datasets
rm(weight)

Now to check if there are any duplicates within my data.

sum(duplicated(Activity))
## [1] 0
sum(duplicated(Calories))
## [1] 0
sum(duplicated(daily_steps))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
sum(duplicated(Intensities))
## [1] 0
sum(duplicated(Sleep))
## [1] 3

Sleep returned 3 duplicate lines so I used distinct to remove them

Sleep <- Sleep %>% distinct()

For accuracy, I cleaned the remaining data by dropping missing values from the tables

Activity <- Activity %>% drop_na()
Sleep <- Sleep %>% drop_na()
daily_steps <- daily_steps %>% drop_na()
Calories <- Calories %>% drop_na()
Intensities <- Intensities %>% drop_na()
hourly_steps <- hourly_steps %>% drop_na()

And finally for consistency, I cleaned the columns names so they would match when analyzing.

Activity <- clean_names(Activity)
Calories <- clean_names(Calories)
Intensities <- clean_names(Intensities)
daily_steps <- clean_names(daily_steps)
hourly_steps <- clean_names(hourly_steps)
Sleep <- clean_names(Sleep)

ANALYZE and SHARE Phase

I decided to start by analyzing the Sleep data to see how much time in spent in bed after waking up

#first looked at a summary of the table
Sleep %>% select(total_minutes_asleep,total_time_in_bed) %>% summary()
##  total_minutes_asleep total_time_in_bed
##  Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:361.0        1st Qu.:403.8    
##  Median :432.5        Median :463.0    
##  Mean   :419.2        Mean   :458.5    
##  3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :796.0        Max.   :961.0
#created a new column of time awake to determine the time in bed after waking up
Sleep$time_awake <- (Sleep$total_time_in_bed - Sleep$total_minutes_asleep)
#checked the summary to see overall data
Sleep %>% select(total_minutes_asleep,total_time_in_bed,time_awake) %>% summary()
##  total_minutes_asleep total_time_in_bed   time_awake    
##  Min.   : 58.0        Min.   : 61.0     Min.   :  0.00  
##  1st Qu.:361.0        1st Qu.:403.8     1st Qu.: 17.00  
##  Median :432.5        Median :463.0     Median : 25.50  
##  Mean   :419.2        Mean   :458.5     Mean   : 39.31  
##  3rd Qu.:490.0        3rd Qu.:526.0     3rd Qu.: 40.00  
##  Max.   :796.0        Max.   :961.0     Max.   :371.00

Now looking at Activity data, there was still some cleaning to do.

#summary of Activity
summary(Activity)
##        id            activity_date       total_steps    total_distance  
##  Min.   :1.504e+09   Length:940         Min.   :    0   Min.   : 0.000  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3790   1st Qu.: 2.620  
##  Median :4.445e+09   Mode  :character   Median : 7406   Median : 5.245  
##  Mean   :4.855e+09                      Mean   : 7638   Mean   : 5.490  
##  3rd Qu.:6.962e+09                      3rd Qu.:10727   3rd Qu.: 7.713  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :28.030  
##  tracker_distance logged_activities_distance very_active_distance
##  Min.   : 0.000   Min.   :0.0000             Min.   : 0.000      
##  1st Qu.: 2.620   1st Qu.:0.0000             1st Qu.: 0.000      
##  Median : 5.245   Median :0.0000             Median : 0.210      
##  Mean   : 5.475   Mean   :0.1082             Mean   : 1.503      
##  3rd Qu.: 7.710   3rd Qu.:0.0000             3rd Qu.: 2.053      
##  Max.   :28.030   Max.   :4.9421             Max.   :21.920      
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   : 0.000        Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.: 1.945        1st Qu.:0.000000         
##  Median :0.2400             Median : 3.365        Median :0.000000         
##  Mean   :0.5675             Mean   : 3.341        Mean   :0.001606         
##  3rd Qu.:0.8000             3rd Qu.: 4.782        3rd Qu.:0.000000         
##  Max.   :6.4800             Max.   :10.710        Max.   :0.110000         
##  very_active_minutes fairly_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00        Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00        1st Qu.:127.0         
##  Median :  4.00      Median :  6.00        Median :199.0         
##  Mean   : 21.16      Mean   : 13.56        Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00        3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00        Max.   :518.0         
##  sedentary_minutes    calories   
##  Min.   :   0.0    Min.   :   0  
##  1st Qu.: 729.8    1st Qu.:1828  
##  Median :1057.5    Median :2134  
##  Mean   : 991.2    Mean   :2304  
##  3rd Qu.:1229.5    3rd Qu.:2793  
##  Max.   :1440.0    Max.   :4900
#Viewing each piece of data for the first 6 lines to determine which data to zero in on
head(as.data.frame(Activity))
##           id activity_date total_steps total_distance tracker_distance
## 1 1503960366     4/12/2016       13162           8.50             8.50
## 2 1503960366     4/13/2016       10735           6.97             6.97
## 3 1503960366     4/14/2016       10460           6.74             6.74
## 4 1503960366     4/15/2016        9762           6.28             6.28
## 5 1503960366     4/16/2016       12669           8.16             8.16
## 6 1503960366     4/17/2016        9705           6.48             6.48
##   logged_activities_distance very_active_distance moderately_active_distance
## 1                          0                 1.88                       0.55
## 2                          0                 1.57                       0.69
## 3                          0                 2.44                       0.40
## 4                          0                 2.14                       1.26
## 5                          0                 2.71                       0.41
## 6                          0                 3.19                       0.78
##   light_active_distance sedentary_active_distance very_active_minutes
## 1                  6.06                         0                  25
## 2                  4.71                         0                  21
## 3                  3.91                         0                  30
## 4                  2.83                         0                  29
## 5                  5.04                         0                  36
## 6                  2.51                         0                  38
##   fairly_active_minutes lightly_active_minutes sedentary_minutes calories
## 1                    13                    328               728     1985
## 2                    19                    217               776     1797
## 3                    11                    181              1218     1776
## 4                    34                    209               726     1745
## 5                    10                    221               773     1863
## 6                    20                    164               539     1728

The columns logged_activities_distance and sedentary_active_distance do not provide information that will impact the analysis so they will get removed. Also, the column activity_date is renamed as date.

#removing columns
Activity <- Activity[-c(6,10)]
#adding a day of the week column and total active minutes
Activity$activity_date <- mdy(Activity$activity_date)
Activity$dow <- weekdays.Date(Activity$activity_date)
Activity$total_active_minutes <- (Activity$lightly_active_minutes+Activity$fairly_active_minutes+Activity$very_active_minutes)

Now to load some summaries of Activity.

Activity %>% select(total_steps,total_distance,calories) %>% summary()
##   total_steps    total_distance      calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :4900
Activity %>% select(sedentary_minutes,lightly_active_minutes,fairly_active_minutes,very_active_minutes) %>% summary()
##  sedentary_minutes lightly_active_minutes fairly_active_minutes
##  Min.   :   0.0    Min.   :  0.0          Min.   :  0.00       
##  1st Qu.: 729.8    1st Qu.:127.0          1st Qu.:  0.00       
##  Median :1057.5    Median :199.0          Median :  6.00       
##  Mean   : 991.2    Mean   :192.8          Mean   : 13.56       
##  3rd Qu.:1229.5    3rd Qu.:264.0          3rd Qu.: 19.00       
##  Max.   :1440.0    Max.   :518.0          Max.   :143.00       
##  very_active_minutes
##  Min.   :  0.00     
##  1st Qu.:  0.00     
##  Median :  4.00     
##  Mean   : 21.16     
##  3rd Qu.: 32.00     
##  Max.   :210.00

Wanting to then look at overal activity per weekday, I created a new dataset with the average total active minutes by weekday.

#creating the dataset and plotting 
avg_activity <- Activity %>% group_by(dow) %>% summarise(active_minutes=mean(total_active_minutes,na.rm = TRUE))
ggplot(data = avg_activity,aes(x=factor(dow,levels=c("Sunday","Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),y=active_minutes))+geom_col()+labs(x="Day of the Week",y="Avg Total Active Minutes")

Users are less active on Sundays by active minutes.

Curious about total steps rather than minutes per weekday, I made a new dataset and plotted that.

#new dataset to see average steps taken per day
avg_step <- Activity %>% group_by(dow) %>% summarise(avg_steps = mean(total_steps,na.rm = TRUE))
#viz of the dataset
ggplot(data=avg_step,aes(x=factor(dow,levels=c("Sunday","Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),y=avg_steps))+geom_col()+labs(x="Day of the Week",y="Avg Total Steps")

Again we see that Sundays are least active as they have the least amount of steps total on average.

Having an idea of how active users are, I decided to plot active minutes and calories to see what the correlation is. I chose to continue to look at Fairly and Very active minutes together.

#create a data set focusing on Fairly and Very active minutes summed together
fairly_very <- Activity %>% group_by(id) %>% summarise(active_minutes=sum(fairly_active_minutes,very_active_minutes),calories=sum(calories))

#plot fairly/very active minutes and calories
ggplot(data=fairly_very,aes(x=active_minutes,y=calories))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#plot lightly active minutes and calories
ggplot(data=Activity,aes(x=lightly_active_minutes,y=calories))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#plot lightly active minutes and calories
ggplot(data=Activity,aes(x=sedentary_minutes,y=calories))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Now to look at the trends between Activity and Sleep. First start with merging the datasets together.

#create a new date column without time and match name"activity_date"
Sleep$activity_date <- str_split(Sleep$sleep_day," ",simplify = T)[,1]
Sleep$activity_date <- mdy(Sleep$activity_date)

#new dataset to compare sleep time and calories 
New <- Activity %>% merge(Sleep,by=c("id","activity_date"))

I want to see what the trend is for total_minutes_asleep and calories, similar to the previous plots.

#Plot time asleep vs calories
ggplot(data = New,aes(x=total_minutes_asleep,y=calories))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

However the data does seem to be very scattered so to see if this assumption is accurate, I plotted total_minutes_asleep and total_steps

#Plot time asleep vs totalsteps
ggplot(data = New,aes(x=total_minutes_asleep,y=total_steps))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We see a similar plot and trend to the previous one. It shows that users getting the recommended amount of sleep (at least 7 hours according to the CDC)

#Plot total active minutes vs time asleep
ggplot(data = New, aes(x=total_minutes_asleep,y=lightly_active_minutes+fairly_active_minutes+very_active_minutes))+geom_point()+geom_smooth()+labs(x="Total Minutes Asleep",y="Active Minutes per Day")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Further solidifies the idea that users getting the recommend amount of sleep tend to be more active and burn more calories. These users are the highest target to improve their daily active minutes.

Discoveries and conclusions:

ACT Phase

With these insights, we hope to solve the stated business task of analyzing how customers use devices to provide insights for Bellabeat’s marketing strategy. The aim of Bellabeat is to empower women with knowledge about their own health and habits. The focus can be around healthy sleeping habits for women who want to be more active.

Recommendations focusing on utilizing the Bellabeat App: