Bellabeat is a high-tech manufacturer of health-focused products for women. My role in the company is as Junior Data Analyst and I am working with the marketing analyst team. Urška Sršen, co founder and Chief Creative Officer (CEO) of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.
Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
Analyze smart device data to gain insight into how consumers are using their smart devices. I am also required to present the analysis to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.
This is the capstone project for the Google Data Analytics Certification. Bellabeat data analysis case study includes all the data analysis process :
Ask Phase include brief description of the business task which is analyzing the smart devices data and obtaining the insights on how consumers use Bellabeat smart devices. Ask phase is comprised of these main goals :
Also, it is important to whom we are presenting the solutions to. Mainly our stakeholders are :
Prepare Phase includes the organization of the data from all the data sources. Sršen suggested me to use the public data set that explores smart devices users’ daily habits. Here, I’m using a specific data set “FitBit Fitness Tracker Data” (https://www.kaggle.com/datasets/arashnic/fitbit) made available through Mobius.
This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Data is stored in 18 CSV files.
Data Credibility & Limitations:
Overall, the data source is evaluated as bad data, but it is not relevant at the moment since this is for the capstone project.
Process Phase includes processing the data by cleaning and ensuring that it is correct,relevant,complete and error free.
Tools Used
I am using RStudio for verifying data integrity, cleaning, transformation, analysis and visualization.
Firstly, need to install and read the packages we need for analysis.
I am using the “sqldf” package, which will allow us to emulate SQL syntax when looking at data.
install.packages("sqldf")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
Installing other packages:
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("plotrix")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
Loading the packages:
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## Loading required package: RSQLite
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(dplyr)
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(plotrix)
Loading the CSV Files:
daily_activity <- read.csv("/cloud/project/Fitabase Data/dailyActivity_merged.csv")
daily_calories <- read.csv("/cloud/project/Fitabase Data/dailyCalories_merged.csv")
daily_steps <- read.csv("/cloud/project/Fitabase Data/dailySteps_merged.csv")
weight_log <- read.csv("/cloud/project/Fitabase Data/weightLogInfo_merged.csv")
sleep_day <- read.csv("/cloud/project/Fitabase Data/sleepDay_merged.csv")
Exploring the Data:
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
head(daily_activity)
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
skim(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 | ▇▅▃▅▅ |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 | ▇▇▁▁▁ |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 | ▇▁▁▁▁ |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 | ▇▁▁▁▁ |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 | ▇▁▁▁▁ |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 | ▆▇▆▁▁ |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 | ▇▁▁▁▁ |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 | ▇▁▁▁▁ |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 | ▇▁▁▁▁ |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 | ▅▇▇▃▁ |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 | ▁▁▇▅▇ |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 | ▁▆▇▃▁ |
Also creating month and day of week column as we need them in analysis.
daily_activity$Rec_Date <- as.Date(daily_activity$ActivityDate,"%m/%d/%y")
daily_activity$month <- format(daily_activity$Rec_Date,"%B")
daily_activity$day_of_week <- format(daily_activity$Rec_Date,"%A")
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories" "Rec_Date"
## [17] "month" "day_of_week"
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
head(daily_calories)
str(daily_calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
skim(daily_calories)
| Name | daily_calories |
| Number of rows | 940 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDay | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2320127002.0 | 4445114986 | 6.962181e+09 | 8877689391 | ▇▅▃▅▅ |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1828.5 | 2134 | 2.793250e+03 | 4900 | ▁▆▇▃▁ |
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(sleep_day)
str(sleep_day)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
skim(sleep_day)
| Name | sleep_day |
| Number of rows | 413 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| SleepDay | 0 | 1 | 20 | 21 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1503960366 | 3977333714 | 4702921684 | 6962181067 | 8792009665 | ▆▆▇▅▃ |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1 | 1 | 1 | 1 | 3 | ▇▁▁▁▁ |
| TotalMinutesAsleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 58 | 361 | 433 | 490 | 796 | ▁▂▇▃▁ |
| TotalTimeInBed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 61 | 403 | 463 | 526 | 961 | ▁▃▇▁▁ |
colnames(weight_log)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
head(weight_log)
str(weight_log)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
skim(weight_log)
| Name | weight_log |
| Number of rows | 67 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 19 | 21 | 0 | 56 | 0 |
| IsManualReport | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 | ▁▁▂▇▆ |
| WeightKg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 | ▇▃▃▁▁ |
| WeightPounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 | ▇▃▃▁▁ |
| Fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 | ▇▁▁▁▇ |
| BMI | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 | ▇▁▁▁▁ |
| LogId | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 | ▇▇▆▇▇ |
After Executing these commands we found out:
Also,
I am also going to count unique IDs to confirm whether data has 30 IDs as claimed by the survey.
n_distinct(daily_activity$Id)
## [1] 33
There are 33 unique IDs, instead of 30 unique IDs as expected. Some users may have created additional IDs during the survey period.
Now the data cleaning and manipulation is done.Now data is ready to be analyzed.
daily_activity %>% select(TotalSteps,TotalDistance,SedentaryMinutes,VeryActiveMinutes) %>% summary()
## TotalSteps TotalDistance SedentaryMinutes VeryActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median :1057.5 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :210.00
Based on the Summary of the daily_activity data:
summary(daily_calories)
## Id ActivityDay Calories
## Min. :1.504e+09 Length:940 Min. : 0
## 1st Qu.:2.320e+09 Class :character 1st Qu.:1828
## Median :4.445e+09 Mode :character Median :2134
## Mean :4.855e+09 Mean :2304
## 3rd Qu.:6.962e+09 3rd Qu.:2793
## Max. :8.878e+09 Max. :4900
summary(weight_log)
## Id Date WeightKg WeightPounds
## Min. :1.504e+09 Length:67 Min. : 52.60 Min. :116.0
## 1st Qu.:6.962e+09 Class :character 1st Qu.: 61.40 1st Qu.:135.4
## Median :6.962e+09 Mode :character Median : 62.50 Median :137.8
## Mean :7.009e+09 Mean : 72.04 Mean :158.8
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05 3rd Qu.:187.5
## Max. :8.878e+09 Max. :133.50 Max. :294.3
##
## Fat BMI IsManualReport LogId
## Min. :22.00 Min. :21.45 Length:67 Min. :1.460e+12
## 1st Qu.:22.75 1st Qu.:23.96 Class :character 1st Qu.:1.461e+12
## Median :23.50 Median :24.39 Mode :character Median :1.462e+12
## Mean :23.50 Mean :25.19 Mean :1.462e+12
## 3rd Qu.:24.25 3rd Qu.:25.56 3rd Qu.:1.462e+12
## Max. :25.00 Max. :47.54 Max. :1.463e+12
## NA's :65
n_distinct(weight_log$Id)
## [1] 8
Based on our analysis I have following recommendations:
We have analysed that most of the people use application to track the steps and calories burned;less number of people use it to track sleep and very few use it to track weight records.So, I will suggest to focus on step,calories and sleep tracking more in application.
People prefer to track their activities on sunday, monday and tuesday than other week days.I think this behaviour is because people get busier in week end days due to work pressure and they don’t get enough time to track their activity.That’s why people are more active on sunday and starting 2 days of week.
The relation between steps taken vs calories burned and very active minutes vs calories burned shows positive correlation.So, this can be a good marketing strategy.
Majority of users 81.3% who are using the FitBit app are inactive for longer period of time and not using it for tracking their health habits.So, this can be a great chance to use this information for market strategy as Bellabeat can alert people about their sedentary behaviour time to time either on application or on tracker itself .
Majority of the users 62.5% who are using fitness tracker are overweight.So, there is an opportunity to influence the people so that they can become healthier.
Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercises, calories intake and burn rate information on Bellabeat application.