Company Details and My Role

Bellabeat is a high-tech manufacturer of health-focused products for women. My role in the company is as Junior Data Analyst and I am working with the marketing analyst team. Urška Sršen, co founder and Chief Creative Officer (CEO) of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices.

Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

Business Task

Analyze smart device data to gain insight into how consumers are using their smart devices. I am also required to present the analysis to the Bellabeat executive team along with high-level recommendations for Bellabeat’s marketing strategy.

Introduction

This is the capstone project for the Google Data Analytics Certification. Bellabeat data analysis case study includes all the data analysis process :

Phase 1 : Ask Phase

Ask Phase include brief description of the business task which is analyzing the smart devices data and obtaining the insights on how consumers use Bellabeat smart devices. Ask phase is comprised of these main goals :

Also, it is important to whom we are presenting the solutions to. Mainly our stakeholders are :

Phase 2 : Prepare Phase

Prepare Phase includes the organization of the data from all the data sources. Sršen suggested me to use the public data set that explores smart devices users’ daily habits. Here, I’m using a specific data set “FitBit Fitness Tracker Data” (https://www.kaggle.com/datasets/arashnic/fitbit) made available through Mobius.

This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Data is stored in 18 CSV files.

Data Credibility & Limitations:

Overall, the data source is evaluated as bad data, but it is not relevant at the moment since this is for the capstone project.

Phase 3 : Process Phase

Process Phase includes processing the data by cleaning and ensuring that it is correct,relevant,complete and error free.

Tools Used

I am using RStudio for verifying data integrity, cleaning, transformation, analysis and visualization.

Firstly, need to install and read the packages we need for analysis.

I am using the “sqldf” package, which will allow us to emulate SQL syntax when looking at data.

install.packages("sqldf")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Installing other packages:

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("plotrix")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Loading the packages:

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## Loading required package: RSQLite
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(dplyr)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(plotrix)

Loading the CSV Files:

daily_activity <- read.csv("/cloud/project/Fitabase Data/dailyActivity_merged.csv")
daily_calories <- read.csv("/cloud/project/Fitabase Data/dailyCalories_merged.csv")
daily_steps <- read.csv("/cloud/project/Fitabase Data/dailySteps_merged.csv")
weight_log  <- read.csv("/cloud/project/Fitabase Data/weightLogInfo_merged.csv")
sleep_day <- read.csv("/cloud/project/Fitabase Data/sleepDay_merged.csv")

Exploring the Data:

Daily Activity Data

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
head(daily_activity)
str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
skim(daily_activity)
Data summary
Name daily_activity
Number of rows 940
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDate 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2.320127e+09 4.445115e+09 6.962181e+09 8.877689e+09 ▇▅▃▅▅
TotalSteps 0 1 7.637910e+03 5.087150e+03 0 3.789750e+03 7.405500e+03 1.072700e+04 3.601900e+04 ▇▇▁▁▁
TotalDistance 0 1 5.490000e+00 3.920000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
TrackerDistance 0 1 5.480000e+00 3.910000e+00 0 2.620000e+00 5.240000e+00 7.710000e+00 2.803000e+01 ▇▆▁▁▁
LoggedActivitiesDistance 0 1 1.100000e-01 6.200000e-01 0 0.000000e+00 0.000000e+00 0.000000e+00 4.940000e+00 ▇▁▁▁▁
VeryActiveDistance 0 1 1.500000e+00 2.660000e+00 0 0.000000e+00 2.100000e-01 2.050000e+00 2.192000e+01 ▇▁▁▁▁
ModeratelyActiveDistance 0 1 5.700000e-01 8.800000e-01 0 0.000000e+00 2.400000e-01 8.000000e-01 6.480000e+00 ▇▁▁▁▁
LightActiveDistance 0 1 3.340000e+00 2.040000e+00 0 1.950000e+00 3.360000e+00 4.780000e+00 1.071000e+01 ▆▇▆▁▁
SedentaryActiveDistance 0 1 0.000000e+00 1.000000e-02 0 0.000000e+00 0.000000e+00 0.000000e+00 1.100000e-01 ▇▁▁▁▁
VeryActiveMinutes 0 1 2.116000e+01 3.284000e+01 0 0.000000e+00 4.000000e+00 3.200000e+01 2.100000e+02 ▇▁▁▁▁
FairlyActiveMinutes 0 1 1.356000e+01 1.999000e+01 0 0.000000e+00 6.000000e+00 1.900000e+01 1.430000e+02 ▇▁▁▁▁
LightlyActiveMinutes 0 1 1.928100e+02 1.091700e+02 0 1.270000e+02 1.990000e+02 2.640000e+02 5.180000e+02 ▅▇▇▃▁
SedentaryMinutes 0 1 9.912100e+02 3.012700e+02 0 7.297500e+02 1.057500e+03 1.229500e+03 1.440000e+03 ▁▁▇▅▇
Calories 0 1 2.303610e+03 7.181700e+02 0 1.828500e+03 2.134000e+03 2.793250e+03 4.900000e+03 ▁▆▇▃▁

Also creating month and day of week column as we need them in analysis.

daily_activity$Rec_Date <- as.Date(daily_activity$ActivityDate,"%m/%d/%y")
daily_activity$month <- format(daily_activity$Rec_Date,"%B")
daily_activity$day_of_week <- format(daily_activity$Rec_Date,"%A")

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"                 "Rec_Date"                
## [17] "month"                    "day_of_week"

Daily Calories Data

colnames(daily_calories)
## [1] "Id"          "ActivityDay" "Calories"
head(daily_calories)
str(daily_calories)
## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
skim(daily_calories)
Data summary
Name daily_calories
Number of rows 940
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ActivityDay 0 1 8 9 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 4.855407e+09 2.424805e+09 1503960366 2320127002.0 4445114986 6.962181e+09 8877689391 ▇▅▃▅▅
Calories 0 1 2.303610e+03 7.181700e+02 0 1828.5 2134 2.793250e+03 4900 ▁▆▇▃▁

Sleeping Data

colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
head(sleep_day)
str(sleep_day)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
skim(sleep_day)
Data summary
Name sleep_day
Number of rows 413
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
SleepDay 0 1 20 21 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1 5.000979e+09 2.06036e+09 1503960366 3977333714 4702921684 6962181067 8792009665 ▆▆▇▅▃
TotalSleepRecords 0 1 1.120000e+00 3.50000e-01 1 1 1 1 3 ▇▁▁▁▁
TotalMinutesAsleep 0 1 4.194700e+02 1.18340e+02 58 361 433 490 796 ▁▂▇▃▁
TotalTimeInBed 0 1 4.586400e+02 1.27100e+02 61 403 463 526 961 ▁▃▇▁▁

Weight Logs

colnames(weight_log)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
head(weight_log)
str(weight_log)
## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
skim(weight_log)
Data summary
Name weight_log
Number of rows 67
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 19 21 0 56 0
IsManualReport 0 1 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 7.009282e+09 1.950322e+09 1.503960e+09 6.962181e+09 6.962181e+09 8.877689e+09 8.877689e+09 ▁▁▂▇▆
WeightKg 0 1.00 7.204000e+01 1.392000e+01 5.260000e+01 6.140000e+01 6.250000e+01 8.505000e+01 1.335000e+02 ▇▃▃▁▁
WeightPounds 0 1.00 1.588100e+02 3.070000e+01 1.159600e+02 1.353600e+02 1.377900e+02 1.875000e+02 2.943200e+02 ▇▃▃▁▁
Fat 65 0.03 2.350000e+01 2.120000e+00 2.200000e+01 2.275000e+01 2.350000e+01 2.425000e+01 2.500000e+01 ▇▁▁▁▇
BMI 0 1.00 2.519000e+01 3.070000e+00 2.145000e+01 2.396000e+01 2.439000e+01 2.556000e+01 4.754000e+01 ▇▁▁▁▁
LogId 0 1.00 1.461772e+12 7.829948e+08 1.460444e+12 1.461079e+12 1.461802e+12 1.462375e+12 1.463098e+12 ▇▇▆▇▇

After Executing these commands we found out:

  • Number of records and columns
  • Number of null and non null values
  • Data type of every columns

Also,

  • There are 940 records in daily_activity data, 940 records in daily_calories, 413 records in daily_sleep and 67 records in weight_log.
  • There are no null values present in any of the data set, So there is no requirement to clean the data.
  • The date column is in character format, so we need to convert it into datetime64 type.

I am also going to count unique IDs to confirm whether data has 30 IDs as claimed by the survey.

n_distinct(daily_activity$Id)
## [1] 33

There are 33 unique IDs, instead of 30 unique IDs as expected. Some users may have created additional IDs during the survey period.

Now the data cleaning and manipulation is done.Now data is ready to be analyzed.

Phase 4 : Analyze Phase

daily_activity %>%  select(TotalSteps,TotalDistance,SedentaryMinutes,VeryActiveMinutes) %>% summary()
##    TotalSteps    TotalDistance    SedentaryMinutes VeryActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median :1057.5   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :210.00

Based on the Summary of the daily_activity data:

summary(daily_calories)
##        Id            ActivityDay           Calories   
##  Min.   :1.504e+09   Length:940         Min.   :   0  
##  1st Qu.:2.320e+09   Class :character   1st Qu.:1828  
##  Median :4.445e+09   Mode  :character   Median :2134  
##  Mean   :4.855e+09                      Mean   :2304  
##  3rd Qu.:6.962e+09                      3rd Qu.:2793  
##  Max.   :8.878e+09                      Max.   :4900
summary(weight_log)
##        Id                Date              WeightKg       WeightPounds  
##  Min.   :1.504e+09   Length:67          Min.   : 52.60   Min.   :116.0  
##  1st Qu.:6.962e+09   Class :character   1st Qu.: 61.40   1st Qu.:135.4  
##  Median :6.962e+09   Mode  :character   Median : 62.50   Median :137.8  
##  Mean   :7.009e+09                      Mean   : 72.04   Mean   :158.8  
##  3rd Qu.:8.878e+09                      3rd Qu.: 85.05   3rd Qu.:187.5  
##  Max.   :8.878e+09                      Max.   :133.50   Max.   :294.3  
##                                                                         
##       Fat             BMI        IsManualReport         LogId          
##  Min.   :22.00   Min.   :21.45   Length:67          Min.   :1.460e+12  
##  1st Qu.:22.75   1st Qu.:23.96   Class :character   1st Qu.:1.461e+12  
##  Median :23.50   Median :24.39   Mode  :character   Median :1.462e+12  
##  Mean   :23.50   Mean   :25.19                      Mean   :1.462e+12  
##  3rd Qu.:24.25   3rd Qu.:25.56                      3rd Qu.:1.462e+12  
##  Max.   :25.00   Max.   :47.54                      Max.   :1.463e+12  
##  NA's   :65
n_distinct(weight_log$Id)
## [1] 8

Phase 5 : ** Share Phase**

In this step, we will create some visualizations based on our analysis and goal of project.

daily_activity$day_of_week <- ordered(daily_activity$day_of_week,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

ggplot(data=daily_activity) + geom_bar(mapping = aes(x=day_of_week),fill="black") +
  labs(x="Day of week",y="Count",title="No. of times users used tracker across week")

Observation :

As we can see, the frequency of usage of FitBit fitness tracker application is high on sunday, monday and tuesday than other week days.I think this behaviour is because people get busier in week end days due to work pressure and they don’t get enough time to track their activity.That’s why people are more active on sunday and starting 2 days of week.

The relationship between Steps taken in a day and Sedentary(people were inactive) minutes.

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color = Calories)) + geom_point() +
geom_smooth(method = "loess",color="green") + 
labs(x="Total Steps",y="Sedentary Minutes",title="Total Steps vs Sedentary Minutes")
## `geom_smooth()` using formula 'y ~ x'

Observations :

Total steps taken vs sedentary minutes

I was expecting a totally inverse relationship between steps taken and sedentary minutes.

  • At the start when the steps taken are less than 10000 the relation between them is inverse, but as number of steps increases beyond 10000 there is no drastic change in relation.
  • I got surprised watching the relation between steps and sedentary minutes after 15000 steps as it becomes slightly positive.

The relationship between Very Active Minutes and Calories Burned.

ggplot(data=daily_activity,aes(x = VeryActiveMinutes, y = Calories, color = Calories)) + geom_point() + 
geom_smooth(method = "loess",color="orange") +
labs(x="Very Active Minutes",y="Calories",title = "Very Active Minutes vs Calories Burned")
## `geom_smooth()` using formula 'y ~ x'

Observations :

Relation between very active minutes and calories burned

As we can see,very active minutes and calories burned are highly correlated with each other with some outliers at bottom left and top left of the plot.

The relationship between Sedentary Minutes and Calories Burned.

ggplot(data=daily_activity,aes(x=SedentaryMinutes,y=Calories,color=Calories)) + geom_point() + 
geom_smooth(method="loess",color="red") + 
labs(y="Calories", x="Sedentary Minutes", title="Sedentary Minutes vs. Calories Burned")
## `geom_smooth()` using formula 'y ~ x'

Observations :

I was expecting the relation between sedentary minutes and calories burned to be totally inverse in nature.

  • The data is showing positive correlation up to 1000 sedentary minutes.
  • After 1000 sedentary minutes the relation is inverse as I expected.

The relationship between Sleep and Time in bed.

ggplot(data=sleep_day, aes(x=TotalMinutesAsleep, y=TotalTimeInBed)) + geom_point() + stat_smooth(method = lm) +
  labs(x="Total Minutes a sleep", y="Total Time in Bed", title = "Sleep Time vs Time in Bed")
## `geom_smooth()` using formula 'y ~ x'

Observations :

  • As we can see, there is a strong positive correlation between TotalMinutesAsleep and TotalTimeInBed, but there are some outliers in data in the middle and top of plot.
  • The outliers are one who spend lot of time in bed but didn’t actually sleep. There can be different reasons for that.

Calculating the sum of individual minute column from daily activity data.

activity_minute <- sqldf("SELECT SUM(VeryActiveMinutes),SUM(FairlyActiveMinutes),
SUM(LightlyActiveMinutes),SUM(SedentaryMinutes) 
FROM daily_activity")

activity_minute

Now, we will use these values to plot a 3D pie chart to compare the percentage of activity by minutes.

x <- c(19895,12751,181244,931738)
x
## [1]  19895  12751 181244 931738
piepercent <- round(100*x / sum(x), 1)
colors = c("red","blue","green","yellow")
 
pie3D(x,labels = paste0(piepercent,"%"),col=colors,main = "Percentage of Activity in Minutes")
legend("bottom",c("VeryActiveMinutes","FairlyActiveMinutes","LightlyActiveMinutes","SedentaryMinutes"),cex=0.6,fill = colors)

Observations :

As we can see,

  • The percentage of sedentary minutes is very high than all other, which covers 81.3 % of pie this indicates that people are inactive for longer period of time.
  • The percentage of very active and fairly active minutes is very less ie. 1.7%,1.1% respectively, which is very less compared to other activities.

Calculating sum of different distance values from daily activity data.

activity_distance <- sqldf("SELECT SUM(ModeratelyActiveDistance),SUM(LightActiveDistance),
      SUM(VeryActiveDistance),SUM(SedentaryActiveDistance)
      FROM daily_activity")

activity_distance

NOTE: As we can see that the values of sedentaryActiveDistance is very less as compare to other distances,So I am excluding it in drawing a 3D pie chart to compare the percentage of activity in minutes.

y <- c(533.49,3140.37,1412.52)
y
## [1]  533.49 3140.37 1412.52
piepercent <- round(100*y / sum(y), 1)
colors = c("orange","green","blue")
pie3D(y,labels = paste0(piepercent,"%"),col=colors,main = "Percentage of Activity in Distance",labelcex=1)
legend("bottom",c("ModeratelyActiveDistance","LightlyActiveDistance","VeryActiveDistance"),cex=0.6, fill = colors,y.intersp=1.5,)

Observations :

  • The percentage of lightly active distance is highest with 61.7% and that of moderately active distance is 10.5%.
  • The percentage of very active distance is 27.8% which is good, but it can be increased further so that people can achieve there fitness goals.

Calculating the count of people with over weight.

The BMI for healthy person is between 18.5 and 24.9 and the persons who’s BMI is above 24.9 are considered to be overweight.(source:CDC)

count_overweight <- sqldf("SELECT COUNT(DISTINCT(Id))
                          FROM weight_log
                          WHERE BMI > 24.9")
count_overweight

Plotting 3D pie chart to compare the percentage of people with overweight vs healthy weight.

z <- c(5,3)
piepercent <- round(100*z / sum(z),1)
colors = c("red","green")
pie3D(z,labels=paste0(piepercent,"%"),explode=0.1,col=colors,radius=1,main="Percentage of people with Over Weight vs Healthy Weight")
legend("bottom",c("OverWeight","HealthyWeight"),cex=0.6,fill=colors)

Observations :

  • The percentage of people with over weight is 62.5% which is high as compared to percentage of people with healthy weigh which is 37.5%. So, there is a very good opportunity to increase the percentage of people with healthy weight.

Phase 5 : Act Phase

Based on our analysis I have following recommendations:

  1. We have analysed that most of the people use application to track the steps and calories burned;less number of people use it to track sleep and very few use it to track weight records.So, I will suggest to focus on step,calories and sleep tracking more in application.

  2. People prefer to track their activities on sunday, monday and tuesday than other week days.I think this behaviour is because people get busier in week end days due to work pressure and they don’t get enough time to track their activity.That’s why people are more active on sunday and starting 2 days of week.

  3. The relation between steps taken vs calories burned and very active minutes vs calories burned shows positive correlation.So, this can be a good marketing strategy.

  4. Majority of users 81.3% who are using the FitBit app are inactive for longer period of time and not using it for tracking their health habits.So, this can be a great chance to use this information for market strategy as Bellabeat can alert people about their sedentary behaviour time to time either on application or on tracker itself .

  5. Majority of the users 62.5% who are using fitness tracker are overweight.So, there is an opportunity to influence the people so that they can become healthier.

  6. Bellabeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercises, calories intake and burn rate information on Bellabeat application.