Introduction

I used Bellabeat case study for my capstone to achieve a Google Data Analytic certificate. In this case study, R was used to clean, analyse and visualise the data.

ASK

Summary of the business task

Bellabeat is a high-tech manufacturer of health products. Bellabeat targets women to use their product for a healthy lifestyle.

To date, Bellabeat has three innovative health products: Bellabeat app, Leaf Time Spring.

Each of these products has its own functions but with similar aims to track the user’s lifestyle, such as daily activity, sleep, stress, menstrual cycle, and dehydration.

Bellabeat wants to use the data produced by the intelligent device to gain insight into how consumers use their smart devices. Based on the insights could guide the marketing strategy for this company and recommend or suggest Bellabeat’s marketing strategy. Using a smart device, Bellabeat would like to know how these trends in women can inform Bellbeat’s marketing strategy.

Identify business tasks

1.What are some trends in intelligent device usage?

2.How could these trends apply to Bellabeat customers?

3.How could these trends guide marketing strategy?

Based on the insights, we could suggest new features in the smart devices and custom profiles based on age, promotion and consultation.

PREPARE & PROCESS

Description of data

Data was retrieved from https://www.kaggle.com/datasets/arashnic/fitbit. There are 18 files with different types of data, such as activity, calorie, step and intensity.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidyr)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

library(stringi)
library(stringr)
library(ggplot2)

Read selected files from Bellabeat folder.

d_act<- read.csv("~/Desktop/Coursera/Capstone/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
d_cal<- read.csv("~/Desktop/Coursera/Capstone/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
d_step<- read.csv("~/Desktop/Coursera/Capstone/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
weight<- read.csv("~/Desktop/Coursera/Capstone/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
sleep<- read.csv("~/Desktop/Coursera/Capstone/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")

Check the colnames to identify the similarity and differences among them

colnames(d_act)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(d_cal)

## [1] "Id"          "ActivityDay" "Calories"

colnames(d_step)

## [1] "Id"          "ActivityDay" "StepTotal"

Get a summary of the data

user_count<- unique(select(d_act, Id)) #33 user use the tracker

user_sleep<- unique(select(sleep, Id)) #24 user key in the sleep pattern

m_user<- anti_join(user_count, user_sleep, by="Id") #list of 9 user ID with missing sleep pattern

b<- unique(d_act) #identify duplicate data #no duplicate user in daily activity file
b2<- unique(sleep) #sleep has duplicate & remove duplicate
sleep<- b2
rm(b2)

The dataset consists of 33 individual which is not a big datasets. More samples are needed to ensure clarity and integrity of the results. In addition, the dataset is from 2 months data collection. However, this data could be used for early insights of Bellabeat costumers in using the app. For instance: * to track how many users consistently used the Bellabeat app. * how many users track their sleep and weight pattern. * which user is actively used the Bellabeat app.

Weight and sleep data cleaning

#merge sleep & daily activity
sleep<- separate(data=sleep, col=SleepDay, into=c("SleepDay", "rmv", "rmv2"), sep="\\ ") #fix SleepDay col.
sleep<- select(sleep, Id, SleepDay, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed)
sleep$SleepDay<- str_trim(sleep$SleepDay)
colnames(sleep)[2]="ActivityDate"
act_sleep<- merge(d_act, sleep, by=c("Id", "ActivityDate"))
glimpse(act_sleep)

## Rows: 410
## Columns: 18
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/…
## $ TotalSteps               <int> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ FairlyActiveMinutes      <int> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ LightlyActiveMinutes     <int> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ SedentaryMinutes         <int> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ Calories                 <int> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ TotalSleepRecords        <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep       <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ TotalTimeInBed           <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, …

#merge user and weight
weight<- separate(data = weight, col= Date, into=c("ActivityDate", "rmv", "rmv2"), sep="\\ ")
weight<- weight[, c(1,2,5:10)]
weight$ActivityDate<- str_trim(weight$ActivityDate)
act_weight<- merge(weight, d_act, by=c("Id", "ActivityDate"))
m_weight<- anti_join(d_act, act_weight, by=c("Id", "ActivityDate"))

ANALYZE AND SHARE

Identify the BMI category

#To group and identify the user into BMI category.
#If your BMI is less than 18.5 = underweight range.
#If your BMI is 18.5 to <25 = healthy weight range.
#If your BMI is 25.0 to <30 = overweight range.
#If your BMI is 30.0 = obesity range.

bmi_uw<- filter(act_weight, BMI < 18.5) #filter and classify the user into BMI category
bmi_hw<- act_weight %>%
  filter(BMI >= 18.5) %>%
  filter(BMI < 25)
bmi_ow<- act_weight %>%
  filter(BMI >= 25) %>%
  filter(BMI < 30)
bmi_obes<- act_weight %>%
  filter(BMI >= 30) 

bmi_hw$bmi_cat<- "Healthy Weight"
bmi_ow$bmi_cat<- "Overweight"
bmi_obes$bmi_cat<- "Obesity"

act_weight2<- rbind(bmi_hw, bmi_obes, bmi_ow) #combine the BMI category

act_weight2<- select(act_weight2, Id, BMI, bmi_cat)

act_weight3<- sqldf("select count(distinct Id) as count_Id, bmi_cat from act_weight2
                    group by bmi_cat") #count how many Id per BMI category
  
act_weight3 %>%
  mutate(bmi_cat = fct_reorder(bmi_cat, count_Id)) %>%
  ggplot( aes(x=bmi_cat, y=count_Id)) +
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
  coord_flip() +
  xlab("") +
  theme_bw()

Understand the sleep pattern

#how many person consistent to track sleep pattern
sleep$ActivityMonth<- sleep$ActivityDate
sleep<- separate(data=sleep, col=ActivityMonth, into=c("month", "day", "year"), sep="\\/")
sleep$month<- str_trim(sleep$month)
sleep<- sleep[, c(1:6)]

sleep$Id<- as.character(sleep$Id)

ggplot(data=sleep, aes(x=TotalMinutesAsleep, y=month)) +
  geom_point(aes(color=Id)) +
  facet_wrap(~Id) +
  theme(axis.title.y = element_text(angle = 90)) +
  labs(title="Pattern of individu track their sleeps per month", x="Total Minutes Asleep", y="Months")

#correlation weight and sleep per month per individu
act_weight4<- select(act_weight, Id, ActivityDate, WeightKg)
act_weight4$ActivityMonth<- act_weight$ActivityDate
act_weight4<- separate(data=act_weight4, col=ActivityMonth, into=c("month", "day", "year"), sep="\\/")
act_weight4<- select(act_weight4, Id, month, ActivityDate, WeightKg)
act_weight4$month<- str_trim(act_weight4$month)

weight_sleep<- merge(act_weight4, sleep, by=c("Id", "ActivityDate", "month"))

weight_sleep2<- weight_sleep %>%
  group_by(Id) %>%
  mutate(mean_kg = mean(WeightKg, na.rm=T)) %>%
  mutate(mean_TotalTimeInBed = mean(TotalTimeInBed, na.rm=T)) %>%
  select(Id, month, mean_kg, mean_TotalTimeInBed) %>%
  unique()

weight_sleep2$Id<- as.character(weight_sleep2$Id)

ggplot(data=weight_sleep2, aes(x=mean_TotalTimeInBed, y=mean_kg)) +
  geom_point(aes(color=Id)) +
  labs(title="Do individu weight relate to amount of sleep?", x="Total Time in Bed", y="Weight (Kg)")

Active and non-active user using Bellabeat app

#High User: Users who use their device between 21 and 31 days.
#Moderate User: Users who use their device between 11 and 20 days.
#Low User: Users who use their device between 1 and 10 days.

active_user<- sqldf("select count(distinct ActivityDate) as count_keyin, Id
                    from d_act group by Id order by count_keyin DESC")

user_hu<- active_user %>%
  filter(count_keyin > 21) %>%
  filter(count_keyin <= 31)

user_moderate<- active_user %>%
  filter(count_keyin > 11) %>%
  filter(count_keyin <= 20)

user_lowuse<- active_user %>%
  filter(count_keyin > 1) %>%
  filter(count_keyin <= 10)

user_hu$type="high_user"
user_lowuse$type="low_user"
user_moderate$type="moderate_user"

active_user<- rbind(user_hu, user_lowuse, user_moderate)
active_user$Id<- as.character(active_user$Id)
  
ggplot(data=active_user, aes(x=count_keyin,y=Id)) +
  geom_point(aes(color=Id)) +
  facet_wrap(~type) +
  theme(axis.text.y = element_text(angle = 30)) +
  labs(title="Classification of individu into different user category", x="Total Active", y="User Id")

Identify steps per daily per individu

#Sedentary < 5000 steps per day
#Light active: 5001 - 9.999 steps.
#Moderate active: 10000-12500 step
#High active: > 12500 steps.

sum_steps<- d_act %>%
  group_by(Id) %>%
  summarize(mean_steps_daily = mean(TotalSteps))

sed<- filter(sum_steps, mean_steps_daily < 5000)

light_active<- sum_steps %>%
  filter(mean_steps_daily >= 5000) %>%
  filter(mean_steps_daily <= 9999)

mod_active<- sum_steps %>%
  filter(mean_steps_daily >= 10000) %>%
  filter(mean_steps_daily < 12500)

high_active<- sum_steps %>%
  filter(mean_steps_daily >= 12500)

sed$ActivityLevel="Sedentary"
light_active$ActivityLevel="Light Active"
mod_active$ActivityLevel="Moderate Active"
high_active$ActivityLevel="Highly Active"

user_act_level<- rbind(sed, light_active, mod_active, high_active)

gby_act_level<- sqldf("select count(Id) as count_Id, ActivityLevel from user_act_level
                      group by ActivityLevel")

gby_act_level %>%
  mutate(ActivityLevel = fct_reorder(ActivityLevel, count_Id)) %>%
  ggplot( aes(x=ActivityLevel, y=count_Id)) +
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
  coord_flip() +
  xlab("Active Level") +
  ylab("Total User") +
  theme_bw()

Correlation between steps and calories

sum_cal<- d_act %>%
  group_by(Id) %>%
  summarize(mean_cal = mean(Calories))

user_act_level_cal<- merge(user_act_level, sum_cal, by="Id")
user_act_level_cal$Id<- as.character(user_act_level_cal$Id)

ggplot(data = user_act_level_cal) +
  geom_smooth(mapping = aes(x = mean_steps_daily, y = mean_cal)) +
  geom_point(mapping = aes(x = mean_steps_daily, y = mean_cal, color=Id, size=10)) +
  labs(title="Daily Steps vs Calorie Burn Per Individu", x="Daily Steps", y="Calorie")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation steps, calories and weight

sum_weight<- weight %>%
    group_by(Id) %>%
    summarize(mean_kg = mean(WeightKg))  

act_level_cal_weight<- merge(user_act_level_cal, sum_weight, by="Id")
act_level_cal_weight$Id<- as.character(act_level_cal_weight$Id)

ggplot(data = act_level_cal_weight) +
  geom_smooth(mapping = aes(y = mean_steps_daily, x = mean_kg)) +
  geom_point(mapping = aes(y = mean_steps_daily, x = mean_kg, color=Id, size=10)) +
  labs(title="Daily Steps vs Weight Per Individu", y="Daily Steps", x="Weight(kg)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ACT

33 users for two months (April-May).

29 active users, 3 moderate and 1 low user.

24 users tracked sleep patterns, but they were not consistent.

8 active users that consistently keep track of their daily activities (sleep, walk) and weight.

This finding shows that users needed to be more consistent in tracking their activity.

Recommendation: To identify the low or non-active users and provide a survey them to determine the constraint.

8 users key in their weight, and it shows:

1 user= obesity (User ID: 1927972279)
3 users= with healthy weight (User ID: 6962181067, 1503960366, 2873212765)
4 users= overweight (User ID: 4319703577, 4558609924, 5577150313, 8877689391 )

Recommendation: To offer them new features (i.e. personalised activity) that suit their weight.

It is presumed that the more weight, the more sleep individuals need. However, in this study, the individual weight does not reflect the amount of sleep.

Recommendation: More data is needed to improve the findings.

10,000 steps are recommended for daily steps, which could burn 300-400 calories. However, 18 users in the Light Active category achieve less than 10000 steps. 8 users in the Sedentary category complete less than 5000 steps. Results show that individuals with less weight performed more daily steps than those with more weight.

Recommendation: To identify the user’s ID in both categories and survey them to know their timetable. We can suggest a suitable activity to increase the steps based on that.

There is a relation between daily steps, calories and weight per individual. For instance, the more steps you have then, the more calorie you burn.

Recommendation: To create new features, such as inspire quotes that could encourange the users to increase daily steps and burn more calories.

Others suggestion: * To improve the data sizes and data types. For instance, provide profile of individual such as age, marital status and occupation. This information could help Bellabeat custom activity or features to the selected users.

To perform a survey with non-active user that could help Bellabeat knows why the user did not keep track their activity. It could be due to the technical issue, such as app needs to charge frequently, or the users are not familiar to use the features because of its complexity.

Capstone: FitBit Fitness Tracker Data by Bellabeat

RURU

2023-01-07