This is a case study about Bellabeat. Bellabeat is a high-tech manufacturer of health-focused product for women. The products are Bellabeat app, Leaf, Time and Spring. Bellabeat also offers a subscription-based membership program for the users. The marketing strategies of Bellabeat are investing year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engaging consumers on Twitter. Bellabeat also runs video ads on Youtube and ads on Google Display Network.
To identify trends in smart device usage and determine on ways to apply insights gained to improve Bellabeat products to better serve the customers.
The dataset that will be used in this case study is FitBit Fitness Tracker Data made by Möbius which is avaliable on Kaggle. This dataset is under CC0: Public Domain license.
This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03/12/2016 to 05/12/2016. This dataset contains personal tracker data from 30 eligible FitBit users.
There are 18 csv files in the original dataset. Before moving further, it is better to check whether the dataset ROCCC or not.
Since there are 18 csv files in the dataset, not all csv files are chosen for this case study. Only the csv files listed below are selected:
dailyActivity_merged.csvdailyCalories_merged.csvdailyIntensities_merged.csvdailySteps_merged.csvsleepDay_merged.csvweightLogInfo_merged.csvhourlyCalories_merged.csvhourlySteps_merged.csvThe pre-processing of the csv files is performed using Google Sheets and Google BigQuery before importing into RStudio.
First, data cleaning procedures which includes trimming white spaces
(if any) and removing duplicates (if any) are performed for all csv
files. There are no white spaces in all the csv files. However, we have
found and removed 3 duplicated rows from
sleepDay_merged.csv.
Upon further investigations using Google Sheets, it was confirmed
that all columns in dailyCalories_merged.csv,
dailySteps_merged.csv and
dailyIntensities_merged.csv are included in
dailyActivity_merged.csv. Therefore, those 3 csv files can
be ignored. The cleaned dailyActivity_merged.csv is then
saved as daily_activity_cleaned.csv.
Next, hourlyCalories_merged.csv and
hourlySteps_merged.csv are joined using Google BigQuery.
The query used to perform this join is:
SELECT
calories.Id AS Id,
calories.Date AS Date,
calories.Time AS Time,
calories.Calories AS Calories,
steps.StepTotal AS Total_Steps
FROM
my-data-project-1-401106.Bellabeat.hr_calories AS calories
INNER JOIN
my-data-project-1-401106.Bellabeat.hr_steps AS steps
ON
calories.Id = steps.Id AND calories.Date = steps.Date AND calories.Time = steps.Time
ORDER BY
Id, Date, Time
The results of the query is exported and saved as a csv file named
hour_calories_steps.csv .
First, we load the R packages needed for our analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, we will load the 4 csv files using the read_csv()
function.
activity <- read_csv("daily_activity_cleaned.csv")
sleep <- read_csv("sleepDay_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
calo_step <- read_csv("hour_calories_steps.csv")
With the csv files loaded, we can start to explore the datasets. The first details that will be checked is the number of distinct Id in each of the datasets.
n_distinct(activity$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
n_distinct(calo_step$Id)
## [1] 33
The output shows the number of participants present in each of the dataset. Notice that all 33 participants involved in contributing daily activities, calories and steps data. Only 24 participants contribute sleep data. Since there are just 8 participants contribute to weight data, it is reasonable to not include it in the analysis as the sample size is insignificant.
First, we will focus on activity dataset.
head(as.data.frame(activity))
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Note that the columns LoggedActivitiesDistance and
SedentaryActiveDistance will be removed since it does not
provide information that will impact the analysis.
activity <- activity[-c(6,10)]
Also, the column ActivityDate is renamed as Date. Then, view the data frame again to check the changes made.
activity <- activity %>% rename(Date=ActivityDate)
head(as.data.frame(activity))
## Id Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 1 1.88 0.55 6.06
## 2 1.57 0.69 4.71
## 3 2.44 0.40 3.91
## 4 2.14 1.26 2.83
## 5 2.71 0.41 5.04
## 6 3.19 0.78 2.51
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1 25 13 328 728
## 2 21 19 217 776
## 3 30 11 181 1218
## 4 29 34 209 726
## 5 36 10 221 773
## 6 38 20 164 539
## Calories
## 1 1985
## 2 1797
## 3 1776
## 4 1745
## 5 1863
## 6 1728
After that, we will explore some descriptive statistics for
activity dataset.
activity %>%
select(TotalSteps, TotalDistance, Calories) %>%
summary()
## TotalSteps TotalDistance Calories
## Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :2134
## Mean : 7638 Mean : 5.490 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :4900
activity %>%
select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes) %>%
summary()
## SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 729.8 1st Qu.:127.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :1057.5 Median :199.0 Median : 6.00 Median : 4.00
## Mean : 991.2 Mean :192.8 Mean : 13.56 Mean : 21.16
## 3rd Qu.:1229.5 3rd Qu.:264.0 3rd Qu.: 19.00 3rd Qu.: 32.00
## Max. :1440.0 Max. :518.0 Max. :143.00 Max. :210.00
Some discoveries from the outputs above:
Next, we focus on calo_step dataset, which is an hourly
dataset.
head(as.data.frame(calo_step))
## Id Date Time Calories Total_Steps
## 1 1503960366 2016-04-12 00:00:00 81 373
## 2 1503960366 2016-04-12 01:00:00 61 160
## 3 1503960366 2016-04-12 02:00:00 59 151
## 4 1503960366 2016-04-12 03:00:00 47 0
## 5 1503960366 2016-04-12 04:00:00 48 0
## 6 1503960366 2016-04-12 05:00:00 48 0
Similarly, descriptive statistics for the dataset
calo_step is presented below.
calo_step %>%
select(Calories, Total_Steps) %>%
summary()
## Calories Total_Steps
## Min. : 42.00 Min. : 0.0
## 1st Qu.: 63.00 1st Qu.: 0.0
## Median : 83.00 Median : 40.0
## Mean : 97.39 Mean : 320.2
## 3rd Qu.:108.00 3rd Qu.: 357.0
## Max. :948.00 Max. :10554.0
Some discoveries from the output above:
Moving on, we focus on sleep dataset.
head(as.data.frame(sleep))
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 4/12/2016 1 327 346
## 2 1503960366 4/13/2016 2 384 407
## 3 1503960366 4/15/2016 1 412 442
## 4 1503960366 4/16/2016 2 340 367
## 5 1503960366 4/17/2016 1 700 712
## 6 1503960366 4/19/2016 1 304 320
Since, we have two variables, which are
TotalMinutesAsleep and TotalTimeInBed, we can
calculate the percentage of time where the participants are actually
asleep while laying in bed.
sleep <- sleep %>% mutate(PercentSleep = (TotalMinutesAsleep/TotalTimeInBed)*100)
head(sleep)
## # A tibble: 6 × 6
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 1 327 346
## 2 1503960366 4/13/2016 2 384 407
## 3 1503960366 4/15/2016 1 412 442
## 4 1503960366 4/16/2016 2 340 367
## 5 1503960366 4/17/2016 1 700 712
## 6 1503960366 4/19/2016 1 304 320
## # ℹ 1 more variable: PercentSleep <dbl>
Similarly, we will explore some descriptive statistics for the dataset.
sleep %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed, PercentSleep) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed PercentSleep
## Min. :1.00 Min. : 58.0 Min. : 61.0 Min. : 49.84
## 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8 1st Qu.: 91.18
## Median :1.00 Median :432.5 Median :463.0 Median : 94.26
## Mean :1.12 Mean :419.2 Mean :458.5 Mean : 91.65
## 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0 3rd Qu.: 96.06
## Max. :3.00 Max. :796.0 Max. :961.0 Max. :100.00
Some discoveries from the output above:
With the insights gained, we hope to solve the stated business task which is to determine on ways to apply insights gained to improve Bellabeat products to better serve the customers. The aim of Bellabeat is to empower women with knowledge about their own health and habits. Therefore, the target audience of the company is women. Notice that most activities happen around 5pm to 7pm, this might indicates that the most of the participants are in the workforce and working until 5pm. Thus, the main target audience of the company can be women working a full-time job in office. The advertisement about helping women in the workforce to achieve a healthier lifestyle through using Bellabeat’s products can be created and advertised on different social media platforms.
Thank you for going through my first case study on Bellabeat. I will appreciate any comments and recommendations for improvement.