Bellabeat is a small successful Company founded by Urška Sršen and Sando Mur, a high-tech Company that manufactures a health-concentrated smart device called Bellabeat App, which is only focused for women. Analysis of this aforementioned product can provide health-related data and also provide insights on how consumers (women) use non-Bellabeat smart devices. Hence, the analysis of Bellabeat’s available consumer data would reveal more opportunities for the Company’s growth.
The task is to look-out for trends on how people use smart devices and how these insights can be used by the Bellabeat patronizers.
• Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
• Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
• Bellabeat marketing analytics team
The repository to this Bellabeat dataset was kaggle, and it’s made available through Mobius. Moreover, the raw dataset can be accessed through this link.
For this case study data, I was able to upload each file constituting the dataset with Excel, and used some of its functions specifically to remove duplicates, leading, trailing and repeated spaces in the data. Moreover, i chose to work on the major core functions and characters of the Bellabeat App such as the users’ daily activities with respect to calories burnt by each users, and their respective body-mass-index. Hence, no duplicate was found in the dataset as indicated in the image below.
Notes: Setting my R environment by loading the ‘tidyverse’ and other useful packages for data analysis and visualizations.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(tidyr)
library(here)
## here() starts at C:/Users/MOSES OLUFEMI/Desktop/DOWNLOADED DATA/FitBase_Analysis
library(ggplot2)
library(colorspace)
library(readr)
dailyActivity_merged <-read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleepDay_merged <-read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weightLogInfo_merged <-read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggsave("dailyActivity_merged2.png")
## Saving 7 x 5 in image
ggsave("Immobility_Pattern.png")
## Saving 7 x 5 in image
ggsave("sleepDay_merged.png")
## Saving 7 x 5 in image
ggsave("DailyActivity.png")
## Saving 7 x 5 in image
As this analysis is concerned, i can observe and assume that most users live a sedentary lifestyle. But to confirm this, we need to do some analyses and then preview users’ daily activity patterns in the respective dataset files.
str(dailyActivity_merged)
## spec_tbl_df [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(dailyActivity_merged)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
The ‘dailyActivity_merged’ file contains 940 rows and 15 columns, with column names Id, ActivityDate, TotalSteps, TotalDistance, etc.
str(sleepDay_merged)
## spec_tbl_df [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(sleepDay_merged)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
The SleepDay_merged file contains 413 rows and 5 columns, with column names Id, SleepDay, etc.
str(weightLogInfo_merged)
## spec_tbl_df [67 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:67] 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr [1:67] "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num [1:67] 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num [1:67] 116 116 294 125 126 ...
## $ Fat : num [1:67] 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num [1:67] 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: logi [1:67] TRUE TRUE FALSE TRUE TRUE TRUE ...
## $ LogId : num [1:67] 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Date = col_character(),
## .. WeightKg = col_double(),
## .. WeightPounds = col_double(),
## .. Fat = col_double(),
## .. BMI = col_double(),
## .. IsManualReport = col_logical(),
## .. LogId = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(weightLogInfo_merged)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
The weightLogInfo_merged file contains 67 rows and 8 columns, with column names Id, Date, weightKg, etc.
For this part of the analysis, I want to do the Sorting, arranging, and summarization of each dataset by specifying and excluding the columns i needed not, in each file i have chosen to work on
dailyActivity_merged2 <-dailyActivity_merged %>%
select(TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, Calories)
dailyActivity_merged2 %>%
group_by(TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, Calories) %>%
drop_na() %>%
summarise(TotalDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, Calories)
## `summarise()` has grouped output by 'TotalDistance', 'VeryActiveDistance',
## 'ModeratelyActiveDistance', 'LightActiveDistance', 'Calories'. You can override
## using the `.groups` argument.
## # A tibble: 940 × 5
## # Groups: TotalDistance, VeryActiveDistance, ModeratelyActiveDistance,
## # LightActiveDistance, Calories [884]
## TotalDistance VeryActiveDistance ModeratelyActiveDistance LightActi…¹ Calor…²
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 57
## 6 0 0 0 0 120
## 7 0 0 0 0 665
## 8 0 0 0 0 1347
## 9 0 0 0 0 1347
## 10 0 0 0 0 1347
## # … with 930 more rows, and abbreviated variable names ¹LightActiveDistance,
## # ²Calories
The pie chart below is generated using Excel, it shows the hourly percentage at which each Fitbit user spend their time. Based on the summary, we can tell thus;
• 81% of Fitbit users spend more than 12 hours in sedentary mode
• 16% of Fitbit users spend their time being lightly active
• 1% of Fitbit users spend their time being fairly active
• 2% of Fitbit users spend their time being very active
• Withing their active time, most users are approximaltely lightly active
In this part of the analyse phase, i want to figure out if there’s connection between users’ daily activity patterns and the number of calories burnt daily
DailyActivity_Pattern <-dailyActivity_merged %>%
select(TotalDistance, Calories)
DailyActivity_Pattern %>%
group_by(TotalDistance, Calories) %>%
drop_na() %>%
summarise(TotalDistance, Calories)
## `summarise()` has grouped output by 'TotalDistance', 'Calories'. You can
## override using the `.groups` argument.
## # A tibble: 940 × 2
## # Groups: TotalDistance, Calories [884]
## TotalDistance Calories
## <dbl> <dbl>
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 57
## 6 0 120
## 7 0 665
## 8 0 1347
## 9 0 1347
## 10 0 1347
## # … with 930 more rows
For this part, i want to find out whether there’s correlation between users’ Sedentary minutes and it’s corresponding Sedentary active distance
Immobility_Pattern <-dailyActivity_merged %>%
select(SedentaryMinutes, SedentaryActiveDistance)
Immobility_Pattern %>%
group_by(SedentaryMinutes, SedentaryActiveDistance) %>%
drop_na() %>%
summarise(SedentaryMinutes, SedentaryActiveDistance)
## `summarise()` has grouped output by 'SedentaryMinutes',
## 'SedentaryActiveDistance'. You can override using the `.groups` argument.
## # A tibble: 940 × 2
## # Groups: SedentaryMinutes, SedentaryActiveDistance [597]
## SedentaryMinutes SedentaryActiveDistance
## <dbl> <dbl>
## 1 0 0
## 2 2 0
## 3 13 0
## 4 48 0
## 5 111 0
## 6 125 0
## 7 127 0
## 8 218 0
## 9 222 0
## 10 241 0
## # … with 930 more rows
In the sleep analysis, i use three columns from the SleepDay_Pattern to do the analysis. This is because i want to use the sleep record to correlate the differences between users’ total time in bed and the total minutes they sleep off. It’s well understood that the total time users stay in bed cannot equates the total minutes they sleep off. Even though, closing of eyes most times whilst in bed doesn’t corresponds to sleeping. Henece, the difference between the two duration(Total TimeInBed and Total MinutesAsleep) can simply be termed as “Insomnia”
SleepDay_Pattern <-sleepDay_merged
print(SleepDay_Pattern)
## # A tibble: 413 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep Total…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## 7 1503960366 4/20/2016 12:00:00 AM 1 360 377
## 8 1503960366 4/21/2016 12:00:00 AM 1 325 364
## 9 1503960366 4/23/2016 12:00:00 AM 1 361 384
## 10 1503960366 4/24/2016 12:00:00 AM 1 430 449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed
I chose to separate the date and time in the SleepDay column because i only need the date to work on, and moreso, to make it consistent.
separate(SleepDay_Pattern,SleepDay, into=c('Date','Time'), sep= ' ')
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## # A tibble: 413 × 6
## Id Date Time TotalSleepRecords TotalMinutesAsleep TotalTim…¹
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 1 327 346
## 2 1503960366 4/13/2016 12:00:00 2 384 407
## 3 1503960366 4/15/2016 12:00:00 1 412 442
## 4 1503960366 4/16/2016 12:00:00 2 340 367
## 5 1503960366 4/17/2016 12:00:00 1 700 712
## 6 1503960366 4/19/2016 12:00:00 1 304 320
## 7 1503960366 4/20/2016 12:00:00 1 360 377
## 8 1503960366 4/21/2016 12:00:00 1 325 364
## 9 1503960366 4/23/2016 12:00:00 1 361 384
## 10 1503960366 4/24/2016 12:00:00 1 430 449
## # … with 403 more rows, and abbreviated variable name ¹TotalTimeInBed
SleepDay_Pattern %>%
group_by(TotalMinutesAsleep, TotalTimeInBed) %>%
drop_na() %>%
summarise(TotalMinutesAsleep, TotalTimeInBed)
## `summarise()` has grouped output by 'TotalMinutesAsleep', 'TotalTimeInBed'. You
## can override using the `.groups` argument.
## # A tibble: 413 × 2
## # Groups: TotalMinutesAsleep, TotalTimeInBed [407]
## TotalMinutesAsleep TotalTimeInBed
## <dbl> <dbl>
## 1 58 61
## 2 59 65
## 3 61 69
## 4 62 65
## 5 74 75
## 6 74 78
## 7 77 77
## 8 79 82
## 9 82 85
## 10 98 107
## # … with 403 more rows
The column bar chart below is the sleep day duration which is generated using Excel, it shows the relationship at which Fitbit users spend their time in bed and the hours the fall asleep. Based on the summary, we can tell thus;
• Apart from assumptions that there’s crystal clear differences between the time users stay in bed and the the time they fall asleep, the chart below underscores the fact that Fitbit users spend more hours in bed before they fall asleep. Therefore, it can be inferred that it’s not the exact time Fitbit users get to bed, that they fall asleep.
In the sleep analysis, i chose three columns from the DailyActivity to do the analysis. This is because i want to use the manual report generated for a check up, and to correlate the relationship between users’ weight(kg) and their body_Mass_Index(BMI). And also, we can have a look at whether users’ daily activity patterns are connected to their body shape.But before i proceed, I’d like to differentiate briefly between weight and body mass; • Body Mass is a quantity of matter a body or an object contains. • A Weight of an object or a body is the product of its mass and acceleration due to gravity. Hence, we can infer that the weight of a body is expected to override its mass.
DailyActivity <-weightLogInfo_merged %>%
select(WeightKg, BMI, IsManualReport)
print(DailyActivity)
## # A tibble: 67 × 3
## WeightKg BMI IsManualReport
## <dbl> <dbl> <lgl>
## 1 52.6 22.6 TRUE
## 2 52.6 22.6 TRUE
## 3 134. 47.5 FALSE
## 4 56.7 21.5 TRUE
## 5 57.3 21.7 TRUE
## 6 72.4 27.5 TRUE
## 7 72.3 27.4 TRUE
## 8 69.7 27.2 TRUE
## 9 70.3 27.5 TRUE
## 10 69.9 27.3 TRUE
## # … with 57 more rows
Here, i need to sort the column ‘IsManualReport’ in a meaningful order
DailyActivity %>%
arrange(WeightKg, BMI, IsManualReport)
## # A tibble: 67 × 3
## WeightKg BMI IsManualReport
## <dbl> <dbl> <lgl>
## 1 52.6 22.6 TRUE
## 2 52.6 22.6 TRUE
## 3 56.7 21.5 TRUE
## 4 57.3 21.7 TRUE
## 5 61 23.8 TRUE
## 6 61 23.8 TRUE
## 7 61.1 23.9 TRUE
## 8 61.2 23.9 TRUE
## 9 61.2 23.9 TRUE
## 10 61.2 23.9 TRUE
## # … with 57 more rows
DailyActivity %>%
group_by(WeightKg, BMI, IsManualReport) %>%
drop_na() %>%
summarise(WeightKg, BMI, IsManualReport)
## `summarise()` has grouped output by 'WeightKg', 'BMI', 'IsManualReport'. You
## can override using the `.groups` argument.
## # A tibble: 67 × 3
## # Groups: WeightKg, BMI, IsManualReport [36]
## WeightKg BMI IsManualReport
## <dbl> <dbl> <lgl>
## 1 52.6 22.6 TRUE
## 2 52.6 22.6 TRUE
## 3 56.7 21.5 TRUE
## 4 57.3 21.7 TRUE
## 5 61 23.8 TRUE
## 6 61 23.8 TRUE
## 7 61.1 23.9 TRUE
## 8 61.2 23.9 TRUE
## 9 61.2 23.9 TRUE
## 10 61.2 23.9 TRUE
## # … with 57 more rows
Bellabeat is concerned about helping the users(women) to manage their health and fitness through a smart device(Fitbit). We have observed and inferred that all Fitbit device users in this dataset live a sedentary lifestyle. Evidences have shown that prolonged sedentary is characterized to having many chronic diseases. However, engaging oneself in doing exercices make someone keep fit all enough. To help Bellabeat users create better and healthier lifestyles, It’s pertinent of them to reduce their users’ sedentary durations and increase their activity level. From what we found in this analysis, reducing sedentary time can help: