Bellabeat is a high-tech company that manufactures health-focused smart products for women use only, these products were beautifully designed by Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer as a result of her background as an artist. The products collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.
One of the stakeholders Urška Sršen asked me, as a junior data analyst in their company to analyze smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. She also wants me to select one Bellabeat product to apply these insights to in my presentation.
I used a public dataset suggested by one of the stakeholders, Urška Sršen, that explores smart device users’ daily habits. Here is the dataset link, this dataset was uploaded by Mobius, the dataset is not copyrighted and it is approved to be used for free by anyone license.
This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
About the data: It contains a total of 18 wide datasets,generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 with various records on participants’ activity and fitness data, i downloaded and saved them in a local file on my laptop, i used the Import Dataset - From Text(readr) to import them into my RStudio desktop and assigned new names to them appropriately.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(janitor)
## Warning: package 'janitor' was built under R version 4.2.2
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(Tmisc)
## Warning: package 'Tmisc' was built under R version 4.2.2
library(readr)
Loaded datasets into working environment
daily_activity <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/dailyActivity_merged.csv")
hourly_calories <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlyCalories_merged.csv")
daily_sleep <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/sleepDay_merged.csv")
hourly_steps <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlySteps_merged.csv")
Cleaning datasets with the clean_names() function to ensure the data in the datasets are unique and consistent, having just characters, numbers and underscores. eg.clean_names(daily_activity), clean_names(hourly_calories) etc…
Get a glimpse of the kind of data contained in each of the loaded dataset
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "04/12/2016", "4/13/2016", "4/14/2016", "4/15…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
The daily activity dataframe contains 15 columns and 940 observations.
glimpse(hourly_calories)
## Rows: 22,099
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories <int> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
The hourly calories dataframe contains 3 columns and 22,099 observations.
glimpse(daily_sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
The daily sleep dataframe contains 5 columns and 413 observations
glimpse(hourly_steps)
## Rows: 22,099
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…
The hourly steps dataframe contains 3 columns and 22,099 observations.
Inspecting column names of each loaded dataframe.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(hourly_calories)
## [1] "Id" "ActivityHour" "Calories"
colnames(daily_sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(hourly_steps)
## [1] "Id" "ActivityHour" "StepTotal"
After inspecting the columns names, i realized that the datasets have a column name in common, the “Id” column. This means that, the dataframes can be joined using the “Id” column to find possible trend(s).
Using the n_distinct() function to detect how many unique participants recorded their activities.
n_distinct(daily_activity$Id)
## [1] 33
Running the above code revealed a discrepancy, there were 33 users in the daily activity dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.
n_distinct(hourly_calories$Id)
## [1] 33
Running the above code revealed a discrepancy, there were 33 users in the hourly calories dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.
n_distinct(daily_sleep$Id)
## [1] 24
Running the above code revealed that just 24 of the 33 unique users recorded their sleep information.
n_distinct(hourly_steps$Id)
## [1] 33
Running the above code revealed a discrepancy, there were 33 users in the hourly steps dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.
Cleaning data further by removing the observations with some NA cells using “daily_activity %>% filter_all(all_vars(!is.na(.)))”, “hourly_calories %>% filter_all(all_vars(!is.na(.)))”, “daily_sleep %>% filter_all(all_vars(!is.na(.)))” and “hourly_steps %>% filter_all(all_vars(!is.na(.)))”,
A good data source should be Reliable, Original, Comprehensive, Current, and Cited, in the case of the available data for this case study, reliability is low as it contains just 33 users, a larger sample would have been better, Its supplied by a third party (Amazon Mechanical Turk), its safe to say its not original, its neither comprehensive nor current, it was a 2016 dataset, its been 7years the data was collected, however cited. The source and it’s license were stated.
daily_activity %>%
select(TotalSteps,
TotalDistance,
VeryActiveMinutes,
SedentaryMinutes,
Calories)%>%
summary()
## TotalSteps TotalDistance VeryActiveMinutes SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.00 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median : 4.00 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean : 21.16 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 32.00 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :210.00 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
hourly_calories%>%
select(ActivityHour,
Calories)%>%
summary()
## ActivityHour Calories
## Length:22099 Min. : 42.00
## Class :character 1st Qu.: 63.00
## Mode :character Median : 83.00
## Mean : 97.39
## 3rd Qu.:108.00
## Max. :948.00
daily_sleep%>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed)%>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
hourly_steps %>%
select(ActivityHour,
StepTotal)%>%
summary()
## ActivityHour StepTotal
## Length:22099 Min. : 0.0
## Class :character 1st Qu.: 0.0
## Mode :character Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
Plotting graphs to view trends, correlation/relationships between important column values.
ggplot(data= daily_activity, aes(x=TotalSteps, y=SedentaryMinutes))+ geom_point()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The graph of Total Steps against Sedentary Minutes revealed that
participants were not so active, they had more idle time than
activity/exercise period and this is not so good for them healthwise. (
I’d recommend that they get more exercise time, be it walking,jogging or
running.)
ggplot(data= daily_activity, aes(x=TotalSteps, y=Calories))+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The graph of Total Steps against Calories burnt revealed there is a
positive relationship between steps taken and calories burnt, the more
steps participants took, the more calories they burnt and this is
particularly good for their health. (I’d recommend the fitness app
developer adds a feature that sends a congratulatory “well done for
prioritizing your well being” message that commends participant’s
effort.)
daily_activity%>%
group_by(Id)%>%
summarize(mean(TotalSteps), sd(TotalSteps), mean(Calories), sd(Calories), cor(TotalSteps,Calories))
## # A tibble: 33 × 6
## Id `mean(TotalSteps)` `sd(TotalSteps)` mean(Calorie…¹ sd(Ca…² cor(T…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 12117. 3052. 1816. 353. 0.892
## 2 1624580081 5744. 6177. 1483. 257. 0.931
## 3 1644430081 7283. 4325. 2811. 507. 0.914
## 4 1844505072 2580. 2713. 1573. 308. 0.917
## 5 1927972279 916. 1205. 2173. 221. 0.822
## 6 2022484408 11371. 2807. 2510. 297. 0.760
## 7 2026352035 5567. 2978. 1541. 186. 0.914
## 8 2320127002 4717. 2255. 1724. 212. 0.910
## 9 2347167796 9520. 4682. 2043. 473. 0.800
## 10 2873212765 7556. 1514. 1917. 158. 0.455
## # … with 23 more rows, and abbreviated variable names ¹`mean(Calories)`,
## # ²`sd(Calories)`, ³`cor(TotalSteps, Calories)`
Summarized the daily_activity data to confirm the possitive trend discovered earlier by the graph plotted and the correlation between the Tota steps and Calories remained the same, its between 0.6 to 0.9, its less than zero all through and thats a positive trend.
ggplot(data= daily_sleep, aes(x=TotalTimeInBed, y=TotalMinutesAsleep))+ geom_point()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above graph revealed that participants who recorded their sleep
information weren’t finding it difficult to sleep almost immediately
after they go to bed.
Merged the daily activity data with the daily sleep data in order to check for trends, do people who take more steps in a day get more sleep at night?
sleep_daily <- merge(daily_activity,daily_sleep, by= "Id")
Take a look at how many unique participants are in the newly merged dataset.
n_distinct(sleep_daily$Id)
## [1] 24
ggplot(data= sleep_daily, aes(x=TotalSteps, y=TotalMinutesAsleep))+ geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The graph revealed no positive correlation in the hypothesis, in fact,
people who took fewer steps a day got more sleeping hours.
Separating the combined date and time in the ActivityHour column in the hourly_steps data.
hourlysteps_separated <- hourly_steps%>%
mutate(ActivityHour = mdy_hms(ActivityHour),
Date = as.Date(ActivityHour),Time = format(ActivityHour, format = "%H:%M:%S"))
Plotted a graph to check the time of the day participants were more active and took more steps.
ggplot(data= hourlysteps_separated, aes(x=Time, y=StepTotal))+ geom_point()+ theme(axis.text.x=element_text(angle=90))
The graph revealed a realistic trend, steps taken were minimal duriing
midnights till early hours of the day, as these are periods participants
slept, the few steps that occured during these hours might be due to
waking up to pee or drink water, steps peaked at 6am, i guessed thats
the time most participants started getting ready to prepare for work and
it rose to the highest 2pm and started declining till 11pm.