BellaBeat Case study

INTRODUCTION.

Bellabeat is a high-tech company that manufactures health-focused smart products for women use only, these products were beautifully designed by Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer as a result of her background as an artist. The products collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

ASK

One of the stakeholders Urška Sršen asked me, as a junior data analyst in their company to analyze smart device usage data in order to gain insights into how consumers use non-Bellabeat smart devices. She also wants me to select one Bellabeat product to apply these insights to in my presentation.

PREPARE

I used a public dataset suggested by one of the stakeholders, Urška Sršen, that explores smart device users’ daily habits. Here is the dataset link, this dataset was uploaded by Mobius, the dataset is not copyrighted and it is approved to be used for free by anyone license.

This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

About the data: It contains a total of 18 wide datasets,generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 with various records on participants’ activity and fitness data, i downloaded and saved them in a local file on my laptop, i used the Import Dataset - From Text(readr) to import them into my RStudio desktop and assigned new names to them appropriately.

Loaded necessary packages to make the analysis run smoothly

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.2.2

## Warning: package 'tidyr' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)

library(janitor)

## Warning: package 'janitor' was built under R version 4.2.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(Tmisc)

## Warning: package 'Tmisc' was built under R version 4.2.2

library(readr)

Loaded datasets into working environment

daily_activity <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/dailyActivity_merged.csv")

hourly_calories <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlyCalories_merged.csv")

daily_sleep <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/sleepDay_merged.csv")

hourly_steps <- read.csv("C:/Users/Dayo Alli/Downloads/Case Study2/hourlySteps_merged.csv")

Cleaning the datasets

Cleaning datasets with the clean_names() function to ensure the data in the datasets are unique and consistent, having just characters, numbers and underscores. eg.clean_names(daily_activity), clean_names(hourly_calories) etc…

Get a glimpse of the kind of data contained in each of the loaded dataset

glimpse(daily_activity)

## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "04/12/2016", "4/13/2016", "4/14/2016", "4/15…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

The daily activity dataframe contains 15 columns and 940 observations.

glimpse(hourly_calories)

## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories     <int> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …

The hourly calories dataframe contains 3 columns and 22,099 observations.

glimpse(daily_sleep)

## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

The daily sleep dataframe contains 5 columns and 413 observations

glimpse(hourly_steps)

## Rows: 22,099
## Columns: 3
## $ Id           <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ StepTotal    <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 2…

The hourly steps dataframe contains 3 columns and 22,099 observations.

Inspecting column names of each loaded dataframe.

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(hourly_calories)

## [1] "Id"           "ActivityHour" "Calories"

colnames(daily_sleep)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(hourly_steps)

## [1] "Id"           "ActivityHour" "StepTotal"

After inspecting the columns names, i realized that the datasets have a column name in common, the “Id” column. This means that, the dataframes can be joined using the “Id” column to find possible trend(s).

Using the n_distinct() function to detect how many unique participants recorded their activities.

n_distinct(daily_activity$Id)

## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the daily activity dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

n_distinct(hourly_calories$Id)

## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the hourly calories dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

n_distinct(daily_sleep$Id)

## [1] 24

Running the above code revealed that just 24 of the 33 unique users recorded their sleep information.

n_distinct(hourly_steps$Id)

## [1] 33

Running the above code revealed a discrepancy, there were 33 users in the hourly steps dataset as opposed to the initial claim from the data uploader, he stated that “the data set contains personal fitness tracker from thirty fitbit users”.

Cleaning data further by removing the observations with some NA cells using “daily_activity %>% filter_all(all_vars(!is.na(.)))”, “hourly_calories %>% filter_all(all_vars(!is.na(.)))”, “daily_sleep %>% filter_all(all_vars(!is.na(.)))” and “hourly_steps %>% filter_all(all_vars(!is.na(.)))”,

NOTE

A good data source should be Reliable, Original, Comprehensive, Current, and Cited, in the case of the available data for this case study, reliability is low as it contains just 33 users, a larger sample would have been better, Its supplied by a third party (Amazon Mechanical Turk), its safe to say its not original, its neither comprehensive nor current, it was a 2016 dataset, its been 7years the data was collected, however cited. The source and it’s license were stated.

PROCESS AND ANALYZE

Cheking for trends by running quick summary statistics on the loaded dataframes

daily_activity %>%
select(TotalSteps,
TotalDistance,
VeryActiveMinutes,
SedentaryMinutes,
Calories)%>%
summary()

##    TotalSteps    TotalDistance    VeryActiveMinutes SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :  0.00    Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00    1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :  4.00    Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   : 21.16    Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00    3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :1440.0  
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900

hourly_calories%>%
select(ActivityHour,
Calories)%>%
summary()

##  ActivityHour          Calories     
##  Length:22099       Min.   : 42.00  
##  Class :character   1st Qu.: 63.00  
##  Mode  :character   Median : 83.00  
##                     Mean   : 97.39  
##                     3rd Qu.:108.00  
##                     Max.   :948.00

daily_sleep%>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed)%>%
summary()

##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

hourly_steps %>%
select(ActivityHour,
StepTotal)%>%
summary()

##  ActivityHour         StepTotal      
##  Length:22099       Min.   :    0.0  
##  Class :character   1st Qu.:    0.0  
##  Mode  :character   Median :   40.0  
##                     Mean   :  320.2  
##                     3rd Qu.:  357.0  
##                     Max.   :10554.0

Plotting graphs to view trends, correlation/relationships between important column values.

ggplot(data= daily_activity, aes(x=TotalSteps, y=SedentaryMinutes))+ geom_point()+ geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The graph of Total Steps against Sedentary Minutes revealed that participants were not so active, they had more idle time than activity/exercise period and this is not so good for them healthwise. ( I’d recommend that they get more exercise time, be it walking,jogging or running.)

ggplot(data= daily_activity, aes(x=TotalSteps, y=Calories))+ geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The graph of Total Steps against Calories burnt revealed there is a positive relationship between steps taken and calories burnt, the more steps participants took, the more calories they burnt and this is particularly good for their health. (I’d recommend the fitness app developer adds a feature that sends a congratulatory “well done for prioritizing your well being” message that commends participant’s effort.)

daily_activity%>%
group_by(Id)%>%
summarize(mean(TotalSteps), sd(TotalSteps), mean(Calories), sd(Calories), cor(TotalSteps,Calories))

## # A tibble: 33 × 6
##            Id `mean(TotalSteps)` `sd(TotalSteps)` mean(Calorie…¹ sd(Ca…² cor(T…³
##         <dbl>              <dbl>            <dbl>          <dbl>   <dbl>   <dbl>
##  1 1503960366             12117.            3052.          1816.    353.   0.892
##  2 1624580081              5744.            6177.          1483.    257.   0.931
##  3 1644430081              7283.            4325.          2811.    507.   0.914
##  4 1844505072              2580.            2713.          1573.    308.   0.917
##  5 1927972279               916.            1205.          2173.    221.   0.822
##  6 2022484408             11371.            2807.          2510.    297.   0.760
##  7 2026352035              5567.            2978.          1541.    186.   0.914
##  8 2320127002              4717.            2255.          1724.    212.   0.910
##  9 2347167796              9520.            4682.          2043.    473.   0.800
## 10 2873212765              7556.            1514.          1917.    158.   0.455
## # … with 23 more rows, and abbreviated variable names ¹`mean(Calories)`,
## #   ²`sd(Calories)`, ³`cor(TotalSteps, Calories)`

Summarized the daily_activity data to confirm the possitive trend discovered earlier by the graph plotted and the correlation between the Tota steps and Calories remained the same, its between 0.6 to 0.9, its less than zero all through and thats a positive trend.

ggplot(data= daily_sleep, aes(x=TotalTimeInBed, y=TotalMinutesAsleep))+ geom_point()+ geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The above graph revealed that participants who recorded their sleep information weren’t finding it difficult to sleep almost immediately after they go to bed.

Merged the daily activity data with the daily sleep data in order to check for trends, do people who take more steps in a day get more sleep at night?

sleep_daily <- merge(daily_activity,daily_sleep, by= "Id")

Take a look at how many unique participants are in the newly merged dataset.

n_distinct(sleep_daily$Id)

## [1] 24

ggplot(data= sleep_daily, aes(x=TotalSteps, y=TotalMinutesAsleep))+ geom_smooth()

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The graph revealed no positive correlation in the hypothesis, in fact, people who took fewer steps a day got more sleeping hours.

Separating the combined date and time in the ActivityHour column in the hourly_steps data.

hourlysteps_separated <- hourly_steps%>%
mutate(ActivityHour = mdy_hms(ActivityHour),
Date = as.Date(ActivityHour),Time = format(ActivityHour, format = "%H:%M:%S"))

Plotted a graph to check the time of the day participants were more active and took more steps.

ggplot(data= hourlysteps_separated, aes(x=Time, y=StepTotal))+ geom_point()+ theme(axis.text.x=element_text(angle=90))

The graph revealed a realistic trend, steps taken were minimal duriing midnights till early hours of the day, as these are periods participants slept, the few steps that occured during these hours might be due to waking up to pee or drink water, steps peaked at 6am, i guessed thats the time most participants started getting ready to prepare for work and it rose to the highest 2pm and started declining till 11pm.

BellaBeat Case study

Dayo Alli

2022-12-07