1. Ask Question to Make Data-driven Decisions

1.1. Introduction

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.

The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

1.2. Business Task

Bellabeat is looking for to identify the trend on how consumers use smart devices and the available business growth opportunity. Additionally, to come up high-level recommendations to use on marketing strategy.

1.3. Business Objectives

The main objectives of the case study is based on these three business question which underline the scope of the study: - What are the trends identified? - How could these trends apply to Bellabeat consumers - How could these trends help influence Bellabeat marketing strategy?

1.4. Key Stakeholders

Bellabeat has key stakeholders who are interesting to obtain solution on business tasks company involved: - Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer. - Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team - Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

1.5. Delieverables

  1. A clear summary of the business task
  2. A description of all data sources used
  3. Documentation of any cleaning or manipulation of data
  4. A summary of your analysis
  5. Supporting visualizations and key findings
  6. Your top high-level content recommendations based on your analysis

2. Prepare Data for Exporation

in here we go through data exploration, where the data was stored and how data was verified the ROCCC method, checking the data licencing, privacy, security, accessibility and protected its integrity. Furthermore, we will highlight on how data help us to answer business questions.

2.1. Source of Data

This case study data is available in popular public website Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.

2.2. ROCCC Method

High Quality data can be help us to determine reliable decisions. To obtain this data we need to check the quality of our data. ROCCC method will show us how are data quality is.

  • Reliability: The data is from 30 FitBit users who consented to the submission of personal tracker data and generated by from a distributed survey via Amazon Mechanical Turk.
  • Original: The data is from 30 FitBit users who consented to the submission of personal tracker data via Amazon Mechanical Turk.
  • Comprehensive: Data minute-level output for physical activity, heart rate, and sleep monitoring. While the data tracks many factors in the user activity and sleep, but the sample size is small and most data is recorded during certain days of the week.
  • Current: Data is from March 2016 to May 2016. Data is not current so the users habit may be different now.
  • Cited: Unknown.

2.3. Data Ethics and Privacy

Bellabeat have set the standards collected, shared and used this data. Bellabeat has kept the privacy and the validity of this data. Fit Bit dataset meets the six elements of data ethics: ownership, transaction transparency, consent, currency, privacy and openness.

3. Process data from Dirty to Cleaning

FitBit data is not clean data to process this data, we will go through, dataset files and check data variables and observations, will sort and filter data, remove missing data, change column names and prepare clean dataset.

3.1. Data Cleaning Tool

Bellabeat Fitness App case study was used R program to clean and analysis data. R Program is one of the best data analysis programming language, which originally created for statistical analysis purpose.

3.2. Set Working environment

setwd("~/Desktop/RPROJECTS/Fitness")

3.3. Load the essential Library

R program use several library to speed up data analysis process. In this capstone, we will use the following pacakges.

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("skimr")
library("here")
## here() starts at /Users/mohamedabdilahi/Desktop/RPROJECTS/Fitness
library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library("janitor")
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library("dplyr")
library("scales")
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library("ggpubr")

3.4. Import FitBit datasets

There are a number of CSV files in fitbit data set. we are only going to analysis three most important which are daily_activity, Sleep and hourly steps. As to explore the data we need to import these dataset into our environment.

activity <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
steps <- read_csv("~/Desktop/RPROJECTS/Fitness/fitabase_data/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

let us explore data and check the competence of data. We have uploaded three dataset weight, daily_activities and sleepDay. Before, we go analyse data, we need to clean it and remove, duplicates, missing values, and format any column need to be formatted.

3.5 Data Cleaning

In this stage, we are looking the overall of our dataset and Identify, if there are some missing values, duplicates and data types

3.5.1 Exploring Dataset

We have determined that this data need to clean and make tidy. At first, we will look the number of users in this data should be 30 users approximate, but we hope it may be greater or less than few numbers.

n_unique(activity$Id)
## [1] 33
n_unique(sleep$Id)
## [1] 24
n_unique(steps$Id)
## [1] 33

3.5.2 Remove Duplicates

sum(duplicated(activity))
## [1] 0
sum(duplicated(sleep))
## [1] 3
sum(duplicated(steps))
## [1] 0

We found that there are 3 duplicate observation in daily_sleep. Now, let us remove the duplicated using this

sleep <- sleep %>%
  distinct() %>%
  drop_na()

Now let us check whether or not removed the duplicates.

sum(duplicated(sleep))
## [1] 0

3.5.3 Clean Colunm Names

Final data can have upper and lowers letters, this can create confusion in the data analysis process. So it is best practice to covert all your column names into lower letters.

# Daily Activity datasets 
clean_names(activity)
## # A tibble: 940 × 15
##            id activity_date total_steps total_distance tracker_distance
##         <dbl> <chr>               <dbl>          <dbl>            <dbl>
##  1 1503960366 4/12/2016           13162           8.5              8.5 
##  2 1503960366 4/13/2016           10735           6.97             6.97
##  3 1503960366 4/14/2016           10460           6.74             6.74
##  4 1503960366 4/15/2016            9762           6.28             6.28
##  5 1503960366 4/16/2016           12669           8.16             8.16
##  6 1503960366 4/17/2016            9705           6.48             6.48
##  7 1503960366 4/18/2016           13019           8.59             8.59
##  8 1503960366 4/19/2016           15506           9.88             9.88
##  9 1503960366 4/20/2016           10544           6.68             6.68
## 10 1503960366 4/21/2016            9819           6.34             6.34
## # … with 930 more rows, and 10 more variables:
## #   logged_activities_distance <dbl>, very_active_distance <dbl>,
## #   moderately_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>
activity<- rename_with(activity, tolower)

# Daily Sleep datasets 
clean_names(sleep)
## # A tibble: 410 × 5
##            id sleep_day       total_sleep_rec… total_minutes_a… total_time_in_b…
##         <dbl> <chr>                      <dbl>            <dbl>            <dbl>
##  1 1503960366 4/12/2016 12:0…                1              327              346
##  2 1503960366 4/13/2016 12:0…                2              384              407
##  3 1503960366 4/15/2016 12:0…                1              412              442
##  4 1503960366 4/16/2016 12:0…                2              340              367
##  5 1503960366 4/17/2016 12:0…                1              700              712
##  6 1503960366 4/19/2016 12:0…                1              304              320
##  7 1503960366 4/20/2016 12:0…                1              360              377
##  8 1503960366 4/21/2016 12:0…                1              325              364
##  9 1503960366 4/23/2016 12:0…                1              361              384
## 10 1503960366 4/24/2016 12:0…                1              430              449
## # … with 400 more rows
sleep <- rename_with(sleep, tolower)

# Hourly Steps datasets 
clean_names(steps)
## # A tibble: 22,099 × 3
##            id activity_hour         step_total
##         <dbl> <chr>                      <dbl>
##  1 1503960366 4/12/2016 12:00:00 AM        373
##  2 1503960366 4/12/2016 1:00:00 AM         160
##  3 1503960366 4/12/2016 2:00:00 AM         151
##  4 1503960366 4/12/2016 3:00:00 AM           0
##  5 1503960366 4/12/2016 4:00:00 AM           0
##  6 1503960366 4/12/2016 5:00:00 AM           0
##  7 1503960366 4/12/2016 6:00:00 AM           0
##  8 1503960366 4/12/2016 7:00:00 AM           0
##  9 1503960366 4/12/2016 8:00:00 AM         250
## 10 1503960366 4/12/2016 9:00:00 AM        1864
## # … with 22,089 more rows
steps <- rename_with(steps, tolower)

3.5.4 Format Date & Time

Date and Time are very important in this data process, because what we are going to analysis the daily activities records. So if we do not change property Date and Time, your data will not be correct.

As we have seen the in daily_activity and daily_sleep, the columns activitydate and sleepDay are character data type. Let us convert into format.

activity <- activity %>% 
  rename(date = activitydate) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y"))

sleep <- sleep %>% 
  rename(date = sleepday) %>% 
  mutate(date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`
steps <- steps %>% 
  rename(date_time = activityhour) %>% 
  mutate(date_time = as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
str(activity)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
##  $ id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date                    : Date[1:940], format: "2016-04-12" "2016-04-13" ...
##  $ totalsteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ totaldistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ trackerdistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ loggedactivitiesdistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ veryactivedistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ moderatelyactivedistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ lightactivedistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ sedentaryactivedistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ veryactiveminutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ fairlyactiveminutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ lightlyactiveminutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ sedentaryminutes        : num [1:940] 728 776 1218 726 773 ...
##  $ calories                : num [1:940] 1985 1797 1776 1745 1863 ...
str(sleep)
## tibble [410 × 5] (S3: tbl_df/tbl/data.frame)
##  $ id                : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date              : Date[1:410], format: "2016-04-12" "2016-04-13" ...
##  $ totalsleeprecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
##  $ totalminutesasleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
##  $ totaltimeinbed    : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...

3.5.5 Merging Data

We arrived the last stage of data processing, Now we are combined to dataset to examine the relations between daily_activity and daily_sleep.

activity_sleep <- merge(activity, sleep, by=c("id", "date"))
glimpse(activity_sleep)
## Rows: 410
## Columns: 18
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ totalsteps               <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ totaldistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ trackerdistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ lightactivedistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes        <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ fairlyactiveminutes      <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ lightlyactiveminutes     <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ sedentaryminutes         <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ calories                 <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ totalsleeprecords        <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep       <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ totaltimeinbed           <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …

4. Data Analyse and Visualization

We are going to extract from the data the insights of Bellabeat fitbit users usage and know how the company determine the trend of the market.

According to the 10000Steps (2022), activity trackers provide data which enables you to become aware of your physical activity levels, work towards a goal and monitor progress. Studies using the 10,000 steps per day goal have shown weight loss, improved glucose tolerance, and reduced blood pressure from increased physical activity toward achieving this goal. The following pedometer indices have been developed to provide a guideline on steps and activity levels:

Although the program promotes the goal of reaching 10,000 steps each day for healthy adults, this goal is not universally appropriate across all ages and physical function. There are some groups where the goal of 10,000 steps may not be accurate, such as the elderly and children. Your individual step goal should be based on current activity levels and overall health and fitness goals. For people who normally do fewer than 10,000 steps, increasing daily activity by 1-2,000 steps per day will provide health benefits.

4.1. Correlation Between Steps & Calories Burning

ggplot(data = activity, aes(x=totalsteps, y=calories, fill = totalsteps))+
  geom_point() + geom_smooth() + labs(title = "Total Steps vs Calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The findings show that there is correlations between Total Steps and Calories. It is known, when you walk long you burn more calories.

daily_average <- activity_sleep %>%
  group_by(id) %>%
  summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))

head(daily_average)
## # A tibble: 6 × 4
##           id mean_daily_steps mean_daily_calories mean_daily_sleep
##        <dbl>            <dbl>               <dbl>            <dbl>
## 1 1503960366           12406.               1872.             360.
## 2 1644430081            7968.               2978.             294 
## 3 1844505072            3477                1676.             652 
## 4 1927972279            1490                2316.             417 
## 5 2026352035            5619.               1541.             506.
## 6 2320127002            5079                1804               61

Now, let the classify our users by daily average steps:

user_type <- daily_average %>%
  mutate(user_type = case_when(
    mean_daily_steps < 5000 ~ "sedentary",
    mean_daily_steps >= 5000 & mean_daily_steps < 7499 ~ "lightly active", 
    mean_daily_steps >= 7500 & mean_daily_steps < 9999 ~ "fairly active", 
    mean_daily_steps >= 10000 ~ "very active"
  ))

head(user_type)
## # A tibble: 6 × 5
##           id mean_daily_steps mean_daily_calories mean_daily_sleep user_type    
##        <dbl>            <dbl>               <dbl>            <dbl> <chr>        
## 1 1503960366           12406.               1872.             360. very active  
## 2 1644430081            7968.               2978.             294  fairly active
## 3 1844505072            3477                1676.             652  sedentary    
## 4 1927972279            1490                2316.             417  sedentary    
## 5 2026352035            5619.               1541.             506. lightly acti…
## 6 2320127002            5079                1804               61  lightly acti…

4.2. Types of Users

Now that we have a new column with the user type we will create a data frame with the percentage of each user type to better visualize them on a graph.

user_type_percent <- user_type %>%
  group_by(user_type) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_type) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_type_percent$user_type <- factor(user_type_percent$user_type , levels = c("very active", "fairly active", "lightly active", "sedentary"))


head(user_type_percent)
## # A tibble: 4 × 3
##   user_type      total_percent labels
##   <fct>                  <dbl> <chr> 
## 1 fairly active          0.375 38%   
## 2 lightly active         0.208 21%   
## 3 sedentary              0.208 21%   
## 4 very active            0.208 21%

Below we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kind of users wear smart-devices.

user_type_percent %>%
  ggplot(aes(x="",y=total_percent, fill=user_type)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  scale_fill_manual(values = c("#85e085","#e6e600", "#ffd480", "#ff8080")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  labs(title="User type distribution")

4.3. Steps and minutes asleep per weekday

We want to know now what days of the week are the users more active and also what days of the week users sleep more. We will also verify if the users walk the recommended amount of steps and have the recommended amount of sleep.

Below we are calculating the weekdays based on our column date. We are also calculating the average steps walked and minutes sleeped by weekday.

weekday_steps_sleep <- activity_sleep %>%
  mutate(weekday = weekdays(date))

weekday_steps_sleep$weekday <-ordered(weekday_steps_sleep$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))

 weekday_steps_sleep <-weekday_steps_sleep%>%
  group_by(weekday) %>%
  summarize (daily_steps = mean(totalsteps), daily_sleep = mean(totalminutesasleep))

head(weekday_steps_sleep)
## # A tibble: 6 × 3
##   weekday   daily_steps daily_sleep
##   <ord>           <dbl>       <dbl>
## 1 Monday          9273.        420.
## 2 Tuesday         9183.        405.
## 3 Wednesday       8023.        435.
## 4 Thursday        8184.        401.
## 5 Friday          7901.        405.
## 6 Saturday        9871.        419.
ggarrange(
    ggplot(weekday_steps_sleep) +
      geom_col(aes(weekday, daily_steps), fill = "#006699") +
      geom_hline(yintercept = 7500) +
      labs(title = "Daily steps per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)),
    ggplot(weekday_steps_sleep, aes(weekday, daily_sleep)) +
      geom_col(fill = "#85e0e0") +
      geom_hline(yintercept = 480) +
      labs(title = "Minutes asleep per weekday", x= "", y = "") +
      theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1))
  )

In the graphs above we can determine the following:

  • Users walk daily the recommended amount of steps of 7500 besides Sunday’s.

  • Users don’t sleep the recommended amount of minutes/ hours - 8 hours.

4.4 Hourly steps throughout the day

Getting deeper into our analysis we want to know when exactly are users more active in a day.

We will use the hourly_steps data frame and separate date_time column.

head(steps)
## # A tibble: 6 × 3
##           id date_time           steptotal
##        <dbl> <dttm>                  <dbl>
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0
steps <- steps %>%
  separate(date_time, into = c("date", "time"), sep= " ") %>%
  mutate(date = ymd(date)) 
  
head(steps)
## # A tibble: 6 × 4
##           id date       time     steptotal
##        <dbl> <date>     <chr>        <dbl>
## 1 1503960366 2016-04-12 00:00:00       373
## 2 1503960366 2016-04-12 01:00:00       160
## 3 1503960366 2016-04-12 02:00:00       151
## 4 1503960366 2016-04-12 03:00:00         0
## 5 1503960366 2016-04-12 04:00:00         0
## 6 1503960366 2016-04-12 05:00:00         0
steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly steps throughout the day", x="", y="") + 
  scale_fill_gradient(low = "green", high = "red")+
  theme(axis.text.x = element_text(angle = 90))

According to Macarena, Lacasa (2021) we can see that users are more active between 8am and 7pm. Walking more steps during lunch time from 12pm to 2pm and evenings from 5pm and 7pm.

4.5 Correlations Daily Steps and Daily Sleep

We will now determine if there is any correlation between different variables:

  • Daily steps and daily sleep
  • Daily steps and calories
ggarrange(
ggplot(activity_sleep, aes(x=totalsteps, y=totalminutesasleep))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Daily steps vs Minutes asleep", x = "Daily steps", y= "Minutes asleep") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14)), 
ggplot(activity_sleep, aes(x=totalsteps, y=calories))+
  geom_jitter() +
  geom_smooth(color = "red") + 
  labs(title = "Daily steps vs Calories", x = "Daily steps", y= "Calories") +
   theme(panel.background = element_blank(),
        plot.title = element_text( size=14))
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Per our plots:

  • There is no correlation between daily activity level based on steps and the amount of minutes users sleep a day.

  • Otherwise we can see a positive correlation between steps and calories burned. As assumed the more steps walked the more calories may be burned.

4.6. Use of smart device

4.6.1. Days used smart device

Now that we have seen some trends in activity, sleep and calories burned, we want to see how often do the users in our sample use their device. That way we can plan our marketing strategy and see what features would benefit the use of smart devices.

We will calculate the number of users that use their smart device on a daily basis, classifying our sample into three categories knowing that the date interval is 31 days:

high use - users who use their device between 21 and 31 days. moderate use - users who use their device between 10 and 20 days. low use - users who use their device between 1 and 10 days. First we will create a new data frame grouping by Id, calculating number of days used and creating a new column with the classification explained above.

daily_use <- activity_sleep %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "moderate use", 
    days_used >= 21 & days_used <= 31 ~ "high use", 
  ))
  
head(daily_use)
## # A tibble: 6 × 3
##           id days_used usage   
##        <dbl>     <int> <chr>   
## 1 1503960366        25 high use
## 2 1644430081         4 low use 
## 3 1844505072         3 low use 
## 4 1927972279         5 low use 
## 5 2026352035        28 high use
## 6 2320127002         1 low use

We will now create a percentage data frame to better visualize the results in the graph. We are also ordering our usage levels.

daily_use_percent <- daily_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

daily_use_percent$usage <- factor(daily_use_percent$usage, levels = c("high use", "moderate use", "low use"))

head(daily_use_percent)
## # A tibble: 3 × 3
##   usage        total_percent labels
##   <fct>                <dbl> <chr> 
## 1 high use             0.5   50%   
## 2 low use              0.375 38%   
## 3 moderate use         0.125 12%

Now that we have our new table we can create a percentage dataframe to better visualize the results in the graph. we are also ordering our usage levels.

daily_use_percent %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(label = labels),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#006633","#00e673","#80ffbf"),
                    labels = c("High use - 21 to 31 days",
                                 "Moderate use - 11 to 20 days",
                                 "Low use - 1 to 10 days"))+
  labs(title="Daily use of smart device")

Analyzing our results we can see that

  • 50% of the users of our sample use their device frequently - between 21 to 31 days.
  • 12% use their device 11 to 20 days.
  • 38% of our sample use really rarely their device.

5. Conclusion & Recommendation

The findings of fitbit dataset exposed the correlation between usage of fitness app and health. Fitness app is a motivator and have close relationship with the users. This findings depicted different activities users involved and how to tract their help trend.

6. References

10000Steps. 2022. Counting Your Steps. https://www.10000steps.org.au/.
Macarena, Lacasa. 2021. Capstone - Case Study Bellabeat. https://www.kaggle.com.