About

This is a capstone project I have performed in RStudio being part of my Google Data Analytics Professional Certificate course.

Scenerio

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Business task

Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices then apply these insights to one of Bellabeat’s products.

Stakeholders

Data source

The data is located here.

The data is of Fitbit Fitness Tracker stored in public domain in Kaggle. The author is Mobius. About 30 Fitbit users allowed collection of personal tracker data on their active time, sleep and heart rate for a period of one month (04/12/2016-05/12/2016). A total of 18 files are available in csv format.

How does the data help in answering the business questions?

  • The reliability of the data is low for it does not represent the whole population of smart gadget users and doesn’t specify their gender. Therefore it’s likely that the users could be a mix of male and females. Whereas, the insights or trends derived from this data would be recommended for the marketing campaigns targeting women.
  • The originality of the data is low for it is a third party data collected using Amazon Mechanical Turk.
  • The data is averagely comprehensive because it contains multiple fields on daily activity intensity, calories burned, daily steps taken, daily sleep time and weight record.
  • The data is low in terms of how current. It was collected in April through May of 2016, meaning that users’ habits might have changed over the years.
  • Data citation is low for it was collected by a third party, therefore, key descriptive information about the data is unknown.

Analysis process

Loading needed library

library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('here')
## here() starts at C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project
library('skimr')
library('janitor')
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library('DT')

Importing dataset

daily_activity <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/dailyActivity_merged.csv")
sleep_day <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/sleepDay_merged.csv")
heartrate_seconds <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/heartrate_seconds_merged.csv")
weight_log_info <- read_csv("C:/Users/PC/OneDrive/Documents/My Coursera/Capstone Project/Fitbit Fitness Tracker/weightLogInfo_merged.csv")

A brief display of the data table

datatable(head(daily_activity, 10), class = 'display',
          options = list(pageLength = 5, dom = 'tip', scrollX = TRUE),
          rownames = FALSE)

Convert data types as appropriate

daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
sleep_day$SleepDay <- mdy_hms(sleep_day$SleepDay, tz=Sys.timezone())
heartrate_seconds$Time <- mdy_hms(heartrate_seconds$Time, tz=Sys.timezone())

Change column names to lower case, add underscore & ensure uniqueness

# Store the result in the existing table
daily_activity <- clean_names(daily_activity)
sleep_day <- clean_names(sleep_day)
heartrate_seconds <- clean_names(heartrate_seconds)

Exploring the dataset

Confirm the numbers of unique IDs in each table

n_distinct(daily_activity$id)
## [1] 33
n_distinct(sleep_day$id)
## [1] 24
n_distinct(weight_log_info$Id)
## [1] 8
n_distinct(heartrate_seconds$id)
## [1] 14

Out of the overall participants which is 33, the weight and heart rate table has just 8 and 14 participants respectively. I am afraid that is too low.

Statistical summary and Visualisation

# Distance per category
daily_activity %>%  
  select(daily_steps,
         very_active_distance,
         moderately_active_distance,
         lightly_active_distance,
         sedentary_distance) %>%
  summary()
##   daily_steps    very_active_distance moderately_active_distance
##  Min.   :    0   Min.   : 0.000       Min.   :0.0000            
##  1st Qu.: 3790   1st Qu.: 0.000       1st Qu.:0.0000            
##  Median : 7406   Median : 0.210       Median :0.2400            
##  Mean   : 7638   Mean   : 1.503       Mean   :0.5675            
##  3rd Qu.:10727   3rd Qu.: 2.053       3rd Qu.:0.8000            
##  Max.   :36019   Max.   :21.920       Max.   :6.4800            
##  lightly_active_distance sedentary_distance
##  Min.   : 0.000          Min.   :0.000000  
##  1st Qu.: 1.945          1st Qu.:0.000000  
##  Median : 3.365          Median :0.000000  
##  Mean   : 3.341          Mean   :0.001606  
##  3rd Qu.: 4.782          3rd Qu.:0.000000  
##  Max.   :10.710          Max.   :0.110000

Highest daily steps taken stands at 36,019 and and on average, it is 7,638

# Minutes per category
daily_activity %>%
  select(very_active_minutes,
         moderately_active_minutes,
         lightly_active_minutes,
         sedentary_minutes) %>%
  summary()
##  very_active_minutes moderately_active_minutes lightly_active_minutes
##  Min.   :  0.00      Min.   :  0.00            Min.   :  0.0         
##  1st Qu.:  0.00      1st Qu.:  0.00            1st Qu.:127.0         
##  Median :  4.00      Median :  6.00            Median :199.0         
##  Mean   : 21.16      Mean   : 13.56            Mean   :192.8         
##  3rd Qu.: 32.00      3rd Qu.: 19.00            3rd Qu.:264.0         
##  Max.   :210.00      Max.   :143.00            Max.   :518.0         
##  sedentary_minutes
##  Min.   :   0.0   
##  1st Qu.: 729.8   
##  Median :1057.5   
##  Mean   : 991.2   
##  3rd Qu.:1229.5   
##  Max.   :1440.0

Average of categories of daily intensity(Distance)

# daily intensity distance
avg_intensity_levels <- daily_activity %>%
  summarise(very_active_distance = mean(very_active_distance),
            moderately_active_distance = mean(moderately_active_distance),
            lightly_active_distance = mean(lightly_active_distance),
            sedentary_distance = mean(sedentary_distance))

Pivot(convert) from wide to long

avg_intensity_levels_long <- avg_intensity_levels %>% 
  pivot_longer(cols = c('very_active_distance', 'moderately_active_distance', 'lightly_active_distance', 'sedentary_distance'),
               names_to = 'avg_intensity_level',
               values_to = 'kilometer')
#Add column for distance in meter
avg_intensity_levels_long <- mutate(avg_intensity_levels_long,
                                    meters = kilometer*1000)
avg_intensity_levels_long
## # A tibble: 4 × 3
##   avg_intensity_level        kilometer  meters
##   <chr>                          <dbl>   <dbl>
## 1 very_active_distance         1.50    1503.  
## 2 moderately_active_distance   0.568    568.  
## 3 lightly_active_distance      3.34    3341.  
## 4 sedentary_distance           0.00161    1.61
  • On average, users cover 1.5km very active
  • 0.57km moderately active
  • 3.3km lightly active
  • Inactive(sedentary) distance had the rest of the time It is clear that users cover more distance while lightly active compared to while very active or moderately active
ggplot(data = avg_intensity_levels_long, aes(x = avg_intensity_level, y = kilometer))+
  geom_col(fill = '#56B4E9', color = 'black')+
  xlab('Intensity distance categories')+
  ylab('distance in km')+
  labs(title = 'Average daily-intensity distance',
     caption = 'FitBit data from Kaggle',
     tag = 'Fig. 1') +
       theme(plot.tag.position = 'bottomleft')

Average of categories of daily intensity(Minutes)

# daily intensity minutes
avg_intensity_mins <- daily_activity %>%
  summarise(very_active_minutes = mean(very_active_minutes),
            moderately_active_minutes = mean(moderately_active_minutes),
            lightly_active_minutes = mean(lightly_active_minutes),
            sedentary_minutes = mean(sedentary_minutes))

Pivot from wide to long

avg_active_mins_long <- avg_intensity_mins %>% 
  pivot_longer(cols = c('very_active_minutes', 'moderately_active_minutes', 'lightly_active_minutes', 'sedentary_minutes'),
               names_to = 'avg_active_times',
               values_to = 'minutes')
# Add column for time in hours
avg_active_mins_long <- mutate(avg_active_mins_long, hours = minutes/60)
avg_active_mins_long
## # A tibble: 4 × 3
##   avg_active_times          minutes  hours
##   <chr>                       <dbl>  <dbl>
## 1 very_active_minutes          21.2  0.353
## 2 moderately_active_minutes    13.6  0.226
## 3 lightly_active_minutes      193.   3.21 
## 4 sedentary_minutes           991.  16.5

Users spent, on average, just 21 minutes very active everyday and 9991 minutes which is equal to over 16 hours being Inactive(sedentary). That is a lot of hours spent being inactive.

Data Viz

ggplot(data = avg_active_mins_long, aes(x = avg_active_times, y = hours))+
  geom_col(fill = '#0072B2', color = 'black')+
  xlab('Time-spent categories')+
  ylab('minutes/hour')+
  labs(title = 'Average daily-intensity minutes',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 2') +
  theme(plot.tag.position = 'bottomleft')

Average distance/minutes

distance_per_mins <- avg_intensity_levels_long$meters/avg_active_mins_long$minutes
distance_per_mins
## [1] 70.998743400 41.839071311 17.326752884  0.001620627

Users cover about 71m in 1 very active minute, about 42m in 1 moderately active minute and just 17m in 1 lightly active minute

Data Viz

barplot(distance_per_mins,
        names = c('very_active', 'moderate', 'light', 'sedentary'),
        col="dodgerblue3",
        main = 'Average distance per minutes',
        ylab = 'Meters',
        xlab = 'Minutes',
        sub = 'Fig. 3')

Calories versus daily steps

#Relationship between calories burnt and daily steps per week
daily_activity <- daily_activity %>%
  mutate(week = case_when(
    between(activity_date, as.Date('2016-04-12'), as.Date('2016-04-18'))~ 'week_1',
    between(activity_date, as.Date('2016-04-19'), as.Date('2016-04-25'))~ 'week_2',
    between(activity_date, as.Date('2016-04-26'), as.Date('2016-05-02'))~ 'week_3',
    between(activity_date, as.Date('2016-05-03'), as.Date('2016-05-09'))~ 'week_4',
    TRUE~ 'NA'))

calories_per_step <- daily_activity %>% 
  select(activity_date, daily_steps, calories, week) %>%
  filter(week == 'week_1'|week == 'week_2'|
           week == 'week_3'|week == 'week_4')

Data Viz

#Calories burned per steps taken daily
ggplot(data = calories_per_step, aes(x = daily_steps, y = calories))+
  geom_point(color = 'blue')+
  geom_smooth(color = 'red')+
  facet_wrap(~week)+
  xlab('Average daily steps')+
  ylab('Average kcals')+
  labs(title = 'Average calories burned/steps taken daily',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 4') +
  theme(plot.tag.position = 'bottomleft')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a positive correlation between steps taken daily and calories burned for the data shows that as the daily steps increases, the calories burned increases. Even though the trend line seems to be moving downward towards the peak of the 3rd week, the variable that caused this could be seen as an outlier. There is also a shift during the middle of the 4th week but the trend still continued to move upward. Each week may not follow exact pattern but the overall trend is clearly from the lower left to the upper right of the plot.

Average heart rate versus average sleep time

#Join average heart rate and sleep minutes
heart_rate_sleep <- avg_heart_rate %>% 
  full_join(avg_sleep_mins, by = 'id')

Data Viz

ggplot(data = heart_rate_sleep, aes(x = average_sleep_mins, y = average_heart_rate))+
  geom_point(color = 'blue')+
  labs(title = 'Average heart rate vs sleep time',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 6')+
  theme(plot.tag.position = 'bottomleft')+
  xlab('Average sleep minutes--->')+
  ylab('Average heart rate--->')

The data points are scattered and it shows there is no correlation between average sleep time and heart rate but this could be due to insufficient data to form a basis for conclusion.

Relationship between minutes asleep and time in bed

ggplot(data = sleep_day, aes(x = total_time_in_bed, y = total_minutes_asleep))+
  geom_point(color = 'blue')+
  geom_smooth(color = 'red')+
  labs(title = 'Time asleep vs time in bed',
       caption = 'FitBit data from Kaggle',
       tag = 'Fig. 5')+
  theme(plot.tag.position = 'bottomleft')+
  xlab('minutes in bed--->')+
  ylab('sleep minutes--->')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a positive correlation between time spent in bed and time spent asleep. As the minutes spent in bed increases, the higher the time spent sleeping.

Insight and findings

Recommendations to the stakeholders

Further analysis

Bellabeat should collect data from their app and other products and perform a detailed analysis of the users behavior