Case Study

This case study is part of the Google Data Analytics Capstone Project and served as my very first data analytics project. It has been an intensive learning process and this project serve as a culmination of everything I have learnt throughtout this program.

Table of Contents

  1. Comapny Overview
    1.1. Products
    1.2. Current Marketing Efforts
  2. Business Task
  3. Data Overview
    3.1. Data Source
    3.2. License and Accessibility
    3.3. Data Organization
    3.4. Data Credibility
    3.5. Data Integrity Verification
  4. Data Cleaning
    4.1. Loading Packages
    4.2. Importing Datasets
    4.3. Renaming Columns
    4.4. Verifying Distinct Values
    4.5. Removing Duplicates and Nulls
    4.6. Changing the Data Format
  5. Analyzing Trends
    5.1. Users’ habit in using the smart device
    5.2. Metrics to determine users’ activity levels
    5.3. Users’ activity level
    5.4. Daily steps in a day
    5.5. Daily steps in days of the week
    5.6. Daily steps in days of the week
  6. Recommendations

1. Company Overview

Bellabeat is a high-tech manufacturer of health-focused smart products specialized for women. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

1.1. Products

Bellabeat has several wellness-tracker products.

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time:This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Ivy: A wellness-tracker designed as an elegant bracelet that monitors users heart rate, respiratory rate, and physical and mental activity. The data is combined and displayed as wellness score and readiness score to keep track on daily habit.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

1.2. Current Marketing Efforts

By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has been making several marketing efforts, listed as follows.

  • Investing in traditional advertising media, such as radio, out-of-home billboards, print, and television
  • Investing year-round in Google Search
  • Maintaining active Facebook and Instagram pages
  • Engaging consumers consistently on Twitter
  • Running video ads on Youtube
  • Displaying ads on the Google Display Network

2. Business Task

The Cofounder and Chief Creative Officer of Bellabeat believe that analyzing smart device usage data from non-Bellabeat products will provide insights that is applicable to one of the Bellabeat product.

This report will attempt to answer these business questions:

The data analysis will focus on the following tasks: 1. Identifying users’ habit in tracking daily activities and recording sleep. 2. Determining metrics to identify users activity level; intensity value vs total steps. 3. Identifying trends in users’ activity levels. 4. Identifying trends of hourly activity across different user types. 5. Identifying trends of daily activity on days of the week across different user types.

3. Data Overview

3.1. Data Source

The dataset provided by this case study is FitBit Fitness Tracker Data. Data is made available by Möbius on Kaggle. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 2016-12-03 and 2016-12-05. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and preferences.

3.2. License and Accessibility

This dataset is listed as CC0: Public Domain. It is available for use publicly without requiring any permission.

3.3. Data Organization

The dataset contains eighteen tables. There are long and wide data types depending on the tables referred. The dataset provides daily, hourly, and minute-level of data for calories, intensities, and sleep datasets.

3.4. Data Credibility

Data credibility determines whether a data source is a good or a bad one. A good data source must be reliable, original, comprehensive, current, and cited. Metadata (data about the data) usually provide this level of information. Credibility issues can be solved strictly by collecting additional data, completing missing variables, and gathering data from different sources. The following tables explain some issues with this dataset and how analysts could address the problems.

Credibility Issues Solution
Incomplete demographic data such as gender, weight, and age, which are some of important variables determining activity behaviors Complete the missing demographic data from sample users
Sample bias, no evidence that the data only came from female users Confirm the gender of sample, remove sample that doesn’t comply with the population of consumers
Small sample size of 33 users and 31 days Collect additional data from different users in a more extended time
The data is collected by another party and then made public Second party data should be verified and confirmed by data vendor to ensure integrity and completeness.
The data is not current Collect and use the most recent data

Base on the above issues stated, the data credibility can be assessed as poor. However, data with good credibility consumes effort and time and it is up to the analyst judgement call to decide whether the existing data is sufficiently credible to answer the business task. This limitation should be kept in mind when drawing conclusions from this analysis exercise.

3.5. Data Integrity Verification

Data integrity deals with the accuracy, completeness, consistency, and trustworthiness of data. It encompasses data quality and data security and the ability of the data to answer the required business questions. To address data integrity, analyst has an arsenal of tools to verify that the data is complete, and free of any errors such as null values, multiple data types in a column, etc. Below are a comparison table that describes the datasets used:

Table Name Description Total Observations Variables Integrity Status
dailyActivity_merged Daily activity of 33 users in 31 days 940 Number of steps, Distance, Active minutes, Calories No duplicates, no nulls, barely sufficient data, aligned to objectives
sleepDay_merged Daily sleep records of 24 users in 31 days 410 Sleep count, Total minutes of sleep, Total minutes of time in bed Duplicates, no nulls, insufficient data, aligned to objectives
hourlySteps_merged Hourly steps count of 33 users in 31 days 22,099 Steps total per hour per day No duplicates, no nulls, barely sufficient data, aligned to objectives

It seems the data integrity of this dataset is good as the issues are most of the data has no duplicates. Duplicate data can be eliminated relatively easy using R code. The datasets are also comptible with each other and allows merging and comparison across the tables. This enables a more comprehensive analysis that can relate users’ activity, intensities and calories burned.

4. Data Cleaning

From this section onward, the data is processed using R in RStudio. R is an exceptional tool for data exploration, cleaning, analysis, and visualization because it has a variety of functions in packages to perform tasks. It can handle a large amount of data, record every data manipulation step in the script, and compile the results in a document (like this one) conveniently.

4.1. Loading Packages

The following packages contain useful functions to produce insightful trends out of the data.

  • tidyverse: an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. It loads multiple packages at once, with ggplot2, dplyr, and tibble as some of the most popular ones.
  • lubridate: makes it easier to work with dates and times.
  • skimr: provides summary statistics about variables in data frames, tibbles, data tables and vectors.
  • janitor: has simple functions for examining and cleaning dirty data.
  • ggalluvial: a ggplot2 extension for producing alluvial plots.
  • ggpubr: provides some easy-to-use functions for creating and customizing ggplot2.
install.packages("ggpubr", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/83/h8ylmntn5xq3lf6j12yhgy440000gn/T//RtmpLKvYu5/downloaded_packages
library(tidyverse)
library(lubridate)
library(skimr)
library(janitor)
library(ggalluvial)
library(ggpubr)

4.2. Importing Datasets

Datasets are imported and assigned their respective variable names for easier manipulation. The respective dataset are: * dailyActivity_merged.csv * sleepDay_merged.csv * hourlySteps_merged.csv

daily_activity_df <- read_csv("dailyActivity_merged.csv", show_col_types = FALSE)
sleep_day_df <- read_csv("sleepDay_merged.csv", show_col_types = FALSE)
hourly_steps_df <- read_csv("hourlySteps_merged.csv", show_col_types = FALSE)
head(daily_activity_df)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance
head(sleep_day_df)
## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹​TotalTimeInBed
head(hourly_steps_df)
## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0

4.3. Renaming Columns

According to the generic naming conventions in R, it is a good practice to only use lowercase letters, numbers and underscores to name the columns. The function clean_names() can do this particular job efficiently. Try comparing the column names before and after clean_names() is executed.

Before clean_names() is applied.

colnames(daily_activity_df)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(sleep_day_df)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(hourly_steps_df)
## [1] "Id"           "ActivityHour" "StepTotal"

After clean_names() is applied

daily_activity_df <- clean_names(daily_activity_df)
sleep_day_df <- clean_names(sleep_day_df)
hourly_steps_df <- clean_names(hourly_steps_df)

colnames(daily_activity_df)
##  [1] "id"                         "activity_date"             
##  [3] "total_steps"                "total_distance"            
##  [5] "tracker_distance"           "logged_activities_distance"
##  [7] "very_active_distance"       "moderately_active_distance"
##  [9] "light_active_distance"      "sedentary_active_distance" 
## [11] "very_active_minutes"        "fairly_active_minutes"     
## [13] "lightly_active_minutes"     "sedentary_minutes"         
## [15] "calories"
colnames(sleep_day_df)
## [1] "id"                   "sleep_day"            "total_sleep_records" 
## [4] "total_minutes_asleep" "total_time_in_bed"
colnames(hourly_steps_df)
## [1] "id"            "activity_hour" "step_total"

4.4. Verifying Distinct Values

As noted in the data overview, daily_activities has 33 distinct users and 31 days of activity data, while daily_sleeps has 24 users and 31 days.hourlySteps_merged.csvhas 33 users and 736 hours of data.

n_distinct(daily_activity_df$id)
## [1] 33
n_distinct(daily_activity_df$activity_date)
## [1] 31
n_distinct(sleep_day_df$id)
## [1] 24
n_distinct(sleep_day_df$sleep_day)
## [1] 31
n_distinct(hourly_steps_df$id)
## [1] 33
n_distinct(hourly_steps_df$activity_hour)
## [1] 736

4.5. Removing Duplicates and Nulls

First step in any data analysis flow of work is to check for any duplicates and nulls that exist in the data. The function sum(duplicated()) checks the number of duplicates, while colSums(is.na()) checks the number of nulls in each column of the tables.

sum(duplicated(daily_activity_df))
## [1] 0
sum(duplicated(sleep_day_df))
## [1] 3
sum(duplicated(hourly_steps_df))
## [1] 0
colSums(is.na(daily_activity_df))
##                         id              activity_date 
##                          0                          0 
##                total_steps             total_distance 
##                          0                          0 
##           tracker_distance logged_activities_distance 
##                          0                          0 
##       very_active_distance moderately_active_distance 
##                          0                          0 
##      light_active_distance  sedentary_active_distance 
##                          0                          0 
##        very_active_minutes      fairly_active_minutes 
##                          0                          0 
##     lightly_active_minutes          sedentary_minutes 
##                          0                          0 
##                   calories 
##                          0
colSums(is.na(sleep_day_df))
##                   id            sleep_day  total_sleep_records 
##                    0                    0                    0 
## total_minutes_asleep    total_time_in_bed 
##                    0                    0
colSums(is.na(hourly_steps_df))
##            id activity_hour    step_total 
##             0             0             0

As shown above, there is only duplicates entry in the daily_sleeps table and no null values in all 3 tables. Duplicates in the table can be easily removed using the distinct() function. We can then verify again that duplicates has been removed.

sleep_day_df <- sleep_day_df %>% distinct()
sum(duplicated(sleep_day_df))
## [1] 0

4.6. Changing the Data Format

Notice that id is in a number format <dbl> and activity_date is in a character format <chr> in the daily_activity_df. It makes the analysis easier down the road if we change these number format into an appropriate one. The id acts as an identifier and hence can be changed to a <chr> format while the activity_date column can be changed to a date_time format to allow data manipulation and calculation. We will also change the column name of activity_date into date to be concise.

head(daily_activity_df)
## # A tibble: 6 × 15
##       id activ…¹ total…² total…³ track…⁴ logge…⁵ very_…⁶ moder…⁷ light…⁸ seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, and abbreviated variable names
## #   ¹​activity_date, ²​total_steps, ³​total_distance, ⁴​tracker_distance,
## #   ⁵​logged_activities_distance, ⁶​very_active_distance,
## #   ⁷​moderately_active_distance, ⁸​light_active_distance,
## #   ⁹​sedentary_active_distance
daily_activity_df <- daily_activity_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date = activity_date) %>% 
  mutate(date = as_date(date, format="%m/%d/%Y"))

head(daily_activity_df)
## # A tibble: 6 × 15
##   id         date       total_…¹ total…² track…³ logge…⁴ very_…⁵ moder…⁶ light…⁷
##   <chr>      <date>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1503960366 2016-04-12    13162    8.5     8.5        0    1.88   0.550    6.06
## 2 1503960366 2016-04-13    10735    6.97    6.97       0    1.57   0.690    4.71
## 3 1503960366 2016-04-14    10460    6.74    6.74       0    2.44   0.400    3.91
## 4 1503960366 2016-04-15     9762    6.28    6.28       0    2.14   1.26     2.83
## 5 1503960366 2016-04-16    12669    8.16    8.16       0    2.71   0.410    5.04
## 6 1503960366 2016-04-17     9705    6.48    6.48       0    3.19   0.780    2.51
## # … with 6 more variables: sedentary_active_distance <dbl>,
## #   very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## #   lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>, and
## #   abbreviated variable names ¹​total_steps, ²​total_distance,
## #   ³​tracker_distance, ⁴​logged_activities_distance, ⁵​very_active_distance,
## #   ⁶​moderately_active_distance, ⁷​light_active_distance

We will perform similar functions to the other 2 dataframes.

sleep_day_df <- sleep_day_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date = sleep_day) %>% 
  mutate(date = as_date(date, format="%m/%d/%Y"))


head(sleep_day_df)
## # A tibble: 6 × 5
##   id         date       total_sleep_records total_minutes_asleep total_time_in…¹
##   <chr>      <date>                   <dbl>                <dbl>           <dbl>
## 1 1503960366 2016-04-12                   1                  327             346
## 2 1503960366 2016-04-13                   2                  384             407
## 3 1503960366 2016-04-15                   1                  412             442
## 4 1503960366 2016-04-16                   2                  340             367
## 5 1503960366 2016-04-17                   1                  700             712
## 6 1503960366 2016-04-19                   1                  304             320
## # … with abbreviated variable name ¹​total_time_in_bed
hourly_steps_df <- hourly_steps_df %>% 
  mutate(id = as.character(id)) %>% 
  rename(date_time = activity_hour) %>% 
  mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))

head(hourly_steps_df)
## # A tibble: 6 × 3
##   id         date_time           step_total
##   <chr>      <dttm>                   <dbl>
## 1 1503960366 2016-04-12 00:00:00        373
## 2 1503960366 2016-04-12 01:00:00        160
## 3 1503960366 2016-04-12 02:00:00        151
## 4 1503960366 2016-04-12 03:00:00          0
## 5 1503960366 2016-04-12 04:00:00          0
## 6 1503960366 2016-04-12 05:00:00          0

6. Recommendations

Takeaways Recommendation
Almost 88% of users wear their smart devices almost everyday We can emphasis advertisement for Bellabeat’s Time product as an ideal smart device to track user’s activity in a day. As the Time device is also connected the Bellabeat app, targeted advertisement could be created to market our other products
Majority of users do not use their smart device for sleep tracking. This could be a potential market to exploit. Bellabeat Leaf bracelet could be a gamechanger to change consumer mindset on wearing a smart devices to sleep as the bracelet form factor is not intrusive.
Over 97% of users use their device for more than half a day We can tap into this large pool of users to provide more value added products as they are connected to smart devices for a majority of their waking time.