In this case study, I will perform a real-world task of a junior data analyst on the marketing analyst team at Bellabeat, a high-tech manufacturer of smart products. Urska Srsen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
I have been tasked with analyzing smart device data to gain insight into how consumers are using their smart devices. The insights discovered will then help guide marketing strategy for the company.
Bellabeat is a high-tech manufacturer of health-focused smart products for women. Founded in 2013 by Urska Srsen and Sando Mur, the company combines technology with beautiful design to empower women with knowledge about their health. Bellabeat’s products collect data on activity, sleep, stress, and reproductive health to help users make informed decisions about their wellness. Over the years, Bellabeat has grown rapidly and positioned itself as a tech-driven wellness company for women. The company sells its products through its own e-commerce channel and various online retailers. Bellabeat utilizes traditional and digital marketing strategies to engage with consumers.
Analyze consumer data on non-Bellabeat smart device usage to spot trends, identify opportunities growth, and improve Bellabeat marketing strategy.
Srsen provided the following dataset: Fitbit Fitness Tracker Data (CCO: Public Domain, dataset made available via Kaggle user Möbius).
This Kaggle data set contains personal fitness tracker data from thirty Fitbit users. According to Möbius, the dataset was generated by respondents to a distributed survey by Amazon Mechanical Turk between 03.12.2016-05.12.2016. The thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Reliability: 30 persons responded to a survey. Demographics of participants are unknown. The potential for sampling bias is high, while the sample size is too small.
Original: 30 persons responded to a survey distributed by Amazon Mechanical Turk between 03.12.2016 - 05.12.2016 to generate the data.
Comprehensive: The collection of information includes data on, daily activity, daily steps, daily calorie burn, weight log, and sleep day records.
Current: Since the data was gathered between March 2016 and May 2016, it is not current; as a result, current user habits and behavior may vary.
Cited: The data has a cited source.
I would have liked to have access to more raw data to enhance data integrity. However, most of the raw data that was available was secured behind pay walls in order to access it.
Data is organized in 18 CSV files. It has both long and wide formats.
The data folder (Fitbit Data 4.12.16-5.12.16 2) was downloaded and saved on my desktop.
All 18 CSV files were opened in RStudio and checked for unique participant IDs using ‘n_distinct()’ function:
First, I set the working directory using the ‘setwd()’ function and upload the data to global environment.
Results of unique participant IDs
33 ID: daily_activity, daily_calories, daily_intensities, daily_steps, heartrate_seconds, hourly_calories, hourly_intensities, hourly_steps, minute_calories_narrow, minute_calories_wide, minute_intensities_narrow, minute_intensities_wide, minute_mets_narrow
24 ID: minute_sleep, sleep_day
8 ID: weight_log
For our analysis we will use the following CSV files:
We want to narrow our focus of this analysis to daily activity, daily calories burned, daily intensities, daily steps, duration of activities and time of day. We hypothesis that our customer base have an interest in their health and try to get in moments of brisk exercise during the work week either before work, and/or on lunch breaks, and/or after work. We also expect to see that participants in this study likely have more intense work outs or distance covered on the weekend when they are not working.
rm(heartrate_seconds, hourly_intensities, minute_calories_narrow, minute_calories_wide, minute_intensities_narrow, minute_intensities_wide, minute_mets_narrow, minute_steps_narrow, minute_steps_wide, minute_sleep, weight_log)
The analysis will be done in R and shared with key stakeholders.
First, we will install and load the packages. Then we will explore, transform, and analyze.
Install & Load Packages
library(lubridate)
library(tidyverse)
library(janitor)
library(ggplot2)
library(readr)
library(sqldf)
library(skimr)
library(dplyr)
library(deeptime)
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
glimpse(daily_intensities)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(daily_steps)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…
Upon viewing the ‘daily_activity’ data frame, I noticed that the data type is incorrectly set for both ‘Id’ and ‘ActivityDay’. ‘Id’ should be set as character (chr), and ‘ActivityDay’ should be set as date.
# Cleaning: clean column names
daily_activity <- daily_activity %>%
clean_names()
daily_calories <- daily_calories %>%
clean_names()
daily_intensities <- daily_intensities %>%
clean_names()
daily_steps <- daily_steps %>%
clean_names()
# Changing data type for 'id' to character (chr)
daily_activity$id= as.character(daily_activity$id)
daily_calories$id= as.character(daily_calories$id)
daily_intensities$id= as.character(daily_intensities$id)
daily_steps$id= as.character(daily_steps$id)
# Cleaning: change column name from activity_date to date and edit date format
daily_activity <- daily_activity %>%
rename(date=activity_date)
daily_calories <- daily_calories %>%
rename(date=activity_day)
daily_intensities <- daily_intensities %>%
rename(date=activity_day)
daily_steps <- daily_steps %>%
rename(date=activity_day)
# Changing date format
daily_activity$date= as.Date(daily_activity$date, format= "%m/%d/%Y")
daily_calories$date= as.Date(daily_calories$date, format= "%m/%d/%Y")
daily_intensities$date= as.Date(daily_intensities$date, format= "%m/%d/%Y")
daily_steps$date= as.Date(daily_steps$date, format= "%m/%d/%Y")
# Check to see format changed
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ id <chr> "1503960366", "1503960366", "1503960366", "…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-0…
## $ total_steps <int> 13162, 10735, 10460, 9762, 12669, 9705, 130…
## $ total_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ tracker_distance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9…
## $ logged_activities_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_distance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3…
## $ moderately_active_distance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1…
## $ light_active_distance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5…
## $ sedentary_active_distance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ very_active_minutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66,…
## $ fairly_active_minutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, …
## $ lightly_active_minutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205…
## $ sedentary_minutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 8…
## $ calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2…
Using the ‘sqldf()’ function with the INTERSECT operator
Upon exploration of the ‘daily_activity’ dataset it appears that the dataset provides the same information as the ‘daily_steps’ df, ‘daily_intensities’ df, and ‘daily_calories’ df.
*Here is side-by-side look at ‘daily_activity’ df vs ‘daily_steps’ df
as_tibble(daily_activity)
## # A tibble: 940 × 15
## id date total_steps total_distance tracker_distance
## <chr> <date> <int> <dbl> <dbl>
## 1 1503960366 2016-04-12 13162 8.5 8.5
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## 7 1503960366 2016-04-18 13019 8.59 8.59
## 8 1503960366 2016-04-19 15506 9.88 9.88
## 9 1503960366 2016-04-20 10544 6.68 6.68
## 10 1503960366 2016-04-21 9819 6.34 6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <int>, fairly_active_minutes <int>,
## # lightly_active_minutes <int>, sedentary_minutes <int>, calories <int>
as_tibble(daily_steps)
## # A tibble: 940 × 3
## id date step_total
## <chr> <date> <int>
## 1 1503960366 2016-04-12 13162
## 2 1503960366 2016-04-13 10735
## 3 1503960366 2016-04-14 10460
## 4 1503960366 2016-04-15 9762
## 5 1503960366 2016-04-16 12669
## 6 1503960366 2016-04-17 9705
## 7 1503960366 2016-04-18 13019
## 8 1503960366 2016-04-19 15506
## 9 1503960366 2016-04-20 10544
## 10 1503960366 2016-04-21 9819
## # ℹ 930 more rows
Next we will test to see if the three data frames- ‘daily_calories’, ‘daily_intensities’, and ‘daily_steps’ are identical to the ‘daily_activity’ df. If they are we can remove the redundant data frames and keep a clean global environment.
We can confirm if they are identical by using the ‘sqldf()’ function with an INTERSECT and then counting distinct values using ‘n_distinct()’ function, see below:
daily_activity_2 <- daily_activity %>%
select(id, date, calories)
head(daily_activity_2)
## id date calories
## 1 1503960366 2016-04-12 1985
## 2 1503960366 2016-04-13 1797
## 3 1503960366 2016-04-14 1776
## 4 1503960366 2016-04-15 1745
## 5 1503960366 2016-04-16 1863
## 6 1503960366 2016-04-17 1728
check_calories <- sqldf("SELECT *
FROM daily_activity_2
INTERSECT
SELECT *
FROM daily_calories")
head(check_calories)
## id date calories
## 1 1503960366 2016-04-12 1985
## 2 1503960366 2016-04-13 1797
## 3 1503960366 2016-04-14 1776
## 4 1503960366 2016-04-15 1745
## 5 1503960366 2016-04-16 1863
## 6 1503960366 2016-04-17 1728
n_distinct(daily_activity_2$id)
## [1] 33
n_distinct(check_calories$id)
## [1] 33
nrow(daily_activity_2)
## [1] 940
nrow(daily_calories)
## [1] 940
Confirmed redundancy
INTERSECT Operator
The Intersect operator retrieves the common records from both left and the right query of the Intersect operator - The data types must be same or at least compatible - The number and the order of the columns must be same in both the queries - INTERSECT filters duplicates and returns only distinct rows that are common
Thus we can conclude that ‘daily_activity’ df contains all the information from the ‘daily_calories’ df. I use the same process to confirm if the other two data frames are also redundant.
daily_activity_3 <- daily_activity %>%
select(id, date, sedentary_minutes, lightly_active_minutes, fairly_active_minutes,
very_active_minutes, sedentary_active_distance, light_active_distance,
moderately_active_distance, very_active_distance)
head(daily_activity_3)
## id date sedentary_minutes lightly_active_minutes
## 1 1503960366 2016-04-12 728 328
## 2 1503960366 2016-04-13 776 217
## 3 1503960366 2016-04-14 1218 181
## 4 1503960366 2016-04-15 726 209
## 5 1503960366 2016-04-16 773 221
## 6 1503960366 2016-04-17 539 164
## fairly_active_minutes very_active_minutes sedentary_active_distance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## light_active_distance moderately_active_distance very_active_distance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
check_intensities <- sqldf("SELECT *
FROM daily_activity_3
INTERSECT
SELECT *
FROM daily_intensities")
head(check_intensities)
## id date sedentary_minutes lightly_active_minutes
## 1 1503960366 2016-04-12 728 328
## 2 1503960366 2016-04-13 776 217
## 3 1503960366 2016-04-14 1218 181
## 4 1503960366 2016-04-15 726 209
## 5 1503960366 2016-04-16 773 221
## 6 1503960366 2016-04-17 539 164
## fairly_active_minutes very_active_minutes sedentary_active_distance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## light_active_distance moderately_active_distance very_active_distance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
n_distinct(daily_activity_3$id)
## [1] 33
n_distinct(check_intensities$id)
## [1] 33
nrow(daily_activity_3)
## [1] 940
nrow(check_intensities)
## [1] 940
Confirmed redundancy
daily_activity_4 <- daily_activity %>%
select(id, date, total_steps)
head(daily_activity_4)
## id date total_steps
## 1 1503960366 2016-04-12 13162
## 2 1503960366 2016-04-13 10735
## 3 1503960366 2016-04-14 10460
## 4 1503960366 2016-04-15 9762
## 5 1503960366 2016-04-16 12669
## 6 1503960366 2016-04-17 9705
check_daily_steps <- sqldf("SELECT *
FROM daily_activity_4
INTERSECT
SELECT *
FROM daily_steps")
head(check_daily_steps)
## id date total_steps
## 1 1503960366 2016-04-12 13162
## 2 1503960366 2016-04-13 10735
## 3 1503960366 2016-04-14 10460
## 4 1503960366 2016-04-15 9762
## 5 1503960366 2016-04-16 12669
## 6 1503960366 2016-04-17 9705
n_distinct(daily_activity_4$id)
## [1] 33
n_distinct(check_daily_steps$id)
## [1] 33
nrow(daily_activity_4)
## [1] 940
nrow(check_daily_steps)
## [1] 940
Confirmed redundancy
Each SQL data frame check output is identical with 940 observations and 33 distinct ID indicating that the daily_activity df contains all the information from the daily_calories, daily_intensities, and daily_steps data frames.
rm(daily_calories, daily_intensities, daily_steps, daily_activity_2, daily_activity_3, daily_activity_4, check_calories, check_daily_steps, check_intensities)
sum(duplicated(daily_activity))
## [1] 0
# Check unique ID numbers
n_unique(daily_activity$id)
## [1] 33
The average sedentary time of the participants is 16.5 hours.
The average daily steps is 7,638 far below the CDCs recommended 10,000 steps.
This means that 51.5% of the participants are taking less than 7,500 steps per day.
Participants were most active during 12pm to 2pm and from 5pm till 7pm.
The average sleep time of the participants is just below the CDCs recommended 7+ hours.
Our analysis shows that there is a significant positive correlation between calories burned and total daily steps and distance covered. Therefore, the Bellabeat app should send notifications to help its customers increase their daily activity.
Moreover, our analysis shows that sedentary time and sleep time have a negative correlation. Therefore, the Bellabeat app should send notifications to help customers reduce sedentary time. This will help achieve better sleep and promote a healthier lifestyle.
-Work Day Work Out Challenge: simple fast exercises that can be completed on a lunch break
-Notifications: send notifications every two hours that show percentage to goal of 10,000 steps with ideas to help you attain daily goal. Also, notifications to wind down before bedtime. Perhaps, the app give ideas to our customers that can help create new daily habits help them to wind down before bedtime like 15-20 minutes of meditation or reading.
-Friend Support: Customers can link up with friends and share their personal goals and achievements.
-Community Support: in app community support can help keep users motivated. Especially by sharing success stories
-Preventative Health Care: reminders about age appropriate check ups (eg. breast exam etc.)
-Healthy Diet: (additional payment) the app can provide an easy-to-use food log to help customers track caloric intake vs daily caloric budget that is set based on weight goals.
Thank You for reading!
Please note: this is my first project using R. I appreciate any constructive feedback. Special thanks to YouTube channel R Programming 101 (R Programming 101). His channel helped me understand linear models and regression analysis.