For this case study, I am taking on the fictional role as a Junior Data Analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.
This case study was provided as a project under the Google Data Analyst certification program, where the following process is used to approach data analysis, and how this notebook will be organized:
Ask
Prepare
Process
Analyze
Share
Act
In this scenario, I’ve been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights will then help guide marketing strategy for the company. This notebook will be a record of this work, allow me to present my analysis to the Bellabeat executive team, as well as provide my recommendations for Bellabeat’s marketing strategy.
The data I was asked to analyze was located at Kaggle’s FitBit data set
This Kaggle data set contains personal fitness tracker data from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. This archive has 18 data sets within it. Both wide and long formats are included for some of the data files.
Following the ROCCC process to determine if there are any credibility or bias issues with the data:
Reliable - yes, as the information was generated by the sensors of the devices directly, rather than responses from individuals.
Original - yes, can locate the original public data : https://zenodo.org/record/53894#.X9oeh3Uzaao
Comprehensive - yes, both long and wide formats have matching data, and no missing values.
Current - no, this is a historical data source (04/12/2016 - 05/12/2016)
Cited - yes
As for the licensing of the data, it is listed under Creative Commons Attribution 4.0 International.
It does not seem like there are any personally identifying values, as the description for the dataset details that Individual reports can be parsed by export session ID, so privacy of the users should be maintained at least at this level, as I have no way of further identifying them. I do however, have to assume that the data is not representative of just the female population, and that the insights gathered here would apply for all users rather than a subset of those who Bellabeat may be marketing towards.
Initially I had attempted to use Google sheets to view the data sets, and found that the heartrate data set was too long for sheets to display. Due to this, I’ve chosen R for it’s ability to handle the analysis of long datasets, visualizations, and presentation.
-Loaded all 18 csv files into project for review.
-Loaded ‘tidyverse’ library
-loaded the lubridate library:
library("tidyverse")
library("lubridate")
As most of the Bellabeat products use smart technology to track user activity, sleep, and stress, I will be reducing the fitbit files down to data that matches this for comparison. The fitbit data does not account directly for stress, so we may not be able to use this for the comparison, however they do have quite a bit of data on daily, hourly, and minute activity. I will be reducing this down further to provide a daily overview of the data for review.
Files to Analyze:
-dailyActivity_merged
-dailyCalories_merged
-dailyIntensities_merged
-dailySteps_merged
-sleepDay_merged
-heartrate_seconds_merged
Check files for structure:
n_distinct(dailyActivity_merged$Id)
# [1] 33
n_distinct(dailyCalories_merged$Id)
# [1] 33
n_distinct(dailyIntensities_merged$Id)
# [1] 33
n_distinct(dailySteps_merged$Id)
# [1] 33
n_distinct(heartrate_seconds_merged$Id)
# [1] 14
n_distinct(sleepDay_merged$Id)
# [1] 24
Although the Heartrate data is the longest, it does look like it’s from the smallest amount of users. Also, fewer users participated with sleep data collection. Assumptions for sleep and stress may need to be made as the data might not reflect a full picture of this information, but we can certainly work with the activity data as a proper sample.
Comparing the structures further, to confirm the data frames have any other common columns that can be merged:
dailyActivity_merged lists the date the information was recorded as “ActivityDate”, dailyCalories_merged, dailyIntensities_merged and dailySteps_merged lists this as “ActivityDay”. heartrate_seconds_merged lists this as “Time”, and sleepDay_merged lists this as “SleepDay”. These are also all Character types, rather than date/time types.
Calories are found on both dailyActivity_merged and dailyCalories_merged.
Comparing dailySteps_merged, daily_intensities_merged, dailyCalories_merged, and dailyActivity_merged, found that all data from dailySteps_merged, daily_intensities_merged, and dailyCalories_merged are found in dailyActivity_merged. Removing the three former data sets as well.
Files to keep: -dailyActivity_merged -heartrate_seconds_merged -sleepDay
No further common columns across the remaining 3 data sets are found. formatting each of the data frames to use date time:
activity <- dailyActivity_merged
activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")
sleepDay <- sleepDay_merged
sleepDay$SleepDay=as.POSIXct(sleepDay$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())
sleepDay$date <- format(sleepDay$SleepDay, format = "%m/%d/%y")
heartrate <- heartrate_seconds_merged
heartrate$Time=as.POSIXct(heartrate$Time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
heartrate$hour <- format(heartrate$Time, format = "%H:%M:%S")
heartrate$date <- format(heartrate$Time, format = "%m/%d/%y")
Summarizing the individual data sets to get a good idea of what the trends are:
sleepDay %>%
summary()
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed date
Min. :1.504e+09 Min. :2016-04-12 00:00:00 Min. :1.000 Min. : 58.0 Min. : 61.0 Length:413
1st Qu.:3.977e+09 1st Qu.:2016-04-19 00:00:00 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0 Class :character
Median :4.703e+09 Median :2016-04-27 00:00:00 Median :1.000 Median :433.0 Median :463.0 Mode :character
Mean :5.001e+09 Mean :2016-04-26 12:40:05 Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 00:00:00 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :8.792e+09 Max. :2016-05-12 00:00:00 Max. :3.000 Max. :796.0 Max. :961.0
heartrate %>%
summary()
Id Time Value hour date
Min. :2.022e+09 Min. :2016-04-12 00:00:00 Min. : 36.00 Length:2483658 Length:2483658
1st Qu.:4.388e+09 1st Qu.:2016-04-19 06:18:10 1st Qu.: 63.00 Class :character Class :character
Median :5.554e+09 Median :2016-04-26 20:28:50 Median : 73.00 Mode :character Mode :character
Mean :5.514e+09 Mean :2016-04-26 19:43:52 Mean : 77.33
3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 08:00:20 3rd Qu.: 88.00
Max. :8.878e+09 Max. :2016-05-12 16:20:00 Max. :203.00
Visualizing the Calories by ActivityDate for activity summary:
ggplot(data=activity)+
geom_smooth(mapping=aes(ActivityDate, Calories))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
looks like quite a bit of the calories were being worked off before summertime arrived, and then either the users achieved their goals, or stopped tracking it as much. May be best to keep in mind that marketing for the activity-focused users should be some time around spring, where they may be getting ready to work out more before summer.
The data set provides different categories for activity. What are the averages for this activity that was logged?
Create new data table that selects the time spent in the categories:
activity_categories <- activity %>%
select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes)
Averages of those selected:
activity_avgs <- c(sedentary = mean(activity_categories$SedentaryMinutes), lightly = mean(activity_categories$LightlyActiveMinutes), fairly = mean(activity_categories$FairlyActiveMinutes), very = mean(activity_categories$VeryActiveMinutes))
I would like to display the averages as columns detailing the difference between the categories. geom_cols requires x and y axis to do this, so I’m creating labels for the averages, and then a new data frame for the visual to use:
activity_labels <- c("Sedentary","Lightly","Fairly", "Very Active")
activity_vis_frame <- data.frame(activity_avgs, activity_labels)
Visualizing the different categories of activity vs time spent in those categories:
ggplot(activity_vis_frame)+
geom_col(mapping=aes(x=activity_avgs, y=activity_labels), fill="blue")
This plot shows that the fitbit users are not primarily active when tracking, as the majority of those tracking are Sedentary. Perhaps they are tracking their current activities to see where they can improve, or they spend the majority of their day working at a desk. Is this also taking into account sleep? More refined data would be needed to confirm if the devices were tracking activity at the same time as sleep.
Visualizing the TotalMinutesAsleep vs. the TotalTimeInBed:
ggplot(data=sleepDay)+
geom_count(mapping=aes(TotalMinutesAsleep, TotalTimeInBed))
Out of those who are reporting their sleep habits, most are sleeping nearly as much time as they are spending in bed, with few outliers.
Visualizing the heartrate time vs value:
ggplot(data=heartrate)+
geom_smooth(mapping=aes(Time, Value))
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A large portion of the visualization shows the values were high through spring up until the beginning of may, then another large spike was entered for the first week of may, perhaps the last push of activity before Summer began?
Merging data sets to see if there are any further connections to be made:
summary_merged <- merge(activity, sleepDay)
Summarize the two merged data sets:
summary_merged %>%
summary()
Id date ActivityDate TotalSteps TotalDistance TrackerDistance
Min. :1.504e+09 Length:413 Min. :2016-04-12 00:00:00 Min. : 17 Min. : 0.010 Min. : 0.010
1st Qu.:3.977e+09 Class :character 1st Qu.:2016-04-19 00:00:00 1st Qu.: 5206 1st Qu.: 3.600 1st Qu.: 3.600
Median :4.703e+09 Mode :character Median :2016-04-27 00:00:00 Median : 8925 Median : 6.290 Median : 6.290
Mean :5.001e+09 Mean :2016-04-26 12:40:05 Mean : 8541 Mean : 6.039 Mean : 6.034
3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 00:00:00 3rd Qu.:11393 3rd Qu.: 8.030 3rd Qu.: 8.020
Max. :8.792e+09 Max. :2016-05-12 00:00:00 Max. :22770 Max. :17.540 Max. :17.540
LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. :0.010 Min. :0.0000000
1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:2.540 1st Qu.:0.0000000
Median :0.0000 Median : 0.57 Median :0.4200 Median :3.680 Median :0.0000000
Mean :0.1131 Mean : 1.45 Mean :0.7502 Mean :3.807 Mean :0.0009201
3rd Qu.:0.0000 3rd Qu.: 2.37 3rd Qu.:1.0400 3rd Qu.:4.930 3rd Qu.:0.0000000
Max. :4.0817 Max. :12.54 Max. :6.4800 Max. :9.480 Max. :0.1100000
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories SleepDay
Min. : 0.00 Min. : 0.00 Min. : 2.0 Min. : 0.0 Min. : 257 Min. :2016-04-12 00:00:00
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:158.0 1st Qu.: 631.0 1st Qu.:1850 1st Qu.:2016-04-19 00:00:00
Median : 9.00 Median : 11.00 Median :208.0 Median : 717.0 Median :2220 Median :2016-04-27 00:00:00
Mean : 25.19 Mean : 18.04 Mean :216.9 Mean : 712.2 Mean :2398 Mean :2016-04-26 12:40:05
3rd Qu.: 38.00 3rd Qu.: 27.00 3rd Qu.:263.0 3rd Qu.: 783.0 3rd Qu.:2926 3rd Qu.:2016-05-04 00:00:00
Max. :210.00 Max. :143.00 Max. :518.0 Max. :1265.0 Max. :4900 Max. :2016-05-12 00:00:00
TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
Min. :1.000 Min. : 58.0 Min. : 61.0
1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
Median :1.000 Median :433.0 Median :463.0
Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :3.000 Max. :796.0 Max. :961.0
Is there a connection between how long someone logs sleep and how many calories they logged as expended?
ggplot(data=summary_merged)+
geom_smooth(mapping=aes(Calories, TotalMinutesAsleep))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looks like longer sleep can be achieved with more calorie expenditure, this may be a recommendation that can be made for stress and sleep tracking.
What surprises did you discover in the data?
The majority of users log more sedentary time than track activity, and those who do log activity do so before May according to the data.
What trends or relationships did you find in the data?
The more one expends calories, the longer sleep they log as well, which may help with stress as less sleep is correlated with more stress and health issues. Citation: https://www.apa.org/news/press/releases/stress/2013/sleep
How will these insights help answer your business questions?
These insights will help with making suggestions to the marketing team regarding when to target advertising periods for Bellabeat, based on the fitbit data, and app change suggestions that may use more data to make recommendations to users.
What next steps would you or your stakeholders take based on your findings?
The Marketing team may need to research the benefits of less sedentary time, calorie expenditure, and longer sleep with accredited sources, and edit that information to provide digestible blurbs in the app, with links to the sources for the users to learn more. Work with the app development team to learn what data is currently being tracked, and what would need to be added in order to make changes to what is tracked. The app and database teams may also need to change the features of the app to allow the users to opt into notifications that remind the user about the newly tracked data, either through settings or through the educational blurbs that are offered in the app.
The database may need to be updated to accommodate the new data, and limits or aggregates may need to be stored rather than the finer details after a period of time, so it’s retrievable, and reviewable for some time.
The Marketing team may also want to arrange a large advertisement campaign for the spring, to provide the smart devices as a product solution to those who may be working towards a healthier body for the summer season.
Is there additional data you could use to expand on your findings?
Any additional data that BellaBeat could provide as to what they are currently tracking and offering in the app may be best to make a proper comparison.