The goal of this case study is to apply skills that I have recently gained from completing Google’s Data Analysis course while demonstrating the six steps of the data analysis process. The case study will include a scenario provided by the course.
“Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.”
Bellabeat is a high-tech company that creates smart devices with a focus on women’s health. The company was founded by Urka Sren and Sando Mur. Sren. Using her training as an artist Urka designs elegant digital devices that benefit and empowers women all around the world through smart technology. With the use of data collection on activity, sleep, stress, and reproductive health, Bellabeat has been able to inform women about their own habits and health.
The task will be to analyze user data from a well established company’s smart device in order to understand the users through their data and to identify trends that can help Bellabeat to increase market growth and sales thorugh targeted marketing campaigns.
Urška Sršen: Bellabeat’s founder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.
“Urška Sršen, Bellabeat’s cofounder and Chief Creative Officer suggested that a specific dataset that could be used for the analysis. The dataset is called FitBit Fitness Tracker Data and is available for public use on Kaggle.com”
The data for this project, titled FitBit Fitness Tracker Data comes from a publicly available open source dataset which features Fitbit user data that was anonymously provided by thirtythree Fitbit users who consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
Data from this dataset was collected from Fitbit devices and published by the author of the dataset along with three contributors. The author of this dataset, Mobius is a data scientist in the field of healthcare and has made available 62 datasets for public use. At the time of this project, the dataset has been downloaded 64,173 since it was created with an average 100 downloads a month.
The dataset titled FitBit Fitness Tracker Data has a high rating from by Kaggle on the criteria of:
After previweing the dataset I noted the following limitations with the data:
There sample size from the Fitness Tracker Dataset is limited, featuring only 33 participants.
The data is from 2016 and is outdated.
The timeframe of the data is limited and only includes data from April 12th 2016 and May 9th 2016.
Details of the users such a gender, age and location are unknown, limiting the analysis.
The original dataset is comprised of 18 csv files, however I have chosen to use what I believe to be the core files among which include a list of participants based on device Id, steps taken, sedentary times, activity times based on intensity, calories burned and sleep times.
In this phase the focus will be on cleaning the data to ensure that it is ready for analysis.
library(ggplot2)
library(ggpubr)
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate)
library(ggrepel)
library(dplyr)
daily_activity <- read_csv(file= "dailyActivity_merged.csv")
daily_sleep <- read_csv(file= "sleepDay_merged.csv")
I begin by previewing the dataset to get an idea of the various formats of the rows and columns.
head(daily_activity)
Id Activ…¹ Day Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.50e9 12/4/2… Tues… 13162 8.5 8.5 0 1.88 0.550 6.06
2 1.50e9 13/4/2… Wedn… 10735 6.97 6.97 0 1.57 0.690 4.71
3 1.50e9 14/4/2… Thur… 10460 6.74 6.74 0 2.44 0.400 3.91
4 1.50e9 15/4/2… Frid… 9762 6.28 6.28 0 2.14 1.26 2.83
5 1.50e9 16/4/2… Satu… 12669 8.16 8.16 0 2.71 0.410 5.04
6 1.50e9 17/4/2… Sund… 9705 6.48 6.48 0 3.19 0.780 2.51
While previewing the daily_activity dataset I notice that the day column has a date but is in string format so I will change it to date format during the cleaning process.
Next I will count the ID’s present in the dataset to confirm the total number of partecipants.
n_unique(daily_activity$id)
#the returned result is 33
The cleaning will consist of changing the column titles to lowercase and renaming the date column, changing it to date format for consistency.
clean_names(daily_activity)
daily_activity<- rename_with(daily_activity, tolower)
daily_activity <- daily_activity %>%
mutate(activitydate = as_date(activitydate, format = "%d/%m/%Y"))
Next, if there are any duplicate or missing values they will be removed.
daily_activity <- daily_activity %>%
distinct() %>%
drop_na()
sum(duplicated(daily_activity))
# The returned result is 0
The above process is repeated for the daily_sleep dataframe. At this point I choose to merge the datasets creating a new dataframe by id and activitydate.
daily_activity_clean <- merge(daily_activity, daily_sleep, by=c ("id", "activitydate"))
Rows: 410
Columns: 20
$ id <dbl> 1503960366, 1503960366, 15039603…
$ activitydate <date> 2016-04-12, 2016-04-13, 2016-04…
$ day <ord> "Tuesday,", "Wednesday,", "Frida…
$ totalsteps <dbl> 13162, 10735, 9762, 12669, 9705,…
$ totaldistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48,
etc...
Because over 400 rows of data are returned above, below I create a new dataframe and group the columns by id which will help me to get a picture the overall level of activity for each participant.
activity_averages_summary <- daily_activity_averages %>%
summarise(mean_daily_steps = mean(mean_steps),
mean_light_Weekly_activity = mean(mean_lightly_active *7),
mean_moderate_weekly_activity = mean(mean_moderate_active * 7),
mean_intense_Weekly_activity = mean(mean_very_active_minutes *7),
mean_daily_sedentary_time = mean(mean_sum_sedentary_minutes),
mean_daily_calories_burned = mean(mean_calories),
mean_daily_sleep = mean(mean_minutes_asleep))
str(activity_averages_summary)
tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
$ mean_daily_steps : num 7880
$ mean_light_Weekly_activity : num 1460
$ mean_moderate_weekly_activity: num 105
$ mean_intense_Weekly_activity : num 152
$ mean_daily_sedentary_time : num 779
$ mean_daily_calories_burned : num 2397
$ mean_daily_sleep : num 377
Focusing on the table that was returned I notice that average participant is quite active but is also sedentary for a large portion of the day.
Based on the above data I hypothesize that the average participant may have a demanding day job given the sedentary time and lower than recommended level of daily sleep. However, given the moderate-to high level of activity combined with the amount of steps taken per day it would appear that the average participant also engages in medium to high intensity activity.
Steps: According to the CDC, one should take at least 8,000 steps per day to remain healthy. When compared to the participant’s data we can see that on average of steps taken by each participant is quite close so the recommended amount.
Sleep: Also according to the CDC, healthy daily recommended sleep levels for ages 18 - 64 shoulde be no less than 7 hours per day. Comparing this with the above data shows that that the average participant gets less hours of sleep than recommended.
Exercise: According to the WHO, the recommended level of high intensity exercise for ages 18-64 is 75 - 150 minutes per week. While comparing this with the above data, the average participant exceeds the weekly recommended amount.
Also according to the WHO, moderate levels of exercise for ages 18-64 consits of 150 - 300 minutes per week. Comparing this with the above data, the average participant has less than the recommended weekly amount.
The first visualization shows the periods when participants were most active.
The first thing I noticed when looking at this chart is the consistent levels of activity throughout the 31 day period indicating a consistency in keeping fit.
Next is a comparison between levels of activity in order to determine whether there is a correlation between calories burned and active minutes which might indicate the intensity level that most effectively burned calories.
The above charts seem to indicate that among the three different activity intensities the moderate and high intensity have the strongest positive correlation.
My conclusion from this is that on average, participants engage moderate to high intensity activity in order to burn the most calories.
The next chart will determine whether there is a correlation between calories burned and steps taken.
The chart confirms that there is a strong correlation between steps taken and calories burned.
Based on my findings I believe that the average participant is likely to be a professional employee due to the high sedentary time and lower than recommended sleep level in a day.
Based on the level of consistency of physical activity, the average participant is likely to be a fitness enthusiast with a focus on burning calories through mid to high intensity activity that requires steps such as jogging or running.
Firstly I would like to note is that Bellabeats’ and Fitbits’ products are designed for different customer groups,
Bellabeat designs elegant women’s smart devices that are able to monitor stress levels, menstruation cycles, meditation and activity levels.
Fitbit’s smart watches on the other hand track heart rate, sleep quality, distance, steps taken and cater more to the needs of fitness enthusiasts.
Given that in this case study I was tasked with analyzing Fitbit’s user data in order to inform decisions that could help Bellabeat with growth and sales, I assume that there is an interest by Bellabeat to expand into the fitness market,
Based on my analysis, my recommendation to the Bellabeat stakeholders would be to create a survey for existing customers to find out whether there is an interest in a new smart device for women who are fitness enthusiasts.
The result of this survey could inform Bellabeat on whether there is demand for a new women’s fitness smart device.
Dataset: <https://www.kaggle.com/arashnic/fitbit
CDC Sleep Recommendation: <https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html
CDC Daily Step Recommendation: <https://www.cdc.gov/media/releases/2020/p0324-daily-step-count.html
WHO Exercise Recommendation: <https://www.who.int/news-room/fact-sheets/detail/physical-activity
Google Data Analysis Course: <https://www.coursera.org/professional-certificates/google-data-analytics