Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates. Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
For this report, I will construct a detailed report containing the following: 1. A summary of the business task 2. A description of all data sources used 3. Documentation of any cleaning or manipulation of data 4. A summary of the analysis 5. Supporting visualizations and key findings 6. My content recommendations based on the analysis
Bellabeat would like us to analyse consumer trends in non-Bellabeat smart devices then apply this to one of Bellabeat’s devices to provide insights. Sršen would like the analysis to answer the following questions: 1. What are some trends in smart device usage? 2. How could these trends apply to Bellabeat customers? 3. How could these trends help influence Bellabeat marketing strategy?
By looking at usage trends of the smart watch, we can devise a better marketing plan for Bellabeat which will ensure that they utilise their marketing budget effectively. Digital marketing is the focus of their marketing as opposed to traditional marketing, therefore it is likely that data on engagement is more readily available for analysis.
I will be using publicly available data of a popular smart device similar to Bellabeat’s; the FitBit Fitness Tracker which is a smart watch that records physical activity, steps, heart rate and sleep monitoring data. It is important to note that the users in this dataset consented to the submission of their personal tracker data and it is entirely ethical and permissible to use this dataset for the study. The data is hosted on Kaggle, an online community for data professionals and can be found here. For the analysis, I have chosen 3 datasets to analyse from the possible 18 that are available.
These are:
• dailyActivity_merged.csv
• sleepDay_merged.csv
• weightLogInfo_merged.csv
daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
Is the Data ROCCC?
Using the “ROCCC” criteria, we can see that whilst the data can certainly be worked with, it would appear that it has certain limitations.
Reliable – Low reliability as there are a small sample of respondents – only 30 Original – Low originality as the data being used is Third Party data from FitBit Comprehensive – Medium comprehensiveness as criterion matches Bellabeat products Current – Low level of current data since FitBit data is from 2016 Cited – Low level citation as data is collected by Third Party
Below you can see the packages I required to work on this data.
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 4.2.3
library("dplyr")
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("janitor")
## Warning: package 'janitor' was built under R version 4.2.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("tidyr")
## Warning: package 'tidyr' was built under R version 4.2.3
This section of the report will provide a summary of the data and a short glimpse of the variables we are dealing with.
As mentioned previously, we’ll be using 3 .csv files: daily_activity.csv, sleep_day.csv and weightlog_info.csv which will be summarized individually in this section.
summary(daily_activity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
n_distinct(daily_activity$Id)
## [1] 33
summary(sleep_day)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Length:413 Min. :1.000 Min. : 58.0
## 1st Qu.:3.977e+09 Class :character 1st Qu.:1.000 1st Qu.:361.0
## Median :4.703e+09 Mode :character Median :1.000 Median :433.0
## Mean :5.001e+09 Mean :1.119 Mean :419.5
## 3rd Qu.:6.962e+09 3rd Qu.:1.000 3rd Qu.:490.0
## Max. :8.792e+09 Max. :3.000 Max. :796.0
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.0
## Median :463.0
## Mean :458.6
## 3rd Qu.:526.0
## Max. :961.0
str(sleep_day)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
n_distinct(sleep_day$Id)
## [1] 24
summary(weight_log)
## Id Date WeightKg WeightPounds
## Min. :1.504e+09 Length:67 Min. : 52.60 Min. :116.0
## 1st Qu.:6.962e+09 Class :character 1st Qu.: 61.40 1st Qu.:135.4
## Median :6.962e+09 Mode :character Median : 62.50 Median :137.8
## Mean :7.009e+09 Mean : 72.04 Mean :158.8
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05 3rd Qu.:187.5
## Max. :8.878e+09 Max. :133.50 Max. :294.3
##
## Fat BMI IsManualReport LogId
## Min. :22.00 Min. :21.45 Length:67 Min. :1.460e+12
## 1st Qu.:22.75 1st Qu.:23.96 Class :character 1st Qu.:1.461e+12
## Median :23.50 Median :24.39 Mode :character Median :1.462e+12
## Mean :23.50 Mean :25.19 Mean :1.462e+12
## 3rd Qu.:24.25 3rd Qu.:25.56 3rd Qu.:1.462e+12
## Max. :25.00 Max. :47.54 Max. :1.463e+12
## NA's :65
str(weight_log)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
n_distinct(weight_log$Id)
## [1] 8
It is important that the datasets have clear and consistent names for their variables, so using the “clean_names()” function, I have ensured that there are only characters, numbers and underscores in the columns.
clean_names(daily_activity)
clean_names(sleep_day)
clean_names(weight_log)
When analysing data, it is important we check for null values in the data. A null value is in indicator that a value does not exist for that particular entry and can cause errors when we deeply analyse data.
is.null(daily_activity$Id)
## [1] FALSE
is.null(daily_activity$ActivityDate)
## [1] FALSE
is.null(daily_activity$TotalSteps)
## [1] FALSE
is.null(daily_activity$TotalDistance)
## [1] FALSE
is.null(daily_activity$TrackerDistance)
## [1] FALSE
is.null(daily_activity$LoggedActivitiesDistance)
## [1] FALSE
is.null(daily_activity$LoggedActivitiesDistance)
## [1] FALSE
is.null(daily_activity$VeryActiveDistance)
## [1] FALSE
is.null(daily_activity$ModeratelyActiveDistance)
## [1] FALSE
is.null(daily_activity$LightActiveDistance)
## [1] FALSE
is.null(daily_activity$SedentaryActiveDistance)
## [1] FALSE
is.null(daily_activity$VeryActiveMinutes)
## [1] FALSE
is.null(daily_activity$FairlyActiveMinutes)
## [1] FALSE
is.null(daily_activity$LightlyActiveMinutes)
## [1] FALSE
is.null(daily_activity$SedentaryMinutes)
## [1] FALSE
is.null(daily_activity$Calories)
## [1] FALSE
is.null(daily_activity$Calories)
## [1] FALSE
is.null(sleep_day$Id)
## [1] FALSE
is.null(sleep_day$SleepDay)
## [1] FALSE
is.null(sleep_day$TotalSleepRecords)
## [1] FALSE
is.null(sleep_day$TotalMinutesAsleep)
## [1] FALSE
is.null(sleep_day$TotalTimeInBed)
## [1] FALSE
is.null(weight_log$Id)
## [1] FALSE
is.null(weight_log$Date)
## [1] FALSE
is.null(weight_log$WeightKg)
## [1] FALSE
is.null(weight_log$WeightPounds)
## [1] FALSE
is.null(weight_log$Fat)
## [1] FALSE
is.null(weight_log$BMI)
## [1] FALSE
is.null(weight_log$IsManualReport)
## [1] FALSE
is.null(weight_log$LogId)
## [1] FALSE
Fortunately, there are no null values in our data so we can proceed with data cleaning.
Sometimes for the purpose of analysis, we have to create new variables during the preparation process. This might include categorizing variables or separating values such as date and time as was carried out below.
Upon further inspection of the data in excel, specifically “sleep_day” and “weight_log” - the date column included both the data and the time so I have seperated them into their own columns.
sleep_new <- sleep_day %>%
separate(SleepDay, c("date","time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Below, we can inspect what the new columns might look like.
glimpse(sleep_new)
## Rows: 413
## Columns: 6
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ date <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/2016",…
## $ time <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
As mentioned, I did the same to the weight column.
weightlog_new <- weight_log %>%
separate(Date, c("date","time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4, 5, 6, 7,
## 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Lets now preview the columns as we did for “sleep_day”
glimpse(weightlog_new)
## Rows: 67
## Columns: 9
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ date <chr> "5/2/2016", "5/3/2016", "4/13/2016", "4/21/2016", "5/12…
## $ time <chr> "11:59:59", "11:59:59", "1:08:52", "11:59:59", "11:59:5…
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <chr> "True", "True", "False", "True", "True", "True", "True"…
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
Next I’ll be creating a new variable titled “total_active_minutes” in “daily_activity” which is a sum of “Very Active Minutes”, “Fairly Active Minutes” and “Lightly Active Minutes”
dailyactivity_new <- daily_activity %>%
mutate(total_active_minutes = VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes)
glimpse(dailyactivity_new)
## Rows: 940
## Columns: 16
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
## $ total_active_minutes <int> 366, 257, 222, 272, 267, 222, 291, 345, 245, …
I then found it necessary to categorise the sleep data of respondents, into “poor sleep”, “good sleep” and “excellent sleep” for ease of analysis later on.
sleepcategories <- sleep_new %>%
group_by(Id) %>%
summarise(avg_time_asleep = mean(TotalMinutesAsleep)) %>%
mutate(type=case_when (
avg_time_asleep < 300 ~ "poor sleep",
avg_time_asleep >=300 & avg_time_asleep <= 420 ~ "good sleep",
avg_time_asleep > 420 ~ "excellent sleep"))
sleepcategories
## # A tibble: 24 × 3
## Id avg_time_asleep type
## <dbl> <dbl> <chr>
## 1 1503960366 360. good sleep
## 2 1644430081 294 poor sleep
## 3 1844505072 652 excellent sleep
## 4 1927972279 417 good sleep
## 5 2026352035 506. excellent sleep
## 6 2320127002 61 poor sleep
## 7 2347167796 447. excellent sleep
## 8 3977333714 294. poor sleep
## 9 4020332650 349. good sleep
## 10 4319703577 477. excellent sleep
## # ℹ 14 more rows
Similarly, I categorised the daily steps of respondents into the following: Sedentary - Less than 5000 steps Fairly Active - More than or equal to 5000 but less than 8000 steps Active - More than or equal to 8000 steps but less than 12000 steps Very Active - More than or equal to 12000 steps
stepcategories <- daily_activity %>%
group_by(Id) %>%
summarise(avg_step= mean(TotalSteps)) %>%
mutate (active_type=case_when (
avg_step <5000 ~ "sedentary",
avg_step >=5000 & avg_step< 8000 ~"fairly active",
avg_step>=8000 & avg_step <12000~"active",
avg_step >=12000 ~ 'very active'))
stepcategories
## # A tibble: 33 × 3
## Id avg_step active_type
## <dbl> <dbl> <chr>
## 1 1503960366 12117. very active
## 2 1624580081 5744. fairly active
## 3 1644430081 7283. fairly active
## 4 1844505072 2580. sedentary
## 5 1927972279 916. sedentary
## 6 2022484408 11371. active
## 7 2026352035 5567. fairly active
## 8 2320127002 4717. sedentary
## 9 2347167796 9520. active
## 10 2873212765 7556. fairly active
## # ℹ 23 more rows
caloriesxsteps <- ggplot(data=daily_activity, aes(x=TotalSteps, y=Calories))+ geom_point(color="red")+ geom_smooth()+labs(title = "Calories Burned vs Total Daily Steps",x="Total Steps", y="Calories Burned")
caloriesxsteps
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As suspected, there is a positive correlation between a respondents total steps and calories burned; meaning as a person takes more steps they will burn more calories.
Now I will compare the total active minutes of respondents with the total calories burned
totalactivexcalories<-ggplot(dailyactivity_new, aes(x = total_active_minutes, y = Calories))+ geom_point(color="turquoise")+ geom_smooth()+ labs(title="Daily Activite Minutes vs Calories Burned", x= "Active Minutes", y="Calories Burned")
totalactivexcalories
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
From the above, a positive correlation between calories burned and active minutes can be observed. This means that more active respondents will burn far more calories in relativity to less active respondents.
Now I will examine the activity of respondents further by by analysing their daily steps in the categories that were created earlier.
steptype<- ggplot(data = stepcategories, aes(x=active_type,fill=active_type))+ geom_bar()+labs(title = "Average Steps")
steptype
From the above, we can see that the majority of respondents are either active, or fairly active with less of them being in the “very active” category. In addition, although the number of respondents that are sedentary are lower in comparison to the active and fairly active categories; this is less desirable and pushing users towards more activity will be something to consider later.
I will now investigate the sleeping patterns of the respondents
sleeptype <- ggplot(data=sleepcategories, aes(x=type , fill=type))+ geom_bar()+ labs(title = "Time Asleep",x="Types")
sleeptype
A majority of respondents are getting excellent sleep, but an equal amount of respondents are getting good and poor sleep. Bellabeat should develop a feature that will convert respondents in the poor sleep category to the good sleep category.
I will now outline my recommendations for each of the metrics that were analysed and provide a rationale for where Bellabeat should focus their marketing.
There is a positive correlation between calories being burned, daily steps and active minutes, however a significant number of users live a sedentary lifestyle. As mentioned before, Bellabeat should focus on aiding users to make a lifestyle change to be more active. My recommendation would be to create a new feature on the mobile app that allows users to set daily activity/step goals that give them notifications for milestones. For instance, if the user sets a goal of 15,000 steps per day then the app can give them a notification that says “Only x more steps til you reach 10,000” then another when 10,000 steps are reached.
As shown earlier, many users are getting good and poor sleep as opposed to excellent sleep, although 50% of users are getting excellent sleep. These people cannot be ignored.
The key to getting excellent sleep is a consistent and early bedtime that consists of a proper bedtime routine. My suggestion would be to include a bedtime routine notification, where certain apps on the phone are silenced unless urgent and a blue light filter is used to encourage melatonin secretion for more efficient sleep. Users should also be encouraged to put their devices aside 1 hour before bed and instead opt for other activities that help them wind down, such as reading, breathing exercises or mindfulness.
The app could offer something in terms of guided breathing exercises or mindfulness, but some users may also require some kind of “white noise” which might be useful , such as rain sounds or fireplace sounds.