Urška Sršen and Sando Mur founded Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a small company that has the potential of becoming a large player in the smart-device market. Bellabeat collects data on actvity, sleep, stress and reproductive health to empower women with their own health and habits.
Bellabeat’s marketing team believes that analyzing smart fitness device data could help unlock new growth opportunities for the company. Their team would like advice and recommendations for growth based on trends of non-Bellabeat smart devices that could be applicable for their own products.
We will be analyzing FitBit Fitness Tracker Data. This dataset contains personal fitness data from thirty Fibit users. These users consented to the submission of their personal data.
With a quick look at the csv file, there were some tables that had more than 1 Million rows that makes it a bit more complicated to do analysis in spreadsheets. Between R and SQL, I decided to use R for easy data formatting and presentation.
install.packages("tidyverse")
install.packages("lubridate")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("hms")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(tidyr)
library(hms)
##
## Attaching package: 'hms'
##
## The following object is masked from 'package:lubridate':
##
## hms
All csv files were imported into RStudio Cloud and loaded below.
daily_activity <- read.csv("dailyActivity_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
daily_intensity <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
heartrate_seconds <- read.csv("heartrate_seconds_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
weight_log_info <- read.csv("weightLogInfo_merged.csv")
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
Here we can see a quick summary of information of the file for daily
activity. We notice that there are 15 cloumns with their data type. With
a quick look, we can see that “ActivityDate” is in a character data type
and must be changed to appropriately analyze this specific table.
Lets take a look at another file.
glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
Here we can see “Id” column again and “ActivityDay” having character data type again.
#Daily Activity
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
#Daily Calories
daily_calories$ActivityDay <- mdy(daily_calories$ActivityDay)
#Daily Intensities
daily_intensity$ActivityDay <- mdy(daily_intensity$ActivityDay)
#Daily Steps
daily_steps$ActivityDay <- mdy(daily_steps$ActivityDay)
#Heartrate Seconds
heartrate_seconds$Time <- parse_date_time(heartrate_seconds$Time, "%m/%d%y %I:%M:%S %p")
#Hourly Calories
hourly_calories$ActivityHour <- parse_date_time(hourly_calories$ActivityHour, "%m/%d%y %I:%M:%S %p")
#Hourly Intensities
hourly_intensities$ActivityHour <- parse_date_time(hourly_intensities$ActivityHour, "%m/%d%y %I:%M:%S %p")
#Hourly Steps
hourly_steps$ActivityHour <- parse_date_time(hourly_steps$ActivityHour, "%m/%d%y %I:%M:%S %p")
#Sleep Day
sleep_day$SleepDay <- parse_date_time(sleep_day$SleepDay, "%m/%d%y %I:%M:%S %p")
#Weight Log Info
weight_log_info$Date <- parse_date_time(weight_log_info$Date, "%m/%d%y %I:%M:%S %p")
Lets check if is now formatted correctly with two different tables to see.
data.class(daily_activity$ActivityDate)
## [1] "Date"
daily_activity$ActivityDate[1:2]
## [1] "2016-04-12" "2016-04-13"
data.class(heartrate_seconds$Time)
## [1] "POSIXct"
heartrate_seconds$Time[1:2]
## [1] "2016-04-12 07:21:00 UTC" "2016-04-12 07:21:05 UTC"
With the columns data types now corrected. We can review the data.
With a glimpse of each table, we see that they all have “Id” in common. This is the distinct identifier for each user.
n_distinct(daily_activity$Id)
[1] 33
n_distinct(daily_calories$Id)
[1] 33
n_distinct(daily_intensity$Id)
[1] 33
n_distinct(daily_steps$Id)
[1] 33
n_distinct(heartrate_seconds$Id)
[1] 14
n_distinct(hourly_calories$Id)
[1] 33
n_distinct(hourly_intensities$Id)
[1] 33
n_distinct(hourly_steps$Id)
[1] 33
n_distinct(sleep_day$Id)
[1] 24
n_distinct(weight_log_info$Id)
[1] 8
We notice that for most of these tables there are 33 distinct users. With this information we will exclude the data from the “heartrate_seconds” and “weight_log_info” as the amount of users that participated in those sections or features is not a good pool sample for the analysis. We will keep the sleep_day with its 24 users but keep in mind that the confidence level at 95% contains a margin error of 10.61%.
daily_activity %>%
select(TotalSteps,
TotalDistance,
VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance VeryActiveMinutes FairlyActiveMinutes
## Min. : 0 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median : 4.00 Median : 6.00
## Mean : 7638 Mean : 5.490 Mean : 21.16 Mean : 13.56
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.: 32.00 3rd Qu.: 19.00
## Max. :36019 Max. :28.030 Max. :210.00 Max. :143.00
## LightlyActiveMinutes SedentaryMinutes Calories
## Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.:127.0 1st Qu.: 729.8 1st Qu.:1828
## Median :199.0 Median :1057.5 Median :2134
## Mean :192.8 Mean : 991.2 Mean :2304
## 3rd Qu.:264.0 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :518.0 Max. :1440.0 Max. :4900
sleep_day %>%
select(TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
hourly_intensities %>%
select(ActivityHour,
TotalIntensity,
AverageIntensity) %>%
summary()
## ActivityHour TotalIntensity AverageIntensity
## Min. :2016-04-12 00:00:00.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:2016-04-19 01:00:00.00 1st Qu.: 0.00 1st Qu.:0.0000
## Median :2016-04-26 06:00:00.00 Median : 3.00 Median :0.0500
## Mean :2016-04-26 11:46:42.58 Mean : 12.04 Mean :0.2006
## 3rd Qu.:2016-05-03 19:00:00.00 3rd Qu.: 16.00 3rd Qu.:0.2667
## Max. :2016-05-12 15:00:00.00 Max. :180.00 Max. :3.0000
By looking at the information above, we make the following observations:
combined_data <- merge(daily_activity, sleep_day, by = 'Id')
head(combined_data)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-05-07 11992 7.71 7.71
## 2 1503960366 2016-05-07 11992 7.71 7.71
## 3 1503960366 2016-05-07 11992 7.71 7.71
## 4 1503960366 2016-05-07 11992 7.71 7.71
## 5 1503960366 2016-05-07 11992 7.71 7.71
## 6 1503960366 2016-05-07 11992 7.71 7.71
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.46 2.12
## 2 0 2.46 2.12
## 3 0 2.46 2.12
## 4 0 2.46 2.12
## 5 0 2.46 2.12
## 6 0 2.46 2.12
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.13 0 37
## 2 3.13 0 37
## 3 3.13 0 37
## 4 3.13 0 37
## 5 3.13 0 37
## 6 3.13 0 37
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories SleepDay
## 1 46 175 833 1821 2016-04-12
## 2 46 175 833 1821 2016-04-13
## 3 46 175 833 1821 2016-04-15
## 4 46 175 833 1821 2016-04-16
## 5 46 175 833 1821 2016-04-17
## 6 46 175 833 1821 2016-04-19
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1 327 346
## 2 2 384 407
## 3 1 412 442
## 4 2 340 367
## 5 1 700 712
## 6 1 304 320
n_distinct(combined_data$Id)
## [1] 24
ggplot(data = combined_data, aes(x = TotalSteps, y = Calories)) +
geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps Vs Calories Expended", x = "Total Steps", y = "Calories")
## `geom_smooth()` using formula 'y ~ x'
Here we can make some quick observations based on the data we have. Shown in the graph is a positive correlation of total steps and calories. In other words, he more steps you take the more calories you burn.
ggplot(data = combined_data, aes(x = TotalSteps, y = SedentaryMinutes)) +
geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps vs Time Sitting", x= "Total Steps", y= "Sedentary Minutes")
## `geom_smooth()` using formula 'y ~ x'
In this graph, we are comparing the total sedimentary time to the total steps taken. There is a negative correlation between the variables. In other words, the more time you are sitting the less steps you take. We can see that the participants spend more time sitting down than getting their steps in.
ggplot(data = combined_data, aes(x = TotalMinutesAsleep, y = TotalTimeInBed, color = TotalSleepRecords)) + facet_grid(~TotalSleepRecords) +
geom_point() + labs(title = "Total Minutes Asleep Vs Total Time in Bed", x= "Total Minutes Asleep", y="Total Minutes in Bed") + geom_vline(xintercept = 419.5, color = "red", linetype = "dashed") + annotate("text", label = "7 Hours", x = 200, y = 800, color = "black", size = 3)
Here I have separated the Total Minutes Asleep vs. Total Time in Bed by the number of sleep cycles in a day. The red dashed line is the average time, 419.5 Minutes or roughly 7 hours, that the participants are usually asleep for. Here we notice that participants who documented one sleep cycle have more plot points in the left side of the average, while participants with twosleep cycles have more plot points on the right side of the red line and lastly the participants who documented three sleep cycles slept more than the average.
#Used the hms library to extract just the time from the datetime
hourly_intensities$ActivityHour <- as_hms(hourly_intensities$ActivityHour)
#Filtered the data to group by ActivityHour to easily analyze for plotting
filtered_data <- hourly_intensities %>%
group_by(ActivityHour) %>%
summarise(avg_total_intensity = mean(TotalIntensity))
#For plotting
ggplot(data = filtered_data, aes(x=ActivityHour, y= avg_total_intensity)) + geom_histogram(stat = "identity") + labs(y = "Average Total Intensity", x = "Daily Hour", title = "Time Most Active In a Day")
In the graph above we can visualize the time of day with the average total intensity of activity between the participants. This graph reflects multiple days of observation. On average participants are most active during 12pm -2pm and 5pm to 7pm.
Through analyzing the FitBit Fitness Tracker Data we have made the following observations with the limited information we have.
The data has a range from April 12, 2016 to May 12, 2016. We have 33 participants that consented for the use of their data. Their users are more lightly active and averaging around 192 minutes per day while the more active users average around 21 minutes per day. The average sedentary time of participants is 16.5 hours and sleep on average 1 time a day for about 7 hours. The participants are more active around 12pm to 2pm and 5pm to 7pm.
If we had more information about age, weight, and height, a more detailed analysis could be constructed. However, we will merely talk about women in general as the target audience.
There are a lot of smart devices in today’s market that document health data for the betterment of their users. If Bellabeat wants to be a contender with the big players they need target areas in this data that can be applicable to their own users. For example, we know that the users are more active around 12pm to 2pm and 5pm to 7pm. Bellabeat could push for timed notifications or even create a personalized program that caters to the user’s schedule to help and remind them to be more active.
The CDC recommends an average of 150 minutes of exercise per week and twice a week. Bellabeat’s target audience is for empowering women, what they could do is create workout programs that appeal to their audience. Having more options for exercises may entice more women to reduce the amount they are sitting and promote their health and well being. Exercises like workouts, stretching, cardio, yoga, and meditations are examples of features that could be implemented in their app.
Articles of healthy food alternatives and recipes can be a great feature for your target audience. Health articles about women’s health can also help empower women. Topics such as breast cancer awareness, mental health, stress, and daily exercise activity can keep women informed.
Smart devices also need to look appealing in today’s fashion. I would not like wearing a huge device that seems out of place on my body to document data. Making the smart device to fit people’s clothing is great way to have consistent use of the program. The more time a person has the device on them, the higher the chance that they could be notified of their health and progression.