About

Urška Sršen and Sando Mur founded Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a small company that has the potential of becoming a large player in the smart-device market. Bellabeat collects data on actvity, sleep, stress and reproductive health to empower women with their own health and habits.

Business Task

Bellabeat’s marketing team believes that analyzing smart fitness device data could help unlock new growth opportunities for the company. Their team would like advice and recommendations for growth based on trends of non-Bellabeat smart devices that could be applicable for their own products.

Questions for Analysis

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Spreadsheet, SQL or R for Analysis?

We will be analyzing FitBit Fitness Tracker Data. This dataset contains personal fitness data from thirty Fibit users. These users consented to the submission of their personal data.

With a quick look at the csv file, there were some tables that had more than 1 Million rows that makes it a bit more complicated to do analysis in spreadsheets. Between R and SQL, I decided to use R for easy data formatting and presentation.

Installing and Loading R Packages

install.packages("tidyverse")
install.packages("lubridate")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("hms")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(tidyr)
library(hms)
## 
## Attaching package: 'hms'
## 
## The following object is masked from 'package:lubridate':
## 
##     hms

Loading the Data

All csv files were imported into RStudio Cloud and loaded below.

daily_activity <- read.csv("dailyActivity_merged.csv")
daily_calories <- read.csv("dailyCalories_merged.csv")
daily_intensity <- read.csv("dailyIntensities_merged.csv")
daily_steps <- read.csv("dailySteps_merged.csv")
heartrate_seconds <- read.csv("heartrate_seconds_merged.csv")
hourly_calories <- read.csv("hourlyCalories_merged.csv")
hourly_intensities <- read.csv("hourlyIntensities_merged.csv")
hourly_steps <- read.csv("hourlySteps_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
weight_log_info <- read.csv("weightLogInfo_merged.csv")

Summary of files

glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

Here we can see a quick summary of information of the file for daily activity. We notice that there are 15 cloumns with their data type. With a quick look, we can see that “ActivityDate” is in a character data type and must be changed to appropriately analyze this specific table.

Lets take a look at another file.

glimpse(daily_calories)
## Rows: 940
## Columns: 3
## $ Id          <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories    <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…

Here we can see “Id” column again and “ActivityDay” having character data type again.

Formatting the data character type into a date format

#Daily Activity 
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)

#Daily Calories
daily_calories$ActivityDay <- mdy(daily_calories$ActivityDay)

#Daily Intensities
daily_intensity$ActivityDay <- mdy(daily_intensity$ActivityDay)

#Daily Steps
daily_steps$ActivityDay <- mdy(daily_steps$ActivityDay)

#Heartrate Seconds 
heartrate_seconds$Time <- parse_date_time(heartrate_seconds$Time, "%m/%d%y %I:%M:%S %p")

#Hourly Calories
hourly_calories$ActivityHour <- parse_date_time(hourly_calories$ActivityHour, "%m/%d%y %I:%M:%S %p")

#Hourly Intensities
hourly_intensities$ActivityHour <- parse_date_time(hourly_intensities$ActivityHour, "%m/%d%y %I:%M:%S %p")

#Hourly Steps
hourly_steps$ActivityHour <- parse_date_time(hourly_steps$ActivityHour, "%m/%d%y %I:%M:%S %p")

#Sleep Day
sleep_day$SleepDay <- parse_date_time(sleep_day$SleepDay, "%m/%d%y %I:%M:%S %p")

#Weight Log Info
weight_log_info$Date <- parse_date_time(weight_log_info$Date, "%m/%d%y %I:%M:%S %p")

Lets check if is now formatted correctly with two different tables to see.

data.class(daily_activity$ActivityDate)
## [1] "Date"
daily_activity$ActivityDate[1:2]
## [1] "2016-04-12" "2016-04-13"
data.class(heartrate_seconds$Time)
## [1] "POSIXct"
heartrate_seconds$Time[1:2]
## [1] "2016-04-12 07:21:00 UTC" "2016-04-12 07:21:05 UTC"

With the columns data types now corrected. We can review the data.

Summary of Data

With a glimpse of each table, we see that they all have “Id” in common. This is the distinct identifier for each user.

n_distinct(daily_activity$Id)

[1] 33

n_distinct(daily_calories$Id)

[1] 33

n_distinct(daily_intensity$Id)

[1] 33

n_distinct(daily_steps$Id)

[1] 33

n_distinct(heartrate_seconds$Id)

[1] 14

n_distinct(hourly_calories$Id)

[1] 33

n_distinct(hourly_intensities$Id)

[1] 33

n_distinct(hourly_steps$Id)

[1] 33

n_distinct(sleep_day$Id)

[1] 24

n_distinct(weight_log_info$Id)

[1] 8

We notice that for most of these tables there are 33 distinct users. With this information we will exclude the data from the “heartrate_seconds” and “weight_log_info” as the amount of users that participated in those sections or features is not a good pool sample for the analysis. We will keep the sleep_day with its 24 users but keep in mind that the confidence level at 95% contains a margin error of 10.61%.

Daily Activity:

daily_activity %>% 
  select(TotalSteps,
         TotalDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories) %>% 
  summary()
##    TotalSteps    TotalDistance    VeryActiveMinutes FairlyActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :  0.00    Min.   :  0.00     
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:  0.00    1st Qu.:  0.00     
##  Median : 7406   Median : 5.245   Median :  4.00    Median :  6.00     
##  Mean   : 7638   Mean   : 5.490   Mean   : 21.16    Mean   : 13.56     
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.: 32.00    3rd Qu.: 19.00     
##  Max.   :36019   Max.   :28.030   Max.   :210.00    Max.   :143.00     
##  LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:127.0        1st Qu.: 729.8   1st Qu.:1828  
##  Median :199.0        Median :1057.5   Median :2134  
##  Mean   :192.8        Mean   : 991.2   Mean   :2304  
##  3rd Qu.:264.0        3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :518.0        Max.   :1440.0   Max.   :4900

Sleep Day:

sleep_day %>% 
  select(TotalSleepRecords,
         TotalMinutesAsleep,
         TotalTimeInBed) %>% 
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Hour Intensities:

hourly_intensities %>% 
  select(ActivityHour,
         TotalIntensity,
         AverageIntensity) %>% 
  summary()
##   ActivityHour                    TotalIntensity   AverageIntensity
##  Min.   :2016-04-12 00:00:00.00   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:2016-04-19 01:00:00.00   1st Qu.:  0.00   1st Qu.:0.0000  
##  Median :2016-04-26 06:00:00.00   Median :  3.00   Median :0.0500  
##  Mean   :2016-04-26 11:46:42.58   Mean   : 12.04   Mean   :0.2006  
##  3rd Qu.:2016-05-03 19:00:00.00   3rd Qu.: 16.00   3rd Qu.:0.2667  
##  Max.   :2016-05-12 15:00:00.00   Max.   :180.00   Max.   :3.0000

By looking at the information above, we make the following observations:

  • The participants are lightly active averaging around 192 minutes while very active is 21 minutes.
  • Average sedentary time is 991.2 minutes or roughly 16.5 hours.
  • Average of 1 time of sleep for 419.5 minutes or roughly 7 hours asleep.
  • Average calories expended in an activity is 2304 kcal
  • The data range is from April 12, 2016 to May 12, 2016. All the information we see is for a time span of a month.

Merging datasets

combined_data <- merge(daily_activity, sleep_day, by = 'Id')
head(combined_data)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-05-07      11992          7.71            7.71
## 2 1503960366   2016-05-07      11992          7.71            7.71
## 3 1503960366   2016-05-07      11992          7.71            7.71
## 4 1503960366   2016-05-07      11992          7.71            7.71
## 5 1503960366   2016-05-07      11992          7.71            7.71
## 6 1503960366   2016-05-07      11992          7.71            7.71
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.46                     2.12
## 2                        0               2.46                     2.12
## 3                        0               2.46                     2.12
## 4                        0               2.46                     2.12
## 5                        0               2.46                     2.12
## 6                        0               2.46                     2.12
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                3.13                       0                37
## 2                3.13                       0                37
## 3                3.13                       0                37
## 4                3.13                       0                37
## 5                3.13                       0                37
## 6                3.13                       0                37
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories   SleepDay
## 1                  46                  175              833     1821 2016-04-12
## 2                  46                  175              833     1821 2016-04-13
## 3                  46                  175              833     1821 2016-04-15
## 4                  46                  175              833     1821 2016-04-16
## 5                  46                  175              833     1821 2016-04-17
## 6                  46                  175              833     1821 2016-04-19
##   TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1                 1                327            346
## 2                 2                384            407
## 3                 1                412            442
## 4                 2                340            367
## 5                 1                700            712
## 6                 1                304            320
n_distinct(combined_data$Id)
## [1] 24

Visualization

ggplot(data = combined_data, aes(x = TotalSteps, y = Calories)) +
  geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps Vs Calories Expended", x = "Total Steps", y = "Calories")
## `geom_smooth()` using formula 'y ~ x'

Here we can make some quick observations based on the data we have. Shown in the graph is a positive correlation of total steps and calories. In other words, he more steps you take the more calories you burn.

ggplot(data = combined_data, aes(x = TotalSteps, y = SedentaryMinutes)) +
  geom_point() + geom_smooth(method = lm) + labs(title = "Total Steps vs Time Sitting", x= "Total Steps", y= "Sedentary Minutes")
## `geom_smooth()` using formula 'y ~ x'

In this graph, we are comparing the total sedimentary time to the total steps taken. There is a negative correlation between the variables. In other words, the more time you are sitting the less steps you take. We can see that the participants spend more time sitting down than getting their steps in.

ggplot(data = combined_data, aes(x = TotalMinutesAsleep, y = TotalTimeInBed, color = TotalSleepRecords)) + facet_grid(~TotalSleepRecords) +
  geom_point() + labs(title = "Total Minutes Asleep Vs Total Time in Bed", x= "Total Minutes Asleep", y="Total Minutes in Bed") + geom_vline(xintercept = 419.5, color = "red", linetype = "dashed") + annotate("text", label = "7 Hours", x = 200, y = 800, color = "black", size = 3)

Here I have separated the Total Minutes Asleep vs. Total Time in Bed by the number of sleep cycles in a day. The red dashed line is the average time, 419.5 Minutes or roughly 7 hours, that the participants are usually asleep for. Here we notice that participants who documented one sleep cycle have more plot points in the left side of the average, while participants with twosleep cycles have more plot points on the right side of the red line and lastly the participants who documented three sleep cycles slept more than the average.

#Used the hms library to extract just the time from the datetime
hourly_intensities$ActivityHour <- as_hms(hourly_intensities$ActivityHour)

#Filtered the data to group by ActivityHour to easily analyze for plotting
filtered_data <- hourly_intensities %>% 
  group_by(ActivityHour) %>% 
  summarise(avg_total_intensity = mean(TotalIntensity))

#For plotting
ggplot(data = filtered_data, aes(x=ActivityHour, y= avg_total_intensity)) + geom_histogram(stat = "identity") + labs(y = "Average Total Intensity", x = "Daily Hour", title = "Time Most Active In a Day")

In the graph above we can visualize the time of day with the average total intensity of activity between the participants. This graph reflects multiple days of observation. On average participants are most active during 12pm -2pm and 5pm to 7pm.

Conclusion

Through analyzing the FitBit Fitness Tracker Data we have made the following observations with the limited information we have.

The data has a range from April 12, 2016 to May 12, 2016. We have 33 participants that consented for the use of their data. Their users are more lightly active and averaging around 192 minutes per day while the more active users average around 21 minutes per day. The average sedentary time of participants is 16.5 hours and sleep on average 1 time a day for about 7 hours. The participants are more active around 12pm to 2pm and 5pm to 7pm.

If we had more information about age, weight, and height, a more detailed analysis could be constructed. However, we will merely talk about women in general as the target audience.

Suggestions for Bellabeat

There are a lot of smart devices in today’s market that document health data for the betterment of their users. If Bellabeat wants to be a contender with the big players they need target areas in this data that can be applicable to their own users. For example, we know that the users are more active around 12pm to 2pm and 5pm to 7pm. Bellabeat could push for timed notifications or even create a personalized program that caters to the user’s schedule to help and remind them to be more active.

The CDC recommends an average of 150 minutes of exercise per week and twice a week. Bellabeat’s target audience is for empowering women, what they could do is create workout programs that appeal to their audience. Having more options for exercises may entice more women to reduce the amount they are sitting and promote their health and well being. Exercises like workouts, stretching, cardio, yoga, and meditations are examples of features that could be implemented in their app.

Articles of healthy food alternatives and recipes can be a great feature for your target audience. Health articles about women’s health can also help empower women. Topics such as breast cancer awareness, mental health, stress, and daily exercise activity can keep women informed.

Smart devices also need to look appealing in today’s fashion. I would not like wearing a huge device that seems out of place on my body to document data. Making the smart device to fit people’s clothing is great way to have consistent use of the program. The more time a person has the device on them, the higher the chance that they could be notified of their health and progression.