I am a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analysing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyse smart device data to gain insight into how consumers are using their smart devices. The insights that I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.
Bellabeat App: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: : Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: : This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat Membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
Urška Sršen asked me to analyse smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wanted me to select one Bellabeat product to apply these insights in my presentation.
1.1 Business Task
The business task is to analyse smart device usage data of non-Bellabeat smart devices to gain insight into relevant (successful and unsuccessful) consumer trends within the global smart device market, as well as to discover how to use these trends to apply to Bellabeat customers and to influence future Bellabeat marketing strategies. This is done by applying said insights to the Bellabeat App and to future products in order to maximise profits and growth for the company and to capitilse on Bellabeat’s rapidly growing consumer base in the smart device/tech-wellness space. Urška Sršen, Sando Mur (Bellabeat’s other co-founder and key member of the Bellabeat executive team), the Bellabeat executive team, the Bellabeat Marketing Analytics Team, and the Bellabeat investors are the key stakeholders that need to be considered in my data analysis and decisions.
Sršen encouraged me to use the following public data that explores smart device users’ daily habits: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius).
2.1 Notes about the Data
18 total data sets were provided in the FitBit Fitness Tracker Data link from above; they are individually stored in the form of .csv files. This analysis will instead focus on three data sets: the daily activity data set (‘activity_daily’), which contains merged data from other provided files like daily calories, daily intensities, and daily steps, the weight data set (‘weight’), and the daily sleep data set (‘sleep_daily’). These files contain relevant data that are also tracked by Bellabeat products - this will provide me with the most relevant and useful insights to solve the business task at hand.
2.2 Issues with the Data Credibility
I will be using ROCC to determine if there are any issues with bias or credibility in this data.
Overall, the conclusions made from this analysis should be taken with a grain of salt, as the data integrity and credibility are lacking. However, the general insights could still prove to be useful by highlighting possible shortcomings in FitBit’s data tracking (by FitBit itself or by the individuals), which Bellabeat can then improve on through innovative features exclusive to the Bellabeat product line.
2.3 Loading Packages
Below are the packages that I will/might be using in this case study:
library(tidyverse)
library(lubridate)
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(skimr)
library(janitor)
library(scales)
2.4 Importing Data Sets
activity_daily <- read_csv("dailyActivity_merged.csv")
weight <- read_csv("weightLogInfo_merged.csv")
sleep_daily <- read_csv("sleepDay_merged.csv")
3. Process
Now that I finished preparing the data, I will begin the processing step. Here, I will be verifying the data, then cleaning and transforming the data for analysis.
3.1 Verifying Data
The next step is to verify that the data sets were imported properly and to check for any obvious errors.
head(activity_daily)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # … with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(weight)
## # A tibble: 6 × 8
## Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 … 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 … 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016… 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016… 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016… 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016… 72.4 160. 25 27.5 TRUE 1.46e12
head(sleep_daily)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecor… TotalMinutesAsl… TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:0… 1 327 346
## 2 1503960366 4/13/2016 12:00:0… 2 384 407
## 3 1503960366 4/15/2016 12:00:0… 1 412 442
## 4 1503960366 4/16/2016 12:00:0… 2 340 367
## 5 1503960366 4/17/2016 12:00:0… 1 700 712
## 6 1503960366 4/19/2016 12:00:0… 1 304 320
colnames(activity_daily)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
colnames(sleep_daily)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
view(activity_daily)
view(weight)
view(sleep_daily)
The first thing I noticed is that the data set only contains one month worth of data rather than two months of data, which raises further questions about the credibility and comprehensiveness of the data set.
The second thing I noticed is the consistency in the logging/tracking of the data. Not everyone was consistent in logging/tracking their data each day. Some people forgot to wear their FitBits, which recorded zero steps for certain days; this will skew any analysis, so I will remove them from the data set. Some people did not participate in recording their sleep or their weight. And some people did not participate for the whole month of time. This will make a complete and more in-depth analysis a lot more difficult to conduct than originally thought.
activity_daily_new <- activity_daily %>%
filter(TotalSteps !=0)
view(activity_daily_new)
Removing the zero steps will definitely help with the analysis! However, there are still very low number of step inputs present. There are also very low inputs for calories burnt. I will keep these in the data set for analysis, because perhaps those individuals were bedridden all day or they fasted all day. There is a lot of uncertainty here, and nothing can be said for certain without more detailed information. But, again, this questions the overall integrity and credibility of the data set.
The third thing I noticed is that the sleep data set and weight data set both contain the date and time in one column. It is best to separate into “Date” and “Time” columns, if I do decide to use the date as a way to analyse the data between the three files. However, whilst viewing the data sets, I noticed a large discrepancy in the number of unique IDs present, as well as inconsistencies in the daily logging/tracking of the individual’s weight and sleep.
weight_new <- weight %>%
separate(Date, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 67 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
sleep_daily_new <- sleep_daily %>%
separate(SleepDay, c("Date", "Time"), " ")
## Warning: Expected 2 pieces. Additional pieces discarded in 413 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
n_distinct(activity_daily_new$Id)
## [1] 33
n_distinct(weight_new$Id)
## [1] 8
n_distinct(sleep_daily_new$Id)
## [1] 24
Interestingly, not everyone involved in this survey provided tracking data for each data set. Only eight people entered their weight (where only two people logged/tracked most of the data), and only 24 people entered their sleep data. Also, there are 33 people recorded in the daily activity data, despite the data citation saying there are 30 people in the sample size. This further questions the credibility of the data as it seems to become even less comprehensive/reliable and more dirty the more I look into it. I might not be able to cross-analyse the data as much as I would like due to the number of incomplete and inconsistently tracked data. Due to the incompleteness of those data sets, I may just focus on the ‘activity_daily’ data set for a more focused analysis and then include some general recommendations for improvement on data logging/tracking consistencies for recording weight and sleep.
The fourth thing I noticed is that there could be some duplicated rows in some of the data sets. I will confirm this and delete the duplicated rows for cleaner data.
nrow(activity_daily_new)
## [1] 863
nrow(weight_new)
## [1] 67
nrow(sleep_daily_new)
## [1] 413
nrow(unique(activity_daily_new))
## [1] 863
nrow(unique(weight_new))
## [1] 67
nrow(unique(sleep_daily_new))
## [1] 410
Good thing I checked. I will clean up the daily sleep data set by creating a new data set with just the unique rows.
sleep_daily_new_v2 <- unique(sleep_daily_new)
Now I am ready to analyse the data.
At this step, I will identify trends and relationships that I find. Any surprises about the data will also be mentioned. Hopefully I can discover valuable insights from my analysis that can help answer my business question.
First, I am going to take a look at a detailed summary of each data set.
skim_without_charts(activity_daily_new)
| Name | activity_daily_new |
| Number of rows | 863 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.857542e+09 | 2.418405e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 |
| TotalSteps | 0 | 1 | 8.319390e+03 | 4.744970e+03 | 4 | 4.923000e+03 | 8.053000e+03 | 1.109250e+04 | 3.601900e+04 |
| TotalDistance | 0 | 1 | 5.980000e+00 | 3.720000e+00 | 0 | 3.370000e+00 | 5.590000e+00 | 7.900000e+00 | 2.803000e+01 |
| TrackerDistance | 0 | 1 | 5.960000e+00 | 3.700000e+00 | 0 | 3.370000e+00 | 5.590000e+00 | 7.880000e+00 | 2.803000e+01 |
| LoggedActivitiesDistance | 0 | 1 | 1.200000e-01 | 6.500000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 |
| VeryActiveDistance | 0 | 1 | 1.640000e+00 | 2.740000e+00 | 0 | 0.000000e+00 | 4.100000e-01 | 2.270000e+00 | 2.192000e+01 |
| ModeratelyActiveDistance | 0 | 1 | 6.200000e-01 | 9.100000e-01 | 0 | 0.000000e+00 | 3.100000e-01 | 8.700000e-01 | 6.480000e+00 |
| LightActiveDistance | 0 | 1 | 3.640000e+00 | 1.860000e+00 | 0 | 2.340000e+00 | 3.580000e+00 | 4.890000e+00 | 1.071000e+01 |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 |
| VeryActiveMinutes | 0 | 1 | 2.302000e+01 | 3.365000e+01 | 0 | 0.000000e+00 | 7.000000e+00 | 3.500000e+01 | 2.100000e+02 |
| FairlyActiveMinutes | 0 | 1 | 1.478000e+01 | 2.043000e+01 | 0 | 0.000000e+00 | 8.000000e+00 | 2.100000e+01 | 1.430000e+02 |
| LightlyActiveMinutes | 0 | 1 | 2.100200e+02 | 9.678000e+01 | 0 | 1.465000e+02 | 2.080000e+02 | 2.720000e+02 | 5.180000e+02 |
| SedentaryMinutes | 0 | 1 | 9.557500e+02 | 2.802900e+02 | 0 | 7.215000e+02 | 1.021000e+03 | 1.189000e+03 | 1.440000e+03 |
| Calories | 0 | 1 | 2.361300e+03 | 7.027100e+02 | 52 | 1.855500e+03 | 2.220000e+03 | 2.832000e+03 | 4.900000e+03 |
skim_without_charts(weight_new)
| Name | weight_new |
| Number of rows | 67 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| logical | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
| Time | 0 | 1 | 7 | 8 | 0 | 26 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| IsManualReport | 0 | 1 | 0.61 | TRU: 41, FAL: 26 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 7.009282e+09 | 1.950322e+09 | 1.503960e+09 | 6.962181e+09 | 6.962181e+09 | 8.877689e+09 | 8.877689e+09 |
| WeightKg | 0 | 1.00 | 7.204000e+01 | 1.392000e+01 | 5.260000e+01 | 6.140000e+01 | 6.250000e+01 | 8.505000e+01 | 1.335000e+02 |
| WeightPounds | 0 | 1.00 | 1.588100e+02 | 3.070000e+01 | 1.159600e+02 | 1.353600e+02 | 1.377900e+02 | 1.875000e+02 | 2.943200e+02 |
| Fat | 65 | 0.03 | 2.350000e+01 | 2.120000e+00 | 2.200000e+01 | 2.275000e+01 | 2.350000e+01 | 2.425000e+01 | 2.500000e+01 |
| BMI | 0 | 1.00 | 2.519000e+01 | 3.070000e+00 | 2.145000e+01 | 2.396000e+01 | 2.439000e+01 | 2.556000e+01 | 4.754000e+01 |
| LogId | 0 | 1.00 | 1.461772e+12 | 7.829948e+08 | 1.460444e+12 | 1.461079e+12 | 1.461802e+12 | 1.462375e+12 | 1.463098e+12 |
skim_without_charts(sleep_daily_new_v2)
| Name | sleep_daily_new_v2 |
| Number of rows | 410 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
| Time | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.994963e+09 | 2.060863e+09 | 1503960366 | 3.977334e+09 | 4702921684.0 | 6962181067 | 8792009665 |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.500000e-01 | 1 | 1.000000e+00 | 1.0 | 1 | 3 |
| TotalMinutesAsleep | 0 | 1 | 4.191700e+02 | 1.186400e+02 | 58 | 3.610000e+02 | 432.5 | 490 | 796 |
| TotalTimeInBed | 0 | 1 | 4.584800e+02 | 1.274600e+02 | 61 | 4.037500e+02 | 463.0 | 526 | 961 |
This is a nice overview to make sure all of the necessary cleaning was done and if there are any immediate issues that stand out when doing a brief analysis from skimming. It looks good so far, but I would like to condense each file into only the columns that I want to use for my more focused analysis.
activity_daily_final <- activity_daily_new %>%
select(Id, ActivityDate, TotalSteps, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories) %>%
rename(Date = ActivityDate)
weight_final <- weight_new %>%
select(Id, Date, BMI, WeightPounds, IsManualReport)
sleep_daily_final <- sleep_daily_new_v2 %>%
select(Id, Date, TotalMinutesAsleep, TotalTimeInBed)
A side note: I would like to convert the ‘Date’ column for ‘activity_daily_final’ into ‘WeekDays’ to find a relationship between which days of the week people are more consistent with logging/tracking their data, but unfortunately, I am constantly receiving ‘NA’ and ‘parsing errors’ when trying to use ‘lubridate’ functions and other functions. I will have to examine this issue further another time. But from my own experience, people tend to become more lackadaisical with logging/tracking data during the weekend and closer to the weekend - be it the Friday before or the Monday after.
Now, I want to take a look at a more specific summary of the values by using the ‘summary’ function.
summary(activity_daily_final)
summary(weight_final)
summary(sleep_daily_final)
3.1 Trends
3.2 Conclusions from Trends
3.3 Questions to Answer
5.1 Revisiting Business Task
The business task is to analyse smart device usage data of non-Bellabeat smart devices to gain insight into relevant (successful and unsuccessful) consumer trends within the global smart device market, as well as to discover how to use these trends to apply to Bellabeat customers and to influence future Bellabeat marketing strategies. This is done by applying said insights to the Bellabeat App and to future products in order to maximise profits and growth for the company and to capitilse on Bellabeat’s rapidly growing consumer base in the smart device/tech-wellness space.
5.2 Trends Identified
5.3 Recommendations
5.4 Future Works
If this data set were to be collected again, I would like to see the following parameters met in order to create a flawless and in-depth analysis of this type of data: