Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products that inform and inspire women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.
Since Bellabeat was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women. By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
The Bellabeat app provides users with health data related to their activity, sleep, stress,menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis.
Key stakeholders of bellabeat are both Urška Sršen and Sando Mur. These members of the company are founders, so they will not be able to look deep into an analysis.
Analytics team - This team can assist with common question I may have about a projects validity, as they can look deeper into my analysis then my stakeholders can.
‘FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.’
I’ll be using a kaggle dataset for this case study. This data is stored in wide format.
Reliable, Original, Comprehensive, Current, and Cited.
This data is relatively reliable. There are no issues with credibility or bias in this data as it meets the minimum 30 persons sample size supported by the CLT (central limit therom.) Of course as the sample size increases, so does the confidence, and margin of error decreases. I am encouraged to look for more data sources, but this is entirely optional as it meets the minimum requirements, albiet it will have a larger margin of error.
This data isn’t as original as it could be, i.e it’s been studied by many junior analysts already, so it’s unlikely that I’ll discover something ‘unheard of’ from the data I’ll be using.
This data is pretty comprehensive, it is organized very well, and has lots of different avalible csv files avalible for use.
This data isn’t current, it’s around 7 years old.
This data is cited, and avalible for public use… It’s licensed “(CC0: Public Domain, dataset made available through Mobius)” In the event I discover PII I will remove it from the data.frames or tables I use. I’ll document my changes, and blurr or ‘hashtag’ the PII (Personal Identifiable Information.) I verified my data’s integrity by using the ROCCC data process.
In this case study, I’ll mainly use R to sort, filter, clean my data. I know the basics of SQL, but I do not have access to premium features with my free account.
First I would like to check the structure and data for the two tables I’ll be using. To do this I need to load the relevant packages and use a couple lines of code.
library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
read_csv('dailyActivity_merged.csv') -> daily_activity
# variable for cleaned data, no nulls, no duplicates.
daily_activity_cleaned <- (na.omit(distinct(daily_activity)))
#summary of the cleaned data
str(daily_activity_cleaned)
glimpse(daily_activity_cleaned)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
Date is formatted in CHR, this is incorrect. Now that I know what is wrong with the data, I’ll proceed to develop my code to clean it.
library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
read_csv('dailyActivity_merged.csv') -> daily_activity
# variable for cleaned data, no nulls, no duplicates.
daily_activity_cleaned <- (na.omit(distinct(daily_activity)))
# formatting the data correctly
daily_activity_cleaned$ActivityDate <- as.Date(
daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )
## grabs min/max and averages for each column in the tables/summarizes them
str(daily_activity_cleaned)
knitr::kable(summary(daily_activity_cleaned))
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1.504e+09 | Min. :2016-04-12 | Min. : 0 | Min. : 0.000 | Min. : 0.000 | Min. :0.0000 | Min. : 0.000 | Min. :0.0000 | Min. : 0.000 | Min. :0.000000 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | |
| 1st Qu.:2.320e+09 | 1st Qu.:2016-04-19 | 1st Qu.: 3790 | 1st Qu.: 2.620 | 1st Qu.: 2.620 | 1st Qu.:0.0000 | 1st Qu.: 0.000 | 1st Qu.:0.0000 | 1st Qu.: 1.945 | 1st Qu.:0.000000 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.:127.0 | 1st Qu.: 729.8 | 1st Qu.:1828 | |
| Median :4.445e+09 | Median :2016-04-26 | Median : 7406 | Median : 5.245 | Median : 5.245 | Median :0.0000 | Median : 0.210 | Median :0.2400 | Median : 3.365 | Median :0.000000 | Median : 4.00 | Median : 6.00 | Median :199.0 | Median :1057.5 | Median :2134 | |
| Mean :4.855e+09 | Mean :2016-04-26 | Mean : 7638 | Mean : 5.490 | Mean : 5.475 | Mean :0.1082 | Mean : 1.503 | Mean :0.5675 | Mean : 3.341 | Mean :0.001606 | Mean : 21.16 | Mean : 13.56 | Mean :192.8 | Mean : 991.2 | Mean :2304 | |
| 3rd Qu.:6.962e+09 | 3rd Qu.:2016-05-04 | 3rd Qu.:10727 | 3rd Qu.: 7.713 | 3rd Qu.: 7.710 | 3rd Qu.:0.0000 | 3rd Qu.: 2.053 | 3rd Qu.:0.8000 | 3rd Qu.: 4.782 | 3rd Qu.:0.000000 | 3rd Qu.: 32.00 | 3rd Qu.: 19.00 | 3rd Qu.:264.0 | 3rd Qu.:1229.5 | 3rd Qu.:2793 | |
| Max. :8.878e+09 | Max. :2016-05-12 | Max. :36019 | Max. :28.030 | Max. :28.030 | Max. :4.9421 | Max. :21.920 | Max. :6.4800 | Max. :10.710 | Max. :0.110000 | Max. :210.00 | Max. :143.00 | Max. :518.0 | Max. :1440.0 | Max. :4900 |
Data appears to be within reasonable data range constraints, This data is valid after analyzing the summary data, though I had to change the data type of the dates so they would not use math.
I’ve chosen to use R as a main language for this project since I have a free account on BigQuery. In the following code chunk, I’ll use basic data cleaning processes used in the past process along with adding na.omit to remove n/a values. This insures my data is clean, and verifies this by summarizing the data. These changes are all documented in my RMD file.
Code I developed
My finalized processing code
library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
read_csv('dailyActivity_merged.csv') -> daily_activity
# variable for cleaned data, no nulls, no duplicates.
daily_activity_cleaned <- (na.omit(distinct(daily_activity)))
# formatting the data correctly
daily_activity_cleaned$ActivityDate <- as.Date(
daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )
# variables for columns I wanna select
daily_activity_processed <- print(
daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 940 × 5
## SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 728 13162 1985 8.5 8.5
## 2 776 10735 1797 6.97 6.97
## 3 1218 10460 1776 6.74 6.74
## 4 726 9762 1745 6.28 6.28
## 5 773 12669 1863 8.16 8.16
## 6 539 9705 1728 6.48 6.48
## 7 1149 13019 1921 8.59 8.59
## 8 775 15506 2035 9.88 9.88
## 9 818 10544 1786 6.68 6.68
## 10 838 9819 1775 6.34 6.34
## # ℹ 930 more rows
In this phase, I’ve chosen to use the processing code, building onto it. I’ve already set up variables that are cleaned, and now I just need to plug in my graphs and plots. My data has been properly formatted in the prepare/process phase.
I’ll run some graphs now to analyze the data.
library(tidyverse)
library(dplyr)
library(readr)
library(stats)
library(utils)
library(ggplot2)
read_csv('dailyActivity_merged.csv') -> daily_activity
# variable for cleaned data, no nulls, no duplicates.
daily_activity_cleaned <- (na.omit(distinct(daily_activity)))
# formatting the data correctly
daily_activity_cleaned$ActivityDate <- as.Date(
daily_activity_cleaned$ActivityDate, format = "%m/%d/%Y" )
# variables for columns I wanna select
daily_activity_processed <- print(
daily_activity_cleaned[, c("SedentaryMinutes", "TotalSteps", "Calories", "TrackerDistance", "TotalDistance")]
)
# plots out a smooth line jitter chart for total steps vs sedentary minutes
ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=TotalSteps)) +
geom_jitter(aes(color=SedentaryMinutes), size=1) +
geom_smooth(color='red', linetype='dashed', size=1) +
scale_y_continuous(limits=c(0,20000))
# Calories vs Sedentary Minutes
ggplot(data=daily_activity_processed, aes(x=SedentaryMinutes, y=Calories)) +
geom_jitter(aes(color=SedentaryMinutes), size=1) +
geom_smooth(color='red', linetype='dashed', size=1)
# total steps vs calories
ggplot(data=daily_activity_processed, aes(x=Calories, y=TotalSteps)) +
geom_jitter(aes(color=Calories), size=1) +
geom_smooth(color='red', linetype='dashed', size=1)
# plots out a smooth line jitter chart for TrackerDistance vs TotalDistance
ggplot(data=daily_activity_processed, aes(x=TrackerDistance, y=TotalDistance)) +
geom_jitter(aes(color=TrackerDistance), size=1) +
geom_smooth(color='red', linetype='dashed', size=1)
## # A tibble: 940 × 5
## SedentaryMinutes TotalSteps Calories TrackerDistance TotalDistance
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 728 13162 1985 8.5 8.5
## 2 776 10735 1797 6.97 6.97
## 3 1218 10460 1776 6.74 6.74
## 4 726 9762 1745 6.28 6.28
## 5 773 12669 1863 8.16 8.16
## 6 539 9705 1728 6.48 6.48
## 7 1149 13019 1921 8.59 8.59
## 8 775 15506 2035 9.88 9.88
## 9 818 10544 1786 6.68 6.68
## 10 838 9819 1775 6.34 6.34
## # ℹ 930 more rows
Upon analyzing these graphs I am surprised! I can confidently say that there may be some correlation between TotalSteps/Sedentary Minutes and Calories/Sedentary Minutes. Two graphs look almost identical. In the lower 2 graphs, they are not surprising… With total steps increasing so does calories. The same is true for the tracker distance and total distance. Though if you look closely you can actually see some people have a higher tracker-distance then what the total-distance is.