By: Reilly McCarthy
Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:
I am are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy for a product.
Urška Sršen asked me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat’s products to apply these insights to in my presentation.
I am to perform a deep-dive analysis on a FitBit data set to gage marketing opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women. I will be analyzing smart device usage to discover consumer trends. Then I will draw conclusions to how these trends can create insights to increase the efficiency of Bellabeat’s marketing strategy. The key stakeholders in this project are Urška Sršen, Sando Mur, Bellabeat marketing analytics team, and Bellabeat investors.
Sršen encouraged me to use public data that explores smart device users’ daily habits. She pointed me to this specific data set: https://www.kaggle.com/datasets/arashnic/fitbit
This data set was provided via downloadable folder and contained 18 .csv files. The data is stored in long format. During the preparation phase I focused on 5 specific .csv files and created the data frames: Daily_Activity, Daily_Intensities, Daily_Sleep. Daily_Steps, and Weight. I chose these files as they seemed to contain data that would lead to insightful findings on Bellabeat products. This is because alot of the data in these data sets is also tracked by the Bellabeat products. These findings could then lead to high-level marketing strategies which is the task of the overall analysis.
Before diving in to any of my code I will use the ROCCC: (Reliable, Original, Comprehensive, Current, Cited) analysis to determine the credibility of this data.
Reliable: This data is NOT reliable. This data set contains a sample size of 30 individuals who consented to the study. The primary reliability factor for this data is that for Central Limit Theorem (CLT) to hold true a sample of 30 or greater is required, however, this sample size is still quite small for the entire network of FitBit customers. Also this data was only obtained over a 2 month time span and was recorded over 6 years ago. I believe health data should span over a longer period of time to show reliable trends as health is a more gradual process. Also the age of this data leads to issues with its relevancy to the current time.
Original: This data is NOT original. The data set is generated by respondents to a distributed survey via Amazon Mechanical Turk. This data is not Bellabeat’s original data. First-party data would have been better to use.
Comprehensive: This data is NOT comprehensive. There are a couple items that would help make this data set more comprehensive. One is a larger sample size. The current sample size of this data set is too small. A larger sample size would raise the confidence level of the data analysis process and lower the margin of error. Also if this data was collected over a longer time span is would provide more comprehensiveness. Finally, there is a chance for sample bias due to the way in which this survey was conducted. Without more information to ensure no sample bias occurred this should be noted.
Current This data is NOT current. This data set is from 2016 so it is not current. I also mentioned this during the reliable section.
Cited: It is cited but I am unsure of its credibility still. It was stated the data set was generated via survey by Amazon Mechanical Turk. I would need to further look into this survey collector to establish credibility.
The overall integrity of this data is lacking so any findings should be prefaced with that. Insightful conclusions can still be made by determining overall trends to look further into with more relevant data sets. This analysis can also showcase data that would be usefulto track but is currently lacking. This can aid Bellabeat in learning relevant data to begin tracking for future analyses.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Daily_Activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Steps <- read_csv("dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Sleep <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Weight <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Intensities <- read_csv("dailyIntensities_merged.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Daily_Activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
## # ℹ Use `colnames()` to see all variable names
head(Daily_Intensities)
## # A tibble: 6 × 10
## Id Activ…¹ Seden…² Light…³ Fairl…⁴ VeryA…⁵ Seden…⁶ Light…⁷ Moder…⁸ VeryA…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 728 328 13 25 0 6.06 0.550 1.88
## 2 1.50e9 4/13/2… 776 217 19 21 0 4.71 0.690 1.57
## 3 1.50e9 4/14/2… 1218 181 11 30 0 3.91 0.400 2.44
## 4 1.50e9 4/15/2… 726 209 34 29 0 2.83 1.26 2.14
## 5 1.50e9 4/16/2… 773 221 10 36 0 5.04 0.410 2.71
## 6 1.50e9 4/17/2… 539 164 20 38 0 2.51 0.780 3.19
## # … with abbreviated variable names ¹ActivityDay, ²SedentaryMinutes,
## # ³LightlyActiveMinutes, ⁴FairlyActiveMinutes, ⁵VeryActiveMinutes,
## # ⁶SedentaryActiveDistance, ⁷LightActiveDistance, ⁸ModeratelyActiveDistance,
## # ⁹VeryActiveDistance
head(Daily_Sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
head(Daily_Steps)
## # A tibble: 6 × 3
## Id ActivityDay StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
head(Weight)
## # A tibble: 6 × 8
## Id Date WeightKg Weight…¹ Fat BMI IsMan…² LogId
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 116. 22 22.6 TRUE 1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 116. NA 22.6 TRUE 1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM 134. 294. NA 47.5 FALSE 1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125. NA 21.5 TRUE 1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126. NA 21.7 TRUE 1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 160. 25 27.5 TRUE 1.46e12
## # … with abbreviated variable names ¹WeightPounds, ²IsManualReport
colnames(Daily_Activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(Daily_Intensities)
## [1] "Id" "ActivityDay"
## [3] "SedentaryMinutes" "LightlyActiveMinutes"
## [5] "FairlyActiveMinutes" "VeryActiveMinutes"
## [7] "SedentaryActiveDistance" "LightActiveDistance"
## [9] "ModeratelyActiveDistance" "VeryActiveDistance"
colnames(Daily_Sleep)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(Daily_Steps)
## [1] "Id" "ActivityDay" "StepTotal"
colnames(Weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
The data frames titled Daily_Intensities and Daily_Steps are unnecessary. After investigation in this stage I determined the data in those two .csv files is in the Daily_Activity data frame. I will avoid using them further in analysis at this time.
glimpse(Daily_Activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
While I was viewing the Daily_Activity table I noticed occasionally, a lot of the data in an observation was filled with 0’s indicating that user did not wear the smart device that day so no data tracking occurred. If these values of 0 are left in the data our analysis will be skewed. I came to the conclusion that the best way to fix this problem is to eliminate rows of data that have the TotalSteps = 0.
Daily_Activity_v2 <- Daily_Activity %>%
filter(TotalSteps !=0)
glimpse(Daily_Activity_v2)
## Rows: 863
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
After performing this code chunk I could see right away how much this filter helped the integrity of this data. The total number of rows went from 940 before being filtered to 863. Now all the observations provided in this data frame will be relevant to the business task at hand.
glimpse(Weight)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
glimpse(Daily_Sleep)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
While looking further into the data structure of the Weight and Daily_sleep data frames I noticed that the data and times were combined as a single observation. I decided to make separate “Date and”Time” columns in new data frames in case I need to use one of these variables in my analysis.
Weight_v2 <- Weight %>%
separate(Date, c("Date", "Time"), " ")
glimpse(Weight_v2)
## Rows: 67
## Columns: 9
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date <chr> "5/2/2016", "5/3/2016", "4/13/2016", "4/21/2016", "5/12…
## $ Time <chr> "11:59:59", "11:59:59", "1:08:52", "11:59:59", "11:59:5…
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
Daily_Sleep_v2 <- Daily_Sleep %>%
separate(SleepDay, c("Date", "Time"), " ")
glimpse(Daily_Sleep_v2)
## Rows: 413
## Columns: 6
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ Date <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/2016",…
## $ Time <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
After performing these two chunks of code we have the dates and times formatted separately in new data frames! This will come in handy if I need to reference a time or date specifically later on.
Before beginning analysis I want to check to ensure there are the 30 unique users in each of the data sets. Recall back to my notes on the Fitbit Kaggle data set:
n_distinct(Daily_Activity_v2$Id)
## [1] 33
n_distinct(Daily_Sleep_v2$Id)
## [1] 24
n_distinct(Weight_v2$Id)
## [1] 8
Now this creates some major issues with the credibility and capability of this data analysis. The code just performed shows that the unique user’s with recorded data in each of the three data frames is: Daily_Activity_v2 = 33, Daily_Sleep_v2 = 24, Weight_v2 = 8. This calls the integrity of the data to even more question. Some data sets have more unique users than recorded and some have less. This creates issues with linking the data sets later on in analysis. Due to this factor I believe the best thing to do is look at each data frame individually in our analysis.
The final thing I want to ensure is that there are not duplicate observations in the data I will be analyzing.
nrow(Daily_Activity_v2)
## [1] 863
nrow(unique(Daily_Activity_v2))
## [1] 863
nrow(Daily_Sleep_v2)
## [1] 413
nrow(unique(Daily_Sleep_v2))
## [1] 410
nrow(Weight_v2)
## [1] 67
nrow(unique(Weight_v2))
## [1] 67
It appears the Daily_Sleep_v2 data frame has 3 duplicate rows. To fix this we will make a new data frame with only unique rows.
Daily_Sleep_v3 <- unique(Daily_Sleep_v2)
nrow(Daily_Sleep_v3)
## [1] 410
nrow(unique(Daily_Sleep_v3))
## [1] 410
Great now our data is ready for analysis!
Now that the data is stored appropriately and has been prepared for analysis, it’s time to start putting it to work. Throughout this phase I will identify trends and relationships to draw insights for completing the business task.
The main trends I will be focusing on are those related to activity, sleep, and stress. These are all metrics many of the Bellbeat health products track so finding meaningful insights will help draw conclusions to new marketing strategies for these products later on on this project.
I performed a summary function on each of the data frames I am working with to draw insights on the summary statistics.
summary(Daily_Activity_v2)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:863 Min. : 4 Min. : 0.00
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 4923 1st Qu.: 3.37
## Median :4.445e+09 Mode :character Median : 8053 Median : 5.59
## Mean :4.858e+09 Mean : 8319 Mean : 5.98
## 3rd Qu.:6.962e+09 3rd Qu.:11092 3rd Qu.: 7.90
## Max. :8.878e+09 Max. :36019 Max. :28.03
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 3.370 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.590 Median :0.0000 Median : 0.410
## Mean : 5.964 Mean :0.1178 Mean : 1.637
## 3rd Qu.: 7.880 3rd Qu.:0.0000 3rd Qu.: 2.275
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.: 2.345 1st Qu.:0.00000
## Median :0.3100 Median : 3.580 Median :0.00000
## Mean :0.6182 Mean : 3.639 Mean :0.00175
## 3rd Qu.:0.8650 3rd Qu.: 4.895 3rd Qu.:0.00000
## Max. :6.4800 Max. :10.710 Max. :0.11000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:146.5 1st Qu.: 721.5
## Median : 7.00 Median : 8.00 Median :208.0 Median :1021.0
## Mean : 23.02 Mean : 14.78 Mean :210.0 Mean : 955.8
## 3rd Qu.: 35.00 3rd Qu.: 21.00 3rd Qu.:272.0 3rd Qu.:1189.0
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 52
## 1st Qu.:1856
## Median :2220
## Mean :2361
## 3rd Qu.:2832
## Max. :4900
Total Steps had a mean of 8,319 with quite a large spread. The maximum was 36,019 and minimum was 4.
Intensity levels had a mean time of 23.02 mins for very active, 14.78 mins for moderately active, and 210.0 mins for fairly active.
Sedentary minutes had a mean of 955.8 with a max of 1440 (i.e. a full day)
It is recommended for proper health to achieve 10,000 total steps in a day. The mean of the users in this in lower than this goal. A goal of Bellabeat could be to raise activity specifically in the total steps category to help their consumers meet this goal.
It is recommended to aim for at least 30 minutes of moderately intense activity a day. The total of average very and moderately active intensity levels equals ( 23.02 + 14.78) = 37.9 minutes. This shows that a majority of the users are getting getting the recommended amount of activity a day.
With sedentary minutes having a max of 1440 minutes, an entire day, Bellabeat has opportunities to encourage customers to get up and be active for a few minutes if it’s detected no activity has been performed for the majority of the day.
summary(Daily_Sleep_v3)
## Id Date Time TotalSleepRecords
## Min. :1.504e+09 Length:410 Length:410 Min. :1.00
## 1st Qu.:3.977e+09 Class :character Class :character 1st Qu.:1.00
## Median :4.703e+09 Mode :character Mode :character Median :1.00
## Mean :4.995e+09 Mean :1.12
## 3rd Qu.:6.962e+09 3rd Qu.:1.00
## Max. :8.792e+09 Max. :3.00
## TotalMinutesAsleep TotalTimeInBed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.8
## Median :432.5 Median :463.0
## Mean :419.2 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
summary(Weight_v2)
## Id Date Time WeightKg
## Min. :1.504e+09 Length:67 Length:67 Min. : 52.60
## 1st Qu.:6.962e+09 Class :character Class :character 1st Qu.: 61.40
## Median :6.962e+09 Mode :character Mode :character Median : 62.50
## Mean :7.009e+09 Mean : 72.04
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05
## Max. :8.878e+09 Max. :133.50
##
## WeightPounds Fat BMI IsManualReport
## Min. :116.0 Min. :22.00 Min. :21.45 Mode :logical
## 1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96 FALSE:26
## Median :137.8 Median :23.50 Median :24.39 TRUE :41
## Mean :158.8 Mean :23.50 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :294.3 Max. :25.00 Max. :47.54
## NA's :65
## LogId
## Min. :1.460e+12
## 1st Qu.:1.461e+12
## Median :1.462e+12
## Mean :1.462e+12
## 3rd Qu.:1.462e+12
## Max. :1.463e+12
##
Weight in pounds had a mean of 158.8 lbs with a min 116.0 lbs and a max of 284.3 lbs.
BMI had a mean of 25.19 with a min of 21.45 and a max of 47.54.
I am to perform a deep-dive analysis on a FitBit data set to gage marketing opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women. I will be analyzing smart device usage to discover consumer trends. Then I will draw conclusions to how these trends can create insights to increase the efficiency of Bellabeat’s marketing strategy for one of its products.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products. I picked this product to aid in the marketing strategy as it tracks the most data and can be linked with many of their other smart devices.
1. The Bellabeat app should offer an option to enable notifications telling consumers when they have been sedentary for an excessive period of time. Sometimes people don’t realize they haven’t been active in a while if they become fixated on something else. These reminders could be useful to put health and wellness back on the mind of Bellabeat customers. We know from Figure 1, the pie chart graphic, that around 80% of peoples day they are sedentary showing opportunity to make improvements in this area.
2. Bellabeat could develop a series of workout videos to encourage being in a moderate or very active intensity level for at least 30 minutes a day. Not everyone is able to go to the gym every day, however, that does not have to stop them from being active. Home-workouts could still provide a good level of intensity workout from the comfort of their own home. This would raise activity on the app as they would be on it during the entirety of the workout.
3. The Bellabeat app should offer music and other sounds that aid in sleep. Also adding an option to set a bedtime and a reminder for it could prove useful. It was shown in a Figure 4 that there was a strong positive relationship between the amount of time people spent in bed and the amount of time they spent asleep. The recommended amount of sleep per night is 8 hours with this data sets average being 7 hours. Since most people do not obtain the proper amount of sleep different aids in this category will increase customers health and activity on the app.
4. Finally, Bellabeat should create an incentives program. This program will encourage daily tracking and positive behavior such as 10,000 steps daily, at least 30 minutes of a moderate or very intense workout, and 8 hours of sleep. Depending on the structure and awards of the program this could attract new customers as well as retain and encourage current ones.
Next time Bellabeat looks into data or the collection of data a couple more parameters should be consider to ensure the most accurate and relevant results are deducted. List of parameters the data should meet next time:
A larger sample size should be collected next time for more accurate results. A sample size the would create a confidence level of 95% with a low margin of error is the most optimal.
Ensure no forms of bias occur in the sampling during the data collection process.
Ensure the users who are being tracked are tracked across the the metrics. This would enable you to join the different data frames together to make more insightful findings.
Have the data be collected from a longer timeframe the a 2-month period.
Ensure the data is relevant and up to date. This data set was from 6 years ago questioning the relevancy of it to current times.