By: Reilly McCarthy

Intro

Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:

Scenario

I am are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy for a product.

Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

1. ASK

Urška Sršen asked me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat’s products to apply these insights to in my presentation.

Business Task

I am to perform a deep-dive analysis on a FitBit data set to gage marketing opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women. I will be analyzing smart device usage to discover consumer trends. Then I will draw conclusions to how these trends can create insights to increase the efficiency of Bellabeat’s marketing strategy. The key stakeholders in this project are Urška Sršen, Sando Mur, Bellabeat marketing analytics team, and Bellabeat investors.

2. PREPARE

Sršen encouraged me to use public data that explores smart device users’ daily habits. She pointed me to this specific data set: https://www.kaggle.com/datasets/arashnic/fitbit

My Notes on the Data

This data set was provided via downloadable folder and contained 18 .csv files. The data is stored in long format. During the preparation phase I focused on 5 specific .csv files and created the data frames: Daily_Activity, Daily_Intensities, Daily_Sleep. Daily_Steps, and Weight. I chose these files as they seemed to contain data that would lead to insightful findings on Bellabeat products. This is because alot of the data in these data sets is also tracked by the Bellabeat products. These findings could then lead to high-level marketing strategies which is the task of the overall analysis.

Data Credibility

Before diving in to any of my code I will use the ROCCC: (Reliable, Original, Comprehensive, Current, Cited) analysis to determine the credibility of this data.

  • Reliable: This data is NOT reliable. This data set contains a sample size of 30 individuals who consented to the study. The primary reliability factor for this data is that for Central Limit Theorem (CLT) to hold true a sample of 30 or greater is required, however, this sample size is still quite small for the entire network of FitBit customers. Also this data was only obtained over a 2 month time span and was recorded over 6 years ago. I believe health data should span over a longer period of time to show reliable trends as health is a more gradual process. Also the age of this data leads to issues with its relevancy to the current time.

  • Original: This data is NOT original. The data set is generated by respondents to a distributed survey via Amazon Mechanical Turk. This data is not Bellabeat’s original data. First-party data would have been better to use.

  • Comprehensive: This data is NOT comprehensive. There are a couple items that would help make this data set more comprehensive. One is a larger sample size. The current sample size of this data set is too small. A larger sample size would raise the confidence level of the data analysis process and lower the margin of error. Also if this data was collected over a longer time span is would provide more comprehensiveness. Finally, there is a chance for sample bias due to the way in which this survey was conducted. Without more information to ensure no sample bias occurred this should be noted.

  • Current This data is NOT current. This data set is from 2016 so it is not current. I also mentioned this during the reliable section.

  • Cited: It is cited but I am unsure of its credibility still. It was stated the data set was generated via survey by Amazon Mechanical Turk. I would need to further look into this survey collector to establish credibility.

The overall integrity of this data is lacking so any findings should be prefaced with that. Insightful conclusions can still be made by determining overall trends to look further into with more relevant data sets. This analysis can also showcase data that would be usefulto track but is currently lacking. This can aid Bellabeat in learning relevant data to begin tracking for future analyses.

Installing and Loading Packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Importing Data Sets

Daily_Activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Steps <- read_csv("dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Sleep <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Weight <- read_csv("weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Daily_Intensities <- read_csv("dailyIntensities_merged.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. PROCESS

Data Validation

head(Daily_Activity)
## # A tibble: 6 × 15
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…   13162    8.5     8.5        0    1.88   0.550    6.06       0
## 2 1.50e9 4/13/2…   10735    6.97    6.97       0    1.57   0.690    4.71       0
## 3 1.50e9 4/14/2…   10460    6.74    6.74       0    2.44   0.400    3.91       0
## 4 1.50e9 4/15/2…    9762    6.28    6.28       0    2.14   1.26     2.83       0
## 5 1.50e9 4/16/2…   12669    8.16    8.16       0    2.71   0.410    5.04       0
## 6 1.50e9 4/17/2…    9705    6.48    6.48       0    3.19   0.780    2.51       0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## #   abbreviated variable names ¹​ActivityDate, ²​TotalSteps, ³​TotalDistance,
## #   ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance, ⁶​VeryActiveDistance,
## #   ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance, ⁹​SedentaryActiveDistance
## # ℹ Use `colnames()` to see all variable names
head(Daily_Intensities)
## # A tibble: 6 × 10
##       Id Activ…¹ Seden…² Light…³ Fairl…⁴ VeryA…⁵ Seden…⁶ Light…⁷ Moder…⁸ VeryA…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 4/12/2…     728     328      13      25       0    6.06   0.550    1.88
## 2 1.50e9 4/13/2…     776     217      19      21       0    4.71   0.690    1.57
## 3 1.50e9 4/14/2…    1218     181      11      30       0    3.91   0.400    2.44
## 4 1.50e9 4/15/2…     726     209      34      29       0    2.83   1.26     2.14
## 5 1.50e9 4/16/2…     773     221      10      36       0    5.04   0.410    2.71
## 6 1.50e9 4/17/2…     539     164      20      38       0    2.51   0.780    3.19
## # … with abbreviated variable names ¹​ActivityDay, ²​SedentaryMinutes,
## #   ³​LightlyActiveMinutes, ⁴​FairlyActiveMinutes, ⁵​VeryActiveMinutes,
## #   ⁶​SedentaryActiveDistance, ⁷​LightActiveDistance, ⁸​ModeratelyActiveDistance,
## #   ⁹​VeryActiveDistance
head(Daily_Sleep)
## # A tibble: 6 × 5
##           Id SleepDay              TotalSleepRecords TotalMinutesAsleep TotalT…¹
##        <dbl> <chr>                             <dbl>              <dbl>    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327      346
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384      407
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412      442
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340      367
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700      712
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304      320
## # … with abbreviated variable name ¹​TotalTimeInBed
head(Daily_Steps)
## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705
head(Weight)
## # A tibble: 6 × 8
##           Id Date                  WeightKg Weight…¹   Fat   BMI IsMan…²   LogId
##        <dbl> <chr>                    <dbl>    <dbl> <dbl> <dbl> <lgl>     <dbl>
## 1 1503960366 5/2/2016 11:59:59 PM      52.6     116.    22  22.6 TRUE    1.46e12
## 2 1503960366 5/3/2016 11:59:59 PM      52.6     116.    NA  22.6 TRUE    1.46e12
## 3 1927972279 4/13/2016 1:08:52 AM     134.      294.    NA  47.5 FALSE   1.46e12
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.    NA  21.5 TRUE    1.46e12
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.    NA  21.7 TRUE    1.46e12
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     160.    25  27.5 TRUE    1.46e12
## # … with abbreviated variable names ¹​WeightPounds, ²​IsManualReport
colnames(Daily_Activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(Daily_Intensities)
##  [1] "Id"                       "ActivityDay"             
##  [3] "SedentaryMinutes"         "LightlyActiveMinutes"    
##  [5] "FairlyActiveMinutes"      "VeryActiveMinutes"       
##  [7] "SedentaryActiveDistance"  "LightActiveDistance"     
##  [9] "ModeratelyActiveDistance" "VeryActiveDistance"
colnames(Daily_Sleep)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(Daily_Steps)
## [1] "Id"          "ActivityDay" "StepTotal"
colnames(Weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Data Observation 1

The data frames titled Daily_Intensities and Daily_Steps are unnecessary. After investigation in this stage I determined the data in those two .csv files is in the Daily_Activity data frame. I will avoid using them further in analysis at this time.

glimpse(Daily_Activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

Data Observation 2

While I was viewing the Daily_Activity table I noticed occasionally, a lot of the data in an observation was filled with 0’s indicating that user did not wear the smart device that day so no data tracking occurred. If these values of 0 are left in the data our analysis will be skewed. I came to the conclusion that the best way to fix this problem is to eliminate rows of data that have the TotalSteps = 0.

Daily_Activity_v2 <- Daily_Activity %>% 
  filter(TotalSteps !=0)

glimpse(Daily_Activity_v2)
## Rows: 863
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps               <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

After performing this code chunk I could see right away how much this filter helped the integrity of this data. The total number of rows went from 940 before being filtered to 863. Now all the observations provided in this data frame will be relevant to the business task at hand.

glimpse(Weight)
## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
glimpse(Daily_Sleep)
## Rows: 413
## Columns: 5
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay           <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

Data Observation 3

While looking further into the data structure of the Weight and Daily_sleep data frames I noticed that the data and times were combined as a single observation. I decided to make separate “Date and”Time” columns in new data frames in case I need to use one of these variables in my analysis.

Weight_v2 <- Weight %>% 
  separate(Date, c("Date", "Time"), " ")

glimpse(Weight_v2)
## Rows: 67
## Columns: 9
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "5/2/2016", "5/3/2016", "4/13/2016", "4/21/2016", "5/12…
## $ Time           <chr> "11:59:59", "11:59:59", "1:08:52", "11:59:59", "11:59:5…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
Daily_Sleep_v2 <- Daily_Sleep %>% 
  separate(SleepDay, c("Date", "Time"), " ")

glimpse(Daily_Sleep_v2)
## Rows: 413
## Columns: 6
## $ Id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ Date               <chr> "4/12/2016", "4/13/2016", "4/15/2016", "4/16/2016",…
## $ Time               <chr> "12:00:00", "12:00:00", "12:00:00", "12:00:00", "12…
## $ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

After performing these two chunks of code we have the dates and times formatted separately in new data frames! This will come in handy if I need to reference a time or date specifically later on.

Data Observation 4

Before beginning analysis I want to check to ensure there are the 30 unique users in each of the data sets. Recall back to my notes on the Fitbit Kaggle data set:

  • Kaggle data set contents This kaggle dataset contains fitness data on 30 FitBit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data was collected survey via Amazon Mechanical Turk.
n_distinct(Daily_Activity_v2$Id)
## [1] 33
n_distinct(Daily_Sleep_v2$Id)
## [1] 24
n_distinct(Weight_v2$Id)
## [1] 8

Now this creates some major issues with the credibility and capability of this data analysis. The code just performed shows that the unique user’s with recorded data in each of the three data frames is: Daily_Activity_v2 = 33, Daily_Sleep_v2 = 24, Weight_v2 = 8. This calls the integrity of the data to even more question. Some data sets have more unique users than recorded and some have less. This creates issues with linking the data sets later on in analysis. Due to this factor I believe the best thing to do is look at each data frame individually in our analysis.

  • Note: If this was not a self guided project after this finding I would have informed all of my stakeholders to come to a decision on next steps together.

Data Observation 5

The final thing I want to ensure is that there are not duplicate observations in the data I will be analyzing.

nrow(Daily_Activity_v2)
## [1] 863
nrow(unique(Daily_Activity_v2))
## [1] 863
nrow(Daily_Sleep_v2)
## [1] 413
nrow(unique(Daily_Sleep_v2))
## [1] 410
nrow(Weight_v2)
## [1] 67
nrow(unique(Weight_v2))
## [1] 67

It appears the Daily_Sleep_v2 data frame has 3 duplicate rows. To fix this we will make a new data frame with only unique rows.

Daily_Sleep_v3 <- unique(Daily_Sleep_v2)

nrow(Daily_Sleep_v3)
## [1] 410
nrow(unique(Daily_Sleep_v3))
## [1] 410

Great now our data is ready for analysis!

4. ANALYZE

Now that the data is stored appropriately and has been prepared for analysis, it’s time to start putting it to work. Throughout this phase I will identify trends and relationships to draw insights for completing the business task.

SHARE

Now that the data has been analyzed through statistics I will dive deeper in understanding and representing it through visualizations. As I stated earlier during the PROCESS phase I am going to avoid combining the data sets due to the integrity of unique individual users in each set being off from each other. ( Number of unique IDs = Daily_Activity_v2 = 33, Daily_Sleep_v2 = 24, Weight_v2 = 8 )

Figure 1

All the figures I create besides Figure 1 will be done through the package ggplot2. I looked to see how to create a pie chart in this package and there are work around ways to do it but no specific functions for it. I will create a new data frame that will be the total of all the active and sedentary minutes. The goal of this is to see the percentage of the populations day they are at each activity intensity level.

V_ActiveMin <- sum(Daily_Activity_v2$VeryActiveMinutes)
F_ActiveMin <- sum(Daily_Activity_v2$FairlyActiveMinutes)
L_ActiveMin <- sum(Daily_Activity_v2$LightlyActiveMinutes)
SedentaryMin <- sum(Daily_Activity_v2$SedentaryMinutes)
## Above I created data frames that total the minutes at each intensity level to aid in creating the pie chart

ChartPeices <- c(V_ActiveMin, F_ActiveMin, L_ActiveMin, SedentaryMin)
label <- c("Very Active", "Fairly  Active", "Lightly Active", "Sedentary")
percentcalc <- round(ChartPeices/sum(ChartPeices)*100)
label <- paste(label, percentcalc, sep = " ")
label <- paste0(label, "%", sep = " ")
pie(ChartPeices, labels = label, clockwise = FALSE, col = c("red", "black", "cyan", "green"), main = "Intensity Level Percentage", sub = "% of particpants's day that they spent being very, moderately, or fairly active")

Deductions

Prior to this phase I mentioned in my Notes From Analysis that I wanted to revisit this concept of time spent in each intensity level. Now that it is visualized via pie chart it is seen that 79% of the participants day they spent in a sedentary state. This creates an area of opportunity for Bellabeat to perform different engagement campaigns to raise consumer activity levels.

Figure 2

Now I want to create a visual using ggplot2 to compare relationships in Total Steps and Calories burned. I am doing this to see if encouraging users to reach 10,000 footsteps would show health benefits in calories lost based on the data.

ggplot(data = Daily_Activity_v2) +
  geom_smooth(mapping = aes(x = Calories, y = TotalSteps)) +
  geom_jitter(mapping = aes(x = Calories, y = TotalSteps), color = "chocolate2", alpha = .25) +
  labs(title ="Relationship Between Calories Lost and Total Steps", x ="Calories Lost", y ="Total Steps")

Deductions

It seems there is positive relationship between calories lost and total daily steps. It appears approx. 3,000 calories are burned if 10,000 steps are achieved. This shows that my previous thought of encouraging users to reach 10,000 footsteps would indeed show health benefits in calories lost based on the data.

Figure 3

Next I will compare the relationship between Total Minutes Asleep and Total Time in Bed.

ggplot(data = Daily_Sleep_v3) +
  geom_smooth(mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed)) +
  geom_jitter(mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed), color = "chocolate2", alpha = .25) +
  labs( title = "Relationship Between Minutes Asleep and Minutes in Bed", x = "Minutes Asleep", y = "Minutes in Bed")

Deductions

There is a very strong positive relationship between the time spent in bed and the time spent asleep. This logically makes a lot of sense, but, will pose areas of opportunity for Bellabeat. Sleep if quite important in proper health and we already know this group of users is not achieving their recommended 8 hours a day. Encouraging their customers to set a time to be in bed to sleep will lead a better overall health experience.

Figure 4

Finally, I will compare the relationship between Weight in Pounds and BMI.

ggplot(data = Weight_v2) +
  geom_smooth(mapping = aes(x = WeightPounds, y= BMI)) +
  geom_jitter(mapping = aes(x = WeightPounds, y = BMI), color = "chocolate2", alpha = .50)

Data Observation

While creating the code for this chart I noticed an outlier in the data skewing the graphic. I will remove this and then finish creating the visual.

Weight_v3 <- Weight_v2 %>%  
  filter(BMI < 47.54) ## Removing the outlier

Figure 4 (Finished)

ggplot(data = Weight_v3) +
  geom_smooth(mapping = aes(x = WeightPounds, y= BMI)) +
  geom_jitter(mapping = aes(x = WeightPounds, y = BMI), color = "chocolate2", alpha = .50) +
  labs( title = "Relationship Between Weight (lbs) and BMI", x = "Weight (lbs)", y = "BMI")

Note: I would like to note that it is very visible that this data frame contains much less data than the other two I was working with.

Deductions

There seems to be a relatively positive relationship between Weight and BMI. A BMI of over 25 is considered to be in the overweight range. With this in mind it appears that individuals who weigh less are more likely to be in a healthy BMI category. All the data points with people in a healthy BMI range were the half of the data that was a lighter weight range (i.e. 140 lbs or less). Bellabeat could encourage different activities that aid in weight loss to help its customers reach a healthy BMI.

ACT

Business Task

I am to perform a deep-dive analysis on a FitBit data set to gage marketing opportunities for Bellabeat, a high-tech manufacturer of health-focused products for women. I will be analyzing smart device usage to discover consumer trends. Then I will draw conclusions to how these trends can create insights to increase the efficiency of Bellabeat’s marketing strategy for one of its products.

Product I Picked

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products. I picked this product to aid in the marketing strategy as it tracks the most data and can be linked with many of their other smart devices.

Recommendations for Marketing

1. The Bellabeat app should offer an option to enable notifications telling consumers when they have been sedentary for an excessive period of time. Sometimes people don’t realize they haven’t been active in a while if they become fixated on something else. These reminders could be useful to put health and wellness back on the mind of Bellabeat customers. We know from Figure 1, the pie chart graphic, that around 80% of peoples day they are sedentary showing opportunity to make improvements in this area.

2. Bellabeat could develop a series of workout videos to encourage being in a moderate or very active intensity level for at least 30 minutes a day. Not everyone is able to go to the gym every day, however, that does not have to stop them from being active. Home-workouts could still provide a good level of intensity workout from the comfort of their own home. This would raise activity on the app as they would be on it during the entirety of the workout.

3. The Bellabeat app should offer music and other sounds that aid in sleep. Also adding an option to set a bedtime and a reminder for it could prove useful. It was shown in a Figure 4 that there was a strong positive relationship between the amount of time people spent in bed and the amount of time they spent asleep. The recommended amount of sleep per night is 8 hours with this data sets average being 7 hours. Since most people do not obtain the proper amount of sleep different aids in this category will increase customers health and activity on the app.

4. Finally, Bellabeat should create an incentives program. This program will encourage daily tracking and positive behavior such as 10,000 steps daily, at least 30 minutes of a moderate or very intense workout, and 8 hours of sleep. Depending on the structure and awards of the program this could attract new customers as well as retain and encourage current ones.

Recommendations for Data Collection

Next time Bellabeat looks into data or the collection of data a couple more parameters should be consider to ensure the most accurate and relevant results are deducted. List of parameters the data should meet next time:

  • A larger sample size should be collected next time for more accurate results. A sample size the would create a confidence level of 95% with a low margin of error is the most optimal.

  • Ensure no forms of bias occur in the sampling during the data collection process.

  • Ensure the users who are being tracked are tracked across the the metrics. This would enable you to join the different data frames together to make more insightful findings.

  • Have the data be collected from a longer timeframe the a 2-month period.

  • Ensure the data is relevant and up to date. This data set was from 6 years ago questioning the relevancy of it to current times.