Larry D. Moore, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Background

The following study was conducted in partial fulfillment of the requirements of the Google Data Analytics Career Certificate. All descriptions from this case study are taken from the Google Coursera course associated with this case study. I do not claim authorship of the cases study description, and I have included a link to the PDF here The case study is hypothetically conducted for Bellabeat, a women’s fitness tracker company.This case study involves preparing, cleaning, visualizing, and analyzing data collected from Fitbit users “in order to gain insight into how consumers use non-Bellabeat smart devices.” The data used in this case study comes from the following source: kaggle.com

Guiding Questions and Deliverables

Business Task and Guiding Questions

As was previously mentioned, the business task is to analyzing data collected from Fitbit users “in order to gain insight into how consumers use non-Bellabeat smart devices.” Specifically, I need to discover trends in smart device usage and use those trends to inform Bellabeat’s marketing strategy. I will need to chose one Bellabeat product to apply my insights to the end of the analysis.

The guidelines for the case study summarize my task through the following questions:

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Deliverables

The guidelines for the case study require me to accomplish the following deliverables:

  • A clear summary of the business task

  • A description of all data sources used

  • Documentation of any cleaning or manipulation of data

  • A summary of my analysis

  • Supporting visualizations and key findings

  • My top high-level content recommendations based on my analysis

The Data

The Date Source

The data packet contained several CSV files tracking Fitbit users’ data. The licensing and DOI of the data set are found here. The description of the data in Kaggle states:

“This data set generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.” kaggle.com

Limitations

The limitations of the data are immediately apparent. According to Bellabeat’s website, they have 1.2 million devices synced, but this data set only contains the data from 30 Fitbit users. Even if every Bellabeat user owned synced devices so that the total number of customers was 600,000, I would need a sample size of 273 to have a confidence level of 90% and a margin of error of 5%. In addition to the risk of sampling bias, it should be noted that the data is dated. Given the changes brought by a global pandemic, I would say that the data is nearly irrelevant today because of its age, small size, and relatively short time frame. Another concern is that nothing is known about the demographics of the participants, and this makes it difficult to draw confident conclusions for a company that specializes in women’s health products. I also do not know what survey questions or methods Amazon Mechanical Turk used to collect the data or if those methods could have risked sampling bias. For the sake of the case study, I will overlook these concerns, but I would not present any recommendations to stakeholders based exclusively off this data in a real-world scenario.

The Cleaning Process

Several of the tables contained the data, but in different arrangements. I decided to explore the dailyActivity_merged, sleepDay_merged, weightLogInfo_merged, and the heartrate_seconds_merged CSV files so that I could get the maximum number of variables to work with. I renamed them as daily_activity_merged_raw, sleep_day_merged_raw, weight_log_info_merged_raw, and heart-rate_seconds_merged_raw, respectively.

To begin cleaning the data, I will first install the following packages:

install.packages("tidyverse") 
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/focal/src/contrib/tidyverse_1.3.1.tar.gz'
Content type 'application/x-tar' length 423528 bytes (413 KB)
==================================================
downloaded 413 KB

* installing *binary* package ‘tidyverse’ ...
* DONE (tidyverse)

The downloaded source packages are in
    ‘/tmp/Rtmp5xfOtx/downloaded_packages’
install.packages("skimr")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/focal/src/contrib/skimr_2.1.3.tar.gz'
Content type 'application/x-tar' length 1220201 bytes (1.2 MB)
==================================================
downloaded 1.2 MB

* installing *binary* package ‘skimr’ ...
* DONE (skimr)

The downloaded source packages are in
    ‘/tmp/Rtmp5xfOtx/downloaded_packages’
install.packages("janitor")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/focal/src/contrib/janitor_2.1.0.tar.gz'
Content type 'application/x-tar' length 248888 bytes (243 KB)
==================================================
downloaded 243 KB

* installing *binary* package ‘janitor’ ...
* DONE (janitor)

The downloaded source packages are in
    ‘/tmp/Rtmp5xfOtx/downloaded_packages’
install.packages("here")
Installing package into ‘/cloud/lib/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'http://package-proxy/focal/src/contrib/here_1.0.1.tar.gz'
Content type 'application/x-tar' length 52266 bytes (51 KB)
==================================================
downloaded 51 KB

* installing *binary* package ‘here’ ...
* DONE (here)

The downloaded source packages are in
    ‘/tmp/Rtmp5xfOtx/downloaded_packages’

Now, I will load the packages:

library("tidyverse")
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.6     ✓ dplyr   1.0.7
✓ tidyr   1.1.4     ✓ stringr 1.4.0
✓ readr   2.1.0     ✓ forcats 0.5.1
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library("skimr")
library("janitor")

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test
library("here")
here() starts at /cloud/project
library("lubridate")

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library("ggplot2")
library("dplyr")

Next, I will load the CSV files for analysis and assign variables:

activity_df <- read_csv("daily_activity_merged_raw.csv")
Rows: 940 Columns: 15
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr  (1): ActivityDate
dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivit...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_df <- read_csv("sleep_day_merged_raw.csv")
Rows: 413 Columns: 5
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (1): SleepDay
dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weight_df <- read_csv("weight_log_info_merged_raw.csv")
Rows: 67 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
lgl (1): IsManualReport

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heart_rate_df <- read_csv("heart_rate_seconds_merged_raw.csv")
Rows: 2483658 Columns: 3
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (1): Time
dbl (2): Id, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It’s time to take a peak at the contents of the data frames:

head(activity_df)
head(sleep_df)
head(weight_df)
head(heart_rate_df)

Let’s get some information on the size of the data frames using skimr:

skim_without_charts(activity_df)
── Data Summary ────────────────────────
                           Values     
Name                       activity_df
Number of rows             940        
Number of columns          15         
_______________________               
Column type frequency:                
  character                1          
  numeric                  14         
________________________              
Group variables            None       

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 ActivityDate          0             1     8     9     0       31
  whitespace
1          0

── Variable type: numeric ─────────────────────────────────────────────────
   skim_variable            n_missing complete_rate    mean      sd
 1 Id                               0             1 4.86e+9 2.42e+9
 2 TotalSteps                       0             1 7.64e+3 5.09e+3
 3 TotalDistance                    0             1 5.49e+0 3.92e+0
 4 TrackerDistance                  0             1 5.48e+0 3.91e+0
 5 LoggedActivitiesDistance         0             1 1.08e-1 6.20e-1
 6 VeryActiveDistance               0             1 1.50e+0 2.66e+0
 7 ModeratelyActiveDistance         0             1 5.68e-1 8.84e-1
 8 LightActiveDistance              0             1 3.34e+0 2.04e+0
 9 SedentaryActiveDistance          0             1 1.61e-3 7.35e-3
10 VeryActiveMinutes                0             1 2.12e+1 3.28e+1
11 FairlyActiveMinutes              0             1 1.36e+1 2.00e+1
12 LightlyActiveMinutes             0             1 1.93e+2 1.09e+2
13 SedentaryMinutes                 0             1 9.91e+2 3.01e+2
14 Calories                         0             1 2.30e+3 7.18e+2
           p0           p25     p50     p75    p100
 1 1503960366 2320127002    4.45e+9 6.96e+9 8.88e+9
 2          0       3790.   7.41e+3 1.07e+4 3.60e+4
 3          0          2.62 5.24e+0 7.71e+0 2.80e+1
 4          0          2.62 5.24e+0 7.71e+0 2.80e+1
 5          0          0    0       0       4.94e+0
 6          0          0    2.10e-1 2.05e+0 2.19e+1
 7          0          0    2.40e-1 8.00e-1 6.48e+0
 8          0          1.95 3.36e+0 4.78e+0 1.07e+1
 9          0          0    0       0       1.10e-1
10          0          0    4   e+0 3.2 e+1 2.1 e+2
11          0          0    6   e+0 1.9 e+1 1.43e+2
12          0        127    1.99e+2 2.64e+2 5.18e+2
13          0        730.   1.06e+3 1.23e+3 1.44e+3
14          0       1828.   2.13e+3 2.79e+3 4.9 e+3
skim_without_charts(sleep_df)
── Data Summary ────────────────────────
                           Values  
Name                       sleep_df
Number of rows             413     
Number of columns          5       
_______________________            
Column type frequency:             
  character                1       
  numeric                  4       
________________________           
Group variables            None    

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 SleepDay              0             1    20    21     0       31
  whitespace
1          0

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable      n_missing complete_rate          mean      sd
1 Id                         0             1 5000979403.   2.06e+9
2 TotalSleepRecords          0             1          1.12 3.46e-1
3 TotalMinutesAsleep         0             1        419.   1.18e+2
4 TotalTimeInBed             0             1        459.   1.27e+2
          p0        p25        p50        p75       p100
1 1503960366 3977333714 4702921684 6962181067 8792009665
2          1          1          1          1          3
3         58        361        433        490        796
4         61        403        463        526        961
skim_without_charts(weight_df)
── Data Summary ────────────────────────
                           Values   
Name                       weight_df
Number of rows             67       
Number of columns          8        
_______________________             
Column type frequency:              
  character                1        
  logical                  1        
  numeric                  6        
________________________            
Group variables            None     

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 Date                  0             1    19    21     0       56
  whitespace
1          0

── Variable type: logical ─────────────────────────────────────────────────
  skim_variable  n_missing complete_rate  mean count           
1 IsManualReport         0             1 0.612 TRU: 41, FAL: 26

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate    mean            sd      p0
1 Id                    0        1      7.01e 9 1950321944.   1.50e 9
2 WeightKg              0        1      7.20e 1         13.9  5.26e 1
3 WeightPounds          0        1      1.59e 2         30.7  1.16e 2
4 Fat                  65        0.0299 2.35e 1          2.12 2.2 e 1
5 BMI                   0        1      2.52e 1          3.07 2.15e 1
6 LogId                 0        1      1.46e12  782994784.   1.46e12
      p25     p50     p75    p100
1 6.96e 9 6.96e 9 8.88e 9 8.88e 9
2 6.14e 1 6.25e 1 8.50e 1 1.34e 2
3 1.35e 2 1.38e 2 1.88e 2 2.94e 2
4 2.28e 1 2.35e 1 2.42e 1 2.5 e 1
5 2.40e 1 2.44e 1 2.56e 1 4.75e 1
6 1.46e12 1.46e12 1.46e12 1.46e12
skim_without_charts(heart_rate_df)
── Data Summary ────────────────────────
                           Values       
Name                       heart_rate_df
Number of rows             2483658      
Number of columns          3            
_______________________                 
Column type frequency:                  
  character                1            
  numeric                  2            
________________________                
Group variables            None         

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 Time                  0             1    19    21     0   961274
  whitespace
1          0

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate         mean           sd
1 Id                    0             1 5513764629.  1950223761. 
2 Value                 0             1         77.3         19.4
          p0        p25        p50        p75       p100
1 2022484408 4388161847 5553957443 6962181067 8877689391
2         36         63         73         88        203

Preliminary Observations

Some observations to note at this point:

  • The data sets are in the long format.

  • These data sets have some common column names, so they can be combined.

  • These data sets have different numbers of rows.

  • The IDs are being treated as numbers, not strings.

  • The measurement for distance is unknown, but we will assume it is in km. I am basing this on the following discussion although I would confirm this with the data collection agency if I was actually employed by Bellabeat.

  • The names of the columns need to be formatted into snake case (my personal taste).

  • The weight_df has a relatively small number of rows and is missing a lot of data in the fat column.

  • The heart_rate_df data frame has the largest number of rows of all the data frames.

Cleaning Tasks for activity_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

  • Convert column names to snake case for easy reading

  • Convert IDs to strings so they can be counted

  • Remove duplicate and empty rows and columns

  • Remove columns with a single value

  • Ensure that the total distance is the sum of the subcategories of distance

  • Create a total column for active minutes

  • Return the day of the week as a string for each date

Here is the code I used to clean the data:

activity_cleaned <- activity_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>% #remove columns with a single value
  mutate(total_distance_new = #make sure total distance is calculated properly
           (very_active_distance + 
                                 moderately_active_distance + 
                                 light_active_distance + 
                                 sedentary_active_distance)) %>% 
  mutate(total_active_minutes = 
           very_active_minutes + fairly_active_minutes + #add total active minutes column
           lightly_active_minutes + sedentary_minutes) %>% 
  mutate(date_new = as.Date(mdy(activity_date))) %>% #standardize date
  mutate(day = wday(date_new, TRUE)) %>% #convert date to weekday string
  mutate(id = as.character(id)) %>% #convert id to string
  select(id, date_new, day, total_steps, total_distance_new, total_active_minutes, sedentary_minutes,
         lightly_active_minutes, fairly_active_minutes, very_active_minutes,
         calories) 
Cleaning Tasks for sleep_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

  • Perform a similar cleaning process used on activity_df

  • Create a lying_awake column that will show how much time participants spend awake in bed

  • Create a column that will round the total sleeping minutes to integer hours for better grouping

Here is the code I used to clean the data:

sleep_cleaned <- sleep_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>% #remove columns with a single value
  mutate(date_new = as.Date(mdy_hms(sleep_day))) %>% #standardize date
  mutate(day = wday(date_new, TRUE)) %>% #convert date to weekday string
  mutate(id = as.character(id)) %>% # convert id to string
  mutate(lying_awake = total_time_in_bed - total_minutes_asleep) %>% #how much time is spent awake in bed?
  mutate(hours_asleep = total_minutes_asleep %/% 60) %>% #round minutes to hours
  select(id, date_new, day, total_sleep_records, total_minutes_asleep, hours_asleep, lying_awake, total_time_in_bed)
Cleaning Tasks for weight_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

  • Perform a similar cleaning process used on activity_df

  • Limit measurements in the metric system only

  • Use the BMI formula to use the weight to calculate a height column in meters in case this reveals anything

Here is the code I used to clean the data:

weight_cleaned <- weight_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>%   #remove columns with a single value
  mutate(height_m = (weight_kg / bmi ) ^ 0.5) %>% # use weight and bmi to calculate height
  mutate(date_new = as.Date(mdy_hms(date))) %>% 
  mutate(day = wday(date_new, TRUE)) %>% # convert date to weekday string
  mutate(id = as.character(id)) %>% # convert id to string
  select(log_id, id, date_new, day, weight_kg, height_m, bmi, is_manual_report) 
Cleaning Tasks for heart_rate_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

  • Perform a similar cleaning process used on activity_df

  • Rename the value column to make the data easier to understand at glance

Here is the code I used to clean the data:

heart_rate_cleaned <- heart_rate_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>%  #remove columns with a single value
  rename("heart_rate" = value) %>%  #clarify value of this column
  mutate(id = as.character(id)) %>% # convert id to string
  select(everything())
Combining Data Frames

After cleaning the previous data frames, I am interested in combining some of them for ease of analysis. I will use an inner join to merge the daily activity and sleep data frames:

activity_sleep_combined <- inner_join(activity_cleaned, sleep_cleaned) 
Joining, by = c("id", "date_new", "day")
Confirming the Cleaning Process

I will now confirm that the frames have been cleaned properly using skimr:

skim_without_charts(activity_cleaned)
── Data Summary ────────────────────────
                           Values          
Name                       activity_cleaned
Number of rows             940             
Number of columns          11              
_______________________                    
Column type frequency:                     
  character                1               
  Date                     1               
  factor                   1               
  numeric                  8               
________________________                   
Group variables            None            

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 id                    0             1    10    10     0       33
  whitespace
1          0

── Variable type: Date ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 date_new              0             1 2016-04-12 2016-05-12 2016-04-26
  n_unique
1       31

── Variable type: factor ──────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 day                   0             1 TRUE           7
  top_counts                            
1 Tue: 152, Wed: 150, Thu: 147, Fri: 126

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable          n_missing complete_rate    mean      sd    p0
1 total_steps                    0             1 7638.   5087.       0
2 total_distance_new             0             1    5.41    3.94     0
3 total_active_minutes           0             1 1219.    266.       2
4 sedentary_minutes              0             1  991.    301.       0
5 lightly_active_minutes         0             1  193.    109.       0
6 fairly_active_minutes          0             1   13.6    20.0      0
7 very_active_minutes            0             1   21.2    32.8      0
8 calories                       0             1 2304.    718.       0
      p25     p50      p75    p100
1 3790.   7406.   10727    36019  
2    2.54    5.17     7.64    28.0
3  990.   1440     1440     1440  
4  730.   1058.    1230.    1440  
5  127     199      264      518  
6    0       6       19      143  
7    0       4       32      210  
8 1828.   2134     2793.    4900  
skim_without_charts(sleep_cleaned)
── Data Summary ────────────────────────
                           Values       
Name                       sleep_cleaned
Number of rows             410          
Number of columns          8            
_______________________                 
Column type frequency:                  
  character                1            
  Date                     1            
  factor                   1            
  numeric                  5            
________________________                
Group variables            None         

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 id                    0             1    10    10     0       24
  whitespace
1          0

── Variable type: Date ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 date_new              0             1 2016-04-12 2016-05-12 2016-04-27
  n_unique
1       31

── Variable type: factor ──────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 day                   0             1 TRUE           7
  top_counts                        
1 Wed: 66, Tue: 65, Thu: 64, Fri: 57

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable        n_missing complete_rate   mean      sd    p0   p25
1 total_sleep_records          0             1   1.12   0.347     1    1 
2 total_minutes_asleep         0             1 419.   119.       58  361 
3 hours_asleep                 0             1   6.48   2.00      0    6 
4 lying_awake                  0             1  39.3   46.7       0   17 
5 total_time_in_bed            0             1 458.   127.       61  404.
    p50   p75  p100
1   1       1     3
2 432.    490   796
3   7       8    13
4  25.5    40   371
5 463     526   961
skim_without_charts(weight_cleaned)
── Data Summary ────────────────────────
                           Values        
Name                       weight_cleaned
Number of rows             67            
Number of columns          8             
_______________________                  
Column type frequency:                   
  character                1             
  Date                     1             
  factor                   1             
  logical                  1             
  numeric                  4             
________________________                 
Group variables            None          

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 id                    0             1    10    10     0        8
  whitespace
1          0

── Variable type: Date ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 date_new              0             1 2016-04-12 2016-05-12 2016-04-27
  n_unique
1       31

── Variable type: factor ──────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 day                   0             1 TRUE           7
  top_counts                        
1 Mon: 12, Wed: 12, Thu: 11, Sun: 10

── Variable type: logical ─────────────────────────────────────────────────
  skim_variable    n_missing complete_rate  mean count           
1 is_manual_report         0             1 0.612 TRU: 41, FAL: 26

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate    mean            sd      p0
1 log_id                0             1 1.46e12 782994784.    1.46e12
2 weight_kg             0             1 7.20e 1        13.9   5.26e 1
3 height_m              0             1 1.69e 0         0.111 1.52e 0
4 bmi                   0             1 2.52e 1         3.07  2.15e 1
      p25     p50     p75    p100
1 1.46e12 1.46e12 1.46e12 1.46e12
2 6.14e 1 6.25e 1 8.50e 1 1.34e 2
3 1.60e 0 1.60e 0 1.83e 0 1.83e 0
4 2.40e 1 2.44e 1 2.56e 1 4.75e 1
skim_without_charts(heart_rate_cleaned)
── Data Summary ────────────────────────
                           Values            
Name                       heart_rate_cleaned
Number of rows             2483658           
Number of columns          3                 
_______________________                      
Column type frequency:                       
  character                2                 
  numeric                  1                 
________________________                     
Group variables            None              

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 id                    0             1    10    10     0       14
2 time                  0             1    19    21     0   961274
  whitespace
1          0
2          0

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75
1 heart_rate            0             1  77.3  19.4    36    63    73    88
   p100
1   203
skim_without_charts(activity_sleep_combined)
── Data Summary ────────────────────────
                           Values                 
Name                       activity_sleep_combined
Number of rows             410                    
Number of columns          16                     
_______________________                           
Column type frequency:                            
  character                1                      
  Date                     1                      
  factor                   1                      
  numeric                  13                     
________________________                          
Group variables            None                   

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 id                    0             1    10    10     0       24
  whitespace
1          0

── Variable type: Date ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 date_new              0             1 2016-04-12 2016-05-12 2016-04-27
  n_unique
1       31

── Variable type: factor ──────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 day                   0             1 TRUE           7
  top_counts                        
1 Wed: 66, Tue: 65, Thu: 64, Fri: 57

── Variable type: numeric ─────────────────────────────────────────────────
   skim_variable          n_missing complete_rate    mean       sd       p0
 1 total_steps                    0             1 8515.   4157.     17     
 2 total_distance_new             0             1    5.98    3.02    0.0100
 3 total_active_minutes           0             1  972.    168.      2     
 4 sedentary_minutes              0             1  712.    166.      0     
 5 lightly_active_minutes         0             1  217.     86.7     2     
 6 fairly_active_minutes          0             1   17.9    22.4     0     
 7 very_active_minutes            0             1   25.0    36.2     0     
 8 calories                       0             1 2389.    758.    257     
 9 total_sleep_records            0             1    1.12    0.347   1     
10 total_minutes_asleep           0             1  419.    119.     58     
11 hours_asleep                   0             1    6.48    2.00    0     
12 lying_awake                    0             1   39.3    46.7     0     
13 total_time_in_bed              0             1  458.    127.     61     
       p25     p50      p75    p100
 1 5189.   8913    11370.   22770  
 2    3.58    6.25     7.92    17.5
 3  906.    983     1042     1398  
 4  631.    717      783.    1265  
 5  158     208      263      518  
 6    0      11       26.8    143  
 7    0       9       38      210  
 8 1841    2207     2920     4900  
 9    1       1        1        3  
10  361     432.     490      796  
11    6       7        8       13  
12   17      25.5     40      371  
13  404.    463      526      961  

Analyzing the Data

Observations from activity_cleaned

When looking at the information for this data frame, I was initially confused because there were 33 unique IDs. The description of the data said that there were 30 participants in this study. I visited on of the discussions on Kaggle in which one user commented that some users may be logging data with more than one device. I will assume that this is the correct explanation.

Observations from activity_cleaned

Although it is scientifically accepted that physical activity will result in burned calories, for the sake of testing our data, lets compare daily steps and calories:

ggplot(data = activity_cleaned, aes(x = calories, y = total_steps, color = total_steps)) + 
  geom_point() +  geom_smooth() + labs(title = "Steps vs. Calories", x = "Calories", y = "Total Steps", color = "Total Steps")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the chart, we can see a positive correlation between calories and steps. Let’s compare calories with total distance:

ggplot(data = activity_cleaned, aes(x = calories, y = total_distance_new, color = total_distance_new)) + 
  geom_point() +  geom_smooth() + labs(title = "Distance vs. Calories", x = "Calories", y = "Total Distance (Km)", color = "Km")
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note that I am assuming that the distance is in Km per a Kaggle discussion. Since there is a positive correlation between steps and calories, and distance and calories, I want to know which days have the most steps and distances logged. We will begin with the steps. We see that the average total steps is slightly lower on the weekends:

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_steps, color = total_steps)) + geom_boxplot(mapping = aes(x = day, y = total_steps)) + labs(title = "Total Steps", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Steps", colour = "Steps")

The same holds true for the average of total distance:

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_distance_new, color = total_distance_new)) + geom_boxplot(mapping = aes(x = day, y = total_distance_new)) + labs(title = "Total Distance", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Distance (Km)", colour = "Km")

Although we might assume that calories burned will also be lower on the weekends, lets make sure:

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = calories, color = calories)) + geom_boxplot(mapping = aes(x = day, y = calories)) + labs(title = "Calories Burned", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Calories", colour = "Calories")

Our hypothesis is confirmed although it should be noted that Thursday’s average seems to be slightly lower than Saturday’s.

Next, I will plot the active minutes distributions by weekday. We will consider the total active minutes by day of the week, and then we will consider the different levels of activity within these totals.

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_active_minutes, color = total_active_minutes), color = "blue") + geom_boxplot(mapping = aes(x = day, y = total_active_minutes)) + labs(title = "Total Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Active Minutes")

I found that the total active minutes logged by Fitbit users tends to be fairly stable throughout the week with the exception of Thursday. The distributions for the weekends are also wider than the weekdays.

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = sedentary_minutes, color = sedentary_minutes), color = "green") + geom_boxplot(mapping = aes(x = day, y = sedentary_minutes)) + labs(title = "Sedentary Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Sedentary Minutes")

The amount of sedentary minutes is also pretty uniform, again with the exception of Thursday. Saturday has one of the highest averages of sedentary minutes, while Thursday has one of the lowest.

I noticed is that Saturdays had the highest average and maximum of lightly active minutes while Sunday had one of the lowest.

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = lightly_active_minutes, color = lightly_active_minutes), color = "violet") + geom_boxplot(mapping = aes(x = day, y = lightly_active_minutes)) + labs(title = "Lightly Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Lightly Active Minutes")

I noticed that the average number of fairly active minutes logged was lower on the weekends.

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = fairly_active_minutes, color = fairly_active_minutes), color = "purple") + geom_boxplot(mapping = aes(x = day, y = fairly_active_minutes)) + labs(title = "Fairly Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Fairly Active Minutes")

The very active minutes seem to peak Mon - Wed, but if you look at the outliers, it seems that some people do their exercise on the weekends because Sunday and Saturday have a few outliers while the maximums for the weekdays are lower.

ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = very_active_minutes, color = very_active_minutes), color = "orange") + geom_boxplot(mapping = aes(x = day, y = very_active_minutes)) + labs(title = "Very Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Very Active Minutes")

Observations from sleep_cleaned

As with the daily activity, we can look at how sleeping hours as well as hours spent awake in bed compare between days of the week. Note that this data frame is much smaller than the activity data frame. Let’s begin with minutes asleep:

ggplot(data = sleep_cleaned) + geom_point(mapping = aes(x = day, y = total_minutes_asleep, color = total_minutes_asleep), color = "blue") + geom_boxplot(mapping = aes(x = day, y = total_minutes_asleep)) + labs(title = "Daily Sleep Averages", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Total Minutes Asleep")

From this plot, it appears that Sunday is the most popular sleep day. Since I rounded the minutes into hours during the cleaning process, let’s see what this looks like in hours using a histogram:

ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = hours_asleep, alpha = hours_asleep), color = "purple") + labs(title = "Daily Sleep Totals", subtitle = "A Histogram of Data Logged 4/12/16 - 5/12/16", x = "Day", y = "Hours Asleep", alpha = "Hours (Rounded)")

While at first it appears that the histogram contradicts the previous graph, the fill shows that a lot of sleep logged on on weekends is longer, but more people are logging more sleep throughout the week in smaller amounts.

Since the data frame contains total minutes asleep and total time in bed, I want to see how much time people are spending in bed awake:

ggplot(data = sleep_cleaned) + geom_point(mapping = aes(x = day, y = lying_awake, color = lying_awake), color = "green") + geom_boxplot(mapping = aes(x = day, y = lying_awake)) + labs(title = "Total Time Awake in Bed", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Waking Minutes")

We can see that the distributions are pretty similar from day to day, but Friday, Saturday, and Sunday also have some outliers. It looks like a few people logged long minutes awake in bed on the weekends. This could be the reason why the scatter plot and histogram differ appear differently at first glance. Let’s check this data with a histogram:

ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = lying_awake, alpha = lying_awake), color = "orange") + labs(title = "Waking Minutes Totals", subtitle = "A Histogram of Data Logged 4/12/16 - 5/12/16", x = "Day", y = "Waking Minutes", alpha = "Minutes")

From this we can see that a lot of people are lying awake in bed on off days while some prefer to get a lot of sleep.

There is one more way that I want to visualize this data:

ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = total_time_in_bed, fill = "red")) +
  geom_col(mapping = aes(x = day, y = lying_awake, fill = "blue"))  + labs(title = "Total Time in Bed Breakdown", subtitle = "Awake vs. Asleep 4/12/16 - 5/12/16", x = "Day", y = "Total Minutes", fill = "Minutes") + scale_fill_hue(labels = c("Waking Minutes", "Total Minutes in Bed"))

This visualization is perhaps the clearest way to understand the proportional relationship between waking minutes and total time in bed. We can see that Fri - Sun has proportionally large portions of time spent awake in bed.

Observations from weight_cleaned

The weight cleaned data frame has a very small number of rows compared to the other data frames. Instead of focusing exclusively on the numbers logged, I wanted to think about the behavior of the participants since so few of them logged their weights. One of the columns contains Booleans to the statement “is manual report.” I am curious to see how the weights compare when logged in manually to when they are logged in automatically. Let’s take a look at the following plot:

ggplot(data = weight_cleaned) + geom_violin(mapping = aes(x = is_manual_report, y = weight_kg)) + labs(title = "Manual vs. Automatic Logging of Weight", subtitle = "Why are Manually Recorded Weights Lower?", x = "Manual Log?", y = "Weight (kg)")

While I can’t make a definitive conclusion from so small a sample, I would want to explore user bias that may skew the data they enter. It appears that weight of people who enter weight manually is much lower than that of people who are logged automatically.

Observations from heart_rate_cleaned

The heart rate data frame contains heart rate information for each user by the second, and this data frame contains 2,483,658 rows. Instead of looking at daily trends, I want to focus on the high and low heart rates to try to guess how people are using their Fitbits and if I see anything abnormal. Looking at an article published by Cleveland Clinic, I can see that for normal adults, 60 to 100 beats per minute is considered normal, so I will filter out the extremes. I will create variables to help filter the data:

heart_rate_high <- heart_rate_cleaned %>% subset(heart_rate >= 100)
skim(heart_rate_high)
── Data Summary ────────────────────────
                           Values         
Name                       heart_rate_high
Number of rows             289007         
Number of columns          3              
_______________________                   
Column type frequency:                    
  character                1              
  numeric                  2              
________________________                  
Group variables            None           

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 time                  0             1    19    21     0   246749
  whitespace
1          0

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate        mean           sd         p0
1 id                    0             1 5881498937. 2023965701.  2022484408
2 heart_rate            0             1        116.         17.1        100
         p25        p50        p75       p100 hist 
1 4388161847 6117666160 7007744171 8877689391 ▃▆▇▇▆
2        104        110        122        203 ▇▂▁▁▁
heart_rate_low <- heart_rate_cleaned %>% subset(heart_rate <= 60)
skim(heart_rate_low)
── Data Summary ────────────────────────
                           Values        
Name                       heart_rate_low
Number of rows             435480        
Number of columns          3             
_______________________                  
Column type frequency:                   
  character                1             
  numeric                  2             
________________________                 
Group variables            None          

── Variable type: character ───────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 time                  0             1    19    21     0   328154
  whitespace
1          0

── Variable type: numeric ─────────────────────────────────────────────────
  skim_variable n_missing complete_rate         mean            sd
1 id                    0             1 5517483664.  1627398472.  
2 heart_rate            0             1         55.7          3.82
          p0        p25        p50        p75       p100 hist 
1 2022484408 4388161847 5553957443 5577150313 8877689391 ▁▆▇▁▂
2         36         54         57         59         60 ▁▁▁▃▇

Now we can plot this data:

ggplot(data = heart_rate_high) + geom_bar(mapping = aes(x = heart_rate), fill = "red") + labs(title = "High Heart Rate", subtitle = "Frequences 4/12/16 - 5/12/16", x = "Heart Rates (BPM)", y = "Frequency") + annotate("text", x = 150, y = 10000, label = "Heart Rate >= 100 BPM", color = "red", fontface = "bold", size = 6)

ggplot(data = heart_rate_low) + geom_bar(mapping = aes(x = heart_rate), fill = "blue") + labs(title = "Low Heart Rate", subtitle = "Frequences 4/12/16 - 5/12/16", x = "Heart Rates (BPM)", y = "Frequency") + annotate("text", x = 43, y = 40000, label = "Heart Rate <= 60 BPM", color = "blue", fontface = "bold", size = 6)

From this data, I can see that both high and low heart rates have occurred amount the participants. This means that the participants are using their Fitbits for sleep and intense physical activity. Some of the participants may be older, or they may be athletes, so I cannot say for certain if the heart rates are at dangerous levels for the individuals who logged them. I would want my fitness tracker to alert me if my heart rate was at an extreme level, like 43 bpm or 180 bpm.

Observations from activity_sleep_combined

Finally, I want to see if there is any connection between sleep and calories to try to discover trends in the data logged by Fitbit users:

ggplot(data = activity_sleep_combined) + 
  geom_col(mapping = aes(x = hours_asleep, y = calories, alpha = calories), fill = "red") + labs(title = "Sleep and Calories", subtitle = "Totals 4/12/16 - 5/12/16", x = "Hours Asleep", y = "Calories", alpha = "Calories")

From this plot, we can see that most calories burned, including the high-level burns are logged by users who get between 6 and 8 hours of sleep a night. This might be because this is how much most people sleep on average, but we know scientifically that sleep is essential for overall wellness.

Conclusions and Recommendations

Summary of Findings

By visualizing the cleaned data, I made the following observations:

  • People are logging less activity on the weekends with the exception of a few outliers who logged intense workouts.

  • People log the most waking minutes in bed on Fri - Sun.

  • People are using their Fitbits to track activity more than they use them to keep track of weight, sleep, and heart rate (based on the total number of unique IDs for each data set).

  • People who log their weights manually tend to log lighter weights than those who don’t.

  • Some people are logging extreme heart rates, both fast and slow. These could be dangerous depending on the activity, age, and fitness-level of the user.

Recommendations

For my recommendations, I must choose one Bellabeat product to apply them too. The case study describes the following products thusly:

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

I will choose the Bellabeat membership because it is accessed through the app and can be highly customized. Bellabeat can encourage users to analyze their data based on the unique needs of their bodies. Here are my top recommendations for the marketing team:

Heart Rate Alert System

Make sure that the Bellabeat membership includes a heart rate alert system based on age and bmi. This could be a huge selling point for older customers because Bellabeat currently seems to market themselves to younger women (judging from their website). A well designed alert system paired with a reliable tracking device could save an older woman’s life and help Bellabeat compete with traditional emergency devices for seniors like Life Alert.

Better Weight Tracking

Try to synchronize the app with a weighing system to encourage users to track their weights digitally and not have to enter them manually. While there was not enough data to be certain, the numbers (and human pride) seem to indicate that weights entered manually are not as reliable as those done automatically.

AI Sleep Coach

Create an AI sleep coach that will help users track their sleep and create better sleeping habits. Syncing this to a device that specializes in comfort for sleeping would perhaps make more customers log their sleep records on a regular basis. Advertising for fitness trackers (including Bellabeat’s) is often filled with images of Olympic-level athletes running marathons or striking advanced yoga poses. Humans need sleep to be able to do these kinds of extreme activities and any kind of activity at all. While some users might be sleeping lightly on the weekends out of habit, the AI sleep coach could remind users to get in some light activity instead of being sedentary.The AI coach can determine what is healthy for the user.

Final Thoughts

Bellabeat has done good work in finding a niche in the growing market of fitness trackers. By focusing on women, they have made their products personable to their customers. Bellabeat can take their marketing strategy a step further by focusing on “holes” in current fitness tracker advertisement. Instead of trying to take on companies like Fitbit in the area of exercise data logging, they have an opportunity to reach a wider audience by focusing on wellness beyond physical exertion. By marketing to seniors and improving their sleep training program, they can reach a wider audience and make a significant impact in the lives of more women.

---
title: "Bellabeat Case Study Report"
author: "Jonathan Scofield"
date: "12/3/2021"
output: html_notebook
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

![Larry D. Moore, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons](Exercise_Treadmill_Convey_Motion.jpeg)

## Background

The following study was conducted in partial fulfillment of the requirements of the **Google Data Analytics Career Certificate.** All descriptions from this case study are taken from the Google Coursera course associated with this case study. **I do not claim authorship of the cases study description**, and I have included a link to the PDF [here](https://drive.google.com/file/d/16qePUFdnzg4xS0ymWx_grRU3clwiap-C/view?usp=sharing) The case study is hypothetically conducted for [Bellabeat](https://bellabeat.com/), a women's fitness tracker company.This case study involves preparing, cleaning, visualizing, and analyzing data collected from Fitbit users "in order to gain insight into how consumers use non-Bellabeat smart devices." The data used in this case study comes from the following source: [kaggle.com](https://www.kaggle.com/arashnic/fitbit)

## Guiding Questions and Deliverables

#### Business Task and Guiding Questions

As was previously mentioned, the business task is to  analyzing data collected from Fitbit users "in order to gain insight into how consumers use non-Bellabeat smart devices." Specifically, I need to discover trends in smart device usage and use those trends to inform Bellabeat's marketing strategy. I will need to chose one Bellabeat product to apply my insights to the end of the analysis.

The guidelines for the case study summarize my task through the following questions: 

* What are some trends in smart device usage? 
* How could these trends apply to Bellabeat customers? 
* How could these trends help influence Bellabeat marketing strategy? 

#### Deliverables

The guidelines for the case study require me to accomplish the following deliverables: 

* A clear summary of the business task 

* A description of all data sources used 

* Documentation of any cleaning or manipulation of data 

* A summary of my analysis 

* Supporting visualizations and key findings 

* My top high-level content recommendations based on my analysis 

## The Data 

#### The Date Source

The data packet contained several CSV files tracking Fitbit users' data. The licensing and DOI of the data set are found [here](https://zenodo.org/record/53894#.Yar7LvHMK3J). The description of the data in Kaggle states:

"This data set generated by respondents to a distributed survey via Amazon Mechanical Turk between            03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring." [kaggle.com](https://www.kaggle.com/arashnic/fitbit)

#### Limitations

The limitations of the data are immediately apparent. According to Bellabeat's website, they have 1.2 million devices synced, but this data set only contains the data from 30 Fitbit users. Even if every Bellabeat user owned synced devices so that the total number of customers was 600,000, I would need a sample size of 273 to have a confidence level of 90% and a margin of error of 5%. In addition to the risk of sampling bias, it should be noted that the data is *dated*. Given the changes brought by a global pandemic, I would say that the data is nearly irrelevant today because of its age, small size, and relatively short time frame. Another concern is that nothing is known about the demographics of the participants, and this makes it difficult to draw confident conclusions for a company that specializes in *women's* health products. I also do not know what survey questions or methods Amazon Mechanical Turk used to collect the data or if those methods could have risked sampling bias. For the sake of the case study, I will overlook these concerns, but *I would not present any recommendations to stakeholders based exclusively off this data in a real-world scenario.*

## The Cleaning Process

Several of the tables contained the data, but in different arrangements. I decided to explore  the **dailyActivity_merged**, **sleepDay_merged**,
**weightLogInfo_merged**, and the **heartrate_seconds_merged** CSV files so that I could get the maximum number of variables to work with. I renamed them as **daily_activity_merged_raw**, **sleep_day_merged_raw**, **weight_log_info_merged_raw**, and **heart-rate_seconds_merged_raw, respectively**.

To begin cleaning the data, I will first install the following packages:


```{r install packages, echo=TRUE}
install.packages("tidyverse") 
install.packages("skimr")
install.packages("janitor")
install.packages("here")
```

Now, I will load the packages:

```{r load packages, echo=TRUE}
library("tidyverse")
library("skimr")
library("janitor")
library("here")
library("lubridate")
library("ggplot2")
library("dplyr")
```
Next, I will load the CSV files for analysis and assign variables:

```{r dataframes, echo=TRUE}
activity_df <- read_csv("daily_activity_merged_raw.csv")
sleep_df <- read_csv("sleep_day_merged_raw.csv")
weight_df <- read_csv("weight_log_info_merged_raw.csv")
heart_rate_df <- read_csv("heart_rate_seconds_merged_raw.csv")
```

It's time to take a peak at the contents of the data frames:

```{r head, echo=TRUE}
head(activity_df)
head(sleep_df)
head(weight_df)
head(heart_rate_df)
```
Let's get some information on the size of the data frames using **skimr**:

```{r skimr, echo=TRUE}
skim_without_charts(activity_df)
skim_without_charts(sleep_df)
skim_without_charts(weight_df)
skim_without_charts(heart_rate_df)
```

#### Preliminary Observations

Some observations to note at this point:

* The data sets are in the **long** format.

* These data sets have some common column names, so they can be combined.

* These data sets have different numbers of rows.

* The IDs are being treated as numbers, not strings.

* The measurement for distance is unknown, but we will assume it is in km. I am basing this on the following [discussion](https://www.kaggle.com/arashnic/fitbit/discussion/256284) although I would confirm this with the data collection agency if I was actually employed by Bellabeat.

* The names of the columns need to be formatted into snake case (my personal taste).

* The **weight_df** has a relatively small number of rows and is missing a lot of data in the **fat** column.

* The **heart_rate_df** data frame has the largest number of rows of all the data frames.

##### Cleaning Tasks for activity_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

* Convert column names to snake case for easy reading

* Convert IDs to strings so they can be counted 

* Remove duplicate and empty rows and columns

* Remove columns with a single value

* Ensure that the total distance is the sum of the subcategories of distance

* Create a total column for active minutes

* Return the day of the week as a string for each date

Here is the code I used to clean the data:

```{r activity_cleaned, echo=TRUE}
activity_cleaned <- activity_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>% #remove columns with a single value
  mutate(total_distance_new = #make sure total distance is calculated properly
           (very_active_distance + 
                                 moderately_active_distance + 
                                 light_active_distance + 
                                 sedentary_active_distance)) %>% 
  mutate(total_active_minutes = 
           very_active_minutes + fairly_active_minutes + #add total active minutes column
           lightly_active_minutes + sedentary_minutes) %>% 
  mutate(date_new = as.Date(mdy(activity_date))) %>% #standardize date
  mutate(day = wday(date_new, TRUE)) %>% #convert date to weekday string
  mutate(id = as.character(id)) %>% #convert id to string
  select(id, date_new, day, total_steps, total_distance_new, total_active_minutes, sedentary_minutes,
         lightly_active_minutes, fairly_active_minutes, very_active_minutes,
         calories) 
```


##### Cleaning Tasks for sleep_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:


* Perform a similar cleaning process used on activity_df

* Create a **lying_awake** column that will show how much time participants spend awake in bed

* Create a column that will round the total sleeping minutes to integer hours for better grouping

Here is the code I used to clean the data:


```{r sleep_cleaned, echo=TRUE}
sleep_cleaned <- sleep_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>% #remove columns with a single value
  mutate(date_new = as.Date(mdy_hms(sleep_day))) %>% #standardize date
  mutate(day = wday(date_new, TRUE)) %>% #convert date to weekday string
  mutate(id = as.character(id)) %>% # convert id to string
  mutate(lying_awake = total_time_in_bed - total_minutes_asleep) %>% #how much time is spent awake in bed?
  mutate(hours_asleep = total_minutes_asleep %/% 60) %>% #round minutes to hours
  select(id, date_new, day, total_sleep_records, total_minutes_asleep, hours_asleep, lying_awake, total_time_in_bed)
```

##### Cleaning Tasks for weight_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

* Perform a similar cleaning process used on activity_df

* Limit measurements in the metric system only

* Use the BMI formula to use the weight to calculate a height column in meters in case this reveals anything

Here is the code I used to clean the data:


```{r weight_cleaned, echo=TRUE}
weight_cleaned <- weight_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>%   #remove columns with a single value
  mutate(height_m = (weight_kg / bmi ) ^ 0.5) %>% # use weight and bmi to calculate height
  mutate(date_new = as.Date(mdy_hms(date))) %>% 
  mutate(day = wday(date_new, TRUE)) %>% # convert date to weekday string
  mutate(id = as.character(id)) %>% # convert id to string
  select(log_id, id, date_new, day, weight_kg, height_m, bmi, is_manual_report) 
```


##### Cleaning Tasks for heart_rate_df

Here is a summary of the cleaning tasks I wish to perform on this data frame:

* Perform a similar cleaning process used on activity_df

* Rename the **value** column to make the data easier to understand at glance

Here is the code I used to clean the data:


```{r heart_rate_cleaned, echo=TRUE}
heart_rate_cleaned <- heart_rate_df %>% 
  clean_names() %>% #convert to snake case
  distinct() %>% #remove duplicates
  remove_empty(c("rows", "cols")) %>% #remove empty rows or columns
  remove_constant() %>%  #remove columns with a single value
  rename("heart_rate" = value) %>%  #clarify value of this column
  mutate(id = as.character(id)) %>% # convert id to string
  select(everything())
```

##### Combining Data Frames

After cleaning the previous data frames, I am interested in combining some of them for ease of analysis. I will use an inner join to merge the daily activity and sleep data frames:

```{r joining data, echo=TRUE}
activity_sleep_combined <- inner_join(activity_cleaned, sleep_cleaned) 
```
##### Confirming the Cleaning Process

I will now confirm that the frames have been cleaned properly using **skimr**:

```{r cleaning confirmation, echo=TRUE}
skim_without_charts(activity_cleaned)
skim_without_charts(sleep_cleaned)
skim_without_charts(weight_cleaned)
skim_without_charts(heart_rate_cleaned)
skim_without_charts(activity_sleep_combined)
```

#### Analyzing the Data

##### Observations from activity_cleaned

When looking at the information for this data frame, I was initially confused because there were **33 unique IDs**. The description of the data said that there were **30 participants** in this study. I visited on of the [discussions](https://www.kaggle.com/arashnic/fitbit/discussion/281874) on Kaggle in which one user commented that some users may be logging data with more than one device. *I will assume that this is the correct explanation.* 

##### Observations from activity_cleaned

Although it is scientifically accepted that physical activity will result in burned calories, for the sake of testing our data, lets compare daily steps and calories:

```{r steps to calories, echo=TRUE}
ggplot(data = activity_cleaned, aes(x = calories, y = total_steps, color = total_steps)) + 
  geom_point() +  geom_smooth() + labs(title = "Steps vs. Calories", x = "Calories", y = "Total Steps", color = "Total Steps")
```
From the chart, we can see a *positive correlation* between calories and steps. Let's compare calories with total distance:

```{r distance to calories, echo=TRUE}
ggplot(data = activity_cleaned, aes(x = calories, y = total_distance_new, color = total_distance_new)) + 
  geom_point() +  geom_smooth() + labs(title = "Distance vs. Calories", x = "Calories", y = "Total Distance (Km)", color = "Km")
```
Note that I am assuming that the distance is in Km per a Kaggle [discussion](https://www.kaggle.com/arashnic/fitbit/discussion/256284). Since there is a positive correlation between steps and calories, and distance and calories, I want to know which days have the most steps and distances logged. We will begin with the steps. We see **that the average total steps is slightly lower on the weekends:** 


```{r total steps, echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_steps, color = total_steps)) + geom_boxplot(mapping = aes(x = day, y = total_steps)) + labs(title = "Total Steps", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Steps", colour = "Steps")
```
The same holds true for the average of total distance:

```{r total distance, echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_distance_new, color = total_distance_new)) + geom_boxplot(mapping = aes(x = day, y = total_distance_new)) + labs(title = "Total Distance", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Distance (Km)", colour = "Km")
```

Although we might assume that calories burned will also be lower on the weekends, lets make sure:

```{r calories burned}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = calories, color = calories)) + geom_boxplot(mapping = aes(x = day, y = calories)) + labs(title = "Calories Burned", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Calories", colour = "Calories")
```

Our hypothesis is confirmed although it should be noted that *Thursday's average seems to be slightly lower than Saturday's.*

Next, I will plot the active minutes distributions by weekday. We will consider the total active minutes by day of the week, and then we will consider the different levels of activity within these totals.

```{r}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = total_active_minutes, color = total_active_minutes), color = "blue") + geom_boxplot(mapping = aes(x = day, y = total_active_minutes)) + labs(title = "Total Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Total Active Minutes")
```


I found that the total active minutes logged by Fitbit users tends to be fairly stable throughout the week with the exception of Thursday. The distributions for the weekends are also wider than the weekdays.

```{r , echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = sedentary_minutes, color = sedentary_minutes), color = "green") + geom_boxplot(mapping = aes(x = day, y = sedentary_minutes)) + labs(title = "Sedentary Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Sedentary Minutes")
```
The amount of sedentary minutes is also pretty uniform, again with the exception of Thursday. Saturday has one of the highest averages of sedentary minutes, while Thursday has one of the lowest.

I noticed is that Saturdays had the highest average and maximum of lightly active minutes while Sunday had one of the lowest. 

```{r lightly active minutes, echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = lightly_active_minutes, color = lightly_active_minutes), color = "violet") + geom_boxplot(mapping = aes(x = day, y = lightly_active_minutes)) + labs(title = "Lightly Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Lightly Active Minutes")
```
I noticed that the average number of fairly active minutes logged was lower on the weekends.

```{r fairly active minutes, echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = fairly_active_minutes, color = fairly_active_minutes), color = "purple") + geom_boxplot(mapping = aes(x = day, y = fairly_active_minutes)) + labs(title = "Fairly Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Fairly Active Minutes")
```
The very active minutes seem to peak Mon - Wed, but if you look at the outliers, it seems that some people do their exercise on the weekends because Sunday and Saturday have a few outliers while the maximums for the weekdays are lower. 

```{r very active minutes, echo=TRUE}
ggplot(data = activity_cleaned) + geom_point(mapping = aes(x = day, y = very_active_minutes, color = very_active_minutes), color = "orange") + geom_boxplot(mapping = aes(x = day, y = very_active_minutes)) + labs(title = "Very Active Minutes", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = " Very Active Minutes")
```

##### Observations from sleep_cleaned

As with the daily activity, we can look at how sleeping hours as well as hours spent awake in bed compare between days of the week. Note that this data frame is much smaller than the activity data frame. Let's begin with minutes asleep:

```{r sleep scatter, echo=TRUE}
ggplot(data = sleep_cleaned) + geom_point(mapping = aes(x = day, y = total_minutes_asleep, color = total_minutes_asleep), color = "blue") + geom_boxplot(mapping = aes(x = day, y = total_minutes_asleep)) + labs(title = "Daily Sleep Averages", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Total Minutes Asleep")
```
From this plot, it appears that Sunday is the most popular sleep day. Since I rounded the minutes into hours during the cleaning process, let's see what this looks like in hours using a histogram:

```{r sleep in hours, echo=TRUE}
ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = hours_asleep, alpha = hours_asleep), color = "purple") + labs(title = "Daily Sleep Totals", subtitle = "A Histogram of Data Logged 4/12/16 - 5/12/16", x = "Day", y = "Hours Asleep", alpha = "Hours (Rounded)")
```
While at first it appears that the histogram contradicts the previous graph, the fill shows that a lot of sleep logged on on weekends is longer, but more people are logging more sleep throughout the week in smaller amounts. 

Since the data frame contains total minutes asleep and total time in bed, I want to see how much time people are spending in bed awake:

```{r lyinng_awake1, echo=TRUE}
ggplot(data = sleep_cleaned) + geom_point(mapping = aes(x = day, y = lying_awake, color = lying_awake), color = "green") + geom_boxplot(mapping = aes(x = day, y = lying_awake)) + labs(title = "Total Time Awake in Bed", subtitle = "Frequency Distributions by Day of the Week (4/12/16 - 5/12/16)", x = "Day", y = "Waking Minutes")
```
We can see that the distributions are pretty similar from day to day, but Friday, 
Saturday, and Sunday also have some outliers. It looks like a few people logged long minutes awake in bed on the weekends. This could be the reason why the scatter plot and histogram differ appear differently at first glance. Let's check this data with a histogram: 

```{r lying_awake2, echo=TRUE}
ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = lying_awake, alpha = lying_awake), color = "orange") + labs(title = "Waking Minutes Totals", subtitle = "A Histogram of Data Logged 4/12/16 - 5/12/16", x = "Day", y = "Waking Minutes", alpha = "Minutes")
```

From this we can see that a lot of people are lying awake in bed on off days while some prefer to get a lot of sleep. 

There is one more way that I want to visualize this data:

```{r total time in bed, echo=TRUE}
ggplot(data = sleep_cleaned) + geom_col(mapping = aes(x = day, y = total_time_in_bed, fill = "red")) +
  geom_col(mapping = aes(x = day, y = lying_awake, fill = "blue"))  + labs(title = "Total Time in Bed Breakdown", subtitle = "Awake vs. Asleep 4/12/16 - 5/12/16", x = "Day", y = "Total Minutes", fill = "Minutes") + scale_fill_hue(labels = c("Waking Minutes", "Total Minutes in Bed"))
```
This visualization is perhaps the clearest way to understand the proportional relationship between waking minutes and total time in bed. We can see that Fri - Sun has proportionally large portions of time spent awake in bed.

##### Observations from weight_cleaned

The weight cleaned data frame has a very small number of rows compared to the other data frames. Instead of focusing exclusively on the numbers logged, I wanted to think about the *behavior* of the participants since so few of them logged their weights. One of the columns contains Booleans to the statement "is manual report." I am curious to see how the weights compare when logged in manually to when they are logged in automatically. Let's take a look at the following plot: 

```{r manual vs auto, echo=TRUE}
ggplot(data = weight_cleaned) + geom_violin(mapping = aes(x = is_manual_report, y = weight_kg)) + labs(title = "Manual vs. Automatic Logging of Weight", subtitle = "Why are Manually Recorded Weights Lower?", x = "Manual Log?", y = "Weight (kg)")
```
While I can't make a definitive conclusion from so small a sample, I would want to explore user bias that may skew the data they enter. It appears that weight of people who enter weight manually is much lower than that of people who are logged automatically.

##### Observations from heart_rate_cleaned

The heart rate data frame contains heart rate information for each user by the *second*, and this data frame contains **2,483,658 rows**. Instead of looking at daily trends, I want to focus on the high and low heart rates to try to guess how people are using their Fitbits and if I see anything abnormal. Looking at an article published by [Cleveland Clinic](https://my.clevelandclinic.org/health/diagnostics/17402-pulse--heart-rate), I can see that for normal adults, 60 to 100 beats per minute is considered normal, so I will filter out the extremes. I will create variables to help filter the data:

```{r filtering heart rates, echo=TRUE}
heart_rate_high <- heart_rate_cleaned %>% subset(heart_rate >= 100)
skim(heart_rate_high)
heart_rate_low <- heart_rate_cleaned %>% subset(heart_rate <= 60)
skim(heart_rate_low)
```
Now we can plot this data:

```{r heart_rate, echo=TRUE}
ggplot(data = heart_rate_high) + geom_bar(mapping = aes(x = heart_rate), fill = "red") + labs(title = "High Heart Rate", subtitle = "Frequences 4/12/16 - 5/12/16", x = "Heart Rates (BPM)", y = "Frequency") + annotate("text", x = 150, y = 10000, label = "Heart Rate >= 100 BPM", color = "red", fontface = "bold", size = 6)
ggplot(data = heart_rate_low) + geom_bar(mapping = aes(x = heart_rate), fill = "blue") + labs(title = "Low Heart Rate", subtitle = "Frequences 4/12/16 - 5/12/16", x = "Heart Rates (BPM)", y = "Frequency") + annotate("text", x = 43, y = 40000, label = "Heart Rate <= 60 BPM", color = "blue", fontface = "bold", size = 6)
```
From this data, I can see that both high and low heart rates have occurred amount the participants. This means that the participants are using their Fitbits for sleep and intense physical activity. Some of the participants may be older, or they may be athletes, so I cannot say for certain if the heart rates are at dangerous levels for the individuals who logged them. *I would want my fitness tracker to alert me if my heart rate was at an extreme level, like 43 bpm or 180 bpm.*

##### Observations from activity_sleep_combined

Finally, I want to see if there is any connection between sleep and calories to try to discover trends in the data logged by Fitbit users:

```{r calories to sleep, echo=TRUE}
ggplot(data = activity_sleep_combined) + 
  geom_col(mapping = aes(x = hours_asleep, y = calories, alpha = calories), fill = "red") + labs(title = "Sleep and Calories", subtitle = "Totals 4/12/16 - 5/12/16", x = "Hours Asleep", y = "Calories", alpha = "Calories")
```
From this plot, we can see that most calories burned, including the high-level burns are logged by users who get between 6 and 8 hours of sleep a night. This might be because this is how much most people sleep on average, but we know scientifically that sleep is essential for overall wellness.

## Conclusions and Recommendations

#### Summary of Findings

By visualizing the cleaned data, I made the following observations:

* People are logging less activity on the weekends with the exception of a few outliers who logged intense workouts.

* People log the most waking minutes in bed on Fri - Sun.

* People are using their Fitbits to track activity more than they use them to keep track of weight, sleep, and heart rate (based on the total number of unique IDs for each data set).

* People who log their weights manually tend to log lighter weights than those who don't.

* Some people are logging extreme heart rates, both fast and slow. *These could be dangerous depending on the activity, age, and fitness-level of the user.*

#### Recommendations

For my recommendations, I must choose one Bellabeat product to apply them too. The case study describes the following products thusly:

* **Bellabeat app:** The Bellabeat app provides users with health data related to their activity, sleep, stress,
menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and
make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

* **Leaf:** Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

* **Time:** This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

* **Spring:** This is a water bottle that tracks daily water intake using smart technology to ensure that you are
appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your
hydration levels.

* **Bellabeat membership:** Bellabeat also offers a subscription-based membership program for users.
Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and
beauty, and mindfulness based on their lifestyle and goals.

I will choose the **Bellabeat membership** because it is accessed through the app and can be highly customized. Bellabeat can encourage users to analyze their data based on the unique needs of their bodies. Here are my top recommendations for the marketing team:

##### Heart Rate Alert System

Make sure that the Bellabeat membership includes a heart rate alert system based on age and bmi. This could be a huge selling point for older customers because Bellabeat currently seems to market themselves to younger women (judging from their website). A well designed alert system paired with a reliable tracking device could save an older woman's life and help Bellabeat compete with traditional emergency devices for seniors like [Life Alert](http://www.lifealert.com/). 

##### Better Weight Tracking

Try to synchronize the app with a weighing system to encourage users to track their weights digitally and not have to enter them manually. While there was not enough data to be certain, the numbers (and human pride) seem to indicate that weights entered manually are not as reliable as those done automatically.

##### AI Sleep Coach

Create an AI sleep coach that will help users track their sleep and create better sleeping habits. Syncing this to a device that specializes in comfort for sleeping would perhaps make more customers log their sleep records on a regular basis. Advertising for fitness trackers (including Bellabeat's) is often filled with images of Olympic-level athletes running marathons or striking advanced yoga poses. Humans need sleep to be able to do these kinds of extreme activities and any kind of activity at all. While some users might be sleeping lightly on the weekends out of habit, the AI sleep coach could remind users to get in some light activity instead of being sedentary.The AI coach can determine what is healthy for the user.

## Final Thoughts

Bellabeat has done good work in finding a niche in the growing market of fitness trackers. By focusing on women, they have made their products personable to their customers. Bellabeat can take their marketing strategy a step further by focusing on "holes" in current fitness tracker advertisement. Instead of trying to take on companies like Fitbit in the area of exercise data logging, they have an opportunity to reach a wider audience by focusing on wellness beyond physical exertion. By marketing to seniors and improving their sleep training program, they can reach a wider audience and make a significant impact in the lives of more women.



