Introduction

The following case study is the final capstone project in the Google Data Analytics Professional Certificate. The case study is on Bellabeat, a high-tech wellness company that manufactures health-focused products made specifically for women. We are asked to analyze smart devices-fitness data to gain insight into how consumers are using their smart devices, and to discover trends that can guide a marketing strategy for the company to grow.
For this case study, I will be following the 6 steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

1-Ask:

The business task:

Analyze Fitbit users’ data to find trends and insights into how consumers use their smart devices.

Key stakeholders:

-Urska Srsen: Co-founder and CCO of Bellabeat.
-Sando Mur: Mathematician and Co-founder of Bellabeat.
-Bellabeat marketing analytics team: A team of data analysts that uses their analysis to guide the marketing strategy.

2-Prepare:

3-Process:

Even though the survey only has 30 participants, some of the data is still too big to use in Microsoft Excel, since it’s represented in long format.
So I decided to use R to clean and analyze the data.
I’ve also decided to use the following datasets in my analysis: daily_activity, heartrate_seconds, and sleep_day.

1- Loading all the packages needed for the analysis (already installed using install.packages():
library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v dplyr   1.0.9
## v tibble  3.1.7     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
2- Loading the datasets:
daily_activity <- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate_seconds<- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\sleepDay_merged.csv")
## Rows: 413 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
3- Checking the structure of the datasets:
str(daily_activity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(heartrate_seconds)
## spec_tbl_df [2,483,658 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id   : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ Time : chr [1:2483658] "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
##  $ Value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   Time = col_character(),
##   ..   Value = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sleep_day)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

This shows us that:
- ActivityDate in the dataset daily_activity is a character: needs to be converted to a Date object.
- Time in the dataset heartrate_seconds is a character: needs to be converted to a Time object.
- SleepDay in the dataset sleep_day is a character: needs to be converted to a DateTime object.

4- Checking for number of unique participants:
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(heartrate_seconds$Id)
## [1] 14
n_distinct(sleep_day$Id)
## [1] 24

We notice that:
- daily_activity has 33 unique identification numbers. Since the survey has only 30 participants, daily_activity has 3 more participants than it should. This could be due to some people making more than one contribution in the survey.
- heartrate_seconds and sleep_day have only 14 and 24 unique identification numbers, respectively. This means that some people either didn’t track their heartbeat data, or the number of hours that they’ve spent asleep per day, or both.

5- Checking for duplicates:
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(heartrate_seconds))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3
  • sleep_day has 3 duplicated rows. I will be using the function distinct() from the dplyr package (already loaded) to remove the duplicates.
sleep_day <- sleep_day %>% distinct()
6- Checking for missing values:
sum(is.na(daily_activity))
## [1] 0
sum(is.na(heartrate_seconds))
## [1] 0
sum(is.na(sleep_day))
## [1] 0
colSums(daily_activity==0)
##                       Id             ActivityDate               TotalSteps 
##                        0                        0                       77 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                       78                       78                      908 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                      413                      386                       85 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                      858                      409                      384 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                       84                        1                        4
colSums(heartrate_seconds==0)
##    Id  Time Value 
##     0     0     0
colSums(sleep_day==0)
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                  0                  0                  0                  0 
##     TotalTimeInBed 
##                  0
  • We can see that daily_activity has values that are equal to 0 in multiple columns. (This could be due to some people having made no exercise or walks on some days, or failing to wear their smart devices when exercising,…)
7- Checking the data for errors:
  • The column names in all the datasets need to be changed, to be made more consistent. I will be using the function clean_names() from the janitor package (already loaded).
daily_activity <- clean_names(daily_activity)
heartrate_seconds <- clean_names(heartrate_seconds)
sleep_day <- clean_names(sleep_day)
a) daily_activity:
daily_activity %>% mutate(total_distance_calculated=very_active_distance+moderately_active_distance+light_active_distance+sedentary_active_distance) %>% select(c(total_distance,total_distance_calculated))
## # A tibble: 940 x 2
##    total_distance total_distance_calculated
##             <dbl>                     <dbl>
##  1           8.5                       8.49
##  2           6.97                      6.97
##  3           6.74                      6.75
##  4           6.28                      6.23
##  5           8.16                      8.16
##  6           6.48                      6.48
##  7           8.59                      8.60
##  8           9.88                      9.88
##  9           6.68                      6.68
## 10           6.34                      6.34
## # ... with 930 more rows
  • Running the code above shows that some values between the total_distance_calculated (now), and the total_distance already given in the dataset, slightly differ. However, this error is small and doesn’t affect the analysis results.

  • Now, the column ActivityDate needs to be converted from Character to Date, as stated before. I’ll be using the lubridate package (already loaded) to make that conversion.

daily_activity$activity_date <- mdy(daily_activity$activity_date)
  • Since the survey was collected over a 2 month period (from 2016-04-12 to 2016-05-12), I’m gonna be checking that the dates in daily_activity all fall in the right interval.
sum(daily_activity$activity_date<"2016-04-12"|daily_activity$activity_date>"2016-05-12")
## [1] 0
  • After adding new columns that show the number of hours people spent on different activity levels (very_active_hours, fairly_active_hours,lightly_active_hours, sedentary_hours), and the total number of hours that they spent active (total_active_hours), I will remove the columns that used to show this data in minutes.
daily_activity <- daily_activity %>% mutate(very_active_hours=very_active_minutes/60,fairly_active_hours=fairly_active_minutes/60,lightly_active_hours=lightly_active_minutes/60,sedentary_hours=sedentary_minutes/60)

daily_activity <- daily_activity %>% mutate(total_active_hours=very_active_hours+fairly_active_hours+lightly_active_hours+sedentary_hours)

daily_activity <- daily_activity %>% select(-c(very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes))
b) heartrate_seconds:
  • First, (to conserve the AM/PM data), before separating the time column, it needs to be converted from Character to a DateTime object. I will be using the lubridate package (already loaded).

  • Then, using the format() function, I will omit all seconds from the given time (since a) they add no important value to the data, and b) to later be able to calculate the average heartbeat per minute, and then group together the average heartbeat per day, for each id).

heartrate_seconds$time <- mdy_hms(heartrate_seconds$time)
heartrate_seconds$time <- format(heartrate_seconds$time,'%Y-%m-%d %H:%M')
heartrate_seconds <- heartrate_seconds %>% separate(time,c("date","time"),' ')
str(heartrate_seconds)
## tibble [2,483,658 x 4] (S3: tbl_df/tbl/data.frame)
##  $ id   : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
##  $ date : chr [1:2483658] "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
##  $ time : chr [1:2483658] "07:21" "07:21" "07:21" "07:21" ...
##  $ value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
  • After separating the original time column into new separate date and time columns and checking with str(), we notice that the date and time reverted back to Character objects, and need to be converted again to Date and Time objects, respectively (using the lubridate package).
heartrate_seconds$date <- ymd(heartrate_seconds$date)
heartrate_seconds$time <- hm(heartrate_seconds$time)
  • Now, (as done for daily_activity earlier), I will again check that the given dates are included in the required 2 month-interval.
sum(heartrate_seconds$date<"2016-04-12"||heartrate_seconds$date>"2016-05-12")
## [1] 0
  • I will now create a new table (heartrate) with the average heartbeat recorded per day for each individual.
heartrate <- heartrate_seconds %>% group_by(id,date,time) %>% drop_na() %>% summarize(average_heartbeat = mean(value))
## `summarise()` has grouped output by 'id', 'date'. You can override using the
## `.groups` argument.
heartrate <- heartrate %>% select(-time)
heartrate <- heartrate %>% group_by(id,date) %>% drop_na() %>% summarise(heartbeat=mean(average_heartbeat))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
heartrate$heartbeat <- as.integer(heartrate$heartbeat)
head(heartrate)
## # A tibble: 6 x 3
## # Groups:   id [1]
##           id date       heartbeat
##        <dbl> <date>         <int>
## 1 2022484408 2016-04-12        74
## 2 2022484408 2016-04-13        78
## 3 2022484408 2016-04-14        70
## 4 2022484408 2016-04-15        78
## 5 2022484408 2016-04-16        74
## 6 2022484408 2016-04-17        82
c) sleep_day:
  • I will split the sleep_day column into 2 separate columns: sleep_date and sleep_time.

  • Checking with str() shows us that sleep_date needs to be changed from Character to Date, and sleep_time from Character to Time (using the lubridate package).

sleep_day <- sleep_day %>% separate(sleep_day,c("sleep_date","sleep_time"),sep=' ')
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
sleep_day$sleep_date <- mdy(sleep_day$sleep_date)
sleep_day$sleep_time <- hms(sleep_day$sleep_time)
  • I will convert the data to be shown in hours, and then remove the data that was given in minutes:
sleep_day <- mutate(sleep_day,total_hours_asleep=total_minutes_asleep/60)
sleep_day <- mutate(sleep_day,total_hours_in_bed=total_time_in_bed/60)
sleep_day <- sleep_day %>% select(-c(total_minutes_asleep,total_time_in_bed))
  • Checking for errors:
  1. if total_hours has any data that is negative or exceeds 24 hours:
sum(sleep_day$total_hours_asleep>=24|sleep_day$total_hours_asleep<=0,na.rm=TRUE)
## [1] 0
sum(sleep_day$total_hours_in_bed>=24||sleep_day$total_hours_in_bed<=0,na.rm=TRUE)
## [1] 0
  1. if sleep_date is consistent with the given date interval:
sum(sleep_day$sleep_date<"2016-04-12"||sleep_day$sleep_date>"2016-05-12")
## [1] 0
  1. if any total hours spent asleep exceeds the total time spent in bed:
sum(sleep_day$total_hours_asleep>sleep_day$total_hours_in_bed,na.rm=TRUE)
## [1] 0

4- Analyze:

join_table <- left_join(daily_activity,heartrate,by=c("id","activity_date"="date"))
join_table <- left_join(join_table,sleep_day,by=c("id","activity_date"="sleep_date"))
final_table <- select(join_table,-c(tracker_distance,logged_activities_distance,sleep_time))
final_table %>% select(-c(id,activity_date)) %>% drop_na() %>% summary()
##   total_steps    total_distance   very_active_distance
##  Min.   :   42   Min.   : 0.030   Min.   :0.000       
##  1st Qu.: 5245   1st Qu.: 3.680   1st Qu.:0.000       
##  Median : 9148   Median : 6.380   Median :0.670       
##  Mean   : 8522   Mean   : 6.021   Mean   :1.429       
##  3rd Qu.:11135   3rd Qu.: 7.920   3rd Qu.:2.420       
##  Max.   :22770   Max.   :17.540   Max.   :9.890       
##  moderately_active_distance light_active_distance sedentary_active_distance
##  Min.   :0.0000             Min.   :0.030         Min.   :0.000000         
##  1st Qu.:0.0000             1st Qu.:2.670         1st Qu.:0.000000         
##  Median :0.4000             Median :3.770         Median :0.000000         
##  Mean   :0.6475             Mean   :3.894         Mean   :0.002099         
##  3rd Qu.:1.0000             3rd Qu.:5.270         3rd Qu.:0.000000         
##  Max.   :2.7700             Max.   :9.480         Max.   :0.110000         
##     calories    very_active_hours fairly_active_hours lightly_active_hours
##  Min.   : 403   Min.   :0.0000    Min.   :0.0000      Min.   :0.06667     
##  1st Qu.:1994   1st Qu.:0.0000    1st Qu.:0.0000      1st Qu.:2.93333     
##  Median :2225   Median :0.2000    Median :0.2167      Median :3.78333     
##  Mean   :2467   Mean   :0.4666    Mean   :0.2766      Mean   :3.81077     
##  3rd Qu.:3004   3rd Qu.:0.6333    3rd Qu.:0.4833      3rd Qu.:4.61667     
##  Max.   :4900   Max.   :3.5000    Max.   :1.2333      Max.   :8.63333     
##  sedentary_hours    total_active_hours   heartbeat     total_sleep_records
##  Min.   : 0.03333   Min.   : 0.10      Min.   : 58.0   Min.   :1.000      
##  1st Qu.:10.35000   1st Qu.:15.35      1st Qu.: 65.0   1st Qu.:1.000      
##  Median :11.78333   Median :16.42      Median : 70.0   Median :1.000      
##  Mean   :11.69954   Mean   :16.25      Mean   : 70.9   Mean   :1.122      
##  3rd Qu.:13.01667   3rd Qu.:17.37      3rd Qu.: 75.0   3rd Qu.:1.000      
##  Max.   :20.36667   Max.   :23.30      Max.   :100.0   Max.   :3.000      
##  total_hours_asleep total_hours_in_bed
##  Min.   : 0.9667    Min.   : 1.017    
##  1st Qu.: 6.5167    1st Qu.: 6.883    
##  Median : 7.3167    Median : 7.733    
##  Mean   : 7.1041    Mean   : 7.580    
##  3rd Qu.: 8.2000    3rd Qu.: 8.767    
##  Max.   :12.9167    Max.   :14.050

5- Share:

final_table %>% drop_na() %>%  ggplot(aes(x=sedentary_hours,y=total_hours_asleep)) + geom_point(color="red")+geom_smooth()+labs(title="The relationship between Sedentary Hours and Total Hours Asleep",x="Sedentary Hours",y="Total Hours Asleep")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

final_table %>% drop_na() %>%  ggplot(aes(x=total_steps,y=calories)) + geom_point(color="blue")+geom_smooth()+labs(title="The relationship between Total Steps and burned Calories",x="Total Steps",y="calories")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

final_table %>% drop_na() %>%  ggplot(aes(x=total_hours_in_bed,y=total_hours_asleep)) + geom_point(color="purple")+geom_smooth()+labs(title="The relationship between Total Hours In Bed and Total Hours Asleep",x="Total Hours In Bed",y="Total Hours Asleep")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

final_table %>% drop_na() %>%  ggplot(aes(x=total_distance,y=total_steps)) + geom_point(color="green")+geom_smooth()+labs(title="The relationship between Total Distance and Total Steps",x="Total Distance",y="Total Steps")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

5- Act:

- Recommendations:
  • Since the number of steps taken per day is less than the CDC recommended 10 000 steps (this could be due to people getting busy at work and forgetting to move around), Bellabeat should send notifications every hour or so on the chosen smart device to remind their customers to get up and walk for a minimum of 10 minutes.
  • Since the total number of very active hours per day is low, Bellabeat should create a program where people earn money, or coins they can use to buy premium products, whenever they hit the recommended hours and intensity of activity. This could encourage people to exercise more, which then, as we saw in our analysis, will help them burn more calories, and will improve their quality and time of sleep.

I’m self-learning data analysis and this is my first time using R or trying to clean/analyze data. If there are any notes on my code or mistakes in it, please do not hesitate to let me know. I would love feedback on how I can improve. Thank you for anyone who checked out my case study.