The following case study is the final capstone project in the Google
Data Analytics Professional Certificate. The case study is on Bellabeat,
a high-tech wellness company that manufactures health-focused products
made specifically for women. We are asked to analyze smart
devices-fitness data to gain insight into how consumers are using their
smart devices, and to discover trends that can guide a marketing
strategy for the company to grow.
For this case study, I will be following the 6 steps of the data
analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Analyze Fitbit users’ data to find trends and insights into how consumers use their smart devices.
-Urska Srsen: Co-founder and CCO of Bellabeat.
-Sando Mur: Mathematician and Co-founder of Bellabeat.
-Bellabeat marketing analytics team: A team of data analysts
that uses their analysis to guide the marketing strategy.
Even though the survey only has 30 participants, some of the data is
still too big to use in Microsoft Excel, since it’s represented in long
format.
So I decided to use R to clean and analyze the data.
I’ve also decided to use the following datasets in my analysis:
daily_activity, heartrate_seconds, and
sleep_day.
library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v dplyr 1.0.9
## v tibble 3.1.7 v stringr 1.4.0
## v tidyr 1.2.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
daily_activity <- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate_seconds<- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_day <- read_csv("C:\\Users\\Lenovo\\Desktop\\Bellabeat\\Fitbit Data\\sleepDay_merged.csv")
## Rows: 413 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(daily_activity)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(heartrate_seconds)
## spec_tbl_df [2,483,658 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ Time : chr [1:2483658] "4/12/2016 7:21:00 AM" "4/12/2016 7:21:05 AM" "4/12/2016 7:21:10 AM" "4/12/2016 7:21:20 AM" ...
## $ Value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. Time = col_character(),
## .. Value = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(sleep_day)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
This shows us that:
- ActivityDate in the dataset daily_activity is a
character: needs to be converted to a Date object.
- Time in the dataset heartrate_seconds is a
character: needs to be converted to a Time object.
- SleepDay in the dataset sleep_day is a character:
needs to be converted to a DateTime object.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(heartrate_seconds$Id)
## [1] 14
n_distinct(sleep_day$Id)
## [1] 24
We notice that:
- daily_activity has 33 unique identification numbers. Since
the survey has only 30 participants, daily_activity has 3 more
participants than it should. This could be due to some people making
more than one contribution in the survey.
- heartrate_seconds and sleep_day have only 14 and 24
unique identification numbers, respectively. This means that some people
either didn’t track their heartbeat data, or the number of hours that
they’ve spent asleep per day, or both.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(heartrate_seconds))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3
sleep_day <- sleep_day %>% distinct()
sum(is.na(daily_activity))
## [1] 0
sum(is.na(heartrate_seconds))
## [1] 0
sum(is.na(sleep_day))
## [1] 0
colSums(daily_activity==0)
## Id ActivityDate TotalSteps
## 0 0 77
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 78 78 908
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 413 386 85
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 858 409 384
## LightlyActiveMinutes SedentaryMinutes Calories
## 84 1 4
colSums(heartrate_seconds==0)
## Id Time Value
## 0 0 0
colSums(sleep_day==0)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 0 0 0 0
## TotalTimeInBed
## 0
daily_activity <- clean_names(daily_activity)
heartrate_seconds <- clean_names(heartrate_seconds)
sleep_day <- clean_names(sleep_day)
daily_activity %>% mutate(total_distance_calculated=very_active_distance+moderately_active_distance+light_active_distance+sedentary_active_distance) %>% select(c(total_distance,total_distance_calculated))
## # A tibble: 940 x 2
## total_distance total_distance_calculated
## <dbl> <dbl>
## 1 8.5 8.49
## 2 6.97 6.97
## 3 6.74 6.75
## 4 6.28 6.23
## 5 8.16 8.16
## 6 6.48 6.48
## 7 8.59 8.60
## 8 9.88 9.88
## 9 6.68 6.68
## 10 6.34 6.34
## # ... with 930 more rows
Running the code above shows that some values between the total_distance_calculated (now), and the total_distance already given in the dataset, slightly differ. However, this error is small and doesn’t affect the analysis results.
Now, the column ActivityDate needs to be converted from Character to Date, as stated before. I’ll be using the lubridate package (already loaded) to make that conversion.
daily_activity$activity_date <- mdy(daily_activity$activity_date)
sum(daily_activity$activity_date<"2016-04-12"|daily_activity$activity_date>"2016-05-12")
## [1] 0
daily_activity <- daily_activity %>% mutate(very_active_hours=very_active_minutes/60,fairly_active_hours=fairly_active_minutes/60,lightly_active_hours=lightly_active_minutes/60,sedentary_hours=sedentary_minutes/60)
daily_activity <- daily_activity %>% mutate(total_active_hours=very_active_hours+fairly_active_hours+lightly_active_hours+sedentary_hours)
daily_activity <- daily_activity %>% select(-c(very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes))
First, (to conserve the AM/PM data), before separating the time column, it needs to be converted from Character to a DateTime object. I will be using the lubridate package (already loaded).
Then, using the format() function, I will omit all seconds from the given time (since a) they add no important value to the data, and b) to later be able to calculate the average heartbeat per minute, and then group together the average heartbeat per day, for each id).
heartrate_seconds$time <- mdy_hms(heartrate_seconds$time)
heartrate_seconds$time <- format(heartrate_seconds$time,'%Y-%m-%d %H:%M')
heartrate_seconds <- heartrate_seconds %>% separate(time,c("date","time"),' ')
str(heartrate_seconds)
## tibble [2,483,658 x 4] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ date : chr [1:2483658] "2016-04-12" "2016-04-12" "2016-04-12" "2016-04-12" ...
## $ time : chr [1:2483658] "07:21" "07:21" "07:21" "07:21" ...
## $ value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
heartrate_seconds$date <- ymd(heartrate_seconds$date)
heartrate_seconds$time <- hm(heartrate_seconds$time)
sum(heartrate_seconds$date<"2016-04-12"||heartrate_seconds$date>"2016-05-12")
## [1] 0
heartrate <- heartrate_seconds %>% group_by(id,date,time) %>% drop_na() %>% summarize(average_heartbeat = mean(value))
## `summarise()` has grouped output by 'id', 'date'. You can override using the
## `.groups` argument.
heartrate <- heartrate %>% select(-time)
heartrate <- heartrate %>% group_by(id,date) %>% drop_na() %>% summarise(heartbeat=mean(average_heartbeat))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
heartrate$heartbeat <- as.integer(heartrate$heartbeat)
head(heartrate)
## # A tibble: 6 x 3
## # Groups: id [1]
## id date heartbeat
## <dbl> <date> <int>
## 1 2022484408 2016-04-12 74
## 2 2022484408 2016-04-13 78
## 3 2022484408 2016-04-14 70
## 4 2022484408 2016-04-15 78
## 5 2022484408 2016-04-16 74
## 6 2022484408 2016-04-17 82
I will split the sleep_day column into 2 separate columns: sleep_date and sleep_time.
Checking with str() shows us that sleep_date needs to be changed from Character to Date, and sleep_time from Character to Time (using the lubridate package).
sleep_day <- sleep_day %>% separate(sleep_day,c("sleep_date","sleep_time"),sep=' ')
## Warning: Expected 2 pieces. Additional pieces discarded in 410 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
sleep_day$sleep_date <- mdy(sleep_day$sleep_date)
sleep_day$sleep_time <- hms(sleep_day$sleep_time)
sleep_day <- mutate(sleep_day,total_hours_asleep=total_minutes_asleep/60)
sleep_day <- mutate(sleep_day,total_hours_in_bed=total_time_in_bed/60)
sleep_day <- sleep_day %>% select(-c(total_minutes_asleep,total_time_in_bed))
sum(sleep_day$total_hours_asleep>=24|sleep_day$total_hours_asleep<=0,na.rm=TRUE)
## [1] 0
sum(sleep_day$total_hours_in_bed>=24||sleep_day$total_hours_in_bed<=0,na.rm=TRUE)
## [1] 0
sum(sleep_day$sleep_date<"2016-04-12"||sleep_day$sleep_date>"2016-05-12")
## [1] 0
sum(sleep_day$total_hours_asleep>sleep_day$total_hours_in_bed,na.rm=TRUE)
## [1] 0
join_table <- left_join(daily_activity,heartrate,by=c("id","activity_date"="date"))
join_table <- left_join(join_table,sleep_day,by=c("id","activity_date"="sleep_date"))
final_table <- select(join_table,-c(tracker_distance,logged_activities_distance,sleep_time))
final_table %>% select(-c(id,activity_date)) %>% drop_na() %>% summary()
## total_steps total_distance very_active_distance
## Min. : 42 Min. : 0.030 Min. :0.000
## 1st Qu.: 5245 1st Qu.: 3.680 1st Qu.:0.000
## Median : 9148 Median : 6.380 Median :0.670
## Mean : 8522 Mean : 6.021 Mean :1.429
## 3rd Qu.:11135 3rd Qu.: 7.920 3rd Qu.:2.420
## Max. :22770 Max. :17.540 Max. :9.890
## moderately_active_distance light_active_distance sedentary_active_distance
## Min. :0.0000 Min. :0.030 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:2.670 1st Qu.:0.000000
## Median :0.4000 Median :3.770 Median :0.000000
## Mean :0.6475 Mean :3.894 Mean :0.002099
## 3rd Qu.:1.0000 3rd Qu.:5.270 3rd Qu.:0.000000
## Max. :2.7700 Max. :9.480 Max. :0.110000
## calories very_active_hours fairly_active_hours lightly_active_hours
## Min. : 403 Min. :0.0000 Min. :0.0000 Min. :0.06667
## 1st Qu.:1994 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2.93333
## Median :2225 Median :0.2000 Median :0.2167 Median :3.78333
## Mean :2467 Mean :0.4666 Mean :0.2766 Mean :3.81077
## 3rd Qu.:3004 3rd Qu.:0.6333 3rd Qu.:0.4833 3rd Qu.:4.61667
## Max. :4900 Max. :3.5000 Max. :1.2333 Max. :8.63333
## sedentary_hours total_active_hours heartbeat total_sleep_records
## Min. : 0.03333 Min. : 0.10 Min. : 58.0 Min. :1.000
## 1st Qu.:10.35000 1st Qu.:15.35 1st Qu.: 65.0 1st Qu.:1.000
## Median :11.78333 Median :16.42 Median : 70.0 Median :1.000
## Mean :11.69954 Mean :16.25 Mean : 70.9 Mean :1.122
## 3rd Qu.:13.01667 3rd Qu.:17.37 3rd Qu.: 75.0 3rd Qu.:1.000
## Max. :20.36667 Max. :23.30 Max. :100.0 Max. :3.000
## total_hours_asleep total_hours_in_bed
## Min. : 0.9667 Min. : 1.017
## 1st Qu.: 6.5167 1st Qu.: 6.883
## Median : 7.3167 Median : 7.733
## Mean : 7.1041 Mean : 7.580
## 3rd Qu.: 8.2000 3rd Qu.: 8.767
## Max. :12.9167 Max. :14.050
I’m self-learning data analysis and this is my first time using R or trying to clean/analyze data. If there are any notes on my code or mistakes in it, please do not hesitate to let me know. I would love feedback on how I can improve. Thank you for anyone who checked out my case study.