This case study is my capstone project for the Google Data Analytics course. This project is on Bellabeat, a high-tech company that manufactures health-focused smart products.
Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been tasked as a marketing analyst to gain insight into how people are using their smart devices and come up with recommendations for how these trends can inform Bellabeat marketing strategy.
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
To define new marketing strategies, knowledge of these components are key by using data samples from FitBit Fitness Tracker; Identifying the trends in smart device usage, how those trends apply to Bellabeat customers and how they influence Bellabeat marketing strategy.
The user data from FitBit Fitness Tracker has been merged and categorised into different sections; daily activity, daily calories daily intensities, daily steps, etc. For two months; April and May, 2016. The dataset has been made publically available through Mobius. It contains personal fitness tracker from thirty fitbit users who consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
R
Installing correct packages
install.packages("tidyverse")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
install.packages("dplyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(dplyr)
install.packages("tidyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyr)
daily_activity <- read.csv("dailyActivity_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(sleep_day)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
daily_activity$Total_Active_Minutes <- daily_activity$VeryActiveMinutes + daily_activity$FairlyActiveMinutes + daily_activity$LightlyActiveMinutes + daily_activity$SedentaryMinutes
daily_activity$Total_Active_Hours <- round(daily_activity$Total_Active_Minutes/60)
daily_activity$Dates <- as.Date(daily_activity$ActivityDate, "%m/%d/%Y")
names(daily_activity) <- c( "Id", "Activity_Date", "Total_Steps", "Total_Distance","Tracker_Distance", "Logged_Activities_Distance", "Very_Active_Distance", "Moderately_Active_Distance", "Light_Active_Distance", "Sedentary_Active_Distance", "Very_Active_Minutes","Fairly_Active_Minutes", "Lightly_Active_Minutes", "Sedentary_Minutes", "Calories", "Total_Active_Minutes", "Total_Active_Hours", "Dates")
sleep_day$Total_Hours_Asleep <- round(sleep_day$TotalMinutesAsleep/60)
sleep_day$Dates <- as.Date(sleep_day$SleepDay, "%m/%d/%Y")
names(sleep_day) <- c("Id", "Sleep_Day", "Total_Sleep_Records", "Total_Minutes_Asleep", "Total_Time_In_Bed", "Total_Hours_Asleep", "Dates")
daily_activity_b <- daily_activity %>%
select(Id, Dates, Total_Steps, Total_Distance, Total_Active_Hours, Calories)
sleep_day_b <- sleep_day %>%
select(Id, Dates, Total_Hours_Asleep)
Merged_data <- daily_activity_b %>% left_join(sleep_day_b)
## Joining, by = c("Id", "Dates")
str(Merged_data)
## 'data.frame': 943 obs. of 7 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Dates : Date, format: "2016-04-12" "2016-04-13" ...
## $ Total_Steps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ Total_Distance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ Total_Active_Hours: num 18 17 24 17 17 13 24 19 18 18 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
## $ Total_Hours_Asleep: num 5 6 NA 7 6 12 NA 5 6 5 ...
Merged_data <- distinct(Merged_data) #remove any duplicates
Merged_data <- drop_na(Merged_data) #Remove missing data
Merged_data %>%
select(Total_Steps, Total_Active_Hours, Total_Distance, Total_Hours_Asleep, Calories) %>%
summary()
## Total_Steps Total_Active_Hours Total_Distance Total_Hours_Asleep
## Min. : 17 Min. : 0.0 Min. : 0.010 Min. : 1.00
## 1st Qu.: 5189 1st Qu.:15.0 1st Qu.: 3.592 1st Qu.: 6.00
## Median : 8913 Median :16.0 Median : 6.270 Median : 7.00
## Mean : 8515 Mean :16.2 Mean : 6.012 Mean : 6.99
## 3rd Qu.:11370 3rd Qu.:17.0 3rd Qu.: 8.005 3rd Qu.: 8.00
## Max. :22770 Max. :23.0 Max. :17.540 Max. :13.00
## Calories
## Min. : 257
## 1st Qu.:1841
## Median :2207
## Mean :2389
## 3rd Qu.:2920
## Max. :4900
ggplot(data = Merged_data) +
geom_smooth(mapping = aes(x = Total_Active_Hours, y = Calories)) +
labs(title = "The relationship between total hours of activity and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = Merged_data) +
geom_smooth(mapping = aes(x = Total_Distance, y = Total_Steps)) +
labs(title = "The relationship between total distance and total steps taken")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = Merged_data) +
geom_smooth(mapping = aes(x = Total_Steps, y = Calories)) +
labs(title = "The relationship between total steps taken and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = Merged_data) +
geom_smooth(mapping = aes(x = Total_Hours_Asleep, y = Calories)) +
labs(title = "The relationship between total hours slept and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the positive relationships of visualizations above, the following can be inferred;
Unfortunately, the visualization specifying the relationship between hours slept and calories burned is a negative one. But according to research, the more hours slept at night, the more calories burned. The negative representation on the visual could mean an error in how data was collected.