Introduction

This case study is my capstone project for the Google Data Analytics course. This project is on Bellabeat, a high-tech company that manufactures health-focused smart products.

Scenerio

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. I have been tasked as a marketing analyst to gain insight into how people are using their smart devices and come up with recommendations for how these trends can inform Bellabeat marketing strategy.

Stakeholders

Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Identify the Business Task

To define new marketing strategies, knowledge of these components are key by using data samples from FitBit Fitness Tracker; Identifying the trends in smart device usage, how those trends apply to Bellabeat customers and how they influence Bellabeat marketing strategy.

Data Sources

The user data from FitBit Fitness Tracker has been merged and categorised into different sections; daily activity, daily calories daily intensities, daily steps, etc. For two months; April and May, 2016. The dataset has been made publically available through Mobius. It contains personal fitness tracker from thirty fitbit users who consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Documentaion, Cleaning and Preparation of Data for Analysis

Tools for Analysis

R

Preparing the Data

Installing correct packages

install.packages("tidyverse")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
install.packages("dplyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(dplyr)
install.packages("tidyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyr)

Importing and loading the dataset

daily_activity <- read.csv("dailyActivity_merged.csv")

sleep_day <- read.csv("sleepDay_merged.csv")

Check the data structure of the dataset

str(daily_activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(sleep_day)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

Adding new columns to daily_activity

daily_activity$Total_Active_Minutes <- daily_activity$VeryActiveMinutes + daily_activity$FairlyActiveMinutes + daily_activity$LightlyActiveMinutes + daily_activity$SedentaryMinutes

daily_activity$Total_Active_Hours <- round(daily_activity$Total_Active_Minutes/60)
daily_activity$Dates <- as.Date(daily_activity$ActivityDate, "%m/%d/%Y")

Renaming columns in daily_activity

names(daily_activity) <- c( "Id", "Activity_Date", "Total_Steps", "Total_Distance","Tracker_Distance", "Logged_Activities_Distance", "Very_Active_Distance", "Moderately_Active_Distance", "Light_Active_Distance", "Sedentary_Active_Distance", "Very_Active_Minutes","Fairly_Active_Minutes", "Lightly_Active_Minutes", "Sedentary_Minutes", "Calories", "Total_Active_Minutes", "Total_Active_Hours", "Dates")

Adding new columns to sleep_day

sleep_day$Total_Hours_Asleep <- round(sleep_day$TotalMinutesAsleep/60)
sleep_day$Dates <- as.Date(sleep_day$SleepDay, "%m/%d/%Y")

Renaming columns in sleep_day

names(sleep_day) <- c("Id", "Sleep_Day", "Total_Sleep_Records", "Total_Minutes_Asleep", "Total_Time_In_Bed", "Total_Hours_Asleep", "Dates")

Adding relevant columns to a new table

daily_activity_b <- daily_activity %>%
  select(Id, Dates, Total_Steps, Total_Distance, Total_Active_Hours, Calories)

sleep_day_b <- sleep_day %>%
  select(Id, Dates, Total_Hours_Asleep)
Merged_data <- daily_activity_b %>% left_join(sleep_day_b)
## Joining, by = c("Id", "Dates")

Look at the data specifics

str(Merged_data)
## 'data.frame':    943 obs. of  7 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ Dates             : Date, format: "2016-04-12" "2016-04-13" ...
##  $ Total_Steps       : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ Total_Distance    : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ Total_Active_Hours: num  18 17 24 17 17 13 24 19 18 18 ...
##  $ Calories          : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
##  $ Total_Hours_Asleep: num  5 6 NA 7 6 12 NA 5 6 5 ...

Cleaning the Data

Merged_data <- distinct(Merged_data)  #remove any duplicates

Merged_data <- drop_na(Merged_data) #Remove missing data

Analyzing the Data

Merged_data %>%
  select(Total_Steps, Total_Active_Hours, Total_Distance, Total_Hours_Asleep, Calories) %>%
  summary()
##   Total_Steps    Total_Active_Hours Total_Distance   Total_Hours_Asleep
##  Min.   :   17   Min.   : 0.0       Min.   : 0.010   Min.   : 1.00     
##  1st Qu.: 5189   1st Qu.:15.0       1st Qu.: 3.592   1st Qu.: 6.00     
##  Median : 8913   Median :16.0       Median : 6.270   Median : 7.00     
##  Mean   : 8515   Mean   :16.2       Mean   : 6.012   Mean   : 6.99     
##  3rd Qu.:11370   3rd Qu.:17.0       3rd Qu.: 8.005   3rd Qu.: 8.00     
##  Max.   :22770   Max.   :23.0       Max.   :17.540   Max.   :13.00     
##     Calories   
##  Min.   : 257  
##  1st Qu.:1841  
##  Median :2207  
##  Mean   :2389  
##  3rd Qu.:2920  
##  Max.   :4900

Supporting Visualisations

ggplot(data = Merged_data) +
  geom_smooth(mapping = aes(x = Total_Active_Hours, y = Calories)) +
  labs(title = "The relationship between total hours of activity and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = Merged_data) +
  geom_smooth(mapping = aes(x = Total_Distance, y = Total_Steps)) +
  labs(title = "The relationship between total distance and total steps taken")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = Merged_data) +
  geom_smooth(mapping = aes(x = Total_Steps, y = Calories)) +
  labs(title = "The relationship between total steps taken and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = Merged_data) +
  geom_smooth(mapping = aes(x = Total_Hours_Asleep, y = Calories)) +
  labs(title = "The relationship between total hours slept and calories burned")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Key Findings

From the positive relationships of visualizations above, the following can be inferred;

Unfortunately, the visualization specifying the relationship between hours slept and calories burned is a negative one. But according to research, the more hours slept at night, the more calories burned. The negative representation on the visual could mean an error in how data was collected.

Recommendations

End of Case Study.