Google Data Analytics Professional Certificate Capstone Project

INTRODUCTION

The following was completed as per the last step of the Google Data Analytics Professional Certificate - Capstone Project. This project aims to follow the steps of the data analysis process: ask, prepare, process, analyse, share, and act. I implemented the steps using R for data cleaning, data analysis and data visualisation.

BACKGROUND

I took the 2nd trajectory for my project and analysed an imaginary women’s high-tech products company - Bellabeat. While Bellabeat is already a successful company, they have the potential to grow bigger and become a more important player in the smart device market. This is where I come in, supposing the role of a data analyst at Bellabeat, my task is to analyse any relevant data to inspire insights and recommendations that will help improve Bellabeat ’s marketing strategy.

Bellabeat’s products and services are designed to help women monitor and improve their overall wellness by tracking various health metrics and providing personalised insights.

Bellabeat focuses mainly on digital marketing tools but also utilises traditional advertising methods—such as radio, billboards, print, and television. Their digital strategy includes year-round investments in Google Search, active engagement on Facebook, Instagram, and Twitter, and running video advertisements on YouTube and ads on the Google Display Network, especially during key marketing periods. This information may come in helpful when deciding on recommendations for the company’s marketing strategy.

The CEO of Bellabeat has tasked the marketing analytics team with focusing on one of Bellabeat’s products to analyse smart device usage data.

The business task is to gain insights into current user behaviours and provide high-level recommendations on how these trends can inform Bellabeat’s marketing strategy.

PHASE 1: ASK

The guiding questions for my analysis are:

What are some trends in smart device usage?
How could these trends apply to Bellabeat’s customers?
How could these trends influence Bellabeat’s marketing strategy?

PHASE 2: PREPARE

This is the step where the data is collected or selected for analysis. For this project, I am using the FitBit Fitness Tracker Data.

This Kaggle dataset contains personal fitness trackers from 30 Fitbit users. The dataset includes minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

The main limitation of this dataset is the sample size. For a more comprehensive analysis, this may come as a challenge as the lack of data may not provide enough insight from the analysis. Therefore, at this step, I would consider choosing an additional dataset with a bigger sample.

Overview of the data sources:

File type: CSV files
Data size: 33 samples
Data format: Long data
Data duration: 03.12.2016-05.12.2016
Data credibility: Data was collected with the consent of the participating FitBit users
Data source: Public data collected from FitBit users via a distributed survey
Data license: CC0: Public Domain

#Importing the dataset

activity <- read.csv('dailyActivity_merged.csv')
calories <- read.csv('dailyCalories_merged.csv')
intensities <- read.csv('dailyIntensities_merged.csv')
steps <- read.csv('dailySteps_merged.csv')
sleep <- read.csv('sleepDay_merged.csv')
weight <- read.csv('weightLogInfo_merged.csv')

PHASE 3: PROCESS

This is the step where the data needs to be cleaned and made ready for analysis.

To make sure the data can be cleaned, I will mostly be using the tidyverse package but also some additional ones that will later be used for analysis:

# Loading Packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)

In order to verify that the data is clean and ready for analysis, I use functions such as:

str() to verify the data structure, variable types and dimensions.
summary() to identify potential issues such as outliers or missing values in the data.
is.na() to locate missing values in the dataset.

#Preview datasets

str(activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(calories)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ Calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

str(intensities)

## 'data.frame':    940 obs. of  10 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...

str(steps)

## 'data.frame':    940 obs. of  3 variables:
##  $ Id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDay: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ StepTotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...

str(sleep)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

str(weight) # NA values found in column 'Fat' of weight

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

After running the code, I notice that the date values are formatted as chr, which needs to be transformed into the correct date/time format.

Additionally, the column names are in inconsistent format, which needs to be renamed to make further analysis easier.

# Changing column names and fixing date formats

#activity
activity_updt <- activity %>%
  rename_with(tolower) %>%
  mutate(date = as.Date(activitydate, "%m/%d/%y")) #creating a new column 'date' with correct formatting
str(activity_updt) #Previewing the results

## 'data.frame':    940 obs. of  16 variables:
##  $ id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activitydate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ totalsteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ totaldistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ trackerdistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ loggedactivitiesdistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ veryactivedistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ moderatelyactivedistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ lightactivedistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ sedentaryactivedistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ veryactiveminutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ fairlyactiveminutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ lightlyactiveminutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ sedentaryminutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
##  $ date                    : Date, format: "2020-04-12" "2020-04-13" ...

#calories
calories_updt <- calories %>%
  rename_with(tolower) %>%
  mutate(date = as.Date(activityday, "%m/%d/%y"))
str(calories_updt)

## 'data.frame':    940 obs. of  4 variables:
##  $ id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activityday: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ calories   : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
##  $ date       : Date, format: "2020-04-12" "2020-04-13" ...

#intensities
intensities_updt <- intensities %>%
  rename_with(tolower) %>%
  mutate(date = as.Date(activityday, "%m/%d/%y"))
str(intensities_updt)

## 'data.frame':    940 obs. of  11 variables:
##  $ id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activityday             : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ sedentaryminutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ lightlyactiveminutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ fairlyactiveminutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ veryactiveminutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ sedentaryactivedistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lightactivedistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ moderatelyactivedistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ veryactivedistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ date                    : Date, format: "2020-04-12" "2020-04-13" ...

#steps
steps_updt <- steps %>%
  rename_with(tolower) %>%
  mutate(date = as.Date(activityday, "%m/%d/%y"))
str(steps_updt)

## 'data.frame':    940 obs. of  4 variables:
##  $ id         : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ activityday: chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ steptotal  : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ date       : Date, format: "2020-04-12" "2020-04-13" ...

#sleep
sleep_updt <- sleep %>%
  rename_with(tolower) %>%
  mutate(date = as.Date(sleepday, "%m/%d/%y"))
str(sleep_updt)

## 'data.frame':    413 obs. of  6 variables:
##  $ id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ sleepday          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ totalsleeprecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ totalminutesasleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ totaltimeinbed    : int  346 407 442 367 712 320 377 364 384 449 ...
##  $ date              : Date, format: "2020-04-12" "2020-04-13" ...

#weight
weight_updt <- weight %>%
  rename_with(tolower) %>%
  mutate(date_correct = as.Date(date, "%m/%d/%y"))
str(weight_updt)

## 'data.frame':    67 obs. of  9 variables:
##  $ id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ weightkg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ weightpounds  : num  116 116 294 125 126 ...
##  $ fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ bmi           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ ismanualreport: chr  "True" "True" "False" "True" ...
##  $ logid         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
##  $ date_correct  : Date, format: "2020-05-02" "2020-05-03" ...

# Checking for distinct IDs
n_distinct(activity$Id)

## [1] 33

n_distinct(intensities$Id)

## [1] 33

n_distinct(calories$Id)

## [1] 33

n_distinct(steps$Id)

## [1] 33

n_distinct(sleep$Id)

## [1] 24

n_distinct(weight$Id) #only 8 distinct IDs

## [1] 8

Here, I notice that in the weight dataset, only 8 user IDs are present. Any analysis made taking weight into consideration will thus have to be accounted for bias as not each Fitbit user had their weight recorded.

PHASE 4: ANALYZE

At this stage, I decided to focus on getting the means of user daily steps, weight data and sleep.

mean(activity_updt$totalsteps) #avg of total steps by Fitbit users

## [1] 7637.911

The average user daily steps are 7637.9 which is significantly lower than the recommended 10.000 steps per day.

mean(activity_updt$calories) #avg total calories burned by Fitbit users

## [1] 2303.61

On average, Fitbit users burned approximately 2303.6 calories per day. However, as indicated previously, this number cannot reveal much since this average is drawn from a very small sample of 8 users. Further complicating the conclusions on the number of burned calories, is that it applies to both females and men.

sleep_hours <- mean(sleep_updt$totalminutesasleep)/60 #converting sleep minutes to sleep hours per day
print(sleep_hours)

## [1] 6.991122

On average, users sleep approximately 6.99 hours per day. It is recommended that adults sleep 7-9 hours per day (NHLBI, 2022). As such, this number indicates that the users may be slightly undersleeping, which can have potential health side effects.

Data on users’ weight:

summary(weight)

##        Id                Date              WeightKg       WeightPounds  
##  Min.   :1.504e+09   Length:67          Min.   : 52.60   Min.   :116.0  
##  1st Qu.:6.962e+09   Class :character   1st Qu.: 61.40   1st Qu.:135.4  
##  Median :6.962e+09   Mode  :character   Median : 62.50   Median :137.8  
##  Mean   :7.009e+09                      Mean   : 72.04   Mean   :158.8  
##  3rd Qu.:8.878e+09                      3rd Qu.: 85.05   3rd Qu.:187.5  
##  Max.   :8.878e+09                      Max.   :133.50   Max.   :294.3  
##                                                                         
##       Fat             BMI        IsManualReport         LogId          
##  Min.   :22.00   Min.   :21.45   Length:67          Min.   :1.460e+12  
##  1st Qu.:22.75   1st Qu.:23.96   Class :character   1st Qu.:1.461e+12  
##  Median :23.50   Median :24.39   Mode  :character   Median :1.462e+12  
##  Mean   :23.50   Mean   :25.19                      Mean   :1.462e+12  
##  3rd Qu.:24.25   3rd Qu.:25.56                      3rd Qu.:1.462e+12  
##  Max.   :25.00   Max.   :47.54                      Max.   :1.463e+12  
##  NA's   :65

As mentioned, a big problem in the data is that it doesn’t distinguish between user genders, making it hard to make recommendations for a company geared towards women. Given the summary above, however, we see that the mean BMI for an average user is 25.19. For both women and men, this number indicates that the average user is slightly overweight (BMI between 25-30 indicates excess weight).

Nevertheless, again, there are only 8 unique users’ data for weight, potentially making the BMI result biased.

Given the data, I predict that the lack of sleep may result in less active minutes. I create a simple plot to see if there is a correlation between the amount of sleep and user physical activity:

#merging data frames to see correlation between sleep and active minutes
merged_df <- merge(sleep_updt, intensities_updt, by = c("id", "date"))

#plotting the dat
ggplot(data = merged_df, aes(x = totalminutesasleep, y = sedentaryminutes)) +
  geom_line() +
  labs(title = 'Time Asleep and Time Spent Being Sedentary', 
       x = 'Sleep Minutes',
       y = 'Sedentary Minutes') +
  theme_minimal()

The plot above indicates that there is indeed a negative correlation between the amount of sleep and activity levels.

Main insights from the data analysis:

Users are taking an average of 7,637 steps daily, which is below the recommended 10,000 steps. This gap suggests an opportunity to encourage users to increase their daily activity levels.
The average calories burned are 2,303.6 calories per day. However, the small sample size and lack of gender differentiation may limit the ability to draw conclusive insights. Highlighting personalized calorie-burning goals could be a feature in marketing campaigns.
Users sleep an average of 6.99 hours per night, slightly below the recommended 7–9 hours. Improved sleep tracking and actionable insights on how to enhance sleep quality can be a key selling point.
The analysis shows a negative correlation between sleep duration and sedentary time. Users who sleep less tend to be more sedentary. This finding can be used to emphasize the importance of a holistic approach to health, integrating sleep, activity, and overall wellness.

PHASE 5: SHARE

Presenting the results of my hypothesis: Less sleep correlates with less physical activity.

ggplot(data = merged_df, aes(x = totalminutesasleep, y = sedentaryminutes)) +
  geom_point(aes(color = sedentaryminutes), size = 3, alpha = 0.7) +  # Use points with gradient color
  geom_smooth(method = "lm", se = FALSE, color = "blue", linetype = "dashed") +  # Add trend line
  scale_color_gradient(low = "lightblue", high = "darkblue") +  # Gradient color scale
  labs(
    title = "Relationship Between Sleep and Sedentary Time",
    subtitle = "Analyzing activity patterns with sleep duration",
    x = "Total Sleep Minutes",
    y = "Sedentary Minutes",
    color = "Sedentary\nMinutes"
  ) +
  theme_minimal() +  # Minimalistic theme
  theme(
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5),  # Center and style title
    plot.subtitle = element_text(size = 10, hjust = 0.5),  # Center and style subtitle
    axis.title = element_text(size = 8, face = "italic"),  # Style axis titles
    axis.text = element_text(size = 8),  # Style axis text
    legend.position = "right"  # Move legend to the right
  )

## `geom_smooth()` using formula = 'y ~ x'

Recommendations:

Product Enhancements:
- Introduce personalized health goals for steps, sleep, and calorie burn based on user-specific data.
- Develop new sleep-improvement features, such as bedtime reminders, relaxation techniques, and personalized sleep analytics.
Marketing Campaigns, e.g.:
1. Launch a “Steps to 10k” Challenge that incentivizes users to achieve 10,000 steps daily.
2. Create content around the importance of quality sleep for physical activity, focusing on Bellabeat’s devices as essential tools for improving sleep patterns.
Educational Content:
1. How to improve sleep for better productivity.
2. The role of activity in managing overall wellness.
3. Nutrition tips for burning calories effectively.
Highlight Bellabeat’s focus on women-specific wellness to stand out from competitors like Fitbit, which target a broader audience.