BACKGROUND
I took the 2nd trajectory for my project and analysed an imaginary
women’s high-tech products company - Bellabeat. While Bellabeat is
already a successful company, they have the potential to grow bigger and
become a more important player in the smart device market. This is where
I come in, supposing the role of a data analyst at Bellabeat, my task is
to analyse any relevant data to inspire insights and recommendations
that will help improve Bellabeat ’s marketing strategy.
Bellabeat’s products and services are designed to help women monitor
and improve their overall wellness by tracking various health metrics
and providing personalised insights.
Bellabeat focuses mainly on digital marketing tools but also utilises
traditional advertising methods—such as radio, billboards, print, and
television. Their digital strategy includes year-round investments in
Google Search, active engagement on Facebook, Instagram, and Twitter,
and running video advertisements on YouTube and ads on the Google
Display Network, especially during key marketing periods. This
information may come in helpful when deciding on recommendations for the
company’s marketing strategy.
The CEO of Bellabeat has tasked the marketing analytics team with
focusing on one of Bellabeat’s products to analyse smart device usage
data.
The business task is to gain insights into current
user behaviours and provide high-level recommendations on how these
trends can inform Bellabeat’s marketing strategy.
PHASE 3: PROCESS
This is the step where the data needs to be cleaned and made ready
for analysis.
To make sure the data can be cleaned, I will mostly be using the
tidyverse package but also some additional ones that will later be used
for analysis:
# Loading Packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)
In order to verify that the data is clean and ready for analysis, I
use functions such as:
- str() to verify the data structure, variable types
and dimensions.
- summary() to identify potential issues such as
outliers or missing values in the data.
- is.na() to locate missing values in the
dataset.
#Preview datasets
str(activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(intensities)
## 'data.frame': 940 obs. of 10 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
str(steps)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ StepTotal : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
str(sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight) # NA values found in column 'Fat' of weight
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
After running the code, I notice that the date values are formatted
as chr, which needs to be transformed into the correct
date/time format.
Additionally, the column names are in inconsistent format, which
needs to be renamed to make further analysis easier.
# Changing column names and fixing date formats
#activity
activity_updt <- activity %>%
rename_with(tolower) %>%
mutate(date = as.Date(activitydate, "%m/%d/%y")) #creating a new column 'date' with correct formatting
str(activity_updt) #Previewing the results
## 'data.frame': 940 obs. of 16 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activitydate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ totalsteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ totaldistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ trackerdistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ loggedactivitiesdistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactivedistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ moderatelyactivedistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ lightactivedistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentaryactivedistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactiveminutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ fairlyactiveminutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ lightlyactiveminutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentaryminutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
## $ date : Date, format: "2020-04-12" "2020-04-13" ...
#calories
calories_updt <- calories %>%
rename_with(tolower) %>%
mutate(date = as.Date(activityday, "%m/%d/%y"))
str(calories_updt)
## 'data.frame': 940 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activityday: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
## $ date : Date, format: "2020-04-12" "2020-04-13" ...
#intensities
intensities_updt <- intensities %>%
rename_with(tolower) %>%
mutate(date = as.Date(activityday, "%m/%d/%y"))
str(intensities_updt)
## 'data.frame': 940 obs. of 11 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activityday : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ sedentaryminutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ lightlyactiveminutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ fairlyactiveminutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ veryactiveminutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ sedentaryactivedistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ lightactivedistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ moderatelyactivedistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ veryactivedistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ date : Date, format: "2020-04-12" "2020-04-13" ...
#steps
steps_updt <- steps %>%
rename_with(tolower) %>%
mutate(date = as.Date(activityday, "%m/%d/%y"))
str(steps_updt)
## 'data.frame': 940 obs. of 4 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activityday: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ steptotal : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ date : Date, format: "2020-04-12" "2020-04-13" ...
#sleep
sleep_updt <- sleep %>%
rename_with(tolower) %>%
mutate(date = as.Date(sleepday, "%m/%d/%y"))
str(sleep_updt)
## 'data.frame': 413 obs. of 6 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ sleepday : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ totalsleeprecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ totalminutesasleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ totaltimeinbed : int 346 407 442 367 712 320 377 364 384 449 ...
## $ date : Date, format: "2020-04-12" "2020-04-13" ...
#weight
weight_updt <- weight %>%
rename_with(tolower) %>%
mutate(date_correct = as.Date(date, "%m/%d/%y"))
str(weight_updt)
## 'data.frame': 67 obs. of 9 variables:
## $ id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ weightkg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ weightpounds : num 116 116 294 125 126 ...
## $ fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ bmi : num 22.6 22.6 47.5 21.5 21.7 ...
## $ ismanualreport: chr "True" "True" "False" "True" ...
## $ logid : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
## $ date_correct : Date, format: "2020-05-02" "2020-05-03" ...
# Checking for distinct IDs
n_distinct(activity$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(steps$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id) #only 8 distinct IDs
## [1] 8
Here, I notice that in the weight dataset, only 8
user IDs are present. Any analysis made taking weight into consideration
will thus have to be accounted for bias as not each Fitbit user had
their weight recorded.
PHASE 4: ANALYZE
At this stage, I decided to focus on getting the means of user daily
steps, weight data and sleep.
mean(activity_updt$totalsteps) #avg of total steps by Fitbit users
## [1] 7637.911
The average user daily steps are 7637.9 which is significantly lower
than the recommended 10.000 steps per day.
mean(activity_updt$calories) #avg total calories burned by Fitbit users
## [1] 2303.61
On average, Fitbit users burned approximately 2303.6 calories per
day. However, as indicated previously, this number cannot reveal much
since this average is drawn from a very small sample of 8 users. Further
complicating the conclusions on the number of burned calories, is that
it applies to both females and men.
sleep_hours <- mean(sleep_updt$totalminutesasleep)/60 #converting sleep minutes to sleep hours per day
print(sleep_hours)
## [1] 6.991122
On average, users sleep approximately 6.99 hours per day. It is
recommended that adults sleep 7-9 hours per day (NHLBI,
2022). As such, this number indicates that the users may be slightly
undersleeping, which can have potential health side effects.
Data on users’ weight:
summary(weight)
## Id Date WeightKg WeightPounds
## Min. :1.504e+09 Length:67 Min. : 52.60 Min. :116.0
## 1st Qu.:6.962e+09 Class :character 1st Qu.: 61.40 1st Qu.:135.4
## Median :6.962e+09 Mode :character Median : 62.50 Median :137.8
## Mean :7.009e+09 Mean : 72.04 Mean :158.8
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05 3rd Qu.:187.5
## Max. :8.878e+09 Max. :133.50 Max. :294.3
##
## Fat BMI IsManualReport LogId
## Min. :22.00 Min. :21.45 Length:67 Min. :1.460e+12
## 1st Qu.:22.75 1st Qu.:23.96 Class :character 1st Qu.:1.461e+12
## Median :23.50 Median :24.39 Mode :character Median :1.462e+12
## Mean :23.50 Mean :25.19 Mean :1.462e+12
## 3rd Qu.:24.25 3rd Qu.:25.56 3rd Qu.:1.462e+12
## Max. :25.00 Max. :47.54 Max. :1.463e+12
## NA's :65
As mentioned, a big problem in the data is that it doesn’t
distinguish between user genders, making it hard to make recommendations
for a company geared towards women. Given the summary above, however, we
see that the mean BMI for an average user is 25.19. For both women and
men, this number indicates that the average user is slightly overweight
(BMI between 25-30 indicates excess weight).
Nevertheless, again, there are only 8 unique users’ data for weight,
potentially making the BMI result biased.
Given the data, I predict that the lack of sleep may result
in less active minutes. I create a simple plot to see if there
is a correlation between the amount of sleep and user physical
activity:
#merging data frames to see correlation between sleep and active minutes
merged_df <- merge(sleep_updt, intensities_updt, by = c("id", "date"))
#plotting the dat
ggplot(data = merged_df, aes(x = totalminutesasleep, y = sedentaryminutes)) +
geom_line() +
labs(title = 'Time Asleep and Time Spent Being Sedentary',
x = 'Sleep Minutes',
y = 'Sedentary Minutes') +
theme_minimal()

The plot above indicates that there is indeed a negative
correlation between the amount of sleep and activity
levels.
Main insights from the data analysis:
Users are taking an average of 7,637 steps
daily, which is below the recommended 10,000 steps. This gap
suggests an opportunity to encourage users to increase their daily
activity levels.
The average calories burned are 2,303.6 calories per
day. However, the small sample size and lack of gender
differentiation may limit the ability to draw conclusive insights.
Highlighting personalized calorie-burning goals could be a feature in
marketing campaigns.
Users sleep an average of 6.99 hours per night,
slightly below the recommended 7–9 hours. Improved sleep tracking and
actionable insights on how to enhance sleep quality can be a key selling
point.
The analysis shows a negative correlation between sleep duration
and sedentary time. Users who sleep less tend to be more sedentary. This
finding can be used to emphasize the importance of a holistic approach
to health, integrating sleep, activity, and overall wellness.