Business Task: I am analyzing Bellabeat’s data usage. They are a smart device manufacturing company targeted towards women’s health. The goal is to determine trends/consumer behavior from consumer data that prove useful to the marketing team’s strategy towards improved user engagement.
Features of Bellabeat’s smart device:
Alongside the watch, there is also the Bellabeat app which is
downloadable, and they both include the following features:
Leaf: a tracker that records sleep, activity, and stress
Time: a watch feature that offers insight on daily wellness and tracks
activity as well
Key Stakeholders:
Urška Sršen: Bellabeat’s co-founder, Chief Creative
Officer
Sando Mur: Mathematician, Bellabeat’s cofounder, key
member of the Bellabeat executive team
Bellabeat Marketing Analytics Team: Tasked with sifting
through data and offering their insights to the marketing team
Objectives:
* find out how consumers use their product; through pattern recognition
or trends
* suggest improvements for Bellabeat smart devices based on consumer
data
Data Source:
Kaggle’s ‘Fitbit Fitness Tracker Data’(CCO: Public Domain) is a dataset
created by Mobius, where there is recorded information on 30 fitbit
wearers, which includes daily activity, calories burned, and more that
is related to health improvement.
Website Link: https://www.kaggle.com/datasets/arashnic/fitbit/data
Data Structure:
For this dataset, We are focusing on the four most important csv files:
daily_steps, daily_sleep, daily_activity, and weight_log_info
#inserting necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
## daily steps csv into data.frame
daily_steps <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
#Key variables: Id, ActivityDay, StepTotal
head(daily_steps)
## Id ActivityDay StepTotal
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
## daily sleep csv into data.frame
daily_sleep <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
#Key variables: Id, SleepDay, TotalMinutesAsleep
head(daily_sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
## daily activity csv into data.frame
daily_activity <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
#Key variables: Id, ActivityDate, TotalSteps, SedentaryMinutes, Calories, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
## weight log csv into data.frame
weight_log <- read.csv("C:/Users/andre/Downloads/bellabeat_data/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
#Key variables: Id, Date, WeightPounds, IsManualReport
head(weight_log)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
Objective:
Checking for nulls, duplicates, and converting to correct data types for
each csv.
Daily_steps:
renamed ActivityDay to Date to create a consistent label for the following dates
converted ActivityDay from ‘chr’ to ‘date’
#check for missing values
summary(daily_steps)
## Id ActivityDay StepTotal
## Min. :1.504e+09 Length:940 Min. : 0
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790
## Median :4.445e+09 Mode :character Median : 7406
## Mean :4.855e+09 Mean : 7638
## 3rd Qu.:6.962e+09 3rd Qu.:10727
## Max. :8.878e+09 Max. :36019
#checking data type
class(daily_steps$Id)
## [1] "numeric"
class(daily_steps$ActivityDay)
## [1] "character"
#change from character to date data type
daily_steps$ActivityDay <- mdy(daily_steps$ActivityDay)
class(daily_steps$ActivityDay)
## [1] "Date"
class(daily_steps$StepTotal)
## [1] "integer"
cleaned_daily_steps <- daily_steps %>%
group_by(Id) %>% # groups by ID
drop_na %>% #removes missing values
distinct() %>% #removes duplicates
rename(Date = ActivityDay) %>%
head()
Daily_sleep:
converted SleepDay from ‘chr’ to ‘dttm’
renamed SleepDay to Date
summary(daily_sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## Min. :1.504e+09 Length:413 Min. :1.000 Min. : 58.0
## 1st Qu.:3.977e+09 Class :character 1st Qu.:1.000 1st Qu.:361.0
## Median :4.703e+09 Mode :character Median :1.000 Median :433.0
## Mean :5.001e+09 Mean :1.119 Mean :419.5
## 3rd Qu.:6.962e+09 3rd Qu.:1.000 3rd Qu.:490.0
## Max. :8.792e+09 Max. :3.000 Max. :796.0
## TotalTimeInBed
## Min. : 61.0
## 1st Qu.:403.0
## Median :463.0
## Mean :458.6
## 3rd Qu.:526.0
## Max. :961.0
#checking data type
class(daily_sleep$Id)
## [1] "numeric"
class(daily_sleep$SleepDay)
## [1] "character"
daily_sleep$SleepDay <- mdy_hms(daily_sleep$SleepDay)
class(daily_sleep$SleepDay)
## [1] "POSIXct" "POSIXt"
class(daily_sleep$TotalMinutesAsleep)
## [1] "integer"
cleaned_daily_sleep <- daily_sleep %>%
select(Id, SleepDay, TotalMinutesAsleep) %>% ##selecting the important variables
group_by(Id) %>% ## groups by ID
drop_na() %>%## removes missing values
distinct() %>% ##removes duplicates
rename(Date = SleepDay) %>%
head()
Daily_Activity:
- created a new variable, TotalMinutesActive, by summarizing the values
of LightlyActiveMinutes, FairlyActiveMinutes, and VeryActiveMinutes from
each observation
- covert ActivityDate from ‘chr’ to ‘date’
summary(daily_activity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
class(daily_activity$Id)
## [1] "numeric"
class(daily_activity$ActivityDate)
## [1] "character"
daily_activity$ActivityDate <- mdy(daily_activity$ActivityDate)
class(daily_activity$ActivityDate)
## [1] "Date"
class(daily_activity$TotalSteps)
## [1] "integer"
class(daily_activity$Calories)
## [1] "integer"
class(daily_activity$SedentaryMinutes)
## [1] "integer"
class(daily_activity$TotalSteps)
## [1] "integer"
class(daily_activity$VeryActiveMinutes)
## [1] "integer"
class(daily_activity$FairlyActiveMinutes)
## [1] "integer"
class(daily_activity$LightlyActiveMinutes)
## [1] "integer"
cleaned_daily_activity <- daily_activity %>%
select(Id, ActivityDate, TotalSteps, Calories, SedentaryMinutes, FairlyActiveMinutes, VeryActiveMinutes, LightlyActiveMinutes) %>%
group_by(Id) %>% ## groups by ID
drop_na() %>%## removes missing values
distinct() %>% ##removes duplicates
mutate(TotalMinutesActive = FairlyActiveMinutes + LightlyActiveMinutes + VeryActiveMinutes) %>%
rename(Date = ActivityDate) %>%
rename(StepTotal = TotalSteps) %>%
head()
weight_log:
summary(weight_log)
## Id Date WeightKg WeightPounds
## Min. :1.504e+09 Length:67 Min. : 52.60 Min. :116.0
## 1st Qu.:6.962e+09 Class :character 1st Qu.: 61.40 1st Qu.:135.4
## Median :6.962e+09 Mode :character Median : 62.50 Median :137.8
## Mean :7.009e+09 Mean : 72.04 Mean :158.8
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05 3rd Qu.:187.5
## Max. :8.878e+09 Max. :133.50 Max. :294.3
##
## Fat BMI IsManualReport LogId
## Min. :22.00 Min. :21.45 Length:67 Min. :1.460e+12
## 1st Qu.:22.75 1st Qu.:23.96 Class :character 1st Qu.:1.461e+12
## Median :23.50 Median :24.39 Mode :character Median :1.462e+12
## Mean :23.50 Mean :25.19 Mean :1.462e+12
## 3rd Qu.:24.25 3rd Qu.:25.56 3rd Qu.:1.462e+12
## Max. :25.00 Max. :47.54 Max. :1.463e+12
## NA's :65
class(weight_log$Id)
## [1] "numeric"
class(weight_log$Date)
## [1] "character"
weight_log$Date <- mdy_hms(weight_log$Date)
class(weight_log$Date)
## [1] "POSIXct" "POSIXt"
class(weight_log$IsManualReport)
## [1] "character"
weight_log$IsManualReport <- as.logical(weight_log$IsManualReport)
class(weight_log$IsManualReport)
## [1] "logical"
cleaned_weight_log <- weight_log %>%
mutate(changed_weight_date = as.Date(weight_log$Date,format = "%m/%d/%Y")) %>% #converting the dttm to Date data type
select(Id, changed_weight_date, WeightPounds,IsManualReport) %>%
group_by(Id) %>% # groups by ID
drop_na() %>% #removes missing values
distinct() %>% #removes duplicates
rename(Date = changed_weight_date) %>% #renames to 'Date'
head()
Merging:
merged_steps_sleep <- merge(x=cleaned_daily_steps, y=cleaned_daily_sleep, by = c("Id", "Date"))
head(merged_steps_sleep)
## Id Date StepTotal TotalMinutesAsleep
## 1 1503960366 2016-04-12 13162 327
## 2 1503960366 2016-04-13 10735 384
## 3 1503960366 2016-04-15 9762 412
## 4 1503960366 2016-04-16 12669 340
## 5 1503960366 2016-04-17 9705 700
merged_steps_activity <- merge(x=cleaned_daily_steps, y=cleaned_daily_activity, by = c("Id", "Date", "StepTotal"))
head(merged_steps_activity)
## Id Date StepTotal Calories SedentaryMinutes FairlyActiveMinutes
## 1 1503960366 2016-04-12 13162 1985 728 13
## 2 1503960366 2016-04-13 10735 1797 776 19
## 3 1503960366 2016-04-14 10460 1776 1218 11
## 4 1503960366 2016-04-15 9762 1745 726 34
## 5 1503960366 2016-04-16 12669 1863 773 10
## 6 1503960366 2016-04-17 9705 1728 539 20
## VeryActiveMinutes LightlyActiveMinutes TotalMinutesActive
## 1 25 328 366
## 2 21 217 257
## 3 30 181 222
## 4 29 209 272
## 5 36 221 267
## 6 38 164 222
merged_activity_weight <- merge(x=cleaned_daily_activity, y=cleaned_weight_log, by = c("Id","Date"))
head(merged_activity_weight)
## [1] Id Date StepTotal
## [4] Calories SedentaryMinutes FairlyActiveMinutes
## [7] VeryActiveMinutes LightlyActiveMinutes TotalMinutesActive
## [10] WeightPounds IsManualReport
## <0 rows> (or 0-length row.names)
merged_total <- merge(x=merged_steps_sleep, y=merged_activity_weight, by = c("Id","Date"))
head(merged_total)
## [1] Id Date StepTotal.x
## [4] TotalMinutesAsleep StepTotal.y Calories
## [7] SedentaryMinutes FairlyActiveMinutes VeryActiveMinutes
## [10] LightlyActiveMinutes TotalMinutesActive WeightPounds
## [13] IsManualReport
## <0 rows> (or 0-length row.names)
With the information recieved from analysis, we can conclude that Bellabeat should target improving current features, especially surrounding the Time feature because it would better benefit users’ overall health. It would also be in their best interest to add new features that compliment their existing ones such as an app that allows for designing periodic workouts, which would help break the wide degree of minutes being sedentary into smaller chunks. These suggestions would not only boost user interactivity, but satisfaction and higher ratings due to the heavy emphasis on health.