Project Overview
Analysis Smart Device Usage
This project is a course for my Google Data Certificate. In this case study, I will perform data analysis for Bellabeat, a high-tech manufacturer of health-focused products for women. Where the CEO (Urška Sršen) want us to analyze smart device data to gain insight into how their consumers are using their smart devices. This analysis aims to help guide future marketing strategies for my team.
About BellaBeat
Bella Beat is a pioneer in the fem-tech realm, Bellabeat is a women’s wellness company that has helped millions of women track their cycles and pregnancies, and live more in sync with their cycles. The company has collected data on activity, sleep, stress, and reproductive health of their customer which allowed the company to empower women with knowledge about their health and habits. The company offer the following products Bellabeat App, leaf, time, spring and membership to their customer.
Problems Statement
The CEO has my team select one of the products of the company to answer the following question
What are some trends in smart device usage?
How could these trends apply to the company’s customers?
How could these trends help influence Bellabeat’s marketing strategy?
Data Source
The data being used is FitBit Fitness Tracker Data (a public domain, dataset made available through Mobius on Kaggle) which contains a personal fitness tracker of thirty Fitbit users. The data can be accessed via (https://www.kaggle.com/datasets/arashnic/fitbit). The data contain information about daily activity, steps, and heart rate of the users that can be used to explore users’ habit.
Tools Used
Excel and R programming I used Excel to view the Data downloaded, the data has 16 csv files, due to the large of amount the data I used R programming to complete the cleaning and preparation of the data set.
Data Clearing and Preparation
After downloading the dataset, I used R studio to clean the data as follows: Removal of duplicate ID from the data sets. Upon checking the data, I detected some problems with the timestamp where Year, Month, Day, and Time are joined together. Convention of timestamp to date time format and by splitting the timestamp into date and time. After removal the duplicate value I check for uniqueness of user ID using n_distinct function I discovered that the set has 33 users’ data from daily activity, 24 users from sleep and only 8 users from weight. It has extra 3 users and some users did not record their data for tracking daily activity and sleep. Most data are recorded from Tuesday to Thursday, which may not be comprehensive enough to form an accurate analysis.
Data Analysis
#Installation of Packages
install.packages(c("tidyverse", "lubridate", "dplyr", "ggplot2", "tidyr"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(tidyr)
Loading of Data Set
Reading data from CSV files
Activity<- read.csv("/cloud/project/DailyActivity.csv")
Calories<- read.csv("/cloud/project/hourlyCalories.csv")
Intensities<- read.csv("/cloud/project/hourlyIntensities.csv")
Sleep<- read.csv("/cloud/project/sleepDay.csv")
Weight<- read.csv("/cloud/project/weightLogInfo.csv")
Viewing the Data Set
Checking of the data with head function
head(Activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 04/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(Calories)
## Id ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
head(Intensities)
## Id ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133333
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.116667
## 4 1503960366 4/12/2016 3:00:00 AM 0 0.000000
## 5 1503960366 4/12/2016 4:00:00 AM 0 0.000000
## 6 1503960366 4/12/2016 5:00:00 AM 0 0.000000
head(Sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
head(Weight)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 05/02/2016 23:59 52.6 115.9631 22 22.65
## 2 1503960366 05/03/2016 23:59 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 05/12/2016 23:59 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 TRUE 1.46223e+12
## 2 TRUE 1.46232e+12
## 3 FALSE 1.46051e+12
## 4 TRUE 1.46128e+12
## 5 TRUE 1.46310e+12
## 6 TRUE 1.46094e+12
Convention of timestamp to date time format and by splitting the timestamp into date and time
For Intensities
Intensities$ActivityHour=as.POSIXct(Intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Intensities$time <- format(Intensities$ActivityHour, format = "%H:%M:%S")
Intensities$date <- format(Intensities$ActivityHour, format = "%m/%d/%y")
head(Intensities)
## Id ActivityHour TotalIntensity AverageIntensity time
## 1 1503960366 2016-04-12 00:00:00 20 0.333333 00:00:00
## 2 1503960366 2016-04-12 01:00:00 8 0.133333 01:00:00
## 3 1503960366 2016-04-12 02:00:00 7 0.116667 02:00:00
## 4 1503960366 2016-04-12 03:00:00 0 0.000000 03:00:00
## 5 1503960366 2016-04-12 04:00:00 0 0.000000 04:00:00
## 6 1503960366 2016-04-12 05:00:00 0 0.000000 05:00:00
## date
## 1 04/12/16
## 2 04/12/16
## 3 04/12/16
## 4 04/12/16
## 5 04/12/16
## 6 04/12/16
Calories
Calories$ActivityHour=as.POSIXct(Calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Calories$time <- format(Calories$ActivityHour, format = "%H:%M:%S")
Calories$date <- format(Calories$ActivityHour, format = "%m/%d/%y")
head(Calories)
## Id ActivityHour Calories time date
## 1 1503960366 2016-04-12 00:00:00 81 00:00:00 04/12/16
## 2 1503960366 2016-04-12 01:00:00 61 01:00:00 04/12/16
## 3 1503960366 2016-04-12 02:00:00 59 02:00:00 04/12/16
## 4 1503960366 2016-04-12 03:00:00 47 03:00:00 04/12/16
## 5 1503960366 2016-04-12 04:00:00 48 04:00:00 04/12/16
## 6 1503960366 2016-04-12 05:00:00 48 05:00:00 04/12/16
Activity
Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date <- format(Activity$ActivityDate, format = "%m/%d/%y")
head(Activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories date
## 1 13 328 728 1985 04/12/16
## 2 19 217 776 1797 04/13/16
## 3 11 181 1218 1776 04/14/16
## 4 34 209 726 1745 04/15/16
## 5 10 221 773 1863 04/16/16
## 6 20 164 539 1728 04/17/16
Sleep
Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleep$date <- format(Sleep$SleepDay, format = "%m/%d/%y")
head(Sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## date
## 1 04/12/16
## 2 04/13/16
## 3 04/15/16
## 4 04/16/16
## 5 04/17/16
## 6 04/19/16
I check for uniqueness of ID in the data
Checking for unique user Id from the dataset
n_distinct(Activity$Id)
## [1] 33
n_distinct(Calories$Id)
## [1] 33
n_distinct(Intensities$Id)
## [1] 33
n_distinct(Sleep$Id)
## [1] 24
n_distinct(Weight$Id)
## [1] 8
The information above tells us about numbers participants in each data sets.
The sets has 33 user data from daily activity, 24 from sleep and only 8 from weight. There are 3 extra users and some users did not record their data for tracking daily activity and sleep. 8 participants in weight data set is not significant to make any recommendations and conclusions based on this data.
Summarize of each Data Set
Activity
Activity %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes,
Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
A summary of activity analysis
Checking min, max, mean, median and any outliers
Average total steps per day are 7638 which a little bit less for adult who are sedentary or lightly active regarding health benefits according to a research conducted by researchers from Kyoto University and the University of California. The study found out that walking 8,000 steps just once or twice a week can significantly improve our health, including lowering your risk of early death. Taking 10,000 steps per day was associated to promote good health and reduce chronic diseases risk.
The user average sedentary time is 991 minutes 991m/60 = 16 hours. This definitely needs to be reduced!!!
exploring number of active minutes per category
#Activity
Activity %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
## Median : 4.00 Median : 6.00 Median :199.0
## Mean : 21.16 Mean : 13.56 Mean :192.8
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
## Max. :210.00 Max. :143.00 Max. :518.0
The data set show user are 3 hours lightly active, only half hour in fairly and very active! The majority of the participants are lightly active.
Calories
Calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 42.00
## 1st Qu.: 63.00
## Median : 83.00
## Mean : 97.39
## 3rd Qu.:108.00
## Max. :948.00
Sleep
Sleep %>%
select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
Participant spent about 7 hours in bed before waking up, this need to be improved because according to National Sleep Foundation (2015) it recommends that adults should spent at least 7 to 9 hours in bed so as to promote optimal health and functioning
Weight
Weight %>%
select(WeightKg, BMI) %>%
summary()
## WeightKg BMI
## Min. : 52.60 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:23.96
## Median : 62.50 Median :24.39
## Mean : 72.04 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :133.50 Max. :47.54
Avg weight is 158 pounds with BMI of 25
Before I start to visualize the data, I need to merge two data sets. I’m going to merge (inner join) activity and sleep on columns Id and date (that I previously created after converting data to date time format)
Merging of the Data Set
merged_data <- merge(Sleep, Activity, by=c('Id', 'date'))
head(merged_data)
## Id date SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 04/12/16 2016-04-12 1 327
## 2 1503960366 04/13/16 2016-04-13 2 384
## 3 1503960366 04/15/16 2016-04-15 1 412
## 4 1503960366 04/16/16 2016-04-16 2 340
## 5 1503960366 04/17/16 2016-04-17 1 700
## 6 1503960366 04/19/16 2016-04-19 1 304
## TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 346 2016-04-12 13162 8.50 8.50
## 2 407 2016-04-13 10735 6.97 6.97
## 3 442 2016-04-15 9762 6.28 6.28
## 4 367 2016-04-16 12669 8.16 8.16
## 5 712 2016-04-17 9705 6.48 6.48
## 6 320 2016-04-19 15506 9.88 9.88
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.14 1.26
## 4 0 2.71 0.41
## 5 0 3.19 0.78
## 6 0 3.53 1.32
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 2.83 0 29
## 4 5.04 0 36
## 5 2.51 0 38
## 6 5.03 0 50
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 34 209 726 1745
## 4 10 221 773 1863
## 5 20 164 539 1728
## 6 31 264 775 2035
Data Visualization
Activity
ggplot(data=Activity, aes(x=TotalSteps, y = Calories, color=SedentaryMinutes))+
geom_point()+
stat_smooth(method=lm)+ labs(title="Total Steps vs. Calories", caption ="Source:Kaggle.com/datasets/arashnic/fitbit")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
scale_color_gradient(low="steelblue", high="yellow")
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
I see positive correlation here between Total Steps and Calories, which is obvious - the more active we are, the more calories we burn.
Sleep
ggplot(data=Sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed, color = TotalMinutesAsleep, TotalTimeInBed)) +
geom_point()+
labs(title="Total Minutes Asleep vs. Total Time in Bed", caption = "Source: Kaggle.com/datasets/arashnic/fitbit")
scale_color_gradient(low='yellow', high='blue')
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
The relationship between Total Minutes Asleep and Total Time in Bed looks linear. So if the the users want to improve their sleep, we should consider using notification to go to sleep.
Intensities
int_new <- Intensities %>%
group_by(time) %>%
drop_na() %>%
summarise(mean_total_int = mean(TotalIntensity))
ggplot(data=int_new, aes(x=time, y=mean_total_int, color=time)) +
geom_histogram(stat = "identity", fill='steelblue') +
theme(axis.text.x = element_text(angle = 90))+
labs(title="Average Total Intensity vs. Time", caption ="Source:Kaggle.com/datasets/arashnic/fitbit")
## Warning in geom_histogram(stat = "identity", fill = "steelblue"): Ignoring
## unknown parameters: `binwidth`, `bins`, and `pad`
After visualizing Average total Intensity hourly, I found out that people are more active between 5 am and 10pm.
Most activity happens between 5 pm and 7 pm. We can use this time in the Bellabeat app to remind and motivate users to go for a run or walk.
Relationship between Total Minutes Asleep and Sedentry Minutes
ggplot(data=merged_data, aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) +
geom_point(color='steelblue') + geom_smooth()+
labs(title="Minutes Asleep vs. Sedentary Minutes", caption = "Source: Kaggle.com/datasets/arashnic/fitbit")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Here we can clearly see the negative relationship between Sedentary Minutes and Total minutes assleep.
If Bellabeat users want to improve their sleep, Bellabeat app can recommend reducing sedentary time.
Keep in mind that we need to support this insights with more data, because correlation between some data doesn’t mean causation.
Conclusion
The Bellabeat app is not just another fitness activity app. It’s a guide (a friend) who empowers women to balance full personal and professional life and healthy habits and routines by educating and motivating them through daily app
Recommendation
While 7,638 average daily steps is a great start, aiming for 8,000 steps per day could significantly improve Bella Beat user health outcomes, according to recent research. A study found that taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). Taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.
If users want to lose weight, it’s probably a good idea to control daily calorie consumption. Bellabeat can suggest some ideas for low-calorie meal especillay at lunch and dinner.
If users want to improve their sleep pattern, Bellabeat should consider using app notifications to go to bed.
Most activity happens between 5 pm and 7 pm - I suppose, that people go to a gym or for a walk after finishing work. Bellabeat can use this time to remind and motivate users to go for a run or walk.
if users want to improve their sleep, the Bellabeat app can recommend reducing sedentary time.
This is my first project working on R I would appreciate any comments and recommendations for improvement!
Thank you for your interest in my Bellabeat Case Study!!!