Project Overview

Analysis Smart Device Usage

This project is a course for my Google Data Certificate. In this case study, I will perform data analysis for Bellabeat, a high-tech manufacturer of health-focused products for women. Where the CEO (Urška Sršen) want us to analyze smart device data to gain insight into how their consumers are using their smart devices. This analysis aims to help guide future marketing strategies for my team.

About BellaBeat

Bella Beat is a pioneer in the fem-tech realm, Bellabeat is a women’s wellness company that has helped millions of women track their cycles and pregnancies, and live more in sync with their cycles. The company has collected data on activity, sleep, stress, and reproductive health of their customer which allowed the company to empower women with knowledge about their health and habits. The company offer the following products Bellabeat App, leaf, time, spring and membership to their customer.

Problems Statement

The CEO has my team select one of the products of the company to answer the following question

  1. What are some trends in smart device usage?

  2. How could these trends apply to the company’s customers?

  3. How could these trends help influence Bellabeat’s marketing strategy?

Data Source

The data being used is FitBit Fitness Tracker Data (a public domain, dataset made available through Mobius on Kaggle) which contains a personal fitness tracker of thirty Fitbit users. The data can be accessed via (https://www.kaggle.com/datasets/arashnic/fitbit). The data contain information about daily activity, steps, and heart rate of the users that can be used to explore users’ habit.

Tools Used

Excel and R programming I used Excel to view the Data downloaded, the data has 16 csv files, due to the large of amount the data I used R programming to complete the cleaning and preparation of the data set.

Data Clearing and Preparation

After downloading the dataset, I used R studio to clean the data as follows: Removal of duplicate ID from the data sets. Upon checking the data, I detected some problems with the timestamp where Year, Month, Day, and Time are joined together. Convention of timestamp to date time format and by splitting the timestamp into date and time. After removal the duplicate value I check for uniqueness of user ID using n_distinct function I discovered that the set has 33 users’ data from daily activity, 24 users from sleep and only 8 users from weight. It has extra 3 users and some users did not record their data for tracking daily activity and sleep. Most data are recorded from Tuesday to Thursday, which may not be comprehensive enough to form an accurate analysis.

Data Analysis

#Installation of Packages

install.packages(c("tidyverse", "lubridate", "dplyr", "ggplot2", "tidyr"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(tidyr)

Loading of Data Set

Reading data from CSV files

Activity<- read.csv("/cloud/project/DailyActivity.csv")
Calories<- read.csv("/cloud/project/hourlyCalories.csv")
Intensities<- read.csv("/cloud/project/hourlyIntensities.csv")
Sleep<- read.csv("/cloud/project/sleepDay.csv")
Weight<- read.csv("/cloud/project/weightLogInfo.csv")

Viewing the Data Set

Checking of the data with head function

head(Activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   04/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(Calories)
##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48
head(Intensities)
##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM             20         0.333333
## 2 1503960366  4/12/2016 1:00:00 AM              8         0.133333
## 3 1503960366  4/12/2016 2:00:00 AM              7         0.116667
## 4 1503960366  4/12/2016 3:00:00 AM              0         0.000000
## 5 1503960366  4/12/2016 4:00:00 AM              0         0.000000
## 6 1503960366  4/12/2016 5:00:00 AM              0         0.000000
head(Sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
head(Weight)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366      05/02/2016 23:59     52.6     115.9631  22 22.65
## 2 1503960366      05/03/2016 23:59     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765      05/12/2016 23:59     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport       LogId
## 1           TRUE 1.46223e+12
## 2           TRUE 1.46232e+12
## 3          FALSE 1.46051e+12
## 4           TRUE 1.46128e+12
## 5           TRUE 1.46310e+12
## 6           TRUE 1.46094e+12

Convention of timestamp to date time format and by splitting the timestamp into date and time

For Intensities

Intensities$ActivityHour=as.POSIXct(Intensities$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Intensities$time <- format(Intensities$ActivityHour, format = "%H:%M:%S")
Intensities$date <- format(Intensities$ActivityHour, format = "%m/%d/%y")
head(Intensities)
##           Id        ActivityHour TotalIntensity AverageIntensity     time
## 1 1503960366 2016-04-12 00:00:00             20         0.333333 00:00:00
## 2 1503960366 2016-04-12 01:00:00              8         0.133333 01:00:00
## 3 1503960366 2016-04-12 02:00:00              7         0.116667 02:00:00
## 4 1503960366 2016-04-12 03:00:00              0         0.000000 03:00:00
## 5 1503960366 2016-04-12 04:00:00              0         0.000000 04:00:00
## 6 1503960366 2016-04-12 05:00:00              0         0.000000 05:00:00
##       date
## 1 04/12/16
## 2 04/12/16
## 3 04/12/16
## 4 04/12/16
## 5 04/12/16
## 6 04/12/16

Calories

Calories$ActivityHour=as.POSIXct(Calories$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Calories$time <- format(Calories$ActivityHour, format = "%H:%M:%S")
Calories$date <- format(Calories$ActivityHour, format = "%m/%d/%y")
head(Calories)
##           Id        ActivityHour Calories     time     date
## 1 1503960366 2016-04-12 00:00:00       81 00:00:00 04/12/16
## 2 1503960366 2016-04-12 01:00:00       61 01:00:00 04/12/16
## 3 1503960366 2016-04-12 02:00:00       59 02:00:00 04/12/16
## 4 1503960366 2016-04-12 03:00:00       47 03:00:00 04/12/16
## 5 1503960366 2016-04-12 04:00:00       48 04:00:00 04/12/16
## 6 1503960366 2016-04-12 05:00:00       48 05:00:00 04/12/16

Activity

Activity$ActivityDate=as.POSIXct(Activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
Activity$date <- format(Activity$ActivityDate, format = "%m/%d/%y")
head(Activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories     date
## 1                  13                  328              728     1985 04/12/16
## 2                  19                  217              776     1797 04/13/16
## 3                  11                  181             1218     1776 04/14/16
## 4                  34                  209              726     1745 04/15/16
## 5                  10                  221              773     1863 04/16/16
## 6                  20                  164              539     1728 04/17/16

Sleep

Sleep$SleepDay=as.POSIXct(Sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Sleep$date <- format(Sleep$SleepDay, format = "%m/%d/%y")
head(Sleep)
##           Id   SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
##       date
## 1 04/12/16
## 2 04/13/16
## 3 04/15/16
## 4 04/16/16
## 5 04/17/16
## 6 04/19/16

I check for uniqueness of ID in the data

Checking for unique user Id from the dataset

n_distinct(Activity$Id)
## [1] 33
n_distinct(Calories$Id)
## [1] 33
n_distinct(Intensities$Id)
## [1] 33
n_distinct(Sleep$Id)
## [1] 24
n_distinct(Weight$Id)
## [1] 8

The information above tells us about numbers participants in each data sets.

The sets has 33 user data from daily activity, 24 from sleep and only 8 from weight. There are 3 extra users and some users did not record their data for tracking daily activity and sleep. 8 participants in weight data set is not significant to make any recommendations and conclusions based on this data.

Summarize of each Data Set

Activity

Activity %>% 
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes, 
         Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900

A summary of activity analysis

Checking min, max, mean, median and any outliers

Average total steps per day are 7638 which a little bit less for adult who are sedentary or lightly active regarding health benefits according to a research conducted by researchers from Kyoto University and the University of California. The study found out that walking 8,000 steps just once or twice a week can significantly improve our health, including lowering your risk of early death. Taking 10,000 steps per day was associated to promote good health and reduce chronic diseases risk.

The user average sedentary time is 991 minutes 991m/60 = 16 hours. This definitely needs to be reduced!!!

exploring number of active minutes per category

#Activity
Activity %>%
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
  summary()
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0       
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0       
##  Median :  4.00    Median :  6.00      Median :199.0       
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8       
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0       
##  Max.   :210.00    Max.   :143.00      Max.   :518.0

The data set show user are 3 hours lightly active, only half hour in fairly and very active! The majority of the participants are lightly active.

Calories

Calories %>%
  select(Calories) %>%
  summary()
##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00

Sleep

Sleep %>%
  select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Participant spent about 7 hours in bed before waking up, this need to be improved because according to National Sleep Foundation (2015) it recommends that adults should spent at least 7 to 9 hours in bed so as to promote optimal health and functioning

Weight

Weight %>%
  select(WeightKg, BMI) %>%
  summary()
##     WeightKg           BMI       
##  Min.   : 52.60   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:23.96  
##  Median : 62.50   Median :24.39  
##  Mean   : 72.04   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :47.54

Avg weight is 158 pounds with BMI of 25

Before I start to visualize the data, I need to merge two data sets. I’m going to merge (inner join) activity and sleep on columns Id and date (that I previously created after converting data to date time format)

Merging of the Data Set

merged_data <- merge(Sleep, Activity, by=c('Id', 'date'))
head(merged_data)
##           Id     date   SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 04/12/16 2016-04-12                 1                327
## 2 1503960366 04/13/16 2016-04-13                 2                384
## 3 1503960366 04/15/16 2016-04-15                 1                412
## 4 1503960366 04/16/16 2016-04-16                 2                340
## 5 1503960366 04/17/16 2016-04-17                 1                700
## 6 1503960366 04/19/16 2016-04-19                 1                304
##   TotalTimeInBed ActivityDate TotalSteps TotalDistance TrackerDistance
## 1            346   2016-04-12      13162          8.50            8.50
## 2            407   2016-04-13      10735          6.97            6.97
## 3            442   2016-04-15       9762          6.28            6.28
## 4            367   2016-04-16      12669          8.16            8.16
## 5            712   2016-04-17       9705          6.48            6.48
## 6            320   2016-04-19      15506          9.88            9.88
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.14                     1.26
## 4                        0               2.71                     0.41
## 5                        0               3.19                     0.78
## 6                        0               3.53                     1.32
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                2.83                       0                29
## 4                5.04                       0                36
## 5                2.51                       0                38
## 6                5.03                       0                50
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  34                  209              726     1745
## 4                  10                  221              773     1863
## 5                  20                  164              539     1728
## 6                  31                  264              775     2035

Data Visualization

Activity

ggplot(data=Activity, aes(x=TotalSteps, y = Calories, color=SedentaryMinutes))+ 
  geom_point()+ 
  stat_smooth(method=lm)+ labs(title="Total Steps vs. Calories", caption ="Source:Kaggle.com/datasets/arashnic/fitbit")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

  scale_color_gradient(low="steelblue", high="yellow")
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1

I see positive correlation here between Total Steps and Calories, which is obvious - the more active we are, the more calories we burn.

Sleep

ggplot(data=Sleep, aes(x=TotalMinutesAsleep, y=TotalTimeInBed, color = TotalMinutesAsleep, TotalTimeInBed)) + 
    geom_point()+ 
    labs(title="Total Minutes Asleep vs. Total Time in Bed", caption = "Source: Kaggle.com/datasets/arashnic/fitbit")

  scale_color_gradient(low='yellow', high='blue')
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1

The relationship between Total Minutes Asleep and Total Time in Bed looks linear. So if the the users want to improve their sleep, we should consider using notification to go to sleep.

Intensities

int_new <- Intensities %>%
    group_by(time) %>%
    drop_na() %>%
    summarise(mean_total_int = mean(TotalIntensity))
  
  ggplot(data=int_new, aes(x=time, y=mean_total_int, color=time)) + 
    geom_histogram(stat = "identity", fill='steelblue') +
    theme(axis.text.x = element_text(angle = 90))+
    labs(title="Average Total Intensity vs. Time", caption ="Source:Kaggle.com/datasets/arashnic/fitbit")
## Warning in geom_histogram(stat = "identity", fill = "steelblue"): Ignoring
## unknown parameters: `binwidth`, `bins`, and `pad`

After visualizing Average total Intensity hourly, I found out that people are more active between 5 am and 10pm.

Most activity happens between 5 pm and 7 pm. We can use this time in the Bellabeat app to remind and motivate users to go for a run or walk.

Relationship between Total Minutes Asleep and Sedentry Minutes

ggplot(data=merged_data, aes(x=TotalMinutesAsleep, y=SedentaryMinutes)) + 
    geom_point(color='steelblue') + geom_smooth()+
    labs(title="Minutes Asleep vs. Sedentary Minutes", caption = "Source: Kaggle.com/datasets/arashnic/fitbit")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Here we can clearly see the negative relationship between Sedentary Minutes and Total minutes assleep.

If Bellabeat users want to improve their sleep, Bellabeat app can recommend reducing sedentary time.

Keep in mind that we need to support this insights with more data, because correlation between some data doesn’t mean causation.

Conclusion

The Bellabeat app is not just another fitness activity app. It’s a guide (a friend) who empowers women to balance full personal and professional life and healthy habits and routines by educating and motivating them through daily app

Recommendation

  1. While 7,638 average daily steps is a great start, aiming for 8,000 steps per day could significantly improve Bella Beat user health outcomes, according to recent research. A study found that taking 8,000 steps per day was associated with a 51% lower risk for all-cause mortality (or death from all causes). Taking 12,000 steps per day was associated with a 65% lower risk compared with taking 4,000 steps.

  2. If users want to lose weight, it’s probably a good idea to control daily calorie consumption. Bellabeat can suggest some ideas for low-calorie meal especillay at lunch and dinner.

  3. If users want to improve their sleep pattern, Bellabeat should consider using app notifications to go to bed.

  4. Most activity happens between 5 pm and 7 pm - I suppose, that people go to a gym or for a walk after finishing work. Bellabeat can use this time to remind and motivate users to go for a run or walk.

  5. if users want to improve their sleep, the Bellabeat app can recommend reducing sedentary time.

This is my first project working on R I would appreciate any comments and recommendations for improvement!

Thank you for your interest in my Bellabeat Case Study!!!