Introduction

Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.

Buisness task

Data source:

https://www.kaggle.com/arashnic/fitbit The dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. And include 18 CSV files.

Working on the dataset

Prepare Phase

Installing packages like ‘tidyverse’, ‘ggplot2’, ‘lubridate’, ‘dplyr’, ‘tidyr’, ‘here’, ‘skimr’, ‘janitor’ that will help in cleaning, analyzing and plotting our data.

# Loading packages :

library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)
library(skimr)
library(janitor)

Importing datasets

Importing dailyActivity_merged.csv

#daily_activity
daily_activity<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
head(daily_activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Impoting dailyCalories_merged.csv

#calories
calories<-read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
head(calories)
##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728

Importing dailyIntensities_merged.csv

#intensities
intensities<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
head(intensities)
##           Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366   4/12/2016              728                  328
## 2 1503960366   4/13/2016              776                  217
## 3 1503960366   4/14/2016             1218                  181
## 4 1503960366   4/15/2016              726                  209
## 5 1503960366   4/16/2016              773                  221
## 6 1503960366   4/17/2016              539                  164
##   FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1                  13                25                       0
## 2                  19                21                       0
## 3                  11                30                       0
## 4                  34                29                       0
## 5                  10                36                       0
## 6                  20                38                       0
##   LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1                6.06                     0.55               1.88
## 2                4.71                     0.69               1.57
## 3                3.91                     0.40               2.44
## 4                2.83                     1.26               2.14
## 5                5.04                     0.41               2.71
## 6                2.51                     0.78               3.19

Importing heartrate_seconds_merged.csv

#heartrate
heartrate<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
head(heartrate)
##           Id                 Time Value
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95

Importing sleepDay_merged.csv

#sleep
sleep <- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
head(sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

Importing weightLogInfo_merged.csv

#weight
weight<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
head(weight)
##           Id                  Date WeightKg WeightPounds Fat   BMI
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   IsManualReport        LogId
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

So now, we can see that everything were imported correctly.

Process Phase

And here some cleaning steps I followed:

1.daily_activity, calories and intensities data sets: I did not found any Spelling errors, Misfield values, Missing values, Extra and blank space and duplicated value in the data.’

2.In sleep data set there were 3 duplicates found and removed.

3.In weight data set there were too many missing values found in one column, and I decided to remove that column

there were some data type changes I did

#daily_activity
daily_activity$ActivityDate=as.Date(daily_activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate             <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
#intensities
intensities$ActivityDay=as.Date(intensities$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())

#sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())

Now data sets are cleaned and are ready to start our analysis on them.

Analyze Phase

Looking that data sets are compatible or not

#Finding the total number of participants in each data set 
daily_activity%>%
  summarise(total_participants = n_distinct(daily_activity$Id))
##   total_participants
## 1                 33
n_distinct(calories$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(heartrate$Id) 
## [1] 14
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8

As, I can see the data of number of participants in Heart rate data set and Weight data set is very less so, I’ll avoid using these data sets to avoid any bias in analysis. Now, analyzing data sets to find any type of correlation in the data.

Using daily_activity data set for analysis:

daily_activity %>%  
  select(TotalSteps,TotalDistance,SedentaryMinutes, Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes    Calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :1057.5   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :4900

Using intensities data set for analysis:

intensities %>%
  select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>%
  summary()
##  VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
##  Min.   :  0.00    Min.   :  0.00      Min.   :  0.0        Min.   :   0.0  
##  1st Qu.:  0.00    1st Qu.:  0.00      1st Qu.:127.0        1st Qu.: 729.8  
##  Median :  4.00    Median :  6.00      Median :199.0        Median :1057.5  
##  Mean   : 21.16    Mean   : 13.56      Mean   :192.8        Mean   : 991.2  
##  3rd Qu.: 32.00    3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:1229.5  
##  Max.   :210.00    Max.   :143.00      Max.   :518.0        Max.   :1440.0

The average sedentary time of participants seems to very high of approx 16.8 hrs and comparatively their active time is very less.

Using sleep data set for analysis:

sleep%>%
  select(TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed)%>%
  summary()
##  TotalSleepRecords TotalMinutesAsleep TotalTimeInBed 
##  Min.   :1.000     Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:1.000     1st Qu.:361.0      1st Qu.:403.0  
##  Median :1.000     Median :433.0      Median :463.0  
##  Mean   :1.119     Mean   :419.5      Mean   :458.6  
##  3rd Qu.:1.000     3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :3.000     Max.   :796.0      Max.   :961.0

Participants sleep 1 time for an average of 7.6 hrs and also when the average of participants spends 7.6 hrs of their time on bed then the average of 6.9 hrs they spend on sleeping.

merging data:

#for most active time
active <- merge(weight, intensities, by="Id", all=TRUE)
active$time <- format(active$Date, format = "%H:%M:%S")

Share Phase

Now let’s visualize some key explorations.

Relationship between Steps and Sedentary time

ggplot(daily_activity)+
  geom_point(mapping = aes(x=TotalSteps, y= SedentaryMinutes))+
  geom_smooth(mapping = aes(x=TotalSteps, y= SedentaryMinutes))+
  labs(title = "Total steps vs sedentary minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Through this graph I can see a week negative correlation between total steps taken and sedentary time in a day. This show the more steps you take in a day than there is chances of less sedentary time you spend in a day.

Relatonaship between Minutes aspleep and time in bed

ggplot(sleep)+
  geom_point(mapping= aes(x=TotalMinutesAsleep, y=TotalTimeInBed))+
  geom_smooth(mapping = aes(x=TotalMinutesAsleep , y= TotalTimeInBed))+
  labs(title = "Total time in bed vs asleep")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Through this graph I can see a strong positive correlation between total time in bed and total time asleep in a day. This show that if a person stays more on bed than they sleep more.

Relationship between Calories and steps

ggplot(daily_activity)+
  geom_point(mapping = aes(y=Calories, x=TotalSteps))+
  geom_smooth(mapping = aes(y=Calories, x= TotalSteps))+
  labs(title = "Total steps vs Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Through this graph I can see a weak positive correlation between total time in bed and total time asleep in a day. This show that if a person take more steps in a day than they are likely to burn more calories in a day.

Looking at intensities over time

ggplot(active)+
  geom_line(mapping = aes(x=time, y=VeryActiveMinutes,color="red"))+
  theme(axis.text.x = element_text(angle=90) )+
  labs(title="Total very Active Intensity vs. Time ")

As I can see that people are more likely to be active in morning till afternoon and less active during the evening and night in a day.

Recommandations

After looking at the data and the insights we created

-As people spends sedentary time a lot so maybe the company should create a marketing strategy focusing on the goal of making people active maybe by including a notification when they spends a lot of sedentary time to motivate them to be more active and spend more time actively. As, the data shows that the company need to market more to the customer segment with a high Sedentary time

-And also we saw that people are more likely to be active since morning till afternoon only so making something to encourage people to maybe go for an evening walk or spending there time more actively

-For customers who want to lose weight, it can be a good idea to control daily calorie consumption and take more steps. And Bellabeat can suggest some ideas for low-calorie healthy food. And also to empower the customers with knowledge about their own health and daily habits.

  • On an average the person takes almost 7600 steps per day but according to the CDC, they recommends that most adults aim for 10,000 steps per day that helps in staying active and fit.

Thank you very much for your interest in my Bellabeat Case Study!

And I would appreciate any comments and recommendations for improvement!