Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company.
https://www.kaggle.com/arashnic/fitbit The dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. And include 18 CSV files.
Installing packages like ‘tidyverse’, ‘ggplot2’, ‘lubridate’, ‘dplyr’, ‘tidyr’, ‘here’, ‘skimr’, ‘janitor’ that will help in cleaning, analyzing and plotting our data.
# Loading packages :
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)
library(skimr)
library(janitor)
#daily_activity
daily_activity<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
#calories
calories<-read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
head(calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
#intensities
intensities<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
head(intensities)
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
#heartrate
heartrate<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
head(heartrate)
## Id Time Value
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
#sleep
sleep <- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
head(sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
#weight
weight<- read.csv("C:/Users/saksh/Desktop/case study 2/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
head(weight)
## Id Date WeightKg WeightPounds Fat BMI
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## IsManualReport LogId
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
So now, we can see that everything were imported correctly.
And here some cleaning steps I followed:
1.daily_activity, calories and intensities data sets: I did not found any Spelling errors, Misfield values, Missing values, Extra and blank space and duplicated value in the data.’
2.In sleep data set there were 3 duplicates found and removed.
3.In weight data set there were too many missing values found in one column, and I decided to remove that column
there were some data type changes I did
#daily_activity
daily_activity$ActivityDate=as.Date(daily_activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
#intensities
intensities$ActivityDay=as.Date(intensities$ActivityDay, format="%m/%d/%Y", tz=Sys.timezone())
#sleep
sleep$SleepDay=as.POSIXct(sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
Now data sets are cleaned and are ready to start our analysis on them.
#Finding the total number of participants in each data set
daily_activity%>%
summarise(total_participants = n_distinct(daily_activity$Id))
## total_participants
## 1 33
n_distinct(calories$Id)
## [1] 33
n_distinct(intensities$Id)
## [1] 33
n_distinct(heartrate$Id)
## [1] 14
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
As, I can see the data of number of participants in Heart rate data set and Weight data set is very less so, I’ll avoid using these data sets to avoid any bias in analysis. Now, analyzing data sets to find any type of correlation in the data.
daily_activity %>%
select(TotalSteps,TotalDistance,SedentaryMinutes, Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
intensities %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes) %>%
summary()
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
The average sedentary time of participants seems to very high of approx 16.8 hrs and comparatively their active time is very less.
sleep%>%
select(TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed)%>%
summary()
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. :1.000 Min. : 58.0 Min. : 61.0
## 1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
## Median :1.000 Median :433.0 Median :463.0
## Mean :1.119 Mean :419.5 Mean :458.6
## 3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :3.000 Max. :796.0 Max. :961.0
Participants sleep 1 time for an average of 7.6 hrs and also when the average of participants spends 7.6 hrs of their time on bed then the average of 6.9 hrs they spend on sleeping.
merging data:
#for most active time
active <- merge(weight, intensities, by="Id", all=TRUE)
active$time <- format(active$Date, format = "%H:%M:%S")
After looking at the data and the insights we created
-As people spends sedentary time a lot so maybe the company should create a marketing strategy focusing on the goal of making people active maybe by including a notification when they spends a lot of sedentary time to motivate them to be more active and spend more time actively. As, the data shows that the company need to market more to the customer segment with a high Sedentary time
-And also we saw that people are more likely to be active since morning till afternoon only so making something to encourage people to maybe go for an evening walk or spending there time more actively
-For customers who want to lose weight, it can be a good idea to control daily calorie consumption and take more steps. And Bellabeat can suggest some ideas for low-calorie healthy food. And also to empower the customers with knowledge about their own health and daily habits.
Thank you very much for your interest in my Bellabeat Case Study!
And I would appreciate any comments and recommendations for improvement!