The business task is to analyze smart device usage data of non-bellabeat devices togain insights into relevant consumer trends as well as to discover how to use these trends to bellabeat marketing strategies.
Reliability: The data is not reliable. The data contains information for only 30 individuals which is not a representative sample of all the fit-bit smart devices users.
Original: The data is not original. It would have been had it been provided by fitbit itself.
Comprehensive: The data is not comprehensive. Some other data,like gender, age etc., would have been useful for an more accurate analysis.
Current: The data is not current.
Cited: The data is cited. It came from Amazon Mechanical Turk.
Keeping all the above points in mind, so the data analysis would not be accurate as the data integrity and credibilty is lacking.
Loading the packages that will be used during the data cleaning and visualization process:
library(tidyverse)
library(lubridate)
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
library(skimr)
library(janitor)
library(scales)
Now that the packages have been loaded, next sep is to import the required csv files to R.
daily_activities <- read.csv("C:\\Users\\Dibyajyoti Das\\Desktop\\case study 2\\Fitabase Data 4.12.16-5.12.16\\dailyActivity_merged.csv")
weight <- read.csv("C:\\Users\\Dibyajyoti Das\\Desktop\\case study 2\\Fitabase Data 4.12.16-5.12.16\\weightLogInfo_merged.csv")
sleep <- read.csv("C:\\Users\\Dibyajyoti Das\\Desktop\\case study 2\\Fitabase Data 4.12.16-5.12.16\\sleepDay_merged.csv")
The dataset contains 1.5 months worth of data. Also, in some cases, total steps have been recorded to be zero which may be because the user forgot wear their smart device.
The rows which contains 0 total steps has to be removed.
daily_activities1 <- daily_activities %>%
filter(TotalSteps!=0)
Separating date an time into different columns in weight and sleep dataset.
weight1 <- weight %>%
separate(Date, c("Date","Time")," ")
sleep1 <- sleep %>%
separate(SleepDay, c("Date","Time")," ")
To make all the datasets consistent to make merging them easier and clean, next step is to rename the ActivityDate column in daily_activities dataset to “Date”.
colnames(daily_activities1)[2]<-"Date"
Now we have a date column across all the datasets. For analysis, another column is created containing the weekdays.
weight1 <- weight1 %>%
mutate(Weekday = weekdays(as.Date(Date,"%m/%d/%Y")))
sleep1 <- sleep1 %>%
mutate(Weekday = weekdays(as.Date(Date,"%m/%d/%Y")))
daily_activities1 <- daily_activities1 %>%
mutate(Weekday = weekdays(as.Date(Date,"%m/%d/%Y")))
The datasets may contain duplicate rows, so we check the datasets for them.
sum(duplicated(sleep1))
## [1] 3
sum(duplicated(weight1))
## [1] 0
sum(duplicated(daily_activities1))
## [1] 0
It shows that sleep1 data set has 3 duplicate rows.
sleep1 <- sleep1[!duplicated(sleep), ]
Merging all three data sets into one.
combined_data <- merge(daily_activities1,sleep1, by = "Id")
combined_data_final <- merge(combined_data,weight1, by = "Id")
combined_data_final$Weekday <- factor(combined_data_final$Weekday,
levels = c("Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday","Sunday"))
Now that the data has been properly cleaned and sorted, we move on to the analysis and visualization process.
summary(daily_activities1)
## Id Date TotalSteps TotalDistance
## Min. :1.504e+09 Length:863 Min. : 4 Min. : 0.00
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 4923 1st Qu.: 3.37
## Median :4.445e+09 Mode :character Median : 8053 Median : 5.59
## Mean :4.858e+09 Mean : 8319 Mean : 5.98
## 3rd Qu.:6.962e+09 3rd Qu.:11092 3rd Qu.: 7.90
## Max. :8.878e+09 Max. :36019 Max. :28.03
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 3.370 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.590 Median :0.0000 Median : 0.410
## Mean : 5.964 Mean :0.1178 Mean : 1.637
## 3rd Qu.: 7.880 3rd Qu.:0.0000 3rd Qu.: 2.275
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.: 2.345 1st Qu.:0.00000
## Median :0.3100 Median : 3.580 Median :0.00000
## Mean :0.6182 Mean : 3.639 Mean :0.00175
## 3rd Qu.:0.8650 3rd Qu.: 4.895 3rd Qu.:0.00000
## Max. :6.4800 Max. :10.710 Max. :0.11000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:146.5 1st Qu.: 721.5
## Median : 7.00 Median : 8.00 Median :208.0 Median :1021.0
## Mean : 23.02 Mean : 14.78 Mean :210.0 Mean : 955.8
## 3rd Qu.: 35.00 3rd Qu.: 21.00 3rd Qu.:272.0 3rd Qu.:1189.0
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories Weekday
## Min. : 52 Length:863
## 1st Qu.:1856 Class :character
## Median :2220 Mode :character
## Mean :2361
## 3rd Qu.:2832
## Max. :4900
summary(sleep1)
## Id Date Time TotalSleepRecords
## Min. :1.504e+09 Length:410 Length:410 Min. :1.00
## 1st Qu.:3.977e+09 Class :character Class :character 1st Qu.:1.00
## Median :4.703e+09 Mode :character Mode :character Median :1.00
## Mean :4.995e+09 Mean :1.12
## 3rd Qu.:6.962e+09 3rd Qu.:1.00
## Max. :8.792e+09 Max. :3.00
## TotalMinutesAsleep TotalTimeInBed Weekday
## Min. : 58.0 Min. : 61.0 Length:410
## 1st Qu.:361.0 1st Qu.:403.8 Class :character
## Median :432.5 Median :463.0 Mode :character
## Mean :419.2 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
summary(weight1)
## Id Date Time WeightKg
## Min. :1.504e+09 Length:67 Length:67 Min. : 52.60
## 1st Qu.:6.962e+09 Class :character Class :character 1st Qu.: 61.40
## Median :6.962e+09 Mode :character Mode :character Median : 62.50
## Mean :7.009e+09 Mean : 72.04
## 3rd Qu.:8.878e+09 3rd Qu.: 85.05
## Max. :8.878e+09 Max. :133.50
##
## WeightPounds Fat BMI IsManualReport
## Min. :116.0 Min. :22.00 Min. :21.45 Length:67
## 1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96 Class :character
## Median :137.8 Median :23.50 Median :24.39 Mode :character
## Mean :158.8 Mean :23.50 Mean :25.19
## 3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :294.3 Max. :25.00 Max. :47.54
## NA's :65
## LogId Weekday
## Min. :1.460e+12 Length:67
## 1st Qu.:1.461e+12 Class :character
## Median :1.462e+12 Mode :character
## Mean :1.462e+12
## 3rd Qu.:1.462e+12
## Max. :1.463e+12
##
The median total steps by users is 8053 with maximum and minimum being 36019 and 4 respectively.
The median total distance traveled is 5.59 kilometers.
The median for Very active minutes is 23.01, for fairly active minutes is 8.0, for light active minutes is 208.0 and for sedentary minutes is 1021.0
The median for total minute asleep is 432.5.
The median BMI is 24.39.
Now, we present our insights through graphs and charts
We see that maximum amount of users spent their day being sedentary while very active and fairly active make up only 2 % of the total time.
We find that data recording by users is not consistent throughout the week. Users record the least amount of data on Friday and Saturday, that is the days leading to weekend. While maximum data is reported on Wednesday.
Maximum steps recorded by users is on Monday and Wednesday. Users record minimum steps on Friday and Saturday which is also the days users record least of data as per our previous data.
Users burn maximum calories on Wednesday and minimum on Friday and Saturday which is consistent with data from the previous bar graphs.
A very interesting trend that is noticed in this bar chart is that users spent most of their asleep on Wednesday which is also when burn the most calories.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
It can be seen that users burn more calories with increasing steps. It can also be seen there is spike of calories burned in between 5000 an 10000 steps. This may be due to users being more active and thus burning more calories.
Sedentary Minutes vs Total Steps:
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Some users burn calories in the range of 4000 kcal just being sedentary.
Very Active Minutes vs Total Minutes:
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
From the above graphs, it can be seen that regardless how much a user slept, the average user is mostly sedentary. In fact, the more sleep user had, the more sedentary he/she becomes.
There has been problems to input information by users. Bellabeat could provide incentives to users for consistent tracking.
Bellabeat products could have algorithm which would track the users schedule and provide health recommendation catered to the specific user.
Bellabeat could have offer different memberships to users, like premium and casual. Some services locked for its premium members.