Part of the Google Data Analytics Professional Certificate course (Capstone Project).
Click Here to Download FitBit Fitness Tracker Data
See My FitBit Fitness Tracker Data Analysis in kaggle Click Here
This is a data analytics fictional case study with the purpose of gaining insights to improve business decisions for the Bellabeat company, a company focused on building products designed to gather information about a woman’s overall activity, sleep schedules, stress and reproductive health.
This analysis follows the same structure that was learned throughout this course, which is the following:
For this fictitional case study, I have joined Bellabeat’s marketing analytics team, responsible for collecting, analyzing and reporting data that helps guide Bellabeat’s marketing strategy. This report’s point of view will then be written as a (fictitional) team.
In the Ask phase, we (the fictional team) should focus in what is the problem that we are trying to solve, and how can our insights drive business decisions.
Bellabeat wants to gather insights regarding the following business tasks:
By analysing the activity of a particular set of costumers (that gave consent to share their activity’s data), we can see what type of features are mostly used or impact the healthy lifestyle the most for those costumers. With these insights, Bellabeat’s team can focus more on improving or spotlighting these features in marketing.
For this step, we’ll take a closer look into the dataset that will be used for this case study and which will help us answer the previous business tasks. We’ll give a brief overview regarding particular aspects of this dataset, such as how is the data being stored, how it’s organized, bias/credibility issues (ROCC), licensing, privacy, security, acessibility, data’s integrity, how does this data help answer our questions, and are there any problems with the data. Finally, this phase also includes the selection or filtering of the data available, and any sortage if necessary.
The data FitBit Fitness Tracker Data is available in Kaggle. As it is described in the dataset’s page: “These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.”
Storage: The data is being currently being stored in Kaggle’s platform and will also be downloaded to a local file system for any analysis or clean up.
Organized: The databate consists of 18 CSV files:
Bias/Credibility (ROCC)
Original: can we locate the original data source?
Comprehensive: Does it have the necessary and important information to find the solution?
Current: Yes it current?
Cited:
Licence, privacy, security, acessability, data’s integrity According to the description of the dataset, thirty eligible Fitbit users consented to the submission of personal tracker data. Looking into the metadata, the data was made public with the following license: CCO: Public Domain.
Problems with the data The data is not current (from 2016), has a relatively small sample size (30 costumers) and can not be sure if the data is actually representative of women, or if it’s biased in some way.
The entire data is distributed across multiple files, and the observations (rows) on each file are registered by different time intervals. Some are high intervals such as dailyActivity_merged.csv where each observation is on a daily basis, while others are low intervals (heartrate_seconds_merged.csv) where each observation is measured by the second. To have a generic view of the usage trends of Bellabeat’s costumers, we will focus on the files with higher intervals observations, which are the following:
dailyActivity_merged.csv
dailyCalories_merged.csv
dailyIntensities_merged.csv
dailySteps_merged.csv
sleepDay_merged.csv
weightLogInfo_merged.csv But, looking into these datasets, we can see that dailyActivity_merged.csv already contains all the columns of either dailyCalories, dailyIntensities and dailySteps files. Also, by having a quick look at the weightLogInfo_merged.csv file, it only contains weight information about 9 costumers, so this file won’t be used as well. We end up with the following files:
dailyActivity_merged.csv
sleepDay_merged.csv
We will combine both steps (Process and Analysis) into this one section.
Process: In the Process phase, the key tasks include checking data for errors, choosing the appropriate or most desired tools to do so, transform the data so we can work with it effectively and document the entire cleaning process. The tool we chose to clean the data was R.
Analysis: The Analysis phase consists of aggregating the data so it’s useful and accessible, organizing and formatting the data, perform any necessary calculations and identify trends and relationships.
As described in the previous phase, the files that were left off for us to work with are the dailyActivity_merged.csv and the sleepDay_merged.csv. There maybe a correlation between the sleep behaviours and the overall activity of the costumers, so the next step will be to merge the two files together so we can have a single data frame to work with.
library(tidyverse)
library(lubridate)
library(skimr)
Loading Files:
dailyActivities <- read_csv("data/dailyActivity_merged.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## ActivityDate = col_character(),
## TotalSteps = col_double(),
## TotalDistance = col_double(),
## TrackerDistance = col_double(),
## LoggedActivitiesDistance = col_double(),
## VeryActiveDistance = col_double(),
## ModeratelyActiveDistance = col_double(),
## LightActiveDistance = col_double(),
## SedentaryActiveDistance = col_double(),
## VeryActiveMinutes = col_double(),
## FairlyActiveMinutes = col_double(),
## LightlyActiveMinutes = col_double(),
## SedentaryMinutes = col_double(),
## Calories = col_double()
## )
sleepActivities <- read_csv("data/sleepDay_merged.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Id = col_double(),
## SleepDay = col_character(),
## TotalSleepRecords = col_double(),
## TotalMinutesAsleep = col_double(),
## TotalTimeInBed = col_double()
## )
head(dailyActivities)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 13162 8.5 8.5 0
## 2 1.50e9 4/13/2016 10735 6.97 6.97 0
## 3 1.50e9 4/14/2016 10460 6.74 6.74 0
## 4 1.50e9 4/15/2016 9762 6.28 6.28 0
## 5 1.50e9 4/16/2016 12669 8.16 8.16 0
## 6 1.50e9 4/17/2016 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(sleepActivities)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecor~ TotalMinutesAsle~ TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2016 12:00:0~ 1 327 346
## 2 1.50e9 4/13/2016 12:00:0~ 2 384 407
## 3 1.50e9 4/15/2016 12:00:0~ 1 412 442
## 4 1.50e9 4/16/2016 12:00:0~ 2 340 367
## 5 1.50e9 4/17/2016 12:00:0~ 1 700 712
## 6 1.50e9 4/19/2016 12:00:0~ 1 304 320
As we can see, we have the Id and ActivityDate for the dailyActivities data frame, and the Id and sleepActivities for the sleepActivities data frame. Both of these columns will be used to make the merge but first we need to make a few changes. R has identified both date-related columns as a Factor or categorical variable (fct). Also, ActivityDate is representing a date, while sleepActivities is date-time. This date-time however is unecessary as they all show the exact same time (12 am). So the next steps will be to:
Using lubridate library, we can use the mdy function in order to parse the date (mdy because the current column date format is in month/day/year).
dailyActivities <- mutate(dailyActivities, ActivityDate = mdy(ActivityDate))
sleepActivities <- mutate(sleepActivities, SleepDay = mdy_hms(SleepDay))
Now it is a date! and it’s in the default date format (and much more readable format: Year > month > day). Let’s quickly recheck:
head(dailyActivities)
## # A tibble: 6 x 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitie~
## <dbl> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 13162 8.5 8.5 0
## 2 1.50e9 2016-04-13 10735 6.97 6.97 0
## 3 1.50e9 2016-04-14 10460 6.74 6.74 0
## 4 1.50e9 2016-04-15 9762 6.28 6.28 0
## 5 1.50e9 2016-04-16 12669 8.16 8.16 0
## 6 1.50e9 2016-04-17 9705 6.48 6.48 0
## # ... with 9 more variables: VeryActiveDistance <dbl>,
## # ModeratelyActiveDistance <dbl>, LightActiveDistance <dbl>,
## # SedentaryActiveDistance <dbl>, VeryActiveMinutes <dbl>,
## # FairlyActiveMinutes <dbl>, LightlyActiveMinutes <dbl>,
## # SedentaryMinutes <dbl>, Calories <dbl>
head(sleepActivities)
## # A tibble: 6 x 5
## Id SleepDay TotalSleepRecords TotalMinutesAsl~ TotalTimeInBed
## <dbl> <dttm> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 00:00:00 1 327 346
## 2 1.50e9 2016-04-13 00:00:00 2 384 407
## 3 1.50e9 2016-04-15 00:00:00 1 412 442
## 4 1.50e9 2016-04-16 00:00:00 2 340 367
## 5 1.50e9 2016-04-17 00:00:00 1 700 712
## 6 1.50e9 2016-04-19 00:00:00 1 304 320
dailyActivities <- rename(dailyActivities, Date = ActivityDate)
sleepActivities <- sleepActivities %>% rename(Date = SleepDay)
glimpse(dailyActivities)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-~
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019~
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4~
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21~
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, ~
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818~
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203~
glimpse(sleepActivities)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150~
## $ Date <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3~
Before merging data frames, let’s just see if all our costumers in one data frame are also present in the other. We can simply do this by counting the number of distinct Id’s in both data frames.
n_distinct(dailyActivities$Id)
## [1] 33
n_distinct(sleepActivities$Id)
## [1] 24
Apparently not all costumers have sleep data. In dailyActivities data frame there are 33 costumers, while in sleepActivities there are 24! In order to solve this. We’ve decided to use the dplyr library function, inner_join. This function will merge both data frames by Id and by Date and will only add observations or rows to the result if they are present in both data frames.
mergeActivity <- inner_join(dailyActivities, sleepActivities, by=c("Id", "Date"))
Let’s now have an overview of our final data frame that we will work with using the skimr library. But first let’s just add one more column to our new data frame to see who’s been lazy or has insonia! Maybe it can have a correlation with the overall workout activity!
mergeActivity <- mutate(mergeActivity, TimeAwakeInBed = TotalTimeInBed - TotalMinutesAsleep)
n_distinct(mergeActivity$Id)
## [1] 24
glimpse(mergeActivity)
## Rows: 413
## Columns: 19
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036~
## $ Date <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-~
## $ TotalSteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544~
## $ TotalDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ TrackerDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3~
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3~
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3~
## $ LightActiveDistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6~
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ VeryActiveMinutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3~
## $ FairlyActiveMinutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,~
## $ LightlyActiveMinutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, ~
## $ SedentaryMinutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, ~
## $ Calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177~
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, ~
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, ~
## $ TimeAwakeInBed <dbl> 19, 23, 30, 27, 12, 16, 17, 39, 23, 19, 46, 2~
skim(mergeActivity)
| Name | mergeActivity |
| Number of rows | 413 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| numeric | 18 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 5.000979e+09 | 2.06036e+09 | 1.50396e+09 | 3.977334e+09 | 4.702922e+09 | 6.962181e+09 | 8.79201e+09 | ▆▆▇▅▃ |
| TotalSteps | 0 | 1 | 8.541140e+03 | 4.15693e+03 | 1.70000e+01 | 5.206000e+03 | 8.925000e+03 | 1.139300e+04 | 2.27700e+04 | ▅▆▇▂▁ |
| TotalDistance | 0 | 1 | 6.040000e+00 | 3.05000e+00 | 1.00000e-02 | 3.600000e+00 | 6.290000e+00 | 8.030000e+00 | 1.75400e+01 | ▅▇▇▁▁ |
| TrackerDistance | 0 | 1 | 6.030000e+00 | 3.05000e+00 | 1.00000e-02 | 3.600000e+00 | 6.290000e+00 | 8.020000e+00 | 1.75400e+01 | ▅▇▇▁▁ |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 5.10000e-01 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.08000e+00 | ▇▁▁▁▁ |
| VeryActiveDistance | 0 | 1 | 1.450000e+00 | 1.99000e+00 | 0.00000e+00 | 0.000000e+00 | 5.700000e-01 | 2.370000e+00 | 1.25400e+01 | ▇▂▁▁▁ |
| ModeratelyActiveDistance | 0 | 1 | 7.500000e-01 | 1.00000e+00 | 0.00000e+00 | 0.000000e+00 | 4.200000e-01 | 1.040000e+00 | 6.48000e+00 | ▇▂▁▁▁ |
| LightActiveDistance | 0 | 1 | 3.810000e+00 | 1.73000e+00 | 1.00000e-02 | 2.540000e+00 | 3.680000e+00 | 4.930000e+00 | 9.48000e+00 | ▂▇▆▂▁ |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.00000e-02 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.10000e-01 | ▇▁▁▁▁ |
| VeryActiveMinutes | 0 | 1 | 2.519000e+01 | 3.63900e+01 | 0.00000e+00 | 0.000000e+00 | 9.000000e+00 | 3.800000e+01 | 2.10000e+02 | ▇▂▁▁▁ |
| FairlyActiveMinutes | 0 | 1 | 1.804000e+01 | 2.24000e+01 | 0.00000e+00 | 0.000000e+00 | 1.100000e+01 | 2.700000e+01 | 1.43000e+02 | ▇▂▁▁▁ |
| LightlyActiveMinutes | 0 | 1 | 2.168500e+02 | 8.71600e+01 | 2.00000e+00 | 1.580000e+02 | 2.080000e+02 | 2.630000e+02 | 5.18000e+02 | ▂▇▇▂▁ |
| SedentaryMinutes | 0 | 1 | 7.121700e+02 | 1.65960e+02 | 0.00000e+00 | 6.310000e+02 | 7.170000e+02 | 7.830000e+02 | 1.26500e+03 | ▁▁▇▃▁ |
| Calories | 0 | 1 | 2.397570e+03 | 7.62890e+02 | 2.57000e+02 | 1.850000e+03 | 2.220000e+03 | 2.926000e+03 | 4.90000e+03 | ▁▇▇▃▁ |
| TotalSleepRecords | 0 | 1 | 1.120000e+00 | 3.50000e-01 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.00000e+00 | ▇▁▁▁▁ |
| TotalMinutesAsleep | 0 | 1 | 4.194700e+02 | 1.18340e+02 | 5.80000e+01 | 3.610000e+02 | 4.330000e+02 | 4.900000e+02 | 7.96000e+02 | ▁▂▇▃▁ |
| TotalTimeInBed | 0 | 1 | 4.586400e+02 | 1.27100e+02 | 6.10000e+01 | 4.030000e+02 | 4.630000e+02 | 5.260000e+02 | 9.61000e+02 | ▁▃▇▁▁ |
| TimeAwakeInBed | 0 | 1 | 3.917000e+01 | 4.65700e+01 | 0.00000e+00 | 1.700000e+01 | 2.500000e+01 | 4.000000e+01 | 3.71000e+02 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 2016-04-12 | 2016-05-12 | 2016-04-27 | 31 |
Let’s just see if there’s any duplicates in the final data frame:
sum(duplicated(mergeActivity))
## [1] 3
# Removing duplicates from the data frame
mergeActivity <- distinct(mergeActivity)
Let’s now do some obvious exploratory visualizations in R between variables that we are sure are correlated.
ggplot(data = mergeActivity, aes(x=TotalSteps, y=TotalDistance)) +
geom_point(color = "red") + labs(title = "Relationship between TotalSteps and TotalDistance")
ggplot(data = mergeActivity, aes(x=TotalSteps, y=Calories)) +
geom_point(color = "red") + labs(title = "Relationship between TotalSteps and Calories")
Let’s now look at the actual behaviours of our costumers. We can analyse their sleep behaviours and if this has any impact with the overall activity. What type of activity they do the most, and in what time of the day or day of the week it’s more likely for the costumers to do activity.
ggplot(mergeActivity, aes(x=TotalMinutesAsleep)) +
geom_histogram(aes(y=..density..) ,binwidth = 30, alpha = 0.6, color = "red") +
geom_density(alpha = 0.2, fill="blue") +
labs(title = "Total Minute Asleep")
From this histogram, we can see that most of the costumers sleep around 420 minutes which equals to 7 hours a day.
ggplot(mergeActivity, aes(x=TimeAwakeInBed)) +
geom_histogram(aes(y=..density..), binwidth = 30, alpha = 0.6, color = "red") +
geom_density(fill= "blue", alpha = 0.2) +
geom_vline(aes(xintercept = mean(TimeAwakeInBed)), color="green", linetype = "dashed") +
geom_vline(aes(xintercept = max(TimeAwakeInBed)), color="green", linetype = "dashed") +
annotate(geom = "text", x = 39 + 45, y = 0.02, label = "Mean = 39", size = 5) +
annotate(geom = "text", x = 371 - 30, y = 0.02, label = "Max = 371", size = 5 ) +
labs(x = "Time Awake in Bed", y = "Density")
Using the same type of visualization, and using our calculated field TimeAwakeInBed, we can see if the costumers are spending too much time awake in bed or not, which could be a signal of bad sleeping habits. In the histogram, we see that most costumers have an average time of 39 minutes awake in bed, which is not particularly bad, but could be improved. Also, there’s still a small amount of costumers that are awake between 100 and 300 minutes, and the maximum being 371 minutes! Will leave further comments regarding this subject in the Share phase.
For this analysis, we’ll construct a data frame where for each date (or day) we’ll measure the mean for each activity, then plotting these measurments using lines with different colors.
activitySummary <- mergeActivity %>% group_by(Date) %>%
summarise(Mean_VeryActiveMinutes = mean(VeryActiveMinutes),
Mean_FairlyActiveMinutes = mean(FairlyActiveMinutes),
Mean_LightlyActiveMinutes = mean(LightlyActiveMinutes),
Mean_SedentaryMinutes = mean(SedentaryMinutes)) %>%
arrange(Date)
cols <- c("Very" = "violetred3", "Fairly" = "black", "Lightly" = "orange", "Sedentary" = "red")
ggplot(data = activitySummary, aes(x=Date)) +
geom_line(aes(y=Mean_VeryActiveMinutes, color = "Very"), size = 1) +
geom_line(aes(y=Mean_FairlyActiveMinutes, color = "Fairly"), size = 1) +
geom_line(aes(y=Mean_LightlyActiveMinutes, color = "Lightly"), size = 1) +
geom_line(aes(y=Mean_SedentaryMinutes, color = "Sedentary"), size = 1) +
labs(title = "Most Frequent Types of Activity", y = "Minutes (mean)", color = "Activity") +
scale_color_manual(values = cols)
From the chart above, we can see that most of the activity is of Sedentary type, then Light activity, and finally the least amount are Very active and Fairly active. Nowadays, most people spend alot of their day working sitting on a chair, which may explain why most of the activity is Sedentary, and the Lightly activity being from simply doing diverse types of chores.
To analyse the most activity per weekday, we can do something similar as we did for the previous data frame (activityByDate) but this time grouping by weekday using the weekdays function.
activityByweekDay <- mergeActivity %>% group_by(weekday = weekdays(Date, abbreviate = T)) %>%
summarise(Very = mean(VeryActiveMinutes),
Fairly = mean(FairlyActiveMinutes),
Lightly = mean(LightlyActiveMinutes),
Sedentary = mean(SedentaryMinutes)) %>%
pivot_longer( cols = Very:Sedentary, names_to = "Activity", values_to = "Mean")
activityByweekDay$weekday <- factor(activityByweekDay$weekday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
head(activityByweekDay,5)
## # A tibble: 5 x 3
## weekday Activity Mean
## <fct> <chr> <dbl>
## 1 Fri Very 21.2
## 2 Fri Fairly 14.6
## 3 Fri Lightly 223.
## 4 Fri Sedentary 743.
## 5 Mon Very 30.7
#plotting
ggplot(activityByweekDay, aes(x= weekday, y = Mean, fill = Activity)) +
geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
labs(title = "Activity per weekday")
Not much we can take from this chart since it’s relatively well distributed, simply that Sedentary type of activity is more frequent on Fridays and Tuesdays. Lightly activity on Saturdays and Fridays. Fairly and Very are a bit hard to distinguish, so we’ll filter those and plot again:
activityByweekDay %>%
filter(Activity == "Very" | Activity == "Fairly") %>%
ggplot(aes(x= weekday, y = Mean, fill = Activity)) +
geom_bar(position = "dodge", stat = "identity") + # stat = "identity" important if i use geom_col() function then no need to use stat param
labs(title = "Activity per weekday Of Fairly & Very")
Fairly activity is more frequent on Mondays and Tuesdays, while Very activity is more frequent on Saturdays and Tuesdays.
For this analysis, we’ll try to see if there’s any correlation between the sleep and activity.
ggplot(data = mergeActivity) +
geom_point(mapping = aes(x = TotalMinutesAsleep, y = TotalSteps, size = Calories, color = Calories)) + # aes here must imp
geom_smooth(formula = y ~ x, method=lm, mapping = aes(x = TotalMinutesAsleep, y = TotalSteps), color = "red") +
labs(title = "Sleep and Activity", x = "Total Minutes Asleep", y = "Total Steps")
Apparently, it’s more likely for our costumers to do more activity if they slept less than usual, rather than more. As noted in the previous analysis of sleep behaviours, most of the costumers have relatively good sleep (420 minutes = 7 hours).
In this phase, we take into consideration all the insights gathered and make a business decision based on our findings. The recommendations were presented in the previous phase and is thus now up to the stakeholders to make the desired decision. In this section we can also discuss possible data sets that could be used to expand this analysis for any future reference, as for example: