About the company

Bellabeat, a high-tech company that manufactures health-focused smart products. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Phase 1: Ask:

  • Create a business objective that aligns with Stakeholders desired outcomes and expectations

Business Task

Analyze smart device usage data in order to gain insight into how consumers use smart devices.

Phase 2: Prepare:

  • Exploratory data analysis to check basic characteristics of dataset.
  • Check dataset for ROCCC.
  • Check data licensing, privacy, security, accessibility and integrity.
  • What are problems with data and how will data help answer the question.

Loading Packages

library(tidyverse)
library(ggpubr)
library(lubridate)

Uploading Datasets

activity <- read.csv("r/files/dailyActivity_merged.csv")
sleep <- read.csv("r/files/sleepDay_merged.csv")
calories <- read.csv("r/files/hourlyCalories_merged.csv")
intensities <- read.csv("r/files/hourlyIntensities_merged.csv")
steps <- read.csv("r/files/hourlySteps_merged.csv")

Exploratory Data Analysis of dataset

str(activity)
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
head(activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
str(sleep)
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...
head(sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
str(calories)
## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ Calories    : int  81 61 59 47 48 48 48 47 68 141 ...
head(calories)
##           Id          ActivityHour Calories
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366  4/12/2016 1:00:00 AM       61
## 3 1503960366  4/12/2016 2:00:00 AM       59
## 4 1503960366  4/12/2016 3:00:00 AM       47
## 5 1503960366  4/12/2016 4:00:00 AM       48
## 6 1503960366  4/12/2016 5:00:00 AM       48
str(intensities)
## 'data.frame':    22099 obs. of  4 variables:
##  $ Id              : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour    : chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ TotalIntensity  : int  20 8 7 0 0 0 0 0 13 30 ...
##  $ AverageIntensity: num  0.333 0.133 0.117 0 0 ...
head(intensities)
##           Id          ActivityHour TotalIntensity AverageIntensity
## 1 1503960366 4/12/2016 12:00:00 AM             20         0.333333
## 2 1503960366  4/12/2016 1:00:00 AM              8         0.133333
## 3 1503960366  4/12/2016 2:00:00 AM              7         0.116667
## 4 1503960366  4/12/2016 3:00:00 AM              0         0.000000
## 5 1503960366  4/12/2016 4:00:00 AM              0         0.000000
## 6 1503960366  4/12/2016 5:00:00 AM              0         0.000000
str(steps)
## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...
head(steps)
##           Id          ActivityHour StepTotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0

activity dataset has 16 variables and 940 observations. sleep dataset has 5 variables and 410 observations. calories dataset has 3 variables and 22099 observations. intensities dataset has 5 variables and 22099 obervations and steps dataset has 3 variables and 22099 observations.

Upon checking it is identified that there are blanks and null values in the observations hence datasets can be determined as not reliable. This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016 hence it is from a 2nd party data source and 6 years old data. This dataset has been provided for public domain via License and clears privacy, security and accessibility issues.

Dataset is not complete or accurate. There are issues in the datasets those needs to be cleaned before analyzing.

Phase 3: Process:

  • Data Cleaning
  • Data transformation

Data Cleaning

Eliminating Duplicates

Fixing Null values

sum(is.null(activity))
## [1] 0
sum(is.null(sleep))
## [1] 0
sum(is.null(intensities))
## [1] 0
sum(is.null(calories))
## [1] 0
sum(is.null(steps))
## [1] 0

Fixing date time data type

activity$ActivityDate <- mdy(activity$ActivityDate)
sleep$SleepDay <- mdy_hms(sleep$SleepDay)
calories$ActivityHour <- mdy_hms(calories$ActivityHour)
intensities$ActivityHour <- mdy_hms(intensities$ActivityHour)
steps$ActivityHour <- mdy_hms(steps$ActivityHour)

Data Transformation

Standardizing Column names

activity <- rename_with(activity, tolower)
sleep <- rename_with(sleep, tolower)
intensities <- rename_with(intensities, tolower)
calories <- rename_with(calories, tolower)
steps <- rename_with(steps, tolower)

Adding weekday column to activity dataset

activity <- activity %>% 
  mutate(weekday = weekdays(activitydate))

Phase 4: Analyze and Share

Analyzing steps taken per weekday

activity_steps <- activity %>% 
  group_by(weekday) %>% 
  summarise(avgstepsperday = mean(totalsteps))
activity_steps$weekday <- factor(activity_steps$weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) 

  ggplot(activity_steps)+
    geom_line(mapping = aes(x = weekday, y = avgstepsperday), color = "darkred", size = 0.8,  group = 0)+
    geom_point(mapping = aes(x = weekday, y = avgstepsperday), color = "darkred")+
    labs(x = "Weekday", y = "Steps per day", title = "Average steps per day")

Above line graph demonstrate variations in average steps taken per day by users.

Analyzing Averag steps taken by users

  activity_steps2 <- activity %>% 
    group_by(id) %>% 
    summarise(Avg_daily_steps = mean(totalsteps)) %>% 
    filter(Avg_daily_steps<= 12000)
  
  ggplot(data = activity_steps2)+ geom_point(mapping = aes(x=id, y=Avg_daily_steps, color= Avg_daily_steps))+
    labs(title = "Average steps taken", y = "Average steps taken ", x="User Id ", color = "Average steps")+
    theme(axis.text.x.bottom = element_text (angle = 90))+
    scale_color_gradient(low = "red", high = "green")+
    geom_hline(yintercept = 8000, linetype = "dashed")

As per NIH report which states that those who took 8,000 steps a day had a 50% lower risk of dying from any cause during follow-up. Above scatterplot shows average steps taken by the users. Above graph indicates that users with less than average 8000 daily steps are at higher health related risks.

Analyzing steps taken and calories burned

  ggplot(data=activity, mapping = aes(x = totalsteps, y = calories, color = totalsteps))+ 
    geom_jitter()+ geom_point()+ geom_smooth()+ 
    labs(title = "Total steps v/s Calories burned per day", x = "Total steps taken", y = "Calories burned",  color= "Total steps taken")+
    theme_light()

Above scatter plot clearly demonstrates that higher the number of steps taken results in higher number of calories burned.

Analyzing average hourly intensities of the users

  intensities <- intensities %>% 
    separate(activityhour, into = c('date','hour'), sep = " ") %>% 
    mutate(date = ymd(date))

  intensities_hourly <- intensities %>% 
    group_by(hour) %>% 
  summarise(avg_intensity =  mean(totalintensity))
  
  ggplot(data = intensities_hourly, mapping = aes(x = hour, y = avg_intensity, fill=avg_intensity))+
    geom_col()+ labs(title = "Intensity per hour", x="Hour", y="Average Intensity", fill="Average intensities")+
    scale_fill_gradient(low = "red", high = "darkblue")+
    theme(axis.text.x = element_text(angle = 90))

Above line graph demonstrates variations in physical intensities of the users throughout the day. Peak in intensities from 5 PM to 7 PM is easily noticeable as it is the time most people prefer to exercise.

Phase 5 : Action:

Key Recommendations

  1. Higher percentage of users has less than 8000 average daily steps which could be reminded to the users by using alerts in the devices
  2. Alerts to remind users to avoid prolonged idle time could be added to the devices.
  3. A feature can be added to calculate total amount of calories consumed to help users.