Bellabeat (Capstone Analysis)

Contains the following deliverables:

A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
Summary of analysis
Top high-level content recommendations based on analysis

Summary

I am a junior data analyst working on the marketing analyst team at Bellabeat. Bellabeat is a manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. These insights will help guide marketing strategies for the company.

Data Source

The data being used in this analysis can be found at this address, “https://www.kaggle.com/arashnic/fitbit/code”.

Author - Mobius

License - CC0: Public Domain

Description - These datasets were generated by respondents to a distributed survery via Amazon Mechanical Turk between 3/12/2016 and 05/12/2016. Thirty eligible users gave consent to the submission of personal tracker data.

Getting started…

Going forward I will be using the principles taught in the Google Data Analytics Certificate course for data analysis.

Ask, Prepare, Process, Analyze, Share, Act

Ask

Ask effective questions
Define the problem
Use structured thinking
Communicate with others

Prepare

Understand how data is generated and collected
Identify and use different data formats, types and structures
Make sure data is unbiased and credible
Organize and protect data

Process

Create and transform data
Maintain data integrity
Test data
Clean data
Verify and report on cleaning results

Analyze

Use tools to format and transform data
Sort and filter data
Identify patterns and draw conclusions
Make predictions and recommendations
Make data-driven decisions

Act

Apply your insights
Solve problems
Make decisions
Create something new

Ask

What is the problem?

For this assignment we are looking to explore Fitbit usage data to see if there are any identifiable patterns that can potentially be used by the Bellabeat marketing team.

Other relevant questions without answers:

Are all the users of the data set female?
Should we know anything about individual users where we should expect different than expected results?

Prepare

How is the data generated / collected?

It was done by using recorded data from Mturk respondents. They would track their fitbit data at the time of entry. Some self reported some not.

Organize and Protect

There is no real ‘personal identifiable information’ that can be tied directly to the users/participants in this data collection. No gender, name, address, city, email, etc…

Make sure data is unbiased/credible

For this, we will most likely need to go through the data and see what files are complete and thorough. We can check this by looking at all the missing values in each table. Let’s connect to the data.

PLEASE NOTE:

I have already gone ahead and created a SQLite database with Python. If you are curious on how I approached this, feel free to reach out to me via email.

# Setting up CRAN repository for knit
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)

# Packages to connect
install.packages("RSQLite")

## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'RSQLite' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'RSQLite'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\Phoenix\Documents\R\win-library\4.1\00LOCK\RSQLite\libs\x64\RSQLite.dll
## to C:\Users\Phoenix\Documents\R\win-library\4.1\RSQLite\libs\x64\RSQLite.dll:
## Permission denied

## Warning: restored 'RSQLite'

## 
## The downloaded binary packages are in
##  C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages

library(RSQLite)

# Set database (db) name and the driver
mydb <- "C:\\Users\\Phoenix\\fitbit.db"
con  <- dbConnect(drv = SQLite(), dbname = mydb)

# Let's test the connection first by changing the working directory and con
setwd("C:\\Users\\Phoenix")
conn <- dbConnect(RSQLite::SQLite(), "fitbit.db")

# View tables to make sure we are connected
dbListTables(conn)

##  [1] "dailyActivity"           "dailyCalories"          
##  [3] "dailyIntensities"        "dailySteps"             
##  [5] "heartrate_seconds"       "hourlyCalories"         
##  [7] "hourlyIntensities"       "hourlySteps"            
##  [9] "minuteCaloriesNarrow"    "minuteIntensitiesNarrow"
## [11] "minuteMETsNarrow"        "minuteSleep"            
## [13] "minuteStepsNarrow"       "sleepDay"               
## [15] "weightLogInfo"

Now since we are connected to the data, we can use SQL and R to make some initial data frames to work with. I will be specifically looking at the following tables… * dailyActivity * sleepDay * weightLogInfo

This is because I plan on looking at trends/patterns on the day instead of the minute or hour. Also, it appears that dailyActivity contains all the information from dailyCalories, dailyIntensities, and dailySteps just aggregated.

This analysis can most likely be furthered by looking at the hour/minute/second levels for users but I will ignore these for now.

# dailyActivity
daily_activity <- dbGetQuery(conn,"SELECT * FROM dailyActivity")

# sleepDay
sleep_day <- dbGetQuery(conn,"SELECT * FROM sleepDay")

# weightLogInfo
weight_log_info <- dbGetQuery(conn,"SELECT * FROM weightLogInfo")

# Let's look at a quick example...
head(daily_activity)

##           id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

Now since we can look at the data as a dataframe in R, we can then see where all the missing values are.

NULLs

# dailyActivity
sapply(daily_activity, function(x) sum(is.na(x)))

##                       id             ActivityDate               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0

dailyAvtivity looks good!

# sleepDay
sapply(sleep_day, function(x) sum(is.na(x)))

##                 id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                  0                  0                  0                  0 
##     TotalTimeInBed 
##                  0

sleepDay looks good!

# weightLogInfo
sapply(weight_log_info, function(x) sum(is.na(x)))

##             id           Date       WeightKg   WeightPounds            Fat 
##              0              0              0              0             65 
##            BMI IsManualReport          LogId 
##              0              0              0

After reviewing the files for NULLs it looks like weightLogInfo is the biggest offender. We can switch gears and see if we can get a number of unique users and the number of their entries. This should further tell us the quality of the data.

Unique Ids/Users

# dailyActivities
dbGetQuery(conn,
           "
           SELECT
             id, 
             COUNT(id)
           FROM
             dailyActivity
           GROUP BY 
             id
           ORDER BY 
             COUNT(id) DESC
           ")

##            id COUNT(id)
## 1  8877689391        31
## 2  8583815059        31
## 3  8378563200        31
## 4  8053475328        31
## 5  7086361926        31
## 6  6962181067        31
## 7  5553957443        31
## 8  4702921684        31
## 9  4558609924        31
## 10 4445114986        31
## 11 4388161847        31
## 12 4319703577        31
## 13 4020332650        31
## 14 2873212765        31
## 15 2320127002        31
## 16 2026352035        31
## 17 2022484408        31
## 18 1927972279        31
## 19 1844505072        31
## 20 1624580081        31
## 21 1503960366        31
## 22 5577150313        30
## 23 3977333714        30
## 24 1644430081        30
## 25 8792009665        29
## 26 6290855005        29
## 27 6117666160        28
## 28 7007744171        26
## 29 6775888955        26
## 30 3372868164        20
## 31 8253242879        19
## 32 2347167796        18
## 33 4057192912         4

Although in the description of the data it said there were 30 participants, it looks like there are 33 different users in the dailyActivity table.

Also, it appears that the number of entries are not always consistent with a range between (4 ~ 31 records).

# sleepDay
dbGetQuery(conn,
           "
           SELECT
             id, 
             COUNT(id)
           FROM
             sleepDay
           GROUP BY 
             id
           ORDER BY 
             COUNT(id) DESC
           ")

##            id COUNT(id)
## 1  8378563200        32
## 2  6962181067        31
## 3  5553957443        31
## 4  4702921684        28
## 5  4445114986        28
## 6  3977333714        28
## 7  2026352035        28
## 8  5577150313        26
## 9  4319703577        26
## 10 1503960366        25
## 11 7086361926        24
## 12 4388161847        24
## 13 6117666160        18
## 14 8792009665        15
## 15 2347167796        15
## 16 4020332650         8
## 17 4558609924         5
## 18 1927972279         5
## 19 1644430081         4
## 20 8053475328         3
## 21 6775888955         3
## 22 1844505072         3
## 23 7007744171         2
## 24 2320127002         1

Here we can see that there are less users (IDs) at 24, with a different range of entries (1 ~ 32 records).

# weightLogInfo
dbGetQuery(conn,
           "
           SELECT
             id, 
             COUNT(id)
           FROM
             weightLogInfo
           GROUP BY 
             id
           ORDER BY 
             COUNT(id) DESC           
           ")

##           id COUNT(id)
## 1 6962181067        30
## 2 8877689391        24
## 3 4558609924         5
## 4 4319703577         2
## 5 2873212765         2
## 6 1503960366         2
## 7 5577150313         1
## 8 1927972279         1

We can see that users did not really track their weight well during data collection. Seeing that there are only 8 users here with only 2 having more than 5 records… we may want to remove this data entirely from the analysis as it may not be reliable.

Further looking at the file in excel, we can see that when there are entries… It is not over a consistent time period (ex. check in weight @ 4/12/2016 versus weight @ 5/12/2016). The two entries are typically only a day apart.

Process

At this point, we will need to look more at our data and see in what ways we can process it for better use. When viewing the head(daily_activity) results in my R terminal, I noticed that “AvtivityDate” was in a char data format. This can be the first task in transforming the data so we can use it better.

Please Note: weightLogInfo is being DROPPED at this point

daily_activity Transformation

# daily_activity before transformation
str(daily_activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ id                      :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 ... 
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num  13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num  728 776 1218 726 773 ...
##  $ Calories                : num  1985 1797 1776 1745 1863 ...

# Transformation
daily_activity[[2]] <- as.Date(daily_activity[[2]], "%m/%d/%Y")

# daily_activity after transformation
str(daily_activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ id                      :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 ... 
##  $ ActivityDate            : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num  728 776 1218 726 773 ...
##  $ Calories                : num  1985 1797 1776 1745 1863 ...

Above we can see that the data type has been updated for ActivityDate to a date.

When looking at the column names we can see a possible categorization between…

Distance & Minutes

Distance:

Total Distance
VeryActiveDistance
ModeratelyActiveDistance
LightActiveDistance
SedentaryActiveDistance

Minutes:

VeryActiveMinutes
FairlyActiveMinutes
LightlyActiveMinutes
SedentaryMinutes

Seeing this makes me think that there should also be a “Total Minutes” column like there is for distance.

Also, when looking at the data in a spreadsheet you can see that there is no real visual difference between TotalDistance and TrackerDistance.

Let’s add up the minutes columns to see if we can get a full picture of the minutes being captured in the data for each user.Also, see about adding in columns using TotalMinutes for Hours and Days

Additional Packages…

install.packages("tidyverse")

## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Looking at this, I have an assumption that TotalMinutes should be a 
# significant amount of time for each day. 
# Let's check on this by making the columns and viewing the data.

daily_activity <- daily_activity %>% 
  mutate(daily_activity, TotalMinutes = VeryActiveMinutes +
                                        FairlyActiveMinutes +
                                        LightlyActiveMinutes +
                                        SedentaryMinutes) %>% 
  mutate( , Hours = (TotalMinutes / 60)) %>% 
  mutate( , Days  = (Hours / 24))

# Let's view the columns now and order it DESC in the viewer on the "Days" col.
# Using head() for RMarkdown code
head(daily_activity)

##           id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-12      13162          8.50            8.50
## 2 1503960366   2016-04-13      10735          6.97            6.97
## 3 1503960366   2016-04-14      10460          6.74            6.74
## 4 1503960366   2016-04-15       9762          6.28            6.28
## 5 1503960366   2016-04-16      12669          8.16            8.16
## 6 1503960366   2016-04-17       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
##   TotalMinutes    Hours      Days
## 1         1094 18.23333 0.7597222
## 2         1033 17.21667 0.7173611
## 3         1440 24.00000 1.0000000
## 4          998 16.63333 0.6930556
## 5         1040 17.33333 0.7222222
## 6          761 12.68333 0.5284722

As we can see in the Days column we created, it has a range of (0.001388889 ~ 1). Looking at the data as well it is apparent that each user is recording data on the day. So if a user had 30 entries, it should be equivalent to 30 days of recorded data. When we look at the time in minutes being recorded, we can see that at times less that 1% of the day is being recorded…

That is a big data concern and it didn’t come up till the Processing stage after some mutations.

If we are wanting to get the most precise data, then we may want to filter the data down to at least 90% of the day being recorded.

# Filter the data down
filtered_daily_activity <- filter(daily_activity, Days >= .9)

# See structure now
str(filtered_daily_activity)

## 'data.frame':    522 obs. of  18 variables:
##  $ id                      :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1624580081 1624580081 ... 
##  $ ActivityDate            : Date, format: "2016-04-14" "2016-04-18" ...
##  $ TotalSteps              : num  10460 13019 12764 18134 11100 ...
##  $ TotalDistance           : num  6.74 8.59 8.13 12.21 7.15 ...
##  $ TrackerDistance         : num  6.74 8.59 8.13 12.21 7.15 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  2.44 3.25 4.76 6.4 2.46 ...
##  $ ModeratelyActiveDistance: num  0.4 0.64 1.12 0.41 0.87 ...
##  $ LightActiveDistance     : num  3.91 4.71 2.24 5.41 3.82 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  30 42 66 78 36 0 0 0 0 0 ...
##  $ FairlyActiveMinutes     : num  11 16 27 11 22 0 0 0 0 0 ...
##  $ LightlyActiveMinutes    : num  181 233 130 243 203 0 146 148 236 96 ...
##  $ SedentaryMinutes        : num  1218 1149 1217 1108 1179 ...
##  $ Calories                : num  1776 1921 1827 2159 1819 ...
##  $ TotalMinutes            : num  1440 1440 1440 1440 1440 1440 1440 1440 1440 1440 ...
##  $ Hours                   : num  24 24 24 24 24 24 24 24 24 24 ...
##  $ Days                    : num  1 1 1 1 1 1 1 1 1 1 ...

After filtering the data we are now down to about 522 observations compared to the original 940 we started with.

Would this be an unacceptable cut for the analysis?

With doing this, I believe that we will get more complete data for the stakeholders. As already mentioned, there will need to be additional data (more than 30 people) to really make statistically significant findings. This should be fine for some exploration.

sleep_day & daily_activity Transformation CONTINUED…

Looking at the sleepDay data in a spread sheet, the data is quite straight forward. We can do a similar process and then see about adding the sleepDay data to daily_activity to get more variables included into one data set.

# Quick view of the data...
head(sleep_day)

##           id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

# Let's see about getting a percentage of time asleep versus in bed

sleep_day <- sleep_day %>% 
  mutate(sleep_day,percent_AsleepInBed=(TotalMinutesAsleep/TotalTimeInBed))

# Personal Note: I have a hunch that this time is included in the
# SedentaryMinutes in the dailyActivity table. 
head(sleep_day)

##           id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed percent_AsleepInBed
## 1            346           0.9450867
## 2            407           0.9434889
## 3            442           0.9321267
## 4            367           0.9264305
## 5            712           0.9831461
## 6            320           0.9500000

percent_AsleepInBed - may help us in identifying the quality of sleep taking place among different users.

Now we can see about bringing in the Sleep_day data and combining it with the filtered_daily_activity data we created.

# As seen in the daily_activity data, the date column was a char data type
# We will need to first convert "sleeDay" into a date data type as well.
sleep_day[[2]] <- as.Date(sleep_day[[2]], "%m/%d/%Y %H:%M:%S")
head(sleep_day)

##           id   SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12                 1                327            346
## 2 1503960366 2016-04-13                 2                384            407
## 3 1503960366 2016-04-15                 1                412            442
## 4 1503960366 2016-04-16                 2                340            367
## 5 1503960366 2016-04-17                 1                700            712
## 6 1503960366 2016-04-19                 1                304            320
##   percent_AsleepInBed
## 1           0.9450867
## 2           0.9434889
## 3           0.9321267
## 4           0.9264305
## 5           0.9831461
## 6           0.9500000

# Next, we will join the data together ON the "id" and the "date" columns.
joined_dailyA_sleepD <- left_join(filtered_daily_activity, sleep_day,
                                  by = c("id" = "id",
                                         "ActivityDate" = "SleepDay"))

# View joined data
# Using head() for RMarkdown
head(joined_dailyA_sleepD)

##           id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366   2016-04-14      10460          6.74            6.74
## 2 1503960366   2016-04-18      13019          8.59            8.59
## 3 1503960366   2016-04-22      12764          8.13            8.13
## 4 1503960366   2016-04-27      18134         12.21           12.21
## 5 1503960366   2016-05-04      11100          7.15            7.15
## 6 1503960366   2016-05-12          0          0.00            0.00
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               2.44                     0.40
## 2                        0               3.25                     0.64
## 3                        0               4.76                     1.12
## 4                        0               6.40                     0.41
## 5                        0               2.46                     0.87
## 6                        0               0.00                     0.00
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                3.91                       0                30
## 2                4.71                       0                42
## 3                2.24                       0                66
## 4                5.41                       0                78
## 5                3.82                       0                36
## 6                0.00                       0                 0
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  11                  181             1218     1776
## 2                  16                  233             1149     1921
## 3                  27                  130             1217     1827
## 4                  11                  243             1108     2159
## 5                  22                  203             1179     1819
## 6                   0                    0             1440        0
##   TotalMinutes Hours Days TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1         1440    24    1                NA                 NA             NA
## 2         1440    24    1                NA                 NA             NA
## 3         1440    24    1                NA                 NA             NA
## 4         1440    24    1                NA                 NA             NA
## 5         1440    24    1                NA                 NA             NA
## 6         1440    24    1                NA                 NA             NA
##   percent_AsleepInBed
## 1                  NA
## 2                  NA
## 3                  NA
## 4                  NA
## 5                  NA
## 6                  NA

When Left Joining the data we are essentially looking at using the rows from the filtered_daily_activity table and then tacking on sleep information where ever there is a match on Id and Date.

Unfortunately, the join is not very successful and seeing any real relationship between activity levels and sleep, can not be made.

# Get number of NULLs and then compare to overall 522 observations
sapply(joined_dailyA_sleepD, function(x) sum(is.na(x)))

##                       id             ActivityDate               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0 
##             TotalMinutes                    Hours                     Days 
##                        0                        0                        0 
##        TotalSleepRecords       TotalMinutesAsleep           TotalTimeInBed 
##                      505                      505                      505 
##      percent_AsleepInBed 
##                      505

# In fact only 17 records were matched

Moving forward…

With this in mind, moving forward we will have to look at “filtered_daily_activity” and “sleep_day” data separately.

For my analysis today, I will not include day_sleep data and just focus on the filtered_daily_activity data set. This is because it appears to be the most complete and thorough.

Please Note: sleepDay is being DROPPED at this point

Analyze

First we will want to see what initial patterns exist in the data.

To do this, I will first employ the use of histograms to see the distributions present in the data.

filtered_daily_activity

# Install and load needed packages
install.packages("Hmisc")

## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'Hmisc' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'Hmisc'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\Phoenix\Documents\R\win-library\4.1\00LOCK\Hmisc\libs\x64\Hmisc.dll to C:
## \Users\Phoenix\Documents\R\win-library\4.1\Hmisc\libs\x64\Hmisc.dll: Permission
## denied

## Warning: restored 'Hmisc'

## 
## The downloaded binary packages are in
##  C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

# Plot the histograms
hist.data.frame(filtered_daily_activity)

## Warning in pretty.default(range(x), n = breaks, min.n = 1): R_pretty(): very
## small range 'cell'=1.40119e-315, corrected to 2.122e-314

## Warning in plot.window(xlim, ylim, "", ...): R_pretty(): very small range
## 'cell'=1.296e-314, corrected to 2.122e-314

From here I can see a couple that we may want to blow up for better visibility.

# Calories
ggplot(filtered_daily_activity)+
  geom_histogram(mapping = aes(x = Calories))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# SedentaryMinutes
ggplot(filtered_daily_activity)+
  geom_histogram(mapping = aes(x = SedentaryMinutes))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# LightlyActiveMinutes
ggplot(filtered_daily_activity)+
  geom_histogram(mapping = aes(x = LightlyActiveMinutes))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# FairlyActiveMinutes
ggplot(filtered_daily_activity)+
  geom_histogram(mapping = aes(x = FairlyActiveMinutes))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# VeryActiveMinutes
ggplot(filtered_daily_activity)+
  geom_histogram(mapping = aes(x = VeryActiveMinutes))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the start we can see that there are a few outliers that exist, I am going to leave them in here for right now.

The main chart I want to bring attention to is the “Calories” one because there appears to be a bi-modal distribution. The data point at 0 must be an outlier since it should be impossible to burn 0 calories in a day.

Regardless, bi-modal distributions are great for identifying different groups within the data. This is the first great find in our analysis to hopefully segment the data a little. I do want to mention that the data set may still be too small to really tell if a real bi-modal relationship exists but we will work with it for now.

Next, we will want to see if there are any inherent relationships among the data after we separate the data into two groups. We will use the cor() function. If we look back at the Calories histogram we saw two peaks.

Peak 1: 2000 or so calories

Peak 2: 2750 or so calories

2000 + 2750 = 4750

4750 / 2 = 2375 calorie divider

So we can create two data frames now using 2,375 calories as the divider…

# Group 1 (lower average calories)
## Let's use this opportunity to prune out some outliers, there are some users
## that entered in '0' calories.
group_1_low <- filter(daily_activity, Calories > 0 & Calories <= 2375)

# Group 2 (higher average calories)
group_2_high <- filter(daily_activity, Calories > 2375)


# Set up the data frame without any dates or other classifying info

#### group_1_low
cor_group_1 <- group_1_low %>% 
  subset( , select = -c(id, ActivityDate))

#### group_2_high
cor_group_2 <- group_2_high %>% 
  subset( , select = -c(id, ActivityDate))


# Install needed package to format the data
install.packages("reshape2")

## package 'reshape2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages

library(reshape2)

# Reshape the data with upper.tri() and melt()
# Including the "<-NA allows us to slice the matrix in half

#### group_1_low
matrix_group_1 <- round(cor(cor_group_1), digits = 2)

matrix_group_1[upper.tri(matrix_group_1)]<-NA

#data.frame(matrix_group_1)

#### group_2_high
matrix_group_2 <- round(cor(cor_group_2), digits = 2)

matrix_group_2[upper.tri(matrix_group_2)]<-NA

#data.frame(matrix_group_2)

# Use the installed package to 'melt' it. This created a long view with new 
# column names such as Var1 and Var2.

#### group_1_low
melt_group_1 <- melt(matrix_group_1, na.rm = TRUE)

#### group_2_high
melt_group_2 <- melt(matrix_group_2, na.rm = TRUE)


# Now that the data has been reshaped, we can plot it

#### group_1_low
ggplot(data = melt_group_1)+
  geom_tile(mapping = aes(x = Var1, y = Var2, fill = value), color = "white")+
  scale_fill_gradient2(low = "red", mid = "white", high = "blue",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name = "Correlation (Pearson/Default")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 0))+
  labs(title = "Group 1: Low Calorie Burners")+
  annotate("rect", xmin = 4, xmax = 12,
                   ymin = .5, ymax = 4, 
                   alpha = .1, color = "green")+
  annotate("rect", xmin = 11.4, xmax = 12.6,
                   ymin = .5, ymax = 14,
                   alpha = .1, color = "green")

#### group_2_high
ggplot(data = melt_group_2)+
  geom_tile(mapping = aes(x = Var1, y = Var2, fill = value), color = "white")+
  scale_fill_gradient2(low = "red", mid = "white", high = "blue",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name = "Correlation (Pearson/Default")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 0))+
  labs(title = "Group 2: High Calorie Burners")+
  annotate("rect", xmin = 4, xmax = 12,
                   ymin = .5, ymax = 4, 
                   alpha = .1, color = "green")+
  annotate("rect", xmin = 11.4, xmax = 12.6,
                   ymin = .5, ymax = 14,
                   alpha = .1, color = "green")

I want to take a moment for us to look back and forth between the two heat maps above. Group 1: Low Calorie Burners appear to have a fairly different landscape than Group 2: High Calorie Burners.

Observations: Please view the data enclosed in the green rectangles

Distance & Minutes

Here when evaluating distance and minutes we already know that they are both highly correlated to one another (ex. VeryActiveDistance ~ VeryActiveMinutes).

What we see here for both of these areas in the lower green rectangle is that for Group 1. We notice that there is an importance on all distance types and all minute types. Especially, LightActiveDistance & LightlyActiveMinutes.

When we look down to Group 2 we begin to see a different story. Here we see that they are actually less important and pale to the relational strength of VeryActiveDistance & VeryActiveMinutes.

SedentaryMinutes

Group 1, SedentaryMinutes almost exclusively have negative relationships across the board…

Group 2, SedentaryMinutes is not as prominent and has less of an effect on TotalDistance, LightlyActiveMinutes, etc…

Summary

By looking at both of these groups, we can see that they do act differently.

Group 1 inherently burns less calories. We can also see that there are stronger relationships when it comes to Light Activty. We can attribute some of those activities as walking or other light exercises. These types of users may not be looking to be professional runners or even training for anything in particular. They may just be curious about tracking their data or just trying to increase their total steps to get out of the house.

On the other hand…

Group 2 has a stronger relationship with Very Active Activity. These may be users that are into sports. They may be into a variety of things that make it to where they spend more time doing more demanding exercises.

As a marketer, it would be important to know how to classify current Bellabeat customers and ensure that marketing material is catered to that specific user.

Additional Insights

Now that we understand a little about these groups we can try to understand if there are any timing differences when it comes to how active they are.

# First we need to make sure we have a column for the weekday.

#### group_1_low
wkd_group_1 <- mutate(group_1_low, weekday = weekdays(ActivityDate))

# Sort the data by weekday
wkd_group_1$weekday <- factor(wkd_group_1$weekday,
                                            levels = c("Sunday", 
                                                       "Monday", 
                                                       "Tuesday",
                                                       "Wednesday",
                                                       "Thursday",
                                                       "Friday",
                                                       "Saturday"))
# Apply the change
wkd_group_1 <- wkd_group_1[order(wkd_group_1$weekday), ]

#### group_2_high
wkd_group_2 <- mutate(group_2_high, weekday = weekdays(ActivityDate))

# Sort the data by weekday
wkd_group_2$weekday <- factor(wkd_group_2$weekday,
                                            levels = c("Sunday", 
                                                       "Monday", 
                                                       "Tuesday",
                                                       "Wednesday",
                                                       "Thursday",
                                                       "Friday",
                                                       "Saturday"))
# Apply the change
wkd_group_2 <- wkd_group_2[order(wkd_group_2$weekday), ]

For this next bit of the analysis I am going to switch to using Tableau.

#Export workable csv files

## group_1_low
write.csv(wkd_group_1, "C:\\Users\\Phoenix\\Desktop\\Kaggle Data Sets\\Fitbit Fitness Tracker Data\\created data\\.group_1.csv")

## group_2_high
write.csv(wkd_group_2, "C:\\Users\\Phoenix\\Desktop\\Kaggle Data Sets\\Fitbit Fitness Tracker Data\\created data\\.group_2.csv")

The goal is to see where these different users spend their time each day. To do this we would need to look towards using the weekday column we just created and getting the percentage of time for each activity type (ex. VeryActiveMinutes).

Percentage of “Very Active” minutes = VeryActiveMinutes / TotalMinutes

Once we have the percentage we can then take the average and group it by each weekday to see where people on average spend their time. If you do not see the embedded dashboard below, please click on the link.

Link: “https://public.tableau.com/app/profile/phoenix.turner/viz/BestWeekdaysforTargetingUsersbyActivityType/AvtivitybyGroup?publish=yes”

Now it is time to get all of the desired visuals in one place.

Segmentation by Calories

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is when we identified a bi-modal distribution among the Calories variable. With that information we looked to segment the data into two groups.

Group 1 - Having Calories <= 2375, Low calorie burners

Group 2 - Having Calories > 2375, High calorie burners

Group 1: Lower Calorie, Correlation Matrix (Pearson)

Group 1 is the lower calorie group. This group can be seen with a higher relationship to Light Activities.

Additionally, we can see a stronger negative relationship to SedentaryMinutes than with Group 2. This means that an increase to SedentarMinutes can negatively impact the amount of calories being burned.

Group 2: Higher Calorie, Correlation Matrix (Pearson)

Group 2 is the higher calorie group. This group can be seen with a stronger relationship to Very Active Activities.

Best Weekdays for Targeting Users by Avtivity Type

Recommendations

With these findings Bellabeat should be able to do the following:

1 - Segment users, this analysis went over one method to segmenting users using the fitbit device. Bellabeat customers may be different but should be tracking the same information. Bellabeat should pull additional information about their own users to see if they can be segmented in a similar way.

- - Group 1: Lower, This group represents an abundance of light and moderate activities. Users that fall into this bucket most likely participate in easier activities such as jogging or walking.

- - Group 2: Higher, This group represents an abundance of very active activities. Users that fall into this bucket most likely participate in sports activities or other ‘very active’ activities.

Each group should be participating in different activities such as going to the gym or taking a walk downtown. Each of these users should be targeted differently.

2 - Know when to target marketing efforts, When looking at the Tableau dashboard, we can see that there are different days of activity. When combining this information with what group they belong to, we can better shape Bellabeats marketing efforts.

- - Group 1: Lower, Appears to be most active on Monday & Tuesday, advertisements can be targeted on these days for the most viewership.

- - Group 2: Higher, Appears to be most active on Sunday & Saturday, advertisements can be targeted on these days for the most viewership.

Act

Now it’s up to Bellabeat to look towards additional data collection and adjusting their marketing strategy. It has been fun going through the Google Data Analytics Professional Certificate program. I’m new to R and if anyone has any tips to help, it would be appreciated.

Thanks, PT

Bellabeat (Capstone Analysis)

Phoenix Turner

9/16/2021

Summary

Data Source

Getting started…

Ask

Prepare

Process

Analyze

Act

Ask

Prepare

NULLs

Unique Ids/Users

Process

daily_activity Transformation

sleep_day & daily_activity Transformation CONTINUED…

Analyze

filtered_daily_activity

Observations: Please view the data enclosed in the green rectangles

Summary

Additional Insights

Act