Contains the following deliverables:
A clear summary of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
Summary of analysis
Top high-level content recommendations based on analysis
I am a junior data analyst working on the marketing analyst team at Bellabeat. Bellabeat is a manufacturer of health-focused products for women. I have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. These insights will help guide marketing strategies for the company.
The data being used in this analysis can be found at this address, “https://www.kaggle.com/arashnic/fitbit/code”.
Author - Mobius
License - CC0: Public Domain
Description - These datasets were generated by respondents to a distributed survery via Amazon Mechanical Turk between 3/12/2016 and 05/12/2016. Thirty eligible users gave consent to the submission of personal tracker data.
Going forward I will be using the principles taught in the Google Data Analytics Certificate course for data analysis.
Ask, Prepare, Process, Analyze, Share, Act
What is the problem?
For this assignment we are looking to explore Fitbit usage data to see if there are any identifiable patterns that can potentially be used by the Bellabeat marketing team.
Other relevant questions without answers:
Are all the users of the data set female?
Should we know anything about individual users where we should expect different than expected results?
How is the data generated / collected?
It was done by using recorded data from Mturk respondents. They would track their fitbit data at the time of entry. Some self reported some not.
Organize and Protect
There is no real ‘personal identifiable information’ that can be tied directly to the users/participants in this data collection. No gender, name, address, city, email, etc…
Make sure data is unbiased/credible
For this, we will most likely need to go through the data and see what files are complete and thorough. We can check this by looking at all the missing values in each table. Let’s connect to the data.
PLEASE NOTE:
I have already gone ahead and created a SQLite database with Python. If you are curious on how I approached this, feel free to reach out to me via email.
# Setting up CRAN repository for knit
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
# Packages to connect
install.packages("RSQLite")
## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'RSQLite' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'RSQLite'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\Phoenix\Documents\R\win-library\4.1\00LOCK\RSQLite\libs\x64\RSQLite.dll
## to C:\Users\Phoenix\Documents\R\win-library\4.1\RSQLite\libs\x64\RSQLite.dll:
## Permission denied
## Warning: restored 'RSQLite'
##
## The downloaded binary packages are in
## C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages
library(RSQLite)
# Set database (db) name and the driver
mydb <- "C:\\Users\\Phoenix\\fitbit.db"
con <- dbConnect(drv = SQLite(), dbname = mydb)
# Let's test the connection first by changing the working directory and con
setwd("C:\\Users\\Phoenix")
conn <- dbConnect(RSQLite::SQLite(), "fitbit.db")
# View tables to make sure we are connected
dbListTables(conn)
## [1] "dailyActivity" "dailyCalories"
## [3] "dailyIntensities" "dailySteps"
## [5] "heartrate_seconds" "hourlyCalories"
## [7] "hourlyIntensities" "hourlySteps"
## [9] "minuteCaloriesNarrow" "minuteIntensitiesNarrow"
## [11] "minuteMETsNarrow" "minuteSleep"
## [13] "minuteStepsNarrow" "sleepDay"
## [15] "weightLogInfo"
Now since we are connected to the data, we can use SQL and R to make some initial data frames to work with. I will be specifically looking at the following tables… * dailyActivity * sleepDay * weightLogInfo
This is because I plan on looking at trends/patterns on the day instead of the minute or hour. Also, it appears that dailyActivity contains all the information from dailyCalories, dailyIntensities, and dailySteps just aggregated.
This analysis can most likely be furthered by looking at the hour/minute/second levels for users but I will ignore these for now.
# dailyActivity
daily_activity <- dbGetQuery(conn,"SELECT * FROM dailyActivity")
# sleepDay
sleep_day <- dbGetQuery(conn,"SELECT * FROM sleepDay")
# weightLogInfo
weight_log_info <- dbGetQuery(conn,"SELECT * FROM weightLogInfo")
# Let's look at a quick example...
head(daily_activity)
## id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
Now since we can look at the data as a dataframe in R, we can then see where all the missing values are.
# dailyActivity
sapply(daily_activity, function(x) sum(is.na(x)))
## id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
dailyAvtivity looks good!
# sleepDay
sapply(sleep_day, function(x) sum(is.na(x)))
## id SleepDay TotalSleepRecords TotalMinutesAsleep
## 0 0 0 0
## TotalTimeInBed
## 0
sleepDay looks good!
# weightLogInfo
sapply(weight_log_info, function(x) sum(is.na(x)))
## id Date WeightKg WeightPounds Fat
## 0 0 0 0 65
## BMI IsManualReport LogId
## 0 0 0
After reviewing the files for NULLs it looks like weightLogInfo is the biggest offender. We can switch gears and see if we can get a number of unique users and the number of their entries. This should further tell us the quality of the data.
# dailyActivities
dbGetQuery(conn,
"
SELECT
id,
COUNT(id)
FROM
dailyActivity
GROUP BY
id
ORDER BY
COUNT(id) DESC
")
## id COUNT(id)
## 1 8877689391 31
## 2 8583815059 31
## 3 8378563200 31
## 4 8053475328 31
## 5 7086361926 31
## 6 6962181067 31
## 7 5553957443 31
## 8 4702921684 31
## 9 4558609924 31
## 10 4445114986 31
## 11 4388161847 31
## 12 4319703577 31
## 13 4020332650 31
## 14 2873212765 31
## 15 2320127002 31
## 16 2026352035 31
## 17 2022484408 31
## 18 1927972279 31
## 19 1844505072 31
## 20 1624580081 31
## 21 1503960366 31
## 22 5577150313 30
## 23 3977333714 30
## 24 1644430081 30
## 25 8792009665 29
## 26 6290855005 29
## 27 6117666160 28
## 28 7007744171 26
## 29 6775888955 26
## 30 3372868164 20
## 31 8253242879 19
## 32 2347167796 18
## 33 4057192912 4
Although in the description of the data it said there were 30 participants, it looks like there are 33 different users in the dailyActivity table.
Also, it appears that the number of entries are not always consistent with a range between (4 ~ 31 records).
# sleepDay
dbGetQuery(conn,
"
SELECT
id,
COUNT(id)
FROM
sleepDay
GROUP BY
id
ORDER BY
COUNT(id) DESC
")
## id COUNT(id)
## 1 8378563200 32
## 2 6962181067 31
## 3 5553957443 31
## 4 4702921684 28
## 5 4445114986 28
## 6 3977333714 28
## 7 2026352035 28
## 8 5577150313 26
## 9 4319703577 26
## 10 1503960366 25
## 11 7086361926 24
## 12 4388161847 24
## 13 6117666160 18
## 14 8792009665 15
## 15 2347167796 15
## 16 4020332650 8
## 17 4558609924 5
## 18 1927972279 5
## 19 1644430081 4
## 20 8053475328 3
## 21 6775888955 3
## 22 1844505072 3
## 23 7007744171 2
## 24 2320127002 1
Here we can see that there are less users (IDs) at 24, with a different range of entries (1 ~ 32 records).
# weightLogInfo
dbGetQuery(conn,
"
SELECT
id,
COUNT(id)
FROM
weightLogInfo
GROUP BY
id
ORDER BY
COUNT(id) DESC
")
## id COUNT(id)
## 1 6962181067 30
## 2 8877689391 24
## 3 4558609924 5
## 4 4319703577 2
## 5 2873212765 2
## 6 1503960366 2
## 7 5577150313 1
## 8 1927972279 1
We can see that users did not really track their weight well during data collection. Seeing that there are only 8 users here with only 2 having more than 5 records… we may want to remove this data entirely from the analysis as it may not be reliable.
Further looking at the file in excel, we can see that when there are entries… It is not over a consistent time period (ex. check in weight @ 4/12/2016 versus weight @ 5/12/2016). The two entries are typically only a day apart.
At this point, we will need to look more at our data and see in what ways we can process it for better use. When viewing the head(daily_activity) results in my R terminal, I noticed that “AvtivityDate” was in a char data format. This can be the first task in transforming the data so we can use it better.
Please Note: weightLogInfo is being DROPPED at this point
# daily_activity before transformation
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ id :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
# Transformation
daily_activity[[2]] <- as.Date(daily_activity[[2]], "%m/%d/%Y")
# daily_activity after transformation
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ id :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 ...
## $ ActivityDate : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num 728 776 1218 726 773 ...
## $ Calories : num 1985 1797 1776 1745 1863 ...
Above we can see that the data type has been updated for ActivityDate to a date.
When looking at the column names we can see a possible categorization between…
Distance & Minutes
Distance:
Minutes:
Seeing this makes me think that there should also be a “Total Minutes” column like there is for distance.
Also, when looking at the data in a spreadsheet you can see that there is no real visual difference between TotalDistance and TrackerDistance.
Let’s add up the minutes columns to see if we can get a full picture of the minutes being captured in the data for each user.Also, see about adding in columns using TotalMinutes for Hours and Days
Additional Packages…
install.packages("tidyverse")
## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Looking at this, I have an assumption that TotalMinutes should be a
# significant amount of time for each day.
# Let's check on this by making the columns and viewing the data.
daily_activity <- daily_activity %>%
mutate(daily_activity, TotalMinutes = VeryActiveMinutes +
FairlyActiveMinutes +
LightlyActiveMinutes +
SedentaryMinutes) %>%
mutate( , Hours = (TotalMinutes / 60)) %>%
mutate( , Days = (Hours / 24))
# Let's view the columns now and order it DESC in the viewer on the "Days" col.
# Using head() for RMarkdown code
head(daily_activity)
## id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
## TotalMinutes Hours Days
## 1 1094 18.23333 0.7597222
## 2 1033 17.21667 0.7173611
## 3 1440 24.00000 1.0000000
## 4 998 16.63333 0.6930556
## 5 1040 17.33333 0.7222222
## 6 761 12.68333 0.5284722
As we can see in the Days column we created, it has a range of (0.001388889 ~ 1). Looking at the data as well it is apparent that each user is recording data on the day. So if a user had 30 entries, it should be equivalent to 30 days of recorded data. When we look at the time in minutes being recorded, we can see that at times less that 1% of the day is being recorded…
That is a big data concern and it didn’t come up till the Processing stage after some mutations.
If we are wanting to get the most precise data, then we may want to filter the data down to at least 90% of the day being recorded.
# Filter the data down
filtered_daily_activity <- filter(daily_activity, Days >= .9)
# See structure now
str(filtered_daily_activity)
## 'data.frame': 522 obs. of 18 variables:
## $ id :integer64 1503960366 1503960366 1503960366 1503960366 1503960366 1503960366 1624580081 1624580081 ...
## $ ActivityDate : Date, format: "2016-04-14" "2016-04-18" ...
## $ TotalSteps : num 10460 13019 12764 18134 11100 ...
## $ TotalDistance : num 6.74 8.59 8.13 12.21 7.15 ...
## $ TrackerDistance : num 6.74 8.59 8.13 12.21 7.15 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 2.44 3.25 4.76 6.4 2.46 ...
## $ ModeratelyActiveDistance: num 0.4 0.64 1.12 0.41 0.87 ...
## $ LightActiveDistance : num 3.91 4.71 2.24 5.41 3.82 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 30 42 66 78 36 0 0 0 0 0 ...
## $ FairlyActiveMinutes : num 11 16 27 11 22 0 0 0 0 0 ...
## $ LightlyActiveMinutes : num 181 233 130 243 203 0 146 148 236 96 ...
## $ SedentaryMinutes : num 1218 1149 1217 1108 1179 ...
## $ Calories : num 1776 1921 1827 2159 1819 ...
## $ TotalMinutes : num 1440 1440 1440 1440 1440 1440 1440 1440 1440 1440 ...
## $ Hours : num 24 24 24 24 24 24 24 24 24 24 ...
## $ Days : num 1 1 1 1 1 1 1 1 1 1 ...
After filtering the data we are now down to about 522 observations compared to the original 940 we started with.
Would this be an unacceptable cut for the analysis?
With doing this, I believe that we will get more complete data for the stakeholders. As already mentioned, there will need to be additional data (more than 30 people) to really make statistically significant findings. This should be fine for some exploration.
Looking at the sleepDay data in a spread sheet, the data is quite straight forward. We can do a similar process and then see about adding the sleepDay data to daily_activity to get more variables included into one data set.
# Quick view of the data...
head(sleep_day)
## id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
# Let's see about getting a percentage of time asleep versus in bed
sleep_day <- sleep_day %>%
mutate(sleep_day,percent_AsleepInBed=(TotalMinutesAsleep/TotalTimeInBed))
# Personal Note: I have a hunch that this time is included in the
# SedentaryMinutes in the dailyActivity table.
head(sleep_day)
## id SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## TotalTimeInBed percent_AsleepInBed
## 1 346 0.9450867
## 2 407 0.9434889
## 3 442 0.9321267
## 4 367 0.9264305
## 5 712 0.9831461
## 6 320 0.9500000
percent_AsleepInBed - may help us in identifying the quality of sleep taking place among different users.
Now we can see about bringing in the Sleep_day data and combining it with the filtered_daily_activity data we created.
# As seen in the daily_activity data, the date column was a char data type
# We will need to first convert "sleeDay" into a date data type as well.
sleep_day[[2]] <- as.Date(sleep_day[[2]], "%m/%d/%Y %H:%M:%S")
head(sleep_day)
## id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 2016-04-12 1 327 346
## 2 1503960366 2016-04-13 2 384 407
## 3 1503960366 2016-04-15 1 412 442
## 4 1503960366 2016-04-16 2 340 367
## 5 1503960366 2016-04-17 1 700 712
## 6 1503960366 2016-04-19 1 304 320
## percent_AsleepInBed
## 1 0.9450867
## 2 0.9434889
## 3 0.9321267
## 4 0.9264305
## 5 0.9831461
## 6 0.9500000
# Next, we will join the data together ON the "id" and the "date" columns.
joined_dailyA_sleepD <- left_join(filtered_daily_activity, sleep_day,
by = c("id" = "id",
"ActivityDate" = "SleepDay"))
# View joined data
# Using head() for RMarkdown
head(joined_dailyA_sleepD)
## id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-14 10460 6.74 6.74
## 2 1503960366 2016-04-18 13019 8.59 8.59
## 3 1503960366 2016-04-22 12764 8.13 8.13
## 4 1503960366 2016-04-27 18134 12.21 12.21
## 5 1503960366 2016-05-04 11100 7.15 7.15
## 6 1503960366 2016-05-12 0 0.00 0.00
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 2.44 0.40
## 2 0 3.25 0.64
## 3 0 4.76 1.12
## 4 0 6.40 0.41
## 5 0 2.46 0.87
## 6 0 0.00 0.00
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 3.91 0 30
## 2 4.71 0 42
## 3 2.24 0 66
## 4 5.41 0 78
## 5 3.82 0 36
## 6 0.00 0 0
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 11 181 1218 1776
## 2 16 233 1149 1921
## 3 27 130 1217 1827
## 4 11 243 1108 2159
## 5 22 203 1179 1819
## 6 0 0 1440 0
## TotalMinutes Hours Days TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1440 24 1 NA NA NA
## 2 1440 24 1 NA NA NA
## 3 1440 24 1 NA NA NA
## 4 1440 24 1 NA NA NA
## 5 1440 24 1 NA NA NA
## 6 1440 24 1 NA NA NA
## percent_AsleepInBed
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
When Left Joining the data we are essentially looking at using the rows from the filtered_daily_activity table and then tacking on sleep information where ever there is a match on Id and Date.
Unfortunately, the join is not very successful and seeing any real relationship between activity levels and sleep, can not be made.
# Get number of NULLs and then compare to overall 522 observations
sapply(joined_dailyA_sleepD, function(x) sum(is.na(x)))
## id ActivityDate TotalSteps
## 0 0 0
## TotalDistance TrackerDistance LoggedActivitiesDistance
## 0 0 0
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## 0 0 0
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## 0 0 0
## LightlyActiveMinutes SedentaryMinutes Calories
## 0 0 0
## TotalMinutes Hours Days
## 0 0 0
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 505 505 505
## percent_AsleepInBed
## 505
# In fact only 17 records were matched
Moving forward…
With this in mind, moving forward we will have to look at “filtered_daily_activity” and “sleep_day” data separately.
For my analysis today, I will not include day_sleep data and just focus on the filtered_daily_activity data set. This is because it appears to be the most complete and thorough.
Please Note: sleepDay is being DROPPED at this point
First we will want to see what initial patterns exist in the data.
To do this, I will first employ the use of histograms to see the distributions present in the data.
# Install and load needed packages
install.packages("Hmisc")
## Installing package into 'C:/Users/Phoenix/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'Hmisc' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'Hmisc'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\Phoenix\Documents\R\win-library\4.1\00LOCK\Hmisc\libs\x64\Hmisc.dll to C:
## \Users\Phoenix\Documents\R\win-library\4.1\Hmisc\libs\x64\Hmisc.dll: Permission
## denied
## Warning: restored 'Hmisc'
##
## The downloaded binary packages are in
## C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# Plot the histograms
hist.data.frame(filtered_daily_activity)
## Warning in pretty.default(range(x), n = breaks, min.n = 1): R_pretty(): very
## small range 'cell'=1.40119e-315, corrected to 2.122e-314
## Warning in plot.window(xlim, ylim, "", ...): R_pretty(): very small range
## 'cell'=1.296e-314, corrected to 2.122e-314
From here I can see a couple that we may want to blow up for better visibility.
# Calories
ggplot(filtered_daily_activity)+
geom_histogram(mapping = aes(x = Calories))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# SedentaryMinutes
ggplot(filtered_daily_activity)+
geom_histogram(mapping = aes(x = SedentaryMinutes))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# LightlyActiveMinutes
ggplot(filtered_daily_activity)+
geom_histogram(mapping = aes(x = LightlyActiveMinutes))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# FairlyActiveMinutes
ggplot(filtered_daily_activity)+
geom_histogram(mapping = aes(x = FairlyActiveMinutes))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# VeryActiveMinutes
ggplot(filtered_daily_activity)+
geom_histogram(mapping = aes(x = VeryActiveMinutes))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the start we can see that there are a few outliers that exist, I am going to leave them in here for right now.
The main chart I want to bring attention to is the “Calories” one because there appears to be a bi-modal distribution. The data point at 0 must be an outlier since it should be impossible to burn 0 calories in a day.
Regardless, bi-modal distributions are great for identifying different groups within the data. This is the first great find in our analysis to hopefully segment the data a little. I do want to mention that the data set may still be too small to really tell if a real bi-modal relationship exists but we will work with it for now.
Next, we will want to see if there are any inherent relationships among the data after we separate the data into two groups. We will use the cor() function. If we look back at the Calories histogram we saw two peaks.
Peak 1: 2000 or so calories
Peak 2: 2750 or so calories
2000 + 2750 = 4750
4750 / 2 = 2375 calorie divider
So we can create two data frames now using 2,375 calories as the divider…
# Group 1 (lower average calories)
## Let's use this opportunity to prune out some outliers, there are some users
## that entered in '0' calories.
group_1_low <- filter(daily_activity, Calories > 0 & Calories <= 2375)
# Group 2 (higher average calories)
group_2_high <- filter(daily_activity, Calories > 2375)
# Set up the data frame without any dates or other classifying info
#### group_1_low
cor_group_1 <- group_1_low %>%
subset( , select = -c(id, ActivityDate))
#### group_2_high
cor_group_2 <- group_2_high %>%
subset( , select = -c(id, ActivityDate))
# Install needed package to format the data
install.packages("reshape2")
## package 'reshape2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Phoenix\AppData\Local\Temp\Rtmpg9lVCY\downloaded_packages
library(reshape2)
# Reshape the data with upper.tri() and melt()
# Including the "<-NA allows us to slice the matrix in half
#### group_1_low
matrix_group_1 <- round(cor(cor_group_1), digits = 2)
matrix_group_1[upper.tri(matrix_group_1)]<-NA
#data.frame(matrix_group_1)
#### group_2_high
matrix_group_2 <- round(cor(cor_group_2), digits = 2)
matrix_group_2[upper.tri(matrix_group_2)]<-NA
#data.frame(matrix_group_2)
# Use the installed package to 'melt' it. This created a long view with new
# column names such as Var1 and Var2.
#### group_1_low
melt_group_1 <- melt(matrix_group_1, na.rm = TRUE)
#### group_2_high
melt_group_2 <- melt(matrix_group_2, na.rm = TRUE)
# Now that the data has been reshaped, we can plot it
#### group_1_low
ggplot(data = melt_group_1)+
geom_tile(mapping = aes(x = Var1, y = Var2, fill = value), color = "white")+
scale_fill_gradient2(low = "red", mid = "white", high = "blue",
midpoint = 0, limit = c(-1,1), space = "Lab",
name = "Correlation (Pearson/Default")+
theme_classic()+
theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 0))+
labs(title = "Group 1: Low Calorie Burners")+
annotate("rect", xmin = 4, xmax = 12,
ymin = .5, ymax = 4,
alpha = .1, color = "green")+
annotate("rect", xmin = 11.4, xmax = 12.6,
ymin = .5, ymax = 14,
alpha = .1, color = "green")
#### group_2_high
ggplot(data = melt_group_2)+
geom_tile(mapping = aes(x = Var1, y = Var2, fill = value), color = "white")+
scale_fill_gradient2(low = "red", mid = "white", high = "blue",
midpoint = 0, limit = c(-1,1), space = "Lab",
name = "Correlation (Pearson/Default")+
theme_classic()+
theme(axis.text.x = element_text(angle = 90, vjust = 0, hjust = 0))+
labs(title = "Group 2: High Calorie Burners")+
annotate("rect", xmin = 4, xmax = 12,
ymin = .5, ymax = 4,
alpha = .1, color = "green")+
annotate("rect", xmin = 11.4, xmax = 12.6,
ymin = .5, ymax = 14,
alpha = .1, color = "green")
I want to take a moment for us to look back and forth between the two heat maps above. Group 1: Low Calorie Burners appear to have a fairly different landscape than Group 2: High Calorie Burners.
Distance & Minutes
Here when evaluating distance and minutes we already know that they are both highly correlated to one another (ex. VeryActiveDistance ~ VeryActiveMinutes).
What we see here for both of these areas in the lower green rectangle is that for Group 1. We notice that there is an importance on all distance types and all minute types. Especially, LightActiveDistance & LightlyActiveMinutes.
When we look down to Group 2 we begin to see a different story. Here we see that they are actually less important and pale to the relational strength of VeryActiveDistance & VeryActiveMinutes.
SedentaryMinutes
Group 1, SedentaryMinutes almost exclusively have negative relationships across the board…
Group 2, SedentaryMinutes is not as prominent and has less of an effect on TotalDistance, LightlyActiveMinutes, etc…
By looking at both of these groups, we can see that they do act differently.
Group 1 inherently burns less calories. We can also see that there are stronger relationships when it comes to Light Activty. We can attribute some of those activities as walking or other light exercises. These types of users may not be looking to be professional runners or even training for anything in particular. They may just be curious about tracking their data or just trying to increase their total steps to get out of the house.
On the other hand…
Group 2 has a stronger relationship with Very Active Activity. These may be users that are into sports. They may be into a variety of things that make it to where they spend more time doing more demanding exercises.
As a marketer, it would be important to know how to classify current Bellabeat customers and ensure that marketing material is catered to that specific user.
Now that we understand a little about these groups we can try to understand if there are any timing differences when it comes to how active they are.
# First we need to make sure we have a column for the weekday.
#### group_1_low
wkd_group_1 <- mutate(group_1_low, weekday = weekdays(ActivityDate))
# Sort the data by weekday
wkd_group_1$weekday <- factor(wkd_group_1$weekday,
levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
# Apply the change
wkd_group_1 <- wkd_group_1[order(wkd_group_1$weekday), ]
#### group_2_high
wkd_group_2 <- mutate(group_2_high, weekday = weekdays(ActivityDate))
# Sort the data by weekday
wkd_group_2$weekday <- factor(wkd_group_2$weekday,
levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
# Apply the change
wkd_group_2 <- wkd_group_2[order(wkd_group_2$weekday), ]
For this next bit of the analysis I am going to switch to using Tableau.
#Export workable csv files
## group_1_low
write.csv(wkd_group_1, "C:\\Users\\Phoenix\\Desktop\\Kaggle Data Sets\\Fitbit Fitness Tracker Data\\created data\\.group_1.csv")
## group_2_high
write.csv(wkd_group_2, "C:\\Users\\Phoenix\\Desktop\\Kaggle Data Sets\\Fitbit Fitness Tracker Data\\created data\\.group_2.csv")
The goal is to see where these different users spend their time each day. To do this we would need to look towards using the weekday column we just created and getting the percentage of time for each activity type (ex. VeryActiveMinutes).
Percentage of “Very Active” minutes = VeryActiveMinutes / TotalMinutes
Once we have the percentage we can then take the average and group it by each weekday to see where people on average spend their time. If you do not see the embedded dashboard below, please click on the link.
Now it’s up to Bellabeat to look towards additional data collection and adjusting their marketing strategy. It has been fun going through the Google Data Analytics Professional Certificate program. I’m new to R and if anyone has any tips to help, it would be appreciated.
Thanks, PT