This is a Capstone Project for the Google Data Analytics Certificate Course on Coursera. Portions of this project were completed in SQL via BigQuery, R via RStudio, and Tableau via Tabeleau Public.

Scenario Summary:

You are working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy

Business Task Statement:

Determine Growth opportunities uncovered by smart device usage trends and how Bellabeat can leverage these insights in marketing strategy

Final Insights and Recommendations

Can be found in this Tableau Viz. They will also be summarized below:

Most users’ activity levels hovered right around the threshold to where increased activity more strongly correlated with higher calorie expenditure.

Bellabeat can utilize this insight to emphasize personalized fitness plans and coaching in the app and membership program, coupled with tracking and immediate insights in the app in marketing messaging. The app and coaching program can offer the extra push individuals need to move past these activity level thresholds.

Every participant experienced abnormally high resting heart rates at some point in the study period. Overall, 6.4% of the time a heart rate was detected to be exceeding 100 bpm, the user was in a period of sedentary activity. More research will be needed to determine true causes, however stress is a common cause of high resting heart rates.

Marketing can utilize this insight to emphasize the mindfulness and stress management features available in app and in the coaching program in messaging.

Roughly 20% of daily activity data involved some sort of empty, missing, or manually entered logs; and over half of the weight log data was manually entered. In an age where convenience is key (especially with smart devices), decreasing manual work on the user’s behalf will go far. According to the Fitabase site, some potential limitations in the devices used for this study, which could lead to data being missed or manual entry, were going too long without syncing device to app (device memory) and going too long between charges (battery life). Additionally, manual entry can be caused by using smart devices which don’t integrate together.

These insights should drive marketing messaging to emphasize connectivity and focus on the seamless experience between Bellabeat product offering and app, as well as to highlight integrations with third party smart devices. Finally, marketing messaging should emphasize battery life, particularly of the Time product as it can last much longer between charges compared to other popular devices.

Saturdays and Tuesdays tended to have the highest levels of activity, whereas Thursday and Sunday generally showed lower levels of activity. Overall, we tend to see peaks in activity around 12-2pm and again around 5-8pm daily. Dips in activity tend to show up between 3-4pm and 8-10pm daily overall. On the weekends, Saturday activity peaks around mid-day (1pm), and Sunday activity is typically lowest 11am-1pm or around 4pm.

Marketing efforts during higher levels of activity should be focused on methods that complement on-the-go activities, such as Podcast or Radio advertisements.
Marketing efforts during lower levels of activity should be focused on methods that complement periods of rest, such as Online, Youtube, and TV Advertisements.

Note: All insights obtained from this data and analysis should be taken generally, and should be further validated with additional surveys and testing. Please see Data Relevance and Reliability section below for more information.

Data Sources

Primary Data Source is from Mobius’ FitBit Dataset on Kaggle, as directed by Coursera Project Instructions.
The Fitabase website was also used as a resource:
- Spec Document was used to discover meanings for some of the data fields, as metadata within the files and datasets themselves were lacking
- FitBit Integrity Page was used to learn about potential limitations in the dataset.

Data Relevance and Reliability

Upon reviewing the data and resources about the data, a number of concerns came to mind:

There is no demographic data - Bellabeat is focused on creating products for women. So, it would make most sense to analyze data from women participants - however, I cannot tell how many, if any, of the participants in this dataset were women.
Overall sample size: Based on the number of “Devices Synced” (1.2 million) on Bellabeat’s website, the sample size of this survey (30-33 participants total, though this is lower for certain statistics that not everyone provided data for) is not representative of the potential population of Bellabeat clients and we will be unable to rely on analysis from this data in high confidence.
Outdated Data: This data is from 2016, and it is now 2022 - much has changed in the world of technology and smart devices in this time.
Inconsistent Data Summary: The Kaggle source for the data set states there are 30 participants and data should span between 3/12/16-5/12/16. However, when the data was downloaded, I learned the true number of unique participants (Determined by Participant ID) was 33, and the true date span contained in the dataset was 4/12/16-5/12/16. This provokes even more concern regarding the reliability of this data.

Thus, my recommendations will be focused around where to focus further research efforts and general areas to look at for marketing. All insights obtained from this data and analysis should be taken generally, and should be further validated with additional surveys and testing.

Initial Review and Data Cleaning - CSV Files

Before beginning, I conducted a few initial checks on the csv files themselves, such as:

Confirmed the “Daily Activity” file contains the other “Daily” file data aggregated into a single location. Confirmed the data in this file lines up with the data in the other files (spot checked as well as utilized VLOOKUP (using older version excel which couldn’t utilize XLOOKUP) using a new UID created with concatenation of: ID & Activity Date)
Confirmed ID values contained same number of characteers/no extra spaces
Noted data’s datespan, though different from data summary description in Kaggle, lined up between files. True data date span was 4/12/16-5/12/16
Noted there is both Total Distance and Tracker Distance field - for most records the numbers match, but not for all. I will use the Total Distance for my analysis as this will be more comprehensive (will contain the auto-calculated as well as manually entered distance data)

Cleaning and Analysis in SQL - BigQuery

I decided to begin looking at this data in Big Query. I loaded the data into BigQuery and used the Spec Document from Fitabase to guide me in specifying the field data types for each file. Due to the way the dates were formatted, I did have to load them as STRING initially and extract dates out of them later in queries.

Files Loaded:

knitr::include_graphics("FilesLoadedBigQuery.png")

With these files loaded, I created and saved a handful of queries which cleaned and organized data into views that could summarize the data for me, or that I thought I may want to use for further analysis in RStudio or Tableau Public later on. I ended up mainly only utilizing the DailyActivity and DailySleep tables for these queries, as I thought I would work with a few other tables in R. These queries will be saved as .txt files and provided with this rmarkdown document so that they can be viewed with a more readable formatting. For now, the query text (without formatting) will be placed in chunks below:

Query #1: Cleaned Daily Sleep Data.

The Daily Sleep data file is one I hadn’t reviewed much prior to importing, and needed to do some additional cleaning before I could query out of it (I had reviewed, done basic cleaning and checks for the DailyActivity file prior to importing). In this query, I removed duplicated rows, extracted the date out of the Sleep_Day field (removed timestamp + AM/PM indication), and changed the ID data type to match the data type of ID in the DailyActivity table.

cat(readLines("SQL/Query_CleanedDailySleep.txt"))

## --Removes duplicated rows, changes date to remove timestamp and AM/PM indication, changes ID data type from STRING to INT so it can be compared with DailyActivity Data  WITH NewDailySleep AS (   SELECT      DISTINCT *,     SUBSTR(Sleep_Day, 1, 9) AS NewSleepDate,     CAST(Id AS INT) AS NewId  FROM `bbfitbit.fitabaseData.DailySleep`)  SELECT    NewDailySleep.NewId,   NewDailySleep.NewSleepDate,   NewDailySleep.Total_Minutes_Asleep,   NewDailySleep.Total_Time_In_Bed FROM NewDailySleep ORDER BY   NewDailySleep.NewId, NewDailySleep.NewSleepDate

Query #2: Summarizing each participant’s activity data across the span of the survey

First I wanted to get an idea of the averages for each participant’s activity data across the span of the survey In my query, I used HA to indicate Highly Active, MOD for Moderately Active, LA for Lightly Active, and SED for Sedentary

cat(readLines("SQL/Query_ParticipantActivityAverages.txt"))

## Warning in readLines("SQL/Query_ParticipantActivityAverages.txt"): incomplete
## final line found on 'SQL/Query_ParticipantActivityAverages.txt'

## --First I would like to get an idea of the averages for each participant's ACTIVITY data across the span of the survey (4/12/16-5/12/16).  -- HA Highly Active, MOD Moderately Active, LA Lightly Active, SED Sedentary  SELECT    Id,    AVG(Calories) AS Avg_Daily_Calories,   Avg(TotalSteps) AS Avg_Daily_Steps,    Avg(TotalDistance) AS Avg_Daily_Distance_km,   Avg(VeryActiveDistance) AS Avg_HA_Distance_km,   safe_divide((Avg(VeryActiveDistance)), Avg(TotalDistance)) * 100 AS Avg_HA_Dist_percent_Total_Dist,   Avg(ModeratelyActiveDistance) AS Avg_MOD_Distance_km,   safe_divide(Avg(ModeratelyActiveDistance), Avg(TotalDistance)) * 100 AS Avg_MOD_Dist_percent_Total_Dist,   Avg(LightActiveDistance) AS Avg_LA_Distance_km,   safe_divide(Avg(LightActiveDistance), Avg(TotalDistance)) * 100 AS Avg_LA_Dist_percent_Total_Dist,   Avg(SedentaryActiveDistance) AS Avg_SED_Distance_km,   safe_divide(Avg(SedentaryActiveDistance), Avg(TotalDistance)) * 100 AS Avg_SED_Dist_percent_Total_Dist,   Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes) AS Avg_Total_Min_Tracked,   Avg(VeryActiveMinutes) AS Avg_HA_Min,   safe_divide(Avg(VeryActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_HA_Min_Percent_Total_Min,   Avg(FairlyActiveMinutes) AS Avg_MOD_Min,   safe_divide(Avg(FairlyActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_MOD_Min_Percent_Total_Min,   Avg(LightlyActiveMinutes) AS Avg_LA_Min,   safe_divide(Avg(LightlyActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_LA_Min_Percent_Total_Min,   Avg(SedentaryMinutes) AS Avg_SED_Min,   safe_divide(Avg(SedentaryMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_SED_Min_Percent_Total_Min   FROM `bbfitbit.fitabaseData.DailyActivity`  GROUP BY   Id

Query #3: Summary of Daily activity averages for the group

Then I was interested in some general group-level averages for the activities per date between 4/12/16-5/12/16

cat(readLines("SQL/Query_GroupActivityAveragesByDate.txt"))

## Warning in readLines("SQL/Query_GroupActivityAveragesByDate.txt"): incomplete
## final line found on 'SQL/Query_GroupActivityAveragesByDate.txt'

## --Now I would  like the group activity averages per day. Data datespan 4/12/16-512/16  SELECT   ActivityDate,   COUNT(DISTINCT Id) AS Participant_Count,   AVG(Calories) AS grp_Avg_Daily_Calories,   Avg(TotalSteps) AS grp_Avg_Daily_Steps,    Count(*) AS grp_Total_Logs,   COUNTIF(TotalSteps=0 AND TotalDistance=0) AS grp_Empty_Logs,   (COUNTIF(TotalSteps=0 AND TotalDistance=0) / Count(*)) * 100 AS grp_Percentage_EmptyLogs,   COUNTIF(LoggedActivitiesDistance != 0) AS grp_Manual_Logs,   (COUNTIF(LoggedActivitiesDistance != 0) / Count(*)) * 100 AS grp_Percentage_ManualEntry,   Avg(TotalDistance) AS grp_Avg_Daily_Distance_km,   Avg(VeryActiveDistance) AS grp_Avg_HA_Distance_km,   Avg(ModeratelyActiveDistance) AS grp_Avg_MOD_Distance_km,   Avg(LightActiveDistance) AS grp_Avg_LA_Distance_km,   Avg(SedentaryActiveDistance) AS grp_Avg_SED_Distance_km,   Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes) AS grp_Avg_Total_Min_Tracked,   Avg(VeryActiveMinutes) AS grp_Avg_HA_Min,   Avg(FairlyActiveMinutes) AS grp_Avg_MOD_Min,   Avg(LightlyActiveMinutes) AS grp_Avg_LA_Min,   Avg(SedentaryMinutes) AS grp_Avg_SED_Min,   CASE      WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =1 THEN "Sunday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =2 THEN "Monday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =3 THEN "Tuesday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =4 THEN "Wednesday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =5 THEN "Thursday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =6 THEN "Friday"     WHEN EXTRACT(DAYOFWEEK FROM ActivityDate) =7 THEN "Saturday"     ELSE "Unknown"     END AS Day_of_Week  FROM `bbfitbit.fitabaseData.DailyActivity` GROUP BY ActivityDate ORDER BY ActivityDate

Query #4: Summary Stats for Group Activity overall

Next, I was interested in learning more about the participant population’s activity as a whole (rather than grouped by date).

cat(readLines("SQL/Query_GroupActivitySummary.txt"))

## Warning in readLines("SQL/Query_GroupActivitySummary.txt"): incomplete final
## line found on 'SQL/Query_GroupActivitySummary.txt'

## --Now I am interested in learning more about the participant population's activity as a whole. Query for summary of data follows. --Data datespan is 4/12/16-5/12/16 for 33 individuals of unknown demographics  SELECT   CONCAT(MIN(ActivityDate), " to ", MAX(ActivityDate)) AS Data_DateSpan,   COUNT(DISTINCT Id) AS Participant_Count,   Count(*) AS grp_Total_Logs,   AVG(Calories) AS grp_Avg_Daily_Calories,   Avg(TotalSteps) AS grp_Avg_Daily_Steps,    COUNTIF(TotalSteps=0 AND TotalDistance=0) AS grp_Empty_Logs,   (COUNTIF(TotalSteps=0 AND TotalDistance=0) / Count(*)) * 100 AS grp_Percentage_EmptyLogs,   COUNTIF(LoggedActivitiesDistance != 0) AS grp_Manual_Logs,   (COUNTIF(LoggedActivitiesDistance != 0) / Count(*)) * 100 AS grp_Percentage_ManualEntry,   Avg(TotalDistance) AS grp_Avg_Daily_Distance_km,   Avg(VeryActiveDistance) AS grp_Avg_HA_Distance_km,   Avg(ModeratelyActiveDistance) AS grp_Avg_MOD_Distance_km,   Avg(LightActiveDistance) AS grp_Avg_LA_Distance_km,   Avg(SedentaryActiveDistance) AS grp_Avg_SED_Distance_km,   Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes) AS grp_Avg_Total_Min_Tracked,   Avg(VeryActiveMinutes) AS grp_Avg_HA_Min,   Avg(FairlyActiveMinutes) AS grp_Avg_MOD_Min,   Avg(LightlyActiveMinutes) AS grp_Avg_LA_Min,   Avg(SedentaryMinutes) AS grp_Avg_SED_Min  FROM `bbfitbit.fitabaseData.DailyActivity`

Query #5: Combining Sleep + Activity Data and outputting averages of key activities/sleep info

I wanted to combine the Sleep and Activity data for easier comparison. Then, I created a summary of the sleep+activity data, grouped by participant.

cat(readLines("SQL/Query_ParticipantActivity+SleepAverages.txt"))

## Warning in readLines("SQL/Query_ParticipantActivity+SleepAverages.txt"):
## incomplete final line found on 'SQL/Query_ParticipantActivity+SleepAverages.txt'

## --Adding Average sleep data with high-level average activity data for easier comparison, grouped by participant. --Survey Data DateSpan 4/12/22-5/12/22 -- HA Highly Active, MOD Moderately Active, LA Lightly Active, SED Sedentary  WITH NewDailySleep AS (     SELECT        DISTINCT *,       SUBSTR(Sleep_Day, 1, 9) AS NewSleepDate,       CAST(Id AS INT) AS NewId    FROM `bbfitbit.fitabaseData.DailySleep` ), ParticipantActivitySummary AS (   SELECT     Id,      AVG(Calories) AS Avg_Daily_Calories,     Avg(TotalSteps) AS Avg_Daily_Steps,      Avg(TotalDistance) AS Avg_Daily_Distance_km,     Avg(VeryActiveMinutes) AS Avg_HA_Min,     Avg(FairlyActiveMinutes) AS Avg_MOD_Min,     Avg(LightlyActiveMinutes) AS Avg_LA_Min,     Avg(SedentaryMinutes) AS Avg_SED_Min,        FROM `bbfitbit.fitabaseData.DailyActivity`   GROUP BY Id   )  SELECT    ParticipantActivitySummary.Id,   Avg(ParticipantActivitySummary.Avg_Daily_Calories) AS avg_daily_calories,   Avg(ParticipantActivitySummary.Avg_Daily_Steps) AS avg_daily_steps,   Avg(ParticipantActivitySummary.Avg_Daily_Distance_km) AS avg_daily_distance_km,   Avg(ParticipantActivitySummary.Avg_HA_Min) AS avg_HA_min,   Avg(ParticipantActivitySummary.Avg_MOD_Min) AS avg_MOD_min,   Avg(ParticipantActivitySummary.Avg_LA_Min) AS avg_LA_min,   Avg(ParticipantActivitySummary.Avg_SED_Min) AS avg_SED_min,   Avg(NewDailySleep.Total_Time_In_Bed) AS Avg_TotalMin_InBed_min,   Avg(NewDailySleep.Total_Minutes_Asleep) AS Avg_Min_Asleep,  FROM ParticipantActivitySummary LEFT JOIN NewDailySleep    ON ParticipantActivitySummary.Id=NewDailySleep.NewId GROUP BY ParticipantActivitySummary.Id ORDER BY Avg(ParticipantActivitySummary.Avg_HA_Min) DESC

Query #6: Identifying occurrences of missing entries, empty entries, and manual entries.

Finally, I had noticed instances of empty entries (where values were 0), missing entries (no record for a participant on certain dates), and manual entries (where it was indicated that the participant had to add data manually rather than it automatically syncing between device and app). The Fitabase website noted data limitations where data entries would not be logged if the user did not log into the app and sync data frequently enough, or if the battery were too low.

I thought this data might provide insight into how often users are unable to obtain value from these devices (whether due to limitations in device noted above such as memory/ battery life, or due to lifestyle such as neglecting to wear device). However, further data will certainly be needed to determine cause and actionable insights - but this is a starting point.

cat(readLines("SQL/Query_UserLogsEmpty-Missing-Manual.txt"))

## Warning in readLines("SQL/Query_UserLogsEmpty-Missing-Manual.txt"): incomplete
## final line found on 'SQL/Query_UserLogsEmpty-Missing-Manual.txt'

## --I have noticed there are some entries with 0's across the board in daily activities. There are also instances where there are no records logging a certain day for individuals. --The Fitabase website noted data limitations where data entries would not be logged if the user did not log into the app and sync data frequently enough, or if the battery were too low.  --I think this data could provide insight into how often users are unable to obtain value from these devices (whether due to limitations in device such as memory/ battery life, or due to lifestyle such as neglecting to wear device; However, further data would be needed to determine cause and actionable insights) --Also interested in determining how often a user will add manual entries as this could provide insight how importan the ability to add manual logs is to individuals.   SELECT   Id AS Participant_ID,   Count(*) AS Total_Logs,   COUNTIF(TotalSteps=0 AND TotalDistance=0) AS Empty_Logs,   (COUNTIF(TotalSteps=0 AND TotalDistance=0) / Count(*)) * 100 AS Percentage_EmptyLogs,   (31 - COUNTIF(Id=Id)) AS Missing_Logs,   ((31 - COUNTIF(Id=Id)) / 31) * 100 AS Percentage_MissingLogs,   COUNTIF(LoggedActivitiesDistance != 0) AS Manual_Logs,   (COUNTIF(LoggedActivitiesDistance != 0) / Count(*)) * 100 AS Percentage_ManualEntry,  FROM `bbfitbit.fitabaseData.DailyActivity`  GROUP BY   Id

Visualizations In Tableau

Next, I moved over to Tableau Public to create a few initial visualizations using some of the exported results from my queries above, as well as some of the original data files.

DataViz for Bellabeat Initial Analysis

knitr::include_graphics("Tableau/DashboardSummary.png")

Viz #1: Avg Calories Burned by Day of Week

First, I created a chart showing average calories burned by day of week. Additionally, this chart provides details on group participants’ average daily distance (km) and average daily steps by day of week. Unsurprisingly, Saturday appeared to be the most-active day for participants in the survey. I did find it rather surprising, however, that Tuesday seemed to be the 2nd-most-active day - and it was barely behind Saturday’s numbers!

Average Calories - Day of Week:

knitr::include_graphics("Tableau/Calories by Day of Week_Sat.png")

knitr::include_graphics("Tableau/Calories by Day of Week_Tues.png")

Viz #2: Summary of Sleep Data by Day

Next, I created a chart to show the average minutes spent sleeping per day of week (bars), as well as the overall time spent in bed (line). Additionally, the chart provides detail on the average time spent awake in bed, which is a calculated field subtracting minutes asleep from total minutes in bed.

Sunday seemed to be the day of week where participants averaged the most time in bed and asleep, with Wednesday being the 2nd highest. Interestingly, the days of week with highest amounts of sleep seem to follow the days of week where participants are generally most active.

knitr::include_graphics("Tableau/Sleep Data By Day of Week.png")

Viz #3: Percentage of total entries empty, manual, or missing

Next, I simply wanted to see what the percentage of overall logs were Missing, Empty, or Manually Entered. The Empty and Missing log percentages were fairly similar, about 8.62% and 8.13% respectively. The manual entries were less frequent, at about 3.37%.

Together, this accounts for about 20% of entries where there may be some type of limitation at play. A potential recommendation here is to conduct further surveys into this type of data and collecting causes so that actionable insights can be derived. If a key limitation causing these discrepancies turns out to be battery life or device memory, we may want to focus marketing efforts to highlight these features to potential customers to show them they can increase the value they receive from smart devices.

knitr::include_graphics("Tableau/Missing_Empty_Manual_Entries.png")

Viz #4: Hourly Activity Levels

Finally, I created a viz to show average step count by hour of the day (with hour 0 representing the timespan 00:00-00:59, hour 1 representing the timespan 01:00-01:59, and so on). This chart can be filtered to average data for any day off the week, or for all days overall.

The chart shows that overall, there tends to be a peak around hours 12-14 (12:00-14:59, or 12pm-2:59pm) and then a slightly larger peak later on around hours 17-19 (17:00-19:59, or 5pm-5:59pm) each day.

On Saturdays, there tends to be a large peak in the middle of the day around hour 13 (13:00, or the 1pm hour). The average Intensity levels for activity (shown by the line in the chart) also spike at this time. Knowing that Saturdays tend to be a day oh higher activity overall, this is likely when individuals are often working out or running errands.

Further Analysis and Visualizations in R

Lastly, I decided to analyze a few additional data files in RStudio using R. First, I loaded the packages I thought I might use during my work:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(rmarkdown)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(tidyr)
library(odbc)
library(DBI)
library(skimr)
library(bigrquery)

Then, I loaded the files I wanted to analyze:

OG_MinuteIntensities <- read_csv("minuteIntensitiesNarrow_merged.csv")

## Rows: 1325580 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityMinute
## dbl (2): Id, Intensity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

OG_HeartRate <- read_csv("heartrate_seconds_merged.csv")

## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

OG_WeightLog <- read_csv("weightLogInfo_merged.csv")

## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, I checked for duplicate rows in each file by comparing total number of rows vs number of distinct rows:

nrow(OG_MinuteIntensities)

## [1] 1325580

nrow(unique(OG_MinuteIntensities))

## [1] 1325580

nrow(OG_HeartRate)

## [1] 2483658

nrow(unique(OG_HeartRate))

## [1] 2483658

nrow(OG_WeightLog)

## [1] 67

nrow(unique(OG_WeightLog))

## [1] 67

Next, I wanted to get a general idea of the data contained in each file:

skim(OG_MinuteIntensities)

Data summary
Name	OG_MinuteIntensities
Number of rows	1325580
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ActivityMinute	0	1	19	21	0	44160	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	0	1	4847897691.9	2.422313e+09	1503960366	2320127002	4445114986	6962181067	8877689391	▇▅▃▅▅
Intensity	0	1	0.2	5.200000e-01	0	0	0	0	3	▇▁▁▁▁

skim(OG_HeartRate)

Data summary
Name	OG_HeartRate
Number of rows	2483658
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Time	0	1	19	21	0	961274	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	0	1	5.513765e+09	1950223761.0	2022484408	4388161847	5553957443	6962181067	8877689391	▃▇▇▅▃
Value	0	1	7.733000e+01	19.4	36	63	73	88	203	▇▇▁▁▁

skim(OG_WeightLog)

Data summary
Name	OG_WeightLog
Number of rows	67
Number of columns	8
_______________________
Column type frequency:
character	1
logical	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date	0	1	19	21	0	56	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
IsManualReport	0	1	0.61	TRU: 41, FAL: 26

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Id	0	1.00	7.009282e+09	1.950322e+09	1.503960e+09	6.962181e+09	6.962181e+09	8.877689e+09	8.877689e+09	▁▁▂▇▆
WeightKg	0	1.00	7.204000e+01	1.392000e+01	5.260000e+01	6.140000e+01	6.250000e+01	8.505000e+01	1.335000e+02	▇▃▃▁▁
WeightPounds	0	1.00	1.588100e+02	3.070000e+01	1.159600e+02	1.353600e+02	1.377900e+02	1.875000e+02	2.943200e+02	▇▃▃▁▁
Fat	65	0.03	2.350000e+01	2.120000e+00	2.200000e+01	2.275000e+01	2.350000e+01	2.425000e+01	2.500000e+01	▇▁▁▁▇
BMI	0	1.00	2.519000e+01	3.070000e+00	2.145000e+01	2.396000e+01	2.439000e+01	2.556000e+01	4.754000e+01	▇▁▁▁▁
LogId	0	1.00	1.461772e+12	7.829948e+08	1.460444e+12	1.461079e+12	1.461802e+12	1.462375e+12	1.463098e+12	▇▇▆▇▇

glimpse(OG_MinuteIntensities)

## Rows: 1,325,580
## Columns: 3
## $ Id             <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960…
## $ ActivityMinute <chr> "4/12/2016 12:00:00 AM", "4/12/2016 12:01:00 AM", "4/12…
## $ Intensity      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

glimpse(OG_HeartRate)

## Rows: 2,483,658
## Columns: 3
## $ Id    <dbl> 2022484408, 2022484408, 2022484408, 2022484408, 2022484408, 2022…
## $ Time  <chr> "4/12/2016 7:21:00 AM", "4/12/2016 7:21:05 AM", "4/12/2016 7:21:…
## $ Value <dbl> 97, 102, 105, 103, 101, 95, 91, 93, 94, 93, 92, 89, 83, 61, 60, …

glimpse(OG_WeightLog)

## Rows: 67
## Columns: 8
## $ Id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date           <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat            <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…

Initial Insights:

ID data type is matching (dbl) in all 3 files, so that is good.
Date/Timestamps are being read as Strings - I want to change it into a DATE TIME format, will need to remove AM/PM? (all 3 files)
Intensity is coded, according to spec doc (https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf), scale is as follows: 0=Sedentary, 1 Light, 2=Moderate, 3=VeryActive
Consider replacing intensity values with the coded value? Or, keep in numerical format so it can be analyzed as such?
Header for HR “Value” should be renamed to “Heartrate”

Bringing up listing of column names for each file so I can reference them as I clean

colnames(OG_HeartRate)

## [1] "Id"    "Time"  "Value"

colnames(OG_MinuteIntensities)

## [1] "Id"             "ActivityMinute" "Intensity"

colnames(OG_WeightLog)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Cleaning Heart Rate Data:

Heartrate_v1 <- OG_HeartRate %>% 
  rename(Participant_Id=Id) %>% 
  rename(Heart_Rate=Value) %>% 
  mutate(Date_Isolated=substr(Time, 1, 9)) %>% 
  mutate(Time_Isolated=substr(Time, 11, 20)) %>% 
  mutate(Date_Time=mdy_hms(Time)) %>% 
  rename(OldTime=Time)
head(Heartrate_v1)

Heartrate_Cleaned <- Heartrate_v1 %>%
  select(Participant_Id, Date_Time, Heart_Rate) %>% 
  arrange(Participant_Id, Date_Time)

head(Heartrate_Cleaned)

Cleaning Intensities Data:

Intensities_v1 <- OG_MinuteIntensities %>% 
  rename(Participant_Id=Id) %>%
  rename(Intensity_Level_Code=Intensity) %>% 
  mutate(Date_Isolated=substr(ActivityMinute, 1, 9)) %>% 
  mutate(Time_Isolated=substr(ActivityMinute, 11, 20)) %>% 
  mutate(Date_Time=mdy_hms(ActivityMinute)) %>% 
  mutate(
    Intensity_Level=case_when(
      Intensity_Level_Code == "0" ~ "Sedentary",
      Intensity_Level_Code == "1" ~ "Light",
      Intensity_Level_Code == "2" ~ "Moderate",
      Intensity_Level_Code == "3" ~ "Very Active"
    ))

ActivityIntensities_Cleaned <- Intensities_v1 %>% 
  select(Participant_Id, Date_Time, Intensity_Level, Intensity_Level_Code) %>% 
  arrange(Participant_Id, Date_Time)


head(ActivityIntensities_Cleaned)

Next, I was interested to see heart rate mapped alongside activity intensity for users. I hypothesized heart rate should increase as intensity increases but thought it would also be interesting if sedentary or light heart rate spikes are seen much - could indicate stress, which could be useful for marketing purposes.

ActivityxHeartRate_Merged <- merge(
  ActivityIntensities_Cleaned, 
  Heartrate_Cleaned, 
  by=c("Participant_Id", "Date_Time"))
head(ActivityxHeartRate_Merged)

Viz #1: Activity Intensity vs Heart Rate

This next plot shows that Sedentary Activity seems to be weighted to lower HR (as expected), then as HR increases, the intensity gradually transitions to majority Moderate or light, then mostly Very Active at highest HR.

However, can clearly see there are instances where sedentary, light, moderate are appearing at higher HR - worth investigating further…. Calculate occurrence of HR over average during Sed or Light activity -> poss. marketing stress relief in subscription product.

ggplot(data = ActivityxHeartRate_Merged)+
  geom_point(mapping = aes(x = Date_Time, y = Heart_Rate, color = Intensity_Level))+
  labs(title="Activity Intensity and Heart Rate",
       subtitle="Participants tracked 4/12/16-5/12/16",
       caption = "Data from https://www.kaggle.com/datasets/arashnic/fitbit",
       x="Date", y="Heart Rate")

Now I wanted to dive deeper into the HRxActivity data to see if I can identify the percentage of high HR in sedentary activity. According to Mayo Clinic, normal resting HR range for adults is between 60-100 bpm.(NOTE:I am making assumption survey participants were adults). Since I don’t have access to demographic info like ages, I can’t calculate or identify normal ranges for any level of activity, even light. So, I will only look at Sedentary status with > 100 HR.

High_HR_while_Sedentary <- subset(ActivityxHeartRate_Merged, Heart_Rate > 100 & Intensity_Level_Code < 1)
nrow(High_HR_while_Sedentary)

## [1] 935

nrow(subset(ActivityxHeartRate_Merged, Heart_Rate > 100))

## [1] 14573

935 / 14573

## [1] 0.06415975

unique(ActivityxHeartRate_Merged$Participant_Id)

##  [1] 2022484408 2026352035 2347167796 4020332650 4388161847 4558609924
##  [7] 5553957443 5577150313 6117666160 6775888955 6962181067 7007744171
## [13] 8792009665 8877689391

unique(High_HR_while_Sedentary$Participant_Id)

##  [1] 2022484408 2026352035 2347167796 4020332650 4388161847 4558609924
##  [7] 5553957443 5577150313 6117666160 6775888955 6962181067 7007744171
## [13] 8792009665 8877689391

The above determined that there were 935 instances of HR>100 during Sedentary Activity levels, and 14573 total instances of HR >100. So, about 6.42% of the time an individual’s HR was >100, they were sedentary - this isn’t a huge occurrence, but may be worth looking at this population for marketing opportunities.

Viz #2: Plotting High HR while Sedentary

ggplot(data = High_HR_while_Sedentary)+
  geom_point(mapping=aes(x=Date_Time, y=Heart_Rate), color="purple")+
  labs(title="Occurrences of High Heart Rate during Sedentary Activity",
       subtitle="Participants tracked 4/12/16-5/12/16",
       caption = "Data from https://www.kaggle.com/datasets/arashnic/fitbit",
       x="Date", y="Heart Rate")

The above chart zooms in on the >100 HR occurrences during times where the individual was sedentary.

Next, I switched gears and decided to review and clean the weight data. Since the data spans only a month, I didn’t think there was enough time/data for meaningful weight change vs activity analysis. Instead, I focused analysis on frequency of weight logs and how often individuals are logging weight manually vs automatically with a connected smart device.

WeightLog_v1 <- OG_WeightLog %>% 
  rename(Participant_Id=Id) %>%
  mutate(Date_Isolated=substr(Date, 1, 9)) %>% 
  mutate(Time_Isolated=substr(Date, 11, 20)) %>% 
  mutate(Date_Time=mdy_hms(Date)) %>% 
  rename(Weight_Lbs=WeightPounds) %>% 
  rename(Weight_Kg=WeightKg) %>% 
  rename(BodyFat_Percentage=Fat) %>% 
  rename(Is_Manual_Report=IsManualReport) %>% 
  rename(Log_Id=LogId)
head(WeightLog_v1)

WeightLog_Cleaned <- WeightLog_v1 %>% 
  select(Participant_Id, Date_Isolated, Date_Time, Is_Manual_Report, Weight_Lbs, BMI, Log_Id) %>% 
  arrange(Participant_Id, Date_Time)
head(WeightLog_Cleaned)

Created a DF to summarize number of weight entries made for each date:

Entries_By_Date_WL <- WeightLog_Cleaned %>% 
  group_by(Date_Isolated) %>% 
  summarize(Num_Entries=n())

head(Entries_By_Date_WL)

Calculating some summary data points:

WeightLog_Cleaned %>% 
  summarize(Start_Date=min(Date_Isolated), End_Date=max(Date_Isolated), Num_Participants=n_distinct(Participant_Id), Total_Entries=nrow(WeightLog_Cleaned))

## Average Entries per day for the group:
67/31

## [1] 2.16129

## Average Entries per participant:
67/8

## [1] 8.375

## Average number of days per entry for each participant:
31/8.375

## [1] 3.701493

67 total entries over 31 days (last entry made a few days before survey span ended though the final days were still applicable to datespan) - group avg’d 2.16 entries/day; 8 unique participants; avg 8.375 total logs per participant, avg 3.7 days per participant log

Viz #3: Plotting Method of Weight Logs Frequency

ggplot(WeightLog_Cleaned)+
  geom_bar(mapping=aes(x=Is_Manual_Report, fill=Is_Manual_Report), show.legend = FALSE)+
  scale_x_discrete(labels = c("FALSE" = "Automatic/Synced", "TRUE" = "Manually Entered"))+
  labs(title="Weight Logs: Data Entry Method", 
       subtitle="Participants tracked 4/12/16-5/12/16", 
       caption="Data from https://www.kaggle.com/datasets/arashnic/fitbit", 
       x="Entry Method", y="Number of Entries")

This graphic shows that entries were made manually >40 times, and were automatically synced <30 times. So, it suggests that manual entry is more frequent. However, due to limitations of this data set (low number of participants and undetermined relevance to our target demographic), further surveys would be recommended to confirm this trend as well as search for causes. If this trend holds true, a recommendation would be to advertise connectivity between the bellabeat app and smart scales. Internally, we would want to increase the level of connectivity as well, to support these marketing efforts.

Connecting RStudio and BigQuery

I wanted to connect RStudio to BigQuery to turn one of my query results into a data frame for further analysis in R. NOTE: I will be placing the R Code for this activity outside of the code chunks as I do not want the connection to BigQuery to be able to be executed in this document. I will add a photo of the Data Frame created through these codes as well as images of the plots created.

`#Connecting RStudio to BigQuery (The project I created earlier in BigQuery is titled bbfitbit, and the dataset is titled fitabaseData):


bigquery_connect <- dbConnect(bigrquery::bigquery(), project =
“bbfitbit”, dataset = “fitabaseData”)
#Listing available tables(it prompts me to authenticte access and log
in at this time):

dbListTables(bigquery_connect)`

Next, adding one of my Previously created SQL queries in order to be able to bring results to R as dataFrames (using the Participant Activity Averages Query):

`#Querying BigQuery from RStudio Query_ParticipantActivityAverages <- odbc::dbSendQuery( bigquery_connect, ” SELECT CAST(Id AS STRING) AS Participant_Id, AVG(Calories) AS Avg_Daily_Calories, Avg(TotalSteps) AS Avg_Daily_Steps, Avg(TotalDistance) AS Avg_Daily_Distance_km, Avg(VeryActiveDistance) AS Avg_HA_Distance_km, safe_divide((Avg(VeryActiveDistance)), Avg(TotalDistance)) * 100 AS Avg_HA_Dist_percent_Total_Dist, Avg(ModeratelyActiveDistance) AS Avg_MOD_Distance_km, safe_divide(Avg(ModeratelyActiveDistance), Avg(TotalDistance)) * 100 AS Avg_MOD_Dist_percent_Total_Dist, Avg(LightActiveDistance) AS Avg_LA_Distance_km, safe_divide(Avg(LightActiveDistance), Avg(TotalDistance)) * 100 AS Avg_LA_Dist_percent_Total_Dist, Avg(SedentaryActiveDistance) AS Avg_SED_Distance_km, safe_divide(Avg(SedentaryActiveDistance), Avg(TotalDistance)) * 100 AS Avg_SED_Dist_percent_Total_Dist, Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes) AS Avg_Total_Min_Tracked, Avg(VeryActiveMinutes) AS Avg_HA_Min, safe_divide(Avg(VeryActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_HA_Min_Percent_Total_Min, Avg(FairlyActiveMinutes) AS Avg_MOD_Min, safe_divide(Avg(FairlyActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_MOD_Min_Percent_Total_Min, Avg(LightlyActiveMinutes) AS Avg_LA_Min, safe_divide(Avg(LightlyActiveMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_LA_Min_Percent_Total_Min, Avg(SedentaryMinutes) AS Avg_SED_Min, safe_divide(Avg(SedentaryMinutes), Avg(VeryActiveMinutes+FairlyActiveMinutes+LightlyActiveMinutes+SedentaryMinutes)) * 100 AS avg_SED_Min_Percent_Total_Min


FROM bbfitbit.fitabaseData.DailyActivity 
GROUP BY
  Id

” )`

Saving the result of above query into data frame:

`#Fetching query results into RStudio DataFrame:

Participant_Activity_Averages <- odbc::dbFetch(Query_ParticipantActivityAverages)`

Now that I have the dataframe of results saved, I can close the connection between my BigQuery project and RStudio:

#Clearing Query and Disconnecting from BigQuery: rm(Query_ParticipantActivityAverages) dbDisconnect(bigquery_connect) rm(bigquery_connect)

Plot Participant Activity Data

Since I have omitted the connection and query code from the code chunks, I will also include an image of the below plots.
As can be expected, they both suggest that more activity tends to be positively correlated with more calories being burned. However, there does seem to be a threshold of 7-8,000 daily steps or around 5km daily distance that one needs to pass before the calorie burning relationship to activity really takes off. As such, we can include messaging in our marketing to encourage customers who wish to lose weight to challenge themselves to surpass these activity levels (and emphasize how our products can help them track progress towards these goals)

#Plot #1 Calories Expended vs Distance - Code: ggplot(Participant_Activity_Averages)+ geom_smooth(mapping=aes(x=Avg_Daily_Calories, y=Avg_Daily_Distance_km))+ labs(title="Calories Expended vs. Distance Travelled (km)", subtitle="Participants tracked 4/12/16-5/12/16", caption="Data from https://www.kaggle.com/datasets/arashnic/fitbit", x="Average Daily Calories Expended", y="Avg Daily Distance (km)")

knitr::include_graphics("RPlots/CaloriesvsDistancekm.png")

#Plot #2 Calories Expended vs Step Count - Code: ggplot(Participant_Activity_Averages)+ geom_smooth(mapping=aes(x=Avg_Daily_Calories, y=Avg_Daily_Steps))+ labs(title="Calories Expended vs. Step Count", subtitle="Participants tracked 4/12/16-5/12/16", caption="Data from https://www.kaggle.com/datasets/arashnic/fitbit", x="Average Daily Calories Expended", y="Avg Daily Steps")

knitr::include_graphics("RPlots/CaloriesvsSteps.png")

Data Insights and High-Level Marketing Recommendations for Bellabeat

Again, it must be noted that the sample size of this population is much too small to be relied heavily upon for insights. Instead, I recommend using these insights as a starting point for further surveying and testing before launching new campaigns. That being said, these insights can certainly shed light on the directions to explore further.

Please visit this Tableau Viz to review insights and analysis.

BellaBeat_FitBitAnalysis_Notebook

Scenario Summary:

Business Task Statement:

Final Insights and Recommendations

Data Sources

Data Relevance and Reliability

Initial Review and Data Cleaning - CSV Files

Cleaning and Analysis in SQL - BigQuery

Query #1: Cleaned Daily Sleep Data.

Query #2: Summarizing each participant’s activity data across the span of the survey

Query #3: Summary of Daily activity averages for the group

Query #4: Summary Stats for Group Activity overall

Query #5: Combining Sleep + Activity Data and outputting averages of key activities/sleep info

Query #6: Identifying occurrences of missing entries, empty entries, and manual entries.

Visualizations In Tableau

Viz #1: Avg Calories Burned by Day of Week

Viz #2: Summary of Sleep Data by Day

Viz #3: Percentage of total entries empty, manual, or missing

Viz #4: Hourly Activity Levels

Further Analysis and Visualizations in R

Viz #1: Activity Intensity vs Heart Rate

Viz #2: Plotting High HR while Sedentary

Viz #3: Plotting Method of Weight Logs Frequency

Connecting RStudio and BigQuery

Plot Participant Activity Data

Data Insights and High-Level Marketing Recommendations for Bellabeat