Case Study 2: Bellabeat Data Analysis

Phase - Ask

Business Task: Analyze smart device data usage to gain insight on how consumers use non-Bellabeat smart devices and apply these insights to one of Bellabeat’s product.

Questions to be answered are as follows:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat’s marketing strategy?

Key Stakeholders include:

Urska Srsen – Bellabeat’s cofounder and Chief Creative Officer
Bellabeat marketing analytics team – team of data analysts responsible for the data analysis process to help guide Bellabeat’s marketing strategy.

Phase - Prepare

Data source: FitBit Fitness Tracker dataset available on Kaggle (CCO: Public Domain, dataset made available via Mobius). To navigate to the website, click link here. This dataset contains smart device data usage stored on Kaggle made open to the public. Data were initially stored on the local drive and then imported in Google’s BigQuery SQL database. The specifed files were utilized:

Folder: Fitabase Data 3.12.16-4.11.16

dailyActivity_merged.csv
hourlyIntensities_merged.csv
minuteSleep_merged.csv

Folder: Fitabase Data 4.12.16-5.12.16

dailyActivity_merged.csv
hourlyIntensities_merged.csv
minuteSleep_merged.csv

Data Organization: smart device data usages pertinent to daily activity, daily calories, daily intensities, and more are divided into two separate folders; each folder containing various usage data formatted via csv files for the dates of March 12th, 2016 through April 11th, 2016 and April 12th, 2016 through May 12th, 2016.

Data was sorted and filtered using Google’s BigQuery Cloud Data Warehouse

Data Credibility: data is from open data available on the Kaggle platform. Due to the nature of the program and directions of the capstone, secondary data pertinent to smart device usages are to be used for the data analysis process.

Phase - Process

Consistent errors discovered among data files which includes, but were not limited to:

Rows(observations) with incomplete entries.
Field values do not have consistent format (i.e. number of decimal values)
One file used ‘Date’ format for fields with dates although other files used ‘Timestamps’ format.

Tools Selected for Data Analysis Process:

SQL - Google’s BigQuery Cloud Data Warehouse - will be referred to as ‘BigQuery’
RStudio - click link

Data Processing/Cleaning/Transformation Steps Documentation:

Upload all data set csv files to BigQuery
Combined datasets from 3/12/2016 through 4/11/2016 and 4/12/2016 through 5/12/2016 using the following SQL syntax (example below):

knitr::include_graphics("Union_all_example.png", error=FALSE)

Saved combined table results using ‘BigQuery Tables’ for subsequent query
For files that contained timestamp data (i.e. “2016-03-19 13:00:00 UTC” ; hourlyintensities & minutesleep csv files), converted timestamp fields into both “date” and “time” fields using the following SQL syntax (example below):
Renamed field names, standardized & summarized value (numeric data) formats, and created new column to subsequently join datasets for daily activity, intensities, and sleep:

Daily Activity Query

Transforming Sleep Minutes to Hours on a per day basis Query

Transforming Hourly Intensities to Daily Total & Average Intensities Query

Innerjoin all tables recently created (5-bulletpoints 1 through 3)

knitr::include_graphics("innerjoin_example.png", error=FALSE)

NOTE: Innerjoin used to establish finalized dataset containing rows/observations where all fields were entered into for ensuring completeness of variable/fields being measured (daily activities, intensities, sleep, etc.)

6.Final query executed to yield cleaned/transformed dataset in SQL and subsequently exported as CSV file for subsequent analysis to be performed (and further clean/transform) in RStudio.

Phase - Analyze

Data organized in SQL as shown above in ‘Process’ Phase.

To set up R environment, install, if necessary, and load packages for data analysis: tidyverse, skimr, and janitor.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Next, follow the analysis steps below:

Import data into 1st data frame(df):

ais_df <- read_csv('activity_intensity_sleep_joined.csv')

## Rows: 644 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): ActivityDate, new_id, ActivityHour1, new_id_1, date1, new_id_2
## dbl (20): Id, TotalSteps, total_distance, tracker_distance, very_active_dist...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspect 1st df for any errors

view(ais_df)
head(ais_df)

## # A tibble: 6 × 26
##           Id ActivityDate TotalSteps total_distance tracker_distance
##        <dbl> <chr>             <dbl>          <dbl>            <dbl>
## 1 3977333714 4/17/2016         10415           6.97             6.97
## 2 3977333714 4/20/2016         11658           7.83             7.83
## 3 3977333714 5/6/2016          11677           8.28             8.28
## 4 1503960366 4/24/2016         10039           6.41             6.41
## 5 4319703577 4/14/2016         10210           6.88             6.88
## 6 8378563200 5/1/2016          11419           9.06             9.06
## # ℹ 21 more variables: very_active_distance <dbl>,
## #   moderate_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, Calories <dbl>, new_id <chr>, Id_1 <dbl>,
## #   ActivityHour1 <chr>, total_daily_intensity <dbl>,
## #   average_daily_intensity <dbl>, new_id_1 <chr>, Id_2 <dbl>, date1 <chr>, …

colnames(ais_df)

##  [1] "Id"                        "ActivityDate"             
##  [3] "TotalSteps"                "total_distance"           
##  [5] "tracker_distance"          "very_active_distance"     
##  [7] "moderate_active_distance"  "light_active_distance"    
##  [9] "sedentary_active_distance" "very_active_minutes"      
## [11] "fairly_active_minutes"     "lightly_active_minutes"   
## [13] "sedentary_minutes"         "Calories"                 
## [15] "new_id"                    "Id_1"                     
## [17] "ActivityHour1"             "total_daily_intensity"    
## [19] "average_daily_intensity"   "new_id_1"                 
## [21] "Id_2"                      "date1"                    
## [23] "value"                     "value_minutes"            
## [25] "value_hours"               "new_id_2"

Convert ‘activitydate’ in char to ‘activitydate_1’ in date format

ais_df$activitydate_1 <- mdy(ais_df$ActivityDate)

Additional cleaning: Remove unneeded columns, lowercase & cleaned column names

trimmed_ais_df <- ais_df %>% 
  select(Id, ActivityDate, TotalSteps, total_distance, tracker_distance, very_active_distance, 
         moderate_active_distance, light_active_distance, sedentary_active_distance, very_active_minutes,
         fairly_active_minutes, lightly_active_minutes, sedentary_minutes, Calories,
         total_daily_intensity, average_daily_intensity, value_minutes, value_hours, activitydate_1) %>% 
  rename_with(tolower) %>% 
  clean_names() %>% 
  rename(sleep_minutes=value_minutes, sleep_hours=value_hours)

Assign to new data frame, create new column, and arrange by id and then by activitydate_1

trimmed_ais_df_v1<- trimmed_ais_df %>% 
  mutate(trimmed_ais_df, total_active_minutes=very_active_minutes+
         fairly_active_minutes+lightly_active_minutes) %>% 
  arrange(id, activitydate_1)

Inspect 2nd df for any errors

view(trimmed_ais_df_v1)
head(trimmed_ais_df_v1)

## # A tibble: 6 × 20
##           id activitydate totalsteps total_distance tracker_distance
##        <dbl> <chr>             <dbl>          <dbl>            <dbl>
## 1 1503960366 3/25/2016         11004           7.11             7.11
## 2 1503960366 3/26/2016         17609          11.6             11.6 
## 3 1503960366 3/27/2016         12736           8.53             8.53
## 4 1503960366 3/28/2016         13231           8.93             8.93
## 5 1503960366 3/29/2016         12041           7.85             7.85
## 6 1503960366 3/31/2016         12256           7.86             7.86
## # ℹ 15 more variables: very_active_distance <dbl>,
## #   moderate_active_distance <dbl>, light_active_distance <dbl>,
## #   sedentary_active_distance <dbl>, very_active_minutes <dbl>,
## #   fairly_active_minutes <dbl>, lightly_active_minutes <dbl>,
## #   sedentary_minutes <dbl>, calories <dbl>, total_daily_intensity <dbl>,
## #   average_daily_intensity <dbl>, sleep_minutes <dbl>, sleep_hours <dbl>,
## #   activitydate_1 <date>, total_active_minutes <dbl>

colnames(trimmed_ais_df_v1)

##  [1] "id"                        "activitydate"             
##  [3] "totalsteps"                "total_distance"           
##  [5] "tracker_distance"          "very_active_distance"     
##  [7] "moderate_active_distance"  "light_active_distance"    
##  [9] "sedentary_active_distance" "very_active_minutes"      
## [11] "fairly_active_minutes"     "lightly_active_minutes"   
## [13] "sedentary_minutes"         "calories"                 
## [15] "total_daily_intensity"     "average_daily_intensity"  
## [17] "sleep_minutes"             "sleep_hours"              
## [19] "activitydate_1"            "total_active_minutes"

Determine date range of analyzed dataset

mdy("05-09-2016") - mdy("03-12-2016")

## Time difference of 58 days

Determining day of the week with highest smart device usage and creating 3rd df to arrange in sequential order

trimmed_ais_df_v1$weekdays <- weekdays(as.Date(trimmed_ais_df_v1$activitydate_1))
trimmed_ais_df_v2 <- trimmed_ais_df_v1
days_of_the_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
view(days_of_the_week)

Computing summary statistics for informational purposes

summary(trimmed_ais_df_v2)

##        id            activitydate         totalsteps    total_distance  
##  Min.   :1.504e+09   Length:644         Min.   :    0   Min.   : 0.000  
##  1st Qu.:3.977e+09   Class :character   1st Qu.: 4740   1st Qu.: 3.268  
##  Median :4.559e+09   Mode  :character   Median : 8371   Median : 6.065  
##  Mean   :4.862e+09                      Mean   : 8222   Mean   : 5.792  
##  3rd Qu.:6.776e+09                      3rd Qu.:11184   3rd Qu.: 7.860  
##  Max.   :8.792e+09                      Max.   :22770   Max.   :17.540  
##  tracker_distance very_active_distance moderate_active_distance
##  Min.   : 0.000   Min.   : 0.000       Min.   :0.0000          
##  1st Qu.: 3.268   1st Qu.: 0.000       1st Qu.:0.0000          
##  Median : 6.065   Median : 0.450       Median :0.4000          
##  Mean   : 5.779   Mean   : 1.335       Mean   :0.7291          
##  3rd Qu.: 7.860   3rd Qu.: 2.215       3rd Qu.:1.0425          
##  Max.   :17.540   Max.   :12.540       Max.   :6.4800          
##  light_active_distance sedentary_active_distance very_active_minutes
##  Min.   : 0.000        Min.   :0.0000000         Min.   :  0.00     
##  1st Qu.: 2.425        1st Qu.:0.0000000         1st Qu.:  0.00     
##  Median : 3.645        Median :0.0000000         Median :  8.00     
##  Mean   : 3.692        Mean   :0.0009938         Mean   : 22.95     
##  3rd Qu.: 4.885        3rd Qu.:0.0000000         3rd Qu.: 35.00     
##  Max.   :12.510        Max.   :0.1100000         Max.   :210.00     
##  fairly_active_minutes lightly_active_minutes sedentary_minutes    calories   
##  Min.   :  0.00        Min.   :  0.0          Min.   :   0.0    Min.   :   0  
##  1st Qu.:  0.00        1st Qu.:153.0          1st Qu.: 638.8    1st Qu.:1860  
##  Median : 11.00        Median :208.5          Median : 728.0    Median :2210  
##  Mean   : 17.37        Mean   :212.1          Mean   : 733.4    Mean   :2353  
##  3rd Qu.: 26.00        3rd Qu.:266.0          3rd Qu.: 815.0    3rd Qu.:2859  
##  Max.   :143.00        Max.   :518.0          Max.   :1440.0    Max.   :4900  
##  total_daily_intensity average_daily_intensity sleep_minutes   
##  Min.   :  0.0         Min.   :0.0000          Min.   :   2.0  
##  1st Qu.:224.0         1st Qu.:0.1600          1st Qu.: 335.5  
##  Median :326.5         Median :0.2200          Median : 415.0  
##  Mean   :322.0         Mean   :0.2232          Mean   : 401.4  
##  3rd Qu.:413.2         3rd Qu.:0.2900          3rd Qu.: 484.2  
##  Max.   :904.0         Max.   :0.6300          Max.   :1108.0  
##   sleep_hours     activitydate_1       total_active_minutes   weekdays        
##  Min.   : 0.033   Min.   :2016-03-12   Min.   :  0.0        Length:644        
##  1st Qu.: 5.592   1st Qu.:2016-04-09   1st Qu.:194.0        Class :character  
##  Median : 6.917   Median :2016-04-20   Median :262.0        Mode  :character  
##  Mean   : 6.690   Mean   :2016-04-19   Mean   :252.4                          
##  3rd Qu.: 8.071   3rd Qu.:2016-04-30   3rd Qu.:317.0                          
##  Max.   :18.467   Max.   :2016-05-12   Max.   :540.0

Note: Prior calculations, via aggregation of data, were performed on data in SQL under the Process phase prior analysis being performed in RStudio.

Phase - Share

The optimal approach for sharing findings involving one or more of the following data visualizations:

Scatterplots
Smoothlines
Barcharts
Hlines (Y-intercept)

Scatterplot & Smoothline Visualizations: Variables plotted

Daily Total Distance vs. Daily Calories Expended

ggplot(data=trimmed_ais_df_v1) + geom_point(mapping=aes(x = total_distance, y = calories, color = totalsteps)) + 
  geom_smooth(mapping=aes(x = total_distance, y = calories)) + 
  labs(title = "Fitbit Data: Daily Total Distance vs. Daily Calories Expended", subtitle = "Sample of Twenty-Four Participants",
       caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  annotate("text", x=12.5, y =1000, label = "As total distance increases, calories expended increases\n and, vice-versa",
           color = "red", fontface = "bold", size = 2.9)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Daily Total Distance vs. Daily Sleep Hours

ggplot(data=trimmed_ais_df_v1) + geom_point(mapping=aes(x = total_distance, y = sleep_hours, color = totalsteps))+
  geom_smooth(mapping=aes(x = total_distance, y = sleep_hours)) + 
  labs(title = "Fitbit Data: Daily Total Distance vs. Daily Sleep", subtitle = "Sample of Twenty-Four Participants",
       caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  annotate("text", x=10, y =16, label = "Slight Inverse Relationship between Total Distance and Daily Sleep",
           color = "red", fontface = "bold", size = 4)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Daily Total Intensity vs. Daily Sleep Hours

ggplot(data=trimmed_ais_df_v1) + geom_point(mapping=aes(x = total_daily_intensity, y = sleep_hours, color = totalsteps))+
  geom_smooth(mapping=aes(x = total_daily_intensity, y = sleep_hours)) + 
  labs(title = "Fitbit Data: Daily Intensity vs. Daily Sleep", subtitle = "Sample of Twenty-Four Participants",
       caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  annotate("text", x=500, y =16, label = "Little-to-no Relationship between Total Intensity and Total Sleep",
           color = "red", fontface = "bold", size = 4)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Daily Total Intensity vs. Daily Total Steps

ggplot(data=trimmed_ais_df_v1) + geom_point(mapping=aes(x = total_daily_intensity, y =totalsteps , color = total_distance))+
  geom_smooth(mapping=aes(x = total_daily_intensity, y = totalsteps)) +
  labs(title = "Fitbit Data: Daily Intensity vs. Daily Total Steps", subtitle = "Sample of Twenty-Four Participants",
       caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  annotate("text", x=250, y =17500, label = "As total intensity increases,\n total steps increases, and vice-versa",
           color = "red", fontface = "bold", size = 4)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Daily Sleep Hours vs. Daily Calories Expended

ggplot(data=trimmed_ais_df_v1) + geom_point(mapping=aes(y = sleep_hours, x = calories, color = totalsteps))+
  geom_smooth(mapping=aes(y = sleep_hours, x = calories)) + 
  labs(title = "Fitbit Data: Daily Sleep vs. Daily Calories Expended", subtitle = "Sample of Twenty-Four Participants",
       caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  annotate("text", x=2500, y=16, label="Little-to-no Relationship between Daily Sleep and Calories expended", color="red")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Barcharts & Hlines: Variable plotted

Number of Logged Days per Participant

ggplot(data=trimmed_ais_df_v1) + geom_bar(mapping=aes(x=id, fill=activitydate)) + theme(axis.text.x=element_text(angle=90))+ 
  labs(title = "Fitbit Data: Number of Logged Days/Participant", subtitle = "Sample of Twenty-Four Participants",
       caption = "Note: A value of 1 used for each day logs were made regardless of # of entries (aggregated data). \n
       FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  geom_hline(yintercept=nrow(trimmed_ais_df_v1)/n_distinct(trimmed_ais_df_v1$id),color="red",linetype= "longdash") +
  annotate("text", x = 5000000000, y = 28, label=round(nrow(trimmed_ais_df_v1)/n_distinct(trimmed_ais_df_v1$id),0), color="red", fontface="bold")

Total Logs per Day of the Week

ggplot(data=trimmed_ais_df_v2) + geom_bar(mapping=aes(x=factor(weekdays, days_of_the_week))) + 
  labs(title = "Fitbit Data: Total Daily Logs per Weekday", subtitle = "Sample of Twenty-Four Participants",
  caption = "FitBit Fitness Tracker Data (CCO: Public Domain, dataset made available by Mobius") +
  geom_hline(yintercept=nrow(trimmed_ais_df_v1)/n_distinct(days_of_the_week),color="red",linetype= "longdash")+
  annotate("text", x="Thursday", y = 95, label=round(nrow(trimmed_ais_df_v1)/n_distinct(days_of_the_week),0), color="red", fontface="bold")

Summary of Analysis/Key Findings:

As daily total distance (km) increased, daily total calories expended increases (strong positive correlation). In other words, the farther a person traveled for a given activity, the more calories they burned.
Weak relationship between variables; however, as daily total distance traveled increased the daily sleep hours decreased, and vice-versa (weak negative correlation).
Little-to-no relationship between daily total intensity and total sleep hours (little-to-no correlation).
As daily total intensity increased, total steps taken increased, and vice-versa (strong positive correlation). In other words, as total intensity increased, the more steps were taken, and vice-versa.
Little-to-no relationship between daily sleep hours and daily total calories expended (little-to-no correlation).
The average participant logged days over the 58-day collection period (range of dates for dataset) was 27 logged days; this is approximately 47% usage (logged) of the smart device by owners.
Based on this dataset, Tuesdays and Saturdays have slightly higher usages (logs) than the other weekdays and are above the average of 92 daily logs.

Accessibility of Work: This RMarkdown file will serve as the deliverable for all aspects of the data analysis phases.

Phase - Act

Questions to be answered:

What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?

Answer Summary: Users are tracking total distance, total steps taken, states of consciousness (sleep, restless, awake), intensity, caloric expenditures, and more to stay informed about their health and how their body responds to activities, stress, emotions, and illness.

High Level Recommendations include:

Trend 1: There is positive correlation between total distance traveled and daily calories expended. Keeping track of total distance traveled in a day can help indicate and/or predict the number of calories burned in a given day that similar distances are being traveled. In addition, as total intensity increased so did the total steps measured. Given the two examples highlights the importance of tracking/measuring biometrics via a smart device and benchmarking those values against a rested state can provide end users with better responses to their daily habits, lifestyle, diets, stress, sleep, etc.
Trend 2: Tuesdays and Saturdays appear to be the days with the most daily user logs. This information may help the executives and marketing team coordinate when and where to focus heavy advertising on “active” days at local/regional health and fitness centers on the specified days to attempt to gain increased market share.
Trend 3: Smart device usage/logs among smart device owners is approximately 47%. This information may help the marketing team and executives identify areas to improve to increase smart device usages by their current and future consumers, and if there are any smart device features lacking that is contributing to the low usage rates. This may impact whether or not consumers continue to grow with the brand.