For this case study, I am taking on the fictional role as a Junior Data Analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.



This case study was provided as a project under the Google Data Analyst certification program, where the following process is used to approach data analysis, and how this notebook will be organized:


Ask
Prepare
Process
Analyze
Share
Act




---
title: "Bellabeat Case study Capstone Project"
author: "Tasha Gill"
output: html_notebook
---
[]{#top-link}

![](https://mk0bellabeatcomhqlip.kinstacdn.com/wp-content/uploads/2020/10/bb_31.jpg)

<br><br><br>
For this case study, I am taking on the fictional role as a Junior Data Analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

<br><br>

This case study was provided as a project under the Google Data Analyst certification program, where the following process is used to approach data analysis, and how this notebook will be organized:

<br>

<justified>


[Ask](#ask-link)<br>
[Prepare](#prepare-link)<br>
[Process](#process-link)<br>
[Analyze](#analyze-link)<br>
[Share](#share-link)<br>
[Act](#act-link)    



<br><br><br>  


<strong>  

### ASK ###{#ask-link}  
[Top](#top-link)  [Bottom](#bottom)<br>


</strong>



In this scenario, I've been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights will then help guide marketing strategy for the company. This notebook will be a record of this work, allow me to present my analysis to the Bellabeat executive team, as well as provide my recommendations for Bellabeat’s marketing strategy.
<br><br><br>  



<strong>  

### PREPARE ###{#prepare-link}   
[Top](#top-link)  [Bottom](#bottom)<br>


</strong>


  The data I was asked to analyze was located at [Kaggle's FitBit data set](https://www.kaggle.com/arashnic/fitbit)  
  This Kaggle data set contains personal fitness tracker data from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. This archive has 18 data sets within it. Both wide and long formats are included for some of the data files.
  <br><br>
  
  
  <i>Following the ROCCC process to determine if there are any credibility or bias issues with the data:</i>  
  
  <strong>Reliable</strong> - yes, as the information was generated by the sensors of the devices directly, rather than responses from individuals.
  <br>
  <strong>Original</strong> - yes, can locate the original public data : https://zenodo.org/record/53894#.X9oeh3Uzaao <br> 
  <strong>Comprehensive</strong> - yes, both long and wide formats have matching data, and no missing values.<br> 
  <strong>Current</strong> - no, this is a historical data source (04/12/2016 - 05/12/2016)  
  <strong>Cited</strong> - yes    <br>
  <br>
  As for the licensing of the data, it is listed under Creative Commons Attribution 4.0 International.   
  <br><br>
  It does not seem like there are any personally identifying values, as the description for the dataset details that Individual reports can be parsed by export session ID, so privacy of the users should be maintained at least at this level, as I have no way of further identifying them. I do however, have to assume that the data is not representative of just the female population, and that the insights gathered here would apply for all users rather than a subset of those who Bellabeat may be marketing towards.  
  <br><br><br> 
  
</p>
  
  <strong> 
  
### PROCESS ###{#process-link}  
[Top](#top-link)  [Bottom](#bottom)<br> 

</strong> 
  
  Initially I had attempted to use Google sheets to view the data sets, and found that the heartrate data set was too long for sheets to display. Due to this, I've chosen R for it's ability to handle the analysis of long datasets, visualizations, and presentation.    
    <br>
      
-Loaded all 18 csv files into project for review.  
-Loaded 'tidyverse' library  
-loaded the lubridate library:

```{r}
library("tidyverse")

library("lubridate")
```
<br><br>
  
  As most of the Bellabeat products use smart technology to track user activity, sleep, and stress, I will be reducing the fitbit files down to data that matches this for comparison. The fitbit data does not account directly for stress, so we may not be able to use this for the comparison, however they do have quite a bit of data on daily, hourly, and minute activity. I will be reducing this down further to provide a daily overview of the data for review.   
    
  <strong>
  
### ANALYZE ###{#analyze-link}  
[Top](#top-link)  [Bottom](#bottom)<br>
  
  </strong>
      
Files to Analyze:  
-dailyActivity_merged  
-dailyCalories_merged  
-dailyIntensities_merged  
-dailySteps_merged  
-sleepDay_merged  
-heartrate_seconds_merged  
<br>

Check files for structure:  

```{r echo=TRUE}
n_distinct(dailyActivity_merged$Id)
# [1] 33
n_distinct(dailyCalories_merged$Id)
# [1] 33
n_distinct(dailyIntensities_merged$Id)
# [1] 33
n_distinct(dailySteps_merged$Id)
# [1] 33
n_distinct(heartrate_seconds_merged$Id)
# [1] 14
n_distinct(sleepDay_merged$Id)
# [1] 24
```  
  
  <br><br>
  
Although the Heartrate data is the longest, it does look like it's from the smallest amount of users. Also, fewer users participated with sleep data collection. Assumptions for sleep and stress may need to be made as the data might not reflect a full picture of this information, but we can certainly work with the activity data as a proper sample.  
<br>
Comparing the structures further, to confirm the data frames have any other common columns that can be merged:  
dailyActivity_merged lists the date the information was recorded as "ActivityDate", dailyCalories_merged, dailyIntensities_merged and dailySteps_merged lists this as "ActivityDay". heartrate_seconds_merged lists this as "Time", and sleepDay_merged lists this as "SleepDay". These are also all Character types, rather than date/time types.  
<br>
Calories are found on both dailyActivity_merged and dailyCalories_merged.    
<br>
Comparing dailySteps_merged, daily_intensities_merged, dailyCalories_merged, and dailyActivity_merged, found that all data from dailySteps_merged, daily_intensities_merged, and dailyCalories_merged are found in dailyActivity_merged. Removing the three former data sets as well.   
<br><br>

Files to keep:
-dailyActivity_merged
-heartrate_seconds_merged
-sleepDay

No further common columns across the remaining 3 data sets are found. formatting each of the data frames to use date time:

```{r echo=TRUE}
activity <- dailyActivity_merged

activity$ActivityDate=as.POSIXct(activity$ActivityDate, format="%m/%d/%Y", tz=Sys.timezone())
activity$date <- format(activity$ActivityDate, format = "%m/%d/%y")

sleepDay <- sleepDay_merged

sleepDay$SleepDay=as.POSIXct(sleepDay$SleepDay, format="%m/%d/%Y", tz=Sys.timezone())
sleepDay$date <- format(sleepDay$SleepDay, format = "%m/%d/%y")

heartrate <- heartrate_seconds_merged

heartrate$Time=as.POSIXct(heartrate$Time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
heartrate$hour <- format(heartrate$Time, format = "%H:%M:%S")
heartrate$date <- format(heartrate$Time, format = "%m/%d/%y")
```  
<br><br>
Summarizing the individual data sets to get a good idea of what the trends are:  

```{r}
activity %>%
  summary()
```  
<br>
```{r}
sleepDay %>%
  summary()
```
<br>
```{r}
heartrate %>%
  summary()
```  
    
 <br><br><br>    
       
          
Visualizing the Calories by ActivityDate for activity summary:  
```{r}
ggplot(data=activity)+
  geom_smooth(mapping=aes(ActivityDate, Calories))
```  
<br>

looks like quite a bit of the calories were being worked off before summertime arrived, and then either the users achieved their goals, or stopped tracking it as much. May be best to keep in mind that marketing for the activity-focused users should be some time around spring, where they may be getting ready to work out more before summer.  
<br><br>
The data set provides different categories for activity. What are the averages for this activity that was logged?  
Create new data table that selects the time spent in the categories:  
<br>
```{r}
activity_categories <- activity %>%
  select(SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, VeryActiveMinutes)
```  
<br>
Averages of those selected:  
<br>
```{r}
activity_avgs <- c(sedentary = mean(activity_categories$SedentaryMinutes), lightly = mean(activity_categories$LightlyActiveMinutes), fairly = mean(activity_categories$FairlyActiveMinutes), very = mean(activity_categories$VeryActiveMinutes))
```  
<br>
I would like to display the averages as columns detailing the difference between the categories. geom_cols requires x and y axis to do this, so I'm creating labels for the averages, and then a new data frame for the visual to use:  
<br>
```{r}
activity_labels <- c("Sedentary","Lightly","Fairly", "Very Active")
activity_vis_frame <- data.frame(activity_avgs, activity_labels)
```  
<br>
Visualizing the different categories of activity vs time spent in those categories:  
<br>
```{r}
ggplot(activity_vis_frame)+
  geom_col(mapping=aes(x=activity_avgs, y=activity_labels), fill="blue")
```  
<br><br>
This plot shows that the fitbit users are not primarily active when tracking, as the majority of those tracking are Sedentary. Perhaps they are tracking their current activities to see where they can improve, or they spend the majority of their day working at a desk. Is this also taking into account sleep? More refined data would be needed to confirm if the devices were tracking activity at the same time as sleep.   
<br>
Visualizing the TotalMinutesAsleep vs. the TotalTimeInBed:  

```{r}
ggplot(data=sleepDay)+
  geom_count(mapping=aes(TotalMinutesAsleep, TotalTimeInBed))
```  

Out of those who are reporting their sleep habits, most are sleeping nearly as much time as they are spending in bed, with few outliers.  
<br><br>
Visualizing the heartrate time vs value:  

```{r}
ggplot(data=heartrate)+
  geom_smooth(mapping=aes(Time, Value))
```  

A large portion of the visualization shows the values were high through spring up until the beginning of may, then another large spike was entered for the first week of may, perhaps the last push of activity before Summer began?  
<br><br>
Merging data sets to see if there are any further connections to be made:  

```{r}
summary_merged <- merge(activity, sleepDay)
```  

Summarize the two merged data sets:  

```{r}
summary_merged %>%
  summary()
```  
<br><br>
Is there a connection between how long someone logs sleep and how many calories they logged as expended?  

```{r}
ggplot(data=summary_merged)+
  geom_smooth(mapping=aes(Calories, TotalMinutesAsleep))
```  
<br><br>
Looks like longer sleep can be achieved with more calorie expenditure, this may be a recommendation that can be made for stress and sleep tracking.
  
<br><i><strong>
What surprises did you discover in the data?  
</i></strong>
	The majority of users log more sedentary time than track activity, and those who do log activity do so before May according to the data.  
<br><i><strong>
What trends or relationships did you find in the data? 
<br></i></strong>
	The more one expends calories, the longer sleep they log as well, which may help with stress as less sleep is correlated with more stress and health issues. Citation:
https://www.apa.org/news/press/releases/stress/2013/sleep

<br><i><strong>
 How will these insights help answer your business questions?
<br></i></strong>
	These insights will help with making suggestions to the marketing team regarding when to target advertising periods for Bellabeat, based on the fitbit data, and app change suggestions that may use more data to make recommendations to users. 


<strong>

### SHARE ### {#share-link}  
[Top](#top-link)  [Bottom](#bottom)<br>

</strong>

<strong>    
  BellaBeat can enhance their product offering by: 
</strong><br>
  * Educating their users on the benefits of reducing sedentary time, expending more calories, and the trend for those actions to lead to longer sleep. <br>
  * How longer sleep may help alleviate stress<br>
  * changing the app to provide periodic and adjustable notifications of sedentary time, activity time, or calories expended.  
<br>
<i><strong>	
How could your team and business apply your insights?
<br></i></strong>
	Begin plans to make adjustments to the app and the data that is collected, along with more marketing research for an advertising campaign.  	  
<br><br><br>

<strong>

I've also created a Google Slides presenation for this information: <br>
[Bellabeat Case study Capstone Project](https://docs.google.com/presentation/d/1NHgjod6jd3D767cg89Re9jNmeRJcG45eVJ_B2UI_A18/edit?usp=sharing)<br>
And finally, here are the notes that I had taken while progressing through this project:<br>
[Capstone Project Notes](https://docs.google.com/document/d/13UPwqcOaExYfxHDbKxdvTPtb6PKS31v2Y7RtWo_C0xg/edit?usp=sharing)


### ACT ###{#act-link}  
[Top](#top-link)  [Bottom](#bottom)<br>

<i>	
What next steps would you or your stakeholders take based on your findings? 
<br></i></strong>

The Marketing team may need to research the benefits of less sedentary time, calorie expenditure, and longer sleep with accredited sources, and edit that information to provide digestible blurbs in the app, with links to the sources for the users to learn more.
Work with the app development team to learn what data is currently being tracked, and what would need to be added in order to make changes to what is tracked. The app and database teams may also need to change the features of the app to allow the users to opt into notifications that remind the user about the newly tracked data, either through settings or through the educational blurbs that are offered in the app.  
 The database may need to be updated to accommodate the new data, and limits or aggregates may need to be stored rather than the finer details after a period of time, so it’s retrievable, and reviewable for some time.  
	The Marketing team may also want to arrange a large advertisement campaign for the spring, to provide the smart devices as a product solution to those who may be working towards a healthier body for the summer season.       
<i><strong>	    
Is there additional data you could use to expand on your findings? 
<br></i></strong>
	Any additional data that BellaBeat could provide as to what they are currently tracking and offering in the app may be best to make a proper comparison.  
	<br>
	<br>
	<br>
	<br>
	
[Top](#top-link)

[]{#bottom}